Encyclopedia of Statistics in Behavioral Science

Encyclopedia of Statistics in Behavioral Science – Volume 3 – Page 1 of 4 VOLUME 3 M Estimators of Location 1109-111...

Author: Brian Everitt | David Howell

92 downloads 2540 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Encyclopedia of Statistics in Behavioral Science – Volume 3 –

Page 1 of 4

VOLUME 3 M Estimators of Location

1109-1110

Median Test.

1193-1194

Mahalanobis Distance.

1110-1111

Mediation.

1194-1198

Mahalanobis, Prasanta Chandra.1112-1114 Mail Surveys.

Mendelian Genetics Rediscovered. 1198-1204

1114-1119

Mallows Cp Statistic.

1119-1120

Mendelian Inheritance and Segregation Analysis. 1205-1206

Mantel-Haenszel Methods.

1120-1126

Meta-Analysis.

1206-1217

Marginal Independence.

1126-1128

Microarrays.

1217-1221

Mid-p Values.

1221-1223

Minimum Spanning Tree.

1223-1229

Misclassification Rates.

1229-1234

Missing Data.

1234-1238

Marginal Models for Clustered Data. 1128-1133 Markov Chain Monte Carlo and Bayesian Statistics. 1134-1143 Markov Chain Monte-Carlo Item Response Theory Estimation. 1143-1148

Model Based Cluster Analysis. Markov Chains. Markov, Andrei Andreevich. Martingales. Matching. Mathematical Psychology.

1238

1149-1152 Model Evaluation.

1239-1242

Model Fit: Assessment of.

1243-1249

Model Identifiability.

1249-1251

Model Selection.

1251-1253

Models for Matched Pairs.

1253-1256

Moderation.

1256-1258

Moments.

1258-1260

Monotonic Regression.

1260-1261

1148-1149 1152-1154 1154-1158 1158-1164

Maximum Likelihood Estimation. 1164-1170 Maximum Likelihood Item Response Theory Estimation. 1170-1175 Maxwell, Albert Ernest.

1175-1176

Measurement: Overview.

1176-1183

Monte Carlo Goodness of Fit Tests. 1261-1264

Measures of Association.

1183-1192

Monte Carlo Simulation.

Median.

1192-1193

Multidimensional Item Response Theory Models. 1272-1280

Median Absolute Deviation.

1264-1271

1193 Multidimensional Scaling.

1280-1289

Encyclopedia of Statistics in Behavioral Science – Volume 3 –

Page 2 of 4

Multidimensional Unfolding.

1289-1294

Neyman, Jerzy.

1401-1402

Multigraph Modeling.

1294-1296

Neyman-Pearson Inference.

1402-1408

Nightingale, Florence.

1408-1409

Multilevel and SEM Approaches to Growth Curve Modeling. 1296-1305 1306-1309

Nonequivalent Control Group Design. 1410-1411

Multiple Comparison Procedures. 1309-1325

Nonlinear Mixed Effects Models. 1411-1415

Multiple Comparison Tests: Nonparametric and Resampling Approaches. 1325-1331

Nonlinear Models.

Multiple Imputation.

1331-1332

Nonparametric Correlation (tau). 1420-1421

Multiple Informants.

1332-1333

Multiple Linear Regression.

1333-1338

Nonparametric Item Response Theory Models. 1421-1426

Multiple Testing.

1338-1343

Nonparametric Regression.

1426-1430

Nonrandom Samples.

1430-1433

Multiple Baseline Designs.

Multi-trait Multi-method Analyses. 1343-1347

1416-1419

Nonparametric Correlation (rs). 1419-1420

Nonresponse in Sample Surveys Multivariate Analysis of Variance. 1359-1363

1433-1436 Nonshared Environment.

Multivariate Analysis: Bayesian. 1348-1352

1436-1439

Normal Scores & Expected Order Statistics. 1439-1441

Multivariate Analysis: Overview. 1352-1359

Nuisance Variables.

1441-1442

Multivariate Genetic Analysis. 1363-1370

Number Needed to Treat.

1448-1450

Multivariate Multiple Regression. 1370-1373

Number of Clusters.

1442-1446

Multivariate Normality Tests. 1373-1379

Number of Matches and Magnitude of Correlation. 1446-1448

Multivariate Outliers.

1379-1384

Observational Study.

1451-1462

Neural Networks.

1387-1393

Odds and Odds Ratios.

1462-1467

Neuropsychology.

1393-1398

One Way Designs:Nonparametric and Resampling Approaches. 1468-1474

New Item Types and Scoring.

1398-1401

Encyclopedia of Statistics in Behavioral Science – Volume 3 –

Page 3 of 4

Optimal Design for Categorical Variables. 1474-1479

Pie Chart.

1547-1548

Pitman Test.

1548-1550

Placebo Effect.

1550-1552

Point Biserial Correlation.

1552-1553

Polychoric Correlation.

1554-1555

Polynomial Model.

1555-1557

Optimal Scaling.

1479-1482

Optimization Methods.

1482-1491

Ordinal Regression Models.

1491-1494

Outlier Detection.

1494-1497

Outliers.

1497-1498

Overlapping Clusters.

1498-1500

P Values.

1501-1503

Population Stratification. Power.

Page's 0rdered Alternatives Test. 1503-1504 Paired Observations, Distribution Free Methods. 1505-1509

1557 1558-1564

Power Analysis for Categorical Methods. 1565-1570 Power and Sample Size in Multilevel Linear Models. 1570-1573

Panel Study.

1510-1511

Prediction Analysis of CrossClassifications.

Paradoxes.

1511-1517

Prevalence.

Parsimony/Occham's Razor.

1517-1518

Principal Component Analysis. 1580-1584

Partial Correlation Coefficients. 1518-1523

Principal Components and Extensions. 1584-1594

Partial Least Squares.

1573-1579 1579-1580

1523-1529 Probability Plots.

1605-1608

Path Analysis and Path Diagrams. 1529-1531

Probability: An Introduction.

1600-1605

Pattern Recognition.

Probits.

1608-1610

Procrustes Analysis.

1610-1614

Projection Pursuit.

1614-1617

Propensity Score.

1617-1619

1532-1535

Pearson Product Moment Correlation. 1537-1539 Pearson, Egon Sharpe.

1536

Pearson, Karl.

1536-1537

Percentiles.

1539-1540

Proscriptive and Retrospective Studies. 1619-1621

Permutation Based Inference.

1540-1541

Proximity Measures.

1621-1628

Person Misfit.

1541-1547

Psychophysical Scaling.

1628-1632

Encyclopedia of Statistics in Behavioral Science – Volume 3 – Qualitative Research.

1633-1636

Quantiles.

1636-1637

Quantitative Methods in Personality Research. 1637-1641 Quartiles.

1641

Quasi-Experimental Designs.

1641-1644

Quasi-Independence.

1644-1647

Quasi-Symmetry in Contingency Tables. 1647-1650 Quetelet, Adolphe.

1650-1651

Page 4 of 4

M Estimators of Location RAND R. WILCOX Volume 3, pp. 1109–1110 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

M Estimators of Location Under normality, the sample mean has a lower standard error than the mean and median. A consequence is that hypothesis testing methods based on the mean have more power; the probability of rejecting the null hypothesis is higher versus using the median. But under nonnormality, there are general conditions under which this is no longer true, a result first derived by LaPlace over two centuries ago. In fact, any method based on means can have poor power. This raises the issue of whether an alternative to the mean and median can be found that maintains relatively high power under normality but continues to have high power in situations in which the mean performs poorly. Three types of estimators aimed at achieving this goal have received considerable attention: M-estimators, L-estimators, and R-estimators. L-estimators contain trimmed and Winsorized means, and the median, as special cases (see Trimmed Means; Winsorized Robust Measures). To describe M-estimators, first consider the least squares approach to estimation. Given some data, how might we choose a value, say c, that is typical of what we observe? The least squares approach is to choose c so as to minimize the sum of the squared differences between the observations and c. In symbols, if we observe X1 , . . . , Xn , the goal is to choose c so as to minimize (Xi − c)2 . ¯ the sample This is accomplished by setting c = X, mean. If we replace squared differences with absolute values, we get the median instead. That is, if the goal is to choose c so as to minimize |Xi − c|, the answer is the usual sample median. But the sample mean can have a relatively large standard error under small departures from normality [1, 3–6], and the median performs rather poorly if sampling is indeed from a normal distribution. So an issue of some practical importance is whether some measure of the difference between c and the observations can be found that not only performs relatively well under normality but also continues to perform well in situations where the mean is unsatisfactory.

Several possibilities have been proposed; see, for example, [2–4]. One that seems to be particularly useful was derived by Huber [2]. There is no explicit equation for computing his estimator, but there are two practical ways of dealing with this problem. The first is to use an iterative estimation method. Software for implementing the method is easily written and is available, for example, in [6]. The second is to use an approximation of this estimator that inherits its positive features and is easier to compute. This approximation is called a one-step M-estimator. In essence, a one-step M-estimator searches for any outliers, which are values that are unusually large or small. This is done using a method based on the median of the data plus a measure of dispersion called the Median Absolute Deviation (MAD). (Outlier detection methods based on the mean and usual standard deviation are known to be unsatisfactory; see [5] and [6].) Then, any outliers are discarded and the remaining values are averaged. If no outliers are found, the onestep M-estimator becomes the mean. However, if the number of outliers having an unusually large value differs from the number of outliers having an unusually small value, an additional adjustment is made in order to achieve an appropriate approximation of the M-estimator of location. There are some advantages in ignoring this additional adjustment, but there are negative consequences as well [6]. To describe the computational details, consider any n observations, say X1 , . . . , Xn . Let M be the median and compute |X1 − M|, . . . , |Xn − M|. The median of these n differences is MAD, the Median Absolute Deviation. Let MADN = MAD/0.6745, let i1 be the number of observations Xi for which (Xi − M)/MADN < −K, and let i2 be the number of observations such that (Xi − M)/MADN > K, where typically K = 1.28 is used to get a relatively small standard error under normality. The one-step Mestimator of location (based on Huber’s ) is K(MADN)(i2 − i1 ) + µˆ os =

n−i 2 i=i1 +1

n − i1 − i2

X(i) .

(1)

2

M Estimators of Location

Computing a one-step M-estimator (with K = 1.28) is illustrated with the following (n = 19) observations: 77 87 88 114 151 210 219 246 253 262 296 299 306 376 428 515 666 1310 2611. It can be seen that M = 262 and that MADN = MAD/0.6745 = 114/0.6745 = 169. If for each observed value we subtract the median and divide by MADN, we get −1.09 −1.04 −1.035 −0.88 −0.66 −0.31 −0.25 −0.095 −0.05 0.00 0.20 0.22 0.26 0.67 0.98 1.50 2.39 6.2 13.90 So there are four values larger than the median that are declared outliers: 515, 666, 1310, 2611. That is, i2 = 4. No values less than the median are declared outliers, so i1 = 0. The sum of the values not declared outliers is 77 + 87 + · · · + 428 = 3411.

So the value of the one-step M-estimator is 1.28(169)(4 − 0) + 3411 = 285. 19 − 0 − 4

References [1] [2] [3] [4] [5] [6]

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. & Stahel, W.A. (1986). Robust Statistics, Wiley, New York. Huber, P.J. (1964). Robust estimation of location parameters, Annals of Mathematical Statistics 35, 73–101. Huber, P.J. (1981). Robust Statistics, Wiley, New York. Staudte, R.G. & Sheather, S.J. (1990). Robust Estimation and Testing, Wiley, New York. Wilcox, R.R. (2003). Applying Conventional Statistical Techniques, Academic Press, San Diego. Wilcox, R.R. (2004). Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition, Academic Press, San Diego.

RAND R. WILCOX

Mahalanobis Distance CARL J. HUBERTY Volume 3, pp. 1110–1111 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mahalanobis Distance It may be recalled from studying the Pythagorean theorem in a geometry course that the (Euclidean) distance between two points, (x1 , y1 ) and (x2 , y2 ) (in a two-dimensional space), is given by (1) d = (x1 − x2 )2 + (y1 − y2 )2 . From a statistical viewpoint, the two variables involved are uncorrelated; also, each variable has a standard deviation of 1.0. It was in 1936 that a statistician from India, Prasanta C. Mahalanobis, (1893–1972) introduced a generalization of the distance concept while investigating anthropometric problems [1]. It is a generalization in the sense of dealing with more than two variables whose intercorrelations may range from −1.0 to 1.0, and whose standard deviations may vary. Let there be a set of p variables. For a single analysis unit (or, subject), let xi denote a vector of p variable scores for unit i. Also, let S denote the p × p covariance matrix that reflects the p variable interrelationships. Then, the Mahalanobis distance between unit i and unit j is given by (2) Dij = (xi − xj ) S−1 (xi − xj ) where the prime denotes a matrix transpose and S−1 denotes the inverse of S – the inverse standardizes the distance. (The radicand is a (1 × p)(p × p)(p × 1) triple product that results in a scalar.) Geometrically, xi and xj represent points for the two units in a p-dimensional space and Dij represents the distance between the two points (in the p-dimensional space). Now, suppose there are two groups of units. Let x¯ 1 represent the vector of p means in group 1; similarly for group 2, x¯ 2 . Then, the distance between the two Table 1

mean vectors (called centroids) is analogously given by the scalar D12 =

(¯x1 − x¯ 2 ) S−1 (¯x1 − x¯ 2 ).

(3)

This D12 represents the distance between the two group centroids. The S matrix used is generally the two-group error (or, pooled) covariance matrix. A third type of Mahalanobis distance is that between a point representing an individual unit and a point representing a group centroid. The distance between unit i and the group 1 centroid is given by (4) Di1 = (xi − x¯ 1 ) S−1 (xi − x¯ 1 ). The S matrix here may be the covariance matrix for group 1 or, in the case of multiple groups, the error covariance matrix based on all of the groups. In sum, there are three types of Mahalanobis distance indices: Dij – unit-to-unit, D12 – group-to-group, and Di1 – unit-to-group. These distance indices are utilized in a little variety of multivariate analyses. A summary of some uses is provided in Table 1. As alluded to earlier, a Mahalanobis D value may be viewed as a standardized distance. Jacob Cohen (1923–1998) in 1969 [2]. The Cohen d, is applied in a two-group, one outcome-variable context. Let x denote the outcome variable, s denote the standard deviation of one group of x scores (or the error standard deviation for the two groups), and x¯1 denote the mean of the outcome variable scores for group 1. The Cohen index, then, is d12 =

x¯1 − x¯2 = (x¯1 − x¯2 )s −1 s

(5)

The use of distance indices in some multivariate analyses Unit-to-unit

Hotellings T 2 ; multivariate analysis of variance (MANOVA) Contrasts Predictive discriminant analysis Cluster analysis Pattern recognition Multivariate outlier detection

Group-to-group X X

X

Unit-to-group X X X X

2

Mahalanobis Distance

which is a special case of D12 . This d12 is sometimes considered as an effect size index (see Effect Size Measures).

References [1]

Mahalanobis, P.C. (1936). On the generalized distance in statistics, Proceedings of the National Institute of Science, Calcutta 2, 49–55.

[2]

Cohen, J. (1969). Statistical Power for the Behavioral Sciences, Academic Press, New York.

(See also Hierarchical Clustering; k -means Analysis; Multidimensional Scaling) CARL J. HUBERTY

Mahalanobis, Prasanta Chandra GARETH HAGGER-JOHNSON Volume 3, pp. 1112–1114 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mahalanobis, Prasanta Chandra Born: Died:

June 29, 1893, in Calcutta, [1]. June 28, 1972, in Calcutta, [1].

Mahalanobis once invoked a mental construct of ‘five concentric circles’ to characterize the domains of science and statistics [3]. At the center is physics and at the outer layers are survey methods or areas where the variables are mostly unidentifiable, uncontrollable and ‘free’. As a student of physics and mathematics at Cambridge [4], he began his own career at the central geometric point of this construct. By the end of his life, he had made theoretical and applied contributions to every sphere. In 1922, he became professor of Physics at the Presidency College in Calcutta [7]. As ‘a physicist by training, statistician by instinct and a planner by conviction’ [9], his interests led him to found the Indian Statistical Institute (ISI) in 1931 [10] to promote interdisciplinary research. In 1933, he launched the internationally renowned journal Sanky¯a, serving as editor for forty years. Among Mahalanobis’ various achievements at the ISI were the establishment of undergraduate, postgraduate and Ph.D. level courses in statistics, and the reform in 1944 to make the ISI fully coeducational [5, 10]. His work made Delhi the ‘Mecca’ of statisticians, economists and planners world–wide [9]. At the national level, the Professor was chair of the National Income Committee set up by the Government of India in 1949 [1]. Mahalanobis was involved in the establishment of the Central Statistical Organization, the Perspective Planning Division in the Planning Commission and the National Sample Survey (perhaps his biggest achievement) [2]. In 1954, Prime Minister Nehru enlisted Mahalanobis to plan studies at the ISI to inform India’s second and third Five Year Plans [10] and he became honorary Statistical Adviser to the Government [4]. To economists, this was an important contribution [3, 7] and it is noteworthy that his lack of training in economics did not prevent him from undertaking this major responsibility [10]. He disliked accepted forms of knowledge and enjoyed problems that offered sufficient subject challenge, regardless of the topic [7]. When dealing with economic problems, he was used to saying that

he was glad he was not exposed to formal education in economics [8]! He developed the Mahalanobis Model, which was initially rejected by the planning commission. However, Nehru approved the model and their combined efforts have been described as ‘the golden age of planning’ [3]. The ISI was declared an Institute of national importance by the Parliament in 1959 with guaranteed funding. It had previously survived on project and ad hoc grants. The Indian Statistical Service (ISS) was established in 1961 [10] and the Professor was elected Fellow of the Royal Society of Great Britain in 1945 for his outstanding contributions to theoretical and applied statistics [9]. Other international awards included Honorary Member of the Academy of Sciences in the USSR and Honorary Member of the American Academy of Sciences. He was frequently called upon to collaborate with scientific research and foreign scientists [2, 4]. The ISI itself worked with scientists from the USSR, UK, USA and Australia. The Professor’s contributions to the theory and practice of large-scale surveys have been the most celebrated [7]. His orientation in physics served as a starting point [8]. The three main contributions were [7]: (a) to provide means of obtaining an optimal sample design, which would either minimize the sampling variance of the estimate for a given cost, or minimize the cost for a given standard error of the estimate; (b) to show how one or more pilot surveys could be utilized to estimate the parameters of the variance and cost functions; (c) to suggest and use various techniques for measurement and control of sampling and nonsampling errors. He developed ‘interpenetrating network of samples in sample surveys’, which can help control for observational errors and judge the validity of survey results. The technique can also be used to measure variation across investigators, between different methods of data collection and input, and variation across seasons [7]. Total variation is split into three components: sampling error, ascertainment error, and tabulation error. In modern terminology, Mahalanobis’ four types of sampling (unitary unrestricted, unitary configurational, zonal unrestricted, and zonal configuration) correspond to unrestricted simple random sampling, unrestricted cluster sampling, stratified simple random sampling, and stratified cluster sampling [8] (see Survey Sampling Procedures). His watchwords were randomization, statistical control, and cost [8]. However, he also believed that samples should be independently

2

Mahalanobis, Prasanta Chandra

investigated and analyzed, or at the very least be split into two subsamples for analysis, despite the increase in cost. The statistic Mahalanobis distance is perhaps the most widely known aspect of Mahalanobis’ work today. This statistic is used in problems of taxonomical classification [10], in cluster analysis (see Cluster Analysis: Overview; Hierarchical Clustering) and for detecting multivariate outliers in datasets. It is helpful when drawing inferences on interrelationships between populations and speculating on their origin, or for measuring divergence between two or more groups. Alternative measures of quantitative characteristics such as social class can be compared using it [5]. Pearson rejected a paper on this area of work for the journal Biometrika, but Fisher soon recognized the importance of the work and provided the term Mahalanobis D2 . In later life, Mahalanobis developed Fractile Graphical Analysis (FGA), a nonparametric method for the comparison of two samples. For example, it can be used to compare the characteristics of a group at different time-points, or two groups at different places. It can also be used to test the normality or log normality or a frequency distribution [6]. The Professor also developed educational tests, studies of the correlations between intelligence or aptitude tests and success in school leaving certificates and other examinations. He made important contributions to factor analysis, which appear in early volumes of Sankya. His work on soil heterogeneity led him to meet Fisher and they became friends. They shared views on the foundations and the methodological aspects of statistics, and also on the role of statistics as a new technology. Perhaps unusually, Mahalanobis had numerous interests outside his scientific pursuits. He was interested in art, poetry (particularly of Rabindranath Tagore) and literature [7], anthropology [11], and architecture. The various social, cultural, and intellectual movements in Bengal were also sources of interest. He was not ‘an ivory-tower scholar or armchair intellectual’, and ‘hardly any issue of the time failed to make him take a stand one way or the other’ [7]. Economic development was one of many broader objectives he believed in: including social change, modernization, national security and international peace [11]. Because he was on such good terms with Indian and world politicians [10], it has been said that Mahalanobis was the only scientist in the world who, when the Cold War was at its height,

was received with as much warmth in Washington and London as in Moscow and Beijing. A genius for locating talent, Mahalanobis’ approach to recruitment prevented a ‘brain drain’ from the ISI. He employed a strategy called ‘brain irrigation’, paying low salaries for posts earmarked for individual people [11]. When they left, the post disappeared. He felt that job security bred inefficiency and used this technique as a screening, quality control mechanism. His personality was unmistakably conscientious and driven, but a flair for argument and an impatience for bureaucracy ‘made the Professor a fighter all his life’ [3]. He believed in the public sector but not the typical civil servant [11] who he saw as inefficient, with little idea of the function and use of science. He struggled with governmental bureaucrats continuously in order to retain the autonomy of the ISI. He could talk with great effectiveness in small groups, using his histrionic talents to command attention, but was less effective in larger gatherings [6]. His character has been summarized as tough, courageous, tenacious, bold [6], intellectual, dynamic, devoted, loving, proud [3], odd (particularly in his attitude to money), but periodically despondent and depressed [11]. Four principal qualities outlined by Chatterjee [3] were: (a) practical mindedness: a preference for things tangible rather than abstract; (b) breadth of vision and farsightedness, a knack for looking beyond problems to envisage the long-term implications of possible solutions; (c) extraordinary organizing ability, although his wife Rani helped alleviate occasional absent mindedness [11]; (d) an innate sense of humanism and nationalism: strengthened in later life by his contacts with Nehru and Tagore [7]. The resourcefulness of his personality and the variety of his works set Mahalanobis apart from other scientists of his time [5]. Many types of research were welcomed at the ISI, but Rudra [11] recalled the Professor saying, ‘There is one kind of research that I shall not allow to be carried out at the Institute. I will not support anybody working on problems of aeronavigation in viscous fluid’. He therefore did not approve of research that neither had any contribution to pure theory nor to solving practical problems. Knowledge was for socially useful purposes, and this conviction found expression in the later phases of his life [7]. Mahalanobis paid almost equal attention to both theoretical and applied statistical research.

Mahalanobis, Prasanta Chandra According to him, statistics was an applied science: its justification centered on the help it can give in solving a problem [7].

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

Bhattacharya, N. (1996a). Professor Mahalanobis and large scale sample surveys, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. 34–56. Bhattacharyya, D. (1996b). Introduction, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. ix–xv. Chatterjee, S.K. (1996). P.C. Mahalanobis’s journey through statistics to economic planning, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. 99–109. Deshmukh, C.D. (1965). Foreword, in Contributions to Statistics. Presented to Professor P. C. Mahalanobis on the Occasion of his 70th Birthday, C.R. Rao & D.B. Lahiri, eds, Statistical Publishing Society/Pergamon Press, Calcutta, Oxford. Goon, A.M. (1996). P.C. Mahalanobis: scientist, activist and man of letters, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. 89–98. Iyengar, N.S. (1996). A tribute: Professor P.C. Mahalanobis and fractile graphical analysis, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. 57–62. Mukherjee, M. (1996). The Professor’s demise: our reactions, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. 6–14.

3

[8]

Murthy, M.N. (1965). On Mahalanobis’ contributions to the development of sample survey theory and methods, in Contributions to Statistics. Presented to Professor P. C. Mahalanobis on the Occasion of his 70th Birthday, C.R. Rao & D.B. Lahiri, eds, Statistical Publishing Society/Pergamon Press, Calcutta, Oxford, pp. 283–315. [9] Rao, C.R. (1973). Prasanta Chandra Mahalanobis, Biographical Memoirs of Fellows of the Royal Society 19, 455–492. [10] Rao, C.R. (1996). Mahalanobis era in statistics, in Science, Society and Planning. A Centenary Tribute to Professor P.C. Mahalanobis, D. Bhattacharyya, ed., Progressive Publishers, Calcutta, pp. 15–33. [11] Rudra, A. (1996). Prasanta Chandra Mahalanobis. A Biography, Oxford University Press, Delhi, Oxford.

Further Reading Mahalanobis, P.C. (1944). On large scale sample surveys, Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 231, 329–451. Mahalanobis, P.C. (1946a). Recent experiments in statistical sampling in the Indian Statistical Institute, Journal of the Royal Statistical Society. Series A 109, 325–378. Mahalanobis, P.C. (1946b). Use of small-size plots in sample surveys for crop yields, Nature 158, 798–799. Mahalanobis, P.C. (1953). On some aspects of the Indian National Sample Survey, Bulletin of the International Statistical Institute 34, 5–14. Mahalanobis, P.C. & Lahiri, D.B. (1961). Analysis of errors in censuses and surveys with special emphasis on the experience in India, Bulletin of the International Statistical Institute 38, 401–433.

GARETH HAGGER-JOHNSON

Mail Surveys THOMAS W. MANGIONE Volume 3, pp. 1114–1119 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mail Surveys A mailed survey is one of several methods of collecting data that researchers can use to answer questions from a sample of the population. Mailed surveys usually involve the research team mailing a questionnaire to a potential respondent who then fills it out and returns the survey by mail. The major advantage of the mailed survey approach is its relatively low cost for data collection compared to telephone surveys or in-person interview surveys. A disadvantage of mailed surveys is that they often achieve a much lower response rate – percent of persons returning the survey from all of those asked to fill out the survey – than other data collection methods. Research studies conducted over the past few decades, however, have found ways to improve response rates to mailed surveys in many situations [5, 17–19, 44]. One general way to improve response rates to mailed surveys is to use this methodology only when it is appropriate. Mailed surveys are a good data collection choice when: 1. the budget for the study is relatively modest; 2. the sample of respondents are widely distributed geographically; 3. the data collection for a large sample needs to be completed in a relatively short time frame; 4. the validity of the answers to the questions would be improved if respondents could answer questions at their own pace; 5. the extra privacy of not having to give the answers to an interviewer would improve the veracity of the answers; 6. the study has a modest number of questions; and 7. the research sample has a moderate to high interest in the survey topic. All mailed survey studies should incorporate three basic elements. The study mailing should include a well-crafted respondent letter, a preaddressed and postage-paid return envelope, and include a promise of confidentiality of answers or, preferably, anonymity of answers. How the respondent letter is written is important because it is usually the sole mechanism for describing the study’s purpose, explaining the procedures to be followed, and motivating the respondent to participate [2, 12, 43, 46, 74]. The following features

contribute to a well-crafted letter: it is not too long (limit to one page if possible); it begins with an engaging sentence; it clearly tells the respondent why the study is important; it explains who the sample is and how people were selected; it explains how confidentiality will be maintained; it indicates that participation is voluntary but emphasizes the importance of participation; it is printed on letterhead that clearly identifies the research institution; it tells the respondent how to return the survey; and it is easy to read in terms of type size, layout, and language level. Early studies of mail surveys showed that including a preaddressed and postage-paid envelope is critically important to the success of a mail survey [3, 6, 38, 61, 67, 81]. Interestingly, research also has shown that using pretty commemorative stamps on both the initial delivered package and the return envelope improve response rates slightly [41, 50, 59]. Confidentiality is provided by not putting names on the surveys but instead using an ID number. Furthermore, confidentiality is maintained by keeping returned surveys under lock and key, keeping the list which links ID numbers to names in a separate locked place or password protected file, and presenting the data in reports in such a way that individuals are not identified. Anonymity can be achieved by not putting an ID number or other identifier on the surveys so that when they are returned the researcher does not know who returned it. Using this procedure, however, makes it difficult to send reminders to those who did not return their survey [7, 8, 14, 15, 25, 31, 34, 35, 54, 62, 66, 69, 78]. Beyond these three basic elements, there are two major strategies for improving response rates – sending reminders to those who have not responded and providing incentives to return the survey. The goal of a reminder is to increase motivation to respond. Reminders are best sent just to those who have not returned the survey so that the language of the reminder letter can be focused on those who have not returned their survey. Reminders should be sent out approximately 10 to 14 days after the previous mailing. This interval is not too short and hence will not waste a reminder on someone who intends to return the survey. Also, the interval is not too long, so that nonparticipating respondents will still remember what this is about. The first reminder should just be a postcard or letter and encourage the respondent to complete their survey. The second reminder should include answers to probable concerns the respondents

2

Mail Surveys

might have in not returning the survey as well as a replacement copy of the survey itself. The third reminder should again be a letter or postcard and the content should focus on the message that this is the last chance to participate. Some studies alter the last reminder by using a telephone reminder or delivering the letter with some type of premium postage like special delivery or overnight mail [16, 20, 23, 24, 26, 28, 33, 46, 49, 51, 53, 56, 81]. There is one important variation on these methods if you want to both provide anonymity to respondents and be able to send reminders to those who have not responded. The surveys are sent without any sort of ID number on them. However, in addition, the initial mailed packet includes a postcard. The person’s name or ID number are printed on the postcard. On the back of the postcard is a message from the respondent to the study director. It says, ‘I am returning my survey, so I do not need any more reminders’. The respondents are instructed in the respondent letter and maybe also at the end of the survey, to return the survey and the postcard, but to return them separately in order to maintain anonymity. By using this dual return mechanism, study directors can send reminders only to those who have not returned their surveys while at the same time maintaining anonymity of the returned surveys. The major concern in using this procedure is that many people will return the postcard but not the surveys. This turns out not to be the case. In general, about 90% of those returning their surveys also return a postcard. Almost certainly there are some people among the 90% who do not return their surveys, but the proportion is relatively small. Incentives are the other strategy to use to improve response rates. Lots of different incentive materials have been used such as movie tickets, ball-point pens, coffee mugs, and cash or checks. The advantage of money or checks is that the value of the incentive is clear. On the other hand, other types of incentives may be perceived of having value greater than what was actually spent [22, 29, 32, 40, 42, 51, 56, 73, 81, 82]. There are three types of incentives based on who gets them and when. The three types are ‘promised–rewards’, ‘lottery awards’, and ‘up-front rewards’. Promised rewards set up a ‘contract’ with the potential respondent such as: ‘each person who sends back the completed survey will be paid $5’. Although improvements in response rates compared to no incentive studies have been observed for

this approach, the improvements are generally modest [82]. Lottery methods are a form of promised-reward, but with a contingency – only one, or a few, of the participating respondents will be randomly selected to receive a reward. This method also differs from the promised rewards method in that the respondents who are selected receive a relatively large reward – $100 to $1000 or sometimes more. Generally speaking, the effectiveness of the lottery method over the basic promised-reward strategy depends on the respondents’ perceptions of the chances of winning and the attractiveness of the ‘prize’ [36, 42, 58]. Up-front incentives work the best. For this method, everyone who is asked to participate in a study is given a ‘small’ incentive that is enclosed with the initial mailing. The incentive is usually described as a ‘token of our appreciation for your participation’. Everyone can keep the incentive whether or not they respond. This unconditional reward seems to establish a sense of trust and admiration for the institution carrying out the study and thereby motivates respondents to ‘help out these good people’ by returning their survey. The size of the up-front incentive does not have to be that large to produce a notable increase in response rates. It is common to see incentives in the $5–$10 range [6, 37, 72, 79]. To get the best response rates, it is recommended that researchers use both reminders and incentives in their study designs [47]. When applying both procedures, it is not unusual to achieve response rates in the 60 to 80% range. By comparison, studies that only send out a single mailing with no incentives achieve response rates of 30% or less. In addition to these two major procedures to improve response rates, there are a variety of other things that can be done by the researcher to achieve a small improvement in response rates. For example, somewhat better response rates can be achieved by producing a survey that looks professional and is laid out in a pleasing manner, has a reading level that is not too difficult, and includes instructions that are easy to follow. Past research has shown that response rates can be increased by: 1. Using personalization in correspondence such as inserting the person’s name in the salutation or by having the researcher actually sign the letter in ink [2, 11, 21, 30, 45, 52, 54, 55, 70, 74.]

Mail Surveys 2. Sending a prenotification letter a week or two before the actual survey is sent out to ‘warn’ the respondent of their selection as a participant and to be on the look-out for the survey [1, 9, 27, 33, 39, 48, 54, 63, 65, 71, 75, 77, 80, 81]. 3. Using deadlines in the respondent letters to give respondents a heightened sense of priority to respond. Soft deadlines such as ‘please respond within the next two weeks so we don’t have to send you a reminder’ provides a push without preventing the researcher from sending a further reminder [34, 41, 51, 56, 64, 68, 76]. 4. Using a questionnaire of modest length, say 10 to 20 pages, rather than an overly long surveys will improve response rates (see Survey Questionnaire Design). Research has produced mixed results concerning the length of a survey. Obviously shorter surveys are less burdensome; but longer surveys may communicate a greater sense of the importance of the research issue. In general, the recommendation is to include no more material than is necessary to address the central hypotheses of the research [4, 10, 12, 13, 57, 60, 73]. Mail surveys when used appropriately and conducted utilizing procedures that have been shown to improve response rates, offer an attractive alternative to more expensive telephone or in-person interviews.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18] [19] [20]

References [1]

[2]

[3] [4] [5]

[6]

[7]

Allen, C.T., Schewe, C.D. & Wijk, G. (1980). More on self-perception theory’s foot technique in the precall/mail survey setting, Journal of Marketing Research 17, 498–502. Andreasen, A.R. (1970). Personalizing mail questionnaire correspondence, Public Opinion Quarterly 34, 273–277. Armstrong, J.S. & Lusk, E.J. (1987). Return postage in mail surveys, Public Opinion Quarterly 51, 233–248. Berdie, D.R. (1973). Questionnaire length and response rate, Journal of Applied Psychology 58, 278–280. Biemer, P.R., Groves, R.M., Lyberg, L.E., Mathiowetz, N.A. & Sudman, S. (1991). Measurement Errors in Surveys, John Wiley & Sons, New York. Blumberg, H.H., Fuller, C. & Hare, A.P. (1974). Response rates in postal surveys, Public Opinion Quarterly 38, 113–123. Boek, W.E. & Lade, J.H. (1963). Test of the usefulness of the postcard technique in a mail questionnaire study, Public Opinion Quarterly 27, 303–306.

[21]

[22]

[23]

[24]

[25] [26]

[27]

3

Bradt, K. (1955). Usefulness of a postcard technique in a mail questionnaire study, Public Opinion Quarterly 19, 218–222. Brunner, A.G. & Carroll, Jr, S.J. (1969). Effect of prior notification on the refusal rate in fixed address surveys, Journal of Advertising Research 9, 42–44. Burchell, B. & Marsh, C. (1992). Effect of questionnaire length on survey response, Quality and Quantity 26, 233–244. Carpenter, E.H. (1975). Personalizing mail surveys: a replication and reassessment, Public Opinion Quarterly 38, 614–620. Champion, D.J. & Sear, A.M. (1969). Questionnaire response rates: a methodological analysis, Social Forces 47, 335–339. Childers, T.L. & Ferrell, O.C. (1979). Response rates and perceived questionnaire length in mail surveys, Journal of Marketing Research 16, 429–431. Childers, T.L. & Skinner, S.J. (1985). Theoretical and empirical issues in the identification of survey respondents, Journal of the Market Research Society 27, 39–53. Cox III, E.P., Anderson Jr, W.T. & Fulcher, D.G. (1974). Reappraising mail survey response rates, Journal of Marketing Research 11, 413–417. Denton, J., Tsai, C. & Chevrette, P. (1988). Effects on survey responses of subject, incentives, and multiple mailings, Journal of Experimental Education 56, 77–82. Dillman, D.A. (1972). Increasing mail questionnaire response in large samples of the general public, Public Opinion Quarterly 36, 254–257. Dillman, D.A. (1978). Mail and Telephone Surveys: The Total Design Method, John Wiley & Sons, New York. Dillman, D.A. (1999). Mail and Internet Surveys: The Tailored Design Method, John Wiley & Sons, New York. Dillman, D., Carpenter, E., Christenson, J. & Brooks, R. (1974). Increasing mail questionnaire response: a four state comparison, American Sociological Review 39, 744–756. Dillman, D.A. & Frey, J.H. (1974). Contribution of personalization to mail questionnaire response as an element of a previously tested method, Journal of Applied Psychology 59, 297–301. Duncan, W.J. (1979). Mail questionnaires in survey research: a review of response inducement techniques, Journal of Management 5, 39–55. Eckland, B. (1965). Effects of prodding to increase mail back returns, Journal of Applied Psychology 49, 165–169. Etzel, M.J. & Walker, B.J. (1974). Effects of alternative follow-up procedures on mail survey response rates, Journal of Applied Psychology 59, 219–221. Filion, F.L. (1975). Estimating bias due to nonresponse in mail surveys, Public Opinion Quarterly 39, 482–492. Filion, F.L. (1976). Exploring and correcting for nonresponse bias using follow-ups on nonrespondents, Pacific Sociological Review 19, 401–408. Ford, N.M. (1967). The advance letter in mail surveys, Journal of Marketing Research 4, 202–204.

4 [28]

[29]

[30] [31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

Mail Surveys Ford, R.N. & Zeisel, H. (1949). Bias in mail surveys cannot be controlled by one mailing, Public Opinion Quarterly 13, 495–501. Fox, R.J., Crask, M.R. & Kim, J. (1988). Mail survey response rate: a meta-analysis of selected techniques for inducing response, Public Opinion Quarterly 52, 467–491. Frazier, G. & Bird, K. (1958). Increasing the response of a mail questionnaire, Journal of Marketing 22, 186–187. Fuller, C. (1974). Effect of anonymity on return rate and response bias in a mail survey, Journal of Applied Psychology 59, 292–296. Furse, D.H. & Stewart, D.W. (1982). Monetary incentives versus promised contribution to charity: new evidence on mail survey response, Journal of Marketing Research 19, 375–380. Furse, D.H., Stewart, D.W. & Rados, D.L. (1981). Effects of foot-in-the-door, cash incentives, and followups on survey response, Journal of Marketing Research 18, 473–478. Futrell, C. & Swan, J.E. (1977). Anonymity and response by salespeople to a mail questionnaire, Journal of Marketing Research 14, 611–616. Futrell, C. & Hise, R.T. (1982). The effects on anonymity and a same-day deadline on the response rate to mail surveys, European Research 10, 171–175. Gajraj, A.M., Faria, A.J. & Dickinson, J.R. (1990). Comparison of the effect of promised and provided lotteries, monetary and gift incentives on mail survey response rate, speed and cost, Journal of the Market Research Society 32, 141–162. Hancock, J.W. (1940). An experimental study of four methods of measuring unit costs of obtaining attitude toward the retail store, Journal of Applied Psychology 24, 213–230. Harris, J.R. & Guffey Jr, H.J. (1978). Questionnaire returns: stamps versus business reply envelopes revisited, Journal of Marketing Research 15, 290–293. Heaton Jr, E.E. (1965). Increasing mail questionnaire returns with a preliminary letter, Journal of Advertising Research 5, 36–39. Heberlein, T.A. & Baumgartner, R. (1978). Factors affecting response rates to mailed questionnaires: a quantitative analysis of the published literature, American Sociological Review 43, 447–462. Henley Jr, J.R. (1976). Response rate to mail questionnaires with a return deadline, Public Opinion Quarterly 40, 374–375. Hopkins, K.D. & Gullickson, A.R. (1992). Response rates in survey research: a meta-analysis of the effects of monetary gratuities, Journal of Experimental Education 61, 52–62. Hornik, J. (1981). Time cue and time perception effect on response to mail surveys, Journal of Marketing Research 18, 243–248. House, J.S., Gerber, W. & McMichael, A.J. (1977). Increasing mail questionnaire response: a controlled

[45]

[46]

[47]

[48] [49]

[50]

[51]

[52]

[53]

[54]

[55] [56]

[57]

[58]

[59]

[60]

[61]

replication and extension, Public Opinion Quarterly 41, 95–99. Houston, M.J. & Jefferson, R.W. (1975). The negative effects of personalization on response patterns in mail surveys, Journal of Marketing Research 12, 114–117. Houston, M.J. & Nevin, J.R. (1977). The effects of source and appeal on mail survey response patterns, Journal of Marketing Research 14, 374–377. James, J.M. & Bolstein, R. (1990). Effect of monetary incentives and follow-up mailings on the response rate and response quality in mail surveys, Public Opinion Quarterly 54, 346–361. Jolson, M.A. (1977). How to double or triple mail response rates, Journal of Marketing 41, 78–81. Jones, W.H. & Lang, J.R. (1980). Sample composition bias and response bias in a mail survey: a comparison of inducement methods, Journal of Marketing Research 17, 69–76. Jones, W.H. & Linda, G. (1978). Multiple criteria effects in a mail survey experiment, Journal of Marketing Research 15, 280–284. Kanuk, L. & Berenson, C. (1975). Mail surveys and response rates: a literature review, Journal of Marketing Research 12, 440–453. Kawash, M.B. & Aleamoni, L.M. (1971). Effect of personal signature on the initial rate of return of a mailed questionnaire, Journal of Applied Psychology 55, 589–592. Kephart, W.M. & Bressler, M. (1958). Increasing the responses to mail questionnaires, Public Opinion Quarterly 22, 123–132. Kerin, R.A. & Peterson, R.A. (1977). Personalization, respondent anonymity, and response distortion in mail surveys, Journal of Applied Psychology 62, 86–89. Kimball, A.E. (1961). Increasing the rate of return in mail surveys, Journal of Marketing 25, 63–65. Linsky, A.S. (1975). Stimulating responses to mailed questionnaires: a review, Public Opinion Quarterly 39, 82–101. Lockhart, D.C. (1991). Mailed surveys to physicians: the effect of incentives and length on the return rate, Journal of Pharmaceutical Marketing & Management 6, 107–121. Lorenzi, P., Friedmann, R. & Paolillo, J. (1988). Consumer mail survey responses: more (unbiased) bang for the buck, Journal of Consumer Marketing 5, 31–40. Martin, J.D. & McConnell, J.P. (1970). Mail questionnaire response induction: the effect of four variables on the response of a random sample to a difficult questionnaire, Social Science Quarterly 51, 409–414. Mason, W.S., Dressel, R.J. & Bain, R.K. (1961). An experimental study of factors affecting response to a mail survey of beginning teachers, Public Opinion Quarterly 25, 296–299. McCrohan, K.F. & Lowe, L.S. (1981). A cost/benefit approach to postage used on mail questionnaires, Journal of Marketing 45, 130–133.

Mail Surveys [62]

[63]

[64]

[65]

[66]

[67]

[68]

[69] [70]

[71]

[72]

McDaniel, S.W. & Jackson, R.W. (1981). An investigation of respondent anonymity’s effect on mailed questionnaire response rate and quality, Journal of Market Research Society 23, 150–160. Myers, J.H. & Haug, A.F. (1969). How a preliminary letter affects mail survey return and costs, Journal of Advertising Research 9, 37–39. Nevin, J.R. & Ford, N.M. (1976). Effects of a deadline and a veiled threat on mail survey responses, Journal of Applied Psychology 61, 116–118. Parsons, R.J. & Medford, T.S. (1972). The effect of advanced notice in mail surveys of homogeneous groups, Public Opinion Quarterly 36, 258–259. Pearlin, L.I. (1961). The appeals of anonymity in questionnaire response, Public Opinion Quarterly 25, 640–647. Price, D.O. (1950). On the use of stamped return envelopes with mail questionnaires, American Sociological Review 15, 672–673. Roberts, R.E., McCrory, O.F. & Forthofer, R.N. (1978). Further evidence on using a deadline to stimulate responses to a mail survey, Public Opinion Quarterly 42, 407–410. Rosen, N. (1960). Anonymity and attitude measurement, Public Opinion Quarterly 24, 675–680. Rucker, M., Hughes, R., Thompson, R., Harrison, A. & Vanderlip, N. (1984). Personalization of mail surveys: too much of a good thing? Educational and Psychological Measurement 44, 893–905. Schegelmilch, B.B. & Diamantopoulos, S. (1991). Prenotification and mail survey response rates: a quantitative integration of the literature, Journal of the Market Research Society 33, 243–255. Schewe, C.D. & Cournoyer, N.D. (1976). Prepaid vs. promised incentives to questionnaire response: further evidence, Public Opinion Quarterly 40, 105–107.

[73]

[74]

[75] [76]

[77]

[78]

[79]

[80]

[81]

[82]

5

Scott, C. (1961). Research on mail surveys, Journal of the Royal Statistical Society, Series A, Part 2 124, 143–205. Simon, R. (1967). Responses to personal and form letters in mail surveys, Journal of Advertising Research 7, 28–30. Stafford, J.E. (1966). Influence of preliminary contact on mail returns, Journal of Marketing Research 3, 410–411. Vocino, T. (1977). Three variables in stimulating responses to mailed questionnaires, Journal of Marketing 41, 76–77. Walker, B.J. & Burdick, R.K. (1977). Advance correspondence and error in mail surveys, Journal of Marketing Research 14, 379–382. Wildman, R.C. (1977). Effects of anonymity and social settings on survey responses, Public Opinion Quarterly 41, 74–79. Wotruba, T.R. (1966). Monetary inducements and mail questionnaire response, Journal of Marketing Research 3, 398–400. Wynn, G.W. & McDaniel, S.W. (1985). The effect of alternative foot-in-the-door manipulations on mailed questionnaire response rate and quality, Journal of the Market Research Society 27, 15–26. Yammarino, F.J., Skinner, S.J. & Childers, T.L. (1991). Understanding mail survey response behavior, Public Opinion Quarterly 55, 613–639. Yu, J. & Cooper, H. (1983). A quantitative review of research design effects on response rates to questionnaires, Journal of Marketing Research 20, 36–44.

(See also Survey Sampling Procedures) THOMAS W. MANGIONE

Mallows’ Cp Statistic CHARLES E. LANCE Volume 3, pp. 1119–1120 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mallows’ Cp Statistic A common problem in applications of multiple linear regression analysis in the behavioral sciences is that of predictor subset selection [10]. The goal of subset selection, also referred to as variable selection (e.g., [4]) or model selection (e.g., [13]) is to choose a smaller subset of predictors from a relatively larger number that is available so that the resulting regression model is parsimonious, yet has good predictive ability [3, 4, 12]. The problem of subset selection arises, for example, when a researcher seeks a predictive model that cross-validates well (see Crossvalidation and [1, 11]), or when there is redundancy amongst the predictors leading to multicollinearity [3]. There are several approaches to predictor subset selection, including forward selection, backward elimination, and stepwise regression [2, 10]. A fourth, ‘all possible subsets’ procedure fits all 2k − 1 distinct models to determine a best fitting model (BFM) on the basis of some statistical criterion. A number of such criteria for choosing a BFM can be considered [5], including R 2 , one of several forms of R 2 adjusted for the number of predictors [11], the mean squared error, or one of a number of criteria based on information theory (e.g., Akaike’s criterion, [6]). Mallows’ [8, 9] Cp statistic is one such criterion that is related to this latter class of indices. For any model containing a subset of p predictors from the total number of k predictors, Mallows’ Cp can be written as: Cp =

RSS p + 2p − n MSE k

Note that if the R 2 for a p-variable model is substantially less than the R 2 for the full k-variable model (i.e., ‘important’ variables are excluded from the subset), RSSp /RSSk will be large compared to the situation in which the p-variable subset model includes all or most of the ‘important’ predictors. In this case, the RSSp /RSSk ratio approaches 1.00, and if the model is relatively parsimonious, a large ‘penalty’ (in the form of 2p) is not invoked and Cp is small. Note, however, that for the full model, p = k so that RSSp /RSSk = 1 and Cp necessarily = p. In practice, for models with p < k predictors, variable subsets with lower Cp values (around p or less) indicate preferred subset models. It is commonly recommended to plot models’ Cp as a function of p and choose the predictor subset with minimum Cp as the preferred subset model (see [3], [8], and [9] for examples).

References [1]

[2]

[3] [4]

[5]

[6]

(1)

where RSSp is the residual sum of squares for the p-variable model, MSEk is the mean squared error for the full (k-variable) model, and n is sample size. As such, Cp indexes the mean squared error of prediction for ‘subset’ models relative to the full model with a penalty for inclusion of unimportant predictors [7]. Because MSEk is usually estimated as RSSk /(n − k), Cp can also be written as: RSS p + 2p − n (2) Cp = (n − k) RSS k

[7]

[8] [9] [10] [11]

Camstra, A. & Boomsma, A. (1992). Cross-validation in regression and covariance structure analysis: an overview, Sociological Methods & Research 21, 89–115. Cohen, J., Cohen, P., West, S.G. & Aiken, L.S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition, Lawrence Erlbaum, Mahwah. Fox, J. (1991). Regression Diagnostics, Sage Publications, Newbury Park. George, E.I. (2000). The variable selection problem, Journal of the American Statistical Association 95, 1304–1308. Hocking, R.R. (1972). Criteria for selection of a subset regression: which one should be used? Technometrics 14, 967–970. Honda, Y. (1994). Information criteria in identifying regression models, Communications in Statistics–Theory and Methods 23, 2815–2829. Kobayashi, M. & Sakata, S. (1990). Mallows’ Cp criterion and unbiasedness of model selection, Journal of Econometrics 45, 385–395. Mallows, C.L. (1973). Some comments on Cp , Technometrics 15, 661–675. Mallows, C.L. (1995). More comments on Cp , Technometrics 37, 362–372. Miller, A.J. (1990). Subset Selection in Regression, Chapman and Hall, London. Raju, N.S., Bilgic, R., Edwards, J.E. & Fleer, P.F. (1997). Methodology review: estimation of population validity and cross-validity, and the use of equal weights

2

Mallows’ Cp Statistic

in prediction, Applied Psychological Measurement 21, 291–305. [12] Thompson, M.L. (1978). Selection of variables in multiple regression: part I. A review and evaluation, International Statistical Review 46, 1–19.

[13]

Zuccaro, C. (1992). Mallows’ Cp statistic and model selection in multiple linear regression, Journal of the Market Research Society 34, 163–172.

CHARLES E. LANCE

Mantel–Haenszel Methods ´ NGEL M. FIDALGO A Volume 3, pp. 1120–1126 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mantel–Haenszel Methods

Table 1

The 2 × 2 contingency table for the hth stratum Health status

Numerous studies within the behavioral sciences have attempted to determine the degree of association between an explanatory variable, called factor (e.g., treatments, potentially harmful exposure), and another variable that is assumed to be determined by or dependent upon the factor variable, called response variable (e.g., degree of improvement, state of health). Moreover, quite frequently, researchers wish to control the modulating effect of other variables, known as stratification variables or covariables (e.g., age, level of illness, gender), on the relationship between factor and response (see Analysis of Covariance). Stratification variables may be the result of the research design, as in a multicenter clinical trial, in which the strata correspond to the different hospitals where the treatments have been applied or of a posteriori considerations made after the study data have been obtained. In any case, when factor and response variables are reported on categorical measurement scales (see Categorizing Data), either nominal or ordinal, the resulting data can be summarized in contingency tables, and the methods based on the work of Cochran [4] and Mantel and Haenszel [16] are commonly used for their analysis. In a general way, it can be said that these methods provide measures and significance tests of two-way association that control for the effects of covariables. The null hypothesis (H0 ) they test is that of ‘partial no-association’, which establishes that in each one of the strata of the covariable, the response variable is distributed randomly with respect to the levels of the factor. In the simplest case, both factor and response are dichotomous variables, and the data can be summarized in a set of Q contingency tables 2 × 2 (denoted by Q : 2 × 2), where each table corresponds to a stratum or level of the covariable, or to each combination of levels in the case of there being several covariables (see Two by Two Contingency Tables). To establish notation, let nhij denote the frequency count of observations occurring at the ith factor level (row), (i = 1, . . . , R = 2), the j th level of the response variable (column), (j = 1, . . . , C = 2), and the hth level of the covariable or stratum, (h = 1, . . . , Q). In Table 1, this notation is applied to a typical study on risk factors and state of health.

Exposure status

Condition present

Condition absent

Total

nh11 nh21 Nh·1

nh12 nh22 Nh·2

Nh1· Nh2· Nh··

Not exposed Exposed Total

In sets of tables 2 × 2, Mantel–Haenszel (MH) statistics are asymptotic tests that follow a chisquared distribution (see Catalogue of Probability Density Functions) with one degree of freedom (df ) and that, in general, can be expressed thus: 2 χMH

=

Q

a−

h=1

Q

2 E(A)

h=1 Q

.

(1)

var(A)

h=1

Here, a is the observed frequencies in one of the cells of the table, E(A) is the frequencies expected under H0 , and var(A) is the variance of the frequencies expected under H0 . All of this summed across the strata. The strategy used for testing H0 is as simple as comparing the frequencies observed in the contingency table (nhij ) with the frequencies that could be expected in the case of there being no relationship between factor and response. Naturally, in order to carry out the test, we need to establish a probabilistic model that determines the probability of obtaining a under H0 . According to the assumed probability model, E(A) and var(A) will adopt different forms. Mantel and Haenszel [16] conditioning on the row and column totals (Nh1· , Nh2· , Nh·1 , and Nh·2 ), that is, taking those values as fixed, establish that, under H0 , the observed frequencies in each table (nhij ) follow the multiple hypergeometric probability model (see Catalogue of Probability Density Functions): R

Pr(nhij |H0 ) =

Nhi· !

i=1

Nh·· !

C

Nh·j !

j =1 R C i=1 j =1

. nhij !

(2)

2

Mantel–Haenszel Methods

Furthermore, on assuming that the marginal totals (Nhi· ) and (Nh·j ) are fixed, this distribution can be expressed in terms of count nh11 alone, since it determines the rest of the cell counts (nh12 , nh21 , and nh22 ). Under H0 , the hypergeometric mean and variance of nh11 are E(nh11 ) = Nh1· Nh·1 /Nh·· and 2 (Nh·· − 1) respecvar(nh11 ) = Nh1· Nh2· Nh·1 Nh·2 /Nh·· tively. We thus obtain the MH chi-squared test:

2 χMH =

Q 2 Q nh11 − E(nh11 ) − 0.5 h=1

h=1

Q

.

(3)

var(nh11 )

h=1

Cochran [4], taking only the total sample in each stratum (Nh·· ) as fixed and assuming a binomial distribution, proposed a statistic that was quite sim2 . The Cochran statistic can be rewritilar to χMH ten in terms of (3), eliminating from it the continuity correction (0.5) and substituting var(nh11 ) 3 . The essenby var∗ (nh11 ) = Nh1· Nh2· Nh·1 Nh·2 /Nh·· tial equivalence between the two statistics when we have moderate sample sizes in each stratum means that the techniques expressed here are frequently referred to as the Cochran–Mantel–Haenszel methods. Nevertheless, statistic (3) is preferable, since it only demands that the combined row sample sizes Q C N·i· = h=1 j =1 nhij be large (e.g., N·i· > 30), while that of Cochran demands moderate to large sample sizes (e.g., Nh·· > 20) in each table. A more precise criterion in terms of sample demands [15] 2 only if the result of the recommends applying χMH expressions Q

E(nh11 ) −

h=1

Q h=1

Q h=1

max(0, Nh1· − Nh·2 )

min(Nh·1 , Nh1· ) −

Q

E(nh11 )

and

(4)

h=1

is over five. When the sample requirements are not fulfilled, either because of the small sample size or because of the highly skewed observed table margins, we will have to use exact tests, such as those of Fisher or Birch [1]. In the case of rejecting the H0 at the α signif2 ≥ df(1) χα2 ), the following step icance level (if χMH

is to determine the degree of association between factor and response. Among the numerous measures of association available for contingency tables (rate ratio, relative risk, prevalence ratio, etc.), Mantel and Haenszel [16] employed odds ratios. The characteristics of this measure of association is examined as follows. It should be noted that if the variables were independent, the odds of the probability (π) of responding in column 1 instead of column 2 (π hi1 /π hi2 ) would be equal at all the levels of the factor. Therefore, the ratio of the odds, referred to as the odds ratio {α = (πh11 /πh12 )/(πh21 /πh22 )} or cross-product ratio (α = πh11 πh22 /πh12 πh21 ), will be 1. Assuming homogeneity of the odds ratios of each stratum (α1 = α2 = · · · = αQ ), the MH measure of association calculated across all 2 × 2 contingency tables is the common odds ratio estimator (αˆ MH ), given by Q

αˆ MH =

h=1 Q h=1

Q

Rh = Sh

h=1 Q

nh11 nh22 /Nh·· .

(5)

nh12 nh21 /Nh··

h=1

The range of αˆ MH varies between 0 and ∞. A value of 1 represents the hypothesis of no-association. If 1 < αˆ MH < ∞, the first response is more likely in row 1 than in row 2; if 0 < αˆ MH < 1, the first response is less likely in row 1 than in row 2. Because of the skewness of the distribution of αˆ MH , it is more convenient to use ln(αˆ MH ), the natural logarithm of αˆ MH . In this case, the independence corresponds to ln(αˆ MH ) = 0, the ln of the common odds ratio being symmetrical about this value. This means that, for ln(αˆ MH ), two values that are equal except for their sign, such as ln(2) = 0.69 and ln(0.5) = −0.69, represent the same degree of association. For constructing confidence intervals around αˆ MH , we need an estimator of its variance, and that which presents the best properties [12] is that of Robins, Breslow, and Greenland [18]: 

Q  (Ph Rh )  (αˆ MH )2  h=1  var(αˆ MH ) = Q 2 2    R h

h=1

Mantel–Haenszel Methods Q

Q



data structure for this general contingency table is shown in Table 2. In the general case, the H0 of no-association will be tested against different alternative hypotheses (H1 ) that will be a function of the scale on which factor and response are measured. Thus, we shall have a variety of statistics that will serve for detecting the general association (both variables are nominal), mean score differences (factor is nominal and response ordinal), and linear correlation (both variables are ordinal). The standard generalized Mantel–Haenszel test is defined, in terms of matrices, by Landis, Heyman, and Koch [13], as Q −1 Q QGMH = (nh − mh ) Ah Ah Vh Ah

(Ph Sh + Qh Rh ) Qh Sh    h=1 h=1  + + Q 2  , (6) Q Q  Rh Sh Sh  h=1

h=1

h=1

where Ph = (nh11 + nh22 )/Nh·· and Qh = (nh12 + nh21 )/Nh·· . If we construct the intervals on the basis of ln(αˆ MH ), we must make the following adjustment: var[ln(αˆ MH )] = var(αˆ MH )/(αˆ MH )2 . Thus, the 100(1 − α)% confidence interval for ln(αˆ MH ) will be equal to ln(αˆ MH ) ± zα/2 var[ln(αˆ MH )]. For αˆ MH , the 100(1 − α)% confidence limits will be equal to exp(ln(αˆ MH ) ± zα/2 var[ln(αˆ MH )]). In relation to these aspects, it should be pointed out that nonfulfillment of the assumption of homogeneity of the odds ratios (α1 = α2 = · · · = αQ ) does not invalidate αˆ MH as a measure of association; even so, given that it is a weighted average of the stratum-specific odds ratios, it makes its interpretation difficult. In the case that the individual odds ratios differ substantially in direction, it is preferable to use the information provided by these odds ratios than to use αˆ MH . Likewise, nonfulfillment of the 2 assumption of homogeneity does not invalidate χMH as a test of association, though it does reduce its statistical power [2]. Breslow and Day [3, 20] provide a test for checking the mentioned assumption (see Breslow–Day Statistic). The statistics shown above are applied to the particular case of sets of contingency tables 2 × 2. Fortunately, from the outset, various extensions have been proposed for these statistics [14, 16], all of them being particular cases of the analysis of sets of contingency tables with dimensions Q : R × C. The Table 2

h=1

×

Q

h=1

Ah (nh − mh ) ,

where nh , mh , Vh , and Ah are, respectively, the vector of observed frequencies, the vector of expected frequencies, the covariances matrix, and a matrix of linear functions defined in accordance with the H1 of interest. From Table 2, these vectors are defined as nh = (nh11 , nh21 . . . , nhRC ) ; mh = Nh·· (ph·∗ ⊗ ph∗· ), ph·∗ and ph∗· being vectors with the marginal row proportions (phi· = Nhi· /Nh·· ) and the marginal column proportions (ph·j = Nh·j /Nh·· ), and ⊗ denoting the Kronecker 2 product multiplication; Vh = Nh·· /(Nh·· − 1){(Dph·∗ − ph·∗ ph·∗ ) ⊗ (Dph∗·· − ph∗· ph∗· )}, where Dph is a diagonal matrix with elements of the vector ph on its main diagonal. As it has been pointed out, depending on the measurement scale of factor and response, (7) will be resolved, via definition of the matrix Ah (Ah =

Response variable categories

1 2 .. . i. .. R Total

(7)

h=1

Data structure in the hth stratum

Factor levels

3

1

2

·

j

·

C

Total

nh11 nh21 .. . nhi1 .. . nhR1 Nh·1

nh12 nh22 .. . nhi2 .. . nhR2 Nh·2

· ·. .. ·. .. · ·

nh1j nh2j .. . nhij .. . nhRj Nh·j

· · · · · · ·

nh1C nh2C .. . nhiC .. . nhRC Nh·C

Nh1· N.h2· .. N.hi· .. NhR· Nh··

4

Mantel–Haenszel Methods

Ch ⊗ Rh ), in a different statistic for detecting each H1 . Briefly, these are as follows: QGMH(1) . When the variable row and the variable column are nominal, the H1 specifies that the distribution of the response variable differs in nonspecific patterns across levels of the row factor. Here, Rh = [IR−1 , −JR−1 ] and Ch = [IC−1 , −JC−1 ], where IR−1 is an identity matrix, and JR−1 is an (R − 1) × 1 vector of ones. Under H0 , QGMH(1) follows approximately a chi-squared distribution with df = (R − 1)(C − 1). QGMH(2) . When only the variable column is ordinal, the H1 establishes that the mean responses differ across the factor levels, Rh being the same as that used in the previous case and Ch = (ch1 , . . . , chC ), where chj is an appropriate score reflecting the ordinal nature of the j th category of response for the hth stratum. Selection of the values of Ch admits different possibilities that are well described in [13]. Under H0 , QGMH(2) has approximately a chi-squared distribution with df = (R − 1). QGMH(3) . If both variables are ordinal, the H1 establishes the existence of a linear progression (linear trend) in the mean responses across the levels of the factor (see Trend Tests for Counts and Proportions). In this case, Ch can be defined as the same as that for the mean responses difference and Rh = (rh1 , . . . , rhR ), where rhi is an appropriate score reflecting the ordinal nature of the ith factor level for the hth stratum. Under H0 , QGMH(3) has approximately a chi-squared distribution with df = 1. It should be noted how the successive H1 s specify more and more restrictive patterns of association, so that each statistic increases the statistical power with respect to the previous ones for detecting its particular pattern of association. For example, QGMH(1) can detect linear patterns of association, but it will do so with less power than QGMH(3) . Furthermore, the increase in power of QGMH(3) compared to QGMH(1) is achieved at the cost of an inability to detect more complex patterns of association. Obviously, when C = R = 2, QGMH(1) = QGMH(2) = QGMH(3) = 2 , except for the lack of the continuity correction. χMH While MH methods have satisfactorily resolved the testing of the H0 in the general case, to date

there is no estimator of the degree of association that is completely generalizable for Q : R × C tables. The interested reader can find generalizations, always complex, of αˆ MH to Q : 2 × C tables (C > 2) in [10], [17], and [21]. The range of application of the methodology presented in this article is enormous, and it is widely used in epidemiology, meta-analysis, analysis of survival data (see Survival Analysis) (where it is known by the name of logrank-test), and psychometric research on differential item functioning [5–7, 9]. This is undoubtedly due to its simplicity and flexibility and due to its minimal demands for guaranteeing the validity of its results: on the one hand, it requires only that the sample size summed across the strata be sufficiently large for asymptotic results to hold (such that the MH test statistic can perfectly well be applied for matched case-control studies with only two subjects in each of the Q tables, as long as the number of tables is large); on the other hand, it permits the use of samples of convenience on not assuming a known sampling link to a larger reference population. This is possible, thanks to the fact that the H0 of interest – that the distribution of the responses is random with respect to the levels of the factor – induces a probabilistic structure (the multiple hypergeometric distribution) that allows for judgment of its compatibility with the observed data without the need for external assumptions. Thus, it can be applied to experimental designs, group designs, and repeated-measures designs [12, 19, 22, 23], (see Repeated Measures Analysis of Variance) and also to designs based on observation or of a historical nature, such as retrospective studies, nonrandomized studies, or case–control studies [11], regardless of how the sampling was carried out. This undoubted advantage is offset by a disadvantage: given that the probability distributions employed are determined by the observed data (see (1)), the conclusions obtained will apply only to the sample under study. Consequently, generalization of the results to the target population should be based on arguments about the representativeness of the sample [11].

Example We shall illustrate the use of the Mantel–Haenszel methods on the basis of the results of a survey on bullying and harassment at work (mobbing) carried out in Spain [8]. Table 3 shows the number of people

Mantel–Haenszel Methods Table 3

Characteristics of mobbing victims

Gender

Job category

No bullying

bullied

Total

% bullied

Men

Manual Clerical Specialized technician Middle manager Manager/executive

148 65 121 95 29

28 13 18 7 2

176 78 139 102 31

15.9 16.7 12.9 6.9 6.5

Women

Manual Clerical Specialized technician Middle manager Manager/executive

98 144 43 38 8

22 32 10 7 1

120 176 53 45 9

18.3 18.2 18.9 15.6 11.1

in the survey who over a period of 6 months or more have been bullied at least once a week (bullied), and the number of those not fulfilling this criterion (no bullying). In order to employ all the statistics considered, we shall analyze the data in different ways. Let us suppose that the main objective of the research is to determine whether there is a relationship between job category and mobbing. Moreover, as it is suspected that gender may be related to mobbing, we decided to control the effect of this variable. The result 2 , without the continuity of applying the statisticχMH correction, to the following contingency table

Men Women Total

5

No bullying

Bullied

Total

458 331 789

68 72 140

526 403 929

indicates the pertinence of this adjustment. Indeed, 2 takes the value 4.34, which has we find that χMH

a p value of 0.037 with 1 df. Thus, assuming a significance level of .05, we reject the H0 of ‘no partial association’ in favor of the alternative that suggests that bullying behavior is not distributed homogeneously between men and women. A value of 1.47 in αˆ MH indicates that women have a higher probability of suffering mobbing than men do. The 95% confidence interval for the common odds ratio estimator, using the variance in (6) with a result of 0.07, is (1.02, 2.10). In the case of transforming αˆ MH to the logarithmic scale [ln(αˆ MH ) = 0.38], the variance would equal 0.03, and the confidence interval, (0.022, 0.74). Note how 1.02 = exp(0.02) and 2.10 = exp(0.74). It should also be noted how our data satisfy the sample demands of the Mantel–Fleiss criterion 1 1 ) = 446.73, max(0, 526 − (4): h=1 E(nh11 h=1 140) = 386 and 1h=1 min(789, 526) = 526, so that (446.73 − 386) > 5 and (526 − 446.73) > 5. In line with the research objective, Table 4 shows the results of the Mantel–Haenszel statistics for association between job category and mobbing

Table 4 Results of the Mantel–Haenszel statistics with the respective linear operator matrices Statistic

Value

df

p

QGMH(1)

5.74

4

.22

QGMH(2)

5.74

4

.22

QGMH(3)

4.70

1

.03

Ah = Ch ⊗ Rh  1 0 0 0 0 1 0 0 [ 1 −1 ] ⊗  0 0 1 0 0 0 0 1  1 0 0 0 0 1 0 0 [1 2] ⊗  0 0 1 0 0 0 0 1 [1

2] ⊗ [1

2 3

4

 −1 −1  . −1  −1  −1 −1  . −1  −1 5 ].

6

Mantel–Haenszel Methods

adjusted for gender; also shown is the linear operator matrix defined in accordance with the H1 s of general association (QGMH(1) ), mean row scores difference (QGMH(2) ), and trend in mean scores (QGMH(3) ). Considering the data in Table 4, a series of points can be made. First point is that when the response variable is dichotomous, as is the case here, QGMH(1) and QGMH(2) offer the same results. Second is that if we suppose that both the response variable and the factor are ordinal variables, then there exists a statistically significant linear relationship (p < .05) between job category and level of harassment or bullying, once the effect of gender is controlled; more specifically, that the higher the level of occupation, the less bullying there is. Third point is that, as we can see, QGMH(3) is the most powerful statistic, via reduction of df, for detecting the linear relationship between factor and response variable. Finally, we can conclude that almost 82% (4.70/5.74) of the nonspecific difference in mean scores can be explained by the linear tendency. Note: The software for calculating the generalized Mantel–Haenszel statistics is available upon request from the author.

[9]

[10] [11]

[12]

[13]

[14]

[15]

[16]

References [1] [2]

[3]

[4] [5]

[6]

[7]

[8]

Agresti, A. (1992). A survey of exact inference for contingency tables, Statistical Science 7, 131–177. Birch, M.W. (1964). The detection of partial association, I: the 2 × 2 case, Journal of the Royal Statistical Society. Series B 26, 313–324. Breslow, N.E. & Day, N.E. (1980). Statistical Methods in Cancer Research I: The analysis of Case-Control Studies, International Agency for research on Cancer, Lyon. Cochran, W.G. (1954). Some methods for strengthening the common χ 2 test, Biometrics 10, 417–451. Fidalgo, A.M. (1994). MHDIF: a computer program for detecting uniform and nonuniform differential item functioning with the Mantel-Haenszel procedure, Applied Psychological Measurement 18, 300. Fidalgo, A.M., Ferreres, D. & Mu˜niz, J. (2004). Utility of the Mantel-Haenszel procedure for detecting differential item functioning with small samples, Educational and Psychological Measurement 64(6), 925–936. Fidalgo, A.M., Mellenbergh, G.J. & Mu˜niz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures, Methods of Psychological Research Online 5(3), 43–53. Fidalgo, A.M. & Pi˜nuel, I. (2004). La escala Cisneros como herramienta de valoraci´on del mobbing [Cisneros

[17]

[18]

[19]

[20]

[21]

[22]

[23]

scale to assess psychological harassment or mobbing at work], Psicothema 16, 615–624. Holland, W.P. & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure, in Test Validity, H. Wainer & H.I. Braun, eds, LEA, Hillsdale, pp. 129–145. Greenland, S. (1989). Generalized Mantel-Haenszel estimators for K 2 × J tables, Biometrics 45, 183–191. Koch, G.G., Gillings, D.B. & Stokes, M.E. (1980). Biostatistical implications of design, sampling and measurement to health science data analysis, Annual Review of Public Health 1, 163–225. Kuritz, S.J., Landis, J.R. & Koch, G.G. (1988). A general overview of Mantel-Haenszel methods: applications and recent developments, Annual Review of Public Health 9, 123–160. Landis, J.R., Heyman, E.R. & Koch, G.G. (1978). Average partial association in three-way contingency tables: a review and discussion of alternative tests, International Statistical Review 46, 237–254. Mantel, N. (1963). Chi-square tests with one degree of freedom; extension of the Mantel-Haenszel procedure, Journal of the American Statistical Association 58, 690–700. Mantel, N. & Fleiss, J.L. (1980). Minimum expected cell size requirements for the Mantel-Haenszel one-degree of freedom chi-square test and a related rapid procedure, American Journal of Epidemiology 112, 129–143. Mantel, N. & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease, Journal of the National Cancer Institute 22, 719–748. Mickey, R.M. & Elashoff, R.M. (1985). A generalization of the Mantel-Haenszel estimator of partial association for 2 × J × K tables, Biometrics 41, 623–635. Robins, J., Breslow, N. & Greenland, S. (1986). Estimators of the Mantel-Haenszel variance consistent in both sparse data and large-strata limiting models, Biometrics 42, 311–324. Stokes, M.E., Davis, C.S. & Kock, G. (2000). Categorical Data Analysis Using the SAS System, 2nd Edition, SAS Institute Inc, Cary. Tarone, R.E., Gart, J.J. & Hauck, W.W. (1983). On the asymptotic inefficiency of certain noniterative estimators of a common relative risk or odds ratio, Biometrika 70, 519–522. Yanagawa, T. & Fujii, Y. (1995). Projection-method Mantel-Haenszel estimator for K 2 × J tables, Journal of the American Statistical Association 90, 649–656. Zhang, J. & Boos, D.D. (1996). Mantel-Haenszel test statistics for correlated binary data, Biometrics 53, 1185–1198. Zhang, J. & Boos, D.D. (1997). Generalized CochranMantel-Haenszel test statistics for correlated categorical data, Communications in Statistics: Theory and Methods 26(8), 1813–1837.

´ NGEL M. FIDALGO A

Marginal Independence VANCE W. BERGER

AND JIALU

ZHANG

Volume 3, pp. 1126–1128 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Marginal Independence The definition of statistical independence is rather straightforward. Specifically, if f (x,y|θ) denotes the joint distribution of random variables X and Y , then the marginal distributions of X and Y can be written as f (x|θ) = ∫ f (x,y|θ) dy and f (y|θ) = ∫ f (x,y|θ) dx, respectively, where θ is some parameter in the distribution functions, and in the discrete case one would use summation instead of integration (see Contingency Tables). Then the variables X and Y are independent if their joint distribution is the product of the two marginal distributions, f (x,y|θ) = f (x|θ) f (y|θ). The intuition here is that knowledge of one of the variables offers no knowledge of the other. Yet the simplicity of this definition belies the complexity that may arise when it is applied in practice. For example, it may seem intuitively obvious that shoe size and intelligence should not be correlated or associated. That is, they should be independent. This may be true when considering a given cohort defined by age group, but when the age is allowed to vary, it is not hard to see that older children would be expected to have both larger feet and more education than younger children. This confounding variable, age, could then lead to the finding that children with larger shoe sizes appear to be more intelligent than children with smaller shoe sizes. Conversely, one would expect height and weight to be positively associated, and not independent, because all things considered, taller individuals tend to also be heavier. However, this assumes that we are talking about the relation between height and weight among a set of distinct individuals. It is also possible to consider the relation between height and weight for a given individual over time. While many individuals will gain weight as they age and get taller over a period of years, one may also consider the relation between height and weight for a given individual over the period of one day. That is, the height and weight of a given individual can be measured every hour during a given day. Among these measurements, height and weight may be independent. Finally, consider income and age. For any given individual who does not spend much time out of work, income would tend to rise over time, as the age also increases. As such, these two variables should have a positive association over time for a given

individual. Yet, to the extent that educational opportunities improve over time, a younger generation may have access to better paying positions than their older counterparts. This trend could compensate for, or even reverse, the association within individuals. We see, then, that the association between two variables depends very much on the context, and so the context must always be borne in mind when discussing association, causality, or independence. In fact, it is possible to create the appearance of a positive relation, either unwittingly or otherwise, when in fact no association exists. Berkson [4] demonstrated that when two events are independent in a population at large, they become negatively associated in a subpopulation characterized by the requirement of one of these events. That is, imagine a trauma center that handles only victims of gun-shot wounds and victims of car accidents, and suppose that in the population at large the two are independent. Then the probability of having been shot would not depend on whether one has been in a car accident. However, if one wanted to over-sample the cases, and study the relationship between these two events in our hypothetical trauma center, then one might find a negative association due to the missing cell corresponding to neither event. See Table 1. In the trauma center, there would be an empty upper-left cell, corresponding to neither the X-event nor the Y -event. That is, n11 would be replaced with zero, and this would create the appearance of a negative association when in fact none exists. It is also possible to intentionally create the appearance of such a positive association between treatment and outcome with strategic patient selection [2]. Unfortunately, there is no universal context, and marginal independence without consideration of covariates does not tell us much about association within levels defined by the covariates. Suppose, for example, that Table 1 turned out as shown in Table 2. In Table 2 X and Y are marginally independent because each cell entry (50) can be expressed

Table 1 x y 0 1 Sum

Creating dependence between variables X and Y 0

1

Sum

n11 = 0 n21 n+1 = n21

n12 n22 n+2

n1+ = n12 n2+ n++

2

Marginal Independence

Table 2 x y 0 1 Sum

Table 3 gender Males x y 0 1 Sum

Independence between variables X and Y 0

1

Sum

50 50 100

50 50 100

100 100 200

Independence between variables X and Y by

0

1

Sum

25 25 50

25 25 50

50 50 100

Females x y 0 1 Sum

0

1

sum

25 25 50

25 25 50

50 50 100

as the product of its row and column proportions, and the grand total, 200. That is, 50 = (100/200)(100/200)(200). In general, for discrete variables, X and Y are marginally independent if P (x, y) = P (x)P (y). But now consider a covariate. To illustrate, suppose that X is the treatment received, Y is the outcome, and Z is gender. From Table 2 we see that treatment and outcome are independent, but what does this really tell us? When gender is considered, the gender-specific tables may turn out to confirm this independence, as in Table 3. In Table 3, each treatment is equally effective for each gender, with a 50% response rate. It is also possible that Table 2 conceals a true treatment∗ gender interaction, so that each treatment is better for one gender, and overall they compensate so as to balance out. Consider, for example, Table 4. In Table 4, Treatment 0 is better for males, with a 60% response rate (compared to 40% for Treatment 1), but this is reversed for females. What may be more surprising is that, even with marginal independence, either treatment may still be better for all levels of the covariate. Consider, for example, Table 5. In Table 5 we see that more males take Treatment 0 (62%), more females take Treatment 1 (62%), and

Table 4 Compensating dependence between variables X and Y by gender Males x y 0 1 Sum

0

1

Sum

20 30 50

30 20 50

50 50 100

0

1

Sum

30 20 50

20 30 50

50 50 100

Females x y 0 1 Sum

Table 5 Males x y 0 1 Sum

Separation of variables X and Y by gender

0

1

Sum

18 44 62

2 36 38

20 80 100

0

1

Sum

32 6 38

48 14 62

80 20 100

Females x y 0 1 Sum

males have a better response rate than females for either treatment (44/62 vs. 6/38 for Treatment 0, and 36/38 vs. 14/62 for Treatment 1). However, it is also apparent that Treatment 1 is more effective than Treatment 0, both for males (36/38 vs. 44/62) and females (14/62 vs. 6/38). Such reinforcing effects can be masked when the two tables are combined to form Table 2; this is possible precisely because more of the better responders (males) took the worse treatment (Treatment 0), thereby making it look better than it should relative to the better treatment (Treatment 1). See [1], Paradoxes and [3] for more information regarding Simpson’s paradox (see Two by Two Contingency Tables). Clearly, different conclusions might be reached from the same data, depending on whether the marginal table or the partial tables are considered.

Marginal Independence

References [1]

[2]

Baker, S.G. & Kramer, B.S. (2001). Good for women, good for men, bad for people: Simpson’s paradox and the importance of sex-specific analysis in observational studies, Journal of Women’s Health and Gender-Based Medicine 10, 867–872. Berger, V.W., Rezvani, A. & Makarewicz, V.A. (2003). Direct effect on validity of response run-in selection in clinical trials, Controlled Clinical Trials 24, 156–166.

[3]

[4]

3

Berger, V.W. (2004). Valid adjustment of randomized comparisons for binary covariates, Biometrical Journal 46(5), 589–594. Berkson, J. (1946). Limitations of the application of fourfold table analysis to hospital data, Biometric Bulletin 2, 47–53.

VANCE W. BERGER

AND JIALU

ZHANG

Marginal Models for Clustered Data GARRETT M. FITZMAURICE Volume 3, pp. 1128–1133 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Marginal Models for Clustered Data Introduction Clustered data commonly arise in studies in the behavioral sciences. For example, clustered data arise when observations are obtained on children nested within schools. Other common examples of naturally occurring clusters in the population are households, clinics, medical practices, and neighborhoods. Longitudinal and repeated measures studies (see Longitudinal Data Analysis and Repeated Measures Analysis of Variance) also give rise to clustered data. The main feature of clustered data that needs to be accounted for in any analysis is the fact that units from the same cluster are likely to respond in a more similar manner than units from different clusters. This intracluster correlation invalidates the crucial assumption of independence that is the cornerstone of so many standard statistical techniques. As a result, the straightforward application of standard regression models (e.g., multiple linear regression for a continuous response or logistic regression for a binary response) to clustered data is no longer valid unless some allowance for the clustering is made. There are a number of ways to extend regression models to handle clustered data. All of these procedures account for the within-cluster correlation among the responses, though they differ in approach. Moreover, the method of accounting for the withincluster association has important implications for the interpretation of the regression coefficients in models for discrete response data (e.g., binary data or counts). The need to distinguish models according to the interpretation of their regression coefficients has led to the use of the terms ‘marginal models’ and ‘mixed effects models’; the former are often referred to as ‘population-average models’, the latter as ‘cluster-specific models’ (see [3, 6, 8]). In this article, we focus on marginal models for clustered data; the meaning of the term ‘marginal’, as used in this context, will soon be apparent.

Marginal Models Marginal models are primarily used to make inferences about population means. A distinctive feature

of marginal models is that they separately model the mean response and the within-cluster association among the responses. The goal is to make inferences about the former, while the latter is regarded as a nuisance characteristic of the data that must be accounted for to make correct inferences about the population mean response. The term marginal in this context is used to emphasize that we are modeling each response within a cluster separately and that the model for the mean response depends only on the covariates of interest, and not on any random effects or on any of the other responses within the same cluster. This is in contrast to mixed effects models where the mean response depends not only on covariates but also on random effects (see Generalized Linear Mixed Models; Hierarchical Models). In our discussion of marginal models, the main focus is on discrete response data, for example, binary responses and counts. However, it is worth stressing that marginal models can be applied equally to continuous response data. Before we describe the main features of marginal models, we need to introduce some notation for clustered data. Let i index units within a cluster and j index clusters. We let Yij denote the response on the ith unit within the j th cluster; the response can be continuous, binary, or a count. For example, Yij might be the outcome for the ith individual in the j th clinic or the ith repeated measurement on the j th subject. Associated with each Yij is a collection of p covariates, Xij 1 , . . . , Xijp . We can group these into a p × 1 vector denoted by Xij = (Xij 1 , . . . , Xijp ) . A marginal model has the following three-part specification: (1)

The conditional mean of each response, denoted by µij = E Yij |Xij , is assumed to depend on the p covariates through a known ‘link function’, g µij = β0 + β1 Xij 1 + β2 Xij 2 + · · · + βp Xijp .

(1)

The link function applies a known transformation to the mean, g(µij ), and then links the transformed mean to a linear combination of the covariates (see Generalized Linear Models (GLM)).

2 (2)

(3)

Marginal Models for Clustered Data Given the covariates, the variance of each Yij is assumed to depend on the mean according to Var Yij |Xij = φ v µij , (2) where v µij is a known ‘variance function’ (i.e., a known function of the mean, µij ) and φ is a scale parameter that may be known or may need to be estimated. The within-cluster association among the responses, given the covariates, is assumed to be a function of additional parameters, α (and also depends upon the means, µij ).

In less technical terms, marginal models take a suitable transformation of the mean response (e.g., a logit transformation for binary responses or a log transformation for count data) and relate the transformed mean response to the covariates. Note that the model for the mean response allows each response within a cluster to depend on covariates but not on any random effects or on any of the other responses within the same cluster. The use of nonlinear link functions, for example, log(µij ), ensures that the model produces predictions of the mean response that are within the allowable range. For example, when analyzing a binary response, µij has interpretation in terms of the probability of ‘success’ (with 0 < µij < 1). If the mean response, here the probability of success, is related directly to a linear combination of the covariates, the regression model can yield predicted probabilities outside of the range from 0 to 1. The use of certain nonlinear link functions, for example, logit(µij ), ensures that this cannot happen. This three-part specification of a marginal model can be considered a natural extension of generalized linear models. Generalized linear models are a collection of regression models for analyzing diverse types of univariate responses (e.g., continuous, binary, counts). They include as special cases the standard linear regression and analysis of variance (ANOVA) models for a normally distributed continuous response, logistic regression models for a binary response, and log-linear models or Poisson regression models for counts (see Generalized Linear Models (GLM)). Although generalized linear models encompass a much broader range of regression models, these three are among the most widely used regression models in applications. Marginal models can be thought of as extensions

of generalized linear models, developed for the analysis of independent observations, to the setting of clustered data. The first two parts correspond to the standard generalized linear model, albeit with no distributional assumptions about the responses. It is the third component, the incorporation of the within-cluster association among the responses, that represents the main extension of generalized linear models. To highlight the main components of marginal models, we consider some examples using the three-part specification given earlier. Example 1. Marginal model for counts Suppose that Yij is a count and we wish to relate the mean count (or expected rate) to the covariates. Counts are often modeled as Poisson random variables (see Catalogue of Probability Density Functions), using a log link function. This motivates the following illustration of a marginal model for Yij : (1)

The mean of Yij is related to the covariates through a log link function, log µij = β0 + β1 Xij 1 + β2 Xij 2 + · · · + βp Xijp .

(2)

(3)

(3)

The variance of each Yij , given the effects of the covariates, depends on the mean response, (4) Var Yij |Xij = φ µij , where φ is a scale parameter that needs to be estimated. The within-cluster association among the responses is assumed to have an exchangeable correlation (or equi-correlated) pattern, Corr Yij , Yi j |Xij , Xi j = α (for i = i ), (5) where i and i index two distinct units within the j th cluster.

The marginal model specified above is a log-linear regression model (see Log-linear Models), with an extra-Poisson variance assumption. The withincluster association is specified in terms of a single correlation parameter, α. In this example, the extraPoisson variance assumption allows the variance to be inflated (relative to Poisson variability) by a factor φ (when φ > 1). In many applications, count data have variability that far exceeds that predicted by

Marginal Models for Clustered Data the Poisson distribution; a phenomenon referred to as overdispersion. Example 2. Marginal model for a binary response Next, suppose that Yij is a binary response, taking values of 0 (denoting ‘failure’) or 1 (denoting ‘success’), and it is of interest to relate the mean of Yij , µij = E(Yij |Xij ) = Pr(Yij = 1|Xij ), to the covariates. The distribution of each Yij is Bernoulli (see Catalogue of Probability Density Functions) and the probability of success is often modeled using a logit link function. Also, for a Bernoulli random variable, the variance is a known function of the mean. This motivates the following illustration of a marginal model for Yij : (1)

The mean of Yij , or probability of success, is related to the covariates by a logit link function, logit(µij ) = log

µij 1 − µij

= β0 + β1 Xij 1

+ β2 Xij 2 + · · · + βp Xijp .

(6)

(2)

The variance of each Yij depends only on the mean response (i.e., φ = 1), (7) Var Yij |Xij = µij (1 − µij ).

(3)

The within-subject association among the vector of repeated responses is assumed to have an exchangeable log odds ratio pattern, (8) log OR Yij , Yi j |Xij , Xi j = α, where OR(Yij , Yi j |Xij , Xi j ) Pr(Yij = 1, Yi j = 1|Xij , Xi j ) × Pr(Yij = 0, Yi j = 0)|Xij , Xi j ) . (9) = Pr(Yij = 1, Yi j = 0|Xij , Xi j ) × Pr(Yij = 0, Yi j = 1|Xij , Xi j )

The marginal model specified above is a logistic regression model, with a Bernoulli variance assumption, Var Yij |Xij = µij (1 − µij ). An exchangeable within-cluster association is specified in terms of pairwise log odds ratios rather than correlations, a natural metric of association for binary responses (see Odds and Odds Ratios).

3

These two examples are purely illustrative. They demonstrate how the choices of the three components of a marginal model might differ according to the type of response variable. However, in principle, any suitable link function can be chosen and alternative assumptions about the variances and within-cluster associations can be made. Finally, our description of marginal models does not require distributional assumptions for the observations, only a regression model for the mean response. In principle, this three-part specification of a marginal model can be extended by making full distributional assumptions about the responses within a cluster. For discrete data, the joint distribution requires specification of the mean vector and pairwise (or two-way) associations, as well as the three-, fourand higher-way associations among the responses (see, for example, [1, 2, 4]). Furthermore, as the number of responses within a cluster increases, the number of association parameters proliferates rapidly. In short, with discrete data there is no simple analog of the multivariate normal distribution (see Catalogue of Probability Density Functions). As a result, specification of the joint distribution for discrete data is inherently difficult and maximum likelihood estimation can be computationally difficult except in very simple cases. Fortunately, assumptions about the joint distribution are not necessary for estimation of the parameters of the marginal model. The avoidance of distributional assumptions leads to a method of estimation known as generalized estimating equations (GEE) (see Generalized Estimating Equations (GEE)). The GEE approach provides a convenient alternative to maximum likelihood estimation for estimating the parameters of marginal models (see [5, 7]) and has been implemented in many of the commercially available statistical software packages, for example, SAS, Stata, S-Plus, SUDAAN, and GenStat (see Software for Statistical Analyses).

Contrasting Marginal and Mixed Effects Models A crucial aspect of marginal models is that the mean response and within-cluster association are modeled separately. This separation has important implications for interpretation of the regression coefficients. In particular, the regression coefficients in the model for the mean have population-averaged interpretations.

Marginal Models for Clustered Data

That is, the regression coefficients describe the effects of covariates on the mean response, where the mean response is averaged over the clusters that comprise the target population; hence, they are referred to as population-averaged effects. For example, the regression coefficients might have interpretation in terms of contrasts of the overall mean responses in certain subpopulations (e.g., different intervention groups). In contrast, mixed effects models account for clustering in the data by assuming there is natural heterogeneity across clusters in a subset of the regression coefficients. Specifically, a subset of the regression coefficients are assumed to vary across clusters according to some underlying distribution; these are referred to as random effects (see Fixed and Random Effects). The correlation among observations within a cluster arises from their sharing of common random effects. Although the introduction of random effects can simply be thought of as a means of accounting for the within-cluster correlation, it has important implications for the interpretation of the regression coefficients. Unlike marginal models, the regression coefficients in mixed effects models have clusterspecific, rather than population-averaged, interpretations. That is, due to the nonlinear link functions (e.g., logit or log) that are usually adopted for discrete responses, the regression coefficients in mixed effects models describe covariate effects on the mean response for a typical cluster. The subtle distinctions between these two classes of models for clustered data can be illustrated in the following example based on a pre-post study design with a binary response (e.g., denoting ‘success’ or ‘failure’). Suppose individuals are measured at baseline (pretest) and following an intervention intended to increase the probability of success (posttest). The ‘cluster’ is comprised of the pair of binary responses obtained on the same individual at baseline and postbaseline. These clustered data can be analyzed using a mixed effects logistic regression model, logit{E(Yij |Xij , bj )} = β0∗ + β1∗ Xij + bj ,

(or clusters) differ in terms of their underlying propensity for success; this heterogeneity is expressed in terms of the variability of the random effect, bj . For a ‘typical’ individual from the population (where a ‘typical’ individual can be thought of as one with unobserved random effect bj = 0, the center of the distribution of bj ), the log odds of success at baseline is β0∗ ; the log odds of success following the intervention is β0∗ + β1∗ . The log odds of success at baseline and postbaseline are displayed in Figure 1, for the case where β0∗ = −1.75, β1∗ = 3.0, and σb2 = 1.5. At baseline, the log odds has a normal distribution with mean and median of -1.75 (see the unshaded density for the log odds in Figure 1). From Figure 1 it is clear that there is heterogeneity in the odds of success, with approximately 95% of individuals having a baseline log odds of success √ that varies from −4.15 to 0.65 (or −1.75 ± 1.96 1.5). When the odds of success is translated to the probability scale (see vertical axis of

1

0.75 Probability

4

0.5

0.25

0 −5 −4 −3 −2 −1 0

1

2

3

4

5

(10)

where bj is normally distributed, with mean zero and variance, σb2 . The single covariate in this model takes values Xij = 0 at baseline and Xij = 1 post-baseline (see Dummy Variables). In this mixed effects model, β0∗ and β1∗ are the fixed effects and bj is the random effect. This model assumes that individuals

Log odds

Figure 1 Subject-specific probability of success as a function of subject-specific log odds of success at baseline (unshaded densities) and post-baseline (shaded densities). Solid lines represent medians of the distributions; dashed lines represent means of the distributions.

Marginal Models for Clustered Data

This simple example highlights how the choice of method for accounting for the within-cluster association has consequences for the interpretation of the regression model parameters.

Figure 1), E(Yij |Xij , bj ) = Pr(Yij = 1|Xij , bj ) ∗

=

∗

eβ0 +β1 Xij +bj , ∗ ∗ 1 + eβ0 +β1 Xij +bj

(11)

Summary

the baseline probability of success for a typical individual (i.e., an individual with bj = 0) from the population is 0.148 (or e−1.75 /1 + e−1.75 ). Furthermore, approximately 95% of individuals have a baseline probability of success that varies from 0.016 to 0.657. From Figure 1 it is transparent that the symmetric, normal distribution for the baseline log odds of success does not translate into a corresponding symmetric, normal distribution for the probability of success. Instead, the subject-specific probabilities have a positively skewed distribution with median, but not mean, of 0.148 (see solid line in Figure 1). Because of the skewness, the mean of the subjectspecific baseline probabilities is pulled towards the tail and is equal to 0.202 (see dashed line in Figure 1). Thus, the probability of success for a ‘typical’ individual from the population (0.148) is not the same as the prevalence of success in the same population (0.202), due to the nonlinearity of the relationship between subject-specific probabilities and log odds. Similarly, although the log odds of success postbaseline has a normal distribution (see the shaded density for the log odds in Figure 1), the subject-specific post-baseline probabilities have a negatively skewed distribution with median, but not mean, of 0.777 (see solid line in Figure 1). Because of the skewness, the mean is pulled towards the tail and is equal to 0.726 (see dashed line in Figure 1). Figure 1 highlights how the effect of intervention on the log odds of success for a typical individual (or cluster) from the population, β1∗ = 3.0, is not the same as the contrast of population log odds. The latter is what is estimated in a marginal model, say logit{E(Yij |Xij )} = β0 + β1 Xij ,

5

(12)

and can be obtained by comparing or contrasting the log odds of success in the population at baseline, log(0.202/0.798) = −1.374, with the log odds of success in the population postbaseline, log(0.726/0.274) = 0.974. This yields a populationaveraged measure of effect, β1 = −2.348, which is approximately 22% smaller than β1∗ , the subjectspecific (or cluster-specific) effect of intervention.

Marginal models are widely used for the analysis of clustered data. They are most useful when the focus of inference is on the overall population mean response, averaged over all the clusters that comprise the population. The distinctive feature of marginal models is that they model each response within the cluster separately. They assume dependence of the mean response on covariates but not on any random effects or other responses within the same cluster. This is in contrast to mixed effects models where the mean response depends not only on covariates but on random effects. There are a number of important distinctions between marginal and mixed effects models that go beyond simple differences in approaches to accounting for the within-cluster association among the responses. In particular, these two broad classes of regression models have somewhat different targets of inference and have regression coefficients with distinct interpretations. In general, the choice of method for analyzing discrete clustered data cannot be made through any automatic procedure. Rather, it must be made on subject-matter grounds. Different models for discrete clustered data have somewhat different targets of inference and thereby address subtly different scientific questions regarding the dependence of the mean response on covariates.

References [1]

[2]

[3] [4]

[5]

Bahadur, R.R. (1961). A representation of the joint distribution of responses to n dichotomous items, in Studies in Item Analysis and Prediction, H. Solomon, ed., Stanford University Press, Stanford, pp. 158–168. George, E.O. & Bowman, D. (1995). A full likelihood procedure for analyzing exchangeable binary data, Biometrics 51, 512–523. Graubard, B.I. & Korn, E.L. (1994). Regression analysis with clustered data, Statistics in Medicine 13, 509–522. Kupper, L.L. & Haseman, J.K. (1978). The use of a correlated binomial model for the analysis of certain toxicological experiments, Biometrics 34, 69–76. Liang, K.-Y. & Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models, Biometrika 73, 13–22.

6 [6]

[7]

Marginal Models for Clustered Data Neuhaus, J.M., Kalbfleisch, J.D. & Hauck, W.W. (1991). A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data, International Statistical Review 59, 25–35. Zeger, S.L. & Liang, K.-Y. (1986). Longitudinal data analysis for discrete and continuous outcomes, Biometrics 42, 121–130.

[8]

Zeger, S.L., Liang, K.-Y. & Albert, P.S. (1988). Models for longitudinal data: a generalized estimating equation approach, Biometrics 44, 1049–1060.

GARRETT M. FITZMAURICE

Markov Chain Monte Carlo and Bayesian Statistics PETER CONGDON Volume 3, pp. 1134–1143 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Markov Chain Monte Carlo and Bayesian Statistics Introduction Bayesian inference (see Bayesian Statistics) differs from classical inference in treating parameters as random variables, one consequence being probability values on hypotheses, and confidence intervals on parameters (or ‘credible intervals’ for Bayesians), that are concordant with commonsense interpretations [31]. Thus, in the Bayesian paradigm, credible intervals are areas under the probability distributions of the parameters involved. Despite inferential and philosophical advantages of this kind, practical application of Bayesian methods was formerly prohibited by the need for numerical integration, except in straightforward problems with a small number of parameters. This problem has been overcome in the last 15 years by the advent of simulation methods known as Markov Chain Monte Carlo (MCMC) algorithms; see [2, 21, 49]. MCMC sample-based estimation methods overcome problems associated with numerical procedures that were in use in the 1980s. They can handle high-dimensional problems and explore the distributions of parameters, regardless of the forms of the distributions of the likelihood and the parameters. Using MCMC methods, one can obtain compute posterior summary statistics (means, variances, 95% intervals) for parameters or other structural quantities. Starting from postulated or ‘prior’ distributions of the parameters, improved or ‘posterior’ estimates of the distributions are obtained by randomly sampling from parameter distributions in turn and updating the parameter values until stable distributions are generated. The implementation of these methods has resulted in Bayesian methods actually being easier to apply than ‘classical’ methods to some complex problems with large numbers of parameters, for example, those involving multiple random effects (see Generalized Linear Mixed Models) [6]. Such methods are now available routinely via software such as WINBUGS (see the page at http.www.mrc-bsu.ac.uk for programs and examples). Bayesian data analysis (especially via modern MCMC methods) has

a range of other advantages such as the ability to combine inferences over different models when no model is preeminent in terms of fit (‘model averaging’) and permitting comparisons of fit between nonnested models. By contrast, classical hypothesis testing is appropriate for testing nested models that differ only with respect to the inclusion/exclusion of particular parameters. Bayesian inference can be seen as a process of learning about parameters. Thus, Bayesian learning does not undertake statistical analysis in isolation but draws on existing knowledge in prior framing of the model, which can be quite important (and beneficial) in clinical, epidemiological, and health applications [48, Chapter 5]. The estimation process then combines existing evidence with the actual study data at hand. The result can be seen as a form of evidence accumulation. Prior knowledge about parameters and updated knowledge about them are expressed in terms of densities: the data analysis converts existing knowledge expressed in a prior density to updated knowledge expressed in a posterior density (see Bayesian Statistics). The Bayesian analysis also produces predictions (e.g., the predicted response for new values of independent variables, as in the case of predicting a relapse risk for a hypothetical patient with an ‘average’ profile). It also provides information about functions of parameters and data (e.g., a ratio of two regression coefficients). One may also assess hypotheses about parameters (e.g., that a regression coefficient is positive) by obtaining a posterior probability relating to the hypothesis being true. Modern sampling methods also allow for representing densities of parameters that may be far from Normal (e.g., skew or multimodal), whereas maximum likelihood estimation and classical tests rely on asymptotic Normality approximations under which a parameter has a symmetric density. In addition to the ease with which exact densities of parameters (i.e., possibly asymmetric) are obtained, other types of inference may be simplified by using the sampling output from an MCMC analysis. For example, a hypothesis that a regression parameter is positive is assessed by the proportion of iterations where the sampled value of the parameter is positive. Bayesian methods are advantageous for random effects models in which ‘pooling strength’ acts to provide more reliable inferences about individual cases. This has relevance in applications such as multilevel

2

Markov Chain Monte Carlo and Bayesian Statistics

analysis (see Generalized Linear Mixed Models) and meta-analysis in which, for example, institutions or treatments are being compared. A topical UK illustration relates to a Bayesian analysis of child surgery deaths in Bristol, summarized in a Health Department report [16]; see [40]. Subsequent sections consider modern Bayesian methods and inference procedures in more detail. We first consider the principles of Bayesian methodology and the general principles of MCMC estimation. Then, sampling algorithms are considered as well as possible ‘problems for the unwary’ in terms of convergence and identifiability raised when models become increasingly complex. We next consider issues of prior specification (a distinct feature of the Bayes approach) and questions of model assessment. A worked example using data on drug regimes for schizophrenic patients involves the question of missing data: this frequently occurs in panel studies (in the form of ‘attrition’), and standard approaches such as omitting subjects with missing data lead to biased inference.

Bayesian Updating In more formal terms, the state of existing knowledge about a parameter, or viewed another way, a statement about the uncertainty about a parameter set θ = (θ1 , θ2 , . . . , θd ) is expressed in a prior density p(θ). The likelihood of a particular set of data y, given θ, is denoted p(y|θ). For example, the standard constant variance linear regression (see Multiple Linear Regression) with p predictors (including the intercept) involves a Normal likelihood with θ = (β, σ 2 ), where β is a vector of length p and σ 2 is the error variance. The updated or posterior uncertainty about θ, having taken account of the data, is expressed in the density p(θ|y). From the usual conditional probability rules (see Probability: An Introduction), this density can be written p(θ, y) p(θ|y) = . p(y)

(1)

Applying conditional probability again gives p(θ|y) =

p(y|θ)p(θ) . p(y)

(2)

The divisor p(y) is a constant and the updating process therefore takes the form

p(θ|y) ∝ p(y|θ)p(θ).

(3)

This can be stated as ‘the posterior is proportional to likelihood times prior’ or, in practical terms, updated knowledge about parameters combines existing knowledge with the sample data at hand.

The Place of MCMC Methods The genesis of estimation via MCMC sampling methods lies in the need to obtain expectations of, or densities for, functions of parameters g(θ), and of model predictions z, that take account of the information in both the data and the prior. Functions g(θ) might, for instance, be differences in probabilities of relapse under different treatments, where the probabilities are predicted by a logistic regression or the total of sensitivity and specificity, sometimes known as the Youden Index [10]. These are standard indicators of the effectiveness of screening instruments, for example [33]; the sensitivity is the proportion of cases correctly identified and the specificity is the probability that the test correctly identifies that a healthy individual is disease free. The expectation of g(θ) is obtained by integrating over the posterior density for θ Eθ|y [g(θ)] = ∫ g(θ)p(θ|y) dθ,

(4)

while the density for future observations (‘replicate data’) is p(z|y) = ∫ p(z|θ, y)p(θ|y) dθ.

(5)

Often, the major interest is in the marginal densities of the parameters themselves, where the marginal density of the j th parameter θj is obtained by integrating out all other parameters p(θj |y) = ∫ p(θ|y)dθ1 dθ2 . . . θj −1 θj +1 ..θd .

(6)

Such expectations or densities may be obtained analytically for conjugate analyses such as a binomial likelihood where the probability has a beta prior (see Catalogue of Probability Density Functions). (A conjugate analysis occurs when the prior and posterior densities for θj have the same form, e.g., both Normal or both beta.) Results can be obtained by asymptotic approximations [5], or by analytic approximations [34]. Such approximations are not appropriate for posteriors that are non-Normal or

Markov Chain Monte Carlo and Bayesian Statistics where there is multimodality. An alternative strategy facilitated by contemporary computer technology is to use sampling-based approximations based on the Monte Carlo principle. The idea here is to use repeated sampling to build up a picture of marginal densities such as p(θj |y): modal values of the density will be sampled most often, those in the tails relatively infrequently. The Monte Carlo method in general applications assumes a sample of independent simulations u(1) , u(2) . . . u(T ) from a density p(u), whereby E[g(u)] is estimated as T

g¯ T =

g(u(t) )

t=1

T

.

(7)

However, in MCMC applications, independent sampling from the posterior target density p(θ|y) is not typically feasible. It is valid, however, to use dependent samples θ (t) , provided the sampling satisfactorily covers the support of p(θ|y) [28]. MCMC methods generate pseudorandom-dependent draws from probability distributions such as p(θ|y) via Markov chains. Sampling from the Markov chain converges to a stationary distribution, namely, π(θ) = p(θ|y), if certain requirements on the chain are satisfied, namely irreducibility, aperiodicity, and positive recurrence1 ; see [44]. Under these conditions, the average g¯ T tends to Eπ [g(u)] with probability 1 as T → ∞, despite dependent sampling. Remaining practical questions include establishing an MCMC sampling scheme and establishing that convergence to a steady state has been obtained [14].

MCMC Sampling Algorithms The Metropolis–Hastings (M–H) algorithm [29] is the baseline for MCMC sampling schemes. Let θ (t) be the current parameter value in an MCMC sampling sequence. The M–H algorithm assesses whether a move to a potential new parameter value θ ∗ should be made on the basis of the transition probability p(θ ∗ |y)f (θ (t) |θ ∗ ) . (8) α(θ ∗ |θ (t) ) = min 1, p(θ (t) |y)f (θ ∗ |θ (t) ) The density f is known as a proposal or jumping density. The rate at which a proposal generated by f is accepted (the acceptance rate) depends on how

3

close θ ∗ is to θ (t) and this depends on the variance σ 2 assumed in the proposal density. For a Normal proposal density, a higher acceptance rate follows from reducing σ 2 but with the risk that the posterior density will take longer to explore. If the proposal density is symmetric f (θ (t) |θ ∗ ) = f (θ ∗ |θ (t) ), then the M–H algorithm reduces to an algorithm used by Metropolis et al. [41] for indirect simulation of energy distributions, whereby p(θ ∗ |y) ∗ (t) . (9) α(θ |θ ) = min 1, p(θ (t) |y) While it is possible for the proposal density to relate to the entire parameter set, it is often computationally simpler to divide θ into blocks or components, and use componentwise updating. Gibbs sampling is a special case of the componentwise M–H algorithm, whereby the proposal density for updating θj is the full conditional density fj (θj |θk=j ), so that proposals are accepted with probability 1. This sampler was originally developed by Geman and Geman [24] for Bayesian image reconstruction, with its full potential for simulating marginal distributions by repeated draws recognized by Gelfand and Smith [21]. The Gibbs sampler involves parameter-by-parameter updating, which when completed forms the transition from θ (t) to θ (t+1) : 1. θ1(t+1) ∼ f1 (θ1 |θ2(t) , θ3(t) , . . . , θd(t) ); 2. θ2(t+1) ∼ f2 (θ2 |θ1(t+1) , θ3(t) , . . . , θd(t) ); (t+1) d. θd(t+1) ∼ fd (θd |θ1(t+1) , θ3(t+1) , . . . , θd−1 ).

Repeated sampling from M–H samplers such as the Gibbs sampler generates an autocorrelated sequence of numbers that, subject to regularity conditions (ergodicity, etc.), eventually ‘forgets’ the starting values θ (0) = (θ1(0) , θ2(0) , . . . , θd(0) ) used to initialize the chain and converges to a stationary sampling distribution p(θ|y). Full conditional densities can be obtained by abstracting out from the full model density (likelihood times prior) those elements including θj and treating other components as constants [26]. Consider a conjugate model for Poisson count data yi with means µi that are themselves gamma distributed. This is a model appropriate for overdispersed count data with actual variability var(y) exceeding that under

4

Markov Chain Monte Carlo and Bayesian Statistics

the Poisson model. This sort of data often occurs because of variations in frailty, susceptibility, or ability between subjects; a study by Bockenholt et al. provides an illustration involving counts of emotional experiences [7]. Thus, neuroticism is closely linked to proneness to experience unpleasant emotions, while extraversion is linked with sociability, enthusiasm, and pleasure arousal: hence, neuroticism and extraversion are correlated with counts of intense unpleasant and pleasant emotions. For example, yi ∼ Po(µi ti ) might be the numbers of emotions experienced by a particular person in a time interval ti and µi the average emotion count under the model. Suppose variations in proneness follow a gamma density µi ∼ Ga(α, β), namely, f (µi |α, β) =

µα−1 e−βµi β α i (α)

(10)

and further that α ∼ E(a), β ∼ Ga(b, c), where a, b, and c are preset constants; this prior structure is used by George et al. [25]. The posterior density p(θ|y), of θ = (µ1 , . . . , µn , α, β) given y, is then proportional to α n n β e−aα β b−1 e−cβ e−µi ti µi yi µi α−1 eβµi , (α) i=1 since all constants (such as the denominator yi ! in the Poisson likelihood) are included in the proportionality constant. The conditional densities of µi and β are f1 (µi |α, β) = Ga(yi + α, β + ti ) and f2 (β|α, µi ) = Ga(b + nα, c + µi ) respectively, while that of α is α n α−1 n β −aα µ . f3 (α|β, ∼ ) ∝ e µi (α) i=1

(11)

This density is nonstandard but log-concave and cannot be sampled directly. However, adaptive rejection sampling [27] may be used. By contrast, sampling from the gamma densities for µi and β is straightforward. For a Gibbs sampling MCMC application, we would repeatedly sample µi(t+1) from f1 conditional on α (t) and β (t) , then β (t+1) from f2 conditional on µi(t+1) and α (t) , and α (t+1) from f3 conditional on µi(t+1) and β (t+1) . By repeated (t) (t) sampling of µ(t) i , α , and β , for iterations t = 1, . . . T , we approximate the marginal densities of these parameters, with the approximation improving as T increases.

Convergence and Identifiability There are many unresolved questions around the assessment of convergence of MCMC sampling procedures [14]. It is preferable to use two or more parallel chains with diverse starting values to ensure full coverage of the sample space of the parameters, and to diminish the chance that the sampling will become trapped in a small part of the space [23]. Single long runs may be adequate for straightforward problems, or as a preliminary to obtain inputs to multiple chains. Convergence for multiple chains may be assessed using Gelman–Rubin scale reduction factors (SRF) that compare variation in the sampled parameter values within and between chains. Parameter samples from poorly identified models will show wide divergence in the sample paths between different chains and variability of sampled parameter values between chains will considerably exceed the variability within any one chain. In practice, SRFs under 1.2 are taken as indicating convergence. Analysis of sequences of samples from an MCMC chain amounts to an application of time-series methods, in regard to problems such as assessing stationarity in an autocorrelated sequence. Autocorrelation at lags 1, 2, and so on may be assessed from the full set of sampled values θ (t) , θ (t+1) , θ (t+2) , . . . , or from subsamples K steps apart θ (t) , θ (t+K) , θ (t+2K) , . . ., and so on. If the chains are mixing satisfactorily, then the autocorrelations in the θ (t) will fade to zero as the lag increases (e.g., at lag 10 or 20). High autocorrelations that do not decay mean that less information about the posterior distribution is provided by each iteration and a higher sample size T is necessary to cover the parameter space. Problems of convergence in MCMC sampling may reflect problems in model identifiability due to overfitting or redundant parameters. Choice of diffuse priors increases the chance of poorly identified models, especially in complex hierarchical models or small samples [20]. Elicitation of more informative priors or application of parameter constraints may assist identification and convergence. Slow convergence will show in poor ‘mixing’ with high autocorrelation between successive sampled values of parameters, and trace plots that wander rather than rapidly fluctuating around a stable mean. Conversely, running multiple chains often assists in diagnosing poor identifiability of models. Correlation between parameters tends to delay convergence

Markov Chain Monte Carlo and Bayesian Statistics and increase the dependence between successive iterations. Reparameterization to reduce correlation – such as centering predictor variables in regression – usually improves convergence [22].

Choice of Prior Choice of the prior density is an important issue in Bayesian inference, as inferences may be sensitive to the choice of prior, though the role of the prior may diminish as sample size increases. From the viewpoint of a subject-matter specialist, the prior is the way to include existing subject-matter knowledge and ensure the analysis is ‘evidence consistent’. There may be problems in choosing appropriate priors for certain types of parameters: variance parameters in random effects models are a leading example [15]. A long running (and essentially unresolvable) debate in Bayesian statistics revolves around the choice between objective ‘off-the-shelf’ priors, as against ‘subjective’ priors that may include subjectmatter knowledge. There may be a preference for off-the-shelf or reference priors that remove any subjectivity in the analysis. In practice, just proper but still diffuse priors are a popular choice (see the WINBUGS manuals for several examples), and a sensible strategy is to carry out a sensitivity analysis with a range of such priors. Just proper priors are those that still satisfy the conditions for a valid density: they integrate to 1, whereas ‘improper’ priors do not. However, they are diffuse in the sense of having a large variance that does not contain any real information about the location of the parameter. Informative subjective priors based on elicited opinion from scientific specialists, historical studies, or the weight of established evidence, can also be justified. One may even carry out a preliminary evidence synthesis using forms of meta-analysis to set an informative prior; this may be relevant for treatment effects in clinical studies or disease prevalence rates [48]. Suppose the unknown parameter is the proportion π of parasuicide patients making another suicide attempt within a year of the index event. There is considerable evidence on this question, and following a literature review, a prior might be set with the mean recurrence rate 15%, and 95% range between 7.5 and 22.5%. This corresponds approximately to a beta prior with parameters 7.5 and 42.5. A set of previous representative studies might be used

5

more formally in a form of meta-analysis, though if there are limits to the applicability of previous studies to the current target population, the information from previous studies may be down-weighted in some way. For example, the precision of an estimated relative risk or prevalence rate from a previous study may be halved. A diffuse prior on π might be a Be(1,1) prior, which is in fact equivalent to assuming a proportion uniformly distributed between 0 and 1.

Model Comparison and Criticism Having chosen a prior and obtained convergence for a set of alternative models, one is faced with choosing between models (or possibly combining inferences over them) and with diagnostic checking (e.g., assessing outliers). Several methods have been proposed for model choice and diagnosis based on Bayesian principles. These include many features of classical model assessment such as penalizing complexity and requiring accurate predictions (i.e., cross-validation). To develop a Bayesian adaptation of frequentist model choice via the AIC, Spiegelhalter et al. [47] propose estimating the effective total number of parameters or model dimension, de . Thus, de is the difference between the mean D¯ of the sampled deviances D (t) ¯ and the deviance D(θ|y) at the parameter posterior mean θ¯ . This total is generally less than the nominal number of parameters in complex hierarchical random effects models (see Generalized Linear Mixed Models). The deviance information criterion is then DIC = D(θ¯ |y) + 2de . Congdon [12] considers repeated parallel sampling of models to obtain the density of DICj k = DICj − DICk for models j and k. Formal Bayesian model assessment is based on updating prior model probabilities P (Mj ) to posterior model probabilities P (Mj |Y ) after observing the data. Imagine a scenario to obtain probabilities of a particular model being true, given a set of data Y , and assuming that one of J models, including the model in question, is true; or possibly that truth resides in averaging over a subset of the J models. Then P (Mj |Y ) =

[P (Mj )P (Y |Mj )] J j =1

,

(12)

[P (Mj )P (Y |Mj )]

where P (Y |Mj ) are known as marginal likelihoods. Approximation methods for P (Y |Mj ) include those

6

Markov Chain Monte Carlo and Bayesian Statistics

presented by Gelfand and Dey [19], Newton and Raftery [42], and Chib [8]. The Bayes factor is used for comparing one model against another under the formal approach and is the ratio of marginal likelihoods P (Y |M1 ) B12 = . (13) P (Y |M2 ) Congdon [13] considers formal choice based on Monte Carlo estimates of P (Mj |Y ) without trying to approximate marginal likelihoods. Whereas data analysis is often based on selecting a single best model and making inferences as if that model were true, such an approach neglects uncertainty about the model itself, expressed in the posterior model probabilities P (Mj |Y ) [30]. Such uncertainty implies that for closely competing models, inferences should be based on model-averaged estimates E[g(θ)] = P (Mj |Y )g(θj |y). (14) j

Problems with formal Bayes model choice, especially in complex models or when priors are diffuse, have led to alternative model-assessment procedures, such as those including principles of predictive crossvalidation; see [18, 35].

Bayesian Applications in Psychology and Behavioral Sciences In psychological and behavioral studies, the methods commonly applied range from clinical trial designs with individual patients as subjects to population based ecological studies. The most frequent analytic frameworks are multivariate analysis and structural equation models (SEMs), more generally, latent class analysis (LCA), diagnosis and classification studies (see Discriminant Analysis; k -means Analysis), and panel and case-control studies. Bayesian applications to psychological data analysis include SEM, LCA, and item response analysis, see [3, 9, 37, 38, 46]; and factor analysis per se [1, 4, 36, 43, 45]. Here we provide a brief overview of a Bayesian approach to missing data in panel studies, with a worked example using the WINBUGS package (code available at www.geog.qmul.ac.uk/staff/congdon.html). This package presents great analytic flexibility while leaving issues such as choice of sampling methods to an inbuilt expert system [11].

An Illustrative Application Panel studies with data yit on subjects i at times t are frequently subject to attrition (permanent loss from observation from a given point despite starting the study), or intermittently missing responses [32]. Suppose that some data are missing in a follow-up study of a particular new treatment due to early patient withdrawal. One may wonder whether this loss to observation is ‘informative’, for instance, if early exit is due to adverse consequences of the treatment. The fact that some patients do withdraw generates a new form of data, namely, binary indicators rit = 1 when the response yit is missing, and rit = 0 when a response is present. If a subject drops out permanently at time t, they contribute to the likelihood at that point with an observation rit = 1, but are not subsequently included in the model. If withdrawal is informative, then the probability that an exit occurs (the probability that r = 1) is related to yit . However, this value is not observed if the patient exits, raising the problem that a model for Pr(r = 1) has to be in terms of a hypothetical value, namely, the value for yit that would have been observed had the response actually happened [39]. This type of missing data mechanism is known as missingness not at random (MNAR) (see Missing Data). By contrast, if missingness is random (often written as MAR, or missingness at random), then Pr(rit = 1) may depend on values of observed variables, such as preceding y values (yi,t−1 , yi,t−2 , etc.), but not on the values of possibly missing variables such as yit itself. Another option is missingness completely at random (abbreviated as MCAR), when none of the data collected or missing is relevant to explaining the chance of missingness (see Missing Data). Only in this scenario is complete case analysis valid. In concrete modeling terms, in the MCAR model, missingness at time t is not related to any other variable, whereas in the informative model missingness could be related to any other variable, missing, or observed. Suppose we relate the probability πit that rit = 1 to current and preceding values of the response y and to a known covariate x (e.g., time in the trial). Then, logit(πit ) = η1 + η2 yit + η3 yi,t−1 + η4 xit + η5 xi,t−1 (15) Under the MNAR scenario, dropout at time t (causing rit to be 1) may be related to the possibly

Markov Chain Monte Carlo and Bayesian Statistics missing value of yit at that time. This would mean that the current value of y influences the chance that r = 1 (i.e., η2 = 0). Note that yit is an extra unknown when rit = 1 (i.e., the hypothetical missing value is imputed or ‘augmented’ under the model). Under a MAR scenario, by contrast, previous observed values of y, or current/preceding x may be relevant to the chance of a missing value, but the current value of y is not relevant. So η3 , η4 , and η5 may be nonzero but one would expect η2 = 0. Finally, MCAR missingness would mean η2 = η3 = η4 = η5 = 0. To illustrate how dropout might be modeled in panel studies, consider a longitudinal trial comparing a control group with two drug treatments for schizophrenia, haloperidol, and risperidone [17]. The response yit was the Positive and Negative Symptom Scale (PANSS), with higher scores denoting more severe illness, and obtained at seven time points (selection, baseline, and at weeks 1, 2, 4, 6, and 8). Let vt denote the number of weeks, with the baseline defined by v2 = 0, and selection into the study by v1 = −1. Cumulative attrition is only 0.6% at baseline but reaches 1.7%, 13.5%, 23.6%, and 39.7% in successive waves, reaching 48.5% in the final wave. The question is whether attrition is related to health status: if the dropout rate is higher for those with high PANSS scores (e.g., because they are gaining no benefit from their treatment), then observed time paths of PANSS scores are in a sense unrepresentative. Let treatment be specified by a trichotomous indicator Gi (control, haloperidol, risperidone). We assume the data are Normal with yit ∼ N(µit , σ 2 ). If subject i drops out at time t = t ∗ , then rit∗ = 1 at that time but the person is excluded from the model for times t > t ∗ . The model for µit includes a grand mean M, main treatment effects δj (j = 1 . . . , 3), linear and quadratic time effects, θj and γj , both specific to treatment j , and random terms over both subjects and individual readings: µit = M + δGi + θGi vt + γGi vt2 + Ui + eit .

(16)

Since the model includes a grand mean, a corner constraint δ1 = 0 is needed for identifiability. The Ui are subject level indicators of unmeasured morbidity factors, sometimes called ‘permanent’ random effects. The model for eit allows for dependence over time (autocorrelation), with an allowance also for the unequally spaced observations (the gap between

7

observations is sometimes 1 week, sometimes 2). Thus, eit ∼ N(ρ [vt −vt−1 ] ei,t−1 , σe2 ). (17) We take ρ, the correlation between errors one week apart, to be between 0 and 1. Note that the model for y also implicitly includes an uncorrelated error term with variance σ 2 . The dropout models considered are more basic than the one discussed above. Thus, two options are considered. The first option is a noninformative (MAR) model allowing dependence on preceding (and observed) yi,t−1 but not on the current, possibly missing, yit . So logit(πit ) = η11 + η12 yi,t−1 .

(18)

The second adopts the alternative informative missingness scenario (MNAR), namely, logit(πit ) = η21 + η22 yit

(19)

since this allows dependence on (possibly missing) contemporaneous scores yit . We adopt relatively diffuse priors on all the model parameters (those defining µit and πit ), including N(0, 10) priors on δj and θj and N(0, 1) priors on γj . With the noninformative dropout model, similar results to those cited by Diggle [17, p. 221] are obtained (see Table 1). Dropout increases with PANSS score under the first dropout model (the coefficient η12 is clearly positive with 95% CI restricted to positive values), so those remaining in the trial are increasingly ‘healthier’ than the true average, and so increasingly unrepresentative. The main treatment effect for risperidone, namely δ3 , has a negative mean (in line with the new drug producing lower average PANSS scores over all observation points) but its 95% interval is not quite conclusive. The estimates for the treatment-specific linear trends in PANSS scores (θ1 , θ2 , and θ3 ) do, however, conclusively show a fall under risperidone, and provide evidence for the effectiveness of the new drug. The quadratic effect for risperidone reflects a slowing in PANSS decline for later observations. Note that the results presented in Table 1 constitute ‘posterior summaries’ in a very simplified form. Introducing the current PANSS score yit into the model for response rit makes the dropout model informative. The fit improves slightly under the predictive criterion [35] based on comparing replicate

8 Table 1

Markov Chain Monte Carlo and Bayesian Statistics PANSS model parameters, alternative dropout models Noninformative

Response model η11 η12 η21 η22 Observation model Main treatment effect δ2 δ3 Autocorrelation parameter ρ Linear time effects θ1 θ2 θ3 Quadratic time effects γ1 γ2 γ3

Informative

Mean

2.5%

97.5%

Mean

2.5%

97.5%

−5.32 0.034

−5.84 0.028

−4.81 0.039

−5.58 0.034

−6.28 0.026

−4.80 0.041

1.43 −1.72

−1.95 −4.29

5.02 1.49

3.32 −4.66

−0.95 −8.63

7.69 −0.66

0.94

0.91

0.97

0.96

0.93

0.99

0.04 1.03 −1.86

−1.29 −0.11 −2.40

1.48 2.25 −1.16

0.06 1.15 −0.81

−0.61 0.38 −1.16

0.68 1.88 −0.42

−0.066 −0.066 0.104

−0.258 −0.232 0.023

−0.017 0.046 0.108

−0.186 −0.135 0.023

data zit to actual data yit . The parameter η22 has a 95% CI confined to positive values and suggests that missingness may be informative; to conclusively establish this, one would consider more elaborate dropout models including both yit , earlier y values, yi,t−1 , yi,t−2, and so on, and possibly treatment group too. In terms of inferences on treatment effectiveness, both main treatment and linear time effects for risperidone treatment are significantly negative under the informative model, though the time slope is less acute. Note that maximum likelihood analysis is rather difficult for this type of problem. In addition to the fact that model estimation depends on imputed yit responses when rit = 1, it can be seen that the main effect treatment parameters under the first dropout model have somewhat skewed densities. This compromises classical significance testing and derivation of confidence limits in a maximum likelihood (ML) analysis. To carry out analysis via classical (e.g., ML) methods for such a model requires technical knowledge and programming skills beyond those of the average psychological researcher. However, using a Bayes approach, it is possible to apply this method for a fairly unsophisticated user of WINBUGS. Note that the original presentation of the informative analysis via classical methods

0.117 0.095 0.175

0.158 0.240 0.199

does not include parameter standard errors [17, Table 9.5], whereas obtaining full parameter summaries (which are anyway more extensive than simply means and standard errors) is unproblematic via MCMC sampling. As an example of a relatively complex hypothesis test that is simple under MCMC sampling, consider the probability that δ3 < min(δ1 , δ2 ), namely, that the main treatment effect (in terms of reducing the PANSS score) under risperidone is greater than either of the other two treatments. This involves inserting a single line in WINBUGS, namely, test.del <- step(min(delta[1], delta[2])-delta[3])

and monitoring the proportion of iterations where the condition δ3 < min(δ1 , δ2 ) holds. Under the informative dropout model, we find this probability to be .993 (based on the second half of a two-chain run of 10 000 iterations), whereas for the noninformative dropout model it is .86 (confirming the inconclusive nature of inference on the parameter δ3 under this model). Such flexibility in hypothesis testing is one feature of Bayesian sampling estimation via MCMC. As a postscript, we can say that the current ‘state of play’ on simulation-based estimation leaves much

Markov Chain Monte Carlo and Bayesian Statistics scope for development. We can see that MCMC has brought a revolution to the implementation of Bayesian inference and this is likely to continue as computing speeds continue to improve. Our analysis has demonstrated the ease of determining posterior distributions in complex applications, and of developing significance tests that allow for skewness and other non-Normality. These, among other benefits of MCMC techniques, are discussed in a number of accessible books, for example [3, 28].

[10]

[11] [12]

Note 1.

[9]

Suppose a chain is defined on a space S. A chain is irreducible if for any pair of states (si , sj ) ∈ S there is a nonzero probability that the chain can move from si to sj in a finite number of steps. A state is positive recurrent if the number of steps the chain needs to revisit the state has a finite mean. If all its states are positive recurrent, then the chain itself is positive recurrent. A state has period k if it can only be revisited after a number of steps that is a multiple of k. Otherwise, the state is aperiodic. If all its states are aperiodic, then the chain itself is aperiodic. Positive recurrence and aperiodicity together constitute ergodicity.

[13]

[14]

[15] [16]

References [17] [1]

[2]

[3] [4]

[5] [6]

[7]

[8]

Aitkin, M. & Aitkin, I. (2005). Bayesian inference for factor scores, Contemporary Advances in Psychometrics, Lawrence Erlbaum (In press). Albert, J. & Chib, S. (1996). Computation in Bayesian econometrics: an introduction to Markov Chain Monte Carlo, in Advances in Econometrics, Vol. 11A, T. Fomby & R. Hill, eds, JAI Press, pp. 3–24. Albert, J. & Johnson, V. (1999). Ordinal Data Modeling, Springer-Verlag, New York. Arminger, G. & Muth´en, B. (1998). A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis-Hastings algorithm, Psychometrika 63, 271–300. Bernardo, J. & Smith, A. (1994). Bayesian Theory, Wiley, Chichester. Best, N., Spiegelhalter, D., Thomas, A. & Brayne, C. (1996). Bayesian analysis of realistically complex models, Journal of the Royal Statistical Society, Series A 159, 323–342. Bockenholt, U., Kamakura, W. & Wedel, M. (2003). The structure of self-reported emotional experiences: a mixed-effects poisson factor model, British Journal of Mathematical and Statistical Psychology 56, 215–229. Chib, S. (1995). Marginal likelihood from the Gibbs output, Journal of the American Statistical Association 90, 1313–1321.

[18]

[19]

[20]

[21]

[22]

[23]

[24]

9

Coleman, M., Cook, S., Matthysse, S., Barnard, J., Lo, Y., Levy, D., Rubin, D. & Holzman, P. (2002). Spatial and object working memory impairments in schizophrenia patients: a Bayesian item-response theory analysis, Journal of Abnormal Psychology 111, 425–435. Congdon, P. (2001). Predicting adverse infant health outcomes using routine screening variables: modelling the impact of interdependent risk factors, Journal Applied Statistics 28, 183–197. Congdon, P. (2003). Applied Bayesian Models, John Wiley. Congdon, P. (2005a). Bayesian predictive model comparison via parallel sampling, Computational Statistics and Data Analysis 48(4), 735–753. Congdon, P. (2005b). Bayesian Model Choice Based on Monte Carlo Estimates of Posterior Model Probabilities, Computational Statistics and Data Analysis, (In press). Cowles, M. & Carlin, B. (1996). Markov Chain MonteCarlo convergence diagnostics: a comparative study, Journal of the American Statistical Association 91, 883–904. Daniels, M. (1999). A prior for the variance in hierarchical models, Canadian Journal of Statistics 27, 567–578. Department of Health (2002). Learning from Bristol: The Department of Health’s Response to the Report of the Public Inquiry into Children’s Heart Surgery at the Bristol Royal Infirmary 1984–1995 , (Available from http://www.doh.gov.uk/ bristolinquiryresponse/index.htm). Diggle, P. (1998). Dealing with missing values in longitudinal studies, in Statistical Analysis of Medical Data: New Developments, B. Everitt & G. Dunn, eds, Hodder Arnold, London. Geisser, S. & Eddy, W. (1979). A predictive approach to model selection, Journal of the American Statistical Association 74, 153–160. Gelfand, A. & Dey, D. (1994). Bayesian model choice: asymptotics and exact calculations, Journal of the Royal Statistical Society, Series B 56, 501–514. Gelfand, A. & Sahu, S. (1999). Identifiability, improper priors, and Gibbs sampling for generalized linear models, Journal of the American Statistical Association 94, 247–253. Gelfand, A. & Smith, A. (1990). Sampling-based approaches to calculating marginal densities, Journal of the American Statistical Association 85, 398–409. Gelfand, A., Sahu, S. & Carlin, B. (1995). Efficient parameterizations for normal linear mixed models, Biometrika 82, 479–488. Gelman, A. & Rubin, D. (1992). Inference from iterative simulation using multiple sequences, Statistical Science 7, 457–511. Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741.

10 [25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

Markov Chain Monte Carlo and Bayesian Statistics George, E., Makov, U. & Smith, A. (1993). Conjugate likelihood distributions, Scandinavian Journal of Statistics 20, 147–156. Gilks, W. (1996). Full conditional distributions, in Markov Chain Monte Carlo in Practice, W. Gilks, S. Richardson & D. Spiegelhalter, eds, Chapman and Hall, London, pp. 75–88. Gilks, W. & Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling, Applied Statistics 41, 337–348. Gilks, W., Richardson, S. & Spiegelhalter, D. (1996). Introducing Markov chain Monte Carlo, in Markov Chain Monte Carlo in Practice, W. Gilks, S. Richardson & D. Spiegelhalter, eds, Chapman and Hall, London, pp. 1–20. Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57, 97–109. Hoeting, J., Madigan, D., Raftery, A. & Volinsky, C. (1999). Bayesian model averaging: a tutorial, Statistical Science 14, 382–401. Hoijtink, H., Klugkist, I. (2003). Using the Bayesian Approach in Psychological Research, Manuscript, Department of Methodology and Statistics, Utrecht University, The Netherlands. Ibrahim, J., Chen, M. & Lipsitz, S. (2001). Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable, Biometrika 88, 551–564. Joseph, L., Gyorkos, T. & Coupal, L. (1995). Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard, American Journal of Epidemiology 141, 263–272. Kass, R., Tierney, L., Kadane, J. (1988). Asymptotics in Bayesian computation, in Bayesian Statistics 3, J. Bernardo, DeGroot, M., Lindley, D., Smith, A., eds, Oxford University Press, 261–278. Laud, P. & Ibrahim, J. (1995). Predictive model selection, Journal of the Royal Statistical Society (Series B) 57, 247–262. Lee, S., Press, S. (1998). Robustness of Bayesian factor analysis estimates, Communications in Statistics – Theory and Methods 27, 1871–1893. Lee, S. & Song, X. (2004). Bayesian model selection for mixtures of structural equation models with an unknown number of components, British Journal of Mathematical and Statistical Psychology 56, 145–165.

[38]

[39] [40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

Lewis, C. (1983). Bayesian inference for latent abilities, in On Educational Testing, S.B. Anderson, S. Helmick & J., eds, San Francisco, Jossey-Bass, pp. 224–251. Little, R. & Rubin, D. (2002). Statistical Analysis with Missing Data, 2nd edition, John Wiley, New York. Marshall, C., Best, N., Bottle, A. & Aylin, P. (2004). Statistical issues in the prospective monitoring of health outcomes across multiple units, Journal of the Royal Statistical Society, Series A 167, 541–559. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. & Teller, E. (1953). Equations of state calculations by fast computing machines, Journal of Chemical Physics 21, 1087–1092. Newton, D. & Raftery, A. (1994). Approximate Bayesian inference by the weighted bootstrap, Journal of the Royal Statistical Society, Series B 56, 3–48. Press, S., Shigemasu, K. (1999). A note on choosing the number of factors, Communications in Statistics – Theory and Methods 28, 1653–1670. Roberts, C. (1996). Markov chain concepts related to sampling algorithms, in Markov Chain Monte Carlo in Practice, W. Gilks, S. Richardson & D. Spiegelhalter, eds, Chapman and Hall, London, pp. 45–59. Rowe, D. (2003). Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing, CRC Press, Boca Raton. Scheines, R., Hoijtink, H. & Boomsma, A. (1999). Bayesian estimation and testing of structural equation models, Psychometrika 64, 37–52. Spiegelhalter, D., Best, N., Carlin, B. & van der Linde, A. (2002). Bayesian measures of model complexity and fit, Journal of the Royal Statistical Society (Series B) 64, 583–639. Spiegelhalter, D., Abrams, K. & Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health-Care Evaluation, John Wiley. Tierney, L. (1994). Markov chains for exploring posterior distributions, Annals of Statistics 22, 1701–1762.

(See also Markov Chain Monte Carlo Item Response Theory Estimation) PETER CONGDON

Markov Chain Monte Carlo Item Response Theory Estimation LISA A. KELLER Volume 3, pp. 1143–1148 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Markov Chain Monte Carlo Item Response Theory Estimation Introduction Item response theory (IRT) (see Item Response Theory (IRT) Models for Polytomous Response Data; Item Response Theory (IRT) Models for Rating Scale Data) postulates the probability of a correct response to an item, given a person’s underlying ability level, θ. There are many types of item response models, all of which are designed to model different testing situations. However, all of these models contain parameters that characterize the items, and parameters that characterize the examinee. For these models to be useful, the item and person parameters need to be estimated from item response data. Traditionally, joint maximum likelihood (JML) and marginal maximum likelihood (MML) procedures have been used to estimate item parameters (see Maximum Likelihood Estimation). MML has been favored over JML as the JML estimates are not consistent [8]. The MML estimates proved to be very robust, and have performed well for the traditional models. However, the need for new models is growing, as testing practices become more sophisticated. As model complexity increases, so does the estimation of the model parameters. Therefore, a need for different estimation methods is necessary if model development is to continue, since the MML procedure is not feasible for estimating all models. Fortunately, with the advancements in technology, different estimation methods have become possible. As a result, several new models have been developed to meet the growing needs of measurement. Among the more popular models are the testlet response model [2, 17, 18] and the hierarchical rater model [12, 15]. Both of these models have employedMarkov Chain Monte Carlo (MCMC) methods for estimating item parameters (see Markov Chain Monte Carlo and Bayesian Statistics). MCMC procedures are becoming more popular among psychometricians primarily due to its flexibility. As noted above, new problems that were previously too difficult, or even impossible, to solve with the traditional methods can be solved. In concept, the

idea of MCMC estimation is to simulate a sample from the posterior distribution of the parameters, and to use the sample to obtain point estimates of the parameters. Since some (most) posterior distributions are difficult to sample from directly, modifications can be made to the process to allow for sampling from a more convenient distribution (e.g., normal), yet selecting the values from that distribution that could realistically have come from the posterior distribution of interest. Any distribution may be used provided that the range of that distribution is chosen appropriately [4]. The drawback to MCMC techniques is the time it often takes to provide a solution. The amount of time it takes depends on many factors, including the choice of the distribution, and some analyses can take weeks of computer time! However, as computing power continues to grow, so will the ease of implementing MCMC methods. As such, it is a technique that will likely gain even more popularity than it enjoys today. Initial research into MCMC for estimating IRT parameters began with the traditional models, for which MML estimates were available. First, dichotomous models were studied [1, 10, 11, 13], and then this methodology was extended to polytomous models [9, 14, 19]. By doing so, comparisons between estimates obtained with MML and MCMC could be compared, and the MCMC procedures appear to produce estimates similar to those found using MML. As these models are relatively simple, MCMC for IRT parameters will be discussed using these models. The general principles can be applied to estimating parameters of more complex models.

Markov Chain Monte Carlo Before describing the details of Markov Chain Monte Carlo estimation, a brief description of Markov chains and the introduction of some terminology are necessary.

Markov Chains A Markov chain can be thought of as a system that evolves through various stages, and, at any given stage, exists in a certain state. A Markov chain is a system where the state that the system exists at any stage is determined solely based on probabilities. Further, if a Markov chain is in a particular state

2

Markov Chain Monte Carlo IRT Estimation

at a given stage, the probability that in the next stage it will be at a given state depends only on the present state, and not on past states. That is, at stage k, the probability that it exists in a given state in stage k + 1 depends only on the state of the chain at stage k, and not stages 1 . . . k − 1. The probabilities of going to the various states are termed the transition probabilities and are assumed known. Given that there are typically multiple states that the Markov chain can inhabit at a given stage, the transition matrix provides the transition probabilities of going from one state to any other state. That is, the pij element of the matrix is the probability of going from state i to state j . A distinguishing feature of a Markov chain is that these probabilities do not change over time.

Notation and Terminology Let si(m) denote the probability that the Markov chain is in state i after m transitions. Supposing that there are n different states, there are n such probabilities. These probabilities can be organized into an n × 1 vector, which will be referred to as the state vector of the Markov chain. After each transition, a new state vector is formed. As the chain must be in some state at any given stage, the sum of the elements of each state vector must be one. As a Markov chain progresses, the state vectors may tend to converge to a vector of probabilities, say S. Therefore, for a ‘suitably large’ number of transitions, it is reasonable to assume Sm = S. This vector S is referred as the steady state vector, or stationary vector, for the Markov chain. That is, the vector S provides the long-term probability that the chain will be in each of the various states. In order for the Markov chain to converge to a unique steady-state vector, some regularity conditions must be met: The transition matrix, P, must be a stochastic matrix, that is, the sum of any row is one, and for some integer m, Pm has every entry in the matrix positive.

Markov Chain Monte Carlo Estimation The basic idea behind MCMC estimation is to create a Markov chain consisting of states M0 , M1 , M2 , . . . , where Mk = (θ k , β k ), and θ, β are the unknown IRT parameters. The chain is created so that the stationary distribution, π(θ, β), to which the chain

converges, is the posterior distribution, p(θ, β|X), where X is the matrix of item response data. States (observations) from the Markov chain are simulated and inferences about the parameters are made based on these observations. The behavior of the Markov chain is determined by its transition kernel t[(θ 0 , β 0 ), (θ 1 , β 1 )] = P [Mk+1 = (θ 1 , β 1 )|Mk = (θ 0 , β 0 )], which is the probability of moving to a new state, (θ 1 , β 1 ), given that the current state is (θ 0 , β 0 ) [13]. The stationary distribution π(θ, β) satisfies t[(θ 0 , β 0 ), (θ 1 , β 1 )]π(θ 0 , β 0 )d(θ 0 , β 0 ) θ,β

= π(θ 1 , β 1 )[13]

(1)

If we can define the transition kernel such that the stationary distribution, π(θ, β), is equal to the posterior distribution, p(θ, β|X), then, after a suitably large number of transitions, m, the remaining observations of the Markov chain behave as if from the posterior distribution. The first m observations are discarded, and are often referred to as the ‘burnin’ [13]. The observations from the stationary distribution can then be considered a sample from the posterior distribution, and estimates of parameters can be obtained using sample statistics. Given this simple outline of Markov chain techniques, methods for creating the appropriate Markov chain are needed. There are several methods for generating the Markov chains. The most widely used include Gibbs sampling and the Metropolis–Hastings (M–H) algorithm within Gibbs. Each of these methods will be discussed next. Gibbs Sampling. It was shown by Geman and Geman [6] that the transition kernel tG [(θ 0 , β 0 ), (θ 1 , β 1 )] = p(θ 1 |β 0 , X)p(β 1 |θ 1 , X) (2) has stationary distribution π(θ, β) = p(θ, β|X). A Markov chain with a transition kernel constructed in this manner is called a Gibbs sampler; the factors p(θ 1 |β 0 , X), and p(β 1 |θ 1 , X) are the complete conditional distributions of the model. Observations from the Gibbs sampler (θ k , β k ) are simulated by repeated sampling from the complete conditional distributions. Therefore, to go from (θ k−1 , β k−1 ) to (θ k , β k ), two transition steps are required:

Markov Chain Monte Carlo IRT Estimation 1. Draw θ k ∼ p(θ|X, β k−1 ) 2. Draw β k ∼ p(β|X, θ k ) Hence, the Gibbs sampler employs the standard IRT technique: Estimate one set of parameters holding the other as fixed, and known. It can be noted that both p(θ|X, β) and p(β|X, θ) are proportional to the joint distribution p(X, θ, β) = p(X|θ, β)p(θ, β): p(θ|X, β) =

p(X|θ, β)p(θ, β) p(X|θ, β)p(θ, β) dθ

and p(β|X, θ) =

p(X|θ, β)p(θ, β)

(3)

p(X|θ, β)p(θ, β) dβ Assuming independence of θ, β, we can write this as p(θ|X, β) ∝ p(X|θ, β)p(θ) and p(β|X, θ) ∝ p(X|θ, β)p(β). Setting up a Gibbs sampler requires computing the normalizing constants, p(X|θ, β) p(θ, β)dθ, p(X|θ, β)p(θ, β)dβ, which can be computationally complex. However, there are techniques available in MCMC to circumvent these calculations. One popular choice is to use the Metropolis–Hastings algorithm within Gibbs sampling. Metropolis–Hastings within Gibbs. As discussed above, both Gibbs samplings yield Markov chains having the desired stationary distribution, p(θ, β|X). However, since it is not trivial to sample from any general distribution, the Gibbs sampler is modified by using the Metropolis–Hastings algorithm, which allows the user to specify a proposal distribution, which is easier to sample from than the complete conditional distributions, and to employ a rejection sampling technique that discards draws from the proposal distribution, which are unlikely to also be draws from the desired conditional distribution. By doing so, a chain with the same stationary distribution can be produced [16]. The Gibbs sampler is modified by choosing proposal distributions for each transition step instead of directly sampling from the complete conditional distributions. The M–H within Gibbs sampling uses different proposal distributions, gθ (θ 0 , θ 1 ) and gβ (β 0 , β 1 ) for each transition step:

3

1. Attempt to draw θ k ∼ p(θ|β k−1 , X): (a) Draw θ ∗ ∼ gθ (θ k−1 , θ). (b) Accept θ k = θ ∗ with probability α(θ k−1 , θ ∗ )   p(X|θ ∗ , β k−1 )p(θ ∗ , β k−1 )       × gθ (θ ∗ , θ k−1 ) , 1 = min k−1 k−1 k−1 k−1  p(X|θ , β )p(θ , β )      × gθ (θ k−1 , θ ∗ ) (4) otherwise set θ k = θ k−1 2. Attempt to draw β ∗ ∼ p(β|θ k , X): (a) Draw β ∗ ∼ gβ (β k−1 , β). (b) Accept β k = β ∗ with probability α(β k−1 , β ∗ )   p(X|θ k , β ∗ )p(θ k , β ∗ )         × gβ (β ∗ , β k−1 ) = min , 1 k k−1 k k−1  p(X|θ , β )p(θ , β )        × gβ (β k−1 , β ∗ ) (5) otherwise set β k = β k−1 The resulting Markov chain has the desired stationary distribution π(θ, β) = p(θ, β|X). It should be noted that if either gθ or gβ is symmetric, then it cancels out in the probability of acceptance ratio. While it may be desirable to use a symmetric distribution to eliminate the proposal distribution, the chain will converge to the stationary distribution more quickly if asymmetric jumping rules are used [4].

Example Patz and Junker [13] used the following proposal distributions for estimating parameters of the twoparameter logistic model: k−1 k−1 ), log(β1j )) = N (β1j , 0.3) gB (log(β1j k−1 k−1 gB (β2j , β1j ) = N (β2j , 1.1)

(6)

where β1j is the a-parameter for item j , and β2j is the b-parameter for item j . The above algorithm provides a means to estimate both item parameters and person parameters jointly. However, in traditional estimation, MML has been favored over JML due to the consistency of the marginal estimates that is lacking in the joint

4

Markov Chain Monte Carlo IRT Estimation

estimates. Yao, Patz, and Hanson [20] developed an M–H within Gibbs algorithm that provides marginal estimates of the item parameters. Once the item parameter estimates are obtained, person parameters can be estimated using ML. The marginal distribution of the parameters is given by: P (β|X) ∝ P (β) P (X|θ, β)P (θ)dθ This can be written as: N

J

P (β) P (Xij |θi βj )P (θ)dθ i=1

N

h(β, Xi )

(7)

i=1

where h(β, Xi ) =

J n

Pij (Xij |θ l , βj )Al

l=1 j =1

where θ 1 , . . . , θ n are quadrature points for the examinee population distribution, and A1 , . . . , An are the weights at these points [20]. The algorithm used to draw samples from this marginal distribution of the item parameters is given below. Assuming that there are J items to be calibrated, the index j will refer to the item, whereas the index i will refer to one of the N examinees. Draw βj∗ ∼ gm (βj |βjm−1 ) where gm (βj |βjm−1 ) is a transition kernel from βj to βjm−1 . For each item, J, calculate the acceptance probability of the parameters: αj∗ (β k−1 , β ∗ )   m−1 m , β>j , X) P (βj∗ |β<j         m−1 ∗   × gm (βj |βj ) , 1 = min m−1 m  P (βjm−1 |β<j , β>j , X)          ∗ m−1 × gm (βj |βj ) where

(9)

m = (βim , i = 1, . . . , j − 1) β<j m β>j = (βim , i = j + 1, . . . , J )

θ j =1

P (β|X) ∼ = P (β)

(b)

αj∗ (β k−1 , β ∗ )   m−1 m , β>j , X) P (βj∗ |β<j         m−1 ∗   × gm (βj |βj ) , 1 = min m−1 m  P (βjm−1 |β<j , β>j , X)          ∗ m−1 × gm (βj |βj ) where

which can be approximated using quadrature points, resulting in:

(a)

(c) Accept each βjm = βj∗ with probability αj∗ , otherwise let βjm = βjm−1

m = (βim , i = 1, . . . , j − 1) β<j m β>j = (βim , i = j + 1, . . . , J )

Example There are many examples of using MCMC techniques to estimate parameters in IRT in the literature, as noted above. The example of Yao, Patz, and Hanson [20], where the marginal distribution was used, will be used here to illustrate the types of proposal distributions used for estimating parameters of the three-parameter logistic model. Since the 3PL is the model used, there are three-item parameters to be estimated for each item: aj , bj , and cj . These parameters will be represented as elements of a parameter vector, β, and as such, the notation will be aj = β1j , bj = β2j , cj = β3j . Proposal Distributions. Example proposal distributions for the three parameters are: m−1 m−1 gm (β1j |β1j ) ∼ N (β1j , 0.172 ) m−1 m−1 gm (log(β2j )| log(β2j )) ∼ N (β2j , 0.172 ) m−1 gm (β3j |β3j )

m−1 m−1 0.10 if β3j ∈ (β3j − 0.05, β3j + 0.05) = 0.00 otherwise (10)

Convergence of the Markov Chain (8)

The parameter estimates resulting from a Markov chain cannot be trusted unless there is evidence that the chain has converged to the posterior distribution. Cowles and Carlin [3] provide a description of several methods used to assess convergence, as well as software to implement the techniques. Another

Markov Chain Monte Carlo IRT Estimation approach is to create several different chains by choosing different starting values. If all the chains result in the same parameter estimates, evidence is provided that they have converged to the same distribution, and it must be the same distribution [5]. A common tool used to assess convergence is to use Geweke’s [7] convergence diagnostic. The diagnostic is based on the equality of means of different parts of the Markov chains. Specifically, comparing the means from the first and last part of the chain, the two samples should have the same mean, if the draws came from the stationary distribution. A z-test is used to test the equality of means, where the asymptotic standard error of the difference is used as the standard error in the test statistic. Details are provided in [7].

[11]

[12]

[13]

[14]

[15]

References [1]

Baker, F. (1998). An investigation of parameter recovery characteristics of a Gibbs sampling procedure, Applied Psychological Measurement 22(2), 153–169. [2] Bradlow, E.T., Wainer, H. & Wang, X. (1999). A Bayesian random effects model for testlets, Psychometrika 64(2), 153–168. [3] Cowles, K. & Carlin, B. (1995). Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review (University of Minnesota Technical Report), University of Minnesota, Minneapolis. [4] Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. (1995). Bayesian Data Analysis, Chapman & Hall, New York. [5] Gelman, A. & Rubin, D.B. (1992). Inference from iterative simulation using multiple sequences, Statistical Science 7, 457–472. [6] Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741. [7] Geweke, J. (1992). Evaluating the accuracy of samplingbased approaches to calculating posterior moments, in Bayesian Statistics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid & A.F.M. Smith, eds, Clarendon Press, Oxford. [8] Hambleton, R.K. & Swaminathan, H. (1985). Item Response Theory: Principles and Applications, KluwerNijhoff Publishing, Boston. [9] Hoitjink, H. & Molenaar, I.W. (1997). A multidimensional item response model: constrained latent class analysis using the Gibbs sampler and posterior predictive checks, Psychometrika 67, 171–189. [10] Keller, L.A. (2002). A comparisons of MML and MCMC estimation for the two-parameter logistic model, in Paper Presented at the Meeting of the National Council on Measurement in Education, New Orleans.

[16] [17]

[18]

[19]

[20]

5

Kim, S.-H. & Cohen, A.S. (1999). Accuracy of parameter estimation in Gibbs sampling under the twoparameter logistic model, in Paper presented at the meeting of the American Educational Research Association, Quebec. Patz, R.J. (1996). Markov Chain Monte Carlo Methods for Item Response Theory Models with Applications for the National Assessment of Educational Progress, Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh. Patz, R.J. & Junker, B.W. (1999). A straightforward approach to Markov Chain Monte Carlo methods for item response models, Journal of Educational and Behavioral Statistics 24(2), 146–178. Patz, R.J. & Junker, B.W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses, Journal of Educational and Behavioral Statistics 24(4), 342–366. Patz, R.J., Junker, B.W., Johnson, M.S. & Mariano, L.T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational data, Journal of Educational and Behavioral Statistics 27(4), 341–384. Tierney, L. (1994). Exploring posterior distributions with Markov chains, Annals of Statistics 22, 1701–1762. Wainer, H., Bradlow, E.T. & Du, Z. (2000). Testlet response theory: An analog for the 3-PL useful in adaptive testing, in Computerized Adaptive Testing: Theory and Practice, W.J. van der Linden & C.A.W. Glas, eds, Kluwer, Boston, pp. 245–270. Wang, X., Bradlow, E.T. & Wainer, H. (2002). A general Bayesian model for testlets: theory and applications, Applied Psychological Measurement 26, 109–128. Wollack, J.A., Bolt, D.M., Cohen, A.S., Lee, Y.-S. (2002). Recovery of item parameters in the nominal response model: a comparison of marginal likelihood estimation and Markov Chain Monte Carlo estimation, Applied Psychological Measurement, 26(3), 339–352. Yao, L., Patz, R.J. & Hanson, B.A. (2002). More efficient Markov Chain Monte Carlo estimation in IRT using marginal posteriors, in Paper Presented at the Meeting of the National Council on Measurement in Education, New Orleans.

Further Reading Chib, S. & Greenberg, E. (1995). Understanding the Metropolis-Hasting algorithm, The American Statistician 49, 327–335. Gelman, A., Roberts, G.O. & Gilks, W.R. (1996). Efficient metropolis jumping rules, in Bayesian Statistics 5: Proceedings of the Fifth Valencia International Meeting, J.M. Bernardo, J.O. Berger, A.P. Dawid & A.F.M. Smith, eds, Oxford University Press, New York, pp. 599–608.

LISA A. KELLER

Markov, Andrei Andreevich ADRIAN BANKS Volume 3, pp. 1148–1149 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Markov, Andrei Andreevich Born: Died:

June 14, 1856, in Ryazan, Russia. July 20, 1922, in Petrograd (now St Petersburg), Russia.

Markov was a Russian mathematician who developed the theory of sequences of dependent random variables in order to extend the weak law of large numbers and central limit theorem. In the process, he introduced the concept of a chain of dependent variables, now known as a Markov chains. It is this idea that has been applied and developed in a number of areas of the behavioral sciences. Markov’s family moved to St Petersburg when he was a boy. At school, he did not excel in most subjects, with the exception of mathematics in which he showed precocious talent, developing what he believed to be a novel method for the solution of linear differential equations (it transpired that it was not new). On graduation, he entered the Faculty of Mechanics and Mathematics at St Petersburg University where he would stay for the remainder of his career. He defended his Ph.D. thesis in 1884 ‘On certain applications of the algebraic continuous fractions’. Markov was politically very active, for example, he opposed honorary membership of the Academy of Sciences for members of the royal family as he felt they had not earned the honor. His research was partially fueled by a similarly vehement dispute with Nekrasov who claimed that independence is required for the law of large numbers. Markov’s study of what became known as Markov chains ultimately disproved Nekrasov’s ideas. In 1913, Markov included in his book the first application of a Markov

chain. He studied the sequence of 20 000 letters in A. S. Pushkin’s poem ‘Eugeny Onegin’ and established the probability of transitions between the vowels and consonants, which form a Markov chain. He died in 1922 as a result of sepsis that had developed after one of several operations that he underwent in the course of his life and that which stemmed from a congenital knee deformity and later complications. A Markov chain consists of multiple states of a process and the probabilities of moving from one state to another over time, known as transition probabilities. A first-order Markov chain has the property that the subsequent state of the process is dependent on the current state but not on the preceding states. Higher-order Markov chains are dependent on previous states. The chief advantage that this method holds is the ability to analyze sequences of data. Markov chains have been used to model a wide range of behavioral data and cognitive processes, especially learning. Hidden Markov models are an extension in which the observable states are the product of unobservable states, which form a Markov chain. More recently, Markov chains have been used to generate sequences of data for use in Monte Carlo studies.

Further Reading Basharin, G.P., Langville, A.N. & Naumov, V.A. (2004). The life and work of A.A. Markov, Linear Algebra and its Applications 386, 3–26. Markov, A.A. (1971). Extension of the limit theorems of probability theory to a sum of variables connected in a chain, Reprinted in Howard R., ed., Dynamic Probabilistic Systems, Vol. 1, Markov Chains, Wiley, New York. Wickens, T.D. (1989). Models for Behavior: Stochastic Processes in Psychology, W.H.Freeman, San Fransisco.

ADRIAN BANKS

Markov Chains DONALD LAMING Volume 3, pp. 1149–1152 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Markov Chains Suppose you are playing blackjack. You are first dealt a ‘6’ and then a ‘9’, to give you a miserable total of 15. What are you going to do? If you choose to ‘hit’ (draw an additional card), your total will rise to

applied more than once. More generally, a matrix of this kind (with all rows summing to unity) can be used to calculate the possible evolution of many systems that similarly inhabit a set of discrete states. The essential condition for such calculation is that the entire future of the system depends only on its present state and is entirely independent of its past history.

Total after drawing one additional card

P r e s e n t T o t a l

12  11 1/13 12  0  13  0  14  0  15  0  16  0  17  0 18   0 19  0 20 0

13 14 15 16 17 18 19 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 0 1/13 1/13 1/13 1/13 1/13 1/13 0 0 1/13 1/13 1/13 1/13 1/13 0 0 0 1/13 1/13 1/13 1/13 0 0 0 0 1/13 1/13 1/13 0 0 0 0 0 1/13 1/13 0 0 0 0 0 0 1/13 0 0 0 0 0 0 0 0 0 0 0 0 0 0

16, 17, 18, 19, 20 or 21, each with a probability of 1/13 (ignoring the complications of sampling without replacement), and with probability 7/13 you will ‘go bust’. Those probabilities are independent of whether you received the ‘6’ before the ‘9’, or received the ‘9’ first. They are independent of whether your 15 was made up of a ‘6’ and a ‘9’, or a ‘5’ and a ‘10’, or a ‘7’ and an ‘8’. Your probabilities of future success depend solely on your present total, and owe nothing to how that total was reached. That is the Markov property. The matrix above sets out the probabilities of different outcomes following a ‘hit’ with different totals (again ignoring the complications of sampling without replacement, and I emphasize that this is only a part of the probabilities associated with blackjack, but it will suffice for illustration). The future course of play depends only on the present total and this constitutes a state of the Markov chain. Each row of the matrix sums to unity and is therefore a proper multinomial distribution of outcomes, a different distribution for each state. If the total after one ‘hit’ still falls short of 21, it is possible to ‘hit’ again, and the transformation defined by the matrix can be

20 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 0

21 4/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13 1/13

>21



4/13   5/13   6/13   7/13   8/13   9/13  10/13   11/13  12/13

(1)

Everyday examples of Markov systems tend to evolve in continuous time. The duration of telephone calls tends to an exponential distribution to a surprising degree of accuracy, and exhibits the Markov property. Suppose you telephone a call center to enquire about an insurance claim. ‘Please hold; one of our consultants will be with you shortly.’ About ten minutes later, still waiting for a ‘consultant’, you are losing patience. The unpalatable fact is that the time you must now expect to wait, after already waiting for ten minutes, is just as long as when you first started. The Markov property means that the future (how long you still have to wait) is entirely independent of the past (how long you have waited already). Light bulbs, on the other hand, are not Markov devices. The probability of failure as you switch the light on increases with the age of the bulb. Likewise, if the probabilities in (1) were recalculated on the basis of sampling without replacement, they would be found to depend on how the total had been reached – a ‘6’ and a ‘9’ versus a ‘5’ and a ‘10’ versus a ‘7’ and an ‘8’. The chief use of Markov chains in psychology has been in the formulation of models for learning. Bower [1] asked his subjects to learn a list of 10

2

Markov Chains

Trial n

paired-associates. The stimuli were pairs of consonants and five were paired with the digit ‘1’, five with ‘2’. Subjects were required to guess, if they did not know the correct response; after each response they were told the correct answer. Bower proposed the following model, comprised of two Markov states (L & U ) and an initial distribution of pairs between them. Trial n + 1 L U L 1 0 U c 1−c

Probability correct 1 2

 1 P T  0  e F cd N

Trial n

0 1

1

P

Initial distribution

(2)

T

Trial n + 1 F

0 0 (1 − q) q (1 − e)(1 − q) (1 − e)q c(1 − d)(1 − q) c(1 − d)q

Trial n

This model supposes that on each trial each hitherto unlearned pairing is learnt with probability c and, until a pairing is learnt, the subject is guessing (correct with probability 1/2). Once a pairing is learnt (state L), subsequent responses are always correct. There is no exit from state L, which is therefore absorbing. State U (unlearned), on the other hand, is transient. (More precisely, a state is said to be transient if return to that state is less than certain.) Bower’s data fitted his model like a glove: but it also needs to be said that any more complicated experiment poses problems not seen here. This is not the only way that Bower’s idea can be formulated. The three-state model A  A 1 C 0 E d

Trial n + 1 C E

1 (1 2

0

0

1 2

1 2

− d)

1 (1 2

− d)

 

Probability Initial correct distribution 1 1 0

1 d 2 1 (1 − 2 1 2

d)

(3)

with d = 2c/(1 + c) is equivalent to the previous twostate model (2), in the sense that the probability of any and every particular set of data is the same

whichever model is used for the calculation [4, pp. 312–314]. The data for each paired-associate in Bower’s experiment consists of a sequence of guesses followed by a criterion sequence of correct responses. The errors all occur in State E, correct guesses prior to the last error in State C, and all the responses following learning (plus, possibly some correct guesses immediately prior to learning), in State A. These three states are all identifiable, in the sense that (provided there is a sufficiently long terminal sequence of correct responses to allow the inference that the pair has been learned) it can be inferred uniquely from the sequence of responses which state the system occupied at any given trial in the sequence. Models (2) and (3) are equivalent.

N  0 0   0 (1 − c)

Probability of avoidance 1 1 0 0

Initial distribution 0 0 0 1 (4)

A more elaborate model of this kind was proposed by Theios and Brelsford [6] to describe avoidance learning by rats. The rats start off na¨ıve (State N ) and are shocked. At that point, they learn how to escape with probability c and exit State N . They also learn, for sure, the connection between the warning signal (90 dB noise) and the ensuing shock, but that connection is formed only temporarily with probability (1 − d) (i.e., exit to States T or F ) and may be forgotten with probability q (State F , rather than T ) before the next trial. A trial in State F (meaning of warning signal forgotten) means that the rat will be shocked, whence the connection of the warning signal to the following shock may be acquired permanently with probability e. Here again all the states are identifiable from the observed sequence of responses. The accuracy of the model was well demonstrated in a series of experimental manipulations by Brelsford [2]. A rather different use of a Markov chain is illustrated by Shannon’s [5] approximations to English text. English text consists of strings of letters and spaces (ignoring the punctuation) conforming to various high-level sequential constraints. Shannon approximated those constraints to varying degrees

3

Markov Chains with a Markov chain. A zeroth order approximation was produced by selecting succeeding characters independently and at random, each with probability 1/27. A first-order approximation consisted of a similar sequence of letters, but now selected in proportion to their frequencies of occurrence in English text. A second-order approximation was constructed by matching successive pairs of letters to their natural frequency of occurrence. In this approximation, not only did the overall letter frequencies match English text, but the probabilities of selection depended also on the letter preceding. A third order approximation IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.

To sum-up, the matrix (5) exemplifies several important features of Markov chains. The asterisks denote any arbitrary entry between 0 and 1; the row sums, of course, are all unity.

S1 S2 S3 S4 S5 S6 S7 S8 S9



S1

S2

S3

S4

S5

S6

S7

S8

S9



             

* * 0 0 0 0 0 0 0

* * 0 0 0 0 0 0 0

* * 1 0 0

* * 0 0 0 1 0 0 0

* * 0 1 0 0 0 0 0

* * 0 0 1 0 0 0 0

* * 0 0 0 0 * * *

* * 0 0 0 0 * * *

* * 0 0 0 0 * * *

             

0 0 0

(5) meant that each sequence of three successive letters was matched in frequency to its occurrence in English text. The corresponding Markov chain now has 272 distinct states. There was, however, no need to construct the transition matrix (a hideous task); it was sufficient to take the last two letters of the approximation and open a book at random, reading on until those same two letters were discovered in correct sequence. The approximation continued with whatever letter followed the two in the book. The 27 × 27 matrix for constructing a secondorder approximation illustrates some further properties of Markov chains. Since all 27 characters will recur, sooner or later, in English text, all states of the chain (the letters A to Z and the space) can be reached from all other states. The chain is said to be irreducible. In comparison, in example (4), return to State N is not possible after it has been left and thereafter the chain can be reduced to just three states. If a Markov chain is irreducible, it follows that all states are certain to recur in due course. A periodic state is one that can be entered only every n steps, after which the process moves on to the other states. States that are not subject to such a restriction are aperiodic, and if return is also certain, they are called ergodic. Since the 27 × 27 matrix in question is irreducible, all its states are ergodic. It follows that, however the approximation to English text starts off, after a sufficient length has been generated, the frequency of different states (letters and spaces) will tend to a stationary distribution, which, in this case, will correspond to the long-run frequency of the characters in English text.

1. It is possible to access all other states from S1 and S2 , but return (except from S1 to S2 and viceversa) is not possible. S1 and S2 are transient. 2. Once S3 is entered, there is no exit. S3 is absorbing. 3. S4 , S5 , and S6 exit only to each other. They form a closed (absorbing) subset; the subchain comprised of these three states is irreducible. Moreover, once this set is entered, S4 exits only to S5 , S5 only to S6 , and S6 only to S4 , so that thereafter each of these states is visited every three steps. These states are periodic, with period 3. 4. S7 , S8 , and S9 likewise exit only to each other, and also form a closed irreducible subchain. But each of these three states can exit to each of the other two and the subchain is aperiodic. Since return to each of these three states is certain, once the subset has been entered, the subchain is ergodic. For further properties of Markov chains the reader should consult [3, Chs. XV & XVI].

References [1] [2]

Bower, G.H. (1961). Application of a model to pairedassociate learning, Psychometrika 26, 255–280. Brelsford Jr, J.W. (1967). Experimental manipulation of state occupancy in a Markov model for avoidance conditioning, Journal of Mathematical Psychology 4, 21–47.

4 [3]

[4]

[5]

Markov Chains Feller, W. (1957). An Introduction to Probability Theory and its Applications, Vol. 1, 2nd Edition, Wiley, New York. Greeno, J.G. & Steiner, T.E. (1964). Markovian processes with identifiable states: general considerations and application to all-or-none learning, Psychometrika 29, 309–333. Shannon, C.E. (1948). A mathematical theory of communication, Bell System Technical Journal 27, 379–423.

[6]

Theios, J. & Brelsford Jr, J.W. (1966). Theoretical interpretations of a Markov model for avoidance conditioning, Journal of Mathematical Psychology 3, 140–162.

(See also Markov, Andrei Andreevich) DONALD LAMING

Martingales DONALD LAMING Volume 3, pp. 1152–1154 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Martingales A simple example of a martingale would be one’s cumulative winnings in an ‘absolutely fair’ game. For example, Edith and William play a regular game of bridge, once a week, against Norman and Sarah for £10 per 100 points. This is a very satisfactory arrangement because these two pairs are exactly matched in their card skills. Coupled with a truly random assignment of cards to the four players, this match ensures that their expected winnings on each hand are zero; the game is absolutely fair. This does not, however, preclude one pair or the other building up a useful profit over the course of time. Norman and Sarah’s cumulative winnings are Sn = X1 + X2 + · · · + Xn ,

(1)

where Xi is the amount they won on the ith hand. This equation describes an additive process, and there are two elementary conditions on the Xi that permit simple and powerful inferences about the behavior of the sums Sn . 1. If the Xn are independent and identically distributed, (1) describes a random walk qv . In a random walk, successive steps, Xn , are independent; this means that the distributions of successive Sn can be calculated by repeated convolution of the Xn . 2. If, instead, the expectations of successive Xn are zero, the sequence {Sn } is a martingale. I emphasize that Xn need not be independent of the preceding steps {Xi , i = 1, . . . , n − 1}, merely that its expectation is always zero, irrespective of the values that those preceding steps happen to have had. Formally, E{Xn+1 |X1 , X2 , . . . , Xn } = 0. This has the consequence that E{Sn+1 |X1 , X2 , . . . , Xn } = Sn (which is the usual definition of a martingale). Continuing the preceding example, the most important rewards in bridge come from making ‘game’, that is, 100 points or more from contracts bid and made. If Norman and Sarah already have a part-score toward ‘game’ from a previous hand, their bidding strategy will be different in consequence. Once a partnership has made ‘game’, most of their penalties (for failing to make a contract) are doubled; so bidding strategy changes again. None of

these relationships abrogate the martingale property; the concept is broad. The martingale property means that the variances of the sums Sn are monotonically increasing, whatever the relationship of Xn+1 to the preceding Xn . The variance of Sn+1 is 2 } E{(Sn + Xn+1 )2 } = E{Sn2 + 2Sn Xn+1 + Xn+1 2 }, = E{Sn2 + Xn+1

(2)

because E{Xn+1 } = 0. From this, it follows that if the variances of the sums Sn are bounded, then Sn tends to a limiting distribution [2, p. 236]. Martingales are important, not so much as models of behavior in their own right, but as a concept that simplifies the analysis of more complicated models, as the following example illustrates. Yellott [9] reported a long binary prediction experiment in which observers had to guess which of two lights, A and B, would light up on the next trial. A fundamental question was whether observers modified their patterns of guessing on every trial or only when they found they had guessed wrongly. The first possibility may be represented by a linear model. Let pn be the probability of choosing light A on trial n. If Light A does indeed light up, put pn+1 = αpn + (1 – α);

(3a)

pn+1 = αpn .

(3b)

otherwise

Following each trial pn in (3) is replaced by a weighted average of pn and the light actually observed (counting Light A as 1 and Light B as 0). If Light A comes on with probability a, fixed over successive trials, then pn tends asymptotically to a – that is, ‘probability matching’, a result that is commonly observed in binary prediction. This finding provides important motivation for the linear model, except that the same result can be derived from other models [1, esp. pp. 179–181]. So, for the last 50 trials of his experiment, Yellott switched on whichever light the subject selected (i.e., noncontingent success reinforcement with probability 1). If observers modified their patterns of guessing only when they had guessed wrong, noncontingent success reinforcement should effect no change at all in the pattern of guessing. But for the linear model on

2

Martingales

trial n + 1, E{pn+1 } = pn [αpn + (1 − α)] + (1 − pn )[αpn ] = pn ,

(4)

so the sequence {pn } is a martingale. The variance of pn therefore increases monotonically and, for any one sequence of trials, pn should tend either to 0 or to 1. Yellott found no systematic changes of that kind. However, a formally similar experiment by Howarth and Bulmer [4] yielded a different result. The observers in this experiment were asked to report a faint flash of light that had been adjusted to permit about 50% detections. The intensity of the light was constant over successive trials, but there was no knowledge of results. So the response, detection or failure to detect, fulfilled a role analogous to noncontingent success reinforcement. Successive responses were statistically related in a manner consistent with the linear model (3). Moreover, the authors reported ‘The experiment was stopped on two occasions after the probability of seeing had dropped almost to zero’ [4, p. 164]; that is, for two observers, pn decreased to 0. There was a similar decrease to zero in the proportion of positive diagnoses by a consultant pathologist engaged in screening cervical smears [3]. Her frame of judgment shifted to the point that she was passing as ‘OK’ virtually every smear presented to her for expert examination. The tasks of predicting the onset of one of two lights or of detecting faint flashes of light do not make sense unless both alternative responses are appropriate on different trials. This generates a prior expectation that some particular proportion of each kind of response will be required. A prior expectation can be incorporated into the linear model by replacing the reinforcement in (3) (Light A = 1, Light B = 0) with the response actually uttered (there was no reinforcement in Howarth and Bulmer’s experiment) and then replacing the response (1 for a detection, 0 for a miss) with a weighted average of the prior expectation and the response. This gives pn+1 = αpn + (1 – α)[bπ∞ + (1 – b)];

(5a)

pn+1 = αpn + (1 – α)bπ∞ .

(5b)

Here, π∞ is the prior expectation of what proportion of trials will present a flash of light for detection,

and the expression in square brackets applies the linear model of (3) to the effect of that prior expectation. The probability of detection tends asymptotically to π∞ . But, notwithstanding that the mean no longer satisfies the martingale condition, the variance still increases to a limiting value, greater than would be obtained from independent binomial trials [5, p. 464]. This increased variability shows up in the proportions of positive smears reported by different pathologists [6], in the proportions of decisions (prescription, laboratory investigation, referral to consultant, followup appointment) by family doctors [8, p. 158] and in the precision of the ‘quantal’ experiment. The ‘quantal’ experiment was a procedure introduced by Stevens and Volkmann [7] for measuring detection thresholds. Increments of a fixed magnitude were presented repeatedly for 25 successive trials. The detection probability for that magnitude was estimated by the proportion of those increments reported by the observer. The precision of 25 successive observations in the ‘quantal’ experiment is equivalent to about five independent binomial trials [5] (see Catalogue of Probability Density Functions).

References [1]

[2] [3]

[4]

[5]

[6] [7] [8]

[9]

Atkinson, R.C. & Estes, W.K. (1963). Stimulus sampling theory, in Handbook of Mathematical Psychology, Vol. 2, R.D. Luce, R.R. Bush & E. Galanter, eds, Wiley, New York, pp. 121–268. Feller, W. (1957). An Introduction to Probability Theory and its Applications, Vol. 2, Wiley, New York. Fitzpatrick, J., Utting, J., Kenyon, E. & Burston, J. (1987). Internal Review into the Laboratory at the Women’s Hospital, Liverpool, Liverpool Health Authority, 8th September. Howarth, C.I. & Bulmer, M.G. (1956). Non-random sequences in visual threshold experiments, Quarterly Journal of Experimental Psychology 8, 163–171. Laming, D. (1974). The sequential structure of the quantal experiment, Journal of Mathematical Psychology 11, 453–472. Laming, D. (1995). Screening cervical smears, British Journal of Psychology 86, 507–516. Stevens, S.S. & Volkmann, J. (1940). The quantum of sensory discrimination, Science 92, 583–585. Wilkin, D., Hallam, L., Leavey, R. & Metcalfe D. (1987). Anatomy of Urban General Practice, 1987, Tavistock Publications, London. Yellott Jr, J.I. (1969). Probability learning with noncontingent success, Journal of Mathematical Psychology 6, 541–575.

DONALD LAMING

Matching VANCE W. BERGER, LI LIU

AND

CHAU THACH

Volume 3, pp. 1154–1158 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Matching Models for Matched Pairs Evaluation is generally inherently comparative. For example, a treatment is generally neither ‘good’ nor ‘bad’ in absolute terms, but rather better than or worse than another treatment at bringing about the desired result or outcome. The studies that allow for such comparative evaluations must then also be comparative. It would not do, for example, to study only one treatment and then declare it to be the winner. To isolate the effects of the treatments under study, avoid confounding treatment effects with unit effects, and possibly increase the power of the study to detect treatment differences, the units under study must be carefully matched across treatment groups. To appreciate the need to control such confounding, consider comparing the survival times of patients in a cancer clinic to those of patients in a sports injury clinic. It might be expected that the patients with sports injuries would live longer, but this is not generally accepted as proof that treatment (for either sports injuries or for cancer) is better at a sports injury clinic than it is at a cancer center. The difference in survival times across the ‘treatment’ groups is attributable not to the treatments themselves (cancer center vs. sports injury clinic) but rather to underlying differences in the patients who choose one or the other. While the aforementioned example is an obvious example of confounding, and one that is not very likely to lead to confusion, there are less obvious instances of confounding that can lead to false conclusions. For example, Marcus [6] evaluated a randomized study of a culturally sensitive AIDS education program [13]. At baseline, the treatment group had significantly lower AIDS knowledge scores (39.89 vs. 36.72 on a 52-question test, p = 0.005), so an unadjusted comparison would be confounded. The key to avoiding confounding is in ensuring the comparability of the comparison groups in every way other than the treatments under study. This way, by the process of elimination, any observed difference can be attributed to differences in the effects of the treatments under study. Ideally, the control group for any given patient or set of patients would be the same patient or set of patients, under identical conditions. This is the idea behind crossover trials,

in which patients are randomized not to treatment conditions, but rather to sequences of treatment conditions, so that each patient experiences each treatment condition, in some order. But crossovers are not the ideal solution, because while each patient is exposed to each treatment condition, this exposure is not under identical conditions. Time must necessarily elapse between the sequential exposures, and patients do not remain the same over time, even if left untreated. The irony is that the very nature of the crossover design, in which the patient is treated initially with one treatment in the hope of improving some condition (i.e., changing the patient), interferes with homogeneity over time. Under some conditions, carryover effects (the effect of treatment during one period on outcomes measured during a subsequent period when a different treatment is being administered) may be minimal, but, in general, this is a serious concern. Besides crossover designs, there are other methods to ensure matching. The matching of the comparison groups may be at the group level, at the individual unit level, or sometimes at an intermediate level. For example, if unrestricted randomization is used to create the comparison groups, then the hope is that the groups are comparable with respect to both observed and unobserved covariates, but there is no such hope for any particular subset of the comparison groups. If, however, the randomization is stratified, say by gender, then the hope is that the females in one treatment group will be comparable to the females in each other treatment group, and that the males in one treatment group will be comparable to the males in each other treatment group (see Stratification). When randomization is impractical or impossible, the matching may be undertaken in manual mode. There are three types of matching – simple matching, symmetrical matching, and split sample [5]. In simple matching, a given set of covariates, such as age, gender, and residential zip code, may be specified, and each unit (subject) may be matched to one (resulting in paired data) or several other subjects on the basis of these covariates. In most situations, simple matching is also done to increase the power of the study to detect differences between the treatment groups due to the reduction in the variability between the treatment groups. Pauling [8] presented a data set of simple matching in his Table 33.2. This table presents the time until cancer patients were determined to be beyond treatment.

2

Matching

There are 11 cancer patients who were treated with ascorbic acid (vitamin C), and ‘for each treated patient, 10 controls were found of the same sex, within five years of the same age, and who had suffered from cancer of the same primary organ and histological tumor type. These 1000 cancer patients (overall there were more than 11 cases, but only these 11 cases are presented in Table 33.2) comprise the control group. The controls received the same treatment as the ascorbic-treated patients except for the ascorbate’ [8]. Of note, Pauling [8] went on to state that ‘It is believed that the ascorbate-treated patients represent a random selection of all of the terminal patients in the hospital, even though no formal randomization process was used’ [8]. It would seem that what is meant here is that there are no systematic differences to distinguish the cases from the controls, and this may be true, but it does not, in any way, equate to proper randomization [1]. The matching, though not constituting randomization, may provide somewhat of a basis for inference, because if the matching were without any hidden bias, then the time to ‘untreatability’ for the case would have the same chance to exceed or to be exceeded by that of any of the controls matched to that case. Under this assumption, one may proceed with an analysis based on the ranking of the case among its controls. For example, Case #28 had a time of 113 days, which is larger than that of five controls and smaller than that of the other five controls. So, Case #28 is quite typical of its matched controls. Of course, any inference would be based on all 11 cases, and not just Case #11. There is missing data for one of the controls for Case #35, so we exclude this case for ease of illustration (a better analysis, albeit a more complicated one, would make use of all cases, including this one), and proceed with the remaining 10 cases, as in Table 1. Cases #37 and #99 had times of 0 days, and each had at least one control with the same time, giving rise to the ties in Table 1. These ties also complicate the analysis somewhat, but for our purposes, it will not matter if they are resolved so that the case had the longer time or the shorter time, as the ultimate conclusion will be that there was no significant difference at the 0.05 level. To illustrate this, we take a conservative approach (actually, it is a liberal approach, as it enhances the differences between the groups, but this is conservative relative to the claim that there is no significant difference) as follows. First, we note that

Table 1

Data from Pauling [8]

Case

Controls with shorter times

Ties

Controls with longer times

28 33 34 36 37 38 84 90 92 99 Sum

5 1 6 9 0 2 6 4 2 0 35

0 0 0 0 2 0 0 0 0 1 3

5 9 4 1 8 8 4 6 8 9 62

there were 62 shorter case times versus only 35 longer case times (this could form the basis for a U-test [10, 11], ignoring the ties), and so the best chance to find a difference would be with a one-sided test to detect shorter case times. We then help this hypothesis along by resolving the ties so that the case times are always shorter. This results in combining the last two columns of Table 1, so that it is 35 versus 65. As noted, the totals, be it 35 versus 62 or 35 versus 65, are sufficient to enable us to conduct a Utest (see Wilcoxon–Mann–Whitney Test) [10, 11]. However, we propose a different analysis because the U-test treats as interchangeable all comparisons; we would like to make more use of the matched sets. Consider, then, any given rank out of the 11 (one case matched to 10 controls). For example, consider the second largest. Under the null hypothesis, the probability of being at this rank or higher is 2/11, so this can form the basis for a binomial test. Of course, one can conduct a binomial test for any of the rank positions. The data summary that supports such analyses is presented in Table 2. The rank of the case is one more than the number of control times exceeded by the case time, or one more than the second column of Table 1. It is known that artificially small P values result from selection of the test based upon the data at hand, in particular, when dealing with maximized differences among cutpoints of an ordered contingency table, or a Lancaster decomposition (see Kolmogorov–Smirnov Tests) [2, 9]. Nevertheless, we select the binomial test that yields the most significant (smallest) one-sided P value, which is at rank 7. We see that 9 of the 10 cases had a rank less than seven. The point probability of this outcome is, by

Matching Table 2 Data from Table 1 put into a form amenable for binomial analysis Rank

Sets in which the case has this rank

1 2 3 4 5 6 7 8 9 10 11 Total

2a 1 2 0 1 1 2 0 0 1 0 10

Cumulative 2 3 5 5 6 7 9 9 9 10 10

a

Cases #37 and #99 have the lowest rank by the tie-breaking convention.

the binomial formula, (10)(7/11)9 (4/11) = 0.0622. The P value is this plus the null probability of all other more extreme outcomes. In this case, the only more extreme outcome is to have all 10 cases have rank less than seven, and this has null probability (7/11)10 = 0.0109, so the one-sided binomial P value is the sum of these two null probabilities, or 0.0731, which, with every advantage given toward a finding of statistical significance, still came up short. This is not surprising with so small a sample size. The method, of course, can be applied with larger sample sizes as well. The second type of matching is symmetrical matching, where the effect of two different treatments are tested on opposite sides of the body. Kohnen et al. [4] compared the difference between standard and a modified (zero-compression) Hansatome microkeratome head in the incidence of epithelial defects. Ninety-three patients (186 eyes) were enrolled in the study. To avoid confounding, each patient’s two eyes were matched. In one eye, the flaps were created using the standard Hansatome head and in the other eye, the flaps were created using a modified design (zero-compression head). Epithelial defects occurred in 21 eyes in which the standard head was used and in 2 eyes (2.1%) in which the zero-compression head was used. Two patients who had an epithelial defect using the zero-compression head also had an epithelial defect in the eye in which the standard head was used. McNemar’s test is appropriate to analyze such data from matched pairs of subjects

3

with a dichotomous (yes–no) response. The P value based on McNemar’s test [7] is less than 0.001, and suggests that the modified Hansatome head significantly reduced the occurrence of the epithelial defects. To compare matched pairs with continuous response, the paired t Test [12] or Wilcoxon signed-ranks test [3] is proper. Shea et al. [14] studied the microarchitectural bone adaptations of the concave and convex spinal facets in idiopathic scoliosis. Biopsy specimens of facet pairs at matched anatomic levels were obtained from eight patients. The concave and convex facets were analyzed for bone porosity. The mean porosity (and standard deviation) for the concave and convex facets was 16.5% +/−5.8% and 24.1% +/−6.2%, respectively. The P value based on the paired t Test is less than 0.03, and suggests that the facets on the convex side were significantly more porous than those on the concave side. The third type of matching is the split sample design, in which each individual is divided into two parts and one treatment is randomized to one part, while the other part gets the second treatment. For example, a piece of fabric could be cut into two pieces and each piece is tested with one of two detergents. Symmetrical matching and split samples both result in paired data. With these designs, the study usually has more power to detect differences between the treatment groups, compared to unmatched designs, because the individual variation is reduced. Matching bears some resemblance to randomized trials that stratify by time of entry onto the trial, as cohorts of patients form blocks, and randomization occurs within the block. Often, each block will have the same number of cases (called treated patients) and controls (patients treated with the control group), with obvious modification for more than two treatment groups. The similarity in the previous sentence applies both within blocks and across blocks; that is, often any given block has the same number of patients randomized to each treatment condition (1:1 allocation within each block), and the size of each block will be the same (constant block size). To the extent that time trends render patients within a block more comparable to each other than to patients in other blocks, we have a structure similar to that of the Pauling [8] matched study, except that each matched set, or block, may have more than one case, or treated patient.

4

Matching

References [1] [2]

[3]

[4]

[5] [6]

[7]

[8]

Berger, V.W. & Bears, J. (2003). When can a clinical trial be called ‘randomized’? Vaccine 21, 468–472. Berger, V.W. & Ivanova, A. (2002). Adaptive tests for ordered categorical data, Journal of Modern Applied Statistical Methods 1(2), 269–280. Box, G., Hunter, W.G. & Hunter, J.S. (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, John Wiley & Sons, New York. Kohnen, T., Terzi, E., Mirshahi, A. & Buhren, J. (2004). Intraindividual comparison of epithelial defects during laser in situ keratomileusis using standard and zerocompression Hansatome microkeratome heads, Journal of Cataract and Refractive Surgery 30(1), 123–126. Langley, R. (1971). Practical Statistics Simply Explained, Dover Publications, New York. Marcus, S.M. (2001). A sensitivity analysis for subverting randomization in controlled trials, Statistics in Medicine 20, 545–555. McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages, Psychometrika 12, 153–157. Pauling, L. (1985). Supplemental ascorbate, vitamin C, in the supportive treatment of cancer, in Data,

D.F. Andrews & A.M. Herzberg, eds, Springer-Verlag, New York, pp. 203–207. [9] Permutt, T. & Berger, V.W. (2000). A new look at rank tests in ordered 2xk contingency tables, Communications in Statistics – Theory and Methods 29(5 and 6), 989–1003. [10] Pettitt, A.N. (1985). Mann-Whitney-Wilcoxon statistic, in The Encyclopedia of Statistical Sciences, Vol. 5, S. Kotz & N.L. Johnson, eds, Wiley, New York. [11] Serfling, R.J. (1988). U-statistics, in The Encyclopedia of Statistical Sciences, Vol. 9, S. Kotz & N.L. Johnson, eds, Wiley, New York. [12] Shea, K.G., Ford, T., Bloebaum, R.D., D’Astous, J. & King, H. (2004). A comparison of the microarchitectural bone adaptations of the concave and convex thoracic spinal facets in idiopathic scoliosis, The Journal of Bone and Joint Surgery-American Volume 86-A,(5), 1000–1006. [13] Stevenson, H.C. & Davis, G. (1994). Impact of culturally sensitive AIDS video education on the AIDS risk knowledge of African American adolescents, AIDS Education and Prevention 6, 40–52. [14] Wayne, W.D. (1978). Applied Nonparametric Statistics, Houghton Mifflin Company.

VANCE W. BERGER, LI LIU AND CHAU THACH

Mathematical Psychology RICHARD A. CHECHILE Volume 3, pp. 1158–1164 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mathematical Psychology Mathematical psychology deals with the use of mathematical and computational modeling methods to measure and explain psychological processes. Although probabilistic models dominate the work in this area, mathematical psychology is distinctly different from the statistical analysis of data. While there is a strong interest in mathematical psychology with the measurement of psychological processes, mathematical psychology does not deal with multipleitem tests such as intelligence tests. The study of multi-item tests and questionnaires is a central focus of psychometrics (see History of Psychometrics; Psychophysical Scaling). Psychometrics has a close linkage with education and clinical psychology–fields concerned with psychological assessment. In contradistinction to psychometrics, mathematical psychology is concerned with measuring and describing the behavioral patterns demonstrated in experimental research. Thus, mathematical psychology has a close tie to experimental psychology–a linkage that is analogous to the connection between theoretical and experimental physics.

The Development of Mathematical Psychology One of the earliest examples of experimental psychology is an 1860 book by Fechner [10] on psychophysics – the science of relating physical dimensions to perceived psychological dimensions. This treatise is also a pioneering example of mathematical psychology. While Fechner’s work was followed by a steady stream of experimental research, there was not a corresponding development for mathematical psychology. Mathematical psychology was virtually nonexistent until after World War II. By that time psychological science had progressed to the point where (a) statistical tools for data analysis were common, (b) rich databases were developed in most areas of psychology that demonstrated regular behavioral patterns, and (c) a wide range of theories had been proposed that were largely verbal descriptions. During World War II, a number of psychologists worked with engineers, physicists, and mathematicians. This collaboration stimulated

the development of a more rigorous theoretical psychology that employed mathematical methods. That time period also saw the rapid development of new branches of applied mathematics such as control theory, cybernetics, information theory, system theory, game theory, and automata theory. These new mathematical developments would all prove useful in the modeling of psychological processes. By the 1950s, a number of papers with mathematical models were being published regularly in the journal, Psychological Review. Books and monographs about mathematical psychology also began appearing [22]. Regular summer workshops on mathematical behavioral sciences were being conducted at Stanford University. Early in the 1960s, there was an explosion of edited books with high quality papers on mathematical psychology. Eventually, textbooks about mathematical psychology were published [7, 14, 15, 27, 33]. These texts helped to introduce the subject of mathematical psychology into the graduate training programs at a number of universities. In 1964, the Journal of Mathematical Psychology was launched with an editorial board of Richard Atkinson, Robert Bush, Clyde Coombs, William Estes, Duncan Luce, William McGill, George Miller, and Patrick Suppes. The first conference on mathematical psychology was held in 1968. The next eight meetings were informally organized by the interested parties, but in 1976, an official professional society was created – the Society for Mathematical Psychology. The officers of this organization (a) decide the location and support the arrangements for an annual conference, (b) select the editor and recommend policies for the Journal of Mathematical Psychology, (c) recognize new researchers in the field by offering a ‘young investigator’ award, and (d) recognize especially important new work by awarding an ‘outstanding paper’ prize. Outside North America, the subfield of mathematical psychology developed as well. In 1965, the British Journal of Statistical Psychology changed its name to the British Journal of Mathematical and Statistical Psychology in order to include mathematical psychology papers along with the more traditional test theory and psychometric papers. In 1967, the journal Mathematical Social Sciences was created with an editorial board containing many European social scientists as well as North Americans. In 1971, a group of European mathematical psychologists held a conference

2

Mathematical Psychology

in Paris. The group has come to be called the European Mathematical Psychology Group. This society also has an annual meeting. A number of edited books have emerged from the papers presented at this conference series. In 1989, Australian and Asian mathematical psychologists created their own series of research meetings (i.e., the Australasian Mathematical Psychology Conference).

Research in Mathematical Psychology Mathematical psychology has come to be a broad topic reflecting the utilization of mathematical models in any area of psychology. Because of this breadth, it has not been possible to capture the content of mathematical psychology with any single volume. Consequently, mathematical psychology research is perhaps best explained with selected examples. The following examples were chosen to reflect a range of modeling techniques and research areas.

Signal Detection Theory Signal detection theory (SDT) is a well-known model that has become a standard tool used by experimental psychologist. Excellent sources of information about this approach can be found in [11, 34]. The initial problem that motivated the development of SDT was the measurement of the strength of a sensory stimulus and the control of decision biases that affect detection processes. On each trial of a typical experiment, the subject is presented with either a faint stimulus or no stimulus. If the subject adopts a lenient standard for claiming that the stimulus is present, then the subject is likely to be correct when the stimulus is present. However, such a lenient standard for decision making means that the subject is also likely to have many false alarms, that is, claiming a stimulus is present when it is absent. Conversely, if a subject adopts a strict standard for a detection decision, then there will be few false alarms, but at the cost of failing on many occasions to detect the stimulus. Although SDT developed in the context of sensory psychophysics, the model has also been extensively used in memory and information processing research because the procedure of providing either stimuluspresent or stimulus-absent trials is common in many research areas.

In what might be called the standard form of SDT, it is assumed that there are two underlying normal distributions on a psychological strength axis. One distribution represents the case when the stimulus is absent, and the other distribution is for the case when the stimulus is present. The stimulus-present distribution is shifted to the right on the strength axis relative to the stimulus-absence distribution. Given data from at least two experimental conditions where different decision criteria are used, it is possible to estimate the parameter d , which is defined as the separation between the two normal distributions divided by the standard deviation of the stimulusabsent distribution. It is also possible to estimate the ratio of the standard deviations of the two distributions.

Multinomial Process Tree Models The data collected in a signal detection experiment are usually categorical, that is, the hit rates and false alarm rates under conditions that have different decision criteria. In fact, many other experimental tasks also consist of measuring proportions in various response categories. A number of mathematical psychologists prefer to model these proportions directly in terms of underlying psychological processes. The categorical information is referred to as multinomial data. In this area of research, the mathematical psychologist generates a detailed probability description as to how underlying psychological processes result in the various proportions for the observed multinomial data. Most researchers develop a probability tree representation where the branches of the tree correspond to the probabilities of the latent psychological processes. The leaves of the tree are the observed categories in an experiment. These models have thus come to be called multinomial processing tree (MPT) models. An extensive review of many of MPT models in psychological research can be found in [3]. Given experimental data, that is, observed proportions for the various categories for responding, the latent psychological parameters of interest can be estimated. General methods for estimating parameters for this class of models are specified in [5, 12]. The goal of MPT modeling is to obtain measures of the latent psychological parameters. For example, a method was developed to use MPT models to obtain separate

Mathematical Psychology measures of memory storage and retrieval for a task that has an initial free recall test that is followed by a series of forced-choice recognition tests [5, 6]. MPT models are tightly linked to a specific experimental task. The mathematical psychologist often must invent an experimental task that provides a means of measuring the underlying psychological processes of interest.

Information Process and Reaction Time Models A number of mathematical psychologists model processes in the areas of psychophysics, information processing (see Information Theory), and cognition; see [9, 16, 19, 30]. These researchers have typically been interested in explaining properties of dependent measures from experiments, such as the participants’ response time, percentage correct, or the trade-off of time with task accuracy. For example, many experiments require the person to make a choice response, that is, the individual must decide if a stimulus is the same or different from some standard. The statistical properties of the time distribution of ‘same‘ response are not equivalent to the distribution of the ‘different’ response. Mathematical psychologists have been interested in accounting for the entire response-time distribution for each type of response. One successful model for response time is the relative judgment, random walk model, [17]. Random walk models are represented by a state variable (a real-value number that reflects the current state of evidence accumulation). The state variable is contained between two fixed boundaries. The state variable for the next time increment is the same as the prior value plus or minus a random step amount. Eventually, the random walk of the state variable terminates when the variable reaches either one of the two boundaries. The random walk model results in a host of predictions for the same-different response times.

Axiomatic Measurement Theories and Functional Equations Research in this area does not deal with specific stochastic models of psychological processes but rather focuses on the necessary and sufficient conditions for a general type of measurement scale or a form of a functional relationship among variables.

3

Typically, a minimal set of principles or axioms is considered, and the consequences of the assumed axioms are derived. If the resulting theory is not supported empirically, then at least one of the axioms must be in error. Furthermore, this approach provides experimental psychologists with an opportunity to test the theory by assessing the validity of essential axioms. A well-known example of the axiomatic approach is the choice axiom [18]. To illustrate the axiom, let us consider an example of establishing a preference ordering of a set of candidates. For each pair of candidates, there will be choice probability Pij that denotes the probability that candidate i is preferred over candidate j . The choice axiom states that the probability that any candidate is preferred is statistically independent of the removal of one of the candidates. In essence, this principle is the same as a long-held principle in voting theory called the independence of irrelevant alternatives, that is, the relative strength between two candidates should not change if another candidate is disqualified. Luce shows that this axiom implies that there must exist numbers vi associated with each alternative, and that the choice probability is vi /(vi + vj ). The result of this formal analysis also establishes that we cannot characterize the choice probabilities in terms of a ratio of underlying strengths if we can show empirically that for some pair of candidates the relative preference is altered by the removal of some other candidate. A similar approach to research in mathematical psychology can be found in the utilization of functional equation theory. D’Alembert, an eighteenth century mathematician, initiated this branch of mathematics, and it has proved to be important for psychological theorizing. Perhaps the most famous illustration of functional equation analysis from mathematics is the Cauchy equation. It can be proved that the only continuous function that satisfies the constraint that f (x + y) = f (x) + f (y) where x and y are nonnegative is f (x) = cx, where c is a constant. Functional equation analysis can be used to determine the form of a mathematical relationship between critical variables. Theorems are developed that demonstrate that only one mathematical function satisfies a given set of requirements without the necessity of curve fitting or statistical analysis. To illustrate this approach, let us consider the analysis provided by Luce [20] in the

4

Mathematical Psychology

area of psychophysics. Consider two stimuli of intensities I and I , respectively, and let f (I ) and f (I ) be the corresponding psychological perception of the stimuli. It can be established that if I /I = c, where c is a constant, and if f (I )/f (I ) = g(c), then f (I ) must be a power function, that is, f (I ) = AI β , where A and β are positive constants that are independent of I .

Judgment and Decision-making Models Economics and mathematics have developed rational models for decision making. However, there is considerable evidence from experimental psychology that people often do not behave according to these ‘normative’ rational models [13]. For example, let us consider the psychological worth of a gamble. Normative economic theory posits that a risky commodity (e.g., a gamble) is worth the expected utility of the gamble. Utility theory was first formulated by the mathematician Daniel Bernoulli in 1738 as a solution to a gambling paradox. Prior to that time, the worth of a gamble was the expected value for the gamble, for example, given a gamble that has a 0.7 probability for a gain of $10 and a probability of 0.3 for a loss of $10, the expected value is 0.7($10) + (0.3)(−$10) = $4. However, Bernoulli considered a complex gamble based on a bet doubling system and showed that the gamble had infinite expected value. Yet, individuals did not perceive that gamble as having infinite value. To resolve this discrepancy, Bernoulli replaced the monetary values with subjective worth numbers for the monetary outcomes – these numbers were called utility values. Thus, for the above example, the subjective utility is 0.7[U ($10)] + (0.3)[U (−$10)], where the u function is nonlinear, monotonic increasing function of dollar value. General axioms for expected utility theory have been developed [31], and that framework has become a central theoretical perspective in economics. However, experimental psychologists provided numerous demonstrations that this theory is not an accurate descriptive theory. For example, the Allais paradox illustrates a problem with expected utility theory [7, 32]. Subjects greatly prefer a certain $2400 to a gamble where there is a 0.33 probability for $2500, a 0.66 probability for $2400, and a 0.01 chance for $0. Thus, from expected utility theory, it follows that U ($2400) > 0.33[U ($2500)] +

0.66[U ($2400)] + 0.01[U ($0)], which is equivalent to 0.34[U ($2400)] > 0.33[U ($2500)] + 0.01[U ($0)]. However, these same subjects prefer a lottery with a 0.33 chance for $2500 to a lottery that has a 0.34 chance for $2400. This second preference implies according to expected utility theory that 0.33[U ($2500)] + 0.67[U ($0)] > 0.34[U ($2400)] + 0.66[U ($0)], which is equivalent to 0.34[U ($2400)] < 0.33[U ($2500)] + 0.01[U ($0)]. Notice that preferences for the gambles are inconsistent, that is, from the first preference ordering we deduced that 0.34[U ($2400)] > 0.33[U ($2500)] + 0.01[U ($0)], but from the second preference ordering we obtained that 0.34[U ($2400)] < 0.33[U ($2500)] + 0.01[U ($0)]. Clearly, there is a violation of a prediction from expected utility theory. In an effort to achieve more realistic theories for the perception of the utility of gambles, mathematical psychologists and theoretical economists formulated alternative models. For example, Luce [21] extensively explores alternative utility models in an effort to find a model that more accurately describes the behavior of individuals.

Models of Memory The modeling of memory has been an active area of research; see [26]. One of the best-known memory models is the Atkinson and Shiffrin multiple-store model [2]. This model deals with both short-term and long-term memory with a system of three memory stores that can be used in a wide variety of ways. The memory stores are a very short-term sensory register, a short-term store, and a permanent memory store. This model stimulated considerable experimental and theoretical research. The Atkinson and Shiffrin model did not carefully distinguish between recall and recognition behavior measures. These measures are quite different. In subsequent memory research, there has been more attention given to the similarities and differences in recall and recognition measures. In particular, a considerable amount of interest has centered on a class of recognition memory models that has come to be called ‘global matching models’. Models in this class differ widely as to the basic representation of information. One type of representation takes a position like that of the Estes array model [8]. For this model, each memory is considered as a separate N -dimensional vector of attributes. A recognition

Mathematical Psychology

5

memory probe activates all of the existing items in the entire memory system by an amount that is dependent on the similarity between the probe and the memory representation. In the Estes array model for classification and recognition, the similarity function between two N -dimensional memory vectors is defined by the expression ts N−k , where t represents the similarity value for the match of k attributes and s is a value between 0 and 1 that is associated with the reduced activation caused by a feature mismatch. The recognition decision is a function of the total similarity produced by the recognition memory probe with all the items in the memory system. Another model in the global matching class is the Murdock TODAM model [24, 25]. This model uses a distributed representation of information in the memory system. It is assumed that memory is composed of a single vector. The TODAM model does not have separate vectors for various items in memory. Recognition is based on a vector function that depends only on the recognition probe and current state of the memory vector. As more items are added to memory, previous item information may be lost. For TODAM and many other memory models, Monte Carlo (i.e., random sampling) methods are used to generate the model predictions for various experimental conditions. In general, these models have a number of parameters – often more parameters than are identifiable in a single condition of an experiment. However, given values for the model parameters, it would be possible to account for the data obtained in that condition as well as many other experimental conditions. In fact, a successful model is usually applied across a wide range of experiments and conditions without major changes in the values for the parameters.

with brain structures such as the hippocampus or the visual cortex [28]. For real neural networks, the researcher focuses on the functioning of a set of cells or a brain structure as opposed to the behavior of the whole animal. However, artificial neural networks are algorithms mainly for learning and memory, and are more likely to be linked to observable behavior. Artificial neural networks are distributed computing systems that pass information throughout a network (see [1, 23, 29]). As the network ‘experiences’ new input patterns and receives feedback from its ‘behavior’, there are changes in the properties of links and nodes in the network, which in turn affect the behavior of the network. Artificial neural networks have been valuable as models for learning and pattern recognition. There are many possible arrangements for artificial neural network.

Neural Network Models

References

Many of the mathematical models of psychological processes fall into the category of neural networks. There is also considerable interest in neural network models outside of psychology. For example, physicists modeling materials known as spin glasses have found that neural network models, which were originally developed to explain learning, can also describe the behavior of these substances [4]. In general, there is a distinction between real and artificial neural networks. Real neural networks deal

Summary Mathematical psychology is the branch of theoretical psychology that uses mathematics to measure and describe the regularities that are obtained from psychological experiments. Some research in this area (i.e., signal detection theory and multinomial processing tree models) is focused on extracting measures of latent psychological processes from a host of observable measures. Other models in mathematical psychology are focused on the description of the changes in psychological processes across experimental conditions. Examples of this approach would include models of information processing, reaction time, learning, memory, and decision making. The mathematical tools used in mathematical psychology are diverse and reflect the wide range of psychological problems under investigation.

[1] [2]

[3]

Anderson, J.A. (1995). Practical Neural Modeling, MIT Press, Cambridge. Atkinson, R.C. & Shiffrin, R. (1968). Human memory: a proposed system and its control processes, in The Psychology of Learning and Motivation: Advances in Research and Theory, Vol. 2, K.W. Spence & J.T. Spence, eds, Academic Press, New York, pp. 89–195. Batchelder, W.H. & Riefer, D.M. (1999). Theoretical and empirical review of multinomial process tree modeling, Psychonomic Bulletin & Review 6, 57–86.

6 [4]

[5]

[6]

[7]

[8] [9] [10] [11] [12]

[13]

[14] [15]

[16] [17]

[18] [19]

[20]

Mathematical Psychology Bovier, A. & Picco, P., eds (1998). Mathematical Aspects of Spin Glasses and Neural Networks, Birkh¨auser, Boston. Chechile, R.A. (1998). A new method for estimating model parameters for multinomial data, Journal of Mathematical Psychology 42, 432–471. Chechile, R.A. & Soraci, S.A. (1999). Evidence for a multiple-process account of the generation effect, Memory 7, 483–508. Coombs, C.H., Dawes, R.M. & Tversky, A. (1970). Mathematical Psychology: An Elementary Introduction, Prentice-Hall, Englewood Cliffs. Estes, W.K. (1994). Classification and Cognition, Oxford University Press, New York. Falmagne, J.-C. (1985). Elements of Psychophysical Theory, Oxford University Press, Oxford. Fechner, G.T. (1860). Elemente der Psychophysik, Breitkopf & H¨artel, Leipzig. Green, D.M. & Swets, J.A. (1966). Signal Detection Theory and Psychophysics, Wiley, New York. Hu, X. & Phillips, G.A. (1999). GPT.EXE: a powerful tool for the visualization and analysis of general processing tree models, Behavior Research Methods Instruments & Computers 31, 220–234. Kahneman, D., Slovic, P. & Tversky, A. (1982). Judgment Under Uncertainty: Heuristics and Biases, Cambridge University Press, Cambridge. Laming, D. (1973). Mathematical Psychology, Academic Press, London. Levine, G. & Burke, C.J. (1972). Mathematical Model Techniques for Learning Theories, Academic Press, New York. Link, S.W. (1992). The Wave Theory of Difference and Similarity, Erlbaum, Hillsdale. Link, S.W. & Heath, R.A. (1975). A sequential theory of psychological discrimination, Psychometrika 40, 77–105. Luce, R.D. (1959). Individual Choice Behavior: A Theoretical Analysis, Wiley, New York. Luce, R.D. (1986). Response Times: Their Role in Inferring Elementary Mental Organization, Oxford University Press, New York. Luce, R.D. (1993). Sound & Hearing: A Conceptual Introduction, Erlbaum, Hillsdale.

[21]

[22]

[23]

[24]

[25]

[26]

[27] [28] [29]

[30]

[31]

[32]

[33] [34]

Luce, R.D. (2000). Utility of Gains and Losses: Measurement-Theoretical and Experimental Approaches, Erlbaum, London. Luce, R.D., Bush, R.R. & Galanter, E., eds (1963;1965). Handbook of Mathematical Psychology, Vol. I, II and III, Wiley, New York. McClelland, J.L. & Rumelhart, D.E. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 2, MIT Press, Cambridge. Murdock, B.B. (1982). A theory for the storage and retrieval of item and associative information, Psychological Review 89, 609–626. Murdock, B.B. (1993). TODEM2: a model for the storage and retrieval of item, associative, and serial-order information, Psychological Review 100, 183–203. Raaijmakers, J.G.W. & Shiffrin, R.M. (2002). Models of memory, in Stevens’ Handbook of Experimental Psychology, Vol. 2, H. Pashler & D. Medin, eds, Wiley, New York, pp. 43–76. Restle, F. & Greeno, J.G. (1970). Introduction to Mathematical Psychology, Addison-Wesley, Reading. Rolls, E.T. & Treves, A. (1998). Neural Networks and Brain Function, Oxford University Press, Oxford. Rumelhart, D.E. & McClelland, J.L. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, MIT Press, Cambridge. Townsend, J.T. & Ashby, F.G. (1983). Stochastic Modeling of Elementary Psychological Processes, Cambridge University Press, Cambridge. von Neumann, J. & Morgenstern, O. (1947). Theory of Games and Economic Behavior, Princeton University Press, Princeton. von Winterfeld, D. & Edwards, W. (1986). Decision Analysis and Behavioral Research, Cambridge University Press, Cambridge. Wickens, T.D. (1982). Models for Behavior: Stochastic Processes in Psychology, W. H. Freeman, San Francisco. Wickens, T.D. (2002). Elementary Signal Detection Theory, Oxford University Press, Oxford.

RICHARD A. CHECHILE

Maximum Likelihood Estimation CRAIG K. ENDERS Volume 3, pp. 1164–1170 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Maximum Likelihood Estimation Many advanced statistical models (e.g., structural equation models) rely on maximum likelihood (ML) estimation. In this entry, we will explore the basic principles of ML estimation using a small data set consisting of 10 scores from the Beck Depression Inventory (BDI) and a measure of perceived social support (PSS). ML parameter estimates are desirable because they are both consistent (i.e., the estimate approaches the population parameter as sample size increases) and efficient (i.e., have the lowest possible variance, or sampling fluctuation), but these characteristics are asymptotic (i.e., true in large samples). Thus, although these data are useful for pedagogical purposes, it would normally be unwise to use ML with such a small N . The data vectors are as follows. BDI = [5, 33, 17, 21, 13, 17, 5, 13, 17, 13], PSS = [17, 13, 13, 5, 17, 17, 19, 15, 3, 11]. (1) The goal of ML estimation is to identify the population parameters (e.g., a mean, regression coefficient, etc.) most likely to have produced the sample data (see Catalogue of Probability Density Functions). This is accomplished by computing a value called the likelihood that summarizes the fit of the data to a particular parameter estimate. Likelihood is conceptually similar to probability, although strictly speaking they are not the same. To determine how ‘likely’ the sample data are, we must first make an assumption about the population score distribution. The normal distribution is frequently used for this purpose. The shape of a normal curve is described by a complicated formula called a probability density function (PDF). The PDF describes the relationship between a set of scores (on the horizontal axis) and the relative probability of observing a given score (on the vertical axis). Thus, the height of the normal curve at a given point along the x-axis provides information about the relative frequency of that score in a normally distributed population with a given mean and variance, µ and σ 2 . The univariate normal PDF is f (x) = √

1 2πσ 2

e−[(x−µ)/σ ]

2

/2

,

(2)

and is composed of two parts: the term in the exponent is the squared standardized distance from the mean, also known as Mahalanobis distance, and the term preceding the exponent is a scaling factor that makes the area under the curve equal to one. The normal PDF plays a key role in computing the likelihood value for a sample. To illustrate, suppose it was known that the BDI had a mean of 20 and standard deviation of 5 in a particular population. The likelihood associated with a BDI score of 21 is computed by substituting µ = 20, σ = 5, and xi = 21 into (2) as follows. Li = √

1 2πσ 2

e−[(x−µ)/σ ]

2

/2

1 2 = e−[(21−20)/5] /2 = .078. (3) 2 2(3.14)(5 ) The resulting value, .078, can be interpreted as the relative probability of xi = 21 in a normally distributed population with µ = 20 and σ = 5. Graphically, .078 represents the height of this normal distribution at a value of 21, as seen in Figure 1. Extending this concept, the likelihood for every case can be computed in a similar fashion, the results of which are displayed in Table 1. For comparative purposes, notice that the likelihood associated with a BDI score of 5 is approximately .001. This tells us that the relative probability of xi = 5 is much lower than that of xi = 21 (.001 versus .078, respectively), owing to the fact that the latter score is more likely to have occurred from a normally distributed population with µ = 20. By extension, this also illustrates that the likelihood associated with a given xi will change as the population parameters, µ and σ , change. Table 1 Likelihood and log likelihood values for the hypothetical sample Case

BDI

Likelihood (Li )

Log Li

1 2 3 4 5 6 7 8 9 10

5 33 17 21 13 17 5 13 17 13

.001 .003 .067 .078 .030 .067 .001 .030 .067 .030

−7.028 −5.908 −2.708 −2.548 −3.508 −2.708 −7.028 −3.508 −2.708 −3.508

2

Maximum Likelihood Estimation .09 .08

Likelihood

.07 .06 .05 .04 .03 .02 .01 0 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 BDI score

Figure 1 Graph of the univariate normal PDF. The height of the curve at xi = 21 represents the relative probability (i.e., likelihood) of this score in a normally distributed population with µ = 20 and σ = 5

Having established the likelihood for individual xi values, the likelihood for the sample is obtained via multiplication. From probability theory, the joint probability for a set of independent events is obtained by multiplying individual probabilities (e.g., the probability of jointly observing two heads from independent coin tosses is (.50)(.50) = .25). Strictly speaking, likelihood values are not probabilities; the probability associated with any single score from a continuous PDF is zero. Nevertheless, the likelihood value for the entire sample is defined as the product of individual likelihood values as follows: N 1 2 L= (4) e−[(x−µ)/σ ] /2 , √ 2πσ 2 i=1 where is the product operator. Carrying out this computation using the individual Li values from Table 1, the likelihood value for the sample is approximately .000000000000000001327 – a very small number! Because L becomes exceedingly small as sample size increases, the logarithm of (4) can be used to make the problem more computationally attractive. Recall that one of the rules of logarithms is log(ab) = log(a) + log(b). Applying logarithms to (4) gives the log likelihood, which is an additive, rather than multiplicative, model. N 1 −[(xi −µ)/σ ]2 /2 log L = . (5) log √ e 2πσ 2 i=1 Individual log Li values are shown in Table 1, and the log likelihood value for the sample is the sum

of individual log Li , approximately −41.16. The log likelihood value summarizes the likelihood that this sample of 10 cases originated from a normally distributed population with parameters µ = 20 and σ = 5. As can be inferred from the individual Li values in the table, higher values (i.e., values closer to zero) are indicative of a higher relative probability. As we shall see, the value of the log likelihood can be used to ascertain the ‘fit’ of a sample to a particular set of population parameters. Thus far, we have worked under a scenario where population parameters were known. More typically, population parameters are unknown quantities that must be estimated from the data. The process of identifying the unknown population quantities involves ‘trying out’ different parameter values and calculating the log likelihood for each. The final maximum likelihood estimate (MLE) is the parameter value that produced the highest log likelihood value. To illustrate, suppose we were to estimate the BDI population mean from the data. For simplicity, assume the population standard deviation is σ = 5. The sample log likelihood for µ values ranging between 10 and 20 are given in Table 2. Beginning with µ = 10, the log likelihood steadily increases (i.e., improves) until µ reaches a value of 15, after which the log likelihood decreases (i.e., gets worse). Thus, it appears that the MLE of µ is near a value of 15. The relationship between the population parameters and the log likelihood value, known as the log likelihood function, can also be depicted graphically. As seen in Figure 2, the height of the log likelihood

Maximum Likelihood Estimation

3

0 −10 Log likelihood

−20 −30 −40 −50 −60 −70 −80 −90

Figure 2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Population mean estimate

The maximum of the log likelihood function is found at µ = 15.4 Table 2 Log likelihood values for different estimates of the population mean µ estimate

Log likelihood

10 11 12 13 14 15 16 17 18 19 20

−42.764 −40.804 −39.244 −38.084 −37.324 −36.964 −37.004 −37.444 −38.284 −39.524 −41.164

function reaches a maximum between µ = 15 and 16. More precisely, the function is maximized at µ = 15.40, where the log likelihood takes on a value of −36.9317645. To confirm this value is indeed the maximum, note that the log likelihood values for µ = 15.399 and 15.401 are −36.9317647 and −36.9617647, respectively, both of which are smaller (i.e., worse) than the log likelihood produced by µ = 15.40. Thus, the MLE of the BDI population mean is 15.40, as it is the value that maximizes the log likelihood function. Said another way, µ = 15.4 is the population parameter most likely to have produced this sample of 10 cases. In practice, the process of ‘trying out’ different parameter values en route to maximizing the log likelihood function is aided by calculus. In calculus, a slope, or rate of change, of a function at a fixed point is known as a derivative. To illustrate, tangent lines are displayed at three points on the log likelihood

function in Figure 3. The slope of these lines is the first derivative of the function with respect to µ. Notice that the derivative (i.e., slope) of the function at µ = 6 is positive, while the derivative at µ = 24 is negative. You have probably already surmised that the tangent line has a slope of zero at the value of µ associated with the function’s maximum. Thus, the MLE can be identified via calculus by setting the first derivative to zero and solving for corresponding value of µ on the horizontal axis. Identifying the point on the log likelihood function where the first derivative equals zero does not ensure that we have located a maximum. For example, imagine a U-shaped log likelihood function where the first derivative is zero at ‘the bottom of the valley’ rather than ‘at the top of the hill’. Fortunately, verifying that the log likelihood function is at its maximum, rather than minimum, can be accomplished by checking the sign of the second derivative. Because second derivatives also play an important role in estimating the variance of an MLE, a brief digression into calculus is warranted. Suppose we were to compute the first derivative for every point on the log likelihood function. For µ < 15.4, the first derivatives are positive, but decrease in value as µ approaches 15.4 (i.e., the slopes of the tangent lines become increasingly flat close to the maximum). For µ > 15.4, the derivatives become increasingly negative (i.e., more steep) as you move away from the maximum. Now imagine creating a new graph that displays the value of the first derivative (on the vertical axis) for every estimate of µ on the horizontal axis. Such a graph is called a derivative function, and second derivatives are defined as the slope of a line tangent to the

4

Maximum Likelihood Estimation 0 −10 Log likelihood

−20 −30 −40 −50 −60 −70 −80 −90

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Population mean estimate

Figure 3 First derivatives are slopes of lines tangent to the log likelihood function. The derivative associated with µ = 6 is positive, while the derivative associated with µ = 24 is negative. The derivative is zero at the function’s maximum, µ = 15.4

derivative function (i.e., the derivative of a derivative, or rate of change in the slopes). A log likelihood function that is concave down (e.g., Figure 3) would produce a derivative function that begins in the upper left quadrant of the coordinate system (the derivatives are large positive numbers at µ < 15.4), crosses the horizontal axis at the function’s maximum value, and continues with a downward slope into the lower right quadrant of the coordinate system (the derivatives become increasingly negative at µ > 15.4). Further, the second derivative (i.e., the slope of a tangent line) at any point along such a function would be negative. In contrast, the derivative function for a U-shaped log likelihood would stretch from the upper right quadrant to the lower left quadrant, and would produce a positive second derivative. Thus, a negative second derivative verifies that the log likelihood function is at a maximum. Identifying the MLE is important, but we also want to know how much uncertainty is associated with an estimate – this is accomplished by examining the parameter’s standard error. The variance of an MLE, the square root of which is its standard error, is also a function of the second derivatives. In contrast to Figure 2, imagine a log likelihood function that is very flat. The first derivatives, or slopes, would change very little from one value of µ to the next. One way to think about second derivatives is that they quantify the rate of change in the first derivatives. A high rate of change in the first derivatives (e.g., Figure 2) reflects greater certainty about the parameter estimate (and thus a lower standard error), while

derivatives that change very little (e.g., a relatively flat log likelihood function) would produce a larger standard error. When estimating multiple parameters, the matrix of second derivatives, called the information matrix, is used to produce a covariance matrix (see Correlation and Covariance Matrices) of the parameters (different from a covariance matrix of scores), the diagonal of which reflects the variance of the MLEs. The standard errors are found by taking the square root of the diagonal elements in the parameter covariance matrix. Having established some basic concepts, let us examine a slightly more complicated scenario involving multiple parameters. To illustrate, consider the regression of BDI scores on PSS. There are now two parameters of interest, a regression intercept, β0 , and slope, β1 (see Multiple Linear Regression). The log likelihood given in (5) is altered by replacing µ with the conditional mean from the regression equation (i.e., µ = β0 + β1 x1i ). log L =

1 2 log √ e−[(yi −β0 −β1 xi )/σ ] /2 . (6) 2πσ 2 i=1

N

Because there are now two unknowns, estimation involves ‘trying out’ different values for both β0 and β1 . Because the log likelihood changes as a function of two parameters, the log likelihood appears as a three-dimensional surface. In Figure 4, the values of β0 and β1 are displayed on separate axes, and the vertical axis gives the value of the log likelihood for every combination of β0 and β1 . A precise solution

5

1.90 1.40 0.90 0.40 −0.10 −0.60 −1.10 −1.60 −2.10 −2.60 −3.10 50.00 45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 −5.00

Maximum Likelihood Estimation

b1

b0

Figure 4 Log likelihood surface for simple regression estimation. ML estimates of β0 and β1 are 23.93 and −.66, respectively

for β0 and β1 relies on the calculus concepts outlined previously, but an approximate maximum can be identified at the intersection of β0 = 25 and β1 = −.60 (the ML estimates obtained from a model-fitting program are β0 = 23.93 and β1 = −.66). Standard errors for the MLEs are obtained as a function of the second derivatives, and can be used to construct single-parameter significance tests of the null hypothesis, βj = 0. Because the asymptotic sampling distribution for an MLE is itself a normal distribution, a z test is constructed by dividing the parameter estimate by its asymptotic standard error (ASE), the square root of the appropriate diagonal element in the parameter covariance matrix. z=

βˆ − 0 . ˆ ASE(β)

(7)

To illustrate, the ASE for β1 is .429, and the corresponding z ratio is −.656/.429 = −1.528. This z ratio can subsequently be compared to a critical value obtained from the unit normal table. For example, the critical value for a two-tailed test using α = .05 is ±1.96, so β1 is not significantly different from zero. The ASEs can also be used to construct confidence intervals around the MLEs in the usual fashion. A more general strategy for conducting significance tests involves comparing log likelihood values

from two nested models. Models are said to be nested if the parameters of one model (i.e., the ‘reduced’ model) are a subset of the parameters from a second model (i.e., the ‘full’ model). For example, consider a multiple regression analysis with five predictor variables, x1 through x5 . After obtaining MLEs of the regression coefficients, suppose we tested a reduced model involving only x1 and x2 . If the log likelihood changed very little after removing x3 through x5 , we might reasonably conclude that this subset of predictors is doing little to improve the fit of the model – conceptually, this is similar to testing a set of predictors using an F ratio. More formally, the likelihood ratio test is defined as the difference between −2 log L values (sometimes called the deviance) for the full and reduced model. This difference is asymptotically distributed as a chi-square statistic with degrees of freedom equal to the difference in the number of estimated parameters between the two models. A nonsignificant likelihood ratio test indicates that the fit of the reduced model is not significantly worse than that of the full model. It should be noted that the likelihood ratio test is only appropriate for use with nested models, and could not be used to compare models with different sets of predictors, for example. Likelihood-based indices such as the Bayesian Information Criterion (BIC) can

6

Maximum Likelihood Estimation

instead be used for this purpose (see Bayesian Statistics). The examples provided thus far have relied on the univariate normal PDF. However, many common statistical models (e.g., structural equation models) are derived under the multivariate normal PDF, a generalization of the univariate normal PDF to k ≥ 2 dimensions. The multivariate normal distribution (see Catalogue of Probability Density Functions) can only be displayed graphically in three dimensions when there are two variables, and in this case the surface of the bivariate normal distribution looks something like a bell-shaped ‘mound’ (e.g., see [2, p. 152]). The multivariate normal PDF is f (x) =

1 −1 e−(x−µ) (x−µ)/2 . (2π )p/2 ||1/2

(8)

In (8), x and µ are vectors of scores and means, respectively, and is the covariance matrix of the scores. Although (8) is expressed using vectors and matrices, the major components of the PDF remain the same; the term in the exponent is a multivariate extension of Mahalanobis distance that takes into account both the variances and covariances of the scores, and the term preceding the exponent is a scaling factor that makes the volume under the density function equal to one (in the multivariate case, probability values are expressed as volumes under the surface). Consistent with the previous discussion, estimation proceeds by ‘trying out’ different values for µ and in search for estimates that maximize the log likelihood. Note that in some cases the elements of µ and may not be the ultimate parameters of interest but may be functions of model parameters (e.g., may contain covariances predicted by the parameters from a structural equation model, or µ may be a function of regression coefficients from a hierarchical model). In any case, an individual’s contribution to the likelihood is obtained by substituting her vector of scores, x, into (8) and solving, given the estimates for µ and . As before, the sample likelihood is the product of (8) over the N cases, and the log likelihood is obtained by summing the logarithm of (8) over the N cases. The estimation of multiple parameters generally requires the use of iterative numerical optimization techniques to maximize the log likelihood. One such

technique, the estimation maximization (EM) algorithm, is discussed in detail in [11], and these algorithms are discussed in more detail elsewhere [1]. Conceptually, the estimation process is much like climbing to the top of a hill (i.e., the log likelihood surface), similar to that shown in Figure 4. The first step of the iterative process involves the specification of initial values for the parameters of interest. These starting values may, in some cases, be provided by the analyst, and in other cases are arbitrary values provided automatically by the model-fitting program. The choice of different starting values is tantamount to starting the climb from different locations on the topography of the log likelihood surface, so a good set of starting values may reduce the number of ‘steps’ required to reach the maximum. In any case, these initial values are substituted into the multivariate PDF to obtain an initial log likelihood value. In subsequent iterations, adjustments to the parameter estimates are chosen such that the log likelihood consistently improves (this is accomplished using derivatives), eventually reaching its maximum. As the solution approaches its maximum, the log likelihood value will change very little between subsequent iterations, and the process is said to have converged if this change falls below some threshold, or convergence criterion. Before closing, it is important to distinguish between full maximum likelihood (FML) and restricted maximum likelihood (RML), both of which are commonly implemented in model-fitting programs. Notice that the log likelihood obtained from the multivariate normal PDF includes a mean vector, µ, and a covariance matrix, . FML estimates these two sets of parameters simultaneously, but the elements in are not adjusted for the uncertainty associated with the estimation of µ – this is tantamount to computing the variance using N rather than N − 1 in the denominator. As such, FML variance estimates will exhibit some degree of negative bias (i.e., they will tend to be too small), particularly in small samples. RML corrects this problem by removing the parameter estimates associated with µ (e.g., regression coefficients) from the likelihood. Thus, maximizing the RML log likelihood only involves the estimation of parameters associated with (in some contexts referred to as variance components). Point estimates for µ are still produced when using RML, but these parameters are

Maximum Likelihood Estimation estimated in a separate step using the RML estimates of the variances and covariances. In practice, parameter estimates obtained from FML and RML tend to be quite similar, perhaps trivially different in many cases. Nevertheless, the distinction between these two methods has important implications for hypothesis testing using the likelihood ratio test. Because parameters associated with µ (e.g., regression coefficients) do not appear in the RML likelihood, the likelihood ratio can only be used to test hypotheses involving variances and covariances. From a substantive standpoint, these tests are often of secondary interest, as our primary hypotheses typically involve means, regression coefficients, etc. Thus, despite the theoretical advantages associated with RML, FML may be preferred in many applied research settings, particularly given that the two approaches tend to produce similar variance estimates.

7

In closing, the intent of this manuscript was to provide the reader with a brief overview of the basic principles of ML estimation. Suffice to say, we have only scratched the surface in the short space allotted here, and interested readers are encouraged to consult more detailed sources on the topic [1].

References [1] [2]

Eliason, S.R. (1993). Maximum Likelihood Estimation: Logic and Practice, Sage, Newbury Park. Johnson, R.A. & Wichern, D.W. (2002). Applied Multivariate Statistical Analysis, 5th Edition, Prentice Hall, Upper Saddle River.

(See also Direct Maximum Likelihood Estimation; Optimization Methods) CRAIG K. ENDERS

Maximum Likelihood Item Response Theory Estimation ANDRE´ A. RUPP Volume 3, pp. 1170–1175 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Maximum Likelihood Item Response Theory Estimation Owing to a variety of theoretical and computational advances in the last 15 years, measurement literature has seen an upsurge in the theoretical development of modeling approaches and the practical development of routines for estimating their constituent parameters [10, 16, 19]. Despite the methodological and technical variety of these approaches, however, several common principles and approaches underlie the parameter estimation processes for them. This entry reviews these principles and approaches but, for more in-depth readings, a consultation of the comprehensive book by Baker [1], the didactic on the expectation maximization (EM)-algorithm by Harwell, Baker, and Zwarts [8] and its extension to Bayesian estimation (see Bayesian Statistics; Bayesian Item Response Theory Estimation) by Harwell and Baker [7], the technical articles on estimation with the EM-algorithm by Bock and Aitkin [2], Bock and Lieberman [3], Dempster, Laird, and Rubin [5], and Mislevy [11, 12], as well as the review article by Rupp [20], are recommended.

be general and, therefore, will leave the specific functional form that separates different latent variable models unspecified. However, as an example, one may want to consider the unidimensional threeparameter logistic model from item response theory (IRT) (see Item Response Theory (IRT) Models for Polytomous Response Data; Item Response Theory (IRT) Models for Rating Scale Data), with the functional form Pj (Xij = xij |θ) = γj + (1 − γj )

(1) To better understand common estimation approaches, it is necessary to understand a fundamental assumption that is made by most latent variable models, namely, that of conditional or local independence. This assumption states that the underlying data-generation mechanism for an observed data structure, as formalized by a statistical model, is of that dimensionality d that renders responses to individual items independent of one another for any given person. As a result, the conditional probability of observing a response vector xi for a given examinee can then be expressed as P (Xi = xi |θ) =

Conceptual Foundations for Parameter Estimation The process of calibrating statistical models is generally concerned with two different sets of parameters, one set belonging to the assessment items and one set belonging to the examinees that interact with the items. While it is common to assign labels to item parameters such as ‘difficulty’ and ‘guessing’, no such meaning is strictly implied by the statistical models and philosophical considerations regarding parameter interpretation, albeit important, are not the focus here. In this entry, examinees are denoted by i = 1, . . . , I , items are denoted by j = 1, . . . , J , the latent predictor variable (see Latent Variable) is unidimensional and denoted by θ, and the response probability for a correct response to a given item or item category is denoted by P or, more specifically, by Pj (Xij = xij |θ). The following descriptions will

exp[αj (θ − βj )] . 1 + exp[αj (θ − βj )]

J

Pj (Xij = xij |θ).

(2)

j =1

Given that the responses of the examinees are independent of one another, because examinees are typically viewed as randomly and independently sampled from a population of examinees with latent variable distribution g(θ) [9], the conditional probability of observing all response patterns (i.e., the conditional probability of observing the data) can, therefore, be expressed as the double-product P (X = x|θ) =

J I

Pj (Xij = xij |θ).

(3)

i=1 j =1

This probability, if thought of as a function of θ, is also known as the likelihood for the data, L(θ|X = x). Under the assumption of θ as a random effect, one can further integrate out θ to obtain the unconditional or marginal probability of observing

2

Maximum Likelihood IRT Estimation

the data,

and the estimated log-likelihood,

P (X = x) =

 J I  i=1



j =1

  Pj (Xij = xij |θi )g(θ) dθ ,  (4)

also known as the marginal likelihood for the data. Here, g(θ) denotes the probability distribution of θ in the population, which is often assumed to be standard normal but which can, technically, be of any form as long as it has interval-scale support on the real numbers. However, while (3) and (4) are useful for a conceptual understanding of the estimation routines, it is numerically easier to work with their logarithmic counterparts. Hence, one obtains, on the basis of (3), the log-likelihood of the data (log −L), log −L =

I J

log Pj (Xij = xij |θ)

(5)

i=1 j =1

and, based on (4), the marginal log-likelihood of the data (log −LM ), log −LM =

I i=1

log

 J  

j =1

  Pj (Xij = xij |θ)g(θ) dθ ,  (6)

log −L˜ ∼ =

Estimation of θ Values

P˜ (X = x|θ) ∼ =

I J i=1 j =1

P˜j (Xij = xij |T )

(7)

(8)

where the letter T is used to denote that manifest values are used and P˜ indicates that this results in an estimated probability. Similarly, (4 and 6) require the estimation of the distribution g(θ) where, again, a suitable subset of the real numbers is selected (e.g., the interval from −4 to 4 if one assumes that θ ∼ N (0, 1)), along with a number of K evaluation points that are typically selected to be equally spaced in that interval. For each evaluation point, the approximate value of the selected density function is then computed, which can be done for a theoretically selected distribution (e.g., a standard normal distribution) or for an empirically estimated distribution (i.e., one that is estimated from the data). For example, in BILOG-MG [24] and MULTILOG [22], initial density weights, also known as prior density weights, are chosen to start the estimation routine, which then become adjusted or replaced at each iteration cycle by empirically estimated weights, also known as posterior density weights [11]. If one denotes the kth density weight by A(Tk ), one thus obtains, as a counterpart to (4), the estimated marginal likelihood,     I K  J   P˜j (Xij = xij |Tk ) A(Tk ) P˜ (X = x) =   k=1

j =1

(9) and, as a counterpart to (6), the estimated marginal log-likelihood, log −L˜ M

Estimating latent θ values amounts to replacing them by manifest values from a finite subset of the real numbers. This requires the establishment of a latent variable metric and a common latent variable metric is one with a mean of 0 and a standard deviation of 1. Consequently, one obtains the estimated counterparts to (3 and 5), which are the estimated likelihood,

log P˜j (Xij = xij |T )

i=1 j =1

i=1

which are theoretical expressions that include latent θ values that need to be estimated for practical implementation.

I J

=

I i=1

    K  J  log  Pj (Xij = xij |Tk ) A(Tk )   k=1

j =1

(10) where, again, the letter T is used to denote that manifest values are used and P˜ indicates that this results in an estimated probability. With these equations at hand we are now ready to discuss the three most common estimation approaches, Joint Maximum

Maximum Likelihood IRT Estimation Likelihood (JML), Conditional Maximum Likelihood (CML), and Marginal Maximum Likelihood (MML).

JML and CML Estimation Historically, JML was the approach of choice for estimating item and examinee parameters. It is based on an iterative estimation scheme that cyclically estimates item and examinee parameters by computing the partial derivatives of the log −L (see (6 and 8)) with respect to these parameters, setting them equal to 0, and solving for the desired parameter values. At each step, provisional parameter estimates from the previous step are used to obtain updated parameter estimates for the current step. While this approach is both intuitively appealing and practically relatively easy to implement, it suffers from theoretical drawbacks. Primarily, it can be shown that the resulting parameter estimates are not consistent, which means that, asymptotically, they do not become arbitrarily close to their population targets in probability, which would be desirable [4, p. 323]. This is understandable, however, because, with each additional examinee, an additional incidental θ parameter is added to the pool of parameters to be estimated and, with each additional item, a certain number of structural item parameters is added as well, so that increasing either the number of examinees or the number of items does not improve asymptotic properties of the joint set of estimators. An alternative estimation approach is CML, which circumvents this problem by replacing the unknown θ values directly by values of manifest statistics such as the total score. For this approach to behave properly, however, the manifest statistics have to be sufficient for θ, which means that they have to contain all available information in the data about θ and have to be directly computable. In other words, the use of CML is restricted to classes of models with sufficient statistics such as the Rasch model in IRT [6, 23]. Since the application of CML is restricted to subclasses of models and JML does not possess optimal estimation properties, a different approach is needed, which is the niche that MML fills.

3

of item parameter estimators and to resolve the iterative dependency of previous parameter estimates in JML. This is accomplished by first integrating out θ (see (4 and 6)) and then maximizing the log −LM (see (9 and (10)) to obtain the MML estimates of the item parameters using first- and second-order derivatives and a numerical algorithm such as Newton–Raphson for solving the equations involved. The practical maximization process of (10) is done via a modification of a missing-data algorithm, the EM algorithm [2, 3, 5]. This algorithm uses the expected number of examinees at each evaluation point, n¯ j k , and the expected number of correct responses at each evaluation point, r¯j k , as ‘artificial’ data and then maximizes the log −LM at each iteration. Specifically, the EM algorithm employs Bayes Theorem (see Bayesian Belief Networks). In general, the theorem expresses the posterior probability of an event, after observing the data, as a function of the likelihood for the data and the prior probability of the event, before observing the data. In order to implement the EM algorithm, the kernel of the log −LM is expressed with respect to the posterior probability for θ, which is Pi (θ|Xi = xi ) =

L(Xi = xi |θ)h(θ) L(Xi = xi |θ)h(θ)dθ

(11)

where h(θ) is a prior distribution for θ. Using this theorem, the MML estimation process within the EM algorithm is comprised of three steps, which are repeated until convergence of the item parameter estimates is achieved. First, the posterior probability of θ for each examinee i at each evaluation point k is computed via   J   Pj (Xij = xij |Tk ) A(Tk )   j =1   Pik (Tk |Xi ) = K  J  Pj (Xij = xij |Ts ) A(Ts )   s=1

j =1

(12)

MML Estimation The basic idea in MML is similar to CML, namely, to overcome the theoretical inconsistency

as an approximation to (11) at evaluation point k. This is accomplished by using provisional item parameter estimates from the previous iteration to

4

Maximum Likelihood IRT Estimation

compute Pj (Xij = xij |Tk ) for a chosen model. Second, using these posterior probabilities, the artificial data for each item j at each evaluation point k are generated using

n¯ j k =

I

Pik (Tk |Xi )

i=1

r¯j k =

I

Xij Pik (Tk |Xi ).

(13)

i=1

Third, the first-order derivatives of the estimated log −LM function in (10) with respect to the item parameters are set to 0 and are solved for the item parameters. In addition, the information matrix at these point estimates is computed using the Newton–Gauss/Fisher scoring algorithm to estimate their precision. For that purpose, (10) and its derivatives are rewritten using the artificial data; the entire process is then repeated until convergence of parameter estimates has been achieved. Instead of just performing the above steps, however, programs such as BILOG-MG and MULTILOG allow for a fully Bayesian estimation process. In that framework, additional prior distributions can be specified for all item parameters and all examinee distribution parameters, which are then incorporated into the log −LM and its derivatives where they basically contribute additive terms. Finally, it should be noted that it is common to group examinees by observed response patterns and to use the observed frequencies of each response pattern to reduce the number of computations but that step is not reproduced in this exposition to preserve notational clarity. Thus, rather than obtaining individual θ estimates, in MML one obtains item parameter estimates and only the distribution parameters of g(θ). Nevertheless, if subsequently desired, the item parameter estimates can be used as ‘known’ values to obtain estimates of the examinee parameters using maximum likelihood (ML), Bayesian expected a posteriori (EAP ) or Bayesian maximum a posteriori (MAP ) estimation and their precision can be estimated using the estimated information matrices at the point estimates for the ML approach, the estimated standard deviation of the posterior distributions for the EAP approach, or the estimated posterior information matrices for the MAP approach.

Alternative Estimation Approaches While JML, CML, and, specifically, MML within a fully Bayesian framework, are the most flexible estimation approaches, extensions of these approaches and other parameter estimation techniques exist. For example, it is possible to incorporate collateral information about examinees into the MML estimation process to improve its precision [13]. Nonparametric models sometimes require techniques such as principal components estimation [15] or kernelsmoothing [18]. More complex psychometric models, as commonly used in cognitively diagnostic assessment, for example, sometimes require specialized software routines altogether [14, 17, 21]. Almost all estimation approaches are based on the foundational principles presented herein, which substantially unify modeling approaches from an estimation perspective.

References [1]

Baker, F.B. (1992). Item Response Theory: Parameter Estimation Techniques, Assessment Systems Corporation, St. Paul. [2] Bock, R.D. & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an EM algorithm, Psychometrika 40, 443–459. [3] Bock, R.D. & Lieberman, M. (1970). Fitting a response model to n dichotomously scored items, Psychometrika 35, 179–197. [4] Casella, G. & Berger, R.L. (1990). Statistical Inference, Duxbury, Belmont. [5] Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society. Series B 39, 1–38. [6] Embretson, S.E. & Reise, S.P. (2000). Item Response Theory for Psychologists, Erlbaum, Mahwah. [7] Harwell, M.R. & Baker, F.B. (1991). The use of prior distributions in marginalized Bayesian item parameter estimation: a didactic, Applied Psychological Measurement 15, 375–389. [8] Harwell, M.R., Baker, F.B. & Zwarts, M. (1988). Item parameter estimation via marginal maximum likelihood and an EM algorithm: a didactic, Journal of Educational Statistics 13, 243–271. [9] Holland, P.W. (1990). On the sampling theory foundations of item response theory models, Psychometrika 55, 577–602. [10] McDonald, R.P. (1999). Test Theory: A Unified Treatment, Erlbaum, Mahwah. [11] Mislevy, R.J. (1984). Estimating latent distributions, Psychometrika 49, 359–381.

Maximum Likelihood IRT Estimation [12] [13]

[14]

[15]

[16] [17]

[18]

[19]

Mislevy, R.J. (1986). Bayes modal estimation in item response models, Psychometrika 51, 177–195. Mislevy, R.J. & Sheehan, K.M. (1989). The role of collateral information about examinees in item parameter estimation, Psychometrika 54, 661–679. Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G. & Johnson, L. (2002). Making sense of data from complex assessments, Applied Measurement in Education 15, 363–389. Mokken, R.J. (1997). Nonparametric models for dichotomous responses, in Handbook of Modern Item Response Theory, W.J. van der Linden & R.K. Hambleton, eds, Springer-Verlag, New York, pp. 351–368. Muth´en, B.O. (2002). Beyond SEM: general latent variable modeling, Behaviormetrika 20, 81–117. Nichols, P.D., Chipman, S.F. & Brennan, R.L., eds (1995). Cognitively Diagnostic Assessment, Erlbaum, Hillsdale. Ramsay, J.O. (1997). A functional approach to modeling test data, in Handbook of Modern Item Response Theory, W.J. van der Linden & R.K. Hambleton, eds, SpringerVerlag, New York, pp. 381–394. Rupp, A.A. (2002). Feature selection for choosing and assembling measurement models: a building-block

5

based organization, International Journal of Testing 2, 311–360. [20] Rupp, A.A. (2003). Item response modeling with BILOG-MG and MULTILOG for windows, International Journal of Testing 4, 365–384. [21] Tatsuoka, K.K. (1995). Architecture of knowledge structures and cognitive diagnosis: a statistical pattern recognition and classification approach, in Cognitively Diagnostic Assessment, P.D. Nichols, S.F. Chipman & R.L. Brennan, eds, Erlbaum, Hillsdale, pp. 327–360. [22] Thissen, D. (1991). MULTILOG: Multiple Category Item Analysis and Test Scoring Using Item Response Theory [Computer software], Scientific Software International, Chicago. [23] van der Linden, W.J. & Hambleton, R.K., eds. (1997). Handbook of Modern Item Response Theory, SpringerVerlag, New York. [24] Zimowski, M.F., Muraki, E., Mislevy, R.J. & Bock, R.D. (1996). BILOG-MG: Multiple-group IRT Analysis and Test Maintenance for Binary Items [Computer Software], Scientific Software International, Chicago.

ANDRE´ A. RUPP

Maxwell, Albert Ernest DAVID J. BARTHOLOMEW Volume 3, pp. 1175–1176 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Maxwell, Albert Ernest Born: Died:

July 7, 1916, in County Cavan. 1996, in Leeds.

Albert Ernest Maxwell, always known as ‘Max’, is remembered primarily in the research community for his collaboration with D. N. Lawley in producing the monograph Factor Analysis as a Statistical Method. The first edition was published as one of Butterworth’s mathematical texts in 1963, with a reprint four years later [3] (see also [2]). The second edition followed in 1971 with publication in the United States by American Elsevier. The second edition was larger in all senses of the word and included new work by J¨oreskog and others on fitting the factor model by maximum likelihood. It also contained new results on the sampling behavior of the estimates. The title of the book is significant. Up to its appearance, factor analysis (which had been invented by Spearman in 1904) had been largely developed within the psychological community – although it had points of contact with principal component analysis introduced by Harold Hotelling in 1933. Factor analysis had been largely ignored by statisticians, though M. G. Kendall and M. S. Bartlett were notable exceptions. More than anyone else, Lawley and Maxwell brought factor analysis onto the statistical stage and provided a definitive formulation of the factor model and its statistical properties. This is still found, essentially unchanged, in many contemporary texts. It would still be many years, however, before the prejudices of many statisticians would be overcome and the technique would obtain its rightful place in the statistical tool kit. That it has done so is due in no small measure to Maxwell’s powers of exposition as a writer and teacher. Maxwell, himself, once privately expressed a pessimistic view of the reception of this book by the statistical community but its long life and frequent citation belie that judgment. Less widely known, perhaps, was his role as a teacher of statistics in the behavioral sciences. This was focused on his work at the Institute of Psychiatry in the University of London where he taught and advised from 1952 until his retirement in 1978. The distillation of these efforts is contained in another small but influential monograph Multivariate Analysis in Behavioural Research published in 1977 [5].

This ranged much more widely than factor analysis and principal components analysis, and included, for example, a chapter on the analysis of variance in matrix notation, which was more of a novelty when it was published in 1977 than it would be now. The book was clear and comprehensive but, perhaps, a little too concise for the average student. But, backed by good teaching, such as that Maxwell provided, it must have made a major impact on generations of research workers. Unusually perhaps, space was found for a chapter on the analysis of contingency tables. This summarized a topic that had been the subject of another of his earlier, characteristically brief, monographs, this time Analysing Qualitative Data from 1964 [4]. A final chapter by his colleague, Brian Everitt, introduced cluster analysis. The publication of Everitt’s monograph on that subject, initially on behalf of the long defunct UK Social Science Research Council (SSRC), was strongly supported by Maxwell [1]. Much of the best work by academics is done in the course of advisory work, supervision, refereeing, committee work, and so forth. This leaves little mark on the pages of history, but Maxwell’s role at the Institute and beyond made full use of his talents in that direction. Outside the Institute, he did a stint on the Statistics Committee of the SSRC where his profound knowledge of psychological statistics did much to set the course of funded research in his field as well as, occasionally, enlightening his fellow committee members. Maxwell’s career did not follow the standard academic model. He developed an early interest in psychology and mathematics at Trinity College, Dublin. This was followed by a spell as a teacher at St. Patrick’s Cathedral School in Dublin of which he became the headmaster at the age of 25. His conversion to full-time academic work was made possible by the award of a Ph.D. from the University of Edinburgh in 1950. Two years later, he left teaching and took up a post as lecturer in statistics at the Institute of Psychiatry, where he remained for the rest of his working life. Latterly, he was head of the Biometrics unit at the Institute.

References [1]

Everitt, B.S. (1974). Cluster Analysis, Heinemann, London.

2 [2]

[3]

[4]

Maxwell, Albert Ernest Maxwell, A.E. (1961). Recent trends in factor analysis, Journal of the Royal Statistical Society. Series A 124, 49–59. Maxwell, A.E. & Lawley, D.N. (1963). Factor Analysis as a Statistical Method, 2nd Edition, 1971, Butterworths, London. Maxwell, A.E. (1964). Analysing Qualitative Data, Wiley, London.

[5]

Maxwell, A.E. (1977). Multivariate Analysis in Behavioural Research, Chapman & Hall, London, (Monographs on applied probability and statistics).

DAVID J. BARTHOLOMEW

Measurement: Overview JOEL MICHELL Volume 3, pp. 1176–1183 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Measurement: Overview Behavioral science admits a diverse class of practices under the heading of measurement. These extend beyond the measurement of physical and biological attributes and include transformations of frequencies (e.g., in psychometrics), summated ratings (e.g., in social psychology), and direct numerical estimates of subjective magnitudes (e.g., in psychophysics). These are used in the hope of measuring specifically psychological attributes and are important sources of data for statistical analyses in behavioral science. Since the Second World War, the consensus within this science has been that measurement is ‘the assignment of numerals to objects or events according to rules’ [38 p. 667], a definition of measurement unique to social and behavioral science. Earlier definitions, such as the one suggested by the founder of quantitative behavioral science, G. T. Fechner [7], that ‘the measurement of a quantity consists of ascertaining how often a unit quantity of the same kind is contained in it’ (p. 38), reflect traditional quantitative science. More recent definitions, for example that ‘measurement is (or should be) a process of assigning numbers to objects in such a way that interesting qualitative empirical relations among the objects are reflected in the numbers themselves as well as in important properties of the number system’ [43 p. 394] are shaped by modern measurement theory. Recent dictionary entries, like Colman’s [5], defining measurement as the ‘systematic assignment of numbers to represent quantitative attributes of objects or events’ (p. 433), also reveal a shift from the earlier consensus. Tensions between the range of practices behavioral scientists call measurement, the way measurement is defined in traditional quantitative science, and various concepts of measurement within philosophy of science are responsible for this definitional diversity.

The Traditional Concept of Measurement The traditional concept of measurement derives from Euclid’s Elements [10]. The category of quantity is central. An attribute is quantitative if its levels sustain ratios. Take length, for example. Any pair of lengths, l1 and l2 , interrelate additively, in the sense that there exist whole numbers, n and m, such that nl1 > l2 and

ml2 > l1 . Prior to Euclid, a magnitude of a quantity, such as a specific length, was said to be measured by the number of units which, when added together, equaled it exactly. Such a concept of measurement does not accommodate all pairs of magnitudes, in particular, it does not accommodate incommensurable pairs. It was known that there are some pairs of lengths, for example, for which there are no whole numbers n and m such that n times the first exactly spans m times the second (e.g., the side and diagonal of a square) and that, therefore, for such pairs, the measure of the first relative to the second does not equal a numerical ratio (i.e., a ratio of two whole numbers). It was a major conceptual breakthrough to recognize (as Book V of Euclid’s Elements suggests) that the ratio of any pair of magnitudes, such as lengths, l1 and l2 (including incommensurable pairs) always falls between two infinite classes of numerical ratios as follows: {the class of ratios of n to m} < the ratio of l1 to l2 < {the class of ratios of p to q} (where n, m, p, and q range over all whole numbers such that nl2 < ml1 and ql1 < pl2 ). This meant that the measure of l1 relative to l2 could be understood as the positive real number (as it became known, much later, toward the end of the nineteenth century [6]) falling between these two classes. This breakthrough not only explained the role of numbers in measurement, it also provided a guide to the internal structure of quantitative attributes: they are attributes in which ratios between any two levels equal positive real numbers. The German mathematician, Otto H¨older [12] (see also [27] and [28]) was the first to characterize the structure of unbounded, continuous, quantitative attributes and his seven axioms of quantity are similar to the following set [26] applying to lengths. Letting a, b, c, . . . , be any specific lengths and letting a + b = c denote the relation holding between lengths a, b, and c when c is entirely composed of discrete parts, a and b: 1. For every pair of lengths, a and b, one and only one of the following is true: (i) a = b; (ii) there exists another length, c, such that a = b + c; (iii) there exists another length, d, such that b = a + d.

2

Measurement: Overview

2. For any lengths a and b, a + b > a. 3. For any lengths a and b, a + b = b + a. 4. For any lengths a, b, and c, a + (b + c) = (a + b) + c. 5. For any length a, there is another length, b, such that b < a. 6. For any lengths a and b there is another length, c such that c = a + b. 7. For every nonempty class of lengths having an upper bound, there is a least upper bound. (Note that an upper bound of a class of lengths is any length not less than any member of the class and that a least upper bound is an upper bound not greater than any other upper bound). The first four of these conditions state what it means for lengths to be additive. The remaining three ensure that the characterization excludes no lengths (i.e., there is no greatest or least length, nor gaps in the ordered series of them). All measurable, unbounded, continuous, quantitative attributes of physics (e.g., length, mass, time, etc.) are taken to possess this kind of structure. H¨older proved that ratios between levels of any attribute having this kind of structure possess the structure of the positive real numbers. This is a necessary condition for the identification of such ratios by real numbers. If quantitative attributes are taken to have this kind of structure, then the meaning of measurement is explicit: measurement is the estimation of the ratio between a magnitude of a quantitative attribute and a unit belonging to the same attribute. This is the way measurement is understood in physics. For example, Quantities are abstract concepts possessing two main properties: they can be measured, that means that the ratio of two quantities of the same kind, a pure number, can be established by experiment; and they can enter into a mathematical scheme expressing their definitions or the laws of physics. A unit for a kind of quantity is a sample of that quantity chosen by convention to have the value 1. So that, as already stated by Clerk Maxwell, physical quantity = pure number × unit.

(1)

This equation means that the ratio of the quantitative abstract concept to the unit is a pure number [41 pp. 765–766].

Measurement in physics is the fixed star relative to which measurement in other sciences steer. There always have been those who believe that behavioral

scientists must attempt to understand their quantitative practices within the framework of the traditional concept of measurement (see [25] and [26]). However, from the time of Fechner, Gustav T, it was questioned (e.g., see [45]) whether the attributes thought to be measurable within behavioral science also possess quantitative structure. As posed relative to the traditional concept of measurement, this question raised an issue of evidence: on what scientific grounds is it reasonable to conclude that an attribute possesses quantitative structure? In an important paper [11], the German scientist, Hermann von Helmholtz, made the case that this issue of evidence is solved for physical, quantitative attributes. Helmholtz argued that evidence supporting the conclusion that the attributes measured in physical science are quantitative comes in two forms, namely, those later designated fundamental and derived measurement by Campbell [2]. In the case of fundamental measurement, evidence for quantitative structure is gained via an empirical operation for concatenating objects having different levels of the attribute, which results in those levels adding together. For example, combining rigid, straight rods linearly, end to end adds their respective lengths; placing marbles together in the same pan of a beam balance adds their respective weights; and starting a second process contiguous with the cessation of a first adds their respective durations. In this regard, the seven conditions given above describing the structure of a quantitative attribute are not causal laws. In and of themselves, they say nothing about how objects possessing levels of the attribute will behave under different conditions. These seven conditions are of the kind that J. S. Mill [29] called uniformities of coexistence. They specify the internal structure of an attribute, not the behavior of objects or events manifesting magnitudes of the attribute. This means that there is no logical necessity that fundamental measurement be possible for any attribute. That it is for a number of attributes (e.g., the geometric attributes, weight, and time), albeit in each case only for a very limited range of levels, is a happy accident of nature. In the case of derived measurement, evidence for quantitative structure is indirect, depending upon relations between attributes already known to be quantitative and able to be measured. That a theoretical attribute is quantitative is indicated via derived measurement when objects believed to be equivalent with respect to the attribute also manifest a

Measurement: Overview constancy in some quantitative function of other, measurable attributes, as, for example, the ratio of mass to volume for different objects, all composed of the same kind of substance, is always a numerical constant. Then it seems reasonable to infer that the theoretical attribute is quantitative and measured by the relevant quantitative function. Since physical scientists have generally restricted the claim that attributes are quantitative to those to which either fundamental or derived measurement apply, there is little controversy, although some attributes, such as temperature, remained controversial longer than others (see, e.g., [22]). However, regarding the nonphysical attributes of behavioral science, the question of evidence for quantitative structure remained permanently controversial. Nonetheless, and despite the arguments of Fechner’s critics, the traditional concept of measurement provided no basis to reject the hypothesis that psychological attributes are quantitative. At most, it indicated a gap in the available evidence for those wishing to accept this hypothesis as an already established truth.

The Representational Concept of Measurement By the close of the nineteenth century, the traditional concept of measurement was losing support within the philosophy of science and early in the twentieth century, the representational concept became dominant within that discipline. This was because thinking had drifted away from the traditional view of number (i.e., that they are ratios of quantities [32]) and toward the view that the concept of number is logically independent of the concept of quantity [35]. If these concepts are logically unrelated, then an alternative to the traditional account of measurement is required. Bertrand Russell suggested the representational concept, according to which measurement depends upon the existence of isomorphisms between the internal structure of attributes and the structure of subsystems of the real numbers. Since H¨older’s paper [12] proves an isomorphism between quantitative attributes and positive real numbers, the representational concept fits all instances of physical measurement as easily as the traditional concept. Where the representational concept has an edge over the traditional concept is

3

in its capacity to accommodate the numerical representation of nonquantitative attributes. Chief amongst these are so-called intensive magnitudes. While the concept of intensive magnitude had been widely discussed in the later middle ages, by the eighteenth century, its meaning had altered (see [14]). For medieval scholars, an intensive magnitude was an attribute capable of increase or decrease (i.e., what would now be called an ordinal attribute) understood by analogy with quantitative structure and, thereby, hypothesized to be both quantitative and measurable. Nineteenth century scholars, likewise, thought of an intensive magnitude as an attribute capable of increase or decrease, but it was thought of as one in which each degree is an indivisible unity and not able to be conceptualized as composed of parts and, so, it was one that could not be quantitative. While some intensive attributes, such as temperature was then believed to be, were associated with numbers, the association was thought of as ‘correct only as to the more or less, not as to the how much’ as Russell [34 p. 55] put it. Many behavioral scientists, retreating in the face of Fechner’s critics, had, by the beginning of the twentieth century, agreed that different levels of sensation were only intensive magnitudes and not fully quantitative in structure. Titchener [42 p. 48] summarized this consensus (using S to stand for sensation): Now it is clear that, in a certain sense, the S may properly be termed a magnitude (Gr¨osse, grandeur): in the sense, namely, that we speak of a ‘more’ and ‘less’ of S-intensity. Our second cup of coffee is sweeter than the first; the water today is colder than it was yesterday; A’s voice carries farther than B’s. On the other hand, the S is not, in any sense, a quantity (messbare Gr¨osse, Quantit¨at, quantit´e).

The representational concept of measurement provides a rationale for the conclusion that psychophysical methods enable measurement because it admits the numerical representation of intensive magnitudes. While some advocates of the representational concept of measurement (e.g., [9] and [8]) were critical of the claims of behavioral scientists, others (e.g., [4]) argued that a range of procedures for the measurement of psychological attributes were instances of the measurement of intensive attributes and the representational concept of measurement began to find its way into the behavioral science literature (e.g., [13] and [39]). S. S. Stevens([38] and [39]) took the representational concept further than previous advocates.

4

Measurement: Overview

Stevens’ interpretation of the representational theory was that measurement amounts to numerically modeling ‘aspects of the empirical world’ [39 p. 23]. The aspects modeled may differ, producing different types of Scales of Measurement: modeling a set of discrete, unordered classes gives a nominal scale; modeling an ordered attribute gives an ordinal scale; modeling differences between levels of an attribute gives an interval scale; and, on top of that, modeling ratios between levels of an attribute gives a ratio scale. Also, these different types of scales were said to differ with respect to the group of numerical transformations that change the specific numbers used but leave the type of measurement scale unchanged (see Transformation). Thus, any one-to-one transformation of the numbers used in a nominal scale map them into a new nominal scale; any order-preserving (i.e., increasing monotonic) transformation of the numbers used in an ordinal scale maps them into a new ordinal scale; any positive linear transformation (i.e., multiplication by a positive constant together with adding or subtracting a constant) maps the numbers used in an interval scale into a new interval scale; and, finally, any positive similarities transformation (i.e., multiplication by a positive constant) maps the numbers used in a ratio scale into a new ratio scale. Stevens proposed a connection between type of scale and appropriateness of statistical operations. For example, he argued that the computation of means and variances was not permissible given either nominal or ordinal scale measures. He unsuccessfully attempted to justify his prescriptions on the basis of an alleged invariance of the relevant statistics under admissible scale transformations of the measurements involved. His doctrine of permissible statistics had been anticipated by Johnson [13] and it remains a controversial feature of Stevens’ theory (see [24] and [44]). Otherwise, Stevens’ theory of scales of measurement is still widely accepted within behavioral science. The identification of these types of scales of measurement with associated classes of admissible transformations was an important contribution to the representational concept and it established a base for further development of this concept by the philosopher, Patrick Suppes, in association with the behavioral scientist, R. Duncan Luce. In a three-volume work, Foundations of Measurement, Luce and Suppes, with David Krantz and Amos Tversky, brought the representational concept of measurement to an

advanced state, carrying forward H¨older’s axiomatic approach to measurement theory and using the conceptual framework of set theory. This approach to measurement involves four steps: 1. An empirical relational system is specified as a nonempty set of entities (objects or attributes of some kind) together with a finite number of distinct qualitative (i.e., nonnumerical) relations between the elements of this set. These elements and relations are the empirical primitives of the system. 2. A set of axioms is stated in terms of the empirical primitives. To the extent that these axioms are testable, they constitute a scientific theory. To the extent that they are supported by data, there is evidence favoring that theory. 3. A numerical relational system is identified such that a set of homomorphic or isomorphic mappings between the empirical and numerical systems can be proved to exist. This proof is referred to as a representation theorem. 4. A specification of how the elements of this set of homomorphisms or isomorphisms relate to one another is given, generally by identifying to which class of mathematical functions all transformations of any one element of this set into the other elements belong. A demonstration of this specification is referred to as a uniqueness theorem for the scale of measurement involved. This approach was extended by these authors to the measurement of more than one attribute at a time, for example, to a situation in which the empirical relational system involves an ordering upon the levels of a dependent variable as it relates to two independent variables. As an illustration of this point, consider an ordering upon performances on a set of intellectual tasks (composing a psychological test, say) as related to the abilities of the people and the difficulties of the tasks. The elements of the relevant set are ordered triples (viz., a person’s level of ability; a task’s level of difficulty; and the level of performance of a person of that ability on a task of that difficulty). The relevant representation theorem proves that the order on the levels of performance satisfies certain conditions (what Krantz, Luce, Suppes & Tversky [16] call double cancellation, solvability and an Archimedean condition) if and only if it is isomorphic to a numerical system in which the elements are triples of real numbers, x, y, and z, such that x + y = z and where

Measurement: Overview levels of ability map to the first number in each triple, levels of difficulty to the second, levels of performance to the third and order between different levels of performance maps to order between the corresponding values of these third numbers (the zs). The relevant uniqueness theorem proves that the mapping from the three attributes (in this case, abilities, difficulties, and performances) to the triples of numbers produce three interval scales of measurement, in Stevens’ sense. This particular kind of example of the representational approach to measurement is known as conjoint measurement and while it was a relatively new concept within measurement theory, it has proved to be theoretically important. For example, it makes explicit the evidential basis of derived measurement in physical sciences in a way that earlier treatments never did. Of the three conditions mentioned above (double cancellation, solvability, and the Archimedean condition), only the first is directly testable. As a result, applications of conjoint measurement theory make use of the fact that a finite substructure of any empirical system satisfying these three conditions, will always satisfy a finite hierarchy of cancellation conditions (single cancellation (or independence), double cancellation, triple cancellation, etc.). All of the conditions in this hierarchy are directly testable. Following the lead of Suppes and Zinnes [40], Luce, Krantz, Suppes & Tversky [21] attempted to reinterpret Stevens’ doctrine of permissible statistics using the concept of meaningfulness. Relations between the objects or attributes measured are said to be meaningful if and only if definable in terms of the primitives of the relevant empirical relational system. This concept of meaningfulness is related to the invariance, under admissible scale transformations of the measurements involved, of the truth (or falsity) of statements about the relation. The best example of the application of the representational concept of measurement within behavioral science is Luce’s [20] investigations of utility in situations involving individual decision making under conditions of risk and uncertainty. Luce’s application demonstrates the way in which ideas in representational measurement theory may be combined with experimental research. Within the representational concept, the validity of any proposed method of measurement must always be underwritten by advances in experimental science.

5

Despite the comprehensive treatment of the representational concept within Foundations of Measurement, and also by other authors (e.g., [30] and [31]), the attitude of behavioral scientists to this concept is ambivalent. The range of practices currently accepted within behavioral science as providing interval or ratio scales extends beyond the class demonstrated to be scales of those kinds under the representational concept. These include psychometric methods, which account for the greater part of the practice of measurement in behavioral science. While certain restricted psychometric models may be given a representational interpretation (e.g., Rasch’s [33] probabilistic item response model (see [15]), in general, the widespread practice of treating test scores (or components of test scores derived via multivariate techniques like linear factor analysis) as interval scale measures of psychological attributes has no justification under the representational concept of measurement. As a consequence, advocates of established practice within behavioral science have done little more than pay lip service to the representational concept.

The Operational Concept As a movement in the philosophy of science, operationism was relatively short-lived, but it flourished long enough to gain a foothold within behavioral science, one that has not been dislodged by the philosophical arguments constructed by its many opponents. Operationism was initiated in 1927 by the physicist, P. W. Bridgman [1]. It had a profound influence upon Stevens, who became one of its leading advocates within behavioral science (see [36] and [37]). Operationism was part of a wider movement in philosophy of science, one that tended to think of scientific theories and concepts as reducible to directly observable terms or directly performable operations. The best-known quotation from Bridgman’s writings is his precept that ‘in general we mean by any concept nothing more than a set of operations; the concept is synonymous with the corresponding set of operations’ ([1 p. 5], italics in original). If this is applied to the concept of measurement, then measurement is nothing more than ‘a specified procedure of action which, when followed, yields a number’ [17 p. 39].

6

Measurement: Overview

The operational and representational concepts of measurement are not necessarily incompatible. The genius of Stevens’ definition, that ‘measurement is the assignment of numerals to objects or events according to rules’ [38 p. 667] lies in the fact that it fits both the representational and the operationist concepts of measurement. If the ‘rule’ for making numerical assignments is understood as one that results in an isomorphism or homomorphism between empirical and numerical relational systems, then Stevens’ definition clearly fits the representational concept. On the other hand, if the ‘rule’ is taken to be any numbergenerating procedure, then it fits the operational concept. Furthermore, when the components of the relevant empirical relational system are defined operationally, then the representational concept of measurement is given an operational interpretation. (This kind of interpretation was Stevens’ [39] preference). Within the psychometric tradition, Stevens’ definition of measurement is widely endorsed (e.g., [19] and [23]). Any specific psychological test may be thought of as providing a precisely specified kind of procedure that yields a number (a test score) for any given person on each occasion of administration. The success of psychometrics is based upon the usefulness of psychological tests in educational and organizational applications and the operational concept of measurement has meant that these applications can be understood using a discourse that embodies a widely accepted concept of measurement. Any psychological test delivers measurements (in the operational sense), but not all such measurements are equally useful. The usefulness of test scores as measurements is generally understood relative to two attributes: reliability and validity (see Reliability: Definitions and Estimation; Validity Theory and Applications [19]). Within this tradition, any measure is thought to be additively composed of two components: a true component (defined as the expected value of the measures across an infinite number of hypothetical occasions of testing with person and test held constant) and an error component (the difference between the measure and the true component on each occasion). The reliability of a set of measurements provided by a test used within a population of people is understood as the proportion of the variance of the measurements that is true. The validity of a test is usually assessed via methods of linear regression (see Multiple Linear Regression), for predictive validity, and linear factor analysis, for construct validity, that is,

in terms of the proportion of true variance shared with an observed criterion variable or with a postulated theoretical construct. The kinds of theoretical constructs postulated include cognitive abilities (such as general ability) and personality traits (such as extraversion). There has been considerable debate within behavioral science as to how realistically such constructs should be interpreted. While the inclusiveness of operationism has obvious advantages, it seems natural to interpret theoretical constructs as if they are real, causally explanatory attributes of people. This raises the issue of the internal structure of such attributes, that is, whether there is evidence that they are quantitative. In this way, thinking about measurement naturally veers away from operationism and toward the traditional concept. Some behavioral scientists admit that numbers such as test scores, that are thought of as measurements of theoretical psychological attributes, ‘generally reflect only ordinal relations between the amounts of the attribute’ ([23] p. 446) (see [3] for an extended treatment of this topic and [18] for a discussion in the context of psychophysics). These authors maintain that at present there is no compelling evidence that psychological attributes are quantitative. Despite this, the habit of referring to such numbers as measurements is established in behavioral science and not likely to change. Thus, pressure will continue within behavioral science for a concept that admits as measurement the assignment of numbers to nonquantitative attributes.

References [1] [2] [3]

[4]

[5] [6] [7]

Bridgman, P.W. (1927). The Logic of Modern Physics, Macmillan, New York. Campbell, N.R. (1920). Physics, the Elements, Cambridge University Press, Cambridge. Cliff, N. & Keats, J.A. (2003). Ordinal Measurement in the Behavioral Sciences, Lawrence Erlbaum Associates, Mahwah. Cohen, M.R. & Nagel, E. (1934). An Introduction to Logic and Scientific Method, Routledge & Kegan Paul Ltd, London. Colman, A.M. (2001). A Dictionary of Psychology, Oxford University Press, Oxford. Dedekind, R. (1901). Essays on the Theory of Numbers, (Translated by W.W. Beman), Open Court, Chicago. Fechner, G.T. (1860). Elemente der Psychophysik, Breitkopf and Hartel, Leipzig.

Measurement: Overview [8]

[9]

[10] [11]

[12]

[13] [14]

[15] [16]

[17]

[18] [19] [20]

[21]

[22]

[23] [24]

Ferguson, A., Myers, C.S., Bartlett, R.J., Banister, H., Bartlett, F.C., Brown, W., Campbell, N.R., Craik, K.J.W., Drever, J., Guild, J., Houstoun, R.A., Irwin, J.C., Kaye, G.W.C., Philpott, S.J.F., Richardson, L.F., Shaxby, J.H., Smith, T., Thouless, R.H. & Tucker, W.S. (1940). Quantitative estimates of sensory events: final report, Advancement of Science 1, 331–349. Ferguson, A., Myers, C.S., Bartlett, R.J., Banister, H., Bartlett, F.C., Brown, W., Campbell, N.R., Drever, J., Guild, J., Houstoun, R.A., Irwin, J.C., Kaye, G.W.C., Philpott, S.J.F., Richardson, L.F., Shaxby, J.H., Smith, T., Thouless, R.H. & Tucker, W.S. (1938). Quantitative estimates of sensory events: interim report, British Association for the Advancement of Science 108, 277–334. Heath, T.L. (1908). The Thirteen Books of Euclid’s Elements, Vol. 2, Cambridge University Press, Cambridge. Helmholtz, H.v. (1887). Z¨ahlen und Messen erkenntnistheoretisch betrachtet, Philosophische Aufs¨atze Eduard Zeller zu Seinum F¨unfzigj¨ahrigen Doktorjubil¨aum Gewindmet, Fues’ Verlag, Leipzig. H¨older, O. (1901). Die Axiome der Quantit¨at und die Lehre vom Mass, Berichte u¨ ber die Verhandlungen der K¨oniglich S¨achsischen Gesellschaft der Wissenschaften zu Leipzig, Mathematisch-Physische Klasse 53, 1–46. Johnson, H.M. (1936). Pseudo-mathematics in the social sciences, American Journal of Psychology 48, 342–351. Kant, I. (1997). Critique of Pure Reason, Translated and edited by P. Guyer & A.E. Wood, eds, Cambridge University Press, Cambridge. Keats, J. (1967). Test theory, Annual Review of Psychology 18, 217–238. Krantz, D.H., Luce, R.D., Suppes, P. & Tversky, A. (1971). Foundations of Measurement, Vol. 1, Academic Press, New York. Lad, F. (1996). Operational Subjective Statistical Methods: a Mathematical, Philosophical and Historical Introduction, John Wiley & Sons, New York. Laming, D. (1997). The Measurement of Sensation, Oxford University Press, Oxford. Lord, F.M. & Novick, M.R. (1968). Statistical Theories of Mental Test Scores, Addison-Wesley, Reading. Luce, R.D. (2000). Utility of Gains and Losses: Measurement-theoretical and Experimental Approaches, Lawrence Erlbaum, Mahwah. Luce, R.D., Krantz, D.H., Suppes, P. & Tversky, A. (1990). Foundations of Measurement, Vol. 3, Academic Press, New York. Mach, E. (1896). Critique of the concept of temperature;(Translated by Scott-Taggart, M.J. & Ellis, B. and reprinted in Ellis, B. (1968). Basic Concepts of Measurement, Cambridge University Press, Cambridge, pp. 183–196). McDonald, R.P. (1999). Test Theory: a Unified Treatment, Lawrence Erlbaum, Mahwah. Michell, J. (1986). Measurement scales and statistics: a clash of paradigms, Psychological Bulletin 100, 398–407.

[25]

[26]

[27]

[28]

[29] [30] [31] [32]

[33]

[34]

[35] [36] [37] [38] [39]

[40]

[41]

[42] [43]

7

Michell, J. (1997). Quantitative science and the definition of measurement in psychology, British Journal of Psychology 88, 355–383. Michell, J. (1999). Measurement in Psychology: a Critical History of a Methodological Concept, Cambridge University Press, Cambridge. Michell, J. & Ernst, C. (1996). The axioms of quantity and the theory of measurement, Part I, Journal of Mathematical Psychology 40, 235–252. Michell, J. & Ernst, C. (1997). The axioms of quantity and the theory of measurement, Part II, Journal of Mathematical Psychology 41, 345–356. Mill, J.S. (1843). A System of Logic, Parker, London. Narens, L. (1985). Abstract Measurement Theory, MIT Press, Cambridge. Narens, L. (2002). Theories of Meaningfulness, Lawrence Erlbaum, Mahwah. Newton, I. (1728). Universal arithmetic: or, a treatise of arithmetical composition and resolution;(Reprinted in Whiteside, D.T. (1967). The Mathematical Works of Isaac Newton, Vol. 2, Johnson reprint Corporation, New York). Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests, Danmarks Paedagogiske Institut, Copenhagen. Russell, B. (1896). On some difficulties of continuous quantity, (Originally unpublished paper reprinted in Griffin, N. & Lewis, A.C., eds The Collected Papers of Bertrand Russell, Volume 2: Philosophical Papers 1896–99 , Routledge, London, pp. 46–58). Russell, B. (1903). Principles of Mathematics, Cambridge University Press, Cambridge. Stevens, S.S. (1935). The operational definition of psychological terms, Psychological Review 42, 517–527. Stevens, S.S. (1936). Psychology: the propaedeutic science, Philosophy of Science 3, 90–103. Stevens, S.S. (1946). On the theory of scales of measurement, Science 103, 667–680. Stevens, S.S. (1951). Mathematics, measurement and psychophysics, in Handbook of Experimental Psychology, S.S. Stevens, ed., Wiley, New York, pp. 1–49. Suppes, P. & Zinnes, J. (1963). Basic measurement theory, in Handbook of Mathematical Psychology, Vol. 1, R.D. Luce, R.R. Bush & E. Galanter, eds, Wiley, New York, pp. 1–76. Terrien, J. (1980). The practical importance of systems of units; their trend parallels progress in physics, in Proceedings of the International School of Physics ‘Enrico Fermi’ Course LXVIII, Metrology and Fundamental Constants, A.F. Milone & P. Giacomo, eds, North-Holland, Amsterdam, pp. 765–769. Titchener, E.B. (1905). Experimental Psychology: A Manual of Laboratory Practice, Macmillan, London. Townsend, J.T. & Ashby, F.G. (1984). Measurement scales and statistics: the misconception misconceived, Psychological Bulletin 96, 394–401.

8 [44]

[45]

Measurement: Overview Velleman, P.F. & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading, American Statistician 47, 65–72. ¨ Von Kries, J. (1882). Uber die Messung intensiver Gr¨ossen und u¨ ber das sogernannte psychophysische Gesetz, Vierteljahrsschrift f¨ur Wissenschaftliche Philosophie 6, 257–294.

(See also Latent Variable; Scales of Measurement) JOEL MICHELL

Measures of Association SCOTT L. HERSHBERGER

AND

DENNIS G. FISHER

Volume 3, pp. 1183–1192 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Measures of Association

=

Measures of association quantify the statistical dependence of two or more categorical variables. Many measures of association have been proposed; in this article, we restrict our attention to the ones that are most commonly used. We broadly divide the measures into those suitable for (a) unordered (nominal) categorical variables, and (b) ordered categorical variables. Within these two categories, further subdivisions can be defined. In the case of unordered categorical variables, measures of association can be categorized as (a) measures based on the odds ratio; (b) measures based on Pearson’s χ 2 ; and (c) measures of predictive association. In the case of ordered (ordinal) categorical variables, measures of association can be categorized as (a) measures of concordance-discordance and (b) measures based on derived scores.

Measures of Association for Unordered Categorical Variables

oddsy|m = Table 1

p(y|m) p(y|m) = 1 − p(y|m) p(n|m)

oddsy|f = =

α=

Employed (y)

Not employed (n)

Total

Male (m)

42

33

75

Female (f ) Total

40 82

125 58

165 40

Sex

40/165 .242 = = .319 125/165 .758

(2)

p(y|f )/p(n|f ) 1.27 oddsy|f = = = 3.98. oddsy|m p(y|m)/p(n|m) .32 (3)

This odds ratio indicates that for every one employed female, there are four employed males. Although not necessary, it is customary to place the larger of the two odds in the numerator [11]. Also note that the odds ratio can be expressed as the crossproduct of the cell frequencies of the contingency table:

=

Employment status

p(y|f ) p(y|f ) = 1 − p(y|f ) p(n|f )

These odds are large when employment is likely, and small when it is unlikely. The odds can be interpreted as the number of times that employment occurs for each time it does not. For example, since the odds of employment for males is 1.27, approximately 1.27 males are employed for every unemployed male, or, in round numbers, 5 males are employed for every four males that are unemployed. For females, there is approximately one employed female for every three unemployed females. The association between sex and employment status can be expressed by the degree to which the two odds differ. This difference can be summarized by the odds ratio (α):

α=

Employment status of males and females

(1)

Similarly, for females:

Measures Based on the Odds Ratio The Odds Ratio. The odds that an event will occur are computed by dividing the probability that the event will occur by the probability that the event will not occur (see Odds and Odds Ratios) [11]. For example, consider the contingency table shown in Table 1, with sex (s) as the row variable and employment status (e) as the column variable. For males, the odds of being employed equal the probability of being employed divided by the probability of not being employed:

42/75 .56 = = 1.27 33/75 .44

(42/75)(125/165) p(y|m)p(n|f ) = p(y|f )p(n|m) (40/165)(33/75) (42)(125) = 3.98. (40)(33)

(4)

Thus, the odds ratio is also known as the crossproduct ratio. The odds ratio can be applied to larger than 2 × 2 contingency tables, although its interpretation becomes difficult when there are more than two columns. However, when the number of rows is greater than two, interpreting the odds ratio

2 Table 2

Measures of Association Employment status within three regions Employment status

Region

West (w) South (s) East (e) Total

Employed (y)

Not employed (n)

Total

90 80 99 269

10 20 1 31

100 100 100 300

is relatively straightforward. For example, consider Table 2. Within any of the three regions, we can determine the odds that someone in that region will be employed. This is accomplished by dividing the probability for those in a given region being employed by the probability for those in the same region not being employed. Thus, for the west, the odds of being employed are (90/100)/(10/100) = 9. For the south, the odds of being employed are (80/100)/(20/100) = 4. For the east, the odds of being employed are (99/100)/(1/100) = 99. From these three odds, we can compute the following three odds ratios: (a) the odds ratio of someone in the east being employed versus someone in west being employed (α = 99/9 = 11); (b) the odds ratio of someone in the east being employed versus someone in the south being employed (α = 99/4 = 24.75); and (c) the odds ratio of someone in the west being employed versus someone in the south being employed (α = 9/4 = 2.25). Thus, the odds of someone in the east being employed are 11 times larger than the odds of someone in the west being employed, and 24.75 times larger than someone in the south. The odds of someone in the west being employed are 2.25 times larger than the odds of someone in the south being employed. The odds ratio is invariant under interchanges of rows and columns; switching only rows or columns changes α to 1/α [11]. Odds ratios are not symmetric around one: An odds ratio larger than one by a given amount indicates a smaller effect than an odds ratio smaller than one by the same amount [11]. While the magnitude of an odds ratio less than one is restricted to the range between zero and one, odds ratios greater than one are not restricted, allowing the ratio to potentially take on any value. If the natural logarithm (ln) of the odds ratio is taken, the

odds ratio is symmetric above and below one, with ln(1) = 0. For example, to take the 2 × 2 example above, the odds ratio of a male being employed, compared to a female, was 3.98. If we reverse that and take the odds ratio of a female being employed compared to a male, it is .32/1.27 = .25. These ratios are clearly not symmetric about 1.00. However, ln(3.98) = 1.38 and ln(.25) = −1.38, and these ratios are symmetric. Yule’s Q Coefficient of Association. Both α and ln(α) range from −∞ to +∞. In order to restrict the odds ratio within the interval −1 to +1, Yule introduced the coefficient of association, Q, for 2 × 2 tables. Its definition is [12]: Q=

α−1 n11 n22 − n12 n21 = . n11 n22 + n12 n21 α+1

(5)

Therefore, if odds ratio (α) expressing the relationship between sex and employment is 3.98, Q= =

(42)(125) − (33)(40) (42)(125) + (33)(40) 3.98 − 1 = .59. 3.98 + 1

(6)

Yule’s Q is algebraically equal to the 2 × 2 Goodman-Kruskal γ coefficient (described below), and, thus, measures the degree of concordance or discordance between two variables. Yule’s Coefficient of Colligation. As an alternative to Q, Yule proposed the coefficient of colligation [12]: √ √ √ = √n11 n22 − √n12 n21 = √a − 1 . Y (7) n11 n22 + n12 n21 a+1 The coefficient of colligation between sex and employment is √ √ (42)(125) − (33)(40) Y =√ √ (42)(125) + (33)(40) √ 3.98 − 1 = .33. (8) =√ 3.98 + 1 The coefficient of colligation has a different interpretation than the coefficient of association, and is interpreted as a Pearson product-moment correlation coefficient r (see below), but is not algebraically equivalent to one.

Measures of Association are symmetric measures of associBoth Q and Y ation, and are invariant under changes in the ordering of rows and columns.

Measures Based on Pearson’s χ 2 Pearson’s Chi-square Goodness-of-fit Test Statistic. (See Goodness of Fit for Categorical Variables), 2 nij − nij 2 , (9) χ = nij i j where nij is the expected frequency in cell ij , and can be usefully transformed into several measures of association [1]. The Phi Coefficient. The phi coefficient is defined as [5]: χ2 = . (10) N For sex and employment, 23.12 = 0.31. = 240

(11)

The phi coefficient can vary between 0 and 1, and is algebraically equivalent to |r|. However, the lower and upper limits of in a 2 × 2 table are dependent on two conditions. In order for to equal −1 or +1, (a) (n11 + n12 ) = (n21 + n22 ), and (b) (n11 + n21 ) = (n12 + n22 ). The Pearson Product-moment Correlation Coefficient. The general formula for the Pearson productmoment correlation coefficient is Xi − X Yi − Y r=

i

N sx sy

,

(12)

where X and Y are two continuous, interval-level variables. The categories of a dichotomous variable can be coded 0 and 1, and used in the formula for r. In a 2 × 2 contingency table, the calculations reduce to [10]: n11 n22 − n12 n21 r= √ . (13) n1+ n2+ n+1 n+2 For √ sex and employment, r = {(42)(125) − (33)(40) / [(75)(165)(82)(158)]} = .31.

3

A symmetric measure of association also invariant to row and column order, r varies between −1 to +1. From the formula, it is apparent that r = 1 if n12 = n21 = 0, and r = −1 if n11 = n22 = 0. In a standardized 2 × 2 table, where each marginal ; otherwise |r| < |Y | except probability = .5, r = Y when the variables are independent or completely related. Cram`er’s Phi Coefficient. Cram`er’s phi coefficient V is an extension of the phi coefficient to contingency tables that are larger than 2 × 2 [10]: χ2 , (14) V = N (k − 1) where k is the number of rows or columns, whichever is smaller. Cram`er’s phi coefficient is based on the fact that the maximum value chi-square can attain for a set of data is N (k − 1); thus, when chi-square is equal to N (k − 1), V = 1. Cram`er’s phi coefficient and the phi coefficient are equivalent for 2 × 2 tables. When applied to a 4 × 2 table for region and employment, 204.38 = .64. (15) V = 500(2 − 1) The Contingency Coefficient. Another variation of is a the phi coefficient, the contingency coefficient C measure of association that can be computed for an r × c of any size [9]: χ2 = C . (16) χ2 + N Applied to the 4 × 2 table between region and employment, 204.38 = .54. (17) C= 204.38 + 500 does not equal computed Note that, unlike V , C for 2 × 2 tables. For example, while V and both equal .31 for the association between sex and employment, 23.12 = = .29. (18) C 23.12 + 240

4

Measures of Association

The contingency coefficient can never equal 1, since is 0 ≤ C < N cannot equal 0. Thus, the range of C is its dependence on the +1. Another limitation of C number of rows and columns in an r × c table. The is upper limit of C max = C

k−1 . k

Thus, when the association between two variables is adj will reflect this. For region and employ1.00, C adj = .54/.71 = .76, and for sex and employment, C adj = .29/.71 = .41. ment, C

i

λ(X = r|Y = c) =

,

1 − maxj p+j maxi pij − maxi p+i

j

1 − maxi p+i

. (22)

The second is the symmetric lambda coefficient (λ) of Goodman and Kruskal [7]: maxj pij + maxi pij i

j

− maxj p+j − maxi pi+ . λ= 2 − maxj p+j − maxi pi+

(23)

The third is the asymmetric uncertainty coefficient of Theil [4]: U (Y = c|X = r)   − (pi+ ) ln(pi+ ) + − (p+j ) ln(p+j )

Measures of Predictive Association Another class of measures of association for nominal variables is measures of prediction analogous in concept to the multiple correlation coefficient in regression analysis [12]. When there is an association between two nominal variables X and Y , then knowledge about X allows one to obtain knowledge about Y , more knowledge than would have been available without X. Let Y be the dispersion of Y and Y.X be the conditional dispersion of Y given X. A measure of prediction Y.X Y

λ(Y = c|X = r) =

(19)

can never exceed Therefore, for 2 × 2√tables, C .71; that is, Cmax = [(2 − 1)/2] = .71. In fact, C will always be less than 1, even when the association between two variables is perfect [9]. In addi for two tables can only be comtion, values of C pared when the tables have the same number of rows and the same number of columns. To counter these limitations, an adjustment can be applied to C: adj = C . (20) C max C

φY.X = 1 −

We describe four measures that operationalize the idea underlying φY.X = 0. The first measure is the asymmetric lambda coefficient of Goodman and Kruskal [7]: maxj pij − maxj p+j

i

 (pij ) ln(pij ) − − i

=

j

− (p+j ) ln(p+j )

,

j

U (X = r|Y = c)   (pi+ ) ln(pi+ ) + − (p+j ) ln(p+j ) − i

(21)

compares the conditional dispersion of Y given X to the unconditional dispersion of Y , similar to how the multiple correlation coefficient compares the conditional variance of the dependent variable to its unconditional variance [1]. When φY.X = 0, X and Y are independently distributed; when φY.X = 1, X is a perfect predictor of Y .

j



=

j

 (pij ) ln(pij ) − − 

i

j

− (pi+ ) ln(pi+ )

.

i

(24) The fourth measure of predictive association is U , Theil’s symmetric uncertainty coefficient [4]:

Measures of Association U=     − (pi+ ) ln(pi+ ) + − (p+j ) ln(p+j )      i j     2      (pij ) ln(pij ) − − −

i

j

 . (pi+ ) ln(pi+ ) + − (p+j ) ln(p+j )



i

We use the contingency table shown in Table 3 to illustrate, in turn, the computation of the four measures of predictive association. The first measure, asymmetric λ(Y |X), is interpreted as the relative decrease in the probability of incorrectly predicting the column variable Y between not knowing X and knowing X. As such, lambda can also be interpreted as a proportional reduction in variation measure – how much of the variance in one variable is accounted for by the other? In the equation for λ(Y |X), 1 − max pij is the minimum probability of error from the prediction that Y is a function of X, and 1 − maxj p+j is the minimum probability of error from a prediction that Y is a constant over X. For the data we obtain: λ(Y |X) = =

((49 + 44 + 63)/432) − (119/432) 1 − (119/432)

λ(X|Y ) = =

((46 + 32 + 53 + 63)/432) − (193/432) 1 − (193/432) .45 − .44 = .02. 1 − .44

(27)

Symmetric λ is the average of the two asymmetric lambdas and has the range 0 ≤ λ ≤ 1. For our example, λ=

.09 .36 + .45 − .28 − .44 = = .07. 2 − .28 − .44 1.28

(28)

Theil’s asymmetric uncertainty coefficient, U (Y |X), is the proportion of uncertainty in the column variable Y that is explained by the row variable X, or, alternatively, U (X|Y ) is the proportion of uncertainty in the row variable X that is explained by the column variable Y . The asymmetric uncertainty coefficient has the range 0 ≤ U (Y |X) ≤ 1. For U (Y |X), we obtain for these data: U (Y |X) =

(1.06) + (1.38) − (2.40) = .03, (29) 1.38

and for U (X|Y ), U (X|Y ) =

(1.06) + (1.38) − (2.40) = .04. (30) 1.06

The symmetric U is computed as

.36 − .28 = .11. 1 − .28

(26) U=

Thus, there is an 11% improvement in predicting the column variable Y given the knowledge of the row variable X. Asymmetric λ has the range 0 ≤ λ(Y |X) ≤ 1.

Table 3 The computation of the four measures of predictive association

X1 X2 X3

In general, λ(Y |X) = λ(X|Y ); λ(X|Y ) is the relative decrease in the probability of incorrectly predicting the row variable X between not knowing Y and knowing Y – hence, the term ‘asymmetric’. For these data,

j

(25)

5

Y1

Y2

Y3

Y4

Total

28 44 46 Total

32 21 31 118

49 17 53 84

36 12 63 119

145 94 193 111

2 [(1.06) + (1.38) − (2.40)] = .03. (1.06) + (1.38)

(31)

Both the asymmetric and symmetric uncertainty coefficients have the range 0 ≤ U ≤ 1. The quantities in the numerators of the equations for the asymmetric lambda and uncertainty coefficients are interpreted as measures of variation for nominal responses: In the case of lambda coefficients, the variation measure is called the Gini concentration, and the variation measure used in the uncertainty coefficients is called the entropy [12]. More specifically, in λ(Y |X), the numerator represents the variance of the Y or column variable: maxj pij − maxj p+j , (32) V (Y ) = i

6

Measures of Association

and in λ(X|Y ), the numerator represents the variance of the X or row variable: V (X) =

maxi pij − maxi pi+ .

(33)

j

Analogously, the numerator of U (Y |X) is the variance of Y , V (Y ) = −



(pi+ ) ln(pi+ )

i

+ −



 (pij ) ln(pij ) ; − − i

Measures of Concordance/Discordance A pair of observations is concordant if the observation that ranks higher on X also ranks higher on Y . A pair of observations is discordant if the observation that ranks higher on X ranks lower on Y (see Nonparametric Correlation (tau)). The number of concordant pairs is nij ni j , (36) C= i
(p+j ) ln(p+j )

j



Measures of Association for Ordered Categorical Variables

(34)

j

where the first summation is over all rows i < i , and the second summation is over all pairs of columns j < j . The number of discordant pairs is nij ni j , (37) D= ij

and the numerator of U (X|Y ) is the variance of X, (pi+ ) ln(pi+ ) V (X) = −

i

 (p+j ) ln(p+j ) + − 

 − −

j

i

C = 302(331 + 250 + 155 + 185)

 (pij ) ln(pij ) .

+ 105(250 + 185) + 409(155 + 185) (35)

+ 331(185) = 524 112.

j

However, in these measures, there is a positive relationship between the number of categories and the magnitude of the variation, thus introducing ambiguity in evaluating both the variance of the variables and their relationship. Table 4

where the first summation is over all rows i < i , and the second summation is over all columns j > j . To illustrate the calculation of C and D, consider the 3 × 3 contingency table shown in Table 4 crossclassifying income with education. In this table, the number of concordant pairs is

(38)

Note that the process of identifying concordant pairs amounts to taking each cell frequency (e.g., 302) in turn, and multiplying that frequency by each of the cell frequencies ‘southeast’ of it (e.g., 331, 250, 155, 185). The number of discordant pairs is

Income for three levels of education Income

Education

LT than high school High school GT than high school Total Proportion

Low

Medium

High

Total

Proportion

302 409 15 726 0.409

105 331 155 591 0.333

23 250 185 458 0.258

430 990 355 1775 1.00

0.243 0.557 0.200 1.00

Measures of Association D = 23(409 + 331 + 15 + 155)

same number of categories. The tau statistics have the range −1 ≤ 0 ≤ +1. The three tau statistics are [6]:

+ 105(409 + 15) + 250(15 + 155) + 331(15) = 112 915.

(39)

Discordant pairs can readily be identified by taking each cell frequency (e.g., 23) in turn, and multiplying that cell frequencies by each of the cell frequencies ‘southwest’ of it (e.g., 409, 331, 15, 155). Let ni+ = j nij and n+j = i nij . We can express the total number of pairs of observations as n(n − 1) (40) = C + D + TX + TY − TXY , 2 where TX = i ni (ni+ − 1)/2 is the number of pairs tied on the row variable X, TY = j nj (n+j − 1)/2 is the number of pairs tied on the column variable Y , and TXY = i j nij (nij − 1)/2 is the number of pairs from a common cell (tied on X and Y ). In this formula for n(n − 1)/2, TXY is subtracted because pairs tied on both X and Y have been counted twice, once in TX and once in TY . Therefore, (430 × 429) + (990 × 989) +(355 × 354) TX = 2 = 644 625 (726 × 725) + (591 × 590) +(458 × 457) TY = 2 = 542 173 (302 × 301) + (105 × 104) + · · · + (185 × 184) = 249 400 TXY = 2 n(n − 1) = (524 112) + (112 915) + (644 625) 2 + (542 173) − (249 400) = 1 574 425.

7

(41)

Kendall’s tau Statistics. Several measures of concordance/discordance are based on the difference C − D. Kendall’s three tau (τ ) statistics are among the most well known of these, and can be applied to r × c contingency tables with r ≥ 2 and c ≥ 2. The row and column variables do not have to have the

τa =

2(C − D) N (N − 1)

τb = √ τc =

(C − D) [(C + D + TY − TXY )(C + D + TX − TXY )]

2m(C − D) , N 2 (m − 1)

(42)

where m = the number of rows or column, whichever is smaller. For our data, the tau statistics are equal to τa =

2(524 112 − 112 915) = .26 1775(1775 − 1)

τb =

(524 112 − 112 915) [(524 112 + 112 915 + 542 173 − 249 900) (524 112 + 112 915 + 644 625 − 249 900)]

= .42 3 × 2(524 112 − 112 915) = .39. τc = 17752 (3 − 1)

(43)

One note of qualification is that tau-a assumes there are no tied observations; thus, strictly speaking, it is not applicable to contingency tables. It is generally more useful as a measure of rank correlation. For 2 × 2 tables, τb simplifies to the Pearson product-moment correlation obtained by assigning any scores to the rows and columns consistent with their orderings. Goodman and Kruskal’s Gamma. Goodman and Kruskal suggested yet another measure, gamma (γ ) based on C − D [7]: γ =

C−D . C+D

(44)

For our example data, γ = (524 112 − 112 915)/ (524 112 + 112 915) = .65 Gamma has the range −1 ≤ 0 ≤ +1, equal to +1 when the data are concentrated in the upper-left to lower-right diagonal, and equal to −1 for the converse pattern. Although γ does equal zero when the two variables are independent, two variables can be completely dependent and still have a value of γ less than unity. For 2 × 2 tables, γ is equal to Yule’s Q.

8

Measures of Association

Somers’ d . Gamma and the tau statistics are symmetric measures. Somers proposed an alternative asymmetric measure [7]: = c|X = r) = d(Y

(C − D) , n(n − 1) − TX 2

(45)

for predicting the column variable Y from the row variable X, or = r|Y = c) = d(X

(C − D) , n(n − 1) − TY 2

|X) = (524 112 − 112 915) = .44 d(Y (1 574 425 − 644 625) (524 112 − 112 915) d(X|Y )= = .40. (47) (1 574 425 − 542 173) Somers’ d can be interpreted as the difference between the proportions of concordant and discordant pairs, using only those pairs that are untied on the predictor variable. For 2 × 2 tables, Somers’ d simplifies to the difference of proportions n11 /n1+ − n21 /n2+ |X)), or to the difference of proportions (for d(Y )). Yet another intern11 /n+j − n12 /n+j (for d(X|Y pretation of Somers’ d is as a least-squares regression |X) slope [4]. To see this more clearly, we rewrite d(Y as dY X and d(X|Y ) as dXY , we note that τb2 = dY X dXY (48)

Recalling in least-squares regression that r 2 = bY X bXY ,

r=√

(46)

for predicting X from Y . For these data, we compute

.422 = (.44)(.40).

Models). When a contingency table is involved, scale values are assigned to row and column categories, the data are treated as a grouped frequency distribution, and the Pearson product-moment correlation is computed. Specifically, let xi and yj be the values assigned to the rows and columns. The Pearson product-moment correlation for grouped data is then

SSX =

xi2 ni+

Some methods for measuring the association between ordinal variables require assigning scores to the levels of the ordinal variables (see Ordinal Regression

−

i

SSY =

2 xi ni+

i

n++

,

 2  yj n+j  yj2 n+j −

j

j

n++

.

(52)

Therefore, if for our 3 × 3 contingency table for income and education, we assign the values, ‘1’, ‘2’, and ‘3’ to the three levels of each variable, we have

(49)

Measures Based on Derived Scores

(50)

where CPXY is the sum of cross products,   xi ni+  yj n+j  i j CPXY = xi yj n++ − , n ++ i,j (51) and SS X and SS Y are the sums of squares,

CPXY = 6863 −

we see the analogy between τb2 and r 2 as coefficients of determination, and dY X , dXY and bY X , bXY as regression coefficients. In addition, we note, as is true for γ , Somers’ d equals zero when the two variables are independent, but is not necessarily equal to unity when the two variables are completely dependent.

CPXY , SSX SSY

SSX = 7585 −

(3475)(3282) = 437.68, 1775 (3475)2 = 781.83, 1775

(3282)2 = 1143.54, 1775 437.68 r=√ = .46. (781.83)(1143.54)

SSY = 7212 −

(53)

The value of r = .46 is also known as the Spearman rank correlation coefficient rs ,sometimes referred to as Spearman’s rho [8]. The Spearman rank correlation coefficient can also be computed by using:

Measures of Association

rS =

6

d2

n(n2 − 1)

,

(54)

where d is the difference in the rank scores between X and Y . Computed by using this formula, rS =

6(20706)2 = .46. 1775(17752 − 1)

(55)

Instead of using the rank scores of ‘1’, ‘2’, and ‘3’ directly, we can use ridit scores [2]. Ridit scores are cumulative probabilities; each ridit score represents the proportion of observations below category j plus half the proportion within j . We illustrate the calculation of ridit scores using the income and education contingency table. Beneath each marginal total is the proportion of observations in the row or column. The ridit score for the low income category is .5(.409) = .205; for medium income, .409 + .5(.333) = .576; and for high income, the ridit score is .409 + .333 + .5(.258) = .871. The ridit scores for less than high school education is .5(.243) = .122; for high school education, .243 + .5(.557) = .522; and for greater than high school education, the ridit score is .243 + .557 + .5(.200) = .900. Applying the Pearson product-moment correlation formula to the ridit scores, we obtain rs = .47. Scores for the ordered levels of categorical variable can also be derived by assuming that the variables have an underlying bivariate normal distribution. Normal scores computed from 2 × 2 contingency tables produce the tetrachoric correlation; when computed from larger tables, the correlation is known as the polychoric correlation [3]. The Tetrachoric/polychoric correlation is the maximum likelihood estimate of the Pearson product-moment

9

correlation between the bivariate normally distributed variables. The polychoric correlation between education and income is .57.

References [1]

Agresti, A. (1984). Analysis of Ordinal Categorical Data, Wiley, New York. [2] Bross, I.D.J. (1958). How to use ridit analysis, Biometrics 14, 18–38. [3] Drasgow, F. (1986). Polychoric and polyserial correlations, in Encyclopedia of Statistical Sciences, Vol. 7, S. Kotz & N.L. Johnson, eds, Wiley, New York, pp. 68–74. [4] Everitt, B.S. (1992). The Analysis of Contingency Tables, 2nd Edition, Chapman & Hall, Boca Raton. [5] Fleiss, J.L. Statistical Methods for Rates and Proportions, 2nd Edition, Wiley, New York. [6] Gibbons, J.D. (1993). Nonparametric Measures of Association, Sage Publications, Newbury Park. [7] Goodman, L.A. & Kruskal, W.H. (1972). Measures of association for cross-classification, IV, Journal of the American Statistical Association 67, 415–421. [8] Hildebrand, D.K., Laing, J.D. & Rosenthal, H. (1977). Analysis of Ordinal Data, Sage Publications, Newbury Park. [9] Liebetrau, A.M. (1983). Measures of Association, Sage Publications, Newbury Park. [10] Reynolds, H.T. (1984). Analysis of Nominal Data, Sage Publications, Newbury Park. [11] Rudas, T. (1998). Odds Ratios in the Analysis of Contingency Tables, Sage Publications, Thousand Oaks. [12] Wickens, T.D. (1986). Multiway Contingency Tables Analysis for the Social Sciences, Lawrence Erlbaum, Hillsdale.

SCOTT L. HERSHBERGER

DENNIS G. FISHER

AND

Median DAVID CLARK-CARTER Volume 3, pp. 1192–1193 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Median The median is a measure of central tendency (or location). It is defined as the point in an ordered set of data such that 50% of scores fall below it and 50% above it. The median is thus the 50th percentile and the second quartile (Q2 ). To find the median, we first place a set of scores in order of their size and then locate the value that is in the middle of the ordered set. This is straightforward when there is an odd number of scores in the set, as the median will be the member of the set that has as many scores below it as above it. For example, for the set of seven numbers 1, 3, 4, 5, 7, 9, and 11, 5 is the median and there are three numbers below it and three above it. Formally, when the number of scores in the set (N ) is odd, the median is the [(N + 1)/2]th value in the set. Accordingly, for the above-mentioned data, the median is the 4th in the ordered set, which is, as before, 5. When there is an even number of values in the set, the median is taken to be the mean of the (N/2)th and the [N/2 + 1]th values. Suppose that the abovementioned set had the value 15 added to it so that it now contained eight numbers, then the median would be the mean of the 4th and 5th scores in the ordered set, that is, the mean of 7 and 9, which is 8. When data are presented as frequencies within ranges rather than as individual values, it is possible to estimate the value of the median using interpolation. Suppose that we want to find the median age of a set of 100 young people from the data presented in Table 1. It can be seen quite easily that the median is in the age range 10 to 14 years. However, as we have no Table 1

Frequency table of ages of 100 young people

Age in years 0 to 4 5 to 9 10 to 14 15 to 19

Frequency

Cumulative frequency

26 22 35 17

26 48 83 100

other information, we are forced to assume that the ages of the 35 people in that age range are evenly distributed across it. Given that assumption, we can see that the median age is at the beginning of that range, as the 48th percentile (from the cumulative frequency) is in the age range 5 to 9. One equation for calculating an approximate value for the median in such a situation is    1    × N − Fm−1     2  , Median = Lm + Cm ×    Fm     (1) where Lm is the lowest value in the range that contains the median, Cm is the width of the range that contains the median, Fm−1 is the cumulative frequency below the range that contains the median, Fm is the frequency within the range that contains the median, and N is the total sample size. Therefore, in the tabulated data, the median would be found from    1    × 100 − 48     2  = 10.29. (2) 10 +  5 ×     35     An advantage of the median is that it is not affected by extreme scores; for example, if instead of adding 15 to the original 7-item set above, we added 100, then the median would still be 8. However, as with the mean, with discrete distributions, a data set containing an even number of values can produce a median, which is not possible, for example, if it was found that the median number of people in a family is 3.5. The median is an important robust measure of location for exploratory data analysis (EDA) and is a key element of a box plot [1].

Reference [1]

Cleveland, W.S. (1985). The Elements of Graphing Data, Wadsworth, Monterey.

DAVID CLARK-CARTER

Median Absolute Deviation DAVID C. HOWELL Volume 3, pp. 1193–1193 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Median Absolute Deviation The median deviation (M.A.D.) is a robust measure of scale, usually defined as M.A.D. = median|Xi − median|.

(1)

We can see that the M.A.D. is simply the average distance of observations from the median of the distribution. For a normally distributed variable, the M.A.D. is equal to 0.6745 times the standard deviation (SD) and 0.8453 times the average deviation (AD). (For distributions of the same shape, these estimators are linearly related.) The M.A.D. becomes increasingly

smaller than the standard deviation for distributions with thicker tails. The major advantage of the M.A.D. over the AD, both of which are more efficient that the SD for even slightly heavier than normal distributions, is that the M.A.D. is more robust. This led Mosteller and Tukey [1] to class the M.A.D. and the interquartile range as robust measures of scale, as ‘the best of an inferior lot.’

Reference [1]

Mosteller, F. & Tukey, J.W. (1977). Data Analysis and Linear Regression, Addison-Wesley, Reading.

DAVID C. HOWELL

Median Test SHLOMO SAWILOWSKY

AND

CLIFFORD E. LUNNEBORG

Volume 3, pp. 1193–1194 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Median Test This test for the equivalence of the medians of k ≥ 2 sampled populations was introduced by Mood in 1950 [3]. The null hypothesis is that the k populations have the same median and the alternative is that at least two of the populations have differing medians. The test requires random samples to be taken independently from the k populations and is based on the following assumption. If all the populations have a common median, then the probability that a randomly sampled value will lie above the grand sample median will be the same for each of the populations [1]. To carry out the test, one first finds the grand sample median, the median of the combined samples. Then, the sample data are reduced to a two-way table of frequencies. The k columns of the table correspond to the k samples and the two rows correspond to sample observations falling above and at or below the grand sample median. As an example, [2] reports the results of an experiment on the impact of caffeine on finger tapping rates. Volunteers were trained at the task and then randomized to receive a drink containing 0 mg, 100 mg, or 200 mg caffeine. Two hours after

Table 2 median

Above grand median Below grand median

Dosage (mg) 0 100 200

Caffeine and finger tapping rates

0 mg

100 mg

200 mg

3 7

5 5

7 3

consuming the drink, the rates of tapping, given in Table 1, were recorded. The grand sample median for these data is a value between 246 and 247. By custom, it is taken to be midway, at 246.5. Table 2 describes how the 30 tapping speeds, 10 in each caffeine dosage group, are distributed about this grand median. Where, as here, k > 2, the test is completed as a chi-squared test of independence in a two-way table. The resulting P value is 0.2019. As the expected cell frequencies are small in this example, the result is only approximate. Where k = 2, Fisher’s exact test (see Exact Methods for Categorical Data) can be used to test for independence. As the name implies, the resulting P value is exact rather than approximate.

References [1] [2]

Table 1

Distribution of tapping speeds about grand

[3]

Conover, W.J. (1999). Practical Nonparametric Statistics, 3rd Edition, Wiley, New York. Draper, N.R. & Smith, N. (1981). Applied Regression Analysis, 2nd Edition, Wiley, New York. Mood, A.M. (1950). Introduction to the Theory of Statistics, McGraw Hill, New York.

Tapping speeds 242, 245, 244, 248, 247, 248, 242, 244, 246, 242 248, 246, 245, 247, 248, 250, 247, 246, 243, 244 246, 248, 250, 252, 248, 250, 246, 248, 245, 250

(See also Distribution-free Inference, an Overview) SHLOMO SAWILOWSKY AND CLIFFORD E. LUNNEBORG

Mediation DAVID A. KENNY Volume 3, pp. 1194–1198 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mediation Mediation presumes a causal chain. As seen in Figure 1, the variable X causes M (path a) and M in turn causes Y (path b). The variable M is said to mediate the X to Y relationship. Complete mediation is defined as the case in which variable X no longer affects Y after M has been controlled and so path c is zero. Partial mediation is defined as the case in which the effect of X on Y is reduced when M is introduced, but c is not equal to zero. The path from X to Y without M being controlled for is designated here as c. A mediator is sometimes called an intervening variable and is said to elaborate a relationship. Given the model in the Figure 1, there are, according to Baron and Kenny [1], four steps in a mediational analysis. They are as follows. Step 1 :

Step 2 :

Step 3 :

Show that X predicts the Y . Use Y as the criterion variable in a regression equation and X as a predictor. This step establishes that X is effective and that there is some effect that may be mediated. Show that X predicts the mediator, M. Use M as the criterion variable in a regression equation and X as a predictor (estimate and test path a in Figure 1). This step essentially involves treating the mediator as if it were an outcome variable. Show that the mediator, M, predicts Y . Use Y as the criterion variable in a regression equation and X and M as predictors (estimate and test path b in the Figure 1). It

M

is insufficient just to correlate the mediator with Y because the mediator and Y may be both caused by X. The effect of X must be controlled in establishing the effect of the mediator on Y . Step 4 : To establish that M completely mediates the X − Y relationship, the effect of X on Y controlling for M should be zero (estimate and test path c in Figure 1). The effects in both steps 3 and 4 are estimated in the same regression equation where Y is the criterion and X and M are the predictors. If all of these steps are met, then the data are consistent with the hypothesis that M completely mediates the X − Y relationship. Partial mediation is demonstrated by meeting all but Step 4. It can sometimes happen that Step 1 fails because of a suppressor effect (c and ab have different signs). Hence, the essential steps in testing for mediation are 2 and 3. Meeting these steps does not, however, conclusively prove the mediation hypothesis because there are other models that are consistent with the data. So for instance, if one makes the mediator the outcome and vice versa, the results may look like ‘mediation’. Design (e.g., measuring M before Y ) and measurement (e.g., not using self-report to measure M and Y ) features can strengthen the confidence that the model in the Figure 1 is correct. If both steps 2 and 3 are met, then necessarily the effect of X on Y is reduced when M is introduced. Theoretically, the amount of reduction, also called the indirect effect, or c − c equals ab. So, the amount of reduction in the effect of X on Y is ab and not the change in an inferential statistic such as, F , p, or variance explained. Mediation or the indirect effect equals, in principle, the reduction in the effect of X to Y when the mediator is introduced. The fundamental equation in mediational analysis is

a

b

Total effect = Direct Effect + Indirect Effect (1) or mathematically

X

c′

c = c + ab

(2)

Y

Figure 1 Basic mediational model: The variable M mediates the X to Y relationship

When multiple linear regression is used, ab exactly equals c − c , regardless of whether standardized or unstandardized coefficients are used (see Standardized Regression Coefficients). However,

2

Mediation

for logistic regression, multilevel modeling (see Generalized Linear Mixed Models), and structural equation modeling, ab only approximately equals c − c . In such cases, it is probably advisable to estimate c by ab + c instead of estimating it directly. There are two major ways to evaluate the null hypothesis that ab = 0. The simple approach is to test each of the two paths individually and if both are statistically significant, then mediation is indicated [4]. Alternatively, many researchers use a test developed by Sobel [11] that involves standard error of ab. It requires the standard error of a or sa (which equals a/ta , where ta is the t Test of coefficient a) and the standard error of b or sb (which equals b/tb ). The standard error of ab can be shown to equal approximately the square root of b2 sa2 + a 2 sb2 . The test of the indirect effect is given by dividing ab by the square root of the above variance and treating the ratio as a Z test (i.e., larger than 1.96 in absolute value is significant at the 0.05 level, two tailed). The power of the Sobel test is very low and the test is very conservative, yielding too few Type I errors. For instance, in a simulation conducted by Hoyle and Kenny [2], in one condition, the Type I error rate was found sometimes to be below 0.05 even when the indirect effect was not zero. Hoyle and Kenny (1999) recommend having at least 200 cases for this test. Several groups of researchers (e.g., [7, 9]) are seeking an alternative to the Sobel test that is not so conservative. The mediation model in the Figure 1 may not be properly specified. If the variable X is an intervention, then it can be assumed to be measured without error and that it may cause M and Y , and not vice versa. If the mediator has measurement error (i.e., has less than perfect reliability), then it is likely that effects are biased. The effect of the mediator on the outcome (path b) is underestimated and the effect of the intervention on the outcome (path c ) tends to be overestimated (given that ab is positive). The overestimation of this path is exacerbated to the extent to which the mediator is caused by the intervention. To remove the biasing effect of measurement error, multiple indicators of the mediators can be found and a latent variable approach can be employed. One would need to use a structural equation modeling program (e.g., AMOS or LISREL – see Structural Equation Modeling: Software) to estimate such a model. If a latent variable approach is not used, the

researcher needs to demonstrate that the reliability of the mediator is very high. Another specification error is that the mediator may be caused by the outcome variable (Y causes M). Because the intervention is typically a manipulated variable, the direction of causation from it is known. However, because both the mediator and the outcome variables are not manipulated variables, they may both cause each other. For instance, it might be the case that the ‘outcome’ may actually mediate the effect of the intervention on the ‘mediator’. Generally, reverse casual effects are ruled out theoretically. That is, a causal effect from Y to M does not make theoretical sense. Design considerations may also lessen the plausibility of reverse causation effects. The mediator should be measured prior to the outcome variable and efforts should be made to determine that the two do not share method effects (e.g., both self-reports from the same person). If one can assume that c is zero, then reverse causal effects can be estimated. That is, if one assumes that there is complete mediation (X does not cause Y ), one can allow for the mediator to cause the outcome and the outcome to cause the mediator. Smith [10] discussed a method to allow for estimation of reverse causal effects. In essence, both the mediator and the outcome variable are treated as outcome variables and they each may mediate the effects of the other. To be able to employ the Smith approach, there must be for both the mediator and the outcome variable a variable that is known to cause each of them but does not cause the other. Another specification error is that M does not cause Y , but rather some unmeasured variable causes both variables. For instance, the covariation between M and Y is due to social desirability. Other names for spuriousness are confounding, omitted variables, selection, and the third-variable problem. The only real solution for spuriousness (besides randomization) is to measure potentially confounding variables. Thus, if there are concerns about social desirability, that variable should be measured and controlled for in the analysis. One can also estimate how strong spuriousness must be to explain the effect [8]. If M is a successful mediator, it is then necessarily correlated with the intervention (path a) and so there will be collinearity. This will affect the precision of the estimates of the last set of regression equations. Thus, the power of the test of the coefficients b and c will be compromised. (The effective sample size for these

Mediation tests is approximately N (1 − a 2 ) where N is the overall sample size and a is a standardized path.) Therefore, if M is a strong mediator (path a), to achieve equivalent power, the sample size might have to be larger than what it would be if M were a weak mediator. Sometimes the mediator is chosen too close to the intervention and path a is relatively large and b small. Such a proximal mediator can sometimes be viewed as a manipulation check. Alternatively, sometimes the mediator is chosen too close to the outcome (a distal mediator), and so b is large and a is small. Ideally, the standardized values of a and b should be nearly equal. Even better in terms of maximizing power, standardized b should be somewhat larger than standardized a. The net effect is then that sometimes the power of the test of ab can be increased by lowering the size of a, thereby lowering the size of ab. Bigger is not necessarily better. The model in the Figure 1 may be more complicated in that there may be additional variables. There might be covariates and there might be multiple interventions (Xs), mediators (Ms), or outcomes (Y s). A covariate is a variable that a researcher wishes to control for in the analysis. Such variables might be background variables (e.g., age and gender). They would be included in all of the regression analyses. If such variables are allowed to interact with X or M, they would become moderator variables. Consider the case in which there is X1 and X2 . For instance, these variables might be two different components of a treatment. It could then be determined if M mediates the X1 − Y and X2 − Y relationships. So in Step 1, Y is regressed on X1 and X2 . In Step 2, M is regressed on X1 and X2 . Finally, in Steps 2 and 3, Y is regressed on M, X1 , and X2 . Structural equation modeling might be used to conduct simultaneous tests. Consider the case in which M1 and M2 mediate the X − Y relationship. For Step 2, two regressions are performed, one for M1 and M2 . For Steps 3 and 4, one can test each mediator separately or both in combination. There is greater parsimony in a combined analysis but power may suffer if the two mediators are correlated. Of course, if they are strongly correlated, the researcher might consider combining them into a single variable or having the two serve as indicators of the same latent variable. Consider the case in which there are two outcome variables, Y1 and Y2 . The Baron and Kenny steps would be done for each mediator. Step 2 would need

3

to be done only once since Y is not involved in that regression. Sometimes the interactive effect of two independent variables on an outcome variable is mediated by the effect of a process variable. For instance, an intervention may be more effective for men versus women. All the steps would be repeated but now X would not be a main effect but an interaction. The variable X would be included in the analysis, but its effect would not be the central focus. Rather, the focus would be on the interaction of X. Sometimes a variable acts as a mediator for some groups (e.g., males) and not others (e.g., females). There are two different ways in which the mediation may be moderated. The mediator interacts with some variable to cause the outcome (path b varies) or the intervention interacts with a variable to cause the mediator (path a varies). In this case, all the variables cause one another. For simple models, one is better off using multiple regression, as Structural Equation Modeling (SEM) tests are only approximate. However, if one wants to test relatively complex hypotheses: M1 , M2 , and M3 do not mediate the X − Y relationship, M mediates the X to Y1 and X to Y2 relationship, M mediates the X1 to Y and X2 to Y relationship, or a equals b, then SEM can be a useful method. Some structural equation modeling computer programs provide measures and tests of indirect effects. In some cases, the variables X, M, or Y are measured with error, and multiple indicators are used to measure them. SEM can be used to perform the Baron and Kenny steps. However, the following difficulty arises: the measurement model would be somewhat different each time that the model is estimated. Thus, c and c are not directly comparable because the meaning of Y varies. If, however, the measurement model were the same in both models (e.g., the loadings were fixed to one in both analyses), the analyses would be comparable. Sometimes, X, M, and Y are measured repeatedly for each person. Judd, Kenny, and McClelland [3] discuss the case in which there are two measurements per participant. They discuss creating difference scores for X, M, and Y , and then examine the extent to which the differences in X and M predict

4

Mediation

differences in Y . If the effect of the difference in X affects a difference in Y less when the difference in M is introduced, there is evidence of mediation. Alternatively, mediational models can be tested using multilevel modeling [5]. Within this approach, the effect of X and M on Y can be estimated, assuming there is a sufficient number of observations for each person. The advantages of multilevel modeling over the previously described difference-score approach are several: the estimate of the effect is more precise a statistical evaluation of whether mediation differs by persons missing data and unequal observations are less problematic Within multilevel modeling, it does happen that c exactly equals c + ab. The decomposition only approximately holds. Krull and MacKinnon [6] discuss the degree of difference in these two estimates. One potential advantage of a multilevel mediational analysis is that it can be tested if the mediation effects vary by participant. So M might be a stronger mediator for some persons than for others. If the mediator were coping style and the causal effect were the effect of stress on mood, it might be that a particular coping style is more effective for some people than for others (moderated mediation). In conclusion, mediational analysis appears to be a simple task. A series of multiple regression analyses are performed, and, given the pattern of statistically significant regression coefficients, mediation can be inferred. In actuality, mediational analysis is not formulaic and careful attention needs to be given to questions concerning the correct specification of the causal model, a thorough understanding of the area of study, and knowledge of the measures and design of the study.

References [1]

Baron, R.M. & Kenny, D.A. (1986). The moderatormediator variable distinction in social psychological

research: conceptual, strategic and statistical considerations, Journal of Personality and Social Psychology 51, 1173–1182. [2] Hoyle, R.H. & Kenny, D.A. (1999). Statistical power and tests of mediation, in Statistical Strategies for Small Sample Research, R.H. Hoyle ed., Sage Publications, Newbury Park. [3] Judd, C.M., Kenny, D.A. & McClelland, G.H. (2001). Estimating and testing mediation and moderation in within-participant designs, Psychological Methods 6, 115–134. [4] Kenny, D.A., Kashy, D.A. & Bolger, N. (1998). Data analysis in social psychology, in The handbook of social psychology, Vol. 1, 4th Edition, D. Gilbert, S. Fiske & G. Lindzey, eds, McGraw-Hill, Boston, pp. 233–265. [5] Kenny, D.A., Korchmaros, J.D. & Bolger, N. (2003). Lower level mediation in multilevel models, Psychological Methods 8, 115–128. [6] Krull, J.L. & MacKinnon, D.P. (1999). Multilevel mediation modeling in group-based intervention studies, Evaluation Review 23, 418–444. [7] MacKinnon, D.P., Lockwood, C.M., Hoffman, J.M., West, S.G. & Sheets, V. (2002). A comparison of methods to test the significance of the mediated effect, Psychological Methods 7, 82–104. [8] Mauro, R. (1990). Understanding L.O.V.E. (left out variables error): a method for the estimating the effects of omitted variables, Psychological Bulletin 108, 314–329. [9] Shrout, P.E. & Bolger, N. (2002). Mediation in experimental and nonexperimental studies. New procedures and recommendations, Psychological Methods 7, 422–445. [10] Smith, E.R. (1982). Beliefs attributions, and evaluations: nonhierarchical models of mediation in social cognition, Journal of Personality and Social Psychology 43, 348–259. [11] Sobel, M.E. (1982). Asymptotic confidence intervals for indirect effects in structural equation models, in Sociological Methodology 1982, S. Leinhardt, ed., American Sociological Association, Washington, pp. 290–312.

(See also Structural Equation Modeling: Multilevel; Structural Equation Modeling: Nonstandard Cases) DAVID A. KENNY

Mendelian Genetics Rediscovered DAVID DICKINS Volume 3, pp. 1198–1204 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mendelian Genetics Rediscovered Introductory Remarks The schoolchild’s notion of the ‘rediscovery of Mendel’ is that 1. Mendel was the first person to make systematic studies of heredity. 2. He used large enough numbers and mathematical skill to get firm results from which his famous laws could be derived. 3. Because Mendel worked in a monastery in Moravia and published in an obscure journal, Darwin (whose own understanding of genetics was incorrect, thereby leaving a logical flaw in his theory of evolution) never got to discover this work. 4. By coincidence, three scientists, De Vries in Holland, Correns in Germany, and Tschermak in Austria, all independently arrived at the same laws of heredity in 1900, and only then realized Mendel had achieved this 34 years earlier. 5. Rediscovery of these laws paved the way for a complete acceptance of Darwin’s theory, with natural selection as the primary mechanism of evolutionary change. All of these need correction.

Mendel was the First Person to Make Systematic Studies of Heredity The first of such experiments should probably be associated with the name of Joseph Gottlieb Koelreuter, a German botanist who made seminal contributions to plant hybridization [22] a century before Mendel. Cyrill Napp, a previous abbot of the monastery, of which Mendel himself eventually became abbot, through contact with sheep-breeders, realized the economic importance to the locality of understanding heredity, and sent Mendel to the University of Vienna to be trained in experimental science1 [21]. In Vienna, Mendel received a very good scientific education in biology and especially in Physics: he had courses from Doppler and von Ettinghausen, and

the latter, in particular, gave him the methodological sophistication to deliberately employ a hypotheticodeductive approach. His Botany professor was Franz Unger, who was a forerunner of Darwin, and who published an evolutionary theory in an Attempt of a History of the Plant World in 1852. Mendel reported it was Unger’s ponderings on how new species arose from old that prompted him to start his experiments. As the title of his famous paper [13, 14] suggests, rather than seeking some general laws of heredity, Mendel was actually trying to study hybridization, in the tradition, started by Linnaeus, that this might lead to the formation of new species. This was not a mechanism for substantive, macroevolutionary change, and some [5] have doubted whether Mendel actually believed in evolution as an explanation for the entirety of living organisms. He had a copy, annotated by him, of the sixth edition of the Origin of Species [2], (which is preserved in the museum in Brno), but certainly Darwin’s idea of natural selection being the main mechanism of evolutionary change did not drive Mendel to search for appropriate units of heredity on which this process could work. Using the garden pea (Pisum sativum), Mendel sought contrasting pairs of definitive characteristics, producing a list of seven, namely: 1. 2. 3. 4.

ripe seeds smooth or wrinkled cotyledon yellow or green seed coat white or grey ripe pod smooth and not constricted anywhere, or wrinkled and constricted between the seeds; 5. unripe pod green or vivid yellow 6. lots of flowers along the main stem, or only at the end 7. stem tall (6–7 ft) or short (9–18 in.). When members of these contrasting pairs were crossed, he found a 3 : 1 ratio in their offspring for that pair of characters. When they were then self-pollinated, the less frequent variant, and one third of the more frequent variant, bred true (gave only offspring the same as themselves), whereas the remaining two thirds of the more frequent variant produced a similar 3 : 1 ratio. This led him to postulate (or confirmed his postulation), that some hypothetical entity responsible for influencing development in either one way or the other must exist as two copies in a particular plant, but as only one copy in a pollen grain or ovum. If the two copies were of

2

Mendelian Genetics Rediscovered

opposite kind, one kind would always be dominant, and the other recessive, so that the former character would always be expressed in a hybrid2 , while the other would lie dormant, yet be passed on to offspring. These were only conceptual ‘atoms’, probably not considered by Mendel to be material entities that might one day be chemically isolated. All of this was prior to the discovery of chromosomes, the individual character and continuity of which was suspected, especially by Boveri, in the 1890s, but only finally shown by Montgomery and Sutton in 1901 and 1902. The three so-called laws of Mendel were neither proposed by him, nor were they really laws. Independent segregation relates to the really important idea that a character which can exist (as we would now say phenotypically in two forms can be due to two factors, one inherited from one parent, and one from the other. That such factors (corresponding only with the wisdom of hindsight to today’s genes) may exist, and that there may be two or more alternative forms of them (today’s set of alleles) were the really important ideas of Mendel that his experiments revolutionarily enshrine. Independent assortment of one pair of factors in relation to some other such pair is not a law, since it depends upon the location of the different pairs on different chromosomes. Mendel was either extremely lucky to select seven pairs of characteristics presumably related to the seven different chromosomes of this species, a chance of 1/163, or he neglected noise in his data (in fact due to what we now know as linkage) or perhaps attributed it to a procedural error, and rejecting such data, and arrived at the ‘correct’ result by calculation based on prior assumptions. And the third ‘law’, of dominance and recessiveness, of course (as Mendel himself knew) only applies to certain pairs of alleles, such as those in which the recessive allele results in the inability to synthesize a particular enzyme, which may be manufactured in sufficient quantity in the presence of just one copy of the dominant allele [8]. In many cases the heterozygote may be intermediate between, or qualitatively different from either homozygote.

nature of the task in hand. Mendel knew nothing of statistical proofs, but realized that large samples would be needed, and on the whole tested large numbers in order to produce reliable results. Mendel’s carefully cast three laws led with impeccable logic to the prediction of a 9 : 3 : 3 : 1 ratio for the independent segregation of two pairs of ‘factors’ (Elemente)3 . Now, the actual ratios which Mendel seems to have claimed to have found, fit these predictions so well that they have been deemed (by R. A. Fisher, no less [7]) to be too good, that is very unlikely to have been obtained by chance, given the numbers of crosses Mendel carried out. Charitably, Jacob Bronowski [1] created the memorable image of Mendel’s brother friar ‘research assistants’ innocently counting out the relative numbers of types of peas so as to please Gregor, rather than Mendel himself inventing them. Another explanation might be that Mendel went on collecting data until the predicted ratios emerged, and then stopped, poor practice leading in effect to an unwitting rigging of results. Fisher made his criticism, which is by no means universally accepted [12, 19], in the context of a paper [7] that also patiently attempts to reconstruct what actual experiments Mendel may have done, and overall eulogizes unstintingly the scientific prestige of Mendel. The ethics do not matter4 : the principles by means of which he set out to predict the expected outcomes are what Mendel is famous for, since we now know, for pairs of nonlethal alleles with loci on separate chromosomes, that these rules correctly describe the default situation. It is a long way from here, however, to the ‘modern synthesis’ [6, 9], and neither Mendel, nor initially ‘Mendelism’ (in the hands of the three ‘Rediscoverers’), were headed in that direction.

He used Large Enough Numbers and Mathematical Skill to Get Firm Results from which his famous Laws could be Derived

Gregor Mendel gave an oral paper, and later published his work in the proceedings of the local society of naturalists in Br¨unn, Austria (now Brno, Czech Republic), in 1866. The Verhandlungen of the Br¨unn Society were sent to the Royal Society and the Linnaean Society (inter alia), but there seems to be no

Certainly, the numbers of plants and crosses Mendel used were appropriately large for the probabilistic

Because Mendel worked in a monastery in Moravia and published in an obscure journal, Darwin (whose own understanding of genetics was incorrect, thereby leaving a logical flaw in his theory of evolution) never got to discover this work

Mendelian Genetics Rediscovered evidence for the notion that Darwin had a personal copy (uncut, and therefore unread). Even if Darwin had come across Mendel’s paper, the fact that it was in German and proposed ideas different from his own theory of pangenesis would have been sufficient, some have suggested, to discourage Darwin from further browsing. A more interesting (alternative history) question (put by Mayr [12], and answered by him in the negative) is whether Darwin would have appreciated the significance of Mendel’s work, had he carefully read it. Darwin was inclined towards a ‘hard’ view of inheritance (that it proceeded to determine the features of offspring with no relation to prevailing environmental conditions). He nevertheless made some allowance, increasingly through his life, for ‘soft’ inheritance, for example, in the effects of use and disuse on characters, such as the reduction or loss of eyes in cave-dwelling species. (This we would now attribute to the cessation of stabilizing selection, plus some possible benefit from jettisoning redundant machinery.) Such ‘Lamarckian’ effects were allowed for in Darwin’s own genetic theory, his ‘provisional hypothesis of pangenesis’ [3], in which hereditary particles, which Darwin called gemmules, dispersed around the individual’s body, where they would be subject to environmental influences, before returning to the reproductive organs to influence offspring. The clear distinction, made importantly by the great German biologist Weismann [23], between germ plasm and somatic cells, with the hereditary particles being passed down an unbroken line in the germ plasm, uninfluenced by what happened in the soma, thus banishing soft inheritance5 , came too late for Darwin and Mendel. After it, the discoveries of Mendel became much easier to ‘remake’. Darwin always ranked natural selection as a more potent force in evolutionary change, however, and this was a gradual process working on the general variation in the population. Darwin was aware of ‘sports’ and ‘freaks’, and thus in a sense had the notion of mutations (which De Vries – see subsequent passage – made the basis of his theory of evolution), but these were not important for Darwin’s theory, and he might have seen the explicit discontinuities of Mendel’s examples as exceptions to general population variation. The work of Fisher and others in population genetics, which explains the latter in terms of the former, thereby reconciling Mendelism

3

with natural selection, had yet to be done: this provided the key to the so-called ‘Modern Synthesis’ [9] of the 1940s. We have seen that Darwin, with the gemmules of his pangenesis theory, had himself postulated a particulate theory of heredity. However, he, together with Galton, Weismann, and De Vries, in their particulate theories of inheritance, postulated the existence of multiple identical elements for a given character in each cell nucleus, including the germ cells. This would not have led to easily predictable ratios. For Mendel, only one Elemente would enter a gamete. In the heterozygote, the unlike elements (differierenden Elemente) would remain separate, though Mendel for some reason assumed that blending would occur (in the homozygote) if the two elements were the same (gleichartigen Elemente). So perhaps Darwin would not have seen Mendel’s work as the answer to Jenkin’s devastating criticism [10] of his theory, based on the notion of blending inheritance, that after a generation or two, the effects of natural selection would be diluted to nothing. This, of course, is the statistical principle of regression to the mean in a Gaussian distribution, which Galton, Darwin’s cousin, later emphasized. Galton could not see how the average of a population could shift through selection, and looked to genetic factors peculiar to those at the tail ends of distributions to provide a source of change, as in positive eugenic plans to encourage those possessed of ‘hereditary genius’ to carefully marry those with similar benefits. Of the 40 reprints of [14], Mendel is known to have sent copies to two famous botanists, A. Kerner von Marilaun at Innsbruck, and N¨ageli, a Swiss, but by that time professor of botany in Munich, and famous for his own purely speculative theory of inheritance in conformity with current physics (a kind of String theory of its day!). Now N¨ageli was a dampening influence upon Mendel (we only have Mendel’s letters to N¨ageli, not vice versa). N¨ageli did not cite Mendel in his influential book [17], nor did he encourage Mendel to publish his voluminous breeding experiments with Pisum and his assiduous confirmations in other species. Had Mendel ignored this advice the meme of his work would probably not have lain dormant for a generation, nor would it have needed to be ‘rediscovered’. N¨ageli simply urged Mendel to test his theory on hawkweeds (Hieracium), which he did, with negative results: we now know parthenogenesis is common in this genus. With his

4

Mendelian Genetics Rediscovered

training in physics (a discipline famous for vaulting general laws, perhaps bereft of the only Golden Rule of Biology – like that of the Revolutionist in George Bernard Shaw’s Handbook – which is that there is no Golden Rule), Mendel asserted that ‘A final decision can be reached only when the results of detailed experiments from the most diverse plant families are available.’ The Hieracium work he did publish in 1870: it was to be his only other paper, before he ascended his institution’s hierarchy and was lost to administration the following year.

By Coincidence Three Scientists, De Vries in Holland, Correns in Germany, and Tschermak in Austria, all Independently Arrived in 1900 at the Same Laws of Heredity and only then Realized Mendel had done this 34 Years Earlier At the turn of the twentieth century, Correns and Tschermak were both working on the garden pea also, and found references to Mendel’s paper during their literature searches. Most important was its citation in Focke’s review of plant hybridization, Die Pflanzen–Mischlinge (1881), (though Focke did not understand Mendel’s work), and is the only one of the 15 actual citations that is relevant to its content that does little to stimulate the reader to consult the original paper6 . De Vries also found the reference. It would be like harking back to a 1970 paper now, but in a strikingly smaller corpus of publications. Hugo (Marie) de Vries was a Dutchman, in his early fifties in 1900, and Professor of Botany at the University of Amsterdam. In 1886, he became interested in the many differences between wild and cultivated varieties of the Evening Primrose (Oenothera lamarckiana), and the sudden appearance, apparently at random, of new forms or varieties when he cultivated this plant, for which he coined the term ‘mutations’. Their sudden appearance seemed to him to contradict the slow, small variations of Darwin’s natural selection, and he thought that it was through mutations that new species were formed, so that their study provided an experimental way of understanding the mechanism of evolution. De Vries had already proposed a particulate theory of inheritance in his Intracellular Pangenesis (1889) [4], and in 1892 began a program of breeding experiments on unit characters in several plant

species. With large samples, he obtained clear segregations in more than 30 different species and varieties, and felt justified in publishing a general law. In a footnote to one of three papers given at meetings and quickly published in 1900, he stated that he had only learned of the existence of Mendel’s paper7 after he had completed most of these experiments and deduced from his own results the statements made in this text. Though De Vries carefully attributed originality to Mendel in all his subsequent publications, scholars [5] have argued about the veracity of this assertion. Some of his data, based on hundreds of crosses, we would now see as quite good approximations to 3 : 1 ratios in F2 crosses, but in his writings around this time, De Vries talked of 2 : 1 or 4 : 1 ratios, or quoted percentage splits such as 77.5% : 22.5% or 75.5% : 24.5%. Mayr [12] opines that it cannot be determined whether or not De Vries had given up his original theory of multiple particles determining characteristics in favor of Mendel’s single elements. Perhaps through palpable disappointment at having been anticipated by Mendel, he did not follow through to the full implications of Mendel’s results as indicative of a general genetic principle, seeing it as only one of several mechanisms, even asserting to Bateson that ‘Mendelism is an exception to the general rule of crossing.’ De Vries became more concerned about his mutation theory of evolution (Die Mutationstheorie, 1901); he was the first to develop the concept of mutability of hereditary units. However, rather than the point gene mutations later studied by Morgan, De Vries’ mutations were mostly chromosome rearrangements resulting from the highly atypical features that happen to occur in Oenothera, the single genus in which De Vries described them. Carl Erich Correns, in his mid-thirties in 1900, was a German botany instructor at the University of T¨ubingen, was also working with garden peas, and similarly claimed to have achieved a sudden insight into Mendelian segregation in October 1899. Being busy with other work, he only read Mendel a few weeks later, and he only rapidly wrote up his own results after he had seen a reprint of one of the papers of De Vries in 1900. He readily acknowledged Mendel’s priority, and thought his own rediscovery had been much easier, given that he, unlike Mendel, was following the work of Weismann. Correns went on to produce a lot more supportive evidence throughout his career in a number of German

Mendelian Genetics Rediscovered universities, and he postulated a physical coupling of genetic factors to account for the consistent inheritance of certain characters together, anticipating the development of the concept of linkage by the American geneticist Thomas Hunt Morgan. Though his 1900 paper8 showed no understanding of the basic principles of Mendelian inheritance [15, 16], the third man usually mentioned as a rediscoverer of Mendel is the Austrian botanist Erich Tschermak von Seysenegg, not quite 30 in 1900. Two years before, he had begun breeding experiments, also on the garden pea, in the Botanical Garden of Ghent, and the following year he continued these in a private garden, whilst doing voluntary work at the Imperial Family’s Foundation at Esslingen near Vienna. Again, it was while writing up his results that he found a reference to Mendel’s work that he was able to access in the University of Vienna. It duplicated, and in some ways went beyond his own work. In his later work, he took up a position in the Academy of Agriculture in Vienna in 1901 and was made a professor five years later. He applied Mendel’s principles to the development of new plants such as a new strain of barley, an improved oat hybrid, and hybrids between wheat and rye. It is really William Bateson (1861–1926) who should be included as the most influential of those who were the first to react appropriately to the small opus by Mendel lurking in the literature, even though his awareness of it was due initially to De Vries. Bateson was the first proponent of Mendelism in English-speaking countries, and coined the term ‘genetics’ (1906). Immediately after the appearance of papers by De Vries, Correns and Tschermak, Bateson reported on these to the Royal Horticultural Society in May 1900, and then read, and was inspired by, Mendel’s paper Experiments in Plant Hybridisation shortly thereafter. Bateson was also responsible for the first English translation of this paper, which was published in the Journal of the Royal Horticultural Society in 1901. Bateson’s is a fascinating case of a brilliant and enthusiastic scientist, who turned against his earlier Darwinian insights, and adopted, with considerable grasp, the new principles of Mendelism. Paradoxically, this served to obstruct the eventual realization that these were in fact the way to vindicate and finally establish the overarching

5

explanatory power of Darwin’s evolution by natural selection. Initially, Bateson shared the interests of his friend Weldon, who had been influenced by Pearson’s statistical approach to study continuous variation in populations, but when he himself came to do this in the field, choosing fishes in remote Russian lakes, he strengthened his conviction that natural selection could not produce evolutionary change from the continuum of small variations he discovered. He urgently sought some particulate theory of heredity to account for the larger changes he thought necessary to drive evolution, and thought he had found this in Mendelism. He was stimulated by De Vries’s notion that species formation occurred when mutations spread in a population, and when he later read a 1900 paper by De Vries describing the 3 : 1 ratio results (in the train on his way to a meeting of the Royal Horticultural Society), he changed the text of his paper to claim that Galton’s regression law was in need of amendment. Bateson became an insightful and influential exponent of the new Mendelism, carrying out confirmatory experiments in animals, and coining terms (in addition to ‘genetics’) which soon became key conceptual tools, including ‘allelomorph’ (later shortened to ‘allele’), ‘homozygote’ and ‘heterozygote’. The term ‘gene’, however, was coined by the Danish botanist William Johannsen. Unfortunately, he sought to define each gene in terms of the specific phenotypic character for which it was deemed responsible. The distinction between ‘genotype’ and ‘phenotype’ was drawn early by G. Udny Yule in Cambridge. Yule, unlike Johannsen, insightfully saw that if many genes combined to influence a particular character, and if this character were a relatively simple unit, which a phenotypic feature arbitrarily seized upon an experimenter might well not be, then Mendelism might deliver the many tiny changes beloved of Darwin. This was an idea whose time was yet to come.

Rediscovery of these Laws Paved the way for a Complete Acceptance of Darwin’s Theory with Natural Selection as the Primary Mechanism of Evolutionary Change In the end it did, but not until after a long period of misunderstanding and dispute, which was finally brought to an end by means of the Modern Synthesis.

6

Mendelian Genetics Rediscovered

As we have seen, there were two main sides to the dispute. On the one hand were the so-called biometricians, led by Karl Pearson, who defended Darwinian natural selection as the major cause of evolution through the cumulative effects of small, continuous, individual variations (which they assumed passed from one generation to the next without being limited by Mendel’s laws of inheritance). On the other were the champions of ‘Mendelism’, who were mostly also mutationists: they felt that if we understood how heredity worked to produce big changes, then ipso facto we would have an explanation for the surges of the evolutionary process. The resolution of the dispute was brought about in the 1920s and 1930s by the mathematical work of the population geneticists, particularly J. B. S. Haldane and R. A. Fisher in Britain, and Sewall Wright in the United States. They showed that continuous variation was entirely explicable in terms of Mendel’s laws, and that natural selection could act on these small variations to produce major evolutionary changes. Mutationism became discredited. Recent squabbles about the relative size and continuity of evolutionary changes, between those tilted towards relative, rapid, quite large changes ‘punctuating’ long periods of unchanging equilibrium, and those in favor of smaller and more continuous change, are but a pale reflection of the earlier dispute. In its original German, Mendel’s paper repetitively used the term Entwicklung, the nearest English equivalent to which is probably ‘development’ [20]. It was also combined, German-fashion, with other words, as in Entwicklungsreihe, (developmental series); die Entwicklungsgeschichte (the history of development); das Entwicklungs-Gesetz (the law of development). Mendel at least (if not all of the ‘rediscoverers’ and their colleagues who thought they were addressing the same problems as Mendel), would have been delighted by the present day rise of developmental genetics, that is, the study of how the genetic material actually shapes the growth and differentiation of the organism, made possible, among other advances, by the discovery of the genetic code and the mapping of more and more genomes. He might have been surprised too, by the explanatory power of theories derived from population genetics, based on his own original work, combined with natural selection, to account for the evolution of behavior, and of human nature.

Notes 1.

2.

3.

4.

5.

6.

7.

8.

According to the presentation by Roger Wood (Manchester University, UK), in a joint paper with Vitezslav Orel (Mendelianum, Brno, Czech Republic) at a conference in 2000, at the Academy of Sciences in Paris, to mark the centenary of the conference of the Academy to which De Vries reported. Mendel was not clear about the status of species versus varieties (not an easy issue today) and used the term hybrid for crosses at either level, whereas we would confine it to species crosses today. With the first two pairs of characters listed above Mendel actually seems to have obtained 315 round yellow: 108 round green: 101 wrinkled yellow: 32 wrinkled green. As the noted historian of psychology Leslie Hearnshaw once suggested to me, the fact that Mendel went on to publish his subsequent failures to obtain similar results in the Hawkweed (Hieracium) does much to allay any suspicions as to his honesty. Weismann was right in principle, but instead of the cytological distinction, we have today the ‘central dogma’ of biochemical genetics, that information can pass from DNA to proteins, and from DNA to DNA, but not from protein to DNA [11]. ‘Mendel’s numerous crossings gave results which were quite similar to those of Knight, but Mendel believed that he found constant numerical relationships between the types of the crosses.’ (Focke 1881, quoted by Fisher [7].) In his book entitled Origins of Mendelism [18], the historian Robert Olby relates that just prior to publication ‘from his friend Professor Beijernick in Delft [De Vries] received a reprint of Mendel’s paper with the comment: “I know that you are studying hybrids, so perhaps the enclosed reprint of the year 1865 by a certain Mendel is still of some interest to you”’. http://www.esp.org/foundations/ genetics/classical/holdings/t/ et-00.pdf

References [1]

[2]

Bronowski, J. (1973). The Ascent of Man, British Broadcasting Corporation Consumer Publishing, London. Darwin, C. (1859). On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life, 1st Edition, John Murray, London.

Mendelian Genetics Rediscovered [3]

Darwin, C. (1868). The Variation of Plants and Animals Under Domestication, 1–2 John Murray, London. [4] De Vries, H. (1889). Intracellul¨are Pangenesis, Gustav Fischer, Jena. [5] Depew, D.J. & Weber, B.H. (1996). Darwinism Evolving: Systems Dynamics and the Genealogy of Natural Selection, The MIT Press, Cambridge & London, England. [6] Dobzhansky, T. (1937). Genetics and the Origin of Species, Columbia University Press, New York. [7] Fisher, R.A. (1936). Has Mendel’s work been rediscovered? Annals of Science 1, 115–137. [8] Garrod, A.E. (1909). Inborn Errors of Metabolism, Frowde & Hodder, London. [9] Huxley, J. (1942). Evolution, the Modern Synthesis, Allen & Unwin, London. [10] Jenkin, F. (1867). The origin of species, North British Review 45, 277–318. [11] Maynard Smith, J. (1989). Evolutionary Genetics, Oxford University Press, Oxford. [12] Mayr, E. (1985). The Growth of Biological thought: Diversity, Evolution, and Inheritance, Harvard University Press, Cambridge. [13] Mendel, J.G. (1865). Experiments in Plant Hybridisation. http://www.laskerfoundation.org/ news/gnn/timeline/mendel 1913.pdf [14] Mendel, J.G. (1865). Versuche u¨ ber Pflanzen-hybriden, Verhandlungen des Naturforschenden Vereines in Br¨unn 4, 3–47.

[15]

[16]

[17] [18] [19]

[20] [21]

[22]

[23]

7

Monaghan, F. & Corcos, A. (1986). Tschermak: a nondiscoverer of mendelism. I. An historical note, Journal of Heredity 77, 468–469. Monaghan, F. & Corcos, A. (1987). Tschermak: a nondiscoverer of mendelism. II. A critique, Journal of Heredity 78, 208–210. N¨ageli, C. (1884). Mechanisch-Physiologische Theorie der Abstammungslehre, Oldenbourg, Leipzig. Olby, R.C. (1966). Origins of Mendelism, Schocken Books, New York. Pilgrim, I. (1986). A solution to the too-good-to-be-true Paradox and Gregor Mendel, The Journal of Heredity 77, 218–220. Sandler, I. (2000). Development: Mendel’s legacy to genetics, Genetics 154(1), 7–11. Veuille, M. (2000). 1900–2000: How the mendelian revolution came about: the Rediscovery of Mendel’s Laws (1900), International Conference, Paris, 23–25 March 2000; Trends in Genetics 16(9), 380. Waller, J. (2002). Fabulous Science: Fact and Fiction in the History of Scientific Discovery, Oxford University Press, Oxford. Weismann, A. (1892). Das Keimplasma: Eine Theorie der Vererbung, Gustav Fischer, Jena.

DAVID DICKINS

Mendelian Inheritance and Segregation Analysis P. SHAM Volume 3, pp. 1205–1206 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mendelian Inheritance and Segregation Analysis Gregor Mendel (1822–1884), Abbot at the St Thomas Monastory of the Augustinian Order in Brunn, conducted the seminal experiments that demonstrated the existence of genes and characterized how they are transmitted from parents to offspring, thus laying the foundation of the science of genetics. Mendel chose the garden pea as his experiment organism, and selected seven characteristics that are dichotomous and therefore easy to measure. Mendel’s experiments, and their results, can be summarized as follows: 1. After repeated inbreeding, plants became uniform in each characteristic (e.g., all tall). These are known as pure lines. 2. When two pure lines with opposite characteristics are crossed (e.g., tall and short), one of the characteristics is present in all the offspring (e.g., they are all tall). The offspring are said to be the F1 generation, and the characteristic present in F1 (e.g., tall) is said to be dominant, while the alternative, absent characteristic (e.g., short) is said to be recessive. 3. When two plants of the F1 generation are crossed, the offspring (F2) display the dominant (e.g., tall) and recessive (e.g., short) characteristics in the ratio 3 : 1. This cross is called an intercross. 4. When an F1 plant is crossed with the parental recessive pure line, the offspring display the dominant and recessive characteristics in the ratio 1 : 1. This cross is called a backcross. Mendel explained these observations by formulating the law of segregation. This law states that each individual contains two inherited factors (or genes) for each pair of characteristics, and that during reproduction one of these two factors is transmitted to the offspring, each with 50% probability. There are two alternative forms of the genes, called alleles, corresponding to each dichotomous character. When the two genes in an individual are of the same allele, then the individual’s genotype is said to be homozygous, otherwise it is said to be heterozygous. An individual with heterozygous genotyping (e.g., Aa) has the same characteristic as an individual with homozygous

genotype (e.g., AA) of the dominant allele (e.g., A). The explanations of the above observations are then as follows: 1. Repeated inbreeding produces homozygous lines (e.g., AA, aa) that will always display the same characteristic in successive generations. 2. When two pure lines (AA and aa) are crossed, the offspring (F1) will all be heterozygous (Aa), and therefore have the same characteristic as the homozygous of the dominant allele (A). 3. When two F1 (Aa) individuals are crossed, the gametes A and a from one parent combine at random with the gametes A and a from the other to form the offspring genotypes AA, Aa, aA, and aa, in equal numbers. The ratio of offspring with dominant and recessive characteristics is therefore 3 : 1. 4. When an F1 (Aa) individual is crossed with the recessive (aa) pure line, the offspring will be 50 : 50 mixture of Aa and aa genotypes, so that the ratio of offspring with dominant and recessive characteristics is therefore 1 : 1. The characteristic 3 : 1 and 1 : 1 ratios among the offspring of intercross and backcross, respectively, are known as Mendelian segregation ratios. Mendel’s work was the first demonstration of the existence of discrete heritable factors, although the significance of this was overlooked for many years, until 1900 when three botanists, separately and independently, rediscovered the same principles.

Classical Segregation Analysis The characteristic 1 : 1 and 3 : 1 segregation ratios provide a method of checking whether a disease in humans is caused by mutations at a single gene. For rare dominant disease, the disease mutation (A) is likely to be rare, so that individuals homozygous for the disease mutations (AA) are likely to be exceedingly rare. A mating between affected and unaffected individuals is therefore very likely to be a backcross (Aa × aa), with a predicted segregation ratio of 1 : 1 among the offspring. An investigation of such matings to test the segregation ratio of offspring against the hypothetical value of 1 : 1 is a form of classical segregation analysis. For rare recessive conditions, the most informative mating is the intercross (Aa × Aa) with a predicted

2

Mendelian Inheritance and Segregation Analysis

segregation ratio of 3 : 1. However, because the condition is recessive, the parents are both normal and are indistinguishable from other mating types (e.g., aa × aa, Aa × aa). It is therefore necessary to recognize Aa × Aa matings from the fact that an offspring is affected. This, however, introduces an ascertainment bias to the “apparent segregation ratio”. To take an extreme case, if all families in the community have only one offspring, then ascertaining families with at least one affected offspring will result in all offspring being affected. A number of methods have been developed to take ascertainment procedure into account when conducting segregation analysis of putative recessive disorders. These include the proband method, the singles method, and maximum likelihood methods.

Complex Segregation Analysis Complex segregation analysis is concerned with the detection of a gene that has a major impact on the phenotype (called a major locus), even though it is not the only influence and that other genetic and environmental factors are involved in determining the phenotype. The involvement of other factors reduces the strength of relationship between the major locus and the phenotype. Because of this, it is not possible to deduce the underlying mating type from the phenotypes of family members. Instead, it is necessary to consider all the possible mating types for each family. Even if a specific mating type could be isolated, the segregation ratio would not be expected to follow classical Mendelian ratios. For

these reasons, this form of segregation analysis is said to be complex. Complex segregation analysis is usually conducted using maximum likelihood methods under one of two models. The first is a generalized single locus model with generalized transmission parameters. This differs from a Mendelian model in two ways. First, the probabilities of disease given genotype (called penetrances) are not necessarily 0 or 1, but can take intermediate values. Secondly, the probabilities of transmitting an allele (e.g., A) given parental genotype (e.g., AA, Aa, and aa) are not necessarily 1, 1/2, or 0, but can take other values. A test for a major locus is provided by a test of whether these transmission probabilities conform to the Mendelian values (1, 1/2, and 0). The second model is the mixed model, which contains a single major locus against a polygenic background. A test for a major locus is provided by a test of this mixed model against a pure polygenic model without a major locus component. Both forms of analyses can be applied to both qualitative (e.g., disease) and quantitative traits, and are usually conducted in a maximum likelihood framework, with adjustment for the ascertainment procedure. The generalized transmission test and the mixed model test have been combined into a unified model and implemented in the POINTER program. Other methods for complex segregation analysis, for example, using regressive models, have also been developed. P. SHAM

Meta-Analysis WIM

VAN DEN

NOORTGATE

AND

PATRICK ONGHENA

Volume 3, pp. 1206–1217 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Meta-Analysis Introduction Small Progress in Behavioral Science? It is well known that the results of empirical studies are subject to random fluctuation if they are based on samples of subjects instead of on the complete population on which the study is focused. In a study evaluating the effect of a specific treatment, for instance, the population effect size is typically estimated using the effect size that is observed in the sample. Traditionally, researchers deal with the uncertainty associated with the estimates by performing significance tests or by constructing confidence intervals around the effect size estimates. In both procedures, one refers implicitly to the sampling distribution of the effect sizes, which is the distribution of the observed effect sizes if the study had been replicated an infinite number of times. Unfortunately, research in behavioral science is characterized by relatively small-sample studies, small population effects and large initial differences between subjects [49, 58]. Consequently, confidence intervals around effect sizes are often unsatisfactorily large and thus relatively uninformative, and the power of the significance tests is small. Researchers, at least those who are aware of the considerable uncertainty about the study results, therefore often conclude their research report by a call for more research. During the last decennia, several topics were indeed investigated several times, some of them even dozens of times. Mosteller and Colditz [44] talk about an information explosion. Yet, reviewers of the results of studies evaluating a similar treatment often are disappointed. While in some studies positive effects are found, in other studies, no effects or even negative effects are obtained. This is not only true for sets of studies that differ from each other in the characteristics of the subjects that were investigated or in the way the independent variable is manipulated, but also for sets of studies that are more or less replications of each other. Reviewers have a hard job to see the wood for the trees and often fall back on personal strategies to summarize the results of a set of studies. Different reviewers therefore often come to different conclusions, even if they discuss the

same set of studies [49, 58]. The conflicting results from empirical studies have brought some to the pessimistic idea that researchers in these domains do not progress, and have driven some politicians and practitioners toward relying on their own feelings instead of on scientific results. The rise of the metaanalysis offered new perspectives.

The Meta-analytic Revolution Since the beginning of the twentieth century, there have been some modest attempts to summarize study results in a quantitative and objective way (e.g., [48] and [29]; see [59]). It was, however, not before the appearance of an article from Glass in 1976 [23] that the idea of a quantitative integration of study results was explicitly described. Glass coined the term ‘metaanalysis’ and defined it as ‘. . . the analysis of analyses. I use it to refer to the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings’. (p. 3)

The introduction of the term meta-analysis and the description of simple meta-analytic procedures were the start of a spectacular rise of the popularity of quantitative research synthesis. Fiske [17] even used the term meta-analytic revolution. Besides the popularity of meta-analysis in education [41], applications showed up in a wide variety of research domains [46]. The quantitative approach gradually supplemented or even pushed out the traditional narrative reviews of research literature [35]. Another indication of the growing popularity of meta-analysis was the mounting number and size of meta-analytic handbooks [1]. The growing attention for meta-analysis gradually affected the views on research progress in behavioral science. Researchers became aware that even if studies are set up in a very similar way, conflicting results due to random fluctuation alone are not surprising, especially when study results are described on the significant–nonsignificant dichotomy. By quantitatively combining the results of several similar trials, meta-analyses have the potential of averaging out the influence of random fluctuation, resulting in more steady estimates of the overall effect and a higher power in testing this overall effect size. Meta-analytic methods were developed in order to distinguish the random variation in study results from ‘true’ between

2

Meta-Analysis

study heterogeneity, and in order to explain the latter by estimating and testing moderating effects of study characteristics.

Performing a Meta-analysis A meta-analysis consists of several steps (see e.g., [9] for an extensive discussion): 1. Formulating the research questions or hypotheses and defining selection criteria. Just like in primary research, a clear formulation of the research question is crucial for meta-analytic research. Together with practical considerations, the research question results in a set of criteria that are used to select studies. The most common selection criteria relate to the population from which study participants are sampled, the dependent and the independent variables and their indicators, and the quality of the study. Note that formulating a very specific research question and using very strict selection criteria eventually results in a set of similar studies for which the results of the meta-analysis are relatively easy to interpret, but at the other side of the coin, the set of studies will also be relatively small, at the expense of the reliability of the results. 2. Looking for studies investigating these questions. A thorough search includes an exploration of journals, books, doctoral dissertations, published or unpublished research reports, conference papers, and so on. Sources of information are, for instance, databases that are printed or available online or on CD-ROM, contacts with experts in the research domain, and reference lists of relevant material. To avoid bias in the meta-analytic results (see below), it is a good idea to use different sources of information and to include published as well as unpublished study results. Studies are selected on the basis of selection criteria from the first step. 3. Extracting relevant data. Study outcomes and characteristics are selected, coded for each study, and assembled in a database. The study characteristics can be used later in order to account for possible heterogeneity in study outcomes. 4. Converting study results to a comparable measure. Study results are generally reported by means of descriptive statistics (means, standard deviations, etc.), test statistics (t, F, χ 2 , . . .) or P values, but the way of reporting is usually very different from

study to study. Moreover, variables are typically not measured on the same scale in all studies. The comparison of the study outcomes, therefore, requires a conversion to a common standardized measure. One possibility is to use P values. The meaning of P values does not depend on the way the variables are measured, or on the statistical test that is used in the study. A disadvantage of P values, however, is that it depends not only on the effect that is observed but also on the sample size. A very small difference between two groups, for instance, can be statistically significant if the groups are large, while a large difference can be statistically nonsignificant if the groups are relatively small. Although techniques for combining P values have been described, in current meta-analyses, usually measures are combined that express the magnitude of the observed relation, independently of the sample size and the measurement scale. Examples of such standardized effect size measures are the Pearson’s correlation coefficient and the odds ratio. A popular effect size measure in behavioral science is the standardized mean difference, used to express the difference between the means of an experimental and a control condition on a continuous variable, or more generally the difference between the means of two groups: µE − µC δ= , (1) σ with µE and µC equal to the population mean under the experimental and the control condition, respectively, and σ equal to the common population standard deviation. The population effect size δ is estimated by its sample counterpart: d=

x¯E − x¯C , sp

(2)

with x¯E and x¯C equal to the sample means, and sp equal to the pooled sample standard deviation. Assuming normal population distributions with a common variance under both conditions, the sampling distribution of d is approximately normal with mean δ and variance equal to [30]: σˆ d2 =

nE + nC d2 + nE nC 2(nE + nC )

(3)

5. Combining and/or comparing these results and possibly looking for moderator variables. This step

Meta-Analysis forms the essence of the meta-analysis. The effect sizes from the studies are analyzed statistically. The unknown parameters of one or more meta-analytic models are estimated and tested. Common statistical models and techniques to combine or compare effect sizes will be illustrated extensively below by means of an example. 6. Interpreting and reporting the results. In this phase, the researcher returns to the research question(s) and tries to answer these questions based on the results of the analysis. The research report ideally describes explicitly the research questions, the inclusion criteria, the sources used in the search for studies, a list of the studies that were selected, the observed effect sizes, the study sample sizes, and the most important study characteristics. It should also contain a description of how study results were converted to a common measure, which models, techniques, estimation procedures, and software were used to combine these measures, and which assumptions were made for the analyses. It is often a good idea to illustrate the meta-analytic results graphically, for instance, using a funnel plot, a stemand-leaf plot, a histogram or a plot of the interval estimates of the observed study effect sizes and the overall effect size estimate (see below).

An Example Raudenbush [50, 51] combined the results of 18 experiments investigating the effect of teacher expectations on the intellectual development of their pupils. In each of the studies, the researcher tried to create high expectancies for an experimental group of pupils, while this was not the case for a control group. The observed differences between groups were converted to the standardized mean difference (Table 1). Table 1 also describes for each study the length of contact between teacher and pupil prior to the expectancy induction, as well as the standard error of the observed effect sizes (3). Since the sampling distribution of the standardized mean difference is approximately normal with a standard deviation estimated by the standard error of estimation, one could easily construct confidence intervals around the observed effect sizes. The confidence intervals are presented in Figure 1. Note that somewhat more than half of the observed effect sizes are larger than zero. In three studies, zero is not included in the confidence interval, and the (positive) observed effect sizes therefore are statistically significantly different from zero. Following Cohen [6], calling a standardized mean difference of

Table 1 Summary results of experiments assessing the effect of teacher expectancy on pupil IQ, reproduced from Raudenbush and Bryk [51], with permission of AERA Study 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

[55] [7] [36] [47] [47] [12] [15] [4] [38] [42] [3] [19] [37] [32] [16] [28] [56] [18] [21]

3

Weeks of prior contact

Effect size

Standard error of effect size estimate

2 21 19 0 0 3 17 24 0 1 0 0 1 2 17 5 1 2 7

0.03 0.12 −0.14 1.18 0.26 −0.06 −0.02 −0.32 0.27 0.8 0.54 0.18 −0.02 0.23 −0.18 −0.06 0.3 0.07 −0.07

0.125 0.147 0.167 0.373 0.369 0.103 0.103 0.22 0.164 0.251 0.302 0.223 0.289 0.29 0.159 0.167 0.139 0.094 0.174

Effect size

4

Meta-Analysis large studies, effect sizes are sometimes weighted by the study sample sizes in calculating the average. A similar approach, resulting in a decreased MSE, is to weight by the precision of the estimates, this is the inverse of the squared standard error. The precision is estimated as the inverse of the sampling variance estimate (as given in Equation 3).

2.5 2 1.5 1 0.5 0 −0.5 −1 Study

Mean Mean ES ES FEM REM

Figure 1 Confidence intervals for the observed effect sizes and for the estimated mean effect size under a fixed effects model (FEM) or a random effects model (REM)

k

δˆ =

wj d j

j =1 k

, wj

j =1

0.20, 0.50, and 0.80 a small, moderate, and large effect respectively, almost all observed effect sizes suggest that there is generally only a small effect. With the naked eye, it is difficult to see whether there is a general effect of teacher expectancy, and how large this effect is. Moreover, it is difficult to see whether the observed differences between study outcomes are due to sampling variation, or whether the observed differences between study outcomes are due to the intrinsic differences between studies. In the following section, we show how these questions can be dealt with in meta-analysis. Different models are illustrated, which differ in complexity and in the underlying assumptions.

Fixed Effects Model As discussed above, observed effect sizes can be conceived of as randomly fluctuating. If in a specific study another sample would have been taken (from the same population), the observed effect size would probably not have been exactly the same. Suppose that differences in the meta-analytic data set between the observed effect sizes can be entirely accounted for by sampling variation. In this case, we can model the study results as: dj = δ + e j ,

(4)

with dj the observed effect size in study j, δ the common population effect size, and ej a residual due to sampling variation. The primary purpose of the meta-analysis is then to estimate or test the overall effect, δ. The overall effect is usually estimated by averaging the observed effects. Since, in general, the population effect size is estimated more reliably in

with wj =

1 and k the total number of studies σˆ d2j (5)

The precision of the estimate of the overall effect is the sum of the individual precisions. The standard error of the estimate of the overall effect is estimated by:

ˆ = (precision)−1/2 SE(δ)

 −1/2 k 1 = ) 2 σ ˆ j =1 dj

(6)

For the example, the estimate of the overall effect is 0.060, with a corresponding standard error of 0.036. Since the sampling distribution of the overall effect size estimate is again approximately normal, an approximate 95% confidence interval equals [0.060 − 1.96 * 0.036; 0.060 + 1.96 * 0.036] = [ −0.011; 0.131]. Because zero is included in the confidence interval, we conclude that the overall effect size estimate is statistically not significant at the 0.05level. This is equivalent to comparing the estimate divided by the standard error with a standard normal distribution, z = 1.67, p = 0.09. The confidence interval around the overall effect size estimate is also presented in Figure 1. It can be seen that the interval is much smaller than the confidence intervals of the individual studies, illustrating the increased precision or power when estimating or testing a common effect size by means of a meta-analysis. The assumption that the population effect size is the same in all studies is frequently unlikely. Studies often differ in, for example, the operationalization of the dependent or the independent variable and the

Meta-Analysis population from which the study participants are sampled, and it is often unlikely that these study or population characteristics are unrelated to the effect that is investigated. Before using the fixed effects techniques, it is therefore recommended to test the homogeneity of the study results. A popular homogeneity test is the test of Cochran [5]. The test statistic Q=

k ˆ2 (dj − δ) σˆ d2j j =1

(7)

follows approximately a chi-square distribution with (k − 1) degrees of freedom1 . For the example, the homogeneity test reveals that it is highly unlikely that such relatively large differences in observed effect sizes are entirely due to sampling variation, χ 2 (18) = 35.83, p = 0.007. The assumption of a common population effect size is relaxed in the following models. For the sake of completeness, we want to remark that the fixed effects techniques (see Fixed and Random Effects; Fixed Effect Models) described above are actually not only appropriate when the population effect size is the same for all studies, but also when the population effect size differs from study to study but the researcher is interested only in the population effect sizes that are studied. In the last case, (5) is not used to estimate a common population effect size, but rather to estimate the mean of the population effect sizes studied in the studies included in the meta-analytic data set.

Random Effects Model Cronbach [10] argued that the treatments that are investigated in a meta-analytic set of empirical studies often can be regarded as a random sample from a population of possible treatments. As a result, studies often do not investigate a common population effect size, but rather a distribution of population effect sizes, from which the population effect sizes from the studies in the data set represent only a random sample. While in the fixed effects model described above (4) the population effect size is assumed to be constant over studies, in the random effects model, the population effect size is assumed to be a stochastic variable (see Fixed and Random Effects). The population effect size that is estimated in a specific study is modeled as the mean of the population distribution of effect sizes plus a random residual [11,

5

31, 35]: δj = γ + uj

(8)

An observed effect size deviates from this studyspecific population effect size due to sampling variation: (9) dj = δ j + e j Combining (8 and 9) results in: dj = γ + u j + e j

(10)

If dj is an unbiased estimate of δj , the variance of the observed effect size is the sum of the variance of the population effect size and the variance of the observed effect sizes around these population effect sizes: (11) σd2j = σδ2 + σ(d2 j |δj ) = σu2 + σe2 Although better estimators are available [70], an obvious estimate of the population variance of the effect sizes therefore is [31]: k

σˆ δ2

=

sd2

−

j =1

σˆ d2j |δj k

(12)

To estimate γ , the mean of the population distribution of effect sizes, one could again use the precision weighted average of the observed effect sizes (5). While in the fixed effects model the precision associated with an observed effect size equals the inverse of the sampling variance alone, the precision based on a random effects model equals the inverse of the sampling variance plus the population variance. Since this population variance is the same for all studies, assuming a random effects model instead of a fixed effects model has an equalizing effect on the weights. For the example, the estimate of the variance in population effect sizes is 0.080, the estimate of the mean effect size 0.114. Assuming that the distribution of the true effect sizes is normal, 95% of the estimated population distribution of effect sizes therefore is located between -0.013 and 0.271. For the example, the equalizing effect of using a random effects model resulted in an estimate of the mean effect size that is larger than that for the fixed effects model (0.060), since in the example larger standard errors are in general associated with larger observed effect sizes. The standard error of the estimate can again be estimated using (6), but in this case, (11) is used to

6

Meta-Analysis

estimate the variance of the observed effect sizes. The standard error therefore will be larger than for the fixed effects model, which is not surprising: the mean effect size can be estimated more precisely if one assumes that in all studies exactly the same effect size is estimated. For the example, the standard error of the estimate of the mean effect size equals 0.079, resulting in a 95% confidence interval equal to [−0.041; 0.269]. Once more, the interval includes zero, which means that even if there is no real effect, it is not unlikely that a mean effect size estimate of 0.114 or a more extreme one is found, z = 1.44, p = 0.15. Since the effect is assumed to depend on the study, the researcher may be interested in the effect in one or a few specific studies. Note that the observed effect sizes are in fact estimates of the study-specific population effect sizes. Alternatively, one could use the mean effect size estimate to estimate the effect in each single study. While the first kind of estimate seems reasonable if studies are large (and the observed effect sizes are more precise estimates of the population effect sizes), the second estimate is sensible if studies are much alike (this is if the between study variance is smaller). The empirical Bayes estimate of the effect in a certain study is an optimal combination of both kinds of estimates. In this combination, more weight will be given to the mean effect if studies are more similar and if the study is small. Hence, in these situations, the estimates are more ‘shrunken’ to the mean effect. Because of this property of shrinkage, empirical Bayes estimates are often called shrinkage estimates. Empirical Bayes estimates ‘borrow strength’ from the data from other studies: the MSE associated with the empirical Bayes estimates of the study effect sizes is, in general, smaller than the MSE of the observed effect sizes. For more details about empirical Bayes estimates, see, for example, [51] and [52]. In the random effects model, population effect sizes are assumed to be exchangeable. This means that it is assumed that there is no prior reason to believe that the true effect in a specific study is larger (or smaller) than in another study. This assumption is quite often too restrictive, since frequently study characteristics are known that can be assumed to have a systematic moderating effect. In Table 1, for instance, the number of weeks of prior contact between pupils and teachers is given for each study. The number of weeks of prior contact could be

supposed to affect the effect of manipulating the expectancy of teachers toward their pupils, since manipulating the expectancies is easier if teachers do not know the pupil yet. In the following models, the moderating effect of study characteristics is modeled explicitly.

Fixed Effects Regression Model Several methods have been proposed to account for moderator variables. Hedges and Olkin [31], for example, proposed to use an adapted analysis of variance to explore the moderating effect of a categorical study characteristic. For a continuous study characteristic, they proposed to use a fixed effects regression model. In this model, study outcomes differ due to sampling variation and due to the effect of study characteristics: dj = δj + ej = γ0 +

S

γs Wsj + ej ,

(13)

s=1

with Wsj equal to the value of study j on the study characteristic s, and S the total number of study characteristics included in the model. Note that the effect of a categorical study characteristic can also be modeled by means of such a fixed effects regression model, by means of dummy variables indicating the category the study belongs to. The fixed effects regression model simplifies to the fixed effects model described above in case the population effect sizes do not depend on the study characteristics. Unknown parameters can be estimated using the weighted least squares procedure, weighting the observed effect sizes by their (estimated) precision as we did before for the fixed effects model (5). Details are given by Hedges and Olkin [31]. If, for the example, one study characteristic is included with levels 0, 1, 2, and 3 for respectively 0, 1, 2, and 3 or more weeks of prior contact, the estimate of the regression intercept equals 0.407, with a standard error of 0.087, while the estimated moderating effect of the number of weeks equals −0.157 with a standard error of 0.036. This means that if there was no prior contact between pupils and teachers, elevating the expectancy of teachers can be expected to have a positive effect, z = 4.678, p < 0.001, but this treatment effect decreases significantly with the length of prior contact, z = −4.361, p < 0.001.

Meta-Analysis

In the fixed effects regression model, possible differences in population effect sizes are entirely attributed to (known) study characteristics. The dependence of δj on the study characteristics is considered to be nonstochastic. The random effects regression model accounts for the possibility that population effect sizes vary partly randomly, partly according to known study characteristics: dj = δj + ej = γ0 +

S

γs Wsj + uj + ej ,

(14)

s=1

Since in the random effects regression model the population effect sizes depend on fixed effects (the γ s) and random effects (the u’s), the model is also called a mixed effects model. Raudenbush and Bryk [51] showed that the mixed effects metaanalytic model is a special case of a hierarchical linear model or linear multilevel model, and proposed to use maximum likelihood estimation procedures that are commonly used to estimate the parameters of multilevel models, assuming normal residuals. For the example, the maximum likelihood estimate of the residual between study variance equals zero. This means that for the example the model simplifies to the fixed effects regression model, and the parameter estimates and corresponding standard errors for the fixed effects are the same as the ones given above. Differences between the underlying population effect sizes are explained entirely by the length of prior contact between pupils and teachers.

Threats for Meta-analysis Despite its growing popularity, meta-analysis has always been a point of controversy and has been the subject of lively debates (see, e.g., [13], [14], [33], [34], [62], [67] and [72]). Critics point to interpretation problems due to, among other things, combining studies of dissimilar quality, including dependent study results, incomparability of different kinds of effect size measures, or a lack of essential data to calculate effect size measures. Another problem is the ‘mixing of apples and oranges’ due to combining studies investigating dissimilar research questions or using dissimilar study designs, dissimilar independent, or dependent variables or participants from dissimilar populations.

The criticism on meta-analysis that probably received most attention is the file drawer problem [57], which refers to the idea that the drawers of researchers may be filled with statistically nonsignificant unpublished study results, since researchers are more inclined to submit manuscripts describing significant results, and manuscripts with significant results are more likely to be accepted for publication. In addition, the results of small studies are less likely to be published, unless the observed effect sizes are relatively large. One way to detect the file drawer problem and the resulting publication bias is to construct a funnel plot [40]. A funnel plot is a scatter plot with the study sample size as the vertical axis and the observed effect sizes as the horizontal axis. If there is no publication bias, observed effect sizes of studies with smaller sample sizes will generally be more variable, while the expected mean effect size will be independent of the sample size. The shape of the scatter plot therefore will look like a symmetric funnel. In case of publication bias, most extreme negative effect sizes, or statistically nonsignificant effect sizes are not published and thus less likely to be included in the meta-analytic data set, resulting in a gap in the funnel plot for small studies and small or negative effect sizes. A funnel plot of the data from the example2 is presented in Figure 2. It can be seen that there is indeed some evidence for publication bias: for small studies there are some extreme positive observed effect sizes but no extreme negative observed effect sizes, resulting in a higher mean effect size for smaller studies. The asymmetry of the funnel plot, however, is largely due to only two effect sizes, and may well be caused by coincidence. This is confirmed by performing a distribution free statistical test for testing the correlation between effect size and sample size, based 500 Sample size

Mixed Effects Model

7

400 300 200 100 0 −0.5

0

0.5 Observed effect sizes

Figure 2

Funnel plot for the example data

1

1.5

8

Meta-Analysis

on Spearman’s rho [2]. While there is a tendency of a negative correlation, this relation is statistically not significant at the 0.05-level, z = −1.78, p = .08. Since a negative correlation between effect size and the sample size is expected in the presence of publication bias, an alternative approach to assess publication bias is to include the sample size as a moderator variable in a meta-analytic regression model. More information about methods for identifying and correcting for publication bias can be found in [2]. One might expect that some of the problems will become less important due to the fact that since the rise of meta-analysis, researchers and editors became aware of the importance of publishing nonsignificant results and of reporting exact P values, effect sizes, or test statistics, and meta-analysts became aware of the importance of looking for nonpublished study results. In addition, some of the criticisms are especially applicable to the relatively simple early metaanalytic techniques. The problem of ‘mixing apples and oranges’, for instance, is less pronounced if the heterogeneity in effect sizes is appropriately modeled using a random effects and/or by using moderator variables. ‘Mixing apples and oranges’ can even yield interesting information if a regression model is used, and the ‘kind of fruit’ is included in the model by means of one or more moderator variables. The use of the framework of multilevel models for metaanalysis further can offer an elegant solution for the problem of multiple effect size measures in some or all studies: a three-level model can be used, modeling within-study variation in addition to sampling variation and between study variation (see [20] for an example). Nevertheless, a meta-analysis remains a tenuous statistical analysis that should be performed rigorously.

Literature and Software The article by Glass [23] and the formulation of some of the simple meta-analytic techniques by Glass and colleagues, (see e.g., [24], [25] and [64]), might be considered as the breakthrough of metaanalysis. Besides the calculation of the mean and standard deviation of the effect sizes, moderator variables are looked for by calculating correlation coefficients between study characteristics and effect sizes, by means of a multiple regression analysis

or by performing separate meta-analyses for different groups of studies. In the 1980s, these metaanalytic techniques were further developed, and the focus moved from estimating the mean effect size to detecting and explaining study heterogeneity. Hunter, Schmidt, and Jackson [35, 39] especially paid attention to possible sources of error (e.g., sampling error, measurement error, range restriction, computational, transcriptional, and typographical errors) by which effect sizes are affected, and to the correction of effect sizes for these artifacts. Rosenthal [58] described some simple techniques to compare and combine P values or measures of effect size. A more statistically oriented introduction in meta-analysis, including an overview and discussion of methods for combining P values, is given by Hedges and Olkin [31]. Rubin [61] proposed the random effects model for meta-analysis, which was further developed by DerSimonian and Laird [11] and Hedges and Olkin [31]. Raudenbush and Bryk [51] showed that the general framework of hierarchical linear modeling encompasses a lot of previously described meta-analytic methodology, yielding similar results [70], but at the same time extending its possibilities by allowing modeling random and fixed effects simultaneously in a mixed effects model. Goldstein, Yang, Omar, Turner, and Thompson [27] illustrated the flexibility of the use of hierarchical linear models for metaanalysis. The flexibility of the hierarchical linear models makes them also applicable for, for instance, combining the results from single-case empirical studies [68, 69]. Parameters of these hierarchical linear models are usually estimated using maximum likelihood procedures, although other estimation procedures could be used, for instance Bayesian estimation [22, 65]. An excellent book for basic and advanced metaanalysis, dealing with each of the steps of a metaanalysis is Cooper and Hedges [8]. This and other instructional, methodological, or application-oriented books on meta-analysis are reviewed by Becker [1]. More information about one of the steps in a metaanalysis, converting summary statistics, test statistics, P values and measures of effect size to a common measure is found in [8], [43], [53], [54], [60] and [66]. Several packages for performing meta-analysis are available, such as META [63] that implements techniques for fixed and random effects models and can

Meta-Analysis be downloaded free of charge together with a manual from http://www.fu-berlin.de/gesund/ gesu− engl/meta−e.htm, Advanced Basic metaanalysis [45], implementing the ideas of Rosenthal, and MetaWin for performing meta-analyses using fixed, random, and mixed effects models and implementing parametric or resampling based tests (http://www.metawinsoft.com). Meta-analyses however can also be performed using general statistical packages, such as SAS [73]. A more complete overview of specialized and general software for performing meta-analysis, together with references to software reviews is given on the homepages from William Shadish (http://faculty.ucmerced. edu/wshadish/Meta-Analysis%20Links.htm) and Alex Sutton (http://www.prw.le.ac.uk/ epidemio/personal/ajs22/meta). Since metaanalytic models can be considered as special forms of the hierarchical linear model, software, and estimation procedures for hierarchical linear models can be used. Examples are MLwiN (http://multilevel.ioe.ac.uk), based on the work of the Centre for Multilevel Modelling from the Institute of Education in London [26], HLM (http://www.ssicentral. com/hlm/hlm.htm), based on the work of Raudenbush and Bryk [52] and SAS proc MIXED [74].

[4]

[5] [6] [7]

[8] [9]

[10] [11] [12]

[13] [14] [15]

Notes 1.

2.

The sampling distribution of Q often is only roughly approximated by a chi-square distribution, resulting in a conservative or liberal homogeneity test. Van den Noortgate & Onghena [71], therefore, propose a bootstrap version of the homogeneity test. Equation 3 was used to calculate group sizes based on the observed effect sizes and standard errors given in Table 1. In each study, the two group sizes were assumed to be equal.

References [1]

[2]

[3]

Becker, B.J. (1998). Mega-review: books on metaanalysis, Journal of Educational and Behavioral Statistics 1, 77–92. Begg, C. (1994). Publication bias, in The Handbook of Research Synthesis, H. Cooper & L.V. Hedges, eds, Sage, New York, pp. 399–409. Carter, D.L. (1971). The effect of teacher expectations on the self-esteem and academic performance of seventh grade students (Doctoral dissertation, University

[16]

[17]

[18]

[19]

[20]

9

of Tenessee, 1970). Dissertation abstracts International, 31, 4539–A. (University Microfilms No. 7107612). Claiborn, W. (1969). Expectancy effects in the classroom: A failure to replicate. Journal of Educational Psychology, 60, 377–383. Cochran, W.G. (1954). The combination of estimates from different experiments, Biometrics 10, 101–129. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd Edition, Erlbaum, Hillsdale. Conn, L.K., Edwards, C.N., Rosenthal, R. & Crowne, D. (1968). Perception of emotion and response to teachers’ expectancy by elementary school children. Psychological Reports, 22, 27–34. Cooper, H. & Hedges, L.V. (1994a). The Handbook of Research Synthesis, Sage, New York. Cooper, H. & Hedges, L.V. (1994b). Research synthesis as a scientific enterprise, in The Handbook of Research Synthesis, H. Cooper & L.V. Hedges, eds, Sage, New York, pp. 3–14. Cronbach, L.J. (1980). Toward Reform of Program Evaluation, Jossey-Base, San Francisco. DerSimonian, R. & Laird, N. (1986). Meta-analysis in clinical trials, Controlled Clinical Trials 7, 177–188. Evans, J. & Rosenthal, R. (1969). Interpersonal selffulfilling prophecies: Further extrapolations from the laboratory to the classroom. Proceedings of the 77th Annual Convention of the American Psychological Association, 4, 371–372. Eysenck, H.J. (1978). An exercise in mega-silliness, American Psychologist 39, 517. Eysenck, H.J. (1995). Meta-analysis squared – Does it make sense? The American Psychologist 50, 110–111. Fielder, W.R., Cohen, R.D. & Feeney, S. (1971). An attempt to replicate the teacher expectancy effect. Psychological Reports, 29, 1223–1228. Fine, L. (1972). The effects of positive teacher expectancy on the reading achievement of pupils in grade two (Doctoral dissertation, Temple University, 1972). Dissertation Abstracts International, 33, 1510–A. (University Microfilms No. 7227180). Fiske, D.W. (1983). The meta-analytic revolution in outcome research, Journal of Consulting and Clinical Psychology 51, 65–70. Fleming, E. & Anttonen, R. (1971). Teacher expectancy or my fair lady. American Educational Research Journal, 8, 241–252. Flowers, C.E. (1966). Effects of an arbitrary accelerated group placement on the tested academic achievement of educationally disadvantaged students (Doctoral dissertation, Columbia University, 1966). Dissertation Abstracts International, 27, 991–A. (University Microfilms No. 6610288). Geeraert, L., Van den Noortgate, W., Hellinckx, W., Grietens, H., Van Assche, V. & Onghena, P. (2004). The effects of early prevention programs for families with young children at risk for physical child abuse and neglect. A meta-analysis, Child Maltreatment 9(3), 277–291.

10 [21]

[22]

[23] [24]

[25] [26] [27]

[28]

[29] [30]

[31] [32]

[33]

[34] [35]

[36]

[37]

Meta-Analysis Ginsburg, R.E. (1971). An examination of the relationship between teacher expectations and student performance on a test of intellectual functioning (Doctoral dissertation, University of Utah, 1970). Dissertation Abstracts International, 31, 3337–A. (University Microfilms No. 710922). Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. (1995). Bayesian Data Analysis, Chapman & Hall, London. Glass, G.V. (1976). Primary, secondary, and metaanalysis of research, Educational Researcher 5, 3–8. Glass, G.V. (1977). Integrating findings: the metaanalysis of research, Review of Research in Education 5, 351–379. Glass, G.V., McGraw, B. & Smith, M.L. (1981). MetaAnalysis in Social Research, Sage, Beverly Hills. Goldstein, H. (2003). Multilevel Statistical Models, 3rd Edition, Arnold, London. Goldstein, H., Yang, M., Omar, R., Turner, R. & Thompson, S. (2000). Meta-analysis using multilevel models with an application to the study of class size effects, Journal of the Royal Statistical Society. Series C, Applied Statistics 49, 339–412. Grieger, R.M.H. (1971). The effects of teacher expectancies on the intelligence of students and the behaviors of teachers (Doctoral dissertation, Ohio State University, 1970). Dissertation Abstracts International, 31, 3338–A. (University Microfilms No. 710922). Gosset, W.S. (1908). The probable error of a mean, Biometrika 6, 1–25. Hedges, L.V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators, Journal of Educational Statistics 6, 107–128. Hedges, L.V. & Olkin, I. (1985). Statistical Methods for Meta-analysis, Academic Press, Orlando. Henrikson, H.A. (1971). An investigation of the influence of teacher expectation upon the intellectual and academic performance of disadvantaged children (Doctoral dissertation, University of Illinois Urbana, Champaign, 1970). Dissertation Abstracts International, 31, 6278–A. (University Microfilms No. 7114791). Holroyd, K.A. & Penzien, D.B. (1989). Letters to the editor. Meta-analysis minus the analysis: a prescription for confusion, Pain 39, 359–361. Hoyle, R.H. (1993). On the relation between data and theory, The American Psychologist 48, 1094–1095. Hunter, J.E. & Schmidt, F.L. (1990). Methods of Metaanalysis: Correcting Error and Bias in Research Findings, Sage, Newbury Park. Jose, J. & Cody, J. (1971). Teacher-pupil interaction as it relates to attempted changes in teacher expectancy of academic ability achievement. American Educational Research Journal, 8, 39–49. Keshock, J.D. (1971). An investigation of the effects of the expectancy phenomenon upon the intelligence, achievement, and motivation of inner-city elementary school children (Doctoral dissertation, Case Western

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45] [46]

[47]

[48] [49] [50]

[51]

[52]

[53]

[54]

Reserve, 1970). Dissertation Abstracts International, 32, 243–A. (University Microfilms No. 7119010). Kester, S.W. & Letchworth, G.A. (1972). Communication of teacher expectations and their effects on achievement and attitudes of secondary school students. Journal of Educational Research, 66, 51–55. Hunter, J.E., Schmidt, F.L. & Jackson, G.B. (1982). Meta-analysis: Cumulating Findings Across Research, Sage, Beverly Hills. Light, R.J. & Pillemer, D.B. (1984). Summing up: The Science of Reviewing Research, Harvard University Press, Cambridge. Lipsey, M.W. & Wilson, D.B. (1993). The efficacy of psychological, educational, and behavioral treatment. Confirmation from meta-analysis, American Psychologist 48, 1181–1209. Maxwell, M.L. (1971). A study of the effects of teacher expectations on the IQ and academic performance of children (Doctoral dissertation, Case Western Reserve, 1970). Dissertation Abstracts International, 31, 3345–A. (University Microfilms No. 7101725). Morris, S.B. & DeShon, R.P. (2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs, Psychological Methods 7, 105–125. Mosteller, F. & Colditz, G.A. (1996). Understanding research synthesis (Meta-analysis), Annual Review Public Health 17, 1–23. Mullen, B. (1989). Advanced BASIC Meta-analysis, Erlbaum, Hillsdale. National Research Council. (1992). Combining Information. Statistical Issues and Opportunities for Research, National Academy Press, Washington. Pellegrini, R. & Hicks, R. (1972). Prophecy effects and tutorial instruction for the disadvantaged child. American Educational Research Journal, 9, 413–419. Pearson, K. (1904). Report on certain enteric fever inoculation statistics, British Medical Journal 3, 1243–1246. Pillemer, D.B. (1984). Conceptual issues in research synthesis, Journal of Special Education 18, 27–40. Raudenbush, S.W. (1984). Magnitude of teacher expectancy effects on pupil IQ as a function of the credibility of expectancy induction: a synthesis of findings from 18 experiments, Journal of Educational Psychology 76, 85–97. Raudenbush, S.W. & Bryk, A.S. (1985). Empirical Bayes meta-analysis, Journal of Educational Statistics 10, 75–98. Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd Edition, Sage, London. Ray, J.W. & Shadish, R. (1996). How interchangeable are different estimators of effect size? Journal of Consulting and Clinical Psychology 64, 1316–1325. Richardson, J.T.E. (1996). Measures of effect size, Behavior Research Methods, Instruments & Computers 28, 12–22.

Meta-Analysis [55]

[56] [57]

[58] [59]

[60]

[61]

[62] [63]

[64]

[65]

Rosenthal, R., Baratz, S. & Hall, C.M. (1974). Teacher behavior, teacher expectations, and gains in pupils’ rated creativity. Journal of Genetic Psychology, 124, 115–121. Rosenthal, R. & Jacobson, L. (1968). Pygmalion in the classroom. New York: Holt, Rinehart & Winston. Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results, Psychological Bulletin 86, 638–641. Rosenthal, R. (1991). Meta-analytic Procedures for Social Research, Sage, Newbury Park. Rosenthal, R. (1998). Meta-analysis: concepts, corollaries and controversies, in Advances in Psychological Science, Vol. 1: Social, personal, and Cultural Aspects, J.G. Adair, D. Belanger & K.L. Dion, eds, Psychology Press, Hove. Rosenthal, R., Rosnow, R.L. & Rubin, D.B. (2000). Contrasts and Effect Sizes in Behavioral Research: A Correlational Approach, Cambridge University Press, Cambridge. Rubin, D.B. (1981). Estimation in parallel randomized experiments, Journal of Educational Statistics 6, 377–400. Schmidt, F.L. (1993). Data, theory, and meta-analysis: response to Hoyle, The American Psychologist 48, 1096. Schwarzer, R. (1989). Meta-analysis Programs. Program Manual, Instit¨ut f¨ur psychologie, Freie Universit¨at Berlin, Berlin. Smith, M.L. & Glass, G.V. (1977). Meta-analysis of psychotherapy outcome studies, American Psychologist 32, 752–760. Smith, T.S., Spiegelhalter, D.J. & Thomas, A. (1995). Bayesian approaches to random-effects meta-analysis: a comparative study, Statistics in Medicine 14, 2685–2699.

[66]

[67] [68]

[69]

[70]

[71]

[72]

[73]

[74]

11

Tatsuoka, M. (1993). Effect size, in Data Analysis in the Behavioral Sciences, G. Keren & C. Lewis, eds, Erlbaum, Hillsdale, pp. 461–479. Thompson, S.G. & Pocock, S.J. (1991). Can metaanalyses be trusted? Lancet 338, 1127–1130. Van den Noortgate, W. & Onghena, P. (2003a). Combining single-case experimental studies using hierarchical linear models, School Psychology Quarterly 18, 325–346. Van den Noortgate, W. & Onghena, P. (2003b). Hierarchical linear models for the quantitative integration of effect sizes in single-case research, Behavior Research Methods, Instruments, & Computers 35, 1–10. Van den Noortgate, W. & Onghena, P. (2003c). Multilevel meta-analysis: a comparison with traditional metaanalytical procedures, Educational and Psychological Measurement 63, 765–790. Van den Noortgate, W. & Onghena, P. (2003d). A parametric bootstrap version of Hedges’ homogeneity test, Journal of Modern Applied Statistical Methods 2, 73–79. Van Mechelen, I. (1986). In search of an interpretation of meta-analytic findings, Psychologica Belgica 26, 185–197. Wang, M.C. & Bushman, B.J. (1999). Integrating Results Through Meta-analytic Review Using SAS Software, SAS Institute, Inc., Cary. Yang, M. (2003). A Review of Random Effects Modelling in SAS (Release 8.2), Institute of Education, London.

WIM

VAN DEN NOORTGATE AND PATRICK ONGHENA

Microarrays LEONARD C. SCHALKWYK Volume 3, pp. 1217–1221 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Microarrays The culture of molecular biology values categorical results, and a generation of scientists used to bands on gels and DNA sequence is confronting statistics for the first time as they design microarray experiments and analyze data. This can sometimes be seen in the style of data presentation used, for example, the red–green false color images often used to summarize differences between the signal from a pair of RNA samples hybridized to the same array in an experiment using two fluorescent labels. This kind of experimental design allows many of the possible sources of measurement error to be at least equalized between a pair of samples, but gives no information on the reliability of the result. This has led to a widespread understanding of microarrays as a survey method, whose indicative results must be checked by other methods. As new uses for microarrays are conceived and costs come down, this is changing and more attention is being paid to statistical methods. This is particularly true for the highly standardized, industrially produced arrays such as the Affymetrix GeneChips.

The Affymetrix Array I will discuss the Affymetrix product in this article, though many aspects of the discussion are applicable to other kinds of microarrays. I will also limit the discussion to arrays used for gene expression analysis, although once again much of the discussion will also be relevant to microarrays designed for genotyping and other purposes. I will start with a brief description of the method, going through the steps of data analysis from the raw image upwards, addressing experimental design last. The principle of measurement in microarrays is nucleic acid hybridization. Immobilized on the array, are known sequences of DNA (normally, although RNA or synthetic NA analogues would also be possible) known as probes. Reacted with these in solution is a mixture of unknown (labeled RNA or DNA) fragments to be analyzed, known as targets (traditionally these terms were used the other way around). The targets bind to the probes in (ideally) a sequence-specific manner and fill the available probe sites to an extent that depends on

the target concentration. After the unbound target is washed away, the quantity of target bound to each probe is determined by a fluorescence method. In the case of Affymetrix arrays, the probes are synthesized in situ using a proprietary photolithography method, which generates an unknown but small quantity of the desired sequence on an ever-smaller feature size. On current products the feature size is 11 − µm square, allowing nearly 300 000 features on a 6-mm square chip. With arrays made by mechanical spotting or similar methods, locating features and generating a representative intensity value can be demanding, but grid location and adjustment are relatively straightforward using the physical anchorage and fluorescent guidespots in the Affymetrix system. Each of the grid squares (representing a feature) contains approximately 25 pixels after an outer rim has been discarded; these are averaged. With current protocols, saturated (SD=0) cells are rare.

Normalization and Outliers Because the amplified fluorescence assay used in the Affymetrix system is in a single color, experimental fluctuations in labeling, hybridization, and scanning need to be taken into account by normalizing across arrays of an experiment. This has been an extensively studied topic in array studies of all kinds and there are many methods in use [10]. A related issue is the identification of outlying intensity values likely to be due to production flaws, imperfect hybridization, or dust specks. The Affymetrix software performs a simple scaling using a chip-specific constant chosen to make the trimmed mean of intensities across the entire chip a predetermined value. This is probably adequate for modest differences in overall intensity between experiments done in a single series, but it does assume linearity in the response of signal to target concentration, which is unlikely to be true across the entire intensity range. Several excellent third party programs that are free, at least for academic use, offer other options. dChip [9], for example, finds a subset of probes whose intensity ranking does not vary across an experimental set of chips and uses this to fit a normalization curve. An excellent toolbox for exploring these issues is provided by packages from the Bioconductor project, particularly affy [5].

2

Microarrays

Summarizing the Probeset The high density made possible by the photolithographic method allows multiple probes to be used for each transcript (also called probeset), which means that the (still not fully understood) variation in the binding characteristics of different oligonucleotide sequences can be buffered or accounted for. For each 25-nucleotide probe (perfect match (PM)), a control probe with a single base in the central (13th) position is synthesized. This was conceived as a control for the sequence specificity of the hybridization or a background value, but this is problematic. Each probeset consists (in current Affymetrix products) of 11 such perfect match-mismatch probe pairs. The next level of data reduction is the distillation of the probeset into a single intensity value for the transcript. Methods of doing this remain under active discussion. Early Affymetrix software used a proprietary method to generate signal values from intensities, but Microarray Suite (MAS) versions 5 and later use a relatively well-documented robust mean approach [7]. This is a robust mean (Tukey’s biweight) for the perfect match PM probes and separately for the mismatch (MM) probes. The latter is subtracted and the result is set to zero where MM> PM. Other summary methods use model-fitting with or without MM values or thermodynamic data on nucleic acid annealing, to take into account the different performance of different probe sequences. These may have effects on the variability of measurements [8, 13, 12]. It may, under some circumstances, make sense to treat the measurements from individual probes as separate measurements [1].

Statistical Analysis of Replicated Arrays Once a list of signal values (in some arbitrary units) is available, the interesting part begins and the user is largely on his own. By this I mean that specialized microarray software does not offer very extensive support for statistical evaluation of the data. An honorable exception is dChip which offers analysis of variance (ANOVA) through a link with R, and the affy package from Bioconductor [5] which is implemented in the same R statistical computing environment. It may be a wise design decision to limit the statistical features of specialized microarray software in that there is nothing unique about the data,

which really should be thought about and analyzed like other experimental data. The signal value for a given probeset is like any other single measurement, and variance and replication need to be considered just as in other types of experiments. Of course the issue of multiple testing requires particular attention, since tens of thousands of transcripts are measured in parallel. Having worked with the experimental data and reduced it to a value for each probeset (loosely, transcript), the analyst can transfer the data to the statistical package he likes best. Because statistical tests are done on each of thousands of transcripts, a highly programmable and flexible environment is needed. R is increasingly popular for this purpose and there are several microarray related packages available. The affy package (part of BioConductor) can be installed from a menu in recent versions of R, and offers methods and data structures to do this quite conveniently. Most of the available methods for summarizing probesets are available and are practical on a personal computer (PC) with one gigabyte (GB) of random access memory (RAM). Whether processing of the data is done in R or with other software, it is a real time saver, especially for the beginner, to have the data in a structure that is not unnecessarily complicated – normally a dataframe with one row for each probeset (and the probeset ID as row name), and columns for each chip with brief names identifying the sample. In advance of testing hypotheses, it may be good to filter the data, removing from consideration datasets that have no usable information, which may also aid in thinking about how much multiple testing is really being done. It is also the time to consider whether the data should be transformed. Because outlying signals should have been dealt with in the process of normalization and probeset summarization, the chief criterion for filtration is expression level. Traditional molecular biology wisdom is that roughly half of all genes are expressed in a given tissue. Whether a qualitative call of expression/no expression actually corresponds to a biological reality is open to question, and there is the added complication that any tissue will consist of multiple cell types with differing expression patterns. In practice, differences between weak expression signals will be dominated by noise. MAS 5 provides a presence/absence call for each probeset which can be used as a filtering criterion. One complication is that the fraction

Microarrays of probesets ‘present’ varies from one hybridization to the next, and is used as an experimental quality measure. An alternative would be to consider those probesets whose signal values exceed the mean (or some other quantile) across the whole array. It is common to log-transform signal data; this is convenient in terms of informal fold-change criteria, although the statistical rationale is usually unclear. The distribution of signal values across a single array is clearly not normally distributed, but the distribution that is generally of interest is for a given probeset across multiple arrays, where there are fewer data. The natural starting point in analyzing the data from a set of microarrays is to look for differences between experimental groups. This will generally be a more or less straightforward ANOVA, applied to each of the probesets on the array used. Depending on the design of the experiment, it may be useful to use a generalized linear mixed model to reflect the different sources of variance. Some additional power should also be available from the fact that the dispersion of measurements in the thousands of probesets across the set of microarrays is likely to be similar. A reasonable approach to this may be ‘variance shrinking’ [4]. The most difficult and controversial aspect of the analysis is the extent to which the assumptions of ANOVA, in particular, independence within experimental groups, are violated. Peculiarly extreme views on this (especially with regard to inbred strains), as well as bogus inflated significance, are encountered. Pragmatically there will be some violation, but as long as the design and analysis take careful and fair account of the most important sources of variance, we are unlikely to be fooled. The resulting vector of P values, or more to the point, of probesets ranked by P values, then needs to be evaluated in light of the experimental question and your own opinions on multiple testing. As a starting point, Bonferroni correction (see Multiple Comparison Procedures) for the number of probesets is extremely, perhaps absurdly, conservative, but in many experiments there will be probesets that will differ even by this criterion. Many probesets will have background levels of signal for all samples, and among those that present something to measure, there will be many that are highly correlated. Falsediscovery-rate (FDR) methods [2, 11] offer a sensible

3

alternative, but in general, experimental results will be a ranked list of candidates, and where this list is cut off may depend on practicalities, or on other information. On the Affymetrix arrays, there is a substantial amount of duplication (multiple probesets for the same transcript), which may give additional support to some weak differences. In other cases, multiple genes of the same pathway, gene family, or genomic region might show coordinate changes for which the evidence may be weak when considered individually. This argues for considerable exploration of the data, and also public archiving of complete data sets. Hierarchical clustering of expression patterns across chips using a distance measure such as 1|r|, where r is the Pearson correlation between the patterns of two transcripts, is an often used example of a way of exploring the data for groups of transcripts with coordinate expression patterns. This is best done with a filtered candidate list as described earlier, because the bulk of transcripts will not differ, and little will be learned by clustering noise at high computational cost. There is a growing amount of annotation on each transcript and a particularly convenient set of tools for filtering, clustering, and graphic presentation with links to annotation is provided by dChip. In some experiments, the objective is not so much the identification of interesting genes as classification of samples (targets), such as diagnostic samples [6]. Here, classification techniques such as discriminant analysis or supervised neural networks can be used to allocate samples to prespecified groups.

Experimental Design In developing an experimental design, initially scientists tend to be mesmerized by the high cost of a single determination, which is currently about £500/$1000 in consumables for one RNA on one array. The cost of spotted cDNA arrays (especially the marginal cost for a laboratory making their own) can be much lower, but because of lower spotting densities and lower information per spot (because of greater technical variation), it is not simple to make a fair comparison. The Affymetrix software is designed to support the side-by-side comparison so loved by molecular biologists. A ‘presence’ call is produced complete with a P value for each probeset, and there is provision for pairwise analysis in

4

Microarrays

which a ‘difference’ call with a P value is produced. These treat the separate probes of each probeset as separate determinations in order to provide a semblance of statistical support to the values given. This is not a fair assessment of even the purely technical error of the experimental determination, because it knows nothing of the differences in RNA preparation, amplification, and labeling between the two samples. Overall intensity differences between chips resulting from differences in labeling efficiency should be largely dealt with by normalization, but it is easy to see how there could be transcript-specific differences. RNA species differ in stability (susceptibility to degradation by nucleases) and length, for example. Nonetheless, the protocols are quite well standardized and the two chip side-by-side design does give a reasonably robust indication of large differences between samples, which is also what has traditionally been sought from the two color spotted cDNA array experiment. The need for replicates starts to become obvious when the resulting list of candidate differences is reviewed – it may be fairly long, and the cost of follow up of the levels of many individuals by other measurement methods such as quantitative polymerase chain reaction (PCR) mounts quickly. The greatest source of variation that needs to be considered in the experimental design is the biological variation in the experimental material. This is particularly so when dealing with human samples where it has to be remembered that every person (except for identical twins) is a unique genetic background that has experienced a unique and largely uncontrolled environment. Further variation comes from tissue collection, and it is often uncertain whether tissue samples are precisely anatomically comparable. This biological variation can be physically averaged using pooling. The temptation to do this should be resisted as much as is practical because it does not give any information on variance. Realistically, experimental designs are decided after dividing the funds available by the cost of a chip. There is no rule of thumb for how many replicates are necessary, since this depends on the true effect size and the number of groups to be compared. Even with eight experimental groups and four replicates of each, there is only 80% power to detect between-group variability of one within-group SD. Whether this corresponds to a 10% difference or a twofold difference depends on the variability of

the biological material (see Sample Size and Power Calculation). Rather than minimizing cost by minimizing array replicates, value for money can ideally be maximized by using materials which can be mined many times. An outstanding example of this is the WebQTL database [3], where microarray data on a growing number of tissues are stored along with other phenotype data for the BxD mouse recombinant inbred panel. Here, the value of the data for repeated mining is particularly great, because the data derives from a genetically reproducible biological material (inbred mouse strains).

References [1]

Barrera, L., Benner, C., Tao, Y.-C., Winzeler, E. & Zhou, Y. (2004). Leveraging two-way probe-level block design for identifying differential gene expression with high-density oligonucleotide arrays, BMC Bioinformatics 5, 42. [2] Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society Series B 57, 289–300. [3] Chesler, E.J., Lu, L., Wang, J., Williams, R.W. & Manly, K.F. (2004). WebQTL: rapid exploratory analysis of gene expression and genetic networks for brain and behavior, Nature Neuroscience 7, 485–486. [4] Churchill, G. (2004). Using ANOVA to analyze microarray data, BioTechniques 37, 173–175. [5] Gautier, L., Cope, L., Bolstad, B.M. & Irizarry, R.A. (2004). Affy-analysis of affymetrix genechip data at the probe level, Bioinformatics 20, 307–315. [6] Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D. & Lander, E.S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286, 531–537. [7] http://www.affymetrix.com/support/ technical/technotes/statistical reference guide.pdf Affymetrix Part No. 701110 Rev 1 (2001) [8] Li, C. & Wong, W.H. (2001a). Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection, Proceedings of the National Academy of Sciences of the United States of America 98, 31–36. [9] Li, C. & Wong, W.H. (2001b). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application, Genome Biology 2(8), research 0032.1–0032.11. [10] Quackenbush, J. (2001). Microarray data normalization and transformation, Nature Genetics 32,Suppl, 496–501.

Microarrays [11]

[12]

Storey, J.D. & Tibshirani, R. (2003). Statistical significance for genome-wide studies, Proceedings of the National Academy of Sciences of the United States of America 100, 9440–9445. Wu, Z. & Irizarry, R. (2004). Preprocessing of oligonucleotide array data, Nature Biotechnology 22, 656–657.

[13]

5

Zhang L., Miles M.F. & Adalpe F.D. (2003). A model of molecular interactions on short oligonucleotide arrays, Nature Biotechnology 21, 818–821.

LEONARD C. SCHALKWYK

Mid-P Values VANCE W. BERGER Volume 3, pp. 1221–1223 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Mid-P Values One of the most important statistical procedures is the hypothesis test, in which a formal analysis is conducted to produce a P value, which is used to summarize the strength of evidence against the null hypothesis. There are a few complicated testing procedures that are not based on a test statistic [3]. However, these procedures are the exception, and the general rule is that a test statistic is the basis for a hypothesis test. For example, if one were to test the fairness of a coin, one could toss the coin a given number of times, say ten times, and record the number of heads observed. This number of heads would serve as the test statistic, and it would be compared to a known null reference distribution, constructed under the assumption that the coin is, in fact, fair. The logic is an application of modus tollens [1], which states that if A implies B, then not B implies not A. If the coin is fair, then we expect (it is likely that we will see) a certain number of heads. If instead we observe a radically different number of heads, then we conclude that the coin was not fair. How many heads do we need to observe to conclude that the coin is not fair? The distance between the observed data and what would be predicted by the null hypothesis (in this case, that the coin is fair) is generally measured not by the absolute magnitude of the deviation (say the number of heads minus the null expected value, five), but rather by a P value. This P value is the probability, computed under the assumption that the null hypothesis is true, of observing a result as extreme as, or more extreme than, the one we actually observed. For example, if we are conducting a one-sided test, we would like to conclude that the coin is not fair if, in fact, it is biased toward producing too many heads. We then observe eight heads out of the ten tosses. The P value is the probability of observing eight, nine, or ten heads when flipping a fair coin. Using the binomial distribution (see Binomial Distribution: Estimating and Testing Parameters), we can compute the P value, which is [45 + 10 + 1]/1024, or 0.0547. It is customary to test at the 0.05 significance level, meaning that the P value would need to be less than 0.05 to be considered significant. Clearly, eight of ten heads is not significant at the 0.05 significance level, and we cannot rule out that the coin is fair. One may ask if we could have rejected

the null hypothesis had we observed instead nine heads. In this case, the P value would be 11/1024, or 0.0107. This result would be significant at the 0.05 level and gives us a decision rule: reject the null hypothesis if nine or ten heads are observed in ten flips of the coin. Generally, for continuous distributions, the significance level that is used to determine the decision rule, 0.05 in this case, is also the null probability of rejection, or the probability of a Type I error. But in this case, we see that a Type I error occurs if the coin is fair and we observe nine or ten heads, and this outcome occurs with probability 0.0107, not 0.05. There is a discrepancy between the intended Type I error rate, 0.05, and the actual Type I error rate, 0.0107. It is unlikely that there would be any serious objection to having an error probability that is smaller than it was intended to be. However, a consequence of this conservatism is that the power to detect the alternative also suffers, and this would lead to objections. Even if the coin is biased toward heads, it may not be so biased that nine or ten heads will typically be observed. The extent of conservatism will be decreased with a larger sample size, but there is an approach to dealing with conservatism without increasing the sample size. Specifically, consider these two probabilities, P {X > k} and P {X ≥ k}, where X is the test statistic expressed as a random variable (in our example, the number of heads to be observed) and k is the observed value of X (eight in our example). If these two quantities, P {X > k} and P {X ≥ k}, were the same, as they would be with a continuous reference distribution for X, then there would be no discreteness, no conservatism, no associated loss of power, and no need to consider the mid-P value. But these two quantities are not the same when dealing with a discrete distribution, and using the latter makes the P value larger than it ought to be, because it includes the null probability of all outcomes with test statistic equal to k. That is, there are 45 outcomes (ordered sets of eight heads and two tails), and all have the same value of the test statistic, eight. What if we used just half of these outcomes? More precisely, what if we used half the probability of these outcomes, instead of all of it? We could compute P {X > k} = 0.0107 and P {X = k} = 0.04395, so P {X ≥ k} = 0.0107 + 0.04395 = 0.0547 (as before), or we could use instead 0.0107 + (0.04395)/2 = 0.0327. This latter

2

Mid-P Values

computation leads to the mid-P value. In general, the mid-p is P {X > k} + (1/2)P {X = k}. See [4–7] for more details regarding the development of the mid-P value. Certainly, the mid-P value is smaller than the usual P value, and so it is less conservative. However, it does not allow one to recover the basic quantities, P {X > k} and P {X ≥ k}. Moreover, it is not a true P value, as it is anticonservative, in the sense that it does not preserve the true Type I error rate [2]. One modification of the mid-P value is based on recognizing the mid-P value as being the midpoint of the interval [P {X > k}, P {X ≥ k}]. Instead of presenting simply the midpoint of this interval, why not present the entire interval, in the form [P {X > k}, P {X ≥ k}]? This is the P value interval [2], and, as mentioned, it has as its midpoint the mid-P value. But it also tells us the usual P value as its upper endpoint, and it shows us the ideal P value, in the absence of any conservatism, as its lower endpoint. For the example, the P value interval would be (0.0107, 0.0547).

References [1]

[2]

[3]

[4]

[5]

[6] [7]

Asogawa, M. (2000). A connectionist production system, which can perform both modus ponens and modus tollens simultaneously, Expert Systems 17, 3–12. Berger, V.W. (2001). The p value interval as an inferential tool, Journal of the Royal Statistical Society. Section D (The Statistician) 50, 79–85. Berger, V.W. & Sackrowitz, H. (1997). Improving tests for superior treatment in contingency tables, Journal of the American Statistical Association 92, 700–705. Berr, G. & Armitage, P. (1995). Mid-p confidence intervals: a brief review, Journal of the Royal Statistical Society. Series D (The Statistician) 44(4), 417–423. Lancaster, H.O. (1949). The combination of probabilities arising from data in discrete distributions, Biometrika 36, 370–382. Lancaster, H.O. (1952). Statistical control of counting experiments, Biometrika 39, 419–422. Lancaster, H.O. (1961). Significance tests in discrete distributions, Journal of the American Statistical Association 56, 223–234.

VANCE W. BERGER

Minimum Spanning Tree GEORGE MICHAILIDIS Volume 3, pp. 1223–1229 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Minimum Spanning Tree Introduction The minimum spanning tree (MST) problem is one of the oldest problems in graph theory, dating back to the early 1900s [13]. MSTs find applications in such diverse areas as least cost electrical wiring, minimum cost connecting communication and transportation networks, network reliability problems, minimum stress networks, clustering and numerical taxonomy (see Cluster Analysis: Overview), algorithms for solving traveling salesman problems, and multiterminal network flows, among others. At the theoretical level, its significance stems from the fact that it can be solved in polynomial time – the execution time is bounded by a polynomial function of the problem size – by greedy type algorithms. Such algorithms always take the best immediate solution while searching for an answer. Because of their myopic nature, they are relatively easy to construct, but in many optimization problems they lead to suboptimal solutions. Problems solvable by greedy algorithms have been identified with the class of matroids. Furthermore, greedy algorithms have been extensively studied for their complexity structure and MST algorithms have been at the center of this endeavor. In this paper, we present the main computational aspects of the problem and discuss some key applications in statistics, probability, and data analysis.

The elements of V are called vertices (nodes) and the elements of E are called edges. An edge e = (u, v), u, v ∈ V is an unordered pair of vertices u and v. An example of an undirected graph is shown in Figure 1. Notice that there is a self-loop (an edge whose endpoints coincide) for vertex E and a multiedge between nodes B and E. Definition 2: A path p = {v0 , v1 , . . . , vk } in a graph G from vertex v0 to vertex vk is a sequence of vertices such that (vi , vi +1 ) is an edge in G for 0 ≤ i ≤ k. Any edge may be used only once in a path. Definition 3: A cycle in a graph G is a path whose end vertices are the same; that is, v0 = vk . Notice the cycle formed by the edges connecting nodes D, F, and G. Definition 4: A graph G is said to be connected if there is a path between every pair of vertices. Definition 5: A tree T is a connected graph that has no cycles (acyclic graph). Definition 6: A spanning tree T of a graph G is a subgraph of G that is a tree and contains all the vertices of G. A spanning tree is shown in Figure 2.

B

A C

D

E

Some Useful Preliminaries In this section, we introduce several concepts that prove useful for the developments that follow. Definition 1: An undirected graph G = (V , E) is a structure consisting of two sets V and E.

G

Figure 1

F

Illustration of an undirected graph

B

A

G

B

C

D A

D

C

E

F E

G

Figure 2

F

A simple undirected graph and a corresponding spanning tree

2

Minimum Spanning Tree B

A C

W= D

E

G

Figure 3

0 0 1 0 0 0 0

0 0 1 0 0 0 0

1 1 0 1 1 0 0

0

1.5

0 1 1 0 0 1 1

0 0 1 0 0 0 0

0 0 0 0 1 1 1

0

8

2.2

3.3 0

3.3

F

A simple undirected graph and its adjacency matrix W

1.5

A

2.2

B

3.3 3.3

8 3.3

2.2

W=

E

1.5

0

0

3.3

8

0

0 1.5 2.2 1.5 0

2.2 3.3 2.2 3.3 D

Figure 4

0 0 0 1 0 0 1

1.5

3.3 0

C

A weighted undirected graph and its adjacency matrix W

A useful mathematical representation of an undirected graph G = (V , E) is through its adjacency matrix W , which is a |V | × |V | matrix with Wij = 1, if an edge exists between vertices i and j and Wij = 0, otherwise. An example of an undirected graph and its adjacency matrix is given in Figure 3. It can easily be seen that W is a binary symmetric matrix with zero diagonal elements. In many cases, the edges are associated with nonnegative weights that capture the strength of the relationship between the end vertices. This gives rise to a weighted undirected graph that can be represented by a symmetric matrix with nonnegative entries, as shown in Figure 4.

The Minimum Spanning Tree Problem The MST problem is defined as follows: Let G = (V , E) be a connected weighted graph. Find a spanning tree of G whose total edge-weight is a minimum. Remark: When V is a subset of a metric space (e.g., V represents points on the plane equipped with some distance measure), then a solution T to the MST problem represents the shortest network connecting all points in V .

As most graph theory problems, the MST one is very simple to state, and has attracted a lot of interest for finding its solution. A history of the problem and the algorithmic approaches proposed for its solution can be found in Graham and Hell [13] with an update in Nesetril [20]. As the latter author observes ‘(the MST) is a cornerstone of combinatorial optimization and in a sense its cradle’. We present next two classical (textbook) algorithms due to Prim [21] and Kruskal [16], respectively. For each algorithm, the solution T is initialized with the minimum weight edge and its two endpoints. Furthermore, let v(T ) denote the number of vertices in T and v(F ) the number of vertices in a collection of trees; that is a forest. Prim’s algorithm: While v(T ) < |V | do: •

•

interrogate edges (in increasing order of their weights) until one is found that has one of its endpoints in T and its other endpoint in V − T and has the minimum weight among all edges that satisfy the above requirement. Add this edge and its endpoint to T and increase v(T ) by 1. Kruskal’s algorithm: While v(F ) < |V | do:

3

Minimum Spanning Tree A

8

B

1.5 2.2 3.3

A

B

Prim’s algorithm A B

A

B

B

A

3.3 3.3

D

E 1.5

2.2 D E

C

E

E

C

C Kruskal’s algorithm A

B

A

B

A

B

A

B D E

E C

D

C

D C

Figure 5

• •

The progression of the two classical MST algorithms with a total cost of 7.4

interrogate edges (in increasing order of their weights) until one is found that does not generate a cycle in the current forest F . Add this edge and its endpoints to F and increase v(F ) by 1 or 2.

An illustration of the progression of the two algorithms on a toy graph is shown in Figure 5. Notice that whereas Prim’s algorithm grows a single tree, Kruskal’s algorithm grows a forest. For a proof of the optimality of these two algorithms, see [23]. These two algorithms take in the worst case scenario approximately |E| × log |V |) step to compute the solution, which is mainly dominated by the sorting step. Over the last two decades better algorithms with essentially a number of steps proportional to the number of edges have been proposed by several authors; see [5, 7, 11]. However, the work of Moret et al. [19] suggests that in practice Prim’s algorithm outperforms on average its competitors.

Applications of the MST In this section, we provide a brief review of some important applications and results of the MST in statistics, data analysis, and probability.

Multivariate Two-sample Tests In a series of papers, Friedman and Rafsky [8, 9] proposed the following generalization of classical

nonparametric two-sample tests for multivariate data (see Multivariate Analysis: Overview): suppose we have two multivariate samples on d , Xn = {X1 , X2 , . . . , Xn } from a distribution FX and Ym = {Y1 , Y2 , . . . , Ym } from another distribution FY . The null hypothesis of interest is H0 : FX = FY and the alternative hypothesis is H1 : FX = FY . The proposed test procedure is as follows: 1. Generate a weighted graph G whose n + m nodes represent the data points of the pooled samples, with the edge weights corresponding to Euclidean distances between the points. 2. Construct the MST, T of G. 3. Remove all the edges in T for which the endpoints come from different samples. 4. Define the test statistic Rn,m = # of disjoint subtrees. Using the asymptotic normality of Rn,m under H0 , the null hypothesis is rejected at the significance level α, if Rn,m − E(Rn,m ) (1) < −1 (α), Var(Rn,m |Cn,m ) where −1 (α) is the α-quantile of the standard normal distribution and Cn,m the number of edge pairs of T that share a common node. Expressions for E(Rn,m ) and Var(Rn,m |Cn,m ) are given in [8]. In [14] it is further shown that Rm,n is asymptotically distribution-free under H0 , and that the above test is universally consistent.

4

Minimum Spanning Tree 4

3

2

1

0

−1

−2 −2

Figure 6

−1

0

1

2

3

4

Demonstration of the multivariate two-sample runs test

A demonstration of the two-sample test is given in Figure 6. The 40 points come from a twodimensional multivariate mean zero normal distribution with σ12 = σ22 = .36 and σ12 = 0. The 30 points come from a two-dimensional multivariate normal distribution with mean vector (2, 2) and σ12 = σ22 = 1 and σ12 = 0 (see Catalogue of Probability Density Functions). The value of the observed test statistic is R40,30 = 6 and the z–score is around −6.1, which suggests that H0 is rejected in favor of H1 . Remark: In [8], a Smirnov type of test using a rooted MST is also presented for the multivariate two-sample problem.

MST and Multivariate Data Analysis The goal of many multivariate data analytic techniques is to uncover interesting patterns and relationships in multivariate data (see Multivariate Analysis: Overview). In this direction, the MST has proved a useful tool for grouping objects into homogeneous groups, but also in visualizing the structure of multivariate data.

The presentation of Prim’s algorithm in the section titled ‘Applications of the MST’ basically shows that the MST is essentially identical to the single linkage agglomerative hierarchical clustering algorithm [12] (see Hierarchical Clustering). By removing the K − 1 highest weight edges, one obtains a clustering solution with K groups [25]. An illustration of the MST as a clustering tool is given in Figure 7, where the first cluster corresponds to objects along a circular pattern and the second cluster to a square pattern inside the circle. It is worth noting that many popular clustering algorithms such as K-means and other agglomerative algorithms are going to miss the underlying group structure as the bottom right panel in Figure 7 shows. The MST has also been used for identifying influential multivariate observations [15] (see Multivariate Outliers), for highlighting inaccuracies of low-dimensional representations of high-dimensional data through multidimensional scaling [3] and for visualizing structure in high dimensions [17] (see k means Analysis).

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−1

−0.5 0

0.5

1

Figure 7 A two-dimensional data example with two underlying clusters (top left panel), the corresponding MST (top right panel, the single linkage dendrogram (bottom left panel) and the average linkage dendrogram (bottom right panel)

0

0.1

0.2

0.3

0.4

0.5

−1 1

−1 0.5

−0.5

−0.5

0

0

0

−0.5

0.5

0.5

−1

1

1

Minimum Spanning Tree

5

6

Minimum Spanning Tree

The MST in Geometric Probability The importance of the MST in many practical situations is that it determines the dominant skeleton structure of a point set by outlining the shortest path between nearest neighbors. Specifically, given a set of points Xn = {x1 , x2 , . . . , xn } in d the MST T (Xn ) connects all the points in the set by using as the weight function on the edges of the underlying complete graph the Euclidean distance. Steele [22] established that if the distances dij , 1 ≤ i, j ≤ n are independent and identically distributed with common cdf F (dij ), then d − 1 1/d T (Xn ) , a.s. (2) lim d−1 = β(d) n→∞ d n d where β(d) is a constant that depends only on the dimension d. The above result is a variation on a theme in geometric probability pioneered by the celebrated paper of Bearwood et al. [2] on the length of the traveling salesman tour in Euclidean space. The exact value of the constant β(d) is not known in general (see [1]). The above result has provided a theoretical basis for using MSTs for image registration [18], for pattern recognition problems [24] and in assignment problems in wireless networks [4].

Concluding Remarks An MST is an object that has attracted a lot of interest in graph and network theory. In this paper, we have also shown several of its uses in various areas of statistics and probability. Some other applications of MSTs have been in defining measures of multivariate association [10] (an extension of Kendall’s τ measure) and more recently in estimating the intrinsic dimensionality of a nonlinear manifold from sparse data sampled from it [6].

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15]

Acknowledgments The author would like to thank Jan de Leeuw and Brian Everitt for many useful suggestions and comments. This work was supported in part by NSF under grants IIS–9988095 and DMS–0214171 and NIH grant 1P41RR018627–01.

[16]

References

[18]

[1]

Avram, F. & Bertsimas, D. (1992). The minimum spanning tree constant in geometric probability and

[17]

under the independent model: a unified approach, Annals of Applied Probability 2, 113–130. Bearwood, J., Halton, J.H. & Hammersley, J.M. (1959). The shortest path through many points, Proceedings of the Cambridge Philosophical Society 55, 299–327. Bienfait, B. & Gasteiger, J. (1997). Checking the projection display of multivariate data with colored graphs, Journal of Molecular Graphics and Modeling 15i, 203–215. Blough, D.M., Leoncini, M., Resta, G. & Santi, P. (2002). On the symmetric range assignment problem in wireless ad hoc networks, 2nd IFIP International Conference on Theoretical Computer Science, Montreal, Canada. Chazelle, B. (2000). A minimum spanning tree algorithm with inverse Ackermann type complexity, Journal of the ACM 47, 1028–1047. Costa, J. & Hero, A.O. (2004). Geodesic entropic graphs for dimension and entropy estimation in manifold learning, IEEE Transactions on Signal Processing 52(8), 2210–2221. Fredman, M.L. & Tarjan, R.E. (1987). Fibonacci heaps and their use in improved network optimization algorithms, Journal of the ACM 34, 596–615. Friedman, J. & Rafsky, L. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests, Annals of Statistics 7, 697–717. Friedman, J. & Rafsky, L. (1981). Graphics for the multivariate two-sample tests, Journal of the American Statistical Association 76, 277–295. Friedman, J. & Rafsky, L. (1981). Graph-theoretic measures of multivariate association and prediction, Annals of Statistics 11, 377–391. Gabow, H.N., Galil, Z., Spencer, T.H. & Tarjan, R.E. (1986). Efficient algorithms for minimum spanning trees on directed and undirected graphs, Combinatorica 6, 109–122. Gower, J.C. & Ross, G.J.S. (1969). Minimum spanning trees and single linkage clustering, Applied Statistics 18, 54–64. Graham, R.L. & Hell, P. (1985). On the history of the minimum spanning tree problem, Annals of the History of Computing 7, 43–57. Henze, N. & Penrose, M.D. (1999). On the multivariate runs test, Annals of Statistics 27, 290–298. Jolliffe, I.T., Jones, B. & Morgan, B.J.T. (1995). Identifying influential observations in hierarchical cluster analysis, Journal of Applied Statistics 22, 61–80. Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem, Proceedings of the American Mathematical Society 7, 48–50. Kwon, S. & Cook, D. (1998). Using a grand tour and minimal spanning tree to detect structure in high dimensions, Computing Science and Statistics 30, 224–228. Ma, B., Hero, A.O., Gorma, J. & Olivier, M. (2000). Image registration with minimum spanning tree algorithms, Proceedings of IEEE International Conference on Image Processing 1, 481–484.

Minimum Spanning Tree [19]

Moret, B.M.E. & Shapiro, H.D. (1994). An empirical assessment of algorithms for constructing a minimum spanning tree, DIMACS Monographs in Discrete Mathematics and Theoretical Computer Science 15, 99–117. [20] Nesetril, J. (1997). Some remarks on the history of the MST problem, Archivum Mathematicum 33, 15–22. [21] Prim, R.C. (1957). The shortest connecting network and some generalizations, Bell Systems Technical Journal 36, 1389–1401. [22] Steele, M.J. (1988). Growth rates of Euclidean minimal spanning trees with power weighted edges, Annals of Probability 16, 1767–1787.

[23]

[24] [25]

7

Tarjan, R.E. (1983). Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics, Philadelphia. Toussaint, G. (1980). The relative neighborhood graph of a finite planar set, Pattern Recognition 13, 261–268. Zahn, C.T. (1971). Graph theoretic methods for detecting and describing gestalt clusters, IEEE Transactions on Computers 1, 68–86.

GEORGE MICHAILIDIS

Misclassification Rates WOJTEK J. KRZANOWSKI Volume 3, pp. 1229–1234 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Misclassification Rates Many situations arise in practice in which individuals have to be allocated into one of a number of prespecified classes. For example, applicants to a bank for a loan may be segregated into two groups: ‘repayers’ or ‘defaulters’; a psychiatrist may have to decide whether a patient is suffering from depression or not, and if yes, then which of a number of types of depressive illness it is most likely to be; and applicants for employment in a company may be classified as being either ‘suitable’ or ‘not suitable’ for the position by an industrial psychologist. Rather than merely relying on subjective expert assessment in such cases, use is often made of quantitative predictions via mathematical or computational formulae, termed allocation rules or classifiers, which are based on a battery of relevant measurements taken on each individual. Thus, the bank manager might provide the classifier with information on each individual’s financial circumstances as given, say, by his/her income, savings, number and size of existing loans, mortgage commitments, and so on, as well as concomitant information on such factors as nature of employment, marital status, educational level attained, and similar. The psychiatrist would perhaps supply information on the patient’s responses to questions about levels of concentration, loss of sleep, feelings of inadequacy, stress at work, and ability to make decisions; while the industrial psychologist might use numerical scores assessing the subject’s letter of application, previous educational attainments, performance at interview, level of ambition, and so on. The construction of such classifiers has a long history, dating from the 1930s. Original application areas were in the biological and agricultural sciences, but the methods spread rapidly to the behavioral and social sciences and are now used widely in most areas of human activity. Many different forms of classifier have been suggested and currently exist, including simple combinations of measured variables as in the linear discriminant function and the quadratic discriminant function (see Discriminant Analysis), more complicated explicit functions such as the logistic discriminant function and adaptive regression splines (see Logistic Regression; Scatterplot Smoothers), implicitly defined functions, and ‘black box’ routines such as feed-forward neural networks (see Neural Networks), tree-based methods such

as decision trees or classification and regression trees, numerically based methods including ones based on kernels, wavelets, or nearest neighbors, and dimensionality-based techniques such as support vector machines. The methodology underlying the construction of these classifiers is described elsewhere (see Discriminant Analysis). We will simply assume that the form of classifier has been chosen in a particular application, and will focus on the assessment of efficacy of that classifier. Since the objective of classification is to allocate individuals to preexisting classes, and any classification must by its nature be subject to error, the most obvious measure of efficacy of a classifier is the proportion of individuals that it is likely to misclassify. Hence, the estimation of misclassification rates is an important practical consideration. In any given application, there will be a set of observations available for individuals whose true class membership is known, from which the classifier is to be built; this is known as the training set. For example, the bank manager will have data on the chosen variables for a set of known defaulters, as well as for those who are known to have repaid their loans. Likewise, the psychiatrist will have measured the relevant variables for patients known to be suffering from each type of depressive illness as well as individuals who are not ill, and the industrial psychologist will have a similarly relevant data set available. Once the form of classifier has been chosen, these data are used to estimate its parameters; this is known as training the classifier. As a specific example, consider a set of data originally given by Reaven and Miller [9]. These authors were interested in studying the relationship between chemical subclinical and overt nonketotic diabetes in nonobese adult subjects. The three primary variables used in the analysis were glucose intolerance (x1 ), insulin response to oral glucose (x2 ), and insulin resistance (x3 ), and all subjects in the study were assigned by medical diagnosis to one of the three classes ‘normal’, ‘chemical diabetic’ and ‘overt diabetic’. For illustrative purposes, we will use a simple linear discriminant function to distinguish the first and third classes, so our training set consists of the 76 normal and 33 overt diabetic subjects. The resulting trained classifier is y = −21.525 + 0.023x1 + 0.017x2 + 0.015x3 , and future patients can be assigned to one or other class according to their observed value of y.

2

Misclassification Rates

The problem now is to evaluate the (future) performance of such a trained classifier, and this can be done by estimating misclassification rates from each of the classes (i.e., the class-conditional misclassification rates). If the classifier has been constructed using a theoretical derivation from an assumed probability model, then there may be a theoretical expression available from which the misclassification rates can be estimated. For example, if we are classifying individuals into one of two groups, and we can assume that the measured variables follow multivariate normal distributions in the two groups, with mean vectors µ1 , µ2 , respectively, and a common dispersion matrix , then standard theory [7, p. 59] shows that the optimal classifier in terms of minimizing costs due to misclassification is the linear discriminant function used in the example above. Moreover, under the assumption of equal costs from each group and equal prior probabilities of each group, the probability of misclassifying an individual from each group by this function is {−1/2[µ1 − µ2 ]t −1 [µ1 − µ2 ]} where xt denotes the transpose of the vector x and {u} is the integral of the standard normal density function between minus infinity and u. The parameters µ1 , µ2 and can be estimated by the group mean vectors and pooled within-groups covariance matrix of the training data, and substituting these values in the expression above will thus yield an estimate of the misclassification rates. Doing this for the diabetes example, we estimate the misclassification rate from each class to be 0.018. However, estimation of the misclassification rates in this way is heavily dependent on the distributional assumptions made (whereas the classifier itself may be more widely applicable). For example, it turns out that a linear discriminant function is a useful classifier in many situations, even when the data do not come from normal populations, but the assumption of normality is critical to the above estimates of misclassification rates, and these estimates can be wrong when the data are not normally distributed. Moreover, there are many classifiers (e.g., neural networks, classification trees) for which appropriate probability models are very difficult if not impossible to specify. Hence, in general, reliance on distributional assumptions is dangerous and the training data must be used in a more direct fashion to estimate misclassification rates. The most intuitive approach is simply to apply the classifier to all training set members in turn, and

to estimate the misclassification rates by the proportion of training set members from each group that are misclassified. This is known as the resubstitution method of estimation, and the resulting estimates are often called the apparent error rates. However, it can be seen easily that this is a biased method that will generally be overoptimistic (i.e., will underestimate the misclassification rates). This is because the classifier has itself been built from the training data and has utilized any differences between the groups in an optimal fashion, thereby minimizing as far as possible any misclassifications. Future individuals for classification may well lie outside the bounds of the training set individuals or have some different characteristics, and so misclassifications are likely to be more frequent. The smaller the available data set, the more will the classifier be tailored to it and, hence, the more extreme will be the difference in performance on future individuals. In this case, we say that the training data have been overfitted and the classifier has poor generalization. The only way to ensure unbiased estimates of misclassification rates is to use different sets of data for forming and assessing the classifier. This can be achieved by having a training set for building the classifier, and then an independent test set (drawn from the same source) for estimating its misclassification rates. If the original training set is very large, then it can be randomly split into two portions, one of which forms the classifier training set and the other the test set. The proportions of individuals misclassified from each group of the test set then form the required estimates. The final classifier for future use is then obtained from the combined set. However, the process of dividing the data into two sets raises problems. If the training set is not very large, then the classifier being assessed may be poorly estimated, and may then differ considerably from the one finally used on future individuals; while if the test set is not very large, then the estimates of misclassification rates are subject to small-sample volatility and may not be reliable. Of course, if the data sets are huge, then the scope for overfitting the training data is small and resubstitution will give very similar results to test set estimation, and the problems do not arise. However, in many practical situations, the available data set is small to moderate in size, so resubstitution will not be reliable while splitting the set into training, and test sets is not viable. In these cases, we need something a little more ingenious, and cross-validation,

Misclassification Rates jackknifing, and bootstrapping are three data-based methods that have been proposed to fill the gap.

Cross-validation Cross-validation aims to steer a middle ground between splitting the data into just one training and one test set, and using all the data for both building the classifier and assessing it. In this method, the data are divided into a number k of subsets; each subset is used in turn as the test set for a classifier constructed from the remainder of the data, and the proportions of misclassified individuals in each group when results from all k subsets have been combined give the estimates of misclassification rates. The final step is to build the classifier for future individuals from the full data set. This method was first introduced in [6] for the case k = 1, and this case is now commonly called the leave-one-out estimate of misclassification rates, while the general case is usually termed the kfold cross-validation estimate. Clearly, if k = 1 and each individual is left out in turn, then the final classifier for future individuals will not differ much from the classifiers used to allocate each omitted individual, but a lot of computing may be necessary to complete the process. On the other hand, if k is quite large, then there is much less computing but more discrepancy between the classifiers obtained during the assessment procedure and the one used for future classifications. The leave-one-out estimator has been studied quite extensively, and has been shown to be approximately unbiased but to have large variance. Nevertheless, it is popular in applications. When k greater than 1 is preferred, then values around 6 to 10 are commonly used.

Jackknifing The jackknife was originally introduced in [8] as a general way of reducing the bias of an estimator, and its application to misclassification rate estimation came later. It uses the same operations as the leaveone-out variant of cross-validation, but with a different purpose; namely, to reduce the bias of the resubstitution estimator. Let us denote by er,i the resubstitution estimate of misclassification rate for group i as obtained from a classifier built from the whole training set. Suppose now that the j th individual is omitted from the data, a new classifier is built from

3

(j )

the reduced data set, and er,i is the resubstitution estimate of misclassification rate for this classifier in group i. If this process is repeated, leaving out each of the n individuals in the data in turn, the average of the n resubstitution estimates of misclassification rate in group i (each based on a sample of size n − 1) (j ) can be found as e¯r,i = 1/n j er,i . Then the jackknife estimate of bias in er,i is (n − 1)(er,i − e¯r,i ) so that the corrected estimate becomes er,i + (n − 1)(er,i −e¯r,i ). For some more details, and some relationships between this estimator and the leave-one-out estimator, see [2].

Bootstrapping An alternative approach to removal of the bias in the resubstitution estimator is provided by bootstrapping. The basic bootstrap estimator works as follows. Suppose that we draw a sample of size n with replacement from the training data. This will have repeats of some of the individuals in the original data while others will be absent from it; it is known as a bootstrap sample, and it plays the role of a potential sample from the population from which the training data came. A classifier is built from the bootstrap sample and its resubstitution estimate of misclassification rate for group i is computed; b . If this classifier is then applied denote this by er,i to the original training data and the proportion of misclassified group i individuals is computed as cb , then this latter quantity is an estimate of the er,i ‘true’ misclassification rate for group i using the classifier from the bootstrap sample. The difference cb b − er,i is then an estimate of the bias in db = er,i the resubstitution misclassification rate. To smooth out the vagaries of individual samples, we take a large number of bootstrap samples and obtain the average difference d¯ over these samples; this is the bias correction to add to er,i . The bootstrap has also been used for direct estimation of error rates. A number of ways of doing this have been proposed (see [1] for example), but one of the most successful is the so-called 632 bootstrap. In this method, the estimate of misclassification rate for group i is given by 0.368er,i + 0.632ei , where ei is the proportion of all points not in each bootstrap sample that were misclassified for group I , and er,i is the resubstitution estimate as before. The value 0.632 arises because it is equal to 1 − e−1 , which is

4

Misclassification Rates

the approximate probability that a specified individual will be selected for the bootstrap sample.

Tuning The above methods are the ones currently recommended for estimating misclassification rates, but in some complex situations, they may become embedded in the classifier itself. This can happen for a number of reasons. One typical situation occurs when some parameters of the classifier have to be optimized by finding those values that minimize the misclassification rates; this is known as tuning the classifier. An example of this usage is in multilayer perceptron neural networks, where the training process is iterative and some mechanism is necessary for deciding when to stop. The obvious point at which to stop is the one at which the network performance is optimized, that is, the one at which its misclassification rate is minimized. If the training data set is sufficiently large, then the network is trained on one portion of it, the misclassification rates are continuously estimated from a second portion, and the training stops when these rates reach a minimum. Note, however, that quoting this achieved minimum rate as the estimate of future performance of the network would be incorrect, sometimes badly so. In effect, the same mistake would be made here as was made by using the resubstitution estimate in the ordinary classifier; the classifier has been trained to minimize the misclassifications on the second set of available data, so the minimum achieved rate is an overoptimistic assessment of how it will fare on future data. To obtain an unbiased estimate, we need a third (independent) set of data on which to assess the trained network. So, ideally, we need to divide our data into three: one set for training the data, one set to decide when training stops, and a third set to assess the final network. Many overoptimistic claims were made in the early days of development of neural networks because this point was not appreciated.

Variable Selection A related problem occurs when variable selection is an objective of the analysis, whatever classifier is used. Typically, a set of variables is deemed to be optimal if it yields a minimum number of misclassifications. The same comments as above are relevant in

this situation also, and simply quoting this minimum achieved rate will be an overoptimistic assessment of future performance of these variables. However, if data sets are small, then an additional problem frequently encountered is that the estimation of misclassification rates for deciding on variable subsets is effected by means of the leave-one-out process. If this is the case, then an unbiased assessment of the overall procedure will require a nested leave-one-out process: each individual is omitted from the data in turn, and then the whole procedure is conducted on the remaining individuals. If this procedure involves variable selection using leave-one-out to choose the variables, then this must be done, and only when it is completed is the first omitted individual classified. Then the next individual is omitted and the whole procedure is repeated, and so on. Thus, considerably more computing needs to be done than at first meets the eye, but this is essential if unbiased estimates are to be obtained. Comparative studies showing the bias incurred when these procedures are sidestepped have been described in [3] for tuning and [4] for variable selection. Note that in all the discussion above we have focused on estimation of the class-conditional misclassification rates. The overall misclassification rate of a classifier is given by a weighted average of the class-conditional rates, where the weights are the prior probabilities of the classes. Often, these prior probabilities have to be estimated. There is no problem if the training data have been obtained by random sampling of the whole population (i.e., mixture sampling), as the proportions of each class in the training data then yield adequate estimates of the prior probabilities. However, the proportions cannot be used in this way whenever the investigator has controlled the class sizes (separate sampling, as for example in case-control studies). In such cases, additional external information is generally needed for estimating prior probabilities. A final matter to note is that all the estimates discussed above are point estimates, that is, singlevalue ‘guesses’ at the true error rate of a given classifier. Such point estimates will always be subject to sampling variability, so an interval estimate might give a more realistic representation. This topic has not received much attention, but a recent study [5] has looked at coverage rates of various intervals generated by standard jackknife, bootstrap, and crossvalidation methods. The conclusion was that 632

Misclassification Rates bootstrap error rates plus jackknife-after-bootstrap estimation of their standard errors gave the most reliable confidence intervals. To illustrate some of these ideas, consider the results from [5] on discriminating between the 76 normal subjects and the 33 subjects suffering from overt nonketotic diabetes using the linear discriminant function introduced above. All data-based estimates of misclassification rates differed from the normal-based estimates of 0.018 from each class, suggesting that the data deviate from normality to some extent. The apparent error rates were 0.000 and 0.061 for the two groups respectively, while the bootstrap 632 rates for these groups were 0.002 and 0.091. This illustrates the overoptimism typical of the resubstitution process, but the bootstrap error rates, nevertheless, seem to indicate that the classifier is a good one with less than a 10% chance of misallocating to the ‘harder’ group of diabetics. However, when the variability in estimation is taken into account and 90% confidence intervals are calculated from the bootstrap rates, we find intervals of (0.000, 0.012) for the normal subjects and (0.000, 0.195) for the diabetics. Hence, classification to the diabetic group could be much poorer than first thought. The interested reader can find more detailed discussion of the above topics, with allied lists of references, in [2, 7].

5

References [1]

[2] [3]

[4]

[5]

[6]

[7]

[8] [9]

Efron, B. (1983). Estimating the error of a prediction rule: improvement on cross-validation, Journal of the American Statistical Association 78, 316–331. Hand, D.J. (1997). Construction and Assessment of Classification Rules, John Wiley & Sons, Chichester. Jonathan, P., Krzanowski, W.J. & McCarthy, W.V. (2000). On the use of cross-validation to assess performance in multivariate prediction, Statistics and Computing 10, 209–229. Krzanowski, W.J. (1995). Selection of variables, and assessment of their performance, in mixed-variable discriminant analysis, Computational Statistics and Data Analysis 19, 419–431. Krzanowski, W.J. (2001). Data-based interval estimation of classification error rates, Journal of Applied Statistics 28, 585–595. Lachenbruch, P.A. & Mickey, M.R. (1968). Estimation of error rates in discriminant analysis, Technometrics 10, 1–11. McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, New York. Quenouille, M. (1956). Notes on bias in estimation, Biometrika 43, 353–360. Reaven, G.M. & Miller, R.G. (1979). An attempt to define the nature of chemical diabetes using a multidimensional analysis, Diabetologia 16, 17–24.

WOJTEK J. KRZANOWSKI

Missing Data RODERICK J. LITTLE Volume 3, pp. 1234–1238 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Missing Data Introduction Missing values arise in behavioral science data for many reasons: dropouts in longitudinal studies (see Longitudinal Data Analysis), unit nonresponse in sample surveys where some individuals are not contacted or refusal to respond, refusal to answer particular items in a questionnaire, missing components in an index variable constructed by summing values of particular items. Missing data can also arise by design. For example, suppose one objective in a study of obesity is to estimate the distribution of a measure Y1 of body fat in the population, and correlate it with other factors. Suppose Y1 is expensive to measure but a proxy measure Y2 , such as body mass index, which is a function of height and weight, is inexpensive to measure. Then it may be useful to measure Y2 and covariates, X, for a large sample and Y1 , Y2 and X for a smaller subsample. The subsample allows predictions of the missing values of Y1 to be generated for the larger sample, yielding more efficient estimates than are possible from the subsample alone. Unless missing data are a deliberate feature of the study design, the most important step in dealing with missing data is to try to avoid it during the data-collection stage. Since data are still likely to be missing despite these efforts, it is important to try to collect covariates that are predictive of the missing values, so that an adequate adjustment can be made. In addition, the process that leads to missing values should be determined during the collection of data if possible, since this information helps to model the missing-data mechanism when an adjustment for the missing values is performed [2]. We distinguish three major approaches to the analysis of missing data: 1. Discard incomplete cases and analyze the remainder (complete-case analysis), as discussed in the section titled Complete-case, Availablecase, and Weighting Analysis”; 2. Impute or fill in the missing values and then analyze the filled-in data, as discussed in the sections titled Single Imputation and Multiple Imputation; 3. Analyze the incomplete data by a method that does not require a complete (that is, a rectangular) data set, as discussed in the sections

titled Maximum Likelihood for Ignorable Models, Maximum Likelihood for Nonignorable Models, and Bayesian Simulation Methods. A basic assumption in all our methods is that missingness of a particular value hides a true underlying value that is meaningful for analysis. This may seem obvious but is not always the case. For example, in a study of a behavioral intervention for people with heart disease, it is not meaningful to consider a quality of life measure to be missing for subjects who die prematurely during the course of the study. Rather, it is preferable to restrict the analysis to the quality of life measures of individuals who are alive. Let Y = (yij ) denote an (N × p) rectangular dataset without missing values, with ith row yi = (yi1 , . . . , yip ) where yij is the value of variable Yj for subject i. With missing values, the pattern of missing data is defined by the missing-data indicator matrix M = (mij ), such that mij = 1 if yij is missing and mij = 0 if yij is present. An important example of a special pattern is univariate nonresponse, where missingness is confined to a single variable. Another is monotone missing data, where the variables can be arranged so that Yj +1 , . . . , Yp are missing for all cases where Yj is missing, for all j = 1, . . . , p − 1. This pattern arises commonly in longitudinal data subject to attrition, where once a subject drops out, no more data are observed. Some methods for handling missing data apply to any pattern of missing data, whereas other methods assume a special pattern. The missing-data mechanism addresses the reasons why values are missing, and in particular, whether these reasons relate to values in the data set. For example, subjects in a longitudinal intervention may more likely to drop out of a study because they feel the treatment was ineffective, which might be related to a poor value of an outcome measure. Rubin [5] treated M as a random matrix, and characterized the missing-data mechanism by the conditional distribution of M given Y , say f (M|Y, φ), where φ denotes unknown parameters. Assume independent observations, and let mi and yi denote the rows of M and Y corresponding to individual i. When missingness for case i does not depend on the values of the data, that is, f (mi |yi , φ) = f (mi |φ)

for all

yi , φ,

(1)

the data are called missing completely at random (MCAR). An MCAR mechanism is plausible in

2

Missing Data

planned missing-data designs, but is a strong assumption when missing data do not occur by design, because missingness often does depend on recorded variables. A less restrictive assumption is that missingness depends only on data that are observed, say yobs,i , and not on data that are missing, say ymis,i ; that is, f (mi |yi , φ) = f (mi |yobs,i , φ) for all ymis,i , φ. (2) The missing-data mechanism is then called missing at random (MAR). Many methods for handling missing data assume the mechanism is MCAR, and yield biased estimates when the data are not MCAR. Better methods rely only on the MAR assumption.

Complete-case, Available-case, and Weighting Analysis A common default approach is complete-case (CC) analysis, also known as listwise deletion, where incomplete cases are discarded and standard analysis methods applied to the complete cases [3, Section 3.2]. In many statistical packages, this is the default analysis. When the missing data are MCAR, the complete cases are a random subsample of the original sample, and CC analysis results in valid (but often inefficient) inferences. If the data are not MCAR then the complete cases are a biased sample, and CC analysis is often (though not always) biased, with a bias that depends on the degree of departure from MCAR, the amount of missing data, and the specifics of the analysis. The potential for bias is why sample surveys with high rates of unit nonresponse (say 30% or more) are often considered unreliable for making inferences to the whole population. A modification of CC analysis, commonly used to handle unit nonresponse in surveys, is to weight respondents by the inverse of an estimate of the probability of response [3, Section 3.3]. A simple approach to estimation is to form adjustment cells (or subclasses) on the basis of background variables and weight respondents in an adjustment cell by the inverse of the response rate in that cell. This method eliminates bias if respondents within each adjustment cell respondents can be regarded as a random subsample of the original sample in that cell (i.e., the data are MAR given indicators for the

adjustment cells). A useful extension with extensive background information is response propensity stratification (see Propensity Score), where adjustment cells are based on similar values of estimates of probability of response, computed by a logistic regression of the indicator for missingness on the background variables. Although weighting methods can be useful for reducing nonresponse bias, they are potentially inefficient. Available-case (AC) analysis [3, Section 3.4] is a straightforward attempt to exploit the incomplete information by using all the cases available to estimate each individual parameter. For example, suppose the objective is to estimate the correlation matrix of a set of continuous variables Y1 , . . . , Yp . AC analysis uses all the cases with both Yj and Yk observed to estimate the correlation of Yj and Yk , 1 ≤ j, k ≤ p. Since the sample base of available cases for measuring each correlation includes the set of complete cases, the AC method appears to make better use of available information. The sample base changes from correlation to correlation, however, creating potential problems when the missing data are not MCAR. In the presence of high correlations, AC can be worse than CC analysis, and there is no guarantee that the AC correlation matrix is even positive definite.

Single Imputation Methods that impute or fill in the missing values have the advantage that, unlike CC analysis, observed values in the incomplete cases are retained. A simple version imputes missing values by their unconditional sample means based on the observed data, but this method often leads to biased inferences and cannot be recommended in any generality [3, Section 4.2.1]. An improvement is conditional mean imputation [3, Section 4.2.2], in which each missing value is replaced by an estimate of its conditional mean given the values of observed values. For example, in the case of univariate nonresponse with Y1 , . . . , Yp−1 fully observed and Yp sometimes missing, regression imputation estimates the regression of Yp on Y1 , . . . , Yp−1 from the complete cases, and the resulting prediction equation is used to impute the estimated conditional mean for each missing value of Yp . For a general pattern of missing data, the missing values for each case can be imputed from the regression of the missing variables on the observed

Missing Data variables, computed using the set of complete cases. Iterative versions of this method lead (with some important adjustments) to maximum likelihood estimates under multivariate normality [3, Section 11.2]. Although conditional mean imputation yields best predictions of the missing values in the sense of mean squared error, it leads to distorted estimates of quantities that are not linear in the data, such as percentiles, variances, and correlations. A solution to this problem is to impute random draws from the predictive distribution rather than best predictions. An example is stochastic regression imputation, in which each missing value is replaced by its regression prediction plus a random error with variance equal to the estimated residual variance. Another approach is hot-deck imputation, which classifies respondents and nonrespondents into adjustment cells based on similar values of observed variables and then imputes values for nonrespondents from randomly chosen respondents in the same cell. A more general approach to hot-deck imputation matches nonrespondents and respondents using a distance metric based on the observed variables. For example, predictive mean matching imputes values from cases that have similar predictive means from a regression of the missing variable on observed variables. This method is somewhat robust to misspecification of the regression model used to create the matching metric [3, Section 4.3]. The imputation methods discussed so far assume the missing data are MAR. In contrast, models that are not missing at random (NMAR) assert that even if a respondent and nonrespondent to Yp appear identical with respect to observed variables Y1 , . . . , Yp−1 , their Yp -values differ systematically. A crucial point about the use of NMAR models is that there is never direct evidence in the data to address the validity of their underlying assumptions. Thus, whenever NMAR models are being considered, it is prudent to consider several NMAR models and explore the sensitivity of analyses to the choice of model [3, Chapter 15].

Multiple Imputation A serious defect with imputation is that a single imputed value cannot represent the uncertainty about, which value to impute, so analyses that treat imputed values just like observed values generally underestimate uncertainty, even if nonresponse is modeled

3

correctly. Multiple imputation (MI, [6, 7]) fixes this problem by creating a set of Q (say Q = 5 or 10) draws for each missing value from a predictive distribution, and thence Q datasets, each containing different sets of draws of the missing values. We then apply the standard complete-data analysis to each of the Q datasets and combine the results in a simple way. In particular, for scalar estimands, the MI estimate is the average of the estimates from the Q datasets, and the variance of the estimate is the average of the variances from the five datasets plus 1 + 1/Q times the sample variance of the estimates over the Q datasets (The factor 1 + 1/Q is a small-Q correction). The last quantity here estimates the contribution to the variance from imputation uncertainty, missed (i.e., set to zero) by single imputation methods. Another benefit of multiple imputation is that the averaging over datasets results in more efficient point estimates than does single random imputation. Often, MI is not much more difficult than doing a single imputation – the additional computing from repeating an analysis Q times is not a major burden and methods for combining inferences are straightforward. Most of the work is in generating good predictive distributions for the missing values.

Maximum Likelihood for Ignorable Models Maximum likelihood (ML) avoids imputation by formulating a statistical model and basing inference on the likelihood function of the incomplete data [3, Section 6.2]. Define Y and M as above, and let X denote an (n × q) matrix of fixed covariates, assumed fully observed, with ith row xi = (xi1 , . . . , xiq ) where xij is the value of covariate Xj for subject i. Covariates that are not fully observed should be treated as random variables and modeled with the set of Yj ’s. Two likelihoods are commonly considered. The full likelihood is obtained by integrating the missing values out of the joint density of Y and M given X. The ignorable likelihood is obtained by integrating the missing values out of the joint density of Y given X, ignoring the distribution of M. The ignorable likelihood is easier to work with than the full likelihood, since it is easier to handle computationally, and more importantly because it avoids the need to specify a model for the missing-data mechanism, about

4

Missing Data

which little is known in many situations. Rubin [5] showed that valid inferences about parameters can be based on the ignorable likelihood when the data are MAR, as defined above. (A secondary condition, distinctness, is needed for these inferences to be fully efficient). Large-sample inferences about parameters can be based on standard ML theory [3, Chapter 6]. In many problems, maximization of the likelihood requires numerical methods. Standard optimization methods such as Newton–Raphson or Scoring can be applied. Alternatively, we can apply the Expectation–Maximization (EM ) algorithm [1] or one of its extensions [3, 4]. Reference [3] includes many applications of EM to particular models, including normal and t models for multivariate continuous data, loglinear models for multiway contingency tables, and the general location model for mixtures of continuous and categorical variables. Asymptotic standard errors are not readily available from EM, unlike numerical methods like scoring or Newton–Raphson. A simple approach is to use the bootstrap or the jackknife method [3, Chapter 5]. An alternative is to switch to a Bayesian simulation method that simulates the posterior distribution of the parameters (see the section titled Bayesian Simulation Methods).

Maximum Likelihood for Nonignorable Models Nonignorable, non–MAR models apply when missingness depends on the missing values (see the section titled Introduction). A correct likelihood analysis must be based on the full likelihood from a model for the joint distribution of Y and M. The standard likelihood asymptotics apply to nonignorable models providing the parameters are identified, and computational tools such as EM also apply to this more general class of models. However, often information to estimate both the parameters of the missing-data mechanism and the parameters of the complete-data model is very limited, and estimates are sensitive to misspecification of the model. Often, a sensitivity analysis is needed to see how much the answers change for various assumptions about the missingdata mechanism [[3], Chapter 15].

Bayesian Simulation Methods Maximum likelihood is most useful when sample sizes are large, since then the log-likelihood is nearly

quadratic and can be summarized well using the ML estimate θ and its large sample variance–covariance matrix. When sample sizes are small, a useful alternative approach is to add a prior distribution for the parameters and compute the posterior distribution of the parameters of interest. Since the posterior distribution rarely has a simple analytic form for incomplete-data problems, simulation methods are often used to generate draws of θ from the posterior distribution p(θ|Yobs , M, X). Data augmentation [10] is an iterative method of simulating the posterior distribution of θ that combines features of the EM algorithm and multiple imputation, with Q imputations of each missing value at each iteration. It can be thought of as a small-sample refinement of the EM algorithm using simulation, with the imputation step corresponding to the E-step (random draws replacing expectation) and the posterior step corresponding to the M-step (random draws replacing MLE). When Q is set equal to one, data augmentation is a special case of the Gibbs’ sampler. This algorithm can be run independently Q times to generate M iid draws from the approximate joint posterior distribution of the parameters and the missing data. The draws of the missing data can be used as multiple imputations, yielding a direct link with the methods in the section titled Multiple Imputation.

Conclusion In general, the advent of modern computing has made more principled methods of dealing with missing data practical, such as multiple imputation (e.g. PROC MI in [8]) or ignorable ML for repeated measures with dropouts (see Dropouts in Longitudinal Studies: Methods of Analysis). See [3] or [9] to learn more about missing-data methodology.

Acknowledgments This research was partially supported by National Science Foundation Grant DMS 9803720.

References [1]

Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society. Series B 39, 1–38.

Missing Data [2]

[3] [4] [5] [6] [7]

Little, R.J.A. (1995). Modeling the drop-out mechanism in longitudinal studies, Journal of the American Statistical Association 90, 1112–1121. Little, R.J.A. & Rubin, D.B. (2002). Statistical Analysis with Missing Data, 2nd Edition, John Wiley, New York. McLachlan, G.J. & Krishnan, T. (1997). The EM Algorithm and Extensions, Wiley, New York. Rubin, D.B. (1976). Inference and missing data, Biometrika 63, 581–592. Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys, Wiley, New York. Rubin, D.B. (1996). Multiple imputation after 18+ years, Journal of the American Statistical Association 91, 473–489; with discussion, 507–515; and rejoinder, 515–517.

5

[8]

SAS. (2003). SAS/STAT Software, Version 9, SAS Institute, Inc., Cary. [9] Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data, Chapman & Hall, London. [10] Tanner, M.A. & Wong, W.H. (1987). The calculation of posterior distributions by data augmentation, Journal of the American Statistical Association 82, 528–550.

(See also Dropouts in Longitudinal Studies: Methods of Analysis) RODERICK J. LITTLE

Model Based Cluster Analysis BRIAN S. EVERITT Volume 3, pp. 1238–1238 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Model Based Cluster Analysis Many methods of cluster analysis are based on ad hoc ‘sorting’ algorithms rather than some sound statistical model (see Hierarchical Clustering; k means Analysis). This makes it difficult to evaluate and assess solutions in terms of a set of explicit assumptions. More statistically respectable are model-based methods in which an explicit model for the clustering process is specified, containing parameters defining the characteristics of the assumed clusters. Such an approach replaces ‘sorting’ with estimation and may lead to formal

tests of cluster structure, particularly for number of clusters. The most well-known type of model-based clustering procedure uses finite mixture distributions. Other possibilities are described in [2] and include the classification likelihood approach [1] and latent class analysis.

References [1]

[2]

Banfield, J.D. & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics 49, 803–822. Everitt, B.S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th Edition, Arnold Publishing, London.

BRIAN S. EVERITT

Model Evaluation DANIEL J. NAVARRO

AND JAY

I. MYUNG

Volume 3, pp. 1239–1242 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Model Evaluation Introduction The main reason for building models is to link theoretical ideas to observed data, and the central question that we are interested in is ‘Is the model any good’? When dealing with quantitative models, we can at least partially answer this question using statistical tools. Before going into detail, there is a touchy, even philosophical, issue that one cannot ignore. A naive view of modeling is to identify the underlying process (truth) that has actually generated the data. This is an ill-posed problem, meaning that the solution is nonunique. The finite data sample rarely contains sufficient information to lead to a single process and also, is corrupted by unavoidable random noise, blurring the identification. An implication of noise-corrupted data is that it is not in general possible to determine with complete certainty that what we are fitting is the regularity, which we are interested in, or the noise, which we are not. A model that assumes a certain amount of error is present may be worse than a yet to be postulated model that can explain more of what we thought of as error in the first model. In short, identifying the true model based on data samples is an unachievable goal. Furthermore, the ‘truth’ of any phenomenon is likely to be rather different from any proposed model. Ultimately, it is crucial to recognize that all models are wrong, and a realistic goal of modeling is to find a model that represents a ‘good’ approximation to the truth in a statistically defined sense. In what follows, we assume that we have a model M with k free parameters θ = (θ1 , . . . , θk ), and a data set that consists of n observations X = (X1 , . . . , Xn ). Quantitative models generally come in two main types: They either assign some probability to the observed data f (X |θ) (probabilistic models), or they produce a single predicted data set X prd (θ) (deterministic models). We should note that most model testing and model selection methods require a probabilistic formulation, so it is commonplace to define a model as M = {f (X |θ)|θ ∈ } where is the parameter space. When written in this form, a model can be conceptualized as a family of probability distributions.

Model Fitting At a minimum, any reasonable model needs to be able to mimic the structure of the data: It needs to be able to ‘fit’ the data. When measuring the goodness of a model’s fit, we find the parameter values that allow the model to best mimic the data, denoted θˆ . The two most common methods for this are maximum likelihood estimation (for probabilistic models) and least squares estimation (for deterministic models). In the maximum likelihood approach, introduced by Sir Ronald Fisher in the 1920s, θˆ is the set of parameter values that maximizes f (X |θ), and is referred to as the maximum likelihood estimate (MLE). The corresponding measure of fit is the maximized log-likelihood Lˆ = ln f (X |θˆ ). See [3] for a tutorial on maximum likelihood estimation with example applications in cognitive psychology. Alternatively, the least squares estimate of θˆ is the set of parameters that minimizes the sum of squared errors (SSE), and the minimized SSE value ˆ is denoted by E: Eˆ =

n

prd

(Xi − Xi

(θˆ ))2 .

(1)

i=1

When this approach is employed, there are several commonly used measures of fit. They are meanˆ squared error MSE = E/n, root mean–squared deviˆ ation RMSD = (E/n), and squared correlation (also known as proportion of variance accounted for) ˆ r 2 = 1 − E/SST . In the last formula, SST stands for the sum of squares total defined as SST = n 2 ¯ (X − X ) , where X¯ denotes the mean of X. i i=1 There is a nice correspondence between maximum likelihood and least squares, in that for a model with independent, identically and normally distributed errors, the same set of parameters is obtained as the one that maximizes the log-likelihood L but also minimizes the sum of squared errors SSE. Model fitting yields goodness-of-fit measures, ˆ that tell us how well the model fits the such as Lˆ or E, observed data sample but by themselves are not particularly meaningful. If our model has a minimized sum squared error of 0.132, should we be impressed or not? In other words, a goodness-of-fit measure may be useful as a purely descriptive measure, but by itself is not amenable to statistical inference. This is because the measure does not address the relevant

2

Model Evaluation

question: ‘Does the model provide an adequate fit to the data, in a defined sense’?. This question is answered in model testing.

Model Testing Classical null hypothesis testing is a standard method of judging a model’s goodness of fit. The idea is to set up a null hypothesis that ‘the model is correct’, obtain the P value, and then make a decision about rejecting or retaining the hypothesis by comparing the resulting P value with the alpha level. For discrete data such as frequency counts, the two most popular methods are the Pearson chi-square (χ 2 ) test and the log-likelihood ratio test (G2 ), which have test statistics given by χ2 =

n prd (Xi − Xi (θˆ ))2 ; prd X (θˆ ) i

i=1

G2 = −2

n i=1

prd

Xi ln

Xi

(θˆ )

Xi

,

(2)

where ln is the natural logarithm of base e. Both of these statistics have the nice property that they are always nonnegative, and are equal to zero when the observed and predicted data are in full agreement. In other words, the larger the statistic, the greater the discrepancy. Under the null hypothesis, both are approximately distributed as a χ 2 distribution with (n − k − 1) degrees of freedom, so we would reject the null if the obtained P value is larger than some critical value obtained by setting an appropriate α level. For continuous data such as response time, goodness-of-fit tests are a little more complicated, since there are no general-purpose methods available for testing the validity of a single model, unlike the discrete case. Instead, we rely on the generalized likelihood ratio test that involves two models. In this test, in addition to the theoretically motivated model, denoted by Mr (reduced model), we create a second model, Mf (full model), such that the reduced model is obtained as a special case of the full model by fixing one or more of Mf ’s parameters. Then, the goodness of fit of the reduced model is assessed by the following G2 statistic: G2 = 2(ln Lˆ f − ln Lˆ r ),

(3)

recalling that Lˆ denotes the maximized loglikelihood. Under the null hypothesis that the theoretically motivated, reduced model is correct, the above statistic is approximately distributed as χ 2 with degrees of freedom given by the difference in the number of parameters (kf − kr ). If the hypothesis is retained (rejected), then we conclude that the reduced model Mr provides (does not provide) an adequate description of the data (see [4] for an example application of this test).

Model Selection What does it mean that a model provides an adequate fit of the data? One should not jump to the conclusion that one has identified the underlying regularity. A good fit merely puts the model on a list of candidate models worthy of further consideration. It is entirely possible that there are several distinct models that fit the data well, all passing goodness of fit tests. How should we then choose among such models? This is the problem of model selection. In model selection, the goal is to select the one, among a set of candidate models, that represents the closest approximation to the underlying process in some defined sense. Choosing the model that best fits a particular set of observed data will not accomplish the goal. This is because a model can achieve a superior fit to its competitors for reasons unrelated to the model’s exactness. For instance, it is well known that a complex model with many parameters and highly nonlinear form can often fit data better than a simple model with few parameters even if the latter generated the data. This is called overfitting. Avoiding overfitting is what every model selection method is set to accomplish. The essential idea behind modern model selection methods is to recognize that, since data are inherently noisy, an ideal model is one that captures only the underlying phenomenon, not the noise. Since noise is idiosyncratic to a particular data set, a model that captures noise will make poor predictions about future events. This leads to the present-day ‘gold standard’ of model selection, generalizability. Generalizability, or predictive accuracy, refers to a model’s ability to predict the statistics of future, as yet unseen, data samples from the same process that generated the observed data sample. The intuitively simplest way to measure generalizability is to estimate it directly from the data, using

Model Evaluation cross-validation (CV; [10]). In cross-validation, we split the data set into two samples, the calibration sample Xc and the test sample Xt . We first estimate the best-fitting parameters by fitting the model to Xc which we denote θˆ (Xc ). The generalizability estimate is obtained by measuring the fit of the model to the test sample at those original parameters, that is, ˆ c ), θ(X ˆ c )). (4) CV = ln f (Xt |θ(X The main attraction of CV is its ease of implementation (see [4] for its application example for psychological models). All that is required is a model fitting procedure and a resampling scheme. One concern with CV is that there is a possibility that the test sample is not truly independent of the calibration sample: Since both were produced in the same experiment, systematic sources of error variation are likely to induce correlated noise across the two samples, artificially inflating the CV measure. An alternative approach is to use theoretical measures of generalizability based on a single sample. In most of these theoretical approaches, generalizability is measured by suitably combining goodness-offit with model complexity. The practical difference between them is the way in which complexity is measured. One of the earliest measures of this kind was the Akaike information criterion (AIC; [1]), which treats complexity as the number of parameters k: AIC = −2 ln f (X |θˆ ) + 2k,

(5)

The method prescribes that the model minimizing AIC should be chosen. AIC seeks to find the model that lies ‘closest’ to the true distribution, as measured by the Kullback–Leibler [8] discrepancy. As shown in the above criterion equation, this is achieved by trading the first, minus goodness-of-fit (lack of fit) term of the right hand side for the second complexity term. As such, a complex model with many parameters, having a large value of the complexity term, will not be selected unless its fit justifies the extra complexity. In this sense, AIC represents a formalization of the principle of Occam’s razor, which states ‘Entities should not be multiplied beyond necessity’ (William of Occam, ca. 1290–1349) (see Parsimony/Occham’s Razor). Another approach is given by the much older notion of Bayesian statistics. In the Bayesian approach, we assume that a priori uncertainty about

3

the value of model parameters is represented by a prior distribution π(θ). Upon observing the data X, this prior is updated, yielding a posterior distribution π(θ|X ) ∝ f (X |θ)π(θ). In order to make inferences about the model (rather than its parameters), we integrate across the posterior distribution. Under the assumption that all models are a priori equally likely (because the Bayesian approach requires model priors as well as parameter priors), Bayesian model selection chooses the model M with highest marginal likelihood defined as: (6) f (X |M) = f (X |θ)π(θ) dθ. The ratio of two marginal likelihoods is called a Bayes factor (BF; [2]), which is a widely used method of model selection in Bayesian inference. The two integrals in the Bayes factor are nontrivial to compute unless f (X |θ) and π(θ) form a conjugate family. Monte Carlo methods are usually required to compute BF, especially for highly parameterized models (seeMarkov Chain Monte Carlo and Bayesian Statistics). A large sample approximation of BF yields the easily computable Bayesian information criterion (BIC; [9]) BIC = −2 ln f (X |θˆ ) + k ln n.

(7)

The model minimizing BIC should be chosen. It is important to recognize that the BIC is based on a number of restrictive assumptions. If these assumptions are met, then the difference between two BIC values approaches twice the logarithm of the Bayes factor as n approaches infinity. A third approach is minimum description length (MDL; [5]), which originates in algorithmic coding theory. In MDL, a model is viewed as a code that can be used to compress the data. That is, data sets that have some regular structure can be compressed substantially if we know what that structure is. Since a model is essentially a hypothesis about the nature of the regularities that we expect to find in data, a good model should allow us to compress the data set effectively. From an MDL standpoint, we choose the model that permits the greatest compression of data in its total description: That is, the description of data obtainable with the help of the model plus the description of the model itself. A series of papers by Rissanen expanded on and refined this idea, yielding a number of different model selection criteria (one

4

Model Evaluation

of which was essentially identical to the BIC). The most complete MDL criterion currently available is the stochastic complexity (SC; [7]) of the data relative to the model, ˆ SC = − ln f (X |θ ) + ln f (Y |θˆ (Y )) dY . (8) Note that the second term of SC represents a measure of model complexity. Since the integral over the sample space is generally nontrivial to compute, it is common to use the Fisher-information approximation (FIA; [6]): Under regularity conditions, the stochastic complexity asymptotically approaches FIA = − ln f (X |θˆ ) k n det I (θ) dθ, (9) + ln + ln 2 2π where I (θ) is the expected Fisher information matrix of sample size one, consisting of the covariances between the partial derivatives of L with respect to the parameters. Once again, the integral can still be intractable, but it is generally easier to calculate than the exact SC. As in AIC and BIC, the first term of FIA is the lack of fit term and the second and third terms together represent a complexity measure. From the viewpoint of FIA, complexity is determined by the number of free parameters (k) and sample size (n) but also by the ‘functional form’ of the model equation, as implied by the Fisher information I (θ), and the range of the parameter space . When using generalizability measures, it is important to recognize that AIC, BIC, and FIA are all asymptotic criteria, and are only guaranteed to work as n becomes arbitrarily large, and when certain regularity conditions are met. The AIC and BIC, in particular, can be misleading for small n. The FIA is safer (i.e., the error level generally falls faster as n increases), but it too can still be misleading in some cases. The SC and BF criteria are more sensitive, since they are exact rather than asymptotic criteria, and can be quite powerful even when presented with very similar models or small samples. However, they can be difficult to employ, and often need to be approximated numerically. The status of CV is a little more complicated, since it is not always clear what CV is doing, but its performance in practice is often better than AIC or BIC, though it is not usually as good as SC, FIA, or BF.

Conclusion When evaluating a model, there are a number of factors to consider. Broadly speaking, statistical methods can be used to measure the descriptive adequacy of a model (by fitting it to data and testing those fits), as well as its generalizability and simplicity (using model selection tools). However, the strength of the underlying theory also depends on its interpretability, its consistency with other findings, and its overall plausibility. These things are inherently subjective judgments, but they are no less important for that. As always, there is no substitute for thoughtful evaluations and good judgment. After all, statistical evaluations are only one part of a good analysis.

References [1]

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory, B.N. Petrov & F. Csaki, eds, Akademiai Kiado, Budapest, pp. 267–281. [2] Kass, R.E. & Raftery, A.E. (1995). Bayes factors, Journal of the American Statistical Association 90(430), 773–795. [3] Myung, I.J. (2003). Tutorial on maximum likelihood estimation, Journal of Mathematical Psychology 47, 90–100. [4] Myung, I.J. & Pitt, M.A. (2002). Mathematical modeling, in Stevens’ Handbook of Experimental Psychology, 3rd Edition, J. Wixten, ed., John Wiley & Sons, New York, pp. 429–459. [5] Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, World Scientific, Singapore. [6] Rissanen, J. (1996). Fisher information and stochastic complexity, IEEE Transactions on Information Theory 42, 40–47. [7] Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes and information in data, IEEE Transactions on Information Theory 47, 1712–1717. [8] Schervish, M.J. (1995). Theory of Statistics, Springer, New York. [9] Schwarz, G. (1978). Estimating the dimension of a model, Annals of Statistics 6(2), 461–464. [10] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions, Journal of Royal Statistical Society Series B 36, 111–147.

DANIEL J. NAVARRO

AND JAY

I. MYUNG

Model Fit: Assessment of CEES A.W. GLAS Volume 3, pp. 1243–1249 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Model Fit: Assessment of Introduction Item response theory (IRT) models (see Item Response Theory (IRT) Models for Dichotomous Data) provide a useful and well-founded framework for measurement in the social sciences. However, the applications of these models are only valid if the model fits the data. IRT models are based on a number of explicit assumptions, so methods for the evaluation of model fit focus on these assumptions. Model fit can be viewed from two perspectives: the items and the respondents. In the first case, for every item, residuals (differences between predictions from the estimated model and observations) and item fit statistics are computed to assess whether the item violates the model. In the second case, residuals and person-fit statistics are computed for every person to assess whether the responses to the items follow the model. The most important assumptions evaluated are subpopulation invariance (the violation is often labeled differential item functioning), the form of the item response function, and local stochastic independence. The first assumption entails that the item responses can be described by the same parameters in all possible subpopulations. Subpopulations are defined on the basis of background variables that should not be relevant in a specific testing situation. One might think of gender, race, age, or socioeconomic status. The second assumption addressed is the form of the item response function that describes the relation between a latent variable, say proficiency, and observable responses to items. Evaluation of the appropriateness of the item response function is usually done by comparing observed and expected item response frequencies given some measure of the latent trait level. The third assumption targeted is local stochastic independence. The assumption entails that responses to different items are independent given the latent trait value. So the proposed latent variables completely describe the responses and no additional variables are necessary to describe the responses. A final remark in this introduction pertains to the relation between formal tests of model fit and residual analyses. A well-known problem with formal tests of model fit is that they tend to reject the model even

for moderate sample sizes. That is, their power (the probability of rejection when the model is violated) grows very fast as a function of the sample size. As a result, small deviations from the IRT model may cause a rejection of the model while these deviations may hardly have practical consequences in the foreseen application of the model. Inspection of the magnitude of the residuals can shed light on the severity of the model violation. The reason for addressing problem of evaluation of model fit in the framework of formal model tests is that the alternative hypotheses in these model tests clarify which model assumptions are exactly targeted by the residuals. This will be explained further below. We will start with describing evaluation of fit to IRT models for dichotomous items, then a general approach to evaluation of model fit for a general class of IRT models will be outlined, and finally, a small example will be given. The focus will be on parameterized IRT models in a likelihood-based framework. The relation with other approaches to IRT will be briefly sketched in the conclusion.

Assessing Model Fit for Items with Dichotomous Responses In the 1- 2-, and 3-parameter logistic models (1PLM, 2PLM and 3PLM), [2] it is assumed that the proficiency level of a respondent (indexed i) can be represented by a one-dimensional proficiency parameter θi . In the 3PLM, the probability of a correct response as a function of θi is given by Pr(Yik = 1|θi , ak , bk , ck ) = Pk (θi ) = ck + (1 − ck ) +

exp(ak (θi − bk )) , 1 + exp(ak (θi − bk )) (1)

where Yik is a random variable assuming a value equal to one if a correct response was given to item k, and zero otherwise. The three item parameters, ak , bk , and ck are called the discrimination, difficulty and guessing parameter, respectively. The 2PLM follows upon setting the guessing parameter ck equal to zero, and the 1PLM follows upon introducing the additional constraint ak = 1.

2

Model Fit: Assessment of

Testing the Form of the Item Characteristic Function

estimate of their variance (and covariance) will be returned to below.

Ideally, a test of the fit of the item response function Pk (θi ) would be based on assessing whether the proportion of correct responses to item k of respondents with a proficiency level θ ∗ matches Pk (θ ∗ ). In the 2PLM and 3PLM this has the problem that the estimates of the proficiency parameters are virtually unique, so for every available θ-value there is only one observed response on item k available. As a result, we cannot meaningfully compute proportions of correct responses given a value of θ. However, the number-correct scores and the estimates of θ are usually highly correlated, and, therefore, tests of model fit can be based on the assumption that groups of respondents with the same number-correct score will probably be quite homogeneous with respect to their proficiency level. So a broad class of test statistics has the form Q1 =

K−1 r=1

Nr

(Ork − Erk )2 , Erk (1 − Erk )

(2)

where K is the test length, Nr is the number of respondents obtaining a number-correct score r, Ork is the proportion of respondents with a score r and a correct score on item k, and Erk is the analogous probability under the IRT model. Examples of statistics of this general form are a statistic proposed by Van den Wollenberg [25] for the 1PLM in the framework of (conditional maximum likelihood estimation) CML estimation and a test statistic for the 2PLM and 3PLM proposed by Orlando and Thissen [13] in the framework of (marginal maximum likelihood) MML estimation. For details on the computation of these statistics see [13] and [25] (see Maximum Likelihood Estimation). Simulation studies show that the Q1 -statistic is well approximated by a χ 2 -distribution [7, 22]. For long tests, it would be practical when a number of adjacent scores could be combined, say to obtain 4 to 6 score-level groups. So the sum over numbercorrect scores r would be replaced by a sum over score–levels g (g = 1, . . . , G) and the test would be based on the differences Ogk − Egk . Unfortunately, it turns out that the variances of these differences cannot be properly estimated using an expression analogous to the denominator of Formula (2). The problem of weighting the differences Ogk − Egk with a proper

Differential Item Functioning Differential item functioning (DIF) is a difference in item responses between equally proficient members of two or more groups. One might think of a test of foreign language comprehension, where items referring to football might impede girls. The poor performance of the girls on the football-related items must not be attributed to their low proficiency level but to their lack of knowledge of football. Since DIF is highly undesirable in fair testing, methods for the detection of DIF are extensively studied. Several techniques for detection of DIF have been proposed. Most of them are based on evaluation of differences in response probabilities between groups conditional on some measure of proficiency. The most generally used technique is based on the Mantel-Haenszel (MH) statistic [9]. In this approach, the respondent’s number-correct score is used as a proxy for proficiency, and DIF is evaluated by testing whether the response probabilities differ between the score groups. In an IRT model, proficiency is represented by a latent variable θ, and DIF can be assessed in the framework of IRT by evaluating whether the same item parameters apply in subgroups (say subgroups g = 1, . . . , G) that are homogeneous with respect to θ. This is achieved by generalizing the test statistic in Formula (2) to Q1 =

G K−1 g=1 r=1

Ngr

(Ogrk − Egrk )2 , Egrk (1 − Egrk )

(3)

where Ngr is the number of respondents obtaining a number-correct score r in subgroup g, and Ogrk and Egrk are the observed proportion and estimated probability for that combination of h and r. Combination of number-correct scores has similar complications as discussed above, so this problem will also be addressed in the last section.

Testing Local Independence and Multidimensionality The statistics of the previous section can be used for testing whether the data support the form of the item response functions. Another assumption underlying the IRT models presented above is unidimensionality.

Model Fit: Assessment of Suppose unidimensionality is violated. If the respondent’s position on one latent trait is fixed, the assumption of local stochastic independence requires that the association between the items vanishes. In the case of more than one dimension, however, the respondent’s position in the latent space is not sufficiently described by one one-dimensional proficiency parameter and, as a consequence, the association between the responses to the items given this one proficiency parameter will not vanish. Therefore, tests for unidimensionality are based on the association between the items. Van den Wollenberg [25] and Yen [26, 27] show that violation of local independence can be tested using a test statistic based on the evaluation of the association between items in a 2-by-2 table. Applying this idea to the 3PLM in an MML framework, a statistic can be based on the difference between observed and expected frequencies given by

can be investigated across items. Usually, investigation of item fit precedes the investigation of person fit, but evaluation of item and person fit can also be an iterative process. Person fit is used to check whether a person’s pattern of item responses is unlikely given the model. Unlikely response patterns may occur because respondents are unmotivated or unable to give proper responses that relate to the relevant proficiency variable, or because they have preknowledge of the correct answers, or because they are cheating. As an example we will consider a person fit statistic proposed by Smith [19]. The set of test items is divided into G nonoverlapping subtests denoted Ag (g = 1, . . . , G) and the test is based on the discrepancies between the observed scores and the expected scores under the model summed within subsets of items. That is, the statistic is defined as 

dkl = nkl − E(Nkl ) = nkl −

K−1

nr P (Yk = 1, Yl = 1|R = r),

(4)

r=2

S3kl =

E(Nkl )

UB =

1 G−1

G



2 (yk − Pk (θ))

k∈Ag

g=1

(6)

Pk (θ)(1 − Pk (θ))

k∈Ag

where nkl is the observed number of respondents making item k and item l correct, in the group of respondents obtaining a score between 2 and K-2, and E(Nkl ) is its expectation. Only scores between 2 and K-2 are considered, because respondents with a score less than 2 cannot make both items correct, and respondents with a score greater than K-2 cannot make both items incorrect. So these respondents contribute no information to the 2-by-2 table. Using Pearson’s X 2 -statistic for association in a 2-by-2 table results in dkl2

3

+

dkl2 E(Nk l¯)

+

dkl2 E(Nkl¯ )

+

dkl2 E(Nk¯ l¯)

Since the statistic is computed only for individual students, the index i was dropped. One of the problems of this statistic is that its distribution under the null-hypothesis cannot be derived because the effects of the estimation of the parameters are not taken into account. Snijders [20] proposed a correction-factor that solves this problem.

Polytomous Items and Multidimensional Models

, (5)

where E(Nk l¯) is the expectation of making item k correct and l wrong, and E(Nkl¯ ) and E(Nk¯ l¯) are defined analogously. Simulation studies by Glas and Su´arez-Fal´con (2003) show that this statistic is well approximated by a χ 2 -distribution with one degree of freedom.

Person Fit In the previous sections, item fit was investigated across respondents. Analogously, fit of respondents

In the previous section, assessment of fit to IRT models was introduced by considering one-dimensional IRT models for dichotomously scored items. In this section, we will consider a more general framework, where items are scored polytomously (with dichotomous scoring as a special case) and where the latent proficiency can be multidimensional. In polytomous scoring the response to an item k can be in one of the categories j = 0, . . . , Mk . In one-dimensional IRT models for polytomous items, it is assumed that the probability of scoring in an item category depends on a one-dimensional latent variable θ. An example of the category response functions for an item with

4

Model Fit: Assessment of 1

3

E(xk |q)

P (q) kj 0.75

2

j=3

j=0 0.5

j=2

j=1

1 0.25

0 −2.0

Figure 1

0 −1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

q

Response functions of a polytomously scored item

four ordered response categories is given in Figure 1. Note that the forms of the response functions are plausible for proficiency items: the response function of the zero-category decreases as a function of proficiency, the response function for the highest category increases with proficiency, and the response functions for the intermediate categories are single-peaked. Item response models giving rise to response functions as in Figure 1 fall into three classes [11]: adjacent-category models [10], continuation-ratio models [24] and cumulative probability models [18]. The rationales underlying the models are very different but the shapes of their response functions are very similar. As an example, we give a model in the first class, the generalized partial credit model (GPCM) by Muraki [12]. The probability of a student i scoring in category j on item k given by exp(j ak θi − bkj )

p(Yikj = 1|θi ) = 1+

Mk

IRT models can serve confirmatory and exploratory purposes. A multidimensional version of the GPCM is obtained by assuming a Q-dimensional latent variable (θi1 , . . . , θiq , . . . , θiQ ) and by replacing ak θi in (7) by Q akq θiq . q=1

Usually, it is assumed that (θi1 , . . . , θiq , . . . , θiQ ) has a multivariate normal distribution [16]. Since the model can be viewed as a factor-analysis model (see Factor Analysis: Exploratory) [23], the latent variables (θi1 , . . . , θiq , . . . , θiQ ) are often called factor-scores while the parameters (ak1 , . . . , aiq , . . . , θiQ ) are referred to as factor-loadings.

A General Framework for Assessing Model Fit ,

(7)

exp[hak θi − bkh ]

h=1

where Yikj is a random variable assuming a value one if a response was given in category j (j = 1, . . . , Mk ), and assuming a value zero otherwise. Usually, in IRT models it is assumed that there is one (dominant) latent variable that explains test performance. However, it may be clear a priori that multiple latent variables are involved or the dimensionality of the latent variable structure might not be clear at all. In these cases, multidimensional

In this section, we will describe a general approach to assessing evaluation of fit to IRT models based on the Lagrange multiplier test [15, 1]. Let η1 be a vector of the parameters of some IRT model, and let η2 be a vector of parameters added to this IRT model to obtain a more general model. Let h(η1 ) and h(η2 ) be the first-order derivatives of the log-likelihood function. The parameters η1 of the IRT model are estimated by maximum likelihood, so h(η1 ) = 0. The hypothesis η2 = 0 can be tested using the statistic LM = h(η2 )t −1 h(η2 ), where is the covariance matrix of h(η2 ). Details on the computation of h(η2 ) and for IRT models in an MML framework can

Model Fit: Assessment of

5

be found in Glas [4, 5]. It can be proved that the LM-statistic has an asymptotic χ 2 -distribution with degrees of freedom equal to the number of parameters in η2 . In the next sections, it will be shown how this can be applied to construct tests of fit to IRT models based on residuals.

for g = 1, . . . , G, where the summations are over all respondents in subgroup g, and the expectation is the posterior expectation given response pattern yi (for details, see [5]).

Fit of the Response Functions

Also local independence can be evaluated using the framework of LM-tests. For the GPCM, dependency between the items k and item l can be modeled as

As above, the score range is partitioned into G subsets to form subgroups that are homogeneous with respect to θ. A problem with polytomously scored items is that there are often too few observations on low item categories in high number-correct score groups and in high item categories in low numbercorrect score groups. Therefore, we will consider the expectation of the item score, defined by Xik =

Mk

j Yikj ,

(8)

j =1

rather than the category scores themselves. In Figure 1, the expected item score function given θ for the GPCM is drawn as a dashed line. To define the statistic, an indicator function w(y(k) i , g) is introduced that is equal to one if the number-correct score on the response pattern without item k, say y(k) i , falls in subrange g, and equal to zero if this is not the case. We will now detail the approach further for the GPCM. The alternative model on which the LM test is based, is given by P (Yikj = 1|θi , δkg , w(y(k) i , g) = 1) exp(j ak θi + j δg − bkj )

= 1+

Mk

,

(9)

exp(hak θi + hδg − bkj )

for j = 1, . . . , Mk . Under the null model, which is the GPCM model, the additional parameter δg is equal to zero. In the alternative model, δg acts as a shift in ability for subgroup g. If we define η2 = (δ1 , . . . , δg , . . . , δG ), the hypothesis η2 = 0 can be tested using the LM-statistic defined above. It can be shown that h(η2 ) is a vector of the differences between the observed and expected item scores in the subgroups g = 1, . . . , G. In an MML framework, the statistic is based on the differences xik − E[Xik |yi ], (10) i|g

P (Yikj = 1, Yilp = 1|θi , δkl ) =

exp(mθi − bkj + pθi − blp + δkj lp ) . 1+ h exp(gθi − bkg + hθi − blh + δkglh ) g

(11) Note the parameter δkj lp models the association between the two items. The LM test can be used to test the special model, where δkj lp = 0, against the alternative model, where δkj lp = 0. In an MML framework, the statistic is based on the differences nkj lp − E(Nkj lp |yi ),

(12)

where njkpl is the observed number of respondents scoring in category j of item k and in category p of item l, and E(Nkl ) is its posterior expectation (see, [5]). The statistic has an asymptotic χ 2 -distribution with (Mk − 1)(Ml − 1) degrees of freedom.

Person-fit

h=1

i|g

Evaluation of Local Independence

The fact that person-fit statistics are computed using estimates of the proficiency parameters θ can lead to serious problems with respect to their distribution under the null-hypothesis [20]. However, person-fit statistics can also be redefined as LM-statistics. For instance, the UB-test can be viewed as a test whether the same proficiency parameter θ can account for the responses in all partial response patterns. For the GPCM, we model the alternative that this is not the case by P (Ykj = 1|θ, k ∈ Ag ) exp[m(θ + θg ) − bkj ]

= 1+

Mk h=1

exp[h(θ + θg ) − bkh ]

.

(13)

6

Model Fit: Assessment of

One subtest, say the first, should be used as a reference. Further, the test length is too short to consider too many subtests, so usually, we only consider two subtests. Defining η2 = θ2 leads to a test statistic for assessing whether the total score on the second part of the test is as expected from the first part of the test.

An Example Part of a school effectiveness study serves as a very small illustration. In a study of the effectiveness of Belgium secondary schools, several cognitive and noncognitive tests were administered to 2207 pupils at the end of the first, second, fourth, and sixth school year. The ultimate goal of the analyses was to estimate a correlation matrix between all scales over all time points using concurrent MML estimates of a multidimensional IRT model. Here we focus on the first step of these analyses, which was checking whether a one-dimensional GPCM held for each scale at each time point. The example pertains to the scale for ‘Academic Self Esteem’ which consisted of 9 items with five response categories each. The item parameters were estimated by MML assuming a standard normal distribution for θ. To compute the LM-statistics, the score range was divided into four sections in such a way that the numbers of respondents scoring in the sections were approximately equal. Section 1 contained the (partial) total-scores r ≤ 7, Section 2 contained the scores 8 ≤ r ≤ 10, Section 3 the scores 11 ≤ r ≤ 13, and Section 4r ≥ 14. The results are given in Table 1. The column labeled ‘LM’ gives the values of the LM-statistics; the column labeled ‘Prob’ gives the significance probabilities. The statistics have three degrees of freedom. Table 1

Note that 6 of the 9 LM-tests were significant at a 5% significance level. To assess the seriousness of the misfit, the observed and the expected average item scores in the subgroups are shown under the headings ‘Obs’ and ‘Exp’, respectively. Note that the observed average scores increased with the score level of the group. Further, it can be seen that the observed and expected values were quite close: the largest absolute difference was .09 and the mean absolute difference was approximately .02. So from the values of the LMstatistics, it must be concluded that the observed item scores are definitely outside the confidence intervals of their expectations, but the precision of the predictions from the model is good enough to accept the model from the intended application.

Conclusion The principles sketched above pertain to a broad class of parameterized IRT models, both in a logistic or normal-ogive formulation, and their multidimensional generalizations. The focus was on a likelihood-based framework. Recently, however, a Bayesian estimation framework for IRT models has emerged [14, 3] (see Bayesian Item Response Theory Estimation). Evaluation of model fit can be based on the same rationales and statistics as outlined above, only here test statistics are implemented as so-called posterior predictive check. Examples of this approach are given by Hoijtink [8] and Glas and Meijer [6], who show how item- and person-fit statistics can be used as posterior predictive checks. Another important realm of IRT are the socalled nonparametric IRT models [17, 21]. Also here

Results of the LM test to evaluate fit of the item response functions for the scale ‘academic self esteem’ Group 1

Group 2

Group 3

Group 4

Item

LM

Prob

Obs

Exp

Obs

Exp

Obs

Exp

Obs

Exp

1 2 3 4 5 6 7 8 9

7.95 11.87 9.51 0.64 10.62 16.90 15.17 2.37 2.41

0.05 0.01 0.02 0.89 0.01 0.00 0.00 0.50 0.49

1.37 0.45 0.70 0.98 0.32 0.34 0.77 0.73 1.56

1.37 0.50 0.77 0.97 0.33 0.33 0.77 0.72 1.54

1.68 0.85 1.21 1.36 0.73 0.67 1.19 1.12 1.81

1.71 0.87 1.19 1.36 0.71 0.67 1.19 1.10 1.85

1.90 1.23 1.53 1.60 0.99 0.98 1.48 1.35 2.04

1.87 1.16 1.49 1.62 0.98 0.94 1.45 1.36 2.01

2.10 1.61 1.91 1.99 1.31 1.26 1.78 1.76 2.21

2.07 1.64 1.92 1.98 1.34 1.31 1.85 1.77 2.23

Model Fit: Assessment of fit to IRT models can be evaluated by comparing observed proportions and expected probabilities of item responses conditional on number-correct scores, response probabilities in subpopulations, and responses to pairs of items. So the principles are the same as in parametric IRT models, but the implementation differs between applications.

[13]

[14]

[15]

References [1]

Aitchison, J. & Silvey, S.D. (1958). Maximum likelihood estimation of parameters subject to restraints, Annals of Mathematical Statistics 29, 813–828. [2] Birnbaum, A. (1968). Some latent trait models, in Statistical Theories of Mental Test Scores, F.M. Lord & M.R. Novick, eds, Addison-Wesley, Reading. [3] Bradlow, E.T., Wainer, H. & Wang, X. (1999). A Bayesian random effects model for testlets, Psychometrika 64, 153–168. [4] Glas, C.A.W. (1998). Detection of differential item functioning using Lagrange multiplier tests, Statistica Sinica 8, 647–667. [5] Glas, C.A.W. (1999). Modification indices for the 2pl and the nominal response model, Psychometrika 64, 273–294. [6] Glas, C.A.W. & Meijer, R.R. (2003). A Bayesian approach to person fit analysis in item response theory models, Applied Psychological Measurement 27, 217–233. [7] Glas, C.A.W. & Su´arez-Fal´con, J.C. (2003). A comparison of item-fit statistics for the three-parameter logistic model, Applied Psychological Measurement 27, 87–106. [8] Hoijtink, H. (2001). Conditional independence and differential item functioning in the two-parameter logistic model, in Essays on Item Response Theory, A. Boomsma, M.A.J. van Duijn & T.A.B. Snijders, eds, Springer, New York, pp. 109–130. [9] Holland, P.W. & Thayer, D.T. (1988). Differential item functioning and the Mantel-Haenszel procedure, in Test Validity, H. Wainer & H.I. Braun, eds, Lawrence Erlbaum, Hillsdale. [10] Masters, G.N. (1982). A Rasch model for partial credit scoring, Psychometrika 47, 149–174. [11] Mellenbergh, G.J. (1995). Conceptual notes on models for discrete polytomous item responses, Applied Psychological Measurement 19, 91–100. [12] Muraki, E. (1992). A generalized partial credit model: application of an EM algorithm, Applied Psychological Measurement 16, 159–176.

[16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

[24]

[25] [26]

[27]

7

Orlando, M. & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models, Applied Psychological Measurement 24, 50–64. Patz, R.J. & Junker, B.W. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models, Journal of Educational and Behavioral Statistics 24, 146–178. Rao, C.R. (1947). Large sample tests of statistical hypothesis concerning several parameters with applications to problems of estimation, Proceedings of the Cambridge Philosophical Society 44, 50–57. Reckase, M.D. (1997). A linear logistic multidimensional model for dichotomous item response data, in Handbook of Modern Item Response Theory, W.J. van der Linden & R.K. Hambleton, eds, Springer, New York, pp. 271–286. Rosenbaum, P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory, Psychometrika 49, 425–436. Samejima, F. (1969). Estimation of latent ability using a pattern of graded scores, Psychometrika Monograph Supplement, (17) 100–114. Smith, R.M. (1986). Person fit in the Rasch model, Educational and Psychological Measurement 46, 359–372. Snijders, T. (2001). Asymptotic distribution of person-fit statistics with estimated person parameter, Psychometrika 66, 331–342. Stout, W.F. (1987). A nonparametric approach for assessing latent trait dimensionality, Psychometrika 52, 589–617. Suarez-Falcon, J.C. & Glas, C.A.W. (2003). Evaluation of global testing procedures for item fit to the Rasch model, British Journal of Mathematical and Statistical Psychology 56, 127–143. Takane, Y. & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables, Psychometrika 52, 393–408. Tutz, G. (1990). Sequential item response models with an ordered response, British Journal of Mathematical and Statistical Psychology 43, 39–55. Van den Wollenberg, A.L. (1982). Two new tests for the Rasch model, Psychometrika 47, 123–140. Yen, W. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model, Applied Psychological Measurement 8, 125–145. Yen, W. (1993). Scaling performance assessments: strategies for managing local item dependence, Journal of Educational Measurement 30, 187–213.

CEES A.W. GLAS

Model Identifiability GUAN-HUA HUANG Volume 3, pp. 1249–1251 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Model Identifiability In some statistical models, different parameter values can give rise to identical probability distributions. When this happens, there will be a number of different parameter values associated with the maximum likelihood of any set of observed data. This is referred to as the model identifiability problem. For example, suppose someone attempts to compute the regression equation predicting Y from three variables X1 , X2 , and their sum (X1 + X2 ), the program will probably crash or give an error message because it cannot find a unique solution. The model is the same if Y = 0.5X1 + 1.0X2 + 1.5(X1 + X2 ), Y = 1.0X1 + 1.5X2 + 1.0(X1 + X2 ), or Y = 2.0X1 + 2.5X2 + 0.0(X1 + X2 ); indeed there are an infinite number of equally good possible solutions. Model identifiability is a particular problem for the latent class model, a statistical method for finding the underlying traits from a set of psychological tests, because, by postulating latent variables, it is easy to introduce more parameters into a model than can be fitted from the data. A model is identifiable if the parameter values uniquely determine the probability distribution of the data and the probability distribution of the data uniquely determines the parameter values. Formally, let φ be the parameter value of the model, y be the observed data, and F ( y; φ) be the probability distribution of the data. A model is identifiable if for all (φ 0 , φ) ∈ and for all y ∈ SY : F ( y; φ 0 ) = F ( y; φ) if and only if φ 0 = φ,

(1)

where denotes the set of all possible parameter values, and SY is the set of all possible values of the data. The most common cause of model nonidentifiability is a poorly specified model. If the number of unique model parameters exceeds the number of independent pieces of observed information, the model is not identifiable. Consider the example of a latent class model that classifies people into three states (severely depressed/mildly depressed/not depressed) and that is used to account for the responses of a group of people to three psychological tests with binary (positive/negative) outcomes. Let (Y1 , Y2 , Y3 ) denote the test results and let each take the value

1 when the outcome is positive and 0 when it is negative. S specifies the unobservable states where S = 1 where there is no depression, 2 where the depression is mild, and 3 where the depression is severe. The probability of the test results is then Pr(Y1 = y1 , Y2 = y2 , Y3 = y3 ) =

3

Pr(S = j )

j =1

3

Pr(Ym = 1|S = j )ym

m=1

× Pr(Ym = 0|S = j )1−ym .

(2)

The test results have 23 − 1 = 7 independent patterns and the model requires 11 unique parameters (Two probabilities for depression status Pr(S = 3), Pr(S = 2), and one conditional probability Pr(Ym = 1|S = j ) for each depression status j and test m); therefore, the model is not identifiable. If the model is not identifiable, one can make it so by imposing various constraints upon the parameters. When there appears to be sufficient total observed information for the number of estimated parameters, it is also necessary to specify the model unambiguously. For the above latent class model, suppose that, for the second and third tests, the probabilities of observing a positive test result are the same for people with severe, mild, or no depression (i.e., Pr(Ym = 1|S = 3) = Pr(Ym = 1|S = 2) = Pr(Ym = 1|S = 1) = pm for m = 2, 3). In other words, only the first test discriminates between the unobservable states of depression. The model now has only seven parameters, which is equal to the number of independent test result patterns. The probability distribution of test results becomes Pr(Y1 = y1 , Y2 = y2 , Y3 = y3 ) =

3

(pm )ym (1 − pm )1−ym ,

(3)

m=2

where = (1 − η2 − η3 )(p11 )y1 (1 − p11 )1−y1 + η2 (p12 )y1 (1 − p12 )1−y1 + η3 (p13 )y1 (1 − p13 )1−y1 ,

(4)

η2 = Pr(S = 2), η3 = Pr(S = 3), p11 = Pr(Y1 = 1| S = 1), p12 = Pr(Y1 = 1|S = 2), and p13 = Pr(Y1 = 1|S = 3). imposes two restrictions on parameters

2

Model Identifiability

(i.e., for y1 = 1 or 0), and there are five parameters to consider (i.e., η2 , η3 , p11 , p12 and p13 ). Because the number of restrictions is less than the number of parameters of interest, and the above latent class model are not identifiable – the same probability distributions could be generated by supposing that there was a large chance of being in a state with a small effect on the probability of being positive on test 1 or by supposing that there was a small chance of being in this state but it was associated with a large probability of responding positively. Sometimes it is difficult to find an identifiable model. A weaker form of identification, called local identifiability, may exist, namely, it may be that other parameters generate the same probability distribution as φ 0 does, but one can find an open neighborhood of φ 0 that contains none of these parameters [3]. For example, we are interested in β in the regression Y = β 2 X (the square root of the association between Y and X). β = 1 and β = −1 result in the same Y prediction; thus, the model is not (globally) identifiable. However, the model is locally identifiable because one can easily find two nonoverlapping intervals (0.5, 1.5) and (−1.5, −0.5) for 1 and −1, respectively. A locally but not globally identifiable model does not have a unique interpretation, but one can be sure that, in the neighborhood of the selected solution, there exist no other equally good solutions; thus, the problem is reduced to determining the regions where local identifiability applies. This concept is especially useful in models containing nonlinearities as the above regression example, or models with complex structures, for example, factor analysis, latent class models and Markov Chain Monte Carlo. It is difficult to specify general conditions that are sufficient to guarantee (global) identifiability. Fortunately, it is fairly easy to determine local identifiability. One can require that the columns of the Jacobian matrix, the first-order partial derivative of the

likelihood function with respect to the unique model parameters, are independent [2, 3]. Alternatively, we can examine whether the Fisher information matrix possesses eigenvalues greater than zero [4]. Formann [1] showed that these two approaches are equivalent. A standard practice for checking local identifiability involves using multiple sets of initial values for parameter estimation. Different sets of initial values that yield the same likelihood maximum should result in the same final parameter estimates. If not, the model is not locally identifiable. When applying a nonidentifiable model, different people may draw different conclusions from the same model of the observed data. Before one can meaningfully discuss the estimation of a model, model identifiability must be verified. If researchers come up against identifiability problems, they can first identify the parameters involved in the lack of identifiability from their extremely large asymptotic standard errors [1], and then impose reasonable constraints on identified parameters based on prior knowledge or empirical information.

References [1]

[2]

[3]

[4]

Formann, A.K. (1992). Linear logistic latent class analysis for polytomous data, Journal of the American Statistical Association 87, 476–486. Goodman, L.A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models, Biometrika 61, 215–231. McHugh, R.B. (1956). Efficient estimation and local identification in latent class analysis, Psychometrika 21, 331–347. Rothenberg, T.J. (1971). Identification in parametric models, Econometrica 39, 577–591.

GUAN-HUA HUANG

Model Selection LILIA M. CORTINA Volume 3, pp. 1251–1253 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Model Selection When a behavioral scientist statistically models relationships among a set of variables, the goal is to provide a meaningful and parsimonious explanation for those relationships, ultimately achieving a close approximation to reality. However, given the complexity inherent in social science data and the phenomena they attempt to capture, there are typically multiple plausible explanations for any given set of observations. Even when one model fits well, other models with different substantive interpretations are virtually always possible. The task at hand, then, is to determine ‘which models are the “fittest” to survive’ [ 1, p. 71]. In this article, I will review issues to consider when selecting among alternative models, using structural equation modeling (SEM) as a context. However, the same general principles apply to other types of statistical models (e.g., multiple regression, analysis of variance ) as well. In light of the existence of multiple viable explanations for a set of relationships, one approach is to formulate multiple models in advance, test each with the same set of data, and determine which model shows superior qualities in terms of fit, parsimony, interpretability, and meaningfulness. Each model should have a strong theoretical rationale. Justifications for such a model comparison strategy are many. For example, this approach is quite reasonable when investigating a new research domain – if a phenomenon is not yet well understood, there is typically some degree of uncertainty about how it operates, so it makes sense to explore different alternatives [4]. This exploration should happen at the model-development stage (a priori ) rather than at the model-fitting stage (post hoc), to avoid capitalizing on chance. Even in established lines of research, scientists may have competing theoretical propositions to test, or equivocal findings from prior research could suggest multiple modeling possibilities. Finally, researchers can argue more persuasively for a chosen model if they can demonstrate its statistical superiority over rival, theoretically compelling models. For this reason, some methodologists advocate that consideration of multiple alternatives be standard practice when modeling behavioral phenomena [e.g., [2, 5]]. After one has found strong, theoretically sensible results for a set of competing models, the question of

selection then arises: Which model to retain? Model selection guidelines vary depending on whether the alternative models are nested or nonnested. Generally speaking, if Model B is nested within Model A, then Model B contains the same variables as Model A but specifies fewer relations among them. In other words, Model B is a special case or a subset of Model A; because it is more restrictive, Model B cannot fit the data as well [6]. However, Model B provides a more parsimonious explanation of the data, so if the decrement in overall fit is trivial, then B is often considered the better model. To compare the fit of nested structural equation models, the chi-square difference test (also termed the Likelihood Ratio or LR test – see Maximum Likelihood Estimation) is most often employed. For example, when comparing nested Models A and B, the researcher estimates both models, computing the overall χ 2 fit statistic for each. She or he then calculates the difference between the two χ 2 values 2 2 − χModelB ). The result is distributed as a χ 2 , (χModelA with degrees of freedom (df) equal to the difference in df for the two models (dfModelA − dfModelB ). If this value is significant (according to a chi-square table), then the restrictions imposed in the smaller Model B led to a significant worsening of fit, so the more comprehensive Model A should be retained. (Another way to characterize this same situation is to say that the additional relationships introduced in Model A led to a significant improvement in fit – again suggesting that Model A should be retained.) However, if this test does not point to a significant difference in the fit of these two models, the simpler, nested Model B is typically retained [e.g., 1, 6]. When comparing alternative models that are nonnested, a chi-square difference test is not appropriate. Instead, one can rely on ‘information measures of fit’, such as Akaike’s Information Criterion (AIC), the Corrected Akaike’s Information Criterion (CAIC), or the single-sample Expected Cross Validation Index (ECVI). These measures do not appear often in behavioral research, but they are appropriate for the comparison of alternative nonnested models. The researcher simply estimates the models, computes one of these fit indices for each, and then rank-orders the models according to the chosen index; the model with the lowest index value shows the best fit to the data [3]. In the cases reviewed above, only overall ‘goodness of fit’ is addressed. However, during model

2

Model Selection

selection, researchers should also pay close attention to the substantive implications of obtained results: Are the structure, valence, and magnitude of estimated relationships consistent with theory? Are the parameter estimates interpretable and meaningful? Even if fit indices suggest one model to be superior to others, this model is useless if it makes no sense from a substantive perspective [4]. Moreover, there often exist multiple plausible models that are mathematically comparable and yield identical fit to a given dataset. These equivalent models differ only in terms of substantive meaning, so this becomes the primary factor driving model selection [5]. One final caveat about model selection bears mention. Even after following all of the practices reviewed here – specifying a parsimonious model based on strong theory, testing it against viable alternatives, and evaluating it to have superior fit and interpretability, a researcher still cannot definitively claim to have captured ‘Truth,’ or even to have identified THE model that BEST approximates reality. Many, many models are always plausible, and selection of an excellent model could artifactually result from failure to consider every possible alternative [4, 5]. With behavioral processes being the complex, messy phenomena that they are, we can only aspire to represent them imperfectly in statistical models,

rarely (if ever) knowing the ‘true’ model. Ay, there’s the rub [6].

References [1] [2]

[3]

[4]

[5]

[6]

Bollen, K.A. (1989). Structural Equations with Latent Variables, John Wiley & Sons, New York. Breckler, S.J. (1990). Applications of covariance structure modeling in psychology: cause for concern? Psychological Bulletin 107, 260–273. Hu, L. & Bentler, P.M. (1995). Evaluating model fit, in Structural Equation Modeling: Concepts, Issues, and Applications, R.H. Hoyle, ed., Sage Publications, Thousand Oaks, pp. 76–99. MacCallum, R.C. (1995). Model specification: procedures, strategies, and related issues, in Structural Equation Modeling: Concepts, Issues, and Applications, R.H. Hoyle, ed., Sage Publications, Thousand Oaks, pp. 16–36. MacCallum, R.C., Wegener, D.T., Uchino, B.N. & Fabrigar, L.R. (1993). The problem of equivalent models in applications of covariance structure analysis, Psychological Bulletin 114, 185–199. Pedhazur, E.J. (1997). Multiple Regression in Behavioral Research: Explanation and Prediction, 3rd Edition, Wadsworth Publishers.

(See also Goodness of Fit) LILIA M. CORTINA

Models for Matched Pairs VANCE W. BERGER

AND JIALU

ZHANG

Volume 3, pp. 1253–1256 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Models for Matched Pairs When randomization is impractical or impossible, matched studies are often used to enhance the comparability of comparison groups, so as to facilitate the assessment of the association between an exposure and an event while minimizing confounding. Matched pairs designs, which tend to be more resistant to biases than are historically controlled studies, are characterized by a particular type of statistical dependence, in which each pair in the sample contains observations either from the same subject or from subjects that are related in some way (see Matching). The matching is generally based on subject characteristics such as age, gender, or residential zip code, but could also be on a propensity score based on many such subject characteristics [13]. The standard design for such studies is the case-control study, in which each case is matched to either one control or multiple controls. The simplest example would probably be binary data (see Binomial Distribution: Estimating and Testing Parameters) with one-to-one matched pairs, meaning that the response or outcome variable is binary and each case is matched to a single control. There are several approaches to the analysis of matched data. For example, the matched pairs t Test can be used for a continuous response. Stratified analysis, the McNemar test [11], and conditional logistic regression [6, 7] can be used for data with discrete or binary responses. In some cases, it even pays to break the matching [9].

The Matched Pairs t Test The matched pairs t Test is used to test for a difference between measurements taken on subjects before an intervention or an event versus measurements taken on them after an intervention or an event. In the test, each subject is allowed to serve as his or her own control. The matched paired t Test reduces confounding that could result from comparing one group of subjects receiving one treatment to a different group of subjects receiving a different treatment. Let X = (x1 , x2 , . . . , xn ) be the first observations from each of the n subjects, and let Y = (y1 , y2 , . . . , yn ) be the second observations from the

same n subjects. The test statistic is a function of the differences d = (d1 , d2 , . . . , dn ), where di = xi − yi . If the di can be considered to have a normal distribution (an assumption not to be taken lightly [2, 4]) with mean zero and arbitrary variance, then it is appropriate to use the t Test. The test statistic can be written as follows: di /n i t=

Sd √ n

,

(1)

where Sd is the standard deviation computed from (d1 , d2 , . . . , dn ). t follows a Student’s t distribution (see Catalogue of Probability Density Functions)(with mean zero) under the null hypothesis. If this test is rejected, then one would conclude that the before and after observations from the same pair are not equivalent to each other.

The McNemar Test When dealing with discrete data, we denote by πij the probability of the first observation of a pair having outcome i and the second observation having outcome j . Let nij be the number of such pairs. Clearly, then, πij can be estimated by nij /n, where n is the total number of pairs in the data. If the response is binary, then McNemar’s test can be applied to test the marginal homogeneity of the contingency table with an exposure and an event. That is, McNemar’s test tests the null hypothesis that H0 : π+1 = π1+ , where π+1 = π11 + π21 and π1+ = π11 + π12 . This is equivalent, of course, to testing the null hypothesis that π21 = π12 . This is called symmetry, as in Table 1. The McNemar test statistic z0 is computed as follows: z0 =

n21 − n12 . (n21 + n12 )1/2

(2)

Table 1 The Structure of a Hypothetical 2 × 2 Contingency Table Before After

1

2

Total

1 2 Total

π11 π21 π+1

π12 π22 π+2

π1+ π2+ π++

2

Models for Matched Pairs

Now asymptotically z02 follows a χ 2 distribution with degree of freedom equal to 1. One can also use a continuity correction, an exact test, or still other variations [10].

Stratified Analysis Stratified analysis is used to control confounding by covariates that are associated with both the outcome and the exposure [14]. For example, if a study examines the association between smoking and lung cancer among several age groups, then age may have a confounding effect on the apparent association between smoking and lung cancer, because there may be an age imbalance with respect to both smoking and incidence of lung cancer. In such a case, the data would need to be stratified by age (see Stratification). One method for doing this is the Mantel-Haenszel test [10], which is performed to measure the average conditional association by estimating the common odds ratio as the sum of weighted odds ratios from each stratum. Of course, in reality, the odds ratios across the strata need not be common, and so the Breslow-Day test [5] can be used to test the homogeneity of odds ratios across the strata. If the strata do not have a common odds ratio, then the association between smoking and lung cancer should be tested separately in each stratum.

Conditional Logistic Regression If the matching is one-to-one so that there is one control per case, and if the response is binary with outcomes 0 and 1, then within each matched set there are four possibilities for the pair (case, control): (1,0), (0,1), (1,1), (0,0). Denote by (Yi1 , Yi2 ) the two observations of the ith matched set. Then the conditional logistic regression model can be expressed as follows: logit[P (Yit = 1)] = αi + βxit i = 1, 2, . . . , n; t = 1, 2,

(3)

where xit is the explanatory variable of interest and αi and β are model parameters. In particular, αi describes the matched-set-specific effect while β is a common effect across pairs. With the above

model, one can assume independence of the observations, both for different subjects and within the same subject. Hosmer and Lemeshow [8] provide a dataset with 189 observations representing 189 women, of whom 59 had low birth-weight babies and 130 had normalweight babies. Several risk factors for low birthweight babies were under investigation. For example, whether the mother smoked or not (sm) and history of hypertension (ht) were considered as possible predictors, among others. A subset containing data from 20 women is used here (Table 2) as an example to illustrate the 1:1 matching. Specifically, 20 women were matched by age (to the nearest year), and divided accordingly into 10 strata. In each age stratum, there is one case of low birth weight (bwt = 1) and one control (bwt = 0). The conditional logistic regression model for this data set can be written as follows: logit[P (Yit = 1)] = αi + βsm xsm,it, i = 1, 2, . . . , 10, t = 1, 2,

(4)

where Yit represents the outcome for woman t in age group i, αi is the specific effect in age group i, Table 2 An example of a matched data set. A data extracted from Hosmer, D.W. & Lemeshow, S. (2000). Applied Logistic Regression, 2nd Edition, Wiley-Interscience Stratum

bwt

Age

Race

Smoke

ht

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

14 14 15 15 16 16 17 17 17 17 17 17 17 17 17 17 18 18 18 18

1 3 2 3 3 3 3 3 1 1 2 1 2 2 3 2 1 3 1 2

0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

Models for Matched Pairs and xsm,it represents the smoking status of woman t in age group i, which can be either 1 or 0. In this particular case, βsm is estimated to be 0.55. Therefore, the odds of having low birth-weight babies for woman in group i who smoke is exp(βsm ) = 1.74 times the odds for woman in the same group who don’t smoke. Conditional logistic regression can also be used in n : m matching scenario. If the complete low birth-weight dataset from Hosmer and Lemeshow [8] is used, then each age stratum, instead of having one case and one control, contains multiple cases and multiple controls. Similar conditional logistic regression model can be applied. The conditional logistic regression model can also include extra predictors for the covariates that are not controlled by matched pairs. It is also possible to treat the αi ’s as random effects to eliminate the large number of nuisance parameters. The pair-specific effect can be modeled as a parameter following some normal distribution with unknown mean and variance. If, instead of the two levels that would prevail in the binary response case, the response has J > 2 levels, then a reference level is chosen for the purpose of comparisons. Without loss of generality, say that level J is considered to be the reference level, to which other levels are compared. Then the model is as follows: P (Yit = k) = αik + βk xit , log P (Yit = J ) k = 1, 2, . . . , J − 1, t = 1, 2.

(5)

Clearly, this is a generalization of the model for binary data. Specialized methods also exist for analyzing matched pairs in a nonparametric fashion with missing data without assuming that the missing data are missing completely at random (see Missing Data) [1]. When considering 2 × 2 matched pairs and testing for noninferiority, it has been found that asymptotic tests may exceed the claimed nominal Type I error rate [12], and so an exact test (see Exact Methods for Categorical Data) would generally be preferred. The very term ‘exact test’ may appear to be a misnomer in the context of a matched design, because there is neither random sampling nor random allocation, and hence, technically, no basis for formal hypothesis testing [3]. However, the basis for inference in matched designs is distinct from that of randomized designs, which

3

involve either random sampling or random allocation (see Randomization Based Tests). Specifically, while in a randomized design the outcome of a given subject exposed at a given time to a given treatment is generally taken as fixed (not random), the outcome of a matched design is taken as the random quantity. So here the randomness is a withinsubject factor, or, more correctly, is random even within the combination of subject, treatment, and time. That such a random component to the outcomes exists needs to be determined on a caseby-case basis. Rubin [13] pointed out that the lack of randomization creates sensitivity to the assignment mechanism, which cannot be avoided simply by using Bayesian methods instead of randomizationbased methods.

References [1]

Akritas, M.G., Kuha, J. & Osgood, D.W. (2002). A nonparametric approach to matched pairs with missing data, Sociological Methods & Research 30(3), 425–454. [2] Berger, V.W. (2000). Pros and cons of permutation tests in clinical trials, Statistics in Medicine 19, 1319–1328. [3] Berger, V.W. & Bears, J. (2003). When can a clinical trial be called ‘randomized’? Vaccine 21, 468–472. [4] Berger, V.W., Lunneborg, C., Ernst, M.D. & Levine, J.G. (2002). Parametric analyses in randomized clinical trials, Journal of Modern Applied Statistical Methods 1(1), 74–82. [5] Breslow, N.E. & Day, N.E. (1993). Statistical Methods in Cancer Research, Vol. I: The Analysis of Case-Control Studies, IARC Scientific Publications, No. 32, Oxford University Press, New York. [6] Cox, D.R. (1958). Planning of Experiments, John Wiley, New York. [7] Cox, D.R. (1970). The Analysis of Binary Data, Methuen, London. [8] Hosmer, D.W. & Lemeshow, S. (2000). Applied Logistic Regression, 2nd Edition, Wiley-Interscience. [9] Lynn, H.S. & McCulloch, C.E. (1992). When does it pay to break the matches for analysis of a matched-pairs design? Biometrics 48, 397–409. [10] Mantel, N. & Haenszel, A. (1959). Statistical aspects of the analysis of data from retrospective studies of diseases, Journal of the National Cancer Institute 22, 719–748. [11] McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages, Psychometrika 12, 153–1157. [12] Rigby, A.S. & Robinson, M.B. (2000). Statistical methods in epidemiology. IV. Confounding and the matched pairs odds ratio, Disability and Rehabilitation 22(6), 259–265.

4 [13]

[14]

Models for Matched Pairs Rubin, D.B. (1991). Practical Implications of Modes of Statistical Inference for Causal Effects and the Critical Role of the Assignment Mechanism, Biometrics 47(4), 1213–1234. Sidik, K. (2003). Exact unconditional tests for testing non-inferiority in matched-pairs design, Statistics in Medicine 22, 265–278.

Further Reading Agresti, A. (2002). Categorical Data Analysis, 2nd Edition, Wiley-Interscience.

VANCE W. BERGER

AND JIALU

ZHANG

Moderation J. MATTHEW BEAUBIEN Volume 3, pp. 1256–1258 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Moderation Every field of scientific inquiry begins with the search for universal predictor–criterion correlations. Unfortunately, universal relationships rarely exist in nature [4]. At best, researchers find that their hypothesized correlations are weaker than expected. At worst, the hypothesized correlations are wildly inconsistent from study to study. The resulting cacophony has caused more than a few researchers to abandon promising lines of research for others that are (hopefully) more tractable. In some cases, however, these counterintuitive or conflicting findings motivate the researchers to reexamine their underlying theoretical models. For example, the researchers may attempt to specify the conditions under which the hypothesized predictor–criterion relationship will hold true. This theoretical-based search for moderator variables – or interaction effects – is one indication of the sophistication or maturity of a field of study [3].

The Moderator-mediator Distinction Many researchers confuse the concepts of moderation and mediation [3, 5]. However, the distinction is relatively simple. Moderation concerns the effect of a third variable (z) on the strength or direction of a predictor–criterion (x − y) correlation. Therefore, moderation addresses the issues of ‘when?’, ‘for whom?’, or ‘under what conditions?’ does the hypothesized predictor–criterion correlation hold true. In essence, moderator variables are nothing more than interaction effects. A moderated relationship is represented by a single, nonadditive, linear function where the criterion variable (y) varies as a product of the independent (x) and moderator variables (z) [5]. Algebraically, this function is expressed as y = f (x, z). When analyzed using multiple regression, the function is usually expressed as y = b1 x + b2 z + b3 xz + e. Specifically, the predicted value of y is modeled as a function of the independent variable (x), the moderator variable (z), their interaction (xz), and measurement error (e). Graphically, mediated relationships are represented by an arrow from the moderator (z) that intersects the x − y relationship at a 90° angle (see Figure 1). Although

Predictor

Criterion

(a)

Moderator

Predictor

Criterion

Predictor

Mediator

(b)

Criterion

(c)

Figure 1 Examples of (a) direct, (b) moderated, and (c) mediated relationships

moderated relationships can help resolve apparently contradictory research findings, they do not imply causality. By contrast, mediation represents one or more links in the causal chain (z) between the predictor (x) and criterion (y) variables. Therefore, mediation addresses the issues of ‘how?’ or ‘why?’ the predictor variable influences the criterion [3]. In essence, mediator variables are caused by the predictor and, in turn, predict the criterion. Unlike moderated relationships, mediated relationships are represented by two or more additive, linear functions. Algebraically, these functions are expressed as y = f (x), z = f (x), and y = f (z). When analyzed using multiple regression, these functions are usually expressed as z = bx + e and y = bz + e. Specifically, the predicted value of the mediator (z) varies as a function of the independent variable (x) and measurement error (e). In addition, the predicted value of the criterion variable (y) varies as a function of the mediator (z) and measurement error (e). Graphically, moderated relationships are represented as a series of arrows in which the predictor variable influences the mediator variable, which in turn influences the criterion variable (see Figure 1). Unlike moderated relationships, mediated relationships specify the chain of causality [3, 5].

2

Moderation

Testing Moderated Relationships Moderated relationships can be tested in a variety of ways. When both the predictor and moderator variables are measured as categorical variables, the moderated relationship can be tested using analysis of variance (ANOVA). However, when one or both are measured on a continuous scale, hierarchical regression is preferred (see Hierarchical Models). Many researchers favor regression because it is more flexible than ANOVA. It also eliminates the need to artificially dichotomize continuous variables. Regardless of which analytical technique is used, the tests are conducted in very similar ways. First, the researcher needs to carefully choose the moderator variable. The moderator variable is typically selected on the basis of previous research, theory, or both. Second, the researcher needs to specify the nature of the moderated relationship. Most common are enhancing interactions, which occur when the moderator enhances the effect of the predictor variable, or buffering interactions, which occur when moderator weakens the effect of the predictor variable. Less common are antagonistic interactions, which occur when the predictor and moderator variables have the same effect on the criterion, but their interaction produces an opposite effect [3]. Third, the researcher needs to ensure that the study has sufficient statistical power to detect an interaction effect. Previous research suggests that the power to detect interactions is substantially lower than the 0.80 threshold [1]. Researchers should consider several factors when attempting to maximize their study’s statistical power. For example, researchers should consider not only the total effect size (R2 ) but also the incremental effect size (R2 ) when selecting the necessary minimum sample sizes. Other important tasks include selecting reliable and valid measures, taking steps such as oversampling to avoid range restriction, centering predictor variables to reduce collinearity, and ensuring that subgroups have equivalent sample sizes and error variances (i.e., when one or more of the variables is measured on a categorical scale) [3]. Fourth, the researcher needs to create the appropriate product terms. These product terms, which are created by multiplying the predictor and moderator variables, represent the interaction between them.

Fifth, the researcher needs to structure the equation. For example, using hierarchical regression, the researcher would enter the predictor and moderator variables during the first step. After controlling for these variables, the researcher would enter the interaction terms during the second step. The significance of the interaction term is determined by examining the direction of the interaction term’s regression weight, the magnitude of the effect (R2 ), and its statistical significance [3]. Finally, the researcher should plot the effects to determine the type of effect. For each grouping variable, the researcher should plot the scores at the mean, one standard deviation above the mean, and one standard deviation below the mean. These plots should help the researcher to visualize the form of the moderator effect: an enhancing interaction, a buffering interaction, or an antagonistic interaction. Alternatively, the researcher could test the statistical significance of the simple regression slopes for different values of the moderator variable [2].

Miscellaneous In this entry, we explored the basics of moderation, the search for interaction effects. The discussion was limited to single-sample studies using analytical techniques such as ANOVA or hierarchical regression. However, moderator analyses can also be assessed in other ways. For example, moderators can be assessed in structural equation modeling (SEM) or meta-analysis by running the model separately for various subgroups and comparing the two sets of results. Regardless of how moderators are tested, previous research suggests that tests for moderation tend to be woefully underpowered. Therefore, it should come as no surprise that many researchers have failed to find significant interaction effects, even though they are believed to be the norm, rather than the exception [4].

References [1]

Aguinis, H., Boik, R. & Pierce, C. (2001). A generalized solution for approximating the power to detect effects of categorical moderator variables using multiple regression, Organizational Research Methods 4, 291–323.

Moderation [2]

[3]

[4]

Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 2nd Edition, Lawrence Erlbaum, Mahwah. Frazier, P.A., Tix, A.P. & Barron, K.E. (2004). Testing moderator and mediator effects in counseling psychology research, Journal of Counseling Psychology 51, 115–134. Jaccard, J., Turrisi, R. & Wan, C. (1990). Interaction Effects in Multiple Regression, Sage Publications, Thousand Oaks.

[5]

3

James, L.R. & Brett, J.M. (1984). Mediators, moderators, and tests for mediation, Journal of Applied Psychology 69, 307–321.

J. MATTHEW BEAUBIEN

Moments REBECCA WALWYN Volume 3, pp. 1258–1260 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Moments

0.5 0.4

1. raw moments, 2. central moments, and 3. factorial moments.

Raw Moments Where a random variable is denoted by the letter X, and k is any positive integer, the kth raw moment of X is defined as E(X k ), the expectation of the random variable X raised to the power k. Raw moments are usually denoted by µk where µk = E(X k ), if that expectation exists. The first raw moment of X is µ1 = E(X), also referred to as the mean of X. The second raw moment of X is µ2 = E(X 2 ), the third µ3 = E(X 3 ), and so on. If the kth moment of X exists, then all moments of lower order also exist. Therefore, if the E(X 2 ) exists, it follows that E(X) exists.

Central Moments Where X again denotes a random variable and k is any positive integer, the kth central moment of X is defined as E[(X − c)k ], the expectation of X minus a constant, all raised to the power k. Where the constant is the mean of the random variable, this is referred to as the kth central moment around the mean. Central moments around the mean are usually denoted by µk where µk = E[(X − µX )k ]. The first central moment is equal to zero as µ1 = E[(X − µX )] = E(X) − E(µX ) = µX − µX = 0. In fact, if the probability distribution is symmetrical around the mean (e.g., the normal distribution) all odd central moments of X around the mean are equal to zero, provided they exist. The most important central moment is the second central moment of X around the mean. This is µ2 = E[(X − µX )2 ], the variance of X. The third central moment about the mean, µ3 = E[(X − µX )3 ], is sometimes used as a measure of

0.3

f (x)

Moments are an important class of expectation used to describe probability distributions. Together, the entire set of moments of a random variable will generally determine its probability distribution exactly. There are three main types of moments:

0.2 0.1 0.0 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 X

Figure 1 An example of an asymmetrical probability distribution where the third central moment around the mean is equal to zero

asymmetry or skewness. As an odd central moment around the mean, µ3 is equal to zero if the probability distribution is symmetrical. If the distribution is negatively skewed, the third central moment about the mean is negative, and if it is positively skewed, the third central moment around the mean is positive. Thus, knowledge of the shape of the distribution provides information about the value of µ3 . Knowledge of µ3 does not necessarily provide information about the shape of the distribution, however. A value of zero may not indicate that the distribution is symmetrical. As an illustration of this, µ3 is approximately equal to zero for the distribution depicted in Figure 1, but it is not symmetrical. The third central moment is therefore not used much in practice. The fourth central moment about the mean, µ4 = E[(X − µX )4 ], is sometimes used as a measure of excess or kurtosis. This is the degree of flatness of the distribution near its center. The coefficient of kurtosis (µ4 /σ 4 − 3) is sometimes used to compare an observed distribution to that of a normal curve. Positive values are thought to be indicative of a distribution that is more peaked around its center than that of a normal curve, and negative values are thought to be indicative of a distribution that is more flat around its center than that of a normal curve. However, as was the case for the third central moment around the mean, the coefficient of kurtosis does not always indicate what it is supposed to.

2

Moments

Factorial Moments Finally, where X denotes a random variable and k is any positive integer, the kth factorial moment of X is defined as the following expectation E[X(X − 1) . . . (X − k + 1)]. The first factorial moment of X is therefore E(X), the second factorial moment of X is E[(X − 1 + 1)(X − 2 + 1)] = E(X 2 − X), and so on. Factorial moments are easier to calculate than raw moments for some random variables (usually discrete). As raw moments can be obtained from factorial moments and vice versa, it is sometimes easier to obtain the raw moments for a random variable from its factorial moments.

characterizing a distribution and for theoretical purposes. For instance, if a moment generating function of a random variable exists, then this moment generating function uniquely determines the corresponding distribution function. As such, it can be shown that if the moment generating functions of two random variables both exist and are equal for all values of t in an interval around zero, then the two cumulative distribution functions are equal. However, existence of all moments is not equivalent to existence of the moment generating function. More information on the topic of moments and moment generating functions is given in [1, 2, 3].

References

Moment Generating Function [1]

For each type of moment, there is a function that can be used to generate all of the moments of a random variable or probability distribution. This is referred to as the ‘moment generating function’ and denoted by mgf, mX (t) or m(t). In practice, however, it is often easier to calculate moments directly. The main use of the moment generating function is therefore in

[2] [3]

Casella, G. & Berger, R.L. (1990). Statistical Inference, Duxbury Press, Belmont. DeGroot, M.H. (1986). Probability and Statistics, 2nd Edition, Addison-Wesley, Reading. Mood, A.M., Graybill, F.A. & Boes, D.C. (1974). Introduction to the Theory of Statistics, McGraw-Hill, Singapore.

REBECCA WALWYN

Monotonic Regression JAN

DE

LEEUW

Volume 3, pp. 1260–1261 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Monotonic Regression In linear regression, we fit a linear function y = α + βx to a scatterplot of n points (xi , yi ). We find the parameters α and β by minimizing σ (α, β) =

n

wi (yi − α − βxi )2 ,

(1)

i=1

where the wi are known positive weights (see Multiple Linear Regression). In the more general nonlinear regression problem, we fit a nonlinear function φθ (x) by minimizing σ (θ) =

n

wi (yi − φθ (xi ))2

(2)

i=1

over the parameters θ. In both cases, consequently, we select the minimizing function from a family of functions indexed by a small number of parameters. In some statistical techniques, low-dimensional parametric models are too restrictive. In nonmetric multidimensional scaling [3], for example, we can only use the rank order of the xi and not their actual numerical values. Parametric methods become useless, but we still can fit the best fitting monotone (increasing) function nonparametrically. Suppose there are no ties in x, and the xi are ordered such that x1 < · · · < xn . In monotone regression, we minimize n σ (z) = wi (yi − zi )2 (3) i=1

over z, under the linear inequality restrictions that z1 ≤ · · · ≤ zn . If the solution to this problem is zˆ , then the best fitting increasing function is the set of pairs (xi , zˆ i ). In monotone regression, the number of parameters is equal to the number of observations. The only reason we do not get a perfect solution all the time is because of the order restrictions on z. Actual computation of the best fitting monotone function is based on the theorem that if yi > yi+1 , then zˆ i = zˆ i+1 . In words: if two consecutive values of y are in the wrong order, then the two corresponding consecutive values of the solution zˆ will be equal. This basic theorem leads to a simple algorithm, because knowing that two values of zˆ must be equal reduces the number of parameters by one. We thus have a monotone regression problem with n − 1

parameters. Either the elements are now in the correct order, or there is a violation, in which case we can reduce the problem to one with n − 2 parameters, and so on. This process always comes to an end, in the worst possible case when we only have a single parameter left, which is obviously monotone. We can formalize this in more detail as the upand-down-blocks algorithm of [4]. It is illustrated in Table 1, in which the first column is y. The first violation we find is 3 > 0, or 3 is not upsatisfied. We merge the two elements to a block, which contains their weighted average 3/2 (in our example all weights are one). But now 2 > (3/2), and thus the new value 3/2 is not down-satisfied. We merge all three values to a block of three and find 5/3, which is both up-satisfied and down-satisfied. We then continue with the next violation. Clearly, the algorithm produces a decreasing number of blocks. The value of the block is computed using weighted averaging, where the weight of a block is the sum of the weights of the elements in the block. In our example, we wind up with only two blocks, and thus the best fitting monotone function zˆ is a step function with a single step from 5/3 to 4. The result is plotted in Figure 1. The line through the points x and y is obviously the best possible fitting function. The best fitting monotone function, which we just computed, is the step function consisting of the two horizontal lines. If x has ties, then this simple algorithm does not apply. There are two straightforward adaptions [2]. In the primary approach to ties, we start our monotone regression with blocks of y values corresponding to the ties in x. Thus, we require tied x values to correspond with tied z values. In the secondary approach, we pose no constraints on tied values, and it can be shown that in that case we merely have to order the y values such that they are increasing Table 1

Computing monotone regression

y

→

→

→

zˆ

2

2

3 0

3 2 3 2

5 3 5 3 5 3

5 3 5 3 5 3

5 3 5 3 5 3

6 6 0

6 6 0

6 6 0

6 3 3

4 4 4

2

Monotonic Regression can be the absolute value function, in which case we merge blocks by computing medians instead of means. And second, we can generalize the algorithm from weak orders to partial orders in which some elements cannot be compared; for details, see [1]. Finally, it is sometimes necessary to compute the least squares monotone regression with a nondiagonal weight matrix. In this case, the simple block merging algorithms no longer apply, and more general quadratic programming methods must be used.

6 5

y

4 3 2 1 0

References 1

2

3

4

5

6

x

Figure 1

[1]

Plotting monotone regression [2]

in blocks of tied x values. And then we perform an ordinary monotone regression. Monotone regression can be generalized in several important directions. First, basically the same algorithm can be used n to minimize any separable function of the form i=1 f (yi − zi ), with f any convex function with a minimum at zero. For instance, f

[3]

[4]

Barlow, R.E., Bartholomew, R.J., Bremner, J.M. & Brunk, H.D. (1972). Statistical Inference under Order Restrictions, Wiley, New York. De Leeuw, J. (1977). Correctness of Kruskal’s algorithms for monotone regression with ties, Psychometrika 42, 141–144. Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29, 1–27. Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method, Psychometrika 29, 115–129.

JAN

DE

LEEUW

Monte Carlo Goodness of Fit Tests JULIAN BESAG Volume 3, pp. 1261–1264 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Monte Carlo Goodness of Fit Tests Monte Carlo P values It is often necessary, particularly at a preliminary stage of data analysis, to investigate the compatibility between a known multivariate distribution {π(x) : x ∈ S} and a corresponding single observation x (1) ∈ S. Here a ‘single observation’ may mean a vector, perhaps corresponding to a random sample from a univariate distribution, or a table or an image or whatever. In exponential families (see Generalized Linear Models (GLM)), the requirement that the distribution is known can be achieved by first conditioning on sufficient statistics so as to eliminate the parameters from the original formulation. In frequentist inference, evidence of a conflict between x (1) and π is quantified by the P value obtained by comparing the observed value u(1) of a particular test statistic u = u(x) with its ‘null distribution’ under π. For small datasets, exact calculations can be made, but usually the null distribution of u is intractable analytically and computationally and so asymptotic chisquared approximations are invoked. However, such approximations are often invalid because the data are too sparse. This is common in analyzing multidimensional contingency tables; see [1, Section 7.1.5] for a 25 table in which the conclusion is questionable. For definiteness, suppose that x (1) is a table and that relatively large values of u(1) indicate a conflict with π. Then an alternative to the above approach is available if a random sample of tables x (2) , . . . , x (m) can be drawn from π, producing values u(2) , . . . , u(m) of the test statistic u. For if x (1) is indeed from π and ignoring for the moment the possibility of ties, the rank of u(1) among u(1) , . . . , u(m) is uniform on 1, . . . , m. It follows that, if u(1) turns out to be kth largest among all m values, an exact P value k/m can be declared. This procedure, suggested independently in [2] and [8], is called a Monte Carlo test, (see Monte Carlo Simulation) though there is sometimes confusion with approximate P values obtained by using simulation to estimate the percentiles of the null distribution of u. Both types of P values converge to the P value in the preceding paragraph as m → ∞. The choice of m is governed by computational considerations, with m = 100 or 1000 or 10 000 the

most popular. Note that, if several investigators carry out the same test on the same data x (1) , they will generally obtain slightly different P values, despite the fact that marginally each result is exact! Such differences should not be important at a preliminary stage of analysis and disparities diminish as m increases. Ties between ranks can occur with discrete data, in which case one can quote a corresponding range of P values, though one may also eliminate the problem by using a randomized rule. For detailed investigation of Monte Carlo tests when π corresponds to a random sample of n observations from a population, see [11, 13]. A useful refinement is provided by sequential Monte Carlo tests [4] (see Sequential Testing). First, one specifies a maximum number of simulations m − 1, as before, but now additionally a minimum number h, typically 10 or 20. Then x (2) , . . . , x (m) are drawn sequentially from π but with the proviso that sampling is terminated if ever h of the corresponding u(t) ’s exceed u(1) , in which case a P value h/ l is declared, where l ≤ m − 1 is the number of simulations; otherwise, the eventual P value is k/m, as before. See [3] for the validity of this procedure. Sequential tests encourage early termination when there is no evidence against π but continue sampling and produce a finely graduated P value when the evidence against the model is substantial. For example, if the model is correct and one chooses m = 1000 and h = 20, the expected sample size is reduced to 98. For more on simple Monte Carlo tests, see, for example [14]. Such tests have been especially useful in the preliminary analysis of spatial data; see, for example, [5] and [7]. The simplest application occurs in assessing whether a spatial point pattern over a perhaps awkwardly shaped study region A is consistent with a homogeneous Poisson process. By conditioning on the observed number of points n, the test is reduced to one of uniformity in which comparisons are made between the data and m − 1 realizations of n points placed entirely at random within A, using any choice of test statistic that is sensitive to interesting departures from the Poisson process.

Markov Chain Monte Carlo P values Unfortunately, it is not generally practicable to generate samples directly from the target distribution π. For example, this holds even when testing for no

2

Monte Carlo Goodness of Fit Tests

three-way interaction in a three-dimensional contingency table: here the face totals are the sufficient statistics but it is not known how to generate random samples from the corresponding conditional distribution, except for very small tables. However, it is almost always possible to employ the Metropolis– Hastings algorithm [12, 16] to construct the transition matrix or kernel P of a Markov chain for which π is a stationary distribution. Furthermore, under the null hypothesis, if one seeds the chain by the data x (1) , then the subsequent states x (2) , . . . , x (m) are also sampled from π. This provides an advantage over other Markov Chain Monte Carlo (MCMC) applications in which a burn-in phase is required to achieve stationarity. However, there is now the problem that successive states are dependent and so there is no obvious way in which to devise a legitimate P value for the test. Leaving gaps of r steps, where r is large, between each x (t) and x (t+1) or, in other words, replacing P by P r , reduces the problem but could still lead to serious bias and, in any case, the goal in MCMC methods is to accommodate the dependence rather than effectively eliminate it, which might require prohibitively long runs (see Markov Chain Monte Carlo and Bayesian Statistics). Two remedies that incorporate dependence and yet retain the exact P values of simple Monte Carlo testing are given in [3]. Both involve running the chain backwards, as well as forwards, in time. This is possible for any stationary Markov chain via its corresponding backwards transition matrix or kernel Q and is trivial if P is time reversible, because then Q = P . Reversibility can always be arranged but we do not assume it here in describing the simpler of two fixes in [3]. Thus, instead of running the chain forwards, suppose we run it backwards from x (1) for r steps, using Q, to obtain a state x (0) , say. The value of the integer r is entirely under our control here. We then run the chain forwards from x (0) for r steps, using P , and do this m−1 times independently to obtain states x (2) , . . . , x (m) that are contemporaneous with x (1) . It is clear that, if x (1) is a draw from π, then so are x (0) , x (2) , . . . , x (m) but not only this: x (1) , . . . , x (m) have an underlying joint distribution that is exchangeable. Moreover, for any choice of test statistic u = u(x), this property must be inherited by the corresponding u(1) , . . . , u(m) . Hence, if x (1) is a draw from π, the rank of u(1) among u(1) , . . . , u(m) is once again uniform and provides an exact P value,

just as for a simple Monte Carlo test. The procedure is rigorous because P values are calculated on the basis of a correct model. Note that x (0) must be ignored and that also it is not permissible to generate separate x (0) ’s, else x (2) , . . . , x (m) , although exchangeable with each other, are no longer exchangeable with x (1) . The value of r should be large enough to provide ample scope for mobility around the state space S, so that simulations can reach more probable parts of S when the formulation is inappropriate. That is, larger values of r tend to improve the power of the test, although the P value itself is valid for any value of r, apart from dealing with ties. Note that it is not essential for validity of the exact P value that P be irreducible. However, this may lead to a loss of power in the test. Irreducibility fails or is in question for many applications to multidimensional contingency tables: the search for better algorithms is currently a hot topic in computational algebra. Finally, see [4] for sequential versions of both procedures in [3]. Example Exact P values for the Rasch Model Consider an r × s table of binary variables xij . For example, in educational testing, xij = 0 or 1 corresponds to the correct (1) or incorrect (0) response of candidate i to item j . See [6] for two well-known LSAT datasets, each with 1000 candidates and 5 questions. For such tables, the Rasch model [17] asserts that all responses are independent and that the odds of 1 to 0 in cell (i, j ) are θij : 1, with θij = φi ψj , where the φi ’s and ψj ’s are unknown parameters that can be interpreted as measuring the relative aptitude of the candidates and difficulties of the items, respectively. The probability of a table x is then x x+j φi i+ ψj x ij c r θij i j (1) = 1 + θij (1 + φi ψj ) i=1 j =1 i

j

and the row and column totals xi+ and x+j are sufficient statistics for the φi ’s and ψj ’s. Thus, if we condition on the row and column totals, the φi ’s and ψj ’s are eliminated and we obtain a uniform distribution π(x) on the space of tables with the same xi+ ’s and x+j ’s. However, this space is generally huge and enumeration is out of the question; nor are simple Monte Carlo tests available.

Monte Carlo Goodness of Fit Tests Binary tables also occur in evolutionary biology, with xij identifying presence or absence of species i in location j ; see [10, 15], for example. Here the Rasch model accommodates differences between species and differences between locations, but departures from it can suggest competition between the species. To construct a test for the Rasch model via MCMC, we require an algorithm that maintains a uniform distribution π on the space S of binary tables x with the same row and column totals as in the data x (1) . The simplest move that preserves the margins is depicted below, where a, b = 0 or 1. The two row indices and the two column indices are the same on .. .. .. .. . . . . ··· b ··· a ··· ··· a ··· b ··· .. .. .. .. → . . . . ··· a ··· b ··· ··· b ··· a ··· .. .. .. .. . . . . the right as on the left. Of course, there is no change in the configuration unless a = b. It can be shown that any table in S can be reached from any other by a sequence of such switches, so that irreducibility is guaranteed. Among several possible ways in which the algorithm can proceed (see [3]), the simplest is to repeatedly choose two rows and two columns at random and to propose the corresponding swap if this is valid or retain the current table if it is not. This defines a Metropolis algorithm and, since π is uniform, all proposals are accepted.

Second, it is often natural to apply several different statistics to the same data: there is no particular objection to this at an exploratory stage, provided that all the results are reported. Finally, we caution against some misleading claims. Thus, the Knight’s move algorithm [18] for the Rasch model is simply incorrect: see [10]. Also, some MCMC tests are referred to as ‘exact’ when in fact they do not apply either of the corrections described in [3] and are therefore approximations; see [9], for example.

References [1] [2]

[3] [4] [5] [6]

[7] [8]

[9]

Closing Comments We close with some brief additional comments on MCMC tests. As regards the test statistic u(x), the choice should reflect the main alternatives that one has in mind. For example, in educational testing, interest might center on departures from the Rasch model caused by correlation between patterns of correct or incorrect responses to certain items. Then u(x) might be a function of the coincidence matrix, whose (j, j ) element is the frequency with which candidates provide the same response to items j and j . Corresponding statistics are easy to define and can provide powerful tools, but note that the total score in the matrix is no use because it is fixed by the row and column totals of the data. For ecologic applications, [10] provides an interesting discussion and advocates using a statistic based on the co-occurrence matrix.

3

[10]

[11]

[12]

[13]

[14] [15] [16]

Agresti, A. (1996). An Introduction to Categorical Data Analysis, Wiley. Barnard, G.A. (1963). Discussion of paper by M. S. Bartlett, Journal of the Royal Statistical Society. Series B 25, 294. Besag, J.E. & Clifford, P. (1989). Generalized Monte Carlo significance tests, Biometrika 76, 633–642. Besag, J.E. & Clifford, P. (1991). Sequential Monte Carlo p-values, Biometrika 78, 301–304. Besag, J.E. & Diggle, P.J. (1977). Simple Monte Carlo tests for spatial pattern, Applied Statistics 26, 327–333. Boch, R.D. & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items, Psychometrika 35, 179–197. Diggle, P.J. (1983). Statistical Analysis of Spatial Point Patterns, Academic Press, London. Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses, Annals of Mathematical Statistics 28, 181–187. Forster, J.J., McDonald, J.W. & Smith, P.W.F. (2003). Markov chain Monte Carlo exact inference for binomial and multinomial logistic regression models, Statistics and Computing 13, 169–177. Gotelli, N.J. & Entsminger, G.L. (2001). Swap and fill algorithms in null model analysis: rethinking the knight’s tour, Oecologia 129, 281–291. Hall, P. & Titterington, D.M. (1989). The effect of simulation order on level accuracy and power of Monte Carlo tests, Journal of the Royal Statistical Society. Series B 51, 459–467. Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57, 97–109. J¨ockel, K.-H. (1986). Finite-sample properties and asymptotic efficiency of Monte Carlo tests, Annals of Statistics 14, 336–347. Manly, B.F.J. (1991). Randomization and Monte Carlo Methods in Biology, Chapman & Hall, London. Manly, B.F.J. (1995). A note on the analysis of species co-occurrences, Ecology 76, 1109–1115. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. & Teller, E. (1953). Equations of state

4

[17]

Monte Carlo Goodness of Fit Tests calculations by fast computing machines, Journal of Chemical Physics 21, 1087–1092. Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests, Danish Educational Research Institute, Copenhagen.

[18]

Sanderson, J.G., Moulton, M.P. & Selfridge, R.G. (1998). Null matrices and the analysis of species cooccurrences, Oecologia 116, 275–283.

JULIAN BESAG

Monte Carlo Simulation JAMES E. GENTLE Volume 3, pp. 1264–1271 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Monte Carlo Simulation Introduction Monte Carlo methods use random processes to estimate mathematical or physical quantities, to study distributions of random variables, to study and compare statistical procedures, and to study the behavior of complex systems. Monte Carlo methods had been used occasionally by statisticians for many years, but with the development of high-speed computers, Monte Carlo methods became viable alternatives to theoretical and experimental methods in studying complicated physical processes. The random samples used in a Monte Carlo method are generated on the computer, and are more properly called “pseudorandom numbers”. An early example of how a random process could be used to evaluate a fixed mathematical quantity is the Buffon needle problem. The French naturalist Comt´e de Buffon showed that the probability that a needle of length l thrown randomly onto a grid of parallel lines with distance d (≥l) apart intersects a line is 2l/(πd). The value of π can, therefore, be estimated by tossing a needle onto a lined grid many times and counting the number of times the needle crosses one of the lines. (See [2], pp. 274, 275 for discussion of the problem and variations on the method.) A key element of the Buffon needle problem is that there is no intrinsic random element; randomness is introduced to study the deterministic problem of evaluating a mathematical constant. The idea of simulating a random process to study its distributional properties is so basic and straightforward that these methods were used in very early studies of probability distributions. An early documented use of a Monte Carlo method was by the American statistician Erastus Lyman De Forest in 1876 in a study of smoothing a time series (see [4]). Another important early use of Monte Carlo was by “Student” (see Gosset, William Sealy) in studying the distributions of the correlation coefficient and of the t statistic (see Catalogue of Probability Density Functions). Student used actual biometric data to simulate realizations of normally distributed random variables.

Monte Carlo Evaluation of an Integral In its simplest form, Monte Carlo simulation is the evaluation of a definite integral f (x) dx (1) θ= D

by identifying a random variable Y with support on D and density p(y) and a function g such that the expected value of g(Y ) is θ: E(g(Y )) =

g(y)p(y) dy

D

=

f (y) dy D

= θ.

(2)

In the simplest case, D is the interval [a, b], Y is taken to be a random variable with a uniform density over [a, b]; that is, p(y) in (2) is the constant uniform density. In this case, θ = (b − a)E(f (Y )).

(3)

The problem of evaluating the integral becomes the familiar statistical problem of estimating a mean, E(f (Y )). From a sample of size m, a good estimate of θ is the sample mean, m

θ = (b − a)

f (yi )

i=1

m

,

(4)

where the yi are values of a random sample from a uniform distribution over (a, b). The estimate is unbiased (see Estimation): E( θ ) = (b − a)

E(f (Yi ))

m = (b − a)E(f (Y )) b f (x) dx. = a

The variance is V( θ ) = (b − a)

2

V(f (Yi )) m2

(5)

2

Monte Carlo Simulation (b − a)2 V(f (Y )) m 2 b (b − a) b f (t) dt dx. (6) = f (x) − m a a

=

The integral in (6) is a measure of the roughness of the function. Consider again the problem of evaluation of the integral in (1) that has been rewritten as in (2). Now suppose that we can generate m random variates yi from the distribution with density p. Then our estimate of θ is just θ=

g(yi ) m

.

(7)

Compare this estimator with the estimator in (4). The use of a probability density as a weighting function allows us to apply the Monte Carlo method to improper integrals (that is, integrals with infinite ranges of integration). The first thing to note, therefore, is that the estimator (7) applies to integrals over general domains, while the estimator (4) applies only to integrals over finite intervals. Another important difference is that the variance of the estimator in (7) is likely to be smaller than that of the estimator in (4). The square root of the variance (that is, the standard deviation of the estimator) is a good measure of the range within which different realizations of the estimator of the integral may fall. Under certain assumptions, using the standard deviation of the estimator, we can define statistical “confidence intervals” for the true value of the integral θ. Loosely speaking, a confidence interval is an interval about an estimator θ that in repeated sampling would include the true value θ a specified portion of the time. (The specified portion is the “level” of the confidence interval and is often chosen to be 90% or 95%. Obviously, all other things being equal, the higher the level of confidence, the wider the interval must be.) Because of the dependence of the confidence interval on the standard deviation, the standard deviation is sometimes called a “probabilistic error bound”. The word “bound” is misused here, of course, but in any event, the standard deviation does provide some measure of a sampling “error”. From (6), we note that the order of error in terms of the Monte Carlo sample size is O(m−1/2 ). This results in the usual diminished returns of ordinary

statistical estimators; to halve the error, the sample size must be quadrupled. An important property of the standard deviation of a Monte Carlo estimate of a definite integral is that the order in terms of the number of function evaluations is independent of the dimensionality of the integral. On the other hand, the usual error bounds for numerical quadrature are O(m−2/d ), where d is the dimensionality. For one or two dimensions, it is generally better to use one of the standard methods of numerical quadrature, such as Newton–Cotes methods, extrapolation or Romberg methods, and Gaussian quadrature, rather than Monte Carlo quadrature.

Experimental Error in Monte Carlo Methods Monte Carlo methods are sampling methods; therefore, the estimates that result from Monte Carlo procedures have associated sampling errors. The fact that the estimate is not equal to its expected value (assuming that the estimator is unbiased) is not an “error” or a “mistake”; it is just a result of the variance of the random (or pseudorandom) data. Monte Carlo methods are experiments using random data. The variability of the random data results in experimental error, just as in other scientific experiments in which randomness is a recognized component. As in any statistical estimation problem, an estimate should be accompanied by an estimate of its variance. The estimate of the variance of the estimator of interest is usually just the sample variance of computed values of the estimator of interest. Following standard practice, we could use the square root of the variance (that is, the standard deviation) of the Monte Carlo estimator to form an approximate confidence interval for the integral being estimated. In reporting numerical results from Monte Carlo simulations, it is mandatory to give some statement of the level of the experimental error. An effective way of doing this is by giving the sample standard deviation. When a number of results are reported, and the standard deviations vary from one to the other, a good way of presenting the results is to write the standard deviation in parentheses beside the result itself, for example, 3.147 (0.0051).

Monte Carlo Simulation Notice that if the standard deviation is of order 10−3 , the precision of the main result is not greater than 10−3 . Just because the computations are done at a higher precision is no reason to write the number as if it had more significant digits.

Variance of Monte Carlo Estimators The variance of a Monte Carlo estimator has important uses in assessing the quality of the estimate of the integral. The expression for the variance, as in (6), is likely to be very complicated and to contain terms that are unknown. We therefore need methods for estimating the variance of the Monte Carlo estimator. A Monte Carlo estimate usually has the form of the estimator of θ in (4): θ =c

fi

m

.

(8)

The variance of the estimator has the form of (6): V = ( θ )c2

2 dx. f (x) − f (t) dt

V( θ ) = c2

m−1

Variance Reduction An objective in sampling is to reduce the variance of the estimators while preserving other good qualities, such as unbiasedness. Variance reduction results in statistically efficient estimators. The emphasis on efficient Monte Carlo sampling goes back to the early days of digital computing, but the issues are just as important today (or tomorrow) because, presumably, we are solving bigger problems. The general techniques used in statistical sampling apply to Monte Carlo sampling, and there is a mature theory for sampling designs that yield efficient estimators. Except for straightforward analytic reduction, discussed in the next section, techniques for reducing the variance of a Monte Carlo estimator are called “swindles” (especially if they are thought to be particularly clever). The common thread in variance reduction is to use additional information about the problem in order to reduce the effect of random sampling on the variance of the observations. This is one of the fundamental principles of all statistical design.

Analytic Reduction The first principle in estimation is to use any known quantity to improve the estimate. For example, suppose that the problem is to evaluate the integral f (x) dx (10) θ=

An estimator of the variance is (fi − f¯)2

3

D

.

(9)

This estimator is appropriate only if the elements of the set of random variables {Fi }, on which we have observations {fi }, are (assumed to be) independent and thus have zero correlations. Our discussion of variance in Monte Carlo methods that are based on pseudorandom numbers follows the pretense that the numbers are realizations of random variables, and the main concern in pseudorandom number generation is the simulation of a sequence of i.i.d. random variables. In quasirandom number generation, the attempt is to get a sample that is spread out over the sample space more evenly than could be expected from a random sample. Monte Carlo methods based on quasirandom numbers, or “quasi-Monte Carlo” methods, do not admit discussion of variance in the technical sense.

by Monte Carlo methods. Now, suppose that D1 and D2 are such that D1 ∪ D2 = D and D1 ∩ D2 = ∅, and consider the representation of the integral

f (x) dx +

θ= D1

f (x) dx

(11)

D2

= θ1 + θ2 . Now, suppose that a part of this decomposition of the original problem is known (that is, suppose that we know θ1 ). It is very likely that it would be better to use Monte Carlo methods only to estimate θ2 and take as our estimate of θ the sum of the known θ1 and the estimated value of θ2 . This seems intuitively obvious, and it is generally true unless there is some relationship between f (x1 ) and f (x2 ), where x1 is in D1 and x2 is in D2 . If there is some known relationship, however, it may be possible to improve

4

Monte Carlo Simulation

the estimate θ2 of θ2 by using a transformation of the same random numbers used for θ1 to estimate θ1 . For example, if θ1 is larger than the known value of θ1 , the proportionality of the overestimate, ( θ1 − θ1 )/θ1 , may be used to adjust θ2 . This is the same principle as ratio or regression estimation in ordinary sampling theory.

Stratified Sampling and Importance Sampling In stratified sampling (see Stratification), certain proportions of the total sample are taken from specified regions (or “strata”) of the sample space. The objective in stratified sampling may be to ensure that all regions are covered. Another objective is to reduce the overall variance of the estimator by sampling more heavily where the function is rough; that is, where the values f (xi ) are likely to exhibit a lot of variability. Stratified sampling is usually performed by forming distinct subregions with different importance functions in each. This is the same idea as in analytic reduction except that Monte Carlo sampling is used in each region. Stratified sampling is based on exactly the same principle in sampling methods in which the allocation is proportional to the variance (see [3]). In some of the literature on Monte Carlo methods, stratified sampling is called “geometric splitting”. In importance sampling, just as may be the case in stratified sampling, regions corresponding to large values of the integrand are sampled more heavily. In importance sampling, however, instead of a finite number of regions, we allow the relative sampling density to change continuously. This is accomplished by careful choice of p in the decomposition implied by (2). We have

From a sample of size m from the distribution with density p, we have the estimator, 1 f (xi ) θ= . m p(xi )

Generating the random variates from the distribution with density p weights the sampling into regions of higher probability with respect to p. By judicious choice of p, we can reduce the variance of the estimator. The variance of the estimator is f (X) 1 , (14) V( θ) = V m p(X) where the variance is taken with respect to the distribution of the random variable X with density p(x). Now,

f (X) V p(X)

f 2 (X) =E p 2 (X)

f (X) 2 . − E p(X) (15)

The objective in importance sampling is to choose p so that this variance is minimized. Because 2 f (X) 2 = f (x) dx , (16) E p(X) D the choice involves only the first term in the expression for the variance. By Jensen’s inequality, we have a lower bound on that term: E

f 2 (X) p 2 (X)

|f (X)| 2 E p(X) 2 = |f (x)| dx . ≥

(17)

D

θ=

That bound is obviously achieved when

f (x) dx

(13)

D

= D

f (x) p(x) dx, p(x)

(12)

where p(x) is a probability density over D. The density p(x) is called the importance function. Stratified sampling can be thought of as importance sampling in which the importance function is composed of a mixture of densities. In some of the literature on Monte Carlo methods, stratified sampling and importance sampling are said to use “weight windows”.

|f (x)| . (18) D |f (x)| dx Of course, if we knew D |f (x)| dx, we would probably know D f (x) dx and would not even be considering the Monte Carlo procedure to estimate the integral. In practice, for importance sampling we would seek a probability density p that is nearly proportional to |f |; that is, such that |f (x)|/p(x) is nearly constant. p(x) =

Monte Carlo Simulation The problem of choosing an importance function is very similar to the problem of choosing a majorizing function for the acceptance/rejection method. Selection of an importance function involves the principles of function approximation with the added constraint that the approximating function be a probability density from which it is easy to generate random variates. Let us now consider another way of developing the estimator (13). Let h(x) = f (x)/p(x) (where p(x) is positive; otherwise, let h(x) = 0) and generate y1 , . . . , ym from a density g(y) with support D. Compute importance weights, wi =

p(yi ) , g(yi )

and form the estimate of the integral as wi h(yi ) 1 . θ= m wi

(19)

5

if it has properties that are known or that can be computed easily. In the general case in Monte Carlo sampling, covariates are called control variates. Two special cases are called antithetic variates and common variates. We first describe the general case, and then the two special cases. We then relate the use of covariates to the statistical method sometimes called “Rao-Blackwellization”. Control Variates. Suppose that Y is a random variable, and the Monte Carlo method involves estimation of E(Y ). Suppose that X is a random variable with known expectation, E(X), and consider the random variable = Y − b(X − E(X)). Y (21) is the same as that of Y , and The expectation of Y its variance is

(20)

In this form of the estimator, g(y) is a trial density, just as in the acceptance/rejection methods. This form of the estimator has similarities to weighted resampling. By the same reasoning as above, we see that the trial density should be “close” to f ; that is, optimally, g(x) = c|f (x)| for some constant c. Although the variance of the estimator in (13) and (20) may appear rather simple, the term E((f (X)/p(X))2 ) could be quite large if p (or g) becomes small at some point where f is large. Of course, the objective in importance sampling is precisely to prevent that, but if the functions are not well-understood, it may happen. An element of the Monte Carlo sample at a point where p is small and f is large has an unduly large influence on the overall estimate. Because of this kind of possibility, importance sampling must be used with some care. (See [2], chapter 7, for further discussion the method.)

Use of Covariates Another way of reducing the variance, just as in ordinary sampling, is to use covariates. Any variable that is correlated with the variable of interest has potential value in reducing the variance of the estimator. Such a variable is useful if it is easy to generate and

) = V(Y ) − 2bCov(Y, X) + b2 V(X). V(Y

(22)

For reducing the variance, the optimal value of b ) < V(Y ) is Cov(Y, X)/V(X). With this choice V(Y as long as Cov(Y, X) = 0. Even if Cov(Y, X) is not known, there is a b that depends only on the sign of is less than Cov(Y, X) for which the variance of Y the variance of Y . The variable X is called a control variate. This method has long been used in survey sampling, where in (21) is called a regression estimator. Y Use of these facts in Monte Carlo methods requires identification of a control variable X that can be simulated simultaneously with Y . If the properties of X are not known but can be estimated (by Monte Carlo methods), the use of X as a control variate can still reduce the variance of the estimator. These ideas can obviously be extended to more than one control variate: = Y − b1 (X1 − E(X1 )) − · · · Y − bk (Xk − E(Xk )).

(23)

The optimal values of the bi s depend on the full variance-covariance matrix. The usual regression estimates for the coefficients can be used if the variancecovariance matrix is not known. Identification of appropriate control variates often requires some ingenuity, although in some special cases, there may be techniques that are almost always applicable.

6

Monte Carlo Simulation

Antithetic Variates. Again consider the problem of estimating the integral b f (x) dx (24) θ= a

by Monte Carlo methods. The standard crude Monte Carlo estimator, (4), is (b − a) f (xi )/n, where xi is uniform over (a, b). It would seem intuitively plausible that our estimate would be subject to less sampling variability if, for each xi , we used its “mirror” x˜i = a + (b − xi ). (25) This mirror value is called an antithetic variate, and use of antithetic variates can be effective in reducing the variance of the Monte Carlo estimate, especially if the integral is nearly uniform. For a sample of size n, the estimator is n

2 b−a (f (xi ) + f (x˜i )). n i=1

The variance of the sum is the sum of the variances plus twice the covariance. Antithetic variates have negative covariances, thus reducing the variance of the sum. Antithetic variates from distributions other than the uniform can also be formed. The linear transformation that works for uniform antithetic variates cannot be used, however. A simple way of obtaining negatively correlated variates from other distributions is just to use antithetic uniforms in the inverse CDF. If the variates are generated using acceptance/rejection, antithetic variates can be used in the majorizing distribution. Common Variates. Often, in Monte Carlo simulation, the objective is to estimate the differences in parameters of two random processes. The two parameters are likely to be positively correlated. If that is the case, then the variance in the individual differences is likely to be smaller than the variance of the difference of the overall estimates. Suppose, for example, that we have two statistics, T and S, that are unbiased estimators of some parameter of a given distribution. We would like to know the difference in the variances of these estimators, V(T ) − V(S) (because the one with the smaller variance is better). We assume that each statistic is a function of a

random sample: {x1 , . . . , xn }. A Monte Carlo estimate of the variance of the statistic T for a sample of size n is obtained by generating m samples of size n from the given distribution, computing Ti for the ith sample, and then computing m (Ti − T¯ )2

V(T ) =

i=1

m−1

.

(26)

Rather than doing this for T and S separately, using the unbiasedness, we could first observe V(T ) − V(U ) = E(T 2 ) − E(S 2 ) = E(T 2 − S 2 ) (27) and hence estimate the latter quantity. Because the estimators are likely to be positively correlated, the variance of the Monte Carlo estimator E(T 2 − S 2 ) is likely to be smaller than the variance of V(T ) − V(S). If we compute T 2 − S 2 from each sample (that is, if we use common variates), we are likely to have a more precise estimate of the difference in the variances of the two estimators, T and S. Rao-Blackwellization. As in the discussion of control variates above, suppose that we have two random variables Y and X and we want to estimate E(f (Y, X)) with an estimator of the form T = f (Yi , Xi )/m. Now suppose that we can evaluate E(f (Y, X)|X = x). (This is similar to what is done in using (21) above.) Now, E(E(f (Y, X)|X = x)) = E(f (Y, X)), so the estimator E(f (Yi , X)|X = xi ) (28) T = m has the same expectation as T . However, we have V(f (Y, X)) = V(E(f (Y, X)|X = x)) + E(V(f (Y, X)|X = x));

(29)

that is, V(f (Y, X)) ≥ V(E(f (Y, X)|X = x)).

(30)

Therefore, T is preferable to T because it has the same expectation but no larger variance. (The function f may depend on Y only. In that case, if Y and X are independent we can gain nothing.) The principle of minimum variance unbiased estimation leads us to consider statistics such as T conditioned on other statistics. The Rao-Blackwell

Monte Carlo Simulation Theorem (see any text on mathematical statistics) tells us that if a sufficient statistic exists, the greatest improvement in variance while still requiring unbiasedness occurs when the conditioning is done with respect to a sufficient statistic. This process of conditioning a given estimator on another statistic is called Rao-Blackwellization. (This name is often used even if the conditioning statistic is not sufficient.)

Applications of Monte Carlo Simulation Monte Carlo simulation is widely used in many fields of science and business. In the physical sciences, Monte Carlo methods were first employed on a major scale in the 1940s, and their use continues to grow. System simulation has been an important methodology in operations research since the 1960s. In more recent years, simulation of financial processes has become an important tool in the investments industry. Monte Carlo simulation has two distinct applications in statistics. One is in the study of statistical methods, and the other is as a part of a statistical method for analysis of data. Monte Carlo Studies of Statistical Methods. The performance of a statistical method, such as a t Test, for example, depends, among other things, on the underlying distribution of the sample to which it is applied. For simple distributions and for simple statistical procedures, it may be possible to work out analytically such things as the power of a test or the exact distribution of a test statistic or estimator. In more complicated situations, however, these properties cannot be derived analytically. The properties can be studied by Monte Carlo, however. The procedure is simple; we merely simulate on the computer many samples from the assumed underlying distribution, compute the statistic of interest from each sample, and use the sample of computed statistics to assess its sampling distribution. There is a wealth of literature and software for the generation of random numbers required in first step in this process (see, for example, [2]). Monte Carlo simulation to study statistical methods is most often employed in the comparison of methods, frequently in the context of robustness studies. It is relatively easy to compare the relative performance of say a t Test with a sign test under a

7

wide range of scenarios by generating multiple samples under each scenario and evaluating the t Test and the sign test for each sample in the given scenario. This application of Monte Carlo simulation is so useful that it is employed by a large proportion of the research articles published in statistics. (In the 2002 volume of the Journal of the American Statistical Association, for example, more than 80% of the articles included Monte Carlo (see Bayesian Statistics) studies of the performance of the statistical methods.) Monte Carlo Methods in Data Analysis. In computational statistics, Monte Carlo methods are used as part of the overall methodology of data analysis. Examples include Monte Carlo tests, Monte Carlo bootstrapping, and Markov chain Monte Carlo for the evaluation of Bayesian posterior distributions. A Monte Carlo test of an hypothesis, just as any statistical hypothesis test, uses a random sample of observed data. As with any statistical test, a test statistic is computed from the observed sample. In the usual statistical tests, the computed test statistic is compared with the quantiles of the distribution of the test statistic under the null hypothesis, and the null hypothesis is rejected if the computed value is deemed sufficiently extreme. Often, however, we may not know the distribution of the test statistic under the null hypothesis. In this case, if we can simulate random samples from a distribution specified by the null hypothesis, we can simulate the distribution of the test statistic by generating many such samples, and computing the test statistic for each one. We then compare the observed value of the test statistic with the ones computed from the simulated random samples. Just as in the standard methods of statistical hypothesis testing, we reject the null hypothesis if the observed value of the test statistic is deemed sufficiently extreme. This is called a Monte Carlo test. Somewhat surprisingly, a Monte Carlo test needs only a fairly small number of random samples to be relatively precise. In most cases, only 100 or so samples are adequate. See chapters 2 through 4 of [1] for further discussion of Monte Carlo tests and other Monte Carlo methods in statistical data analysis.

References [1]

Gentle, J.E. (2002). Elements of Computational Statistics, Springer, New York.

8 [2] [3]

[4]

Monte Carlo Simulation Gentle, J.E. (2003). Random Number Generation and Monte Carlo Methods, 2nd Edition, Springer, New York. S¨arndal, C.-E., Swensson, B. & Wretman, J. (1992). Model Assisted Survey Sampling, Springer-Verlag, New York. Stigler, S.M. (1978). Mathematical statistics in the early states, Annals of Statistics 6, 239–265.

(See also Randomization Based Tests) JAMES E. GENTLE

Multidimensional Item Response Theory Models TERRY A. ACKERMAN Volume 3, pp. 1272–1280 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multidimensional Item Response Theory Models Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs [27]. A construct is a theoretical representation of the underlying trait, concept, attribute, processes, and/or structure that the test is designed to measure [16]. Tests that are composed of items each measuring the same construct, or same composite of multiple constructs, are considered to be unidimensional. If, however, different items are measuring different constructs, or different composites of multiple constructs, the test can be considered to be multidimensional. It is important to distinguish between construct-irrelevant or invalid traits that are being measured versus those traits that are valid and replicable [5]. If a test is unidimensional, then it is appropriate to report examinee performance on the test as a single score. If a test is multidimensional, then reporting examinee results is more problematic. In some cases, if the skills are distinct, a profile of scores may be most appropriate. If the items are measuring similar composites of skills, then a single score may suffice. Problems of how to report results can easily arise. For example, consider a test in which easy items measure skill A and difficult items measure skill B. If the results of this test are reported on a single score scale, comparing results could be impossible because low scores represent differences in skill A and high scores represent differences in skill B. Response data represent the interaction between a group of examinees and a set of items. Surprisingly, an assessment can be either unidimensional or multidimensional depending on the set of skills inherent in a particular group of examinees who take the test. Consider the following two items: Item 1: Item 2:

If 500 – 4X = 138, then X =? Janelle went to the store and bought four pieces of candy. She gave the clerk $5.00 and received $1.38 back. How much did one piece of candy cost?

Item 1 requires the examinee to use algebra to solve the linear equation. Item 2, a ‘story problem’, requires the examinee to read the scenario, translate the text to an algebraic expression, and then solve. If examinees vary in the range of reading skill required to read and translate the text in item 2, then the item has the potential to measure a composite of both reading and algebra. If all the examinees vary in reading skill, but in a range beyond the level of reading required by item 2, then this item will most likely distinguish only between levels of examinee proficiency in algebra. One item is always unidimensional. However, two or more items, each measuring a different composite of, say, algebra and reading, have the potential to yield multidimensional data. Ironically, the same items administered to one group of examinees may result in unidimensional data, yet when given to another group of examinees, may yield multidimensional data.

Determining if Data are Multidimensional The first step in any multidimensional item response theory (MIRT) analysis (see Item Response Theory (IRT) Models for Dichotomous Data) is to determine whether the data are indeed multidimensional. Dimensionality and MIRT analyses should be supported by, and perhaps even preempted with, substantive judgment. A thorough analysis of the knowledge and skills needed to successfully respond to each item should be conducted. It might be helpful to conduct this substantive analysis by referring to the test specifications and the opinions of experts who have extensive knowledge of the content and of the examinees’ cognitive skills. If subsets of items measure different content knowledge and/or cognitive skills, then these items have the potential to represent distinct dimensions if given to a group of examinees that vary on these skills. Many empirical methods have been proposed to investigate the dimensionality of test data (e.g., [10, 11]). These quantitative methods range from linear factor analysis to several nonparametric methods [9, 12, 17, 18, 25, 32]. Unfortunately, most of the dimensionality tools available to practitioners are exploratory in nature, and many of these tools produce results that contradict substantive dimensionality hypotheses [10].

2

Multidimensional IRT Models

Factor analysis is a data reduction technique that uses the inter-item correlations to distinguish a small set of underlying skills or factors. A factor can be substantively identified by noting which items load highly on the factor. Additionally, scree plots of eigenvalues can be used to determine if more than one factor is necessary to account for the total variance observed in the test scores. Unfortunately, there is no one method that psychometricians can agree upon to determine how large eigenvalues have to be to indicate a set of test data are indeed multidimensional. A nonparametric approach that some researchers have found useful is hierarchical cluster analysis [21]. This approach uses proximity matrices (see Proximity Measures) and clustering rules to form homogeneous groups of items. This is an iterative procedure. For an n-item test, the procedure will form n-clusters, then n − 1 cluster, and so on until all the items are combined into a single cluster. Researchers often examine the two- or three-cluster solutions to determine if these solutions define identifiable traits. One drawback with this approach is that because the solution for each successive iteration of the algorithm is dependent on the previous solution, the solution attained at one or more levels may not be optimal. A relatively new approach is a procedure called DETECT [32]. DETECT is an exploratory nonparametric dimensionality assessment procedure that estimates the number of dominant dimensions present in a data set and the magnitude of the departure from unidimensionality using a genetic algorithm. DETECT also identifies the dominant dimension measured by each item. This procedure produces mutually exclusive, dimensionally homogeneous clusters of items. Perhaps one of the more promising nonparametric approaches for determining if two groups of items are dimensionally distinct is the program DIMTEST based on the work of Stout [24]. Hypotheses about multiple dimensions that are formulated using substantive analysis, factor analysis, cluster analysis, or DETECT can be tested using DIMTEST. This program provides a statistical test of significance, which can verify that test data are not unidimensional. Once it has been determined that the test data are multidimensional, practitioners need to determine which multidimensional model best describes the response process for predicting correct response on the individual items.

MIRT Models Reckase [19] defined the probability of a correct response for subject j on item i on a compensatory model as PC (uij = 1|θj 1 , θj 2 ) =

1 1 + eai1 θj 1 + ai2 θj 2 + dj

(1)

where uij is the dichotomous score (0 = wrong, 1 = correct), θj 1 and θj 2 are the two ability parameters for dimensions 1 and 2, ai1 and ai2 are the item discrimination parameters for dimensions 1 and 2, dj is the scalar difficulty parameter. Even though there is a discrimination parameter for each dimension, there is only one difficulty parameter. That is, difficulty parameters for each dimension are indeterminate. Reckase called the model M2PL because it was a two-dimensional version of the two-parameter (discrimination and difficulty) unidimensional item response theory (IRT) model. In a two-dimensional latent ability space, the ai vector designates the θ1 − θ2 combination that is being best measured (i.e., the composite for which the item can optimally discriminate). If a1 = a2 , both dimensions are measured equally well. If a1 = 0 and a2 > 0, discrimination only occurs along the θ1 . Graphically, for an item, the M2PL model probability of correct response forms an item response surface as opposed to the unidimensional item characteristic curve. Four different perspectives of this surface for an item with a1 = 1.0, a2 = 1.0, and d = 0.20 is illustrated in Figure 1. The reason the M2PL model is denoted as a compensatory model is due to the addition of terms in the logit. This feature makes it possible for an examinee with low ability on one dimension to compensate by having a higher probability on the remaining dimension. Figure 2 illustrates the equiprobability contours for the response surface of the item shown in Figure 1. For the compensatory model, the contours are equally spaced and parallel across the response surface. The contour lines become closer as the slope of the response surface becomes steeper or more discriminating. Notice that Examinee A (high θ1 , low θ2 ) has the same probability of correct response as Examinee B (high θ2 , low θ1 ). Note,

Multidimensional IRT Models Item characteristic surface A1 = 1.00 A2 = 1.00 D = 0.20 1.00

ty Probabili

0.75

0.50

0.25

0.50 0.25

3.

1. 5

0. .5

−1 .5

1 .5

q1

.5

q2

−1

q1

.0

0

0

.0

0. 0

0 .0

.0

3

−3 ty

1.00

0.25

0.25 0.00

0 0. 5 1.

q1

5 1.

.5

q1

−1

.5

q2

−1

0.

0

.0

0

.0

0

−1 .

.5

.5

5

1

−1

−3 0

0.00

0.50

−3 .0

0.50

0.75

− q2 1.5

0.75

Probabili

Probabili

ty

1.00

.0

−3

.0

−3

Figure 1

q2

1.

.5

.5

5

1

−1

3. 0

0.00

0

0.00

0.75

−1

ty Probabili

1.00

Four perspectives of a compensatory response surface Contour plot of item characteristic surface A1 = 1.00 A2 = 1.00 D = 0.30 Notice examinees with opposite ability profiles have the same probability of correct answer (ie., compensation).

High q2 Low q1 3.0

0.

8

2.0

0.

3

0.

5

0.

6

0.

9

q2

1.0

0.

8

0.0 −1.0

0.

0.

3

1

0.

4

0.

6

High q1 Low q2

−2.0 −3.0 −3.0

Figure 2

−2.0

−1.0

0.0 q1

1.0

2.0

3.0

The contour plot of a compensatory item response surface with equal discrimination parameters

3

Multidimensional IRT Models

however, that the degree of compensation is greatest when a1 = a2 . Obviously, the more a1 and a2 differ (e.g., a1 = 0 and a2 > 0, or a2 = 0 and a1 > 0), the less compensation can occur. That is, for integration of skills to be possible, an item must require the use of both skills. Another multidimensional model that researchers have investigated is the noncompensatory model, proposed by Sympson [26]. This model (also known as the partial compensatory model) expresses the probability of correct response for subject j on item i as, 1 P (uij = 1|θj 1 , θj 2 ) = 1 + e−ai1 (θj 1 −bi1 ) 1 × (2) 1 + e−ai2 (θj 2 −bi2 )

bi1 and bi2 are the item difficulty parameters for dimensions 1 and 2. This model is essentially the product of two 2PL unidimensional IRT models, the overall probability of a correct response is bounded in the upper limit by the smaller of the two component probabilities. A graph of the item characteristic surface for an item with parameters a1 = 1.0, a2 = 1.0, b1 = 0.0 and b2 = 0.0 is shown in Figure 3. Examining the contour plot for this item, Figure 4, enables one to see the noncompensatory nature of the model. Note that Examinee A (high θ1 , low θ2 ), and Examinee B (low θ1 , high θ2 ) have approximately the same probability as Examinee C (low θ1 , low θ2 ), indicating that little compensation actually occurs. It should be noted that the noncompensatory surface actually curves around, creating the curvilinear equiprobability contours shown in Figure 4. Spray, Ackerman, and Carlson [23] extended the work of Reckase and Sympson by formulating a generalized MIRT model that combined both the compensatory and noncompensatory models. Letting

where uij is the dichotomous score (0 = wrong, 1 = correct), θj 1 and θj 2 are the two ability parameters for dimensions 1 and 2, ai1 and ai2 are the item discrimination parameters for dimensions 1 and 2,

Probabili

0.25 0.00

5 1.

0. 0 3 .0

.0 1.00

5− 3.0

−3. 0 0− q1 1.5

q2

.5

.5

q1

1

q2

−1

0.

.0

.0

5

0

0

1.

.5

.5

−1

−3

.0

.0

3

Four perspectives of a noncompensatory response surface

0.25 0.00

1

0.25 0.00

0.50

−1 .

0.50

0.75

0. 0

Probabili

ty

0.75

Proba

bility

1.00

Figure 3

−1 .5

q1

.5

−3

−1 .5

1

q1

.5

q2

−1

0. 0

.0

0

.0

0

1.

.5

5

−1

.5

1

3.0

0.25 0.00

0.50

q2

0.50

0.75

3.0

ty

1.00

0.75

Proba

bility

1.00

1. 5

4

Multidimensional IRT Models 3.0

High q2 Low q1

3.0

6 0.

q2

0.7

q2

0.8

0.0

0.5

0.3

−1.0

0.2

0.1

High q1 Low q2

−2.0

0.0

1.0

2.0

3.0

q1

Figure 4 surface

0.5

1.0

4 0. 0.2

−1.0 No compensation occurs for being high on only one ability −3.0 −3.0 −2.0 −1.0

0.3

0.6

Low q1 Low q2

0.0

0.2

2.0

0.8

0.1

2.0 1.0

5

The contour plot of a noncompensatory response

−3.0 −3.0

−2.0

−1.0

0.0 q1

1.0

2.0

3.0

Figure 5 A contour of a generalized model response surface with µ = 0.15

f1 = a1 (θ1 − b1 ) and f2 = a2 (θ2 − b2 ), the compensatory model (1) can be written as PC =

e(f1 +f2 ) 1 + e(f1 +f2 )

(3)

and the noncompensatory model (4) can be written as e(f1 +f2 ) . (4) PNC = [1 + e(f1 +f2 ) ] + ef1 + ef2 The generalized model can then be expressed as PG =

e(f1 +f2 ) [1 + e(f1 +f2 ) ] + µ[ef1 + ef2 ]

(5)

where µ represents a compensation or integration parameter that ranges from 0 to 1. If µ = 0, then PG is equivalent to the compensatory model (1), and if µ = 1, then PG is equivalent to the noncompensatory (2). As µ increases from 0, the degree of compensation decreases, and the equiprobability contours become more curvilinear. Contour plots for µ-parameter of 0.1 and 0.3 are shown in Figure 5. For a more comprehensive development of different types of item response theory models, both unidimensional and multidimensional, readers are referred to van der Linden and Hambleton [28].

Estimating Item and Ability Parameters Several software estimation programs were developed to estimate item parameters for the compensatory model, including NOHARM [8] and TESTFACT [30]. NOHARM uses a nonlinear factor analytic approach [14, 15], and can estimate item parameters for models having from one to six dimensions. It has both an exploratory or confirmatory mode (see Factor Analysis: Confirmatory). Unfortunately, NOHARM does not have the capability to estimate examinees’ ability levels. TESTFACT uses a full-information approach to estimating both item and ability parameters for multidimensional compensatory models. A relatively new approach using Markov chain Monte Carlo procedure was used by Bolt and Lall [6] to estimate parameters for both the compensatory and noncompensatory models. This estimation approach was used to estimate the degree of compensation in the generalized model by Ackerman and Turner [4], though with limited success. Another interesting approach using the genetic algorithm used in the program DETECT has been proposed by Zhang [31] to estimate parameters for the noncompensatory model. To date, there has been very little research on the goodness-of-fit of multidimensional models. One exception is the work of Ackerman,

6

Multidimensional IRT Models

Hombo, and Neustel [3], which looked at goodnessof-fit measures using the compensatory model.

of item i is given by αi = arccos

Graphical Representations of Multidimensional Items and Information To better understand and interpret MIRT item parameters, practitioners can use a variety of graphical techniques [2]. Item characteristic surface plots and contour plots do not allow the practitioner to examine and compare several items simultaneously. To get around this limitation, Reckase and Mckinley [20] developed item vector plots. In this approach, each item is represented by a vector that represents three characteristics: discrimination, difficulty, and location. Using orthogonal axes, discrimination corresponds to the length of the item response vector. This length represents the maximum amount of discrimination, and is referred to as MDISC. For item i, MDISC is given by 2 2 + ai2 , (6) MDISC = ai1 where a1 and a2 are the logistic model discrimination parameters. The tail of the vector lies on the p = 0.5 equiprobability contour. If extended, all vectors would pass through the origin of the latent trait plane. Further, the a-parameters are constrained to be positive, as with unidimensional IRT, and, thus, the item vectors will only be located in the first and third quadrants. MDISC is analogous to the a-parameter in unidimensional IRT. Difficulty corresponds to the location of the vector in space. The signed distance from the origin to the p = 0.5 equiprobability contour, denoted by D, is given by Reckase [18] as D=

−di , MDISC

(7)

where di is the difficulty parameter for item i. The sign of this distance indicates the relative difficulty of the item. Items with negative D are relatively easy, and are in the third quadrant, whereas items with positive D are relatively hard, and are in first quadrant. D is analogous to the b-parameter in unidimensional IRT. Location corresponds to the angular direction of each item relative to the positive θ1 axis. The location

ai1 . MDISC i

(8)

A vector with location αi greater than 45 degrees is a better measure of θ2 than θ1 , whereas a vector with a location αi less than 45 degrees is a better measure of θ1 . By examining the discrimination, difficulty, and location of each item response vector, the degree of similarity in the two-dimensional composite for all items on the test can be viewed. By colorcoding items according to content, practitioners can understand better the different ability composites that are being assessed by their tests. In Figure 6 is a vector plot for 25 items from a mathematics usage test. Note that the composite angle 43.1° represents the average composite direction of the vectors weighted by each item’s MDISC value. This direction indicates two-dimensional composite that would be estimated if this test were calibrated using a unidimensional model. Information. In item response theory, measurement precision is evaluated using information. The reciprocal of the information function is the asymptotic variance of the maximum likelihood estimate of ability. This relationship implies that the larger the

4.0 3.0 2.0 1.0 0.0 −4.0 −3.0 −2.0 −1.0 0.0

1.0

−1.0 −2.0 −3.0

2.0

3.0

4.0 q1

Composite angle: 43.1

q2

−4.0

Figure 6 items

A vector plot of 25 ACT mathematics usage

7

Multidimensional IRT Models information function, the smaller the asymptotic variance and the more measurement precision. Multidimensional information (MINF) serves as one measure of precision. MINF is computed in a manner similar to its unidimensional IRT counterpart, except that the direction of the information is also considered, as shown in the formula 2 m MINF = Pi (θ)[1 − Pi (θ)] αik cos αik . (9) k=1

MINF provides a measure of information at any point on the latent ability plane (i.e., measurement precision relative to the θ1 , θ2 composite). MINF can be computed at the item level or at the test level (where the test information is the sum of the item information functions). Reckase and McKinley [20] developed a clamshell plot to represent information with MINF (the representation was said to resemble clamshells, hence the term). To create the clamshells, the amount of information is computed at 49 uniformly spaced points on a 7 × 7 grid in the θ1 , θ2 – space. At each of the 49 points, the amount of information is computed for 10 different directions or ability composites from 0 to 90 degrees in 10-degree increments, and represented as the length of the 10 lines in each clamshell. Figure 7 contains the clamshell plot for the 25 items

whose vectors are displayed in Figure 6. At the ability (0,0), the clamshell vectors are almost of equal length, indicating that most of the composites tend to be measured with equal accuracy and much more accurately than would subjects located at the ability (3,3). Ackerman [1] expanded upon the work of Reckase and Mckinley and provided several other ways to graphically examine the information of twodimensional tests. One example of his work was a number plot that was somewhat similar to the clamshell plot. However, at each of the 49 points, the direction of maximum information is given as a numeric value on the grid, while the amount of information is represented by the size of the font for each numeric value (the larger the font, the greater the information). Figure 8 shows the number plot corresponding to the clamshell plot in Figure 7. Note that for examinees located at (0,0), the composite direction that is being best measured is at 40° . Such plots help to determine if two forms of a test are truly parallel.

Future Research Directions in MIRT Multidimensional item response theory holds a great deal of promise for future psychometric research. Direction of maximum information

M2PL test information vectors 4.0

4.0

3.0

45.

40.

37.

40.

45.

51.

57.

2.0

2.0

47.

43.

40.

42.

47.

52.

57.

1.0

1.0

47.

43.

41.

43.

47.

52.

57.

0.0

44.

40.

38.

40.

45.

50.

55.

−1.0

−1.0

41.

36.

34.

37.

41.

46.

51.

−2.0

−2.0

36.

31.

29.

32.

37.

41.

46.

−3.0

32.

27.

25.

28.

34.

38.

41.

q2

q2

3.0

0.0

−3.0 3.64 −4.0 −4.0 −3.0 −2.0 −1.0 0.0 q1

1.0

2.0

3.0

4.0

Figure 7 A clamshell information plot of 25 ACT mathematics usage items

−4.0 −4.0 −3.0 −2.0 −1.0

0.0 q1

1.0

2.0

3.0

4.0

Figure 8 A number plot indicating the direction of maximum information at 49 ability locations

8

Multidimensional IRT Models

Directions in which MIRT research needs to be headed include computer adaptive testing [13], differential item functioning [22, 29], and diagnostic testing [7]. Other areas that need more development include expanding MIRT interpretations beyond two dimensions and to polytomous or Likert-type data [12, 17].

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Ackerman, T.A. (1994). Creating a test information profile in a two-dimensional latent space, Applied Psychological Measurement 18, 257–275. Ackerman, T.A. (1996). Graphical representation of multidimensional item response theory analyses, Applied Psychological Measurement 20, 311–330. Ackerman, T.A., Hombo, C. & Neustel, S. (2002). Evaluating indices used to assess the goodness-of-fit of the compensatory multidimensional item response theory model, in Paper Presented at the Meeting of the National Council of Measurement in Education, New Orleans. Ackerman, T.A. & Turner, R. (2003). Estimation and application of a generalized MIRT model: assessing the degree of compensation between two latent abilities, in Paper Presented at the Meeting of the National Council on Measurement in Education, Chicago. AERA, APA, NCME. (1999). Standards for Educational and Psychological Testing, American Educational Research Association, Washington. Bolt, D.M. & Lall, V.F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo, Applied Psychological Measurement 27(6), 395–414. DiBello, L., Stout, W. & Roussos, L. (1995). Unified cognitive/psychometric diagnostic assessment likelihood-based classification techniques, in Cognitive Diagnostic Assessment, P. Nichols, S. Chipmanand & R. Brennan, eds, Erlbaum, Hilldale, pp. 361–389. Fraser, C. (1988). NOHARM : An IBM PC Computer Program for Fitting Both Unidimensional and Multidimensional Normal Ogive Models of Latent Trait Theory, The University of New England, Armidale. Gessaroli, M.E. & De Champlain, A.F. (1996). Using an approximate chi-square statistic to test the number of dimensions underlying the responses to a set of items, Journal of Educational Measurement 33, 157–179. Hambleton, R.K. & Rovinelli, R.J. (1986). Assessing the dimensionality of a set of test items, Applied Psychological Measurement 10, 287–302. Hattie, J. (1985). Methodology review: assessing unidimensionality of tests and items, Applied Psychological Measurement 9, 139–164. Hattie, J., Krakowski, K., Rogers, H.J. & Swaminathan, H. (1996). An assessment of Stout’s index of essential unidimensionality, Applied Psychological Measurement 20, 1–14.

[13]

[14] [15] [16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

Luecht, R.M. (1996). Multidimensional computer adaptive testing in a certification or licensure context, Applied Psychological Measurement 20, 389–404. McDonald, R.P. (1967). Nonlinear factor analysis. Psychometric Monograph No. 15. McDonald, R.P. (1999). Test Theory: A Unified Approach, Erlbaum, Hillsdale. Messick, S. (1989). Validity, in Educational Measurement, 3rd Edition, R.L. Linn, eds, American Council on Education, Macmillan, New York, pp. 13–103. Nandakumar, R. (1991). Traditional dimensionality versus essential dimensionality, Journal of Educational Measurement 28, 99–117. Nandakumar, R. & Stout, W. (1993). Refinement of Stout’s procedure for assessing latent trait unidimensionality, Journal of Educational Statistics 18, 41–68. Reckase, M.D. (1985). The difficulty of test items that measure more than one ability, Applied Psychological Measurement 9, 401–412. Reckase, M.D. & McKinley, R.L. (1991). The discrimination power of items that measure more than one dimension, Applied Psychological Measurement 14, 361–373. Roussos, L. (1992). Hierarchical Agglomerative Clustering Computer Programs Manual, University of Illinois at Urbana-Champaign, Department of Statistics, (Unpublished manuscript). Roussos, L. & Stout, W. (1996). A multidimensionalitybased DIF analysis paradigm, Applied Psychological Measurement 20, 355–371. Spray, J.A., Ackerman, T.A. & Carlson, J. (1986). An analysis of multidimensional item response theory models, in Paper Presented at the Meeting of the Office of Naval Research Contractors, Gatlinburg. Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality, Psychometrika 52, 589–617. Stout, W., Habing, B., Douglas, J., Kim, H.R., Roussos, L. & Zhang, J. (1996). Conditional covariance-based nonparametric multidimensionality assessment, Applied Psychological Measurement 20, 331–354. Sympson, J.B. (1978). A model for testing multidimensional items, in Proceedings of the 1977 Computerized Adaptive Testing Conference, D.J. Weiss, eds University of Minnesota, Department of Psychology, Psychometric Methods Program, Minneapolis, pp. 82–98. Traub, R.E. (1983). A priori consideration in choosing an item response theory model, in Applications of Item Response Theory,R.K. Hambleton, eds Educational Research Institute of British Columbia, Vancouver, pp. 57–70. van der Linden, W.J. & Hambleton, R.K. (1997). Handbook of Modern Item Response Theory, Springer, New York. Walker, C.M. & Beretvas, S.N. (2001). An empirical investigation demonstrating the multidimensional DIF paradigm: a cognitive explanation for DIF, Journal of Educational Measurement 38, 147–163.

Multidimensional IRT Models [30]

Wilson, D., Wood, R. & Gibbons, R. (1987). TESTFACT, [Computer Program]. Scientific Software, Mooresville. [31] Zhang, J. (2001). Using multidimensional IRT to analyze item response data, in Paper Presented at the Meeting of the National Council on Measurement in Education, Seattle. [32] Zhang, J. & Stout, W. (1999). The theoretical detect index of dimensionality and its application to approximate simple structure, Psychometrika 64, 231–249.

Further Reading Ackerman, T.A. (1994). Using multidimensional item response theory to understand what items and tests are measuring, Applied Measurement in Education 7, 255–278.

9

Kelderman, H. & Rijkes, C. (1994). Loglinear multidimensional IRT models for polytomously scored items, Psychometrika 2(59), 149–176. Nandakumar, R., Yu, F., Li, H. & Stout, W. (1998). Assessing unidimensionality of polytomous data, Applied Psychological Measurement 22, 99–115.

TERRY A. ACKERMAN

Multidimensional Scaling PATRICK J.F. GROENEN

AND

MICHEL

VAN DE

VELDEN

Volume 3, pp. 1280–1289 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multidimensional Scaling

map in 2 dimensions. We see that the colors, denoted by their wavelengths, are represented in the shape of a circle. The interpretation of this map should be done in terms of the depicted interpoint distances. Note that, as distances do not change under rotation, a rotation of the plot does not affect the interpretation. Similarly, a translation of the solution (that is, a shift of all coordinates by a fixed value per dimension) does not change the distances either, nor does a reflection of one or both of the axes. Figure 1 should be interpreted as follows. Colors that are located close to each other are perceived as being similar. For example, the colors with wavelengths 434 (violet) and 445 (indigo) or 628 and 651 (both red). In contrast, colors that are positioned far away from each other, such as 490 (green) and 610 (orange) indicate a large difference in perception. The circular form obtained in this example is in accordance with theory on the perception of colors. Summarizing, MDS is a technique that translates a table of dissimilarities between pairs of objects into a map where distances between the points match the dissimilarities as well as possible. The use of MDS is not limited to psychology but has applications in a wide area of disciplines, such as sociology, economics, biology, chemistry, and archaeology. Often, it is used as a technique for exploring the data. In addition, it can be used as a technique for dimension reduction. Sometimes, as in chemistry, the objective is to reconstruct a 3D model for large DNA-molecules for which only partial information of the distances between atoms is available.

Introduction Multidimensional scaling is a statistical technique originating in psychometrics. The data used for multidimensional scaling (MDS) are dissimilarities between pairs of objects (see Proximity Measures). The main objective of MDS is to represent these dissimilarities as distances between points in a low dimensional space such that the distances correspond as closely as possible to the dissimilarities. Let us introduce the method by means of a small example. Ekman [7] collected data to study the perception of 14 different colors. Every pair of colors was judged by a respondent from having ‘no similarity’ to being ‘identical’. The obtained scores can be scaled in such a way that identical colors are denoted by 0, and completely different colors by 1. The averages of these dissimilarity scores over the 31 respondents are presented in Table 1. Starting from wavelength 434, the colors range from bluishpurple, blue, green, yellow, to red. Note that the dissimilarities are symmetric: the extent to which colors with wavelengths 490 and 584 are the same is equal to that of colors 584 and 490. Therefore, it suffices to only present the lower triangular part of the data in Table 1. Also, the diagonal is not of interest in MDS because the distance of an object with itself is necessarily zero. MDS tries to represent the dissimilarities in Table 1 in a map. Figure 1 presents such an MDS Table 1

Dissimilarities of colors with wavelengths from 434 to 674 nm [7]

nm

434

445

465

472

490

504

537

555

584

600

610

628

651

674

434 445 465 472 490 504 537 555 584 600 610 628 651 674

– 0.14 0.58 0.58 0.82 0.94 0.93 0.96 0.98 0.93 0.91 0.88 0.87 0.84

– 0.50 0.56 0.78 0.91 0.93 0.93 0.98 0.96 0.93 0.89 0.87 0.86

– 0.19 0.53 0.83 0.90 0.92 0.98 0.99 0.98 0.99 0.95 0.97

– 0.46 0.75 0.90 0.91 0.98 0.99 1.00 0.99 0.98 0.96

– 0.39 0.69 0.74 0.93 0.98 0.98 0.99 0.98 1.00

– 0.38 0.55 0.86 0.92 0.98 0.98 0.98 0.99

– 0.27 0.78 0.86 0.95 0.98 0.98 1.00

– 0.67 0.81 0.96 0.97 0.98 0.98

– 0.42 0.63 0.73 0.80 0.77

– 0.26 0.50 0.59 0.72

– 0.24 0.38 0.45

– 0.15 0.32

– 0.24

–

2

Multidimensional Scaling

472465 490 445 434

504 537 555

584 600

Figure 1

610

674 651 628

MDS solution in 2 dimensions of the color data in Table 1

Data for MDS In the previous section, we introduced MDS as a method to describe relationships between objects on the basis of observed dissimilarities. However, instead of dissimilarities we often observe similarities between objects. Correlations, for example, can be interpreted as similarities. By converting the similarities into dissimilarities MDS can easily be applied to similarity data. There are several ways of transforming similarities into dissimilarities. For example, we may take one divided by the similarity or we can apply any monotone decreasing function that yields nonnegative values (dissimilarities cannot be negative). However, in Section ‘Transformations of the Data’, we shall see that by applying transformations in MDS, there is no need to transform similarities into dissimilarities. To indicate both similarity and dissimilarity data, we use the generic term proximities. Data in MDS can be obtained in a variety of ways. We distinguish between the direct collection of proximities versus derived proximities. The color data of the previous section is an example of direct proximities. That is, the data arrives in the format of proximities. Often, this is not the case and our data does not consist of proximities between variables. However, by considering an appropriate measure, proximities can be derived from the original data. For example, consider the case where objects are rated on several variables. If the interest lies in representing the variables, we can calculate the correlation matrix

as measure of similarity between the variables. MDS can be applied to describe the relationship between the variables on the basis of the derived proximities. Alternatively, if interest lies in the objects, Euclidean distances can be computed between the objects using the variables as dimensions. In this case, we use high dimensional Euclidean distances as dissimilarities and we can use MDS to reconstruct these distances in a low dimensional space. Co-occurrence data are another source for obtaining dissimilarities. For such data, a respondent groups the objects into partitions and an n × n incidence matrix is derived where a one indicates that a pair of objects is in the same group and a zero indicates that they are in different groups. By considering the frequencies of objects being in the same or different groups and by applying special measures (such as the so-called Jaccard similarity measure), we obtain proximities. For a detailed discussion of various (dis)similarity measures, we refer to [8] (see Proximity Measures).

Formalizing Multidimensional Scaling To formalize MDS, we need some notation. Let n be the number of different objects and let the dissimilarity for objects i and j be given by δij . The coordinates are gathered in an n × p matrix X, where p is the dimensionality of the solution to be specified in advance by the user. Thus, row i from X gives the

Multidimensional Scaling coordinates for object i. Let dij (X) be the Euclidean distance between rows i and j of X defined as 1/2 p 2 (xis − xj s ) , (1) dij (X) =

3

Doing Multidimensional Scaling’, we briefly discuss other popular definitions of Stress.

Transformations of the Data

s=1

that is, the length of the shortest line connecting points i and j . The objective of MDS is to find a matrix X such that dij (X) matches δij as closely as possible. This objective can be formulated in a variety of ways but here we use the definition of raw Stress σ 2 (X), that is, σ 2 (X) =

n i−1

wij (δij − dij (X))2

(2)

i=2 j =1

by Kruskal [11, 12] who was the first one to propose a formal measure for doing MDS. This measure is also referred to as the least-squares MDS model. Note that due to the symmetry of the dissimilarities and the distances, the summation only involves the pairs ij where i > j . Here, wij is a user defined weight that must be nonnegative. For example, many MDS programs implicitly choose wij = 0 for dissimilarities that are missing. The minimization of σ 2 (X) is a rather complex problem that cannot be solved in closed-form. Therefore, MDS programs use iterative numerical algorithms to find a matrix X for which σ 2 (X) is a minimum. One of the best algorithms available is the SMACOF algorithm [1, 3, 4, 5] based on iterative majorization. The SMACOF algorithm has been implemented in the SPSS procedure Proxscal [13]. In Section ‘The SMACOF Algorithm’, we give a brief illustration of the SMACOF algorithm. Because Euclidean distances do not change under rotation, translation, and reflection, these operations may be freely applied to MDS solution without affecting the raw Stress. Many MDS programs use this indeterminacy to center the coordinates so that they sum to zero dimension wise. The freedom of rotation is often exploited to put the solution in socalled principal axis orientation. That is, the axes are rotated in such a way that the variance of X is maximal along the first dimension, the second dimension is uncorrelated to the first and has again maximal variance, and so on. Here, we have discussed the Stress measure for MDS. However, there are several other measures for doing MDS. In Section ‘Alternative Measures for

So far, we have assumed that the dissimilarities are known. However, this is often not the case. Consider for example the situation in which the objects have been ranked. That is, the dissimilarities between the objects are not known, but their order is known. In such a case, we would like to assign numerical values to the proximities in such a way that these values exhibit the same rank order as the data. These numerical values are usually called disparities, dhats, or pseudo distances, and they are denoted by ˆ The task of MDS now becomes to simultaneously d. obtain disparities and coordinates in such a way that the coordinates represent the disparities (and thus the original rank order of the data) as well as possible. This objective can be captured in minimizing a slight adaptation of raw Stress, that is, ˆ X) = σ 2 (d,

i−1 n

wij (dˆij − dij (X))2 ,

(3)

i=2 j =1

over both the dˆ and X, where dˆ is the vector containing dˆij for all pairs. The process of finding the disparities is called optimal scaling and was first introduced by Kruskal [11, 12]. Optimal scaling aims to find a transformation of the data that fits as well as possible the distances in the MDS solution. To avoid the trivial optimal scaling solution X = 0 and dˆij = 0 for all ij, we impose a length constraint on the disparities in such a way that the sum of squared equals a fixed constant. For example, n d-hats i−1 ˆ2 i=2 j =1 wij dij = n(n − 1)/2 [3]. Transformations of the data are often used in MDS. Figure 2 shows a few examples of transformation plots for the color example. Let us look at some special cases. Suppose that we choose dˆij = δij for all ij. Then, minimizing (3) without the length constraint is exactly the same as minimizing (2). Minimizing (3) with the length constraint only changes dˆij = aδij , where a is a scalar chosen in such a way that the length constraint is satisfied. This transformation is called a ratio transformation (Figure 2a). Note that, in this case, the relative differences of aδij are the same as those for δij . Hence, the relative differences

4

Multidimensional Scaling

dˆij

1.2

1.2

1.0

1.0

0.8

dˆij

0.6

0.8 0.6

0.4

0.4

0.2

0.2

0.0 0.0

0.2

0.4

0.6

0.8

0.0 0.0

1.0

0.2

0.4

0.6

0.8

1.0

0.6

0.8

1.0

δij

δij (a). Ratio

(b). Interval 1.4

dˆij

1.2

1.2

1.0

1.0

0.8

dˆij

0.6

0.8 0.6

0.4

0.4

0.2

0.2

0.0 0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.0

δij (c). Ordinal

Figure 2

0.2

0.4 δij

(d). Spline

Four transformations often used in MDS

of the dij (X) in (2) and (3) are also the same. Ratio MDS can be seen as the most restrictive transformation in MDS. An obvious extension to the ratio transformation is obtained by allowing the dˆij to be a linear transformation of the δij . That is, dˆij = a + bδij , for some unknown values of a and b. Figure 2b depicts an interval transformation. This transformation may be chosen if there is reason to believe that δij = 0 does not have any particular interpretation. An interval transformation that is almost horizontal reveals little about the data as different dissimilarities are transformed to similar disparities. In such a case, the constant term will dominate the dˆij ’s. On the other hand, a good interval transformation is obtained if the line is not horizontal and the constant term is reasonably small with respect to the rest. For ordinal MDS, the dˆij are only required to have the same rank order as δij . That is, if for two pairs of objects ij and kl we have δij ≤ δkl then the corresponding disparities must satisfy dˆij ≤ dˆkl .

An example of an ordinal transformation in MDS is given in Figure 2c. Typically, an ordinal transformation shows a step function. Similar to the case for interval transformations, it is not a good sign if the transformation plot shows a horizontal line. Moreover, if the transformation plot only exhibits a few steps, ordinal MDS does not use finer information available in the data. Ordinal MDS is particularly suited if the original data are rank orders. To compute an ordinal transformation a method called monotone regression can be used. A monotone spline transformation offers more freedom than an interval transformation, but never more than an ordinal transformation (see Scatterplot Smoothers). The advantage of a spline transformation over an ordinal transformation is that it will yield a smooth transformation. Figure 2d shows an example of a spline transformation. A spline transformation is built on two ideas. First, the range of the δij ’s can be subdivided into connected intervals. Then, for each interval, the data are transformed

Multidimensional Scaling

modeled by smaller distances. A ratio transformation is not possible for similarities. The reason is that the dˆij ’s must be nonnegative. This implies that the transformation must include an intercept. In the MDS literature, one often encounters the terms metric and nonmetric MDS. Metric MDS refers to the ratio and interval transformations, whereas all other transformations such as ordinal and spline transformations are covered by the term nonmetric MDS. We believe, however, that it is better to refer directly to the type of transformation that is used. There exist other a-priori transformations of the data that are not optimal in the sense described above. That is transformations that are not obtained by minimizing (3). The advantage of optimal transformations is that the exact form of the transformation is unknown and determined optimally together with the MDS configuration.

Diagnostics In order to assess the quality of the MDS solution we can study the differences between the MDS solution and the data. One convenient way to do this is by inspecting the so-called Shepard diagram. A Shepard diagram shows both the transformation and the error. Let pij denotes the proximity between objects i and j . Then, a Shepard diagram plots simultaneously the pairs (pij , dij (X)) and (pij , dij ). In Figure 3, solid points denote the pairs (pij , dij (X)) and open circles represent the pairs (pij , dij ). By

Distances and disparities

using a polynomial of a specified degree. For example, a second-degree polynomial imposes that dˆij = aδij2 + bδij + c. The special feature of a spline is that at the connections of the intervals, the so-called interior knots, the two polynomials connect smoothly. The spline transformation in Figure 2d was obtained by choosing one interior knot at .90 and by using second-degree polynomials. For MDS it is important that the transformation is monotone increasing. This requirement is automatically satisfied for monotone splines or I-Splines (see, [14, 1]). For choosing a transformation in MDS it suffices to know that a spline transformation is smooth and nonlinear. The amount of nonlinearity is governed by the number of interior knots specified. Unless the number of dissimilarities is very large, a few interior knots for a second-degree spline usually works well. There are several reasons to use transformations in MDS. One reason concerns the fit of the data in low dimensionality. By choosing a transformation that is less restrictive than the ratio transformation a better fit may be obtained. Alternatively, there may exist theoretical reasons as to why a transformation of the dissimilarities is desired. Ordered from most to least restrictive transformation, we start with ratio, then interval, spline, and ordinal. If the data are dissimilarities, then it is necessary that a transformation is monotone increasing (as in Figure 2) so that pairs with higher dissimilarities are indeed modeled by larger distances. Conversely, if the data are similarities, then the transformation should be monotone decreasing so that more similar pairs are

Proximities

Figure 3

5

Shepard diagram for ordinal MDS of the color data, where the proximities are dissimilarities

Multidimensional Scaling

Distances and disparities

6

Proximities

Figure 4

Shepard diagram for ordinal MDS of the color data, where the proximities are dissimilarities

connecting the open circles a line is obtained representing the relationship between the proximities and the disparities which is equivalent to the transformation plots in Figure 2. The vertical distances between the open and closed circles are equal to dij − dij (X), that is, they give the errors of representation for each pair of objects. Hence, the Shepard diagram can be used to inspect both the residuals of the MDS solution and the transformation. Outliers can be detected as well as possible systematic deviations. Figure 3 gives the Shepard diagram for the ratio MDS solution of Figure 1 using the color data. We see that all the errors corresponding to low proximities are positive whereas the errors for the higher proximities are all negative. This kind of heteroscedasticity suggests the use of a more liberal transformation. Figure 4 gives the Shepard diagram for an ordinal transformation. As the solid points are closer to the line connecting the open circles, we may indeed conclude that the heteroscedasticity has gone and that the fit has become better.

Choosing the Dimensionality Several methods have been proposed to choose the dimensionality of the MDS solution. However, no definite strategy is present. Unidimensional scaling, that is, p = 1 (with ratio transformation) has to be treated with special care because the usual MDS algorithms will end up in local minima that can be far from global minima.

One approach to determine the dimensionality is to compute MDS solutions for a range of dimensions, say from 2 to 6 dimensions, and plot the Stress against the dimension. Similar to common practice in principal component analysis, we then use the elbow criterion to determine the dimensionality. That is, we choose the number of dimensions where a bend in the curve occurs. Another approach (for ordinal MDS), proposed by [15], compares the Stress values against the Stress of generated data. However, perhaps the most important criterion for choosing the dimensionality is simply based on the interpretability of the map. Therefore, the vast majority of reported MDS solutions are done in two dimensions and occasionally in three dimensions. The interpretability criterion is a valid one especially when MDS is used for exploration of the data.

The SMACOF Algorithm In Section ‘Formalizing Multidimensional Scaling’, we mentioned that a popular algorithm for minimizing Stress is the SMACOF Algorithm. Its major feature is that it guarantees lower Stress values in each iteration. Here we briefly sketch how this algorithm works. Nowadays, the acronym SMACOF stands for Scaling by Majorizing a Complex Function. To understand how it works, consider Figure 5a. Suppose that we have dissimilarity data on 13 stock

7

Multidimensional Scaling

0.3

0.3

0.2

Stress

Stress

0.2

0.1

0.1

0 1

nikkei taiwan

sing hs

8

milan

xi1

−1

sj

0

brus dax cbs vec ftse madrid

−1

1

0 xi2

(a). First majorizing function

brus dax

dsp j

0 −1

sing

vec cbs

−1

ftse madrid

milan

hs

0

nikkei

0 xi2

1

xi1

1

(b). Final majorizing function

Figure 5 The Stress function and the majorizing function for the supporting (0,0) in Panel a., and the majorizing function at convergence in Panel b. Reproduced by permission of Vrieseborch Publishers

markets and that the two dimensional MDS solution is given by the points in the horizontal plane of Figure 5a and 5b. Now, suppose that the position of the ‘nikkei’ index was unknown. Then we can calculate the value of Stress as a function of the two coordinates for ‘nikkei’. The surface in Figure 5a shows these values of Stress for every potential position of ‘nikkei’. To minimize Stress we must find the coordinates that yield the lowest Stress. Hence, the final point must be located in the horizontal plane under the lowest value of the surface. To find this point, we use an iterative procedure that is based on majorization. First, as an initial point for ‘nikkei’ we choose the origin in Figure 5a. Then, a so-called majorization function is chosen in such a way that, for this initial point, its value is equal to the Stress value, and elsewhere it lies above the Stress surface. Here, the majorizing function is chosen to be quadratic and is visualized in Figure 5a as the bowl-shaped surface above the Stress function surface. Now, as the Stress surface is always below the majorization function, the value of Stress evaluated at the point corresponding to the minimum of the majorization function, will be lower than the initial Stress value. Hence, the initial point can be updated by calculating the minimum of the majorization function which is easy because the majorizing function is quadratic. Using the updated point we repeat this process until the coordinates remain practically

constant. Figure 5b shows the subsequent coordinates for ‘nikkei’ obtained by majorization as a line with connected dots marking the path to its final position.

Alternative Measures for Doing Multidimensional Scaling In addition to the raw Stress measure introduced in Section ‘Formalizing Multidimensional Scaling’, there exist other measures for doing Stress. Here we give a short overview of some of the most popular alternatives. First, we discuss normalized raw Stress, i−1 n

σn2 ( d, X) =

wij (dij − dij (X))2

i=2 j =1 i−1 n

,

(4)

wij dij2

i=2 j =1

which is simply raw Stress divided by the sum of squared dissimilarities. The advantage of this measure over raw Stress is that its value is independent of the scale of the dissimilarities and their number. Thus, multiplying the dissimilarities by a positive factor will not change (4) at a local minimum, whereas the coordinates will be the same up to the same factor.

8

Multidimensional Scaling The second measure is Kruskal’s Stress-1 formula 

i−1 n

1/2

wij (dij − dij (X))   i=2 j =1  σ1 ( d, X) =  n i−1   wij dij2 (X)

2

     

, (5)

i=2 j =1

which is equal to the square root of raw Stress divided by the sum of squared distances. This measure is of importance because many MDS programs and publications report this value. It can be proved that d, X), σ1 ( d, X) also has a at a local minimum of σn2 ( local minimum with the same configuration up to a multiplicative constant. In addition, the square root of normalized raw Stress is equal to Stress-1 [1]. A third measure is Kruskal’s Stress-2, which is similar to Stress-1 except that the denominator is based on the variance of the distances instead of the sum of squares. Stress-2 can be used to avoid the situation where all distances are almost equal. A final measure that seems reasonably popular is called S-Stress (implemented in the program ALSCAL) and it measures the sum of squared error between squared distances and squared dissimilarities [16]. The disadvantage of this measure is that it tends to give solutions in which large dissimilarities are overemphasized and the small dissimilarities are not well represented.

Pitfalls If missing dissimilarities are present, a special problem may occur for certain patterns of missing dissimilarities. For example, if it is possible to split the objects in two or more sets such that the between-set

weights wij are all zero, we are dealing with independent MDS problems, one for each set. If this situation is not recognized, you may inadvertently interpret the missing between set distances. With only a few missing values, this situation is unlikely to happen. However, when dealing with many missing values, one should verify that the problem does not occur. Another important issue is to understand what MDS will do if there is no information in the data, that is, when all dissimilarities are equal. Such a case can be seen as maximally uninformative and therefore as a null model. Solutions of empirical data should deviate from this null model. This situation was studied in great detail by [2]. It turned out that for constant dissimilarities, MDS will find in one dimension points equally spread on a line (see Figure 6). In two dimensions, the points lie on concentric circles [6] and in three dimensions (or higher), the points lie equally spaced on the surface of a sphere. Because all dissimilarities are equal, any permutation of these points yield an equally good fit. This type of degeneracy can be easily recognized by checking the Shepard diagram. For example, if all disparities (or dissimilarities in ratio MDS) fall into a small interval considerably different from zero, we are dealing with the case of (almost) constant dissimilarities. For such a case, we advise redoing the MDS analysis with a more restrictive transformation, for example, using monotone splines, an interval transformation or even ratio MDS. A final pitfall for MDS are local minima. A local minimum for Stress implies that small changes in the configuration always have a worse Stress than the local minimum solution. However, larger changes in the configuration may yield a lower Stress. A configuration with the overall lowest Stress value is called a global minimum. In general, MDS algorithms that minimize Stress cannot guarantee the retrieval of

•

• • ••••••••••••••••••••••••••••••

•

• • •

•

•

•

•

•

•

•

• •

• •

•

•

• •

•

•

•

•

•

•

•

•

Figure 6 Solutions for constant dissimilarities with n = 30. The left plot shows the unidimensional solution and the right plot a 2D solution

Multidimensional Scaling a global minimum. However, if the dimensionality is exactly n − 1, it is known that ratio MDS only has one minimum that is consequently global. Moreover, when p = n − 1 is specified, MDS often yields a solution that fits in a dimensionality lower than n − 1. If so, then this MDS solution is also a global minimum. A different case is that of unidimensional scaling. For unidimensional scaling with a ratio transformation, it is well known that it has many local minima and can better be solved using combinatorial methods. For low dimensionality, like p = 2 or p = 3, experiments indicated that the number of different local minima ranges from a few to several thousands. For an overview of issues concerning local minima in ratio MDS, we refer to [9] and [10]. When transformations are used, there are fewer local minima and the probability of finding a global minimum increases. As a general strategy, we advice to use multiple random starts (say 100 random starts) and retain the solution with the lowest Stress. If most random starts end in the same candidate minimum, then there probably only exist few local minima. However, if the random starts end in many different local minima, the data exhibit a serious local minimum problem. In that case, it is advisable to increase the number of random starts and retain the best solution.

[5]

[6] [7] [8]

[9]

[10]

[11]

[12] [13] [14] [15]

[16]

References [1]

[2]

[3]

[4]

Borg, I. & Groenen, P.J.F. (1997). Modern Multidimensional Scaling: Theory and Applications, Springer, New York. Buja, A., Logan, B.F., Reeds, J.R. & Shepp, L.A. (1994). Inequalities and positive-definite functions arising from a problem in multidimensional scaling, The Annals of Statistics 22, 406–438. De Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling, in Recent Developments in Statistics, J.R., Barra, F., Brodeau, G., Romier & B., van Cutsem, eds, North-Holland, Amsterdam, pp. 133–145. De Leeuw, J. (1988). Convergence of the majorization method for multidimensional scaling, Journal of Classification 5, 163–180.

9

De Leeuw, J. & Heiser, W.J. (1980). Multidimensional scaling with restrictions on the configuration, in Multivariate Analysis, Vol. V, P.R., Krishnaiah, ed., NorthHolland, Amsterdam, pp. 501–522. De Leeuw, J. & Stoop, I. (1984). Upper bounds of Kruskal’s Stress, Psychometrika 49, 391–402. Ekman, G. (1954). Dimensions of color vision, Journal of Psychology 38, 467–474. Gower, J.C. & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients, Journal of Classification 3, 5–48. Groenen, P.J.F. & Heiser, W.J. (1996). The tunneling method for global optimization in multidimensional scaling, Psychometrika 61, 529–550. Groenen, P.J.F., Heiser, W.J. & Meulman, J.J. (1999). Global optimization in least-squares multidimensional scaling by distance smoothing, Journal of Classification 16, 225–254. Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29, 1–27. Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method, Psychometrika 29, 115–129. Meulman, J.J. Heiser, W.J. & SPSS. (1999). SPSS Categories 10.0, SPSS, Chicago. Ramsay, J.O. (1988). Monotone regression splines in action, Statistical Science 3(4), 425–461. Spence, I. & Ogilvie, J.C. (1973). A table of expected stress values for random rankings in nonmetric multidimensional scaling, Multivariate Behavioral Research 8, 511–517. Takane, Y., Young, F.W. & De Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: an alternating least-squares method with optimal scaling features, Psychometrika 42, 7–67.

(See also Multidimensional Unfolding; Scaling Asymmetric Matrices; Scaling of Preferential Choice) PATRICK J.F. GROENEN AND MICHEL VAN DE VELDEN

Multidimensional Unfolding JAN

DE

LEEUW

Volume 3, pp. 1289–1294 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multidimensional Unfolding The unfolding model is a geometric model for preference and choice. It locates individuals and alternatives as points in a joint space, and it says that an individual will pick the alternative in the choice set closest to its ideal point. Unfolding originated in the work of Coombs [4] and his students. It is perhaps the dominant model in both scaling of preference data and attitude scaling. The multidimensional unfolding technique computes solutions to the equations of unfolding model. It can be defined as multidimensional scaling of off-diagonal matrices. This means the data are dissimilarities between n row objects and m column objects, collected in an n × m matrix . An important example is preference data, where δij indicates, for instance, how much individual i dislikes object j . In unfolding, we have many of the same distinctions as in general multidimensional scaling: there is unidimensional and multidimensional unfolding, metric and nonmetric unfolding, and there are many possible choices of loss functions that can be minimized. First we will look at (metric) unfolding as defining the system of equations δij = dij (X, Y ), where X is the n × p configuration matrix of row points, Y is the m × p configuration matrix of column points, and p (1) dij (X, Y ) = (xis − yj s )2 . s=1

Clearly, an equivalent system of algebraic equations is δij2 = dij2 (X, Y ), and this system expands to δij2 =

p s=1

xis2 +

p s=1

yj2s − 2

p

xis yj s .

(2)

s=1

+ We can rewrite this in matrix form as (2) = aem en b − 2XY , where a and b contain the row and column sums of squares, and where e is used for a vector with all elements equal to one. If we define the centering operators Jn = In − en en /n and Jm = /m, then we see that doubly centering the Im − em em matrix of squared dissimilarities gives the basic result

H = − 12 Jn (2) Jm = X˜ Y˜ ,

(3)

where X˜ = Jn X and Y˜ = Jm Y are centered versions of X and Y . For our system of equations to be solvable, it is necessary that rank(H ) ≤ p. Solving the system, or finding an approximate solution by using the singular value decomposition, gives us already an idea about X and Y , except that we do not know the relative location and orientation of the two point-clouds. More precisely, if H = P Q is full rank decomposition of H , then the solutions X and Y of our system of equations δij2 = dij2 (X, Y ) can be written in the form X = (P + en α )T , Y = (Q + em β )(T )−1 ,

(4)

which leaves us with only the p(p + 2) unknowns in α, β, and T still to be determined. By using the fact that the solution is invariant under translation and rotation, we can actually reduce this to (1/2)p(p + 3) parameters. One way to find these additional parameters is given in [10]. Instead of trying to find an exact solution, if one actually exists, by algebraic means, we can also define a multidimensional unfolding loss function and minimize it. In the most basic and classical form, we have the Stress loss function σ (X, Y ) =

m n

wij (δij − dij (X, Y ))2 .

(5)

i=1 j =1

This is identical to an ordinary multidimensional scaling problem in which the diagonal (row–row and column–column) weights are zero. Or, to put it differently, in unfolding, the dissimilarities between different row objects and different column objects are missing. Thus, any multidimensional scaling program that can handle weights and missing data can be used to minimize this loss function. Details are in [7] or [1, Part III]. One can also consider measuring loss using SStress, the sum of squared differences between the squared dissimilarities and squared distances. This has been considered in [6, 11]. We use an example from [9, p. 152]. The Department of Psychology at the University of Nijmegen has, or had, nine different areas of research and teaching. Each of the 39 psychologists working in the department ranked all nine areas in order of relevance for their work. The areas are given in Table 1.

2

Multidimensional Unfolding

CLI EDU 2

15 5

7

31

1

2

EXP

14

22 19

0 30 −1

MAT

17

1

SOC 4

28

16

313 29

CUL

PHY

38 25 21 23 39 18 20 2724

12 9

Dimension 2

10 6

8

36 11

−2

35

TST

26 33 37

3432 −3 IND −4

Figure 1

−2

0 Dimension 1

2

4

Metric Unfolding Roskam Data

We apply metric unfolding, in two dimensions, and find the solution in Figure 1. In this analysis, we used the rank orders, more precisely the numbers 0 to 8. Thus, for good fit, first choices should coincide with ideal points. The grouping of the nine areas in the solution is quite natural and shows the contrast between the more scientific and the more humanistic and clinical areas. Table 1 Nine psychology areas Area Social Psychology Educational and Developmental Psychology Clinical Psychology Mathematical Psychology and Psychological Statistics Experimental Psychology Cultural Psychology and Psychology of Religion Industrial Psychology Test Construction and Validation Physiological and Animal Psychology

Plot code SOC EDU CLI MAT EXP CUL IND TST PHY

In this case, and in many other cases, the problems we are analyzing suggest that we really are interested in nonmetric unfolding. It is difficult to think of actual applications of metric unfolding, except perhaps in the life and physical sciences. This does not mean that metric unfolding is uninteresting. Most nonmetric unfolding algorithms solve metric unfolding subproblems, and one can often make a case for metric unfolding as a robust form to solve nonmetric unfolding problems. The original techniques proposed by Coombs [4] were purely nonmetric and did not even lead to metric representations. In preference analysis, the protypical area of application, we often only have ranking information. Each individual ranks a number of candidates, or food samples, or investment opportunities. The ranking information is row-conditional, which means we cannot compare the ranks given by individual i to the ranks given by individual k. The order is defined only within rows. Metric data are generally unconditional, because we can compare numbers both within and between rows. Because of the paucity of information (only rank order, only row-conditional, only off-diagonal), the usual Kruskal approach to

Multidimensional Unfolding 6

3

31 38

EDU CLI

4

39

PHY

10 15 8 5 6 12

23 24 25 21 20 27 22 35 14 7 32 1234 9 11 36 34 26 33 16

Dimension 2

2

0 CUL −2

18 19

37

EXP MAT TST IND SOC

30 29

17

−4

13

−6

28

−10

−5

0

5

10

Dimension 1

Figure 2

Nonmetric unfolding Roskam data

nonmetric unfolding often leads to degenerate solutions, even after clever renormalization and partitioning of the loss function [8]. In Figure 2, we give the solution minimizing m

σ (X, Y, ) =

n i=1

wij (δij − dij (X, Y ))2

j =1 m

(6) wij (δij − δi )

2

j =1

over X and Y and over those whose rows are monotone with the ranks given by the psychologists. Thus, there is a separate monotone regression computed for each of the 39 rows. The solution is roughly the same as the metric one, but there is more clustering and clumping in the plot, and this makes the visual representation much less clear. It is quite possible that continuing to iterate to higher precision will lead to even more degeneracy. More recently, Busing et al. [2] have adapted the Kruskal approach to nonmetric unfolding by penalizing for the flatness of the monotone regression function. One would expect even more problems when the data are not even rank orders but just binary choices. Suppose n individuals have to choose one alternative

from a set of m alternatives. The data can be coded as an indicator matrix, which is an n × m binary matrix with exactly one unit element in each row. The unfolding model says there are n points xi and m points yj in p such that if individual i picks alternative j , then xi − yj ≤ xi − y for all = 1, . . . , m. More concisely, we use the m points yj to draw a Voronoi diagram. This is illustrated in Figure 3 for six points in the plane. There is one Voronoi cell for each yj , and the cell (which can be bounded on unbounded) contains exactly those points that are closer to yj than to any of the other y ’s. The unfolding model says that individuals are in the Voronoi cells of the objects they pick. This clearly leaves room for a lot of indeterminacy in the actual placement of the points. The situation becomes more favorable if we have more than one indicator matrix, that is, if each individual makes more than one choice. There is a Voronoi diagram for each choice and individuals must be in the Voronoi cells of the object they choose for each of the diagrams. Superimposing the diagrams creates smaller and smaller regions that each individual must be in, and the unfolding model requires the intersection of the Voronoi cells determined by the choices of any individual to be nonempty.

4

Multidimensional Unfolding

y

1.0

0.5

0.0

−1.0

Figure 3

−0.5

0.0

0.5 x

1.0

1.5

2.0

A Voronoi diagram 110

4

100

3

111

y

101

2

1

000

1+ 001

1− 011

0 0

Figure 4

3+ 3−

2+ 2− 1

2 x

3

4

Unfolding binary data

It is perhaps simplest to apply this idea to binary choices. The Voronoi cells in this case are half spaces defined by hyperplanes dividing n in two parts. All individuals choosing the first of the two alternatives must be on one side of the hyperplane, all others must be on the other side. There is a hyperplane for each choice.

This is the nonmetric factor analysis model studied first by [5]. It is illustrated in Figure 4. The prototype here is roll call data [3]. If 100 US senators vote on 20 issues, then the unfolding model says that (for a representation in the plane) there are 100 points and 20 lines, such that each issue-line separates the ‘aye’ and the ‘nay’ voters for

Multidimensional Unfolding

5

Roll call plot for senate 0.4 60

40

37

0.3

Dimension 2

10 64 94

0.2 12 62

86 100 24

0.1

57

0.0 80

20 87

38 6716 1011 9 3

51

65 25

113 82 9 35 12 63 39 76 98 19 14 61 15 50 18 22 4 5884 548 4952 47 3166 1836 20 19 23 3 32 73 67 68 71 9183 8997 92 777 69 53 7982 48 335570 26 75 2 28 74 6 13 72 99 95 1644 17 7815 45 90 30 96 46 27 43 5921 1456 81 41 34 4 88 17

5

19

5 85 11 29 1

−0.1

−0.3

−0.2

42 93

20 14 1213 184 21 15 81639 6 711 105 17

−0.1

0.0

0.1

0.2

Dimension 1

Figure 5

The 2000 US senate

that issue. Unfolding, in this case, can be done by correspondence analysis, or by maximum likelihood logit or probit techniques. We give an example, using 20 issues selected by Americans for Democratic Action, the 2000 US Senate, and the logit technique (Figure 5). The issue lines are the perpendicular bisectors of the lines connecting the ‘aye’ and ‘nay’ points of the issues. We see for the figure how polarized American politics is, with almost all lines going through the center and separating Democrats from Republicans.

References [1]

[2]

[3]

Borg, I. & Groenen, P.J.F. (1997). Modern Multidimensional Scaling: Theory and Applications, Springer, New York. Busing, F.M.T.A., Groenen, P.J.F. & Heiser, W.J. (2004). Avoiding degeneracy in multidimensional unfolding by penalizing on the coefficient of variation, Psychometrika (in press). Clinton, J., Jackman, S. & Rivers, D. (2004). The statistical analysis of roll call data, American Political Science Review 98, 355–370.

[4] [5]

Coombs, C.H. (1964). A Theory of Data, Wiley. Coombs, C.H. & Kao, R.C. (1955). Nonmetric Factor Analysis, Engineering Research Bulletin 38, Engineering Research Institute, University of Michigan, Ann Arbor. [6] Greenacre, M.J. & Browne, M.W. (1986). An efficient alternating least-squares algorithm to perform multidimensional unfolding, Psychometrika 51, 241–250. [7] Heiser, W.J. (1981). Unfolding analysis of proximity data, Ph.D. thesis, University of Leiden. [8] Kruskal, J.B. & Carroll, J.D. (1969). Geometrical models and badness of fit functions, in Multivariate Analysis, Vol. II, P.R. Krishnaiah, ed., North Holland, pp. 639–671. [9] Roskam, E.E.C.H.I. (1968). Metric analysis of ordinal data in psychology, Ph.D. thesis, University of Leiden. [10] Sch¨onemann, P.H. (1970). On metric multidimensional unfolding, Psychometrika 35, 349–366. [11] Takane, Y., Young, F.W. & De Leeuw, J. (1977). Nonmetric individual differences in multidimensional scaling: an alternating least-squares method with optimal scaling features, Psychometrika 42, 7–67.

JAN

DE

LEEUW

Multigraph Modeling HARRY J. KHAMIS Volume 3, pp. 1294–1296 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multigraph Modeling

13

Figure 1 [13] [23]

Multigraph modeling is a graphical technique used to (1) identify whether a hierarchical loglinear model (HLM) is decomposable or not, (2) obtain an explicit factorization of joint probabilities in terms of marginal probabilities for decomposable HLMs, and (3) interpret the conditional independence structure (see Probability: An Introduction) of a given HLM of a contingency table. Consider a multiway contingency table and a corresponding HLM (see e.g., [1–3], or [11]). The HLM is uniquely characterized by its generating class (or minimal sufficient configuration), which establishes the correspondence between the model parameters and the associated minimal sufficient statistics. As an example, consider the model of conditional independence for a three-way contingency table (using Agresti’s [1] notation): 23 log mijk = µ + λ1i + λ2j + λ3k + λ13 ik + λj k ,

(1)

where mijk denotes the expected cell frequency for the ith row, j th column, and kth layer of the contingency table, and the parameters on the right side of the equation represent certain contrasts of log mijk . The generating class for this model is denoted by [13][23]; it corresponds to the inclusion-maximal sets of indices in the model, {1, 3} and {2, 3}, referred to as the generators of the model. This model represents conditional independence of factors 1 and 2 given factor 3, written as [1 ⊗ 2|3]. A graph G is a mathematical object that consists of two sets: (1) a set of vertices, V, and (2) a set of edges, E, consisting of pairs of elements taken from V. The diagram of the graph is a picture in which a circle or dot or some other symbol represents a vertex and a line represents an edge. The generator multigraph, or simply multigraph, M of an HLM is a graph in which the vertex set consists of the set of generators of the HLM, and two vertices are joined by edges that are equal in number to the number of indices shared by them. The multigraph for the model in (1) is given in Figure 1. The vertices of this multigraph consist of the two generators of the model {1, 3} and {2, 3}, and there is a single edge joining the two vertices because {1, 3} ∩ {2, 3} = {3}. Note the one-to-one correspondence between the generating class and the multigraph representation.

23

Generator multigraph for the loglinear model

A maximum spanning tree T of a multigraph is a connected graph with no circuits (or closed loops) that includes each vertex of the multigraph such that the sum of all of the edges is maximum. Each maximum spanning tree consists of a family of sets of factor indices called the branches of the tree. For the multigraph in Figure 1, the maximum spanning tree is the edge (branch) joining the two vertices and is denoted by T= {3}. An edge cutset of a multigraph is an inclusionminimal set of multiedges whose removal disconnects the multigraph. For the model [13][23] with the multigraph given in Figure 1, there is a single-edge cutset that disconnects the two vertices, {3}, and it is the minimum number of edges that does so. General results concerning the multigraph are given as follows: An HLM is decomposable if and only if [10] |S| − |S|, (2) d= S∈B(T) S∈V(T) where d = number of factors in the contingency table, T is any maximum spanning tree of the multigraph, V(T) and B(T) are the set of vertices and set of branches of T respectively, and S ∈ V(T) and S ∈ B(T) represent the factor indices contained in V(T) and B(T) respectively. For decomposable models, the joint distribution for the associated contingency table is [10]: P [v : V ∈ S] S∈V(T) , (3) P [v1 , v2 , . . . , vd ] = P [v : V ∈ S] S∈B(T) where P [v1 , v2 , . . . , vd ] represents the probability associated with level v1 of the first factor, level v2 of the second factor, and so on, and level vd of the dth factor. P [v : V ∈ S] denotes the marginal probability indexed on those indices contained in S (and summing over all other indices). From this factorization, an explicit formula for the maximum likelihood estimator (see Maximum Likelihood Estimation) can be obtained (see, e.g., [2]).

2

Multigraph Modeling

In order to identify the conditional independence structure of a given HLM, the branches and edge cutsets of the multigraph are used. For a given multigraph M and set of factors S, construct the multigraph M/S by removing each factor of S from each generator (vertex in the multigraph) and removing each edge corresponding to that factor. For decomposable models, S is chosen to be a branch of any maximum spanning tree of M, and for nondecomposable models, S is chosen to be the factors corresponding to an edge cutset of M. Then, the conditional independence interpretation for the HLM is the following: The sets of factors in the disconnected components of M/S are mutually independent, conditional on S [10]. For the multigraph in Figure 1, d = 3, T = {3}, V(T) = {{1, 3}, {2, 3}}, and B(T) = {3}. By (2), 3 = (2 + 2) − 1 so that the HLM [13][23] is decomposable, and by (3), the factorization of the joint distribution is (using simpler notation) pijk = pi+k p+j k /p++k . Upon removing S = {3} from the multigraph, the sets of factors in the disconnected components of M/S are {1} and {2}. Hence, the interpretation of [13][23] is [1 ⊗ 2|3]. Edwards and Kreiner [6] analyzed a set of data in the form of a five-way contingency table from an investigation conducted at the Institute for Social Research, Copenhagen. A sample of 1592 employed men, 18 to 67 years old, were asked whether in the preceding year they had done any work they would have previously paid a craftsman to do. The variables included in the study are shown in Table 1. One of the HLMs that fits the data well is [AME][RME][AMT]. The multigraph for this model is given in Figure 2. The maximum spanning tree for this multigraph is T = {{M, E}, {A, M}}, where V(T) = {{A, M, E}, {R, M, E}, {A, M, T}} and B(T) = {{M, E}, {A, M}}. From (2), 5 = (3 + 3 + 3) – (2 + 2), indicating that [AME][RME][AMT] is a decomposable model, and the factorization of the

AME

AMT

Figure 2 Generator multigraph for the loglinear model [AME][RME][AMT]

A R

Age Response Mode of residence Employment Type of residence

Levels

A R M E T

<30, 31–45, 46–67 Yes, no Rent, own Skilled, unskilled, other Apartment, house

RE

S = {A, M} [R,E ⊗ TA, M] (b)

Figure 3 Multigraphs M/S and conditional independence interpretations for the loglinear model [AME][RME][AMT] with branches: (a) S = {M, E} and (b) S = {A, M}

joint probability can be obtained directly from (3). The multigraph M/S that results from choosing the branch S = {M, E} is shown in Figure 3(a). The resulting interpretation is [A, T ⊗ R|M, E]. Analogously, for S = {A, M}, we have [R, E ⊗ T|A, M] (Figure 3b). The multigraph is typically smaller (i.e., has fewer vertices) than the first-order interaction graph introduced by Darroch et al. [4] (see also [5], [8], and [9]), especially for contingency tables with many factors and few generators. For the theoretical development of multigraph modeling, see [10]. For a further description of the application of multigraph modeling and additional examples, see [7].

References [1]

Symbol

E T

AT S = {M,E} [A,T ⊗ RM, E] (a)

Table 1 Variable

RME

[2]

[3] [4]

Agresti, A. (2000). Categorical Data Analysis, 2nd Edition, Wiley, New York. Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice, MIT Press, Cambridge. Christensen, R. (1990). Log-linear Models, SpringerVerlag, New York. Darroch, J.N., Lauritzen, S.L. & Speed, T.P. (1980). Markov fields and log-linear interaction models for contingency tables, Annals of Statistics 8, 522–539.

Multigraph Modeling [5] [6]

[7]

[8]

Edwards, D. (1995). Introduction to Graphical Modelling, Springer-Verlag, New York. Edwards, D. & Kreiner, S. (1983). The analysis of contingency tables by graphical models, Biometrika 70, 553–565. Khamis, H.J. (1996). Application of the multigraph representation of hierarchical log-linear models, in Categorical Variables in Developmental Research: Methods of Analysis, Chap. 11, A. von Eye & C.C. Clogg, eds, Academic Press, New York. Khamis, H.J. & McKee, T.A. (1997). Chordal graph models of contingency tables, Computers and Mathematics with Applications 34, 89–97.

[9]

3

Lauritzen, S.L. (1996). Graphical Models, Oxford University Press, New York. [10] McKee, T.A. & Khamis, H.J. (1996). Multigraph representations of hierarchical loglinear models, Journal of Statistical Planning and Inference 53, 63–74. [11] Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences, Erlbaum, Hillsdale.

HARRY J. KHAMIS

Multilevel and SEM Approaches to Growth Curve Modeling JOOP HOX

AND

REINOUD D. STOEL

Volume 3, pp. 1296–1305 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multilevel and SEM Approaches to Growth Curve Modeling Introduction A broad range of statistical methods exists for analyzing data from longitudinal designs (see Longitudinal Data Analysis). Each of these methods has specific features and the use of a particular method in a specific situation depends on such things as the type of research, the research question, and so on. The central concern of longitudinal research, however, revolves around the description of patterns of stability and change, and the explanation of how and why change does or does not take place [9]. A common design for longitudinal research in the social sciences is the panel or repeated measures design (see Repeated Measures Analysis of Variance), in which a sample of subjects is observed at more than one point in time. If all individuals provide measurements at the same set of occasions, we have a fixed occasions design. When occasions are varying, we have a set of measures taken at different points in time for different individuals. Such data occur, for instance, in growth studies, where individual measurements are collected for a sample of individuals at different occasions in their development (see Growth Curve Modeling). The data collection could be at fixed occasions, but the individuals have different ages. The distinction between fixed occasions designs and varying occasions designs is important, since they may lead to different analysis methods. Several distinct statistical techniques are available for the analyses of panel data. In recent years, growth curve modeling has become popular [11, 12, 13, 24]. All subjects in a given population are assumed to have developmental curves of the same functional form (e.g., all linear), but the parameters describing their curves may differ. With linear developmental curves, for example, there may be individual differences in the initial level as well as in the growth rate or rate of change. Growth curve analysis is a statistical technique to estimate these parameters. Growth curve analysis is used to obtain a description of the mean growth in a population over a specific period of time (see Growth Curve Modeling). However, the

main emphasis lies in explaining variability between subjects in the parameters that describe their growth curves, that is, in interindividual differences in intraindividual change [25]. The model on which growth curve analysis is based, the growth curve model, can be approached from several perspectives. On the one hand, the model can be constructed as a standard two-level multilevel regression (MLR) model [4, 5, 20] (see Linear Multilevel Models). The repeated measures are positioned at the lowest level (level-1 or the occasion level), and are then treated as nested within the individuals (level-2 or the individual level), the same way as a standard cross-sectional multilevel model treats children as being nested within classes. The model can therefore be estimated using standard MLR software. On the other hand, the model can be constructed as a structural equation model (SEM). Structural equation modeling uses latent variables to account for the relations between the observed variables, hence the name latent growth curve (LGC) model. The two approaches can be used to formulate equivalent models, providing identical estimates for a given data set [3].

The Longitudinal Multilevel or Latent Growth Curve Model Both MLR and LGC incorporate the factor ‘time’ explicitly. Within the MLR framework time is modeled as an independent variable at the lowest level, the individual is defined at the second level, and explanatory variables can be included at all existing levels. The intercept and slope describe the mean growth. Interindividual differences in the parameters describing the growth curve are modeled as random effects for the intercept and slope of the time variable. The LGC approach adopts a latent variable view. Time is incorporated as specific constrained values for the factor loadings of the latent variable that represents the slope of the growth curve; all factor loadings of the latent variable that represents the intercept are constrained to the value of 1. The latent variable means for the intercept and slope factor describe the mean growth. Interindividual differences in the parameters describing the growth curve are modeled as the (co)variances of the intercept and slope factors. The mean and covariance structure of the latent variables in LGC analysis correspond to the fixed and

2

Multilevel and SEM Approaches to Growth Curve Modeling

random effects in MLR analysis, and this makes it possible to specify exactly the same model as a LGC or MLR model [23]. If this is done, exactly the same parameter estimates will emerge, as will be illustrated in the example. The general growth curve model, for the repeatedly measured variable yti of individual i at occasion t, may be written as: yti = λ0t η0i + λ1t η1i + γ2t xti + εti η0i = ν0 + γ0 zi + ζ0i

(1)

η1i = ν1 + γ1 zi + ζ1i , where λ1t denotes the time of measurement and λ0t a constant equal to the value of 1. Note that in a fixed occasions design λ1t will typically be a consecutive series of integers (e.g., [0, 1, 2, . . . , T]) equal to all individuals, while in a varying occasions design λ1t· can take on different values across individuals. The individual intercept and slope of the growth curve are represented by η0i and η1i , respectively, with expectations ν0 and ν1 , and random departures or residuals, ζ0i and ζ1i , respectively. γ2t represents the effect of the time-varying covariate xti ; γ0 and γ1 are the effects of the time-invariant covariate on the initial level and linear slope. Time-specific deviations are represented by the independent and identically standard normal distributed εti , with variance σε2 . The variances of ζ0i and ζ1i , and their covariance are represented by: 2 σ0 ζ = . (2) σ01 σ12 Furthermore, it is assumed that cov(εit , εit ) = 0, cov(εit , ηi0 ) = 0, cov(εit , ηi1 ) = 0. Within the longitudinal MLR model η0i and η1i are the random parameters, and λ1t is an observed variable representing time. In the LGC model η0i and η1i are the latent variables and λ0t and λ1t are parameters, that is, factor loadings. Thus, the only difference between the models is the way time is incorporated in the model. In the MLR model time is introduced as a fixed explanatory variable, whereas in the LGC model it is introduced via the factor loadings. So, in the longitudinal MLR model an additional variable is added, and in the LGC model the factor loadings for the repeatedly measured variable are constrained in such a way that they represent time. The consequence of this is that with reference to the basic

growth curve model, MLR is essentially a univariate approach, with time points treated as observations of the same variable, whereas the LGC model is essentially a multivariate approach, with each time point treated as a separate variable [23]. Figure 1 presents a path diagram depicting a LGC model for four measurement occasions, for simplicity without covariates. Following SEM conventions, the first path for the latent slope factor, which is constrained to equal zero, is usually not present in the diagram. The specific ways MLR and LGC model ‘time’ have certain consequences for the analysis. In the LGC approach, λ1t cannot vary between subjects, which makes it best suited for a fixed occasions design. LGC modeling can be used for designs with varying occasions by modeling all existing occasions and viewing the varying occasions as a missing data problem, but when the number of existing cases is large this approach becomes unmanageable. In the MLR approach, λ1t is simply a time-varying explanatory variable that can take on any variable, which makes MLR the best approach if there are a large number of varying occasions. There are also some differences between the LGC and MLR approach in the ways the model can be extended. In the LGC approach, it is straightforward to embed the LGC model in a larger path model, for instance, by combining several growth curves in one model, or by using the intercept and slope factors as predictors for outcome values measured at a later occasion. The MLR approach does not deal well with such extended models. On the other hand, in the MLR approach it is simple to add more levels, for instance to model a growth process of pupils nested in classes nested in schools. In the LGC approach, it is possible to embed a LGC model in a two-level structural equation model [14], but adding more levels is problematic.

Example We will illustrate the application of both the MLR model and the LGC model using a hypothetical study in which data on the language acquisition of 300 children were collected during primary school at 4 consecutive occasions. Besides this, data were collected on the children’s intelligence, as well as, on each occasion a measure of their emotional well-being. The same data have been analyzed by Stoel, et al. [23], who also discuss extensions and applications of both models.

Multilevel and SEM Approaches to Growth Curve Modeling

3

s01

h0

h1

l03 = 1

l00 = 1

l01 = 1

y0

l02 = 1

e1

y3

e2

e3

Path diagram of a four-wave latent growth curve model

curves and that emotional well-being explains the time-specific deviations from the mean growth curve. The covariance matrix and mean vector are presented in Table 1. Analyzing these data using both the MLR and LGC model with Maximum Likelihood estimation leads to the parameter estimates presented in Table 2.

Covariance matrix and means vector y1

y1 y2 y3 y4 x1 x2 x3 x4 z

l13 = 3

y2

The aim of the study is testing the hypothesis that there exists substantial growth in language acquisition, and that there is substantial variability between the children in their growth curves. Given interindividual differences in the growth curves, it is hypothesized that intelligence explains (part of) the interindividual variation in the growth Table 1

l12 = 2

l11 = 1

y1

e0

Figure 1

l10 = 0

1.58 1.28 1.52 1.77 0.99 0.16 0.07 0.08 0.34

y2 4.15 4.91 6.83 −0.08 1.44 −0.13 0.23 1.28

y3

8.90 11.26 −0.26 0.17 1.2 0.46 2.01

y4

17.50 −0.33 0.18 −0.27 1.98 3.01

x1

2.17 0.13 −0.06 0.03 −0.08

x2

2.56 −0.09 −0.04 0.07

Note: y = language acquisition, x = emotional well-being, z = intelligence.

x3

2.35 0.10 0.01

x4

2.24 0.06

z

means

0.96

9.83 11.72 13.66 15.65 0.00 0.00 0.00 0.00 0.00

4

Multilevel and SEM Approaches to Growth Curve Modeling

Table 2 Maximum likelihood estimates of the parameters of (1), using multilevel regression and latent growth curve analysis Parameter Fixed part ν0 ν1 γ0 γ1 γ2 Random part σε2 σ02 σ12 σ01

MLR

LGC

9.89 1.96 0.40 0.90 0.55

(.06) (.05) (.06) (.05) (.01)

9.89 1.96 0.40 0.90 0.55

(.06) (.05) (.06) (.05) (.01)

0.25 0.78 0.64 0.00

(.01) (.08) (.06) (.05)

0.25 0.78 0.64 0.00

(.01) (.08) (.06) (.05)

Note: Standard errors are given in parentheses. The Chi-square test of model fit for the LGC model: χ 2 (25) = 37.84(p = .95); RMSEA = 0.00. For the MLR model: −2∗ loglikelihood = 3346.114.

The first column of Table 2 presents the relevant parameters; the second and third columns show the parameter estimates of respectively the MLR, and LGC model. As one can see in Table 2 the parameter estimates are the same and, consequently, both approaches would lead to the same substantive conclusions. According to the overall measure of fit provided by SEM, the model seems to fit the data quite well. Thus, the conclusions can be summarized as follows. After controlling for the effect of the covariates, a mean growth curve emerges with an initial level of 9.89 and a growth rate of 1.96. The significant variation between the subjects around these mean values implies that subjects start their growth process at different values and grow subsequently at different rates. The correlation between initial level and growth rate is zero. In other words, the initial level has no predictive value for the growth rate. Intelligence has a positive effect on both the initial level and growth rate, leading to the conclusion that children who are more intelligent show a higher score at the first measurement occasion and a greater increase in language acquisition than children with lower intelligence. Emotional well-being explains the timespecific deviations from the mean growth curve. That is, children with a higher emotional wellbeing at a specific time point show a higher score on language acquisition than is predicted by their growth curve.

Dichotomous and Ordinal Data Both conventional structural equation modeling and multilevel regression analysis assume that the outcome variable(s) are continuous and have a (multivariate) normal distribution. In practice, many variables are measured as ordinal categorical variables, for example, the responses on a five- or sevenpoint Likert attitude question. Often, researchers treat such variables as if they were continuous and normal variables. If the number of response categories is fairly large and the response distribution is symmetric, treating these variables as continuous normal variable appears to work quite well. For instance, Bollen and Barb [2] show in a simulation study that if bivariate normal variables are categorized into at least five response categories, the differences between the correlation between the original variables and the correlation of the categorized variables is small (see Categorizing Data). Johnson and Creech [7] show that this also holds for parameter estimates and model fit. However, when the number of categories is smaller than five, the distortion becomes sizable. It is clear that such variables require special treatment. A Categorical ordinal variable can be viewed as a crude observation of an underlying latent variable. The same model that is used for the continuous variables is used, but it is assumed to hold for the underlying latent response. The residuals are assumed to have a standard normal distribution, or a logistic distribution. The categories of the ordinal variable arise from applying thresholds to the latent continuous variable. Assume that we have an ordered categorical variable with three categories, for example, ‘disagree’, ‘neutral’, and ‘agree.’ The relation of this variable to the underlying normal latent variable is depicted in Figure 2. The position on the latent variable determines which categorical response is observed. Specifically, 1, if yi∗ ≤ τ1 (3) yi = 2, if τ1 < yi∗ ≤ τ2 3, if τ2 < yi∗ , where yi is the observed categorical variable, yi∗ is the latent continuous variable, and τ1 and τ2 are the thresholds. Note that a dichotomous variable only has one threshold, which becomes the intercept in a regression equation. To analyze ordered categorical data in a multilevel regression context, the common approach is

Multilevel and SEM Approaches to Growth Curve Modeling

1 = Disagree

2 = Neutral t1

Figure 2

3 = Agree t2

Illustration of two thresholds underlying a three-category variable

to assume a normal or a logistic distribution (see Catalogue of Probability Density Functions). There are several estimation methods available; most software relies on a Taylor series approximation to the likelihood, although some software is capable of numerical integration of the likelihood. For a discussion of estimation in nonnormal multilevel models computational details, we refer to the literature, for example [4, 20]. The common approach in structural equation modeling is to estimate polychoric correlations, that is, the correlations between the underlying latent responses, and apply conventional SEM methods. This approach also assumes a standard normal distribution for the residuals. For statistical issues and computational details, we refer to the literature, for example [1, 8] (see Structural Equation Modeling: Categorical Variables). The important point here for both types of modeling is that when the number of categories of an outcome variable is small, using an analysis approach that assumes continuous variables may lead to strongly biased results. Table 3

5

To give an indication of the extent of the bias, the outcome variables of the language acquisition example were dichotomized on their common median. This leads to a data set where the number of ‘zero’ and ‘one’ scores on the four Y variables taken together is 50% each, with the proportion of ‘one’ scores goes up from 0.04 through 0.44 at the first occasion to 0.72 to 0.81 at the last occasion. Table 3 presents the results of three multilevel analyses for a model with a linear random effect for time: (1) Maximum likelihood estimation on a standardized continuous outcome Z (mean zero and variance one across Y1 to Y4) assuming normality, (2) Maximum Likelihood estimation on the dichotomized outcome D assuming normality, and (3) approximate Maximum Likelihood estimation on the dichotomized outcome D assuming a logistic model and using a Laplace approximation (Laplace6 in HLM5, [22]). The parameter estimates are different across all three methods. This is not surprising, because the Zscores are standardized to a mean of 0 and a variance of 1, the dichotomous variables have an overall mean

Estimates of the parameters of (1), using multilevel regression and different estimation methods

Parameter Fixed Part ν0 ν1 Random part σε2 σ02 σ12 σ01

ML on Z assuming normality

ML on D assuming normality

Approximate ML on D assuming logistic

−0.82 0.54

0.11 0.26

−2.44 1.80

0.07 0.07 0.12 0.03

0.10 0.01 0.01 0.01

– 0.24 0.60 0.38

6

Multilevel and SEM Approaches to Growth Curve Modeling

of 0.5 and a variance of 0.25, and the underlying continuous variable y ∗ has a mean of 0 and a residual variance of approximately 3.29. However, statistical tests on the variance components lead to drastically different conclusions. An ordinary likelihood-ratio test on the variances is not appropriate, because the null-hypothesis is a value that lies on the boundary of the parameter space. Instead, we apply the chi-square test described by Raudenbush and Bryk [21], which is based on the residuals. The multilevel regression analysis on the continuous Z-scores leads to significant variance components (p < 0.01). The multilevel logistic regression analysis on the dichotomized Dscores leads to variance components that are also significant (p < 0.01). On the other hand, the multilevel regression analysis on the dichotomized D-scores using standard Maximum Likelihood estimation for continuous outcomes, leads to variance components that are not significant (p > 0.80). Thus, for our example data, using standard Maximum Likelihood estimation assuming a continuous outcome on the dichotomized variable leads to substantively different and in fact misleading conclusions.

Extensions The model in (1) can be extended in a number of ways. We will describe some of these extensions in this section separately, but they can in fact be combined in one model.

and reflects the fact that the mean intercept and slope (ν0j and ν0j , respectively) may be different across classes. Note that (4) turns into (1) if ζ2j and ζ3j are constrained to zero. Incorporating classlevel covariates and additional higher levels in the hierarchy are straightforward.

Extending the Measurement Model Secondly, the model in (1) can be easily extended to include multiple indicators of a construct at each occasion explicitly. Essentially this extension merges the growth curve model with a measurement (latent factor) model at each occasion. For example if yti represented a meanscore over R items, we may recognize that yti = Rr=1 yrti /m. Instead of modeling observed item parcels, the R individual items yrti can be modeled directly. That is, to model them, explicitly, as indicators of a latent construct or factor at each measurement occasion. A growth model may then be constructed to explain the variance and covariance among the first-order latent factors. This approach has been termed second-order growth modeling in contrast to first-order growth modeling on the observed indicators. Different names for the same model are ‘curve-of-factors model’ and ‘multiple indicator latent growth model’ [12]. Note that modeling multiple indicators in a longitudinal setting requires a test on the structure of the measurements, that is a test of measurement invariance or factorial invariance [1, 8]. The model incorporating all yrti explicitly then becomes:

Extending the Number of Levels

yrti = αr + λr ηti + εri

First, let us assume that we have collected data on several occasions from individuals within classes, and that there are (systematic) differences between classes in intercept and slope. The model in (1) can easily account for such a ‘three-level’ structure by adding the class-specific subscript j . The model then becomes:

ηti = λ0t η0i + λ1t η1i + γ2t xti + ζti

ytij = λ0t η0ij + λ1t η1ij + γ2t xtij + εtij η0ij = ν0j + γ0 zi + ζ0ij η1ij = ν1j + γ1 zi + ζ1ij ν0j = ν0 + ζ2j ν1j = ν1 + ζ3j ,

(4)

η0i = ν0 + γ0 zi + ζ0i η1i = ν1 + γ1 zi + ζ1i ,

(5)

where αr and λr represent, respectively, the itemspecific intercept and factor loading of item r, and εri is a residual. ηti is an individual and time-specific latent factor corresponding to yti of Model (1), ζti is a random deviation corresponding to εti of Model (1). The growth curve model is subsequently built on the latent factor scores ηti with λ1t representing the time of measurement and λ1t a constant equal to the value of 1. This model thus allows for a separation of measurement error εri and individual time-specific

Multilevel and SEM Approaches to Growth Curve Modeling deviation ζti . In Model (1) these components are confounded in εti .

Nonlinear Growth The model discussed so far assumes linear growth. The factor time is incorporated explicitly in the model by constraining λ1t in (1) explicitly to known values to represent the occasions at which the subjects ere measured. This is, however, not a necessary restriction; it is possible to estimate a more general, that is nonlinear, model in which values of λ1t are estimated (see Nonlinear Mixed Effects Models; Nonlinear Models). Thus, instead of constraining λ1t to, for example [0, 1, 2, 3 . . . T ], some elements are left free to be estimated, providing information on the shape of the growth curve. For purposes of identification, at least two elements of λ1t need to be fixed. The remaining values are then estimated to provide information on the shape of the curve; λ1t then becomes [0, 1, λ12 , λ13 , . . . , λ1T −1 ]. So, essentially, a linear model is estimated, while the nonlinear interpretation comes from relating the estimated λ1t to the real time frame [13, 24]. The transformation of λ1t to the real time frame gives the nonlinear interpretation.

Further Extensions Further noteworthy extensions of the standard growth model in (1) which we will briefly sum up here are: •

•

The assumption of independent and identically distributed residuals can be relaxed. In other words, the model in (1) may incorporate a more complex type of residual structure. In fact, any type of residual structure can be implemented, provided the resulting model is identified. As stated earlier, the assumption that all individuals have been measured at the same measurement occasions as implied by Model (1) can be relaxed by giving λ1t in (1) an individual subscript i. λ1ti can subsequently be partly, or even completely different across individuals. However, using LGC modeling requires that we treat different λ1ti ’s across subjects as a balanced design with missing data (i.e., that not all subjects have been measured at all occasions), and assumptions about the missing data mechanism need to be made.

•

7

Growth mixture modeling provides an interesting extension of conventional growth curve analysis. By incorporating a categorical latent variable it is possible to represent a mixture of subpopulations where population membership is not known but instead must be inferred from the data [15, 16, 18]. See Li et al. [10], for a didactic example of this methodology.

Estimation and Software When applied to longitudinal data as described above, the MLR and LGC model are identical; they only differ in their representation. However, these models come from different traditions, and the software was originally developed to analyze different problems. This has consequences for the way the data are entered into the program, the choices the analyst must make, and the ease with which specific extensions of the model are handled by the software. LGC modeling is a special case of the general approach known as structural equation modeling (SEM). Structural equation modeling is inherently a multivariate analyst method, and it is therefore straightforward to extend the basic model with other (latent or observed) variables. Standard SEM software abounds with options to test the fit of the model, compare groups, and constrain parameters within and across groups. This makes SEM a very flexible analysis tool, and LGC modeling using SEM shows this flexibility. Typically, the different measurement occasions are introduced as separate variables. Time-varying covariates are also introduced as separate variables that affect the outcome measures at the corresponding measurement occasions. Time invariant covariates are typically incorporated in the model by giving these an effect on the latent variables that represent the intercept or the slope. However, it is also possible to allow the time invariant covariates to have direct effects on the outcome variables at each measurement occasion. This leads to a different model. In LGC modeling using SEM, it is a straightforward extension to model effects of the latent intercept and slope variables on other variables, including analyzing two LGC trajectories in one model and investigating how their intercepts and slopes are related. The flexibility of LGC analysis using SEM implies that the analyst is responsible for ensuring that the

8

Multilevel and SEM Approaches to Growth Curve Modeling

model is set up properly. For instance, one extension of the basic LGC model discussed in the previous section is to use a number of indicators for the outcome measure and extend the model by including a measurement model. In this situation, the growth model is modeled on the consecutive latent variables. To ensure that the measurement model is invariant over time, the corresponding factor loadings for measurements across time must be constrained to be equal. In addition, since the LGC model involves changes in individual scores and the overall mean across time, means and intercepts are included in the model, and the corresponding intercepts must also be constrained equal over time. Adding additional levels to the model is relatively difficult using the SEM approach. Muth´en [14] has proposed a limited information Maximum Likelihood method to estimate parameters in two-level SEM. This approach works well [6], and can be implemented in standard SEM software [5]. Since the LGC model can be estimated using standard SEM, two-level SEM can include a LGC model at the individual (lowest) level, with groups at the second level. However, adding more levels is cumbersome in the SEM approach. Multilevel Regression (MLR) is a univariate method, where adding an extra (lowest) level for the variables allows analysts to carry out multivariate analyses. So, growth curve analysis using MLR is accomplished by adding a level for the repeated measurement occasions. Most MLR software requires that the data matrix is organized by having a separate row for each measurement occasion within each individual, with the time invariant individual characteristics repeated for occasions within the same individual. Adding time-varying or time invariant covariates to the model is straightforward; they are just added as predictor variables. Allowing the time invariant covariates to have direct effects on the outcome variables at each measurement occasion is more complicated, because in the MLR approach this requires specifying interactions of these predictors with dummy variables that indicate the measurement occasions. Adding additional levels is simple in MLR; after all, this is what the software was designed for. The maximum number of levels is defined by software restrictions; the current record is MLwiN [20], which is designed to handle up to 50 levels. This may seem excessive, but many special analyses are set up in

multilevel regression software by including an extra level. This is used, for instance, to model multivariate outcomes, cross-classified data, and specify measurement models. For such models, a nesting structure of up to five levels is not unusual, and not all multilevel regression software can accommodate this. The MLR model is more limited than SEM. For instance, it is not possible to let the intercept and slopes act as predictors in a more elaborate path model. The limitations show especially when models are estimated that include latent variables. For instance, models with latent variables over time that are indicated by observed variables, easy to specify in SEM, can be set up in MLR using an extra variable level. At this (lowest) level, dummy variables are used to indicate variables that belong to the same construct at different measurement occasions. The regression coefficients for these dummies are allowed to vary at the occasion level, and they are interpreted as latent variables in a measurement model. However, this measurement model is more restricted than the measurement in the analogous SEM. In the MLR approach, the measurement model is a factor model with equal loadings for all variables, and one common error variance for the unique factors. In some situations, for instance, when the indicators are items measured using the same response scale, this restriction may be reasonable. It also ensures that the measurement model is invariant over time. The important issue is of course that this restriction cannot be relaxed in the MLR model, and it cannot be tested. Most modern structural equation modeling software can be used to analyze LGC models. If the data are unbalanced, either by design or because of panel attrition, it is important that the software supports analyzing incomplete data using the raw Likelihood. If there are categorical response variables, it is important that the software supports their analysis. At the time of writing, only Muth´en’s software Mplus supports the combination of categorical incomplete data [17]. Longitudinal data can be handled by all multilevel software. Some software supports analyzing specific covariance structures over time, such as autoregressive models. When outcome variables may be categorical, there is considerable variation in the estimation methods employed. Most multilevel regression relies on Taylor series linearization, but increasingly numerical integration is used, which is regarded as more accurate.

Multilevel and SEM Approaches to Growth Curve Modeling A recent development in the field is that the distinction between MLR and LGC analysis is blurring. Advanced structural equation modeling software is now incorporating some multilevel features. Mplus, for example, goes a long way towards bridging the gap between the two approaches [15, 16]. On the other hand MLR software is incorporating features of LGC modeling. Two MLR software packages allow linear relations between the growth parameters: HLM [22] and GLLAMM [19]. HLM offers a variety of residual covariance structures for MLR models. The GLLAMM framework is especially powerful; it can be viewed as a multilevel regression approach that allows factor loadings, variable-specific unique variances, as well as structural equations among latent variables (both factors and random coefficients). In addition, it supports categorical and incomplete data. As the result of further developments in both statistical models and software, the two approaches to growth curve modeling may in time merge (see Software for Statistical Analyses; Structural Equation Modeling: Software).

growth model must be embedded in a larger number of hierarchical data levels.

References [1] [2]

[3]

[4] [5] [6]

[7]

[8]

Discussion Many methods are available for the analysis of longitudinal data. There is no single preferred procedure, since different substantial questions dictate different data structures and statistical models. This entry focuses on growth curve analysis. Growth curve analysis, and its SEM variant latent growth curve analysis, has advantages for the study of change if it can be assumed that change is systematically related to the passage of time. Identifying individual differences in change, as well as understanding the process of change are considered critical issues in developmental research by many scholars. Growth curve analysis explicitly reflects on both intra-individual change and interindividual differences in such change. In this entry, we described the general growth curve model and discussed differences between the multilevel regression approach and latent growth curve analysis using structural equation modeling. The basic growth curve model has a similar representation, and gives equivalent results in both approaches. Differences exist in the ways the model can be extended. In many instances, latent growth curve analysis is preferred because of its greater flexibility. Multilevel Regression is preferable if the

9

[9] [10]

[11]

[12]

[13] [14] [15]

[16] [17]

Bollen, K.A. (1989). Structural Equations with Latent Variables., Wiley, New York. Bollen, K.A. & Barb, K. (1981). Pearson’s R and coarsely categorized measures, American Sociological Review 46, 232–239. Curran, P. (2003). Have multilevel models been structural equation models all along? Multivariate Behavior Research 38(4), 529–569. Goldstein, H. (2003). Multilevel Statistical Models, Edward Arnold, London. Hox, J.J. (2002). Multilevel Analysis: Techniques and Applications, Lawrence Erlbaum, Mahwah. Hox, J.J. & Maas, C.J.M. (2001). The accuracy of multilevel structural equation modeling with pseudobalanced groups and small samples, Structural Equation Modeling 8, 157–174. Johnson, D.R. & Creech, J.C. (1983). Ordinal measures in multiple indicator models: a simulation study of categorization error, American Sociological Review 48, 398–407. Kaplan, D. (2000). Structural Equation Modeling, Sage Publications, Thousand Oaks. Kessler, R.C. & Greenberg, D.F. (1981). Linear Panel Analysis, Academic Press, New York. Li, F., Duncan, T.E., Duncan, S.C. & Acock, A. (2001). Latent growth modeling of longitudinal data: a finite growth mixture modeling approach, Structural Equation Modeling 8, 493–530. McArdle, J.J. (1986). Latent variable growth within behavior genetic models, Behavior Genetics 16, 163–200. McArdle, J.J. (1988). Dynamic but structural equation modeling of repeated measures data, in Handbook of Multivariate Experimental Psychology, 2nd Edition, R.B. Cattel & J. Nesselroade, eds, Plenum Press, New York, pp. 561–614. Meredith, W.M. & Tisak, J. (1990). Latent curve analysis, Psychometrika 55, 107–122. Muth´en, B. (1994). Multilevel covariance structure analysis, Sociological Methods & Research 22, 376–398. Muth´en, B. (2001). Second-generation structural equation modeling with a combination of categorical and continuous latent variables: New opportunities for latent class-latent growth modeling, in New Methods for the Analysis of Change, L.M. Collins & A.G. Sayer, eds, American Psychological Association, Washington, pp. 291–322. Muth´en, B.O. (2002). Beyond SEM: general latent variable modeling, Behaviormetrika 29, 81–117. Muth´en, L.K. & Muth´en, B.O. (2004). Mplus Users Guide, Muth´en & Muth´en, Los Angeles.

10

Multilevel and SEM Approaches to Growth Curve Modeling

[18]

Muth´en, B. & Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the EM algorithm, Biometrics 55, 463–469. [19] Rabe-Hesketh, S., Skrondal, A. & Pickles, A. (2004). Generalized multilevel structural equation modelling, Psychometrika 69, 167–190. [20] Rasbash, J., Browne, W., Goldstein, H., Yang, M., Plewis, I., Healy, M., Woodhouse, G., Draper, D., Langford, I. & Lewis, T. (2000). A User’s Guide to MlwiN , Multilevel Models Project, University of London, London. [21] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods, Sage Publications, Newbury Park. [22] Raudenbush, S., Bryk, A., Cheong, Y.F. & Congdon, R. (2000). HLM 5. Hierarchical Linear and Nonlinear Modeling, Scientific Software International, Chicago. [23] Stoel, R.D., Van den Wittenboer, G. & Hox, J.J. (2003). Analyzing longitudinal data using multilevel regression

[24]

[25]

and latent growth curve analysis, Metodologia de las Ciencas del Comportamiento 5, 21–42. Stoel, R.D., van den Wittenboer, G. & Hox, J.J. (2004). Methodological issues in the application of the latent growth curve model, in Recent Developments in Structural Equation Modeling: Theory and Applications, K. van Montfort, H. Oud & A. Satorra, eds, Kluwer Academic Publishers, Amsterdam, pp. 241–262. Willett, J.B. & Sayer, A.G. (1994). Using covariance structure analysis to detect correlates and predictors of individual change over time, Psychological Bulletin 116, 363–381.

(See also Structural Equation Modeling: Multilevel) JOOP HOX

AND

REINOUD D. STOEL

Multiple Baseline Designs JOHN FERRON

AND

HEATHER SCOTT

Volume 3, pp. 1306–1309 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Baseline Designs Single-case designs were developed to allow researchers to examine the effect of a treatment for a single case, where the case may be an individual participant or a group such as a class of students. Multiple-baseline designs [1] are an extension of the most basic single-case design, the AB, or interrupted time-series design. With an AB design, the researcher repeatedly measures the behavior of interest prior to intervention. These observations become the baseline phase (A). The researcher then introduces a treatment and continues to repeatedly measure the behavior, creating the treatment phase (B) of the design. If a change in behavior is observed, some may question whether the change was the result of the treatment or whether it resulted from maturation or some event that happened to coincide with treatment implementation. To allow researchers to rule out these alternative explanations for observed changes, more elaborate single-case designs were developed. Multiple-baseline designs extend the AB design, such that a baseline phase (A) and treatment phase (B) are established for multiple participants, multiple behaviors, or multiple settings. The initiation of the treatment phases is staggered across time creating baselines of different lengths for the different participants, behaviors, or settings. The general form of the design with 12 observations and 4 baselines is presented in Figure 1. To further illustrate the logic of this design, a graphical display of data from a multiple baseline across participant design is presented in Figure 2. When the first participant enters treatment, there is a notable change in behavior for this participant. The other participant, who is still in baseline, does not show appreciable changes in behavior when the treatment is initiated with the first participant. This makes history or maturational effects less plausible as explanations for why the first participant’s behavior

changed. Put another way, when changes are due to history or maturation, we would anticipate change for each participant, but we would not expect those changes to stagger themselves across time in a manner that happened to coincide with the staggered interventions.

Applications The multiple-baseline design is often employed in applied clinical and educational settings. However, its application may be extended to a variety of other disciplines and situations. The following are examples in which this design may be utilized: •

•

•

•

•

•

O1 O2 O3 X O4 O5 O6 O7 O1 O2 O3 O4 O5 X O6 O7 O1 O2 O3 O4 O5 O6 O7 X O1 O2 O3 O4 O5 O6 O7

Educators might find the multiple baseline across individuals to be suitable for studying methods for reducing disruptive behaviors of students in the classroom. Psychologists might find the multiple baseline across behaviors to be suitable for investigating the effects of teaching children with autism to use socially appropriate gestures in combination with oral communication. Counselors might find the multiple baseline across participants to be suitable for studying the effects of training staff in self-supervision and uses of empathy in counseling situations. Therapists might find the multiple baseline across groups to be suitable for examining the effectiveness of treating anxiety disorders, phobias, and obsessive-compulsive disorders in adolescents. Retailers and grocery stores might find the multiple-baseline design across departments to be suitable for studying training effects on cleaning behaviors of employees. Teacher preparation programs might find the multiple baseline across participants to be suitable for examining the result of specific teaching practices. O8 O9

O10 O11 O12

O8 O9

O10 O11 O12

O8 O9

O10 O11 O12

O8 O9 X O10 O11 O12

Figure 1 Diagram of a multiple-baseline design where Os represent observations and Xs represent changes from baseline to treatment phases

2

Multiple Baseline Designs impacts another, or when treatment effects are not consistent across participants, behaviors, or settings. It should also be noted that the evidence for inferring a treatment effect for any one participant tends to be established less clearly than it could be in a singlecase design involving a reversal. Finally, relative to group designs, the multiple-baseline design involves relatively few participants and thus there is little means for establishing generalizability.

100 80 60 40 20 0 100 80

Design Issues

60

More Complex Designs

40 20 0

Figure 2

•

Graphical display of multiple-baseline design

Medical practitioners or social workers might find the multiple baseline across behaviors to be suitable for studying the impact of teaching child care and safety skills to parents with intellectual disabilities.

Design Strengths and Weaknesses The widespread use of multiple-baseline designs can be attributed to a series of practical and conceptual strengths. The multiple-baseline design allows researchers to focus extensively on a single participant or on a few participants. This may be advantageous in contexts where researchers wish to study cases that are relatively rare, when researchers are implementing particularly complex and timeconsuming treatments, and when they are devoted to showing effects on specific individuals. Among single-case designs, the multiple-baseline design allows researchers to consider history and maturation effects without requiring them to withdraw the treatment. This can be seen as particularly valuable in clinical settings where it may not be ethical to withdraw an effective treatment. Multiple-baseline designs provide relatively strong grounds for causal inferences when treatment effects can be seen that coincide with the unique intervention times for each participant, behavior, or setting. Things are less clear when there is a lack of independence between baselines so that treatment of one baseline

The basic multiple-baseline design can be extended to have a more complex between series structure. For example, a researcher could conduct a multiplebaseline design across participants and behaviors by studying three behaviors of each of three participants. A researcher could also extend the multiplebaseline design by using a more complex within series structure. For example, a researcher conducting a multiple-baseline design across participants may for each participant include a baseline phase, followed by a treatment phase, followed by a second baseline phase. Whether one is considering the basic multiplebaseline design or a more complex extension, there are several additional design features that should be considered, including the number of baselines, the number of observations to be collected in each time series, and the decision about when to intervene within each series.

Number of Baselines Multiple-baseline designs need to include at least two baselines, and the use of three or four baselines is more common. Assuming other things are equal, it is preferable to have greater numbers of baselines. Having more baselines provides a greater number of replications of the effect, and allows greater confidence that the observed effect in a particular time series was the result of the intervention.

Number of Observations The number of observations within a time series varies greatly from one multiple-baseline study to the next. One study may have a phase with two

Multiple Baseline Designs or three observations, while another may have more than 30 observations in each phase. When other things are equal, greater numbers of observations lead to stronger statements about treatment effects. The number of observations needed depends heavily on the amount of variation in the baseline data and the size of the anticipated effect. Suppose that a researcher wishes to intervene with four students that consistently spend no time on a task during a particular activity (i.e., 0% is observed for each baseline observation). Suppose further that the intervention is expected to increase the time on the task to above 80%. Under these conditions, a few observations within each phase is ample. However, if the baseline observations fluctuated between 0% and 80% with an average of 40% and the hope was to move the average to at least 60% for each student, many more observations would be needed.

Placement of Intervention Points Systematic Assignment. The points at which the researcher will intervene can be established in several different ways. In some cases, the researcher simply chooses the points a priori on the basis of obtaining an even staggering across time. For example, the researcher may choose to intervene for the four baselines at times 4, 6, 8, and 10 respectively. This method seems to work well when baseline observations are essentially constant. Under these conditions, temporal stability is assumed and the researcher uses what has been referred to as the scientific solution to causal inference. When baseline observations are more variable, inferences become substantially more difficult and one may alter the method of assigning intervention points to facilitate drawing treatment effect inferences. Response-guided Assignment. One option is to use a response-guided strategy where the data are viewed and intervention points are chosen on the basis of the emerging data. For example, a researcher gathering baseline data on each of the four individuals may notice variability in each participant’s behavior. The researcher may be able to identify sources of this variability and make adjustments to the experiment to control the identified factors. After controlling these factors, the baseline data may stabilize, and the researcher may feel comfortable intervening with the first participant. The researcher would continue

3

watching the accumulating observations waiting until the data for the first participant demonstrated the anticipated effect. At this point, if the baseline data were still stable for the other three participants, the researcher would intervene with the second participant. Interventions for the third and fourth participants would be made in a similar manner, each time waiting until the data had shown a clear pattern before beginning the next treatment phase. The response-guided strategy works relatively well when initial sources of variability to the baseline data can be identified and controlled. Variability may be resulting from unreliability in the measurement process that could be corrected with training, or it may be resulting from changes in the experimental conditions. As examples, variation may be associated with changes in the time of the observation, changes in the activities taking place during the observation period, changes in the people present during the observation period, or changes in the events preceding the observation period. By further standardizing conditions, variation in the baseline data can be reduced. If near constancy can be obtained in the baseline data, temporal stability can be assumed, and treatment effects can be readily seen. When baseline variability cannot be controlled, one may turn to statistical methods for making treatment effect inferences. Interestingly, the response-guided strategy for establishing intervention points can lead to difficulties in establishing inferences statistically. Researchers wishing to make statistical inferences may turn to an approach that includes some form of random assignment. Random Assignment. One approach is to randomly assign participants to designated intervention times (see Randomization). To illustrate, consider a researcher who plans to conduct a multiple-baseline study across participants where 12 observations will be gathered on each of the four participants. The researcher could decide that the interventions would occur on the 4th, 6th, 8th, and 10th point in time. The researcher could then randomly decide which participant would be treated at the 4th point in time, which would be treated at the 6th point in time, which on the 8th, and which on the 10th. A second method of randomization would be to randomly select an intervention point for a participant under the restriction that there would be at least n observations in each phase. For example, the

4

Multiple Baseline Designs

researcher could randomly choose a time point for each participant under the constraint that the chosen point would fall between the 4th and 10th observation. It may turn out that a researcher chooses the interventions for the four participants to coincide with the 8th, 4th, 7th, and 5th observations, respectively. This leads to more possible assignments than the first method, but could possibly lead to the assignment of all interventions to the same point in time. This would be inconsistent with the temporal staggering, which is a defining feature in the multiple-baseline design, so one may wish to further restrict the randomization so this is not possible. One way of doing this is to randomly choose intervention times under greater constraints and then randomly assign participants to intervention times. For example, a researcher could choose four intervention times by randomly choosing the 3rd or 4th, then randomly choosing the 5th or 6th, then randomly choosing the 7th or 8th, and then randomly choosing the 9th or 10th. The four participants could then be randomly assigned to these four intervention times. Finally, one could consider coupling one of these randomization methods with a response-guided strategy. For example, a researcher could work to control sources of baseline variability, and then make the random assignment after the data had stabilized as much as possible.

Analysis Researchers often start their analysis by constructing a graphical display of the multiple-baseline data. In many applications, a visual analysis of the graphical display is the only analysis provided. There are, however, a variety of additional methods available to help make inferences about treatment effects in multiple-baseline studies. If one makes intervention assignments randomly, then it is possible to use a randomization test to make

an inference about the presence of a treatment effect. Randomization tests [2] have been developed for each of the randomization strategies previously described, and these tests are statistically valid even when history or maturational effects have influenced the data. It is important to note, however, that these tests do not lead to inferences about the size of the treatment effect. When baseline data are variable, effect size inferences are often difficult and researchers should consider statistical modeling approaches. A variety of statistical models are available for time series data [3], including models with a relatively simple assumed error structure, such as a standard regression model that assumes an independent error structure, to more complex time series models that have multiple parameters to capture dependencies among the errors, such as an autoregressive moving average model. Different error structure assumptions will typically lead to different estimated standard errors and potentially different inferential statements about the size of treatment effects. Consequently, care should be taken to establish the creditability of the assumed statistical model, which often requires a relatively large number of observations.

References [1]

[2]

[3]

Baer, D.M., Wolf, M.M. & Risley, T.R. (1968). Some current dimensions of applied behavior analysis, Journal of Applied Behavior Analysis 1, 91–97. Koehler, M. & Levin, J. (1998). Regulated randomization: a potentially sharper analytical tool for the multiplebaseline design, Psychological Methods 3, 206–217. McKnight, S.D., McKean, J.W. & Huitema, B.E. (2000). A double bootstrap method to analyze linear models with autoregressive error terms, Psychological Methods 5, 87–101.

JOHN FERRON

AND

HEATHER SCOTT

Multiple Comparison Procedures H.J. KESELMAN, BURT HOLLAND

AND

ROBERT A. CRIBBIE

Volume 3, pp. 1309–1325 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Comparison Procedures Introduction When more than two groups are compared for the effects of a treatment variable, researchers frequently adopt multiple comparison procedures (MCPs) in order to specify the exact nature of the treatment effects. Comparisons between two groups, pairwise comparisons, are frequently of interest to applied researchers, comparisons such as examining the effects of two different types of drugs, from a set of groups administered different drugs, on symptom dissipation. Complex comparisons, nonpairwise comparisons, the comparison of say one group (e.g., a group receiving a placebo instead of an active drug) to the average effect of two other groups (e.g., say two groups receiving some amount of the drug–50 and 100 mg) are also at times of interest to applied researchers. In both cases, the intent of the researcher is to examine focused questions about the nature of the treatment variable, in contrast to the answer provided by an examination of a global hypothesis through an omnibus test statistic (e.g., say, the use of the analysis of variance F test in order to determine if there is an effect of J > 2(j = 1, . . . , J ) different types of drugs on symptom dissipation). A recent survey of statistical practices of behavioral science researchers indicates that the Tukey [60] and Scheff´e [55] MCPs are very popular methods for conducting pairwise and complex comparisons, respectively (Keselman et al., [28]). However, behavioral science researchers now have available to them a plethora of procedures that they can choose from when their interest lies in conducting pairwise and complex comparisons among their treatment group means. In most cases, these newer procedures will provide either a test that is more insensitive, that is more robust, to the nonconformity of applied data to the derivational assumptions (i.e., homogeneity of population variances and normally distributed data) of the classical procedures (Tukey and Scheff´e) or will provide greater sensitivity (statistical power) to detect effects when they are present. Therefore, the purpose of our paper is to present a brief introduction to some of the newer methods for conducting multiple comparisons. Our presentation will predominately

be devoted to an examination of pairwise methods since researchers perform these comparisons more often than complex comparisons. However, we will discuss some methods for conducting complex comparisons among treatment group means. Topics that we briefly review prior to our presentation of newer MCPs will be methods of Type I error control and simultaneous versus stepwise multiple testing. It is important to note that the MCPs that are presented in our paper were also selected for discussion, because researchers can, in most cases, obtain numerical results with a statistical package, and in particular, through the SAS [52] system of computer programs. The SAS system presents a comprehensive up-to-date array of MCPs (see [65]). Accordingly, we acknowledge at the beginning of our presentation that some of the material we present follows closely Westfall et al.’s presentation. However, we also present procedures that are not available through the SAS system. In particular, we discuss a number of procedures that we believe are either new and interesting ways of examining pairwise comparisons (e.g., the model comparison approach of Dayton [6]) or have been shown to be insensitive to the usual assumptions associated with some of the procedures discussed by Westfall et al. (e.g., MCPs based on robust estimators).

Type I Error Control Researchers who test a hypothesis concerning mean differences between two treatment groups are often faced with the task of specifying a significance level, or decision criterion, for determining whether the difference is significant. The level of significance specifies the maximum probability of rejecting the null hypothesis when it is true (i.e., committing a Type I error). As α decreases, researchers can be more confident that rejection of the null hypothesis signifies a true difference between population means, although the probability of not detecting a false null hypothesis (i.e., a Type II error) increases. Researchers faced with the difficult, yet important, task of quantifying the relative importance of Type I and Type II errors have traditionally selected some accepted level of significance, for example α = .05. However, determining how to control Type I errors is much less simple when multiple tests of significance (e.g., all possible pairwise comparisons between group means) will be computed. This is

2

Multiple Comparison Procedures

because when multiple tests of significance are computed, how one chooses to control Type I errors can affect whether one can conclude that effects are statistically significant or not. The multiplicity problem in statistical inference refers to selecting the statistically significant findings from a large set of findings (tests) to support or refute one’s research hypotheses. Selecting the statistically significant findings from a larger pool of results that also contain nonsignificant findings is problematic since when multiple tests of significance are computed, the probability that at least one will be significant by chance alone increases with the number of tests examined (see Error Rates). Discussions on how to deal with multiplicity of testing have permeated many literatures for decades and continue to this day. In one camp are those who believe that the occurrence of any false positive must be guarded at all costs (see [13], [40], [48, 49, 50], [66]); that is, as promulgated by Thomas Ryan, pursuing a false lead can result in the waste of much time and expense, and is an error of inference that accordingly should be stringently controlled. Those in this camp deal with the multiplicity issue by setting α for the entire set of tests computed. For example, in the pairwise multiple comparison problem, Tukey’s [60] MCP uses a critical value wherein the probability of making at least one Type I error in the set of pairwise comparisons tests is equal to α. This type of control has been referred to in the literature as family-wise or experiment-wise (FWE) control. These respective terms come from setting a level of significance over all tests computed in an experiment, hence experiment-wise control, or setting the level of significance over a set (family) of conceptually related tests, hence FWE control. Multiple comparisonists seem to have settled on the family-wise label. As indicated, for the set of pairwise tests, Tukey’s procedure sets a FWE for the family consisting of all pairwise comparisons. Those in the opposing camp maintain that stringent Type I error control results in a loss of statistical power and consequently important treatment effects go undetected (see [47], [54], [72]). Members of this camp typically believe the error rate should be set per comparison (the probability of rejecting a given comparison) (hereafter referred to as the comparison-wise error-CWE rate) and usually recommend a five percent level of significance, allowing the overall error rate (i.e., FWE) to inflate with the number of tests

computed. In effect, those who adopt comparisonwise control ignore the multiplicity issue. For example, a researcher comparing four groups (J = 4) may be interested in determining if there are significant pairwise mean differences between any of the groups. If the probability of committing a Type I error is set at α for each comparison (comparison-wise control = CWE), then the probability that at least one Type I error is committed over all C = 4(4 − 1)/2 = 6 pairwise comparisons can be much higher than α. On the other hand, if the probability of committing a Type I error is set at α for the entire family of pairwise comparisons, then the probability of committing a Type I error for each of the C comparisons can be much lower than α. Clearly then, the conclusions of an experiment can be greatly affected by the level of significance and unit of analysis over which Type I error control is imposed. The FWE rate relates to a family (containing, in general, say k elements) of comparisons. A family of comparisons, as we indicated, refers to a set of conceptually related comparisons, for example, all possible pairwise comparisons, all possible complex comparisons, trend comparisons, and so on. As Miller [40] points out, specification of a family of comparisons, being self-defined by the researcher, can vary depending on the research paradigm. For example, in the context of a one-way design, numerous families can be defined: A family of all comparisons performed on the data, a family of all pairwise comparisons, a family of all complex comparisons. (Readers should keep in mind that if multiple families of comparisons are defined (e.g., one for pairwise comparisons and one for complex comparisons), then given that erroneous conclusions can be reached within each family, the overall Type I FWE rate will be a function of the multiple subfamily-wise rates.) Specifying family size is a very important component of multiple testing. As Westfall et al. [65, p. 10] note, differences in conclusions reached from statistical analyses that control for multiplicity of testing (FWE) and those that do not (CWE) are directly related to family size. That is, the larger the family size, the less likely individual tests will be found to be statistically significant with family-wise control. Accordingly, to achieve as much sensitivity as possible to detect true differences and yet maintain control over multiplicity effects, Westfall et al. recommend that researchers ‘choose smaller, more focused families rather than broad ones, and

Multiple Comparison Procedures (to avoid cheating) that such determination must be made a priori . . . .’ Definitions of the CWE and FWE rates appear in many sources (e.g., see [34], [48], [40], [59], [60]). Controlling the FWE rate has been recommended by many researchers (e.g., [16], [45], [48], [50], [60]) and is ‘the most commonly endorsed approach to accomplishing Type I error control’ [56, p. 577]. Keselman et al. [28] report that approximately 85 percent of researchers conducting pairwise comparisons adopt some form of FWE control. Although many MCPs purport to control FWE, some provide ‘strong’ FWE control while others only provide ‘weak’ FWE control. Procedures are said to provide strong control if FWE is maintained across all null hypotheses; that is, under the complete null configuration (µ1 = µ2 = · · · = µJ ) and all possible partial null configurations (An example of a partial null hypothesis is µ1 = µ2 = · · · = µJ −1 = µJ ). Weak control, on the other hand, only provides protection for the complete null hypothesis, that is, not for all partial null hypotheses as well. The distinction between strong and weak FWE control is important because as Westfall et al. [65] note, the two types of FWE control, in fact, control different error rates. Weak control only controls the Type I error rate for falsely rejecting the complete null hypothesis and accordingly allows the rate to exceed, say 5%, for the composite null hypotheses. On the other hand, strong control sets the error rate at, say 5%, for all (component) hypotheses. For example, if CWE = 1 − (1 − 0.05)1/k , the family-wise rate is controlled in a strong sense for testing k independent tests. Examples of MCPs that only weakly control FWE are the Newman [41] Keuls [33] and Duncan [8] procedures. As indicated, several different error rates have been proposed in the multiple comparison literature. The majority of discussion in the literature has focused on the FWE and CWE rates (e.g., see [34], [48], [40], [59], [60]), although other error rates, such as the false discovery rate (FDR) also have been proposed (e.g., see Benjamini & Hochberg [2]). False Discovery Rate Control. Work in the area of multiple hypothesis testing is far from static, and one of the newer interesting contributions to this area is an alternative conceptualization for defining errors in the multiple-testing problem; that is the FDR, presented by Benjamini and Hochberg [2]. FDR is defined

3

by these authors as the expected proportion of the number of erroneous rejections to the total number of rejections, that is, it is the expected proportion of false discoveries or false positives. Benjamini and Hochberg [2] provide a number of illustrations where FDR control seems more reasonable than family-wise or comparison-wise control. Exploratory research, for example, would be one area of application for FDR control. That is, in new areas of inquiry where we are merely trying to see what parameters might be important for the phenomenon under investigation, a few errors of inference should be tolerable; thus, one can reasonably adopt the less stringent FDR method of control which does not completely ignore the multiple-testing problem, as does comparison-wise control, and yet, provides greater sensitivity than family-wise control. Only at later stages in the development of our conceptual formulations does one need more stringent familywise control. Another area where FDR control might be preferred over family-wise control, suggested by Benjamini and Hochberg [2], would be when two treatments (say, treatments for dyslexia) are being compared in multiple subgroups (say, children of different ages). In studies of this sort, where an overall decision regarding the efficacy of the treatment is not of interest but, rather where separate recommendations would be made within each subgroup, researchers likely should be willing to tolerate a few errors of inference and accordingly would profit from adopting FDR rather than family-wise control. Very recently, use of the FDR criterion has become widespread when making inferences in research involving the human genome, where family sizes in the thousands are common. See the review by Dudoit, Shaffer and Boldrick [7], and references contained therein. Since multiple testing with FDR tends to detect more significant differences than testing with FWE, some researchers may be tempted to automatically prefer FDR control to FWE control. We caution that researchers who use FDR should be obligated to explain, in terms of the definitions of the two criteria, why it is more appropriate to control FDR than FWE in the context of their research.

Types of MCPs MCPs can examine hypotheses either simultaneously or sequentially. A simultaneous MCP conducts all

4

Multiple Comparison Procedures

comparisons regardless of whether the omnibus test, or any other comparison, is significant (or not significant) using a constant critical value. Such procedures are frequently referred to as simultaneous test procedures (STPs) (see Einot & Gabriel [11]). A sequential (stepwise) MCP considers either the significance of the omnibus test or the significance of other comparisons (or both) in evaluating the significance of a particular comparison; multiple critical values are used to assess statistical significance. MCPs that require a significant omnibus test in order to conduct pairwise comparisons have been referred to as protected tests. MCPs that consider the significance of other comparisons when evaluating the significance of a particular comparison can be either step-down or step-up procedures. Step-down procedures begin by testing the most extreme test statistic and nonsignificance of the most extreme test statistics implies nonsignificance for less extreme test statistics. Step-up procedures begin by testing the least extreme test statistic and significance of least extreme test statistics can imply significance for larger test statistics. In the equal sample sizes case, if a smaller pairwise difference is statistically significant, so is a larger pairwise difference, and conversely. However, in the unequal sample-size cases, one can have a smaller pairwise difference be significant and a larger pairwise difference nonsignificant, if the sample sizes for the means comprising the smaller difference are much larger than the sample sizes for the means comprising the larger difference. One additional point regarding STP and stepwise procedures is important to note. STPs allow researchers to examine simultaneous intervals around the statistics of interest whereas stepwise procedures do not (see, however, [4]).

Preliminaries A mathematical model that can be adopted when examining pairwise and/or complex comparisons of means in a one-way completely randomized design is: (1) Yij = µj + ij , where Yij is the score of the ith subject (i = 1, . . . , n) in the j th (j = 1, . . . , J ) group (j n = N ), µj is the j th group mean, and ij is the random error for the ith subject in the j th group. In the typical application

of the model, it is assumed that the ij s are normally and independently distributed and that the treatment group variances (σj2 s) are equal. Relevant sample estimates include µj = Y¯j =

n Yij i=1

=

J n j =1 i=1

n

and σ 2 = MSE

(Yij − Y¯j )2 . J (n − 1)

(2)

Pairwise Comparisons A confidence interval for a pairwise difference µj − µj 2 ¯ ¯ Yj − Yj ± cα σ , (3) n where cα is selected such that FWE = α. In the case of all possible pairwise comparisons, one needs a cα for the set such that they simultaneously surround the true differences with a specified level of significance. That is, for all j = j , cα must satisfy 2 ≤ µj − µj P Y¯j − Y¯j − cα σ n 2 = 1 − α. (4) ≤ Y¯j − Y¯j + cα σ n A hypothesis for the comparison (Hc : µj − µj = 0) can be examined with the test statistic: tc =

Y¯j − Y¯j . (2 MSE /n)1/2

(5)

MCPs that assume normally distributed data and homogeneity of variances are given below. Tukey. Tukey [60] proposed a STP for all pairwise comparisons. Tukey’s MCP uses a critical value obtained from the Studentized range distribution. In particular, statistical significance, with FWE control, is assessed by comparing |tc | ≥

q(J,J (n−1)) , √ 2

(6)

where q is a value from the Studentized range distribution (see [34]) based on J means and J (n − 1) error degrees of freedom. Tukey’s procedure can

Multiple Comparison Procedures be implemented in SASs [52] generalized linear model (GLM) program. Tukey–Kramer [35]. It is also important to note that Tukey’s method, as well as other MCPs, can be utilized when group sizes are unequal. A pairwise test statistic for the unequal sample-size case would be Y¯j − Y¯j (j = j ). tj,j = MSE (1/nj + 1/nj )

(7)

Accordingly, the significance of a pairwise difference is assessed by comparing |tj,j | >

q(J,j (nj −1)) . √ 2

(8)

Hayter [17] proved that the Tukey–Kramer MCP only approximately controls the FWE – the rate is slightly conservative, that is, the true rate of Type I error will be less than the significance level. The GLM procedure in SAS will automatically compute the Kramer version of Tukey’s test when group sizes are unequal. Fisher–Hayter’s Two-stage MCP. Fisher [12] proposed conducting multiple t Tests on the C pairwise comparisons following rejection of the omnibus ANOVA null hypothesis (see [29], [34]). The pairwise null hypotheses are assessed for statistical significance by referring tc to t(α/2,ν) , where t(α/2,ν) is the upper 100(1 − α/2) percentile from Student’s distribution with parameter ν. If the ANOVA F is nonsignificant, comparisons among means are not conducted; that is, the pairwise hypotheses are retained as null. It should be noted that Fisher’s [12] least significant difference (LSD) procedure only provides Type I error protection via the level of significance associated with the ANOVA null hypothesis, that is, the complete null hypothesis. For other configurations of means not specified under the ANOVA null hypothesis (e.g., µ1 = µ2 = · · · = µJ −1 << µJ ), the rate of family-wise Type I error can be much in excess of the level of significance (Hayter [18]; Hochberg & Tamhane [20]; Keselman et al. [29]; Ryan, [51]). Hayter [18] proposed a modification to Fisher’s LSD that would provide strong control over FWE. Like the LSD procedure, no comparisons are tested

5

unless the omnibus test is significant. If the omnibus test is significant, then Hc is rejected if: |tc | >

q(J −1,ν) √ . 2

(9)

Studentized range critical values can be obtained through SASs PROBMC (see [65], p. 46). It should be noted that many authors recommend Fisher’s two-stage test for pairwise comparisons when J = 3 (see [27], [37]). These recommendations are based on Type I error control, power and ease of computation issues. Hochberg’s Sequentially Acceptive Step-up Bonferroni Procedure. In this procedure [19], the P values corresponding to the m statistics (e.g., tc ) for testing the hypotheses H1 , . . . , Hm are ordered from smallest to largest. Then, for any i = m, m − 1, . . . , 1, if pi ≤ α/(m − i + 1), the Hochberg procedure rejects all Hi (i ≤ i). According to this procedure, therefore, one begins by assessing the largest P value, pm . If pm ≤ α, all hypotheses are rejected. If pm > α, then Hm is accepted and one proceeds to compare p(m−1) to α/2. If p(m−1) ≤ α/2, then all Hi (i = m − 1, . . . , 1) are rejected; if not, then H(m−1) is accepted and one proceeds to compare p(m−2) with α/3, and so on. Shaffer’s Sequentially Rejective Bonferroni Procedure that Begins with an Omnibus Test. Like the preceding procedure, the P values associated with the test statistics are rank ordered. In Shaffer’s procedure [57], however, one begins by comparing the smallest P value, p1 , to α/m. If p1 > α/m, statistical testing stops and all pairwise contrast hypotheses (Hi , 1 ≤ i ≤ m) are retained; on the other hand, if p1 ≤ α/m, H1 is rejected and one proceeds to test the remaining hypotheses in a similar step-down fashion by comparing the associated P values to α/m∗ , where m∗ is equal to the maximum number of true null hypotheses, given the number of hypotheses rejected at previous steps. Appropriate denominators for each α-stage test for designs containing up to ten treatment levels can be found in Shaffer’s Table 2. Shaffer [57] proposed a modification to her sequentially rejective Bonferroni procedure which involves beginning this procedure with an omnibus test. (Though MCPs that begin with an omnibus test frequently are presented with the F test, other

6

Multiple Comparison Procedures

omnibus tests (e.g., a range statistic) can also be applied to these MCPs.) If the omnibus test is declared nonsignificant, statistical testing stops and all pairwise differences are declared nonsignificant. On the other hand, if one rejects the omnibus null hypothesis, one proceeds to test pairwise contrasts using the sequentially rejective Bonferroni procedure previously described, with the exception that pm , the smallest P value, is compared to a significance level which reflects the information conveyed by the rejection of the omnibus null hypothesis. For example, for m = 6, rejection of the omnibus null hypothesis implies at least one inequality of means and, therefore, p6 is compared to α/3, rather than α/6. Benjamini and Hochberg’s (BH) FDR Procedure. In this procedure [2], the P values corresponding to the m pairwise statistics for testing the hypotheses H1 , . . . , Hm are ordered from smallest to largest, that is, p1 ≤ p2 ≤, . . . , ≤ pm , where m = J (J − 1)/2. Let k be the largest value of i for which pi ≤ i/mα, then reject all Hi , i = 1, 2, . . . , k. According to this procedure one begins by assessing the largest P value, pm , proceeding to smaller P values as long as pi > i/mα. Testing stops when pk ≤ k/mα. Benjamini and Hochberg [3] also presented a modified (adaptive) (BH-A) version of their original procedure that utilizes the data to estimate the number of true Hc s. [The adaptive BH procedure has only been demonstrated, not proven, to control FDR, and only in the independent case.] With the original procedure, when the number of true null hypotheses (CT ) is less than the total number of hypotheses, the FDR rate is controlled at a level less than that specified (α). To compute the Benjamini and Hochberg [2] procedure, the pc -values are ordered (smallest to largest) p1 , . . . , pc , and for any c = C, C − 1, . . . , 1, if pc ≤ α(c/C), reject all Hc (c ≤ c). If all Hc s are retained, testing stops. If any Hc is rejected with the criterion of the BH procedure, then testing continues by estimating the slopes Sc = (1 − pc )/(C + 1 − c), where c = 1, . . . , C. Then, for any c = C, C − 1, . . . , 1, if T ), reject all Hc (c ≤ c), where C T = pc ≤ α(c/C min[(1/S ∗ ) + 1, C], [x] is the largest integer less than or equal to x and S ∗ is the minimum value of Sc such that Sc < Sc−1 . If all Sc > Sc−1 , S ∗ is set at C.

One disadvantage of the BH-A procedure, noted by both Benjamini and Hochberg [3] and Holland and Cheung [21], is that it is possible for an Hc to be rejected with pc > α. Therefore, it is suggested, by both authors, that Hc only be rejected if: (a) the hypothesis satisfies the rejection criterion of the BH-A; and (b) pc ≤ α. To illustrate this procedure, assume a researcher has conducted a study with J = 4 and α = .05. The ordered P values associated with the C = 6 pairwise comparisons are: p1 = .0014, p2 = .0044, p3 = .0097, p4 = .0145, p5 = .0490, and p6 = .1239. The first stage of the BH-A procedure would involve comparing p6 = .1239 to α(c/C) = .05(6/6) = .05. Since .1239 > .05, the procedure would continue by comparing p5 = .0490 to α(c/C) = .05(5/6) = .0417. Again, since .0490 > .0417, the procedure would continue by comparing p4 = .0145 to α(c/C) = .05(4/6) = .0333. Since .0145 < .0333, H4 would be rejected. Because at least one Hc was rejected during the first stage, testing continues by estimating each of the slopes, Sc = (1 − pc )/(C − c + 1), for c = 1, . . . , C. The calculated slopes for this example are: S1 = .1664, S2 = .1991, S3 = .2475, S4 = .3285, S5 = .4755 and S6 = .8761. Given that all Sc > Sc−1 , S ∗ is set at C = 6. The estimated number of true nulls is then deterT = min[(1/S ∗ ) + 1, C] = min[(1/6) + mined by C 1, 6] = min[1.1667, 6] = 1. Therefore, the BH-A T ) = procedure would compare p6 = .1239 to α(c/C .05(6/1) = .30. Since .1239 < .30, but .1239 > α, H6 would not be rejected and the procedure would T ) = continue by comparing p5 = .0490 to α(c/C .05(5/1) = .25. Since .0490 < .25 and .0490 < α, H5 would be rejected; in addition, all Hc would also be rejected (i.e., H1 , H2 , H3 , and H4 ). Stepwise MCPs Based on the Closure Principle. As we indicated previously, researchers can adopt stepwise procedures when examining all possible pairwise comparisons, and typically they provide greater sensitivity to detect differences than do STPs, for example, Tukey’s [60] method, while still maintaining strong FWE control. In this section, we present some methods related to closed-testing sequential MCPs that can be obtained through the SAS system of programs. As Westfall et al. [65, p. 150] note, it was in the past two decades (from their 1999 publication) that a unified approach to stepwise testing has evolved.

Multiple Comparison Procedures The unifying concept has been the closure principle. MCPs based on this principle have been designated as closed-testing procedures. These methods are designated as closed-testing procedures because they address families of hypotheses that are closed under intersection (∩). By definition, a closed family ‘is one for which any subset intersection hypothesis involving members of the family of tests is also a member of the family’. The closed-testing principle has led to a way of performing multiple tests of significance such that FWE is strongly controlled with results that are coherent. A coherent MCP is one that avoids inconsistencies in that it will not reject a hypothesis without rejecting all hypotheses implying it (see [20], pp. 44–45). Because closed-testing procedures were not always easy to derive, various authors derived other simplified stepwise procedures which are computationally simpler, though at the expense of providing smaller α values than what theoretically could be obtained with a closed-testing procedure. Naturally, as a consequence of having smaller α values (Type I errors are too tightly controlled), these simpler stepwise MCPs would not be as powerful as exact closed-testing methods. Nonetheless, these methods are still typically more powerful than STPs (e.g., Tukey) and therefore are recommended and furthermore, researchers can obtain numerical results through the SAS system. REGWQ. One such method was introduced by Ryan [49], Einot and Gabriel [11] and Welsch [64] and is available through SAS. One can better understand the logic of the REGWQ procedure if we first introduce one of the most popular stepwise strategies for examining pairwise differences between means, the Newman-Keuls (NK) procedure. In this procedure, the means are rank ordered from smallest to largest and the difference between the smallest and largest means is first subjected to a statistical test, typically with a range statistic (Q), at an α level of significance. If this difference is not significant, testing stops and all pairwise differences are regarded as null. If, on the other hand, this first range test is statistically significant, one ‘steps-down’ to examine the two J − 1 subsets of ordered means, that is, the smallest mean versus the next-to-largest mean and the largest mean versus the next-to-smallest mean, with each tested at an α level of significance. At each stage of testing, only subsets of ordered

7

means that are statistically significant are subjected to further testing. Although the NK procedure is very popular among applied researchers, it is becoming increasingly well known that when J > 3 it does not limit the FWE to α (see [20], p. 69). Ryan [49] and Welsch [64], however, have shown how to adjust the subset levels of significance in order to provide strong FWE control. Specifically, in order to strongly control FWE a researcher must: •

•

•

Test all subset (p = 2, . . . , J ) hypotheses at αp = p

1 − (1 − α) J , for p = 2, . . . , J − 2 and at level αp = α for p = J − 1, J . Testing starts with an examination of the complete null hypothesis µ1 = µ2 = · · · = µJ , and if rejected one steps down to examine subsets of J − 1 means, J − 2 means, and so on. All subset hypotheses implied by a homogeneity hypothesis that has not been rejected are accepted as null without testing.

REGWQ can be implemented with the SAS GLM program. We remind the reader, however, that this procedure cannot be used to construct simultaneous confidence intervals.

MCPs that Allow Variances to be Heterogeneous The previously presented procedures assume that the population variances are equal across treatment conditions. Given available knowledge about the nonrobustness of MCPs with conventional test statistics (e.g., t, F), and evidence that population variances are commonly unequal (see [28], [67]), researchers who persist in applying MCPs with conventional test statistics increase the risk of Type I errors. As Olejnik and Lee [43, p. 14] conclude, ‘most applied researchers are unaware of the problem [of using conventional test statistics with heterogeneous variances (see Heteroscedasticity and Complex Variation) and probably are unaware of the alternative solutions when variances differ’. Although recommendations in the literature have focused on the Games–Howell [14], or Dunnett [10] procedures for designs with unequal σj2 s (e.g., see [34], [59]), sequential procedures can provide more power than STPs while generally controlling the FWE (see [24], [36]).

8

Multiple Comparison Procedures

The SAS software can once again be used to obtain numerical results. In particular, Westfall et al. [65, pp. 206–207] provide SAS programs for logically constrained step-down pairwise tests when heteroscedasticity exists. The macro uses SASs mixed-model program (PROC MIXED), which allows for a nonconstant error structure across groups. As well, the program adopts the Satterthwaite [53] solution for error df. Westfall et al. remind the reader that the solution requires large data sets in order to provide approximately correct FWE control. It is important to note that other non-SAS solutions are possible in the heteroscedastic case. For completeness, we note how these can be obtained. Specifically, sequential procedures based on the usual tc statistic can be easily modified for unequal σj2 s (and unequal nj s) by substituting Welch’s [62] statistic, tw (νw ) for tc (ν), where Y¯j − Y¯j , (10) tw = sj2 sj2 + nj nj and sj2 and sj2 represent the sample variances for the j th and j th group, respectively. This statistic is approximated as a t variate with critical value t(1−α/2,νW ) , the 100(1 − α/2) quantile of Student’s t distribution with df 2 2 sj2 sj + nj nj . νw = 2 (11) 2 (sj /nj ) (sj2 /nj )2 + nj − 1 nj − 1 For procedures, simultaneously comparing more than two means or when an omnibus test statistic is required (protected tests), robust alternatives to the usual analysis of variance (ANOVA) F statistic have been suggested. Possibly the best-known robust omnibus test is that due to Welch [63]. With the Welch procedure, the omnibus null hypothesis is rejected if Fw > F(J −1,νw ) where: J

j =1

Fw = 1+

)2 /(J − 1) wj (Y¯j − Y

J (1 − w /w )2 2(J − 2)

j j 2 (J − 1) j =1 nj − 1

,

(12)

= wj Y¯j /wj . The statistic wj = nj /sj2 , and Y is approximately distributed as an F variate and

is referred to the critical value, F(1−α,J −1,νw ) , the 100(1 − α) quantile of the F distribution with J − 1 and νw df, where νw =

J2 − 1 . J (1 − w /w )2

j j 3 nj − 1 j =1

(13)

The Welch test has been found to be robust for largest to smallest variance ratios less than 10:1 (Wilcox, Charlin & Thompson [71]). On the basis of the preceding, one can use the nonpooled Welch test and its accompanying df to obtain various stepwise MCPs. For example, Keselman et al. [27] verified that one can use this approach with Hochberg’s [19] step-up Bonferroni MCP (see Westfall et al. [65], pp. 32–33) as well as with Benjamini and Hochberg’s [2] FDR method to conduct all possible pairwise comparisons in the heteroscedastic case.

MCPs That Can be Used When Data are Nonnormal An underlying assumption of all of the previously presented MCPs is that the populations from which the data are sampled are normally distributed; this assumption, however, may rarely be accurate (see [39], [44], [68]) (Tukey [61] suggests that most populations are skewed and/or contain outliers). Researchers falsely assuming normally distributed data risk obtaining biased Type I and/or Type II error rates for many patterns of nonnormality, especially when other assumptions are also not satisfied (e.g., variance homogeneity) (see [70]). Bootstrap and Permutation Tests. The SAS system allows users to obtain both simultaneous and stepwise pairwise comparisons of means with methods that do not presume normally distributed data. In particular, users can use either bootstrap or permutation methods to compute all possible pairwise comparisons. Bootstrapping allows users to create their own empirical distribution of the data and hence P values are accordingly based on the empirically obtained distribution, not a theoretically presumed distribution. , is For example, the empirical distribution, say F obtained by sampling, with replacement, the pooled

Multiple Comparison Procedures sample residuals ij = Yij − µj = Yij − Y¯j . That is, rather than assume that residuals are normally distributed, one uses empirically generated residuals to estimate the true shape of the distribution. From the pooled sample residuals one generates bootstrap data. An example program for all possible pairwise comparisons is given by Westfall et al. [65, p. 229]. As well, pairwise comparisons of means (or ranks) can be obtained through permutation of the data with the program provided by Westfall et al. [65, pp. 233–234]. Permutation tests also do not require that the data be normally distributed. Instead of resampling with replacement from a pooled sample of residuals, permutation tests take the observed data (Y11 , . . . , Yn1 1 , . . . , Y1J , . . . , YnJ J ) and randomly redistributes them to the treatment groups, and summary statistics (i.e., means or ranks) are then computed on the randomly redistributed data. The original outcomes (all possible pairwise differences from the original sample means) are then compared to the randomly generated values (e.g., all possible pairwise differences in the permutation samples). When users adopt this approach to combat the effects of nonnormality they should take heed of the cautionary note provided by Westfall et al. [65, p. 234], namely, the procedure may not control the FWE when the data have heterogeneous variances, particularly when group sizes are unequal. Thus, we introduce another approach, pairwise comparisons based on robust estimators and a heteroscedastic statistic, an approach that has been demonstrated to generally control the FWE when data are nonnormal and heterogeneous even when group sizes are unequal.

MCPs That Can be Used When Data are Neither Normal Nor When Variances are Heterogeneous Trimmed Means Approach. A different type of testing procedure, based on trimmed (or censored) means (see Trimmed Means), has been discussed by Yuen and Dixon [73] and Wilcox [69, 70], and is purportedly robust to violations of normality. That is, it is well known that the usual group means and variances, which are the bases for all of the previously described procedures, are greatly influenced by the presence of extreme observations

9

in distributions. In particular, the standard error of the usual mean can become seriously inflated when the underlying distribution has heavy tails. Accordingly, adopting a nonrobust measure ‘can give a distorted view of how the typical individual in one group compares to the typical individual in another, and about accurate probability coverage, controlling the probability of a Type I error, and achieving relatively high power’ [69, p. 66]. By substituting robust measures of location and scale for the usual mean and variance, it should be possible to obtain test statistics that are insensitive to the combined effects of variance heterogeneity and nonnormality. While a wide range of robust estimators have been proposed in the literature (see [15]), the trimmed mean and Winsorized variance (see Winsorized Robust Measures) are intuitively appealing because of their computational simplicity and good theoretical properties [69]. The standard error of the trimmed mean is less affected by departures from normality than the usual standard error mean because extreme observations, that is, observations in the tails of a distribution, are censored or removed. Trimmed means are computed by removing a percentage of observations from each of the tails of a distribution (set of observations). Let Y(1) ≤ Y(2) ≤ · · · ≤ Y(n) represent the ordered observations associated with a group. Let g = [γ n], where γ represents the proportion of observations that are to be trimmed in each tail of the distribution and [x] is notation for the largest integer not exceeding x. Wilcox [69] suggests that 20% trimming should be used. The effective sample size becomes h = n − 2g. Then the sample trimmed mean is n−g 1 Y¯t = Y(i) . h i=g+1

(14)

An estimate of the standard error of the trimmed mean is based on the Winsorized mean and Winsorized sum of squares (see Winsorized Robust Measures). The sample Winsorized mean is 1 Y¯w = [(g + 1)Y(g+1) + Y(g+2) + · · · + Y(n−g−1) n + (g + 1)Y(n−g) ], (15) and the sample Winsorized sum of squared deviations is SSD w = (g + 1)(Y(g+1) − Y¯w )2

10

Multiple Comparison Procedures + (Y(g+2) − Y¯w )2 + · · · + (Y(n−g−1) − Y¯w )2 + (g + 1)(Y(n−g) − Y¯w )2 .

(16)

Accordingly, the sample Winsorized variance is 2 = SSD W /(n − 1) and the squared standard error σW of the mean is estimated as [58] d=

2 (n − 1) σW . h(h − 1)

(17)

To test a pairwise comparison null hypothesis, compute Y¯t and d for the j th group, label the results Y¯tj and dj . The robust pairwise test (see Keselman, Lix & Kowalchuk [31]) becomes Y¯tj − Y¯tj , tW = dj + dj

(18)

with estimated df νW =

dj2 /(hj

(dj + dj )2 . − 1) + dj2 /(hj − 1)

(19)

When trimmed means are being compared, the null hypothesis relates to the equality of populationtrimmed means, instead of population means. Therefore, instead of testing H0 : µj = µj , a researcher would test the null hypothesis, H0 : µtj = µtj , where µt represents the population-trimmed mean (Many researchers subscribe to the position that inferences pertaining to robust parameters are more valid than inferences pertaining to the usual least squares parameters, when they are dealing with populations that are nonnormal in form. Yuen and Dixon [73] and Wilcox [69] report that for long-tailed distributions, tests based on trimmed means and Winsorized variances can be much more powerful than tests based on the usual mean and variance. Accordingly, when researchers feel they are dealing with nonnormal data, they can replace the usual least squares estimators of central tendency and variability with robust estimators and apply these estimators in any of the previously recommended MCPs. A Model Testing Procedure. The procedure to be described takes a completely different approach to specifying differences between the treatment group means. That is, unlike previous approaches which rely on a test statistic to reject or accept pairwise null hypotheses, the approach to be described uses an information criterion statistic to select a configuration

of population means which most likely corresponds with the observed data. Thus, as Dayton [6, p. 145] notes, ‘model-selection techniques are not statistical tests for which type I error control is an issue.’ When testing all pairwise comparisons, intransitive decisions are extremely common with conventional MCPs [6]. An intransitive decision refers to declaring a population mean (µj ) not significantly different from two different population means (µj = µj , µj = µj ), when the latter two means are declared significantly different (µj = µj ). For example, a researcher conducting all pairwise comparisons (J = 4) may decide not to reject any hypotheses implied by µ1 = µ2 = µ3 or µ3 = µ4 , but reject µ1 = µ4 and µ2 = µ4 , based on results from a conventional MCP. Interpreting the results of this experiment can be ambiguous, especially concerning the outcome for µ3 . Dayton [6] proposed a model testing approach based on Akaike’s Information Criterion (AIC) [1]. Mutually exclusive and transitive models are each evaluated using AIC, and the model having the minimum AIC is retained as the most probable population mean configuration, where: AIC = SSw + j nj (Y¯j − Y¯mj )2 + 2q,

(20)

Y¯mj is the estimated sample mean for the jth group (given the hypothesized population mean configuration for the mth model), SSw is the ANOVA pooled within group sum of squares and q is the degree of freedom for the model. For example, for J = 4 (with ordered means) there would be 2J −1 = 8 different models to be evaluated ({1234}, {1, 234}, {12, 34}, {123, 4}, {1, 2, 34}, {12, 3, 4}, {1, 23, 4}, {1, 2, 3, 4}). To illustrate, the model {12, 3, 4} postulates a population mean configuration where groups one and two are derived from the same population, while groups three and four each represent independent populations. The model having the lowest AIC value would be retained as the most probable population model. Dayton’s AIC model-testing approach has the virtue of avoiding intransitive decisions. It is more powerful in the sense of all-pairs power than Tukey’s MCP, which is not designed to avoid intransitive decisions. One finding reported by Dayton, as well as Huang and Dayton [25], is that the AIC has a slight bias for selecting more complicated models than the true model. For example, Dayton found

Multiple Comparison Procedures that for the mean pattern {12, 3, 4}, AIC selected the more complicated pattern {1, 2, 3, 4} more than ten percent of the time, whereas AIC only rarely selected less complicated models (e.g., {12, 34}). This tendency can present a special problem for the complete null case {1234}, where AIC has a tendency to select more complicated models. Consequently, a recommendation by Huang and Dayton [25] is to use an omnibus test to screen for the null case, and then assuming rejection of the null, apply the Dayton procedure. Dayton’s [6] model testing approach can be modified to handle heterogeneous treatment group variances. Like the original procedure, mutually exclusive and transitive models are each evaluated using AIC, and the model having the minimum AIC is retained as the most probable population mean configuration. For heterogeneous variances: −N (ln(2π) + 1) AIC = −2 2 1 (21) − (nj ln(S)) + 2q, 2 where S is the biased variance for the j th group, substituting the estimated group mean (given the hypothesized mean configuration for the mth model) for the actual group mean in the calculation of the variance. As in the original Dayton procedure, an appropriate omnibus test can also be applied.

Complex Comparisons To introduce some methods that can be adopted when investigating complex comparisons among treatment group means, we first expand on our introductory definitions. Specifically, we let ψ = c1 µ1 + c2 µ2 + · · · + cJ µJ ,

(22)

(where the coefficients (cj s) defining the contrast

J sum to one (i.e., j =1 cj = 0)) represent the population complex contrast that we are interested in subjecting to a test of significance. To test that Ho : ψ = 0, we replace the unknown population values with their least squares estimates, that is, the sample means, and subject the sample contrast estimate = c1 Y¯1 + c2 Y¯2 + · · · + cJ Y¯J ψ

(23)

11

to a test with the statistic ψ

tψˆ = MSE

J

j =1

.

(24)

cj2 /nj

As we indicated in the beginning of our paper, a very popular method for examining complex contrasts is Scheff´e’s [55] method. Scheff´e’s method, as we indicated, is an STP method that provides FWE control, that is, a procedure that uses one critical value to assess statistical significance of a set of complex comparisons. The simultaneous critical value is (J − 1)F1−α,J −1,ν , (25) where F1−α,J −1,ν is the 1 − α quantile from the sampling distribution of F based on J − 1 and ν numerator and denominator degrees of freedom (numerator and denominator df that are associated with the omnibus ANOVA F test). Accordingly, one rejects the hypothesis Ho when (26) tψˆ ≥ (J − 1)F1−α,J −1,ν . Bonferroni. Another very popular STP method for evaluating a set of m complex comparisons is to compare the P values associated with the tψˆ statistics to the Dunn-Bonferroni critical value α/m or one may refer |tψˆ | to their simultaneous critical value (see Kirk [34], p. 829 for the table of critical values). As has been pointed out, researchers can compare, a priori, the Scheff´e [55] and Dunn-Bonferroni critical values, choosing the smaller of the two, in order to obtain the more powerful of the two STPs. That is, the statistical procedure with the smaller critical value will provide more statistical power to detect true complex comparison differences between the means. In general though, if there are 20 or fewer comparisons, the Dunn-Bonferroni method will provide the smaller critical value and hence the more powerful approach with the reverse being the case, when there are more than 20 comparisons [65, p. 117]. One may also adopt stepwise MCPs when examining a set of m complex comparisons. These stepwise methods should result in greater sensitivity to detect effects than the Dunn–Bonferroni method. Many others have devised stepwise Bonferroni-type MCPs, for example, Holm [23], Holland and Copenhaver [22],

12

Multiple Comparison Procedures

Hochberg [19], Rom [46], and so on. All of these procedures provide FWE control; the minor differences between them can result in small differences in power to detect effects. Thus, we recommend the Hochberg [19] sequentially acceptive step-up Bonferroni procedure previously described because it is simple to understand and implement. Finally, we note that the Scheff´e [55], DunnBonferroni [9] and Hochberg [19] procedures can be adopted to robust estimation and testing by adopting trimmed means and Winsorized variances, instead of the usual least squares estimators. That is, to circumvent the biasing effects of nonnormality and variance heterogeneity, researchers can adopt a heteroscedastic Welch-type statistic and its accompanying modified degrees of freedom, applying robust estimators. For example, the heteroscedastic test statistic for the Dunn–Bonferroni [9] and Hochberg [19] procedures would be tψˆ =

ψ J

j =1

,

(27)

cj2 sj2 /nj

where the statistic is approximately distributed as a Student t variable with dfW =

2

J

j =1 J

j =1

cj2 sj2 /nj .

[(cj2 sj2 /nj )2 /(nj

(28)

− 1)]

Accordingly, as we specified previously, one replaces the least squares means with trimmed means and least squares variances with variances based on Winsorized sums of squares. That is, tψˆ t =

t ψ J

j =1

,

(29)

cj2 dj

where the sample comparison of trimmed means equals t = c1 Y¯t1 + c2 Y¯t2 + · · · + cJ Y¯tJ ψ

(30)

and error degrees of freedom is given by dfWt =

J

j =1 J

j =1

2 cj2 dj

[(cj2 dj )2 /(hj

.

(31)

− 1)]

A robust Scheff´e [55] procedure would be similarly implemented, however, one would use the heteroscedastic Brown and Forsythe [5] statistic (see Kirk [34], p. 155). It is important to note that, although we are unaware of any empirical investigations that have examined tests of complex contrasts with robust estimators, based on empirical investigations related to pairwise comparisons, we believe these methods would provide approximate FWE control under conditions of nonnormality and variance heterogeneity and should possess more power to detect group differences than procedures based on least squares estimators.

Extensions We have presented newer MCPs within the context of a one-way completely randomized design, highlighting procedures that should be robust to variance heterogeneity and nonnormality. We presented these methods because we concur with others that behavioral science data will not likely conform to these derivational requirements, assumptions which were adopted to derive the classical procedures, for example, [60] and [55]. For completeness, we note that robust procedures, that is, MCPs (omnibus tests as well) that employ a heteroscedastic test statistic and adopt robust estimators rather than the usual least squares estimators for the mean and variance, have also been enumerated within the context of factorial between-subjects and factorial between- by within-subjects repeated measures designs. Accordingly, applied researchers can note the generalization of the methods presented in our paper, to those that have been presented within these more general contexts by Keselman and Lix [30], Lix and Keselman [38], Keselman [26] and Keselman, Wilcox and Lix [32].

13

Multiple Comparison Procedures

Numerical Example

Table 2 FWE (α = .05) significant (*) comparisons for the data from Table 1

We present a numerical example for the previously discussed MCPs so that the reader can check his/her facility to work with the SAS/Westfall et al. [65] programs and to demonstrate through example the differences between their operating characteristics. (Readers can obtain a copy of the SAS [52] and SPSS [42] syntax that we used to obtain numerical results from the third author.) In particular, the data (n1 = n2 = · · · = nJ = 20) presented in Table 1 were randomly generated by us though they could represent the outcomes of a problem solving task where the five groups were given different clues to solve the problem; the dependent measure was the time, in seconds, that it took to solve the task. The bottom two rows of the table contain the group means and standard deviations, respectively. Table 2 contains FWE (α = .05) significant (*) values for the 10 pairwise comparisons for the five groups. The results reported in Table 2 generally conform, not surprisingly, to the properties of the MCPs that we discussed previously. In particular, of the ten comparisons, four were found to be statistically significant with the Tukey [60] procedure:

µj − µj TK HY HC SH BH BH-A RG BT PM TM

Table 1 Data values and summary statistics (means and standard deviations) J1

J2

J3

J4

J5

17 15 17 13 22 14 12 15 14 16 14 15 17 11 14 13 17 15 12 17 15.00 2.47

17 14 15 12 18 18 16 18 16 20 15 19 16 16 19 13 18 16 19 16 16.55 2.11

17 14 19 15 15 19 20 18 22 16 13 16 23 22 24 20 18 25 13 24 18.65 3.79

20 15 18 20 14 18 18 16 21 23 15 22 18 19 19 23 23 18 18 23 19.05 2.82

20 23 18 26 17 14 16 32 21 23 29 21 26 22 22 17 27 18 21 18 21.55 4.64

1 1 1 1 2 2 2 3 3 4

vs vs vs vs vs vs vs vs vs vs

2 3 4 5 3 4 5 4 5 5

* * *

* * *

* * *

* * *

* * *

* * *

* * *

* * *

* * *

* *

* * * * * *

*

*

*

*

*

*

*

* *

*

*

*

* *

* *

* *

*

*

Note: TK-Tukey (1953); HY-Hayter (1986); HC-Hochberg (1988); SH-Shaffer (1979); BT-(Bootstrap)/PM (Permutation)Westfall et al.; BH-Benjamini & Hochberg (1995); BHA(Adaptive)-Benjamini & Hochberg (2002); RG-Ryan (1960)Einot & Gabriel (1975)-Welsch (1977); TM-Trimmed means (and Winsorized variances) used with a nonpooled t Test and BH critical constants. Raw P values (1 vs 2, . . ., 4 vs 5) for the SAS (1999) procedures are .1406, .0007, .0002, < .0001, .0469, .0185, < .0001, .7022, .0065 and .0185. The corresponding values for the trimmed means tests are .0352, .0102, .0001, .0003, .1428, .0076, .0044, .6507, .1271 and .1660.

µ1 − µ3 , µ1 − µ4 , µ1 − µ5 , and µ2 − µ5 . In addition to these four comparisons, the Hayter [18], Hochberg [19], Shaffer [57], bootstrap [see 65] and permutation [see 65] MCPs found one additional comparison (µ3 − µ5 ) to be statistically significant. The REGWQ procedure also identified these comparisons as significant plus one additional one-µ4 − µ5 . The BH [2] and BH-A [3] MCPs detected these contrasts as well, plus one and two others, respectively. In particular, BH and BH-A also identified µ2 − µ4 as significant, while BH-A additionally declared µ2 − µ3 significant. Thus, out of the ten comparisons, BHA declared eight to be statistically significant. Clearly the procedures based on the more liberal FDR found more comparisons to be statistically significant than the FWE controlling MCPs. (Numerical results for BH-A were not obtained through SAS; they were obtained through hand-calculations.) We also investigated the ten pairwise comparisons with the trimmed means and model-testing procedures; the results for the trimmed means analysis are also reported in Table 2. In particular, we computed the group trimmed means (Y¯t1 = 14.92, Y¯t2 = 16.67, Y¯t3 = 18.50, Y¯t4 = 19.08 and Y¯t5 = 21.08) as 2 = well as the group Winsorized variances ( σW 1 2 2 2 2 σW3 = 8.16, σW4 = 3.00 and σW 2.58, σW2 = 1.62, 5

14

Multiple Comparison Procedures

= 10.26). These values can be obtained with the SAS/IML program discussed by Keselman et al. [32] or one can create a ‘special’ SPSS [42] data set to calculate nonpooled t statistics (tW and νW ) and their corresponding P values (through the ONEWAY program) (These programs can be obtained from the first author). The results reported in Table 2 indicate that with this approach five comparisons were found to be statistically significant: µ1 − µ3 , µ1 − µ4 , µ1 − µ5 , µ2 − µ4 , and µ2 − µ5 . Clearly, other MCPs had greater power to detect more pairwise differences. However, the reader should remember that robust estimation should result in more powerful tests when data are nonnormal as well as heterogeneous (see Wilcox [70]), which was not the case with our numerical example data. Furthermore, trimmed results were based on 12 subjects per group, not 20. With regard to the model-testing approach, we examined the 2J −1 models of nonoverlapping subsets of ordered means and used the minimum AIC value to find the best model that ‘is expected to result in the smallest loss of precision relative to the true, but unknown, model (Dayton [6], p. 145).’ From the 16 models examined, the two models with the smallest AIC values were {1, 2, 34, 5}(AI C = 527.5) and {12, 34, 5}(AI C = 527.8). The ‘winning’ model combines one pair but clearly there is another model that is plausible given the data available. (Results were obtained through hand calculations. However, a GAUSS program is available from the Department of Measurement & Statistics, University of Maryland web site.) Though this ambiguity might Table 3

seem like a negative feature of the model-testing approach, Dayton [6] would maintain that being able to enumerate a set of conclusions (i.e., competing models) provides a broader more comprehensive perspective regarding group differences than does the traditional approach. We also computed a set of complex contrasts (nine) among the five treatment group means to allow the reader to check his/her understanding of the computational operations associated with the MCPs that can be used when examining a set of complex contrasts. The population contrasts examined, their sample values, and the decisions regarding statistical significance (FWE = .05) are enumerated in Table 3. Again results conform to the operating characteristics of the MCPs. In particular, the Scheff´e [55] procedure found the fewest number of significant contrasts, Hochberg’s [19] step-up Bonferroni procedure the most, and the number found to be significant according to the Dunn–Bonferroni [55] criterion was intermediate to Scheff´e [55] and Hochberg [19]. When applying Hochberg’s [19] MCP with trimmed means and Winsorized variances six of the nine complex contrasts were found to be statistically significant.

References [1]

Akaike, H. (1974). A new look at the statistical model identification, IEEE Transactions on Automatic Control AC-19, 716–723. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society B 57, 289–300.

[2]

FWE (α = .05) significant (*) complex contrasts for the data from Table 1

Contrast (ψ) .5µ1 + .5µ2 − .33µ3 − .33µ4 − .33µ5 µ1 − .33µ3 − .33µ4 − .33µ5 µ2 − .33µ3 − .33µ4 − .33µ5 µ3 − .5µ4 − .5µ5 −.5µ3 + µ4 − .5µ5 .5µ3 + .5µ4 − µ5 µ1 − .5µ2 − .5µ3 µ2 − .5µ3 − .5µ4 .5µ1 + .5µ2 − µ3

ψˆ

ψˆ t

Scheff´e

Bonferroni

Hochberg

Hochberg/TM

−3.99 −4.77 −3.22 −1.65 −1.05 2.70 −2.60 −2.30 2.88

−3.78 −4.65 −2.90 −1.58 −0.71 2.29 −2.67 −2.13 2.71

* * *

* * *

* * *

* * *

* *

* * * *

*

*

*

Note: ψˆ is the value of the contrast for the original means; ψˆ t is the value of the contrast for the trimmed means; Scheff´e - Scheff´e (1959); Hochberg-Hochberg (1988); Hochberg/TM-Hochberg FWE control with trimmed means (and Winsorized variances) and a nonpooled (Welch) t Test. Raw P values are < .001, < .001, < .001, .071, .248, .004, .005, .012, and .002. The corresponding values for the trimmed means tests are .0000, .0000, .0010, .2324, .5031, .1128, .0043, .0124, and .0339.

Multiple Comparison Procedures [3]

[4]

[5]

[6]

[7]

[8] [9]

[10]

[11]

[12] [13] [14]

[15]

[16]

[17]

[18]

[19]

[20] [21]

Benjamini, Y. & Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics, Journal of Educational and Behavioral Statistics 25, 60–83. Bofinger, E., Hayter, A.J. & Liu, W. (1993). The construction of upper confidence bounds on the range of several location parameters, Journal of the American Statistical Association 88, 906–911. Brown, M.B. & Forsythe, A.B. (1974). The ANOVA and multiple comparisons for data with heterogeneous variances, Biometrics 30, 719–724. Dayton, C.M. (1998). Information criteria for the pairedcomparisons problem, The American Statistician 52, 144–151. Dudoit, S., Shaffer, J.P. & Boldrick, J.C. (2003). Multiple hypothesis testing in microarray experiments, Statistical Science 18, 71–103. Duncan, D.B. (1955). Multiple range and multiple F tests, Biometrics 11, 1–42. Dunn, O.J. (1961). Multiple comparisons among means, Journal of the American Statistical Association 56, 52–64. Dunnett, C.W. (1980). Pairwise multiple comparisons in the unequal variance case, Journal of the American Statistical Association 75, 796–800. Einot, I. & Gabriel, K.R. (1975). A study of the powers of several methods of multiple comparisons, Journal of the American Statistical Association 70, 574–583. Fisher, R.A. (1935). The Design of Experiments, Oliver & Boyd, Edinburgh. Games, P.A. (1971). Multiple comparisons of means, American Educational Research Journal 8, 531–565. Games, P.A. & Howell, J.F. (1976). Pairwise multiple comparison procedures with unequal n’s and/or variances, Journal of Educational Statistics 1, 113–125. Gross, A.M. (1976). Confidence interval robustness with long-tailed symmetric distributions, Journal of the American Statistical Association 71, 409–416. Hancock, G.R. & Klockars, A.J. (1996). The quest for α: Developments in multiple comparison procedures in the quarter century since Games (1971), Review of Educational Research 66, 269–306. Hayter, A.J. (1984). A proof of the conjecture that the Tukey-Kramer multiple comparisons procedure is conservative, Annals of Statistics 12, 61–75. Hayter, A.J. (1986). The maximum familywise error rate of Fisher’s least significant difference test, Journal of the American Statistical Association 81, 1000–1004. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75, 800–802. Hochberg, Y. & Tamhane, A.C. (1987). Multiple Comparison Procedures, John Wiley & Sons, New York. Holland, B. & Cheung, S.H. (2002). Family size robustness criteria for multiple comparison procedures, Journal of the Royal Statistical Society, B 64, 63–77.

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

15

Holland, B.S. & Copenhaver, M.D. (1987). An improved sequentially rejective Bonferroni test procedure, Biometrics 43, 417–423. Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6, 65–70. Hsuing, T. & Olejnik, S. (1994). Power of pairwise multiple comparisons in the unequal variance case, Communications in Statistics: Simulation and Computation 23, 691–710. Huang, C.J. & Dayton, C.M. (1995). Detecting patterns of bivariate mean vectors using model-selection criteria, British Journal of Mathematical and Statistical Psychology 48, 129–147. Keselman, H.J. (1998). Testing treatment effects in repeated measure designs: An update for psychophysiological researchers, Psychophysiology 35, 70–478. Keselman, H.J., Cribbie, R.A. & Holland, B. (1999). The pairwise multiple comparison multiplicity problem: An alternative approach to familywise and comparisonwise Type I error control, Psychological Methods 4, 58–69. Keselman, H.J., Huberty, C.J., Lix, L.M., Olejnik, S., Cribbie, R., Donahue, B., Kowalchuk, R.K., Lowman, L.L., Petoskey, M.D., Keselman, J.C. & Levin, J.R. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses, Review of Educational Research 68, 350–386. Keselman, H.J., Keselman, J.C. & Games, P.A. (1991). Maximum familywise Type I error rate: The least significant difference, Newman-Keuls, and other multiple comparison procedures, Psychological Bulletin 110, 155–161. Keselman, H.J. & Lix, L.M. (1995). Improved repeated measures stepwise multiple comparison procedures, Journal of Educational and Behavioral Statistics 20, 83–99. Keselman, H.J., Lix, L.M. & Kowalchuk, R.K. (1998). Multiple comparison procedures for trimmed means, Psychological Methods 3, 123–141. Keselman, H.J., Wilcox, R.R. & Lix, L.M. (2003). A generally robust approach to hypothesis testing in independent and correlated groups designs, Psychophysiology 40, 586–596. Keuls, M. (1952). The use of the “Studentized range” in connection with an analysis of variance, Euphytica 1, 112–122. Kirk, R.E. (1995). Experimental Design: Procedures for the Behavioral Sciences, Brooks/Cole Publishing Company, Toronto. Kramer, C.Y. (1956). Extension of the multiple range test to group means with unequal numbers of replications, Biometrics 12, 307–310. Kromrey, J.D. & La Rocca, M.A. (1994). Power and Type I error rates of new pairwise multiple comparison procedures under heterogeneous variances, Journal of Experimental Education 63, 343–362.

16 [37]

[38]

[39]

[40] [41]

[42] [43]

[44] [45]

[46]

[47] [48] [49]

[50] [51]

[52] [53]

[54]

[55] [56]

Multiple Comparison Procedures Levin, J.R., Serlin, R.C. & Seaman, M.A. (1994). A controlled, powerful multiple-comparison strategy for several situations, Psychological Bulletin 115, 153–159. Lix, L.M. & Keselman, H.J. (1995). Approximate degrees of freedom tests: A unified perspective on testing for mean equality, Psychological Bulletin 117, 547–560. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures, Psychological Bulletin 105, 156–166. Miller, R.G. (1981). Simultaneous Statistical Inference, 2nd Edition, Springer-Verlag, New York. Newman, D. (1939). The distribution of the range in samples from a normal population, expressed in terms of an independent estimate of standard deviation, Biometrika 31, 20–30. Norusis, M.J. (1997). SPSS 9.0 Guide to Data Analysis, Prentice Hall. Olejnik, S. & Lee, J. (1990). Multiple comparison procedures when population variances differ, Paper Presented at the Annual Meeting of the American Educational Research Association, Boston. Pearson, E.S. (1931). The analysis of variance in cases of nonnormal variation, Biometrika 23, 114–133. Petrinovich, L.F. & Hardyck, C.D. (1969). Error rates for multiple comparison methods: Some evidence concerning the frequency of erroneous conclusions, Psychological Bulletin 71, 43–54. Rom, D.M. (1990). A sequentially rejective test procedure based on a modified Bonferroni inequality, Biometrika 77, 663–665. Rothman, K. (1990). No adjustments are needed for multiple comparisons, Epidemiology 1, 43–46. Ryan, T.A. (1959). Multiple comparisons in psychological research, Psychological Bulletin 56, 26–47. Ryan, T.A. (1960). Significance tests for multiple comparison of proportions, variances, and other statistics, Psychological Bulletin 57, 318–328. Ryan, T.A. (1962). The experiment as the unit for computing rates of error, Psychological Bulletin 59, 305. Ryan, T.A. (1980). Comment on “Protecting the overall rate of Type I errors for pairwise comparisons with an omnibus test statistic”, Psychological Bulletin 88, 354–355. SAS Institute Inc. (1999). SAS/STAT User’s Guide, Version 7, SAS Institute, Cary. Satterthwaite, F.E. (1946). An approximate distribution of estimates of variance components, Biometrics Bulletin 2, 110–114. Saville, D.J. (1990). Multiple comparison procedures: The practical solution, The American Statistician 44, 174–180. Scheff´e, H. (1959). The Analysis of Variance, Wiley. Seaman, M.A., Levin, J.R. & Serlin, R.C. (1991). New developments in pairwise multiple comparisons: Some powerful and practicable procedures, Psychological Bulletin 110, 577–586.

[57]

[58] [59] [60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68] [69]

[70]

[71]

[72]

[73]

Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures, Journal of the American Statistical Association 81, 826–831. Staudte, R.G. & Sheather, S.J. (1990). Robust Estimation and Testing, Wiley, New York. Toothaker, L.E. (1991). Multiple Comparisons for Researchers, Sage Publications, Newbury Park. Tukey, J.W. (1953). The Problem of Multiple Comparisons, Princeton University, Department of Statistics, Unpublished manuscript. Tukey, J.W. (1960). A survey of sampling from contaminated normal distributions, in I. Olkin, S. Ghurye, W. Hoeffding, W. Madow & H. Mann, eds, Contributions to Probability and Statistics, Stanford University Press, Stanford. Welch, B.L. (1938). The significance of the difference between two means when population variances are unequal, Biometrika 38, 330–336. Welch, B.L. (1951). On the comparison of several mean values: An alternative approach, Biometrika 38, 330–336. Welsch, R.E. (1977). Stepwise multiple comparison procedures, Journal of the American Statistical Association 72, 566–575. Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D. & Hochberg, Y. (1999). Multiple Comparisons and Multiple Tests, SAS Institute, Cary. Westfall, P.H. & Young, S.S. (1993). Resampling-based Multiple Testing: Examples and Methods for p-value Adjustment, Wiley, New York. Wilcox, R.R. (1988). A new alternative to the ANOVA F and new results on James’ second order method, British Journal of Mathematical and Statistical Psychology 41, 109–117. Wilcox, R.R. (1990). Comparing the means of two independent groups, Biometrics Journal 32, 771–780. Wilcox, R.R. (1995). ANOVA: The practical importance of heteroscedastic methods, using trimmed means versus means, and designing simulation studies, British Journal of Mathematical and Statistical Psychology 48, 99–114. Wilcox, R.R. (1997). Three multiple comparison procedures for trimmed means, Biometrical Journal 37, 643–656. Wilcox, R.R., Charlin, V.L. & Thompson, K.L. (1986). New Monte Carlo results on the robustness of the ANOVA F, W and F∗ statistics, Communications in Statistics: Simulation and Computation 15, 933–943. Wilson, W. (1962). A note on the inconsistency inherent in the necessity to perform multiple comparisons, Psychological Bulletin 59, 296–300. Yuen, K.K. & Dixon, W.J. (1973). The approximate behavior of the two sample trimmed t, Biometrika 60, 369–374.

H.J. KESELMAN, BURT HOLLAND AND ROBERT A. CRIBBIE

Multiple Comparison Tests: Nonparametric and Resampling Approaches H.J. KESELMAN

AND

RAND R. WILCOX

Volume 3, pp. 1325–1331 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Comparison Tests: Nonparametric and Resampling Approaches

J n

=

(Yij − Y j )2

j =1 i=1 J

.

(2)

(nj −1 )

j =1

Introduction

Pairwise Comparisons

An underlying assumption of classical multiple comparison procedures (MCPs) is that the populations from which the data are sampled are normally distributed. Additional assumptions are that the population variances are equal (the homogeneity of variances assumption) and that the errors or observations are independent from one another (the independence of observations assumption). Although it may be convenient (both practically and statistically) for researchers to assume that their samples are obtained from normally distributed populations, this assumption may rarely be accurate [21, 31]. Researchers falsely assuming normally distributed data risk obtaining biased tests and relatively high Type II error rates for many patterns of nonnormality, especially when other assumptions are also not satisfied (e.g., variance homogeneity) (See [31]). Inaccurate confidence intervals occur as well. The assumptions associated with the classical test statistics are typically associated with the following mathematical model that describes the sources that contribute to the magnitude of the dependent scores. Specifically, a mathematical model that can be adopted when examining pairwise and/or complex comparisons of means in a one-way completely randomized design is:

A confidence interval, assuming equal sample sizes n, for convenience, for a pairwise difference µj − µj has the form 2 Y j − Y j ± cα σ , nj

Yij = µj + ij ,

(1)

where Yij is the score of the ith subject (i = 1, . . . , nj ) in the jth (j = 1, . . . , J ) group, j nj = N , µj is the jth group mean, and ij is the random error for the ith subject in the jth group. As indicated, in the typical application of the model, it is assumed that the ij s are normally and independently distributed and that the treatment group variances (σj2 s) are equal. Relevant sample estimates include µj = Y j =

nj Yij i=1

nj

and σ 2 = MSE

where cα is selected such that the overall rate of Type I error (the probability of making at least one Type I error in the set of say m tests), that is, the familywise rate of Type I error (FWE) = α. In the case of all possible pairwise comparisons, one needs a cα such that the simultaneous probability coverage achieves a specified level. That is, for all j = j , cα must satisfy 2 P Y j − Y j − cα σ ≤ µj − µj ≤ Y j − Y j nj 2 = 1 − α. (3) +cα σ nj

Resampling Methods Researchers can use both simultaneous and stepwise MCPs for pairwise comparisons of means with methods that do not assume normally distributed data. (Simultaneous MCPs use one critical value to assess statistical significance while stepwise procedures use a succession of critical value to assess statistical significance.) In particular, users can use either permutation or bootstrap methods to compute all possible pairwise comparisons, leading to hypothesis tests of such comparisons. Pairwise comparisons of groups can be obtained through permutation of the data with the program provided by Westfall et al. [28, pp. 233–234]. Permutation tests do not require that the data be normally distributed. Instead of resampling with replacement from a pooled sample of residuals, permutation tests take the observed data (Y11 , . . . , Yn11 , . . . , Y1J , . . . , YnJ J )

2

Multiple Comparison Tests

and randomly redistributes them to the treatment groups, and summary statistics (i.e., means or ranks) are then computed on the randomly redistributed data. The original outcomes (all possible pairwise differences from the original sample means) are then compared to the randomly generated values (e.g., all possible pairwise differences in the permutation samples). See [22] and [26]. Permutation tests can be used with virtually any measure of location, but regardless of which measure of location is used, they are designed to test the hypothesis that groups have identical distributions (e.g., [23]). If, for example, a permutation test based on means is used, it is not robust (see Robust Testing Procedures) if the goal is to make inferences about means (e.g., [1]). When users adopt this approach to combat the effects of nonnormality, they should also heed the cautionary note provided by Westfall et al. [28, p. 234], namely, the procedure may not control the FWE when the data are heterogeneous, particularly when group sizes are unequal. Thus, we will introduce another approach, pairwise comparisons based on robust estimators and a heteroscedastic statistic, an approach that has been demonstrated to generally control the FWE when data are nonnormal and heterogeneous even when group sizes are unequal. Prior to introducing bootstrapping with robust estimators it is important to note for completeness that researchers also can adopt nonparametric methods (see Distribution-free Inference, an Overview) to examine pairwise and/or complex contrasts among treatment group means (see e.g., [8] and [11]). However, one should remember that this approach is only equivalent to the classical approach of comparing treatment group means (comparing the same full and restricted models) when the distributions that are being compared are equivalent except for possible differences in location (i.e., a shift in location). That is, the classical and nonparametric approaches test the same hypothesis when the assumptions of the shift model hold; otherwise, the nonparametric approach is not testing merely for a shift in location parameters of the J groups (see [20], [4]). Generally, conventional nonparametric tests are not aimed at making inferences about means or some measure of location. For example, the Wilcoxon–Mann–Whitney test is based on an estimate of p(X < Y ), the probability that an observation from the first group is less

than an observation from the second (e.g., see [3]). If restrictive assumptions are made about the distributions being compared, conventional nonparametric tests have implications about measures of location [6], but there are general conditions where a more accurate description is that they test the hypothesis of identical distributions. Interesting exceptions are given by Brunner, Domhof, and Langer [2] and Cliff [3]. For those researchers who believe nonparametric methods are appropriate (e.g., Kruskal–Wallis), we refer the reader to [8], [11], [20], or [25].

Robust Estimation Bootstrapping methods provide an estimate of the distribution of the test statistic yielding P values that are not based on a theoretically presumed distribution (see e.g., [19]; [29]). An example SAS [22] program for all possible pairwise comparisons (of least squares means) is given by Westfall et al. [28, p. 229]. Westfall and Young’s [29] results suggest that Type I error control could be improved further by combining a bootstrap method with one based on trimmed means. When researchers feel that they are dealing with populations that are nonnormal in form and thus subscribe to the position that inferences pertaining to robust parameters are more valid than inferences pertaining to the usual least squares parameters, then procedures, based on robust estimators, say trimmed means, should be adopted. Wilcox et al. [30] provide empirical support for the use of robust estimators and test statistics with bootstrapdetermined critical values in one-way independent groups designs. This benefit has also been demonstrated in correlated groups designs (see [14]; [15]). Accordingly, researchers can apply robust estimates of central tendency and variability to a heteroscedastic test statistic (see Heteroscedasticity and Complex Variation) (e.g., Welch’s test [27]; also see [16]). When trimmed means are being compared the multiple comparison null hypothesis pertains to the equality of population trimmed means, that is, H0 : ψ = µtj − µtj = 0(j = j ). Although the null hypothesis stipulates that the population trimmed means are equal, we believe this is a reasonable hypothesis to examine since trimmed means, as opposed to the usual (least squares) means, provide better estimates of the typical individual in distributions that either contain outliers or are skewed. That

Multiple Comparison Tests is, when distributions are skewed, trimmed means do not estimate µ but rather some value (i.e., µt ) that is typically closer to the bulk of the observations. (Another way of conceptualizing the unknown parameter µt is that it is simply the population counterpart of µt (see [12] and [9]). And lastly, as Zhou, Gao, and Hui [34] point out, distributions are typically skewed. Thus, with robust estimation, the trimmed group means ( µtj s) replace the least squares group means ( µj s), the Winsorized group variances estimators (see 2 s) replace the Winsorized Robust Measures) ( σWj 2 least squares variances ( σj s), and hj replaces nj and accordingly one computes the robust version of a heteroscedastic test statistic (see [33], [32]). Definitions of trimmed means, Winsorized variances and the standard error of a trimmed mean can be found in [19] or [30, 31]. To test H0 : µt1 − µt2 = 0 ≡ µt1 = µt2 (equality of population trimmed means), 2 2 σwj / hj (hj − 1), where σwj is the let dj = (nj − 1) gamma-Winsorized variance and hj is the effective sample size, that is, the size after trimming (j = 1, 2). Yuen’s [33] test is µt1 − µt2 , tY = √ d1 + d2

(4)

where µtj is the γ -trimmed mean for the jth group and the estimated degrees of freedom are νY =

(d1 + d2 )2 . d12 /(h1 − 1) + d22 /(h2 − 1)

(5)

Bootstrapping Following Westfall and Young [29] and as enumerated by Wilcox [30, 31], let Cij = Yij − µtj ; thus, the Cij values are the empirical distribution of the jth group, centered so that the observed trimmed mean is zero. That is, the empirical distributions are shifted so that the null hypothesis of equal trimmed means is true in the sample. The strategy behind the bootstrap is to use the shifted empirical distributions to estimate an appropriate critical value. For each j, obtain a bootstrap sample by randomly sampling with replacement nj observations from the Cij values, yielding Y1∗ , . . . , Ynj∗ . Let tY∗ be the value of the test statistic based on the bootstrap sample. To control the FWE for a set of contrasts, the following approach can be used. Set tm∗ = max tY∗ , the maximum being taken over all j = j . Repeat this process B times yielding

3

∗ ∗ ∗ ∗ ∗ tm1 , . . . , tmB . Let tm(1) ≤ · · · ≤ tm(B) be the tmb values written in ascending order, and let q = (1 − α)B, rounded to the nearest integer. Then a test of a null ∗ (i.e., hypothesis is obtained by comparing tY to t(mq) ∗ whether tY ≥ t(mq) ), where q is determined so that the FWE is approximately α. See [19, pp. 404–407], [29], or [32, 437–443]. Keselman, Wilcox, and Lix [16] present a SAS/IML [24] program which can be used to apply bootstrapping methods with robust estimators to obtain numerical results. The program can also be obtained from the first author’s website at

http://www.umanitoba.ca/faculties/arts/ psychology/. This program is an extension of the

program found in Lix and Keselman [18]. Tests of individual contrasts or families of contrasts may be performed (in addition omnibus main effects or interaction effects may be performed). The program can be applied in a variety of research designs. See [19].

Complex Comparisons To introduce some methods that can be adopted when investigating complex comparisons among treatment group means we first provide some definitions. Specifically, we let ψ = c1 µ1 + c2 µ2 + · · · + cJ µJ ,

(6)

where the coefficients (cj s) defining the contrast sum J to one (i.e., j =1 cj = 0) represent the population contrast that we are interested in subjecting to a test of significance. To test H0 : ψ = 0, estimate ψ with = c1 Y¯1 + c2 Y¯2 + · · · + cJ Y¯J . ψ

(7)

The usual homoscedastic statistic is ψ . tψˆ = J MSE c2 /nj

(8)

j

j =1

A very popular simultaneous method for evaluating a set of m complex comparisons is to compare the P values associated with the tψˆ statistics to the Dunn–Bonferroni critical value α/m (see [17, p. 829] for the table of critical values). One may also adopt stepwise MCPs when examining a set of m complex comparisons. These stepwise methods should result in greater sensitivity to

4

Multiple Comparison Tests

detect effects than the Dunn–Bonferroni [5] method. Many others have devised stepwise Bonferroni-type MCPs, for example, [10], [7], and so on. All of these procedures provide FWE control; the minor differences between them can result in small differences in power to detect effects. Thus, we recommend the Hochberg [7] sequentially acceptive step-up Bonferroni procedure because it is simple to understand and implement.

Accordingly, as we specified previously, one replaces the least squares means with trimmed means and least squares variances with variances based on Winsorized sums of squares. That is,

Hochberg’s [7] Sequentially Acceptive Step-up Bonferroni Procedure

where the sample comparison of trimmed means equals

In this procedure, the P values corresponding to the m statistics (e.g., tψˆ ) for testing the hypotheses H1 , . . . , Hm are ordered from smallest to largest. Then, for any i = m, m − 1, . . . , 1, if pi ≤ α/(m − i + 1), the Hochberg procedure rejects all Hi (i ≤ i). According to this procedure, therefore, one begins by assessing the largest P value, pm . If pm ≤ α, all hypotheses are rejected. If pm > α, then Hm is accepted and one proceeds to compare p(m−1) to α/2. If p(m−1) ≤ α/2, then all Hi (i = m − 1, . . . , 1) are rejected; if not, then H(m−1) is accepted and one proceeds to compare p(m−2) with α/3, and so on. The Dunn–Bonferroni [5] and Hochberg [7] procedures can be adopted to robust estimation and testing by adopting trimmed means and Winsorized variances, instead of the usual least squares estimators. That is, to circumvent the biasing effects of nonnormality and variance heterogeneity researchers can adopt a heteroscedastic Welch-type statistic and its accompanying modified degrees of freedom, applying robust estimators. For example, the heteroscedastic test statistic for the Dunn–Bonferroni [5] and Hochberg [7] procedures would be ψ tψˆ = , J 2 2 c s /nj

(9)

j j

j =1

where the statistic is approximately distributed as a Student t variable with  2 J  cj2 sj2 /nj  dfW =

j =1 J j =1

[(cj2 sj2 /nj )2 /(nj − 1)]

.

(10)

t ψ , tψˆ t = J 2 c dj

(11)

j

j =1

t = c1 Y¯t1 + c2 Y¯t2 + · · · + cJ Y¯tJ ψ

(12)

and error degrees of freedom is given by   dfW t =

J

2 cj2 dj 

j =1

.

J

[(cj2 dj )2 /(hj

(13)

− 1)]

j =1

A bootstrap version of Hochberg’s [7] sequentially acceptive step-up Bonferroni procedure can be obtained in the following manner. Corresponding to the ordered P values are the |tψˆ t | statistics. These pairwise statistics can be rank ordered according to their size and thus pm (the largest P value) will correspond to the smallest |tψˆ t | statistic. Thus, the smallest tψˆ t (or largest P value) is bootstrapped. That is, as Westfall and Young [29, p. 47] maintain ‘The resampled P values. . . are computed using the same calculations which produced the original P values. . . from the original. . . data.’ Accordingly, let |tψ∗ˆ | ≤ · · · ≤ |tψ∗ˆ | be the |tψ∗ˆ | values for this t (1) t (B) t (b) smallest comparison written is ascending order, and let mr = [(1 − α)B]. Then statistical significance is determined by comparing |tψˆ t | ≥ tψ∗ˆ (m ) . t

r

(14)

If this statistic fails to reach significance, then the next smallest |tψˆ t | statistic is bootstrapped and compared to the mr = [(1 − α/2)B] quantile. If necessary, the procedure continues with the next smallest |tψˆ t |. Other approaches for FWE control with resampling techniques are enumerated by Lunneborg [19, pp. 404–407] and Westfall and Young [29].

Multiple Comparison Tests It is important to note that, although we are unaware of any empirical investigations that have examined tests of complex contrasts with robust estimators, based on empirical investigations related to pairwise comparisons, we believe these methods would provide approximate FWE control under conditions of nonnormality and variance heterogeneity and should possess more power to detect group differences than procedures based on least squares estimators.

Numerical Example We use an example data set presented in Keselman et al. [16] to illustrate the procedures enumerated in this paper. Keselman et al. modified a data set given by Karayanidis, Andrews, Ward, and Michie [13] where the authors compared the performance of three age groups (Young, Middle, Old) on auditory selective attention processes. The dependent reaction time (in milliseconds) scores are reported in Table 1. We used the SAS/IML program demonstrated in [16, pp. 589–590] to obtain numerical results for tests of pairwise comparisons and a SAS/IML program that we wrote to obtain numerical results for tests of complex contrasts. Estimates of group trimmed means and standard errors are presented in Table 2. Table 1

Example data set

Table 2 Descriptive Statistics [Trimmed Means and (Standard Errors)] Young 532.98 (15.27)

Middle

Old

518.29 548.42 524.10 666.63 488.84 676.40 482.43 531.18 504.62 609.53 584.68 609.09 495.15 502.69 484.36 519.10 572.10 524.12 495.24

335.59 353.54 493.08 469.01 338.43 499.10 404.27 494.31 487.30 485.85 886.41 437.50

558.95 538.56 586.39 530.23 629.22 691.84 557.24 528.50 565.43 536.03 594.69 645.69 558.61 519.01 538.83

Middle

Old

453.11 (26.86)

559.41 (11.00)

Three pairwise comparisons were computed: (1) Young versus Middle, (2) Young versus Old, and (3) Middle versus Old. The values of tY and νY for the three comparisons are (1) 6.68 and 11.55, (2) 1.97 and 19.72, and (3) 13.41 and 9.31; for .05 FWE control the critical value is 12.56. Accordingly, based on the methodology enumerated in this paper and the critical value obtained with the program discussed by Keselman et al. [16] only the third comparison is judged to be statistically significant. We as well computed three complex comparisons: (1) Young versus the average of Middle and Old, (2) Middle versus the average of Young and Old, and (3) Old versus the average Young and Middle. Using the bootstrapped version of Hochberg’s [7] step-up Bonferroni MCP, none of the contrasts are statistically significant. That is, the ordered tψˆ t were 1.27 (comparison 2), 3.27 (comparison 1) and 3.50 (comparison 3). The corresponding critical tψ∗ˆ s t (mr ) were 2.20, 4.14, and 4.15, respectively.

References [1]

Young

5

[2]

[3] [4]

[5]

[6] [7]

[8]

Boik, R.J. (1987). The Fisher-Pitman permutation test: a non-robust alternative to the normal theory F test when variances are heterogeneous, British Journal Mathematical and Statistical Psychology 40, 26–42. Brunner, E., Domhof, S. & Langer, F. (2002). Nonparametric Analysis of Longitudinal Data in Factorial Experiments, Wiley, New York. Cliff, N. (1996). Ordinal Methods for Behavioral Data Analysis, Lawrence Erlbaum, Mahwah. Delaney, H.D. & Vargha, A. (2002). Comparing several robust tests of stochastic equality with ordinally scaled variables and small to moderate sized samples, Psychological Methods 7, 485–503. Dunn, O.J. (1961). Multiple comparisons among means, Journal of the American Statistical Association 56, 52–64. Hettmansperger, T.P. & McKean, J.W. (1998). Robust Nonparametric Statistical Methods, Arnold, London. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75, 800–802. Hochberg, Y. & Tamhane, A.C. (1987). Multiple Comparison Procedures, Wiley, New York.

6 [9]

[10]

[11] [12] [13]

[14]

[15]

[16]

[17]

[18]

[19] [20]

[21]

Multiple Comparison Tests Hogg, R.V. (1974). Adaptive robust procedures: a partial review and some suggestions for future applications and theory, Journal of the American Statistical Association 69, 909–927. Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6, 65–70. Hsu, J.C. (1996). Multiple Comparisons Theory and Methods, Chapman & Hall, New York. Huber, P.J. (1972). Robust statistics: a review, Annals of Mathematical Statistics 43, 1041–1067. Karayanidis, F., Andrews, S., Ward, P.B. & Michie, P.T. (1995). ERP indices of auditory selective attention in aging and Parkinson’s disease, Psychophysiology 32, 335–350. Keselman, H.J., Algina, J., Wilcox, R.R. & Kowalchuk, R.K. (2000). Testing repeated measures hypotheses when covariance matrices are heterogeneous: revisiting the robustness of the Welch-James test again, Educational and Psychological Measurement 60, 925–938. Keselman, H.J., Kowalchuk, R.K., Algina, J., Lix, L.M. & Wilcox, R.R. (2000). Testing treatment effects in repeated measures designs: trimmed means and bootstrapping, British Journal of Mathematical and Statistical Psychology 53, 175–191. Keselman, H.J., Wilcox, R.R. & Lix, L.M. (2003). A generally robust approach to hypothesis testing in independent and correlated groups designs, Psychophysiology 40, 586–596. Kirk, R.E. (1995). Experimental Design: Procedures for the Behavioral Sciences, Brooks/Cole Publishing Company, Toronto. Lix, L.M. & Keselman, H.J. (1995). Approximate degrees of freedom tests: a unified perspective on testing for mean equality, Psychological Bulletin 117, 547–560. Lunneborg, C.E. (2000). Data Analysis by Resampling, Duxbury, Pacific Grove. Maxwell, S.E. & Delaney, H.D. (2004). Designing Experiments and Analyzing Data, 2nd Edition, Lawrence Erlbaum, Mahwah. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures, Psychological Bulletin 105, 156–166.

[22]

Opdyke, J.D. (2003). Fast permutation tests that maximize power under conventional Monte Carlo sampling for pairwise and multiple comparisons, Journal of Modern Applied Statistical Methods 2(1), 27–49. [23] Pesarin, F. (2001). Multivariate Permutation Tests, Wiley, New York. [24] SAS Institute. (1999). SAS/STAT user’s guide, Version 7, SAS Institute, Cary. [25] Sprent, P. (1993). Applied Nonparametric Statistical Methods, 2nd Edition, Chapman & Hall, London. [26] Troendle, J.F. (1996). A permutational step-up method of testing multiple outcomes, Biometrics 52, 846–859. [27] Welch, B.L. (1938). The significance of the difference between two means when population variances are unequal, Biometrika 38, 330–336. [28] Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D. & Hochberg, Y. (1999). Multiple Comparisons and Multiple Tests, SAS Institute, Cary. [29] Westfall, P.H. & Young, S.S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P value Adjustment, Wiley, New York. [30] Wilcox, R.R. (1990). Comparing the means of two independent groups, Biometrics Journal 32, 771–780. [31] Wilcox, R.R. (1997). Introduction to Robust Estimation and Hypothesis Testing, Academic Press, San Diego. [32] Wilcox, R.R. (2003). Applying Contemporary Statistical Techniques, Academic Press, New York. [33] Yuen, K.K. (1974). The two-sample trimmed t for unequal population variances, Biometrika 61, 165–170. [34] Zhou, X., Gao, S. & Hui, S.L. (1997). Methods for comparing the means of two independent log-normal samples, Biometrics 53, 1129–1135.

Further Reading Keselman, H.J., Lix, L.M. & Kowalchuk, R.K. (1998). Multiple comparison procedures for trimmed means, Psychological Methods 3, 123–141. Keselman, H.J., Othman, A.R., Wilcox, R.R. & Fradette, K. (2004). The new and improved two-sample t test, Psychological Science 15, 47–51.

H.J. KESELMAN

AND

RAND R. WILCOX

Multiple Imputation BRIAN S. EVERITT Volume 3, pp. 1331–1332 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Imputation A method by which missing values in a data set are replaced by more than one, usually between 3 and 10, simulated versions. Each of the simulated complete datasets is then analyzed by the method relevant to the investigation at hand, and the results combined to produce estimates, standard errors, and confidence intervals that incorporate missing data uncertainty. Introducing appropriate random errors into the imputation process makes it possible to get approximately unbiased estimates of all parameters, although the data must be missing at random (see Dropouts in Longitudinal Data; Dropouts in Longitudinal Studies: Methods of Analysis) for this to

be the case. The multiple imputations themselves are created by a Bayesian approach (see Bayesian Statistics and Markov Chain Monte Carlo and Bayesian Statistics), which requires specification of a parametric model for the complete data and, if necessary, a model for the mechanism by which data become missing. A comprehensive account of multiple imputation and details of associated software are given in Schafer [1].

Reference [1]

Schafer, J. (1997). The Analysis of Incomplete Multivariate Data, CRC/Chapman & Hall, Boca Raton.

BRIAN S. EVERITT

Multiple Informants KIMBERLY J. SAUDINO Volume 3, pp. 1332–1333 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Informants Most behavioral researchers agree that it is desirable to obtain information about subjects from multiple sources or informants. For example, the importance of using multiple informants for the assessment of behavior problems in children has long been emphasized in the phenotypic literature. Correlations between different informants’ ratings of child behavior problems are typically low – in the range of 0.60 between informants with similar roles (e.g., parent–parent); 0.30 between informants who have different roles (e.g., parent–teacher); and 0.20 between self-reports and other informant ratings [1]. These low correlations are typically interpreted as indicating that different raters provide different information about behavior problems because they view the child in different contexts or situations; however, error and rater bias can also contribute to low agreement between informants. Whatever the reasons for disagreement among informants, the use of multiple informants allows researchers to gain a fuller understanding of the behavior under study. In quantitative genetic analyses, relying on a single informant may not paint a complete picture of the etiology of the behavior of interest. Using the above example, parent and teacher ratings assess behaviors in very different contexts; consequently, the genetic and environmental factors that influence behaviors at home might differ from those that influence the same behaviors in the classroom. There may also be some question as to whether parents and teachers actually are assessing the same behaviors. In addition, informants’ response tendencies, standards, or behavioral expectations may affect their ratings – rater biases that cannot be detected without information from multiple informants. Analysis of data from multiple informants can inform about the extent to which different informants’ behavioral ratings are influenced by the same genetic and environmental factors, and can explain why there is agreement and/or disagreement amongst informants. Three classes of quantitative genetic models have been applied to data from multiple informants: biometric models, psychometric models, and bias models [5]. Each makes explicit assumptions about the reasons for agreement and disagreement among informants. Biometric models such as the Independent Pathway model [3] posit that genes and environments

contribute to covariance between informants through separate genetic and environmental pathways. This model decomposes the genetic, shared environmental, and nonshared environmental variances of multiple informants’ ratings (e.g., parent, teacher, and child) into genetic and environmental effects that are common to all informants, and genetic and environmental effects that are specific to each informant. Under this model, covariance between informants can arise due to different factors. That is, although all informants’ ratings may be intercorrelated, the correlation between any two informants’ ratings may be due to different factors (e.g., the correlation between parent and teacher could have different sources than the correlation between parent and child). Like the Cholesky Decomposition (also a biometric model), the Independent Pathway Model can be considered to be ‘agnostic’ in that it does not specify that the different informants are assessing the same phenotype, rather, it just allows that the phenotypes being assessed by each informant are correlated [2]. The Psychometric or Common Pathway model [3, 4] is more restrictive, positing that genes and environments influence covariation between raters through a single common pathway. This model suggests that correlations between informants arise because they are assessing a common phenotype. This common phenotype is then influenced by genetic and/or environmental influences. As is the case for the Independent Pathway model, this model also allows genetic and environmental effects specific to each informant. Under this model, genetic and/or environmental sources of covariance are the same across all informants. Informants’ ratings agree because they tap the same latent phenotype (i.e., they are assessing the same behaviors). Informants’ ratings differ because, to some extent, they also assess different phenotypes due to the fact that each informant contributes different but valid information about the target’s behavior. As with the Psychometric model, the Rater Bias model [2, 6] assumes that informants agree because they are assessing the same latent phenotype that is influenced by genetic and environmental factors; however, this model also assumes that disagreement between informants is due to rater bias and unreliability. That is, this model does not include informantspecific genetic or environmental influences – anything that is not reliable trait variance (i.e., the common phenotype) is bias or error, both of which are estimated in the model. Rater biases refer to the

2

Multiple Informants

informant’s tendency to consistently overestimate or underestimate the behavior of targets [2]. Because bias is conceptualized as consistency within an informant across targets, the Rater Bias model requires that the same informant assess both members of a twin or sibling pair. Other multiple informant models do not require this. All of the above models allow the estimation of genetic and environmental correlations (i.e., degree of genetic and/or environmental overlap) between informants, and the extent to which genetic and environmental factors contribute to the phenotypic correlations between informants (i.e., bivariate heritability and environmentality). Comparisons of the relative fits of the different models make it possible to get some understanding of differential informant effects. By comparing the Biometric and Psychometric models, it is possible to determine whether it is reasonable to assume that different informants are assessing the same phenotypes. That is, if a Biometric model provides the best fit to the data, then the possibility that informants are assessing different, albeit correlated, phenotypes must be considered. Similarly, comparisons between the Psychometric and Rater Bias models inform about the presence of valid informant differences versus rater biases.

References [1]

[2]

[3]

[4]

[5]

[6]

Achenbach, T.M., McConaughy, S.H. & Howell, C.T. (1987). Child/adolescent behavioral and emotional problems: implications of cross-informant correlations for situational specificity, Psychological Bulletin 101, 213–232. Hewitt, J.K., Silberg, J.L., Neale, M.C., Eaves, L.J. & Erikson, M. (1992). The analysis of parental ratings of children’s behavior using LISREL, Behavior Genetics 22, 292–317. Kendler, K.S., Heath, A.C., Martin, N.G. & Eaves, L.J. (1987). Symptoms of anxiety and symptoms of depression: same genes, different environments? Archives General Psychiatry 44, 451–457. McArdle, J.J. & Goldsmith, H.H. (1990). Alternative common-factor models for multivariate biometric analyses, Behavior Genetics 20, 569–608. Neale, M.C. & Cardon, L.R. (1992). Methodology for Genetic Studies of Twins and Families, Kluwer Academic Publishers, Dordrecht. Neale, M.C. & Stevenson, J. (1989). Rater bias in the EASI temperament survey: a twin study, Journal of Personality and Social Psychology 56, 446–455.

KIMBERLY J. SAUDINO

Multiple Linear Regression STEPHEN G. WEST

AND

LEONA S. AIKEN

Volume 3, pp. 1333–1338 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Linear Regression Multiple regression addresses questions about the relationship between a set of independent variables (IV s) and a dependent variable (DV ). It can be used to describe the relationship, to predict future scores on the DV, or to test specific hypotheses based on scientific theory or prior research. Multiple regression most often focuses on linear relationships between the IVs and the DV, but can be extended to examine other forms of relationships. Multiple regression begins by writing an equation in which the DV is a weighted linear combination of the independent variables. In general, the regression equation may be written as Y = b0 + b1 X1 + b2 X2 + · · · + bp Xp + e. Y is the DV, each of the Xs is an independent variable, each of the bs is the corresponding regression coefficient (weight), and e is the error in prediction (residual) for each case. The linear combination excluding the residual, b0 + b1 X1 + b2 X2 + · · · + bp Xp , is also known as the predicted value or Yˆ , the score we would expect on the DV based on the scores on the set of IVs. To illustrate, we use data from 56 live births taken from [4]. The IV s were the Age of the mother in years, the Term of the pregnancy in weeks, and the Sex of the infant (0 = girls, 1 = boys). The DV is

the Weight of the infant in grams. Figure 1(a) is a scatterplot of the Term, Weight pair for each case. Our initial analysis (Model 1) predicts infant Weight from one IV, Term. The regression equation is written as Weight = b0 + b1 Term +e. The results are shown in Table 1. b0 = −2490 is the intercept, the predicted value of Weight when Term = 0. b1 = 149 is the slope, the number of grams of increase in Weight for each 1-week increase in Term. Each of the regression coefficients is tested against a population value of 0 using the formula, t = bi /sbi , with df = n − p− 1, where sbi is the estimate of the standard error of bi . Here, n = 56 is the sample size and p = 1 is the number of predictor variables. We cannot reject the null hypothesis that the infant’s Weight at 0 weeks (the moment of conception) is 0 g, although a term of 0 weeks for a live birth is impossible. Hence, this conclusion should be treated very cautiously. The test of b1 indicates that there is a positive weight gain for each 1-week increase in Term; the best estimate is 149 g per week. The 95% confidence interval for the corresponding population regression coefficient β1 is b1 ± t0.975 sb1 = 149 ± (2)(38.8) = 71.4 to 226.6. R 2 = 0.21 is the squared correlation between Y and Yˆ or, alternatively, .21 is SS predicted /SS total , the proportion of variation in Weight accounted for by Term. Model 2 predicts Weight from Term and mother’s Age. Figure 1(b) portrays the relationship between Weight and Age. To improve the interpretability of the intercept [9], we mean center Term, Term C = Term −

5000

5000 Infant sex

4000

4000

3000

3000

2000 32 (a)

Birthweight (g)

Birthweight (g)

Male Female

34

36

38

40

Term in weeks

42

2000 10

44 (b)

20

30

40

Mother's age in years

Figure 1 Scatterplots of raw data (a) Weight vs. Term (b) Weight vs. Age Note: x represents male; o represents female. Each point represents one observation. The best fitting straight line is superimposed in each scatterplot.

2

Multiple Linear Regression Table 1

Model 1: regression of infant weight on term of pregnancy Coefficient estimates

Label

Estimate

Std. error

−2490 149 0.2148

b0 . Intercept b1 . Term R Squared: Number of cases: Degrees of freedom:

t-value

P value

−1.62 3.84

0.11 0.0003

MS

F

P value

3598397. 243622.

14.77

0.0003

1537.0 38.8

54 Summary Analysis of Variance Table

Source

df

SS

Regression Residual

1 54

3598397. 13155587.

Total

55

16753984

Table 2

Model 2: regression of infant weight on term of pregnancy and mother’s age Coefficient estimates

Label

Estimate

b0 . Intercept b1 . Term-c b2 . Age-c R Squared: Number of cases: Degrees of freedom:

Std. error

3413 41 48 0.3771 56 53

59.3 35.0 13.0

t-value 57.56 4.04 3.72

P value <0.0001 0.0002 0.0005

Summary Analysis of Variance Table Source

df

SS

Regression Residual

2 53

6318298 10435686

Total

55

16753984

Mean(Term), and Age, Age C = Age − Mean(Age). In this sample, Mean(Term) = 39.6 weeks and Mean(Age) = 26.3 years. The results are presented in Table 2. The intercept b0 now represents the predicted value of Weight = 3413 g when Term C and Age C both equal 0, at the mean Age of all mothers and mean Term length of all pregnancies in the sample. This value is equal to the mean birth weight in the sample. b1 = 141 is the slope for Term, the gain in Weight for each 1-week increase in Term, holding the mother’s Age constant. b2 = 48 is the slope for Age, the gain in Weight for each 1-year increase in mother’s Age, holding Term constant. Mean centering does not affect the value of the highest order regression

MS

F

3159149. 196900.

16.04

P value <0.0001

coefficients, here the slopes. The addition of Term to the equation results in an increase in R2 from 0.21 to 0.38. Model 3 illustrates the addition of a categorical IV, infant Sex, to the regression equation. For G categories, G − 1 code variables are needed to represent the categorical variable. Sex has two categories (female, male), so one code variable is needed. Several different schemes including dummy, effect, and contrast codes are available; the challenge is to select the scheme that provides the optimal interpretation for the research question [3]. Here we use a dummy code scheme in which 0 = female and 1 = male (see Dummy Variables). Table 3 presents

Multiple Linear Regression Table 3

3

Model 3: regression of infant weight on term, mother’s age, and sex Coefficient estimates

Label

unstand. bi

Intercept Age-c Term-c Sex R Squared: Number of cases: Degrees of freedom:

3423 48 141 18 0.3774 56 52

95% CI

sr2

stand. βi

3246 to 3600 21.6 to 74.8 70.5 to 212.5 −261 to +225

–

–

.40 .44 −.02

pr2

s bi

t-value

.203 .235 .000

88.2 13.2 35.4 121.0

38.8 3.6 4.0 −0.1

– .159 .192 .000

P value <0.0001 0.0006 0.0002 0.8822

Summary Analysis of Variance Table Source

df

Regression Residual

3 52

SS 6322743 10431241

Diagnostic Statistics Leverage (hii ) Distance (ti ) Global Influence (DFFITSi ) Specific Influence DFBETASij – Term DFBETASij – Age DFBETASij – Sex

the results. Again, Term C and Age C are mean centered. The intercept b0 now represents the predicted value of Weight when Age C and Term C are 0 (their respective mean values) and Sex = 0 (female infant). b1 = 141 represents the slope for Term c , b2 = 48 represents the slope for Age C , and b3 = −18 indicates that males are on average 18 g lighter holding Age and Term constant. The t Tests indicate that the intercept, the slope for Term and the slope for Age are greater than 0. However, we cannot reject the null hypothesis that the two sexes will have equal birth weights in the population, holding Age and Term constant. The 95% confidence intervals indicate the plausible range for each of the regression coefficients in the population. Finally, the R 2 is still 0.38. The gain in prediction can be tested when any set of 1 or more predictors is added to the equation, F =

(SS regression−full − SS regression−reduced )/dfset ) SS residual−full /dfresidual−full (1)

MS 2107581 200601

F 10.51

P value <0.0001

Largest Absolute Value 0.24 2.49 0.66 −0.46 0.56 0.36

Comparing Model (3), the full model with all 3 predictors, with Model (1), the reduced model that only included Term as a predictor, we find F =

(6322743 − 3598397)/2 = 22.5; 10431241/52

df = 2, 54;

p < 0.001

(2)

The necessary Sums of Squares (SS ) are given in Tables 1 and 3 and the df1 is the difference in the number of predictors between the full and reduced models, here 3 − 1 = 2, and the df2 is identical to that of the full model. This test indicates that adding Age and Sex to the model leads to a statistically significant increase in prediction. Table 3 also presents the information about the relationship between each IV and the DV in three other metrics. The standardized regression coefficient indicates the number of standard deviations (SD) the DV changes for each 1 SD change in the IV, controlling for the other IVs. The squared semipartial (part) correlation indicates the unique proportion of the total variation in the DV accounted for by each

4

Multiple Linear Regression Multilevel Models and Generalized Linear Mixed Models). Second, the variance of the residuals around the regression line may not be constant (heteroscedasticity). Third, the residuals may not have a normal distribution (nonnormality). Violations of these assumptions may require special procedures such as random coefficient models (3) for clustering and the use of alternative estimation procedures for nonconstant variance to produce proper results, notably correct standard errors. Careful examination of plots of the residuals (see Figure 2) can diagnose problems in the regression model and lead to improvements in the model (e.g., addition of an omitted IV ). Another problem is outliers, extreme data points that are far from the mass of observations [3, 4, 6]. Outliers may represent observational errors or extremely unusual true cases (e.g., an 80-year old college student). Outliers can potentially profoundly affect the values of the regression coefficients, even reversing the sign in extreme cases. Outlier statistics are case statistics, with a set of different diagnostic statistics being computed for each case i in the data set. Leverage (hii ) measures how extreme a case is in the set of IVS; it is a measure of the distance of case i from the centroid of the Xs (the mean of the IVS). Distance measures the discrepancy between the observed Y and the predicted Yˆ (Y − Yˆ ) for each case, indicating how poorly the model fits the case in question. Because outliers can affect the slope of the

IV. The squared partial correlation (see Partial Correlation Coefficients) indicates the unique proportion of the remaining variation in the DV accounted for by an IV after the contributions of each of the other IVS have been subtracted out. The squared partial correlation will be larger than the squared semipartial correlation unless the other IVS collectively do not account for any variation. More complex forms of multiple regression models can also be specified. A quadratic relationship [1] can be expressed as Y = b0 + b1 X + b2 X 2 + e. An interaction [1] in which Y depends on the combination of X1 and X2 can be expressed as Y = b0 + b1 X + b2 X2 + b3 X1 X2 + e. In some cases such as when there a very large range on the DV, Y may be transformed by replacing it by a function of Y , for example, log(Y ). Similar transformations may also be performed on each IV [4]. Nonlinear models which follow a variety of mathematical functions can also be estimated [6, 8] (see Nonlinear Models). For example, the growth of learning over repeated trials often initially increases rapidly and then gets slower and slower until it approaches a maximum possible value (asymptote). This form can often be represented by an exponential function. Multiple regression is adversely affected by violations of assumptions [3, 4, 6]. First, data may be clustered because individuals are measured in groups or repeated measurements are taken on the same individuals over time (nonindependence) (see Linear

1,500 Infant sex Male Female

1,000

1000

Residual

Residual

2000

0

500 0 −500

−1000 2000 (a)

3000

4000

Predicted value

−1,000 −1,000

5000 (b)

−500

0

500

1,000

Normal quantiles

Figure 2 Model checking plots (a) Residuals vs. Predicted Values (b) Q–Q Plot of Residuals vs Normal Distribution Note: In (a), residuals should be approximately evenly distributed around the 0-line if the residuals are homoscedastic. In (b), the residuals should approximately fit the superimposed straight line if they are normally distributed. The assumptions of homoscedasticity and normality of residuals appear to be met.

Multiple Linear Regression regression line, the externally Studentized residual ti is used rather than the simple residual e as ti provides a much clearer diagnosis. Conceptually, the regression equation is estimated with case i deleted from the data set. The values of the IVS for case i are substituted into this equation to calculate Yˆi(i) for case i. ti is the standardized difference between the observed value of Yi and Yˆi(i) . Influence, which is the combination of high leverage and high distance, indicates the extent to which the results of the regression equation are affected by the outlier. DFFITS i is a standardized measure of global influence which describes the number of SDs by which Yˆ√would change if case i were deleted. DFFITS i = ti (hii /1 − hii ). (Cook’s Di is a closely related alternative measure of global influence in a different metric). Finally, DFBETAS ij is a standardized measure of influence which describes the number of SDs regression coefficient bj would change if case i were deleted. Cases with extreme values on the outlier diagnostic statistics should be examined with suggested rule of thumb cutoff values for large sample sizes √ being (2(p + 1)/n) for hii , ±3 or√±4 for ti , ±2 (p + 1/n) for DFFITS i , and ±(2/ n) for DFBETAS ij . No extreme outliers were identified in our birth weight data set. Methods of addressing outliers include dropping them from the data set and limiting the conclusions that are reached and using robust regression procedures that down weight their influence (see Influential Observations). In summary, multiple regression is a highly flexible method studying the relationship between a set of IVS and a DV. Different types of IVS may be studied and a variety of forms of the relationship between each IV and the DV may be represented. Multiple regression requires attention to its assumptions and to possible outliers to obtain optimal results and improve the model. A chapter length introduction [2], full length introductions

5

for behavioral scientists [3, 5, 7], and full length introductions for statisticians [6, 8, 10] to multiple regression are available. Numerous multiple regression programs exist, with SPSS Regression and SAS PROC Reg being most commonly used (see Software for Statistical Analyses). Cook and Weisberg [1] offer an excellent freeware program ARC http://www.stat.umn.edu/arc/.

References [1]

[2]

[3]

[4] [5]

[6]

[7] [8] [9]

[10]

Aiken, L.S. & West, S.G. (1991). Multiple Regression: Testing and Interpreting Interactions, Sage Publications, Newbury Park. Aiken, L.S. & West, S.G. (2003). Multiple linear regression, in Handbook of Psychology, (Vol. 2): Research Methods in Psychology, J.A. Schinka & W.F. Velicer, eds, Wiley, New York, pp. 483–532. Cohen, J., Cohen, P., West, S.G. & Aiken, L.S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition, Lawrence Erlbaum, Mahwah. Cook, R.D. & Weisberg, S. (1999). Applied Regression Including Computing and Graphics, Wiley, New York. Fox, J. (1997). Applied Regression Analysis, Linear Models, and Related Methods, Sage Publications, Thousand Oaks. Neter, J., Kutner, M.H., Nachtsheim, C.J. & Wasserman W. (1996). Applied Linear Regression Models, 3rd Edition, Irwin, Chicago. Pedhazur, E.J. (1997). Multiple Regression in Behavioral Research, 3rd Edition, Harcourt Brace, Forth Worth. Ryan, T.P. (1996). Modern Regression Methods, Wiley, New York. Wainer, H. (2000). The centercept: an estimable and meaningful regression parameter, Psychological Science 11, 434–436. Weisberg, S. (2005). Applied Linear Regression, 3rd Edition, Wiley, New York.

STEPHEN G. WEST

AND

LEONA S. AIKEN

Multiple Testing RICHARD B. DARLINGTON Volume 3, pp. 1338–1343 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multiple Testing The multiple-testing problem arises when a person or team performs two or more significance tests, because the probability of finding at least one significant result exceeds the nominal probability α. Multipletesting methods are designed to correct for this liberal bias. Typical cases of multiple tests are the multiple comparisons of cell means in analysis of variance, testing all the values in a matrix of correlations (see Correlation and Covariance Matrices), testing all the slopes in a multiple linear regression, testing each cell in a contingency table for the cell frequency’s deviation from its expected value, and finding multiple tests of the same general question in a review of scientific literature. The multiple-comparisons problem has received far more attention than the others, and a separate article is devoted to that topic (see A Priori v Post Hoc Testing). We also cover that multiple-comparisons problem briefly here, but we focus more heavily on the more general problem. If pm denotes the most significant value among B significance levels, the problem is to find the probability that a value at least as small as pm would have been found just by chance in the total set of B values. We shall denote this probability as PC, for ‘corrected pm ’. We do not generally assume that the B trials are statistically independent; of the five examples in the opening paragraph, trials are independent only in the last one, in which the various significance levels are obtained from completely different experiments. When independence does exist, the exact formula for PC is well known: (1 − pm ) is the probability of failing to find the pm in question on any one trial, so (1 − pm )B is the probability of failing on all B trials, so 1 − (1 − pm )B is the probability of finding at least that level on at least one trial. Thus P C = 1 − (1 − pm )B . Here we review the most widely used methods for correcting for multiple tests and also consider more general questions that arise when dealing with multiple tests. We begin with a review of methods.

The Bonferroni Method Ryan [5] and Dunn [3] were actually the first prominent advocates of the method that is now generally called the Bonferroni method, but nevertheless

here we shall call the method by its most widely used name. The Bonferroni method is noteworthy for its extreme flexibility; it can be applied in all of the cases mentioned in our opening paragraph, plus many more. It is also remarkably simple. If you have selected the most significant value pm out of B significance levels that you computed, then you simply multiply these two values together, so P C = B × pm . The Bonferroni method has become increasingly practical in recent years because modern computer packages can usually give the exact values of very small p values; these are, of course, necessary for the Bonferroni method. There is no requirement that the tests analyzed by the Bonferroni method be statistically similar. For instance, they may be any mixture of parametric and nonparametric tests. In theory, they could even be a mixture of one-tailed and two-tailed tests, but scientifically, the conclusions would usually be most meaningful if all tests were one or the other. If tests are independent and the true PC is .05 or below, the Bonferroni method has only a slight conservative bias in comparison to the exact formula mentioned above. By trying the exact formula and the Bonferroni formula with various values of B, one can readily verify that when B < 109 and the exact formula yields P C = .05, the Bonferroni formula never yields PC over .0513. But when the exact PC is very high, the Bonferroni PC may be far higher still – even over 1.0. Thus, if the Bonferroni formula yields P C > .05 for independent tests, you should take the extra time to calculate the exact value. Contrary to intuition, the Bonferroni method can be reasonably powerful even when B is very large. For instance, consider a Pearson correlation of .5 in a sample of 52 cases. We have p < .000001, which would yield PC below .05 even if B were 50 000. Ryan [6] has shown that the Bonferroni method is never too liberal, even if the various tests in the analysis are nonindependent. The more positively correlated the tests in the analysis are, the more conservative the Bonferroni test is relative to some theoretically optimum test. One can think of the ordinary two-tailed test as a one-tailed test corrected by a Bonferroni factor of 2 to allow for the possibility that the investigator chose the direction post hoc to fit the data. There is no conservative bias in the twotailed test because the two possible one-tailed tests are perfectly negatively correlated – they can never

2

Multiple Testing

both yield positive results. But there is no liberal bias even in this worst case. The Bonferroni method easily handles a problem that arises in the area of multiple comparisons, which the classical methods for multiple comparisons cannot handle. If you have decided in advance that you are interested in only some subset of all the possible comparisons, and you test only those, then you can apply the Bonferroni method to correct for the exact number of tests you have performed. There is also Bonferroni layering. If the most significant of k results is significant after a Bonferroni correction, then you can test the next most significant result using a Bonferroni factor of (k − 1) because that is the most significant of the remaining (k − 1) results. You can then continue, testing the third, fourth, and other most significant results using correction factors of (k − 2), (k − 3), and so on.

Multiple Comparisons The opening paragraph gave five instances of the multiple-tests problem, and there are many others. Thus, the problem of multiple comparisons in analysis of variance is just one of many instances. But this one instance has attracted extraordinary attention from statisticians. Lengthy reviews of this work appear in works like [4] and [7], and the topic is covered in a separate article in this encyclopedia (see A Priori v Post Hoc Testing). Thus, we mention only a few highlights here. In analysis of variance, a pairwise comparison is a test of the difference between two particular cell means. For simplicity, we shall emphasize one-way analyses of variance, though our discussion applies with little change to more complex designs. If there are k cells in a design, then the number of possible pairwise comparisons is k(k − 1)/2. Thus, one valid approach is to compute all possible pairwise values of t, test the most significant one using a Bonferroni correction factor of k(k − 1)/2, and then use Bonferroni layering to test the successively smaller values of t. This approach is very flexible: it does not require equal cell frequencies or the assumption of equal within-cell variances, and tests may be one-tailed, two-tailed, or even mixed. Virtually all alternative methods are less flexible but do gain a small amount of power relative to the method just mentioned.

Several of these methods use the studentized √ range, defined as sr = (Mi − Mj )/ (MSE/nh ), where Mi and Mj are the two cell means in question, MSE is the mean squared error, and nh is the harmonic mean of the two cell frequencies. This is essentially the t that one would calculate to test the difference between just those two cell means, with two differences: sr uses the pooled variance for the entire analysis rather than for just the two cells being tested and sr is multiplied by an extra √ factor of 2. The latter difference arises because nh = 2/(1/ni + 1/nj ), which is twice the comparable value used in the ordinary t Test. The studentized range is then compared to critical values specific to the method in question. Perhaps the best-known method using sr is the Tukey HSD (honestly significant difference) method. This provides critical values for the largest of all the k(k − 1)/2 values of sr for a given data set. The HSD test is slightly more powerful than the Bonferroni method. For instance, when k = 10, residual degrees of freedom = 60, and α = .05 two-tailed, the critical t-values for the Tukey and Bonferroni methods are 3.29 and 3.43 respectively. The Duncan method was designed to address this same problem; it also provides tables of critical values of sr. This method is widely available but is commonly considered logically flawed. It has the unusual property that the larger the k, the more likely the outcome is to be significant when the null is true. For instance, if k = 100 and if tests are at the .05 level and if all 100 true cell means are equal, the probability is nevertheless .9938 that at least one comparison will be found to be significant. The general expression for such values is 1 − (1 − α)k−1 . Duncan argued that this is reasonable, since when k is large, the largest cell difference is very likely to reflect a real difference. This view seems overly optimistic – if you were testing 100 different pills, each promising to make users lose 20 pounds of weight in the next month, would you want to use a statistical method almost guaranteed to conclude that at least one was effective? The Fisher protected t Test is also called the LSD (least significant difference) or protected LSD test. Unlike the HSD or Duncan method, it requires no special tables. Rather the investigator first tests the overall F , and if it is significant, tests as many comparisons as desired with ordinary t Tests.

Multiple Testing This test is controversial, and its advantages and disadvantages are discussed further below. The Dunnett method is for a different problem in which the k cells include k − 1 treatment groups and one control group, and the only comparisons to be tested are between the control group and the treatment groups. It uses a studentized range. The Bonferroni method could be used here with a correction factor of k − 1. The Dunnett method is slightly more powerful than the Bonferroni method. The Scheff´e method is the only method discussed here that applies to comparisons involving more than two cell means, such as when you average the highest 3 of 10 cell means, average the lowest 3 cell means, and want to test the difference between the two averages. Weighted averages may also be used. Since the number of possible weighted averages is infinite, the Bonferroni method cannot be used because B would be infinite. The Scheff´e method allows the investigator to use any post hoc strategy whatsoever to select the most significant comparisons and to test as many comparisons as desired. When corrected by the Scheff´e formula, the tests will still be valid. The Scheff´e formula is simple: compute t for the comparison of interest, then compute F = t 2 /(k − 1), and compare F to an F table using the same df one would use to test the overall F for that data set. The Scheff´e method is less powerful than any of the other methods discussed here and is thus not recommended if any of the other methods are applicable.

Layering Methods for Multiple Comparisons Suppose there are 10 cells in a one-way analysis of variance, and you number the cells in the order of their means, from 1 (lowest) to 10 (highest). Suppose that by the HSD method you have found a significant difference between cell means 1 and 10. You might then want to test the difference between means 1 and 9 and the difference between means 2 and 10. If the 1 to 9 difference is significant, you might want to test the difference between means 1 and 8 and the difference between means 2 and 9. Continuing in this way so long as results continue to be significant, you might ultimately want to test the difference between adjacent cell means. This entire process is called layering or step-down analysis, and several methods for it have been proposed.

3

The Newman–Keuls method simply uses the HSD tables for lower values of k, setting k = m + 1 where m is the difference between the ranks of the two means being compared. For instance, suppose you are testing the difference between means 2 and 7, so m = 5. These means are the highest and lowest of a set of 6 means, so use the HSD tables with k = 6. This method has been criticized as too liberal, but is defended in a later section. The Ryan–Einot–Gabriel–Welsch method is a noncontroversial Bonferroni-type method for the layering problem in multiple comparisons. Compute the t and p for any comparison in the ordinary way, but then multiply p by a Bonferroni correction factor of k × m/2.

Contingency Tables Suppose you want to test each cell in a contingency table for a difference between that cell’s observed and expected frequency. The sufficient statistics for that null hypothesis are the cell frequency, the corresponding column and row totals, and the grand total for the entire table. Thus you can use those four frequencies to construct a 2 × 2 table in which the four frequencies are the frequency in the focal cell, all other cases in the focal row, all other cases in the focal column, and all cases not in any of the first three. One can then apply any valid 2 × 2 test to that table. If you do this for each cell in the table, the Bonferroni method can be used to correct for multiple tests. If R and C are the numbers of rows and columns, then the total number of tests is R × C, so that should be the Bonferroni correction. Bonferroni layering may be used. It could be argued that once a cell has been identified as an outlying cell, its frequency should somehow be ignored in computing the expected frequencies for other cells. As explained in [1], that problem can be addressed by fitting models with ‘structural zeros’ for cells already identified as outliers.

What Tests Should One Correct for? Particularly in the multiple-comparisons area, writers have generally assumed that the set of tests used in a single correction should be all the tests in a single ‘experiment’, or in a complex experiment, all

4

Multiple Testing

the tests in a single ‘effect’ – that is, all the tests in a single line of the analysis of variance table. The corrected significance levels calculated in this way are called ‘experimentwise’ levels in the former case or ‘familywise’ levels in the latter. But these answers do not apply to many of the examples in our opening paragraph, such as when one is reviewing many studies by different investigators. The broader answer would seem to be that one is trying to reach a particular conclusion, and one should correct for all the tests one scanned to try to support that conclusion. It should be plausible, to at least some people, that all the individual null hypotheses in the scanned set are true, so that any results that are significant individually appeared mainly because so many tests were scanned. Thus, for instance, we would never have to correct for all the tests ever performed in the history of science because there is no serious claim that all the null hypotheses in the history of science are true – the dozens of reliable appliances we use every day contradict that hypothesis. A single result can drastically change the set that seems appropriate to correct for. For instance, suppose you search the literature on a topic and find 50 significance tests by various investigators. They do not all test the same null hypothesis, though all are in the same general area. One possibility is that all 50 null hypotheses are true, and that might be plausible. But suppose you can divide the 50 tests into 5 groups (A to E) on the basis of their scientific content. Suppose you find that single tests in groups B and D are significant even after any reasonable correction. Then you can dismiss the null hypotheses that there are no real effects in areas B and D, and you might evaluate other tests in those areas with no further correction for multiple tests. Of course, you have also dismissed the null hypothesis of no real effects in the total set of 50. Thus, in areas A, C, and E, you might apply corrections for multiple tests, correcting each only for the number of tests in those subareas. Thus, one or two highly significant results may, as in this example, greatly change your view of what ‘sets’ need to be tested. The choice of the proper sets should be made by subject-matter experts, not by statistical consultants. This line of reasoning can be used to defend the Newman–Keuls method. In a one-way analysis of variance, suppose the set of k cell means is divided into subsets such that all true cell means within a subset are equal but all the subsets differ from each

other. In that situation, the Newman–Keuls method fails the familywise and experimentwise criteria of validity because the significant results found in the early stages of the layering process will lead the analyst to perform multiple tests in the final stages, and it is highly probable that some of those later tests will be significant (producing Type 1 errors) just by chance. But it can be argued that if all the cell means in one subset have been shown to differ from all those in another subset, then tests within the two subsets are now effectively in two different scientific areas, even though all the cell means were originally estimated in the same ‘experiment’. We all agree that tests in different scientific areas need not be corrected for each other, even though that policy does mean that on any given day some type 1 errors will be made somewhere in the world. By that line of reasoning, the Newman–Keuls method seems valid. It is harder to see any similar justification for the Fisher LSD method. If one of 10 cell means is far above the other 9, the overall F may well be significant. But this provides no obvious reason for logically separating the lower 9 cell means from each other. If you have selected the largest difference among these 9 cell means, you would be selecting the largest of 36 comparisons. Yet, LSD would allow you to test that with no correction for post hoc selection. There is no clear justification for such a process. The Duncan method is far more liberal than LSD and also seems unreasonable.

An Apparent Contradiction Investigators often observe significant results in an overall test of the null hypothesis – for instance, a significant overall F in a one-way analysis of variance – but then they find no significant individual results. This can happen even without corrections for multiple tests but is far more common when corrections are made. What conclusions can be drawn in such cases? To answer this question, it is useful to distinguish between specific and vague conclusions. A vague conclusion includes the phrase ‘at least one’ or some similar phrase. For instance, in the significant overall F just mentioned, the proper conclusion is that ‘at least one’ of the differences between cell means must be nonzero. But no specific conclusions can be

Multiple Testing drawn – conclusions specifying exactly which differences are nonzero, or even naming one difference known to be nonzero. Results like this occur simply because it takes more evidence to reach specific conclusions than vague conclusions. This fact is not limited to statistics. For instance, a detective may conclude after only a few hours of investigation that a murder must have been committed by ‘at least one’ of five people but may take months to name one particular suspect, or may never be able to name one. Thus the situation we are discussing arises because the sample contains enough evidence to reach a vague conclusion but not enough evidence to reach a specific conclusion. For another example, suppose five coins are each flipped 10 times to test the hypothesis that all are fair coins, against the one-tailed hypothesis that some or all are biased toward heads. Suppose the numbers of heads for the five coins are, respectively, 5, 6, 7, 8, and 9. Even the last of the 5 coins is not significantly different from the null at the .05 level after correction for post hoc selection; using the exact correction formula for independent tests yields PC = .0526 for that coin. But collectively, there are 35 heads in 50 flips, and that result is significant with p = .0033. We can reach the vague conclusion that at least one of the coins must be biased toward heads, but we cannot reach any specific conclusions about which one or ones.

We thus have a significant result yielding a conclusion more specific than the conclusion that at least one of the five coins is biased toward heads. This kind of technique seems particularly useful in literature reviews, sometimes allowing conclusions that are positive and fairly specific even when no individual studies are significant after correction for post hoc selection. Methods for this problem are discussed at length in [2]. These methods also permit allowance for the filedrawer problem – the problem of possible unpublished negative results. In our example, suppose we were worried that as many as four other coins had been flipped but their results had not been given to us because they came out negative. Thus, our two coins with 17 heads between them in fact yielded the best results of nine coins, not merely the best of five. Thus our appropriate Bonferroni correction is 9!/(2! 7!) = 36. Applying this correction we have P C = 36 × .0013 = .0464. Thus our two-coin conclusion is significant beyond the .05 level even after allowing for the possibility that we are unaware of several negative results.

References [1]

[2]

More Techniques for Independent Tests, and the ‘File-Drawer’ Problem Examples like the last one raise an interesting question: If an overall test is significant but no tests for specific conclusions can be reached, can we reach a conclusion that is more specific than the one already reached, even if not as specific as we might like? The answer is yes. For instance, in the five-coin example, consider the conclusion that at least one of the last two coins is biased toward heads. If we ignored the problem of post hoc selection we would notice that those two coins showed 17 heads in 20 trials, yielding p = .0013 one-tailed. But there are 10 ways to select two coins from five, so we must make a 10-fold Bonferroni correction, yielding P C = .013.

5

[3] [4]

[5] [6]

[7]

Bishop, Y.M., Feinberg, S.E. & Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice, MIT Press, Cambridge. Darlington, R.B. & Hayes, A.F. (2000). Combining independent P values: extensions of the Stouffer and binomial methods, Psychological Methods 5, 496–515. Dunn, O.J. (1961). Multiple comparisons among means, Journal of the American Statistical Association 56, 52–64. Kirk, R.E. (1995). Experimental Design: Procedures for the Behavioral Sciences, 3rd Edition, Brooks/Cole, Pacific Grove. Ryan, T.A. (1959). Multiple comparisons in psychological research, Psychological Bulletin 56, 26–47. Ryan, T.A. (1960). Significance tests for multiple comparison of proportions, variances, and other statistics, Psychological Bulletin 57, 318–328. Winer, B.J., Brown, D.R. & Michels, K.M. (1991). Statistical Principles in Experimental Design, 3rd Edition, McGraw-Hill, New York.

RICHARD B. DARLINGTON

Multitrait–Multimethod Analyses SUSAN A. PARDY, LEANDRE R. FABRIGAR

AND

PENNY S. VISSER

Volume 3, pp. 1343–1347 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multitrait–Multimethod Analyses The multitrait-multimethod (MTMM) design provides a framework for assessing the validity of psychological measures [5] (see Measurement: Overview). Although a variety of MTMM research designs exist, all share a critical feature: the factorial crossing of constructs and measurement procedures (i.e., a set of constructs is assessed with each of several different methods of measurement). MTMM designs have been used to confirm that measures intended to reflect different constructs do assess distinct constructs (i.e., discriminant validity), to confirm that measures intended to assess a given construct are in fact assessing the same construct (i.e., convergent validity), and to gauge the relative influence of method and trait effects.

Traditional Approach to MTMM Analyses Although MTMM studies are ubiquitous, there has been surprisingly little consensus on how MTMM data should be analyzed. Traditionally, data have been analyzed by visually examining four properties of the resulting correlation matrix of measures in MTMM designs [5]. The first property is ‘monotrait-heteromethod’ correlations (i.e., correlations between different measures of the same construct). Large monotrait-heteromethod correlations indicate convergent validity. These correlations cannot be interpreted in isolation, however, because distinct measures of a construct may covary for other reasons (e.g., because measures tap unintended constructs or contain systematic measurement error). Thus, the ‘heterotrait-heteromethod’ correlations (i.e., correlations between different measures of different constructs) should also be examined. Each monotrait-heteromethod correlation should be larger than the heterotrait-heteromethod correlations that share the same row or column in the matrix. In other words, correlations between different measures of the same construct should be larger than correlations between construct/measure combinations that share neither construct nor measurement procedure. Third, the ‘heterotrait-monomethod’ correlations (i.e., correlations between measures of different

constructs with the same measurement procedures) should be inspected. Monotrait-heteromethod correlations should be larger than heterotrait-monomethod correlations. That is, observations of the same construct with different measurement procedures should be more strongly correlated than are observations of different constructs using the same measurement procedures. Finally, all of the heterotrait triangles should be examined. These triangles are the correlations between each of the constructs measured via the same procedures and the correlations between each of the constructs measured via different procedures. Similar patterns of associations among the constructs should be evident in each of the triangles. Although the traditional analytical approach has proven useful, it has significant limitations. For example, it articulates no clear assumptions regarding the underlying nature of traits and methods (e.g., whether traits are correlated with methods) and it implies some assumptions that are unrealistic (e.g., random measurement error is ignored). Further, it provides no means for formally assessing the plausibility of underlying assumptions, and it is also cumbersome to implement with large matrices. Finally, implementation of the traditional approach is an inherently subjective and imprecise process. Some aspects of the matrix may be consistent with the guidelines, whereas others may not, and there are gradations to how well the requirements have been met. Because of these limitations, formal statistical procedures for analyzing MTMM data have been developed. The goal of these approaches is to allow for more precise hypothesis testing and more efficient summarizing of patterns of associations within MTMM correlation matrices.

Analysis of Variance (ANOVA) Approach The analysis of variance (ANOVA) approach [6, 9, 14] is rooted in the notion that every MTMM observation is a joint function of three factors: trait, method, and person. ANOVA procedures can be used to parse the variance in MTMM data into that which is attributable to each of these factors and to their interactions. Variance attributable to person (i.e., a large person main effect) is often interpreted as evidence of convergent validity, but not as defined in the classic Campbell and Fiske [5] sense. Rather, it indicates that a general factor accounts for a

2

Multitrait–Multimethod Analyses

substantial portion of the variance in the MTMM matrix (which is useful as a comparison point for higher-order interactions). Variance attributable to the person by trait interaction reflects the degree to which people vary in their level of each trait, averaging across methods of measurement. A large person by trait interaction implies low heterotrait correlations (i.e., discriminant validity). Discriminant validity is high when the person by trait interaction accounts for substantially more variance than the person main effect. Variance attributable to the person by method interaction reflects the degree to which people vary in their responses to the measurement procedures, averaging across traits. Large person by method interactions are undesirable because they indicate that methods of measurement account for substantial variance in the MTMM matrix. The ANOVA approach is useful in that it parsimoniously summarizes the pattern of correlations in an MTMM matrix, and in that it assesses the relative contribution of various sources of variance. However, the ANOVA also requires unrealistic assumptions (i.e., uncorrelated trait factors, uncorrelated method factors, equivalent reliability for measures, the person by trait by method interaction contains only error variance). Furthermore, the use of the person main effect as an index of convergent validity is problematic because it indicates that individual differences account for variation in observations, collapsing across traits and methods. This is conceptually different from the original notion of convergent validity, reflected in the correlations between different methods measuring a single construct. Finally, the ANOVA yields only omnibus information about trait and method effect. The approach precludes the exploration of correlations among traits, among methods, and between traits and methods.

Traditional MTMM-Confirmatory Factor Analysis (CFA) Model The traditional MTMM-confirmatory factor analysis (see Factor Analysis: Confirmatory) model [7] uses covariance structure modeling to estimate a latent factor model in which each measure reflects a trait factor, a method factor, and a unique factor (i.e., random measurement and systematic measurement error specific to that measure). This model implies that measures assessing different constructs but employing a common method will be correlated because

of a shared method of measurement, and measures assessing the same construct but employing different methods will correlate because of a common underlying trait factor. The model also posits that trait factors are intercorrelated and that method factors are intercorrelated. High discriminant validity is reflected by weak correlations among trait factors. High convergent validity is reflected by large trait loadings and weak method factor loadings. Convergent validity is also improved when the variances in unique factors are small. The traditional MTMM-CFA provides a parsimonious representation of MTMM data. Its assumptions are relatively explicit, and for the most part, intuitively follow from the original conceptualization of MTMM designs. Unfortunately, this approach routinely encounters computational difficulties in the form of improper solutions. The reasons for this problem are not entirely understood but might involve statistical or empirical identification problems. There are also certain ambiguities in the assumptions of this model that have generally gone unrecognized. For example, although the model assumes that traits and method effects are additive, interactions are possible [4, 15]. Moreover, the model actually allows for two potentially competing types of trait-method interactions. A hierarchical version of this model has been proposed [12]. This version has potential advantages over the first-order model in that it may be less susceptible to problems related to empirical identification, but it also suffers from some of the same limitations such as including competing types of traitmethod interactions [15].

Restricted Versions of the Traditional CFA Model In an effort to take advantage of the strengths of CFA while avoiding computational problems, some researchers [1, 10, 13] have proposed imposing constraints on the traditional MTMM-CFA model, thereby reducing the number of model parameters. For example, one may constrain the factor loadings from each method factor to the observed variables measured with that method to be equal to 1, and constrain the correlations between method factors to be zero. By imposing constraints, the computational difficulties that arise from underidentification may be remedied. However, these constraints require

Multitrait–Multimethod Analyses assumptions that are often unrealistic. In addition, the imposition of constraints reduces the amount of information provided by the model. For example, restrictions may preclude investigating correlations among method factors or which observed variables a particular trait most strongly influences.

Correlated Uniqueness Model In this model [8, 11], only traits are represented as latent constructs. The influence of methods is represented in correlated unique variances for measures that share a common method. Thus, the unique variances in this model represent three sources of variability in measures: random error of measurement, specific factors, and method factors. Because both random error and specific factors are by definition uncorrelated, correlations among unique variances are attributable to a shared method of measurement. In this model, traits are permitted to correlate, but method factors are implied to be uncorrelated. Low correlations among traits indicate discriminant validity, and high convergent validity is represented by strong trait factor loadings. Method effects are gauged by the magnitude of the correlations among unique variances, with weak correlations implying greater convergent validity. One advantage of this model is that it is less susceptible to computational difficulties. However, this model is based on similar assumptions to the traditional MTMM-CFA model, and therefore, the potential for the same logical inconsistencies exist. Also, because method factors are not directly represented, the model is not well suited for addressing specific questions regarding method effects.

Covariance Component Analysis The Covariance Component Analysis (CCA) model [16] (see Covariance Structure Models) is based on the main effects portion of a random effects analysis of variance (see Fixed and Random Effects), but is typically tested using covariance structure modeling. Although there are a number of ways to parameterize this model, the Browne [3] parameterization is probably the most useful. In this parameterization, trait and method factors are treated as deviations from a general response factor (i.e., a ‘G’ factor). The scale of the latent variables is set by constraining the variance of the ‘G’ factor to 1.00.

3

Trait and method factors are represented by ‘Traitdeviation’ and ‘Method-deviation’ factors, and as such, the number of factors representing methods or traits is one less than the number of methods or traits included in the matrix. Trait deviations are permitted to correlate with each other but not with method deviations and vice versa. Neither trait nor method deviations correlate with the general response factor. Finally, a latent variable influencing each observed variable is specified. Random measurement error is incorporated into the model through error terms influencing each of the latent variables associated with each measured variable. Unfortunately, even this parameterization does not provide immediately useful estimates, due to the fact that estimates for the traits and methods sum to zero. Thus, it is necessary to add ‘G’ separately to the trait and method deviations so that the correlation matrix among traits and the correlation matrix among methods can be computed. These matrices are then used to determine discriminant and convergent validity. Low correlations among traits imply discriminant validity. Convergent validity is supported by strong positive correlations among method factors, which indicate that the methods produce similar patterns of responses. The CCA also provides estimates of communalities for the measured variables. The CCA model has advantages over many other models because it is a truly additive model and thus does not allow for the conceptual ambiguities that can occur in many CFA models. Furthermore, the matrices produced efficiently summarize properties of the data. Importantly, the CCA approach does not have the computational problems characteristic of the traditional MTMM-CFA model. A potential limitation is that it provides relatively little information about the performance of individual observed variables.

Composite Direct Product Model Traditionally, MTMM researchers have assumed that trait and method factors combine in an additive fashion to influence measures. However, interactions among trait and method factors may exist, thus making additive models inappropriate for some data sets. The Composite Direct Product (CDP) model [2] was developed to deal with interactions between traits and

4

Multitrait–Multimethod Analyses

methods. In this model, each measure is assumed to be a function of a multiplicative combination of the trait and the method underlying that measure. Each measure is also assumed to be influenced by a unique factor. The model allows for correlations among trait factors and correlations among method factors. No correlations are permitted between method factors and trait factors. High discriminant validity is reflected in weak correlations among traits. As with the CCA, high correlations among methods suggest high convergent validity. Estimates of communalities provide information about the influence of random measurement error and specific factors. One strength of the model is that its parameters can be related quite directly to the questions of interest in MTMM research. The CDP improves upon the traditional MTMM-CFA model and its variants in that it does not suffer from the conceptual ambiguities of these approaches. Moreover, the CDP is less susceptible to computational problems. Applications of the CDP have suggested that it usually produces satisfactory fit and sensible parameter estimates. Interestingly, when applied to data sets for which there is little evidence of trait-method interactions, the CDP model often provides similar fit and parameter estimates to the CCA model. However, the CDP model appears to sometimes perform better than the CCA model for data sets in which there is evidence of trait-method interactions. As with the CCA approach, the model provides relatively little information about the performance of individual measures.

Conclusions When analyzing MTMM data, it seems most sensible to begin with a consideration of the four properties of the MTMM correlation matrix originally proposed by Campbell and Fiske [5]. If the properties of the original MTMM matrix are consistent with the results of more formal analyses, researchers can be more confident in the inferences they draw from those results. In addition, the assumptions of more formal analytical methods (e.g., do traits and methods combine in an additive or multiplicative manner?) can be evaluated by examining the original MTMM matrix [15]. Researchers should then consider what formal statistical analyses are most appropriate. Many approaches have serious limitations. The CCA and the CDP probably have the strongest conceptual

foundations and the fewest practical limitations. Of these two, the CDP has the advantages of being easier to implement and perhaps more robust to violations of its underlying assumptions about the nature of trait-method effects. However, until the contexts in which the CDP and CCA differ are better understood, it might be useful to supplement initial CDP analyses with a CCA analysis using Browne’s [3] parameterization. If the results are similar, one can be confident that the conclusions hold under assumptions of either additive or multiplicative trait-method effects. If they are found to substantially differ, then it is necessary to carefully evaluate the plausibility of each model in light of conceptual and empirical considerations.

References [1]

Alwin, D.F. (1974). Approaches to the interpretation of relationships in the multitrait-multimethod matrix, in Sociological Methodology 1973–1974, H.L. Costner, ed., Jossey-Bass, San Francisco pp. 79–105. [2] Browne, M.W. (1984). The decomposition of multitraitmultimethod matrices, British Journal of Mathematical and Statistical Psychology 37, 1–21. [3] Browne, M.W. (1989). Relationships between an additive model and a multiplicative model for multitraitmultimethod matrices, in Multiway Data Analysis, R. Coppi & S. Bolasco, eds, Elsevier Science, NorthHolland, pp. 507–520. [4] Browne, M.W. (1993). Models for multitrait-multimethod matrices, in Psychometric Methodology: Proceedings of the 7th European Meeting of the Psychometric Society in Trier, R. Steyer, K.F. Wender & K.F. Widaman, eds, Gustav Fischer Verlag, New York, pp. 61–73. [5] Campbell, D.T. & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix, Psychological Bulletin 56, 81–105. [6] Guilford, J.P. (1954). Psychometric Methods, McGrawHill, New York. [7] Joreskog, K.G. (1974). Analyzing psychological data by structural analysis of covariance matrices, in Contemporary Developments in Mathematical Psychology, Vol. 2, R.C. Atkinson, D.H. Krantz, R.D. Luce & P. Suppes, eds, W.H.Freeman, San Francisco, pp. 1–56. [8] Kenny, D.A. (1979). Correlation and Causality, Wiley, New York. [9] Kenny, D.A. (1995). The multitrait-multimethod matrix: design, analysis, and conceptual issues, in Personality Research, Methods, and Theory: A Festschrift Honoring Donald W. Fiske, P.E. Shrout & S.T. Fiske, eds, Erlbaum, Hillsdale, pp. 111–124. [10] Kenny, D.A. & Kashy, D.A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor analysis, Psychological Bulletin 112, 165–172.

Multitrait–Multimethod Analyses [11]

Marsh, H.W. (1989). Confirmatory factor analyses of multitrait-multimethod data: many problems and a few solutions, Applied Psychological Measurement 13, 335–361. [12] Marsh, H.W. & Hocevar, D. (1988). A new, more powerful approach to multitrait-multimethod analyses: application of second-order confirmatory factor analysis, Journal of Applied Psychology 73, 107–117. [13] Millsap, R.E. (1992). Sufficient conditions for rotational uniqueness in the additive MTMM model, British Journal of Mathematical and Statistical Psychology 45, 125–138. [14] Stanley, J.C. (1961). Analysis of unreplicated three-way classifications, with applications to rater bias and trait independence, Psychometrika 26, 205–219.

5

[15]

Visser, P.S., Fabrigar, L.R., Wegener, D.T. & Browne, M.W. (2004). Analyzing Multitrait-Multimethod Data, University of Chicago, Unpublished manuscript. [16] Wothke, W. (1995). Covariance components analysis of the multitrait-multimethod matrix, in Personality Research, Methods, and Theory: A Festschrift Honoring Donald W. Fiske, P.E. Shrout & S.T. Fiske, eds, Erlbaum, Hillsdale, pp. 125–144.

(See also Structural Equation Modeling: Overview; Structural Equation Modeling: Software) SUSAN A. PARDY, LEANDRE R. FABRIGAR AND PENNY S. VISSER

Multivariate Analysis: Bayesian JIM PRESS Volume 3, pp. 1348–1352 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Analysis: Bayesian Multivariate Bayesian analysis is that branch of statistics that uses Bayes’ theorem (see Bayesian Belief Networks) to make inferences about several, generally correlated, unobservable and unknown quantities. The unknown quantities may index probability distributions; they may be hypotheses or propositions; or they may be probabilities themselves (see Bayesian Statistics). Such procedures have widespread applications in the behavioral sciences.

Bayes’ Theorem In Bayesian analysis, an unknown quantity is assigned a probability distribution, to represent one’s degree-of-belief about the unknown quantity. This degree-of-belief is then modified via Bayes’ theorem, as new information becomes available through observational data, experience, or new theory. Multivariate Bayesian inference is based on Bayes’ theorem for correlated random variables. Symbolically, let denote a collection (vector) of k unobservable jointly continuous random variables, and X a collection of N observable, pdimensional jointly continuous random vectors. Let f (·), g(·), and h(·) denote probability density functions of their arguments. (Lowercase letters will be used to represent observed values of the random variables designated by uppercase letters. Sometimes, for convenience, we will use lowercase letters throughout. We use boldface to designate vectors and matrices.) Bayes theorem for continuous X and continuous asserts that 1 f (x|θ)g(θ), (1) h(θ|x) = Q where θ and x denote values of and X, respectively, and Q denotes a one-dimensional constant (possibly depending on x, but not on θ), which is given by Q = f (x|θ)g(θ)dθ . (2) The integration is taken over all possible values in k-dimensional space, and the notation f (x|θ)

should be understood to mean the density of the conditional distribution of X, given = θ. f (x|θ) is the joint probability density function for X| = θ. When it is viewed as a function of θ, it is called the likelihood function (see Maximum Likelihood Estimation). When X and/or are discrete, we replace integrals by sums, and probability density functions by probability mass functions. g(θ) is the prior density of , since it is the probability density of prior to having observed X. Note that the prior density should not depend in any way upon the current data set, although it certainly could, and often does depend upon earlier-obtained data sets. If the prior were permitted to depend upon the current data set, using Bayes’ theorem in this inappropriate way would violate the laws of probability. h(θ|x) is the posterior density of , since it is the distribution of ‘subsequent’ to having observed X = x. Bayesian inference in multivariate distributions is based on the posterior distribution of the unobservable random variables, , given the observable data (the unobservable random variable may be a vector or a matrix). A Bayesian estimator of θ is generally taken to be a measure of location of the marginal posterior distribution of , such as its mean, median, or mode. To obtain the marginal posterior density of given the data, it is often necessary to integrate the joint posterior density over spaces of other unobservable random variables that are jointly distributed with θ. For example, to determine the marginal posterior density of a mean vector θ, given data x, from the joint posterior density of a mean vector θ and a covariance matrix , given x, we must integrate the joint posterior density of (θ, ), given x over all elements of that make it positive definite.

Bayesian Inferences Bayesian confidence regions (called credibility regions) are obtainable for any preassigned level of credibility directly from the cumulative distribution function of the posterior distribution. We make a distinction here between ‘credibility’ and ‘confidence’ that is fundamental, and not just a simple choice of alternative words. The credibility region is a probability region for the unknown, unobservable vector or matrix, conditional on the specific value of the observables that happened to have been observed in this instance,

2

Multivariate Analysis: Bayesian

regardless of what values of the observables might be observed in other instances (the region is based upon P {|X}). For example, denotes a 95% credibility region for |X if P { ∈ |X} = 95%.

(3)

The confidence region, by contrast, is obtained from a probability statement about the observable variables, conditional on the unobservable ones. So it really represents a region based upon the distribution of X as to where the observables are likely to be, rather than where the unobservables are likely to be (the region is based upon P {X|θ}). (For more details, see Confidence Intervals). When nonuniform, proper prior distributions are used, the resulting credibility and confidence regions will generally be quite different from one another. Predictions about a data vector(s) or matrix not yet observed are carried out by averaging the likelihood for the future observation vector(s) or matrix over the best information we have about the indexing parameters of its distribution, namely, the posterior distribution of the indexing parameters, given the data already observed. Hypothesis testing may be carried out by comparing the posterior probabilities of all competing hypotheses, given all data observed, and selecting the hypothesis with the largest posterior probability. These notions are identical with those in univariate Bayesian analysis. In multivariate Bayesian analysis, however, in order to make posterior inferences about a given hypothesis, conditional upon the observable data, it is generally necessary to integrate over all the components of .

Multivariate Prior Distributions The process of developing a prior distribution to express the beliefs of the analyst about the likely values of a collection of unobservables is called multivariate subjective probability assessment. None of the variables in a collection of unobservables, , is ever known. The multivariate prior probability density function, g(θ) for continuous θ (or its counterpart, the prior probability mass function for discrete θ), is used to denote the degrees of belief the analyst holds about θ. The parameters that index the prior distribution are called hyperparameters.

Table 1 Prior g(θ1 , θ2 )

distribution θ2

θ1

0

1

0

0.2

0.1

1

0.3

0.4

For example, suppose a vector mean θ is bivariate (k = 2), so that there are two unobservable, onedimensional random variables 1 and 2 . Suppose further (for simplicity) that 1 and 2 are discrete random variables, and let g(θ1 , θ2 ) denote the joint probability mass function for = (1 , 2 ) . Suppose 1 and 2 can each assume only two values, 0 and 1, and the analyst believes the probable values to be given by those in Table 1. Thus, for example, the analyst believes that the chances that 1 and 2 are both 1 is 0.4, that is, P {1 = 1, 2 = 1} = g(1, 1) = 0.4.

(4)

Note that this bivariate prior distribution represents the beliefs of the analyst, and need not correspond to the beliefs of anyone else, neither any individual, nor any group. Other individuals may feel quite differently about . In some situations, the analyst does not feel at all knowledgeable about the likely values of unknown, unobservable variables. In such cases, he/she will probably resort to using a ‘vague’ (sometimes called ‘diffuse’ or ‘noninformative’) prior distribution. Let denote a collection of k continuous, unknown variables, each defined on (−∞, ∞). g(θ) is a vague prior density for if the elements of are mutually independent, and if the probability mass of each variable is diffused evenly over all possible values. We write the (improper) prior density for as g(θ) ∝ constant, where ∝ denotes proportionality. Note that while this density characterization corresponds in form to the density of a uniform distribution, the fact that this uniform distribution must be defined over the entire real line means that the distribution is improper (it does not integrate to one) except in the case where the components of θ are defined on a finite interval. While prior distributions may sometimes be

Multivariate Analysis: Bayesian improper, the corresponding posteriors must always be proper. The vague prior for positive definite random variables is quite different. If denotes a k-dimensional square and symmetric positive definite matrix of variances and covariances, a vague (improper) prior for is given by (see [6]): g() ∝ ||−(k+1)/2 where || denotes the determinant of the matrix . Sometimes natural conjugate or other families of multivariate prior distributions are used (see, e.g., [10,11]). Multivariate prior distributions are often difficult to assess because of the complexities of thinking in many dimensions simultaneously, and also because the number of parameters to assess is so much greater than in the univariate case. For some methods that have been proposed for assessing multivariate prior distributions, see, for example, [7, 10] or [11, Chapter 5].

3

Table 2 Data reproduced from Kendall, M. Multivariate Analysis, 2nd Edition, Griffin Publisher, Charles, 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Form of letter application Appearance Academic ability Likability Self-confidence Lucidity Honesty Salesmanship Experience Drive Ambition Grasp Potential Keenness to join Suitability

explain the variation observed in the 15-dimensional scores? If so, the applicants could then be compared more easily. Adopt the Bayesian factor analysis model xj =

Numerical Methods Numerical methods are of fundamental importance in Bayesian multivariate analysis. They are used for evaluating and approximating the normalizing constants of multivariate posterior densities, for finding marginal densities, for calculating moments and other functions of marginal densities, for sampling from multivariate posterior densities, and for many other needs associated with multivariate Bayesian statistical inference. The principal methods currently in use are computer-intensive: Markov Chain Monte Carlo (MCMC) methods (see Markov Chain Monte Carlo and Bayesian Statistics) and the MetropolisHastings Algorithm ([1–4, 8, 9], [11, Chapter 6], and [16, 17]).

An Example The data for this illustrative example first appeared in Kendall [8]. There are 48 applicants for a certain job and they have been scored on 15 variables regarding their acceptability. The variables are as shown in Table 2. The question is: Is there an underlying subset of factors (of dimension less than 15) that

fj + εj (p×m) (m×1) (p×1)

j = 1, . . . , N, m < p.

(5)

Here, denotes the (unobservable) factor-loading matrix; fj denotes the (unobservable) factor score vector for subject j , with F ≡ (f1 , . . . , fN ); the εj ’s are assumed to be (unobservable) disturbance vectors, mutually uncorrelated and normally distributed as N (0, ), where is permitted to be a general, symmetric, positive definite matrix, although E() is assumed to be a diagonal matrix. The dimension of the problem is p; in this example, p = 15; m denotes the number of underlying factors (in this example, m = 4, but it is generally unknown); and N denotes the number of subjects; in this example, N = 48. There are three unknown matrix parameters: (, F, ). Assume a priori that g(, F, ) = g1 (|)g2 ()g3 (F),

(6)

where the g’s denote prior probability densities of their respective arguments. Thus, F is assumed to be a priori independent of (, ). Adopt, respectively, the normal and inverted Wishart prior densities: g1 (|) ∝ ||−0.5m

4

Multivariate Analysis: Bayesian × exp{(−0.5)tr[ −1 ( − 0 )H( − 0 ) ]}

[2]

g2 () ∝ ||−0.5υ exp{−0.5tr( −1 B)}, [3]

where ‘tr’ denotes the matrix trace operation. There are many prior densities for F that might be used (see [10]). For simplicity, adopt the vague prior density g3 (F) ∝ constant.

[4]

[5]

The quantities H, 0 , υ, B are hyperparameters that must be assessed. A model with four factors is postulated on the basis of having carried out a principal components analysis and having found that four factors accounted for 81.5% of the variance. This is therefore the first guess for m. Adopt the simple hyperparameter structure: H = n0 Im , for some preassigned scalar n0 , and Im denotes the identity matrix of order m. In this example, take H = 10I4 , B = 0.2I15 , and υ = 33. On the basis of the underlying beliefs, the following transposed location hyperparameter, 0 , for the factor-loading matrix, was constructed: 

0.0 0.0  0 =  0.7 0.0 1

0.0 0.0 0.0 0.0 2

0.0 0.7 0.0 0.0 3

0.0 0.0 0.0 0.7 4

0.7 0.0 0.0 0.0 5

0.7 0.0 0.0 0.0 6

0.0 0.0 0.0 0.7 7

The four underlying factors are found as linear combinations of the original 15 variables. (See [11, Chapter 15], [12, 13], for additional details.) Factor scores and credibility intervals for the factor scores are given in [11, Chapter 15]. It is shown in [12] how the number of underlying factors may be selected by maximizing the posterior probability over m, the number of factors. An empirical Bayes estimation approach is used in [14] to assess the hyperparameters in the model. Other proposals for assessing the hyperparameters may be found in [5] and [15].

References [1]

Chib, S. (2003). Markov chain Monte Carlo methods, Chapter 6 of Subjective and Objective Bayesian Methods: Principles, Models and Applications, S. James Press, New York: John Wiley & Sons.

[6] [7]

[8] [9]

0.7 0.0 0.0 0.0 8

Gelfand, A.E. & Smith, A.F.M. (1990). Sampling based approaches to calculating marginal densities, Journal of the American Statistical Association 85, 398–409. Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741. Hastings, W.K. (1970). Mont Carlo sampling methods using Markov chains and their applications, Biometrika 57, 97–109. Hyashi, K. (1997). The Press-Shigemasu factor analysis model with estimated hyperparameters, Ph.D. thesis, Department of Psychology, University of North Carolina, Chapel Hill. Jeffreys, H. (1961). Theory of Probability, 3rd Edition, Clarendon, Oxford. Kass, R. & Wasserman, L. (1996). The selection of prior distributions by formal rules, Journal of the American Statistical Association 91, 1343–1370. Kendall, M. Multivariate Analysis, 2nd Edition, Griffin Publisher, Charles, 34. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. & Teller, E. (1953). Equations of state calculations by fast computing machines, Journal of Chemical Physics 21, 1087–1092.

 0.0 0.7 0.7 0.7 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0  0.7 0.0 0.0 0.0 0.0 0.0 0.7  0.0 0.0 0.0 0.0 0.0 0.0 0.0 9 10 11 12 13 14 15

[10]

(7)

Press, S.J. (1982). Applied Multivariate Analysis: Using Bayesian and Frequentist Methods of Inference, 2nd Edition, (Revised), Krieger Pub. Co, Melbourne. [11] Press, S.J. (2003). Subjective and Objective Bayesian Statistics: Principles, Models, and Applications, John Wiley & Sons, New York. [12] Press, S.J. & Shigemasu, K. (1989). Bayesian inference in factor analysis, in Contributions to Probability and Statistics: Essays in Honor of Ingram Olkin, Gleser, L.J. Perlman, M.D. Press S.J. & Simpson, A.R. eds, Springer-Verlag, New York, 271–287. [13] Press, S.J. & Shigemasu, K. (1999). A note on choosing the number of factors, with K. Shigemasu, Communications in Statistics: Theory and Methods 28(8). [14] Press, S.J., Shin, K.-Il. & Lee, S.E. (2004). Empirical Bayes Assessment of the hyperparameters in Bayesian factor analysis, submitted to Communications in Statistics, Theory and Methods. [15] Shera, D.M. & Ibrahim, J.G. (1998). Prior elicitation and computation for Bayesian factor analysis, submitted manuscript, Department of Biostatistics,

Multivariate Analysis: Bayesian School of Public Health, Harvard University, Boston. [16] Spiegelhalter, D., Thomas, A., Best, N. & Gilks, W. (1994). BUGS: Bayesian inference using Gibbs sampling, available from MRC Biostatistics Unit, Cambridge.

[17]

5

Tanner, M.A. & Wong, W. (1987). The calculation of posterior distributions (with discussion), Journal of the American Statistical Association, 82, 528–550.

JIM PRESS

Multivariate Analysis: Overview WOJTEK J. KRZANOWSKI Volume 3, pp. 1352–1359 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Analysis: Overview Many social or behavioral studies lead to the collection of data in which more than one variable (feature, attribute) is measured on each sample member (individual, subject). This can happen either by design, where the variables have been chosen because they represent essential descriptors of the system under study, or because of expediency or economy, where the study has been either difficult or expensive to organize, so as much information as possible is collected once it is in progress. If p variables (X1 , X2 , . . . , Xp , say) have been measured on each of n sample individuals, then the resulting values can be displayed in a data matrix X. Standard convention is that rows relate to sample individuals and columns to variables, so that X has n rows and p columns, and the element xij in the ith row and j th column is the value of the j th variable exhibited by the ith individual. For example, the data matrix in Table 1 gives the values of seven measurements in inches (chest, waist, hand, head, height, forearm, wrist) taken on each of 17 first-year university students. In most cases, the rows of a data matrix will not be related in any way, as the values observed on

Table 1 Data matrix representing seven measures on each of 17 students Row Chest Waist Hand Head Height Arm Wrist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

36.0 36.0 38.5 36.0 39.0 37.0 38.0 39.0 39.0 36.0 38.0 41.0 38.0 36.0 43.0 36.0 38.0

29.00 29.00 31.00 31.00 33.00 33.00 30.50 32.00 32.00 30.00 32.00 36.00 33.00 31.00 32.00 30.00 34.00

8.00 8.50 8.50 8.50 9.00 9.00 8.50 9.00 8.00 7.00 8.00 8.50 9.00 9.00 9.00 8.50 8.00

22.00 23.00 24.00 23.00 24.00 22.00 23.00 23.00 24.00 22.00 23.00 23.00 23.00 23.00 22.50 22.00 23.00

70.50 70.00 72.50 71.00 75.00 71.00 70.00 70.00 73.00 70.00 71.00 75.00 73.50 70.50 74.00 68.00 70.00

10.0 10.5 13.0 11.0 11.0 10.0 10.0 11.0 11.0 14.0 11.0 12.0 12.0 11.5 12.0 10.0 13.0

6.50 6.50 7.00 7.00 7.50 7.00 7.00 7.00 7.00 6.50 7.00 7.00 7.00 7.00 8.00 6.50 7.00

any one individual in a sample will not be influenced by the values of any other individual. This is obvious in the case of the students’ measurements. However, since the whole set of variables is measured on each individual, the columns will be related to a greater or lesser extent. In the example above, tall students will tend to have large values for all the variables, while short ones will tend to have small values across the board. There is, therefore, correlation between any pair of variables, and a typical multivariate data matrix will have a possibly complex set of interdependencies linking the columns. This means that it is dangerous to analyze the variables individually if general conclusions are desired about the overall system. Rather, techniques that analyze the whole ensemble of variables are needed, and this is the province of multivariate analysis. Multivariate methods have had a slightly curious genesis and development. The earliest work, dating from the end of the nineteenth century, was rooted in practical problems arising from social and educational research, specifically in the disentangling of factors underlying correlated Intelligence Quotient (IQ) tests. This focus gradually widened and led to many developments in the first half of the twentieth century, but because computational aids were very limited, these developments were primarily mathematical. Indeed, multivariate research at that time might almost be viewed as a branch of linear algebra, and practical applications were very limited as a consequence of the formidable computational requirements attaching to most techniques. However, the advent and rapid development of electronic computers in the second half of the twentieth century removed these barriers. The first effect was an expansion in the practical use of established techniques, which was then followed by renewed interest in development of new techniques. Inevitably, this development became more and more dependent on sophisticated computational support, and, as the century drew to a close, the research emphasis had shifted perceptibly from mathematics to computer science. Currently, new results in multivariate analysis are reported as readily in sociological, psychological, biological, or computer science journals as they are in the more traditional statistical, mathematical, or engineering ones. The general heading ‘multivariate analysis’ covers a great variety of different techniques that can be grouped together for ease of description in various

2

Multivariate Analysis: Overview

ways. Different authors have different preferences, but, for the purpose of this necessarily brief survey, I will consider them under four headings: Visualization and Description; Extrapolation and Inference; Discrimination and Classification; Modeling and Explanation. I will then end the article by briefly surveying current trends and developments.

Visualization and Description A data matrix contains a lot of information, often of quite complex nature. Large-scale investigations or surveys may contain thousands of individuals and hundreds of variables, in which case assimilation of the raw data is a hopeless task. Even in a relatively small dataset, such as that of the university students above, the interrelationships among the variables make direct assimilation of the numbers difficult. As a first step towards analysis, a judicious pictorial representation may help to uncover patterns, identify unusual observations, suggest relationships between variables, and indicate whether assumptions made by formal techniques are reasonable or not. So, effort has been expended in devising ways of representing multivariate data with these purposes in mind, and several broad classes of descriptive techniques can be identified. The simplest idea derives from the pictogram and is a method for directly representing a multivariate sample in two dimensions (so that it can be drawn on a sheet of paper or shown on a computer screen). The aim here is to represent each individual in the sample by a single object (such as a box, a star, a tree, or a face) using the values of the variables for that individual to determine the size or features of the object (see Star and Profile Plots). So, for example, each variable might be associated with a branch of a tree, and the value of the variable determines its length, or each variable is associated with a facial feature, and the value determines either its size or its shape. The idea is simple and appealing, and the methods can be applied whether the data are quantitative, qualitative, or a mixture of both, but the limitations are obvious: difficulties in relating individuals and variables or directly comparing the latter across individuals, inability to cope with large samples, and no scope for developing any interpretations or analyses from the pictures. Andrews’ curves [2], where each individual is represented by a Fourier curve over a range

of values, introduces more quantitative possibilities, but limitations still exist. A more fruitful line of approach when the data are quantitative is to imagine the n × p data matrix as being represented by n points in p dimensions. Each variable corresponds to a dimension and each individual to a point, the coordinates of each point on p orthogonal axes being given by the values of the p variables for the corresponding individual. Of course, if p is greater than 2, then such a configuration cannot be seen directly and must somehow be approximated. The scatterplot has long been a powerful method of data display for bivariate data, and computer graphics now offer the user a large number of possibilities. The basic building block is the scatterplot matrix, in which the scatterplots for all p(p − 1) ordered pairs of variables are obtained and arranged in matrix fashion (so that each marginal scale applies to p − 1 plots). The separate plots can be linked by tagging (where individual points are tagged so that they can be identified in each plot), brushing (where sets of points can either be highlighted in, or deleted from, each plot), and painting (where color is used to distinguish grouping or overplotting in the diagrams). The development of such depth-cueing devices as kinematic and stereoscopic displays has led to three-dimensional scatterplot depiction, and modern software packages now routinely include such facilities as dynamic rotations of three-dimensional scatterplots, tourplots, and rocking rotations (i.e., ones with limited angular displacements). Moreover, window-based systems allow the graphics to be displayed in multiple windows that are linked by their observations and/or their variables. For details, see [25] or [26]. Despite their ingenuity and technical sophistication, the above displays are still limited to pairs or triples of variables and stop short at pure visualization. To encompass all variables simultaneously, and to allow the possibility of substantive interpretation, we must move on to projection techniques. The idea here is to find a low-dimensional subspace of the p-dimensional space into which the n points can be projected; for a sufficiently low dimensionality (2 or 3), the points can then be plotted and any patterns in the data easily discerned. The problem is to find a projection that minimizes the distortion caused by approximation in a lower dimensionality. However, what is meant by ‘minimizing distortion’ will depend on what features of the data are considered to

Multivariate Analysis: Overview be important; many different projection methods now exist, each highlighting different aspects of the data. The oldest such technique, dating back to the turn of the twentieth century, is principal component analysis (PCA) [17]. This has many different characterizations, but the two most common ones are as the subspace into which the points are orthogonally projected with least sum of squared (perpendicular) displacements, or, equivalently, as the subspace which maximizes the total variance among the coordinates of the points. The latter characterization identifies the subspace as the one with maximum inter-point scatter, so is the one in which the overall configuration of points is best viewed to detect interesting patterns. The technique has various side benefits: it has an analytical solution from which configurations in any chosen dimensionality can be obtained, and the chosen subspace is defined by a set of orthogonal linear combinations or contrasts among the original variables. The latter property enables directions of scatter to be substantively interpreted, or reified, which adds power to any analysis. If the data have arisen from a number of a priori groups, and one objective in data visualization is to highlight group differences, then canonical variate analysis (CVA) [6] is the appropriate technique to use (see Canonical Correlation Analysis; Discriminant Analysis). This has similarities to PCA in that the requisite subspaces are again defined by linear combinations or contrasts of the original variables, and the solution is also obtained analytically for any dimensionality, but the projections in this case are not orthogonal, and, hence, interpretation is a little harder. If other facets of the data are to be highlighted (e.g., subspaces in which either skewness or kurtosis is maximized, or subspaces in which the data are as nonnormal as possible), then computational searches rather than analytical solutions must be undertaken. All such problems fall into the category known as projection pursuit [18]. Moreover, dimensionality reduction can be optimally linked to some other statistical techniques in this way, as with projection pursuit regression. The above projection methods require quantitative data, in order for the underlying model of n points in p dimensions to be realizable. So what can be done if qualitative data (such as presence/absence of attributes, or categories like different colors) are present? Here, we can link the concepts of dissimilarity and distance to enable us to construct a

3

low-dimensional representation of the data for visualization and interpretation (see Proximity Measures). There are many ways of measuring the dissimilarity between two individuals [15]. If we choose a suitable measure appropriate for our data, then we can compute a dissimilarity matrix that gives the dissimilarity between every pair of individuals in the sample. Multidimensional scaling is a suite of techniques that represents sample individuals by points in low-dimensional space in such a way that the distance between any two points approximates as closely as possible the dissimilarity between the corresponding pair of individuals. If the approximation is in terms of exact values of distance/dissimilarity, then metric multidimensional scaling provides an analytical solution along similar lines to PCA, while if the approximation is between the rank orders of the dissimilarities and distances, then nonmetric multidimensional scaling provides a computational solution along the lines of projection pursuit. Many other variants of multidimensional scaling now also exist [5]. The above are the most commonly used multivariate visualization techniques in practical applications. They are essentially linear techniques, in that they deal with linear combinations of variables and involve linear algebra for their solution. Various nonlinear techniques have recently been developed [13], but are only slowly finding their way into general use.

Extrapolation and Inference Frequently, a multivariate data set comes either as a sample from some single population of interest or as a collection of samples from two or more populations of interest. The main focus then may not be on the sample data per se, but rather on extrapolating from the sample(s) to the population(s) of interest, that is, on drawing inferences about the population(s). For example, the data on university students given above is actually a subset of a larger collection, in which the same measurements were taken on samples of both male and female students from each of a number of different institutions. Interest may then center on determining whether there are any differences overall, either between genders or institutions, in terms of the features measured for the samples. In order to make inferences about populations, it is necessary to postulate probability models for these

4

Multivariate Analysis: Overview

populations, and then to couch the inferences in terms of statements about the parameters of these models. In classical (frequentist) inference, the traditional ways of doing this are through either estimation (whether point or interval) of the parameters or testing hypotheses about them. In Bayesian inference, all statements are channeled through the posterior distribution of the model parameters (see Bayesian Statistics). Taking classical methods first, the underlying population model for quantitative multivariate data has long been the multivariate normal distribution (see Catalogue of Probability Density Functions), although some attention has also been paid in recent years to the class of elliptically contoured distributions [11]. The multivariate normal distribution is characterized completely by its mean vector and dispersion matrix, and most traditional methods of estimation (such as maximum likelihood, least squares, or invariance) lead to the sample mean vector and some constant times the sample covariance matrix as the estimates of the corresponding population parameters. Moreover, distribution theory easily leads to confidence regions, at least for the population mean vector. So, for estimation, multivariate results appear to be ‘obvious’ generalizations of familiar univariate quantities. Hypothesis testing is not so straightforward, however. Early results were derived by researchers who sought ‘intuitive’ generalizations of univariate test statistics, such as Hotelling’s T 2 statistics for oneand two-sample tests of means as generalizations of the familiar one- and two-sample t statistics in univariate theory. It soon became evident that this avenue afforded limited prospects, however, so a more structured approach was necessary. Several general methods of test construction thus evolved, and the most common of these were the likelihood ratio, the union-intersection, and the invariant test procedures. An unfortunate side effect of this multiplicity is that each approach in general leads to a different test statistic in any single situation, the only exceptions being those that lead to the T 2 statistics mentioned above. Typically, the likelihood ratio approach produces test statistics involving products of all eigenvalues of particular matrices, the union-intersection approach produces test statistics involving just the largest eigenvalue of such matrices, and other approaches frequently produce test statistics involving the sum of the eigenvalues; in

statistical software, the product statistic is generally termed Wilks’ lambda, the largest eigenvalue is usually referred to as Roy’s largest root, while the sum statistic is either Pillai’s trace or Hotelling–Lawley’s trace (for more details see Multivariate Analysis of Variance). Tables of critical values exist [24] for conducting significance tests with these statistics, but it should be noted that the different statistics test for different departures from the null hypothesis, and so, in any given situation, may provide contradictory inferences. One- or two-sample tests can be derived for most situations by using one of the above approaches, and most standard textbooks on multivariate analysis quote the relevant statistics and associated distribution theory. When there are more than two groups, or if there is a more complex design structure on the data (such as a factorial cross-classification, for example), then the multivariate analysis of variance (MANOVA) provides the natural analog of univariate analysis of variance (ANOVA). The calculations for any particular MANOVA design are exactly the same as for the corresponding ANOVA design, except that they are carried out for each separate variable and then for each pair of variables, and the results are collected together into matrices. All resulting ANOVA sums of squares are thereby translated into MANOVA sums of squares and product matrices, and in place of a ratio of mean squares for any particular ANOVA effect, we have a product of one matrix with the inverse of another. Test statistics are then based on the eigenvalues of these matrices, using one of the four variants described above. Moreover, if any effect proves to be significant, then canonical variate analysis (also referred to as descriptive discriminant analysis; see Discriminant Analysis) can be used to investigate the reasons for the significance. Most classical multivariate tests require the data to come from either the multivariate normal or an elliptically contoured distribution for the theory to be valid and for the resulting tests to have good properties. This is sometimes a considerable restriction (see Multivariate Normality Tests). However, the computing revolution that took place in the last quarter of the twentieth century means that very heavy computing can now be undertaken for even the most routine analysis. This has opened up the possibility of conducting inference computationally rather than mathematically, and such computer-intensive procedures as

Multivariate Analysis: Overview permutation testing, bootstrapping, and jackknifing [7] now form a routine approach to inference. Bayesian inference requires the specification of a prior distribution for the unknown parameters, and this prior distribution is converted into a posterior distribution by multiplying it by the likelihood of the data. Inference is then conducted with reference to this posterior distribution, perhaps after integrating out those parameters in which there is no interest. For many years, multivariate Bayesian inference was hampered by the complicated multidimensional integrals that had to be effected during the derivation of the posterior distribution, and older textbooks restricted attention either to a small set of conjugate prior distributions or to the case of prior uncertainty. Since the development of Markov Chain Monte Carlo (MCMC) methods in the early 1990s, however, the area has been opened up considerably, and, now, virtually any prior distribution can be handled numerically using the software package BUGS.

Discrimination and Classification Whenever there are a priori groups in a data set, for example the genders or the institutions for the measurements on university students, the above methods give a mechanism for deciding whether the populations from which the groups came are different or not. If the populations are different, then further questions become relevant: How can we best characterize the differences? How can we best allocate a future unidentified individual to his or her correct group? The first of these questions concerns discrimination between the groups, while the second concerns classification of (future) individuals to the specified groups. Such questions have assumed great currency, particularly in such areas as health, where it is important to be able to distinguish between, and to correctly allocate individuals to, groups such as disease classes. This topic is a genuinely multivariate one, as intuitively one feels that increasing the number of variables on each individual should increase the ability to discriminate between groups (see Discriminant Analysis). Unfortunately, however, the word ‘classification’ has also been used in a rather different context, namely, in the sense of partitioning a collection of individuals into distinct classes. This usage has come down from biological/taxonomic applications, and in the statistical literature, has been distinguished

5

from allocation by calling it clustering. Another way of distinguishing the two usages, popularized in the pattern recognition and computer science literature, is to refer to the allocation usage as supervised classification, and to the partitioning usage as unsupervised classification. This is the nomenclature that we will adopt in the present overview. At the heart of supervised classification is a function or set of functions of the measured variables. For simplicity, we will refer just to a single function f (x ). If distinguishing the groups is the prime purpose, then f (x ) is usually termed a discriminant function, while if allocation of individuals to groups is the intended usage, it is termed a classification function or classifier. The first appearance of such a quantity in the literature was the now famous linear discriminant function between two groups, derived by R.A. Fisher in 1935 on purely intuitive grounds. Subsequently, Welch demonstrated in 1938 that it formed an optimal classifier into one of two multivariate normal populations that shared the same dispersion matrix, while Anderson provided the sample version in 1951. In the years since then, a great variety of discriminant functions and classifiers have been suggested. Relaxing the assumption of common dispersion matrices in normal populations leads to a quadratic discriminant function; eschewing distributional assumptions altogether requires a nonparametric discriminant function; estimating probabilities of class membership for classification purposes can be done via logistic discriminant functions (see Logistic Regression); producing a classifier by means of a sequence of decisions taken on individual variables results in a tree-based classifier (see Classification and Regression Trees); while recent years have seen the development of a variety of computer-intensive classifiers using neural networks, regularized discriminant functions, and radial basis functions. Bayesian approaches to parametric discriminant analysis were hampered in the early days for the same reasons as already described above, but MCMC procedures are now beginning to be applied in the classification field also. With so many potential classifiers to choose from, ways of assessing their performance is an important consideration. There are various possible measures of performance [16], but the most common one in practical use is the error rate associated with a classifier (see Misclassification Rates). Another focus of interest

6

Multivariate Analysis: Overview

in supervised classification is the question of variable selection; in many problems, only a restricted number of variables provide genuine discriminatory information between groups, while the other variables essentially only contribute noise. Classification performance, as well as economy of future measurement, can, therefore, be enhanced by finding this ‘effective’ subset of variables, and much research has, therefore, been conducted into variable selection procedures. In the Bayesian approach, variable selection requires the use of reversible jump MCMC. Full technical details of all these aspects can be found in [8, 16, 22]. Unsupervised classification has an even longer history than the supervised variety, dating back essentially to the numerical taxonomy and biological classification of the nineteenth century. It has developed as a series of ad-hoc advances rather than as a systematic field, and can now be characterized in terms of several groups of loosely connected methods. The oldest set of methods are the hierarchical clustering methods in which the sample of n individuals is either progressively grouped, starting with separate groups for each individual and finishing with a single group containing all individuals by fusing two existing groups at each stage, or progressively splitting from a single group containing all individuals down to n groups, each containing a single individual, by dividing an existing group into two at each stage. Both approaches result in a ‘family tree’ structure, showing how individuals either successively fuse together or split apart, and this structure fitted well into the biological classification schemes that were its genesis (see Hierarchical Clustering). However, over the years, many different criteria have been proposed for fusing or splitting groups, so now the class of hierarchical methods is a very large one. By contrast with hierarchical methods are the optimization methods of clustering, in which the sample of n individuals is partitioned into a prespecified number g of groups in such a way as to optimize some numerical criterion that quantifies the two desirable features of internal cohesion within groups and separability between groups (see k -means Analysis). As can be imagined, many different criteria have been proposed to quantify these features, so there are also many such clustering methods. Moreover, each of these methods often requires formidable computing resources if n is large and needs to be repeated for different values of g if an optimum number of groups is also desired.

Other possible approaches to clustering involve probabilistic models for groups (mode-splitting or mixture separation; see Model Based Cluster Analysis) or latent variable models (see below). Further technical details may be found in [10, 14].

Modeling and Explanation It has already been stressed that association between the variables is one of the chief characteristics of a multivariate data set. If the variables are quantitative, then the correlation coefficient is a familiar measure of the strength of association between two variables, while if the variables are categorical, then the data set can be summarized in a contingency table and the strength of association between the variables can be assessed through a chi square test using the chi square measure X 2 . If there are many variables in a data set, then there are potentially many associations between pairs of variables to consider. In the students’ measurements, for example, there are seven variables and, hence, 21 correlations to assess. With 20 variables, there will be 190 correlations, and the number rapidly escalates as the number of variables increases. The pairwise nature of these values means that many interrelationships exist among a set of such correlations, and these interrelationships must be disentangled if any sense is to be made of the overall situation. One simple technique is to calculate partial correlations between specified variables when others are considered to be held fixed. This will help to determine whether an apparently high association between a pair of variables is genuine or has been induced by their mutual association with the other variables. Another simple technique is to treat correlation r as ‘similarity’ between two variables, in which case a measure such as 1 − r 2 or (1 − r)/2 can be viewed as a ‘distance’ between them. Applying some form of multidimensional scaling to a resulting set of distances will then represent the variables as points on a low-dimensional graph and will show up groups of ‘similar’ variables as well as highlighting those variables that are very different from the rest. If variables are qualitative/categorical, then correspondence analysis is a technique for obtaining lowdimensional representations that highlight relationships, both between variables and between groups of individuals. Finally, if the variables have some

Multivariate Analysis: Overview form of a priori grouping structure on them, then canonical correlation analysis will extract essential association between these groups from the detail of the whole correlation matrix. However, if a deeper understanding is to be gained into the nature of a set of dependencies, possibly leading on to a substantive explanation of the system, then some form of underlying model must be introduced. This area forms arguably the oldest single strand of multivariate research, that of latent variable models. The origins date back to the pioneering work of Spearman and others in the field of educational testing at the start of the twentieth century, where empirical studies showed the presence of correlations between scores achieved by subjects over a range of different tests. Initial work attempted to explain these correlations by supposing that each score was made up of the subject’s value on an underlying, unobservable, variable termed ‘intelligence’, plus a ‘residual’ specific to the individual subject and test. First results seemed promising, but it soon became evident that such a single-variable explanation was not sufficient. The natural extension was then to postulate a model in which a particular test score was made up of a combination of contributions from a set of unobservable underlying variables in addition to ‘intelligence,’ say ‘numerical ability,’ ‘verbal facility,’ ‘memory,’ and so on, plus a residual specific to the individual subject and test. If these underlying factors could account for the observed inter-test correlations, then they would provide a characterization of the system. Determination of number and type of factors, and extraction of the combinations appropriate to each test thus became known as factor analysis. The ideas soon spread to other disciplines in the behavioral science, where unobservable (or latent) variables such as ‘extroversion’, ‘political leanings’, or ‘degree of industrialization’ were considered to be reasonable constructs for explanations of inter-variable dependencies. The stumbling block in the early years was the lack of reliable computational methods for estimating the various model parameters and quantities. Indeed, the first half of the twentieth century was riven with conflicting suggestions and controversies, and many of the proposed methods often failed to find satisfactory solutions. The development of the maximum likelihood approach by Lawley and Maxwell in the 1940s was one major advance, but computational problems still remained and were not satisfactorily

7

resolved till the work of J¨oreskog [19]. Now, it is a popular and much used technique throughout the behavioral sciences (although many technical pitfalls still remain, and successful applications require both skill and care). However, it is essentially only applicable to quantitative data. Development with qualitative or categorical variables was much slower, the first account being that by Lazarsfeld and Henry [20] under the banner of latent structure analysis. This area then underwent rapid development, and, now, there exist many variants such as latent class analysis, latent trait analysis, latent profile analysis, and others (see Finite Mixture Distributions). Further details of these techniques may be found in [4, 9].

Current Trends and Developments The dramatic developments in computer power over the past twenty years or so have already been noted above, and it is these developments that drive much of the current progress in multivariate analysis. If the first half of the twentieth century can be described as years of mathematical foundations and development, and the succeeding twenty or thirty years as ones of widening practical applicability, then recent developments have centered firmly on computational power. Many of these developments have been brought on by the enormous data sets that arise in modern scientific research, due in no small measure to highly sophisticated and rapid data-collecting instruments. Thus, we have spectroscopy in chemistry, flow cytometry in medicine, and microarray experiments in biology (see Microarrays) which all yield multivariate data sets comprising many thousands of observations. Such data matrices bring their own problems and objectives, and this has led to many advances in computationally intensive multivariate analysis. Some specific techniques include: curve fitting through localized nonparametric smoothing and recursive partitioning, a flexible version of which is MARS (‘multivariate adaptive regression splines’) [12]; the use of wavelets for function estimation [1]; the analysis of data that themselves come in the form of functions [23]; the detection of pattern in two-way arrays by using biclustering [21]; and the decomposition of multivariate spatial and temporal data into meaningful components [3]. While much of the impetus for these developments has come from the scientific area, many of the

8

Multivariate Analysis: Overview

resultant techniques are, of course, applicable to the large data sets that are becoming more prevalent in behavioral science also. Indeed, a completely new field of study, data mining, has sprung up in recent years to deal with explorative aspects of large-scale multivariate data, drawing on elements of computer science as well as statistics. It is evident that the field is a rich one, and many more developments are likely in the future.

References

[12] [13] [14] [15]

[16] [17] [18]

[1]

Abramovich, F., Bailey, T.C. & Sapatinas, T. (2000). Wavelet analysis and its statistical applications, Journal of the Royal Statistical Society, Series D 49, 1–29. [2] Andrews, D.F. (1972). Plots of high-dimensional data, Biometrics 28, 125–136. [3] Badcock, J., Bailey, T.C., Krzanowski, W.J. & Jonathan, P. (2004). Two projection methods for multivariate process data, Technometrics 46, 392–403. [4] Bartholomew, D.J. & Knott, M. (1999). Latent Variable Models and Factor Analysis, 2nd Edition, Arnold, London. [5] Borg, I. & Groenen, P. (1997). Modern Multidimensional Scaling, Theory and Applications, Springer, New York. [6] Campbell, N.A. & Atchley, W.R. (1981). The geometry of canonical variate analysis, Systematic Zoology 30, 268–280. [7] Davison, A.C. & Hinkley, D.V. (1997). Bootstrap Methods and their Applications, University Press, Cambridge. [8] Denison, D.G.T., Holmes, C.C., Mallick, B.K. & Smith, A.F.M. (2002). Bayesian Methods for Nonlinear Classification and Regression, Wiley, Chichester. [9] Everitt, B.S. (1984). An Introduction to Latent Variable Models, Chapman & Hall, London. [10] Everitt, B.S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th Edition, Arnold, London. [11] Fang, K.-T. & Zhang, Y.-T. (1990). Generalized Multivariate Analysis, Science Press, Beijing and Springer, Berlin.

[19] [20] [21] [22] [23] [24] [25]

[26]

Friedman, J.H. (1991). Multivariate adaptive regression splines (with discussion), Annals of Statistics 19, 1–141. Gifi, A. (1990). Nonlinear Multivariate Analysis, Wiley, New York. Gordon, A.D. (1999). Classification, 2nd Edition, Chapman & Hall/CRC, Boca Raton. Gower, J.C. & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients, Journal of Classification 3, 5–48. Hand, D.J. (1997). Construction and Assessment of Classification Rules, Wiley, Chichester. Jolliffe, I.T. (2002). Principal Component Analysis, 2nd Edition, Springer, New York. Jones, M.C. & Sibson, R. (1987). What is projection pursuit? (with discussion), Journal of the Royal Statistical Society, Series A 150, 1–36. J¨oreskog, K.G. (1967). Some contributions to maximum likelihood factor analysis, Psychometrika 32, 443–482. Lazarsfeld, P.F. & Henry, N.W. (1968). Latent Structure Analysis, Houghton-Mifflin, Boston. Lazzeroni, L. & Owen, A. (2002). Plaid models for gene expression data, Statistica Sinica 12, 61–86. McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York. Ramsay, J.O. & Silverman, B.W. (1997). Functional Data Analysis, Springer, New York. Seber, G.A.F. (1984). Multivariate Observations (Appendix D), Wiley, New York. Wegman, E.J. and Carr, D.B. (1993). Statistical graphics and visualization, in Computational Statistics. Handbook of Statistics 9, C.R. Rao, ed., North Holland, Amsterdam, pp. 857–958. Young, F.W., Faldowski, R.A. & McFarlane, M.M. (1993). Multivariate statistical visualization, in Computational Statistics. Handbook of Statistics 9, C.R. Rao, ed., North Holland, Amsterdam, pp. 959–998.

(See also Multivariate Multiple Regression) WOJTEK J. KRZANOWSKI

Multivariate Analysis of Variance JOACHIM HARTUNG

AND

GUIDO KNAPP

Volume 3, pp. 1359–1363 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Analysis of Variance

Table 1 week)

Weight losses in n = 24 rats (first week, second Drug

A

The multivariate linear model is used to explain and to analyze the relationship between one or more independent or explanatory variables and p > 1 quantitative dependent or response variables that have been observed for each of n subjects. When all the explanatory variables are quantitative, the multivariate linear model is referred to as a multivariate multiple linear regression model (see Multivariate Multiple Regression). When all the explanatory variables are measured on a qualitative scale, or in other words, when the observations can only be made within a fixed number of factor level combinations, the model is called a multivariate analysis of variance model. The techniques used to draw inferences about the model parameters are typically referred to by the generic term multivariate analysis of variance or for short MANOVA. The simplest MANOVA model is the one-way model in which the p quantitative variables are only affected by a single factor with r levels. The oneway MANOVA model can also be thought of as explaining p quantitative variables arising from r different populations. Typically, the null hypothesis of interest in such a model is that the mean responses of the groups corresponding to the different levels of the factor (or the r populations) are equal. The alternative hypothesis is that there is at least one pair of levels of the factor with different mean responses. Consider the hypothetical example from Morrison [2] that has been changed slightly for the current presentation (Table 1). Assume that s = 8 rats were randomly allocated to each of r = 3 drug treatments (i.e., n = 24) and weight losses in grams were observed for each rat at two time points (after 1 week and after 2 weeks). Let yij 1 denote the weight loss of rat j with drug i after 1 week, yij 2 the weight loss of the same rat after 2 weeks, and yij = (yij 1 , yij 2 )T the vector of observations for rat j on drug i. Let µi1 and µi2 further denote the means of variables 1 and 2 for group i. Then, the expectation of yij is µi = (µi1 , µi2 )T , and a one-way MANOVA model for this

(5, (5, (9, (7, (7, (6, (9, (8,

B 6) 4) 9) 6) 10) 6) 7) 10)

C

(7, 6) (7, 7) (9, 12) (6, 8) (10, 13) (8, 7) (7, 6) (6, 9)

(21, (14, (17, (12, (16, (14, (14, (10,

15) 11) 12) 10) 12) 9) 8) 5)

example is given by yij = µi + eij ,

i = 1, 2,

r = 3,

j = 1, . . . , s = 8,

(1)

with eij = (eij 1 , eij 2 )T , a vector of errors that arises from a multivariate normal distribution (see Catalogue of Probability Density Functions) with mean vector 0 and (unknown) covariance matrix . Consequently, the vector of observations is multivariate normally distributed with mean vector µi and covariance matrix . Note that it is assumed that the covariance matrix is identical for all vectors of observations (see Covariance Matrices: Testing Equality of). Moreover, it is assumed that the vectors of observations are stochastically independent. To estimate the mean responses, the least squares principle (see Least Squares Estimation) can be employed. The least squares estimate, µˆ i , of the mean response vector µi corresponds to the vector of least squares estimates from separate univariate analysis of ˆ i1 , µˆ i2 )T , variance (ANOVA) s models, that is, µˆ i = (µ with µˆ i1 = 1/s j =1 yij 1 and µˆ i2 = 1/s sj =1 yij 2 . In the rat example, the least squares estimates of the mean responses are µˆ 1 = (7, 7.25)T , µˆ 2 = (8, 8.5)T , and µˆ 3 = (14.75, 10.25)T . Also of interest in a MANOVA model is a formal test of zero group difference. This translates into testing the null hypothesis of equal mean responses at each level of the factor, that is, H0 : µ 1 = µ 2 = · · · = µ r .

(2)

In univariate one-way ANOVA, a test for equality of group means is derived by partitioning the total sum of squares into a component that is due to the factor in question and an error sum of squares

2

Multivariate Analysis of Variance

(see Analysis of Variance). This principle can be generalized to the multivariate case to derive a test for the multivariate hypothesis H0 . The total sum-ofsquares-and-products matrix is given by SST =

s r

(yij − y¯ ..)(yij − y¯ ..) , T

(3)

i=1 j =1

eigenvalues ξ1 , ξ2 , . . . , ξp (with ξ1 ≥ ξ2 ≥ · · · ≥ ξp ) of the matrix product M of SSA and SSE −1 . Specifically, the test statistics are given by Pillai’s Trace p P = i=1

with y¯ .. the vector where each component is the arithmetic mean of all the observations of the corresponding quantitative variable. Note that SST is a symmetric p × p-dimensional matrix. The total sum-of-squares-and-products matrix SST can be partitioned into the sum-of-squares-and-products matrix according to the factor, say SSA, and the error sumof-squares-and-products matrix, say SSE. It holds

Wilks’ Lambda p W = i=1

ξi , 1 + ξi

1 , 1 + ξi

Hotelling–Lawley Trace H L =

p

ξi ,

i=1

SSA = s

r

(µˆ i − y¯ ..)(µˆ i − y¯ ..)T

(4)

i=1

and SSE = SST − SSA.

(5)

The p × p-error sum-of-squares-and-products matrix SSE is used to estimate the unknown covariance matrix . The estimate is given by SSE divided by the so-called error degrees of freedom r(s − 1). Note that the error degrees of freedom in MANOVA are identical to the error degrees of freedom in the univariate analysis of variance. An equivalent assertion holds for the degrees of freedom that corresponds to the sum-of-squares-andproducts matrix of the factor in question, here SSA. The degrees of freedom corresponding to SSA in MANOVA are (r − 1), which is the same value as in univariate analysis of variance. The degrees of freedom corresponding to a factor in question are often called hypothesis degrees of freedom. In the rat example, the 2 × 2-dimensional matrices SSA and SSE are given as 301 97.5 SSA = and 97.5 36.333 109.5 98.5 SSE = . (6) 98.5 147 In MANOVA models, there are four statistics for testing H0 , namely, Pillai’s Trace, Wilks’ Lambda, the Hotelling–Lawley Trace, and Roy’s Greatest Root. All four test statistics can be calculated from the

Roy’s Greatest Root R = ξ1 .

(7)

In the rat example, the eigenvalues of M = (SSA)(SSE )−1 are ξ1 = 4.4883 and ξ2 = 0.0498, so that the values of the four test statistics read as follows Pillai’s Trace P =

2 i=1

ξi = 0.8652, 1 + ξi

Wilks’ Lambda W =

2 i=1

1 = 0.1735, 1 + ξi

Hotelling–Lawley Trace H L =

p

ξi = 4.5381,

i=1

Roy’s Greatest Root R = ξ1 = 4.4883.

(8)

When the number of quantitative variables p is only one – the univariate case – then all four test statistics yield equivalent results since the matrix product (SSA)(SSE )−1 reduces to the scalar value SSA/SSE that can be simply transformed to the usual F-statistic in univariate analysis of variance. Also, when the hypothesis degrees of freedom are exactly one, as in a simple comparison of a factor with only

Multivariate Analysis of Variance two levels, all four statistics yield equivalent results. For a given number of groups, the critical values of the exact distributions of the four test statistics under H0 can be found in various textbooks, see [3] or [4]. In software packages, however, the values of the test statistics are transformed into F-values and the distributions of these F-values under H0 are approximated by F distributions with suitable numerator and denominator degrees of freedom (see Catalogue of Probability Density Functions). In case of p = 2 variables, the distribution of the transformed Wilks’ Lambda is exactly F-distributed. The F-value of Roy’s Greatest Root is, in general, an upper bound, so that the reported P value is a lower bound and the test tends to be overly optimistic. Table 2 shows the results of the four multivariate tests for the rat example. All four tests indicate Table 2

that the differences between the three drugs are statistically significant. The one-way MANOVA model can be extended by including additively further factors. These factors can be nested factors, cross-classified factors, or interaction terms, and they usually reflect further structures in the different populations. As in univariate analysis of variance, the focus would be on determining that there was evidence of any factorial main effects or interactions (see Analysis of Variance). On the basis of the eigenvalues of the matrix product of a hypothesis sum-of-squares-and-products matrix according to the factor under consideration and the inverse of the error sum-of-squares-andproducts matrix, the four tests described above can be applied for each hypothesis test. Originally, Morrison [2] used the example to motivate a two-way MANOVA model with interactions.

Multivariate tests in the hypothetical one-way MANOVA model

Statistic

Value

F-value

Num DFc

Den DFd

Approx. P value

Pillai’s Trace Wilks’ Lambda Hotelling–Lawley Trace Roy’s Greatest Root

0.865 0.174 4.538 4.488

8.006 14.004a 21.556 47.127b

4 4 4 2

42 40 38 21

<0.0001 <0.0001 <0.0001 <0.0001

a

F-statistic for Wilks’ Lambda is exact. F-statistic for Roy’s Greatest Root is an upper bound. c Numerator degrees of freedom of the F distribution of the transformed statistic. d Denominator degrees of freedom of the F distribution of the transformed statistic. b

Table 3

Multivariate tests in the two-way MANOVA model with interactions for the hypothetical example

Factor

Statistic

Value

F-Value

Num DFc

Den DFd

Approx. P value

Drug

Pillai’s Trace Wilks’ Lambda Hotelling–Lawley Trace Roy’s Greatest Root Pillai’s Trace Wilks’ Lambda Hotelling–Lawley Trace Roy’s Greatest Root Pillai’s Trace Wilks’ Lambda Hotelling–Lawley Trace Roy’s Greatest Root

0.880 0.039 4.640 4.576 0.007 0.993 0.008 0.008 0.227 0.774 0.290 0.284

7.077 12.199a 18.558 41.184b 0.064a 0.064a 0.064a 0.064a 1.152 1.159a 1.159 2.554b

4 4 4 2 2 2 2 2 4 4 4 2

36 34 32 18 17 17 17 17 36 34 32 18

0.0003 <0.0001 <0.0001 <0.0001 0.938 0.938 0.938 0.938 0.348 0.347 0.346 0.106

Sex

Drug × sex

a

F-statistic is exact. F-statistic is an upper bound. c Numerator degrees of freedom of the F distribution of the transformed statistic. d Denominator degrees of freedom of the F distribution of the transformed statistic b

3

4

Multivariate Analysis of Variance

The first four rats in each drug group in Table 1 are male and the remainder female. The design, therefore, allows assessment of the main effect of the three drugs, the main effect of sex, and the interactions between drug and sex. The interactions between drug and sex would mean that the differences between male and female rat responses varied across the drugs. Table 3 shows the results from the original two-way analysis. Again, weight-loss differences between the drugs are statistically significant, whereas the main effect of factor sex and the interaction between drug and sex are not statistically significant. Since the comparison between male and female rats has exactly one hypothesis degree of freedom, the four multivariate tests yield equivalent results. The question that may now arise is which test statistic should be used. One criterion for choosing a test statistic is the power of the tests. Another criterion is the robustness of the tests when assumptions of the basic data distributions, here independent multivariate normally distributed vectors of observation with equal covariance matrix, are violated. Hand and Taylor [1] extensively discuss the choice of the test statistic in MANOVA models. Wilks’ Lambda is most popular in the literature and the statistic is a generalized likelihood ratio test statistic. Pillai’s Trace is recommended in [1] on the basis of simulation results for power and robustness of the multivariate tests. In standard (fixed effects) MANOVA, one is usually interested in analyzing the influence of the given

levels of the factors on the dependent variables, as described above. But sometimes, the levels of a factor are more appropriately considered a random sample from an (infinite) number of levels. This can be accommodated by treating the effects of the levels as random effects. In this case, besides the covariance matrix of the error term, further covariance matrices according to the random effects, also called multivariate variance components, contribute to the total covariance matrix of the dependent variables (for more details random effects MANOVA see [5]).

References [1]

[2] [3] [4] [5]

Hand, D.J. & Taylor, C.C. (1987). Multivariate Analysis of Variance and Repeated Measures, Chapman & Hall, London. Morrison, D.F. (1976). Multivariate Statistical Methods, 2nd Edition, McGraw-Hill, New York. Rao, C.R. (2002). Linear Statistical Inference and its Application, 2nd Edition, John Wiley, New York. Seber, G.A.F. (1984). Multivariate Observations, John Wiley, New York. Timm, N.H. (2002). Applied Multivariate Analysis, Springer, New York.

(See also Multivariate Analysis: Overview; Repeated Measures Analysis of Variance) JOACHIM HARTUNG

AND

GUIDO KNAPP

Multivariate Genetic Analysis NATHAN A. GILLESPIE

AND

NICHOLAS G. MARTIN

Volume 3, pp. 1363–1370 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Genetic Analysis Bivariate Heritability The genetic and environmental components of covariation can be separated by a technique which is analogous to factor analysis [8]. Just as structural equation modeling (SEM) can be used to analyze the components of variance influencing a single variable in the case of univariate analysis, SEM can also be used to analyze the sources and structure of covariation underlying multiple variables [8]. When based on genetically informative relatives such as twins, this methodology allows researchers to estimate the extent to which genetic and environmental influences are shared in common by several traits or are trait specific. Information not only comes from the covariance between the variables but also from the cross-twin cross-trait covariances. More precisely, a larger cross-twin cross-trait correlation between monozygotic twin pairs as compared with dizygotic twin pairs suggests that covariance between the variables is partially due to genetic factors. A typical starting point in bivariate and multivariate analysis is the Cholesky decomposition. Other methods include common and independent genetic pathway models, as well as genetic simplex models, which will also be discussed.

variables. A Cholesky decomposition is specified for each latent source of variance A, D, C, or E, and as in the univariate case, ACE, ADE, AE, DE, CE, and E models are fitted to the data (see ACE Model). The expected variance–covariance matrix in the Cholesky decomposition is parameterized in terms of n latent factors (where n is the number of variables). All variables load on the first latent factor, n − 1 variables load on the second factor and so on, until the final variable loads on the nth latent factor only. Each source of phenotypic variation (i.e., A, C or D, and E) is parameterized in the same way. Therefore, the full factor Cholesky does not distinguish between common factor and specific factor variance and does not estimate a specific factor effect for any variable except the last. Although symptoms of fatigue and somatic distress are frequently comorbid with anxiety and depressive disorders, a number of studies [6, 7, 14] have demonstrated that a significant proportion of patients with somatic disorders do not meet the criteria for other psychological disorders. As an example of how the Cholesky decomposition can be used

h1

h2

hn

y1

y2

yn

Multivariate Genetic Analysis Cholesky Decomposition The most commonly used multivariate technique in the Classical Twin design (see Twin Designs) is the Cholesky decomposition. As shown in Figure 1, the Cholesky is a method of triangular decomposition where the first variable (y1 ) is assumed to be caused by a latent factor (see Latent Variable) (η1 ) that can explain the variance in the remaining variables (y2 , . . . , yn ). The second variable (y2 ) is assumed to be caused by a second latent factor (η2 ) that can explain variance in the second as well as remaining variables (y2 , . . . , yn ). This pattern continues until the final observed variable (yn ) is explained by a latent variable (ηn ), which is constrained from explaining the variance in any of the previous observed

Figure 1 Multivariate Cholesky triangular decomposition, y1 , . . . , yn = observed phenotypic variables, η1−n = latent factors

Table 1 Phenotypic factor correlations for the factor analytic dimensions of depression, phobic anxiety, and somatic distress. Male correlations appear below the diagonal (reproduced from [3, p. 455]) Females (n = 2219) 1 1 Depression 2 Phobic anxiety 3 Somatic distress

0.63 0.54

2

3

0.64

0.52 0.58

0.58

Males (n = 1418)

2

Multivariate Genetic Analysis

Table 2 Univariate model-fitting for the factor analytic dimensions of depression, phobic anxiety, and somatic distress. The table includes standardized proportions of variance attributable to genetic and environmental effects (reproduced from [4, p. 1056]) A Depression 0.33 0.33

C

E

−2LL

df

−2LL

df

p

0.00

0.67 0.67 0.76 1.00

10 720.59 10 720.59 10 729.87 10 791.02

7987 7988 7988 7989

0.05 9.28 70.44

1 1 2

.82

0.59 0.59 0.70 1.00

7968.50 7968.62 7976.54 8051.14

7979 7980 7980 7981

0.11 8.04 82.64

1 1 2

.74

0.72 0.68 0.75 1.00

9563.28 9566.50 9564.08 9624.57

7969 7970 7970 7971

3.22 0.79 61.29

1 1 2

.07 .37

0.24 Phobic Anxiety 0.37 0.41

0.03 0.30

Somatic Distress 0.11 0.32

0.17 0.25

b c

b c

c

Note: A, C & E = additive genetic, shared/common environment, and nonshared environment variance. Results based on Maximum Likelihood. −2LL = −2 log-likelihood. a p < 0.05, b p < 0.01, c p < 0.001.

to resolve these sorts of questions, we administered self-report measures of anxiety, depression, and somatic distress to a community-based sample of 3469 Australian twin individuals aged 18 to 28 years. As shown in Table 1, the phenotypic correlations between somatic distress, depression, and phobic anxiety, for males and females alike, are all high. The results for the univariate genetic analyses in Table 2 reveal that an additive genetic (see Additive Genetic Variance) and nonshared environmental effects model best explains individual differences in depression and phobic anxiety scores, for male and female twins alike. The same could not be said for somatic distress because there is insufficient power to choose between additive genetic or shared environment effects as the source of familial aggregation in somatic distress. This limitation can be overcome using multivariate genetic analysis, which has greater power to detect genetic and environmental effects by making use of all the covariance terms between variables. Moreover, it will allow us to determine whether somatic distress is etiologically distinct from self-report measures of depression and anxiety.

As shown in Table 3, an additive genetic and nonshared environment (AE) model best explained the sources of covariation between the three factors. This is illustrated in Figure 2, where 33% (i.e., 0.322 /0.432 + 0.152 + 0.322 ) of the genetic variance in somatic distress is due to specific gene action unrelated to depression or phobic anxiety. In addition, 74% of the individual environmental influence on somatic distress is unrelated to depression and phobic anxiety. These results support previous findings that somatic symptoms are partly etiologically distinct, both genetically and environmentally from the symptoms of anxiety and depression.

Common and Independent Genetic Pathway Models Alternate multivariate methods can be used to estimate common factor and specific factor variance (see Factor Analysis: Exploratory). For instance, the common pathway model in Figure 3a assumes that the genetic and environmental effects (A, C and E) (see ACE Model) contribute to one or more latent

Multivariate Genetic Analysis

3

Table 3 Multivariate Cholesky decomposition model-fitting results. Results are based on combined male and female data adjusted for sex differences in the prevalence of depression, phobic anxiety, and somatic distress (reproduced from [4, p. 1056]) −2LL

df

−2LL

df

p

26 000.29 26 003.82 26 015.20 26 147.97

15 285 15 291 15 291 15 297

3.53 14.91 147.68

6 6 12

.74

Model A A

C

E E E E

C

a c

Note: A, C & E = additive genetic, shared/common environment, and nonshared environment variance. Results based on Maximum Likelihood. −2LL = −2 log-likelihood. a p < .05, b p < .01, c p < .001.

Genetic factors

A1

A2

A3 0.15 (0.05−0.23)

0.43 (0.36−0.50) 0.57 (0.51−0.63)

0.32 (0.25−0.38)

0.55 (0.48−0.61)

0.33 (0.26−0.39)

Depression

Phobic anxiety 0.63 (0.61−0.66)

0.44 (0.39−0.48)

0.82 (0.78−0.86)

Somatic distress

0.71 (0.68−0.75) 0.39 (0.34−0.44)

E1

0.17 (0.13−0.21)

E2

E3

Environmental factors

Figure 2 Path diagram showing standardized path coefficients and 95% confidence intervals for the latent genetic (A1 to A3 ) and environmental (E1 to E3 ) effect (reproduced from [4] p. 1057)

intervening variables (η), which in turn are responsible for the observed patterns of covariance between symptoms (y1 , . . . , yn ). This is in contrast to the independent pathway model in Figure 3b, which predicts that genes and environment have different effects on the covariance between symptoms. Because it can be shown algebraically that the common pathway is nested within the independent pathway model, the two models can

be compared using a likelihood ratio chi-squared statistic (see Goodness of Fit). Parker’s 25-item Parental Bonding Instrument (PBI) [13] was designed to measure maternal and paternal parenting along the dimensions of Care and Overprotection [11, 12]. However, factor analysis of the short 14-item version based on 4514 females, aged 18 to 45, has yielded three correlated factors: Autonomy, Coldness, and Overprotection [5].

4

Multivariate Genetic Analysis

Cx

Ax

Ex

Ax

Cx

Ex

yn

y1

y2

yn

h

y1

y2

Ee1

Ae1

Ee2

Ae2

Ce1

Ce2

Een

Aen

(a)

Ee1

Ae1

Cen

Ae2

Ce1

Ee2

Een

Aen

Ce2

Cen

(b)

Figure 3 Common (a) and independent (b) genetic pathway models. η = common factor, y1 , . . . , yn = observed phenotypic variables, Aξ , Cξ & Eξ = latent genetic & environmental factors, Aε1−n , Cε1−n & Eε1−n = latent genetic & environmental residual factors Table 4 Best fitting univariate models with standardized proportions of variance attributable to genetic and environmental variance (reproduced from [5, p. 390]) PBI dimensions

A

C

E

−2LL

df

Coldness Overprotection Autonomy

.61 .22 .33

– .24 .17

.39 .54 .51

5979.01 7438.84 9216.17

3603 3602 3599

Note: A, C & E = additive genetic, shared/common environment, and nonshared environment variance. Results based on Maximum Likelihood. −2LL = −2 log-likelihood. a p < .05, b p < .01, c p < .001.

Table 5 Comparison of the common and independent pathway models for the PBI dimensions of Autonomy, Coldness, and Overprotection [5] Model Independent pathway Common pathway

χ 2a

df

p

9.17 13.39

15 19

.87 .82

AIC −20.83 −24.61

Note: Results based on Weighted Least Squares. AIC = Akaike Information Criterion.

Univariate analyses of the three dimensions, which are summarized in Table 4, reveal that variation in parental Overprotection and Autonomy can be best explained by additive genetic, shared, and nonshared environmental effects, whereas the best fitting model for Coldness includes additive genetic and nonshared environmental effects. As is shown in Table 5, when compared to an independent pathway model,

a common pathway genetic model provided a more parsimonious fit to the three PBI dimensions. The common pathway model is illustrated in Figure 4.

Genetic Simplex Modeling When genetically informative longitudinal data are available, a multivariate Cholesky can again be

Multivariate Genetic Analysis

5

0.34 (0.11− 0 .56)

0.37 (0.12 − 0.63)

0.29 (0.22 − 0.36)

C A

E

PARENTING 0.59 (0.50 − 0.70)

0.45 (0.38 − 0.53) 0.16 (0.13 − 0.20)

AUTO

A

OVERP

E

0.02 (0.00 − 0.13)

C

A

E

0.16 0.33 (0.25 − 0.41) (0.00 − 0.31)

0.06 (0.00 − 0.15)

COLD

C

A

E

0.06 0.47 (0.40 − 0.55) (0.00 − 0.22)

0.21 (0.02 − 0.39)

C

0.21 (0.02 − 0.33)

0.28 (0.20 − 0.36)

Figure 4 Common pathway genetic model (saturated) for the PBI dimensions with standardized proportions of variance and 95% confidence intervals (reproduced from 5, p. 391). AUTO = Autonomy, OVERP = Overprotection, COLD = Coldness, Results based on Weighted Least Squares

fitted to determine the extent to which genetic and environmental influences are shared in common by a trait measured at different time points. However, this approach is limited in so far as it does not take full advantage of the time-series nature of the data, that is, that causation is unidirectional through time [1]. One solution is to fit a simplex model, which explicitly takes into account the longitudinal nature of the data. As shown in Figure 5, simplex models are autoregressive, whereby the genetic and environmental latent variables at time i are causally related to the immediately preceding latent variables (ηi−1 ). Eta (ηi ) is a latent variable (i.e., A, C, E, or D) at time i, βi is the regression of the latent factor on

zi − 1

zi

hi − 1

bi

1

zi + 1 hi 1

ei

ei − 1 Yi − 1

bi + 1

hi + 1 1

ei + 1 Yi

Yi + 1

Figure 5 General simplex model. β = regression of the latent factor on the previous latent factor ζ = new input or innovation at time i η = latent variable at time i

the immediately preceding latent factor ηi−1 , and ζi is the new input or innovation at time i. When using data from MZ and DZ twin pairs, structural equations can be specified for additive genetic sources of variation (A), common environmental (C), nonadditive genetic sources of variation such as dominance or epistasis (D), and unique environmental sources of variation (E). Because measurement error does not influence observed variables at subsequent time points, simplex designs therefore permit discrimination between transient factors effecting measurement at one time point only, and factors that are continuously present or exert a long-term influence throughout the time series [1, 9]. Although denoted as error variance, the error parameters will also include variance attributable to short-term nonshared environmental effects. We have used this model-fitting approach to investigate the stability and magnitude of genetic and environmental effects underlying major dimensions of adolescent personality across time [2]. The junior eysenck personality questionnaire (JEPQ) was administered to over 540 twin pairs at ages 12, 14, and 16 years. Results for JEPQ Neuroticism are presented here. The additive genetic factor correlations based on a Cholesky decomposition are shown in Table 6. These reveal that the latent additive genetic factors are

6

Multivariate Genetic Analysis

Table 6 Additive genetic (above diagonal) and nonshared environmental latent factor correlations for JEPQ Neuroticism (reproduced from [2]) Females 1 1 12 years 2 14 years 3 16 years

0.40 0.36

Males

2

3

1

0.76

0.68 0.94

0.24 0.12

0.41

2

3

0.86

0.79 0.74

0.53

Table 7 Multivariate model-fitting results for JEPQ Neuroticism based on twins aged 12, 14, and 16 years (reproduced from [2]) Neuroticism Females

Cholesky ACE Simplex ACE AE Drop ζa3 CE E

−2LL

df

10 424.88

1803

10 424.95 10 425.34 10 426.57 10 432.45 10 663.09

1805 1810 1811 1810 1815

2LL

0.07 0.39 1.23 7.50 238.14

Males

df

−2LL

df

10 016.48

1753

10 016.88 10 021.26 10 058.59 10 032.32 10 650.99

1755 1760 1761 1760 1765

P

2 5 1 5 10

.96 1.00 .27 .19 c

2LL

df

p

0.40 4.38 37.32 15.44 634.11

2 5 1 5 10

.82 .50 c b c

Note: Results based on Maximum Likelihood. ζa3 = Genetic innovation at time 3. Best fitting models in bold. Female neuroticism

Male neuroticism

7.15

4.61

11.33

2.84

3.72

za1

za2

za1

za2

za3

A1 6.69

1

e1

0.93 6.69

A2 1

e2

0.66 6.69

A3 1

A1 4.33

1

e1

e3

0.91 4.33

A2 1

e2

0.69 4.33

A3 1

e3

N12

N14

N16

N12

N14

N16

1

1

1

1

1

1

E1

E2

E3

E1

E2

0.62

0.95

0.42

0.81

ze1

ze2

ze3

ze1

ze2

ze3

9.03

2.38

2.68

5.57

5.87

2.66

E3

Figure 6 Best fitting genetic simplex model for female and male Neuroticism (reproduced from [2]). N12 – 16 = Neuroticism 12 to 16 yrs, A1 – 3 , E1 – 3 = additive genetic & nonshared environmental effects, ζa1 – 3 , ζe1 – 3 = additive genetic innovations & nonshared environmental innovations, ε1−3 = error parameters 12 to 16 yrs double/single headed arrows = variance components/path coefficients

Multivariate Genetic Analysis highly correlated. This is consistent with a pleiotropic model of gene action, whereby the same genes explain variation across different time points. As shown in Table 7, the fit of the ACE simplex models provided a better explanation of the Neuroticism time-series data, in so for as the fit was no worse than the corresponding Cholesky decompositions. The final best-fitting AE simplex models for male and female Neuroticism are shown in Figure 6. It is difficult to imagine that genetic variation in personality is completely determined by age 12. As shown in Figure 6, smaller genetic innovations are observed for male Neuroticism at 14 and 16, as well as female Neuroticism at 14. These smaller genetic innovations potentially hint at age-specific genetic effects related to developmental or hormonal changes during puberty and psychosexual development. When data are limited to three time points, a common genetic factor model will also provide a comparable fit when compared to the genetic simplex model. Other possible modeling strategies include biometric growth models (see [10]). Despite these limitations, time-series data even when based on three time points still provides an opportunity to test explicit hypotheses of genetic continuity. Moreover, the same data are ideal for fitting univariate and multivariate linkage models to detect quantitative trait loci of significant effect. The above sections have provided an introduction to bivariate and multivariate analyses and how these methods can be used to estimate the genetic and environmental covariance between phenotypic measures. This should give the reader an appreciation for the flexibility of SEM approaches to address more complicated questions beyond univariate decompositions. For a more detailed treatment of this subject, see [9].

References [1]

Boomsma, D.I., Martin, N.G. & Molenaar, P.C. (1989). Factor and simplex models for repeated measures: application to two psychomotor measures of alcohol sensitivity in twins, Behavior Genetics 19, 79–96.

[2]

[3]

[4]

[5]

[6]

[7]

[8] [9]

[10] [11]

[12]

[13]

[14]

7

Gillespie, N.A., Evans, D.E., Wright, M.M. & Martin, N.G. (submitted). Genetic simplex modeling of Eysenck’s dimensions of personality in a sample of young Australian twins, Journal of Child Psychology and Psychiatry. Gillespie, N., Kirk, K.M., Heath, A.C., Martin, N.G. & Hickie, I. (1999). Somatic distress as a distinct psychological dimension, Social Psychiatry and Psychiatric Epidemiology 34, 451–458. Gillespie, N.G., Zhu, G., Heath, A.C., Hickie, I.B. & Martin, N.G. (2000). The genetic aetiology of somatic distress, Psychological Medicine 30, 1051–1061. Gillespie, N.G., Zhu, G., Neale, M.C., Heath, A.C. & Martin, N.G. (2003). Direction of causation modeling between measures of distress and parental bonding, Behavior Genetics 33, 383–396. Hickie, I.B., Lloyd, A., Wakefield, D. & Parker, G. (1990). The psychiatric status of patients with chronic fatigue syndrome, British Journal of Psychiatry 156, 534–540. Kroenke, K., Wood, D.R., Mangelsdorff, D., Meier, N.J. & Powell, J.B. (1988). Chronic fatigue in primary care, Journal of the American Medical Association 260, 926–934. Martin, N.G. & Eaves, L.J. (1977). The genetical analysis of covariance structure, Heredity 38, 79–95. Neale, M.C. & Cardon, L.R. (1992). Methodology for Genetic Studies of Twins and Families. NATO ASI Series, Kluwer Academic Publishers, Dordrecht. Neale, M.C. & McArdle, J.J. (2000). Structured latent growth curves for twin data, Twin Research 3, 165–177. Parker, G. (1989). The parental bonding instrument: psychometric properties reviewed, Psychiatric Developments 4, 317–335. Parker, G. (1990). The parental bonding instrument. A decade of research, Social Psychiatry and Psychiatric Epidemiology 25, 281–282. Parker, G., Tupling, H. & Brown, L.B. (1979). A parental bonding instrument, British Journal of Medical Psychology 52, 1–10. Wessely, S. & Powell, R. (1989). Fatigue syndromes: a comparison of chronic postviral with neuromuscular and affective disorders, Journal of Neurology, Neurosurgery, and Psychiatry 52, 940–948.

NATHAN A. GILLESPIE

NICHOLAS G. MARTIN

AND

Multivariate Multiple Regression JOACHIM HARTUNG

AND

GUIDO KNAPP

Volume 3, pp. 1370–1373 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Multiple Regression The multivariate linear model is used to explain and to analyze the relationship between one or more explanatory variables and p > 1 quantitative dependent or response variables that have been observed at n subjects. In case all the explanatory variables are qualitative, the multivariate linear model is called the Multivariate Analysis of Variance (MANOVA) model . When all the explanatory variables are quantitative, that is a multivariate system of quantitative variables is given in which the relationships between p dependent quantitative variables and, say q, independent quantitative are of interest, then the model is referred to as a multivariate regression model. Multivariate regression analysis is used to investigate the relationships when the p dependent variables are correlated. In contrast, when the dependent variables are uncorrelated, relationships can be assessed by carrying out p univariate regression analyses (see Regression Models). Often, there is an implied predictive aim in the investigation, and the formulation of appropriate and parsimonious relationships among the variables is a necessary prerequisite. Consider the small example given in [2]. The data in Table 1 show the four measurements: chest circumference (CC), midupper arm circumference (MUAC), height and age (in months) for a sample on nine girls. One practical objective would be to develop a predictive model for CC and MUAC from knowledge of height and age. Table 1

The dependent variables CC and MUAC are highly correlated with each other and Pearson’s correlation coefficient is 0.77, so they should be incorporated in a single multivariate regression model for maximum efficiency as multiple regression analyses for each variable separately will ignore this correlation in the construction of hypothesis tests or confidence intervals. Let yi1 denote the CC of the ith girl, yi2 the MUAC, xi1 the height, and xi2 the age; then, the univariate regression models for each variable are yi1 = β01 + β11 xi1 + β21 xi2 + ei1 , i = 1, . . . , n = 9,

(1)

and yi2 = β02 + β12 xi1 + β22 xi2 + ei2 , i = 1, . . . , n = 9.

(2)

In the multivariate case, the observations yi1 and yi2 are put in a row vector so that the model has the form β01 β02 ( yi1 yi2 ) = ( 1 xi1 xi2 ) β11 β12 β12 β22 + ( ei1

ei2 ),

i = 1, . . . , n = 9.

(3)

In general, the model for the observed data can be written in matrix form as Y = XB + E

(4)

where Y is the n × p matrix whose ith row contains the values of the dependent variables for the ith subject, X is the n × (q + 1)-matrix, also called the

Four measurements taken on each of nine young girls

Individual

Chest circumference (cm) y1

Midupper arm circumference (cm) y2

Height (cm) x1

Age (months) x2

1 2 3 4 5 6 7 8 9

58.4 59.2 60.3 57.4 59.5 58.1 58.0 55.5 59.2

14.0 15.0 15.0 13.0 14.0 14.5 12.5 11.0 12.5

80 75 78 75 79 78 75 64 80

21 27 27 22 26 26 23 22 22

2

Multivariate Multiple Regression

regression matrix, whose ith row contains a one and the values of the explanatory variables for the ith subject, B is the (q + 1) × p-matrix of unknown regression parameters, and E is an n × p-matrix of random variables whose rows are independent observations from a multivariate normal distribution with mean zero and covariance matrix . Note that it is assumed that the covariance matrix is identical for each row of E. The assumption of multivariate normality is only necessary when tests or confidence regions have to be constructed for the parameters or the predicted values. For other aspects, it is sufficient to assume that the rows of E are uncorrelated and have mean zero and covariance matrix . The first step in multivariate regression analysis is to fit a model to the observed data. This requires the estimation of the unknown parameters in B and , which can be done by maximum likelihood when normality is assumed or by the least-squares approach (see Least Squares Estimation) when no distributional assumptions are made. For the parameter matrix B, both approaches lead to the same estimate so that the actual assumptions are not critical to the outcome of the analysis. For standard application of the theory, it is necessary that the number of subjects n is greater than p + q, the total number of quantitative variables. Furthermore, the matrix X has to be of full rank (q + 1) so that the inverse (XT X)−1 exists. When these conditions are ˆ of the parameter matrix B is met, the estimator B given by ˆ = (XT X)−1 XT Y. B (5) Note that the regression matrix X is identical in all the corresponding univariate regressions analyses. Consequently, the columns of the estimated ˆ are the estimators that would be parameter matrix B obtained from the corresponding univariate regression analyses. The estimator of the covariance matrix is built from the matrix of the residuals ˆ =Y−Y ˆ = Y − XB. ˆ E

(6)

The maximum likelihood estimator of is given ˆ T E/n, ˆ ˆ ML = E while an unbiased estimator of as ˆ T E/(n ˆ ˆ unbiased = E − q − 1). Details of the is maximum likelihood derivation can be found in [3] while [4] covers the least-squares approach.

In the example, the estimated parameter matrix is given by 36.7774 −5.5258 ˆB = (7) 0.2042 0.1405 . 0.2545 0.3477 The residual sum-of-squares-and-products matrix is given by 2.4049 −0.3558 ˆ TE ˆ = E , (8) −0.3558 2.8713 so that the maximum likelihood estimate of has the form 1 ˆTˆ 0.2672 −0.0395 ˆ ML = E E = , (9) −0.0395 0.3190 9 while the unbiased estimator of is 0.4008 −0.0593 ˆ = ˆ TE ˆ unbiased = 1 E . (10) −0.0593 0.4786 6 When the above assumptions of the general multivariate regression model are fulfilled, inference about the model parameters is possible. In analogy to univariate regression, the first test of interest is of the significance of the regression: do the explanatory variables help at all in predicting the dependent variables, or would simple prediction from the unconditional mean of Y be just as accurate? Formally, the null hypothesis can be formulated that all the parameters of B that multiply any of the explanatory variables are zero. Under this null hypothesis, the mean of Y is 1µT where µ is the population mean vector of the dependent variables and 1 is an n-vector of ones. The maximum likelihood estimator of µ is y¯ , the sample mean vector of the dependent variables. Then, the maximum likelihood estimate of under the null hypothesis can be written as ˜ = YT Y − n¯y y¯ T . This total sum-of-squares-andn products matrix can be decomposed as ˆ TY ˆ T E. ˆ − n¯y y¯ T ) + E ˆ (YT Y − n¯y y¯ T ) = (Y

(11)

ˆ TY ˆ − n¯y y¯ T , the decomposition of Writing H = Y the total sum-of-squares-and-products matrix can be summarized in a table just like in MANOVA models (see Table 2). The null hypothesis of no relationship between the explanatory and the dependent variables can be tested by any of the four multivariate test statistics,

Multivariate Multiple Regression Table 2

MANOVA table for multivariate regression

MANOVA Source

Sum-of-squares-andproducts matrix

Degrees of freedom

Multivariate regression Residual Total

H

q

ˆ T Eˆ E YT Y − n¯y y¯ T

n−q −1 n−1

namely, Pillai’s Trace, Wilks’ Lambda, the HotellingLawley Trace, and Roy’s Greatest Root. All the test statistics are based on the eigenvalues of the ˆ −1 H. (For more details see ˆ T E) matrix product (E Multivariate Analysis of Variance). Having established that a relationship does exist between the dependent variables and the explanatory variables for predictive purposes, the next objective often is to derive the most parsimonious relationship, that is, the one that has good predictive accuracy while involving the fewest number of explanatory variables. In this context, the null hypothesis to test is that a specified subset of explanatory variables provides no additional predictive information when the remaining explanatory variables are included in the model. Suppose that the null hypothesis is that the last q2 explanatory variables have no predictive information when the first q1 explanatory variables are included in the model, q1 + q2 = q. Then, the sum-of-squares-and-products matrix H can be partitioned in H = H(q1 ) + H(q2 |q1 ) . The matrix H(q1 ) is ˜ T E, ˜ where E ˜ is the residgiven by H(q1 ) = T − E ual matrix when only the first q1 explanatory variables are included in the multivariate regression, Table 3 Variable Height

Age

a

T is the total sum-of-squares-and-product matrix, and H(q2 |q1 ) = H − H(q1 ) . The tests statistics are then based on the eigenvalues of the matrix product ˆ −1 H(q2 |q1 ) . ˆ T E) (E In statistical software packages, usually the test of the null hypothesis that a specified explanatory variable has no predictive information when all the other explanatory variables are included in the multivariate regression model is reported by default. Since the hypothesis degrees of freedom are exactly one in this situation, the four multivariate tests yield equivalent results (for details see Multivariate Analysis of Variance). Applying the multivariate tests to the example, the results put together in Table 3 are obtained. The analysis indicates that both explanatory variables height and age significantly affect the two dependent variables after adjusting for the effect of the other variable. The fit of a model to a set of data involves a number of assumptions. These assumptions can relate either to the systematic part of the model, that is, to the form of the relationship between explanatory and dependent variables, or to the random part of the model, that is, to the distribution of the dependent variables. In a multivariate regression model, the ˆ contains information that can matrix of residuals E be used to assess the adequacy of the model. If the model is correct and all distributional assumptions are ˆ will look approximately like valid, then the rows of E independent observations from a multivariate normal distribution with mean 0 and covariance matrix . A detailed discussion of procedures for checking model adequacy is provided in [1] and [2].

Multivariate tests on the predictive information of height and age in the example Statistic

Value

F Value

Num DFb

Den DFc

Approx. P value

Pillai’s Trace Wilks’ Lambda Hotelling–Lawley Trace Roy’s Greatest Root Pillai’s Trace Wilks’ Lambda Hotelling–Lawley Trace Roy’s Greatest Root

0.839 0.161 5.193 5.193 0.785 0.215 3.659 3.659

12.981a 12.981a 12.981a 12.981a 9.148a 9.148a 9.148a 9.148a

2 2 2 2 2 2 2 2

5 5 5 5 5 5 5 5

0.0105 0.0105 0.0105 0.0105 0.0213 0.0213 0.0213 0.0213

F statistic is exact. Numerator degrees of freedom of the F distribution of the transformed statistic. c Denominator degrees of freedom of the F distribution of the transformed statistic. b

3

4

Multivariate Multiple Regression

References [1]

[2]

Gnanadesikan, R. (1997). Methods For Statistical Data Analysis of Multivariate Observations, 2nd Edition, John Wiley, New York. Krzanowski, W.J. (2000). Principles of Multivariate Analysis: A User’s Perspective, Oxford University Press, Oxford.

[3] [4]

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1995). Multivariate Analysis, Academic Press, London. Rao, C.R. (2002). Linear Statistical Inference and its Application, 2nd Edition, John Wiley, New York.

(See also Multivariate Analysis: Overview) JOACHIM HARTUNG

AND

GUIDO KNAPP

Multivariate Normality Tests H.J. KESELMAN Volume 3, pp. 1373–1379 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Normality Tests Behavioral science researchers frequently gather multiple measures in their research endeavors. That is, for many areas of investigation in order to capture the phenomenon under investigation numerous measures are gathered. For example, Harasym et al. [6] obtained scores on multiple personal traits using the Myers-Briggs inventory for nursing students with three different styles of learning. Knapp and Miller [11] discuss the measurement of multiple dimensions of healthcare quality as part of a system-wide evaluation program. Indeed, as Tabachnick and Fidell [29, p. 3] note, ‘Experimental research designs with multiple DVs (dependent variables) were unusual at one time. Now, however, with attempts to make experimental designs more realistic,. . ., experiments often have several DVs. It is dangerous to run an experiment with only one DV and risk missing the impact of the IV (independent variable) because the most sensitive DV is not measured’. The assumption of multivariate normality (see Catalogue of Probability Density Functions) underlies many multivariate techniques, for example, multivariate analysis of variance, discriminant analysis, canonical correlation and maximum likelihood factor analysis (see Factor Analysis: Exploratory), to name but a few. That is to say, many areas of multivariate analysis involve parameter estimation and tests of significance and the validity of these procedures require that the data follow a multivariate normal distribution. As Timm [30, p. 118] notes, ‘The study of robust estimation for location and dispersion of model parameters, the identification of outliers, the analysis of multivariate residuals, and the assessment of the effects of model assumptions on tests of significance and power are as important in multivariate analysis as they are in univariate analysis.’ In the remainder of this paper, I review methods for assessing the validity of the multivariate normality assumption.

normal density function for a random variable y, having mean µ and variance σ 2 is f(y) = √

1 2 2 √ e−(y−µ) /2σ . 2 2π σ

(1)

Notationally, when y has the density given in (1) we can write y ∼ N (µ, σ 2 ). As is well known, the density function is depicted by a bell-shaped curve. Now if y has a multivariate normal distribution with mean µ and covariance , its density is given as g(y) = √

1 −1 √ e−(y−µ) (y−µ)/2 , p (2π) ||

(2)

where p is the number of variables, is the transpose operator and −1 is the inverse function. Again, notationally, we can say that when y has the density given in (2) that y is ∼ Np (µ, ). Note that the p × p population covariance matrix, , is the generalization of σ 2 and the p × 1 mean vector µ is the generalization of µ. The graph of a particular case of (2) is a (p + 1)-dimensional bell-shaped surface. Another correspondence between multivariate and univariate normality is important to note. In the univariate case, the expression (y − µ)2 /σ 2 = (y − µ)(σ 2 )−1 (y − µ) in the exponent of (1) is a distance statistic, that is, it measures the squared distance from y to µ in σ 2 units. Similarly, (y − µ) −1 (y − µ) in the exponent of (2) is the squared generalized distance from y to µ. This squared distance is also referred to as Mahalanobis’s distance statistic, and a sample estimate (D2 ) can be obtained from D2 = (y − y) S−1 (y − y),

(3)

where y is the p × 1 vector of sample means and S is the p × p Sample covariance matrix (see [8], p. 55).

Properties of Multivariate Normal Distributions The multivariate normal distribution has a number of important properties that bare on how one can assess whether data are indeed multivariate normally distributed.

Multivariate Normality

•

Multivariate normality is most easily described by generalizing univariate normality. In particular, the

•

Linear combinations of the variables in y are themselves normally distributed. All marginal distributions for any subset of y are normally distributed. This fact means that

2

•

Multivariate Normality Tests each variable in the set is normally distributed. That is, multivariate normality implies univariate normality for each individual variable in the set y. However, the converse is not true; univariate normality of the individual variables does not imply multivariate normality. All conditional distributions (say y1 |y2 ) are normally distributed.

Assessing Multivariate Normality According to Timm [30, Section 3.7, p. 118], ‘The two most important problems in multivariate data analysis are the detection of outliers and the evaluation of multivariate normality’. Timm presents a very detailed strategy for detecting outlying values and for assessing multivariate normality. The steps he (as well as others) suggests are: 1. Evaluate each individual variable for univariate normality. A number of procedures can be followed for assessing univariate normality, such as checking the skewness and kurtosis measures for each individual variable (see [4, 13, 17], Chapter 4). As well, one can use a global test such as the Shapiro and Wilk [23] test. Statistical packages usually provide these assessments, for example, SAS’s [20] UNIVARIATE procedure (see Software for Statistical Analyses). 2. Use normal probability plots (Q-Q plots) for each individual variable, plots that compare each sample distribution to population quantiles of the normal distribution. Rencher [17] indicates that many researchers find this technique to be most satisfactory as a visual check for nonnormality. 3. Transform nonnormal variables. Researchers can use Box and Cox’s [3] power transformation. Timm [30, p. 130] and Timm and Mieczkowski [31, pp. 34–35] provide a SAS program for the Box and Cox power transformation (see Transformation). 4. Locate outlying values (see the SAS programs in Timm and Mieczkowski [31, pp. 26–31]. If outlying values are present they must be dealt with (see [30, p. 119]). See the references in [33] where they describe a number of (multivariate) outlier detection procedures. These four steps are for evaluating normality and outlying values with respect to marginal (univariate)

normality. However, as previously indicated, univariate normality is not a sufficient condition for multivariate normality. As Looney [13, p. 64] notes, ‘Even if UVN (univariate normality) is tenable for each of the variables individually, however, this does not necessarily imply MVN (multivariate normality) since there are non-MVN distributions that have UVN marginals.’ Nonetheless, some authors believe that the results from assessing univariate normality are good indications of whether multivariate normality exists or not. For example, Gnanadesikan [5, p. 168] indicates ‘In practice, except for rare or pathological examples, the presence of joint (multivariate) normality is likely to be detected quite often by methods directed at studying the marginal (univariate) normality of the observations on each variable’. This view was also noted by Johnson and Wichern [10, p. 153] – ‘. . . for most practical work, onedimensional and two-dimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representation but nonnormal in higher dimensions are not frequently encountered in practice’. These viewpoints suggest that it is sufficient to examine marginal and bivariate normality in order to assess multivariate normality. Procedures, however, are available for directly assessing multivariate normality. Romeu and Ozturk [18] compared goodness-of-fit tests for multivariate normality and their results indicate that multivariate tests of skewness and kurtosis proposed by Mardia [14, 16] are most reliable for assessing multivariate normality. The skewness and kurtosis measures are defined respectively, as

and

β1,p = E[(y − µ) −1 (y − µ)]3

(4)

β2,p = E[(y − µ) −1 (y − µ)]4 .

(5)

The sample estimators, respectively, are 1,p = β

n n [(y − y) S−1 (y − y)]3

n2

i=1 j =1

and 2,p = β

n [(y − y) S−1 (y − y)]3 i=1

n

.

(6)

(7)

2,p 1,p and β The sampling distributions of β are given by Timm [30, p. 121]. Mardia [14,

Multivariate Normality Tests 1,p and β 2,p 15] provided percentage points of β for p = 2, 3, and 4; other values of p or when n ≥ 50 can be obtained from approximations provided by Rencher [17, p. 113]. A macro provided by SAS (%MULTINORM – see Timm, [30]) can be used to compute these multivariate tests of skewness and kurtosis (see also [31, pp. 32–34]). The macro %MULTINORM can be downloaded from either the Springer-Verlag website – http://www.springer-ny.com/detail .tpl?isbn = 0387953477 or from Timm’s website – http://www.pitt.edu/∼timm. As Timm indicates, rejection of either test suggests that the data are not multivariate normally distributed; that is, the data contains either multivariate outliers or are not multivariate normally distributed. Other tests for multivariate normality have been proposed. Looney [13] discusses a number of tests: Royston’s [19] multivariate extension of the Shapiro–Wilk test, Small’s [24] multivariate extensions of the univariate skewness and kurtosis measures, Srivastava’s [26] measures of multivariate skewness and kurtosis, and Srivastava and Hui’s [27] Shapiro–Wilk’s tests. Commenting on these different tests, Looney [13, p. 69] notes that ‘Since no one test for MVN can be expected to be uniformly most powerful against all non-MVN alternatives, we maintain that the best strategy currently available is to perform a battery of tests and then base the assessment of MVN on the aggregate of results.’ If one rejects the multivariate normality hypothesis there are a number of paths that one may follow. First, researchers can search for multivariate outliers. However, detecting multivariate outliers is a much more difficult task than identifying univariate outlying values. Graphical procedures might help here (see [22]). The SAS [21] system offers a comprehensive graphing procedure [SAS/INSIGHT – see the discussion and examples in Timm [30, pp. 124–131]. Another approach that researchers frequently employ to detect multivariate outliers is based on Mahalanobis distance statistic (3). That is, one can plot the ordered D2i values against the expected order statistics (qi ) from a chi-squared distribution [see also Timm and Mieczkowski [31, pp. 22–31]. As Timm points out, if the data are multivariate normal, the plotted pairs (Di2 , qi ) should fall close to the straight line. Timm also notes that there are formal tests for detecting multivariate outliers (see [2]), suggesting however, that these tests have limited value.

3

Wilcox [32, Chapter 6], on the other hand, presents some newer, more satisfactory, methods for detecting multivariate outliers. Methods for dealing with multivariate outlying values are also more complex than in the univariate case. As Timm [30] notes, the Box–Cox power transformation has been generalized to the multivariate case by Andrews et al. [1]; however, determining the appropriate transformation is complicated. As an alternative to seeking an appropriate transformation of the data, Timm suggests that researchers can adopt a data reduction transformation, a l`a principal components analysis, thereby only using a subset (hopefully which are multivariate normally distributed) of the data. Another strategy is to adopt robust estimators instead of the usual least squares estimates; that is, one can utilize trimmed means and a Winsorized variance-covariance matrix. These robust estimators can as well be used to detect outlying values (see [32, 33]) and/or in a multivariate test statistic (see e.g., [12]).

Grouped Data For many multivariate investigations, data are categorized by group membership (e.g., one-way multivariate analysis of variance). Accordingly, the procedures previously described would be applied to the data within each treatment group (or cell, from a factorial design). Stevens [28] and Tabachnick and Fidell [29] present strategies for applying the previously enumerated procedures for grouped data. As well, they present data sets, indicating statistical software (SAS [20] and SPSS [25]) that can be used to obtain numerical results. It should be noted that when comparing group means it is more important to examine skewness than kurtosis. As DeCarlo [4, p. 297] notes ‘. . . a general finding for univariate and multivariate data is that tests of means appear to be affected by skew more than kurtosis’.

Numerical Example To illustrate the use of the methods previously discussed, a hypothetical data set was created by generating data from a lognormal distribution (having skewness = 6.18 and kurtosis = 110.93). In particular, lognormal data were generated for each of two groups (N1 = 40 and N2 = 60), where there were

4

Multivariate Normality Tests Group 1 15 Y1 Y2

10 5 0 −5 −10 −15 −20 −25 −30

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Observation number

Figure 1

Scores from a lognormal distribution

Table 1

Variable

Univariate and multivariate normality tests

N

Multivariate skewness & kurtosis

Test

Test statistic value

P value

0.806 0.584 114.849 14.728

.0000 .0000 .0000 .0000

0.699 0.485 239.200 30.427

.0000 .0000 .0000 .0000

Group 1 Y1 Y2

40 40 40 40

Y1 Y2

60 60 60 60

Shapiro Wilk . Shapiro Wilk . Mardia skewness 15.2437 Mardia kurtosis 26.6297 Group 2 Shapiro Wilk . Shapiro Wilk . Mardia skewness 22.0340 Mardia kurtosis 39.4248

two dependent measures (Y1 and Y2 ) per subject (see Figures 1 and 2). As indicated, multivariate normality can be assessed by determining whether each of the dependent measures is normally distributed; if not, then multivariate normality is not satisfied. Researchers, however, can choose to bypass these univariate tests (i.e., Shapiro-Wilk) and merely proceed to examine the multivariate measures of skewness and kurtosis with the tests due to Mardia [14]. For completeness, results from both approaches are presented in Table 1. Within each group, both the univariate and multivariate tests for normality are statistically significant.

Having found that the data are not multivariate normally distributed, researchers have to choose some course of action to deal with this problem. As previously indicated, one can search for a transformation to the data in an attempt to achieve multivariate normality, locate outlying values and deal with them, or adopt some robust form of analysis, that is, a procedure (e.g., test statistic) whose accuracy is not affected by nonnormality. Part of the motivation for presenting a two-group problem was to demonstrate a robust solution for circumventing nonnormality. That is, the motivation for the hypothetical data set was that such data could have

Multivariate Normality Tests

5

Group 2 20 Y1 Y2 15

10

5

0

−5

−10

Figure 2

1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 Observation number

Scores from a lognormal distribution

been gathered in an experiment where a researcher intended to test a hypothesis that the two groups were equivalent on the set of dependent measures. The classical solution to this problem would be to test the hypothesis that the mean performance on each dependent variable is equal across groups, that is, H0 : µ1 = µ2 , where µj = [µj 1 µj 2 ], and j = 1, 2. Hotelling’s [7] T2 is the conventional statistic for testing this multivariate hypothesis; however, it rests on a set of derivational assumptions that are not likely to be satisfied. Specifically, this test assumes that the outcome measurements follow a multivariate normal distribution and exhibit a common covariance structure (covariance homogeneity). One solution for obtaining a robust test of the multivariate null hypothesis is to adopt a heteroscedastic statistic (one that does not assume covariance homogeneity) with robust estimators, estimators intended to be less affected by nonnormality. Trimmed means and Winsorized covariance matrices are robust estimators that

can be used instead of the usual least squares estimators of the group means and covariance matrices (see Robust Statistics for Multivariate Methods). Lix and Keselman [12] enumerated a number of robust test statistics for the two-group multivariate problem. For the data presented in Figures 1 and 2, the P value for Hotelling’s T2 statistic is .1454, a nonsignificant result, whereas the P value associated with any one of the robust procedures (e.g., Johanson’s [9] test) is .0000, a statistically significant result. Consequently, the biasing effects of nonnormality and/or covariance heterogeneity have been circumvented by adopting a robust procedure, a procedure based on robust estimators and a heteroscedastic test statistic.

Summary Evaluating multivariate normality is a crucial aspect of multivariate analyses. Though the task is considerably more involved than is the case when there is

6

Multivariate Normality Tests

but a single measure, researchers would be remiss to ignore the issue. Indeed, most classical multivariate procedures require multivariate normality in order to provide accurate tests of significance and confidence coefficients as well as safeguarding the power to detect multivariate effects. Researchers should at a minimum assess each of the individual variables for univariate normality. As indicated, data cannot be multivariate normal if the marginal distributions are not univariate normal. Checking for univariate normality is straightforward. Statistical packages (e.g., SAS [20], SPSS [25]) provide measures of skewness and kurtosis as well as formal tests (e.g., Shapiro-Wilk). Perhaps, bivariate normality should be assessed as well. For some authors (e.g., [5]; [10]; [28]) these procedures should suffice to assess the multivariate normality assumption. However, others suggest that researchers go on to formally assessing multivariate normality [13, 17, 30]. (One could bypass tests of univariate normality and proceed directly to an examination of multivariate normality.) The approach most frequently cited for assessing multivariate normality is the method due to Mardia [14]. Mardia presented measures and tests of multivariate skewness and kurtosis. If the data are found to contain multivariate outlying values and/or are not multivariate normal in form, researchers can adopt a number of ameliorative actions. Data can be transformed (not a straightforward task), outlying values can be detected and removed (see [30, 32, 33]) or researchers can adopt robust estimators with robust test statistics (see [12, 30, 32]). With regard to this last option, some authors would recommend that researchers abandon testing and/or examining the classical parameters (e.g., population least squares means and variance-covariance matrices, classical multivariate measures of effect size and confidence intervals) and examine and/or test robust parameters (e.g., population trimmed means and Winsorized covariance matrices) using robust estimators and thus bypass assessing multivariate normality altogether by exclusively adopting robust procedures (e.g., see [32]).

References [1]

Andrews, D.F., Gnanadesikan, R. & Warner, J.L. (1971). Transformations of multivariate data, Biometrics 27, 825–840.

[2]

Barnett, V. & Lewis, T. (1994). Outliers in Statistical Data, Wiley, New York. [3] Box, G.E.P. & Cox, D.R. (1964). An analysis of transformations, Journal of the Royal Statistical Society, Series B 26, 211–252. [4] DeCarlo, L.T. (1997). On the meaning and use of kurtosis, Psychological Methods 2, 292–307. [5] Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York. [6] Harasym, P.H., Leong, E.J., Lucier, G.E. & Lorscheider, F.L. (1996). Relationship between Myers-Briggs psychological traits and use of course objectives in anatomy and physiology, Evaluation & The Health Professions 19, 243–252. [7] Hotelling, H. (1931). The generalization of student’s ratio, Annals of Mathematical Statistics 2, 360–378. [8] Huberty, C.J. (1994). Applied Discriminant Analysis, Wiley, New York. [9] Johansen, S. (1980). The Welch-James approximation to the distribution of the residual sum of squares in a weighted linear regression, Biometrika 67, 85–92. [10] Johnson, R.A. & Wichern, D.W. (1992). Applied Multivariate Statistical Analysis, 3rd Edition, Prentice Hall, Englewood Cliffs. [11] Knapp, R.G. & Miller, M.C. (1983). Monitoring simultaneously two or more indices of health care, Evaluation & The Health Professions 6, 465–482. [12] Lix, L.M. & Keselman, H.J. (2004). Multivariate tests of means in independent groups designs: Effects of covariance heterogeneity and non-normality, Evaluation & The Health Professions 27, 45–69. [13] Looney, S.W. (1995). How to use tests for univariate normality to assess multivariate normality, The American Statistician 49, 64–70. [14] Mardia, K.V. (1970). Measures of multivariate skewness and kurtosis with applications, Biometrika 50, 519–530. [15] Mardia, K.V. (1974). Applications of some measures of multivariate skewness and kurtosis for testing normality and robustness studies, Sankhya, Series B 36, 115–128. [16] Mardia, K.V. (1980). Tests for univariate and multivariate normality, in Handbook of Statistics, Vol. 1, Analysis of Variance, P.R. Krishnaiah, ed., North-Holland, New York, 279–320. [17] Rencher, A.C. (1995). Methods of Multivariate Analysis, Wiley, New York. [18] Romeu, J.L. & Ozturk, A. (1993). A comparative study of goodness-of-fit tests for multivariate normality, Journal of Multivariate Analysis 46, 309–334. [19] Royston, J.P. (1983). Some techniques for assessing multivariate normality based on the Shapiro-Wilk W, Applied Statistics 32, 121–133. [20] SAS Institute Inc. (1990). SAS Procedure Guide, Version 6, 3rd Edition, SAS Institute, Cary. [21] SAS Institute Inc. (1993). SAS/INSIGHT User’s Guide, Version 6, 2nd Edition, SAS Institute, Cary. [22] Serber, G.A.F. (1984). Multivariate Observations, Wiley, New York.

Multivariate Normality Tests [23] [24]

[25] [26]

[27]

[28]

Shapiro, S.S. & Wilk, M.B. (1965). An analysis of variance test for normality, Biometrika 52, 591–611. Small, N.J.H. (1980). Marginal skewness and kurtosis in testing multivariate normality, Applied Statistics 29, 85–87. SPSS Inc. (1994). SPSS Advanced Statistics 6.1, Author Chicago. Srivastava, M.S. (1984). A measure of skewness and kurtosis and a graphical method for assessing multivariate normality, Statistics and Probability Letters 2, 263–267. Srivastava, M.S. & Hui, T.K. (1987). On assessing multivariate normality based on the Shapiro-Wilk W statistic, Statistics and Probability Letters 5, 15–18. Stevens, J.P. (2002). Applied Multivariate Statistics for the Social Sciences, Lawrence Erlbaum, Mahwah.

[29]

7

Tabachnick, B.G. & Fidell, L.S. (2001). Using Multivariate Statistics, 4th Edition, Allyn & Bacon, New York. [30] Timm, N.H. (2002). Applied Multivariate Analysis, Springer, New York. [31] Timm, N.H. & Mieczkowski, T.A. (1997). Univariate and Multivariate General Linear Models: Theory and Applications Using SAS Software, SAS Institute, Cary. [32] Wilcox, R.R. (in press). Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition, Academic Press.San Diego. [33] Wilcox, R.R., & Keselman, H.J. (in press). A skipped multivariate measure of location: one- and two-sample hypothesis testing, Journal of Modern Applied Statistical Methods.

H.J. KESELMAN

Multivariate Outliers NICK FIELLER Volume 3, pp. 1379–1384 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Multivariate Outliers The presence of outliers in univariate data is readily detected because they must be extreme observations that can easily be identified numerically or graphically. The outliers are amongst the largest or smallest observations and it is not difficult to identify and examine these and perhaps test them for discordancy in relation to the model proposed for the data (see Outliers; Outlier Detection). By contrast, outliers in multivariate data present special difficulties, primarily because of lack of a clear definition of the ‘extreme observations’ in such a data set. While the intuitive notion that the extreme observations are those that are ‘furthest from the main body of the data’, this does not help locate them since there are many possible ‘directions’ in which they could be separated. As many authors have commented, univariate outliers ‘stick out at one end or the other but multivariate outliers just stick out somewhere’. Figure 1 shows a plot of two components, B and I, from a set of data containing measurements of nine trace elements (labelled A, B, . . ., I) measured in a collection of 269 archaic clay vessels. There is clearly one outlier ‘sticking out’ in a direction of forty-five degrees from the main body of the data and there are maybe some suspicious observations sticking out in the opposite direction. Most proposed techniques for pinpointing any outliers that may then be tested formally for discordancy (or, maybe, just examined for deeper understanding of the data) rely on calculations of sample statistics such as the mean and covariance matrix, which

0.08 Trace element ‘I’

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0

Figure 1

1

2

3 4 5 6 Trace element ‘B’

Outlier revealed on bivariate plot

7

8

9

can be seriously affected by the outliers themselves. With univariate data, inflation of sample variance and change in mean does not camouflage their hiding place; by contrast, distortion of the sample covariance can hide multivariate outliers effectively, especially when there are multiple outliers placed in different directions. This feature makes the use of robust estimates of the mean and covariance matrix a possibility to be considered, though this may or may not be effective. This is discussed further below. As with all outlier problems in the whole range of contexts of statistical analysis, there are three distinct objectives: identification, testing for discordancy and accommodation. Irrespective of whether the first two have been considered, the third of these can be handled by the routine use of robust methods; specifically, use of robust estimates of mean and covariance (see Robust Statistics for Multivariate Methods). These methods are widely available in many statistical software packages, for example, R, S-Plus, and SAS and are not specifically discussed here (see Software for Statistical Analyses). Identification may be followed by a formal test of discordancy or it may be that the identification stage has merely revealed simple mistakes in the data recording or, maybe some unexpected feature of interest in its own right. Identification may proceed in step with formal tests in hunting for multiple outliers where masking and swamping are ever-present dangers.

Identification Strategies for identification begin with simple graphical methods as an aid to visual inspection. For an n × p data matrix X = (x1 , . . . , xn ) representing a random sample of n p-dimensional observations xi , i = 1, . . . , n, univariate plots of original components (or boxplots or listing the ordered values numerically) should reveal any simple extreme potential recording errors. Bivariate (or pseudo three-dimensional (see Three Dimensional (3D) Scatterplots) plots of original components are only worthwhile for low dimensional data, perhaps for p ≤ 10 or 15, say (an effective upper limit for easy scanning of matrix plots of all pairwise combinations of components). These may not reveal anything other than simple recording errors in a single component which are not extreme enough to have caused concern on first examination of just that component alone. Further, they will not

Multivariate Outliers

reveal outliers that are not simple recording errors but are attributable to more interesting unexpected causes. Figure 1 above shows just such an example where the outlier could be a low value of element B or a high value of element I. They are nevertheless worthwhile, especially since such matrix plots are quick and easy to produce (see Scatterplot Matrices). For higher dimensional data sets, it is not easy to focus on all of the bivariate scatterplots individually. A more sophisticated approach is to use an initial dimensionality reduction method as a preliminary step to reduce the combinations of components that need to be examined. The obvious candidate is principal component analysis, based on either the covariance or the correlation matrix (or on both in turn) and examining bivariate scatterplots of successive pairs of principal components. These methods have the key advantage in that they can be used even if n < p, that is, more dimensions than data points. Here, the rationale is that the presence of outliers will distort the covariance matrix resulting in ‘biasing’ a principal component by ‘pulling it towards’ the outlier, thus allowing it to be revealed as an extreme observation on a scatterplot including that principal component. Clearly, a gross outlier will be revealed on the high order principal components while a minor one will be revealed on those associated with the smaller eigenvalues. It is particularly sensible to examine plots of PCs around the ‘cutoff point’ on a scree plot of the eigenvalues since outliers might increase the apparent effective dimensionality of the data set and so be revealed in this ‘additional’ dimension. These methods can be effective for either low numbers of outliers in relation to the sample size or for ‘clusters’ of outliers that arise from a common cause and so ‘stick out’ in the same direction. For large numbers of heterogeneous outliers, then, a robust version of principal component analysis is a possibility, although this loses the rationale that the outliers distort the PCs towards them by distorting the covariance or correlation matrix. Returning to the measurements of nine trace elements of the clay vessels, Figure 2 shows a scree plot of cumulative variance explained by successive principal components calculated from the correlation matrix of the nine variables on a subset of 53 of the vessels. This suggests that the inherent dimensionality of the data is around five, with the first five principal components containing 93% of the variation and so it could be that outliers would be revealed on plots

Cumulative percentage of variance

2

100 90 80 70

Cut-off point

60 50 40 30 0

Figure 2

1

2

3 4 5 6 7 Number of components

8

9

Scree plot of cumulative percentage of variance

of principal components around that value. Figure 3 below shows scatter plots of the data referred to successive pairs of the first five principal components. Inspection of these suggests that the two observations on the right of the lower right-hand plot (i.e. with high scores on the fifth component) would be worth investigating further, though they are not exceptionally unusual. Closer inspection (with interactive brushing) reveals that one of this pair occurs as the extreme top right observation in the top left plot of PC1 versus PC2 in Figure 3, that is, it has almost the highest scores on both of the first two principal components. Further inspection with brushing reveals that the group of three observations on the top right of PC3 versus PC4 (i.e. those with the three highest scores on the fourth component) have very similar values on all other components and so might be considered a group of outliers from a common source. A more intensive graphical procedure, which is only viable for reasonably modest data sets (say n <∼ 50) with more observations than dimensions (n > p) and where only a small number of outliers is envisaged is the use of outlier displaying components (ODCs). These are really just linear discriminant coordinates (sometimes called canonical variate coordinates) (see Discriminant Analysis). For a single outlier, these ODCs are calculated by dividing the data set into two groups, one consisting of just one observation xj and the other of the (n − 1) observations omitting xj , j = 1, . . . , n (so that there is potentially a separate ODC for each observation). It is easily shown that the univariate likelihood ratio test statistic for discordancy of xj under a Normal mean slippage model calculated from the values projected

3

Multivariate Outliers 3

3

2

2

1

PC2 (27%)

PC2 (35%)

4

1 0 −1

0 −1

−2

−2

−3

−3 −3

−2

−1

0

1

2

−4

3

−3

−2

PC2 (27%)

2

1

2

3

2

1 PC4 (9%)

PC3 (14%)

0

3

3

0 −1 −2

1 0 −1

−3

−2

−4 −2

−1

0

1

2

3

PC4 (9%)

Figure 3

−1

PC3 (14%)

−2

−1

0

1

2

PC5 (8%)

Pairwise plots on first five principal components

onto this ODC is numerically identical to the equivalent statistic (see below), calculated from the original p-dimensional data. Thus, the one-dimensional data projected onto the ODC capture all the numerical information on the discordancy of that observation (though the reference distribution for a formal test of discordancy does depend on the dimensionality p). Picking that with the highest value of the statistic will reveal the most extreme outlier. The generalization to two outliers and beyond depends on whether they arise from a common or two distinct slippages. If the former, then there is just one ODC for each possible pair and, if the latter, then there are two. For three or more, the number of possibilities increases rapidly and so the use is limited as a tool for pure detection of outliers to modest data sets with low contamination rates. However, the procedure can be effective for displaying outliers already identified and for plotting subsidiary data in the same coordinate system. Examination of loadings of original components may give information on the nature of the outliers in the spirit of union-intersection test procedures.

Returning to the data on trace elements in claypots, the plot in Figure 4 on the left displays the identified outlier in Figure 1 on the single outlier displaying component (ODC) for that observation as the horizontal axis. This has been calculated using all nine dimensions rather than just two as in Figure 1 and this contains all the information on the deviation of this observation from the rest of the data. The vertical axis has been chosen as the first principal component but other choices might be sensible, for example, a component around the cutoff value of four or five or else on a ‘subprincipal component’: that vector which maximizes the variance subject to the constraint of being orthogonal with the ODC. Examination of the loadings of the trace elements in the ODC show heavy (negative) weighting on element I with moderate contributions from trace elements F (negative) and E (positive). Trace element B, the horizontal axis in Figure 1, has a very small contribution and, thus, it might be suspected that any recording error is in the measurement of element I. A matrix plot of all components (not shown here) does not

Multivariate Outliers Outlier displaying component 2

4

PC1 (47%)

0

−10

−20 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

0.1

0.2

Outlier displaying component

Figure 4

0.3 0.2 0.1 0.0 −0.1 −0.2 −0.3 −0.4 −0.5 −0.6 0.0

0.5

1.0

Outlier displaying component 1

Illustrations of outlier displaying components

reveal this, primarily because this pair of elements is the only one exhibiting any strong correlation. It may be seen that there are several further observations separated from the main body of the data in the direction of the ODC, both with low values (as with the outlier) and with high values. Attention might be directed towards these samples. It should be noted, of course, that the sign of the ODC is arbitrary in the sense that all the coefficients could be multiplied by −1 (thus reflecting the plot about a vertical axis), in the same way that signs of principal components are arbitrary. The figure on the right of Figure 4 displays the subset of the data discussed in relation to Figure 3. Two groups of outliers have been identified. One is the pair of extreme observation on the fifth principal component and these appear on the right of the plot. The other is the triple of extreme observations on the fourth principal component. These appear in the lower left of the plot. However, two further observations appear separated from the main body in this coordinate system (at the top of the plot) and further examination of these samples might be of interest. Informal examination of the loadings of these two ODCs suggests that the first is dominated by trace elements D and E and the second by E, F and I. Whether this information is of key interest and an aid in further understanding is, of course, a matter for the scientist involved rather than purely a statistical result. The methods outlined above have the advantage that they display outliers in relation to the original data – they are methods for selecting interesting views of the raw data that highlight the outliers. For data sets with large numbers of observations and

dimensions, this approach may not be viable and various, more systematic approaches have been suggested based on some variant of probability plotting. The starting point is to consider ordered values of a measure of generalized squared distance of each observation from a measure of central location Rj (xj , C, ) = (xj − C) −1 (xj − C), where C is a measure of location (e.g. the sample mean or a robust estimate of the population mean) and is a measure of scale (e.g., the sample covariance or a robust estimate of the population value). The choice of the sample mean and variance for C and yields the squared Mahalanobis distance of each observation from the mean. It is known that if the xj comes from a multivariate Normal distribution (see Catalogue of Probability Density Functions), then the distribution of the Rj (xj , C, ) is well approximated by a gamma distribution (see Catalogue of Probability Density Functions) with, dependent upon , some shape parameter that needs to be estimated. A plot of the ordered values against expected order statistics would reveal outliers as deviating from the straight line at the upper end. Barnett and Lewis [1] give further details and references. It should be noted that, typically, these methods are only available when n > p; in other cases, a (very robust) dimensionality reduction procedure might be considered first. Further, these methods are primarily of use when the number of outliers is relatively small. In cases where there is a high multiplicity of outliers, the phenomena of masking and swamping make them less effective. An alternative approach, typified by Hadi [2, 3], is based upon finding a ‘clean’ interior subset of observations and successively adding more observations to the subset. Hadi recommends starting

Multivariate Outliers with the p + 1 observations with smallest values of Rj (xj , C, ), with very robust choices for C and . An alternative might be to start by peeling away successive convex hulls to leave a small interior subset. Successive observations are added to the subset by choosing that with the smallest value of Rj (xj , C, ), calculated just from the clean initial subset (with C and chosen as the sample mean and covariance) and with the ordering updated at each step. Hadi gives stopping criterion based on simulations from multivariate Normal distributions, the remaining observations not included in the clean subset being declared discordant.

Tests for Discordancy Formal tests of discordancy rely on distributional assumptions; an outlier can only be tested for discordancy in relation to a formal statistical model. Most available formal tests presume multivariate Normality; exceptions are tests for single outliers in bivariate exponential and bivariate Pareto distributions. Barnett and Lewis [1] provide some details and tables of critical values for these two distributions. For multivariate Normal distributions (see Catalogue of Probability Density Functions, the likelihood ratio statistic for a test of discordancy of a single outlier arising from a similar distribution with a change in mean (i.e., a mean slippage model) is easily shown to be equivalent to the Mahalanobis distance from the sample mean (i.e., Rj (xj , C, ), with C and taken as the sample mean and covariance). In one dimension, this is equivalent to the studentized distance from the sample mean. Barnett & Lewis [1] give tables of 5% and 1% critical values of this statistic for various values of n ≤ 500 and p ≤ 5, together with similar tables for critical values of equivalent statistics when the population covariance and also both population mean and covariance are known, as well as the case where an independent external estimate of the covariance is available. Additionally, they provide similar

5

tables for the appropriate statistic for a pair of outliers and references to further sources.

Implementation Although the techniques described here are not specifically available as an ‘outlier detection and testing module’ in any standard package, they can be readily implemented using standard modules provided for principal component analysis, linear discriminant analysis and robust calculation of mean and covariance, provided there is also the capability of direct evaluation of algebraic expressions involving products of matrices and so on. Such facilities are certainly available in R, S-PLUS and SAS. For example, to obtain a single outlier ODC plot, a group indicator needs to be created, which labels the single observation as group 1 and the remaining observations as group 2 and then a standard linear discriminant analysis can be performed. Some packages may fail to perform standard linear discriminant analysis if there is only one observation in the group. However, it is easy to calculate the single outlier ODC for observation xj directly as −1 xj , where is the sample covariance matrix or a robust version of it. The matrix calculation facilities would probably be needed for implementation of the robust versions of the techniques.

References [1] [2]

[3]

Barnett, V. & Lewis, T. (1994). Outliers in Statistical Data, 3rd Edition, Wiley, New York. Hadi, A.S. (1992). Identifying multiple outliers in multivariate data. Journal of the Royal Statistical Society Series B 54, 761–771. Hadi, A.S. (1994). A modification of a method for the detection of outliers in multivariate samples. Journal of the Royal. Statistical Society Series B 56, 393–396.

NICK FIELLER

Neural Networks FIONN MURTAGH Volume 3, pp. 1387–1393 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Neural Networks Introduction The two types of algorithms to be described can be said to be neuro-mimetic. The human or animal brain is comprised of simple computing engines called neurons, and interconnections called synapses which seem to store most of the information in the form of electrical signals. Thus, neural network algorithms usually avail of neurons (also called units) that perform relatively simple computing tasks such as summing weighted inputs, and then thresholding the value found, usually in ‘soft’ manner. Soft thresholding is, loosely, approximate thresholding, and is counterposed to hard thresholding. A definition is discussed below. The second point of importance in neural network algorithms is that the interconnections between neurons usually play an important role. Since we are mostly dealing with software implementations of such algorithms, the role of electrical signals is quite simply taken up by a weight – a numeric value – associated with the interconnection. In regard to the categorization or clustering objectives that these algorithms can cater for, there is an important distinction to be made between, on the one hand, the multilayer perceptron, and on the other hand the Kohonen self-organizing feature map. The multilayer perceptron (MLP) is an example of a supervised method, in that a training set of samples or items of known properties is used. Supervised classification is also known as discriminant analysis. The importance of this sort of algorithm is that data from the real world is presented to it (of course, in a very specific form and format), and the algorithm learns from this (i.e., iteratively updates its parameters or essential stored values). So, reminiscent of a human baby learning its first lessons in the real world, a multilayer perceptron can adapt by learning. The industrial and commercial end-objective is to have an adaptable piece of decision-making software on the production line to undertake quality control, or carry out some other decision-based task in some other context. The Kohonen self-organizing feature map is an example of an unsupervised method. Cluster analysis methods are also in this family of methods (see k -means Analysis). What we have in mind here is that the data, in some measure, sorts itself out or

‘self-organizes’ so that human interpretation becomes easier. The industrial and commercial end-objective here is to have large and potentially vast databases go some way towards facilitating human understanding of their contents. In dealing with the MLP, the single perceptron is first described – it can be described in simple terms – and subsequently the networking of perceptrons, in a set of interconnected, multiple layers to form the MLP. The self-organizing feature map method is then described. A number of examples are given of both the MLP and Kohonen maps. A short article on a number of neural network approaches, including those covered here, can be seen in [4]. A more in-depth review, including various themes not covered in this article, can be found in [7]. Taking the regression theme further are the excellent articles by Schumacher et al. [10] and Vach et al. [11].

Multilayer Perceptron Perceptron Learning Improvement of classification decision boundaries, or assessment of assignment of new cases, have both been implemented using neural networks since the 1950s. The perceptron algorithm is due to work by Rosenblatt in the late 1950s. The perceptron, a simple computing engine which has been dubbed a ‘linear machine’ for reasons which will become clear below, is best related to supervised classification. The idea of the perceptron has been influential, and generalization in the form of multilayer networks will be looked at later. Let x be an input vector of binary values, o an output scalar, and w a vector of weights (or learning coefficients, initially containing arbitrary values). The perceptron calculates o = j wj xj . Let θ be some threshold. If o ≥ θ, when we would have wished o < θ for the given input, then the given input vector has been incorrectly categorized. We therefore seek to modify the weights and the threshold. Set θ ← θ + 1 to make it less likely that wrong categorization will take place again. If xj = 0, then no change is made to wj . If xj = 1, then set wj ← wj − 1 to lessen the influence of this weight.

2

Neural Networks

If the output was found to be less than the threshold, when it should have been greater for the given input, then the reciprocal updating schedule is implemented. If a set of weights exist, then the perceptron algorithm will find them. A counter example is the exclusive-or, XOR problem, shown in Table 1. Here, it can be verified that no way exists to choose values of w1 and w2 to allow discrimination between the first and fourth vectors (on the one hand), and the second and third (on the other hand). Network designs which are more sophisticated than the simple perceptron can be used to solve this problem. The network shown in Figure 1 is a feedforward three-layer network. This type of network is the most widely used. It has directed links from all nodes at one level to all nodes at the next level. Such a network architecture can be related to a linear algebra formulation, as will be seen below. It has been found to provide a reasonably straightforward architecture

Table 1 The XOR, or exclusive-or problem. Linear separation between the desired output classes, 0 and 1, using just one separating line is impossible Input vector 0, 0, 1, 1,

Desired output

0 1 0 1

0 1 1 0

wij i

Input layer

Hidden layer

j

Ouput layer

Figure 1 The MLP showing a three-layer architecture, on the left, and one arc on the right. The circles represent neurons (or units)

which can also carry out learning or supervised classification tasks quite well. For further discussion of the perceptron, Gallant [2] can be referred to for various interesting extensions and alternatives. Other architectures are clearly possible: directed links between all pairs of nodes in the network, as in the Hopfield model, directed links from later levels back to earlier levels as in recurrent networks, and so on. In a feedforward multilayer perceptron, learning, weight estimation (Figure 1, right), takes place through use of gradient descent. This is an iterative sequence of updates to the weights along the directed links of the MLP. Based on the discrepancy between obtained output and desired output, some fraction of the overall discrepancy is propagated backwards, layer by layer, in the MLP. This learning algorithm is also known as backpropagation. Its aim is to bring the weights in line with what is desired, which implies that the discrepancy between obtained and desired output will decrease. Often a large number of feed forward, and backpropagation updates, are needed in order to fix the weights in the MLP. There is no guarantee that the weights will be optimal. At best, they are locally suboptimal in weight space. It is not difficult to see that we can view what has been described in linear algebra terms. Consider the matrix W1 as defining the weights between all input layer neurons, i, and all hidden layer neurons, j . Next consider the matrix W2 as defining the weights between all hidden layer neurons, j , and all output layer neurons, k. For simplicity, we consider the case of the three-layer network but results are straightforwardly applicable to a network with more layers. Given an input vector, x, the values at the hidden layer are given by xW1 . The values at the output layer are then xW1 W2 . Note that we can ‘collapse’ our network to one layer by seeking the weights matrix W = W1 W2 . If y is the target vector, then we are seeking a solution of the equation xW = y. This is linear regression. The backpropagation algorithm described above would not be interesting for such a classical problem. Backpropagation assumes greater relevance when nonlinear transfer functions are used at neurons.

Generalized Delta Rule for Nonlinear Units Nonlinear transformations are less tractable mathematically but may offer more sensitive modeling of real data. They provide a more faithful modeling

Neural Networks of electronic gates or biological neurons. Now, we will amend the delta rule introduced in the preceding section to take into consideration the case of a nonlinear transfer function at each neuron. Consider the accumulation of weighted values at any neuron. This is passed through a differentiable and nondecreasing transformation. In the matrix formalism discussed above, this is just a linear stretch or rescaling function. For a nonlinear alternative, this transfer function is usually taken as a sigmoidal one. If it were a step function, implying that thresholding is carried out at the neuron, this would violate our requirement for a differentiable function. Taking the derivative of the transfer function, in other words, finding its slope (in a particular direction), is part and parcel of the backpropagation or gradient descent optimization commonly used. A sigmoidal function is an elongated ‘S’ function, which is not allowed to buckle backwards. One possibility is the function y = (1 + e−x )−1 . Another choice is the hyperbolic tangent or tanh function: for x > 20.0, y = +1.0; for y < −20.0, y = −1.0; otherwise y = (ex − e−x )/(ex + e−x ). Both of these functions are invertible and continuously differentiable. Both have semilinear zones which allow good (linear) fidelity to input data. Both can make ‘soft’ or fuzzy decisions. Finally, they are similar to the response curve of a biological neuron. A ‘pattern’ is an input observation vector, for instance, a row of an input data table. The overall training algorithm is as follows: present pattern; feed forward through successive layers; backpropagate, that is, update weights; repeat. An alternative is to determine the changes in weights as a result of presenting all patterns. This so-called ‘off-line’ updating of weights is computationally more efficient but loses out on the adaptivity of the overall approach. We have discussed how discrepancy between obtained and desired output is used to drive or direct the learning. The discrepancy is called the error, and is usually a squared Euclidean distance, or a mean square error (i.e., mean squared Euclidean distance). Say that we are using binary data only. Such data could correspond to some qualitative or categorical data coding. In regard to outputs obtained, in practice, one must use approximations to hard-limiting values of 0, 1; for example, 0.2, 0.8 can be used. Any output value less than or equal to 0.2 is taken as tantamount

3

to 0, and any output value greater than or equal to 0.8 is taken as tantamount to 1. If we are dealing instead with quantitative, realvalued data, then we clearly have a small issue to address: a sigmoidal ‘squashing’ transfer function at the output nodes means that output values will be between 0 and 1. To get around this limitation, we can use sigmoidal transfer functions at all neurons, save the output layer. At the output layer we use a linear transfer function at each neuron. The MLP architecture using the generalized delta rule can be very slow for the following reasons. The compounding effect of sigmoids at successive levels can lead to a very flat energy surface. Hence movement towards fixed points is not easy. A second reason for slowness when using the generalized delta rule is that weights tend to try together to attain the same value. This is unhelpful – it would be better for weights to prioritize themselves when changing value. An approach known as cascade correlation, due to Scott Fahlman, addresses this by introducing weights one at a time. The number of hidden layers in an MLP and the number of nodes in each layer can vary for a given problem. In general, more nodes offer greater sensitivity to the problem being solved, but also the risk of overfitting. The network architecture which should be used in any particular problem cannot be specified in advance. A three-layer network with the hidden layer containing, say, less than 2m neurons, where m is the number of values in the input pattern, can be suggested. Other optimization approaches and a range of examples are considered in [4, 7]. Among books, we recommend the following: [1], [3], and [9].

Example: Inferring Price from Other Car Attributes The source of the data used was the large machine learning datasets archive at the University of California Irvine. The origin of the data was import yearbook and insurance reports. There were 205 instances or records. Each instance had 26 attributes, of which 15 were continuous, 1 integer, and 10 nominal. Missing attribute values were denoted by ‘?’. Table 2 describes the attributes used. Some explanations on Table 2 follow. ‘Symboling’ refers to the actuarial risk factor symbol associated with the price (+3: risky; −3: safe). Attribute

4

Neural Networks

Table 2

Description of variables in car data set

Attribute

Attribute Range

1. symboling 2. normalized-losses 3. make

4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type

16. num-of-cylanders 17. engine-size 18. fuel-system 19. 20. 21. 22. 23. 24. 25. 26.

bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price

−3, −2, −1, 0, 1, 2, 3 continuous from 65 to 256 alfa-romeo, audi, bmw, chevrolet, dodge, honda isuzu, jaguar, mazda, mercedes-benz, mercury mitsubishi, nissan, peugeot, plymouth, porsche renault, saab, subaru, toyota, volkswagen, volvo diesel, gas standard, turbo four, two hardtop, wagon, sedan, hatchback, convertible 4wd, fwd, rwd front, rear continuous from 86.6 to 120.9 continuous from 141.1 to 208.1 continuous from 60.3 to 72.3 continuous from 47.8 to 59.8 continuous from 1488 to 4066 dohc, dohcv, l, ohc, ohcf, ohcv, rotor 8, 5, 4, 6, 3, 12, 2 continuous from 61 to 326 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi continuous from 2.54 to 3.94 continuous from 2.07 to 4.17 continuous from 7 to 23 continuous from 48 to 288 continuous from 4150 to 6600 continuous from 13 to 49 continuous from 16 to 54 continuous from 5118 to 45 400

number 2: relative average loss payment per insured vehicle year, normalized for all cars within a particular size classification. As a sample of the data, the first three records are shown in Table 3. The objective was to predict the Table 3

price of a car using all other attributes. The coding and processing carried out by us is summarized as follows. Records with missing prices were ignored, leaving 201 records. All categorical attribute values were mapped onto 1, 2, . . . . Missing attribute values were set to zero. Price attribute (to be predicted) was discretized into 10 classes. All nonmissing attributes were normalized to lie in the interval [0.1, 1.0]. A test set of 28 uniformly sequenced cases was selected; a training set was given by the remaining 173 cases. A three-layer MLP was used, with 25 inputs and 10 outputs. The number of units (neurons) in the hidden layer was assessed on the training set, and covered 4, 8, 12, . . . , 36, 40 neurons. For comparison, a k-Nearest Neighbors discriminant analysis was also carried out with values of k = 1, 2, 4. (K-Nearest Neighbors involves taking a majority decision for assignment, among the class assignments found for the k Nearest neighbors. Like MLP, it is a nonlinear mapping method.) Best performance was found on the training set using 28 hidden layer neurons. Results obtained were as follows, – learning on the training set and generalization on the test set. MLP 25–28–10 network: Learning: 96.53% Generalization: 71.43% k-Nearest Neighbors k = 1: Learning: 97.64%, Generalization: 67.86% k = 2: Learning: 85.54%, Generalization: 71.43% k = 4: Learning: 73.99%, Generalization: 71.43% Hence, the MLP’s generalization ability was better than that of 1-Nearest Neighbor’s, but equal to that of 2- and 3-Nearest Neighbor’s. Clearly, further analysis (feature selection, data coding) could be useful. An advantage of the MLP is its undemanding requirements for coding the input data.

Sample of three records from car data set

3, ?, alfa-romeo, gas, std, two, convertible, rwd, front, 88.60, 168.90, 64.10, 48.80, 2548, dohc, four, 130, mpfi, 3.47, 2.68, 9.00, 111, 5000, 21, 27, 13 495 3, ?, alfa-romeo, gas, std, two, convertible, rwd, front, 88.60, 168.90, 64.10, 48.80, 2548, dohc, four, 130, mpfi, 3.47, 2.68, 9.00, 111, 5000, 21, 27, 16 500 1, ?, alfa-romeo, gas, std, two, hatchback, rwd, front, 94.50, 171.20, 65.50, 52.40, 2823, ohcv, six, 152, mpfi, 2.68, 3.47, 9.00, 154, 5000, 19, 26, 16 500

Neural Networks In the example just considered, we had 201 cars, with measures on 25 attributes. Let us denote this data as matrix X, of dimensions 201 × 25. The weights between input layer and hidden layer can be represented by matrix U , of dimensions 25 × 28. The weights between hidden layer and output layer can be represented by matrix W , of dimensions 28 × 10. Let us take the sigmoid function used as f (z) = 1/(1 + e−z ), acting elementwise on its argument, z. The sigmoid function can be straightforwardly extended for use on a matrix (i.e., set of vectors, z). We will denote the outputs as Y , a 201 × 10 matrix, giving assignments – close to 0 or close to 1 – for all 201 cars, with respect to 10 price categories. Then, finally, our trained MLP gives us the following nonlinear relationship: Y = f ((f (XU ))W )

5

(a)

(1)

The Kohonen Self-organizing Feature Map A self-organizing map can be considered as a display grid where objects are classified (Figure 2). In such a grid, similar objects are located in the same area. In the example (Figure 2, right), a global classification is shown: the three different shapes are located in three different clusters. Furthermore, the largest objects are located towards the center of the grid, and each cluster is ordered: the largest objects are at one side of a cluster, smaller shapes are at the other side. This leads to an interesting and effective use of the Kohonen self-organizing feature map for both the presentation of information, and as an interactive user interface. As such, it constitutes an active and responsive output from the Kohonen map algorithm. A self-organizing map is constructed as follows. •

• • •

Each object is described by a vector. In Figure 2, the vector has two components: the first one corresponds to the number of angles, and the other one to the width of the area. Initially, a vector is randomly associated with each box (or node) of the grid. Each document is located in a box whose descriptive vector is the most similar to the object’s vector. During an iterative learning process, the components of the nodes describing vectors are

(b)

Figure 2 Notional view of Kohonen self-organizing feature map. Object locations. (a) before learning. (b) after learning

modified. The learning process produces the classification. The Kohonen self-organizing feature map (SOFM) with a regular grid output representational or display space, therefore involves determining cluster representative vectors, such that observation vector inputs are parsimoniously summarized (clustering objective); and in addition, the cluster representative vectors are positioned in representational space so that similar vectors are close (low-dimensional projection objective) in the representation space.

6

Neural Networks

The output representation grid is usually a square, regular grid. ‘Closeness’ is usually based on the Euclidean distance. The SOFM algorithm is an iterative update one, not similar to other widely used clustering algorithms (e.g., k-means clustering). In representation space, a neighborhood of grid nodes is set at the start. Updates of all cluster centers in a neighborhood are updated at each update iteration. As the algorithm proceeds, the neighborhood is allowed to shrink. The result obtained by the SOFM algorithm is suboptimal, as also is the case usually for clustering algorithms of this sort (again, for example, k-means). A comparative study of SOFM performance can be found in [5]. An evaluation of its use as a visual user interface can be found in [6, 8].

References [1] [2]

[3] [4]

Bishop, C.M. (1995). Neural Networks for Pattern Recognition, Oxford University Press. Gallant, S.I. (1990). Perceptron-based learning algorithms, IEEE Transactions on Neural Networks 1, 179–191. MacKay, D.J.C. (2003). Information Theory, Inference, and Learning Algorithms, Cambridge University Press. Murtagh, F. (1990). Multilayer perceptrons for classification and regression, Neurocomputing 2, 183–197.

[5]

Murtagh, F. (1994). Neural network and related massively parallel methods for statistics: a short overview, International Statistical Review 62, 275–288. [6] Murtagh, F. (1996). Neural networks for clustering, in Clustering and Classification, P., Arabie, L.J., Hubert & G., De Soete, eds, World Scientific. [7] Murtagh, F. & Hern´andez-Pajares, M. (1995). The Kohonen self-organizing feature map method: an assessment, Journal of Classification 12, 165–190. [8] Poin¸cot, P., Lesteven, S. & Murtagh, F. (2000). Maps of information spaces: assessments from astronomy, Journal of the American Society for Information Science 51, 1081–1089. [9] Ripley, B.D. (1996). Pattern Recognition and Neural Networks, Cambridge University Press. [10] Schumacher, M., Roßner, R. & Vach, W. (1996). Neural networks and logistic regression: Part I, Computational Statistics and Data Analysis 21, 661–682. [11] Vach, W., Roßner, R. & Schumacher, M. (1996). Neural networks and logistic regression: Part II, Computational Statistics and Data Analysis 21, 683–701.

Further Reading Murtagh, F., Taskaya, T., Contreras, P., Mothe, J. & Englmeier, K. (2003). Interactive visual user interfaces: a survey, Artificial Intelligence Review 19, 263–283.

FIONN MURTAGH

Neuropsychology YANA SUCHY Volume 3, pp. 1393–1398 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Neuropsychology Neuropsychology is a scientific and clinical discipline that is concerned with two related issues that are, at least theoretically, methodologically, and statistically, distinct. These are: (a) Understanding brain-behavior relationship, and (b) the development and application of methods (including statistical methods) that allow us to assess behavioral manifestations of various types of brain dysfunction. To illustrate how these clinical and scientific interests come together practically, let us follow a hypothetical case scenario. Imagine a man, let us call him Mr. Fixit, who, while repairing a loose gutter 30 feet above the ground, falls from his ladder. Let us further imagine that this man is fortunate enough to fall into a pile of dry leaves. The man escapes this accident without fractures or internal injuries, although he does fall unconscious briefly. He regains consciousness within a few minutes, feeling woozy, shaky, and sick in his stomach. His wife takes him to the hospital, where he is diagnosed with a concussion and released, with the instruction to take it easy for a few days until the wooziness and nausea go away. Let us further imagine that this man is an owner of a small business, and that his daily activities require him to organize and keep in mind many details as he receives inquiries from potential customers, questions from his five employees, and quotes from suppliers. The man returns to work after a few days of rest, looking forward to catching up on paperwork, phone calls, purchases, and business meetings. By the end of the first week, however, he feels frustrated, irritated, exhausted, and no closer to catching up than he was when he first returned to work. By the end of the second week, the situation is no better: In fact, not only is he nowhere close to catching up but he is also falling further behind. After a couple of more weeks, the man’s wife convinces him that these difficulties seem to be linked to his fall, and that he should see his doctor. Upon his wife’s urging, Mr. Fixit goes back to the neurologist who first examined him, but his visit is not very satisfying: He is told to give it time, that his ‘symptoms’ are normal, and that he should not worry about it. To complicate matters further, Mr. Fixit has noticed that his ladder has a manufacturing defect that may have caused his fall, and is now considering legal action. And, so, to bring this story

to the topic of this article, Mr. Fixit ends up in a neuropsychologist’s office, asking whether his fall might have permanently damaged his brain in such a way as to cause him the difficulties he has been experiencing. How would a neuropsychologist go about answering such a question? To avoid an unduly complicated discussion, suffice it to say that the neuropsychologist first needs to have a working knowledge of the topics that are subsumed under the heading of ‘brain-behavior relationship’. For example, the neuropsychologist must understand what type of damage typically occurs in a person’s brain as a result of a fall such as the one suffered by Mr. Fixit, and what behaviors are typically affected by such damage. Second, the neuropsychologist must possess ‘assessment instruments’ that allow him to examine Mr. Fixit and interpret the results. The remainder of this chapter focuses on the statistical methods that allow neuropsychologists to both gather knowledge about brain-behavior relationship and to develop appropriate assessment instruments. We will return to the specifics of Mr. Fixit’s case to illustrate methodological and statistical issues pertinent to both experimental and clinical neuropsychology.

Statistical Methods Used for Examination of Brain-behavior Relationship The first question that a neuropsychologist examining Mr. Fixit needs to answer is whether it is realistic that Mr. Fixit’s fall resulted in the symptoms that Mr. Fixit describes. To answer questions of this type, neuropsychologists conduct studies in which they compare test performances of individuals who sustained brain injury to healthy controls. Such studies typically use a simple independent t Test, or, if more than two groups are compared, a one-way Analysis of Variance (ANOVA). Past studies utilizing such comparisons have demonstrated that patients who have sustained a mild traumatic brain injury (similar to the one sustained by Mr. Fixit) may exhibit difficulties with memory and attention, or difficulties in organization and planning. Although simple in their design, comparisons of this kind continue to yield invaluable information about the specific nature of patients’ deficits. For example, Keith Cicerone [4] recently examined different types of attention in a group of patients similar

2

Neuropsychology

to Mr. Fixit. Independent t Tests showed that, from among seven different measures of attention, only those that required vigilant processing of rapidly presented information were performed more poorly among brain-injured, as compared to healthy, participants. In contrast, brief focusing of attention was no different for the two groups. However, because different measures have different discriminability, a failure to find differences between groups with respect to a particular variable (such as the ability to briefly focus attention) can be difficult to interpret. In particular, it is not clear whether such a failure truly suggests that patients are unimpaired, or whether it suggests inadequate measurement or inadequate statistical power. One way to address this question is to examine interaction effects using factorial ANOVA (see Factorial Designs). As an example, Schmitter-Edgecombe et al. [9] compared brain-injured and non–braininjured participants on several indices of memory. They conducted a mixed factor ANOVA with group (injured vs. noninjured) as a between-subjects factor and different components of memory performance as within-subjects factors. Their results yielded a main effect of group, showing that brain-injured patients were generally worse on memory overall. However, the presence of an interaction between group and memory component also allowed the researchers to state with a reasonable degree of certainty that certain facets of memory processing were not impaired among brain-injured patients. In addition to understanding what kinds of deficits are typical among patients who have sustained a brain injury, it is equally important to understand the time course of symptom development and symptom recovery? In other words, is it realistic that Mr. Fixit’s symptoms would not have improved or disappeared by the time he saw the neuropsychologist? And, relatedly, should he be concerned that some symptoms will never disappear? Studies that address these questions generally rely on Repeated Measures Analysis of Variance, again using group membership as a between-subjects factor and time since injury (e.g., one week, one month, six months) as a withinsubjects factor. This method tells us whether either, or both, groups improved in subsequent testing sessions (a within-subjects effect), as well as whether the two groups differ from one another at all, or on only some testing sessions (a between-subjects effect).

As an example, Bruce and Echemendia [3], in their study of sports-related brain injuries, applied this technique using group membership (brain-injured vs. non-brain-injured athletes) as the between-subjects factor and time since injury (2 hours, 48 hours, 1 week, and 1 month) as the within-subjects factor. They found that, whereas healthy participants’ performance improved on second testing (due to practice effects), no such improvement was present for the brain-injured group. Essentially, this means that memory performance declined initially in the brain-injured group during the period of 2 to 48 hours, likely due to acute effects of injury such as brain swelling. The results further showed that brain-injured patients’ memory abilities recovered significantly by the time of the one-month follow-up assessment. Thus, Mr. Fixit’s physician was correct in encouraging Mr. Fixit to ‘give it more time’. Addressing Confounds. As is the case with much of clinical research, research in neuropsychology typically relies on quasi-experimental designs. It is well recognized that such designs are subject to a variety of confounds (see Confounding Variable). Because much research in clinical neuropsychology is based on datasets that come from actual patients who were examined initially for clinical purposes rather than research purposes, confounds in neuropsychology are perhaps even more troublesome than in some other clinical areas. Thus, for example, although the studies presented up to this point demonstrate that patients like Mr. Fixit exhibit poorer performances than healthy controls with respect to attention and memory, we cannot be absolutely certain that all of the observed group differences are due to the effects of brain injury. Instead, we might argue that people who fall from ladders or are otherwise ‘accident prone’ tend to suffer from attentional and memory problems even prior to their injuries. In fact, perhaps these accident-prone individuals have always been somewhat forgetful and inattentive, and perhaps these were the very characteristics that caused them to suffer their injuries in the first place. Bruce and Echemendia [3], in their study of sportsrelated brain injury, found just such a scenario. Specifically, the athletes who sustained head injuries had lower premorbid (i.e., preinjury) SAT scores than those in a control group. This scenario poses the following questions: What is responsible for the poor

Neuropsychology memory and attentional performances of the braininjured group? Is it the lower premorbid functioning (reflected in their lower SAT scores), or is it the brain injury? Addressing this question statistically is not as simple as one might think. On the one hand, some investigators attempt to address this problem by using the Analysis of Covariance (ANCOVA), as did Bruce and Echemendia [3] in their study when they used the SAT scores as a covariate. On the other hand, Miller and Chapman [6] argue that the only appropriate use of ANCOVA is to account for random variation on the covariate, not to ‘control’ for differences between groups. However, while it is true that the results of an ANCOVA regarding premorbid differences are not easy to interpret, some investigators maintain that conducting an ANCOVA in this situation is appropriate, as it might at least afford ruling out the effect of the covariate. In particular, if group differences continue to be present even after a covariate is added to the analysis, it is clear that the dependent variable, not the covariate, accounted for such differences. If, on the other hand, adding a covariate abolishes a given effect, the results cannot be readily interpreted. To avoid this controversy, some researchers match groups on a variety of potential confound variables, most commonly age, education, and gender. This approach, however, is not controversy-free either, as it may lead to samples that are unrealistically constrained or are not representative of the population of interest. For example, it is well understood that mild traumatic brain injury patients who are involved in litigation (as Mr. Fixit might become, due to his defective ladder) perform more poorly on neuropsychological tests than those who are not involved in litigation [8]. It is also well understood that a substantial subgroup of litigating patients consciously or unconsciously either feign or exaggerate their cognitive difficulties [1]. Thus, researchers who are interested in understanding the nature of cognitive deficits resulting from a traumatic brain injury often remove the litigating patients from their sample. However, it is possible that these patients litigate because of continued cognitive difficulties. For example, would Mr. Fixit even think of litigation (regardless of whether he noticed the defect on his ladder) if he were not experiencing difficulties at work? Thus, by excluding these patients from our samples, we may be underestimating the extent of

3

deleterious effects of traumatic brain injury on cognition. Because of these statistical and methodological controversies, accounting for confounds is a constant struggle both in neuropsychology research and in neuropsychology practice. In fact, these issues will figure largely in the mind of Mr. Fixit’s neuropsychologist who attempts to determine the true nature and severity of Mr. Fixit’s difficulties.

Neuropsychological Assessment Instruments Understanding the realistic nature and course of Mr. Fixit’s difficulties allows the clinician to select a proper battery of tests for Mr. Fixit’s assessment. However, there are hundreds of neuropsychological instruments available on the market. How does a neuropsychologist choose the most appropriate ones for a given case? Research and statistical methods that address different aspects of this question are described below.

Validity The first step in selecting assessment instruments is to select instruments that measure the constructs of interest (see Validity Theory and Applications). In this case, the various functional domains that might have been affected by Mr. Fixit’s fall, such as attention and memory, represent our constructs. While some constructs and the related assessment instruments are relatively straightforward and require relatively little validation past that conducted as part of the initial test development, others are complex or even controversial, requiring continual reexamination. An example of such a complex construct is a functional domain called executive abilities. Executive abilities allow people to make intelligent choices, avoid impulsive and disorganized actions, plan ahead, and follow through with plans. Research shows that executive abilities are sometimes compromised following a traumatic brain injury. Given Mr. Fixit’s difficulties managing his small business, it is quite likely that impaired executive functions are to blame. Assessment of executive abilities is arguably among the greatest challenges in clinical neuropsychology, with much research devoted to the validation of measures that assess this domain. A common statistical method for addressing the question of a test’s construct validity is a factor

4

Neuropsychology

analysis (see Factor Analysis: Exploratory). Factor analysis can demonstrate that the items on a test follow a certain theoretically cogent structure. Typically, a test’s factor structure is examined during the initial test development process. However, it is sometimes the case that factor analysis research may offer new insights even after a test has been in clinical use for some time. For example, based on their dayin and day-out experiences with patients, clinicians may sometimes become aware of certain qualitative aspects of test performance that may be related to specific disease processes. For these clinical impressions to be confirmed, a formal validation is required. An example of such a validation study was that of Osmon and Suchy [7] who examined several qualitative aspects of performance on the Wisconsin Card Sorting Test, a popular test of executive functioning. Factor analyses yielded three factors, demonstrating that different types of errors represent separate components of executive processing. Thus, the results support the notion that the Wisconsin Card Sorting Test assesses more than one executive ability, which is in-line with the conceptualization of executive functions as a nonunitary construct. In addition to examining the factor structure of a test, validation can also be accomplished by examining whether a measure correlates with other instruments that are known to measure the same construct. A study conducted by Bogod and colleagues [2] is an example of this approach. Specifically, these researchers were interested in determining the validity of a measure of self-awareness (called SADI). Self-awareness, or awareness of deficits, is sometimes impaired in patients who have suffered a traumatic brain injury. Bogod and colleagues used such patients to examine whether SADI scores correlated with another index of self-awareness. These analyses yielded the first tentative support for the measure’s construct validity. In particular, the analyses showed that, although the SADI did not correlate with the deficits reported by the patient or the patients’ relatives, it did correlate with the degree of discrepancy between the patients’ and the relatives’ reports. In other words, the measure was unrelated to the severity of one’s disability, but rather to one’s inability to accurately appraise one’s disability. Additionally, because deficits in self-awareness are, at least theoretically, related to deficits in executive functioning, Bogod and colleagues examined the relationship

between the SADI and measures of executive functions. Medium size correlations provided additional support for the measure, as well as for the theoretical conceptualization of the construct of self-awareness and its relationship to other cognitive abilities. However, demonstrating that a particular test measures a particular cognitive construct does not address the question of whether a deficit on a test relates to real-life abilities. In other words, does a test predict how a person will function in everyday life? This is often referred to as the test’s ‘ecological validity’. As an example, Suchy and colleagues [10] determined whether an executive deficit identified during the course of hospitalization predicted patients’ daily functioning following discharge. These researchers used a series of Stepwise Multiple Regressions (see Multiple Linear Regression) in which different behavioral outcomes (e.g., whether patients prepare their own meals, take care of their hygiene needs, or manage their own finances) served as criterion variables and three factors of the Behavioral Dyscontrol Scale (BDS), a measure of executive functioning, served as predictors. Because Stepwise Multiple Regressions can be used to generate parsimonious models of prediction, the authors showed not only that the BDS was a significant predictor of actual daily functioning, but also which components of the measure were the best predictors. A caution about Stepwise Multiple Regressions is that it can unduly capitalize on the idiosyncrasies of the sample. As a result, only a small difference in sample characteristics can lead to a variable’s exclusion or inclusion in the final model. For that reason, it is a good idea to test the stability of generated models by repeating the analyses with random subsamples drawn from the original sample. Discrepancies in results must be resolved both statistically and theoretically prior to drawing final conclusions. Finally, because patients and settings vary widely across different neuropsychological studies, it is common for a single measure to yield a variety of contradictory findings. For example, the sensitivity and specificity of the Wisconsin Card Sorting Test has been an ongoing source of controversy in neuropsychology research and practice. Some studies have found the measure to have good sensitivity and specificity for frontal lobe integrity (the presumed substrate of executive abilities), whereas others have shown no such relationship. Recently, Demakis [5] addressed the ongoing controversy about the test’s

Neuropsychology utility by conducting a meta-analysis. The composite finding of 24 studies yielded small but statistically significant effect sizes, providing validation for both sides of the argument: Yes, the measure is related to frontal lobe functioning, and, yes, the relationship is weak.

Clinical Utility Research and statistical methods described up to this point allow us to determine that, on average, patients who have suffered a traumatic brain injury perform more poorly than healthy controls on validated measures of executive functioning, attention, and memory. This means that Mr. Fixit’s difficulties might very well be real. However, we don’t know which measures are the best for making decisions about Mr. Fixit’s specific situation. To answer that question, researchers must examine how successful different measures are at patient classification. Traditionally, Discriminant Function Analysis (DFA) has been the statistic of choice for examining a measure’s classification accuracy. Although this procedure provides information about the overall hit rate and its statistical significance, these results do not always generalize to clinical settings. First, this method produces cutting scores that will maximize the overall hit rate, regardless of what the clinical needs are. In other words, in some clinical situations, it may be more useful to have a test with a high specificity, whereas, in other situations, a test of high sensitivity may be needed. Second, classifications are maximized by using discriminant function coefficients that require a conversion of the raw data. Such conversions are not always practical, and, as a result, are generally of little clinical use. In recent years, the Receiver Operating Characteristic (ROC) Curve has gained popularity over DFA in neuropsychology research. This procedure allows determination of exact sensitivity and specificity associated with every possible score on a given test. Based on ROC data, one can choose a cutting score with a high sensitivity to ‘rule in’ patients who are at risk for a disorder, high specificity to ‘rule out’ patients who are at low risk for a disorder, or an intermediate sensitivity and specificity to maximize the overall hit rate. Additionally, the ROC analysis allows determination of whether the classification rates afforded by the test are statistically significant.

5

To conduct an ROC analysis, one needs a sample of subjects for whom an external validation of ‘impaired’ is available. Depending on the setting in which a test is employed, a variety of criteria can be used. Some examples of external criteria of impairment include inability to succeed at work (as was the case with Mr. Fixit), inability to live independently (a common issue with the elderly), or an external diagnosis of a condition based on neuroradiology or other medical data. By conducting an ROC analysis with such a sample, empirically based cutting scores can be applied in clinical settings to make decisions about patients for whom external criteria of impairment are not known.

Conclusions A variety of statistical techniques are needed to conduct research that allows a clinician to address even one clinical scenario. In our hypothetical scenario, t Tests, ANOVAs, and ANCOVAs helped our clinician to determine whether Mr. Fixit’s complaints were realistic and what cognitive difficulties might account for his complaints. Factor analyses, correlational analyses, meta-analyses, DFA, and ROC curve analyses helped determine what measures are most appropriate for assessing the exact nature of Mr. Fixit’s difficulties. Multiple regressions allowed predictions about the impact of his deficits on his everyday functioning.

References [1]

[2]

[3]

[4]

[5]

Binder, L. (1993). Assessment of malingering after mild head trauma with the Portland digit recognition Test, Journal of Clinical and Experimental Neuropsychology 15, 170–182. Bogod, N.M., Mateer, C.A. & MacDonald, S.W.S. (2003). Self-awareness after traumatic brain injury: a comparison of measures and their relationship to executive functions, Journal of International Neuropsychological Society 9, 450–458. Bruce, J.M. & Echemendia, R.J. (2003). Delayed-onset deficits in verbal encoding strategies among patients with mild traumatic brain injury, Neuropsychology 17, 622–629. Cicerone, K.D. (1997). Clinical sensitivity of four measures of attention to mild traumatic brain injury, The Clinical Neuropsychologist 11, 266–272. Demakis, G. (2003). A meta-analytic review of the sensitivity of the Wisconsin card sorting Test to frontal

6

[6]

[7]

[8]

Neuropsychology and lateralized frontal brain damage, Neuropsychology 17, 255–264. Miller, G.M. & Chapman, J.P. (2001). Misunderstanding analysis of covariance, Journal of Abnormal Psychology, 110(1), 40–48. Osmon, D.C. & Suchy, Y. (1996). Fractionating frontal lobe functions: factors of the Milwaukee card sorting Test, Archives of Clinical Neuropsychology 11, 541–552. Reitan, R.M. & Wolfson, D. (1996). The question of validity of neuropsychological test scores among headinjured litigants: Development of a dissimulation index, Archives of Clinical Neuropsychology 11, 573–580.

[9]

Schmitter-Edgecombe, M., Marks, W., Wright, M.J. & Ventura, M. (2004). Retrieval inhibition in directed forgetting following severe closed-head injury, Neuropsychology 18, 104–114. [10] Suchy, Y., Blint, A. & Osmon, D.C. (1997). Behavioral dyscontrol scale: criterion and predictive validity in an inpatient rehabilitation unit sample, The Clinical Neuropsychologist 11, 258–265.

YANA SUCHY

New Item Types and Scoring FRITZ DRASGOW

AND

KRISTA MATTERN

Volume 3, pp. 1398–1401 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

New Item Types and Scoring Computers have been used to administer tests since the 1960s [2]. At first, computerized tests were just that; tests were still comprised of the traditional multiple-choice questions, but examinees read the items from a computer screen or teletype rather than a printed test booklet. Improvements in the technological capabilities of the computer over the decades have revolutionized the field of testing and assessment. (see Computer-Adaptive Testing) Advances in hardware capabilities (e.g., memory, hard disks, video adaptor cards, sound cards) and software (e.g., graphical user interfaces, authoring tools) have enabled test developers to create a variety of new item types. Researchers and test developers have devised these new item types to assess skills and performance abilities not well measured by traditional item types, as well as to measure efficiently characteristics traditionally assessed by a multiple-choice format. In this entry, a review of many of these new item types is presented. Moreover, issues related to scoring innovative items are described. Because the area is rapidly growing and changing, it is impossible to give an exhaustive review of all developments; however, some of the most important advancements in new item types and related scoring issues are highlighted.

Simulations Recently, several licensing and credentialing boards have concluded that traditional multiple-choice items may adequately assess candidates’ factual knowledge of a given domain, but fail to measure the skills and performance abilities necessary for good job performance. For example, once candidates pass the Uniform Certified Public Accountant Examination (UCPAE), they, as accountants, do not answer multiple-choice questions on the job; instead they must often address vague and ambiguous problems posed to them by clients. As a result, many licensing boards have begun incorporating simulated workplace situations in their exams. These simulations are designed to faithfully reflect important challenges faced by employees on the job; adequate responses are required for satisfactory performance.

In April of 2004, the American Institute of Certified Public Accountants (AICPA) implemented a new computerized UCPAE comprising multiple-choice items and simulations. The simulations require examinees to perform activities that are required to answer questions posed by clients. Consequently, they may enter information into spreadsheets, complete cash flow reports, search the definitive literature (which is available as a on-line resource), and write a letter to the client that provides their advice and states its justification. Clearly, the advantage of this assessment format is that it provides an authentic test of the skills and abilities necessary to be a competent accountant. One disadvantage of the simulation format is that many of the items are highly interdependent due to the nature of the task. For example, a typical simulation, which might require about 20 responses, might require an examinee to enter several values in a balance sheet. If an examinee’s first entry in the balance sheet is incorrect, it is unlikely that his/her entry in the next cell will be correct. Having two highly intercorrelated items (i.e., the interitem correlation may exceed 0.90) is problematic because, essentially, the same question is asked twice. The examinee who correctly answers the first item is rewarded twice, whereas the candidate who answers the first item incorrectly is penalized twice. As a result, the AICPA is considering alternative scoring methods, such as not scoring redundant items or giving examinees one point if they correctly answer all parts of redundant sets of items and zero points if they do not. In 1999, National Board of Medical Examiners (NBME) changed its licensing test to a computerized format, which included simulated patient encounters [5]. These simulations were developed to assess critical diagnostic skills required to be a competent physician. The candidate assumes the role of the physician and must treat simulated patients with presenting symptoms. On the basis of these symptoms, the candidate can request medical history, order a test, request a consultation, or order a treatment; virtually any action of a practicing physician can be taken. As in the real world, these activities take time. Therefore, if the candidate orders a test that requires an hour to complete, the candidate must advance an on-screen clock by one hour in order to obtain the results. Simultaneously, the patient’s symptoms and condition also progress by one hour. Clearly, this simulation format provides a highly authentic test of the

2

New Item Types and Scoring

skills required to be a competent physician; however, scoring becomes very complicated because candidates can take any of literally thousands of actions.

Items that Incorporate Media Early computerized test development efforts focused on multiple-choice items were presented in a format that included text and simple graphics [7]. However, incorporating media within a test has improved measurement in tests such as Vispoel’s [10] tonal memory computer adaptive test (CAT) that assesses musical aptitude, Ackerman, Evans, Park, Tamassia, and Turner’s [1]. Test of Dermatological Skin Disorders that assesses the ability to identify and diagnose different types of skin disorders, and Olson-Buchanan, Drasgow, Moberg, Mead, Keenan, and Donovan’s [8] Conflict Resolution Skills Assessment that measures the interpersonal skill of conflict resolution. Vispoel’s [10] tonal memory CAT improves on traditional tests of music aptitude that are administered via paper and pencil with a tape recorder controlled by the test administrator at the front of the room. Biases such as seat location and disruptive neighbors are eliminated because each examinee completes the test via computer with a pair of headsets, which offers superior sound quality to that of a tape cassette. Vispoel [10] demonstrated that fewer questions have to be administered because the test is adaptive, which reduced problems of fatigue and boredom; furthermore, the test pace is controlled by the examinee rather than the administrator. Many of the benefits of Vispoel’s tonal memory CAT are analogous to the benefits of Ackerman et al.’s Test of Dermatological Skin Disorders [1]. For example, this test, which scores examinees on their ability to identify skin disorders, is improved when computerized because the high-resolution color image on a computer monitor is superior to that of a slide projector. Moreover, the computerized test allows examinees to zoom on an image. Note that examinees seated at the back of the room are no longer at a visual disadvantage, and, the respondent, not the test administrator, controls the pace of item administration. Olson-Buchanan et al.’s [8] Conflict Resolution Skills Assessment presents video clips of realistic workplace problems a manager might encounter and then asks the examinee what he/she would do in each situation. Its developers believe that examinees become more psychologically involved in the

situation when watching the video clips than when reading passages. Chan and Schmidt [4] found that reading comprehension correlated 0.45 with a paperand-pencil assessment of an interpersonal skill, but only 0.05 with a video-based version. Thus, it appears that the video format enables a more valid assessment of interpersonal skill by reducing the influence of cognitive ability. On the other hand, scoring these types of items presents a challenge. Is it best to use subject matter experts to determine the scoring or should empirical keying be used? Are the wrong options equally wrong or are some better than others? If so, how do you score them? Test developers are investigating the best way to score these item types.

Innovative Mathematical Item Types Traditional standardized tests assessing mathematical ability such as the American College Testing (ACT) and the Scholastic Assessment Test (SAT) rely solely on a multiple-choice format and, consequently, examinees have a high probability (25% if there are four options) of answering correctly by simply guessing. However, three novel mathematics item types, mathematical expressions, generating examples, and graphical modeling, require examinees to construct their own response [3], which reduces the probability of answering these problems correctly by guessing to roughly zero. Mathematical expressions are items that require examinees to construct a numerical equation to represent the problem [3]. For example, if x is the mean of 7, 8, and y, what is the value of y in terms of x? The obvious solution is y = 3x − 7 − 8; however, y = 3x − 15, y = 3x − (7 + 8), and y = (6x − 14 − 16)/2 are also correct. In fact, there are an infinite number of correct solutions, which makes scoring problematic. Fortunately, Bennett et al. [3] developed a scoring algorithm, which converts answers to their simplest form. In a test of the scoring algorithm, Bennett et al. (2000) found the accuracy to be 99.6%. Generating examples are items that require examinees to construct examples that fulfill the constraints given in the problem [3]. For example, suppose x and y are positive integers and 2x + 4y = 20; provide two possible solution sets for x and y (Solution: x = 4, y = 3 and x = 6, y = 2). Again, there is more than one right answer, and therefore scoring is not as straightforward as with multiple-choice items.

New Item Types and Scoring Graphical modeling items require examinees to construct a graphical representation of the information provided in the problem [3]. For example, examinees may be asked to use a grid to draw a triangle that has an area of 12. As before, there are many solutions to this problem. These problems require the test developer to create and validate more extensive scoring rules than a simple multiple-choice key. Nonetheless, by requiring examinees to construct responses, Bennett et al. [3] showed that these item types have enriched the assessment of mathematical ability.

Innovative Verbal Item Types Like standardized tests of mathematical ability, most tests of verbal ability rely on multiple-choice items. However, recent innovations in the assessment of verbal ability may change the status quo. The following section describes two examples of innovation in verbal assessment: passage editing and essay with automated scoring. A traditional item type used to assess verbal ability requires examinees to read a sentence with a section underlined and choose from a list of options the alternative that best corrects the section grammatically. The problem with this type of item is that it points the examinee to the error by underlining it. Would examinees know there was an error if it was not pointed out for them? Passage editing bypasses this problem because it requires examinees to locate an error within a passage (nothing is underlined) as well as correct it [6, 11]. This results in a more rigorous test of verbal ability than the traditional format because it assesses both the ability to locate errors as well as fix them. It is difficult to dispute the argument that having an examinee write an essay is the most authentic test of writing ability. Unfortunately, essays are often omitted from standardized tests because they are costly and time-consuming to score. However, innovations in computerized scoring such as e-rater [9] make the inclusion of essays a realistic option. E-rater is a computerized essay grader that utilizes scoring algorithms based on the rules of natural language processing. In an interesting study examining the validity of e-rater, Powers et al. [9] invited a wide variety of people to submit essays that might be scored incorrectly by the computer. A Professor of Computational Linguistics received the highest possible score by

3

writing a paragraph and then repeating it 37 times. Erater has since been improved, and Powers et al. [9] state that e-rater is now capable of correctly scoring most of the essays that had previously tricked it.

Conclusion Psychological testing and assessment has been revolutionized by the advent of the computer, and, specifically, improvements in the capabilities of modern computers. As a result, test developers have explored a wide variety of innovative approaches to assessment; however, there is a cost. Scoring innovative items types is rarely as straightforward as a multiplechoice key and often quite complex. This review presents some of the newest innovations in item types along with related scoring issues. These innovations in item types and concomitant scoring algorithms represent genuine improvement in psychological testing.

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

Ackerman, T.A., Evans, J., Park, K.S., Tamassia, C. & Turner, R. (1999). Computer assessment using visual stimuli: a test of dermatological skin disorders, in Innovations in Computerized Assessment, F. Drasgow & J.B. Olson-Buchanan, eds, Lawrence Erlbaum, Mahwah, pp. 137–150. Bartram, D. & Bayliss, R. (1984). Automated testing: past, present and future, Journal of Occupational Psychology 57, 221–237. Bennett, R.E., Morely, M. & Quardt, D. (2000). Three response types for broadening the conception of mathematical problem solving in computerized tests, Applied Psychological Measurement 24, 294–309. Chan, D. & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions, Journal of Applied Psychology 82, 143–159. Clyman, S.G., Melnick, D.E. & Clauser, B.E. (1999). Computer-based case simulations from medicine: assessing skills in patient management, in Innovative Simulations for Assessing Professional Competence, A. Tekian, C.H. McGuire & W.C. McGahie, eds, University of Illinois, Department of Medical Education, Chicago, pp. 29–41. Davey, T., Godwin, J. & Mittelholtz, D. (1997). Developing and scoring an innovative computerized writing assessment, Journal of Educational Measurement 34, 21–41. McBride, J.R. (1997). The Marine Corps exploratory development project: 1977–1982, in Computerized

4

[8]

[9]

New Item Types and Scoring Adaptive Testing: From Inquiry to Operation, W.A. Sands, B.K. Waters & J.R. McBride, eds, American Psychological Association, Washington, pp. 59–67. Olson-Buchanan, J.B., Drasgow, F., Moberg, P.J., Mead, A.D., Keenan, P.A. & Donovan, M.A. (1998). Interactive video assessment of conflict resolution skills, Personnel Psychology 51, 1–24. Powers, D.E., Burstein, J.C., Chodorow, M., Fowles, M.E. & Kukich, K. (2002). Stumping e-rater: challenging the validity of automated essay scoring, Computers in Human Behavior 18, 103–134.

[10]

Vispoel, W.P. (1999). Creating computerized adaptive tests of musical aptitude: problems, solutions, and future directions, in Innovations in Computerized Assessment, F. Drasgow & J.B. Olson-Buchanan, eds, Lawrence Erlbaum, Mahwah, pp. 151–176. [11] Zenisky, A.L. & Sireci, S.G. (2002). Technological innovations in large-scale assessment, Applied Measurement in Education 15, 337–362.

FRITZ DRASGOW

AND

KRISTA MATTERN

Neyman, Jerzy DAVID C. HOWELL Volume 3, pp. 1401–1402 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Neyman, Jerzy Born: Died:

April 16, 1894, Bendery, Russia. August 5, 1981, Berkeley, CA.

Jerzy Neyman spent the first half of his life moving back and forth between Russia, Poland, England, and France but in the last half of his life firmly ensconced in Berkeley, California. In the process, he helped lay the foundation of modern statistical theory and built one of the major statistics departments in the United States. Jerzy Neyman was born to a Polish family in Russia in 1894. His father was an attorney, and Neyman was educated at home until he was 10, at which point he was fluent in five languages. When his father died, the family moved to Kharkov, where he attended school and eventually university. While at university, he read the work of Lebesque and proved five theorems on the Lebesque integral on his own. At the end of the First World War, he was imprisoned briefly. He remained at Kharkov, teaching mathematics until 1921, at which point he moved to Poland. He took a position as a senior statistical assistant at the Agricultural Institute but soon moved to Warsaw to pursue his PhD., which he received in 1924. In 1925, Neyman won a Polish Government Fellowship to work with Karl Pearson in London, but he was disappointed with Pearson’s knowledge of mathematical statistics. In the next year, he won a Rockefeller Fellowship to study pure mathematics in Paris. While in London, Neyman had formed a friendship with Pearson’s son Egon Pearson, and they began to collaborate in 1927 while Neyman was still in Paris. This collaboration had profound effects on the history and theory of statistics. In 1928, Neyman returned again to Poland to be head of the Biometric Laboratory at the Nencki Institute of Warsaw. He remained there until returning to London in 1934 but continued his collaboration with Egon Pearson throughout that period. The Neyman–Pearson approach to hypothesis testing is outlined elsewhere (see Neyman–Pearson Inference) and will not be elaborated on here. But it is worth pointing out that their work developed the concepts of the alternative hypothesis, power,

Type II errors, ‘critical region’, and the ‘size’ of the critical region, which they named the significance level. Neyman had also developed the concept of a ‘confidence interval’ while living in Poland, though he did not publish that work until 1937. In 1934, Neyman returned to London to take a position in Pearson’s department, and they continued to publish together for another four years. They clashed harshly with Fisher, who was at that time the head of the Eugenics department at University College, which was the other half of Karl Pearson’s old department. Those battles are well known to statisticians and were anything but polite – especially on Fisher’s side. In 1938, Neyman accepted a position of Professor of Mathematics at Berkeley and moved to the United States, where he was to remain. He established and directed the Statistics Laboratory at Berkeley, and in 1955 he founded and chaired the Department of Statistics. O’Connor and Robinson [2] note that his efforts at establishing a department were not always appreciated by all at the University. An assistant to the President of Berkeley once wrote: Here, a willful, persistent and distinguished director has succeeded, step by step over a fifteen-year period, against the wish of his department chairman and dean, in converting a small ‘laboratory’ or institute into, in terms of numbers of students taught, an enormously expensive unit; and he argues that the unit should be renamed a ‘department’ because no additional expense will occur.

That quotation will likely sound familiar to those who are members of academic institutions. In 1949, he published the first report of the Symposium of Mathematical Statistics and Probability, and this symposium became the well-known Berkeley Symposium, which continued every five years until 1971. Additional material on Neyman’s life and influence can be found in [1], [3], and [4].

References [1]

[2]

Chiang, C.L. & Jerzy Neyman, (1894–1981). Found on http://www.amstat.org/ the Internet at about/statisticians/index.cfm? fuseaction=biosinfo&BioID=11. O’Connor, J.J. & Robertson, E.F. Neyman, J. (2004). Found on the Internet at http://www-gap.dcs.stand.ac.uk/∼history/Mathematicians/ Neyman.html.

2 [3]

Neyman, Jerzy Pearson, E.S. The Neyman-Pearson story: 1926–1934. Historical sidelights on an episode in Anglo-Polish collaboration, Festschrift for J Neyman, New York, 1966.

[4]

Reid, C. (1998). Neyman, Springer Verlag, New York.

DAVID C. HOWELL

Neyman–Pearson Inference RAYMOND S. NICKERSON Volume 3, pp. 1402–1408 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Neyman–Pearson Inference Jerzy Neyman [9] credits Pierre Laplace [5] with being the first to attempt to test a statistical hypothesis. Prominent among the many people who have contributed to the development of mathematical statistics from the time of Laplace until now are the developers of schools of thought that have dominated the use of statistics by psychologists for the analysis of experimental data. One of these schools of thought derives from the work of Ronald A. Fisher, the other from that of Neyman and Egon Pearson (son of Karl Pearson, for whom the Pearson correlation coefficient was named). A third school, emanating from the work of the Reverend Thomas Bayes, has had considerable impact on the theorizing of psychologists, especially in their efforts to develop normative and descriptive models of decision making and choice, but it has been applied much less to the analysis of experimental data, despite eloquent pleas by proponents of Bayesian analysis for such an application.

The Collaboration The collaboration between Neyman and E. Pearson began when the former spent a year as a visitor in Karl Pearson’s laboratory at University College, London in 1925. It lasted for the better part of a decade during which time the two co-authored several papers. The collaboration, which has been recounted by Pearson [15] in a Festchrift for Neyman, was carried on primarily by mail with Neyman in Poland and Pearson in England. In their first paper on hypothesis testing, Neyman and Pearson [10] set the stage for the ideas they were to present this way: ‘One of the most common as well as most important problems which arise in the interpretation of statistical results, is that of deciding whether or not a particular sample may be judged as likely to have been randomly drawn from a certain population, whose form may be either completely or only partially specified. We may term Hypothesis A1 the hypothesis that the population from which the sample has been randomly drawn is that specified, namely . In general the method of procedure is to apply certain tests or criteria, the results of which will

enable the investigator to decide with a greater or less degree of confidence whether to accept or reject Hypothesis A, or, as is often the case, will show him that further data are required before a decision can be reached’ (p. 175). In this paper, they work out the implications of certain methods of testing, assuming random sampling from populations with normal, rectangular, and exponential distributions. They emphasize the importance of the assumption one makes about the shape of the underlying distribution in applying statistical methods and about the randomness of sampling (the latter especially in the case of small samples). Later, Pearson [15] described this paper as having ‘some of the character of a research report, putting on record a number of lines of inquiry, some of which had not got very far. The way ahead was open; for example, five different methods of approach were suggested for the test of the hypothesis that two different samples came from normal populations having a common mean!’ (p. 463).

Neyman and Pearson versus Fisher As Macdonald [6] has pointed out, both the Fisherian and Neyman–Pearson approaches to statistical inference were concerned with establishing that an observed effect could not plausibly be attributed to sampling error. Perhaps because of this commonality, the blending of the ideas of Fisher with those of Neyman and Pearson has been so thorough, and so thoroughly assimilated by the now conventional treatment of statistical analysis within psychology, that it is difficult to distinguish the two schools of thought. What is generally taught to psychology students in introductory courses on statistics and experimental design makes little, if anything, of this distinction. Typically, statistical analysis is presented as an integrated whole and who originated specific ideas is not stressed. Seldom is it pointed out that the Fisherian approach to statistical analysis and that of Neyman and Pearson were seen by their proponents to be incompatible in several ways and that the rivalry between these pioneers is a colorful chapter in the history of applied statistics. An account of the disagreements between Fisher and Karl Pearson, which were especially contentious, and how they carried over to the relationship between Fisher and Neyman and Egon Pearson has been told with flair by Salsburg [16].

2

Neyman–Pearson Inference

Neyman and Pearson considered Fisher’s method of hypothesis testing, which posed questions that sought yes–no answers – yes, the hypothesis under test is rejected, no it is not – to be too limited. ‘It is indeed obvious, upon a little consideration, that the mere fact that a particular sample may be expected to occur very rarely in sampling from would not in itself justify the rejection of the hypothesis that it had been so drawn, if there were no other more probable hypotheses conceivable’ [10, p. 178]. They proposed an approach to take account of how the weight of evidence contributes to the relative statistical plausibility of each of two mutually exclusive hypotheses.2 So, in distinction from Fisher’s yes–no approach, Neyman and Pearson’s may be characterized as either–or, because it leads to a judgment in favor of one or the other of the candidate hypotheses. ‘[I]n the great majorityof problems we cannot so isolate the relation of to ; we reject Hypothesis A not merely because is of rare occurrence, but because there are other hypotheses as to the origin of which it seems more reasonable to accept’ [10, p. 183]. Neyman and Pearson’s approach to statistical testing differed from that of Fisher in other respects as well. Fisher prohibited acceptance of the null hypothesis (the hypothesis of no difference between two samples, or no effect of an experimental manipulation); if the results did not justify rejection of the null hypothesis, the most that could be said was that the null hypothesis could not be rejected. (One of the concerns that have been expressed about the effect of this aspect of null-hypothesis testing is that too often ‘failure’ to reject the null hypothesis has been equated, inappropriately, with failure of a research effort.) The Neyman–Pearson approach permitted acceptance of either of the hypotheses under consideration, including the null if it was one of them, although it provided for making the acceptance (and hence erroneous acceptance) of one of the hypotheses more difficult than acceptance of the other. Other distinguishing features of the Neyman–Pearson approach include the distinction between error types, the conception of sample spaces, the pre-experiment specification of decision criteria, and the notion of the power of a test. As to the kinds of conclusions that can be drawn from the results of statistical tests, Neyman and Pearson describe them as ‘rules of behaviour’, and they explicitly contrast this with the idea that one can take the results as indicative of the probability that a hypothesis of interest is true or false. In the

first of two 1933 papers, Neyman and Pearson [12] address the question: ‘Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature?’ (p. 290). Their answer is a carefully qualified yes. ‘Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. Here, for example, would be such a “rule of behaviour”: to decide whether a hypothesis, H, of a given type be rejected or not, calculate a specified character, x, of the observed facts; if x > x0 reject H, if x ≤ x0 accept H. Such a rule tells us nothing as to whether in a particular case H is true when x ≤ x0 or false when x > x0 . But it may often be proved that if we behave according to such a rule, then in the long run we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false’ (p. 291).

Error Types and Level of Significance Neyman and Pearson recognized that there are two possible ways to be right, and two possible ways to be wrong, in drawing a conclusion from a test, and that the two ways of being right (wrong) may not be equally desirable (undesirable). In particular, they stressed that deciding in favor of H0 when H1 is true may be a more (or less) acceptable error than deciding in favor of H1 when H0 is correct. ‘These two sources of error can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. . . From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimized. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator’ [12, p. 296]. The idea that the main purpose of statistical testing is to identify and learn from error has been expounded in some detail in [7]. Assuming that in most cases of hypothesis testing the two types of error would not be considered equally serious, Neyman and Pearson defined as ‘the error of the first kind’ (generally referred to as Type I error) the error that one would most wish to avoid. ‘The first demand of the mathematical theory is to

Neyman–Pearson Inference deduce such test criteria as would ensure that the probability of committing an error of the first kind would equal (or approximately equal, or not exceed) a preassigned number α, such as α = 0.05 or 0.01, etc. This number is called the level of significance’ [9, p. 161]. It is important to stress that α is a conditional probability, specifically the probability of rejecting H0 – here referring to the null hypothesis in the sense of the hypothesis under test, not necessarily the hypothesis of no difference – conditional on its being true. And its use is justified only when certain assumptions involving random sampling and the parameters of the distributions from which the sampling is done are met. This conditional-probability status of α is often not appreciated and it is treated as the absolute probability of occurrence of a Type I error. But, by definition, a Type I error can be made only when H0 is true; the unconditional probability of the occurrence of a Type I error is the product of the probability that H0 is true and the probability of a Type I error given that it is true. A similar observation pertains to Type II error (failure to reject H0 when it is false, i.e., failure to accept H1 when it is true), which, by definition, can be made only when H0 is false. The probability of occurrence of a Type II error, conditional on H0 being false, is usually referred to as β (beta), and the unconditional probability of a Type II error is the product of β and the probability that H0 is false. Computing the absolute unconditional probabilities of Type I and Type II errors generally is not possible, because the probability of H0 being true (or false) is not known and, indeed, whether to behave as though it were true, to use Neyman and Pearson’s term, is what one wants to determine by the application of statistical procedures. Failure to recognize the conditionality of α and β has led to much confusion in the interpretation of results of statistical techniques [14].

Sample Spaces and Critical Areas Neyman and Pearson conceptualize the outcome of an experiment in terms of the sample spaces that represent the possible values a specified statistic could take under each of the hypotheses of interest. A point in a sample space represents a particular experimental outcome, if the associated hypothesis is true. To test a hypothesis, one divides the sample

3

space for that hypothesis into two regions and then applies the rule of accepting the hypothesis if the statistic falls in one of these regions and rejecting it if it falls in the other. In null-hypothesis testing, the convention is to define a region, known as the critical region, and to reject the hypothesis if the statistic falls in that region. The situation may be represented as in Figure 1. Each distribution represents the theoretical distribution – the sample space – of some statistic assuming the specified hypothesis is true. The vertical line (decision criterion) represents the division of the space into two regions, the critical region being to the right of the line.3 The area under the curve representing H0 and to the right of the criterion represents Decision criterion

Distribution under H0

Accept H0 (a)

Distribution under H1

Accept H1 Critical region

(b)

Figure 1 In panel a, the distribution to the left represents the distribution under H0 ; that to the right represents the distribution under H1 . The area under the H0 distribution and to the right of the decision criterion represents α, the probability of a Type I error conditional on H0 being true; that under the H1 distribution and to the left of the criterion represents β, the probability of a Type II error conditional on H1 being true; the area under the H1 distribution and to the right of the criterion, 1 − β, is said to be the power of the test, which is the probability of deciding in favor of H1 conditional on its being true. One decreases the conditional probability of a Type I error by moving the decision to the right, but in doing so, one necessarily increases the conditional probability of a Type II error. Panel b shows the situation assuming a much greater overlap between the two distributions, illustrating that a greater degree of overlap means necessarily a greater error rate. In this case, holding the decision criterion so as to keep the conditional probability of a Type I error constant means greatly increasing the conditional probability of a Type II error

4

Neyman–Pearson Inference

the probability of a Type I error on the condition that H0 is true; the area under the curve representing H1 and to the left of the decision criterion represents the probability of a Type II error, on the condition that H1 is true. As is clear from the figure, given the distributions shown, the probability of a Type I error, conditional on H0 being true, can be decreased only at the expense of an increase in the probability of a Type II error, conditional on H1 being true. Figure 1 has been drawn to represent cases in which the theoretical distributions of H0 and H1 overlap relatively little (a) and relatively greatly (b). Obviously, the greater the degree of overlap, the more a stringent criterion for minimizing the conditional probability of a Type I error will cost in terms of allowance of a high conditional probability of a Type II error. We should also note that, as drawn, the figures show the distributions having the same variance and indeed the same Gaussian shape. Such are the default assumptions that are often made in null-hypothesis testing; however, the theoretical distributions used need not be Gaussian, and procedures exist for taking unequal variances into account. Neyman and Pearson [12] devote most of their first 1933 paper to a discussion of the nontrivial problem of specifying a ‘best critical region’ for controlling Type II error, for a criterion that has been selected to yield a specified (typically small) conditional probability of a Type I error. Such a selection must depend, of course, on what the alternative hypothesis(es) is (are). When there are more alternative hypotheses than one, the best critical region with regard to one of the alternative hypotheses will not generally be the best critical region with respect to another. Neyman and Pearson show that in certain problems it is possible to specify a ‘common family of best critical regions for H0 with regard to the whole class of admissible alternative hypotheses’ (p. 297). They develop procedures for identifying best critical regions in the sense of maximizing the probability of accepting H1 when it is true, given a fixed criterion for erroneously rejecting H0 when it is true.

Power In setting the decision criterion so as to make the conditional probability of a Type I error very small, one typically ensures that the conditional probability of a Type II error, β, is relatively large, which means

that the chance of failing to note a real effect, if there is one, is fairly high. Neyman and Pearson [13] define the power of a statistical test with respect to the alternative hypothesis as 1-β, which is the probability of detecting an effect (rejecting the null) if there is a real effect to be detected (if the null is false). The power of a statistical test is a function of three variables, as can be seen from Figure 1: the effect size (the difference between the means of the distributions), relative variability (the variance or standard deviation of the distributions), and the location of the decision criterion. Often, sample size is mentioned as one of the determinants of power, but this is because the relative variability of a distribution decreases as sample size is increased. Variability can be affected in other ways as well (see Power).

Statistical Tests and Judgment In view of the tendency of some textbook writers to present hypothesis testing procedures, including those of Neyman and Pearson, as noncontroversial formulaic methods that can be applied in cook-book fashion to the problem of extracting information from data, it is worth noting that Neyman and Pearson were sensitive to the need to exercise good judgment in the application of statistical methods to the testing of hypotheses and they characterized the methods as themselves being aids to judgment; they would undoubtedly have been appalled at how the techniques they developed are sometimes applied with little thought as to whether they are really appropriate to the situation and the results interpreted in a categorical fashion. In general, Neyman and Pearson saw statistical tests as attempts to quantify commonsense notions about the interpretation of probabilistic data. ‘In testing whether a given sample, , is likely to have been drawn from a population , we have started from the simple principle that appears to be used in the judgments of ordinary life – that the degree of confidence placed in an hypothesis depends upon the relative probability or improbability of alternative hypotheses. From this point of view any criterion which is to assist in scaling the degree of confidence with which we accept or reject the hypothesis that has been randomly drawn from should be one which decreases as the probability (defined in some definite manner) of alternative hypotheses becomes

Neyman–Pearson Inference relatively greater. Now it is of course impossible in practice to scale the confidence with which we form a judgment with any single numerical criterion, partly because there will nearly always be present certain a priori conditions and limitations which cannot be expressed in exact terms. Yet though it may be impossible to bring the ideal situation into agreement with the real, some form of numerical measure is essential as a guide and control’ [11, p. 263]. So, while Neyman and Pearson argue the need for quantitative tests, they stress that they are to be used to aid judgment and not to supplant it. ‘[T]ests should only be regarded as tools which must be used with discretion and understanding, and not as instruments which in themselves give the final verdict’ [10, p. 232]. Again, ‘The role of sound judgment in statistical analysis is of great importance and in a large number of problems common sense may alone suffice to determine the appropriate method of attack’ [12, p. 292]. They note, in particular, the need for judgment in the interpretation of likelihood ratios, the introduction of which they credit to Fisher [2]. The likelihood ratio is the ratio of the probability of obtaining the sample in hand if H1 were true to the probability of obtaining the sample if H0 were true. But, the likelihood ratio is to be weighed along with other, not necessarily numerical considerations in arriving at a conclusion. ‘There is little doubt that the criterion of likelihood is one which will assist the investigator in reaching his final judgment; the greater be the likelihood of some alternative Hypothesis A’ (not ruled out by other considerations), compared with that of A, the greater will become his hesitation in accepting the latter. It is true that in practice when asking whether can have come from , we have usually certain a priori grounds for believing that this may be true, or if not so, for expecting that ’ differs from in certain directions only. But such expectations can rarely be expressed in numerical terms. The statistician can balance the numerical verdict of likelihood against the vaguer expectations derived from a priori considerations, and it must be left to his judgment to decide at what point the evidence in favour of alternative hypotheses becomes so convincing that Hypothesis A must be rejected’ [10, p. 187]. Often Neyman and Pearson speak of taking a priori beliefs into account in selecting testing procedures or interpreting the results of tests. They

5

sometimes note that more than one approach to statistical testing can be taken with equal justification, making the choice somewhat a matter of personal taste. And they emphasize the need for discretion or an ‘attitude of caution’ in the drawing of conclusions from the results of tests. They viewed statistical analysis as only one of several reasons that should weigh in an investigator’s decision as to whether to accept or reject a particular hypothesis, and noted that the combined effect of those reasons is unlikely to be expressible in numerical terms. ‘All that is possible for [an investigator] is to balance the results of a mathematical summary, formed upon certain assumptions, against other less precise impressions based upon a priori or a posteriori considerations. The tests themselves give no final verdict, but as tools help the worker who is using them to form his final decision; one man may prefer to use one method, a second another, and yet in the long run there may be little to choose between the value of their conclusions’ [10, p. 176]. Neyman and Pearson give examples of situations in which they would consider it reasonable to reject a hypothesis on the basis of common sense despite the outcome of a statistical test favoring acceptance. They stress the importance of a ‘clear understanding [in the mind of a user of a statistical test] of the process of reasoning on which the test is based’ [10, p. 230], and note that that process is an individual matter: ‘we do not claim that the method which has been most helpful to ourselves will be of greatest assistance to others. It would seem to be a case where each individual must reason out for himself his own philosophy’ [10, p. 230]. In sum, Neyman and Pearson presented their approach to statistical hypothesis testing in considerably less doctrinaire terms than methods of hypothesis testing are often presented today.

Other Sources Other sources of information on the Neyman–Pearson school of thought regarding statistical significance testing and the differences between it and the Fisherian school include [1, 3, 4, 7], and [8].

Notes 1.

Neyman and Pearson use a variety of letters and subscripts to designate hypotheses under test. Except

6

2.

3.

Neyman–Pearson Inference when quoting Neyman and Pearson, I here use H0 and H1 to represent respectively the main (often null) hypothesis under test and an alternative, incompatible, hypothesis. In practice, H1 is often the negation of H0 . In [11], Neyman and Pearsondistinguish between a simple hypothesis (e.g., ‘that has been drawn from an exactly defined population ’) and a composite hypothesis (e.g., ‘that has been drawn from an unspecified population belonging to a clearly defined subset ω of the set ’). The distinction is ignored in this article because it is not critical to the principles discussed. In ‘two-tailed’ testing, appropriate when the alternative hypothesis is that there is an effect (e.g., a difference between means), but the direction is not specified, the critical region can be composed of two subregions, one on each end of the distributions.

References [1] [2]

[3] [4]

[5] [6]

[7] [8]

Cowles, M.P. (1989). Statistics in Psychology: A Historical Perspective, Lawrence Erlbaum, Hillsdale. Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society, A 222, 309–368. Gigerenzer, G. & Murray, D.J. (1987). Cognition as Intuitive Statistics, Lawrence Erlbaum, Hillsdale. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kr¨uger, L. (1989). The Empire of Chance: How Probability Changed Science and Everyday Life, Cambridge University Press, New York. Laplace, P.S. (1812). Th´eorie Analytique Des Probabilities, Acad´emie Francaise, Paris. Macdonald, R.R. (1997). On statistical testing in psychology, British Journal of Psychology 88, 333–347. Mayo, D. (1996). Error and the Growth of Experimental Knowledge, University of Chicago Press, Chicago. Mulaik, S.A., Raju, N.S. & Harshman, R.A. (1997). There is a time and a place for significance testing: appendix, in What if there were no Significance Tests?

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

L.L. Harlow, S.A. Mulaik & J.H. Steiger, eds, Lawrence Erlbaum, Mahwah, pp. 103–111. Neyman, J. (1974). The emergence of mathematical statistics: a historical sketch with particular reference to the United States, in On the History of Statistics and Probability, D.B. Owen, ed., Marcel Dekker, New York, pp. 147–193. Neyman, J. & Pearson, E.S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference. Part I, Biometrika 20A, 175–240. Neyman, J. & Pearson, E.S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference. Part II, Biometrika 20A, 263–294. Neyman, J. & Pearson, E.S. (1933a). On the problem of the most efficient tests of statistical hypotheses, Philosophical Transactions of the Royal Society, A 231, 289–337. Neyman, J. & Pearson, E.S. (1933b). The testing of statistical hypotheses in relation to probability a priori, Proceedings of the Cambridge Philosophical Society 29, 492–510. Nickerson, R.S. (2000). Null hypothesis statistical testing: a review of an old and continuing controversy, Psychological Methods 5, 241–301. Pearson, E.S. (1970). The Neyman-Pearson story: 1926–1934: historical sidelights on an episode in Anglo-Polish collaboration, in Studies in the History of Statistics and Probability, E.S. Pearson & M. Kendall, eds, Charles Griffin, London, pp. 455–477. (Originally published in 1966.). Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, W. H. Freeman, New York.

(See also Classical Statistical Inference: Practice versus Presentation) RAYMOND S. NICKERSON

Nightingale, Florence DAVID C. HOWELL Volume 3, pp. 1408–1409 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nightingale, Florence Born: Died:

May 12, 1820, in Florence, Italy. August 13, 1910, in London, England.

Florence Nightingale was born into a wealthy English family early in the nineteenth century and was educated at home by her father. That education included mathematics at some level. She continued her studies of mathematics under James Joseph Sylvester. She was also later influenced by the work of Quetelet, who was already well known for the application of statistical methods to data from a variety of fields (see [1] for the influence of Quetelet’s writing and a short biography of Nightingale). Nightingale received her initial training in nursing in Egypt, followed by experience in Germany and France. On returning to England, she took up the position of Superintendent of the Establishment for Gentlewomen during Illness, on an unpaid basis. When the Crimean War broke out in 1854, Nightingale was sent by the British Secretary of War

to Turkey as head of a group of 38 nurses. The Times had published articles on the poor medical facilities available to British soldiers, and the introduction of nurses to military hospitals was hoped to address the problem. Nightingale was disturbed by the lack of sanitary conditions and fought the military establishment to improve the quality of care. Most importantly to statistics, she collected data on the causes of death among soldiers and used her connections to publicize her results. She was able to show that soldiers were many times more likely to die from illnesses contracted as a result of poor sanitation in the hospitals or from wounds left untreated than to die from enemy fire. To illustrate her point, she plotted her data in the form of polar area diagrams, whose areas were proportional to the cause of deaths. Such a diagram is shown below (Figure 1), and can be found in [2]. Diagrams such as this have since been referred to as Coxcombs, although Small [3] has argued that Nightingale used that term to refer to the collection of such graphics. This graphic was published 10 years before the famous graphic by Minard on the size of Napoleon’s army in the invasion of Russia. Unlike

April 1855 To March 1856 June

Au

gu st

be r em ov

h a rc yM

Septemper Octo ber N

April 18 55

M

ay

July

mber January D e ce F e 1856 bru ar

The area of the medium, light, and dark gray wedges are each measured from the center as the common vertex. The medium wedges measured from the center of the circle represent area for area the deaths from Preventible or Mitigable Zymotic Diseases, the light wedges measured from the center the deaths from wounds, & the dark wedges measured from the center the deaths from all other causes.

Figure 1

Causes of death among military personnel in the Crimean War. (The figure was adapted from

http://www.florence-nightingale-avenging-angel.co.uk/Coxcomb.htm)

2

Nightingale, Florence

Minard’s graphic, it carries a clear social message and played an important role in the reform of military hospitals. It represents an early, though certainly not the first, example of the use of statistical data for social change. Among other graphics, Nightingale created a simple line graph that showed the death rates of civilian and military personnel during peacetime. The data were further broken down by age. The implications were unmistakable and emphasize the importance of controlling confounding variables. Such a graph can be seen in [3]. Because of her systematic collection of data and use of such visual representations, Nightingale played an important role in the history of statistics. Though she had little use for the theory of statistics, she was influential because of her ability to use statistical data to influence public policy. Nightingale was elected as the first female Fellow of the Royal Statistical Society in 1858, and was made an honorary member of the American Statistical Association, in 1874. During the 1890s she worked unsuccessfully with Galton on the establishment of a Chair of Statistics at Oxford. For many years of her life she was bed-ridden, and she died in 1910 in London, having published

over 200 books, reports, and pamphlets. Additional material on her life can be found in a paper by Conquest in [4].

References [1]

[2]

[3]

[4]

Diamond, M. & Stone, M. (1981). Nightingale on Quetelet, Journal of the Royal Statistical Society, Series A 144, 66–79, 176–213, 332–351. Nightingale, F.A. (1859). Contribution to the Sanitary History of the British Army During the Late War with Russia, Harrison, London. Small, H. (1998). Paper from Stats & Lamps Research Conference organized by the Florence Nightingale Museum at St. Thomas’ Hospital, 18th March 1998. London. Constable. A version of this paper is available at http://www.florence-nightingale.co muse.uk/small.htm Stinnett, S. (1990). Women in statistics: sesquicennial activities, The American Statistician 44, 74–80.

(See also Graphical Methods pre-20th Century) DAVID C. HOWELL

Nonequivalent Group Design CHARLES S. REICHARDT Volume 3, pp. 1410–1411 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonequivalent Group Design A nonequivalent group design is used to estimate the relative effects of two or more treatment conditions. In the simplest case, only two treatment conditions are compared. These two conditions could be alternative interventions such as a novel and a standard treatment, or one of the treatment conditions could be a control condition consisting of no intervention. The participants in a nonequivalent group design can be either individuals or aggregates of individuals, such as classrooms, communities, or businesses. Each of the treatments being compared is given to a different group of participants. After the treatment conditions are implemented, the relative effects of the treatments are assessed by comparing the performances across the groups on an outcome measure. The nonequivalent group design is a quasiexperiment (as opposed to a randomized experiment) (see Quasi-experimental Designs) because the participants are assigned to the treatment conditions nonrandomly. Assignment is nonrandom, for example, when participants self-select the treatments they are to receive on the basis of their personal preferences. Assignment is also nonrandom when treatments are assigned by administrators on the basis of what is most convenient or some other nonrandom criterion. Nonequivalent group designs also arise when treatments are assigned to preexisting groups of participants that were created originally for some other purpose without using random assignment. Differences in the composition of the treatment groups are called selection differences [7]. Because the groups of participants who receive the different treatments are formed nonrandomly, selection differences can be systematic and can bias the estimates of the treatment effects. For example, if the participants in one group tend to be more motivated than those in another, and if motivation affects the outcome that is being measured, the groups will perform differently even if the two treatments are equally effective. The primary task in the analysis of data from a nonequivalent group design is to estimate the relative effects of the treatments while taking account of the potentially biasing effects of selection differences. A variety of statistical methods have been proposed for taking account of the biasing effects

of selection differences [3], [8]. The most common techniques are change-score analysis, matching or blocking (using either propensity scores or other covariates), analysis of covariance (with or without correction for unreliability in the covariates), and selection-bias modeling. These methods all require that one or more pretreatment measures have been collected. In general, the best pretreatment measures are those that are operationally identical to the outcome (i.e., posttreatment) measures. The statistical methods differ in how they use pretreatment measures to adjust for selection differences. Change-score analysis assumes that the size of the average posttreatment difference will be the same as the size of the average pretreatment difference in the absence of a treatment effect. Matching and blocking adjusts for selection differences by comparing participants who have been equated on their pretreatment measures from the treatment conditions. Analysis of covariance is similar to matching, except that equating is accomplished mathematically rather than by physical pairings. Unreliability in the pretreatment measures can be taken into account in the analysis of covariance using structural equation modeling techniques [2]. When there are multiple pretreatment measures, propensity scores are often used in either matching or analysis of covariance, where a participant’s propensity score is the probability that the participant is in one rather than another treatment condition, which is estimated using the multiple pretreatment measures [6]. Selection-bias modeling corrects the effects of selection differences by modeling the nonrandom selection process [1]. Unfortunately, there is no guarantee that any of these or any other statistical methods will properly remove the biases due to selection differences. Each method imposes different assumptions about the effects of selection differences, and it is difficult to know which set of assumptions is most appropriate in any given set of circumstances. Typically, researchers should use multiple statistical procedures to try to ‘bracket’ the effects of selection differences by imposing a range of plausible assumptions [5]. Researchers can often improve the credibility of their results by adding design features (such as pretreatment measures at multiple points in time and nonequivalent dependent variables) to create a pattern of outcomes that cannot plausibly be explained as a result of selection differences [4]. In addition, because uncertainty about the size of treatment effects

2

Nonequivalent Group Design

is reduced to the extent selection differences are small, researchers should strive to use groups (such as cohorts) that are as initially similar as possible. Nonetheless, even under the best of conditions, the results of nonequivalent group designs tend to be less credible than the results from randomized experiments, because random assignment is the optimal way to make treatment groups initially equivalent.

References [1]

[2]

[3]

Barnow, B.S., Cain, G.C. & Goldberger, A.S. (1980). Issues in the analysis of selectivity bias, in Evaluation Studies Review Annual, Vol. 5, E.W. Stromsdorfer & G. Farkas, eds, Sage Publications, Newbury Park. Magidson, J. & Sorbom, D. (1982). Adjusting for confounding factors in quasi-experiments: another reanalysis of the Westinghouse head start evaluation, Educational Evaluation and Policy Analysis 4, 321–329. Reichardt, C.S. (1979). The statistical analysis of data from nonequivalent group designs, in Quasiexperimentation: Design and Analysis Issues for Field

[4]

[5]

[6]

[7]

[8]

Settings, T.D. Cook & D.T. Campbell, eds, Rand McNally, Chicago, pp. 147–205. Reichardt, C.S. (2000). A typology of strategies for ruling out threats to validity, in Research Design: Donald Campbell’s Legacy, Vol. 2, L. Bickman, ed., Sage Publications, Thousand Oaks, pp. 89–115. Reichardt, C.S. & Gollob, H.F. (1987). Taking uncertainty into account when estimating effects, in Multiple Methods for Program Evaluation. New Directions for Program Evaluation No. 35, M.M. Mark & R.L. Shotland, eds, Jossey-Bass, San Francisco, pp. 7–22. Rosenbaum, P.R. & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects, Biometrika 70, 41–55. Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Houghton Mifflin, Boston. Winship, C. & Morgan, S.L. (1999). The estimation of causal effects from observational data, Annual Review of Sociology 25, 659–707.

CHARLES S. REICHARDT

Nonlinear Mixed Effects Models MARY J. LINDSTROM

AND

HOURI K. VORPERIAN

Volume 3, pp. 1411–1415 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonlinear Mixed Effects Models Linear mixed-effects (LME) models were developed to model hierarchical (or nested) data. However, before the publication of [2], methods for estimating LME models were not widely available to handle hierarchical data where the observations within a subcategory (often subject) are not equally correlated. This situation usually arises because the observations have been taken over time. This type of hierarchical data is often referred to as growth curve, longitudinal, or (more generally) repeated measures data. We will begin with a quick review of LME and nonlinear regression models to set up ideas and notation.

Linear Mixed-effects Models (LME)

The Berkeley growth data [5] includes heights of boys and girls measured between ages 1 to 18 years. In Figure 1, we show a subset of the data for the girls over a range of ages where one might postulate a linear model for the relationship between age and height, for example: i = 1, . . . , M

150

140

130

120 8.5

9.0

9.5

10.0

10.5

Age (years)

Figure 1 Heights of 27 girls measured at 8.5, 9, 9.5, 10, and 10.5 years

Figure 1 indicates that the intercept and may be the slope will vary across subjects. We can modify the model to accommodate this as

Example: Heights of Girls

yij = β1 + β2 agej + eij

Height (cm)

160

(1)

where M is the number of subjects, yij is the j th height on the ith subject, β1 is the intercept, agej is the j th age, β2 is the slope for age, eij is the error term for the j th age for the ith subject.

LME Modeling We might be tempted to fit the model described in 1 to the data relating to heights of girls. However, the data do not fulfill the assumptions of the typical regression model, that is, Cov(eij , ekl ) = 0 except when i = k and j = l. The error terms will not be independent since some come from the same subject and some come from different subjects. In other words, this data has a hierarchical structure that we must take in to account. One additional complication is that the error terms within subject may not be equally correlated since they are observed over time (see Linear Multilevel Models).

yij = βi,1 + βi,2 agej + eij

i = 1, . . . , M

(2)

Where βi,1 is the intercept for the ith subject and βi,2 is the slope for age for the ith subject. We still have the problem of the dependence of the error terms plus we have a model where the number of parameters grows as the number of subjects does. Common assumptions for the error terms are normality and conditional independence, that is, within subjects, the errors are independent. We write this as ei ∼ N(0, σ 2 I) where ei is the vector of all the error terms for the ith subject and I is the identity matrix. Note that the assumption of conditional independence is a working assumption that we know may be violated. We will address this issue later in the Section ‘Parameter estimation’. A popular approach to reducing the number of parameters is to assume that the subjects are a random sample from some underlying population of subjects and that the parameter values for the subjects follow a distribution. A typical choice is βi,1 ∼ N (µ, D) (3) βi,2 where µT = [µ1 , µ2 ] and where µ1 and µ2 are the mean (fixed) intercept and slope values for the population of subjects and where D is a 2 × 2 covariance

2

Nonlinear Mixed Effects Models

matrix. This assumed distribution on the parameters makes this a mixed-effects model. Models of this type are also commonly called hierarchical, twostage, empirical-Bayes, growth-curve (see Growth Curve Modeling), and multilevel-linear model (see Linear Multilevel Models).

Parameter Estimation Typically, we estimate the fixed effects and variance parameter by computing the marginal distribution of the data by integrating out the random effects. This solves the problem of too many parameters. The resulting distribution has the form yi ∼ N(Xi µ, i ) where i = Zi DZTi + σ 2 I (4) and where Zi is the design matrix for the random effects and Xi is the design matrix for the fixed effects (for individual i). In our example they are the same:   1 8.5  1 9.0    Xi = Zi =  1 9.5  for all i (5)  1 10.0  1 10.5 The fixed effects (µ) and variance components (D and σ ) are estimated by maximizing the likelihood of the data under this marginal model (4). The marginal variance–covariance matrix i has a rich structure that depends on D and Z and that may be sufficient to provide accurate inference for the fixed effects even if our assumption of conditional independence within subject (the σ 2 I term in 4) is not valid. If the term Zi DZTi is not sufficient to model the marginal variance–covariance structure, it may be necessary to assume a more complex within-subject correlation model. For example ei ∼ N(0, σ 2 ), where is a correlation matrix corresponding to an autoregressive model. While this approach requires that D must be specified and estimated, inference about the fixed effects will not be sensitive to misspecification of D if the marginal distribution of yi is rich enough. This should also hold for sensitivity to the assumption of conditional independence within subject. Since the marginal distribution for LME models has a closed form, the theory for finding maximum likelihood and restricted maximum likelihood (see Maximum Likelihood Estimation) is straightforward [3]. In

practice, one would use existing software packages to do the estimation. If estimates for the random effects are desired, the best linear unbiased predictors (BLUPs) are used where ‘Best’ in this case means minimum squared error of prediction (or loss). The BLUPs are also the empirical-Bayes predictors or posterior means of the random effects (see Random Effects in Multivariate Linear Models: Prediction).

Nonlinear Regression Our first task is to define the class of nonlinear regression model. Let us consider linear regression models first (see Multiple Linear Regression). The term linear regression has two common meanings. The first is straight line regression where the model has the form yj = β1 + β2 xj + ej

(6)

The subscript j indexes observations as in equation (1). We reserve the subscript i for subjects used in the next section. The second meaning is a linear model that may, for instance, have additional polynomial terms or terms for groups, for example, yj = β1 + β2 tmtj + β3 xj + β4 xj2 + ej

(7)

where tmtj would be, for example, 0 for observations in the control group and 1 for observations in the treated group. The definition of a linear model is one that can be written in the following form βk xkj + ej (8) yj = k

where the βs are the only parameters to be estimated. Equations 6 and 7 can be written in this form and are linear but the following nonlinear regression model is not linear β

yj = β1 + β2 xj 3 + ej

(9)

A second definition of a linear model is a model for which the derivative of the right-hand side of the model equation with respect to any individual parameter contains no parameters at all. Nonlinear models are most commonly used for data that show single or double asymptotes, inflection

Nonlinear Mixed Effects Models

9.0 8.5 Tongue length (cm)

points, and other features not well fit by polynomials. Often, there is no mechanistic interpretation for the parameters (other than intercepts, asymptotes, inflection points, etc.), they simply describe the relationship between the predictor and outcome variables via the model function. There are certain nonlinear models where scientists have attempted to interpret the parameters as having physical meaning. One such class of models is compartment models [1]. Equation 9 is a simple (fixed effects only) nonlinear regression model that can be used for data where there is no nested structure but where a linear model does not adequately describe the relationship between the predictors and the outcome variable. The general form of the model for the j th observation is

3

8.0 7.5 7.0 6.5 6.0 5.5 0

20

40

60

80

Age (months)

yj = f (β, xj ) + ej

ej ∼ N(0, σ 2 )

(10)

β β2 xj 3

is the assumed model where f (β, xj ) = β1 + with parameters β that relates the mean response yj to the predictor variables xj . The error terms are assumed to be independent. Note that xj can contain more than one predictor variable although in this model there is only one. The parameters β are typically estimated via nonlinear least squares [1].

Nonlinear Mixed-effects Models Nonlinear mixed-effects (NLME) models are used when the data have a hierarchical structure as described above for LME models but where a nonlinear function is required to adequately describe the relationship between the predictor(s) and the outcome variable.

Example: Tongue Lengths Figure 2 shows measured tongue length from magnetic resonance images for 14 children [6]. Note that in this data set, not every subject has been measured at each age. An important feature of mixed effects models (linear and nonlinear) is that all subjects do not have to be measured at the same time points or the same number of times. One model that fits these data well is the asymptotic regression model. It has many formulations but the one we have chosen is yij = α − β exp(−γ ageij ) + eij

(11)

Figure 2 Tongue length at various ages for 14 children. Each child’s data are connected by a line

In this version, α is the asymptote as age goes to infinity, β is the range between the response at age = 0 and α, and γ controls the rate at which the curve achieves it’s asymptote. Once again it is sensible to postulate a model that includes parameters that are specific to subject. It turns out that in this case only α needs to vary between subjects. The model then takes on the following form: yij = αi − β exp(−γ ageij ) + eij

(12)

We model the parameters as random in the population of subjects with a fixed mean µ and assume αi ∼ N(µ, σ22 ) and ei ∼ N(0, σ 2 I). It is also possible to add additional predictors such as gender. If gender is expected to affect the intercept, we can rewrite the model for the intercepts by modifying the model for the random intercepts αi ∼ N(µ1 + µ2 gender, σ22 ) or equivalently by modifying the model for the responses (but not both) yij = αi + µ2 gender − β exp(−γ ageij ) + eij (13)

Parameter Estimation Just as in the LME case, estimation for NLME models is accomplished by finding the maximum likelihood estimates that correspond to the marginal distribution for the data. However, unlike in the linear case, there is usually no closed form expression

Nonlinear Mixed Effects Models

for the marginal distribution of an NLME model. Thus, direct maximum likelihood estimation requires either numerical or Monte Carlo integration (see Markov Chain Monte Carlo and Bayesian Statistics). Alternatively, estimation can be accomplished using approximate maximum likelihood or restricted maximum likelihood [3]. We can write down an approximate marginal variance covariance–matrix which has the form ˆ i Dˆ Z ˆ Ti + σˆ 2 I. ˆi = Z

9.0 8.5 Tongue length (cm)

4

(14)

ˆ i is the derivative matrix of the rightHere Z hand side of the subject specific model (12) with respect to the random effects. Unlike in the linear model, it will, in general, depend on the estiˆ T may ˆ i Dˆ Z mated parameter values [3]. The term Z i capture the marginal covariance of the data and assuming independence within subject may still result in a valid inference on the fixed effects. If not (as in this case where only the ‘intercept’ is random and Zi is a column vector of 1s), various more general forms for the within-subject covariance can be assumed (as described under LME models above). The maximum likelihood estimates (and standard errors) for the tongue data main effects are µˆ = 12.16(0.65), βˆ = 4.65(0.57), and γˆ = 0.023(0.0069). The estimated variance components are σˆ 2 = 0.203 (standard deviation of the intercepts) and σˆ = 0.478 (within-subject standard deviation). Figure 3 shows the fitted population average and subject specific curves for the tongue length data. The thick line corresponds to the fitted values when the estimated mean value µˆ is substituted in for αi . The lighter lines each correspond to a subject and are computed from the BLUP estimates of αi . These estimates were obtained using the NLME package for the R statistical software package [4]. There are other packages that implement NLME models, including the SAS procedure NLMIXED (see Software for Statistical Analyses). Testing in NLME models is a bit more complex than in standard regression or analysis of variance. Simulation studies have found that likelihood ratio tests perform well for testing for the need for variance components (like adding a random effect) even though they are theoretically invalid because the parameter value being tested is on the edge of the parameter space. Methods for testing fixed

8.0 7.5 7.0 6.5 6.0 5.5 0

20

40

60

80

Age (months)

Figure 3 Population average (thick line) and subject level (lighter lines) fitted curves for the tongue length data

effects are still controversial but it is known that LR tests are much too liberal and approximate F tests [3] seem to do a better job. More evaluation is required.

References [1] [2] [3] [4]

[5]

[6]

Bates, D.M. & Watts, D.G. (1988). Nonlinear Regression Analysis and its Applications, Wiley, New York. Laird, N.M. & Ware, J.H. (1982). Random-effects models for longitudinal data, Biometrics 38(4), 963–974. Pinheiro, J.C. & Bates, D.M. (2000). Mixed Effects Models in S and S-PLUS, Springer, New York. R Development Core Team. (2004). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna. Tuddenham, R.D. & Snyder, M.M. (1954). Physical growth of California boys and girls from birth to eighteen years, University of California Publications in Child Development 1, 183–364. Vorperian, H.K., Kent, R.D., Lindstrom, M.J., Kalina, C.M., Gentry, L.R. & Yandell, B.S. (2005). Development of vocal tract length during early childhood: a magnetic resonance imaging study, Journal of the Acoustical Society of America 117, 338–350.

(See also Generalized Linear Mixed Models; Marginal Models for Clustered Data) MARY J. LINDSTROM AND HOURI K. VORPERIAN

Nonlinear Models L. JANE GOLDSMITH

AND

VANCE W. BERGER

Volume 3, pp. 1416–1419 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonlinear Models

35 30

where X denotes a vector (perhaps of length one) of independent, or predictor, variables and θ denotes a vector of one or more (unknown) parameters in the model. f (X, θ) expresses the functional relationship between the expectation of Y , denoted E(Y ), and X and θ. That is, E(Y ) = f (X, θ). ε denotes the random error that prevents measurement of E(Y ) directly. ε is assumed to have a Gaussian distribution centered upon 0 and with constant variance for each independent observation of Y and X. That is, ε ∼ N (0, σ 2 ). An example of a nonlinear model is Y =

θ1 X + ε, θ2 + X

(2)

the so-called Michaelis–Menten model. This model can be transformed to a linear form for E(Y ), but the error properties of ε usually dictate the use of nonlinear estimation techniques. The Michaelis–Menten model has been used to model the velocity of a chemical reaction as a function of the underlying concentrations. Thus, a mnemonic representation is given by θ1 C + ε. (3) V = θ2 + C A graph of this model is presented in Figure 1: Here, we see the nonlinear relationship expected between V and C, with θ1 = 30 and θ2 = 0.5.

Velocity

25 20 15

An obvious advantage of a nonlinear model is that the methodology allows modeling and prediction of relationships that are more complicated in nature than the familiar linear model. Nonlinear

θ1 C

q2 + C

=

30C 0.5 + C

5 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Concentration

Figure 1

A Michaelis–Menton model

modeling also allows for special interpretation of parameters. In some cases, the parameters or functions of them describe special attributes of the nonlinear relationship that are apparent in graphs, such as (a) intercepts, (b) asymptotes, (c) maxima, (d) minima, and (e) inflection points. The following graph (Figure 2) demonstrates a graphical inter θ1 in the Michaelis–Menten pretation of θ = θ2 model: This demonstrates that θ1 is the maximum attainable value, expressed as an asymptote, for V , and θ2 is the so-called ‘half-dose,’ or the concentration C at which V attains half its theoretical maximum. Another nonlinear model is graphed in Figure 3:

35 30

V approaches 30 as a maximum asymptote

25 20 15 10

Advantages of Nonlinear Models

V=

10

Velocity

A nonlinear model expresses a relationship between a response variable and one or more predictor variables as a nonlinear function. The stochastic, or random, model is used to indicate that the expected response is obscured somewhat by a random error element. The nonlinear model may be expressed as follows: Y = f (X, θ) + ε, (1)

5

V=

30 C

0.5 + C

0.5 is the C value at which 1/2 the maximum is obtained

0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Concentration

Figure 2 model

Parameter intepretation in a Michaelis–Menton

2

Nonlinear Models H

α

Parameter Estimation

a = asymptote

H = α / (1+ be−CT) a /2

Inflection point α / (1+ b) 0

Figure 3

(ln b) / C

T

The autocatalytic growth model

To estimate the parameters in a proposed nonlinear model Y = f (X, θ) + ε, data observations must be made on both the response variable Y and its predictors X. The number of observations made is denoted by n, or the ‘sample size.’ An iterative process is often used to find the parameter estimates ˆ To find the ‘best’ estimates, or the ones likely θ. to be close to the true, unknown θ, the iterative method tries to find the θˆ that minimizes the sum of squared differences between the n observed Y ’s and the predicted Y ’s, the Yˆ ’s. That is, the least-squares solution (see Least Squares Estimation) will be the θˆ that minimizes the sum of squared residuals n n ri2 = (Yi − Yˆi )2 . (5) i=1

The ‘autocatalytic growth model’ postulates that height H cannot exceed its maximum α, that it occurs as an S-shaped curve with inflection point ((ln β)/C, α/2), and that the starting height at T = 0 is α/(1 + β). The formula H = α/(1 + βe−CT ), with its logistic, S-shaped curve, assures that the curve is skew-symmetric about the inflection point. As we can see, a researcher familiar with data properties through graphs may be able to deduce or find a nonlinear model that reflects desired properties. Another advantage of nonlinear models is that the model can reflect properties based on relevant differential equations. For instance, the autocatalytic growth model is the solution to a specific differential equation. This rate equation states that the growth rate relative to the attained height is proportional to the amount of height yet unattained. In other words, as height approaches its maximum, the growth slows. The differential equation for which H = α/(1 + βe−CT ) is a solution is dH 1 C(α − H ) = . (4) dT H α Often nonlinear models arise when researchers can specify a rate-of-change relationship, which can be expressed as either a single differential equation or as a system of differential equations. Choosing a candidate model can be a creative, enjoyable process in which researchers use their knowledge and experience to find the most promising model for their project.

i=1

In other words, the nYi ’s obtained using the leastsquares optimal θˆ will minimize the sum of squared residuals. There is a vast literature dealing with methods of optimization using computer methods (see Optimization Methods). n 2 Some techniques developed to minimize i=1 ri are the steepest-descent or gradient method, the Newton method, the modified Gauss–Newton method, the Marquardt method, and the multivariate secant method. Most of these methods require an initial starting value for θˆ . This can often be obtained through graphical analysis of the data, using any known relationships between a θ component θi and graphical properties such as asymptotes, intercepts, and so on. If there is a linearizing transformation that is not being used due to error distribution considerations (as was the case for the Michaelis–Menten model above), then the linear least-squares solution for the transformed linear model may provide a good starting ˆ Another possibility might be to use an initial θ. θˆ from the literature if the model has been used successfully in a previous study. Since there is not a closed-form solution, a criterion must be chosen to determine when a satisfactory solution has been found and the computer minimization program should stop iterating. Usually, the stopping rule is to abandon searching for a better solution report the current θˆ when the reduction n and 2 of i=1 ri is small in the most recent step. Sometimes there is difficulty obtaining convergence, with the computer program ‘cycling’ around a potential solution, but not homing in on the best answer due

3

Nonlinear Models

Confidence Regions In addition to the point estimate θˆ , which is generally the least-squares solution, it is also desirable to report 95% (or some such high confidence level) confidence regions for θ. Researchers often desire confidence intervals for individual θi , the components of θ. Through advanced programming techniques, it is possible to compute exact confidence regions for θ. However, if the sample size is small, then it may be extremely difficult to compute the regions, or the regions themselves may be unsatisfactorily complicated. For example, a 95% exact confidence region for θ may consist of several disjoint, asymmetric areas in R k , where k is the dimension of θ. Approximate confidence ellipsoids for θ can be computed, along with approximate confidence intervals for the θi . The accuracy and precision of these approximate confidence regions depend upon two factors: the estimation properties of the chosen model and the sample size.

Estimation Properties The accuracy (unbiasedness) and precision (small size) of the approximate confidence ellipsoids and confidence intervals mentioned above depend upon an estimation property of a particular nonlinear model termed ‘close-to-linear behavior.’ As it turns out, a model with close-to-linear behavior will tend to produce good confidence regions and confidence intervals with relatively small sample sizes. Conversely, those models without this behavior may not produce similarly desirable confidence intervals. The properties of various types of nonlinear models have been rigorously studied, and researchers seeking an efficient nonlinear model, which can accomplish accurate inference with small sample sizes, can search the literature of nonlinear models to find a model with desirable estimation properties.

Model Validation The nonlinear model validation process is similar in some ways to that for linear regression (see Multiple Linear Regression). Scatter plots of residuals versus predicted values can be used to detect model fit. Small confidence intervals for the θi indicate precision. An important caveat is that R 2 , the familiar coefficient of determination from linear regression, is not meaningful in the context of a nonlinear model.

A Fitted Nonlinear Model To illustrate these points, simulated data, with a sample size of n = 32, were generated using the Michaelis–Menten model discussed above: θ1 C + ε, with θ1 = 30, and θ2 = 0.5. V = θ2 + C ε ∼ N (0, (1.5)2 )

(6)

The nonlinear regression program (SAS version 8.02) converged in five iterations to the following estimates:

Parameter θ1 θ2

Estimate

Estimated S.E.

29.36 0.44

0.66 0.05

Approximate 95% confidence limits 28.02 0.34

30.70 0.54

A graph is presented in Figure 4. There are some excellent texts dealing with nonlinear models. See, for example, [1], [2], [3], [4], [5], and [6]. 35 30 25 Velocity

to technical problems with the method chosen or a poorly chosen starting value. Trying again with a new starting θˆ or choosing a new iterative optimization technique are obvious suggestions when convergence difficulties occur. This approach will also help ensure that if convergence is reached, then what was found was a true global minimum, and not just a local one.

20 Theoretical line

15

Estimated line

10

Data points

5 0 0

Figure 4

1

2

3 4 Concentration

A Fitted Michaelis-Menton Model

5

6

4

Nonlinear Models

References [1] [2]

[3]

Bates, D.M. & Watts, D.G. (1988). Nonlinear Regression Analysis and its Applications, Wiley, New York. Dixon, W.J. chief editor. (1992). BMDP Statistical Software Manual, Vol. 2, University of California Press, Berkeley. Draper, N.R. & Smith, H. (1998). Applied Regression Analysis, 3rd Edition, Wiley, New York.

[4] [5] [6]

Ratkowsky, D.A. (1990). Handbook of Nonlinear Regression Models, Marcel Dekker, New York. SAS (Statistical Analysis System). Online Documentation, Release 8.02, (1999). SAS Institute, Cary. Seber, G.A.F. & Wild, C.J. (1989). Nonlinear Regression, Wiley, New York.

L. JANE GOLDSMITH

AND

VANCE W. BERGER

Nonparametric Correlation (rs ) DAVID C. HOWELL Volume 3, pp. 1419–1420 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonparametric Correlation (rs )

Table 1 Expenditures for alcohol and tobacco by region of Great Britain Region

Spearman’s Correlation for Ranked Data Let (x1 , x2 , . . . , xn ) and (y1 , y2 , . . . , yn ) be random samples of sizes n from the two random variables, X and Y . The scale of measurement of the two random variables is at least ordinal and, to avoid problems with ties, ought to be strictly continuous. Rank the data within each variable separately and let (rx1 , rx2 , . . . , rxn ) and (ry1 , ry2 , . . . , ryn ) represent the ranked data. Our goal is to correlate the two sets of ranks and obtain a test of significance on that correlation. Spearman provided a formula for this correlation, assuming no ties in the ranks, as rS = 1 −

6Di2 , N (N 2 − 1)

(1)

where Di is defined as the set of differences between rxi and ryi . Spearman derived this formula (using such equalities as the sum of the first N integers = N (N + 1)/2) at a time when calculations were routinely carried by hand. The result of using this formula is exactly the same as applying the traditional Pearson product-moment correlation coefficient to the ranked data, and that is the approach commonly used today. The Pearson formula is correct regardless of whether there are tied ranks, whereas the Spearman formula requires a correction for ties.

Example

North Yorkshire Northeast East Midlands West Midlands East Anglia Southeast Southwest Wales Scotland Northern Ireland

Alcohol

Tobacco

Rank A

Rank T

6.47 6.13 6.19 4.89 5.63 4.52 5.89 4.79 5.27 6.08 4.02

4.03 3.76 3.77 3.34 3.47 2.92 3.20 2.71 3.53 4.51 4.56

11 9 10 4 6 2 7 3 5 8 1

9 7 8 4 5 2 3 1 6 10 11

Similarly, the distribution of alcohol expenditures is decidedly nonnormal, whereas the ranked data on alcohol, like all ranks, is rectangularly distributed. The Spearman correlation coefficient (rS ) for this sample is 0.373, whereas the Pearson correlation (r) on the raw-score variables is 0.224. There is no generally accepted method for estimation of the standard error of rS for small samples. This makes the computation of confidence limits problematic, but with very small samples the width of the confidence interval is likely to be very large in any event. Kendall [1] has suggested that for sample sizes greater than 10, rS can be tested in the same manner as the Pearson correlation (see Pearson Product Moment Correlation). For these data, because of the extreme outlier for Northern Ireland, the correlation is not significant using the standard approach for Pearson’s r (p = 0.259, two tailed) or using a randomization test (p = 0.252, two tailed). There are alternative approaches to rank correlation. See Nonparametric Correlation (tau) for an alternative method for correlating ranked data.

The Data and Story Library website (DASL) (http: //lib.stat.cmu.edu/DASL/Stories/Alcohol andTobacco.html) provides data on the average

Reference

weekly spending on alcohol and tobacco products for each of 11 regions in Great Britain. The data follow in Table 1. Columns 2 and 3 contain expenditures in pounds, and Columns 4 and 5 contain the ranked data for their respective expenditures. Though it is not apparent from looking at either the alcohol or tobacco variable alone, in a bivariate plot it is clear that Northern Ireland is a major outlier.

[1]

Kendall, M.G. (1948). Rank Correlation Methods, Griffen Publishing, London.

(See also Kendall’s Tau – τ ) DAVID C. HOWELL

Nonparametric Correlation (tau) DAVID C. HOWELL Volume 3, pp. 1420–1421 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonparametric Correlation (tau)

Table 1 Alcohol and tobacco expenditures in Great Britain, broken down by region Region

Kendall’s Correlation for Ranked Data (τ ) Let (x1 , x2 , . . . , xn ) and (y1 , y2 , . . . , yn ) be random samples of sizes n from the two random variables, X and Y . The scale of measurement of the two random variables is at least ordinal and, to avoid problems with ties, ought to be strictly continuous. Rank the data within each variable separately and let (rx1 , rx2 , . . . , rxn ) and (ry1 , ry2 . . . , ryn ) represent the ranked data. Our goal is to correlate the two sets of ranks and obtain a test of significance on that correlation. The correlation coefficient is based on the number of concordant and discordant rankings, or, equivalently, the number of inversions in rank. We will rank the data separately for each variable and let C represent the number of concordant pairs of observations, which is the number of pairs of observations whose rankings on the two variables are ordered in the same direction. We will let D represent the number of discordant pairs (pairs where the rankings on the two variables are in the opposite direction), and let S = C − D. Then τ =

2S . n(n − 1)

Rank A

Rank T

6.47 6.13 6.19 4.89 5.63 4.52 5.89 4.79 5.27 6.08 4.02

4.03 3.76 3.77 3.34 3.47 2.92 3.20 2.71 3.53 4.51 4.56

11 9 10 4 6 2 7 3 5 8 1

9 7 8 4 5 2 3 1 6 10 11

Similarly, the distribution of Alcohol expenditures is decidedly nonnormal, whereas the ranked data on alcohol, like all ranks, are rectangularly distributed. There are n(n − 1)/2 = 11(10)/2 = 55 pairs of rankings. 37 of those pairs are concordant, while 18 are discordant. (For example, Yorkshire ranks lower than the North of England on both Alcohol and Tobacco, so that pair of rankings is concordant. On the other hand, Scotland ranks lower than the North of England on Alcohol expenditures, but higher on Tobacco expenditures, so that pair of rankings is discordant.) Then C = 37 D = 18

(It is not essential that we actually do the ranking, though it is easier to work with ranks than with raw scores.)

The Data and Story Library website (DASL) (http://lib.stat.cmu.edu/DASL/Stories/ AlcoholandTobacco.html) provides data on the average weekly spending on alcohol and tobacco products for each of 11 regions in Great Britain. The data follow. Columns 2 and 3 contain expenditures in pounds, and Columns 4 and 5 contain the ranked data for their respective expenditures as in Table 1. Though it is not apparent from looking at either the Alcohol or Tobacco variable alone, in a bivariate plot it is clear that Northern Ireland is a major outlier.

Tobacco

North Yorkshire Northeast East Midlands West Midlands East Anglia Southeast Southwest Wales Scotland Northern Ireland

(1)

Example

Alcohol

S = C − D = 37 − 18 = 19 2S 38 τ= = = 0.345. N (N − 1) 110

(2)

For these data Kendall’s τ = 0.345. Unlike Spearman’s rS , there is an accepted method for estimation of the standard error of Kendall’s τ [1]. 2(2N + 5) sτ = . (3) 9N (N − 1) Moreover, τ is approximately normally distributed for N ≥ 10. This allows us to approximate the sampling distribution of Kendall’s τ using the normal approximation. z=

0.345 τ τ = = Sτ 2(2N + 5) 2(27) 9N (N − 1) 9(11)(10)

2

Nonparametric Correlation (tau) =

0.345 = 1.48. 0.2335

(4)

For a two-tailed test p = .139, which is not significant. With a standard error of 0.2335, the confidence limits on Kendall’s τ , assuming normality of τ , would be 2(2N + 5) CI = τ ± 1.96sτ = τ ± 1.96 9N (N − 1) = τ ± 1.96(0.2335)

There are alternative approaches to rank correlation. See Nonparametric Correlation (rs ) for an alternative method for correlating ranked data. Kendall’s t has generally been given preference over Spearman’s rS because it is a better estimate of the corresponding population parameter, and its standard error is known.

Reference [1]

(5)

For our example this would produce confidence limits of −0.11 ≤ t ≤ 0.80.

Kendall, M.G. (1948). Rank Correlation Methods, Griffen Publishing, London.

DAVID C. HOWELL

Nonparametric Item Response Theory Models KLAAS SIJTSMA Volume 3, pp. 1421–1426 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonparametric Item Response Theory Models Goals of Nonparametric Item Response Theory Nonparametric item response theory (NIRT) models are used for analyzing the data collected by means of tests and questionnaires consisting of J items. The goals are to construct a scale for the ordering of persons and – depending on the application of this scale – for the items. To attain these goals, test constructors and other researchers primarily want to know whether their items measure the same or different traits, and whether the items are of sufficient quality to distinguish people with relatively low and high standings on these traits. These questions relate to the classical issues of validity (see Validity Theory and Applications) and reliability, respectively. Other issues of interest are differential item functioning, person-fit analysis, and skill identification and cognitive modeling. NIRT models are most often used in small-scale testing applications. Typical examples are intelligence and personality testing. Most intelligence tests and personality inventories are applied to individuals only once, each individual is administered the same test, and testing often but not necessarily is individual. Another example is the measurement of attitudes, typical of sociological or political science research. Attitude questionnaires typically consist of, say, 5 to 15 items, and the same questionnaire is administered to each respondent in the sample. NIRT is also applied in educational testing, preference measurement in marketing, and health-related quality-of-life measurement in a medical context. See [16] for a list of applications. NIRT is interesting for at least two reasons. First, because it provides a less demanding framework for test and questionnaire data analysis than parametric item response theory, NIRT is more data-oriented, more exploratory and thus more flexible than parametric IRT; see [7]. Second, because it is based on weaker assumptions than parametric IRT, NIRT can be used as a framework for the theoretical exploration of the possibilities and the boundaries of IRT in general; see, for example, [4] and [6].

Assumptions of Nonparametric IRT Models The assumptions typical of NIRT models, and often shared with parametric IRT models, are the following: •

•

Unidimensionality (UD). A unidimensional IRT model contains one latent variable, usually denoted by θ, that explains the variation between tested individuals. From a fitting unidimensional IRT model, it is inferred that performance on the test or questionnaire is driven by one ability, achievement, personality trait, or attitude. Multidimensional IRT models exist that assume several latent variables to account for the data. Local independence (LI). Let Xj be the random variable for the score on item j (j = 1, . . . , J ); let xj be a realization of Xj ; and let X and x be the vectors containing J item score variables and J realizations, respectively. Also, let P (Xj = xj |θ) be the conditional probability of a score of xj on item j . Then, a latent variable θ, possibly multidimensional, exists such that the joint conditional probability of J item responses can be written as P (X = x|θ) =

J

P (Xj = xj |θ).

(1)

j =1

•

An implication of LI is that for any pair of items, say j and k, their conditional covariance equals 0; that is, Cov(Xj , Xk |θ) = 0. Monotonicity (M). For binary item scores, Xj ∈ {0, 1}, with score zero for an incorrect answer and score one for a correct answer, we define Pj (θ) ≡ P (Xj = 1|θ). This is the item response function (IRF). Assumption M says that the IRF is monotone nondecreasing in θ. For ordered rating scale scores, Xj ∈ {0, . . . , m}, a similar monotonicity assumption can be made with respect to response probability, P (Xj ≥ xj |θ).

Parametric IRT models typically restrict the IRF, Pj (θ), by means of a parametric function, such as the logistic. A well-known example is the threeparameter logistic IRF. Let γj denote the lower asymptote of the logistic IRF for item j , interpreted as the pseudochance probability; let δj denote the location of item j on the θ scale, interpreted as the

2

Nonparametric Item Response Theory Models

difficulty; and let αj (>0) correspond to the steepest slope of the logistic function, which is located at parameter δj and interpreted as the discrimination. Then the IRF of the three-parameter logistic model is Pj (θ) = γj + (1 − γj )

exp[αj (θ − δj )] . 1 + exp[αj (θ − δj )]

(2)

Many other parametric IRT models have been proposed; see [20] for an overview. NIRT models only impose order restrictions on the IRF, but refrain from a parametric definition. Thus, assumption M may be the only restriction, so that for any two fixed values θa < θb , we have that Pj (θa ) ≤ Pj (θb ).

(3)

The NIRT model based on assumptions UD, LI, and M is the monotone homogeneity model [8]. Assumption M may be further relaxed by assuming that the mean of the J IRFs is increasing, but not each of the individual IRFs [17]. This mean is the test response function, denoted by T (θ) and defined as T (θ) = J −1

J

persons with respect to latent variable θ, which by itself is not estimated. The three-parameter logistic model is a special case of the monotone homogeneity model, because it has monotone increasing logistic IRFs and assumes UD and LI. Thus, SOL also holds for this model. The item parameters, γ , δ, and α, and the latent variable θ can be solved from the likelihood of this model. These estimates can be used to calibrate a metric scale that is convenient for equating, item banking, and adaptive testing in large-scale testing [20]. NIRT models are candidates for test construction, in particular, when an ordinal scale for respondents is sufficient for the application envisaged. Another class of NIRT models is based on stronger assumptions. For example, to have an ordering of items which is the same for all values of θ, with the possible exception of ties for some θs, it is necessary to assume that the J items have IRFs that do not intersect. This is called an invariant item ordering (IIO, [15]). Formally, J items have an IIO, when they can be ordered and numbered such that P1 (θ) ≤ P2 (θ) ≤ · · · ≤ PJ (θ), for all θ.

Pj (θ), increasing in θ.

(4)

j =1

Another relaxation of assumptions is that of strict unidimensionality, defined as assumption UD, to essential unidimensionality [17]. Here, the idea is that one dominant trait drives test performance in particular, but that there are also nuisance traits active, whose influence is minor. In general, one could say that NIRT strives for defining models that are based on relatively weak assumptions while maintaining desirable measurement properties. For example, it has been shown [2] that the assumptions of UD, LI, and M imply that the total score X+ = Jj=1 Xj stochastically orders latent variable θ; that is, for two values of X+ , say x+a < x+b , and any value t of θ, assumptions UD, LI, and M imply that P (θ ≥ t|X+ = x+a ) ≤ P (θ ≥ t|X+ = x+b ).

(5)

Reference [3] calls (2) a stochastic ordering in the latent trait (SOL). SOL implies that for higher X+ values the θs are expected to be higher on average. Thus, (5) defines an ordinal scale for person measurement: if the monotone homogeneity model fits the data, total score X+ can be used for ordering

(6)

A set of items that is characterized by an IIO facilitates the interpretation of results from differential item functioning and person-fit analysis, and provides the underpinnings of the use of individual starting and stopping rules in intelligence testing and the hypothesis testing of item orderings that reflect, for example, ordered developmental stages. The NIRT model based on the assumptions of UD, LI, M, and IIO is the double monotonicity model [8]. The generalization of the SOL and IIO properties from dichotomous-item IRT models to polytomousitem IRT models is not straightforward. Within the class of known polytomous IRT models, SOL can only be generalized to the parametric partial credit model [20] but not to any other model. Reference [19] demonstrated that although SOL is not implied by most models, it is a robust property for most tests in most populations, as simulated in a robustness study. For J polytomously scored items, an IIO is defined as E(X1 |θ) ≤ E(X2 |θ) ≤ · · · ≤ E(XJ |θ), for all θ. (7) Thus, the ordering of the mean item scores is the same, except for possible ties, for each value of

3

Nonparametric Item Response Theory Models θ. IIO can only be generalized to the parametric rating scale model [20] and similarly restrictive IRT models. See [13] and [14] for nonparametric models that imply an IIO. Because SOL and IIO are not straightforwardly generalized to polytomous-item IRT models, and because these models are relatively complicated, we restrict further attention mostly to dichotomousitem IRT models. More work on the foundations of IRT through studying NIRT has been done, for example, by [1, 3, 4, 6, and 17]. See [8] and [16] for monographs on NIRT.

Evaluating Model-data Fit Several methods exist for investigating fit of NIRT models to test and questionnaire data. These methods are based mostly on one of two properties of observable variables implied by the NIRT models.

Conditional-association Based Methods The first observable property is conditional association [4]. Split item score vector X into two disjoint part vectors, X = (Y, Z). Define f1 and f2 to be nondecreasing functions in the item scores from Y, and g to be some function of the item scores in Z. Then UD, LI, and M imply conditional association in terms of the covariance, denoted by Cov, as Cov[f1 (Y), f2 (Y)|g(Z) = z] ≥ 0,

for all z. (8)

Two special cases of (8) constitute the basis of model-data fit methods: Unconditional inter-item covariances. If function g(Z) selects the whole group, and f1 (Y) = Xj and f2 (Y) = Xk , then a special case of (8) is Cov(Xj , Xk ) ≥ 0,

all pairs j, k; j < k.

(9)

Negative inter-item covariances give evidence of model-data misfit. Let Cov(Xj , Xk )max be the maximum covariance possible given the marginals of the cross table for the bivariate frequencies on these items. Reference [8] defined coefficient Hj k as Hj k =

Cov(Xj , Xk ) . Cov(Xj , Xk )max

(10)

Equation 9 implies that 0 ≤ Hj k ≤ 1. Thus, positive values of Hj k found in real data support the monotone homogeneity model, while negative values reject the model. Coefficient Hj k has been generalized to (1) an item coefficient, Hj , which expresses the degree to which item j belongs with the other J − 1 in one scale; and (2) a scalability coefficient, H , which expresses the degree to which persons can be reliably ordered on the θ scale using total score X+ . An item selection algorithm has been proposed [8, 16] and implemented in the computer program MSP5 [9], which selects items from a larger set into clusters that contain items having relatively high Hj values with respect to one another – say, Hj ≥ c, often with c > 0.3 (c user-specified) – while unselected items have Hj values smaller than c. Because, for a set of J items, H ≥ min(Hj ) [8], item selection produces scales for which H ≥ c. If c ≥ 0.3, person ordering is at least weakly reliable [16]. Such scales can be used in practice for person measurement, while each scale identifies another latent variable. Conditional inter-item covariances. First, define a total score – here, called a rest score and denoted R – based on X as, R(−j,−k) = Xh . (11) h=j,k

Second, define function g(Z) = R(−j,−k) , and let f1 (Y) = Xj and f2 (Y) = Xk . Equation 8 implies that, Cov[Xj , Xk |R(−j,−k) = r] ≥ 0, all j, k; j < k; all r = 0, 1, . . . , J − 2.

(12)

That is, in the subgroup of respondents that have the same rest score r, the covariance between items j and k must be nonnegative. Equation 12 is the basis of procedures that try to find an item subset structure for the whole test that approximates local independence as good as possible. The optimal solution best represents the latent variable structure of the test data. See the computer programs DETECT and HCA/CCPROX [18] for exploratory item selection, and DIMTEST [17] for confirmatory hypothesis testing with respect to test composition.

Manifest-monotonicity Based Methods The second observable property is manifest monotonicity [6]. It can be used to investigate assumption

4

Nonparametric Item Response Theory Models

M. To estimate the IRF for item j, Pj (θ), first a sum score on J − 1 items excluding item j , Xk , (13) R(−j ) = k=j

is used as an estimate of θ, and then the conditional probability P [Xj = 1|R(−j ) = r] is calculated for all values r of R(−j ) . Given the assumptions of UD, LI, and M, the conditional probability P [Xj = 1|R(−j ) ] must be nondecreasing in R(−j ) ; this is manifest monotonicity. Investigating assumption M. The computer program MSP5 can be used for estimating probabilities, P [Xj = 1|R(−j ) ], plotting the discrete response functions for R(−j ) = 0, . . . , J − 1, and testing violations of manifest monotonicity for significance. The program TestGraf98 [11, 12] estimates continuous response functions using kernel smoothing, and provides many graphics. These response functions include, for example, the option response functions for each of the response options of a multiplechoice item. Investigating assumption IIO. To investigate whether the items j and k have intersecting IRFs, the conditional probabilities P [Xj = 1|R(−j,−k) ] and P [Xk = 1|R(−j,−k) ] can be compared for each value R(−j,−k) = r, and the sign of the difference can be compared with the sign of the difference of the sample item means, X j and X k , for the whole group. Opposite signs for some r values indicate intersection of the IRFs and are tested against the null hypothesis that P [Xj = 1|R(−j,−k) = r] = P [Xk = 1|R(−j,−k) = r] – meaning that the IRFs coincide locally – in the population. This method and other methods for investigating an IIO have been discussed and compared by [15] and [16]. MSP5 [9] can be used for investigating IIO. Many of the methods mentioned have been generalized to polytomous items, but research in this area is still going on. Finally, we mention that methods for estimating the reliability of total score X+ have been developed under the assumptions of UD, LI, M, and IIO, both for dichotomous and polytomous items [16].

New developments are in adaptive testing, differential item functioning, person-fit analysis, and cognitive modeling. The analysis of the dimensionality of test and questionnaire data has received much attention, using procedures implemented in the programs DETECT, HCA/CCPROX, DIMTEST, and MSP5. Latent class analysis has been used to formulate NIRT models as discrete, ordered latent class models and to define fit statistics for these models. Modern estimation methods such as Markov Chain Monte Carlo have been used to estimate and fit NIRT models. Many other developments are ongoing. The theory discussed so far was developed for analyzing data generated by means of monotone IRFs, that is, data that conform to the assumption that a higher θ value corresponds with a higher expected item score, both for dichotomous and polytomous items. Some item response data reflect a choice process governed by personal preferences for some but not all items or stimuli, and assumption M is not adequate. For example, a marketing researcher may present subjects with J brands of beer, and ask them to pick any number of brands that they prefer in terms of bitterness; or a political scientist may present a sample of voters with candidates for the presidency and ask them to order them with respect to perceived trustworthiness. The data resulting from such tasks require IRT models with single-peaked IRFs. The maximum of such an IRF identifies the item location or an interval on the scale – degree of bitterness or trustworthiness – at which the maximum probability of picking that stimulus is obtained. See [10] and [5] for the theoretical foundation of NIRT models for single-peaked IRFs and methods for investigating model-data fit.

References [1]

[2]

[3]

Developments, Alternative Models NIRT developed later than parametric IRT. It is an expanding area, both theoretically and practically.

[4]

Ellis, J.L. & Van den Wollenberg, A.L. (1993). Local homogeneity in latent trait models. A characterization of the homogeneous monotone IRT model, Psychometrika 58, 417–429. Grayson, D.A. (1988). Two-group classification in latent trait theory: scores with monotone likelihood ratio, Psychometrika 53, 383–392. Hemker, B.T., Sijtsma, K., Molenaar, I.W. & Junker, B.W. (1997). Stochastic ordering using the latent trait and the sum score in polytomous IRT models, Psychometrika 62, 331–347. Holland, P.W. & Rosenbaum, P.R. (1986). Conditional association and unidimensionality in monotone latent variable models, The Annals of Statistics 14, 1523–1543.

Nonparametric Item Response Theory Models [5]

[6]

[7]

[8] [9] [10] [11]

[12]

[13]

Johnson, M.S. & Junker, B.W. (2003). Using data augmentation and Markov chain Monte Carlo for the estimation of unfolding response models, Journal of Educational and Behavioral Statistics 28, 195–230. Junker, B.W. (1993). Conditional association, essential independence and monotone unidimensional item response models, The Annals of Statistics 21, 1359–1378. Junker, B.W. & Sijtsma, K. (2001). Nonparametric item response theory in action: an overview of the special issue, Applied Psychological Measurement 25, 211–220. Mokken, R.J. (1971). A Theory and Procedure of Scale Analysis, De Gruyter, Berlin. Molenaar, I.W. & Sijtsma, K. (2000). MSP5 for Windows. User’s Manual, iecProGAMMA, Groningen. Post, W.J. (1992). Nonparametric Unfolding Models: A Latent Structure Approach, DSWO Press, Leiden. Ramsay, J.O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation, Psychometrika 56, 611–630. Ramsay, J.O. (2000). A Program for the Graphical Analysis of Multiple Choice Test and Questionnaire Data, Department of Psychology, McGill University, Montreal. Scheiblechner, H. (1995). Isotonic ordinal probabilistic models (ISOP), Psychometrika 60, 281–304.

[14]

5

Sijtsma, K. & Hemker, B.T. (1998). Nonparametric polytomous IRT models for invariant item ordering, with results for parametric models, Psychometrika 63, 183–200. [15] Sijtsma, K. & Junker, B.W. (1996). A survey of theory and methods of invariant item ordering, British Journal of Mathematical and Statistical Psychology 49, 79–105. [16] Sijtsma, K. & Molenaar, I.W. (2002). Introduction to Nonparametric Item Response Theory, Sage Publications, Thousand Oaks. [17] Stout, W.F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation, Psychometrika 55, 293–325. [18] Stout, W.F., Habing, B., Douglas, J., Kim, H., Roussos, L. & Zhang, J. (1996). Conditional covariance based nonparametric multidimensionality assessment, Applied Psychological Measurement 20, 331–354. [19] Van der Ark, L.A. (2004). Practical consequences of stochastic ordering of the latent trait under various polytomous IRT models, Psychometrika (in press). [20] Van der Linden, W.J. & Hambleton, R.K., (eds) (1997). Handbook of Modern Item Response Theory, Springer, New York.

KLAAS SIJTSMA

Nonparametric Regression JOHN FOX Volume 3, pp. 1426–1430 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonparametric Regression Nonparametric regression analysis traces the dependence of a response variable (y) on one or several predictors (xs) without specifying in advance the function that relates the response to the predictors: E(yi ) = f (x1i , . . . , xpi )

(1)

where E(yi ) is the mean of y for the ith of n observations. It is typically assumed that the conditional variance of y, Var(yi |x1i , . . . , xpi ) is a constant, and that the conditional distribution of y is normal, although these assumptions can be relaxed. Nonparametric regression is therefore distinguished from linear regression (see Multiple Linear Regression), in which the function relating the mean of y to the xs is linear in the parameters, E(yi ) = α + β1 x1i + · · · + βp xpi

(2)

and from traditional nonlinear regression, in which the function relating the mean of y to the xs, though nonlinear in its parameters, is specified explicitly, E(yi ) = f (x1i , . . . , xpi ; γ1 , . . . , γk )

the relationship between the two variables in a scatterplot. Figure 1, for example, shows the relationship between female expectation of life at birth and GDP per capita for 154 nations of the world, as reported in 1998 by the United Nations. Two fits to the data are shown, both employing local-linear regression (described below); the solid line represents a fit to all of the data, while the broken line omits four outlying nations, labelled on the graph, which have values of female life expectancy that are unusually low given GDP per capita. It is clear that although there is a positive relationship between expectation of life and GDP, the relationship is highly nonlinear, leveling off substantially at high levels of GDP. Three common methods of nonparametric regression are kernel estimation (see Kernel Smoothing), local-polynomial regression (which is a generalization of kernel estimation), and smoothing splines. Nearestneighbor kernel estimation proceeds as follows (as illustrated for the UN data in Figure 2): 1.

(3)

In traditional regression analysis, the object is to estimate the parameters of the model – the βs or γ s. In nonparametric regression, the object is to estimate the regression function directly. There are many specific methods of nonparametric regression. Most, but not all, assume that the regression function is in some sense smooth. Several of the more prominent methods are described in this article. Moreover, just as traditional linear and nonlinear regression can be extended to generalized linear model and nonlinear regression models that accommodate nonnormal error distributions, the same is true of nonparametric regression. There is a large literature on nonparametric regression analysis, both in scientific journals and in texts. For more extensive introductions to the subject, see in particular, Bowman and Azzalini [1], Fox [2, 3], Hastie, Tibshirani, and Freedman [4], Hastie and Tibshirani [5], and Simonoff [6]. The simplest use of nonparametric regression is in smoothing scatterplots (see Scatterplot Smoothers). Here, there is a numerical response y and a single predictor x, and we seek to clarify visually

2.

Let x0 denote a focal x-value at which f (x) is to be estimated; in Figure 2(a), the focal value is the 80th ordered x-value in the UN data, x(80) . Find the m nearest x-neighbors of x0 , where s = m/n is called the span of the kernel smoother. In the example, the span was set to s = 0.5, and thus m = 0.5 × 154 = 77. Let h represent the half-width of a window encompassing the m nearest neighbors of x0 . The larger the span (and hence the value of h), the smoother the estimated regression function. Define a symmetric unimodal weight function, centered on the focal observation, that goes to zero (or nearly zero) at the boundaries of the neighborhood around the focal value. The specific choice of weight function is not critical: In Figure 2(b), the tricube weight function is used:  3  |x − x0 | 3   1− h WT (x) =    0

3.

|x − x0 | <1 h |x − x0 | ≥1 for h (4) for

A Gaussian (normal) density function is another common choice. Using the tricube (or other appropriate) weights, calculate the weighted average of the y-values to

2

Nonparametric Regression

Female expectation of life

80

70 Iraq 60 Gabon Botswana

50

Afghanistan 40 0

5000

10 000 15 000 GDP per capita ($)

20 000

25 000

Figure 1 Female expectation of life by GDP per capita, for 154 nations of the world. The solid line is for a local-linear regression with a span of 0.5, while the broken line is for a similar fit deleting the four outlying observations that are labeled on the plot

obtain the fitted value

WT (xi )yi y0 = f (x0 ) =

WT (xi )

4.

(5)

as illustrated in Figure 2(c). Greater weight is thus accorded to observations whose x-values are close to the focal x0 . Repeat this procedure at a range of x-values spanning the data – for example, at the ordered observations x(1) , x(2) , . . . , x(n) . Connecting the fitted values, as in Figure 2(d), produces an estimate of the regression function.

Local-polynomial regression is similar to kernel estimation, but the fitted values are produced by locally weighted regression rather than by locally weighted averaging; that is, y0 is obtained in step 3 by the polynomial regression of y on x to minimize the weighted sum of squared residuals

in Figure 1). Local- polynomial regression tends to be less biased than kernel regression, for example, at the boundaries of data – as in evident in the artificial flattening of the kernel estimator at the right of Figure 2(d). More generally, the bias of the local-polynomial estimator declines and the variance increases with the order of the polynomial, but an odd-ordered local-polynomial estimator has the same asymptotic variance as the preceding even-ordered estimator: Thus, the local-linear estimator (of order 1) is preferred to the kernel estimator (of order 0), and the local-cubic (order 3) estimator to the localquadratic (order 2). Smoothing splines are the solution to the penalized regression problem: Find f (x) to minimize

S(h) =

2 yi − f (xi ) + h

x(n)

2 f (x) dx

x(1)

WT (xi )(yi − a − b1 xi − b2 xi2 − · · · − bk xik )2 (6)

Most commonly, the order of the local polynomial is taken as k = 1, that is, a local linear fit (as

(7) Here h is a roughness penalty, analogous to the span in nearest-neighbor kernel or local-polynomial regression, and f is the second derivative of the

Nonparametric Regression 1.0 Tricube kernel weight

Female expectation of life

80

70

60

50

40 0

5000 10 000

0.6 0.4 0.2

20 000

GDP per capita

0

5000 10 000 20 000 GDP per capita

0

5000 10 000 20 000 GDP per capita

(b)

80

80 Female expectation of life

Female expectation of life

0.8

0.0

(a)

70

60

50

40

70

60

50

40 0

(c)

3

5000 10 000 20 000 GDP per capita

(d)

Figure 2 How the kernel estimator works: (a) A neighborhood including the 77 observations closest to x(80) , corresponding to a span of 0.5. (b) The tricube weight function defined on this neighborhood; the points show the weights for the observations. (c) The weighted mean of the y-values within the neighborhood, represented as a horizontal line. (d) The nonparametric regression line connecting the fitted values at each observation. (The four outlying points are excluded from the fit)

regression function (taken as a measure of roughness). Without the roughness penalty, nonparametrically minimizing the residual sum of squares would simply interpolate the data. The mathematical basis for smoothing splines is more satisfying than for kernel or local polynomial regression, since an explicit criterion of fit is optimized, but spline and localpolynomial regressions of equivalent smoothness tend to be similar in practice. Local regression with several predictors proceeds as follows, for example. We want the fit y0 = f (x0 ) at the focal point x0 = (x10 , . . . , xp0 ) in the predictor space. We need the distances D (xi , x0 ) between the observations on the predictors and the focal point. If the predictors are on the same scale (as, for example,

when they represent coordinates on a map), then measuring distance is simple; otherwise, some sort of standardization or generalized distance metric will be required. Once distances are defined, weighted polynomial fits in several predictors proceed much as in the bivariate case. Some kinds of spline estimators can also be generalized to higher dimensions. The generalization of nonparametric regression to several predictors is therefore mathematically straightforward, but it is often problematic in practice. First, multivariate data are afflicted by the so-called curse of dimensionality: Multidimensional spaces grow exponentially more sparse with the number of dimensions, requiring very large samples to estimate nonparametric regression models with many

Nonparametric Regression

E(yi ) = α + f1 (x1i ) + · · · + fp (xpi )

(8)

where the fj are smooth partial-regression functions, typically estimated with smoothing splines or by local regression. This model can be extended in two directions: (1) To incorporate interactions between (or among) specific predictors; for example E(yi ) = α + f1 (x1i ) + f23 (x2i , x3i )

(9)

which is not as general as the unconstrained model E(yi ) = α + f (x1i , x2i , x3i ). (2) To incorporate linear terms, as in the model E(yi ) = α + β1 x1i + f2 (x2i )

(10)

Such semiparametric models are particularly useful for including dummy regressors or other contrasts derived from categorical predictors. Returning to the UN data, an example of a simple additive regression model appears in Figures 3 and 4. Here female life expectancy is regressed on GDP per capita and the female rate of illiteracy, expressed

80 Female life expectancy

predictors. Second, although slicing the surface can be of some help, it is difficult to visualize a regression surface in more than three dimensions (that is, for more than two predictors). Additive regression models are an alternative to unconstrained nonparametric regression with several predictors. The additive regression model is

70

60

50 0

5000

15 000

25 000

GDP per capita 85 Female life expectancy

4

75

65

55 0

20

40 60 Female illiteracy

80

Figure 4 Partial regression functions from the additive regression of female life expectancy on GDP per capita and female illiteracy; each term is fit by a smoothing spline using the equivalent of four degrees of freedom. The “rug plot” at the bottom of each graph shows the distribution of the predictor, and the broken lines give a point-wise 95-percent confidence envelope around the fit

Female life expectancy

GDP per capita

Female illiteracy

Figure 3 Fitted regression surface for the additive regression of female expectation of life on GDP per capita and female illiteracy. The vertical lines represent residuals

as a percentage. Each term in this additive model is fit as a smoothing spline, using the equivalent of four degrees of freedom. Figure 3 shows the twodimensional fitted regression surface, while Figure 4 shows the partial-regression functions, which in effect slice the regression surface in the direction of each predictor; because the surface is additive, all slices in a particular direction are parallel, and the twodimensional surface in three-dimensional space can be summarized by two two-dimensional graphs. The ability to summarize the regression surface with a series of two-dimensional graphs is an even greater advantage when the surface is higher-dimensional. A central issue in nonparametric regression is the selection of smoothing parameters – such as the span in kernel and local-polynomial regression or the

Nonparametric Regression roughness penalty in smoothing-spline regression (or equivalent degrees of freedom for any of these). In the examples in this article, smoothing parameters were selected by visual trial and error, balancing smoothness against detail. The analogous statistical balance is between variance and bias, and some methods (such as cross-validation) attempt to select smoothing parameters to minimize estimated mean-square error (i.e., the sum of squared bias and variance).

[2]

[3] [4]

[5] [6]

Fox, J. (2000a). Nonparametric Simple Regression: Smoothing Scatterplots, Sage Publications, Thousand Oaks. Fox, J. (2000b). Multiple and Generalized Nonparametric Regression, Sage Publications, Thousand Oaks. Hastie, T., Tibshirani, R. & Freedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York. Hastie, T.J. & Tibshirani, R.J. (1990). Generalized Additive Models, Chapman & Hall, London. Simonoff, J.S. (1996). Smoothing Methods in Statistics, Springer, New York.

References [1]

Bowman, A.W. & Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations, Oxford University Press, Oxford.

5

(See also Generalized Additive Model) JOHN FOX

Nonrandom Samples HOWARD WAINER Volume 3, pp. 1430–1433 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonrandom Samples Many of the questions that the tools of modern statistics are enlisted to help answer are causal. To what extent does exercise cause a decrease in coronary heart disease? How much of an effect does diet have on cholesterol in the blood? How much of a child’s performance in school is due to home environment? What proportion of a product’s sales can be traced to a new advertisement campaign? How much of the change in national test scores can be attributed to modifications in educational policy? Some of these questions lend themselves to the possibility of experimental study. One could randomly assign a large group of individuals to a program of exercise, command another to a more sedentary life, and then compare the coronary health of the two groups years later (see Clinical Trials and Intervention Studies). This is possible but impractical. Similar gedanken experiments could be imagined on diet, but the likelihood of actually accomplishing any long-term controlled experiment on humans is extremely small. Trying to control the home life of children borders on the impossible. Modern epistemology [3] points out that the randomization associated with a true experimental design is the only surefire way to measure the effects of possible causal agents. The randomization allows us to control for any ‘missing third factor’ because, under its influence, any unknown factor will be evenly distributed (in the limit) in both the treatment and control conditions. Yet, as we have just illustrated, many of modern causal questions do not lend themselves easily to the kind of randomized assignment that is the hallmark of tightly controlled experiments. Instead, we must rely on observational studies – studies in which our data are not randomly drawn. In an observational study, we must examine intact groups – perhaps one group of individuals who have followed a program of regular exercise and another who have not – and compare their cardiovascular health. But it is almost certain that these two groups differ on some other variable as well – perhaps diet or age or weight or

social class. These must be controlled for statistically. Without random assignment, we can never be sure we have controlled for everything, but this is the best that we can do. In fact, thanks to the ingenious efforts of many statisticians (e.g., [1, 2, 8, 9, 10, 11]) in many circumstances, there are enough tools available for us to do very well indeed. Let us now consider three separate circumstances involving inferences from nonrandomly gathered data. In the first two, the obvious inferences are wrong, and in the third, we show an ingenious solution that allows what appears to be a correct inference. In each case, we are overly terse, but provide references for more detailed explanation. Example 1. The most dangerous profession In 1835, the Swiss physician H. C. Lombard [5] published the results of a study on the longevity of various professions. His data were very extensive, consisting of death certificates gathered over more than a half century in Geneva. Each certificate contained the name of the deceased, his profession, and age at death. Lombard used these data to calculate the mean longevity associated with each profession. Lombard’s methodology was not original with him, but, instead, was merely an extension of a study carried out by R. R. Madden, Esq. that was published two years earlier [6]. Lombard found that the average age of death for the various professions mostly ranged from the early 50s to the mid 60s. These were somewhat younger than those found by Madden, but this was expected since Lombard was dealing with ordinary people rather than the ‘geniuses’ in Madden’s (the positive correlation between fame and longevity was well known even then). But Lombard’s study yielded one surprise; the most dangerous profession – the one with the shortest longevity – was that of ‘student’ with an average age of death of only 20.7! Lombard recognized the reason for this anomaly, but apparently did not connect it to his other results (for more details see [12]). Example 2. The twentieth century is a dangerous time In 1997, to revisit Lombard’s methodology, Samuel Palmer and Linda Steinberg (reported in [13]) gathered 204 birth and death dates from the Princeton (NJ) Cemetery. This cemetery opened in the mid 1700s, and has people buried in it born in

2

Nonrandom Samples Death data from the Princeton Cemetery (smoothed using `53h twice')

100 90

Anomaly 80

Age at death

70 60 50

World War I

40 30

20 10

2000

1980

1960

1940

1920

1900

1880

1860

1840

1820

1800

1780

1760

1740

1720

1700

0

Birth year

Figure 1

The longevities of 204 people buried in Princeton Cemetery shown as a function of the year of their birth

the early part of that century. Those interred include Grover Cleveland, John Von Neumann, Kurt G¨odel, and John Tukey. When age-at-death was plotted as a function of birth year (after suitable smoothing to make the picture coherent), we see the result shown as Figure 1. The age of death stays relatively constant until 1920, when the longevity of the people in the cemetery begins to decline rapidly. The average age of death decreases from around 70 years of age in the 1900s to as low as 10 in the 1980s. It becomes obvious immediately that there must be a reason for the anomaly in the data (what we might call the ‘Lombard Surprise’), but what? Was it a war or a plague that caused the rapid decline? Has a neonatal section been added to the cemetery? Was it opened to poor people only after 1920? Obviously, the reason for the decline is nonrandom sampling. People cannot be buried in the cemetery if they are not already dead. Relatively few people born in the 1980s are buried in the cemetery and, thus, no one born in the 1980s that we found in Princeton Cemetery could have been older than 17. Examples of situations in which this anomaly arises abound. Four of these are:

1. In 100 autopsies, a significant relationship was found between age-at-death and the length of the lifeline on the palm [7]. Actually, what they discovered was that wrinkles and old age go together. 2. In 90% of all deaths resulting from barroom brawls, the victim was the one who instigated the fight. One questions the wit of the remaining 10% who did not point at the body on the floor when the police asked, ‘Who started this?’ 3. In March of 1991, the New York Times reported the results of data gathered by the American Society of Podiatry, which stated that 88% of all women wear shoes at least one size too small. Who would be most likely to participate in such a poll? 4. In testimony before a February 1992 meeting of a committee of the Hawaii State Senate, then considering a law requiring all motorcyclists to wear a helmet, one witness declared that, despite having been in several accidents during his 20 years of motorcycle riding, a helmet would not have prevented any of the injuries he received. Who was unable to testify? Why?

Nonrandom Samples Example 3. Bullet holes and a model for missing data Abraham Wald, in some work he did during World War II [14], was trying to determine where to add extra armor to planes on the basis of the pattern of bullet holes in returning aircraft. His conclusion was to determine carefully where returning planes had been shot and put extra armor every place else! Wald made his discovery by drawing an outline of a plane (crudely shown in Figure 2), and then putting a mark on it where a returning aircraft had been shot. Soon, the entire plane had been covered with marks except for a few key areas. It was at this point that he interposed a model for the missing data, the planes that did not return. He assumed that planes had been hit more or less uniformly, and, hence, those aircraft hit in the unmarked places had been unable to return, and, thus, were the areas that required more armor. Wald’s key insight was his model for the nonresponse. From his observation that planes hit in certain areas were still able to return to base, Wald inferred that the planes that did not return must have been hit somewhere else. Note that if he used a different model analogous to ‘those lying within Princeton Cemetery have the same longevity as those without’ (i.e., that the planes that returned were hit about the same as those that did not return), he would have arrived at exactly the opposite (and wrong) conclusion. To test Wald’s model requires heroic efforts. Planes that did not return must be found and the patterns of bullet holes in them must be recorded. In short, to test the validity of Wald’s model for missing data requires that we sample from the unselected population – we must try to get a random sample,

even if it is a small one. This strategy remains the basis for the only empirical solution to making inferences from nonrandom samples. Nonresponse in surveys (see Nonresponse in Sample Surveys) exhibits many of the same problems seen in observational studies. Indeed, in a mathematical sense, they are formally identical. Yet, in practice, there are important distinctions that can usefully be drawn between them [4]. This essay was aimed at providing a flavor of the depth of the problems associated with making correct inferences from nonrandomly gathered data, and an illustration of the character of one solution – Abraham Wald’s – which effectively models the process that generated the data in the hope of uncovering its secrets.

References [1] [2] [3]

[4] [5]

[6]

[7]

[8] [9]

[10] [11] An outline of a plane

An outline of a plane with shading indicating where other planes had been shot.

Figure 2 A schematic representation of Abraham Wald’s ingenious scheme to investigate where to armor aircraft

3

[12]

[13]

Cochran, W.G. (1982). Contributions to Statistics, Wiley, New York. Cochran, W.G. (1983). Planning and Analysis of Observational Studies, Wiley, New York. Holland, P.W. (1986). Statistics and causal inference, Journal of the American Statistical Association 81, 945–960. Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data, Wiley, New York. Lombard, H.C. (1835). De l’Influence des professions sur la Dur´ee de la Vie, Annales d’Hygi´ene Publique et de M´edecine L´egale 14, 88–131. Madden, R.R. (1833). The Infirmities of Genius, Illustrated by Referring the Anomalies in Literary Character to the Habits and Constitutional Peculiarities of Men of Genius, Saunders and Otley, London. Newrick, P.G., Affie, E. & Corrall, R.J.M. (1990). Relationship between longevity and lifeline: a manual study of 100 patients, Journal of the Royal Society of Medicine 83, 499–501. Rosenbaum, P.R. (1995). Observational Studies, Springer-Verlag, New York. Rosenbaum, P.R. & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects, Biometrika 70, 41–55. Rubin, D.B. (1987). Multiple Imputation for Nonresponse, Wiley, New York. Wainer, H. (1986). Drawing Inferences from SelfSelected Samples, Springer-Verlag, New York. Reprinted in 2000 by Lawrence Erlbaum Associates: Hillsdale, NJ. Wainer, H. (1999). The most dangerous profession: A note on nonsampling error, Psychological Methods 4(3), 250–256. Wainer, H., Palmer, S.J. & Bradlow, E.T. (1998). A selection of selection anomalies, Chance 11(2), 3–7.

4 [14]

Nonrandom Samples Wald, A. (1980). A Method of Estimating Plane Vulnerability Based on Damage of Survivors, CRC 432, July 1980. (These are reprints of work done by Wald while a member of Columbia’s Statistics Research Group during the period 1942–1945. Copies can be obtained from the

Document Center, Center for Naval Analyses, 2000 N. Beauregard St., Alexandria, VA 22311.)

HOWARD WAINER

Nonresponse in Sample Surveys PAUL S. LEVY

AND

STANLEY LEMESHOW

Volume 3, pp. 1433–1436 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonresponse in Sample Surveys Consider a sample survey of n units from a population containing N units. During the data collection phase of the survey, attempts are made to collect information from each of the n sampled units. Generally, for a variety of reasons, there is a subset of units from which it is not possible to collect the information. For example, if the units are patient hospital medical records, the records may be missing. If the unit is a household, and the survey uses telephone interviewing, the household respondent may rarely be at home or may be unwilling to participate in the survey. These are just two of a countless number of scenarios in which information cannot be obtained from a sampled unit (see Missing Data). The nonresponse from a sampled unit is considered to be one of the major threats to the validity of information obtained by the use of sample surveys [6–8].

Effects of Nonresponse The nonresponse of sample units is responsible for a smaller sample size, and, hence, larger variances or standard errors in the resulting estimates than were planned for in the original study design. More importantly, the nonresponse results in bias that is insidious because it cannot be estimated directly from the survey data. For any variable measured in the survey, this bias is equal to the product of the expected proportion of nonrespondents in the population and the expected difference between the level of the variable among respondents and nonrespondents [9, 12]. Thus, if 30% of households in a population are potential nonrespondents, and if 40% of the potential nonrespondents are below the poverty level as opposed to 10% of the potential respondents, then the expected estimated proportion of households below the poverty level from the survey will be 10%, which is a gross underestimate of the true level which is 19% (0.30 × 40% + 0.70 × 10%).

the findings of a sample survey. There are many types of response rates that can be reported, each leading to a different interpretation concerning potential biases due to nonresponse. To illustrate, let us suppose that a firm has given special training on the use of a new software package to 1000 employees and, a year later, selects a random sample of 200 for interviews on their satisfaction with the software and the training. Of the 200, they are able to successfully interview 120, which is a response rate of 60% of those given the training. However, of the 80 that were not interviewed, 50 were no longer employed at the firm; 10 refused to be interviewed; and 20 were either ill or on vacation during the period in which interviews occurred. Thus, if one considers only those available during the period of the survey, the response rate would be 120/130 (92.3%). If one considers only those still employed during the survey, the response rate would be 120/150 (80%). Each of these three response rates would lead to a different interpretation with respect to the quality of the data and with respect to interpretations concerning the findings of the survey. Guidelines frequently cited and used by survey professionals concerning the use of different types of response rates are given in articles by the Council of American Survey Research Organizations (CASRO) [3] and the American Association for Public Opinion Research (AAPOR) [1]. The material in these articles can be downloaded from their respective websites: http://www.casro.org/res prates.cfm and http://www.aapor.org. The sampling text by Lohr also discusses the various types of response rates commonly used as well as guidelines on what constitutes an acceptable response rate [12, pp. 281–282].

Methods of Improving Response Rates

Types of Response Rates and Issues in Reporting Them

While 100% response rates are generally unattainable, there are a variety of methods commonly used to improve response rates. Most of these methods can be found in recent textbooks on sampling or survey methodology [9, 12], or in more specialized publications dealing with response issues [4, 6–8, 10, 11]. We now present a brief description of several widely used methods.

Because of the potential biases due to nonresponse, it is very important that response rates be reported in

Endorsements. Response rates can be increased if the survey material contains endorsements by an

2

Nonresponse in Sample Surveys

agency or organization whose sphere of interest is strongly associated with the subject matter of the survey. This works best if the name of the endorsing organization or individual signing the endorsement letter is well known to the potential respondents. This is especially effective in surveys of business, trade, or professional establishments. Incentives. Monetary or other gifts for participation in the survey have been shown to be effective in obtaining the participation of selected units [6]. Generally, the value of the incentive should be proportional to the extent of the burden to the unit. With the increased attention given to protection of human subjects, there is a tendency for Institutional Review Boards to frown on excessive gifts, fearing that the incentives would be a form of coercion. Refusal Conversion Attempts. Attempts to convert initial refusals often involve the use of interviewers of a different age/gender/ethnicity than that used initially (especially in face-to-face surveys). Changes in the interviewing script with more flexibility permitted for the interviewer to ad lib (especially in telephone surveys), and attempts to interview at a different time from that used initially are also widely used. Although the success of refusal conversion can vary considerably depending on the subject matter and interviewing mode of the survey, success rates are generally in the 15 to 25% range. Subsampling Nonrespondents (Double Sampling). If a sample of n units from a population containing N units results in n1 respondents and n2 = n − n1 nonrespondents, a random subsample of n∗ units is taken from the n2 units that were nonrespondents at the first phase of sampling. Intensive attempts are made to obtain data from this sample of original nonrespondents. If k of the n∗ subsampled original nonrespondents are successfully interviewed, then by weighting these units by the factor n2 /k, we can obtain estimates that partially compensate for the nonresponses. The accuracy of the estimates obtained by this method depends, to a large extent, on obtaining a high response rate from the sample of original nonrespondents. Detailed descriptions of this method are given in [9]. Advance Letters in Telephone Sampling. With the switch in emphasis from the Mitofsky–Waksberg

[13] method to list assisted methods [2] of random digit dialing (RDD) in telephone surveys, it is possible to obtain addresses for a relatively high proportion of the telephone numbers obtained by RDD. By sending advance letters to the households corresponding to these numbers informing them of the nature and importance of the survey, and requesting their support, it has been shown that the response rate can be increased [11]. Multimode Options. In many surveys, where one mode of data collection (e.g., face-to- face personal interview) has not succeeded in obtaining an interview, it is possible that offering the potential target unit an alternative mode (e.g., web or mail) will result in a response. Link et al. [10] has performed an experiment in which this technique has resulted in increased response rates. Differential Effects on Subpopulations. Although the above methods may increase overall response rates, there may be differences among subpopulations with respect to their level of effectiveness, and this may increase the diversity among various population groups with respect to response rates. How this would effect biases in the overall survey findings has become a recent concern as suggested in the recent paper by Eyerman et al. [4].

Statistical Methods of Adjusting for Nonresponse In spite of efforts such as those mentioned above to collect data from all units selected in the sample, there will often be some units from whom survey data have not been collected. There are several methods used to adjust for this in the subsequent statistical estimation process. The ‘classic’ method is to construct classes based on characteristics that are thought to be related to the responses to the items in the survey and that are known about the nonrespondent units (often these are ecological or demographic). If a given class has n units in the original sample, but only n∗ respondent units, then the initial sampling weight is multiplied by n/n∗ for every unit in the particular class. This, of course, does not consider the possibility that nonrespondents, even within the same class, might differ from respondents with respect to their levels of variables in the survey. More refined methods of

Nonresponse in Sample Surveys adjustment (e.g., raking, propensity scores) are now used fairly commonly and are discussed in more recent literature [5].

References [1]

[2]

[3]

[4]

[5]

American Association for Public Opinion Research (2004). Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for RDD Telephone Surveys and In-Person Household Surveys, AAPOR, Ann Arbor. Casady, R.J. & Lepkowski, J.M. (1993). Stratified telephone sampling designs, Survey Methodology 19(1), 103–113. Council of American Survey Research Organizations. (1982). On the Definition of Response Rates, A special report of the CASRO task force on completion rates. CASRO, Port Jefferson. Eyerman, J., Link, M., Mokdad, A. & Morton, J. (2004). Assessing the impact of methodological enhancements on different subpopulations in a BRFSS experiment., in Proceedings of the American Statistical Association, Survey Methodology Section, American Statistical Association (CD-ROM), Alexandria. Folsom, R.E. & Singh, A.C. (2000). A generalized exponential model for sampling weight calibration for a unified approach to nonresponse, poststratification, and extreme weight adjustments, in Proceedings of the American Statistical Association, Section on Survey Research Methods 598–603.

[6]

3

Groves, R.M. (1989). Survey Errors and Survey Costs, Wiley, New York. [7] Groves, R.M. & Coupier, M.P. (1998). Nonresponse in Household Interview Surveys, Wiley, New York. [8] Groves, R.M., Dillman, D.A., Eltinge, J.L. & Little, R.J.A. (2002). Survey Nonresponse, Wiley, New York. [9] Levy, P.S. & Lemeshow, S. (1999). Sampling of Populations: Methods and Applications, Wiley, New York. [10] Link, M.W. & Mokdad, A. (2004). Are web and mail modes feasible options for the behavioral risk factor surveillance system? in Proceedings of the 8th Conference on Health Survey Research Method, Centers for Disease Control and Prevention, Atlanta. [11] Link, M.W., Mokdad, A., Town, M., Roe, D. & Weiner, J. (2004). Improving response rates for the BRFSS: use of lead letters and answering machine messages, in Proceedings of the American Statistical Association, Survey Methodology Section, American Statistical Association (CD-ROM), Alexandria. [12] Lohr, S. (1999). Sampling: Design and Analysis, Duxbury Press, Pacific Grove. [13] Waksberg, J. (1978). Sampling methods for random digit dialing, Journal of the American Statistical Association 73, 40–46.

(See also Survey Sampling Procedures) PAUL S. LEVY

AND

STANLEY LEMESHOW

Nonshared Environment HILARY TOWERS

AND JENAE

M. NEIDERHISER

Volume 3, pp. 1436–1439 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nonshared Environment Nonshared environment refers to the repeated finding that across a very wide range of individual characteristics and behaviors, children raised in the same home are often very different from one another [6]. Most importantly, these behavioral differences are the product of some aspect of the children’s environment – and not to differences in their genes. Nonshared environmental influences may include different peer groups or friendships, differential parenting, differential experience of trauma, different work environments, and different romantic partners or spouses. The difficulty, however, has been in finding evidence of the substantial and systematic impact of any of these variables on sibling differences. Nonshared environment remains a ‘hot topic’ in the field of quantitative genetic research because it has been shown to be important to a wide array of behaviors across the entire life-span, and yet its systematic causes remain elusive [12]. How are we able to discern the extent to which nonshared environmental factors are operating on a behavior? Just as genetic influences on behaviors are inferred from comparisons of different sibling types, so can nonshared environmental outcomes be detected through similar comparisons. For example, monozygotic (MZ, or identical) twins are 100% genetically identical. This means that when members of an MZ twin pair differ, these differences cannot be due to genetic differences – they must be the result of differential effects of factors within each twin’s respective environment. This unique opportunity to separate the effects of genes and environment on behavior has made identical twins very popular among behavioral geneticists (see Twin Designs). Prior to describing the promising results of one recent study of nonshared environmental associations, we wish to point out an important distinction among such studies. While it is clear that withinfamily differences are substantial for most behaviors, until recently the sources of these differences have remained ‘anonymous’. That is, quantitative genetic analyses of individual behaviors at one time point (i.e., univariate, cross-sectional analyses) can only tell us siblings are different because of factors in their environments. For example, if we find substantial differences in levels of depression within MZ twin pairs, nonshared environmental influence

on depression is indicated. We still do not know, however, what factors are causing, or contributing to, these behavioral differences. It is only recently that researchers have begun to assess the causes of withinfamily differences by examining nonshared environmental associations between constructs thought to be environmental (e.g., peer relations) and adjustment (e.g., depression). In the simplest form of these bivariate analyses, MZ twin differences on an environmental factor are correlated with MZ twin differences on an adjustment factor. This correlation between MZ twin differences on two constructs allows the ‘pure’ nonshared environmental association for this association to be estimated. The MZ twin difference method is ideal in its relative simplicity and its stringency, since reduced sample sizes and the general tendency for MZ twins to be more similar rather than less similar to one another than DZ twins decrease the likelihood of finding significant associations. Yet, there have been very few quantitative genetic analyses that have used an MZ differences approach. Of those that have, there has been an almost exclusive reliance on self-reports of both the environment, often retrospective family environment, and current psychological adjustment (e.g. [1, 2]). While this does not diminish the importance of the findings, the use of within-reporter difference correlations (i.e., correlations between two measures reported by the same informant) is subject to the same criticism that applies to standard, twovariable correlations: the correlations are more likely to represent rater bias of some sort than are correlations between ratings from two different reporters (see Rater Bias Models). One recent MZ difference analysis of nonshared environmental influences on the socio-emotional development of preschool children provides evidence for at least moderate, and in some cases substantial, nonshared environmental associations using parent, interviewer, and observer ratings [3]. In this study, across-rater MZ difference correlations between interviewer reports of parental harshness, for example, and parent reports of the children’s problem behaviors, emotionality, and prosocial behavior were .36, .38, and −.27, respectively. Cross-rater MZ difference correlations were also significant for: (a) parental positivity and child positivity, (b) parental positivity and child problem behavior (a negative association), (c) parental negative control and child prosocial behavior (a negative association), and (d) parental

2

Nonshared Environment

harsh discipline and child responsiveness (a negative association). These findings can be interpreted as evidence that the differential treatment of very young siblings living in the same family by their parents matters – and it matters apart from any genetically influenced traits of the child. For example, when a parent is harsh toward one child, and less harsh to the other, the child receiving the harsh discipline is more likely to have behavior problems, be more emotionally labile, and be less prosocial than their sibling. One alternative explanation of the findings reported by Deater–Deckard and colleagues [3] is that differences in the children’s behaviors elicit differences in parenting, or that the relationship is reciprocal. Unfortunately, it is not possible to directly test for this using cross-sectional data. A similar set of preliminary analyses of parent–child data from an adult twin sample, the Twin Moms project [7], found several substantial cross-rater MZ difference correlations between the differential parenting patterns of adult twin women and characteristics of their adolescent children [10]. While it is also not possible to confirm directionality in this cross-sectional sample, such findings in an adult-based sample imply that differences in the children are contributing to differences in (or nonshared environmental variation in) the parenting behavior of the mothers, since the mothers’ genes and environment are the focus of assessment. Regardless of the direction of effect, the associations reported by Deater–Deckard and colleagues [3] are remarkable for at least two reasons. First, withinfamily differences have been shown to increase over time – and to be less easily detected in early childhood, when parents are more likely to treat their children similarly [5, 9]. Finding associations between differential parenting, a family environment variable, and childhood behaviors using a sample of three and a half-year-old children is notable, and has implications for clinical intervention efforts directed toward promoting more ‘equitable’ parenting of siblings. Second, the fact that these associations exist across three different rater ‘perspectives’ lends credence to the validity of the associations – they cannot be attributed entirely to the bias of a single reporter. Third, unlike univariate analyses of nonshared environment, analyses of multivariable associations minimize the amount of measurement error in the estimate of nonshared environment.

In light of this apparent success, one may ask, why all the excitement over nonshared environment? Again, the issue is an inability to isolate systematic sources of within-family differences in behavior, which are so prevalent. In other words – where single studies have succeeded in specifying sources of sibling differences, there has been a general failure to replicate these findings across samples. A recent meta-analysis of the effect produced by candidate sources of nonshared environment on behavior found the average effect size of studies using genetically informed designs to be roughly 2% [12]. Revelations of this kind have stirred discussion about the definition of nonshared environment itself [4, 8]. So far, however, the study of nonshared environmental associations has been limited almost exclusively to analyses of children and adolescents. As adult siblings often go their separate ways – rent their own place, get a job, get married, and so on – it comes as no surprise that nonshared environmental variation may increase with age [5]. It seems a viable possibility, then, that primary sources of nonshared environmental influence on behavior are to be found in adulthood. Indeed, results from the Twin Moms project [7] indicate that a portion of the nonshared environmental variation in women’s marital satisfaction covaries with differences in the women’s spousal characteristics, current family environment, and stressful life events [11]. In other words, the extent to which women are satisfied or dissatisfied with their marriages has been found to vary as a function of spouse characteristics, family climate, and stressors. Importantly, these relationships exist independently of any genetic influence that may be operating on her feelings about her marriage. These findings confirm the potential to isolate nonshared environmental relationships among adult behaviors and relationships, and suggest an important area of exploration for future behavioral genetic work. Finally, studies of nonshared environmental associations are important because they provide information that compliments our understanding of genetic influences on behavior and relationships. Our success in quantifying the relative impact of genetic and environmental factors (and their mutual interplay) on human behavior has implications for therapeutic intervention. For example, the purpose of many types of therapy is to produce changes in individuals’ social environments. How much more effective these therapies may be when based on knowledge that

Nonshared Environment a particular psychosocial intervention has the potential to alter the mechanisms of genetic expression for a particular trait. For instance, intervention research has demonstrated the potential for altering behavioral responses of mothers toward their infants with difficult temperaments [13]. The intent is not only to change mothers’ behaviors or thought processes, but to eliminate the evocative association between the infants’ genetically influenced, irritable behaviors and their mothers’ corresponding expressions of hostility. By identifying specific sources of nonshared environmental influences, we will be better able to focus our attention on the environmental factors that matter in eliciting behavior, as in the intervention study cited above, or that are prime candidates for examination in studies looking for genotype × environment interaction.

[6]

[7]

[8]

[9]

References [10] [1]

[2]

[3]

[4]

[5]

Baker, L.A. & Daniels, D. (1990). Nonshared environmental influences and personality differences in adult twins, Journal of Personality and Social Psychology 58(1), 103–110. Bouchard, T.J. & McGue, M. (1990). Genetic and rearing environmental influences on adult personality: an analysis of adopted twins reared apart, Journal of Personality 58(1), 263–292. Deater-Deckard, K., Pike, A., Petrill, S.A., Cutting, A.L., Hughes, C. & O’Connor, T.G. (2001). Nonshared environmental processes in socio-emotional development: an observational study of identical twin differences in the preschool period, Developmental Science 4(2), F1–F6. Maccoby, E.E. (2000). Parenting and its effects on children: on reading and misreading behavior genetics, Annual Review of Psychology 51, 1–27. McCartney, K., Harris, M.J. & Bernieri, F. (1990). Growing up and growing apart: a developmental metaanalysis of twin studies, Psychological Bulletin 107(2), 226–237.

[11]

[12]

[13]

3

Reiss, D., Plomin, R., Hetherington, E.M., Howe, G.W., Rovine, M., Tryon, A. & Hagan, M.S. (1994). The separate worlds of teenage siblings: an introduction to the study of the nonshared environment and adolescent development, in Separate Social Worlds of Siblings: The Impact of Nonshared Environment on Development, E.M. Hetherington, D. Reiss & R. Plomin, eds, Lawrence Erlbaum, Hillsdale. Reiss, D., Pedersen, N., Cederblad, M., Hansson, K., Lichtenstein, P., Neiderhiser, J. & Elthammer, O. (2001). Genetic probes of three theories of maternal adjustment: universal questions posed to a Swedish sample, Family Process 40(3), 247–259. Rowe, D.C., Woulbroun, J. & Gulley, B.L. (1994). Peers and friends as nonshared environmental influences, in Separate Social Worlds of Siblings: The Impact of Nonshared Environment on Development, E.M. Hetherington, D. Reiss & R. Plomin eds, Lawrence Erlbaum, Hillsdale. Scarr, S. & Weinberg, R.A. (1983). The Minnesota adoption studies: genetic differences and malleability, Child Development 54, 260–267. Towers, H., Spotts, E., Neiderhiser, J., Hansson, K., Lichtenstein, P., Cederblad, M., Pedersen, N.L., Elthammer, O. & Reiss, D. (2001). Nonshared environmental associations between mothers and their children, in Paper Presented at the Biennial Meeting of the Society for Research in Child Development, Minneapolis. Towers, H. (2003). Contributions of current family factors and life events to women’s adjustment: nonshared environmental pathways, Dissertation-AbstractsInternational:-Section-B:-The-Sciences-and-Engineering 63(12-B), 6123. Turkheimer, E. & Waldron, M. (2000). Nonshared environment: a theoretical, methodological, and quantitative review, Psychological Bulletin 126(1), 78–108. van den Boom, D.C. (1994). The influence of temperament and mothering on attachment and exploration: an experimental manipulation of sensitive responsiveness among lower-class mothers with irritable infants, Child Development 65(5), 1457–1477.

HILARY TOWERS

AND JENAE

M. NEIDERHISER

Normal Scores and Expected Order Statistics CLIFFORD E. LUNNEBORG Volume 3, pp. 1439–1441 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Normal Scores and Expected Order Statistics The use of normal scores as a replacement for ranks has been advocated as a way of increasing power in rank-sum tests (see Power Analysis for Categorical Methods). The rationale is that power is compromised by the presence of tied sums in the permutation distribution (see Permutation Based Inference) of rank-based test statistics, for example, 1 + 6 = 2 + 5 = 3 + 4 = 7. The likelihood of such tied sums is reduced when the ranks (integers) are replaced by scores drawn from a continuous distribution. There are three normal score transformations in the nonparametric inference literature. For each, the raw scores are first transformed to ranks. van der Waerden scores [7] are the best known of the normal score transformations. The ranks are replaced with quantiles of the standard normal random variable. In particular, the kth of n ranked score is transformed to the q(k) = [k/(n + 1)] quantile of the N (0, 1) random variable. For example, for k = 3, n = 10, q(k) = 3/11. The corresponding van der Waerden score is that z score below which fall 3/11 of the distribution of the standard normal. The van der Waerden scores can be obtained, of course, from tables of the standard normal. Most statistics packages, however, have normal distribution functions. For example, in the statistical language R (http://www.R-project.org) the qnorm function returns normal quantiles: We can get the set of n van der Waerden scores this way: > n < −8 > qnorm((1 : n)/(n + 1)) [1] −1.22064035 −0.76470967 −0.43072730 −0.13971030 0.13971030 0.43072730 [7] 0.76470967 1.22064035

Expected normal order statistics were proposed by Fisher and Yates [3]. The kth smallest raw score among n is replaced by the expectation of the kth smallest score in a random sample of n observations from the standard normal distribution. Special

tables [3] or [6] often are used for the transformation but the function normOrder in the R package SuppDists makes the transformation somewhat more accessible. > normOrder(8) [1] −1.42363084 −0.85222503 −0.47281401 −0.15255065 0.15255065 0.47281401 [7] 0.85222503 1.42363084

Random normal scores can be used to extend further the support for the permutation distribution. Both van der Waerden and expected normal order statistics are symmetric about the expected value of the normal random variable. Thus, there are a number of equivalent sums involving pairs of the normal scores. By using a set of ordered randomly chosen normal scores to code the ranks, these sum equivalencies are eliminated. The rationale for this approach is further developed in [1]. In R, the function rnorm returns sampled normal scores. Thus, the command > sort(rnorm(8)) [1] −1.287122544 −1.026961817 −0.149881236 −0.052743551 0.373083915 [6] 0.408619154 1.254177021 1.736467480

provides such a sorted set of random normal observations. A characteristic of this approach, of course, is that a second call to the function will produce a different set of random normal scores: > sort(rnorm(8)) [1] −2.00484017 −0.88854417 −0.82976365 −0.75127395 −0.61653373 −0.50975136 [7] −0.14575532 0.57991888

An analysis that is carried out using the first set of scores will differ, perhaps substantially, from an analysis that is carried out using the second set. Is this a drawback? Perhaps, though it reminds us that whenever we replace the original observations, with ranks or with other scores, the outcome of our subsequent analysis will be dependent on our choice of recoding.

2

Normal Scores and Expected Order Statistics

Example The dependence of the analytic outcome on the coding of scores is illustrated by an example adopted from Loftus, G. R. & Loftus, E. F. (1988) Essence of statistics. New York: Knopf. Eight student volunteers watched a filmed simulated automobile accident. Then, the students read what they were told was a newspaper account of the accident, an account that contained misinformation about the accident. Four randomly selected students were told that the article was taken from the New York Daily News, the other four that it appeared in the New York Times. Finally, students were tested on their memory for the details of the accident. The substantive (alternative) hypothesis is that recall of the accident would be more influenced by the Times account than by one attributed to the Daily News. In consequence, the Times group was expected to make greater errors of recall than would the Daily News students. The reported number of errors were (4, 8, 10, 5) for the Daily News group and (9, 13, 15, 7) for the Times group. Permutation tests (see Permutation Based Inference) are used here to test the null hypothesis of equal errors for the two attributions against the alternative. The test statistic used in each instance is the sum of the error score codes for the Times students. The P value for each test is the proportion of the 8!/(4!4!) = 70 permutations of the error score codes for which the Times sum is as large or larger than for the observed randomization of cases. Table 1 shows the variability in the P value over the different codings of the error scores.

The raw scores permutation test is the Pitman test and the rank-based permutation test is the Wilcoxon Mann-Whitney test. The normal codings of the ranked errors for the remaining four permutation tests are the ones reported earlier. All six permutation tests were carried out using the R function perm.test in the exactRankTests package.

Summary The indeterminacy of the random normal scores and the relative difficulty of determining expected normal order statistics have made van der Waerden scores the most common choice for normal scores. Asymptotic inference techniques for normal scores tests are outlined in [2]. More appropriate to small sample studies, however, are inferences based on the exact permutation distribution of the test statistic or a Monte Carlo approximation to that distribution. Such normal scores test are available, for example, in StatXact (http://www.cytel.com) and the general ideas are described in [4] and [5].

References [1]

[2] [3]

[4] [5]

Table 1

P values for different codings of responses

Error score coding Raw scores Ranks van der Waerden scores Expected normal order statistics Random normals, set 1 Random normals, set 2

P value 5/70 = 0.07143 7/70 = 0.10000 6/70 = 0.08571 8/70 = 0.11430 5/70 = 0.07143 7/70 = 0.10000

[6] [7]

Bell, C.B. & Doksum, K.A. (1965). Some new distribution-free statistics, The Annals of Mathematical Statistics 36, 203–214. Conover, W.J. (1999). Practical Nonparametric Statistics, 3rd Edition, Wiley, New York. Fisher, R.A. & Yates, F. (1957). Statistical Tables for Biological, Agricultural and Medical Research, 5th Edition, Oliver & Boyd, Edinburgh. Good, P. (2000). Permutation Tests, 2nd Edition, Springer, New York. Lunneborg, C.E. (2000). Data Analysis by Resampling, Duxbury, Pacific Grove. Owen, D.B. (1962). Handbook of Statistical Tables, Addison-Wesley, Reading. van der Waerden, B.L. (1952). Order tests for the twosample problem and their power, Proceedings Koninklijke Nederlander Akademie van Wetenschappen 55, 453–458.

CLIFFORD E. LUNNEBORG

Nuisance Variables C. MITCHELL DAYTON Volume 3, pp. 1441–1442 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Nuisance Variables Nuisance variables are associated with variation in an outcome (dependent variable) that is extraneous to the effects of independent variables that are of primary interest to the researcher. In experimental comparisons among randomly formed treatment groups, the impact of nuisance variables is to increase experimental error and, thereby, decrease the likelihood that true differences among groups will be detected. Procedures that control nuisance variables, and thereby reduce experimental error, can be expected to increase the power for detecting group differences. In addition to randomization, there are three fundamental approaches to controlling for the effects of nuisance variables. First, cases may be selected to be similar for one or more nuisance variables (e.g., only 6-year-old girls with no diagnosed learning difficulties are included in a study). Second, statistical adjustments may be made using stratification (e.g., cases are stratified by sex and grade in school). Note that the most complete stratification occurs when cases are matched one-to-one prior to application of experimental treatments. Stratification, or matching, on more than a small number of nuisance variables is often not practical. Third, statistical adjustment can be made using regression procedures (e.g., academic ability is used as a covariate in analysis of covariance, ANCOVA). In the social sciences, randomization is considered a sine qua non for experimental research. Thus, even when selection, stratification, and/or covariance adjustment is used, randomization is still necessary for controlling unrecognized nuisance variables. Although randomization to experimental treatment groups guarantees their equivalence in expectation, the techniques of selection, stratification, and/or covariance adjustment are still desirable in order to increase statistical power. For example, if cases are stratified by grade level in school, then differences among grades and the interaction between grades and experimental treatments are no longer confounded with experimental error. In practice, substantial increases in statistical power can be achieved by controlling nuisance variances. Similarly, when

ANCOVA is used, the variance due to experimental error is reduced in proportion to the squared correlation between the dependent variable and the covariate. Thus, the greatest benefit from statistical adjustment occurs when the nuisance variables are strongly related to the dependent variable. In effect, controlling nuisance variables can reduce the sample size required to achieve a desired level of statistical power when making experimental comparisons. While controlling nuisance variables may enhance statistical power in nonexperimental studies, the major impetus for this control is that, in the absence of randomization, comparison groups cannot be assumed to be equivalent in expectation. Thus, in nonexperimental studies, the techniques of matching, stratification, and/or ANCOVA are utilized in an effort to control preexisting differences among comparison groups. Of course, complete control of nuisance variables is not possible without randomization. Thus, nonexperimental studies are always subject to threats to internal validity from unrecognized nuisance variables. For example, in a case-control study, controls may be selected to be similar to the cases and matching, stratification, and/or ANCOVA may be used, but there is no assurance that all relevant nuisance variables have been taken into account. Uncontrolled nuisance variables may be confounded with the effects of treatment in such a way that comparisons are biased. For studies involving the prediction of an outcome rather than comparisons among groups (e.g., multiple regression and logistic regression analysis), the same general concepts that apply to nonexperimental studies are relevant. In regression analysis, the independent contribution of a specific predictor, say X, can be assessed in the context of other predictors in the model. These other predictors may include dummy-coded stratification variables and/or variables acting as covariates (see Dummy Variables). Then, a significance test for the regression coefficient corresponding to X will be adjusted for these other predictors. However, as in the case of any nonexperimental study, this assessment may be biased owing to unrecognized nuisance variables. C. MITCHELL DAYTON

Number of Clusters DAVID WISHART Volume 3, pp. 1442–1446 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Number of Clusters An issue that frequently occurs in cluster analysis is how many clusters are present, or which partition is ‘best’ (see Cluster Analysis: Overview)? This article discusses stopping rules that have been proposed for determining the best number of clusters, and concludes with a pragmatic approach to the problem for social science investigations. In hierarchical classification the investigator is presented with a complete range of clusters, from one (the entire data set), to n, the number of cases (where each cluster has one member). In the early stages of agglomerative clustering, each cluster comprises either a single case or a small homogeneous grouping of cases that are very similar. At the later stages, clusters are combined to form larger – more heterogeneous – groups of cases, and at the penultimate stage there are only two clusters that constitute, in terms of the clustering method, the best two groups. Clusters formed at the later stages therefore exhibit increasingly less homogeneity or greater heterogeneity. Divisive clustering methods work in the opposite direction – at the first stage dividing the sample into the two most homogeneous subgroups, then further subdividing the subgroups, and so on until a stopping rule can be applied or each cluster contains a single case and cannot be further subdivided. The issue of which partition is ‘best’ therefore resolves in some sense as to how much heterogeneity can be tolerated in the latest cluster formed by agglomeration or split by division. Numerous procedures have been proposed for determining the best number of clusters in a data set. In a review of 30 such rules, Milligan and Cooper [8] present a Monte Carlo evaluation of the performance of the rules when applied to the analysis of artificial data sets containing 2, 3, 4, or 5 distinct clusters by four hierarchical clustering methods. The procedures described below were found by Milligan and Cooper to outperform the others covered in their review. Mojena [9] approached the question of when to stop a hierarchical agglomerative process by evaluating the distribution of clustering criterion values that form the clusters at each level. These values normally constitute a series that is increasing or decreasing with each fusion. The approach is to search for an unusually large jump in an increasing series, corresponding to a disproportionate loss of homogeneity

caused by the forced fusion of two disparate clusters, despite their being the most similar of all possible fusion choices. For example, Ward’s method obtains a series of monotonically increasing fusion values α = {α1 , . . . , αn−1 } each of which is the smallest increase in the total within-cluster error sum of squares resulting from the fusion of two clusters. Such a series is shown in Table 1. It is argued that the ‘best’ partition is the one that precedes a significant jump in α, implying that the union of two clusters results in a large loss of homogeneity and should not therefore be combined. From visual inspection of Table 1 the four-cluster partition is a likely candidate for the best partition, because the increase in the error sums of squares, E, resulting from the next fusion to form 3 clusters (4.74) is more than three times the corresponding value (1.5) at which 4 clusters were formed. There is, therefore, a large loss of homogeneity at 3 clusters.

Mojena’s Upper Tail Rule This rule is proposed by Mojena [9]. It looks for the first value αj +1 that lies in the upper tail of the distribution of fusion values α, selecting the partition corresponding to the first stage j satisfying: αj +1 > µα + cσα

(1)

where µα is the mean and σα the standard deviation of the distribution of fusion values α = {α1 , . . . , αn−1 }, and c is a chosen constant, sometimes referred to as the ‘critical score’. The rule states that n − j clusters are best for the first αj +1 that satisfies the rule. If no value for αj +1 satisfies the rule then one cluster is best. When αj +1 satisfies the rule, this means that αj +1 exceeds the mean µα of α by at least c standard deviations σα . It is the first big jump in α, suggesting that partition j is reasonably homogeneous whereas partition j + 1 is not. Partition j is therefore selected as the first best partition. Using Ward’s method, Mojena and Wishart [10] found values of c ≤ 2.5 performed best, whereas Milligan and Cooper [8] suggest c = 1.25. The reason for the discrepancy between these findings is not clear, but it may lie in the choice of proximity measure: for Ward’s method it is necessary to use a proximity matrix of squared Euclidean distances, whereas Milligan and Cooper computed Euclidean

2

Number of Clusters

Table 1 Fusion values in the last 10 stages of hierarchical clustering by Ward’s method. Each value is the increase in error sum of squares, E, resulting from the next fusion, thus 28.821 is the increase in E caused by joining the last 2 clusters Number of clusters Fusion values α

10

9

8

7

6

5

4

3

2

0.457

0.457

0.903

1.172

1.228

1.500

4.740

7.303

28.821

distances. Having evaluated a range of values of c for different artificial data sets, Milligan and Cooper concluded that the best critical score c also varied with the numbers of clusters. As the number of clusters is not known a priori, this finding is not generally useful. The series in Table 1 has a mean of 2 and standard deviation of 5.96, so rule (1) checks for the first fusion value that exceeds 9.45 for a critical score of c = 1.25, or higher values for higher critical score c. This is not very helpful, as the only partition for which αj +1 exceeds 9.45 is at 2 clusters. An alternative approach is to look for the first fusion value that represents a significant departure from the mean µα of α, using the t test statistic: αj +1 − µα √ σα / n − 1

(2)

which has a t distribution with n − 2 degrees of freedom under normality. The first partition for which (2) is significant at the 5% probability level indicates a large loss of homogeneity, and hence the previous partition j is taken as best. This has the advantages that the test takes into account the length of the α series, which determines the degrees of freedom, and eliminates the need to specify a value for the critical score c. It is implemented by Wishart [11] in his ‘Best Cut’ Clustan procedure. For the data in Table 1, which were drawn from a fusion sequence of 24 values on 25 entities, the t statistics that are significant at the 5% level with 23 degrees of freedom are 22.05, 4.36, and 2.25 at the 2, 3, and 4 cluster partitions, respectively. The t test therefore confirms the earlier intuitive conclusion for Table 1 that 4 clusters is the best level of disaggregation for these data. It should be noted that Mojena’s upper tail rule is only defined for a hierarchical classification: it cannot be used to compare two partitions obtained by optimization because it requires a series of fusion values. Mojena [9] also evaluated a moving average quality control rule, and Mojena and Wishart [10] evaluated a double exponential smoothing rule. However, both

of these performed poorly in tests although the latter rule proved to be most accurate in predicting the next value in the α series at each stage. They concluded that the upper tail rule performed best using Ward’s method with artificial data sets.

Calinski and Harabasz’ Variance Ratio ´ Cali´nski and Harabasz [4] propose a variance ratio test statistic: C(k) =

trace(B)/(k − 1) trace(W)/(n − k)

(3)

where W is the within-groups dispersion matrix and B is the between-groups dispersion matrix for a partition of n cases into k clusters. Since trace(W) is the within-groups sum of squares, and trace(B) is the between-groups sum of squares, we have trace(W) + trace(B) = trace(T), the total sum of squares, which is a constant. It follows that C(k) is a weighted function of the within-groups error sum of squares trace(W). Cali´nski and Harabasz suggest that the partition with maximum C(k) is indicated as the best; if C(k) increases monotonically with k then a best partition of the data may not exist. This rule has the advantage that no critical score needs to be chosen. It can be used with both hierarchical classifications and partitioning methods, because it is separately evaluated for each partition of k clusters. Milligan and Cooper [8] found the variance ratio criterion to be the top performer in their tests. Everitt et al. [6] used C(k) to evaluate different partitions of garden flower data, for which 4 clusters were selected.

Beale’s F test Beale [3] proposes a test statistic on the basis of the sum of squared distances between the cases and the cluster means. It is an F test, defined for two

Number of Clusters partitions into k1 and k2 clusters (k2 > k1 ): F (k1 , k2 ) =

(S12 − S22 )/S22 (n − k1 )/(n − k2 ) (k2 /k1 )2/n − 1 (4)

where S12 and S22 are the sum of squared distances for clusters 1 and 2. The statistic is compared with an F distribution with (k2 − k1 ) and (n − k2 ) degrees of freedom. If the F-ratio is significant for any k2 then Beale concludes that the partition k1 is not entirely adequate. Beale’s F test was proposed for the comparison of two k-means partitions generally, but it can also be used with hierarchical methods for all (kj , j = n, . . . , 2) in a dendrogram (see k -means Analysis). In hierarchical agglomerative clustering, for example, the rule stops at the first partition j for which F (kj +1 , kj ) is significant with 1 and (n − kj ) degrees of freedom. This rule has the advantage that no critical score needs to be chosen.

3

or divisive procedures by considering all the test statistics {L(m), m = 1, . . . , k} in a series of fusions or divisions, and stopping when the test statistic exceeds its critical value. Milligan and Cooper [8] conclude that the error ratio performed best with a critical value of 3.3 in their tests.

Hubert and Levin’s C Index For a given partition into k clusters, Hubert and Levin [7] compute the sum Dk of the within-cluster dissimilarities for a series or set of partitions of the same data. The minimum Dmin and maximum Dmax of Dk are found, and their C-index is a standardized Dk : C=

Dk − Dmin Dmax − Dmin

(7)

The minimum value of C across all partitions evaluated is taken to be the best.

Baker and Hubert’s Gamma Index Duda and Hart’s Error Ratio Duda and Hart [5] propose a local criterion that evaluates the division of a cluster m into two subclusters, without reference to the remainder of the data set: Je(2) Je(1)

(5)

where Je(1) is the sum of the squared distances of the cases from the mean of cluster m, and Je(2) is the sum of the squared distances of the cases from the means of the two subclusters following division of cluster m. The null hypothesis that cluster m is homogeneous, and hence the cluster should be subdivided, is rejected if: L(m) = {1 − Je(2)/Je(1) − 2/(πp)} 1/2 × nm p/ 2 1 − 8/(π 2 p)

(6)

exceeds the critical value from a standard normal distribution, where p is the number of variables and nm is the number of cases in cluster m. The rule can be applied to hierarchical agglomerative clustering, when the fusion of two clusters to form cluster m is being considered. Although it is local to the consideration of cluster m alone, the rule can be applied globally to any hierarchical agglomerative

Given the input dissimilarities dij and output ultrametric distances hij corresponding to a hierarchical classification, Baker and Hubert [2] define a Gamma index: S+ − S− γ = (8) S+ + S− where S+ denotes the number of subscript combinations (i, j, k, l) for which gij kl ≡ (dij − dkl ) (hij − hkl ) is positive, and S− is the number that are negative. It is a measure of the concordance of the classification: S+ is the number of consistent comparisons of within-cluster and between-cluster dissimilarities, and S− is the number of inconsistent comparisons. When γ = 1, the correspondence is perfect. It is evaluated at all partitions and the maximum value for γ indicates the best partition (see Measures of Association). It should be noted that γ is only defined for a classification obtained by the transformation of a proximity matrix into a dendrogram. It cannot, therefore, be used with optimization methods.

Wishart’s Bootstrap Validation Wishart [11] compares a dendrogram obtained for a given data set with a series of dendrograms generated

4

Number of Clusters

at random from the same data set or proximity matrix (see Proximity Measures). The method searches for partitions that manifest the greatest departure from randomness. In statistical terms, it seeks to reject the null hypothesis that the structure displayed by a partition is random. The data are rearranged at random within columns (variables) and a dendrogram is obtained from the randomized data. This randomization is then repeated in a series of r random trials, where each trial generates a different dendrogram for the data arranged in a different random order. The mean µj and standard deviation σj are computed for the fusion values obtained at each partition j in r trials, and a confidence interval for the resulting family of dendrograms is obtained. The dendrogram for the given data, before randomization, is then compared with the distribution of dendrograms obtained with the randomized data, looking for significant departures from randomness. The null hypothesis that the given data are random and exhibit no structure is evaluated by a t test on the fusion value αj at each partition j compared with the distribution of randomized dendrograms: αj − µj (9) √ σj / r − 1 which is compared with the t distribution with r − 2 degrees of freedom. The best partition is taken to be the one that exhibits the greatest departure from randomness, if any, as indicated by the largest value for the t statistic (9) that is significant at the 5% probability level. This rule can be used with hierarchical agglomerative and partitioning clustering methods.

Summary A classification is a summary of the data. As such, it is only of value if the clusters make sense and the resulting typology can be usefully applied. They can make sense at different levels of disaggregation – biological organisms are conventionally classified as families, genera and species, and can be further subdivided as subspecies or hybrids. All of these make sense, either in the context of a broad-brush phylogeny or more narrowly in species-specific terms. The same can apply in the behavioral sciences, where people can be classified in an infinite variety of ways. We should never forget that every person is a unique individual. However, it is the stuff of social science

to look for key structure and common features in an otherwise infinitely heterogeneous population. The analysis does not necessarily need to reveal a satisfactory typology for all of the data – it may be enough, for example, to find a cluster of individuals who are sufficiently distinctive to warrant further study. This article has presented various tests for the number of clusters in a data set, to satisfy an expressed need to justify the choice of one partition from several alternatives. Some editors of applied journals regrettably demand proof of ‘significance’ before accepting cluster analysis results for publication. In noting that there is a preoccupation with finding an ‘optimal’ number of groups, and a single ‘best’ classification, Anderberg [1] observes that the possibility of several alternative classifications reflecting different aspects of the data is rarely entertained. A more pragmatic approach is to select the partition into the largest number of clusters that makes sense as a workable summary of the data and can be adequately described. This constitutes a first baseline of greatest disaggregation. Higher-order partitions may then be reviewed with reference to this baseline, where more simplification is indicated. Clusters of particular interest can be drawn out for special attention. If appropriate, a hierarchy of the relationships between the baseline clusters can be presented, and it may be helpful to order this hierarchy optimally so that higher-order clusters are grouped in a meaningful way. A ‘best’ partition may then be drawn out as a level of special interest in the hierarchy, if indeed it does constitute a sensible best summary. It may be helpful to recall the parallel of a biological classification – at the family, genus, and species levels of detail. In practice, the presentation of a classification is no different from writing intelligently and purposefully on any subject, where the degree of technical detail or broad summary depends crucially on what the author aims to communicate to his intended audience. In the final analysis, the crucial test is whether or not the presentation is communicated successfully to the reader and survives peer review and the passage of time.

References [1] [2]

Anderberg, M.R. (1973). Cluster Analysis for Applications, Academic Press, London. Baker, F.B. & Hubert, L.J. (1975). Measuring the power of hierarchical cluster analysis, Journal American Statistical Association 70, 31–38.

Number of Clusters [3] [4]

[5] [6] [7]

[8]

Beale, E.M.L. (1969). Euclidean cluster analysis, Proceedings ISA 43, 92–94. Cali´nski, T. & Harabasz, J. (1974). A dendrite method for cluster analysis, Communications in Statistics 3, 1–27. Duda, R.O. & Hart, P.E. (1973). Pattern Classification and Scene Analysis, Wiley, New York. Everitt, B.S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th Edition, Edward Arnold, London. Hubert, L.J. & Levin, J.R. (1976). A general statistical framework for assessing categorical clustering in free recall, Psychological Bulletin 83, 1072–1080. Milligan, G.W. & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika 50, 159–179.

[9]

5

Mojena, R. (1977). Hierarchical grouping methods and stopping rules: an evaluation, Computer Journal 20, 359–363. [10] Mojena, R. & Wishart, D. (1980). Stopping rules for Ward’s clustering method, in COMPSTAT 1980: Proceedings in Computational Statistics, M.M. Barritt & D. Wishart, eds, Physica-Verlag, Wien, pp. 426–432. [11] Wishart, D. (2004). ClustanGraphics Primer: A Guide to Cluster Analysis, Clustan, Edinburgh.

DAVID WISHART

Number of Matches and Magnitude of Correlation MICHAEL J. ROVINE Volume 3, pp. 1446–1448 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Number of Matches and Magnitude of Correlation The first expression of a correlation coefficient for categorical variables is often attributed to G. Udny Yule For two dichotomous variables, x and y, the 2 × 2 contingency table (see Table 1). Yule’s Q was defined as Q=

ad − bc . ad + bc

(1)

The coefficient functions as an index of the number of observations that occur in the main diagonal of the 2 × 2 table. According to Yule [9], this coefficient has the defining characteristics of a proper correlation coefficient: namely, its maximum values are 1 and −1, and 0 represents no association between the two variables. Coming out of the tradition of Pearson bivariate normal notion of the correlation coefficient, Yule interpreted the coefficient as the size of the association between two random variables much as he would interpret a Pearson product moment correlation coefficient. Pearson [2] emphasized this definition as he described a set of alternative versions of Q (Q1 , Q3 − Q5 renaming Yule’s Q, Q2 ). Pearson’s coefficients were more easily seen as the association of a pair of dichotomized normal variables. Although these were the first coefficients that were described using the word, ‘correlation’, they were preceded by a set of association coefficients (see Measures of Association) developed to estimate the association between the prediction of an event and the occurrence of an event. In an 1884 science paper, the American logician, C. S. Peirce, defined a coefficient of ‘the science of the method’ that he used to assess the accuracy of tornado prediction [5]. His coefficient, i, was defined as Table 1

A 2 × 2 contingency table x 1 0 Total

1

a

b

a+b

0

c

d

c+d

a+c

b+d

N

Y

Total

i=

ad − bc (a + c)(b + d)

(2)

like Yule’s Q and is interesting. The numerator in Peirce’s i is the determinant of a 2 × 2 matrix. This numerator is common to most association coefficients for dichotomous variables. The denominator is a scaling factor that limits the maximum absolute value of the coefficient to 1. For Peirce’s i, the denominator is the product of the marginals. It is interesting to compare Peirce’s i to the ϕ-coefficient, the computational form of Pearson’s product-moment coefficient: ϕ= √

ad − bc . (a + c)(b + d)(a + b)(c + d)

(3)

For equal marginal values (i.e., (a + b) = (a + c) and (b + d) = (c + d)), the two formulas are identical. Where i is an asymmetric coefficient, ϕ is a symmetric coefficient. As an association coefficient, the correlation coefficient had a number of different interpretations [3]. The work of Spearman [7] on the use of the correlation coefficient to assess reliability led to a number of different forms and additional interpretations of the coefficient. Spearman and Thorndike [8] presented papers that brought the correlation coefficient into the area of measurement error. Thorndike suggested the use of a median-ratio coefficient as having a more direct relation to each pair of variable values rather than considering the table as a whole. Cohen [1] suggested an association coefficient based on the idea of inter-rater agreement (see Rater Agreement – Kappa). His κ coefficient defined agreement above and beyond chance. For the 2 × 2 table, κ=

(a + d) − (ae + de ) N − (ae + de )

(4)

where ae and de are the cell values expected by chance. An indicator of effect size (see Effect Size Measures) more directly interpreted in terms of each pair of variable values has been presented by Rosenthal and Rubin [4]. Their Binomial Effect Size Display (BESD) was used to relate a treatment to an outcome. In a 2 × 2 table, the BESD indicates the increase in success rate of the treatment over the outcome. An example appears in Table 2. The correlation coefficient, r = 0.60 compares the proportions of those in the treatment group with good outcomes to those

2

Number of Matches and Magnitude of Correlation

in the control group. It, thus, represents the increase in proportion of positive outcomes expected as one moves from the control group to the treatment group. Rosenthal and Rubin [4] extended the BESD to a continuous outcome variable. Rovine and von Eye [6] approached the correlation coefficient in a similar fashion, considering the correlation coefficient in the case of two continuous variables. They considered the proportion of matches in a bivariate distribution. They began with 100 observations of two uncorrelated variables, x and y, each drawn from a normal distribution. They created one match by replacing the first observation of y with the first observation of x; two matches by replacing the first two observations of y with the first two observations of x; and so on. In this manner, they created 100 new data sets with 1 to 100 matches. They calculated the correlation for each of these data sets. The correlation approximately equaled the proportion of matches in the data set. They extended this definition by defining a match for two standardized variables as the instance in which the values of the y falls within a specific interval, [−A, A] of the value of x. In this case, matches do not have to be exact. They can be targeted so that a match can represent values on two variables falling in the same general location in each distribution as defined by an interval around the value of the variable, x. For matches in which the values for variable x is a, the interval would be a ± A. If the value b on variable y fell within that range, a match would occur. For k matches in n observations where a match is defined as described above, the correlation coefficient one would observe is k n (5) r≈ 1/2 . k A2 1+ n 3

the correlation one could expect to observe. If the interval, A is 0, then a match represents an exact match, and the correlation represents the proportion of matches. As the interval that defines a match increases, it would allow unrelated values of y to more likely qualify as a match. As a result, the observed correlation would tend to be smaller. This result yields a way of interpreting the correlation coefficient that relates back to the notion of a contingency table. For categorical variables, the observed frequencies in a contingency can be used to help interpret the meaning of a correlation coefficient. For continuous variables, the idea of a match allows the investigator to consider the relationship between two continuous variables in a similar fashion. The correlation as a proportion of matches represents another way of interpreting what a correlation coefficient of a particular size can mean.

According to this equation, increasing the interval in which a match occurs decreases the size of

[8]

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

[9]

Table 2

A BESD coefficient Outcome Good Bad Proportion

Treatment y Control

40

10

0.80

10

40

0.20

BESD = 0.80 − 0.20 = 0.60.

Cohen, J.J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement 20(1), 37–46. Pearson, K. (1901). On the correlation of characters not quantitatively measurable, Philosophical Transactions of the Royal Society of London, Series A 195, 1–47. Rodgers, J.L. & Nicewander, W.A. (1988). Thirteen ways to look at the correlation coefficient, American Statistician 42(1), 59–66. Rosenthal, R. & Rubin, D.B. (1982). A simple, general purpose display of magnitude of experimental effect, Journal of Educational Psychology 74(2), 166–169. Rovine, M.J. & Anderson, D.A. (2004). Peirce and Bowditch: an American contribution to correlation and regression, The American Statistician 59(3), 232–236. Rovine, M.J. & von Eye, A. (1997). A 14th way to look at a correlation coefficient: correlation as a proportion of matches, The American Statistician 51(1), 42–46. Spearman, C. (1907). Demonstration of formulae for true measurement of correlation, American Journal of Psychology XVIII, 160–169. Thorndike, E.L. (1907). Empirical studies in the theory of measurement, New York. Yule, G.U. (1900). On the association of attributes in statistics: with illustrations from the material from the childhood society, and so on, Philosophical Transactions of the Royal Society of London, Series A 194, 257–319.

Further Reading Peirce, C.S. (1884). The numerical measure of the success of predictions, Science 4, 453–454.

MICHAEL J. ROVINE

Number Needed to Treat EMMANUEL LESAFFRE Volume 3, pp. 1448–1450 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Number Needed to Treat The Number Needed to Treat and Related Measures Suppose that in a randomized controlled clinical trial two treatments are administered and that the probabilities of showing a positive response are π1 (new treatment) and π2 (control treatment), respectively. The absolute risk reduction (AR) is defined as π1 − π2 and expresses in absolute terms the benefit (or the harm) of the new over the control treatment. For example, when π1 = 0.10 and π2 = 0.05, then in 100 patients there are on average five more responders in the new treatment group. Other measures to indicate the benefit of the new treatment are the relative risk (RR), the relative risk reduction(RRR) and the odds-ratio (OR). On the other hand, Laupacis et al.[3] suggested the number needed to treat (NNT) as a measure to express the benefit of the new treatment. The NNT is defined as the inverse of the absolute risk reduction, that is, NNT = 1/AR. In our example, NNT is equal to 20 and is interpreted that 20 patients need to be treated with the new treatment to ‘save’ one extra patient if these patients were treated with the control treatment. In practice, the true response rates π1 and π2 are estimated by pj = rj /nj (j = 1, 2), respectively where nj (j = 1, 2) are the number of patients in the new and control treatment, respectively with rj (j = 1, 2) responders. Replacing the true response rates by their estimates in the expressions above yields an estimated value for AR, NNT, RR, and OR, denoted rr and or, respectively. However, in the by ar, NNT, literature the notation NNT is used most often for the estimated number needed to treat.

A Clinical Example In a double-blind, placebo-controlled study, 23 epileptic patients were randomly assigned to topiramate 400 mg/day treatment (as adjunctive therapy) and 24 patients to control treatment [4]. It is customary to measure the effect of an antiepileptic drug by the percentage of patients who show at least 50% reduction in seizure rate at the end of the treatment period. Two patients on placebo showed at least 50% reduction in seizure rate, compared

to eight patients on topiramate (p = 0.036, Fisher’s Exact test). The proportion of patients with at least a 50% reduction is p2 = 2/24 = 0.083 for placebo and p1 = 8/23 = 0.35 for topiramate. The absolute risk reduction ar = p1 − p2 = 0.26 expresses the absolute gain due to the administration of the active (add-on) medication on epileptic patients. The odds-ratio (0.35/0.65)/(0.083/0.917) = 5.87 is more popular in epidemiology and has attractive statistical properties [2], but is difficult to interpret by clinicians. With the relative risk = 4.17, we estimate that the probability of being a 50% responder is about 4 times as high under topiramate as under placebo. The estimated NNT equals 1/0.26 = 3.78, which is interpreted that about four patients must receive topiramate if we are to achieve one additional good outcome compared to placebo.

Controversies on the Number Needed to Treat Proponents of the NNT (often, but not exclusively, clinical researchers) argue that this measure is much better understood by clinicians than the odds-ratio. Further, they argue that relative measures like or and rr cannot express the benefit of the new over the control treatment in daily life. Initially, some false claims were made such as that the NNT can give a clearer picture of the difference between treatment results, see for example [2] for an illustration. Antagonists of the NNT state that it is not clear whether clinicians indeed understand this measure better than the other measures. Further, they point out that clinicians neglect the statistical uncertainty with which NNT is estimated in practice. The sta have been investigated by tistical properties of NNT Lesaffre and Pledger [4]. They indicate the following becomes infinite; problems: (a) when ar = 0, NNT is complicated because (b) the distribution of NNT its behavior around ar = 0; (c) the moments of NNT do not exist; and (d) simple calculations with NNT like addition can give nonsensical results. To take the statistical uncertainty of estimating NNT into account, one could calculate the 100(1 − α)% con fidence interval (CI) for NNT on the basis of NNT. For the above example, the 95% C.I. is equal to [2.01, 24.6]. However, when the 100(1 − α)% C.I. for AR, [arun , arup ], includes 0 the 100(1 − α)% C.I.

2

Number Needed to Treat

for NNT splits up into two disjoint intervals ] − ∞, 1/ar up ] ∪ [1/arun , ∞[. is only reported There is also the risk that NNT is when H0 : AR = 0 is rejected, or when NNT positive. Altman [1] suggested using the term number < 0, needed to treat for harm (NNTH) when NNT implying that it would represent the (average) number of patients needed to treat with the new treatment to harm one extra patient than when these patients were treated with the control treatment. The NNTH has been used for reporting adverse events in a clinical > 0, Altman called it the number trial. When NNT needed to treat for benefit (NNTB).

The Use of the Number Needed to Treat in Evidence-based Medicine Lesaffre and Pledger [4] showed that the number needed to treat deduced from a meta-analysis should definitely not be calculated on the NNT-scale, but on the AR-scale if that scale shows homogeneity. However, there is often heterogeneity on the ARscale and hence also for the NNT, because they both depend on the baseline risk. This also implies that one needs to be very careful when extrapolating the estimated NNT to new patient populations. On the other hand, there is more homogeneity on the ORand the RR-scale. Suppose that the meta-analysis has been done on the relative risk scale then, because NNT = 1/(1 − RR)π2 , one can estimate the NNT where for a patient with baseline risk kπ2 by kNNT, NNT is obtained from the meta-analysis. This result is the basis for developing nomograms for individual patients with risks different from the risks observed in the meta-analysis. Furthermore, the literature clearly shows that NNT has penetrated in all therapeutic areas and even more importantly is a key tool for inference in Evidencebased Medicine and even so for the authorities.

Conclusion Despite the controversy between clinicians and statisticians and the statistical deficiencies of NNT, the number needed to treat is becoming increasingly popular in Evidence-based Medicine and is well appreciated by the authorities. In a certain sense, this popularity is surprising since 100AR is at least as easy to interpret; namely, it is the average number of extra patients that benefit from the new treatment (over the control treatment) when 100 patients were treated. Moreover, ar has better statistical properties than NNT.

References [1] [2]

[3]

[4]

Altman, D.G. (1998). Confidence intervals for the number needed to treat, British Medical Journal 317, 1309–1312. Hutton, J. (2000). Number needed to treat: properties and problems, Journal of the Royal Statistical Society, Series A 163, 403–419. Laupacis, A., Sackett, D.L. & Roberts, R.S. (1988). An assessment of clinically useful measures of the consequences of treatment, New England Journal of Medicine 318, 1728–1733. Lesaffre, E. & Pledger, G. (1999). A note on the number needed to treat, Controlled Clinical Trials 20, 439–447.

Further Reading Chatellier, G., Zapletal, E., Lemaitre, D., Menard, J. & Degoulet, P. (1996). The number of needed to treat: a clinically useful nomogram in its proper context, British Medical Journal 312, 426–429.

EMMANUEL LESAFFRE

Observational Study PAUL R. ROSENBAUM Volume 3, pp. 1451–1462 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Observational Study Observational Studies Defined In the ideal, the effects caused by treatments are investigated in experiments that randomly assign subjects to treatment or control, thereby ensuring that comparable groups are compared under competing treatments [1, 5, 15, 23]. In such an experiment, comparable groups prior to treatment ensure that differences in outcomes after treatment reflect effects of the treatment (see Clinical Trials and Intervention Studies). Random assignment uses chance to form comparable groups; it does not use measured characteristics describing the subjects before treatment. As a consequence, random assignment tends to make the groups comparable both in terms of measured characteristics and characteristics that were not or could not be measured. It is the unmeasured characteristics that present the largest difficulties when randomization is not used. More precisely, random assignment ensures that the only differences between treated and control groups prior to treatment are due to chance – the flip of a coin in assigning one subject to treatment, another to control – so if a common statistical test rejects the hypothesis that the difference is due to chance, then a treatment effect is demonstrated [18, 22]. Experiments with human subjects are often ethical and feasible when (a) all of the competing treatments under study are either harmless or intended and expected to benefit the recipients, (b) the best treatment is not known, and in light of this, subjects consent to be randomized, and (c) the investigator can control the assignment and delivery of treatments. Experiments cannot ethically be used to study treatments that are harmful or unwanted, and experiments are not practical when subjects refuse to cede control of treatment assignment to the experimenter. When experiments are not ethical or not feasible, the effects of treatments are examined in an observational study. Cochran [12] defined an observational study as an empiric comparison of treated and control groups in which: the objective is to elucidate cause-and-effect relationships [. . . in which it] is not feasible to use controlled experimentation, in the sense of being able to impose the procedures or treatments whose effects

it is desired to discover, or to assign subjects at random to different procedures.

When subjects are not assigned to treatment or control at random, when subjects select their own treatments or their environments inflict treatments upon them, differing outcomes may reflect these initial differences rather than effects of the treatments [12, 59]. Pretreatment differences or selection biases are of two kinds: those that have been accurately measured, called overt biases, and those that have not be measured but are suspected to exist, called hidden biases. Removing overt biases and addressing uncertainty about hidden biases are central issues in observational studies. Overt biases are removed by adjustments, such as matching, stratification or covariance adjustment (see Analysis of Covariance), which are discussed in the Section titled ‘Adjusting for Biases Visible in Observed Covariates’. Hidden biases are addressed partly in the design of an observational study, discussed in the Section titled ‘Design of Observational Studies’ and ‘Elaborate Theories’, and partly in the analysis of the study, discussed in the Section titled ‘Appraising Sensitivity to Hidden Bias’ and ‘Elaborate Theories’ (see Quasi-experimental Designs; Internal Validity; and External Validity).

Examples of Observational Studies Several examples of observational studies are described. Later sections refer to these examples.

Long-term Psychological Effects of the Death of a Close Relative In an attempt to estimate the long-term psychological effects of bereavement, Lehman, Wortman, and Williams [30] collected data following the sudden death of a spouse or a child in a car crash. They matched 80 bereaved spouses and parents to 80 controls drawn from 7581 individuals who came to renew a drivers license. Specifically, they matched for gender, age, family income before the crash, education level, number, and ages of children. Contrasting their findings with the views of Bowlby and Freud, they concluded: Contrary to what some early writers have suggested about the duration of the major symptoms

2

Observational Study of bereavement . . . both spouses and parents in our study showed clear evidence of depression and lack of resolution at the time of the interview, which was 5 to 7 years after the loss occurred.

Effects on Criminal Violence of Laws Limiting Access to Handguns Do laws that ban purchases of handguns by convicted felons reduce criminal violence? It would be difficult, perhaps impossible, to study this in a randomized experiment, and yet an observational study faces substantial difficulties as well. One could not reasonably estimate the effects of such a law by comparing the rate of criminal violence among convicted felons barred from handgun purchases to the rate among all other individuals permitted to purchase handguns. After all, convicted felons may be more prone to criminal violence and may have greater access to illegally purchased guns than typical purchasers of handguns without felony convictions. For instance, in his ethnographic account of violent criminals, Athens (1997, p. 68) depicts their sporadic violent behavior as consistent with stable patterns of thought and interaction, and writes: “. . . the self-images of violent criminals are always congruent with their violent criminal actions.” Wright, Wintemute, and Rivara [68] compared two groups of individuals in California: (a) individuals who attempted to purchase a handgun but whose purchase was denied because of a prior felony conviction, and (b) individuals whose purchase was approved because their prior felony arrest had not resulted in a conviction. The comparison looked forward in time from the attempt to purchase a handgun, recording arrest charges for new offenses in the subsequent three years. Presumably, group (b) is a mixture of some individuals who did not commit the felony for which they were arrested and others who did. If this presumption were correct, group (a) would be more similar to group (b) than to typical purchasers of handguns, but substantial biases may remain.

Effects on Children of Occupational Exposures to Lead Morton, Saah, Silberg, Owens, Roberts, and Saah [39] asked whether children were harmed by lead brought home in the clothes and hair of parents

who were exposed to lead at work. They matched 33 children whose parents worked in a battery factory to 33 unexposed control children of the same age and neighborhood, and used Wilcoxon’s signed rank test to compare the level of lead found in the children’s blood, finding elevated levels of lead in exposed children. In addition, they compared exposed children whose parents had varied levels of exposure to lead at the factory, finding that parents who had higher exposures on the job in turn had children with more lead in their blood. Finally, they compared exposed children whose parents had varied hygiene upon leaving the factory at the end of the day, finding that poor hygiene of the parent predicted higher levels of lead in the blood of the child.

Design of Observational Studies Observational studies are sometimes referred to as natural experiments [36, 56] or as quasiexperiments [61] (see Quasi-experimental Designs). These differences in terminology reflect certain differences in emphasis, but a shared theme is that the early stages of planning or designing an observational study attempt to reproduce, as nearly as possible, some of the strengths of an experiment [47]. A treatment is a program, policy, or intervention which, in principle, may be applied to or withheld from any subject under study. A variable measured prior to treatment is not affected by the treatment and is called a covariate. A variable measured after treatment may have been affected by the treatment and is called an outcome. An analysis that does not carefully distinguish covariates and outcomes can introduce biases into the analysis where none existed previously [43]. The effect caused by a treatment is a comparison of the outcome a subject exhibited under the treatment the subject actually received with the potential but unobserved outcome the subject would have exhibited under the alternative treatment [40, 59]. Causal effects so defined are sometimes said to be counterfactual (see Counterfactual Reasoning), in the specific sense that they contrast what did happen to a subject under one treatment with what would have happened under the other treatment. Causal effects cannot be calculated for individuals, because each individual is observed under treatment or under control, but not both. However, in a randomized experiment, the treated-minus-control difference

Observational Study in mean outcomes is an unbiased and consistent estimate of the average effect of the treatment on the subjects in the experiment. In planning an observational study, one attempts to identify circumstances in which some or all of the following elements are available [47]. •

•

Key covariates and outcomes are available for treated and control groups. The most basic elements of an observational study are treated and control groups, with important covariates measured before treatment, and outcomes measured after treatment. If data are carefully collected over time as events occur, as in a longitudinal study (see Longitudinal Data Analysis), then the temporal order of events is typically clear, and the distinction between covariates and outcomes is clear as well. In contrast, if data are collected from subjects at a single time, as in a cross-sectional study (see Cross-sectional Design) based on a single survey interview, then the distinction between covariates and outcomes depends critically on the subjects’ recall, and may not be sharp for some variables; this is a weakness of cross-sectional studies. Age and sex are covariates whenever they are measured, but current recall of past diseases, experiences, moods, habits, and so forth can easily be affected by subsequent events. Haphazard treatment assignment rather than self-selection. When randomization is not used, treated and control groups are often formed by deliberate choices reflecting either the personal preferences of the subjects themselves or else the view of some provider of treatment that certain subjects would benefit from treatment. Deliberate selection of this sort can lead to substantial biases in observational studies. For instance, Campbell and Boruch [10] discuss the substantial systematic biases in many observational studies of compensatory programs intended to offset some disadvantage, such as the US Head Start Program for preschool children. Campbell and Boruch note that the typical study compares disadvantaged subjects eligible for the program to controls who were not eligible because they were not sufficiently disadvantaged. When randomization is not possible, one should try to identify circumstances in which an ostensibly irrelevant event, rather than

•

•

3

deliberate choice, assigned subjects to treatment or control. For instance, in the United States, class sizes in government run schools are largely determined by the degree of wealth in the local region, but in Israel, a rule proposed by Maimonides in the 12th century still requires that a class of 41 must be divided into two separate classes. In Israel, what separates a class of size 40 from classes half as large is the enrollment of one more student. Angrist and Lavy [2] exploited Maimonides rule in their study of the effects of class size on academic achievement in Israel. Similarly, Oreopoulos [41] studies the economic effects of living in a poor neighborhood by exploiting the policy of Toronto’s public housing program of assigning people to housing in quite different neighborhoods simply based on their position in a waiting list. Lehman et al. [30], in their study of bereavement in the Section titled ‘Long-term Psychological Effects of the Death of a Close Relative’, limited the study to car crashes for which the driver was not responsible, on the grounds that car crashes for which the driver was responsible were relatively less haphazard events, perhaps reflecting forms of addiction or psychopathology. Random assignment is a fact, but haphazard assignment is a judgment, perhaps a mistaken one; however, haphazard assignments are preferred to assignments known to be severely biased. Special populations offering reduced selfselection. Restriction to certain subpopulations may diminish, albeit not eliminate, biases due to self-selection. In their study of the effects of adolescent abortion, Zabin, Hirsch, and Emerson [69] used as controls young women who visited a clinic for a pregnancy test, but whose test result came back negative, thereby ensuring that the controls were also sexually active. The use in the Section titled ‘Effects on Criminal Violence of Laws Limiting Access to Handguns’ of controls who had felony arrests without convictions may also reduce hidden bias. Biases of known direction. In some settings, the direction of unobserved biases is quite clear even if their magnitude is not, and in certain special circumstances, a treatment effect that overcomes a bias working against it may yield a relatively unambiguous conclusion. For

4

•

Observational Study instance, in the Section titled ‘Effects on Criminal Violence of Laws Limiting Access to Handguns’, one expects that the group of convicted felons denied handguns contains fewer innocent individuals than does the arrested-but-notconvicted group who were permitted to purchase handguns. Nonetheless, Wright et al. [68] found fewer subsequent arrests for gun and violent offenses among the convicted felons, suggesting that the denial of handguns may have had an effect large enough to overcome a bias working in the opposite direction. Similarly, it is often claimed that payments from disability insurance provided by US Social Security deter recipients from returning to work by providing a financial disincentive. Bound [6] examined this claim by comparing disability recipients to rejected applicants, where the rejection was based on an administrative judgment that the injury or disability was not sufficiently severe. Here, too, the direction of bias seems clear: rejected applicants should be healthier. However, Bound found that even among the rejected applicants, relatively few returned to work, suggesting that even fewer of the recipients would return to work without insurance. Some general theory about studies that exploit biases of known direction is given in Section 6.5 of [49]. An abrupt start to intense treatments. In an experiment, the treated and control conditions are markedly distinct, and these conditions become active at a specific known time. Lehman et al.’s [30] study of the psychological effects of bereavement in the Section titled ‘Long-term Psychological Effects of the Death of a Close Relative’ resembles an experiment in this sense. The study concerned the effects of the sudden loss of a spouse or a child in a car crash. In contrast, the loss of a distant relative or the gradual loss of a parent to chronic disease might possibly have effects that are smaller, more ambiguous, more difficult to discern. In a general discussion of studies of stress and depression, Kessler [28] makes this point clearly: “. . . a major problem in interpret[ation] . . . is that both chronic role-related stresses and the chronic depression by definition have occurred for so long that deciding unambiguously which came first is difficult . . . The researcher, however, may focus on stresses that can

be assumed to have occurred randomly with respect to other risk factors of depression and to be inescapable, in which case matched comparison can be used to make causal inferences about long-term stress effects. A good example is the matched comparison of the parents of children having cancer, diabetes, or some other serious childhood physical disorder with the parents of healthy children. Disorders of this sort are quite common and occur, in most cases, for reasons that are unrelated to other risk factors for parental psychiatric disorder. The small amount of research shows that these childhood physical disorders have significant psychiatric effects on the family.” (p. 197)

•

Additional structural features in quasi-experiments intended to provide information about hidden biases. The term quasi-experiment is often used to suggest a design in which certain structural features are added in an effort to provide information about hidden biases; see Section titled ‘Elaborate Theories’ for detailed discussion. In the Section titled ‘Effects on Children of Occupational Exposures to Lead’, for instance, data were collected for control children whose parents were not exposed to lead, together with data about the level of lead exposure and the hygiene of parents exposed to lead. An actual effect of lead should produce a quite specific pattern of associations: more lead in the blood of exposed children, more lead when the level of exposure is higher, more lead when the hygiene is poor. In general terms, Cook et al. [14] write that: “. . . the warrant for causal inferences from quasi-experiments rests [on] structural elements of design other than random assignment–pretests, comparison groups, the way treatments are scheduled across groups . . . – [which] provide the best way of ruling out threats to internal validity . . . [C]onclusions are more plausible if they are based on evidence that corroborates numerous, complex, or numerically precise predictions drawn from a descriptive causal hypothesis.” (pp. 570-1)

Randomization will produce treated and control groups that were comparable prior to treatment, and it will do this mechanically, with no understanding of the context in which the study is being conducted. When randomization is not used, an understanding of the context becomes much more important.

Observational Study Context is important whether one is trying to identify what covariates to measure, or to locate settings that afford haphazard treatment assignments or subpopulations with reduced selection biases, or to determine the direction of hidden biases. Ethnographic and other qualitative studies (e.g., [3, 21]) may provide familiarity with context needed in planning an observational study, and moreover qualitative methods may be integrated with quantitative studies [55]. Because even the most carefully designed observational study will have weaknesses and ambiguities, a single observational study is often not decisive, and replication is often necessary. In replicating an observational study, one should seek to replicate the actual treatment effects, if any, without replicating any biases that may have affected the original study. Some strategies for doing this are discussed in [48].

Adjusting for Biases Visible in Observed Covariates Matched Sampling Selecting from a Reservoir of Potential Controls. Among methods of adjustment for overt biases, the most direct and intuitive is matching, which compares each treated individual to one or more controls who appear comparable in terms of observed covariates. Matched sampling is most common when a small treated group is available together with a large reservoir of potential controls [57]. The structure of the study of bereavement by Lehman et al. [30] in the Section titled ‘Long-term Psychological Effects of the Death of a Close Relative’ is typical. There were 80 bereaved spouses and parents and 7581 potential controls, from whom 80 matched controls were selected. Routine administrative records were used to identify and match bereaved and control subjects, but additional information was needed from matched subjects for research purposes, namely, psychiatric outcomes. It is neither practical nor important to obtain psychiatric outcomes for all 7581 potential controls, and instead, matching selected 80 controls who appear comparable to treated subjects. Most commonly, as in both Lehman et al.’s [30] study of bereavement in the Section titled ‘Longterm Psychological Effects of the Death of a Close

5

Relative’ and Morton et al.’s [39] study of lead exposure in the Section titled ‘Effects on Children of Occupational Exposures to Lead’, each treated subject is matched to exactly one control, but other matching structures may yield either greater bias reduction or estimates with smaller standard errors or both. In particular, if the reservoir of potential controls is large, and if obtaining data from controls is not prohibitively expensive, then the standard errors of estimated treatment effects can be substantially reduced by matching each treated subject to several controls [62]. When several controls are used, substantially greater bias reduction is possible if the number of controls is not constant, instead varying from one treated subject to another [37]. Multivariate Matching Using Propensity Scores. In matching, the first impulse is to try to match each treated subject to a control who appears nearly the same in terms of observed covariates; however, this is quickly seen to be impractical when there are many covariates. For instance, with 20 binary covariates, there are 220 or about a million types of individuals, so even with thousands of potential controls, it will often be difficult to find a control who matches a treated subject on all 20 covariates. Randomization produces covariate balance, not perfect matches. Perfect matches are not needed to balance observed covariates. Multivariate matching methods attempt to produce matched pairs or sets that balance observed covariates, so that, in aggregate, the distributions of observed covariates are similar in treated and control groups. Of course, unlike randomization, matching cannot be expected to balance unobserved covariates. The propensity score is the conditional probability (see Probability: An Introduction) of receiving the treatment rather than the control given the observed covariates [52]. Typically, the propensity score is unknown and must be estimated, for instance, using logistic regression [19] of the binary category, treatment/control on the observed covariates. The propensity score is defined in terms of the observed covariates, even when there are concerns about hidden biases due to unobserved covariates, so estimating the propensity score is straightforward because the needed data are available. For nontechnical surveys of methods using propensity scores, see [7, 27], and see [33] for discussion of propensity scores for doses of treatment.

6

Observational Study

Matching on one variable, the propensity score, tends to balance all of the observed covariates, even though matched individuals will typically differ on many observed covariates. As an alternative, matching on the propensity score and one or two other key covariates will also tend to balance all of the observed covariates. If it suffices to adjust for the observed covariates – that is, if there is no hidden bias due to unobserved covariates – then it also suffices to adjust for the propensity score alone. These results are Theorems 1 through 4 of [52]. A study of the psychological effects of prenatal exposures to barbiturates balanced 20 observed covariates by matching on an estimated propensity score and sex [54]. One can and should check to confirm that the propensity score has done its job. That is, one should check that, after matching, the distributions of observed covariates are similar in treated and control groups; see [53, 54] for examples of this simple process. Because theory says that a correctly estimated propensity score should balance observed covariates, this check on covariate balance is also a check on the model used to estimate the propensity score. If some covariates are not balanced, consider adding to the logit model interactions or quadratics involving these covariates; then check covariate balance with the new propensity score. Bergstralh, Kosanke, and Jacobsen [4] provide SAS software for an optimal matching algorithm.

Model-based Adjustments

Stratification

With care, matching, stratification, model-based adjustments and combinations of these techniques may often be used to remove overt biases accurately recorded in the data at hand, that is, biases visible in imbalances in observed covariates. However, when observational studies are subjected to critical evaluation, a common concern is that the adjustments failed to control for some covariate that was not measured. In other words, the concern is that treated and control subjects were not comparable prior to treatment with respect to this unobserved covariate, and had this covariate been measured and controlled by adjustments, then the conclusions about treatment effects would have been different. This is not a concern in randomized experiments, because randomization balances both observed and unobserved covariates. In an observational study, a sensitivity analysis (see Sensitivity Analysis in Observational Studies) asks how such hidden biases of various magnitudes might alter the conclusions of

Stratification is an alternative to matching in which subjects are grouped rather than paired. Cochran [13] showed that five strata formed from a single continuous covariate can remove about 90% of the bias in that covariate. Strata that balance many covariates at once can often be formed by forming five strata at the quintiles of an estimated propensity score. A study of coronary bypass surgery balanced 74 covariates using five strata formed from an estimated propensity score [53]. The optimal stratification – that is, the stratification that makes treated and control subjects as similar as possible within strata – is a type of matching called full matching in which a treated subject can be matched to several controls or a control can be matched to several treated subjects [45]. An optimal full matching, hence also an optimal stratification, can be determined using network optimization.

Unlike matched sampling and stratification, which compare treated subjects directly to actual controls who appear comparable in terms of observed covariates, model-based adjustments, such as covariance adjustment, use data on treated and control subjects without regard to their comparability, relying on a model, such as a linear regression model, to predict how subjects would have responded under treatments they did not receive. In a case study from labor economics, Dehejia and Wahba [20] compared the performance of model-based adjustments and matching, and Rubin [58, 60] compared performance using simulation. Rubin found that model-based adjustments yielded smaller standard errors than matching when the model is precisely correct, but model-based adjustments were less robust than matching when the model is wrong. Indeed, he found that if the model is substantially incorrect, model-based adjustments may not only fail to remove overt biases, they may even increase them, whereas matching and stratification are fairly consistent at reducing overt biases. Rubin found that the combined use of matching and model-based adjustments was both robust and efficient, and he recommended this strategy in practice.

Appraising Sensitivity to Hidden Bias

Observational Study the study. Observational studies vary greatly in their sensitivity to hidden bias. Cornfield et al. [17] conducted the first formal sensitivity analysis in a discussion of the effects of cigarette smoking on health. The objection had been raised that smoking might not cause lung cancer, but rather that there might be a genetic predisposition both to smoke and to develop lung cancer, and that this, not an effect of smoking, was responsible for the association between smoking and lung cancer. Cornfield et al. [17] wrote: . . . if cigarette smokers have 9 times the risk of nonsmokers for developing lung cancer, and this is not because cigarette smoke is a causal agent, but only because cigarette smokers produce hormone X, then the proportion of hormone X-producers among cigarette smokers must be at least 9 times greater than among nonsmokers. (p. 40)

Though straightforward to compute, their sensitivity analysis is an important step beyond the familiar fact that association does not imply causation. A sensitivity analysis is a specific statement about the magnitude of hidden bias that would need to be present to explain the associations actually observed. Weak associations in small studies can be explained away by very small biases, but only a very large bias can explain a strong association in a large study. A simple, general method of sensitivity analysis introduces a single sensitivity parameter that measures the degree of departure from random assignment of treatments. Two subjects with the same observed covariates may differ in their odds of receiving the treatment by at most a factor of . In an experiment, random assignment of treatments ensures that = 1, so no sensitivity analysis is needed. In an observational study with = 2, if two subjects were matched exactly for observed covariates, then one might be twice as likely as the other to receive the treatment because they differ in terms of a covariate not observed. Of course, in an observational study, is unknown. A sensitivity analysis tries out several values of to see how the conclusions might change. Would small departures from random assignment alter the conclusions? Or, as in the studies of smoking and lung cancer, would only very large departures from random assignment alter the conclusions? For each value of , it is possible to place bounds on a statistical inference – perhaps for = 3, the true P value is unknown, but must be between

7

0.0001 and 0.041. Analogous bounds may be computed for point estimates and confidence intervals. How large must be before the conclusions of the study are qualitatively altered? If for = 9, the P value for testing no effect is between 0.00001 and 0.02, then the results are highly insensitive to bias – only an enormous departure from random assignment of treatments could explain away the observed association between treatment and outcome. However, if for = 1.1, the P value for testing no effect is between 0.01 and 0.3, then the study is extremely sensitive to hidden bias – a tiny bias could explain away the observed association. This method of sensitivity analysis is discussed in detail with many examples in Section 4 of [49] and the references given there. For instance, Morton et al.’s [39] study in the Section titled ‘Effects on Children of Occupational Exposures to Lead’ used Wilcoxon’s signed-ranks test to compare blood lead levels of 33 exposed and 33 matched control children. The pattern of matched pair differences they observed would yield a P value less than 10−5 in a randomized experiment. For = 3, the range of possible P values is from about 10−15 to 0.014, so a bias of this magnitude could not explain the higher lead levels among exposed children. In words, if Morton et al. [39] had failed to control by matching a variable strongly related to blood lead levels and three times more common among exposed children, this would not have been likely to produce a difference in lead levels as large as the one they observed. The upper bound on the P value is just about 0.05 when = 4.35, so the study is quite insensitive to hidden bias, but not as insensitive as the studies of heavy smoking and lung cancer. For = 5 and = 6, the upper bounds on the P value are 0.07 and 0.12, respectively, so biases of this magnitude could explain the observed association. Sensitivity analyses for point estimates and confidence intervals for this example are in Section 4.3.4 and Section 4.3.5 of [49]. Several other methods of sensitivity analysis are discussed in [16, 24, 26], and [31].

Elaborate Theories Elaborate Theories and Pattern Specificity What can be observed to provide evidence about hidden biases, that is, biases due to covariates that

8

Observational Study

were not observed? Cochran [12] summarizes the view of Sir Ronald Fisher, the inventor of the randomized experiment: About 20 years ago, when asked in a meeting what can be done in observational studies to clarify the step from association to causation, Sir Ronald Fisher replied: “Make your theories elaborate.” The reply puzzled me at first, since by Occam’s razor, the advice usually given is to make theories as simple as is consistent with known data. What Sir Ronald meant, as subsequent discussion showed, was that when constructing a causal hypothesis one should envisage as many different consequences of its truth as possible, and plan observational studies to discover whether each of these consequences is found to hold.

Similarly, Cook & Shadish [15] (1994, p. 565): “Successful prediction of a complex pattern of multivariate results often leaves few plausible alternative explanations. (p. 95)” Some patterns of response are scientifically plausible as treatment effects, but others are not [25], [65]. “[W]ith more pattern specificity,” writes Trochim [63], “it is generally less likely that plausible alternative explanations for the observed effect pattern will be forthcoming. (p. 580)”

Example of Reduced Sensitivity to Hidden Bias Due to Pattern Specificity Morton et al.’s [39] study of lead exposures in the Section titled ‘Effects on Children of Occupational Exposures to Lead’ provides an illustration. Their elaborate theory predicted: (a) higher lead levels in the blood of exposed children than in matched control children, (b) higher lead levels in exposed children whose parents had higher lead exposure on the job, and (c) higher lead levels in exposed children whose parents practiced poorer hygiene upon leaving the factory. Since each of these predictions was consistent with observed data, to attribute the observed associations to hidden bias rather than an actual effect of lead exposure, one would need to postulate biases that could produce all three associations. In a formal sense, elaborate theories play two roles: (a) they can aid in detecting hidden biases [49], and (b) they can make a study less sensitive to hidden bias [50, 51]. In Section titled ‘Effects on Children of Occupational Exposures to Lead’, if exposed children had lower lead levels than controls,

or if higher exposure predicted lower lead levels, or if poor hygiene predicted lower lead levels, then this would be difficult to explain as an effect caused by lead exposure, and would likely be understood as a consequence of some unmeasured bias, some way children who appeared comparable were in fact not comparable. Indeed, the pattern specific comparison is less sensitive to hidden bias. In detail, suppose that the exposure levels are assigned numerical scores, 1 for a child whose father had either low exposure or good hygiene, 2 for a father with high exposure and poor hygiene, and 1.5 for the several intermediate situations. The sensitivity analysis discussed in the Section titled ‘Appraising Sensitivity to Hidden Bias’ used the signed rank test to compare lead levels of the 33 exposed children and their 33 matched controls, and it became sensitive to hidden bias at = 4.35, because the upper bound on the P value for testing no effect had just reached 0.05. Instead, using the dose-signed-rank statistic [46, 50] to incorporate the predicted pattern, the comparison becomes sensitive at = 4.75; that is, again, the upper bound on the P value for testing no effect is 0.05. In other words, some biases that would explain away the higher lead levels of exposure children are not large enough to explain away the pattern of associations predicted by the elaborate theory. To explain the entire pattern, noticeably larger biases would need to be present. A reduction in sensitivity to hidden bias can occur when a correct elaborate theory is strongly confirmed by the data, but an increase in sensitivity can occur if the pattern is contradicted [50]. It is possible to contrast competing design strategies in terms of their ‘design sensitivity;’ that is, their ability to reduce sensitivity to hidden bias [51].

Common Forms of Pattern Specificity There are several common forms of pattern specificity or elaborate theories [44, 49, 61]. •

Two control groups.Campbell [8] advocated selecting two control groups to systematically vary an unobserved covariate, that is, to select two different groups not exposed to the treatment, but known to differ in terms of certain unobserved covariates. For instance, Card and Krueger [11] examined the common claim among economists that increases in the

Observational Study

•

minimum wage cause many minimum wage earners to loose their jobs. They did this by looking at changes in employment at fast food restaurants – Burger Kings, Wendy’s, KFCs, and Roy Rogers’ – when New Jersey increased its minimum wage by about 20%, from $4.25 to $5.05 per hour, on 1 April 1992, comparing employment before and after the increase. In certain analyses, they compared New Jersey restaurants initially paying $4.25 to two control groups: (a) restaurants in the same chains across the Delaware River in Pennsylvania where the minimum wage had not increased, and (b) restaurants in the same chains in affluent sections of New Jersey where the starting wage was at least $5.00 before 1 April 1992. An actual effect of raising the minimum wage should have negligible effects on both control groups. In contrast, one anticipates differences between the two control groups if, say, Pennsylvania Burger Kings were poor controls for New Jersey Burger Kings, or if employment changes in relatively affluent sections of New Jersey are very different from those in less affluent sections. Card and Krueger found similar changes in employment in the two control groups, and similar results in their comparisons of the treated group with either control group. An algorithm for optimal pair matching with two control groups is illustrated with Card and Krueger’s study in [32]. Coherence among several outcomes and/or several doses. Hill [25] emphasized the importance of a coherent pattern of associations and of doseresponse relationships, and Weiss [66] further developed these ideas. Campbell [9] wrote: “. . . great inferential strength is added when each theoretical parameter is exemplified in two or more ways, each mode being as independent as possible of the other, as far as the theoretically irrelevant components are concerned (p. 33).” Webb [64] speaks of triangulation. The lead example in the Section titled ‘Example of Reduced Sensitivity to Hidden Bias Due to Pattern Specificity’ provides one illustration and Reynolds and West [42] provide another. Related statistical theory is in [46, 51] and the references given there.

•

9

Unaffected outcomes; ineffective treatments.An elaborate theory may predict that certain outcomes should not be affected by the treatment or certain treatments should not affect the outcome; see Section 6 of [49] and [67]. For instance, in a case-crossover study [34], Mittleman et al. [38] asked whether bouts of anger might cause myocardial infarctions or heart attacks, finding a moderately strong and highly significant association. Although there are reasons to think that bouts of anger might cause heart attacks, there are also reasons to doubt that bouts of curiosity cause heart attacks. Mittleman et al. found curiosity was not associated with myocardial infarction, writing: “the specificity observed for anger . . . as opposed to curiosity . . . argue against recall bias.” McKillip [35] suggests that an unaffected or ‘control’ outcome might sometimes serve in place of a control group, and Legorreta et al. [29] illustrate this possibility in a study of changes in the demand for a type of surgery following a technological change that reduced cost and increased safety.

Summary In the design of an observational study, an attempt is made to reconstruct some of the structure and strengths of an experiment. Analytical adjustments, such as matching, are used to control for overt biases, that is, pretreatment differences between treated and control groups that are visible in observed covariates. Analytical adjustments may fail because of hidden biases, that is, important covariates that were not measured and therefore not controlled by adjustments. Sensitivity analysis indicates the magnitude of hidden bias that would need to be present to alter the qualitative conclusions of the study. Observational studies vary markedly in their sensitivity to hidden bias; therefore, it is important to know whether a particular study is sensitive to small biases or insensitive to quite large biases. Hidden biases may leave visible traces in observed data, and a variety of tactics involving pattern specificity are aimed at distinguishing actual treatment effects from hidden biases. Pattern specificity may aid in detecting hidden bias or in reducing sensitivity to hidden bias.

10

Observational Study

Acknowledgment

[15]

This work was supported by grant SES-0345113 from the US National Science Foundation.

[16]

References

[17]

[1]

[2]

[3] [4]

[5] [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Angrist, J.D. (2003). Randomized trials and quasiexperiments in education research. NBER Reporter, Summer, 11–14. Angrist, J.D. & Lavy, V. (1999). Using Maimonides’ rule to estimate the effect of class size on scholastic achievement, Quarterly Journal of Economics 114, 533–575. Athens, L. (1997). Violent Criminal Acts and Actors Revisited, University of Illinois Press, Urbana. Bergstralh, E.J., Kosanke, J.L. & Jacobsen, S.L. (1996). Software for optimal matching in observational studies, Epidemiology 7, 331–332. http://www.mayo. edu/hsr/sasmac.html Boruch, R. (1997). Randomized Experiments for Planning and Evaluation, Sage Publications, Thousand Oaks. Bound, J. (1989). The health and earnings of rejected disability insurance applicants, American Economic Review 79, 482–503. Braitman, L.E. & Rosenbaum, P.R. (2002). Rare outcomes, common treatments: analytic strategies using propensity scores, Annals of Internal Medicine 137, 693–695. Campbell, D.T. (1969). Prospective: artifact and Control, in Artifact in Behavioral Research, R. Rosenthal & R. Rosnow, eds, Academic Press, New York. Campbell, D.T. (1988). Methodology and Epistemology for Social Science: Selected Papers, University of Chicago Press, Chicago, pp. 315–333. Campbell, D.T. & Boruch, R.F. (1975). Making the case for randomized assignment to treatments by considering the alternatives: six ways in which quasi-experimental evaluations in compensatory education tend to underestimate effects, in Evaluation and Experiment, C.A. Bennett & A.A. Lumsdaine, eds, Academic Press, New York, pp. 195–296. Card, D. & Krueger, A. (1994). Minimum wages and employment: a case study of the fast-food industry in New Jersey and Pennsylvania, American Economic Review 84, 772–793. Cochran, W.G. (1965). The planning of observational studies of human populations (with Discussion), Journal of the Royal Statistical Society. Series A 128, 134–155. Cochran, W.G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies, Biometrics 24, 295–313. Cook, T.D., Campbell, D.T. & Peracchio, L. (1990). Quasi-experimentation, in Handbook of Industrial and Organizational Psychology, M. Dunnette & L. Hough, eds, Consulting Psychologists Press, Palo Alto, pp. 491–576.

[18] [19] [20]

[21]

[22] [23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

Cook, T.D. & Shadish, W.R. (1994). Social experiments: Some developments over the past fifteen years, Annual Review of Psychology 45, 545–580. Copas, J.B. & Li, H.G. (1997). Inference for nonrandom samples (with discussion), Journal of the Royal Statistical Society. Series B 59, 55–96. Cornfield, J., Haenszel, W., Hammond, E., Lilienfeld, A., Shimkin, M. & Wynder, E. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions, Journal of the National Cancer Institute 22, 173–203. Cox, D.R. & Read, N. (2000). The Theory of the Design of Experiments, Chapman & Hall/CRC, New York. Cox, D.R. & Snell, E.J. (1989). Analysis of Binary Data, Chapmann & Hall, London. Dehejia, R.H. & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs, Journal of the American Statistical Association 94, 1053–1062. Estroff, S.E. (1985). Making it Crazy: An Ethnography of Psychiatric Clients in an American Community, University of California Press, Berkeley. Fisher, R.A. (1935). The Design of Experiments, Oliver & Boyd, Edinburgh. Friedman, L.M., DeMets, D.L. & Furberg, C.D. (1998). Fundamentals of Clinical Trials, Springer-Verlag, New York. Gastwirth, J.L. (1992). Methods for assessing the sensitivity of statistical comparisons used in title VII cases to omitted variables, Jurimetrics 33, 19–34. Hill, A.B. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine 58, 295–300. Imbens, G.W. (2003). Sensitivity to exogeneity assumptions in program evaluation, American Economic Review 93, 126–132. Joffe, M.M. & Rosenbaum, P.R. (1999). Propensity scores, American Journal of Epidemiology 150, 327–333. Kessler, R.C. (1997). The effects of stressful life events on depression, Annual Review of Psychology 48, 191–214. Legorreta, A.P., Silber, J.H., Costantino, G.N., Kobylinski, R.W. & Zatz, S.L. (1993). Increased cholecystectomy rate after the introduction of laparoscopic cholecystectomy, Journal of the American Medical Association 270, 1429–1432. Lehman, D., Wortman, C. & Williams, A. (1987). Long-term effects of losing a spouse or a child in a motor vehicle crash, Journal of Personality and Social Psychology 52, 218–231. Lin, D.Y., Psaty, B.M. & Kronmal, R.A. (1998). Assessing the sensitivity of regression results to unmeasured confounders in observational studies, Biometrics 54, 948–963. Lu, B. & Rosenbaum, P.R. (2004). Optimal pair matching with two control groups, Journal of Computational and Graphical Statistics 13, 422–434.

Observational Study [33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46] [47]

[48] [49]

Lu, B., Zanutto, E., Hornik, R. & Rosenbaum, P.R. (2001). Matching with doses in an observational study of a media campaign against drug abuse, Journal of the American Statistical Association 96, 1245–1253. Maclure, M. & Mittleman, M.A. (2000). Should we use a case-crossover design? Annual Review of Public Health 21, 193–221. McKillip, J. (1992). Research without control groups: a control construct design, in Methodological Issues in Applied Social Psychology, F.B. Bryant, J. Edwards & R.S. Tindale, eds, Plenum Press, New York, pp. 159–175. Meyer, B.D. (1995). Natural and quasi-experiments in economics, Journal of Business and Economic Statistics 13, 151–161. Ming, K. & Rosenbaum, P.R. (2000). Substantial gains in bias reduction from matching with a variable number of controls, Biometrics 56, 118–124. Mittleman, M.A., Maclure, M., Sherwood, J.B., Mulry, R.P., Tofler, G.H., Jacobs, S.C., Friedman, R., Benson, H. & Muller, J.E. (1995). Triggering of acute myocardial infarction onset by episodes of anger, Circulation 92, 1720–1725. Morton, D., Saah, A., Silberg, S., Owens, W., Roberts, M. & Saah, M. (1982). Lead absorption in children of employees in a lead-related industry, American Journal of Epidemiology 115, 549–555. Neyman, J. (1923). On the application of probability theory to agricultural experiments: essay on principles, Section 9 (In Polish), Roczniki Nauk Roiniczych, Tom X, 1–51, Reprinted in English with Discussion in Statistical Science 1990,. 5, 463–480. Oreopoulos, P. (2003). The long-run consequences of living in a poor neighborhood, Quarterly Journal of Economics 118, 1533–1575. Reynolds, K.D. & West, S.G. (1987). A multiplist strategy for strengthening nonequivalent control group designs, Evaluation Review 11, 691–714. Rosenbaum, P.R. (1984a). The consequences of adjustment for a concomitant variable that has been affected by the treatment, Journal of the Royal Statistical Society. Series A 147, 656–666. Rosenbaum, P.R. (1984b). From association to causation in observational studies, Journal of the American Statistical Association 79, 41–48. Rosenbaum, P.R. (1991). A characterization of optimal designs for observational studies, Journal of the Royal Statistical Society. Series B 53, 597–610. Rosenbaum, P.R. (1997). Signed rank statistics for coherent predictions, Biometrics 53, 556–566. Rosenbaum, P.R. (1999). Choice as an alternative to control in observational studies (with Discussion), Statistical Science 14, 259–304. Rosenbaum, P.R. (2001). Replicating effects and biases, American Statistician 55, 223–227. Rosenbaum, P.R. (2002). Observational Studies, 2nd Edition, Springer-Verlag, New York.

[50]

[51] [52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

11

Rosenbaum, P.R. (2003). Does a dose-response relationship reduce sensitivity to hidden bias? Biostatistics 4, 1–10. Rosenbaum, P.R. (2004). Design sensitivity in observational studies, Biometrika 91, 153–164. Rosenbaum, P.R. & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects, Biometrika 70, 41–55. Rosenbaum, P. & Rubin, D. (1984). Reducing bias in observational studies using subclassification on the propensity score, Journal of the American Statistical Association 79, 516–524. Rosenbaum, P. & Rubin, D. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score, American Statistician 39, 33–38. Rosenbaum, P.R. & Silber, J.H. (2001). Matching and thick description in an observational study of mortality after surgery, Biostatistics 2, 217–232. Rosenzweig, M.R. & Wolpin, K.I. (2000). Natural “natural experiments” in economics, Journal of Economic Literature 38, 827–874. Rubin, D. (1973a). Matching to remove bias in observational studies, Biometrics 29, 159–183.Correction: 1974,. 30, 728. Rubin, D.B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies, Biometrics 29, 185–203. Rubin, D.B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies, Journal of Educational Psychology 66, 688–701. Rubin, D.B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies, Journal of the American Statistical Association 74, 318–328. Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Houghton-Mifflin, Boston. Smith, H.L. (1997). Matching with multiple controls to estimate treatment effects in observational studies, Sociological Methodology 27, 325–353. Trochim, W.M.K. (1985). Pattern matching, validity and conceptualization in program evaluation, Evaluation Review 9, 575–604. Webb, E.J. (1966). Unconventionality, triangulation and inference, Proceedings of the Invitational Conference on Testing Problems, Educational Testing Service, Princeton, pp. 34–43. Weed, D.L. & Hursting, S.D. (1998). Biologic plausibility in causal inference: current method and practice, American Journal of Epidemiology 147, 415–425. Weiss, N. (1981). Inferring causal relationships: elaboration of the criterion of ‘dose-response’, American Journal of Epidemiology 113, 487–90. Weiss, N.S. (2002). Can the “specificity” of an association be rehabilitated as a basis for supporting a causal hypothesis? Epidemiology 13, 6–8.

12 [68]

[69]

Observational Study Wright, M.A., Wintemute, G.J. & Rivara, F.P. (1999). Effectiveness of denial of handgun purchase to persons believed to be at high risk for firearm violence, American Journal of Public Health 89, 88–90. Zabin, L.S., Hirsch, M.B. & Emerson, M.R. (1989). When urban adolescents choose abortion: effects on

education, psychological status, and subsequent pregnancy, Family Planning Perspectives 21, 248–255.

PAUL R. ROSENBAUM

Odds and Odds Ratios ´ RUDAS TAMAS

Volume 3, pp. 1462–1467 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Odds and Odds Ratios Odds Odds are related to possible observations (events) associated with an experiment or observational study. The value of the odds tells how many times one event is more likely to occur than another one. The value of the odds is obtained as the ratio of the probabilities of the two events. These probabilities may be the true or estimated probabilities, in which the estimated probabilities may be taken from some data available or may be based on a statistical model. For example, in a study on unemployment it was found that 40% of the respondents were out of the labor force, 50% were employed, and 10% were unemployed. Then, the value of the odds of being unemployed versus employed in the data is 0.1/0.5 = 0.2. Notice that the same odds could be obtained by using the respective frequencies instead of the probabilities. Odds are only defined for events that are disjoint, that is, cannot occur at the same time. For example, the ratio of the probabilities of being unemployed and of not working (that is, being out of the labor force or being unemployed), 0.1/(0.4 + 0.1) = 0.2 is not the value of odds, rather that of a conditional probability (see Probability: An Introduction). However, odds may be used to specify a distribution just as well as probabilities. Saying that the value of odds of being unemployed versus employed is 0.2 and that of being employed versus being out of the labor force is 1.25 completely specifies the distribution of the respondents in the three categories. Of particular interest is the value of the odds of an event versus all other possibilities, that is, versus its complement, sometimes called the odds of the event. The odds of being out of the labor force in the above example is 0.4/0.6 = 2/3. If p is the probability of an event, the value of its odds is p/(1 − p). The (natural) logarithm of this quantity, log(p/(1 − p)), is the logit of the event. In logistic regression, the effects of certain explanatory variables on a binary response are investigated by approximating the logit of the variable with a function of the explanatory variables (see [1]). The odds are relatively simple parameters of a distribution, and their statistical analysis is usually confined to constructing confidence intervals for the

true value of the odds and to testing hypotheses that the value of the odds is equal to a specified value. To obtain an asymptotic confidence interval for the true value of the odds, one may use the asymptotic variance of the estimated value of the odds. All these statistical tasks can be transformed to involve the logarithm of the odds. If the estimated probabilities of the events are p and q, then the logarithm of the odds is estimated as log(p/q), and its asymptotic (approximate) variance is s 2 = 1/np + 1/nq, where n is the sample size. Therefore, an approximate 95% confidence interval for the true value of the logarithm of the odds is (t − 2s, t + 2s). The hypothesis that the logarithm of the odds is equal to a specified value, say, u (against the alternative that it is not equal to u), is rejected at the 95% level, if u lies outside of this confidence interval. The value u = 0 means that the two events have the same probability, that is, p = q. In a two-way contingency table, the odds related to events associated with one variable may be computed within the different categories of the other variable. If, for example, the employment variable is cross-classified by gender, as shown in Table 1, one obtains a 2 × 3 table. Here P (1, 2)/P (1, 1) is the value of the conditional odds of being unemployed versus employed for men. The value of the same conditional odds for women is P (2, 2)/P (2, 1). The same odds, computed for everyone, that is, P (+, 2)/P (+, 1), are called the marginal odds here.

Odds Ratios for 2 × 2 Contingency Tables Ratios of conditional odds are called odds ratios. In a 2 × 2 table (like the first two columns of Table 1 that cross classify the employment status of those in the labor force with gender), the value of the odds ratio is (P (1, 1)/P (1, 2))/(P (2, 1)/P (2, 2)). This is the ratio of the conditional odds of the column variable, conditioned on the categories of the row variable. If the ratio of the conditional odds Table 1 status

Cross-classification of sex and employment Out of the Employed Unemployed labor force

Men Women Total

P(1, 1) P(2, 1) P(+, 1)

P(1, 2) P(2, 2) P(+, 2)

P(1, 3) P(2, 3) P(+, 3)

Total P(1, +) P(2, +) P(+, +)

2

Odds and Odds Ratios

of the row variable, conditioned on the categories of the column variable is considered, one obtains (P (1, 1)/P (2, 1))/(P (1, 2)/P (2, 2)), and this has the same value as the previous one. The common value, α = (P (1, 1)P (2, 2))/(P (1, 2)P (2, 1)), the odds ratio, is also called the cross product ratio. When the order of the categories of either one of the variables is changed, the odds ratio changes to its reciprocal. When the order of the categories of both variables is changed, α remains unchanged. When the two variables in the table are independent, that is, when the rows and the columns are proportional, the conditional odds of either one of the variables, given the categories of the other one, are equal to each other, and the value of the odds ratio is 1. Therefore, a test of the hypothesis that log α = 0 is a test of independence. A test of this hypothesis, and a test of any hypothesis assuming a fixed value of the odds ratio, may be based on the asymptotic (approximate) variance of the estimated odds ratio, s2 = (1/n)(1/P (1, 1) + 1/P (1, 2) + 1/P (2, 1) + 1/P (2, 2)) (see [3]). The more the conditional odds of one variable, given the different categories of the other variable differ, the further away from 1 is the value of the odds ratio. Therefore, the odds ratio is a measure of how different the conditional odds are from each other and it may be used as a measure of the strength of association between the two variables forming the table.

The Odds Ratio as a Measure of Association Because changing the order of the categories of one variable obviously does not affect the strength of association (see Measures of Association) but changes the value of the odds ratio to its reciprocal, the values α and 1/α refer to the same strength of association. With ordinal variables or with categorical variables having the same categories, the categories of the two variables have a fixed order with respect to each other (both in the same order) and in such cases α and 1/α mean different directions of association (positive and negative) of the same strength. In cases when the order of the categories of one variable does not determine the order of the categories of the other variable, the association does not have a direction. For further discussion of the direction of association, see [6].

An interpretation of the odds ratio, as a measure of the strength of association between two categorical variables, is that it measures how helpful it is to know in which category of one of the variables an observation is in order to guess in which category of the other variable it is. If in a 2 × 2 table P (1, 1)/P (1, 2) = P (2, 1)/P (2, 2), then knowing that the observation is in the first or the second row does not help in guessing in which column it is. In this case, α = 1. If, however, say, P (1, 1)/P (1, 2) is bigger than P (2, 1)/P (2, 2), then a larger fraction of those in the first row will be in the first column and only a smaller fraction of those in the second row. In this case, α has a large value. If any of the entries in a 2 × 2 table is equal to 0, the value of the odds ratio cannot be computed. One may think that if P (1, 1) = 0, α may be computed and its value is 0, but by changing the order of categories one would have to divide by zero in the computation of α, which is impossible. When working with odds ratios as a measure of the strength of association (or with any measure of association, for that matter), it should be kept in mind that having a large observed value of the odds ratio and having a statistically significant one are very different things. Tables 2 and 3 show two different cross classifications of 40 observations. In Table 2, α = 1.53, log α = 0.427 and its approximate standard deviation is 0.230, therefore α is not significantly different from 1 at the 95% level. On the other hand, in Table 3 α = 1.31, log α = 0.270 and its approximate standard deviation is 0.102. Therefore, in Table 3 the observed (estimated) odds ratio is significantly different from 1, at the 95% level. The larger observed value did not prove significant, while Table 2 Cross-classification of 40 observations with α = 1.53 23

1

15

1

Table 3 Another crossclassification of 40 observations with α = 1.31 11

12

7

10

Odds and Odds Ratios the smaller one did, at the same level and using the same sample size. While some users may find this fact counterintuitive, it is a reflection of the different distributions seen in Tables 2 and 3. The observed distribution is much more uniform in Table 3 than in Table 2 and therefore one may consider it a more reliable estimate of the underlying population distribution than the one in Table 2. Consequently, the estimate of the odds ratio derived from Table 3 is more reliable than the estimate derived from Table 2 and thus one has a smaller variance of the estimate in Table 3 than in Table 2.

Variation Independence of the Odds Ratio and the Marginal Distributions One particularly attractive property of the odds ratio as a measure of association is its variation independence from the marginal distributions of the table. This means that any pair of marginal distributions may be combined with any value of the odds ratio, and it implies that the odds ratio measures an aspect of association that is not influenced by the marginal distributions. In fact, one can say that the odds ratio represents the information that is contained in the joint distribution in addition to the marginal distributions (see [6]). Variation independence is also important for the interpretation of the value of the odds ratio as a measure of the strength of association: it makes the values of the odds ratio comparable across tables with different marginal distributions. The difficulties associated with measures that are not variationally independent from the marginals are illustrated using the data in Tables 4 and 5. One possible measure of association is β = P (1, 1)/P (+, 1) − P (2, 1)/P (+, 2), that is, the difference between the conditional probabilities of the first row category (e.g., success) in the two categories of the column variables (e.g., treatment and control). The value of β is 0.1 for both sets of data. In spite of this, one would not think association has the same strength (e.g., being in treatment or control has a same effect on success) in the two tables because, given the marginals, for Table 4, the maximum value of β is 0.3 and for Table 5 the maximum value is 0.9. In Table 4, β is one third of its possible maximum, while in Table 5 it is one ninth of its maximum value. The measure β is not variationally independent from the marginals and therefore lacks calibration.

3

Table 4 A 2 × 2 table with β = 0.1 10

5

40

45

Table 5 Another 2 × 2 table with β = 0.1 25

20

25

30

I × J Tables The concept of odds ratio as a measure of the strength of association may be extended beyond 2 × 2 tables. For I × J tables, one possibility is to consider local odds ratios. These are obtained by selecting two adjacent rows and two adjacent columns of the table. Their intersection is a local 2 × 2 subtable of the contingency table and the odds ratio computed here is a local odds ratio. If the two variables forming the table are independent, all the local odds ratios are equal to 1. The number of different possible local odds ratios is (I − 1)(J − 1) and they may be naturally arranged into an (I − 1) × (J − 1) table. In the (k, l) position of this table, one has the odds ratio (P (k, l)P (k + 1, l + 1))/(P (k, l + 1)P (k + 1, l)). Under independence, the table of local odds ratios contains 1s in all positions. For another generalization of the concept of odds ratio to I × J tables, see [6]. In practice, the model of independence often proves too restrictive. In fact, the model of independence specifies, out of the total number of parameters, I J − 1, the values of (I − 1)(J − 1) (the local odds ratios) and leaves only I + J − 2 parameters (the marginal distributions of the two variables) to estimate. In a 6 × 8 table, for example, this means fixing 35 parameters and leaving 12 free. Therefore, models for two-way tables that are less restrictive than the model of independence and still postulate some kind of a simple structure are of interest. A class of such models may be obtained by restricting the structure of the (I − 1) × (J − 1) table of local odds ratios. Such models are called I × J association models (see [4]). The uniform association model (U-model) assumes that the local odds ratios have a common (unspecified) value. If that common value happens to be 1,

4

Odds and Odds Ratios

Table 6 A distribution with all local odds ratios being equal to 3 10 30 40

20 180 720

20 540 6480

30 2430 87 480

one has independence. Under the U-model, the table of local odds ratios is uniformly distributed, but the original table may show actually quite strong association. For example, Table 6 illustrates a distribution with common local odds ratios (all equal to 3). The row association model (R-model) assumes that the structure of the table of local odds ratios is such that there is a row effect in the sense that the local odds ratios are constant within the rows but are different in different rows. The column association model (Cmodel) is defined similarly. There are two models available that assume the presence of both row and column effects on the local odds ratio. The R + Cmodel assumes that these effects are additive, the RC-model assumes that they are multiplicative on a logarithmic scale (see [4]). All these models are in between assuming the very restrictive model of independence and not assuming any structure at all.

Conditional and Marginal Association Measured by the Odds Ratio Table 7 presents a (hypothetical) cross-classification of respondents according to two binary variables, sex (male or female) and income (high or low). The odds ratio is equal to 1.67, indicating that the value of the Table 7 Hypothetical cross-classification according to sex and income

Men Women

Table 8

High income

Low income

300 180

200 200

odds of men to have a high income versus a low one is greater than that of women. Table 8 presents a breakdown of the data according to the educational level of the respondent, also with two categories (high and low). Table 7 is a two-way marginal of Table 8 and the odds ratio computed from that table is the marginal odds ratio of gender and income. The odds ratios computed from the two conditional tables in Table 8, one obtained by conditioning on educational level being low and another one conditioned on educational level being high are conditional odds ratios of gender and income, conditioned on educational level. The conditional odds ratios measure the strength of association among sex and income, separately for people with low and high levels of education. For a more detailed discussion of marginal and conditional aspects of association, see [2]. Many of us would tend to expect that if for all people considered together, men have a certain advantage (odds ratio greater than 1), and then if the respondents are divided into two groups (according to educational level in the example), men also should have an advantage (odds ratio greater than 1) in at least one of the groups. Table 8 illustrates that this is not necessarily true. The conditional odds ratios computed from Table 8 are 0.75 for people with low income and 0.55 for people with high income. That is, in both groups women have higher odds of having a high salary (versus a low one) than men do. A short (but imprecise) way to refer to this phenomenon is to say that the direction of the marginal association is opposite to the direction of the conditional associations. This is called Simpson’s paradox (see Paradoxes). Notice that the reversal of the direction of the marginal association upon conditioning is not merely a mathematical possibility, it does occur in practice (see, for example, [5]). What is wrong with the expectation that if the marginal odds of men for higher income are higher than those of women, the same should be true for men and women with low and also for men and women with high levels of education (or at least in one of

Breakdown of the data in Table 7 according to educational level Low educational level

Men Women

High income

Low income

80 130

160 195

High educational level

Men women

High income

Low income

220 50

40 5

Odds and Odds Ratios these categories)? The higher odds of men would imply higher odds for men in both subcategories if the higher value of the odds was somehow an inherent property of men and not something that is a consequence of other factors. If there was a causal relationship between sex and income, that would have to manifest itself not only for all people together but also among people with low and high levels of educations. However, with the present data, the higher value of the odds of men for higher incomes is a consequence of different compositions of men and women according to their educational levels. The data show advantage for women in both categories of educational level, and men appear to have advantage when educational level is disregarded only because a larger fraction of them has high level of education than that of women, and higher education leads to higher income. The paradox (our wrong expectation) occurs because of the routine assumption of causal relationships in cases when there is only an association. When there is unobserved heterogeneity (like the different compositions of the male and female respondents according to educational level), the conditional odds ratios may have directions different from the marginal odds ratio. When the data are collected with the goal of establishing a causal relationship, random assignment into the different categories of the treatment variable is applied to reduce unobserved heterogeneity. Obviously, such a random assignment is not always possible, for example, one cannot assign people randomly into the categories of being a man or a woman.

5

The ratio of these two quantities is a measure of how strong the effect of sex is on the strength of relationship between educational level and income. This is COR(E, I |G = men) / COR(E, I |G = women) = 0.091/0.067 = 1.36. Although it has little substantive meaning in the example, one could also define a measure of the strength of effect of income on the strength of association between sex and educational level, as COR(G, E|I = high) / COR(G, E|I = low), and its value would also be 1.36. It is not by pure coincidence that the three measures in the previous paragraph have the same value: this is true for all data sets, indicating that the common value does not measure a property related to any grouping of the variables. Rather, it measures association among all three variables and is called second-order interaction. The second-order interaction is P (1, 1, 1)P (1, 2, 2)P (2, 1, 2)P (2, 2, 1) P (2, 2, 2)P (2, 1, 1)P (1, 2, 1)P (1, 1, 2) where P (i, j, k) is the probability of an observation in the ith category of the first, j th category of the second and kth category of the third variable. Values of the second-order odds ratio show how strong is the effect of any of the variables on the strength of association between the other two variables. When the second-order odds ratio is equal to 1, no variable has an effect on the strength of association between the other two variables. Similarly, in a four-way table, conditional secondorder odds ratios may be defined and, as the common value of their ratios, one obtains third-order odds ratios. Similarly, higher-order odds ratios may be defined for tables of arbitrary dimension (see [6]).

Higher-order Odds Ratios In Table 8, the conditional odds ratios of sex and income, given educational level, are different. If these two conditional odds ratios were the same, one would say that educational level has no effect on the strength of association between sex and income. Further, the more different the two conditional odds ratios are, the larger is the effect of education on the strength of association between sex and income. The ratio of the two conditional odds ratios, COR(G, I |E = low) / COR(G, I |E = high) = 0.75/0.55 = 1.36 is a measure of the strength of this effect. Similarly, the conditional odds ratios between educational level and income may be computed for men and women.

Odds Ratios and Log-linear Models Log-linear models (see [1] and [3]) are the most widely used tools to analyze cross-classified data. Log-linear models may be given a simple interpretation using odds ratios. A log-linear model is usually defined by listing subsets of variables that are the maximal allowed interactions (see Interaction Effects). Subsets of these variables are called interactions. Such a specification is equivalent to saying that for any subset of the variables that is not an interaction, the conditional odds ratios of all involved variables, given all possible categories of the variable not in the subset, are equal to 1. That is, these

6

Odds and Odds Ratios

greater subsets have no (conditional) association or interaction. For example, with variables X, Y, Z, the log-linear model with maximal interactions XY, Y Z means that COR(X, Z|Y = j ) = 1, for all categories j of Y and that the second-order odds ratio of X, Y, Z is equal to 1. The first of these conditions is the conditional independence of X and Z, given Y . See [6] and [7] for the details and advantages of this interpretation. The interpretation of log-linear models based on conditional odds ratio also shows that the association structures that may be modeled for discrete data using log-linear models are substantially more complex than those available for continuous data under the assumption of multivariate normality. A multivariate normal distribution (see Catalogue of Probability Density Functions) is entirely specified by its expected values and covariance matrix. The assumption of normality implies that the bivariate marginals completely specify the distribution, and the association structure among several variables cannot be more complex than the pairwise associations. With log-linear models for categorical data, one may model complex association structures involving several variables by allowing higher-order interactions in the model.

References [1] [2]

[3]

[4] [5]

[6] [7]

Agresti, A. (2002). Categorical Data Analysis, 2nd Edition, Wiley. Bergsma, W. & Rudas, T. (2003). On conditional and marginal association, Annales de la Faculte des Sciences de Toulouse 11(6), 443–454. Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice, MIT Press. Clogg, C.C. & Shihadeh, E.S. (1994). Statistical Models for Ordinal Variables, Sage. Radelet, M. (1981). Racial characteristics and the imposition of the death penalty, American Sociological Review 46, 918–927. Rudas, T. (1998). Odds Ratios in the Analysis of Contingency Tables, Sage. Rudas, T. (2002). Canonical representation of log-linear models, Communications in Statistics (Theory and Methods) 31, 2311–2323.

(See also Relative Risk; Risk Perception) ´ RUDAS TAMAS

One Way Designs: Nonparametric and Resampling Approaches ANDREW F. HAYES Volume 3, pp. 1468–1474 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

One Way Designs: Nonparametric and Resampling Approaches Analysis of variance (ANOVA) is arguably one of the most widely used statistical methods in behavioral science. This no doubt is attributable to the fact that behavioral science research so often involves some kind of comparison between groups that either exist naturally (such as men versus women) or are created artificially by the researcher (such as in an experiment). In a one-way design, one that includes only a single grouping factor with k ≥ 2 levels or groups, analysis of variance is used initially to test the null hypothesis (H0 ) that the k population means on some outcome variable Y are equal against the alternative hypothesis that at least two of the k population means are different: H0 : µ1 = µ2 = · · · = µk . Ha : at least two µ are different

(1)

The test statistic in ANOVA is F , defined as the ratio of an estimate of variability in Y between groups, sB2 , to an estimate of the within-group variance in Y , sE2 (i.e., F = sB2 /sE2 ). As ANOVA is typically implemented, within-group variance is estimated by deriving a pooled variance estimate, assuming equality of within-group variance in Y across the k groups. Under a true H0 , and when the assumptions of analysis of variance are met, the P value for the obtained F ratio, Fobtained , can be derived in reference to the F distribution on df num = k − 1 and df den = N − k, where N is the total sample size. If p ≤ α, H0 is rejected in favor of Ha , where α is the level of significance chosen for the test. As with all null hypothesis testing procedures, the decision about H0 is based on the P value, and, thus, it is important that you can trust the accuracy of the P value. Poor estimation of the P value can produce a decision error. When errors in estimation of the P value occur, this is typically attributable to poor correspondence between the actual sampling distribution of the obtained F ratio, under H0 , and the theoretical sampling distribution of that ratio. In the case of one-way ANOVA, the P value can be trusted if the distribution of all possible values of F that you

could have observed, conditioned on H0 being true, does, in fact, follow the F distribution. Unfortunately, there is no way of knowing whether Fobtained does follow the F distribution because you only have a single value of Fobtained describing the results of your study. However, the accuracy of the P value can be inferred indirectly by determining whether the assumptions that gave rise to the use of F as the appropriate sampling distribution are met. Those assumptions are typically described as normality of Y in each population, equality of within-group variance on Y in the k populations, and independence of the observations. Of course, the first two of these assumptions can never be proven true, so there is no way of knowing for certain whether F is the appropriate sampling distribution for computing the P value for Fobtained. With the advent of high-speed, low-cost computing technology, it is no longer necessary to rely on a theoretically derived but assumption-constrained sampling distribution in order to generate the P value for Fobtained (see Permutation Based Inference). It is possible, through the use of a resampling method, to generate a P value for the hypothesis test by generating the null reference distribution empirically. The primary advantages of resampling methods are (a) the assumptions required to trust the accuracy of the P value are usually weaker and/or fewer in number and (b) the test statistic used to quantify the obtained result is not constrained to be one with a theoretical sampling distribution. This latter advantage is especially exciting, because the use of resampling allows you to choose any test statistic that is sensitive to the hypothesis of interest, not just the standard F ratio as used in one-way ANOVA. I focus here on the two most common and widely studied resampling methods: bootstrapping and randomization tests. Bootstrapping and randomization tests differ somewhat in assumptions, purpose, and computation, but they are conceptually similar to all resampling methods in that the P value is generated by comparing the obtained result to some null reference distribution generated empirically through resampling or ‘reshuffling’ of the existing data set in some way. Define θ as some test statistic used to quantify the obtained difference between the groups. In analysis of variance, θ = F , defined as above. But θ could be any test statistic that you invent that suitably describes the obtained result. For example, θ could

2

One Way Designs

Table 1 A sample of 15 units distributed across three groups measured on variable Y Group 1 (n1 = 4) 6 5 8 10

Y 1 = 7.25 s1 = 2.22

Group 2 (n1 = 6)

Group 3 (n3 = 5)

4 8 7 6 9 8

2 4 5 2 6

Y 2 = 7.00 s2 = 1.79

Y 3 = 3.80 s3 = 1.79

be a quantification of group differences using the group medians, trimmed means, or anything you feel best quantifies the obtained result. Of course, θ is a random variable, but in any study, you observe only a single instantiation of θ. A different sample of N total units drawn from a population and then distributed in the same way among k groups would produce a different value of θ. Similarly, a different random reassignment of N available units in an experiment into k groups would also produce a different value of θ. Below, I describe how bootstrapping and randomization tests can be used to generate the desired P value to test a null hypothesis. Toward the end, I discuss some issues involved in the selection of a test statistic, and how this influences what null hypothesis is actually tested. For the sake of illustrating bootstrapping and randomization tests, I will refer to a hypothetical data set. The design and nature of the study will differ for different examples I present. For each study, however, N = 15 units are distributed among k = 3 groups of size n1 = 4, n2 = 6, and n3 = 5. Each unit is measured on an outcome variable Y , and the individual values of Y and the sample mean and standard deviation of Y in each group are those presented in Table 1.

Bootstrapping the Sampling Distribution Bootstrapping [6] is typically used in observational studies, where N units (e.g., people) are randomly sampled from k populations, with the k populations distinguished from each other on some (typically, but not necessarily) qualitative dimension.

For example, a political scientist might randomly poll N people from a population by randomly dialing phone numbers, assessing the political knowledge of the respondents in some fashion, and asking them to self-identify as either a Democrat (group 1), a Republican (group 2), or an Independent (group 3). Imagine that a sample of 15 people yielded the data in Table 1, where the outcome variable is the number of 12 questions about current political events that the respondent correctly answered. Using these data, MSbetween = 18.23, MSwithin = 3.63, and so Fobtained = 5.02. From the F distribution, the P value for the test of the null hypothesis that µDemocrats = µRepublicans = µIndependents is 0.026. So, if the null hypothesis is true, the probability of obtaining an F ratio as larger or larger than 5.02 is 0.026. Using α = 0.05, the null hypothesis is rejected. It appears that affiliates of different political groups differ in their political knowledge on average. To generate the P value via bootstrapping, a test statistic θ must first be generated describing the obtained result. To illustrate that any test statistic can be employed, even one without a convenient sampling distribution, I will use θ = nj |(Y¯j − Y¯ )|. So, the obtained result is quantified as the weighted sum of the absolute mean deviations from the grand mean. In the data in Table 1, θobtained = 22. The traditional F ratio or any alternative test statistic could be used and the procedure described below would be unchanged. The P value is defined, as always, as the proportion of possible values of θ that are as large or larger than θobtained in the distribution of possible values of θ, assuming that the null hypothesis is true. The distribution of possible values of θ is generated by resampling from the N = 15 units with replacement. There are two methods of bootstrap resampling discussed in the bootstrapping literature. They differ in the assumptions made about the populations sampled, and, as a result, they can produce different results. The first approach to bootstrapping assumes that the sample of N units represents a set of k independent samples from k distinct populations. No assumption is made that the distribution of Y has any particular form, or that the distribution is identical in each of the k populations. The usual null hypothesis is that these k populations have a common mean. To accommodate this null hypothesis, the original data must first be transformed before resampling

One Way Designs can occur. This transformation is necessary because the P value is defined as the probability of θobtained conditioned on the null hypothesis being true, but the data in their raw form may be derived from a world where the null hypothesis is false. The transformation recommended by Efron and Tibshirani [6, p. 224] is Yij = Yij – Y j + Y

(2)

where Yij is the value of Y for case i in group j , Y j is the mean Y for group j , and Y is the mean of all N values of Y (in Table 1, Y = 6). Once this transformation is applied, then the group means are all the same, Y¯1 = Y¯2 = · · · Y¯j = Y . The result of applying this transformation to the data in Table 1 can be found in Table 2. To generate the bootstrapped sampling distribution of θ, for each of the k groups, a sample of nj scores is randomly taken with replacement from the transformed scores for group j , yielding a new sample of nj values. This is repeated for each of the k groups, resulting in a total sample size of N . In the resulting ‘resampled’ data set, θ ∗ is computed, where θ ∗ is the same test statistic used to describe the obtained results in the untransformed data. So θ ∗ = nj |(Y¯j − Y¯ )|. This procedure is repeated, resampling with replacement from the transformed data set (Table 2) a total of b times. The bootstrapped null sampling distribution of θ is defined as the b values of θ ∗ . The P value for θobtained is calculated as the proportion of values of θ ∗ in this distribution that are equal to or larger than θobtained ; that is, p = #(θ ∗ ≥ θobtained )/b. Because of the random nature of bootstrapping, p is a random variable. A good estimate of the P Table 2 tion (2)

Data in Table 1 after transformation using Equa-

Group 1 (n1 = 4)

Group 2 (n1 = 6)

Group 3 (n3 = 5)

3.00 7.00 6.00 5.00 8.00 7.00

4.20 6.20 7.20 4.20 8.20

Y 2 = 6.00 s2 = 1.79

Y 3 = 6.00 s3 = 1.79

4.75 3.75 6.75 8.75

Y 1 = 6.00 s1 = 2.22

3

value can be derived with values of b as small as 1000, but given the speed of today’s computers, there is no reason not to do as many as 10 000 bootstrap resamples. The larger the value of b, the more precise the estimate of p. The second approach to generating a bootstrap hypothesis test is more nearly like classical ANOVA. We assume that, under the null hypothesis, each of the k samples was drawn from a single common population. We do not assume that this population is normally distributed, only that it is well represented by the sample of N observations. Bootstrap samples for each group can then be taken from this pool of N scores, again by randomly sampling with replacement the requisite number of times. As with the first approach, we build up a bootstrap null sampling distribution by repeating these random samplings a large number, b, of times, computing our group comparison statistic each time. The choice between these two approaches is dictated largely by group sizes. In the first approach, we assume that each of the k sampled populations is well represented by the sample drawn from that population. This assumption makes greater sense the larger the group size. For small group sizes, we may doubt the relevance of the assumption. If group sizes are smaller than, say, 10 or 12, it is preferable to make the additional assumption of a commonness of the sampled populations. There is a variety of statistical programs on the market that can conduct this test, as well as some available for free on the Internet. No doubt, as bootstrapping becomes more common, more and more software packages will include bootstrapping routines. Even without a statistical package capable of conducting this test, it is not difficult to program in whatever language is convenient. For example, I wrote a GAUSS program [2] to generate the bootstrapped sampling distribution of θ or the data in Table 1 using the first approach. Setting b to 10 000, 5 of the 10 000 values of θ ∗ were equal to or greater than 22, yielding an estimated P value of 5/10 000 = 0.0005.

The Randomization Test The bootstrapping approach to generating the P value is based on the idea that each Y value observed in the sample represents only one of perhaps many units in

4

One Way Designs

the population with that same value of Y . No assumption is made about the form of the distribution of Y , as is traditional in ANOVA, where the assumption is that Y follows a normal distribution. So, the bootstrapped sampling distribution of θ is conditioned only on the values of Y that are known to be possible when randomly sampling from the k populations. The randomization test goes even farther by doing a form of sampling without replacement, so each of the N values of Y is used only once. Thus, no assumption is made that each unit in the sample is representative of other units with the same Y score in some larger population. A randomization test [5] would typically be used when the k groups are generated through the random assignment of N units into k groups, as in an experiment. Although it is ideal to make sure that the random assignment mechanism produces k groups of roughly equal size, this is not mathematically necessary. For example, suppose that rather than randomly sampling 15 people from some population, the 15 people in Table 1 were 15 people that were conveniently available to the political scientist, and each was randomly assigned into one of three experimental conditions. Participants assigned to group 1 were given a copy of the New York Times and asked to read it for 30 min. Participants assigned to group 2 were also given a copy of the New York Times, but all advertisements in the paper had been blackened out so they couldn’t be read. For participants assigned to group 3, the advertisements were highlighted with a bright yellow marker, making them stand out on the page. After the 30-minute reading period, the participants were given a 12-item multiple-choice test covering the news of the previous day. Of interest is whether the salience of the advertisements distracted people from the content of the newspaper, and, thereby, affected how much they learned about the previous day’s events. This would typically be assessed by testing whether the three treatment population means are the same. The randomization test begins with the assumption that, under the null hypothesis, each of the N values of Y observed was in a sense ‘preordained.’ That is, the value of Y any sampled unit contributed to the data set is the value of Y that unit would have contributed regardless of which condition in the experiment the unit was assigned to. This is the essence of the claim that the manipulation had no effect on the outcome. Under this assumption,

the obtained test statistic θobtained is only one of many possible values of θ that could have been observed merely through the random assignment of the N units into k groups. A simple combinations formula tells us that there are N !/(n1 !n2 ! . . . nk !) possible ways of randomly assigning N cases (and their corresponding Y values) into k groups of size n1 , n2 , . . . nk . Therefore, under the assumption that the salience of the advertisements had no effect on learning, there are 15!/(4!6!5!) = 630 630 possible values of θ that could be generated with these data, and, thus 630 630 ways this study could have come out merely through the random assignment of sampled units into groups. In a randomization test, the P value for the obtained result, θobtained , is generated by computing every value of θ possible given the k group sample sizes and the N values of Y available. The set of N !/(n1 !n2 ! . . . nk !) values of θ possible is called the randomization distribution of θ. Using this randomization distribution, the P value for θobtained is defined as the proportion of values of θ in the randomization distribution that are at least as large as θobtained , that is, p = #(θ ≥ θobtained )/[N !/(n1 !n2 ! . . . nk !)]. Because θobtained is itself in the randomization distribution of θ, the smallest possible P value is therefore 1/[N !/(n1 !n2 ! . . . nk !)]. Using the data in Table 1, and just for the sake of illustration defining θ as the usual ANOVA F ratio, it has already been determined that θobtained = 5.02. Using one of the several software packages available that conduct randomization tests or writing special code in any programming language, it can be derived that there are 19 592 values of θ in the randomization distribution of θ that are at least 5.02. So the P value for the obtained result is 19 592/630 630 = 0.031. In bootstrapping, the number of values of θ in the bootstrapped sampling distribution is b, a number completely under control of the data analyst and generally small enough to make the computation of the distribution manageable. In a randomization test, the number of values of θ in the randomization distribution is [N !/(n1 !n2 ! . . . nk !)]. So the size of the randomization distribution is governed mostly by the total sample size N and the number of groups k. The size of the randomization distribution can easily explode to something unmanageable, such that it would be computationally impractical if not entirely infeasible to generate every possible value of θ. For example, merely doubling the sample

One Way Designs size and the size of each group in this study yields [30!/(8!12!10!)] or about 3.8 × 1012 possible values of θ in the randomization distribution that must be generated just to derive a single P value. The generation of that many statistics would simply take too long even with today’s (and probably tomorrow’s) relatively fast desktop computers. Fortunately, there is an easy way around this seemingly insurmountable problem. Rather than generating the full randomization distribution of θ, an approximate randomization distribution of θ can be generated by randomly sampling from the possible values of θ in the full randomization distribution. When an approximation of the full randomization distribution is used to generate the P value for θobtained , the test is known as an approximate randomization test. In most real-world applications, the sheer size of the full randomization distribution dictates that you settle for an approximate randomization test. It is fairly easy to randomly sample from the full randomization distribution using available software or published code, for example, [4] and [7], or programming the test yourself. The random sampling is accomplished by randomly reassigning the Y values to the N units and then computing θ after this reshuffling of the data. This reshuffling procedure is repeated b − 1 times. The bth value of θ in the approximate randomization distribution is set equal to θobtained , and the P value for θobtained defined as #(θ ≥ θobtained )/b. Because θobtained is one of the b values in the approximate randomization distribution, the smallest possible P value is 1/b. Of course, the P value is a random variable because of the random nature of the random selection of possible of values of θ. The larger b, the less the P value will vary from the exact P value (from using the full randomization distribution). For most applications, setting b between 5000 and 10 000 is sufficient. Little additional precision is gained using more than 10 000 randomizations. The randomization test is sometimes called a permutation test, although some reserve that term to refer to a slightly different form of the randomization test, where the groups are randomly sampled from k larger and distinct populations. Under the null hypothesis assumption that these k populations are identical, the Y observations are exchangeable among populations (meaning simply that any Y score could be observed in any group with the same probability). Under this

5

null hypothesis, the P value can be constructed as described above for the randomization test.

The Kruskal–Wallis Test as a Randomization Test on Ranks When you are interested in comparing a set of groups measured on some quantitative variable but the data are not normally distributed, many statistics textbooks recommend the Kruskal–Wallis test. The Kruskal–Wallis test is conducted by converting each Y value to its ordinal rank in the distribution of N values of Y . The sum of the ranks within group are then transformed into a test statistic H . For small N , H is compared to a table of critical values to derive an approximate P value. For large N , the P value is derived in reference to the χ 2 distribution with k − 1 degrees of freedom [8]. It can be shown that the Kruskal–Wallis test is simply a randomization test as described above, but using the ranks rather than the original Y values. With large N , the P value is derived by capitalizing on the fact that the full randomization distribution of H when analyzing ranks is closely approximated by the χ 2 distribution.

Assumptions and Selection of Test Statistic Bootstrapping and randomization tests make fewer and somewhat weaker assumptions than does classical ANOVA about the source of the observed Y scores. Classical ANOVA assumes that the j th group of scores was randomly sampled from a normal distribution. Further, each of the k normal populations sampled is assumed to have a common variance, and, under the null hypothesis, a common mean. The first bootstrapping approach assumes that the j th group of scores was randomly sampled from a population, of unknown form, that is well represented by the sample from that population. The k populations may differ in variability or shape, but under the null hypothesis, they are assumed to have a common mean. The second bootstrapping approach assumes that the k populations sampled potentially differ only in location (i.e., in their means). Under the null hypothesis of a common mean as well, the k samples can be treated as having been sampled from the same population, one well represented by the aggregate of the k samples.

6

One Way Designs

Bootstrapping shares with the ANOVA a random sampling model of chance. The value of our test statistic will vary from one random sample of N units to another. By contrast, the model of chance for the randomization test does not depend on random sampling, only on the randomization to groups of a set of available units. The value of our test statistic now varies with the random allocation of this particular set of N units. The two models of chance differ in their range of statistical inference. Random sampling leads to inferences about the larger populations sampled. Randomization-based inference is limited to the particular set of units that were randomized. However, randomization also leads to causal inference. Establishing causality, even over a limited number of units, can be more important than population inference. And, many behavioral science experiments are conducted using nonrandom collections of units. ANOVA is often applied to the analysis of outcomes that clearly are not normal on the grounds that ANOVA is relatively ‘robust’ to violations of that assumption. For example, much behavioral science research is based on outcome variables that could not possibly be normally distributed, such as counts or responses that have an upper or lower bound such as rating scales. But why make assumptions in your statistical method you don’t need to make in the first place, and that are almost certainly not warranted when there are methods available that work as well, can test the hypothesis of interest, and don’t make those unrealistic assumptions? Large departures from the assumptions, notably the assumption of variance equality, change the null hypothesis tested in ANOVA to equality of the population distributions rather than just equality of their means. So a statistically significant F ratio can result from differences in population variance, shape, mean, or any combination of these factors, depending on which assumption is violated. The problem is that the test statistic typically used in ANOVA based on a ratio of between-group variability to a pooled estimate of within-group variability is sensitive to group differences in variance as well as group differences in mean. For this reason, alternative test statistics such as Welch’s F ratio and a corresponding degrees of freedom adjustment [10] as well as other methods, for example, [1] and [3], have been advocated as alternatives to traditional ANOVA

that allow you to relax the assumption of equality of variance. So, the null hypothesis tested with ANOVA is a function of the test statistic used, and whether the assumptions used in the derivation of the test statistic are met. And the same can be said of resampling methods. If you cannot assume or have reason to reject the assumption of equality of the populations in variance and shape, the use of the traditional F ratio in ANOVA produces a test of equality of population distributions rather than population means. Similarly, the randomization test and the second version of the bootstrap described here using the pooled variance F ratio as the test statistic, or any monotonically related statistic such as the between groups sum of squares or mean square, yields a test of the null hypothesis of equality of the population distributions, unless you make some additional assumptions. In the case of the second approach to bootstrapping described earlier, if you desire a test of mean differences, you must also assume equality of the population distributions on ‘nuisance’ parameters such as skew, kurtosis, and variance. In the case of randomization tests, you must assume that the manipulation does not affect the responses in any way whatsoever in order to test the null hypothesis of equality of the means. For example, if you have reason to believe that the manipulation might affect variability in the outcome variable, then you have no basis for interpreting a significant P value from a randomization test as evidence that the manipulation affects the means, because the F ratio quantifies departures from equality of the response distributions, not just the means. The first bootstrapping method does not suffer near as much from this problem of choice of test statistic, although it is not immune to it. By conditioning the resampling for each group on the observed Y units in that group, the effect of population differences in variance and shape on the validity of the test as a test of group mean differences is reduced considerably. However, this method requires that you have substantial faith in each group’s sample as representing the distribution of the population from which it was derived. The validity of this method of bootstrapping is compromised when one or more of the group sample sizes is small. So even when using a resampling method, if you desire to test a precise null hypothesis about the group means, it is necessary to use a test statistic that is

One Way Designs sensitive only to mean differences. It is not always obvious what a test statistic is sensitive to without some familiarity with the derivation of and research on that statistic. Fortunately, as discussed above, there are many alternative test statistics developed for ANOVA that perform better when population variances differ [1, 3], or [9]. For example, the numerator of Welch’s F ratio [10, also see 9] would be a sensible test statistic for comparing population means, given that it is much less sensitive to population differences in variability than the usual ANOVA F ratio. This test statistic is θ = nj /sj2 (Y j − Y˜ )2 , where Y˜ is defined as w −1 nj Y j /sj2 and w = nj /sj2 . But, resampling methods are a relatively new methodological tool, and continued research on their performance through simulation will no doubt clarify which test statistics are most suitable for which null hypothesis under which conditions.

References

[2] [3]

[4]

[5] [6] [7]

[8] [9]

[10]

Aptech Systems. (2001). GAUSS version 5.2 [computer program]. Maple Valley. Brown, M.B. & Forsythe, A.B. (1974). The small sample behavior of some statistics which test the equality of means, Technometrics 16, 129–132. Chen, R.S. & Dunlap, W.P. (1993). SAS procedures for approximate randomization tests, Behavior Research Methods Instruments & Computers 25, 406–409. Edgington, E. (1995). Randomization Tests, 3rd Edition, Dekker, New York. Efron, B. & Tibshirani, R.J. (1991). An Introduction to the Bootstrap, Chapman & Hall, Boca Raton. Hayes, A.F. (1998). SPSS procedures for approximate randomization tests, Behavior Research Methods Instruments & Computers 30, 536–543. Howell, D. (1997). Statistical Methods for Psychology, 4th Edition, Duxbury, Belmont. James, G.S. (1951). The comparison of several groups of observations when the ratios of population variances are unknown, Biometrika 38, 324–329. Welch, B.L. (1951). On the comparison of several mean values: an alternative approach, Biometrika 38, 330–336.

ANDREW F. HAYES [1]

Alexander, R.A. & Govern, D.M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity, Journal of Educational Statistics 19, 91–101.

7

Optimal Design for Categorical Variables MARTIJN P.F. BERGER Volume 3, pp. 1474–1479 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Optimal Design for Categorical Variables Introduction Optimal designs enable behavioural scientists to obtain efficient parameter estimates and maximum power of their tests, thus reducing the required sample size and the costs of experimentation. Multiple linear regression analysis is one of the most frequently used statistical techniques to analyse the relation between quantitative and qualitative variables. Unlike quantitative variables, qualitative variables do not have a well-defined measurement scale and only distinguish categories, with or without an ordering. Nominal independent variables have no ordering among the categories, and dummy variables are usually used to describe their variation. Ordinal independent variables are usually numbered in accordance with the ordering of the categories and treated as if they are quantitative. The dependent variable can also be quantitative or qualitative. Although optimal design theory is applicable to models with both quantitative and qualitative variables, the results are usually quite different. In this paper, we will briefly consider four cases, namely, optimal designs for linear models and logistic regression models, with and without dummy coding. First, the methodological basics of optimal design theory will be described by means of a linear regression model. Then, optimal designs for each of the four cases will be illustrated with concrete examples. Finally, we will draw some conclusions and provide references for further reading.

Methodological Basics for Optimal Designs Model. In describing the relationship of Y to the quantitative variables X1 and X2 , we consider the linear model for subject i: Yi = β0 + β1 X1i + β2 X2i + εi ,

(1)

where β0 is the intercept, and β1 and β2 are the usual effect parameters. This model assumes that the errors εi are independently normally distributed with mean zero and variance σ 2 , that is, εi ∼ N (0,σ 2 ).

Optimal Design. In the design stage of a study, we are free to sample subjects with different values for X1 and X2 . These choices influence the variance of the estimated effect parameters β = [β1 β2 ]. Although the intercept may also be of interest, we will restrict ourselves to these two effect parameters. Because we usually want to estimate β1 and β2 as efficiently as possible, an ideal design would be a design with minimum variance of the estimated parameters. Such a design is called an optimal design. The variances and covariance of the estimators βˆ1 and βˆ2 in model (1) are given by: ˆ = var(β) 

1

 N (1 − σ2   cov(X1 , X2 ) − 2 N (1 − r12 )W

2 r12 )var(X1 )

−

cov(X1 , X2 ) N (1 −

2 r12 )W

1 N (1 −

2 r12 )var(X2 )

  ,  (2)

where W = {var(X1 ) × var(X2 )}, and var(X1 ) and var(X2 ) are the variances of X1 and X2 , respectively, and cov(X1 , X2 ) and r12 are the corresponding covariance and correlation. var(βˆ1 ) and var(βˆ2 ) are on the main diagonal of the matrix in (2), and the covariance between βˆ1 and βˆ2 is the off-diagonal element. Equation (2) shows that the var(βˆ1 ) and var(βˆ2 ) decrease as: 1. The variances of the variables var(X1 ) and var(X2 ), respectively, increase. This can be done by only selecting subjects with the most extreme values for X1 and X2 . 2. The correlation r12 between X1 and X2 decreases. For an orthogonal design, the correlation r12 is minimum, that is, r12 = 0. 3. The common error variance σ 2 decreases, that is, when more precise measurements of Y are obtained. 4. The total number of observations N increases. This is, in fact, the most often applied method to obtain sufficient power for finding real effects, but with extra costs of collecting data. There is a trade-off among these effects. For example, reduction of measurement errors by a factor √ 1 − (1/ 2) = 0.29 has the same effect on var(βˆ1 ) as doubling the total sample size N .

2

Optimal Design for Categorical Variables

Optimality Criterion. Although var(βˆ1 ) and var(βˆ2 ) can be minimized separately, model (1) is usually applied because interest is focused on the joint prediction of Y by X1 and X2 , and because hypotheses on both the regression parameters in β are tested. Moreover, the overall test for the fit of model (1) is based on all parameters in β, simultaneously. Thus, an optimal design for model (1) should take all the variances and covariance into account, simultaneously. This can be done by means of an optimality criterion. Different optimality criteria have been proposed in the literature. See [4] for a review. We will restrict ourselves to the determinant or D-optimality criterion, because of its advantages. This criterion is proportional to the volume of the simultaneous confidence region for the parameters in β = [β1 β2 ], thus giving it a similar interpretation as that of a confidence interval for a single parameter. Moreover, a D-optimal design is invariant under linear transformation of the scale of the independent variables, which is generally not true for the other criteria. Consider several designs τ in a design space T , which are distinguished by the number of different levels of the independent variables X1 and X2 . A D-optimal design τ ∗ ∈ T is the design among all ˆ designs τ ∈ T , for which the determinant of var(β) is minimized: ˆ det[var(β)]

Minimize :

2 2

Minimize:

or

[σ ] 2 N 2 (1 − r12 )var(X1 )var(X2 )

,

(3)

ˆ is not only a Equation (3) shows that det[var(β)] function of N , but also of σ 2 , var(X1 ), var(X2 ), and of the correlation r12 . Again, we can see that, for example, doubling the sample size N has the same effect on the efficiency as reducing the error variance σ 2 by 1/2, that is, by decreasing the errors εi by a factor of about 0.29. Maximum power and efficiency. The 100(1 − α)% confidence ellipsoid for the p regression parameters in β is given by:

−1

ˆ (β − β) ˆ ≤ pMSe F (df1 , df2 , α), ˆ var(β) (β − β) (4) where MS e is an estimate of the error variance σ 2 and F (df1 , df2 , α) is the α point of the F distribution with df1 and df2 degrees of freedom.

ˆ is proportional to the volume of Since det[var(β)] this ellipsoid, minimizing this criterion will maximize the power of the simultaneous test of the hypothesis β = 0. Thus, a D-optimal design τ ∗ will result in maximum efficiency and power for a given sample size and estimate of the error variance σ 2 . Relative efficiency. To compare the efficiency of a design τ with that of the optimal design τ ∗ , the following relative efficiency can be used:

1/p det[var(βˆτ ∗ )] , (5) Eff(τ ) = det[var(βˆτ )] where p is the number of independent parameters in the parameter vector β. This ratio is always Eff(τ ) < 1, and indicates how much additional observations are needed to obtain maximum efficiency. If, for example, Eff(τ ) = 0.8, then (0.8−1 − 1)100% = 25% more observations will be needed for design τ to become as efficient as the optimal design, and when each extra observation costs the same, then 25% cost reduction is feasible.

Examples Multiple Regression with Ordinal Independent Variables Consider, as an example, a study on the acquisition of new vocabulary. The design consists of four samples of pupils from the 8th, 9th, 10th, and 11th grades, respectively. Suppose that vocabulary growth also depends on IQ, and that three IQ classes are considered, namely, Class 1 (IQ range 90–100), Class 2 (IQ range 100–110), and Class 3 (IQ range 110–120). In total, 12 distinct samples of pupils are available. A similar study was reported by [9, pp. 453–456], but with a different design. The vocabulary test score Yi of pupil i can be described by the linear regression model with ordinal independent variables Grade and IQ class: Yi = β0 + β1 Grade i + β2 IQ i + εi ,

(6)

where β0 is the intercept, β1 and β2 are the regression parameters for Grade and IQ class, respectively, and εi ∼ N (0, σ 2 ). We will treat the ordinal independent variables as being quantitative. Five different designs for model (1) are displayed in Table 1. These designs are all balanced and based

3

Optimal Design for Categorical Variables Table 1

Five designs with 12 combinations of the levels of X1 and X2 for the vocabulary study Design τ1 X1

1 2 3 4 5 6 7 8 9 10 11 12 var(X) r12 ˆ det[var(β)]

8 8 8 9 9 9 10 10 10 11 11 11 1.25

X2 1 2 3 1 2 3 1 2 3 1 2 3 0.6667 0.0000 1.20

Design τ2 X1 8 8 8 8 8 8 11 11 11 11 11 11 2.25

X2 1 2 3 1 2 3 1 2 3 1 2 3 0.6667 0.0000 0.6667

Design τ3 X1 8 8 8 8 8 8 11 11 11 11 11 11 2.25

X2 1 1 3 1 2 3 1 2 3 1 3 3 0.8333 0.1833 0.5517

Design τ4 X1

X2

8 1 8 1 8 1 8 1 8 1 8 1 11 3 11 3 11 3 11 3 11 3 11 3 2.25 1.00 1.0000 ∞

Design τ5 X1

X2

8 1 8 1 8 1 8 3 8 3 8 3 11 1 11 1 11 1 11 3 11 3 11 3 2.25 1.00 0.0000 0.4444

ˆ is divided by N 2 /σ 4 . Note: The D-optimality criterion is computed without N and σ 2 , that is, det[var(β)]

on the 12 combinations of the levels of Grade and IQ, that is, the total sample size is N = 12n, where n is the number of pupils in each combination. Design τ1 is the original design. Design τ5 is the D-optimal ˆ = 0.444. design with minimum value for det[var(β)] As expected, this design has maximum variances var(X1 ) and var(X2 ), and minimum value r12 = 0. The relative efficiency of Design τ1 is Eff(τ1 ) = 0.6085, indicating that Design τ1 would need about 64% more observations to have maximum efficiency and maximum power. Designs τ2 and τ3 are both more efficient than Design τ1 . Notice that Design τ4 is not appropriate for this model because r12 = 1.

Multiple Regression with Nominal Independent Variables Suppose that in the vocabulary growth study, the joint effect of Grade and three different Instruction Methods is investigated, and that two dummy variables Dum 1 and Dum 2 are used to describe the variation of the Instruction Methods. The regression model is: Yi = β0 + β1 Grade i + β2 Dum 1i + β3 Dum 2i + εi . (7) In Table 2, the D-optimal design for model (7) is presented with n subjects for six optimal combinations of Grade levels and Instruction Method and

ˆ = 1.6875. Three unequal minimum value det[var(β)] group designs are also presented in Table 2, and it can be seen that efficiency loss is relatively small for these designs because Eff(τ ) > 0.9.

Logistic Model with Ordinal Independent Variables Occasionally, the dependent variable may be dichotomous. For example, the following logistic model with Grade as predictor describes the probability Pi that subject i passes a vocabulary test: Pi (8) ln = β0 + β1 Grade i . 1 − Pi The parameters β0 and β1 can be estimated by maximum likelihood, and a D-optimal design for model (8) can be found by minimizing the deterˆ (see [4], Chapter 18, for details). minant of var(β) There is, however, a problem. For nonlinear models, ˆ depends on the actual parameter values, and var(β) a D-optimal design for one combination of parameter values may not be optimal for other values. This problem is often referred to as the local optimality problem (see [4] for ways to overcome this problem). It can be shown (see [5]) that a D-optimal design for model (8) has two distinct, equally weighted design points, namely, xi = ±(1.5437 − β0 )/β1 . This

4

Optimal Design for Categorical Variables Table 2 The effect of unbalancedness on efficiency of a design for regression model (7) with quantitative variable Grade X1 and qualitative variable Instruction Method X2 Levels Cells

X1

Design weights X2

1

8

1

2

8

2

3

8

3

4

11

1

5

11

2

6

11

3

ˆ det[var(β)] Eff(τ )

W1

W2

W3

W4

1 N 6 1 N 6 1 N 6 1 N 6 1 N 6 1 N 6 1.6875 1.0000

1 N 4 1 N 6 1 N 12 1 N 4 1 N 6 1 N 12 2.25 0.9086

1 N 4 1 N 4 1 N 12 1 N 6 1 N 6 1 N 12 2.2345 0.9107

1 N 5 1 N 10 1 N 5 1 N 5 1 N 10 1 N 5 1.9531 0.9524

Note: Eff(τ ) is computed with p = 3 variables, that is, X1 and two dummy variables for X2 .

means that optimal estimation of β0 and β1 is obtained if N /2 pupils have a Pi = 0.176, and N /2 pupils have a Pi = 0.824 of passing the test. Of course, such a selection is often not feasible in practice. But, for a test designed for the 9th grade, a teacher may be able to select N /2 pupils from the 8th grade with about Pi = 0.2 and N /2 pupils from the 10th grade with about Pi = 0.8 probability of passing the test, respectively. Such a sample would be more efficient than one sample of N pupils from the 9th grade. The well known item response theory (IRT) models in educational and psychological measurement are comparable with model (8), and the corresponding optimal designs were studied by [5] and [12], among others.

where Dum 1 and Dum 2 are two dummy variables needed to describe the variation of the three Instruction Methods. For parameter values (β0 , β1 , β 2 ) ∈ [−10, 10], the D-optimal design for model (9) is a balanced design with an equal number of pupils assigned to each instruction method. It should, however, be emphasized that for other models or parameter values, a D-optimal design may not be balanced. If the researcher expects that one of the instruction methods works better than the others and assigns 70% of the sample to this instruction method and divides the remaining pupils equally over the two other methods, then the Eff(τ ) = 0.752. This means that the sample size would have to be 33% larger to remain as efficient as the optimal design.

Logistic Regression with Nominal Independent Variables

Summary and Further Reading As already mentioned, an optimal design for logistic regression is only locally optimal, that is, for a given set of parameter values. If the probability of passing a vocabulary test is modeled with Instruction Method as only predictor, then the logistic regression model is: Pi (9) = β0 + β1 Dum 1 + β2 Dum 2 , ln 1 − Pi

Although different models have different optimal designs, a rule-of-thumb may be that if the independent variables are treated as being quantitative, efficiency will increase if the variance of the independent variables increases, and if the design becomes more orthogonal. For qualitative independent variables, a design with equal sample sizes for all categories is often optimal. For non-linear models, the

Optimal Design for Categorical Variables optimal designs are only optimal locally, that is, for a given set of parameter values. The number of studies on optimal design problems is increasing rapidly. Elementary reviews are given by [13] and [19]. A more general treatment is given by [4]. Optimal designs for correlated data were studied by [1] and [8]. Studies on optimal designs for random block effects are found in [3, 10], and [11], while [14] and [18] studied random effects models. Optimal designs for multilevel and random effect models with covariates are presented by [15, 16], and [17], among others. Optimal design for IRT models in educational measurement is studied by [2, 5, 6, 7] and [2].

References [1]

[2]

[3]

[4] [5]

[6]

[7]

Abt, M., Liski, E.P., Mandal, N.K. & Sinha, B.K. (1997). Correlated model for linear growth: optimal designs for slope parameter estimation and growth prediction, Journal of Statistical Planning and Inference 64, 141–150. Armstrong, R.D., Jones, D.H. & Kunce, C.S. (1998). IRT test assembly using network-flow programming, Applied Psychological Measurement 22, 237–247. Atkins, J.E. & Cheng, C.S. (1999). Optimal regression designs in the presence of random block effects, Journal of Statistical Planning and Inference 77, 321–335. Atkinson, A.C. & Donev, A.N. (1992). Optimum Experimental Designs, Clarendon Press, Oxford. Berger, M.P.F. (1992). Sequential sampling designs for the two-parameter item response theory model, Psychometrika 57, 521–538. Berger, M.P.F. (1998). Optimal design of tests with dichotomous and polytomous items, Applied Psychological Measurement 22, 248–258. Berger, M.P.F., King, C.Y. & Wong, W.K. (2000). Minimax D-optimal designs for item response theory models, Psychometrika 65, 377–390.

[8]

[9] [10]

[11]

[12] [13] [14]

[15]

[16]

[17]

[18]

[19]

5

Bishoff, W. (1993). On D-optimal designs for linear models under correlated observations with an application to a linear model with multiple response, Journal of Statistical Planning and Inference 37, 69–80. Bock, R.D. (1975). Multivariate Statistical Methods, McGraw-Hill, New York. Cheng, C.S. (1995). Optimal regression designs under random block-effects models, Statistica Sinica 5, 485–497. Goos, P. & Vandebroek, M. (2001). D-optimal response surface designs in the presence of random block effects, Computational Statistics and Data Analysis 37, 433–453. Jones, D.H. & Jin, Z. (1994). Optimal sequential designs for on-line estimation, Psychometrika 59, 59–75. McClelland, G.H. (1997). Optimal design in psychological research, Psychological Methods 2, 3–19. Mentr´e, F., Mallet, A. & Baccar, D. (1997). Optimal design in random effects regression models, Biometrika 84, 429–442. Moerbeek, M., Van Breukelen, G.J.P. & Berger, M.P.F. (2001). Optimal experimental designs for multilevel models with covariates, Communications in Statistics; Theory and Methods 30(12), 2683–2697. Ouwens, J.N.M., Tan, F.E.S. & Berger, M.P.F. (2001). On the maximin designs for logistic random effects models with covariates, in New Trends Models in Statistical Modelling, The Proceedings of the 16 th International Workshop on Statistical Modelling, B. Klein & L. Korshom, ed., Odense, Denmark, pp. 321–328. Ouwens, J.N.M., Tan, F.E.S. & Berger, M.P.F. (2002). Maximin D-optimal designs for longitudinal mixed effects models, Biometrics 58, 735–741. Tan, F.E.S. & Berger, M.P.F. (1999). Optimal allocation of time points for the random effects model, Communications in Statistics: Simulations 28(2), 517–540. Wong, W.K. & Lachenbruch, P.A. (1996). Designing studies for dose response, Statistics in Medicine 15, 343–360.

MARTIJN P.F. BERGER

Optimal Scaling YOSHIO TAKANE Volume 3, pp. 1479–1482 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Optimal Scaling Suppose that (dis)similarity judgments are obtained between a set of stimuli. The dissimilarity between stimuli may be represented by a Euclidean distance model (see Multidimensional Scaling). However, it is rare to find the observed dissimilarity data measured on a ratio scale (see Scales of Measurement). It is more likely that the observed dissimilarity data satisfy only the ordinal scale level. That is, they are only monotonically related to the underlying distances (see Monotonic Regression). In such cases, we may consider transforming the observed data monotonically, while simultaneously representing the transformed dissimilarity data by a distance model. This process of simultaneously transforming the data, and representing the transformed data, is called optimal scaling [2, 8]. Let δij denote the observed dissimilarity between stimuli i and j measured on an ordinal scale. Let dij represent the underlying Euclidean distance between the two stimuli represented as points in an Adimensional Euclidean space. Let xia denote the coordinate of stimulus i on dimension a. Then, dij 1/2 A 2 can be written as dij = . (We a=1 (xia − xj a ) use X to denote the matrix of xia , and sometimes write dij as dij (X) to explicitly indicate that dij is a function of X.) Optimal scaling obtains the best monotonic transformation (m) of the observed dissimilarities (δij ), and the best representation (dij (X)) of the transformed dissimilarity (m(δij )), in such a way that the squared discrepancy between them is as small as possible. Define the least squares criterion, φ = i,j
stimulus configuration. From this, we may deduce that the process mediating confusions among the signals is two-dimensional; one is the total number of components in the signals, and the other is the mixture rate of two kinds (dots and dashes) of components. Signals having more components tend to be located toward the top, and those having more dots tend to be located toward the left of the configuration. Figure 2 displays the optimal inverse monotonic transformation of the confusion probabilities. The derived optimal transformation looks very much like a negative exponential function, pij = a exp(−dij ), or possibly a Gaussian, pij = a exp(−dij2 ), typically found in stimulus generalization data. In the above example, the data involved are dissimilarity data, for which the distance model may be an appropriate choice. Other kinds of data may also be considered for optimal scaling. For example, the data may reflect the joint effects of two or more underlying factors. In this case, an analysis of variance-like additive model may be appropriate. As another example, preference judgments are obtained from a single subject on a set of objects (e.g., cars) characterized by a set of features (size, color, gas efficiency, roominess, etc.) In this case, a regression-like linear model that combines these features to predict the overall preference judgments, may be appropriate. As a third example, preference judgments are obtained from a group of subjects on a set of stimuli. In this case, a vector model of preference may be appropriate, in which subjects are represented as vectors, and stimuli as points in a multidimensional space, and subjects’ preferences are obtained by projecting the stimulus points onto the subject vectors. This leads to a principal component analysis-like bilinear model [7]. Alternatively, subjects’ ideal stimuli may be represented as (ideal) points, and it may be assumed that subjects’ preferences are inversely related to the distances between subjects’ ideal points and actual stimulus points. This is called unfolding (or ideal point) model [1] (see Multidimensional Unfolding). Any one of the models described above may be combined with various types of data transformations, depending on the scale level on which observed data are assumed to be measured. Different levels of measurement scale allow different types of transformations, while preserving the essential properties of the information represented by numbers. In psychology, four major scale levels have traditionally

2

Optimal Scaling s

Dashe

ts

2

3

Do

7

9 J Q

Y

6 4 X 5

P

Z O

C

F B L

V

G K W

H

Number of components

0

1

8

D R

U

M

S

A E

N

I T

Figure 1 A stimulus configuration obtained by nonmetric multidimensional scaling of confusion data [4] among Morse code signals

been distinguished: nominal, ordinal, interval, and ratio [6]. In the nominal scale level, only the identity of numbers is considered meaningful (i.e., x = y or x = y). Telephone numbers and gender (males and females coded as 1’s and 0’s) are examples of this level of measurement. In the nominal scale, any one-to-one transformation is permissible, since it preserves the identity (and nonidentity) between numbers. (This is called admissible transformation.) In the ordinal scale level, an ordering property of numbers is also meaningful (i.e., for x and y such that x = y, either x > y or x < y, but how much larger or smaller is not meaningful). An example of this level of measurement, is the rank number given to participants in a race. In the ordinal scale, any monotonic (or order-preserving) transformation is admissible. In the interval scale level, the difference between two numbers is also meaningful. A difference in temperature

measured on an interval scale can be meaningfully interpreted (e.g., the difference between yesterday’s temperature and today’s is such and such), but because the origin (zero point) in the scale is arbitrary as the temperature is measured in Celsius or Fahrenheit, their ratio is not meaningful. In the interval scale, any affine transformation (x = ax + b) is admissible. In the ratio scale level, a ratio between two numbers is also meaningful (e.g., temperature measured on the absolute scale where −273° C is set as the zero point). In the ratio scale, any similarity transformation (x = ax) is admissible. In optimal scaling, a specific transformation of the observed data is sought, within each class of the admissible transformations consistent with the scale level on which observed data are assumed to be measured. It is assumed that one of these transformations is tacitly applied in a data generation process. For

Optimal Scaling

3

80

Per Cent "Same" Judgments

70 60 50 40 30 20 10

0.0

Figure 2

0.5

1.0

1.5 2.0 Interpoint Distance

[4]

[5]

[6]

[7]

References

[2] [3]

3.0

3.5

An optimal data transformation of the confusion probabilities

example, if observed dissimilarity data are measured on an ordinal scale, the model prediction, dij , is assumed error-perturbed, and then monotonically transformed to obtain the observed dissimilarity data, δij . Optimal scaling reverses this operation by first transforming back δij to the error-perturbed distance by m, which is then represented by the distance model, dij (X).

[1]

2.5

Coombs, C.H. (1964). A Theory of Data, Wiley, New York. Gifi, A. (1990). Nonlinear Multivariate Analysis, Wiley, Chichester. Kruskal, J.B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29, 1–27.

[8]

Rothkopf, E.Z. (1957). A measure of stimulus similarity and errors in some paired associate learning, Journal of Experimental Psychology 53, 94–101. Shepard, R.N. (1963). Analysis of proximities as a technique for the study of information processing in man, Human Factors 5, 19–34. Stevens, S.S. (1951). Mathematics, measurement, and psychophysics, in Handbook of Experimental Psychology, S.S. Stevens, ed., Wiley, New York, pp. 1–49. Tucker, L.R. (1960). Intra-individual and inter-individual multidimensionality, in Psychological Scaling: Theory and Applications, H. Guliksen & S. Messick, eds., Wiley, New York, pp. 155–167. Young, F.W. (1981). Quantitative analysis of qualitative data, Psychometrika 46, 357–388.

YOSHIO TAKANE

Optimization Methods ROBERT MICHAEL LEWIS

AND

MICHAEL W. TROSSET

Volume 3, pp. 1482–1491 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Optimization Methods Introduction The goal of optimization is to identify a best element in a set of competing possibilities. Optimization problems permeate statistics. Consider two familiar examples: 1.

2.

Maximum likelihood estimation. Let y denote a random sample, drawn from a joint probability density function pθ , where θ ∈ . Then, a maximum likelihood estimate of θ is any element of at which the likelihood function, L(θ) = pθ (y), is maximal. Nonlinear regression. Let rθ (·) denote a family of regression functions, where θ ∈ . At input zi , we observe output yi . Then, a least squares estimate of θ is any element of at which the sum of the squared residuals,

2 yi − rθ (zi ) ,

(1)

i

is minimal. Formally, let f : X → denote a real-valued objective function that measures the quality of elements of the set X. We will discuss the minimization of f ; this is completely general, as maximizing f is equivalent to minimizing −f . Let C ⊆ X denote the feasible subset that specifies which possibilities are to be considered. Then the global optimization problem is the following: Find x∗ ∈ C such that f (x∗ ) ≤ f (x) for all x ∈ C. (2) Any such x∗ is a global minimizer, and f (x∗ ) is the global minimum. In most cases, global optimization is extremely difficult. If X is a topological space, that is, if it is meaningful to speak of neighborhoods of x ∈ X, then we can pose the easier local optimization problem: Find x∗ ∈ C and a neighborhood U of x∗ such that f (x∗ ) ≤ f (x) for all x ∈ C ∩ U .

(3)

Any such x∗ is a local minimizer and f (x∗ ) is a local minimum. Figure 1 illustrates local minimizers for functions of one and two variables. Figure 1(a) displays an objective function without constraints, i.e., with feasible subset C = X = . There are two local minimizers, indicated by asterisks. The local minimizer indicated by the asterisk on the right is also a global minimizer. Figure 1(b) displays an objective function with a square feasible subset. At each corner of the square, the function is decreasing, and one could find points with smaller objective function values by leaving the feasible subset. At each corner, however, the objective function is smaller at the corner than at nearby points within the square. Accordingly, each corner is a local constrained minimizer. We will emphasize problems in which X = n , a finite-dimensional Euclidean space, and neighborhoods are defined by Euclidean distance. In this setting, one can apply the techniques of analysis. For example, if f is differentiable and C = X = n , then a local minimizer must solve the stationarity equation ∇f (x) = 0. If f is convex and ∇f (x∗ ) = 0, then x∗ is a local minimizer. Furthermore, every local minimizer of a convex function is a global minimizer. If f is strictly convex, then f has at most one minimizer. The classical method of Lagrange multipliers provides necessary conditions for a local minimizer when C is defined by a finite number of equality constraints, that is, C = {x ∈ X: h(x) = 0}. The Karush–Kuhn–Tucker (KKT) conditions extend the classical necessary conditions to the case of finitely many equality and inequality constraints. These conditions rely on the concept of feasible directions. The feasible subset displayed in Figure 2(a) is the dark-shaded circular region. For the point indicated on the boundary of the region, the feasible directions are all directions that lie strictly inside the half-plane lying to the left of the tangent to the circular region at the indicated point. Along any such direction, it is possible to take a step (perhaps only a very short step) from the indicated point and remain in the feasible subset. To illustrate, let us examine one of the local constrained minimizers in Figure 1(b). In Figure 2(b), the shaded square region in the lower left hand corner is the feasible subset. The contour lines indicate different values of the objective function, extended beyond the feasible subset. The objective function

2

Optimization Methods 0.03

f (x) Local minima

0.02

0.01

0

−0.01

−0.02

−0.03

−0.04 −0.8 (a)

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

0 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 −4 −1 −0.5

−1 0 0

0.5 (b)

Figure 1

1

−0.5

0.5 1

Unconstrained and constrained local minimizers

is decreasing as one moves from the lower left hand to the upper right hand corner at (1, 1). A local constrained minimizer occurs at (1, 1) because no feasible direction at (1, 1) is a descent direction. This geometric idea is captured algebraically by the KKT conditions, which state that the direction of steepest descent can be written as a nonnegative combination of the normals pointing outward

from the constraints that hold at the local minimizer. In Figure 2(b), the direction of steepest descent is indicated by the solid arrow, while the normals to the binding constraints are indicated by dashed arrows. Occasionally, mathematical characterizations of minimizers yield simple conditions from which minimizers can be deduced analytically. More often

Optimization Methods

3

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.5 (a)

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.8 (b)

Figure 2

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

Feasible directions and the KKT conditions

in practice, the resulting conditions are too complicated for analytical deduction. In such cases, one must rely on iterative methods for numerical optimization.

Example Before reviewing several of the essential ideas that underlie iterative methods for numerical optimization,

4

Optimization Methods

we consider an example that motivates such methods. Bates and Watts [1] fit a nonlinear regression model based on Newton’s law of cooling to data collected by Count Rumford in 1798. This leads to an objective function of the form (1), 2 (4) yi − 60 − 70 exp (−θzi ) f (θ) = i

Squared error

where z measures cooling time and yi is the temperature observed at time zi . The goal is to find the value of the parameter θ ∈ that minimizes f . Negative values of θ are not physically plausible, suggesting that C = [0, ∞) is a natural feasible subset. The objective function (4) and its derivative are displayed in Figure 3. The location of the global minimizer, θ∗ , is indicated by a vertical dotted line. Because θ∗ > 0, that is, the global minimizer lies in the strict interior of the feasible subset, we see that the constraint θ ≥ 0 is not needed. The derivative of the objective function vanishes at θ∗ . In theory, θ∗ can be characterized as the unique solution of the stationarity equation f (θ) = 0; in practice, one cannot solve f (θ) = 0 to obtain a formula for θ∗ . The practical impossibility of deriving such formulae necessitates iterative methods for numerical optimization. How might an iterative method proceed? Suppose that we began by guessing that f might be small at θ0 = 0.03. The challenge is to improve on this guess. To do so, one might try varying

the value of θ in different directions, noting which values of θ result in smaller values of f . For example, f (0.03) = 4931.369. Upon incrementing θ0 by 0.01, we observe f (0.04) = 8463.455 > f (0.03); upon decrementing θ0 by 0.01, we observe f (0.02) = 1735.587 < f (0.03). This suggests replacing θ0 = 0.03 with θ1 = 0.02. The logic is clear, but it is not so clear how to devise an algorithm that is guaranteed to produce a sequence of values of θ that will converge to θ∗ . Alternatively, having guessed θ0 = 0.03, we might compute f (0.03) = 348599.6. Because f (0.03) > 0, to decrease f we should decrease θ. But by how much should we decrease θ? Again, it is not clear how to devise an algorithm that is guaranteed to produce a sequence of values of θ that will converge to θ∗ . One possibility, if θ0 is sufficiently near θ∗ , is to apply Newton’s method for finding roots to the stationarity equation f (θ) = 0. The remainder of this essay considers iterative methods for numerical optimization. Because there are many such methods, we are content to survey several key ideas that underlie unconstrained local optimization, constrained local optimization, and global optimization.

Unconstrained Local Optimization We begin by considering methods for solving (3) when C = X = n . Unless f is convex, there is no

15000

5000 0 −0.01

0.0

0.01

0.02

0.03

0.04

0.02

0.03

0.04

θ

0

−2 × 106

−0.01

0.0

0.01 θ

Figure 3 The objective function and its derivative for fitting a nonlinear regression model based on Newton’s law of cooling to data collected by Count Rumford

Optimization Methods practical way to guarantee convergence to a local minimizer. Instead, we seek algorithms that are guaranteed to converge to a solution of the stationarity equation, ∇f (x) = 0. Because effective algorithms for minimizing f take steps that reduce f (x), they generally do find local minimizers in practice and rarely converge to stationary points that are not minimizers. There are two critical concerns. First, can the algorithm find a solution when started from any point in n ? Algorithms that can are globally convergent. Second, will the algorithm converge rapidly once it nears a solution? Ideally, we prefer globally convergent algorithms with fast local convergence. Global convergence can be ensured through globalization techniques, which place mild restrictions on the steps the algorithm may take. Some methods perform line searches, requiring a step in the prescribed direction to sufficiently decrease the objective function. Other methods restrict the search to a trust region, in which it is believed that the behavior of the objective function can be accurately modeled. Generally, the more derivative information that an algorithm uses, the faster it will converge. If analytic derivatives are not available, then derivatives can be estimated using finite-differences [6] or complex perturbations [19]. Derivatives can also be computed by automatic differentiation techniques that operate directly on computer programs, applying the chain rule on a statement-by-statement basis [15, 8]. Most iterative methods for local optimization progress by constructing a model of the objective function in the neighborhood of a designated point (the current iterate), then using the model to help find a new point (the subsequent iterate) at which the objective function has a smaller value. It is instructive to classify these methods by the types of models that they construct. 0.

1.

Zeroth-order methods do not model the objective function. Instead, they simply compare values of the objective function at different points. Examples of such direct search methods include the Nelder–Mead simplex algorithm and the pattern search method of Hooke and Jeeves. The Nelder-Mead algorithm can be effective, but it can also fail, both in theory and in practice. In contrast, pattern search methods are globally convergent. First-order methods construct local linear models of the objective function. Usually, this is

2.

5

accomplished by using function values and first derivatives to construct the first-order Taylor polynomial at the current iterate. The classic example of a first-order method is the method of steepest descent, in which a line search is performed in the direction of the negative gradient. Second-order methods construct local quadratic models of the objective function. The prototypical example of a second-order method is Newton’s method, which uses function values, first, and second derivatives to construct the second-order Taylor polynomial at the current iterate, then minimizes the quadratic model to obtain the subsequent iterate. (Newton’s method for optimization is equivalent to applying Newton’s method for finding roots to the stationarity equation.) When second derivatives are not available, quasi-Newton methods (also known as variable metric methods or secant update methods) use function values and first derivatives to update second derivative approximations. Various update formulas can be used; the most popular is BFGS (for Broyden, Fletcher, Goldfarb, and Shanno – the researchers who independently suggested it).

Zeroth-order methods are reviewed in [10]. Most surveys of unconstrained local optimization methods, e.g., [3, 4, 13, 14], emphasize first- and second-order methods. The local linear and quadratic models used by these methods are illustrated in Figure 4. In Figure 4(a), the graph of the objective function is approximated by the line that is tangent at x = 1/2. This approximation suggests a descent direction; however, because it has no minimizer, it does not suggest the potential location of a (local) minimizer of the original function. When a linear approximation is used to search for the next iterate, it is necessary to control the length of the step, for example, by performing a line search. In Figure 4(b), the same graph is approximated by the quadratic curve that is tangent at x = 1/2. This approximation has a unique minimizer that suggests itself as the potential location of a (local) minimizer of the original function. However, because the quadratic approximation is only valid near the point where it was constructed (here x = 1/2), it is still necessary to insure that the step suggested by the approximation does, in fact, decrease the value of the objective function.

6

Optimization Methods 0.2

f ( x) Linear approximation Minimum of f (x) 0.15

0.1

0.05

0

−0.05

−0.1 −0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(a) 0.2

f (x ) Quadratic approximation Minimum of f (x) Minimum of approximation

0.15

0.1

0.05

0

−0.05

−0.1 −0.6 (b)

Figure 4

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Local models of an objective function

Constrained Local Optimization Next we discuss several basic strategies for finding local minimizers when X = n , and the feasible set

C is a closed proper subset of X. In some cases, one can adapt unconstrained methods to accommodate constraints. For example, in seeking to minimize f with current iterate xk , the method of steepest descent

Optimization Methods would search for the next iterate, xk+1 , along the ray {xk − t∇f (xk ) : t ≥ 0} .

(5)

This is a gradient method. But what if the ray in the direction of the negative gradient does not lie in the feasible set? If C is convex, there is an obvious remedy. Given x ∈ X, let Pc (x) denote the point in C that is nearest x, the projection of x into C. Instead of searching in (5), a gradient projection method searches in {Pc (xk − t∇f (xc )) : t ≥ 0} . Because each iterate produced by gradient projection lies in C, this is a feasible-point method. Most algorithms for constrained local optimization proceed by solving a sequence of approximating optimization problems. One generic strategy is to approximate the constrained problem with a sequence of unconstrained problems. For example, suppose that x ∈ C if and only if x satisfies the equality constraints h1 (x) = 0 and h2 (x) = 0. To approximate this constrained optimization problem with an unconstrained problem, we remove the constraints, allowing all x ∈ X, and modify the objective function so that x ∈ C are penalized, e.g., fk (x) = f (x) + ρk (h1 (x) + h2 (x))2 ,

ρk > 0.

The modified objective function is used in searching for the next iterate, xk+1 . Typically, xk+1 ∈ C, so penalty function methods are infeasible-point methods. To ensure that {xk } will converge to a solution of the original constrained problem, ρk must be increased as optimization progresses. Another opportunity to approximate a constrained problem with a sequence of unconstrained problems arises when the feasible set of the constrained problem is defined by a finite number of inequality constraints. An interior point method starts at a feasible point in the strict interior of C, and, thereafter, produces a sequence of iterates that may approach the boundary of C (on which local minimizers typically lie), but remain in its strict interior. This is accomplished by modifying the original objective function, adding a penalty term that grows near the boundary of the feasible region. For example, suppose that x ∈ C if and only if x satisfies the inequality constraints c1 (x) ≥ 0 and c2 (x) ≥ 0. In the logarithmic barrier method, the modified objective function fk (x) = f (x) − µk (log c1 (x) + log c2 (x)) , µk > 0,

7

is used in searching for the next iterate, xk+1 . To ensure that {xk } will converge to a solution of the original constrained problem, µk must be decreased to zero as optimization progresses. In many applications, the feasible region C is defined by finite numbers of equality constraints (h(x) = 0) and inequality constraints (c(x) ≥ 0). There are important special cases of such problems for which specialized algorithms exist. The optimization problem is a linear program (LP) if both the objective function and the constraint functions are linear; otherwise, it is a nonlinear program (NLP). An NLP with a quadratic objective function and linear constraint functions is a quadratic program (QP). An inequality constraint cj (x) ≥ 0 is said to be active at x∗ if cj (x∗ ) = 0. Active set strategies proceed by solving a sequence of approximating equality-constrained optimization problems. The equality constraints in the approximating problem are the equality constraints in the original problem, together with those inequality constraints that are believed to be active at the solution and, therefore, can be treated as equality constraints. The best-known example of an active set strategy is the simplex method for linear programming. Active set strategies often rely on clever heuristics to decide which inequality constraints to include in the working set of equality constraints. Sequential quadratic programming (SQP) algorithms are among the most effective local optimization algorithms in general use. SQP proceeds by solving a sequence of approximating QPs. Each QP is typically defined by a quadratic approximation of the objective function being minimized, and by linear approximations of the constraint functions. The quadratic approximation is either based on exact second derivatives or on versions of the secant update approximations of second derivatives discussed in connection with unconstrained optimization. SQP is particularly well-suited for treating nonlinear equality constraints, and many SQP approaches use an active set strategy to produce equality-constrained QP approximations of the original problem. For further discussion of constrained local optimization techniques, see [4, 5, 6, 13, 14].

Global Optimization So far, we have emphasized methods that exploit local information, e.g., derivatives, to find local

8

Optimization Methods

minimizers. We would prefer to find global minimizers, but global information is often impossible to obtain. Convex programs are the exception: if f is a convex function and C is a convex set, then one can solve (2) by solving (3). Lacking convexity, one should embark on a search for a global minimizer with a healthy suspicion that the search will fail. Why is global optimization so difficult? One often knows that an objective function is well-behaved locally, so that information about f at xk conveys information about f in a neighborhood of xk . However, one rarely knows how to infer behavior in one neighborhood from behavior in another neighborhood. The prototypical global optimization problem is discrete, that is, C is a finite set, and there is no notion of neighborhood in X. Of course, if C is finite then (2) can be solved by direct enumeration: evaluate f at each x ∈ C and find the smallest f (x). This is easy in theory, but C may be so large that it takes centuries in practice. For instance, the example in Figure 1(b) can be generalized to a problem on the unit cube in n variables, and each of the 2n vertices of the cube is a constrained local minimizer. Thus, methods for global optimization tend to be concerned with clever heuristics for speeding up the search. Most of the best-known methods for global optimization were originally devised for discrete optimization, then adapted (with varying degrees of success) for continuous optimization. The claims made on behalf of these methods are potentially misleading. For example, a method may be ‘guaranteed’ to solve (2), but the guarantee usually means something like: if one looks everywhere in C, then one will eventually solve (2); or, if one looks long enough, then one will almost surely find something arbitrarily close to a solution. Such guarantees have little practical value [11]. Because it seems impossible to devise efficient ‘off-the-shelf’ methods for global optimization, we believe that methods customized to exploit the specific structure of individual applications should be used or developed whenever possible. Nevertheless, a variety of useful general strategies for global optimization are available. The most common benchmark for global optimization is the multistart strategy [17]. One defines neighborhoods, chooses a local search strategy, starts local searches from a number of different points, finds corresponding local minimizers, and declares the local minimizer at which the objective function is

smallest to be the putative global minimizer. The success of this strategy is highly dependent on how the initial iterates are selected: if one succeeds in placing an initial iterate in each basin of the objective function, then one should solve (2). When using multistart, it may happen that a number of initial iterates are placed in the same basin, resulting in wasteful duplication as numerous local searches all find the same local minimizer. To reduce such inefficiency, it would seem natural to use the results of previous local searches to guide the placement of the initial iterates for subsequent local searches. Such memory-based strategies are the hallmark of a tabu search, a meta-heuristic that guides local search strategies [7]. The concept of a tabu search unifies various approaches to global optimization. For example, bounding methods begin by partitioning C into regions for which lower and upper bounds on f are available. These bounds are used to eliminate regions that cannot contain the global minimizer; then, the same process is applied to each of the remaining regions. The bounds may be obtained in various ways. Lipschitz methods [9] require knowledge of a global constant K such that |f (x) − f (y)| ≤ K x − y for all x, y ∈ C. Then, having computed f (x), one deduces that f (x) − ε ≤ f (y) ≤ f (x) + ε for all y within distance ε/K of x. Interval methods [16] compute bounds by means of interval arithmetic, which uses information about the range of possible inputs to an arithmetic statement to bound the range of outputs. This bounding process can be performed on a statement-by-statement level inside a computer program. From the perspective of global optimization, one evident difficulty with a local search strategy is its inability to escape the basin in which it finds itself searching. One antidote to this difficulty is to provide an escape mechanism. This is the premise of simulated annealing. Suppose that xk is the current iterate and that the local search strategy has identified a trial iterate, xt , with f (xt ) > f (xk ). Alone, the local search strategy would reject xt and search for another trial iterate. Simulated annealing, gambling that perhaps xt lies in a deeper basin than xk , figuratively tosses a (heavily weighted) coin to decide whether or not to set xk+1 = xt . From a slightly different perspective [20], simulated annealing uses the Metropolis algorithm to construct certain nonstationary Markov chains/processes

Optimization Methods on C, the behavior of which lead (eventually) to solutions of (2). This is a prime example of a random search strategy, in which stochastic simulation is used to sample C. Other well-known examples include genetic algorithms and evolutionary search strategies [18]. The effectiveness of random search strategies depends on how efficiently they exploit information about the problem that they are attempting to solve. Usually, C is extremely large, so that pure random search (randomly sample from C until one gets tired) is doomed to failure. More intelligent random search strategies try to sample from the most promising regions in C.

References [1]

[2]

[3]

[4]

Conclusion The discipline of numerical optimization includes more topics than one can possibly mention in a brief essay. We have emphasized nonlinear continuous optimization, for which a number of sophisticated methods have been developed. The study of nonlinear continuous optimization begins with the general ideas that we have sketched; however, more specialized methods such as the EM algorithm for maximum likelihood (see Maximum Likelihood Estimation) estimation and Gauss–Newton methods for nonlinear least squares will be of particular interest to statisticians. Furthermore, to each assumption that we have made corresponds a branch of numerical optimization dedicated to developing methods that can be used when that assumption does not hold. For example, nonsmooth optimization is concerned with nonsmooth objective functions, multiobjective optimization is concerned with multiple objective functions, and stochastic optimization is concerned with objective functions that cannot be evaluated, only randomly sampled. Each of these subjects has its own literature. The reader who seeks to use the ideas described in this essay may discover that good software for numerical optimization is hard to find and harder to write. One attempt to address this situation is the network-enabled optimization system (NEOS) Server [2], whereby the user submits an optimization problem to be solved by state-of-the-art solvers running on powerful computing platforms. Many of the solvers available through NEOS are described in [12].

9

[5]

[6] [7] [8]

[9]

[10]

[11] [12]

[13] [14] [15]

[16]

[17]

Bates, D.M. & Watts, D.G. (1988). Nonlinear Regression Analysis and its Applications, John Wiley & Sons, New York. Czyzyk, J., Mesnier, M.P. & Mor´e, J.J. (1996). The Network-Enabled Optimization System (NEOS) Server, Preprint MCS-P615-0996, Argonne National Laboratory, Argonne. The URL for the NEOS server is . Dennis, J.E. & Schnabel, R.B. (1983). Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cliffs. Republished by SIAM, Philadelphia, in 1996 as Volume 16 of Classics in Applied Mathematics. Fletcher, R. (2000). Practical Methods of Optimization, 2nd Edition, John Wiley & Sons, New York. Forsgren, A., Gill, P.E. & Wright, M.H. (2002). Interior methods for nonlinear optimization, Society for Industrial and Applied Mathematics Review 44(4), 525–597. Gill, P.E., Murray, W. & Wright, M.H. (1981). Practical Optimization, Academic Press, London. Glover, F. & Laguna, M. (1997). Tabu Search, Kluwer Academic Publishers, Dordrecht. Griewank, A. (2000). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM, Philadelphia. Frontiers in Applied Mathematics. Hansen, P. & Jaumard, B. (1995). Lipschitz optimization, in Handbook of Global Optimization, R., Horst & P.M., Pardalos, eds, Kluwer Academic Publishers, Dordrecht, pp. 407–493. Kolda, T.G., Lewis, R.M. & Torczon, V. (2003). Optimization by direct search: new perspectives on some classical and modern methods, Society for Industrial and Applied Mathematics Review 45(3), 385–482. Mathematical Programming A new algorithm for optimization, 3, 124–128. Mor´e, J.J. & Wright, S.J. (1993). Optimization Software Guide, Society for Industrial and Applied Mathematics, Philadelphia. Nash, S.G. & Sofer, A. (1996). Linear and Nonlinear Programming, McGraw-Hill, New York. Nocedal, J. & Wright, S.J. (1999). Numerical Optimization, Springer-Verlag, Berlin. Rall, L.B. (1981). Automatic Differentiation–Techniques and Applications, Vol. 120, of Springer-Verlag Lecture Notes in Computer Science Springer-Verlag, Berlin. Ratschek, H. & Rokne, J. (1995). Interval methods, in Handbook of Global Optimization, R., Horst & P.M., Pardalos, eds, Kluwer Academic Publishers, Dordrecht, pp. 751–828. Rinnooy Kan, A.H.G. & Timmer, G.T. (1989). Global optimization, in Optimization, G.L., Nemhauser, A.H.G., Rinnooy Kan & M.J., Todd, eds, Elsevier, Amsterdam, pp. 631–662.

10 [18]

Optimization Methods

Schwefel, H.-P. (1995). Evolution and Optimum Seeking, John Wiley & Sons, New York. [19] Squire, W. & Trapp, G. (1998). Using complex variables to estimate derivatives of real functions, Society for Industrial and Applied Mathematics Review 40(1), 110–112. [20] Trosset, M.W. (2001). What is simulated annealing?, Optimization and Engineering 2, 201–213.

(See also Markov Chain Monte Carlo and Bayesian Statistics) ROBERT MICHAEL LEWIS AND MICHAEL W. TROSSET

Ordinal Regression Models SCOTT L. HERSHBERGER Volume 3, pp. 1491–1494 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Ordinal Regression Models A number of regression models for analyzing ordinal variables have been proposed [2]. We describe one of the most familiar ordinal regression models: the ordinal logistic model (see Logistic Regression) [6]. The ordinal logistic model is a member of the family of generalized linear models [4].

Generalized Linear Models Oftentimes, our response variable is not continuous. Instead, the response variable may be dichotomous, ordinal, or nominal, or may even be simply frequency counts. In these cases, the standard linear regression model (the General Linear Model ) is not suitable for several reasons [4]. First, heteroscedastic and nonnormal errors are almost certain to occur when the response variable is not continuous. Second, the standard linear regression model (see Multiple Linear Regression) will often predict values that are impossible. For example, if the response variable is dichotomous, the linear model will predict scores that are less than 0 or greater than 1. Third, the functional form specified by the linear model will often be incorrect. For example, it is unlikely that extreme values of an explanatory variable will have the same effect on the response variable than more moderate values. Consider a binary response variable Y (e.g., 0 = no and 1 = yes) and an explanatory variable X, which may be binary or continuous. In order to evaluate whether X has an effect on Y , we examine whether the probability of an event π(x) = P (Y = 1|X = x) is associated with X by means of the generalized linear model f (π(x)) = α + βx,

(1)

where f is a link function relating Y and X, α is an intercept, and β is a regression coefficient for X. Link functions specify the correct relationship between Y and X: linear, logistic, lognormal, as well as many others. Where there are more than k = 1 explanatory variables, the model can be written as f (π(x)) = α + β1 x1 + · · · + βk xk .

(2)

The two most common link functions for analyzing binary and ordinary response variables are the logit link, π f (π) = log , (3) (1 − π) and the complementary log–log link, f (π) = log(− log(1 − π)).

(4)

The logit link represents the logistic regression model since π(x) log = α + βx (5) 1 − π(x) when π(x) =

exp(α + βx) . (1 + exp(α + βx))

(6)

The complementary log–log link function is an extreme-value regression model since log(− log(1 − π(x))) = α + βx

(7)

π(x) = 1 − exp (− exp(α + βx)).

(8)

when

Other link functions include f (π) = µ, the link function for linear regression; f (π) = −1 (µ), the link function for probit regression; and f (π) = log(µ), the link function for Poisson regression (see Generalized Linear Models (GLM)).

The Ordinal Logistic Regression Model Now let Y be a categorical response variables with c + 1 ordered categories. For example, we might have the response variable Y of performance, with the ordered categories of Y as 1 = fail, 2 = marginally pass, 3 = pass, and 4 = pass with honors. Let πk (x) = p(Y = k|X = x) be the probability of observing Y = k given X = x, k = 0, 1, . . . , c. We can then formulate a model for ordinal response variable Y based upon the cumulative probabilities f (γk (x)) = p(Y ≥ k|X = x) = ak + βk x,

k = 1, . . . , c.

(9)

Instead of specifying one equation describing the relationship between Y and X, we specify k model

2

Ordinal Regression Models

equations and k regression coefficients to describe this relationship. Each equation specifies the probability of membership in one category versus membership in all other higher categories. This model can be simplified considerably if we assume that the regression coefficient does not depend on the value of the explanatory variable, in this case k = 1, 2, or 3, the equal slopes assumption of many models, including the ordinary leastsquares regression model and the analysis of covariance (ANCOVA) model. The cumulative probability model with equal slopes is f (γk (x)) = p(Y ≥ k|X = x) = ak + βx,

k = 1, . . . , c.

(10)

Thus, we assume that for a link function f the corresponding regression coefficients are equal for each cut-off point k; parallel lines are fit that are based on the cumulative distribution probabilities of the response categories. Each model has a different α but the same β. Different models result from the use of different link functions. If the logit link function is used, we obtain the ordinal logistic regression model: γk (x) = αk + βx log (1 − γk (x)) ⇔ γk (x) =

exp(αk + βx) . (1 + exp(αk + βx))

order of the categories is reversed. Second, the ordinal logistic regression model has the important property of collapsibility. The parameters (e.g., regression coefficients) of models that have the property of collapsibility do not change in sign or magnitude even if some of the response categories are combined [7]. A third advantage is the relative interpretability of the ordinal logistic regression coefficients: exp(β) is an invariant odds ratio over all category transitions, and thus can be used as a single summary measure to express the effect of an explanatory variable on a response variable, regardless of the value of the explanatory variable. Example Consider the following 2 × 5 contingency table (see Table 1) between party affiliation and political ideology, adapted from Agresti [1]. To model the relationship between party affiliation (the explanatory variable X, scored as Democratic = 0 and Republican = 1) and political ideology (the ordered response variable Y , scored from 1 = very liberal to 5 = very conservative), we used PROC LOGISTIC in SAS [5], obtaining these values for the αk and β: α1 = −2.47 α2 = 1.47 α3 = 0.24

(11)

Some writers refer to the ordinal logistic regression model using the logit link function as the proportional odds model [4]. Alternatively, if the complementary log–log link function is used, we have log(− log(γk (x))) = αk + βx ⇔ γk (x) = 1 − exp(− exp(αk + βx)). (12) Some writers refer to the ordinal logistic regression model using the complementary log–log link as the discrete proportional hazards model [3]. In practice, the results obtained using either the logistic or complementary log–log link functions are substantively the same, only differing in scale. Arguments can be made, however, for preferring the use of the logistic function over the complementary log–log function. First, only the sign – and not the magnitude – of the regression coefficients change if the

α4 = 1.07 β = .98.

(13)

The maximum likelihood estimate of β = .98 means that for any level of political ideology, the estimated odds that a Democrat’s response is in the liberal direction is exp(.98) = 2.65, or more than two and one-half times the estimated odds for Republicans. We can use the parameter values to calculate the cumulative probabilities for political ideology. The cumulative probabilities equal p(Y ≤ k|X = x) =

exp(αk + βx) 1 + exp(αk + βx)

(14)

For example, the cumulative probability for a Republican having a very liberal political ideology is p(Y ≤ 1|X = 0) =

exp(−2.47 + .98(0)) = .08, 1 + exp(−2.47 + .98(0)) (15)

3

Ordinal Regression Models Table 1

Relationship between political ideology and party affiliation Political ideology

Party affiliation Democratic Republican

Very liberal

Slightly liberal

Moderate

Slightly conservative

Very conservative

80 30

81 46

171 148

41 84

55 99

whereas the cumulative probability for a Democrat having a very liberal political ideology is

[3]

exp(−2.47 + .98(1)) = .18. 1 + exp(−2.47 + .98(1)) (16)

[4]

p(Y ≤ 1|X = 1) =

[5] [6]

References [7] [1] [2]

Agresti, A. (1996). An Introduction to Categorical Data Analysis., Wiley, New York. Ananth, C.V. & Kleinbaum, D.G. (1997). Regression models for ordinal responses: a review of methods and applications, International Journal of Epidemiology 26, 1323–1333.

McCullagh, P. (1980). Regression models for ordinal data (with discussion), Journal of the Royal Statistical Society: B 42, 109–142. McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models., Chapman & Hall, New York. SAS Institute (1992). SAS/STAT User’s Guide, Version 9, Vols. 1 and 2, SAS Institute, Inc, Cary. Scott, S.C., Goldberg, M.S. & Mayo, N.E. (1997). Statistical assessment of ordinal outcomes in comparative studies, Journal of Clinical Epidemiology 50, 45–55. Simonoff, J.S. (2003). Analyzing Categorical Data, Springer, New York.

SCOTT L. HERSHBERGER

Outlier Detection RAND R. WILCOX Volume 3, pp. 1494–1497 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Outlier Detection Dealing with Outliers Outliers are a practical concern because they can distort commonly used descriptive statistics such as the sample mean and variance, as well as least squares regression; they can also result in low power relative to other techniques that might be used. Moreover, modern methods for detecting outliers suggest that they are common and can appear even when the range of possible values is limited. So, a practical issue is finding methods for dealing with outliers when comparing groups and studying associations among variables. Another and perhaps more basic problem is finding good outlier detection techniques. A fundamental criterion when judging any outlier detection rule is that it should not suffer from masking, meaning that the very presence of outliers should not render a method ineffective at finding them. Outlier detection rules based on the mean and variance suffer from masking because the mean, and especially the variance, can be greatly influenced by outliers. What is needed are measures of variation and location that are not themselves affected by outliers. In the univariate case, a boxplot rule is reasonably effective because it uses the interquartile range which is not affected by the smallest and largest 25% of the data. Rules based on the median and a measure of variation called the median absolute deviation are commonly used as well. (If M is the median of X1 , . . . , Xn , the median absolute deviation is the median of the values |X1 − M|, . . . , |Xn − M|.) Letting MAD represent the median absolute deviation, a commonly used method is to declare Xi an outlier if |Xi − M| > 2.24 (1) MAD/0.6745 (e.g., [3] and [4]). Detecting outliers among multivariate data is a substantially more complex problem, but effective methods are available (see Multivariate Outliers). To begin, consider two variables, say X and Y . A simple approach is to use some outlier detection method on the X values that avoids masking (such as a boxplot or the mad-median rule just described), and then do the same with the

Y values. This strategy is known to be unsatisfactory, however, because it does not take into account the overall structure of the data. That is, influential outliers can exist even when no outliers are detected among the X values, ignoring Y , and simultaneously, no outliers are found among the Y values, ignoring X. Numerous methods have been proposed for dealing with this problem, several of which appear to deserve serious consideration [3, 4]. Also see [1]. It is stressed, however, that methods based on the usual means and covariance matrix, including methods based on Mahalanobis distance, are known to perform poorly due to masking (e.g., [2]). One of the best-known methods among statisticians is based on something called the minimum volume ellipsoid (MVE) estimator. The basic strategy is to search for that half of the data that is most tightly clustered together based on its volume. Then the mean and covariance matrix is computed using this central half of the data only, the idea being that this avoids the influence of outliers. Finally, a generalization of Mahalanobis distance is used to check for outliers. A related method is based on what is called the minimum covariance determinant estimator [2, 3, 4]. These methods represent a major advance, but for certain purposes, alternative methods now appear to have advantages. One of these is based on a collection of projections of the data. Note that if all points are projected onto a line, yielding univariate data, a univariate outlier detection could be applied. The strategy is to declare any point an outlier if it is an outlier for any projection of the data. Yet another strategy that seems to have considerable practical value is based on the so-called minimum generalized variance (MGV) method. These latter two methods are computationally intensive but can be easily performed with existing software [3, 4]. Next, consider finding a good measure of location when dealing with a single variable. A fundamental goal is finding a location estimator that has a relatively small standard error. Under normality, the sample mean is optimal, but under arbitrarily small departures from normality, this is no longer true, and in fact the sample mean can have a standard error that is substantially larger versus many competitors. The reason is that the population variance can be greatly inflated by small changes in the tails of a

2

Outlier Detection

distribution toward what are called heavy-tailed distributions. Such distributions are characterized by an increased likelihood of outliers relative to the number of outliers found when sampling from a normal distribution. Moreover, the usualestimate of the standard error of the sample mean, (s 2 /n), where s 2 is the usual sample variance and n is the sample size, is extremely sensitive to outliers. Because outliers can inflate the standard error of the sample mean, a strategy when dealing with this problem is to reduce or even eliminate the influence of the tails of a distribution. There are two general approaches which are used: the first is to simply trim some predetermined proportion of the largest and smallest observations from the sample available, that is, use what is called a trimmed mean; the second is to use some measure of location that checks for outliers and removes or down weights them if any are found. In the statistical literature, the best-known example is a so-called M-estimator of location. Both trimmed means and M-estimators can be designed so that under normality, they have standard errors nearly as small as the standard error of the sample mean. But they have the advantage of yielding standard errors substantially smaller than the standard error of the sample mean as we move from a normal distribution toward distributions having heavier tails. In practical terms, power might be substantially higher when using a robust estimator. An important point is that when testing hypotheses, it is inappropriate to simply apply methods for means to the data that remain after trimming or after outliers have been removed (e.g., [4]). The reason is that once extreme values are removed, the remaining observations are no longer independent under random sampling. When comparing groups based on trimmed means, there is a simple method for dealing with this problem which is based in part on Winsorizing the data to get a theoretically sound estimate of the standard error (see Winsorized Robust Measures). To explain Winsorizing, suppose 10% trimming is done. So, the smallest 10% and the largest 10% of the data are removed and the trimmed mean is the average of the remaining values. Winsorizing simply means that rather than eliminate the smallest values, their value is reset to the smallest value not trimmed. Similarly, the largest values that were trimmed are now set equal to the largest value not trimmed. The sample variance, applied to the Winsorized values, is called the Winsorized variance, and theory indicates that it

should be used when estimating the standard error of a trimmed mean. When using M-estimators, estimates of the standard error take on a complicated form, but effective methods (and appropriate software) are available. Unlike methods based on trimmed means, the more obvious methods for testing hypotheses based on M-estimators are known to be unsatisfactory with small to moderate sample sizes. The only effective methods are based on some type of bootstrap method [3, 4]. Virtually all of the usual experimental designs, including one-way, twoway, and repeated measures designs, can be analyzed. When working with correlations, simple methods for dealing with outliers among the marginal distributions include Kendall’s tau, Spearman’s rho, and a so-called Winsorized correlation. But these methods can be distorted by outliers, even when no outliers among the marginal distributions are found [3, 4]. There are, however, various strategies for taking the overall structure of the data into account. One is to simply eliminate (or down weight) any points declared outliers using a good multivariate outlier detection method. These are called skipped correlations. Another is to identify the centrally located points, as is done for example by the minimum volume ellipsoid estimator, and ignoring the points not centrally located, compute a correlation coefficient based on the centrally located values. A related method is to look for the half of the data that minimizes the determinant of the covariance matrix, which is called the generalized variance. Both of these methods are nontrivial to implement, but the software SAS, SPLUS, and R have built-in functions that perform the computations (see Software for Statistical Analyses). In terms of testing the hypothesis of a zero correlation, with the goal of establishing independence, again it is inappropriate to simply apply the usual hypothesis testing methods using the remaining data. This leads to using the wrong standard error, and if the problem is ignored, poor control over the probability of a Type I error results. There are, however, ways of dealing with this problem as well as appropriate software [3, 4]. One approach that seems to be particularly effective is a skipped correlation where a projection-type method is used to detect outliers, any outliers found are removed,

Outlier Detection and Pearson’s correlation is applied to the remaining data. When there are multiple correlations and the goal is to test the hypothesis that all correlations are equal to zero, replacing Pearson’s correlation with Spearman’s correlation seems to be necessary in order to control the probability of a Type I error. As for regression, a myriad of methods has been proposed for dealing with the deleterious effects of outliers. A brief outline is provided here and more details can be found in [2, 3], and [4]. One approach, with many variations, is to replace the sum of squared residuals with some other function that reduces the effects of outliers. If, for example, the squared residuals are replaced by their absolute values, yielding the least absolute deviation estimator, protection against outliers among the Y values is achieved, but outliers among the X values can still cause serious problems. Rather than use absolute values, various M-estimators use functions of the residuals that guard against outliers among both the X and Y values. However, problems due to outliers can still occur [3, 4]. Another approach is to ignore the largest residuals when assessing the fit to data. That is, determine the slopes and intercept so as to minimize the sum of the squared residuals, with the largest residuals simply ignored. To ensure high resistance to outliers, a common strategy is to ignore approximately the largest half of the squared residuals. Replacing squared residuals with absolute values has been considered as well. Yet another strategy, called S-estimators, is to choose values for the parameters that minimize some robust measure of variation applied to the residuals. There are also two classes of correlation-type estimators. The first replaces Pearson’s correlation with some robust analog which can then be used to estimate the slope and intercept. Consider p predictors, X1 , . . . , Xp , and let τj be any correlation between Xj , the jth predictor, and Y − b1 X1 − · · · − bp Xp . The other general class among correlation-type estimators chooses the slope estimates b1 , . . . , bp so as to minimize |τˆj |. Currently, the most common choice for τ is Kendall’s tau which (when p = 1) yields the Theil-Sen estimator. Skipped estimators are based on the strategy of first applying some multivariate outlier detection method, eliminating any points that are flagged

3

as outliers, and applying some regression estimator to the data that remain. A natural suggestion is to apply the usual least squares estimator, but this approach has been found to be rather unsatisfactory [3, 4]. To achieve a relatively low standard error and high power under heteroscedasticity, a better approach is to apply the TheilSen estimator after outliers have been removed. Currently, the two best outlier detection methods appear to be the projection-type method and the MGV (minimum generalized variance) method; see [3, 4]. E-type skipped estimators (where E stands for error term) look for outliers among the residuals based on some preliminary fit, remove (or down weight) the corresponding points, and then compute a new fit to the data. There are many variations of this approach. Among the many regression estimators that have been proposed, no single method dominates in terms of dealing with outliers and simultaneously yielding a small standard error. However, the method used can make a substantial difference in the conclusions reached. Moreover, least squares regression can perform very poorly, so modern robust estimators would seem to deserve serious consideration. More details about the relative merits of these methods can be found in [3] and [4]. While no single method can be recommended, among the many estimators available, skipped estimators used in conjunction with a projection or MGV outlier detection technique, followed by the Theil-Sen estimator, appear to belong to the class of estimators that deserves serious consideration. One advantage of many modern estimators should be stressed. Even when the error term has a normal distribution, but there is heteroscedasticity, they can yield standard errors that are tens and even hundreds of times smaller than the standard error associated with ordinary least squares. Finally, as was the case when dealing with measures of location, any estimator that reduces the influence of outliers requires special hypothesis testing methods. But even when there is heteroscedasticity, accurate confidence intervals can be computed under fairly extreme departures from normality [3, 4]. Currently, the best hypothesis testing methods are based on a certain type of percentile bootstrap method. The computations are long and tedious, but they can be done quickly with modern computers.

4

Outlier Detection

References [1] [2] [3]

Barnett, V. & Lewis, T. (1994). Outliers in Statistical Data, 3rd Edition, Wiley, New York. Rousseeuw, P.J. & Leroy, A.M. (1987). Robust Regression & Outlier Detection, Wiley, New York. Wilcox, R.R. (2003). Applying Contemporary Statistical Techniques, Academic Press, San Diego.

[4]

Wilcox, R.R. (2004). Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition, Academic Press, San Diego.

RAND R. WILCOX

Outliers RAND R. WILCOX Volume 3, pp. 1497–1498 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Outliers Outliers are values that are unusually large or small among a batch of numbers. Outliers are a practical concern because they can distort commonly used descriptive statistics such as the sample mean and variance. One result is that when measuring effect size using a standardized difference, large differences can be masked because of even one outlier (e.g., [4]). To give a rough indication of one reason why, 20 observations were randomly generated from 2 normal distributions with variances equal to 1. The means were 0 and 0.8. Estimating the usual standardized difference yielded 0.84. Then the smallest observation in the first group was decreased from −2.53 to −4, the largest observation in the second group was increased from 2.51 to 3, the result being that the difference between the means increased from 0.89 to 1.09, but the standardized difference decreased to 0.62; that is, what is generally considered to be a large effect size is now a moderate effect size. The reason is that the variances are increasing faster, in a certain sense, than the difference between the means. Indeed, with arbitrarily large sample sizes, large effect sizes (as originally defined by Cohen using a graphical point of view) can be missed when using means and variances, even with very small departures from normality (e.g., [4]). Another concern is that outliers can cause the sample mean to have a large standard error compared to other estimators that might be used. One consequence is that outliers can substantially reduce power when using any hypothesis testing method based on means, and more generally, any least squares estimator. In fact, even a single outlier can be a concern (e.g., [4, 5]). Because modern outlier detection methods suggest that outliers are rather common, as predicted by Tukey [3], methods for detecting and dealing with them have taken on increased importance in recent years. Describing outliers as unusually large or small values is rather vague, but there is no general agreement about how the term should be made more precise. However, some progress has been made in identifying desirable properties for outlier detection methods. For example, if sampling is from a normal distribution, a goal might be that the expected proportion of values declared outliers is relatively small. Another basic criterion is that an

outlier detection method should not suffer from what is called masking, meaning that the very presence of outliers inhibits a method’s ability to detect them. Suppose, for example, a point is declared an outlier if it is more than two standard deviations from the mean. Further, imagine that 20 points are sampled from a standard normal distribution and that 2 additional points are added, both having the value 10 000. Then surely the value 10 000 is unusual, but generally these two points are not flagged as outliers using the rule just described. The reason is that the outliers inflate the mean and variance, particularly the variance, the result being that truly unusual values are not detected. Consider, for instance, the values 2, 2, 3, 3, 3, 4, 4, 4, 100 000, and 100 000. Surely, 100 000 is unusual versus the other values, but 100 000 is not declared an outlier using the method just described. The sample mean is X¯ = 20 002.5, the sample variance is s = 42 162.38, and it can be seen that 100 000 − 20 002.5 < 2 × 42 162.38. What is needed are methods for detecting outliers that are not themselves affected by outliers, and such methods are now available (e.g., [1, 2, 5]). In the univariate case, the best-known approach is the boxplot, which uses the interquartile range. Methods based in part on the median are available and offer certain advantages [5]. Detecting outliers in multivariate data (see Multivariate Outliers) is a much more difficult problem. A simple strategy is, for each variable, to apply some univariate outlier detection method, ignoring the other variables that are present. However, among the combination of values, a point can be very deviant, while the individual scores are not. That is, based on the overall structure of the data, a point might seriously distort the mean, for example, or it might seriously inflate the standard error of the mean, even though no outliers are found when examining the individual variables. However, effective methods for detecting multivariate outliers have been developed [2, 5].

References [1] [2] [3]

Barnett, V. & Lewis, T. (1994). Outliers in Statistical Data, 3rd Edition, Wiley, New York. Rousseeuw, P.J. & Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley, New York. Tukey, J.W. (1960). A survey of sampling from contaminated normal distributions, in Contributions to Probability and Statistics, S. Olkin, W. Ghurye, W. Hoeffding,

2

[4]

Outliers W. Madow & H. Mann, eds, Stanford University Press, Stanford. Wilcox, R.R. (2001). Fundamentals of Modern Statistical Methods: Substantially Increasing Power and Accuracy, Springer, New York.

[5]

Wilcox, R.R. (2004). Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition, Academic Press, San Diego.

RAND R. WILCOX

Overlapping Clusters MORVEN LEESE Volume 3, pp. 1498–1500 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Overlapping Clusters Many standard techniques in cluster analysis find partitions that are mutually exclusive (see Hierarchical Clustering; k -means Analysis). In some applications, mutually exclusive clusters would be a natural requirement, but in others, especially in psychological and sociological research, it is plausible that cases should belong to more than one cluster simultaneously. An example might be market research in which an individual person belongs to several consumer groups, or social network research in which people belong to overlapping groups of friends or collaborators. In the latter type of application, the individuals appearing in the intersections of the clusters are often of most interest. The overlapping cluster situation differs from that of fuzzy clustering: here, cases cannot be assigned definitely to a single cluster and cluster membership is described in terms of relative weights or probabilities of membership. However, both overlapping and fuzzy clustering are similar in that overlapping clustering relaxes the normal constraint that cluster membership should sum to 1 over clusters, and fuzzy clustering relaxes the constraint that membership should take only the values of 0 or 1.

Clumping and the Bk Technique With unlimited overlap the problem is to achieve both an appropriate level of fit to the data and also a model that is simple to interpret. Two early examples of techniques that can allow potentially unlimited degrees of overlap are clumping and the Bk technique. Clumping [6] divides the data into two groups using a ‘cohesion’ function including a parameter controlling the degree of overlap. In the Bk technique [5], individuals are represented by nodes in a graph; pairs of nodes are connected that have a similarity value above some specified threshold. Each stage in a hierarchical clustering process finds the set of maximal complete subgraphs (the largest sets of individuals for which all pairs of nodes are connected), and these may overlap. The number of clusters can be restricted by choosing k such that a maximum of k − 1 objects belong to the overlap of any pair of clusters. Any clusters that have more than k − 1 objects in common are amalgamated. This method has been implemented in the packages CLUSTAN (version 3) and Clustan/PC.

Hierarchical Classes for Binary Data Limited Overlap Limited forms of overlap can be obtained from direct data clustering methods (see Two-mode Clustering) and pyramids. A particular type of direct data clustering method [4], sometimes known as the two-way joining or block method, clusters cases and variables simultaneously by reordering rows and columns of the data matrix so that similar columns and rows appear together and sets of contiguous rows (columns) form the clusters (see the package Systat for example). Overlapping clusters can be defined by extending a set of rows (columns) that forms a cluster, so that parts of adjoining clusters are incorporated. The pyramid is a generalization of the dendrogram (see Hierarchical Clustering) that can be used as a basis for clustering cases, since a cut through the pyramid at any given height gives rise to a set of ordered, overlapping clusters [3]. In both types of method, individuals can belong to at most two clusters.

A relatively widely used method is hierarchical classes [2], a two-mode method appropriate for binary attribute data that clusters both cases and attributes. It has been implemented as HICLASS software. The theoretical model consists of two hierarchical class structures, one for cases and one for attributes. Cases are grouped into mutually exclusive classes, called ‘bundles’, which in the underlying model have identical attributes. These can then be placed in a hierarchy reflecting subset/superset relations. Each case bundle has a corresponding attribute bundle. Above the lowest level are combinations of bundles, which may overlap. The model is fitted to data by optimizing a goodness of fit index for a given number of bundles (or rank), which has to be chosen by the investigator. Figure 1 shows a simple illustration of this for hypothetical data shown in Table 1, where an entry is 1 if the person exhibits the quality and 0 otherwise (adapted from [7]).

2

Overlapping Clusters

Table 1 Hypothetical data matrix of qualities exhibited by various people. Adapted from Rosenberg, S., Van Mechelen, I. & De Boeck, P. (1996) [7]

Father Boyfriend Uncle Brother Mother Me now Ideal me

Successful

Articulate

Generous

Outgoing

Hardworking

Loving

Warm

1 0 0 0 0 0 1

1 0 0 0 0 0 1

0 1 0 0 1 1 1

0 0 1 1 1 1 1

1 1 0 0 1 1 1

0 1 1 1 1 1 1

0 1 1 1 1 1 1

Ideal me

Mother me now

Father

Boyfriend

Uncle brother

Successful articulate

Generous

Outgoing

Hardworking

Loving warm

Figure 1 A hierarchical classes analysis of hypothetical data in Table 1. The ‘pure’ clusters, called ‘bundles’ at the bottom of the hierarchies are joined by zigzag lines. Clusters above that level in the hierarchy may contain lower level clusters and hence overlap, for example, (mother, me now, and boyfriend) and (mother, me now, uncle, brother)

where pik = 1 if case i is in cluster k and 0 otherwise, and wk is a weight representing the ‘salience’ of cluster k. The goodness of fit is measured by the percentage of variance in the observed proximities explained by the model. Algorithms for fitting the values of wk and pik include the original ADCLUS software, with an improved algorithm MAPCLUS. Figure 2 illustrates the results of additive clustering in an analysis of data from a study of the social structure of an American monastery [1, 8]. The social relations between 18 novice monks were assessed in ‘sociograms’, in which the monks rated the highest four colleagues on four positive qualities and four negative qualities. The resulting similarity matrix was

Loyal opposition 1

8 6 8

4 4 11

3

2

m k=1

wk pik pj k

(1)

10

16

12

7

7

sˆij =

5

15

Young turks

5

Additive Clustering for Proximity Matrices Additive clustering [1] fits a model to an observed proximity matrix (see Proximity Measures) such that the theoretical proximity between any pair of cases is the sum of the weights of those clusters containing that pair. The basic model has the reconstructed similarity between cases as

9

13 3

18 17

14 1

Outcasts

6

Figure 2 Overlapping clusters of monks found by additive clustering of data on interpersonal relations using MAPCLUS, superimposed on a nonmetric multidimensional scaling plot. Numbers with arrows indicate the cluster numbers, ranked in order of their weights. Other numbers denote individual monks. Text descriptions based on external data have been added. Adapted from Arabie, P. & Carroll, J.D. (1989) [1] and Sampson, S.F. (1968) [8]

Overlapping Clusters analyzed using MAPCLUS and plotted on a twodimensional representation, in which individuals that are similar are situated close together.

[2]

[3]

Summary [4]

Methods for identifying overlapping clusters are not usually included in standard software packages. Furthermore, their results are often complex to interpret. For these reasons, applications are relatively uncommon. Nevertheless certain types of research questions demand their use, and for these situations, there are several methods available in specialist software.

References [1]

Arabie, P. & Carroll, J.D. (1989). Conceptions of Overlap in Social Structure, in Research Methods in Social Network Analysis, L.C. Freeman, D.R. White & A.K. Romney, eds, George Mason University Press, Fairfax, pp. 367–392.

[5]

[6] [7]

[8]

3

De Boeck, P. & Rosenberg, S. (1988). Hierarchical classes: model and data analysis, Psychometrika 53, 361–381. Diday, E. (1986). Orders and Overlapping Clusters by Pyramids in Multidimensional Data Analysis, J. De Leeuw, W. Heiser, J. Meulman & F. Critchley, eds, DSWO Press, Leiden. Hartigan, J.A. (1975). Clustering Algorithms, Wiley, New York. Jardine, N. & Sibson, R. (1968). The construction of hierarchic and non-hierarchic classifications, Computer Journal 11, 117–184. Needham, R.M. (1967). Automatic classification in linguistics, The Statistician 1, 45–54. Rosenberg, S., Van Mechelen, I. & De Boeck, P. (1996). A hierarchical classes model: theory and method with applications in psychology and psychopathology, In: P. Arabie, L.J. Hubert & G. De Soete (eds) Clustering and Classification World Scientific Singapore, pp. 123–155. Sampson, S.F. (1968). A novitiate in a period of change: an experimental and case study of social relationships. Doctoral dissertation Comell University Microfilms No 69-5775.

MORVEN LEESE

P Values CHRIS DRACUP Volume 3, pp. 1501–1503 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

P Values The P value associated with a test statistic is the probability that a study would produce an outcome as deviant as the outcome obtained (or even more deviant) if the null hypothesis was true. Before the advent of high-speed computers, exact P values were not readily available, so researchers had to compare their obtained test statistics with tabled critical values corresponding to a limited number of P values, such as .05 and .01, the conventional significance levels. This fact may help to explain the cut and dried, reject/do not reject approach to significance testing that existed in the behavioral sciences for so long. According to this approach, a test statistic either was significant or it was not, and results that were not significant at the .05 level were not worthy of publication, while those that reached significance at that level were (see Classical Statistical Inference: Practice versus Presentation). Statistical packages now produce P values as a matter of course, and this allows behavioral scientists to show rather more sophistication in the evaluation of their data. If one study yields p = .051 and another p = .049, then it is clear that the strength of evidence is pretty much the same in both studies (other things being equal). The arbitrary discontinuity in the P value scale that existed at .05 (and to a lesser extent at .01) is now seen for what it is: an artificial and unnecessary categorization imposed on a continuous measure of evidential support. The first sentence of the guideline of the American Psychological Association’s Task Force on Statistical Inference [12] regarding hypothesis tests emphasizes the superiority of P values: It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual P value, or better still, a confidence interval. (page 599)

The P value associated with a test statistic provides evidence that is relevant to the question of whether a real effect is present or not present in the observed data. Many behavioral scientists seem to believe that the P value represents the probability that the null hypothesis of no real effect is true, given the observed data. This incorrect interpretation has been encouraged by many textbook authors [2, 3, 11].

As outlined above, the P value actually conveys information about the probability of the observed outcome on the assumption that the null hypothesis is true. The fact that the observed outcome of a study would be unlikely if the null hypothesis was true does not, in itself, imply that it is unlikely that the null hypothesis is true. However, the two conditional probabilities are related, and, other things being equal, the smaller the P value, the more doubt is cast on the truth of the null hypothesis [4].

One and Two-tailed P values In the description of a P value above, reference was made to ‘an outcome as deviant as the outcome obtained (or even more deviant)’. Here, the deviation is from the expected value of the test statistic under the null hypothesis. But what constitutes an equally deviant or more deviant observation depends on the nature of the alternative hypothesis. If a one-tailed test is being conducted, then the P value must include all the outcomes between the one obtained and the end of the predicted tail. However, if a two-tailed test is being conducted, then the P value must include all the outcomes between the obtained one and the end of the tail in which it falls, plus all the outcomes in the other tail that are equally or more extreme (but in the other direction). Suppose a positive sample correlation, r, is observed. If this is subjected to a one-tailed test in which a positive correlation had been predicted, then the P value would be the probability of obtaining a sample correlation from r to +1 under the null hypothesis that the population correlation was zero. If the same sample correlation is subjected to a two-tailed test, then the P value would be the probability of obtaining a sample correlation from −r to −1 under the null hypothesis that the population correlation was zero. The twotailed P value would be twice as large as the onetailed P value. The fact that the evidential value of a result appears to rest on a private and unverifiable decision to make a one-tailed prediction has led many behavioral scientists to turn their backs on one-tailed testing. A further argument against one-tailed tests is that, conducted rigorously, they require researchers to treat all results that are highly deviant in the nonpredicted direction as uninformative. There are particular problems associated with the calculation of P values for test statistics with discrete

2

P Values

sampling distributions. One hotly debated issue is how the probability of the observed outcome should be included in the calculation of a P value. It could contribute to the P value entirely or in part (see [1]). Another issue concerns the calculation of two-tailed P values in situations where there are no equally deviant or more deviant outcomes in the opposite tail. Is it reasonable in such circumstances to report the one-tailed P value as the two-tailed one or should the two-tailed P value be doubled as it is in the case of a continuously distributed test statistic [9]? These issues are not resolved to everyone’s satisfaction, as the cited sources testify.

of the null hypothesis with some particular P value in favor of a two-tailed alternative hypothesis. In such circumstances, researchers usually go on to conclude that an effect exists in the direction indicated by the sample data. If this procedure is followed, then the probability that the test would lead to a conclusion in one direction given that the actual direction of the effect is in the other (a type 3 error [5, 6, 8]) cannot be as great as half the obtained P value. The P value, therefore, provides a useful index of the strength of evidence against a real effect in the opposite direction to the direction observed in the sample.

References

P values and a False Null Hypothesis When the null hypothesis is true, the sampling distribution of the P value is uniform over the interval 0 to 1, regardless of the sample size(s) used in the study. However, if the null hypothesis is false, then the obtained P value is largely a function of sample size, with an expected value that gets smaller as sample size increases. A null hypothesis is, by its nature, an exact statement, and, it has been argued, it is not possible for such a statement to be literally true [10]. For example, the probability that the mean of any two existing populations is exactly equal to the last decimal point, as the null hypothesis specifies, must be zero. Given this, some have argued that nothing can be learned from testing such a hypothesis as any arbitrarily low P value can be obtained just by running more participants. Some defense can be offered against such arguments when true experimental manipulations are involved [7], but the validity of the argument cannot be denied when it is applied to comparisons of preexisting groups or investigations of the correlation between two variables in some population. What information can P values provide in such situations? The conviction that a particular null hypothesis must be untrue on a priori grounds does not fully specify what the true state of the world must be. The conviction that two conditions have different population means does not in itself identify which has the higher mean. Similarly, the conviction that a population correlation is not zero does not specify whether the relationship is positive or negative. Suppose that a study has been conducted that leads to the rejection

[1]

Armitage, P., Berry, G. & Matthews, J.N.S. (2002). Statistical Methods in Medical Research, 4th Edition, Blackwell Science, Oxford. [2] Carver, R.P. (1978). The case against statistical significance testing, Harvard Educational Review 48, 378–399. [3] Dracup, C. (1995). Hypothesis testing: what it really is? Psychologist 8, 359–362. [4] Falk, R. & Greenbaum, C.W. (1995). Significance tests die hard: the amazing persistence of a probabilistic misconception, Theory and Psychology 5, 74–98. [5] Harris, R.J. (1997). Reforming significance testing via three-valued logic, in What if there were no Significance Tests? L.L. Harlow, S.A. Mulaik & J.H. Steiger, eds, Lawrence Erlbaum Associates, Mahwah. [6] Kaiser, H.F. (1960). Directional statistical decisions, Psychological Review 67, 160–167. [7] Krueger, J. (2001). Null hypothesis significance testing: on the survival of a flawed method, American Psychologist 56, 16–26. [8] Leventhal, L. & Huynh, C.-L. (1996). Directional decisions for two-tailed tests: power, error rates, and sample size, Psychological Methods 1, 278–292. [9] Macdonald, R.R. (1998). Conditional and unconditional tests of association in 2 × 2 tables, British Journal of Mathematical and Statistical Psychology 51, 191–204. [10] Nickerson, R.S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy, Psychological Methods 5, 241–301. [11] Pollard, P. & Richardson, J.T.E. (1987). On the probability of making Type I errors, Psychological Bulletin 102, 159–163. [12] Wilkinson, L. and the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: guidelines and explanations, American Psychologist, 54, 594–604.

CHRIS DRACUP

Page’s Ordered Alternatives Test SHLOMO SAWILOWSKY

AND

GAIL FAHOOME

Volume 3, pp. 1503–1504 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Page’s Ordered Alternatives Test

rank sums are expected to increase directly as the column number increases, and the greater rank sums are multiplied by larger numbers, L is expected to be largest under the alternative hypotheses. The null hypothesis is rejected when L ≥ the critical value.

Page’s Test

Large Sample Sizes

Page’s procedure [4] is a distribution free (see Distribution-free Inference, an Overview) test for an ordered hypothesis with k > 2 related samples. It takes the form of a randomized block design, with k columns and n rows. The null hypothesis is H0 : m1 = m2 = · · · = mk . This is tested against the alternative hypothesis: H1 : mi ≤ mj , for all i < j and

The mean of L is µ=

nk(k + 1)2 , 4

(2)

and the standard deviation is nk 2 (k + 1)(k 2 − 1) σ = . 144

(3)

For a given α, the approximate critical region is

mi < mj for at least one i < j ;

1 L ≥ µ + zσ + . 2

for i, j in 1, 2, . . . , k. The ordering of treatments, of course, is established prior to viewing the results of the study. Neave and Worthington [3] provided a Match test as a powerful alternative to Page’s test.

(4)

Monte Carlo simulations conducted by Fahoome and Sawilowsky [2] and Fahoome [1] indicated that the large sample approximation requires a minimum sample size of 11 for α = 0.05, and 18 for α = 0.01.

Procedure

Example

The data are ranked from 1 to k for each row. The ranks of each of the k columns are totaled. If the null hypothesis is true, the ranks should be evenly distributed over the columns, whereas if the alternative is true, the rank’s sums should increase with the column index.

Page’s statistic is calculated with Samples 1 to 5 in the table below (Table 1), n1 = n2 = n3 = n4 = n5 = 15. It is computed as a one-tailed test with α = 0.05. The rows are ranked, with average ranks assigned to tied ranks (Table 2).

Assumptions

Table 1

It is assumed that the rows are independent and there are no tied observations in a row. Average ranks are typically applied to ties.

Test Statistic Each column rank sum is multiplied by the column index. The test statistic is L=

k

iRi ,

(1)

i=1

where i is the column index, i = 1, 2, 3, . . . , k, and Ri is the rank sum for the ith column. Because the

Sample data

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

20 33 4 34 13 6 29 17 39 26 13 9 33 16 36

11 34 23 37 11 24 5 9 11 33 32 18 27 21 8

9 14 33 5 8 14 20 18 8 22 11 33 20 7 7

34 10 38 41 4 26 10 21 13 15 35 43 13 20 13

10 2 32 4 33 19 11 21 9 31 12 20 33 15 15

2

Page’s Ordered Alternatives Test

Table 2

Sample data

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5

Total

4 4 1 3 4 1 5 2 5 3 3 1 4.5 3 5 48.5

3 5 2 4 3 4 1 1 3 5 4 2 3 5 2 47.0

1 3 4 2 2 2 4 3 1 2 1 4 2 1 1 33.0

5 2 5 5 1 5 2 4.5 4 1 5 5 1 4 3 52.5

2 1 3 1 5 3 3 4.5 2 4 2 3 4.5 2 4 44.0

The large sample approximation is calculated with µ = 675 and σ = 19.3649. The approximation is 675 + 1.64485(19.3649) + 0.5 = 707.352. Because 671.5 < 707.352, the null hypothesis cannot be rejected on the basis of the evidence from these samples.

References [1]

[2]

[3] [4]

The column sums are as follows: R1 = 48.5, R2 = 47.0, R3 = 33.0, R4 = 52.5, R5 = 44.0. The statistic, L, is the sum of iRi = 671.5, where i = 1, 2, 3, 4, 5. L = 1 × 48.5 + 2 × 47.0 + 3 × 33.0 + 4 × 52.5 + 5 × 44.0 = 48.5 + 94.0 + 99.0 + 210.0 + 220.0 = 671.5.

Fahoome, G. (2002). Twenty nonparametric statistics and their large-sample approximations, Journal of Modern Applied Statistical Methods 1(2), 248–268. Fahoome, G. & Sawilowsky, S. (2000). Twenty nonparametric statistics, in Annual Meeting of the American Educational Research Association, SIG/Educational Statisticians, New Orleans. Neave, H.R. & Worthington, P.L. (1988). DistributionFree Tests, Unwin Hyman, London. Page, E.B. (1963). Ordered hypotheses for multiple treatments: a significance test for linear ranks, Journal of the American Statistical Association, 58, 216–230.

SHLOMO SAWILOWSKY

AND

GAIL FAHOOME

Paired Observations, Distribution Free Methods VANCE W. BERGER

AND

ZHEN LI

Volume 3, pp. 1505–1509 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Paired Observations, Distribution Free Methods Nonparametric Analyses of Paired Observations Given two random samples, X1 , . . . , Xn and Y1 , . . . , Yn , each with the same number of observations, the data are said to be paired if the first observation of the first sample is naturally paired with the first observation of the second sample, the second observation of the first sample is naturally paired with the second observation of the second sample, and so on. That is, there would need to be a natural one-to-one correspondence between the two samples. In behavioral studies, paired samples may occur in the form of a ‘before and after’ experiment. That is, individuals or some other unit of measurement would be observed on two occasions to determine if there is a difference in the value of some variable of interest. For example, researchers may want to determine if there is a drop in unwanted pregnancies following an intervention that consists of relevant educational materials that are being disseminated. It is possible to randomize (see Randomization) some communities or schools to receive this intervention and others not to, but an alternative design is the one that uses each school or community as its own control. That is, one would make the intervention available to all the schools in the study, and then compare their rate of unwanted pregnancies before the intervention to the corresponding rate after the intervention. In this way, the variability across schools is effectively eliminated. Because comparisons are most precise when the maximum number of sources of extraneous variation is eliminated, it is true in general that paired observations can be used to eliminate those sources of extraneous variation that fall within the realm of variability across subjects. Note that actual differences may not exist between two populations, yet the presence of extraneous, sampling, sources of variation (confounding) may cause the illusion of such a difference. Conversely, true differences may be masked by the presence of extraneous factors [4]. Consider, for example, two types of experiment that could be conducted to compare two types of sunscreen. One method would be to select two different samples of subjects, one of which uses

sunscreen A, while the other uses sunscreen B. It may turn out, either by chance or for some more systematic reason such as self-selection bias, that most of the individuals who use sunscreen A are naturally less sensitive to sunlight. In such a case, a conclusion that individuals who received sunscreen A had less sun damage would not be attributable to the sunscreen itself. It could just as easily reflect the underlying differences between the groups. Randomization is one method that can be used to ensure the comparability of the comparison groups, but while randomization does, eliminate selfselection bias and some other types of biases, it cannot, in fact, ensure that the comparison groups are comparable even in distribution [1]. Another method that can be used to compare the types of sunscreen is to select one sample of subjects and have each subject in this sample receive both sunscreens, but at different times. The unit of measurement would then be the combination of a subject and a time point, as opposed to the more common case in which the unit of measurement is the subject without consideration of the time point. One could, for example, randomly assign half of the subjects to receive sunscreen A first and sunscreen B later, whereas the other half of the subjects is exposed to the sunscreens in the reverse order. This is a classical crossover design, which leads to paired observations. Specifically, the two time points for a given subject are paired and can be compared directly. One concern with this design is the potential for carryover effects (see Carryover and Sequence Effects). Yet another design would consist of again selecting one sample of subjects and having each subject in this sample receive both sunscreens, but now the exposure to the sunscreens is simultaneous. It is not time that distinguishes the exposure of a given individual to a given sunscreen but rather the side of the face. For example, each subject could be randomized to either sunscreen A applied to the left side of the face and sunscreen B applied to the right side of the face or sunscreen B applied to the left side of the face and sunscreen A applied to the right side of the face. After a specified length of exposure to the sun, the investigator would measure the amount of damage to each half of the face, possibly by recording a lesion count. This design also leads to paired observations, as the two sides of a given individual’s face are paired.

2

Paired Observations, Distribution Free Methods

If the side of the face to which sunscreen A was applied tended to be less damaged overall, regardless of whether this was the left side or the right side, then one could fairly confidently attribute this result to sunscreen A, and not to preexisting differences between the groups of subjects randomized to the pairings of sides of the face and sunscreens, because both sunscreens were applied to equally pigmented skin [4]. Note that if this design is used in a multinational study, then it might be prudent to stratify the randomization by nation. Not only could different nations have different levels of exposure to the sun (contrast Greenland, for example, to Ecuador), but also some nations mandate that drivers be situated in the left side of the car (for example, the United States), whereas other nations mandate that drivers be situated in the right side of the car (for example, the United Kingdom). Clearly, this will have implications for which side of the face has a greater level of exposure to the sun. By using paired observations, Burgess [3] conducted a study to determine weight loss, body composition, body fat distribution, and resting metabolic rate in obese subjects before and after 12 weeks of treatment with a very-low-calorie diet (VLCD) and to compare hydrodensitometry with bioelectrical impedance analysis. The women’s weights before and after the 12-week VLCD treatment were measured and compared. This is also a type of paired observations study. Likewise, Kashima et al. [8] conducted a study of parents of mentally retarded children in which a media-based program presented, primarily through videotapes and instructional manuals, information on self-help skill teaching. In this research study, paired observations were obtained. Before and after the training program, the Behavioral Vignettes Test was administered to the parents. We see that paired observations may be obtained in a number of ways [4] as follows. 1. The same subjects may be measured before and after receiving some treatment. 2. Pairs of twins or siblings may be assigned randomly to two treatments in such a way that members of a single pair receive different treatments. 3. Material may be divided equally so that one half is analyzed by one method and one half is analyzed by the other.

The techniques that can be used to analyze paired observations can be classified as parametric or nonparametric on the basis of the assumptions they require for validity. A classical parametric statistical method, for example, is the paired t Test (see Catalogue of Parametric Tests). The nonparametric alternatives include Fisher’s randomization test, the McNemar Chi-square test, the sign test, and the Wilcoxon Signed-Rank test (see Distribution-free Inference, an Overview). The paired t Test is used to determine if there is a significant difference between two samples, but it exploits the natural pairing of the data to reduce the variability. Specifically, instead of considering the variability within the X and Y samples separately, we consider the difference di between the paired observations for the ith subject as the variable of interest and compute the variability of di . This will often lead to a reduction in variability. The paired t Test assumes or requires (for validity) the following. 1. Independence. 2. Normality. Each of these assumptions requires some elaboration. Regarding the first, we note that it is not required that the Xi ’s be independent of the Yi ’s. In fact, if they were, then this would defeat the purpose of the pairing. Rather, it is the differences, or the di ’s, that need to be independent of each other. Regarding the second assumption, we note that the normality of Xi and Yi are not required because the analysis depends on only the di ’s. However, it is required that the di ’s follow the normal distribution. The following example shows a case in which the X’s and Y’s are not normally distributed but the d’s are at least close. A researcher wanted to find out if he could get closer to a squirrel than the squirrel was to the nearest tree before the squirrel would start to run away [11]. The observations follow in Table 1. The normal QQ plots (see Probability Plots) of the X’s, Y’s, and d’s are shown in Figure 1, Figure 2, and Figure 3, respectively. The distribution of the distances from the squirrel to the person (X) appears to be reasonably normally distributed, but the distances from the squirrel to the tree (Y ) are far from being normally distributed. However, the differences of the paired observations (d) appear to meet the required normality condition.

Paired Observations, Distribution Free Methods Table 1 Distances (in inches) from person and from tree when squirrel started to run. Reproduced from Witmer, S. (2003). Statistics for the Life Sciences, 3rd Edition, Pearson Education, Inc. [11]

1 2 3 4 5 6 7 8 9 10 11 Mean SD

From person X

From tree Y

81 178 202 325 238 134 240 326 60 119 189 190 89

137 34 51 50 54 236 45 293 277 83 41 118 101

Difference d =X−Y

250 200

y

Observations

3

150

−56 144 151 275 184 −102 195 33 −217 36 148 72 148

100 50 −1

0

1

N_y

Figure 2

QQ plot of variable Y

x

d

300

200

100

−1

−1 0

1

0

1

N_d

N_x

Figure 3 Figure 1

QQ plot of difference d

QQ plot of variable X

In general, the null hypothesis can often be expressed as follows. H0 : E[X |i] = E[Y |i].

(1)

That is, given the ith block, the expected values of two distributions are the same. This conditional null hypothesis can be simplified by considering the differences: (2) H0 : E(d ) = 0 To test this null hypothesis, we can use the onesample t Test. The test statistics based on the t

distribution are: n 1 d¯ where d¯ = t= s , di d n i=1 √ n n 1 ¯ 2, sd = (di − d) n − 1 i=1

(3)

The rejection region is determined using the t distribution with n − 1 degrees of freedom, where n is the number of pairs. If the pairing is ignored, then an analysis of differences in means is not

4

Paired Observations, Distribution Free Methods

valid because it assumes that the two samples are independent when in fact they are not. Even when the pairing is considered, this t Test approach is not valid if either the normality assumption or the independence assumption is unreasonable, which is often the case. Hence, it is often appropriate to instead use nonparametric analyses. Fisher introduced the first nonparametric test, the widely used Fisher randomization test, in his analysis of Darwin’s data in 1935 [5] (see Randomization Based Tests). Darwin planted 15 pairs of plants, one cross-fertilized and the other self-fertilized, over four pots. The number of plant pairs varied from pot to pot. Darwin’s experimental results follow in Table 2. That is, Fisher argued that under the null hypothesis, the cross- and self-fertilized plants in the ith pair have heights that are random samples from the same distribution, so this means that each difference between cross- and self-fertilized plant heights could have appeared with a positive or negative sign with equal probability. There were 15 paired observations, leading to 15 differences, and 15 i=1 di = 314. Also, there were a total of 215 = 32 768 possible arrangements of signs with the 15 differences obtained. Assuming that all of these configurations are equally likely, only 863 of these arrangements give a total difference of 314 or more, thus giving a probability of 863/32 768 = 2.634%

Table 2 Darwin’s results for the heights of cross- and self-fertilized plants of Zea mays, reported in 1/8th’s of an inch. Reproduced from Fisher, R.A. (1935). The Design of Experiments, 1st Edition, Oliver & Boyd, London [5] Pot I

II

III

IV

Cross-fertilized

Self-fertilized

Difference

188 96 168 176 153 172 177 163 146 173 186 168 177 184 96

139 163 160 160 147 149 149 122 132 144 130 144 102 124 144

49 −67 8 16 6 23 28 41 14 29 56 24 75 60 −48

for that one-side test [7]. That is, the P value is 0.026. Another nonparametric technique that can be used to analyze paired observations is the Wilcoxon Signed-Rank test (see Wilcoxon–Mann–Whitney Test). This test can be used in any situation in which the d’s are independent of each other and come from a symmetric distribution; the distribution need not be normal. The null hypothesis of ‘no difference between population’ can be stated as: H0 : E(d ) = 0.

(4)

The Wilcoxon Signed-Rank test is conducted as follows. Compute the differences between two treatments in each experimental subject, rank the differences according to their magnitude (without regard for sign), then reattach the sign to each rank. Finally, sum the signed ranks to obtain the test statistic W . Because this procedure is based on ranks, it does not require making any assumptions about the nature of the population. If there is no true difference between the means of the units within a pair, then the ranks associated with the positive changes should be similar in both number and magnitude to the ranks associated with the negative changes. So the test statistic W should be close to zero if H0 is true. On the other hand, W will tend to be a large positive number or a large negative number if H0 is not true. This is a nonparametric procedure, which means that it is valid in more situations than the parametric t Test is [2]. It is generally recommended that the Wilcoxon Signed-Rank test be used instead of the paired t Test if the normal distributional assumption does not hold or the sample size is small (for example, n < 30), but the comparison between these two analyses actually differ not only in the reference distribution (exact versus based on the normal distribution) (see Exact Methods for Categorical Data) but also with regard to the test statistic (mean difference vs mean ranks). This means that the usual recommendation is suspect for two reasons. First, given the availability of an exact test, it is hard to imagine justifying an approximation to that exact test on the basis of the goodness of the approximation [2]. Second, the t Test is not an approximation to the Wilcoxon test anyway. So a better recommendation would be to use an exact test whenever it is feasible to do so, regardless of how normally distributed the data appear to be, and to exercise discretion in the selection of the

Paired Observations, Distribution Free Methods test statistic. If interest is in the ranks, then use the Wilcoxon test. If the raw data are of interest, then use the exact version of the t Test (i.e., use the t Test statistic, but refer it to its exact permutation reference distribution, as did Fisher). These are not the only reasonable choices, as other scores, including normal scores, could also be used. The sign test is another robust nonparametric test that can be used as an alternative to the paired t Test. The sign test gets its name from the fact that pluses and minuses, rather than numerical values, are considered, counted, and compared. This test is valid in any situation in which the d’s are independent of each other. The sign test focuses on the median rather than mean. So the null hypothesis is: H0 : The median difference is zero.

(5)

That is, H0 : p(+) = p(−) = 0.5.

(6)

So, the sign test is distribution free. Its validity does not depend on any conditions about the form of the population distribution of the d’s. The test statistic is the number of plus signs, which follows the binomial distribution (see Catalogue of Probability Density Functions) with parameters n (the sample size excluding ties) and p = 0.5 (the null probability) if H0 is true. The McNemar Chi-square test studies the paired observations on the basis of a dichotomous variable, and is often used in case-control studies. The data can be displayed as a 2 × 2 table as follows. Control Outcome Case Outcome + –

+ a c

– b d

The test statistics is: McNemar χ 2 =

(|b − c| − 1)2 . b+c

Besides these classical techniques, some more modern techniques have been developed on the basis of paired observations. Wei proposed the idea of a test for interchangeability with incomplete paired observations [10]. Gross and Lam [6] studied the paired observations from a survival distribution and developed hypothesis tests to investigate the equality of mean survival times when the observations came in pairs, for example, length of a tumor remission when a patient received a standard treatment versus length of a tumor remission when the same patient, at a later time, received an experimental treatment. Lipscomb and Gray studied a connection between paired data analysis and regression analysis for estimating sales adjustments [9].

References

An alternative way of stating the null hypothesis is: H0 : p(xi > yi ) = p(xi < yi ) = 0.5.

5

(7)

Under the null hypothesis, this test statistic follows the χ 2 distribution with one degree of freedom.

[1]

Berger, V.W. & Christophi, C.A. (2003). Randomization technique, allocation concealment, masking, and susceptibility of trials to selection bias, Journal of Modern Applied Statistical Methods 2(1), 80–86. [2] Berger, V.W., Lunneborg, C., Ernst, M.D. & Levine, J.G. (2002). Parametric analyses in randomized clinical trials, Journal of Modern Applied Statistical Methods 1, 74–82. [3] Burgess, N.S. (1991). Effect of a very-low-calorie diet on body composition and resting metabolic rate in obese men and women, Journal of the American Dietetic Association 91, 430–434. [4] Daniel, W.W. (1999). Biostatistics: A Foundation for Analysis in the Health Sciences, 7th Edition, John Wiley & Sons, New York. [5] Fisher, R.A. (1935). The Design of Experiments, 1st Edition, Oliver & Boyd, London. [6] Gross, A.J. & Lam, C.F. (1981). Paired observations from a survival distribution, Biometrics 37(3), 505–511. [7] Jacquez, J.A. & Jacquez, G.M. (2002). Fisher’s randomization test and Darwin’s data – a footnote to the history of statistics, Mathematical Biosciences 180, 23–28. [8] Kashima, K.J., Baker, B.L. & Landen, S.J. (1988). Media-based versus professionally led training for parents of mentally retarded children, American Journal on Mental Retardation 93, 209–217. [9] Lipscomb, J.B. & Gray, J.B. (1995). A connection between paired data analysis and regression analysis for estimating sales adjustments, The Journal of Real Estate Research 10(2), 175–183. [10] Wei, L.J. (1983). Test for interchangeability with incomplete paired observations, Journal of the American Statistical Association 78, 725–729. [11] Witmer, S. (2003). Statistics for the Life Sciences, 3rd Edition, Pearson Education, Inc.

VANCE W. BERGER

AND

ZHEN LI

Panel Study JAAK BILLIET Volume 3, pp. 1510–1511 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Panel Study In a typical cross-section survey, people are interviewed just once. By contrast, in a panel study, exactly the same people are interviewed repeatedly at multiple points in time. Panel surveys are much more suitable for the study of attitude change and for the study of causal effects than repeated crosssections or surveys with retrospective questions. Statistical analysis of turn-over tables that emerge from repeated interviewing of the same respondents may reveal the amount of individual-level attitude change that occurred between the waves of the survey. This individual level change is unobservable in repeated cross-sections [12]. The strength of panel studies is eminently illustrated by the Intergenerational Panel Study of Parents and Children (IPS) of the University of Michigan. This is an eight wave 31-year intergenerational panel study (1961–1993) of young men, young women, and their families [1]. These data are ideally suited to study the relationship between attitudes, family values, and behavior for several reasons. The data are longitudinal and span the entire lives of the children, allowing insights into causation by relying on temporal ordering of attitudes and subsequent behavior. Moreover, these data include measures of children’s attitudes and their mother’s attitudes in multiple domains at crucial moments during their life. Further, the data contain measures of children’s subsequent cohabitation, marriage, and childbearing patterns, as well as their education and work behavior. Finally, the data contain detailed measurement of the families in which the children were raised [2]. Whereas panel studies provide unique opportunities, there are also dark sides. Long-term panel studies are complicated by the expense and difficulty of finding respondents who have moved in intervening years (location of respondents). Sometimes, there is a considerable amount of nonresponse at the first wave, which may rise to over 30% because of the long-term engagement that respondents are unwilling to make. However, more frequently, problems occur during the second or later waves. Drop out from the original sample units in subsequent waves may lead to considerable bias. For example, in three waves of the Belgian Election surveys (1991–1999), it is shown that the nonresponse in the second and third waves of the panel is not random and can be

predicted by some background characteristics, task performance [8], and attitudes of both interviewers and respondents in previous waves [9]. In this panel survey, we found that the less interested and more politically alienated voters were more likely to drop out and that, consequently, the population estimates on the willingness to participate in future elections were strongly biased toward increasing willingness to participate over time. Simple assessment of the direction of nonresponse bias can be done by comparing the first-wave answers given by respondents who were successfully reinterviewed to the first-wave answers of those who drop out. But this is only a preliminary step. In current research, much of the attention is paid to the nature of nonignorable nonresponse [7] and to statistical ways of correcting the estimates [3]. Refreshment of the panel sample with new respondents in subsequent waves [4] and several weighting procedures and imputation methods are proposed in order to adjust the statistical estimates for attrition bias [3, 10] (see Missing Data). The availability of a large amount of information about the nonrespondents in later waves of a panel is certainly a major advantage of the panel over ‘fresh’ nonresponse. Designers of longitudinal surveys should seriously consider adding variables that are useful to predict location, contact difficulty, and cooperation propensity in order to improve the understanding of processes that produce nonresponse in later waves of panel surveys. This may also lead to the reduction of nonresponse rates and to more effective adjustment procedures [7]. High costs for locating respondents who have moved, wrong respondent selection, and the problems related to panel attrition are not the only deficits of panels. Panel interviews are repeated measurements that are not independent at the individual level. The high expectations regarding the methodological strength of panel studies for analyzing change are somewhat tempered by this. Panel studies within quasi-experimental designs, aimed at studying the effects of a specific independent variable (or intervention), are confronted with the presence of unobserved intervening variables, or by the effect of the first observation on subsequent measurement. Some of these challenges, but not all, can be solved by an appropriate research design like the ‘simulated pretest–posttest design’ [5]. In the case of a series of repeated measurements over time, confusion is possible about the time order of a causal effect, if no

2

Panel Study

stringent assumptions are made about the duration of an effect. The unreliability of the measurements is another problem that needs consideration by those who are interested in change. Even a small amount of random error in repeated measurements may lead to faulty conclusions about substantial change in turnover tables even when nothing has changed [5, 6]. In order to distinguish unreliability (random error) from real (systematic) change in the measurements, at least three wave panels are necessary unless repeated measurements within each wave are used [11].

References [1]

[2]

[3]

[4]

Axinn, W.G. & Arland, T. (1996). Mothers, children and cohabitation: the intergenerational effects of attitudes and behavior, American Sociological Review 58(2), 233–246. Barber, J.S., Axinn, W.G. & Arland, T. (2002). The influence of attitudes on family formation processes, in Meaning and Choice-Value Orientations and Life Course Decisions, Monograph 37, R. Lesthaeghe, ed., NIDI-CBGS Publications, The Hague and Brussels, pp. 45–95. Brownstone, D. (1998). Multiple imputation methodology for missing data, non-random response, and panel attrition, in Theoretical Foundations of Travel Choice Modeling, T. G¨arling, I. Laitila & K. Westin, eds, Elsevier Biomedical, Amsterdam, pp. 421–450. Browstone, D., Thomas, F.G. & Camilla, K. (2002). Modeling nonignorable attrition and measurement error in panel surveys: an application to travel demand modeling, in Survey Nonresponse, R.M. Groves, D.A. Dillman, J.L. Eltinge & R.J.A. Little, eds, John Wiley & Sons, New York, pp. 373–387.

[5]

Campbell, D.T. & Stanley, J.C. (1963). Experimental and Quasi-Experimental Designs for Research, Rand McNally, Chicago. [6] Hagenaars, J.A. (1975),. Categorical Longitudinal Data: Loglinear Panel, Trend, and Cohort Analysis, Sage Publications, Newbury Part. [7] Lepkowski, J.M. & Mick P.C. (2002). Nonresponse in the second wave of longitudinal household surveys, in Survey Nonresponse, R.M. Groves, D.A. Dillman, J.L. Eltinge & R.J.A. Little, eds, John Wiley & Sons, pp. 259–272. [8] Loosveldt, G. & Carton, A. (2000). An empirical test of a limited model for panel refusals, International Journal of Public Opinion Research 13(2), 173–185. [9] Pickery, J., Loosveldt, G. & Carton, A. (2001). The effects of interviewer and respondent characteristics on response behavior in panel surveys, Sociological Methods and Research 29(4), 509–523. [10] Rubin, D.B. & Elaine, Zanutto (2002). Using matched substitutes to adjust for nonignorable nonresponse through multiple imputations, in Survey Nonresponse, R.M. Groves, D.A. Dillman, J.L. Eltinge & R.J.A. Little, eds, John Wiley & Sons, New York, pp. 389–402. [11] Saris, W.E. & Andrews, F.M. (1991). Evaluation of measurement instruments using a structural equation approach, in Measurement Errors in Surveys, P.P. Biemer, R.M. Groves, L.E. Lyberg, N.A. Mathiowetz & S. Sudman, eds, John Wiley & Sons, New York, pp. 575–599. [12] Weinsberg, H.F., Krosnick J.A. & Bowen B.D. (1996). An Introduction to Survey Research, Polling, and Data Analysis, 3rd Edition, Sage Publications, Thousand Oaks.

JAAK BILLIET

Paradoxes VANCE W. BERGER

AND

VALERIE DURKALSKI

Volume 3, pp. 1511–1517 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Paradoxes Although scientists have spent centuries defining mathematical systems, self-contradictory conclusions continue to arise in certain scenarios. Why, for example, do larger data sets sometimes give opposite conclusions from smaller data sets, and could the larger data sets be incorrect? How can a nonsmoker’s risk of heart disease be lower overall if a particular smoker who exercises fares better than a particular nonsmoker who exercises, and a particular smoker who does not exercise fares better than a particular nonsmoker who does not exercise? How is it that various baseline adjustment methods can give different results, or that the adjustment applied to artificially small P values because of multiplicity can result in even smaller P values? How can a data set have a low kappa statistic (see Rater Agreement – Kappa) and yet have high values of observer agreement? How can (almost) each element in a population be above the population mean? Below are some of the more common paradoxes encountered in observational and experimental studies and explanations of why they may occur in actual practice.

Simpson’s Paradox Baker and Kramer [1] provide the following hypothetical example. Among males, there are 120/200 (60%) responses to A and 20/50 (40%) responses to B, while among females there are 95/100 (95%) responses to A and 221/250 (85%) responses to B. Overall, then, there are 215/300 (72%) responses to A and 241/300 (80%) responses to B, so females respond better than males to either treatment, and A is better than B in each gender, yet B appears to be better than A overall. This apparent paradox occurs because gender is associated with both treatment and outcome. Specifically, women have both a higher survival rate than men and a better chance to receive Treatment B, so ignoring gender makes Treatment B look better than Treatment A, when in fact it is not. A reversal of the direction of an association between two variables when a third variable is controlled is referred to as Simpson’s paradox (see Two by Two Contingency Tables) [25], and is commonly found in both observational and experimental studies

that partition data. The implications of this paradox are that different ways of partitioning the data can produce different correlations that appear to be discordant with the initial correlations. Once the invalidity is revealed, it can give evidence to a causal relationship. The Birth To Ten (BTT) Study, conducted in South Africa, provides one example of the potential for intuitive reasoning to be incorrect [20]. A birth cohort was formed to identify cardiovascular risk factors in children living in an urban environment in South Africa. A survey to collect information on healthrelated issues was performed when the children were 5 years of age. One measured association is between the presence/absence of medical aid in the cohort that responded at 5 years and a comparative cohort that had children but was not in the initial survey (‘No Trace’ cohort). The overall contingency table of the presence or absence of medical aid between the two cohorts shows with statistical significance that the probability of the 5-year cohort having medical aid is lower than the probability of the ‘Not Traced’ cohort having medical aid. However, when the population is partitioned by race, the reverse is found. There is a higher probability of having medical aid for the 5-year cohort in each racially homogeneous population, which is contradictory to the conclusion based on the overall population. The contradiction is referred to as the ‘reversal of inequalities’ or ‘Simpson’s paradox’, and appears in the hypothetical data presented by Heydtmann [12], the BBT data set [20], and other scenarios presented in the medical literature [2, 22, 24]. It can also be shown that even if the numbers uniformly increase (retaining all relative proportions), the contradictions observed in these fractions remain constant [17]. This means that the best method for avoiding Simpson’s paradox is not to increase the sample size but rather to carefully research the current literature when designing your study and consult with experts in the therapeutic field to define appropriate outcomes and prognostic variables, and tabulate the data within each level of each key predictor.

Lord’s Paradox In experimental studies that compare two treatments (e.g., placebo vs. experimental treatment), often the investigator is interested in observing the effect of

2

Paradoxes

an intervention on a specific continuous outcome variable (see Clinical Trials and Intervention Studies). Experimental units (i.e., subjects) are randomly assigned to each treatment group and a comparison of the two groups, while controlling for baseline factors, is conducted. There are different yet related methods for controlling for baseline factors, which under certain situations can yield different conclusions. This occurrence is commonly referred to as Lord’s Paradox [15, 16]. For example, in depression studies, one may define the primary outcome variable as Beck’s Depression Index (BDI). One would be interested in comparing BDI scores between two groups after an intervention. Upon study completion, the two groups may be compared with an unadjusted analysis (i.e., a comparison based on a two-sample t Test of the means). However, as Lord points out in his review of a weight gain study in college students [15], this approach ignores the potential correlation between outcome and baseline variables. Adjustments for baseline are meant to control for pretreatment differences and remove the variation in response on the basis of baseline factors. This is important for obtaining precise estimates and treatment comparisons [15, 25]. Options for adjustment include: (1) subtracting the baseline from the outcome variable (i.e., analyzing the change score, or delta), (2) dividing the outcome variable by the baseline (i.e., analyzing the percent change), or (3) using baseline as a covariate in an analysis of covariance (ANCOVA) [4, 5] or a suitable nonparametric alternative. Approach 1 leads to the descriptive statement ‘On average the outcome decreases X amount for Treatment Group A, whereas it decreases Y amount for Treatment Group B’. Approach 2 concludes ‘On average the outcome decreases X% for Treatment Group A, whereas it decreases Y % for Treatment Group B’, and Approach 3 concludes ‘On average the outcome for Treatment Group A decreases as much after intervention as that of Treatment Group B for those subjects with the same baseline’. Senn [25] discusses the pros and cons of each of these approaches and suggests that ANCOVA is the ‘best’ approach. However, for this discussion we will focus on reasons why related approaches can give different results rather than which approach is most appropriate. Many authors offer insight into why results may vary based on technique of choice [13, 15, 16, 29]. Using pulmonary function test data, Kaiser [13]

demonstrates different statistical inferences on the same data set between percent change and change score results. Wainer [29] illustrates how the three adjustment options can alter the conclusion of heart rate changes between young and old rats and blur the causal effect of the treatment. He concludes that the level of discrepancy between methods depends on different untestable assumptions such as ‘if an intervention was not applied then the response would be similar to baseline’. There are many scenarios where this specific assumption does not hold owing to placebo effects or unobservable variables. For this reason alone, a certain level of judgment is needed when choosing the adjustment approach. The goal should be to choose the method that allows baseline to be independent (or close to independent) of the adjusted response. If dependence is present, then there is a potential for decreased sensitivity in estimating treatment differences and a decrease in useful summary statistics.

Correlation, Dependence, and the Ecological Fallacy Consider a study of age and income conducted on two groups, one of which was young professionals, less than 5 years from earning their medical or law degree. The other group consists of older unskilled laborers, near to retirement age. It is reasonable to suppose that within each group the trend would be toward increased income with increased age, but any such effect within the groups would be overwhelmed by the association across the groups if the two groups are combined. Overall, then, the older the worker, the more likely the worker would be to fall into the lower income group. Hence, income and age are inversely associated, in contrast to the within-group direction of the association. This is nonsensical correlation due to pooling [11]. Ecologic studies in which the experimental or observational unit is a defined population are conducted in order to collect aggregate measures and provide a summary of the individuals within a group. This design can be found in studies where it is too expensive to collect individual data or primary interest is in population responses (i.e., air pollution studies or intersection car wrecks). Conclusions based on ecologic studies should be interpreted with caution when expressed on an individual level. Otherwise, the

Paradoxes

Mean ager (yrs)

54 52 50 48 46 44 42 0.4

Figure 1

0.6 Risk

0.8

Mean age and incidence of tumor size ≥6 mm

70 65 60 Age (yrs)

researcher may fall prey to the ecological fallacy, in which an association observed between variables on an aggregate level does not necessarily represent the association that exists at an individual level [27]. Davis [8] discusses an ecological fallacy seen in car-crash rates and the positive correlation with variation in speed. A number of studies show a direct causal relationship between driving speeds and crash rate and conclude that both slower and faster drivers are more likely to have a car wreck than those who drive within the average speed limit. The unit of analysis for these studies is a group of individuals at a specific intersection or a specific highway segment. Davis [8] discusses the potential role of the ecologic fallacy in these studies and illustrates through a series of hypothetical examples that correlations seen in group analyses do not always provide support for conclusions at the individual level. In particular, whether the individual risk is monotonically increasing, decreasing, or Ushaped, the group correlation between the risk of a car crash and variance in speed is positive. A hypothetical data set illustrates what may occur under this fallacy. Suppose that a researcher is observing three inner-city hospitals to understand the relationship between tumor size and age of patient. Five patients are observed at each hospital. The mean ages in the three hospitals are 45.8, 47.0, and 53.0 years, respectively. Respective incidences of tumors ≥6 mm are 0.4, 0.6, and 0.8, indicating a positive correlation between tumor size and age (Figure 1). However, when observing the individual data (Figure 2), it appears that patients under the age of 50 have larger size tumors. Piantadosi et al. [23] expand on this topic and offer a concise summary and reasoning of why the

3

55 50 45 40 35 30 0

0.2

0.4

0.6

0.8

1

1.2

Size

Figure 2

Individual age and tumor size

fallacy exists. The conclusion is that despite the fact that aggregate data are somewhat easier to collect (in certain scenarios), correlations based on ecologic data should be interpreted with caution since they may not represent the associations that truly exist at the individual level. This is another way of saying that correlated variables may be conditionally uncorrelated. Another example would be height and weight, which tend to be correlated, but when conditioning on both the individual and the day, the height tends to be fairly constant, and certainly independent of the weight. It is also possible for uncorrelated variables to be conditionally correlated. This could be a variation on Simpson’s paradox, with pooling masking a common correlation across strata (but instead of reversing it, simply resulting in no correlation). It could also represent compensation, with the direction of association varying across the strata. Meehl’s paradox is also rooted in correlation, except that it involves multiple correlation (see Rsquared, Adjusted R-squared). The issue here is that two binary variables, neither one of which predicts a binary outcome at all, can together predict it perfectly. Von Eye [28] offers an example with two binary predictors, each assuming the values true or false, and a binary outcome, taking the values yes or no. There are then 2 × 2 × 2 = 8 possible sets of values, and these eight patterns have frequencies as follows: TTY FTY

20 0

TTN FTN

0 20

TFY FFY

0 20

TFN FFN

20 0

Neither binary predictor is predictive of the outcome, but the outcome is yes if the two predictors

4

Paradoxes

agree, and no otherwise. So in combination, the two predictors predict the outcome perfectly, truly a case of the whole being greater that the sum of the parts.

High Proportion of Agreement yet Low Kappa Cohen’s Kappa (see Rater Agreement – Kappa) is a chance-adjusted measure of agreement between two or more observers [10]. Although it is used in a variety of interrater agreement observer studies, there are situations that present relatively low kappa values yet high proportions of overall agreement. This paradox is due to the prevalence of the observed value and the imbalance in marginal totals of the contingency table [7, 9, 14]. If the prevalence of the variable of interest is low, then it should be expected that agreement will be highly skewed. Thus, it is important to take prevalence into consideration when designing your study and defining the sampled population.

in the grand scheme of things, very few distributions are technically symmetric. Some distributions are, in fact, very far from symmetric. Neuhauser [21] discussed one such hypothetical data set, in which there were seven observations overall, and two treatment groups, denoted A and B. In Group A, the observations were 7, 5, and 5. In Group B, the observations were 4, 3, 3, and 2. This is already the most extreme configuration possible with this set of numbers. That is, any permutation of these numbers will result in a less extreme deviation from the null expected outcome of common means across the groups. As such, the critical region of a permutation test (see Permutation Based Inference) consists of only the observed outcome, with null probability .018, and this is true whether we consider a one-sided test or a two-sided test. So either way, the P value is .018, yet this appears to be much more impressive if the test was planned as two-sided.

Same Data, Different Conclusion The Two-sided P value Equals the One-sided P value One-sided tests are based on one-sided rejection regions. That is, the rejection regions consists of the observed value plus either outcomes to the right of (larger values than) the observed value or outcomes to the left of (smaller values than) the observed value, but not both. Two-sided tests, in contrast, allow values to be on either side of the observed value, as long as they are more extreme, or further away from what one might expect under the null hypothesis. The critical values for a two-sided test are derived from the specified type I error rate (α) such that the error rate can be divided in half for a two-sided test (α/2). When dealing with symmetrical distributions, the two-sided P value is two times the one-sided P value, and so for any given alpha level, the two-sided test is the more conservative one. This is generally understood, and the doubling that applies to symmetric distributions appears to be universally accepted (or assumed), because in practice it is common to conduct a one-sided test at a significance level that is half the usual two-sided one. That is, if a two-sided test would be performed at the 0.05 level, then the corresponding one-sided test would be performed at the 0.025 level [3]. Yet,

Suppose that one wishes to test to see if a given population, say those who seek help in quitting smoking, are as likely to be male as female, against the alternative hypothesis that males are better represented among this population. But resources are scarce, so the study needs to be as small as possible. One could sample nine members of the population, ascertain the gender of each, and tabulate the number of males. The probability of finding all nine males, if the genders are truly equally likely, is 1/512. The probability of finding eight males, if the genders are truly equally likely, is 9/512. The probability of finding eight or nine males, if the genders are truly equally likely, is 10/512, which is less than the customary significance level of 0.05 (and it is even less than half of it, 0.025, which is relevant as this is a one-sided test), so this is one way to construct a valid test of the hypothesis. Suppose that one team of investigators uses this test, finds eight males, and reports a one-sided P value of 10/512. Another research team decides to stop after six subjects, for an interim analysis. The probability of all six being male, under the null hypothesis, is 1/64. In fact, they found only five males among the first six, and so continued on to nine subjects altogether. Of these nine, eight were males, the same as for the previously mentioned study. But now the P value

Paradoxes has to account for the interim analysis. The critical region consists of the observed outcome, eight of nine males, plus the more extreme outcome, six of six males (as opposed to nine of nine males in the previous experiment) which is still more extreme, but no longer a possible outcome. This probability is then 1/64 + (6/64)(1/8) = 14/512. This means that the same data are observed but a different P value is reported. The reason for this apparent paradox is that a different system was used for ranking the outcomes by extremity. In the first experiment, 8/9 was second place to only 9/9, but in the second experiment 8/9 was behind 6/6, even if the 6/6 had continued and turned into 6/8 or even 6/9. So 6/6 could have turned into 6/9 had there been no interim analysis, and this would certainly have been ranked as less extreme than 8/9, yet an implication of the interim analysis and its associated stopping rule is that 6/6 is ranked more extreme than 8/9, no matter what the unobserved three outcomes would have been.

Curious Adjusted P values Consider a study comparing two active treatments, A and B, and a control group, C. One wishes to compare A to B, and A to C. Suppose that the data are as follows: Three successes out of four (75%) subjects with A, one of four (25%) with B, and zero of 2000 (0%) with C. Without adjusting for multiplicity, the raw P value for comparing A to B is .9857. However, we would want to adjust for the multiple tests, to keep this P value from being artificially low. The adjustment performed by PROC MULTTEST in SAS would produce an adjusted P value of .0076. Keep in mind the purpose for the adjustment and the expectation that the P value would become larger after adjustment. Instead, it became smaller, and in fact so much smaller that it went from being nowhere near significant to highly significant. So what went wrong? As explained by Westfall and Wolfinger [30], the reference distribution is altered by consideration of the third group, and this creates the paradox.

What Does the Mean Mean? It is possible, if a calibration system is obsolete, for every member of a population to be above average, or above the mean. For example, the IQ was initially developed so that an IQ of 100 was

5

considered average, but it is unlikely that an IQ of 100 represents the average today. Standard ranges may be established for any quantity on the basis of a sample relevant to that quantity, but as time elapses, the mean may shift to the point that each subject is above the mean. Even if no time elapses, it is still possible for almost every subject to be on one side of the mean. For example, the income distribution is skewed by a few very large values. This will tend to drive up the mean to the point that almost nobody has an income that is above average. Finally, the same set of data may give rise to two different yet equally valid values for the mean. Consider, for example, a school with four classes. One class has 30 students and the other three classes have 10 students each. What, then, is the average classroom size? Counting classes gives an average of (30 + 10 + 10 + 10)/4 = 15, but suppose that there is no overlap between students taking any two classes. That is, there are 60 students in all and each takes exactly one class. One could survey these students, asking each one about the size of their class. Now 30 of these students would answer ‘30’, whereas the other 30 would answer ‘10’. The mean would then be [(30)(30) + (10)(10) + (10)(10) + (10)(10)]/60 = 20, to reflect the weighting applied to the class sizes.

Miscellaneous Other Paradoxes The one-sample runs test is designed to detect departures from randomness in a series of observations. It does so by comparing the observed lengths of runs (consecutive observations with the same value) to what would be expected under the null hypothesis of randomness. Any repeating pattern would certainly constitute a deviation from randomness, so one would expect it to be detected. Yet, an observed sequence of runs of length two, AABBAABBAABB. . . , would not lead to the runs test rejecting the null hypothesis of randomness [19]. See also [18]. A triad is a departure from the transitive property, in which A > B > C implies that A > C. So, for example, in the game ‘rock, paper, scissors’, the triad occurs because rock beats scissors, which beats paper, which in turn beats the rock. One might think that transitivity [6] applies to pairs of random variables in the sense that P{A > B} > .5 and P{B > C} > .5 jointly imply that P{A > C} > .5. However, consider three die, fair in the sense that each face occurs with

6

Paradoxes

probability 1/6. Die A has one 1 and five 4s. Die B has one 6 and five 3s. Die C has three 2s and three 5s. Now P{A > B} = 25/36 > .5, P{B > C} = 7/12 > .5, and P{C > A} = 7/12 > .5 [26].

[15]

References

[17]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10] [11]

[12]

[13] [14]

Baker, S.G. & Kramer, B.S. (2001). Good for women, Good for men, Bad for people: Simpson’s paradox and the importance of sex-specific analysis in observational studies, Journal of Women’s Health and Gender-Based Medicine 10, 867–872. Berger, V.W. (2004a). Valid adjustment of randomized comparisons for binary covariates, Biometrical Journal 46, 589–594. Berger, V.W. (2004b). On the generation and ownership of alpha in medical studies, Controlled Clinical Trials 25, 613–619. Berger, V.W. (2005). Nonparametric adjustment techniques for binary covariates, Biometrical Journal in press. Berger, V.W., Zhou, Y.Y., Ivanova, A. & Tremmel, L. (2004a). Adjusting for ordinal covariates by inducing a partial ordering, Biometrical Journal 46(1), 48–55. Berger, V.W., Zhou, Y.Y., Ivanova, A. & Tremmel, L. (2004b). Adjusting for ordinal covariates by inducing a partial ordering, Biometrical Journal 46(1), 48–55. Cicchetti, D.V. & Feinstein, A.R. (1990). High agreement but low kappa. II. Resolving the paradoxes, Journal of Clinical Epidemiology 43, 551–558. Davis, G. (2002). Is the claim that ‘variance kills’ an ecological fallacy? Accident Analysis and Prevention 34, 343–346. Feinstein, A.R. & Cicchetti, D.V. (1990). High agreement but low kappa. I. The problems of two paradoxes, Journal of Clinical Epidemiology 43, 543–549. Fleiss, J.L. (1981). Statistical Methods for Rates and Proportions, John Wiley & Sons. Hassler U., Thadewald, T. Nonsensical and biased correlation due to pooling heterogeneous samples, The Statistician 52(3), 367–379. Heydtmann, M. (2002). The nature of truth: Simpson’s paradox and the limits of statistical data, The Quarterly Journal Medicine 95, 247–249. Kaiser, L. (1989). Adjusting for baseline: change or percentage change? Statistics in Medicine 8, 1183–1190. Lantz, C.A. & Nebenzahl, E. (1996). Behavior and Interpretation of the κ Statistic: Resolution of the Two Paradoxes, Journal of Clinical Epidemiology 49(4), 431–434.

[16]

[18]

[19]

[20]

[21] [22]

[23]

[24]

[25] [26] [27] [28] [29]

[30]

Lord, F.M. (1967). A paradox in the interpretation of group comparisons, Psychological Bulletin 69(5), 304–305. Lord, F.M. (1969). Statistical adjustments when comparing preexisting groups, Psychological Bulletin 72, 336–337. Malinas, G. & Bigelow, J. Simpson’s paradox, in The Stanford Encyclopedia of Philosophy (Spring 2004 Edition), Edward N. Zalta, ed., forthcoming URL = . McKenzie, D.P., Onghena, P., Hogenraad, R., Martindale, C. & MacKinnon, A.J. (1999). Detecting patterns by one-sample runs test: paradox, explanation, and a new omnibus procedure, The Journal of Experimental Education 67(2), 167–179. Mogull, R.G. (1994). The one-sample runs test – A category of exception Mogull RG, Journal of Educational and Behavioral Statistics 19(3), 296–303. Morrell, C.H. (1999). Simpson’s paradox: an example from a longitudinal study in South Africa, Journal of Statistics Education 7(3), 1–4. Neuhauser, M. (2004). The choice of alpha for one-sided tests, Drug Information Journal 38, 57–60. Neutel, C.I. (1997). The potential for Simpson’s paradox in drug utilization studies, Annals of Epidemiology 7(7), 517–521. Piantadosi, S., Byar, D. & Green, S. (1988). The ecological fallacy, American Journal of Epidemiology 127(5), 893–904. Reintjes R. de Boer A., van Pelt W. & Mintjes-de Goot J. (2000). Simpson’s paradox: an example from hospital epidemiology, Epidemiology 11(1), 81–83. Senn, S. (1997). Statistical Issues in Drug Development, John Wiley & Sons, England. Szekely, G.J. (2003). Solution to problem #28: irregularly spotted dice, Chance 16(3), 54–55. Szklo, M. & Nieto, F.J. (2000). Epidemiology: Beyond the Basics, Aspen Publishers. Von Eye, A. (2002). Configural Frequency Analysis, Lawrence Erlbaum, Mahwah. Wainer, H. (1991). Adjusting for differential base rates: lord’s paradox again, Psychological Bulletin 109(1), 147–151. Westfall P.H., Wolfinger R.D., Closed Multiple Testing Procedures and PROC MULTTEST, http:// support.sas.com/documentation/ periodicals/obs/obswww23, accessed 2/2/04.

VANCE W. BERGER

AND

VALERIE DURKALSKI

Parsimony/Occham’s Razor STANLEY A. MULAIK Volume 3, pp. 1517–1518 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Parsimony/Occham’s Razor Parsimony or simplicity as a desirable feature of theories and models was first popularized by the Franciscan William Ockham (1285–1347) in the principle ‘Entities are not to be multiplied unnecessarily’. In other words, theoretical explanations should be as simple as possible, evoking the fewest explanatory entities as possible. The nineteenth-century physicist Heinrich Hertz listed parsimony as one of several desirable features for physical theories [4]. However, a rationale for this principle was not given. Later, the philosopher of science Karl Popper suggested it facilitated the falsifiability of theories, but did not provide a clear demonstration [7]. L. L. Thurstone [8], the American factor analyst, advocated parsimony and simplicity in factor analytic theories. Mulaik [6] showed that in models in which estimation of parameters is necessary to provide values for unknown parameters, the number of dimensions in which data is free to differ from the model of the data is given by the degrees of freedom of the model, df = D − m, where D is the number of data elements to fit and m is the number of estimated parameters. (In confirmatory factor analysis and structural equation models, D is the number of nonredundant elements of a variance–covariance matrix to which the model is fit.) Each dimension corresponding to a degree of freedom corresponds to a condition by which the model can fail to fit and thus be disconfirmed. Parsimony is the fewness of parameters estimated relative to the number of data elements, and is expressed as a ratio P = m/D. Zero is the ideal case where nothing is estimated. James, Mulaik, and Brett [3] suggested that an aspect of model quality could be indicated by a ‘parsimony ratio’: P R = df/D = 1 − P . As this ratio approaches unity, the model is increasingly disconfirmable. They further suggested that

this ratio could be multiplied by ‘goodness-of-fit’ indices that range between 0 and 1 to combine fit with parsimony in assessing model quality, Q. For example, in structural equation modeling one can obtain QCFI = P R ∗ CFI or QGFI = P R ∗ GFI , where CFI is the comparative fit index of Bentler [1] and GFI the ‘goodness-of-fit’ index of LISREL [5]. (see Goodness of Fit; Structural Equation Modeling: Software) It can also be adapted to the RMSEA index [2] as QRMSEA = P R ∗ exp(−RMSEA). D in PR for the CFI is p(p − 1)/2 and for the GFI and RMSEA it is p(p + 1)/2, where p is the number of observed variables. A Q > 0.87 indicates a model that is both highly parsimonious and high in degree of fit.

References [1] [2]

[3]

[4] [5]

[6] [7]

[8]

Bentler, P.M. (1990). Comparative fit indexes in structural models, Psychological Bulletin 107, 238–246. Browne, M.W. & Cudeck, R. (1993). Alternative ways of assessing model fit, in Testing Structural Equation Models, K.H. Bollen & J.S. Long, eds, Sage Publications, Newbury Park, 136–162. James, L.R., Mulaik, S.A. & Brett, J.M. (1982). Causal Analysis: Assumptions, Models and Data, Sage Publications, Beverly Hills. Janik, A. & Toulmin, S. (1973). Wittgenstein’s Vienna, Simon and Schuster, New York. J¨oreskog, K.G. & S¨orbom, D. (1987). LISREL V: Analysis of Linear Structural Relationships by the Method of Maximum Likelihood, National Educational Resources, Chicago. Mulaik, S.A. (2001). The curve-fitting problem: An objectivist view, Philosophy of Science 68, 218–241. Popper, K.R. (1961). The Logic of Scientific Discovery, (translated and revised by the author), Science Editions, (Original work published 1934), New York. Thurstone, L.L. (1947). Multiple Factor Analysis, University of Chicago Press, Chicago.

STANLEY A. MULAIK

Partial Correlation Coefficients BRUCE L. BROWN

AND

SUZANNE B. HENDRIX

Volume 3, pp. 1518–1523 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Partial Correlation Coefficients

described, this calculates as: r12 − r13 r23 r12.3 = 2 2 1 − r13 1 − r23

Partial correlation answers the question ‘what is the correlation coefficient between any two variables with the effects of a third variable held constant?’ An example will illustrate. Suppose that one wishes to know how well high-school grades correlate with grades in college, but with potential ability (scholastic aptitude) held constant; that is, for a group of students of equal ability, is there a strong tendency for students who do well in high school to also do well in college? A direct way to do this would be to gather data on a group of students who are all equal in scholastic aptitude, and then examine the correlation coefficient between high school grades and college grades within this group. However, it would obviously not be easy to find a group of students who are equal in scholastic aptitude. Partial correlation is the statistical way of solving this problem by estimating from ordinary data what this correlation coefficient would be. Suppose, in this example, that the correlation coefficient between high school grades (variable X1 ) and college grades (variable X2 ) is found to be very high (r12 = 0.92). However, both of these variables are also found to be quite highly correlated with scholastic aptitude (variable X3 ), with the correlation between aptitude and high school grades being r13 = 0.73, and the correlation between aptitude and college grades being r23 = 0.76. These three correlation coefficients are referred to as zero-order correlations. A zero-order correlation is a correlation coefficient that does not include a control variable (see Correlation; Correlation and Covariance Matrices). Partial correlation that includes one control variable (variable X3 above) is referred to as a first-order correlation. The partial correlation coefficient (a first-order correlation) is a straightforward function of these three zero-order correlations: r12 − r13 r23 r12.3 = 2 2 1 − r12 1 − r13

(1)

The symbol r12.3 is read as ‘the partial correlation coefficient between variable X1 and variable X2 with variable X3 held constant’. For the example just

0.92 − (0.73) × (0.76) = 1 − (0.73)2 1 − (0.76)2 0.92 − 0.5548 √ 1 − 0.5329 1 − 0.5776 0.3652 0.3652 =√ = √ (0.6834) (0.6499) 0.4671 0.4224 =√

=

0.3652 = 0.82 0.4419

(2)

From this calculation, it can be concluded that the correlation between high school grades and college grades drops from 0.92 to 0.82 when it is adjusted for the effects of scholastic aptitude. In other words, 0.82 is what the correlation between high school grades and college grades would be expected to be for a group of students who are all equal in scholastic aptitude. This partial correlation coefficient removes the effects of scholastic aptitude from the correlation between high school grades and college grades. It therefore represents the determinative influence of those factors not related to scholastic aptitude – perhaps such things as diligence, responsibility, academic motivation, and so forth – that are common to both high school grades and college grades.

Partial Correlation and Regression It can be demonstrated that partial correlation is closely related to linear regression analysis (see Multiple Linear Regression). In fact, the partial correlation coefficient r12.3 of 0.82 that was just demonstrated is equivalent to the correlation between the two sets of residuals that are produced by regressing X1 onto X3, and X2 onto X3 . This will be illustrated with ‘simplest case’ data; that is, a data set with few observations and simple numerical structure will be used to demonstrate that the partial correlation coefficient (as calculated from the above formula) is equivalent to the correlation coefficient between the two sets of residuals. For this demonstration, only six students for whom there are high school grades, college grades, and scholastic aptitude scores will be used. They are shown in Table 1.

2

Partial Correlation Coefficients

Table 1 High school GPA, college GPA, and SAT scores for six students High school GPA College GPA SAT scores X1 X2 X3 Student Student Student Student Student Student

A B C D E F

3.64 3.43 3.20 2.80 2.17 1.56

3.65 3.51 3.38 2.96 3.10 1.99

661 523 477 569 523 385

In actual practice, sample sizes would usually be considerably larger. The small sample size here is to make the demonstration clear and understandable. To further simplify the demonstration, these three variables are first converted to standard score (Z score) form. The means and standard deviations of these three variables are X1 =

X

3.64 + 3.43 + 3.20 16.80 = +2.80 + 2.17 + 1.56 = 6 6

n = 2.80 2 X /n X2 − σ1 = n−1 50.237 − 16.802 /6 = 6−1

50.237 − 47.040 3.197 = = 5 5 √ = 0.6394 = 0.800 3.65 + 3.51 + 3.38 X 18.59 X2 = = +2.96 + 3.10 + 1.99 = n 6 6 = 3.10 2 X /n X2 − σ2 = n−1 59.399 − 18.592 /6 = 6−1

59.399 − 57.598 1.801 = = 5 5 √ = 0.3601 = 0.600

661 + 523 + 477 + 569 + 523 + 385 = n 6 3138 = = 523 6 2 X /n X2 − σ3 = n−1 1683494 − 31382 /6 = 6−1

1683494 − 1641174 42320 = = 5 5 √ = 8464 = 92 (3)

X3 =

X

For each of the three variables, the Z scores are created by subtracting the mean of each from the raw values to get deviation scores, and then dividing each deviation score by the standard deviation. The Z scores for these three variables are shown in Table 2. Now that the data are in Z score form, it is relatively easy to obtain predicted scores and residual scores on high school grade point average (GPA) (Z1 ) and college GPA (Z2 ), as predicted by regression from SAT scores (Z3 ). The correlation coefficient between high school GPA and college GPA is found from the product of Z scores to be Z1 Z2 r12 = n−1 1.0505 × 0.9193 + 0.7879 × 0.6860 + 0.5002 ×0.4694 + 0.0000 × (−0.2305) + (−0.7879) ×0.0028 + (−1.5507) × (−1.8469) = 5 4.6027 = = 0.92 (4) 5 Table 2 Standardized high school GPA, college GPA, and SAT scores for six students High school GPA College GPA SAT scores Z1 Z2 Z3 Student Student Student Student Student Student

A B C D E F

1.0505 0.7879 0.5002 0.0000 −0.7879 −1.5507

0.9193 0.6860 0.4694 −0.2305 0.0028 −1.8469

1.50 0.00 −0.50 0.50 0.00 −1.50

Partial Correlation Coefficients 

The correlation coefficient between high school GPA and SAT is r13 =

Z1 Z3

n−1 1.0505 × 1.50 + 0.7879 × 0.00 + 0.5002 ×(−0.50) + 0.0000 × 0.50 + (−0.7879)0.00 +(−1.5507) × (−1.50) = 5 3.6517 = 0.73 (5) = 5

The correlation coefficient between college GPA and SAT is r23 =

Zˆ 1.3

   1.50 1.0955  0.00   0.0000       −0.50   −0.3652  = r13 Z3 = (0.730) ×  =   0.50   0.3652   0.00   0.0000  −1.50 −1.0955 (8)

The residual scores for high school GPA are obtained by subtracting these predicted scores from the actual high school GPA standardized scores: 

e1.3 = Z1 − Zˆ 1.3

Z2 Z3

n−1 0.9193 × 1.50 + 0.6860 × 0.00 + 0.4694 ×(−0.50) + (−0.2305) × 0.50 + 0.0028 ×0.00 + (−1.8469) × (−1.50) = 5 3.7993 = = 0.76 (6) 5

These three zero-order correlation coefficients are the same ones from which the partial correlation between high school GPA and college GPA with scholastic aptitude test (SAT) scores held constant was found from the partial correlation formula (1) to be 0.82. It can be now shown that this same partial correlation coefficient can be obtained by finding the correlation coefficient between the residual scores of high school GPA predicted from SAT scores, and the residual scores of college GPA predicted from SAT scores. The high school GPA scores predicted from SAT scores are found from the standardized regression function: Zˆ 1.3 = r13 Z3 (7) In other words, the standardized predicted high school GPA scores that are predicted on the basis of SAT scores (Zˆ 1.3 ) are found by multiplying each of the standardized SAT scores (Z3 ) by the correlation coefficient between these two variables (r13 ) (see Standardized Regression Coefficients).

3

   1.0505 1.0955  0.7879   0.0000       0.5002   −0.3652  = −   0.0000   0.3652   −7879   0.0000  −1.5507 −1.0955   −0.0450  0.7879     0.8654  = (9)   −0.3652   −0.7879  −0.4552

Similarly, the college GPA predicted scores (from SAT) are found to be 

Zˆ 2.3

   1.50 1.1398  0.00   0.0000       −0.50   −0.3799  = r23 Z3 = (0.760) ×  =   0.50   0.3799   0.00   0.0000  −1.50 −1.1398 (10)

and the residual scores for college GPA are 

e2.3 = Z2 − Zˆ 2.3

   0.9193 1.1398  0.6860   0.0000       0.4694   −0.3799  = −   −0.2305   0.3799   0.0028   0.0000  −1.8469 −1.1398   −0.2205  0.6860     0.8493  = (11)   −0.6104   0.0028  −0.7071

4

Partial Correlation Coefficients

The partial correlation coefficient of 0.82 between high school GPA and college GPA with SAT scores held constant can now be shown to be equivalent to the zero-order correlation coefficient between these two sets of residuals. The covariance between these two sets of residuals is e1.3 e2.3 cov(e1.3 e2.3 ) = n−1 (−0.0450)(−0.2205) + 0.7879 × 0.6860 +0.8654 × 0.8493 + (−0.3652)(−0.6104) +(−0.7879)0.0028 + (−0.4552)(−0.7071) = 5 1.8280 = = 0.3656 (12) 5 Since the Pearson product moment correlation coefficient is equal to the covariance divided by the two standard deviations, the correlation between these two sets of residuals can be obtained by dividing this covariance by the two standard deviations. The residual scores are already in deviation score form, so the standard deviation for the first set of residuals is

2 e1.3 2.3330 = σe1 = n−1 6−1

2.3330 √ = = 0.4666 = 0.6831 (13) 5 and the standard deviation for the second set of residuals is

2 e2.3 2.1131 = σe2 = n−1 6−1

2.1131 √ = = 0.4226 = 0.6501 (14) 5 The zero-order correlation between the two sets of residuals is therefore found as r12.3 = =

cov(e1.3 e2.3 ) σe1 σe2 0.3656 0.3656 = = 0.82 (0.6831)(0.6501) 0.4441

(15)

which is indeed equivalent to the partial correlation coefficient as calculated by the received formula. It can be demonstrated algebraically that this equivalence holds in general.

The calculations involved in obtaining partial correlations (as shown above) are simple enough to be easily accomplished using a spreadsheet, such as Microsoft Excel, Quattro Pro, ClarisWorks, and so on. They can also be accomplished using computer statistical packages such as SPSS and SAS (see Software for Statistical Analyses). Landau and Everitt [7] clearly outline how to calculate partial correlations using SPSS (pp. 94–96).

Partial Correlation in Relation to Other Methods Partial correlation is related to, and foundational to many statistical methods of current interest, such as structural equation modeling (SEM) and causal modeling. Using partial correlation as a method to determine the nature of causal relations between observed variables has a long history, as suggested by Lazarsfeld [8], and presented in more detail in by Simon [9] and Blalock [2, 3]. Partial correlation is also closely related, both conceptually and also mathematically, to path analysis (see [5]). Path analysis methods were first developed in the 1930s by Sewall Wright [11–14] (see also [10] ), and they also are fundamental to the developments underlying structural equations modeling. In addition, partial correlation mathematics is central to the computational procedures in traditional factor analytic methods [4] (and see Factor Analysis: Exploratory). They are also central to the algebraic rationale underlying stepwise multiple regression. That is, partial correlation mathematics is the conceptual basis for removing the effects of each predictive variable in a stepwise regression, in order to determine the additional predictive contribution of the remaining variables. One of the more interesting applications of partial correlation methods is in demonstrating mediation [1, 6]. That is, partial correlation methods are used to demonstrate that the causal effects of one variable on another are in fact mediated by yet another variable (a third one) – that without this third variable, the correlative relationship vanishes, or at least is substantially attenuated. Thus, partial correlation is useful not only as a method of statistical control but also as a way of testing theories of causation.

Partial Correlation Coefficients

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

Baron, R.M. & Kenny, D.A. (1986). The moderatormediator variable distinction in social psychological research: conceptual, strategic and statistical considerations, Journal of Personality and Social Psychology 51, 1173–1182. Blalock, H.M. (1961). Causal Inferences in NonExperimental Research, University of North Carolina Press, Chapel Hill. Blalock, H.M. (1963). Making causal inferences for unmeasured variables from correlations among indicators, American Journal of Sociology 69, 53–62. Brown, B.L., Hendrix, S.B. & Hendrix, K.A. (in preparation). Multivariate for the Masses, Prentice-Hall, Upper Saddle River. Edwards, A.L. (1979). Multiple Regression and the Analysis of Variance and Covariance, W. H. Freeman, San Francisco. Judd, C.M. & Kenny, D.A. (1981). Process analysis: estimating mediation in treatment evaluations, Evaluation Review 5, 602–619. Landau, S. & Everitt, B.S. (2003). A Handbook of Statistical Analyses Using SPSS, Chapman and Hall, Boca Raton.

[8]

[9] [10]

[11] [12]

[13]

[14]

5

Lazarsfeld, P.F. (1955). The interpretation of statistical relations as a research operation, in The Language of Social Research, P.F. Lazarsfeld & M. Rosenberg, eds, The Free Press, New York. Simon, H.A. (1957). Models of Man, Wiley, New York. Tukey, J.W. (1954). Causation, regression, and path analysis, in Statistics and Mathematics in Biology, O. Kempthorne, T.A. Bancroft, J.M. Gowen & J.L. Lush, eds, Iowa State College Press, Ames. Wright, S. (1934). The method of path coefficients, Annals of Mathematical Statistics 5, 161–215. Wright, S. (1954). The interpretation of multivariate systems, in Statistics and Mathematics in Biology, O. Kempthorne, T.A. Bancroft, J.M. Gowen & J.L. Lush, eds, Iowa State College Press, Ames. Wright, S. (1960a). Path coefficients and path regressions: alternative or complementary concepts? Biometrics 16, 189–202. Wright, S. (1960b). The treatment of reciprocal interaction, with or without lag, in path analysis, Biometrics 16, 423–445.

BRUCE L. BROWN

AND

SUZANNE B. HENDRIX

Partial Least Squares PAUL D. SAMPSON

AND

FRED L. BOOKSTEIN

Volume 3, pp. 1523–1529 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Partial Least Squares Introduction Partial Least Squares (PLS), as described in this article, is a method of statistical analysis for characterizing and making inferences about the relationship(s) among two or more causally related blocks of variables in terms of pairs of latent variables. The notion of causality in the preceding sentence is meant in its scientific, not its statistical, sense. PLS is not a tool for asserting or testing causation, but for calibrating it (see [10]) in contexts already known to be causal in Pearl’s sense [14]. The informal idea of a ‘block’ of variables is used in the sense of a vector of measurements that together observe a range of relevant aspects of the causal nexus under study. This discussion of PLS is presented in the context of an application in the field of fetal alcohol teratology. It concerns the characterization of ‘alcohol-related brain damage’ as expressed in the relationship of a multivariate characterization of prenatal exposure to alcohol through maternal drinking to a large battery of neurobehavioral performance measures in a prospective study of children of approximately 500 women with a broad range of drinking patterns during pregnancy. That the effect of alcohol on the brain or on behavior is causal is amply known in advance from a plethora of studies in animals and also from human clinical teratology. This PLS approach to defining and making inference about latent variables is quite distinct from other methods of structural equation modeling (SEM) and from multivariate methods such as canonical correlation analysis. In this context, the latent variable models that underlie two-block PLS analysis can be shown to parameterize a range of reducedrank regression models [1] as shown by Wegelin et al. [19]. SEM typically draws inferences about latent variables – without enabling the estimation/computation of explicit scores for these constructs – under strong assumptions about the error structure in measurement models for the observable (‘manifest’) variables representing each latent variable. PLS computes latent variable scores without any measurement model assumptions. As will become clear from the symmetries of the equations following, PLS is an ‘undirected’ technique in the sense of the directed

acyclic graph (DAG) literature (see Graphical Chain Models), without specific tools to assess the direction of causation between blocks its tools show to be associated. Nevertheless, PLS is completely consistent with the critique of SEM that can be mounted from that more coherent point of view [17], in that its reports are consistent with the full range of causal reinterpretations that might model one single data stream. Wegelin et al. [19] show that all feasible directed path diagrams for two-block latent variable models are equivalent or indistinguishable in the sense that they parameterize the same set of covariance matrices. Once causation is presumed, the tie between paired latent variables in PLS is in most ways equivalent to a multivariate version of the E(Y |do(X)) operation of Pearl [14] in the situation where X is a vector of alternate indicators of the same latent score (so that one can’t ‘do’ X1 , X2 , . . . all at the same time). Canonical correlation analysis (and generalizations to multiple blocks), as the name suggests, computes scores to optimize between-block correlation of the composites (linear combinations of the manifest variables), but these bear no interpretation as latent variables (see Canonical Correlation Analysis). PLS must also be distinguished from a number of other related statistical analysis methods reported in the literature under the same name. The twentieth-century Swedish econometrician Herman Wold developed PLS as an ordinary least squares–based approach for modeling paths of causal relationship among any number of ‘blocks’ of manifest variables [9, 20] (see Path Analysis and Path Diagrams). There appear to have been relatively few recent applications of this type of PLS in the behavioral sciences; see [13] and [7]. The largest body of literature on Partial Least Squares is found in the field of chemometrics, with early work associated with Herman Wold’s son, Svante. A perspective on early PLS development was provided by Svante Wold in 2001 in a special issue of a chemometrics journal [21]. The chemometric variants of PLS include the situation where one ‘block’ consists of a single dependent variable, and PLS provides an approach to coping with a multiple regression problem with highly multicollinear predictor blocks. In this context it is often compared with methods such as ridge regression and principal components regression (see Principal Component Analysis).

2

Partial Least Squares

Increasingly common in applications of PLS in chemometrics, bioinformatics, and the behavioral sciences is the two-block case, using the algebra described below, but where one of the blocks of variables is a design matrix defining different classes to be discriminated. Recent behavioral science applications include the discrimination of dementias [5], the differentiation of fronto-temporal dementia from Alzheimer’s using FDG-PET imaging [8], and the characterization of schizophrenia in terms of the relationship of MRI brain images and neuropsychological measures [11]. PLS has also been applied to examine differences between neurobehavioral experimental design conditions in a study using multivariate event related brain potential data [6]. For recent discussion of applications to microarray data, see [12] (see Microarrays). Barker and Rayens [2] argue that PLS is to be preferred over principal component analysis (PCA) when discrimination is the goal and dimension reduction is needed.

Algebra and Interpretation of two-block PLS The scientist for whom PLS is designed is faced with data for i = 1, . . . , n subjects on two (or more) lists or blocks of variables, Xj , j = 1, . . . , J and Yk , k = 1, . . . , K. In the application below, J = 13 X’s will represent different measures of the alcohol consumption of pregnant women, and K = 9 Y ’s will be a range of measurements of young adult behavior, limited here for purposes of illustration to nine subscales of the WAIS IQ test. Write C for the J × K covariance matrix of the X’s by the Y ’s. In many applications, the variables in a block are incommensurate, and so are scaled to variance 1. If both blocks of variables are standardized to variance 1, this will be a matrix of correlations. Analyses of standardized and unstandardized blocks are in general different; analyses of mean-centered versus uncentered blocks are in most respects exactly the same. The central computation can be phrased in any of several different ways. All involve the production of a vector of coefficients A1 = (A11 , . . . , A1J ), one for each Xj , together with a vector of coefficients B1 = (B11 , . . . , B1K ), one for each Yk . The elements of either vector, A1 or B1 , are scaled so as to have squares that sum to 1. These make up the

first pair of singular vectors of the matrix C from computational linear algebra. Accompanying this pair of vectors is a scalar quantity, d1 , the first singular value of C, such that all of the following assertions are true (in fact, all four are exactly algebraically equivalent): •

•

•

•

The scaled outer product d1 A1 B1 , is the best (least-squares) fit of all such matrices to the between-block covariance matrix C. The goodness of this fit serves as a figure of merit for the overall PLS analysis. It is characterized as ‘the fraction of summed squared covariance explained’ in the model C = d1 A1 B1 + error. The vector A1 is the ‘central tendency’ of the K columns of C thought of as patterns of covariance across the J variables Xj and can be computed as the first uncentered, unstandardized principal component of the columns of C. Likewise, the vector B1 is the first principal component of the rows of C, the central tendency they share as J patterns of covariance across the K variables Yk . The latent variables LV1X = Jj=1 A1j Xj and LV1Y = K k=1 B1k Yk have covariance d1 , and this is the greatest covariance of any pair of such linear combinations for which the coefficient vectors are standardized to norm 1, A1 A1 = B1 B1 = 1. These latent variables are not like those of an SEM (e.g., LISREL or EQS see Structural Equation Modeling: Software); they are ordinary linear combinations of the observed data, not factors, and they have exact values. The quantity PLS is optimizing is the covariance between these paired LV’s, not the variance explained in the course of predictions of the outcomes individually or collectively. The elements A1j of the vector A1 are proportional to the covariances of the corresponding X-block variable Xj with the latent variable LV1Y representing the Y ’s, the other block in the analysis; and, similarly, the elements B1k of the vector B1 are proportional to the covariances of the corresponding Y -block variable Yk with the latent variable LV1X representing the X’s. When it is known a priori that a construct that the X’s share causes changes in a construct that the Y ’s share, or vice versa, these coefficients may be called saliences: each A1j is the salience of the variable Xj for the latent variable representing the Y -block, and each B1k is the salience of the

Partial Least Squares variable Yk for the latent variable representing the X-block. A PLS analysis thus combines the elements of two (or more) blocks of variables into composite scores, usually one per block, that sort variables (by saliences A1j or B1k ) and also subjects (by scores LV1X = Jj=1 A1j Xj and LV1Y = K k=1 B1k Yk ). The resulting systematic summary of the pattern of cross-block covariance thus links characterization of subjects with characterization of variables. The computations of PLS closely resemble other versions of cross-block averaging in applied statistics, such as correspondence analysis. Subsequent pairs of salience or coefficient vectors, A2 and B2 , A3 and B3 , and so on, and corresponding latent variables, explaining residual covariance structure after that explained by the first pair of singular vectors, can be computed as subsequent pairs of singular vectors from the complete singular-value decomposition (SVD) of the cross-block covariance matrix, C = ADB , where A = [A1 , A2 , . . .] is the J × M matrix of orthonormal left singular vectors, B = [B1 , B2 , . . .] is the K × M matrix of orthonormal right singular vectors, and D is the M × M diagonal matrix of ordered singular values, M = min{J, K}. There are no distributional assumptions in PLS; hence inference in PLS goes best by applications of bootstrap analysis and permutation testing [4] (see Permutation Based Inference). The primary hypotheses of interest concern the structure of the relationship between the X and Y blocks – the patterning of the rows and columns of C – as these calibrate the assumed causal relationship. The significance of this structure is assessed in terms of the leading singular values (covariances of the latent variable pairs) with respect to a null model of zero covariance between the X and Y blocks. More formally, assuming random sampling and a true population crosscovariance matrix with singular values δ1 ≥ δ2 ≥ · · · ≥ δM , we ask whether the structure of the covariance matrix can be modeled in terms of a small number, p, of singular vector pairs (or latent variables); that is: δ1 ≥ δ2 ≥ · · · ≥ δp > δp+1 = · · · = δM = 0. Without altering the covariance structure within the several blocks separately, one permutes case orders of all the blocks independently and examines sample singular values, or sums of the first few singular values, with respect to the permutation distribution.

3

When the PLS model for the SVD structure has been found to be statistically significant, one examines the details of the structure, the saliences or singular-value coefficients of A1 and B1 , by computing bootstrap standard errors. These inferential computations effectively take into account the excess of number of variables over sample size in this data set. For instance, all the P values in the example to follow pertain to analysis of all 13 prenatal alcohol measures, all 9 IQ subscale outcomes, or their combination, at once; no further adjustment for multiple comparisons is necessary.

PLS Applications in Behavioral Teratology Teratologists study the developmental consequences of chemical exposures and other measurable aspects of the intrauterine and developmental environment. In human studies, the magnitude of teratological exposures cannot be experimentally controlled. Analysis by PLS as presented here rewards the investigator who conscientiously observes a multiplicity of alternative exposure measures along with a great breadth of behavioral outcomes. This strategy is appropriate whenever there is a plausible single factor (‘latent brain damage’) underlying a wide range of child outcomes, and a corresponding ranking of children by extent of deficit. Streissguth et al. [18] and Bookstein et al. [3] discuss the PLS methodology in detail as applied to this type of ‘dose-response’ behavioral teratology study. The application here is a simple demonstration relating prenatal alcohol exposure to an assessment of IQ at 21 years of age. See [15] for a similar application using IQ data at 7 years of age. Longitudinal PLS analyses are discussed in [18]. Sample. During 1974–1975, 1529 women were interviewed in their fifth month of pregnancy regarding health habits. Most were married (88%) and had graduated from high school (again 88%). Approximately 500 of these generally low risk families, chosen by participation in prenatal care and by prenatal maternal report of alcohol use during pregnancy, have been followed as part of a longitudinal prospective study of child development [18]. Follow-up examinations of morphological, neurocognitive, motor, and other real life outcomes were conducted at 8 months, 18 months, 4 years, 7 years, 14 years and 21 years. Thirteen measures of prenatal alcohol exposure were computed from the prenatal maternal interviews.

4

Partial Least Squares

Table 1

Two-block PLS analysis: prenatal alcohol exposure by WAIS IQ

Prenatal alcohol measure AAP AAD MOCCP MOCCD DOCCP DOCCD BINGEP BINGED MAXP MAXD QFVP QFVD ORDEXC

Salience

Standard error

0.23 0.10 0.13 0.04 0.42 0.33 0.33 0.24 0.35 0.26 0.36 0.27 0.19

0.06 0.07 0.06 0.07 0.04 0.04 0.05 0.05 0.03 0.04 0.04 0.04 0.05

IQ subscale score INFOSS DGSPSS ARTHSS SIMSS PICCSS PICASS BLKDSS OBJASS DSYMSS

Salience

Standard error

0.34 0.29 0.52 0.09 0.24 0.07 0.46 0.33 0.24

0.08 0.10 0.07 0.12 0.11 0.13 0.07 0.08 0.10

94% of summed squared covariance in the cross-covariance matrix explained by this first pair of LV scores. Correlation between Alcohol and IQ LV scores is −0.22. Salience is a bias-corrected bootstrap mean salience from 1000 bootstrap replications. Standard error is the bootstrap standard error. Prenatal alcohol measures: AA: ave. ounces of alcohol per day; MOCC: ave monthly drinking occasions; DOCC: ave drinks per drinking occasion; BINGE: indicator of at least 5 drinks on at least one occasion, MAX: maximum drinks on a drinking occasion; QFV: 5-point ordered quantity-frequency-variability scale; ORDEXC: 5-point a priori ranking of consumption for study design. ‘P’ and ‘D’ in alcohol measure names refer to the periods prior to recognition of pregnancy and during pregnancy, respectively.

These measures reflect the level and pattern of drinking, including binge drinking, during two time periods: the month or so prior to pregnancy recognition and mid-pregnancy (denoted by ‘P’ and ‘D’, respectively, in the variable names of Table 1). Each of these scores was gently (monotonically) transformed to approximately linearize their relationships with a net neurobehavioral outcome [16]. For purposes of illustration the analysis here considers just 9 subscales of the WAIS-R IQ test assessed at 21 years of age. A recurring theme in PLS analysis is that it is usually inadvisable to base analyses only on common summary scales, such as full-scale, ‘performance’ or ‘verbal’ IQ, regardless of the established reliability of these scales. The dimensions (or ‘scales’) of the multivariate outcome most relevant for characterizing the cross-block associations are not necessarily well-represented by the conventional summary scores. In this case the nine subscales are represented without further summarization as a block of the nine original raw WAIS scores.

Partial Least Squares for Exposure-behavior Covariances Because of incommensurabilities among the exposure measures together with differences in variance among

the IQ measures, both the 13-variable prenatal alcohol exposure block Xj , j = 1, . . . , 13 and the 9-variable psychological outcome block Yk , k = 1, . . . , 9 were scaled to unit variance. In other words, this PLS derives from the singular-value decomposition of a 13 × 9 correlation matrix and results in a series of pairs of linear combinations, of which only the first will merit consideration here: (LV1X , LV1Y ). The A1 ’s, the coefficients of LV1X , jointly specify the pattern of maternal drinking behavior during pregnancy with greatest relevance for IQ functioning as a whole, in the sense of maximizing cross-block covariance. At the same time, the profile given by the B1 ’s, the coefficients of LV1Y is the IQ profile of greatest relevance for prenatal alcohol exposure. Table 1 summarizes the PLS analysis. The first singular value, 1.107, represents 94% of the total (summed) squared covariance in the cross-covariance matrix. The structure of this first dimension of the SVD is highly significant as the value 1.107 was not exceeded by the first singular value in any of 1000 random permutations of the rows of the alcohol block with respect to the IQ block. Almost all the prenatal alcohol measures have notable salience, but the two drinking frequency scores (AA: average ounces

Partial Least Squares 6

5

Linear Loess

4

IQ LV score

2

0

−2

−4

−6 0

Figure 1

2

4 6 8 Prenatal alcohol LV score

10

12

Scatterplot for the relationship of the IQ LV score and the Alcohol LV score, with linear and smooth regressions

of alcohol per day; MOCC: average monthly drinking occasions) have relatively low salience for IQ in comparison with the other binge-related measures, notably DOCC: average drinks per drinking occasion. The measures of drinking behavior prior to pregnancy recognition (P) are also somewhat more salient than the measures reflecting mid-pregnancy (D). The arithmetic (ARTH) and block design (BLKD) scales of the WAIS-R are clearly the most salient reflections of prenatal alcohol exposure. While the LV scores computed from these saliences, (LV1Y ) are very highly correlated with full-scale IQ (r = 0.978), this IQ LV score is notably more highly correlated with the Alcohol LV score (LV1X ) than is fullscale IQ. Figure 1 presents a scatterplot of the Alcohol and IQ LV scores that correlate −0.22. In applications, when the relationship depicted in this figure is challenged by adjustment for other prenatal exposures, demographics, and postnatal environmental factors using conventional multiple regression analysis, alcohol retains a highly significant correlation with the IQ LV after adjustment, although it does

not explain as much variance in the IQ LV as is explained by demographic and socioeconomic factors.

Remarks In other applications, the PLS technique can focus the investigator’s attention on profiles of saliences (causal relevance) across aspects of measurement in multimethod studies (see Multitrait–Multimethod Analyses), can detect individual cases with extreme scores on one or more LV’s, can reveal subgrouping patterns by graphical enhancement of LV scatterplots that were carried out on covariance structures pooled over those subgroupings, and the like. PLS is not useful for theory-testing, as its permutation tests do not meet even the mild strictures of the DAG literature, let alone the standards of causation arising in the neighboring domains from which the multidisciplinary data typical of PLS applications tend to emerge. It takes its place alongside biplots and correspondence analysis in any toolkit of exploratory data analysis wherever empirical theory is weak and point

6

Partial Least Squares

predictions of case scores or coefficients correspondingly unlikely. [12]

References [1]

Anderson, T.W. (1999). Asymptotic distribution of the reduced rank regression estimator under general conditions, Annals of Statistics 27, 1141–1154. [2] Barker, M. & Rayens, W. (2003). Partial least squares discrimination, Journal of Chemometrics 17, 166–173. [3] Bookstein, F.L., Sampson, P.D. & Streissguth, A.P. (1996). Exploiting redundant measurement of dose and developmental outcome: new methods from the behavioral teratology of alcohol, Developmental Psychology 32, 404–415. [4] Good, P.I. (2000). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, Springer, New York. [5] Gottfries, J., Blennow, K., Wallin, A. & Gottfries, D.G. (1995). Diagnosis of dementias using partial leastsquares discriminant-analysis, Dementia 6, 83–88. [6] Hay, J.F., Kane, K.A., West, K. & Alain, D. (2002). Event-related neural activity associated with habit and recollection, Neuropsychologia 40, 260–270. [7] Henningsson, M., Sundbom, E., Armelius, B.A. & Erdberg, P. (2001). PLS model building: a multivariate approach to personality test data, Scandinavian Journal of Psychology 42, 399–409. [8] Higdon, R., Foster, N.L., Koeppe, R.A., DeCarli, C.S., Jagust, W.J., Clark, C.M., Barbas, N.R., Arnold, S.E., Turner, R.S., Heidebrink, J.L. & Minoshima, S. (2004). A comparison of classification methods for differentiating fronto-temporal dementia from Alzheimer’s disease using FDG-PET imaging, Statistics in Medicine 23, 315–326. [9] J¨oreskog, K.G. & Wold, H., eds (1982). Systems Under Indirect Observation: Causality, Structure, Prediction, North Holland, New York. [10] Martens, H. & Naes, T. (1989). Multivariate Calibration, Wiley, New York. [11] Nestor, P.G., Barnard, J., ODonnell, B.F., Shenton, M.E., Kikinis, R., Jolesz, F.A., Shen, Z., Bookstein, F.L. & McCarley, R.W. (1997). Partial least

[13]

[14] [15]

[16]

[17]

[18]

[19]

[20]

[21]

squares analysis of MRI and neuropsychological measures in schizophrenia, Biological Psychiatry 41(7 Suppl 1), 16S. Nguyen, D.V. & Rocke, D.M. (2004). On partial least squares dimension reduction for microarray-based classification, Computational Statistics & Data Analysis 46, 407–425. Pag`es, J. & Tenenhaus, M. (2001). Multiple factor analysis combined with PLS path modeling. Application to the analysis of relationships between physicochemical variables, sensory profiles and hedonic judgements, Chemometrics and Intelligent Laboratory Systems 58, 261–273. Pearl, J. (2000). Causality: Models, Reasoning, and Inference, Cambridge University Press, New York. Sampson, P.D., Streissguth, A., Barr, H. & Bookstein, F. (1989). Neurobehavioral effects of prenatal alcohol. Part II. Partial least squares analyses, Neurotoxicology and Teratology 11, 477–491. Sampson, P.D., Streissguth, A.P., Bookstein, F.L. & Barr, H.M. (2000). On categorizations in analyses of alcohol teratogenesis, Environmental Health Perspectives 108(Suppl. 2), 421–428. Spirtes, P., Glymour, C. & Scheines, R. (2000). Causation, Prediction and Search, 2nd Edition, MIT Press, Cambridge. Streissguth, A.P., Bookstein, F.L., Sampson, P.D. & Barr, H.M. (1993). The Enduring Effects of Prenatal Alcohol Exposure on Child Development : Birth Through Seven Years, a Partial Least Squares Solution, The University of Michigan Press, Ann Arbor. Wegelin, J.A., Packer, A., & Richardson, T.S. (2005). Latent models for cross-covariance, Journal of Multivariate Analysis in press. Wold, H. (1985). Partial least squares, in Encyclopedia of Statistical Sciences, Vol. 6, S. Kotz & N.L. Johnson, eds., Wiley, New York, pp. 581–591. Wold, S. (2001). Personal memories of the early PLS development, Chemometrics and Intelligent Laboratory Systems, 58, 83–84.

PAUL D. SAMPSON

AND

FRED L. BOOKSTEIN

Path Analysis and Path Diagrams STEVEN M. BOKER

AND JOHN

J. MCARDLE

Volume 3, pp. 1529–1531 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Path Analysis and Path Diagrams History Path analysis was first proposed by Sewall Wright [12] as a method for separating sources of variance in skeletal dimensions of rabbits (see [10, 11], for historical accounts). He extended this method of path coefficients to be a general method for calculating the association between variables in order to separate sources of environmental and genetic variance [14]. Wright was also the first recognize the principle on which path analysis is based when he wrote, ‘The correlation between two variables can be shown to equal the sum of the products of the chains of path coefficients along all of the paths by which they are connected’ [13, p. 115]. Path analysis was largely ignored in the behavioral and social sciences until Duncan [3] and later Goldberger [4] recognized and brought this work to the attention the fields of econometrics and psychometrics. This led directly to the development of the first modern structural equation modeling (SEM) software tools ([5] (see Structural Equation Modeling: Software)). Path diagrams were also first introduced by Wright [13], who used them in the same way as they are used today; albeit using drawings of guinea pigs rather than circles and squares to represent variables (see [6], for an SEM analysis of Wright’s data). Duncan argued that path diagrams should be ‘. . .isomorphic with the algebraic and statistical properties of the postulated system of variables. . .’ ([3], p. 3). Modern systems for path diagrams allow a one-to-one translation between a path diagram and computer scripts that can be used fit the implied structural model to data [1, 2, 9].

predicted covariance matrix between the measured variables can be unambiguously calculated [8]. The (Reticular Action Model) RAM method is discussed below as a recommended practice for constructing path diagrams. Figure 2 is a path diagram expressing a bivariate regression equation with one outcome variable, y, such that y = b0 + b1 x + b2 z + e (1) where e is a residual term with a mean of zero. There are four single-headed arrows, b0 , b1 , b2 , and b3 pointing into y and likewise four terms are added together on the right hand side of 1. In a general linear model, one would use a column of ones to allow the estimation of the intercept b0 . Here a constant variable is denoted by a triangle that maps onto that column of ones. For simplicity of presentation, the two predictor variables x and z in this example have means of zero. If x and z had nonzero means, then single-headed arrows would be drawn from the triangle to x and from the triangle to z. There are double-headed variance arrows for each of the variables x, z, and e on the right hand side of the equation, representing the variance of the predictor variables, and residual variance respectively. The constant has, by convention, a nonzero variance term

a

b

d

e

f

1

Figure 1 Graphical elements composing path diagrams. a. Manifest (measured) variable. b. Latent (unmeasured) variable. c. Constant with a value of 1. d. Regression coefficient. e. covariance between variables. f. Variance (total or residual) of a variable

Vx

x

Path Diagrams Path diagrams express the regression equations relating a system of variables using basic elements such as squares, circles, and single- or double-headed arrows (see Figure 1). When these elements are correctly combined in a path diagram the algebraic relationship between the variables is completely specified and the

c

b1

Cxz Vz

1

e

Ve

y z

b2

b0

1 1

Figure 2 Bivariate regression expressed as a path diagram (see Multivariate Multiple Regression)

2

Path Analysis and Path Diagrams

fixed at the value 1.0. While this double-headed arrow may seem counterintuitive since it is not formally a variance term, it is required in order to provide consistency to the path tracing rules described below. In addition, there is a double-headed arrow between x and z that specifies the potential covariance, Cxz between the predictor variables. Multivariate models expressed as a system of linear equations can also be represented as path diagrams. As an example, a factor model (see Factor Analysis: Confirmatory) with two correlated factors is shown in Figure 3. Each variable with one or more single-headed arrows pointing to it defines one equation in the system of simultaneous linear equations. For instance, one of the six simultaneous equations implied by Figure 3 is y = b1 F1 + uy ,

(2)

where y is an observed score, b1 is a regression coefficient, F1 is an unobserved common factor score, and uy is an unobserved unique factor. In order to identify the scale for the factors F1 and F2 , one path leading from each is fixed to a numeric value of 1. All of the covariances between the variables in path diagrams such as those in Figures 1 and 3 can be calculated using rules of path analysis.

VF1

CF1F2

VF2

F1

b1

1

x 1

F2

b2

1

y 1

z 1

b3

b4

p 1

q 1

r 1

ux

uy

uz

up

uq

ur

Vux

Vuy

Vuz

Vup

Vuq

Vur

Figure 3 Simple structure confirmatory factor model expressed as a path diagram

Zero or more

V1

One and only one

Zero or more

V2

Figure 4 Schematic of the rule for forming a bridge between two variables: exactly one double-headed arrow with zero or more single-headed arrows pointing away from each end towards the selected variable(s)

Path Analysis The predicted covariance (or correlation) between any two variables v1 and v2 in a path model is the sum of all bridges between the two variables that satisfy the form shown in Figure 4 [7]. Each bridge contains one and only one double-headed arrow. From each end of the double-headed arrow leads a sequence of zero or more single-headed arrows pointing toward the variable of interest at each end of the bridge. All of the regression coefficients from the single-headed arrows in the bridge as well as the one variance or covariance from the double-headed arrow are multiplied together to form the component of covariance associated with that bridge. The sum of these components of covariance from all bridges between any two selected variables v1 and v2 equals the total covariance between v1 and v2 . If a variable v is at both ends of the bridge, then each bridge beginning and ending at v calculates a component of the variance v. The sum of all bridges between a selected variable v and itself calculates the total variance of v implied by the model. As an example, consider the covariance between x and y predicted by the bivariate regression model shown in Figure 2. There are two bridges between x and y. First, if the double-headed arrow is the variance Vx , then there is a length zero sequence of single-headed arrows from one end of Vx and pointing to x and a length one sequence single-headed arrows leading from the other end of Vx and pointing to y. This bridge is illustrated in Figure 5-a and leads to a covariance component of Vx b1 , the product of the coefficients used to form the bridge. Second, if the double-headed arrow is the covariance Cxz between x and z, then there is a length zero sequence of singleheaded arrows from one end of Cxz leading to x and a length one sequence of single-headed arrows leading from the other end of Cxz to y. This bridge

Path Analysis and Path Diagrams Vx

x

b1 y

Cxz Vz

Ve

e

1

z

b2

b0

1

1

Vz

(3)

Path analysis is especially useful in gaining a deeper understanding of the covariance relations implied by a specified structural model. For instance, when a theory specifies mediating variables, one may work out all of the background covariances implied by the theory. Concepts such as suppression effects or the relationship between measurement interval and cross-lag effects can be clearly explained using a path analytic approach but may seem more difficult to understand from merely studying the equivalent algebra.

[6]

[7] [8]

[9]

[10]

[11]

References [12]

[3] [4]

[5]

1

Ve

e

y z

b2

b0

1

1

Two bridges between the variables x and y

Cxy = b1 Vx + b2 Cxz

[2]

b1

(b)

is illustrated in Figure 5-b and results in a product Cxz b2 . Thus the total covariance Cxy is the sum of the direct effect Vx b1 and the background effect Cxz b2

[1]

x

Cxz

(a)

Figure 5

Vx

3

Arbuckle, J.L. (1997). Amos User’s Guide, Version 3.6 SPSS, Chicago. Boker, S.M., McArdle, J.J. & Neale, M.C. (2002). An algorithm for the hierarchical organization of path diagrams and calculation of components of covariance between variables, Structural Equation Modeling 9(2), 174–194. Duncan, O.D. (1966). Path analysis: sociological examples, The American Journal of Sociology 72(1), 1–16. Goldberger, A.S. (1971). Econometrics and psychometrics: a survey of communalities, Econometrica 36(6), 841–868. J¨oreskog, K.G. (1973). A general method for estimating a linear structural equation system, in Structural Equation Models in the Social Sciences, A.S. Goldberger & O.D. Duncan, eds, Seminar, New York, pp. 85–112.

[13]

[14]

McArdle, J.J. & Aber, M.S. (1990). Patterns of change within latent variable structural equation modeling, in New Statistical Methods in Developmental Research, A. von Eye ed., Academic Press, New York, pp. 151–224. McArdle, J.J. & Boker, S.M. (1990). Rampath, Lawrence Erlbaum, Hillsdale. McArdle, J.J. & McDonald, R.P. (1984). Some algebraic properties of the reticular action model for moment structures, The British Journal of Mathematical and Statistical Psychology 87, 234–251. Neale, M.C., Boker, S.M., Xie, G. & Maes, H.H. (1999). Mx: Statistical Modeling, Box 126 MCV, 23298: 5th Edition, Department of Psychiatry, Richmond. Wolfle, L.M. (1999). Sewall Wright on the method of path coefficients: an annotated bibliography, Structural Equation Modeling 6(3), 280–291. Wolfle, L.M. (2003). The introduction of path analysis to the social sciences, and some emergent themes: an annotated bibliography, Structural Equation Modeling 10(1), 1–34. Wright, S. (1918). On the nature of size factors, The Annals of Mathematical Statistics 3, 367–374. Wright, S. (1920). The relative importance of heredity and environment in determining the piebald pattern of guinea-pigs, Proceedings of the National Academy of Sciences 6, 320–332. Wright, S. (1934). The method of path coefficients, The Annals of Mathematical Statistics 5, 161–215.

(See also Linear Statistical Models for Causation: A Critical Review; Structural Equation Modeling: Checking Substantive Plausibility; Structural Equation Modeling: Nontraditional Alternatives) STEVEN M. BOKER

AND JOHN

J. MCARDLE

Pattern Recognition LUDMILA I. KUNCHEVA

AND

CHRISTOPHER J. WHITAKER

Volume 3, pp. 1532–1535 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Pattern Recognition Pattern recognition deals with classification problems that we would like to delegate to a machine, for example, scanning for abnormalities in smear test samples, identifying a person by voice and a face image for security purposes, detecting fraudulent credit card transactions, and so on. Each object (test sample, person, transaction) is described by a set of p features and can be thought of as a point in some p-dimensional feature space. A classifier is a formula, algorithm or technique that outputs a class label for any collection of values of the p features submitted to its input. For designing a classifier, also called discriminant analysis, we use a labeled data set, Z, of n objects, where each object is described by its feature values and true class label. The fundamental idea used in statistical pattern recognition is Bayes decision theory [2]. The c classes are treated as random entities that occur with prior probabilities P (ωi ), i = 1, . . . , c. The posterior probability of being in class ωi for an observed data point x is calculated using Bayes rule P (ωi |x) =

p(x|ωi )P (ωi ) c

,

(1)

p(x|ωj )P (ωj )

j =1

where p(x|ωi ) is the class-conditional probability density function (pdf) of x, given class ωi (see Bayesian Statistics; Bayesian Belief Networks). According to the Bayes rule, the class with the largest posterior probability is selected as the label of x. Ties are broken randomly. The Bayes rule guarantees the minimum misclassification rate. Sometimes the misclassifications cost differently for different classes. Then we can use a loss matrix = [λij ], where λij is a measure of the loss incurred if we assign class label ωi when the true label is ωj . The minimum risk classifier assigns x to the class with the minimum expected risk Rx (ωi ) =

c

λij P (ωj |x).

(2)

j =1

In general, the classifier output can be interpreted as a set of c degrees of support, one for each class (discriminant scores obtained through discriminant functions). We label x in the class with the largest support.

In practice, the prior probabilities and the classconditional pdfs are not known. The pdfs can be estimated from the data using either a parametric or nonparametric approach. Parametric classifiers assume the form of the probability distributions and then estimate the parameters from Z. The linear and quadratic discriminant classifiers, which assume multivariate normal distributions as p(x|ωi ), are commonly used (see Discriminant Analysis). Nonparametric classifier models include the k-nearest neighbor classifier (k-nn) and kernel classifiers (e.g., Parzen, support vector machines (SVM)). The k-nn classifier assigns x to the class most represented among the closest k neighbors of x. Instead of trying to estimate the pdfs and applying Bayes’ rule, some classifier models directly look for the best discrimination boundary between the classes, for example, classification and regression trees, and neural networks. Figure 1 shows a two-dimensional data set with two banana-shaped classes. The dots represent the observed data points, and members of the classes are denoted by their color. Four classifiers are built, each one splitting the feature space into two classification regions. The boundary between the two regions is denoted by the white line. The linear discriminant classifier results in a linear boundary, while the quadratic discriminant classifier results in a quadratic boundary. While both these models are simple, for this difficult data set, neither is adequate. Here, the greater flexibility afforded by the 1-nn and neural network models result in more accurate but complicated boundaries. Also, the class regions found by 1-nn and the neural network may not be connected sets of points, which is not possible for the linear and quadratic discriminant classifiers.

Classifier Training and Testing In most real life problems we do not have a ready made classification algorithm. Take for example, a classifier that recognizes expressions of various feelings from face images. We can only provide a rough guidance in a linguistic format and pick out features that we believe are relevant for the task. The classifier has to be trained by using a set of labeled examples. The training depends on the classifier model. The nearest neighbor classifier (1-nn) does not require any training; we can classify a new data point

2

Pattern Recognition

(a)

(b)

(c)

(d)

Figure 1 Classification regions for two banana-shaped classes found by four classifiers (a) Linear, (b) Quadratic, (c) 1-nn, (d) Neural Network

right away by finding its nearest neighbor in Z. On the other hand, the lack of a suitable training algorithm for neural networks was the cause for their dormant period between 1940s and 1980s (the error backpropagation algorithm revitalized their development). When trained, some classifiers can provide us with an interpretable decision strategy (e.g., tree models and k-nn) whereas other classifiers behave as black boxes (e.g., neural networks). Even when we can verify the logic of the decision making, the ultimate judge of the classifier performance is the classification error. Estimating the misclassification rate of our classifier is done through the training protocol. Part of the data set, Z, is used for training and the remaining part is left for testing. The most popular training/testing protocol is cross-validation. Z is divided into K approximately equal parts, one is left for testing, and the remaining K − 1 are pooled as the training set. This process is repeated K times (K-fold crossvalidation) leaving aside a different part each time.

The error of the classifier is the averaged testing error across the K testing parts.

Variable Selection Not all features are important for the classification task. Classifiers may perform better with fewer features. This is a paradox from an information-theoretic point of view. Its explanation lies in the fact that the classifiers that we use and the parameter estimates that we calculate are imperfect; therefore, some of the supposed information is actually noise to our model. Feature selection reduces the original feature set to a subset without adversely affecting the classification performance. Feature extraction, on the other hand, is a dimensionality reduction approach whereby all initial features are used and a small amount of new features are calculated from them (e.g., principal component analysis, projection pursuit, multidimensional scaling).

Pattern Recognition There are two major questions in feature selection: what criterion should we use to evaluate the subsets? and how do we search among all possible subsetcandidates? Since the final goal is to have an accurate classifier, the most natural choice of a criterion is the minimum error of the classifier built on the subsetcandidate. Methods using a direct estimate of the error are called wrapper methods. Even with modern computational technology, training a classifier and estimating its error for each examined subset of features might be prohibitive. An alternative class of feature selection methods where the criterion is indirectly related to the error are called filter methods. Here the criterion used is a measure of discrimination between the classes, for example, the Mahalanobis distance between the class centroids. For large p, checking all possible subsets is often not feasible. There are various search algorithms, the simplest of which are the sequential forward selection (SFS) and the sequential backward selection (SBS). In SFS, we start with the single best feature (according to the chosen criterion) and add one feature at a time. The second feature to enter the selected subset will be the feature that makes the best pair with the feature already selected. The third feature is chosen so that it makes the best triple containing the already selected two features, and so on. In SBS, we start with the whole set of features and remove the single feature which gives the best remaining subset of p − 1 features. Next we remove the feature that results in the best remaining subset of p − 2 features, and so on. SFS and SBS, albeit simple, have been found to be surprisingly robust and accurate. A modification of these is the floating search feature selection algorithm, which leads to better results at the expense of an expanded search space. Feature selection is an art rather than science as it relies on heuristics, intuition, and domain knowledge. Among many others, genetic algorithms have been applied for feature selection with various degree of success.

Cluster Analysis In some problems, the class labels are not defined in advance. Then, the problem is to find a class structure in the data set, if there is any. The number of clusters is usually not specified in advance, which makes the problem even more difficult. If we guess wrongly, we may impose a structure onto a data set that does not

3

have one or may fail to discover an existing structure. Cluster analysis procedures can be roughly grouped into hierarchical and iterative optimization methods (see Hierarchical Clustering; k -means Analysis).

Classifier Ensembles Instead of using a single classifier, we may combine the outputs of several classifiers in an attempt to reach a more accurate or reliable solution. At this stage, there are a large number of methods, directions, and paradigms in designing classifier ensembles but there is no agreed taxonomy for this relatively young area of research. Fusion and Selection. In classifier fusion, we assume that all classifiers are ‘experts’ across the whole feature space, and therefore their votes are equally important for any x. In classifier selection, first ‘an oracle’ or meta-classifier decides whose region of competence x is in, and then the class label of the nominated classifier is taken as the ensemble decision. Decision Optimization and Coverage Optimization. Decision optimization refers to ensemble construction methods that are primarily concerned with the combination rule assuming that the classifiers in the ensembles are given. Coverage optimization looks at building the individual classifiers assuming a fixed classification rule. Five simple combination rules (combiners) are illustrated below. Suppose that there are 3 classes and 7 classifiers whose outputs for a particular x (discriminant scores) are organized in a 7 by 3 matrix (called a decision profile for x) as given in Table 1. The overall support for each class is calculated by applying a simple operation to the discriminant scores for that class only. The majority vote operates on the label outputs of the seven classifiers. Each possible class label occurs as the final ensemble label in the five shaded cells. This shows the flexibility that we have in choosing the combination rule for the particular problem. We can consider the classifier outputs as new features, disregard their context as discriminant scores, and use these features to build a classifier. We can thus build hierarchies of classifiers (stacked generalization). There are many combination methods

4

Pattern Recognition

Table 1 An ensemble of seven classifiers, D1 , . . . , D7 : the decision profile for x and five combination rules

D1 D2 D3 D4 D5 D6 D7 Minimum Maximum Average Product Majority

Support for ω1 0.24 0.17 0.22 0.17 0.27 0.51 0.29 0.17 0.51 0.27 0.0001

Support for ω2 0.44 0.13 0.32 0.40 0.77 0.90 0.46 0.13 0.90 0.49 0.0023

Support for ω3 0.56 0.59 0.86 0.49 0.45 0.06 0.03 0.03 0.86 0.43 0.0001

–

–

–

Label output ω3 ω3 ω3 ω3 ω2 ω2 ω2 ω1 ω2 ω2 ω2 ω3

proposed in the literature that involve various degrees of training. Within the coverage optimization group are bagging, random forests and boosting. Bagging takes L random samples (with replacement) from Z and builds one classifier on each sample. The ensemble decision is made by the majority vote. The success of bagging has been explained by its ability to reduce the variance of the classification error of a single classifier model. Random forests are defined as a variant of bagging such that the production of the individual classifiers depends on a random parameter and independent sampling. A popular version of random forests is an ensemble where each classifier is built upon a random subsets of features, sampled with replacement from the initial feature set. A boosting algorithm named AdaBoost has been found to be even more successful than bagging. Instead of drawing random bootstrap samples, AdaBoost designs the ensemble members one at a time, based on the performance of the previous member. A set of weights is maintained across the data points in Z. In the resampling version of AdaBoost, the weights are taken as probabilities. Each training sample is drawn using the weights. A classifier is built on

the sampled training set, and the weights are modified according to its performance. Points that have been correctly recognized get smaller weights and points that have been misclassified get larger weights. Thus difficult to classify objects will have more chance to be picked in the subsequent training sets. The procedure stops at a predefined number of classifiers (e.g., 50). The votes are combined by a weighted majority vote, where the classifier weight depends on its error rate. AdaBoost has been proved to reduce the training error to zero. In addition, when the training error does reach zero, which for most classifier models means that they have been overtrained and the testing error might be arbitrarily high, AdaBoost ‘miraculously’ keeps reducing the testing error further. This phenomenon has been explained by the margin theory but with no claim about a global convergence of the algorithm. AdaBoost has been found to be more sensitive to noise in the data than bagging. Nevertheless, AdaBoost has been declared by Leo Breiman to be the ‘most accurate available offthe-shelf classifier’ [1]. Pattern recognition is related to artificial intelligence and machine learning. There is renewed interest in this topic as it underpins applications in modern domains such as data mining, document classification, financial forecasting, organization and retrieval of multimedia databases, microarray data analysis (see Microarrays), and many more [3].

References [1] [2]

[3]

Breiman, L. (1998). Arcing classifiers, The Annals of Statistics 26(3), 801–849. Duda, R.O., Hart, P.E. & Stork, D.G. (2001). Pattern Classification, 2nd Edition, John Wiley & Sons, New York. Jain, A.K., Duin, R.P.W. & Mao, I. (2000). Statistical pattern recognition: a review, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37.

LUDMILA I. KUNCHEVA

AND

CHRISTOPHER J. WHITAKER

Pearson, Egon Sharpe BRIAN S. EVERITT Volume 3, pp. 1536–1536 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Pearson, Egon Sharpe Born: Died:

August 11, 1895, in Hampstead, London. June 12, 1980, in Midhurst, Sussex.

Egon Pearson was born in Hampstead, London in 1895, the only son of Karl Pearson, the generally acknowledged founder of biometrics and one of the principal architects of the modern theory of mathematical statistics. The younger Pearson began to read mathematics at Trinity College, Cambridge in 1914, but his undergraduate studies were interrupted first by a severe bout of influenza and then by war work. It was not until 1920 that Pearson received his B.A. after taking the Military Special Examination in 1919, which had been set up to cope with those who had had their studies disrupted by the war. In 1921, Pearson took up a position as lecturer in his father’s Department of Applied Statistics at University College. But it was not until five years later in 1926 that Karl Pearson allowed his son to begin lecturing in the University and only then because his own health prevented him from teaching. During these five years, Egon Pearson did manage to produce a stream of high-quality research publications on statistics and also became an assistant editor of Biometrika. In 1925, Jerzy Neyman joined University College as a Rockefeller Research Fellow and befriended the rather introverted Egon. The result was a joint research project on the problems of hypothesis testing that eventually led to a general approach to statistical and scientific problems known as Neyman–Pearson inference. Egon Pearson also collaborated with W. S. Gossett, whose ideas played a major part in Pearson’s own discussions with

Neyman, and in drawing his attention to the important topic of robustness. In 1933, Karl Pearson retired from the Galton Chair of Statistics and, despite his protests, his Department was split into two separate parts, one (now the Department of Eugenics) went to R. A. Fisher, the other, a Department of Applied Statistics, went to Egon Pearson. Working in the same institute as Fisher, a man who had attacked his father aggressively, was not always a comfortable experience for Egon Pearson. In 1936, Karl Pearson died and his son became managing editor of Biometrika, a position he continued to hold until 1966. During the Second World War, Egon Pearson undertook war work on the statistical analysis of the fragmentation of shells hitting aircraft, work for which he was awarded a CBE in 1946. It was during this period that he became interested in the use of statistical methods in quality control. In 1955 Pearson was awarded the Gold Medal of the Royal Statistical Society and served as its President from 1955 to 1956. His presidential address was on the use of geometry in statistics. In 1966, Egon Pearson was eventually elected Fellow of the Royal Society. In the 1970s, Pearson worked on the production of an annotated version of his father’s lectures on the early history of statistics, which was published in 1979 [1] just before his death in 1980.

Reference [1]

Pearson, E.S., ed., (1979). The History of Statistics in the Seventeenth and Eighteenth Centuries, Macmillan Publishers, New York.

BRIAN S. EVERITT

Pearson, Karl MICHAEL COWLES Volume 3, pp. 1536–1537 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Pearson, Karl Born: Died:

March 27, 1857, in London, UK. April 27, 1936, in Surrey, UK.

In the dozen or so years that spanned the turn of the nineteenth into the twentieth century, Karl Pearson constructed the modern discipline of statistics. His goal was to make biology and heredity quantitative disciplines and to establish a scientific footing for a eugenic social philosophy. Pearson entered Kings College, Cambridge, in 1876 and graduated in mathematics third in his year (third wrangler) in 1879. In the early 1880s, Carl with a ‘C’ gave way to Karl with a ‘K’. There is no evidence for the claim that the change was due to his espousal of Socialism and in honor of Marx. The spelling of his name had varied since childhood. After Cambridge, he studied in Germany and then began a search for a job. There was a brief consideration of a career in law and he was called to the Bar, after his second attempt at the Bar examinations, in 1882. In 1880, Pearson published, under the pen name ‘Loki’, The New Werther, followed in 1882 by The Trinity, a 19th Century Passion Play – works that showed that he was not a poet! During the first half of the 1880s, he gave lectures on ethics, free thought, socialism, and German history and culture. He was the inspiration and a founder of a club formed to discuss women’s issues and through it met his wife, Maria Sharpe. In 1884, he was appointed Goldsmid Professor of Applied Mathematics and Mechanics at University College, London. Pearson’s meeting his friend and colleague Walter Weldon steered Pearson away from his work in applied mathematics toward a new discipline of biometrics. Later, he became the first professor of National Eugenics. He was the editor of Biometrika, a journal founded with the assistance of Francis Galton, for the whole of his life. A dispute over the degrees of freedom for chisquare is one of the more bizarre controversies that Pearson spent much fruitless time on. It had been pointed out to Pearson that the df to be applied when the expected frequencies in a contingency table were constrained by the marginal totals were not to be calculated in the same way as they are when

the expected frequencies were known a priori as in the goodness of fit test (see Goodness of Fit for Categorical Variables). Raymond Pearl also raised problems about the test, compounding his heresy by doing it in the context of Mendelian theory, a view that Pearson opposed. The df point was also raised by Fisher in 1916. Pearson attempted to argue that the aim was to find the lowest chi-square for a given grouping rather than its p level, thus, to be most charitable, clouding the issue, or, to be much less charitable, misunderstanding his own test. The well-known deviation score formula for the correlation coefficient was shown by Pearson to be the best in a 1896 paper, which includes a section on the mathematical foundations of correlation [2, 3]. Here, he acknowledges the work of Bravais and of Edgeworth. Twenty-five years later, he repudiated these statements, emphasizing his own work and that of Galton. Even in this, his seminal contribution, a row with Fisher over the probability distribution of r ensued. Pearson and his colleagues embarked on the laborious task of calculating the ordinates of the frequency curves for r for various df . Fisher arrived at a much simpler method using his r to t (or r to z) transformation, but Pearson could not or would not accept it. Pearson’s most important contributions were produced before Weldon’s untimely death in 1906. His book The Grammar of Science [1] has not received the attention it deserves. His contribution to statistics and the generous help that he gave many students has to be and will continue to be acknowledged, but his work was weakened by controversy.

References [1] [2]

[3]

Pearson, K. (1892). The Grammar of Science, Walter Scott, London. Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. regression, heredity and panmixia, Philosophical Transactions of the Royal Society, Series A 187, 253–318. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Philosophical Magazine 5th series, 50, 157–175.

MICHAEL COWLES

Pearson Product Moment Correlation DIANA KORNBROT Volume 3, pp. 1537–1539 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Pearson Product Moment Correlation

slopes (on a standardized scale) correspond to larger magnitudes of r. (A variable is standardized by a linear transformation to a new variable with mean zero and standard deviation 1). r 2 gives the proportion of variance (variability) accounted for by the linear relation (see R-squared, Adjusted R-squared). An r of 0.3, as frequently found between personality measures, only accounts for 10% of the variability. Figure 1 provides scatterplots of some possible relations between X and Y , and the associated values of r and r 2 . Panel (a) shows a moderate positive linear correlation, while Panel (b) shows a strong negative linear correlation. Panel (c) shows no correlation. In Panel (d), r is also near zero. The relationship is strong, but nonlinear, as might occur between stress and performance. Panel (e) shows a negative nonlinear relation, as might occur if Y is light intensity needed to see and X is time in the dark. The relationship is negative (descending), but nonlinear. Panel (f) has same data as Panel (c), with an added outlier at (29,29). r is dramatically changed, from −0.009 to 0.56.

Pearson’s product moment correlation, r, describes the strength of the linear relation, between two metric (interval or ratio) variables (see Scales of Measurement) such as height and weight. A high magnitude of r indicates that the squared distance of the points from the best fitting straight line on a scatterplot (plot of Y v X or X v Y ) is small. The statistic r is a measure of association (see Measures of Association) and does NOT imply causality, in either direction. For valid interpretation of r, the variables should have a bivariate normal distribution (see Catalogue of Probability Density Functions), and a linear relation, as in Panels A and B of Figure 1. The slopes of linear regressions of standardized Y scores on standardized X scores, and of standardized X scores on standardized Y scores generate r (see Standardized Regression Coefficients). Steeper Pearson's r = +0.46

Pearson's r = − 0.98

R2 = 0.21

20

Y

Y

15 10 5 0 0

5

10

15

20

25

X

(a)

0

5

5

10

15

20

0

25

X

100

200

5

10

15

X

5

10 15 20 25 30

X

Pearson's r = −0.82

Pearson's r = +0.56 30 R2 = 0.32

25 R2

= 0.6785

20 15 10

100

5 0

0

0

(c)

Y

300

R2 = 0.004

15 10

R2 = 0.96

150

Y

Y (d)

20

400

50

R2 = 0.00009

25

500

200

0

Pearson's r = −0.009 30

(b)

Pearson's r = +0.06

250

90 80 70 60 50 40 30 20 10 0

Y

25

20

25

−100 (e)

0

5

10

15

X

20

0

25 (f)

0

5

10 15 20 25 30

X

Figure 1 Scatterplots of Y against X for a variety of relationships and correlation values. The correlation coefficient, r, should never be interpreted without a scatterplot

2

Pearson Product Moment Correlation

Calculation The population values of r for N pairs of points (X, Y ) is given by: X Y XY − N . r = 2 2 X Y Y2 − X2 − N N (1) Equation 1, applied to a sample, gives a biased estimate of the population value. The adjusted correlation, radj , gives an unbiased estimate [1] (see Rsquared, Adjusted R-squared). (1 − r 2 )(N − 1) radj = 1 − N −2 or

1 − r2 1 . (2) radj = r 1 − N −2 r2 The value of r overestimates radj , often with large bias (e.g., 75% overestimate for radj = 0.10 with N = 50). So, the value radj should be reported (although reporting r is all too common).

Confidence Limits and Hypothesis Testing The sampling distributions of r only approaches normality for very large N . Equation 3 gives a

statistic that is t-distributed with N − 2 degrees of freedom when population r is zero [1]. It should be used to test the null hypothesis that there is no association. √ r N −2 t= √ (3) 1 − r2 It might be argued that radj should be used in (3). However, standard statistical packages (e.g., SPSS, JMP-IN) use r. Confidence intervals and hypotheses about values of r other than 0 can be obtained from the Fisher transformation, which gives normally distributed statistic, zr , with variance 1/(N − 3) [1, 2].

1 1+r . (4) zr = ln 2 1−r

References [1]

[2]

Howell, D.C. (2004). Fundamental Statistics for the Behavioral Sciences, 5th Edition, Duxbury Press, Pacific Grove. Sheskin, D.J. (2000). Handbook of Parametric and Nonparametric Statistical Procedures, 2nd Edition, Chapman & Hall, London.

DIANA KORNBROT

Percentiles DAVID CLARK-CARTER Volume 3, pp. 1539–1540 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Percentiles By definition, percentiles are quantiles that divide the set of numbers into 100 equal parts. As an example, the 10th percentile is the number that has 10% of the set below it, while the 90th percentile is the number that has 90% of the set below it. Moreover, the 25th percentile is the first quartile (Q1 ), the 50th percentile is the second quartile (the median, Q2 ), and the 75th percentile is the third quartile (Q3 ). The general method for finding percentiles is the same as that for quantiles. However, suppose that we have the reading ages for a class of eight-yearold children, as shown in column 2 of Table 1, and we want to ascertain each child’s percentile position or rank. One way of doing this is to find the cumulative frequency for each child’s reading score, then convert these to cumulative proportions by dividing each by the sample size and then convert these proportions into cumulative percentages, as shown in columns 3–5 of Table 1. From the table, we can see that the first three children are in the 20th percentile for reading age. If we had the norms for the test, we would be able to find each child’s percentile point for the population. We note too that the median reading age (50th percentile) for the children is between 101 and 103 months.

Certain percentiles such as the 25th, 50th, and 75th (and in some cases the 10th and 90th) are represented on box plots [1]. Table 1 Reading ages in months of a class of eight-yearold children with the associated cumulative frequencies, proportions, and percentages Reading age Cumulative Cumulative Cumulative Child (in months) frequency proportion percentage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

84 87 90 94 94 100 101 103 104 104 106 107 108 111 112

1 2 3 4.5 4.5 6 7 8 9.5 9.5 11 12 13 14 15

0.07 0.13 0.20 0.30 0.30 0.40 0.47 0.53 0.63 0.63 0.73 0.80 0.87 0.93 1.00

6.67 13.33 20.00 30.00 30.00 40.00 46.67 53.33 63.33 63.33 73.33 80.00 86.67 93.33 100.00

Reference [1]

Cleveland, W.S. (1985). The Elements of Graphing Data, Wadsworth, Monterey, California.

DAVID CLARK-CARTER

Permutation Based Inference CLIFFORD E. LUNNEBORG Volume 3, pp. 1540–1541 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Permutation Based Inference Permutation-based inference had its origins in the 1930s in the seminal efforts of R. A. Fisher [2] and E. J. G. Pitman [6, 7]. Fisher described an alternative to the matched-pairs t Test that made no assumption about the form of the distribution sampled while Pitman titled his articles extending permutation inference to the two-sample, multiple-sample, and correlation contexts as ‘significance tests which may be applied to samples from any population.’ The idea behind permutation tests is best introduced by way of an example. Let y: (18, 21, 25, 30, 31, 32) be a random sample of size six from a distribution, Y , of body-mass-indices (BMI) for women in the freshman class at Cedar Grove High School and z: (19, 20, 22, 23, 28, 35) be a second random sample drawn from the BMI distribution, Z, for men in that class. The null hypothesis is that the two distributions, Y and Z are identical. This is the same null hypothesis underlying the independent-samples t Test, but without the assumption that Y and Z are normally distributed. Under this null hypothesis, the observations in the two samples are said to be exchangeable with one another [3]. For example, the BMI of 18 in the women’s sample is just as likely to have turned up in the sample from the men’s distribution, while the BMI of 35 in the men’s sample is just as likely to have turned up in the sample from the women’s distribution; that is, under the null hypothesis, a women’s sample of y: (21, 25, 30, 31, 32, 35) and a men’s sample of z: (18, 19, 20, 22, 23, 28) are just as likely as the samples actually observed. These, of course, are not the only arrangements of the observed BMIs that would be possible under the null hypothesis. Every permutation of the 12 indices, 6 credited to the women’s distribution and 6

to the men’s distribution, can be attained by a series of pairwise exchanges between the two samples, and, hence, has the same chance of having been observed. For our example, there are [12!/(6! × 6!)] = 924 arrangements of the observed BMIs, each arrangement equally likely under the null hypothesis. A permutation test results when we are able to say whether the observed arrangement of the data points is unlikely under the null hypothesis. To pursue the analogy with the two-sample t Test, we ask whether a difference in sample means as large as that observed, Mean(y) − Mean(z) = 1.66667,

(1)

is unlikely under the null hypothesis. The answer depends on where the value of 1.66667 falls in the reference distribution made up of the differences in mean computed for each of the 924 permutations of the BMIs. In fact, 307 of the 924 mean differences are greater than or equal to 1.6667 and another 307 are smaller than or equal to −1.6667. Taking the alternative hypothesis to be nondirectional, we can assign a P value of (614/924) = 0.6645 to the test. There is no evidence here that the two samples could not have been drawn from identical BMI distributions. The permutation-test decision to reject or not reject the null hypothesis depends only upon a distribution generated by the data at hand. It does not rely upon unverifiable assumptions about the entire population. If the observations are exchangeable, the resulting P value will be exact. Moreover, permutation tests can be applied to data derived from finite as well as infinite populations [6]. The reference distribution of a permutation test statistic is not quite the same thing as the sampling distribution we associate with, say, the normaltheory t Test. The latter is based on all possible pairs of independent samples, six observations each from identical Y and Z distributions, while the former considers only those pairs of samples arising from the permutation of the observed data points. Because of this rooting in the observed data, the permutation test is said to be a conditional test. The P value in our example was computed moreor-less instantaneously by the perm.test function from an R package for exact rank tests (www.Rproject.org). In the 1930s, of course, enumerating the 924 possible permutations and tabulating the resulting test statistics would have been a prodigious if not prohibitive computational task. In fact,

2

Permutation Based Inference

the thrust of [2, 6, 7] was that the permutation test P value could be approximated by the P value associated with a parametric test. Fisher [2] summarized his advice to use the parametric approximation this way: ‘(although) the statistician does not carry out this very tedious process, his conclusions have no justification beyond the fact they could have been arrived at by this very elementary method.’ Today, we have computing facilities, almost universally available, that are fast and inexpensive enough to make it no longer necessary or desirable to substitute an approximate test for an exact one. Permutation tests offer more than distribution-free alternatives to established parametric tests. As noted by Bradley [1], the real strength of permutation-based inference lies in the ability to choose a test statistic best suited to the problem at hand, rather than be limited by the ability of precalculated tables. An excellent example lies in the analysis of k-samples, where permutation-based tests are available for both ordered and unordered categories as well as for a variety of loss functions [3–5]. Permutation-based inference has been applied successfully to correlation (see Pearson Product Moment Correlation), contingency tables, k-sample comparisons, multivariate analysis, with censored,

missing, or repeated measures data, and to situations where parametric tests do not exist. The author gratefully acknowledges the assistance of Phillip Good in the preparation of this article.

References [1] [2] [3] [4] [5] [6]

[7]

Bradley, J.V. (1968). Distribution-Free Statistical Tests, Prentice Hall. Fisher, R.A. (1935). Design of Experiments, Hafner, New York. Good, P. (2004). Permutation Tests, 3rd Edition, Springer, New York. Mielke, P.W. & Berry, K.J. (2001). Permutation Methods – A Distance Function Approach, Springer, New York. Pesarin, F. (2001). Multivariate Permutation Tests, Wiley & Sons, New York. Pitman, E.J.G. (1937). Significance tests which may be applied to samples from any population, Parts I and II, Journal of the Royal Statistical Society, Supplement 4, 119–130, 225–232. Pitman, E.J.G. (1938). Significance tests which may be applied to samples from any population. Part III. The analysis of variance test, Biometrika 29, 322–335.

CLIFFORD E. LUNNEBORG

Person Misfit ROB R. MEIJER Volume 3, pp. 1541–1547 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Person Misfit Since the beginning of psychological and educational standardized testing, inaccuracy of measurement has received widespread attention. In this overview, we discuss research methods for determining the fit of individual item score patterns to a test model. In the past two decades, important contributions to assessing individual test performance arose from item response theory (IRT) (see Item Response Theory (IRT) Models for Polytomous Response Data; Item Response Theory (IRT) Models for Rating Scale Data); these contributions are summarized as person-fit research. However, because person-fit research has also been conducted without IRT modeling, approaches outside the IRT framework are also discussed. This review is restricted to dichotomous (0, 1) item scores. For a comprehensive review of person-fit, readers are referred to [9] and [10]. As a measure of a person’s ability level, the total score (or the trait level estimate) may be inadequate. For example, a person may guess some of the correct answers to multiple-choice items, thus raising his/her total score on the test by luck and not by ability, or an examinee not familiar with the test format may due to this unfamiliarity obtain a lower score than expected on the basis of his/her ability level [23]. Inaccurate measurement of the trait level may also be caused by sleeping behavior (inaccurately answering the first questions in a test as a result of, for example, problems of getting started), cheating behavior (copying the correct answers of another examinee), and plodding behavior (working very slowly and methodically, and, as a result, generating item score patterns that are too good to be true given the stochastic nature of a person’s response behavior as assumed by most IRT models). It is important to realize that not all types of aberrant behavior affect individual test scores. For example, a person may guess the correct answers of some of the items but also guess wrong on some of the other items, and, as the result of the stochastic nature of guessing, this process may not result in substantially different test scores under most IRT models to be discussed below. Whether aberrant behavior leads to misfitting item score patterns depends on numerous factors such as the type and the amount of aberrant behavior.

Furthermore, it may be noted that all methods discussed can be used to detect misfitting item score patterns, but that several of these methods do not allow the recovery of the mechanism that created the deviant item score patterns. Other methods explicitly test against specific violations of a test model assumption, or against particular types of deviant item score patterns. The latter group of methods, therefore, may facilitate the interpretation of misfitting item score patterns.

Person-fit Methods Based on Group Characteristics Most person-fit statistics compare an individual’s observed and expected item scores across the items from a test. The expected item scores are determined on the basis of an IRT model or on the basis of the observed item means in the sample. In this section, group-based statistics are considered. In the next section, IRT-based person-fit statistics are considered. Let n persons take a test consisting of k items, and let πg denote the proportion-correct score on item g that can be estimated from the sample by πˆ g = ng /n, where ng is the number of 1 scores. Furthermore, let the items be ordered and numbered according to decreasing proportion-correct score (increasing item difficulty): π1 > π2 > · · · > πk , and let the realization of a dichotomous (0,1) item score be denoted by Xg = xg (g = 1, . . . , k). Examinees are indexed i, with i = 1, . . . , n. Most person-fit statistics are a count of certain score patterns for item pairs (to be discussed shortly), and compare this count with the expectation under the deterministic Guttman [4] model. Let θ be the latent trait (see Latent Variable) known from IRT (to be introduced below), and let δ be the location parameter, which is a value on the θ scale. Pg (θ) is the conditional probability of giving a correct answer to item g. The Guttman model is defined by θ < δg ⇔ Pg (θ) = 0;

(1)

θ ≥ δg ⇔ Pg (θ) = 1.

(2)

and The Guttman model thus excludes a correct answer on a relatively difficult item h and an incorrect answer on an easier item g by the same examinee:

2

Person Misfit

Xh = 1 and Xg = 0, for all g < h. Item score combinations (0, 1) are called ‘errors’ or ‘inversions’. Item score patterns (1, 0), (0, 0), and (1, 1) are permitted, and known as ‘Guttman patterns’ or ‘conformal’ patterns. Person-fit statistics that are based on group characteristics compare an individual’s item score pattern with the other item score patterns in the sample. Rules-of-thumb have been proposed for some statistics, but these rules-of-thumb are not based on sampling distributions, and, therefore, difficult to interpret. Although group-based statistics may be sensitive to misfitting item score patterns, a drawback is that their null distributions are unknown and, as a result, it cannot be decided on the basis of significance probabilities when a score pattern is unlikely given a nominal Type I error rate. In general, let t be the observed value of a person-fit statistic T . Then, the significance probability or probability of exceedance is defined as the probability under the sampling distribution that the value of the test statistic is smaller than the observed value: p∗ = P (T ≤ t), or larger than the observed value: p∗ = P (T ≥ t), depending on whether low or high values of the statistic indicate aberrant item score patterns. Although it may be argued that this is not a serious problem as long as one is only interested in the use of a person-fit statistic as a descriptive measure, a more serious problem is that the distribution of the numerical values of most group-based statistics is dependent on the total score [2]. This dependence implies that when one critical value is used across total scores, the probability of classifying a score pattern as aberrant is a function of the total score, which is undesirable. To summarize, it can be concluded that the use of group-based statistics has been explorative, and, with the increasing interest in IRT modeling, interest in group-based person-fit has gradually declined.

Person-fit Measures Based on Item Response Theory

variables X = (X1 , . . . , Xk ), and a realization x = (x1 , . . . , xk ). Pg (θ) often is specified using the 1-, 2-, or 3-parameter logistic model (1-, 2-, 3-PLM). The 3-PLM is defined as Pg (θ) = γg +

(1 − γg ) exp[αg (θ − δg )] , 1 + exp[αg (θ − δg )]

(3)

where γg is the lower asymptote (γg is the probability of a 1 score for low-ability examinees (that is, θ → −∞)); αg is the slope parameter (or item discrimination parameter); and δg is the item location parameter. The 2-PLM can be obtained by fixing γg = 0 for all items; and the 1-PLM, or Rasch model, can be obtained by additionally fixing αg = 1 for all items. A major advantage of IRT models is that the goodness-of-fit of a model to empirical data can be investigated. Compared to group-based person-fit statistics, this provides the opportunity of evaluating the fit of item score patterns to an IRT model. To investigate the goodness-of-fit of item score patterns, several IRT-based person-fit statistics have been proposed. Let wg (θ) be a suitable function for weighting the item scores and adapting person-fit scale scores, respectively. Following [18], a general form in which most person-fit statistics can be expressed is V =

k

[Xg − Pg (θ)]wg (θ).

(4)

g=1

Examples of these kinds of statistics are given in [19] and [23].

Likelihood-based Statistics Most studies, to be discussed below, have been conducted using some suitable function of the loglikelihood function l=

k

{Xg ln Pg (θ) + (1 − Xg ) ln[1 − Pg (θ)]}.

g=1

(5) In IRT, the probability of obtaining a correct answer on item g(g = 1, . . . , k) is a function of the latent trait value (θ) and characteristics of the item such as the location δ [21]. This conditional probability Pg (θ) is the item response function (IRF). Further, we define the vector with item score random

This statistic, first proposed by Levine and Rubin [7], was further developed and applied in a series of articles by Drasgow et al. [2]. Two problems exist when using l as a fit statistic. The first problem is that l is not standardized, implying that

Person Misfit the classification of an item score pattern as normal or aberrant depends on θ. The second problem is that for classifying an item score pattern as aberrant, a distribution of the statistic under the null hypothesis of fitting item scores is needed, and for l, this null distribution is unknown. Solutions proposed for these two problems are the following. To overcome the problem of dependence on θ and the problem of unknown sampling distribution, Drasgow et al. [2] proposed a standardized version lz of l, which was less confounded with θ, and which was purported to be asymptotically standard normally distributed; lz is defined as lz =

l − E(l) 1

,

(6)

[var(l) 2 ] where E(l) and var(l) denote the expectation and the variance of l, respectively. These quantities are given by E(l) =

k

Pg (θ) ln[Pg (θ)]

g=1

× [1 − Pg (θ)] ln[1 − Pg (θ)] ,

(7)

and var(l) =

k g=1

Pg (θ)[1 − Pg (θ)] ln

Pg (θ) 1 − Pg (θ)

2 . (8)

Molenaar and Hoijtink [11] argued that lz is only standard normally distributed when the true θ values are used, but a problem arises in practice when θ is replaced by the maximum likelihood estimate ˆ Using an estimate and not the true θ will have θ. an effect on the distribution of a person-fit statistic. These studies showed that when maximum likelihood estimates θˆ were used, the variance of lz was smaller than expected under the standard normal distribution using the true θ, particularly for tests up to moderate length (say, 50 items or fewer). As a result, the empirical Type I error was smaller than the nominal Type I error. Molenaar and Hoijtink [11] used the statistic M = − kg=1 δg Xg and proposed three approximations to the distribution of M under the Rasch model. Snijders [18] derived the asymptotic sampling distribution for a group of person-fit statistics that

3

have the form given in (4), and for which the maximum likelihood estimate θˆ was used instead of θ. It can easily be shown that l – E (l) can be written in the form of (5) choosing Pg (θ) wg (θ) = ln (9) 1 − Pg (θ) Snijders [18] derived expressions for the first two moments of the distribution: E[V (θˆ )] and var[V (θˆ )], and performed a simulation study using maximum likelihood estimation for θ. The results showed that the approximation was satisfactory for α = 0.05 and α = 0.10, but that the empirical Type I error was higher than the nominal Type I error for smaller values of α.

Optimal Person-fit Statistics Levine and Drasgow [6] proposed a likelihood ratio statistic, which provided the most powerful test for the null hypothesis that an item score pattern is normal versus the alternative hypothesis that it is aberrant. The researcher in advance has to specify a model for normal behavior (e.g., the 1-, 2-, or 3PLM) and a model that specifies a particular type of aberrant behavior (e.g., a model in which violations of local independence are specified). Klauer [5] investigated aberrant item score patterns by testing a null model of normal response behavior (Rasch model) against an alternative model of aberrant response behavior. Writing the Rasch model as a member of the exponential family, P (X = x|θ) = µ(θ)h(x) exp[θR(x)],

(10)

where µ(θ) =

k

[1 + exp(θ − δg )]−1 ,

g=1

h(x) = exp − xg δ g ,

(11)

and R(x) = number-correct score, Klauer [5] modeled aberrant response behavior using the twoparameter exponential family, and introducing an extra person parameter η, as P (X = x|θ, η) = µ(θ, η)h(x) exp[ηT (x) + θR(x)], (12)

4

Person Misfit

where T (x) depends on the particular alternative model considered. Using the exponential family of models, a uniformly most powerful test can be used for testing H0 : η = η0 against H1 : η = η0 . Let a test be subdivided into two subtests A1 and A2 . Then, as an example of η, η = θ1 − θ2 was considered, where θ1 is an individual’s ability on subtest A1 , and θ2 is an individual’s ability on subtest A2 . Under the Rasch model, it is expected that θ is invariant across subtests and, thus, H0 : η = 0 can be tested against H1 : η = 0. For this type of aberrant behavior, T (x) is the number-correct score on either one of the subtests. Klauer [5] also tested H0 of equal item discrimination parameters for all persons against person-specific item discrimination and H0 of local independence against violations against local independence. Results showed that the power of these tests depended on the type and the severeness of the violations. Violations against noninvariant ability (H0 : η = 0) were found to be the most difficult to detect. What is interesting in both [5] and [6] is that model violations are specified in advance and that tests are proposed to investigate these model violations. This is different from the approach followed in most person-fit studies where the alternative hypothesis simply says that the null hypothesis is not true. An obvious problem is, which alternative models to specify? A possibility is to specify a number of plausible alternative models and then successively test modelconform item score patterns against these alternative models. Another option is to first investigate which model violations are most detrimental to the use of the test envisaged, and then test against the most serious violations.

The Person-response Function Trabin and Weiss [20] proposed to use the person response function (PRF) to identify aberrant item score patterns. At a fixed θ value, the PRF specifies the probability of a correct response as a function of the item location δ. In IRT, the IRF often is assumed to be a nondecreasing function of θ, whereas the PRF is assumed to be a nonincreasing function of δ. To construct an observed PRF, Trabin and Weiss [20] ordered items to increasing δˆ values and then formed subtests of items by grouping items according to δˆ values. For fixed θˆ , the observed PRF was constructed by determining, in each subtest, the mean probability of a correct response. The expected

PRF was constructed by estimating, according to the 3-PLM, in each subtest, the mean probability of a correct response. A large difference between the expected and observed PRFs was interpreted as an indication of nonfitting responses for that examinee. Sijtsma and Meijer ([17]; see also [3]) and Reise [14] further refined person-fit methodology based on the person-response function.

Person-fit Research in Computer Adaptive Testing Bradlow et al. [1] and van Krimpen-Stoop and Meijer [22] defined person-fit statistics that make use of the property of computer-based testing that a fitting item score pattern will consist of an alternation of correct and incorrect responses, especially at the end of the test when θˆ comes closer to θ. Therefore, a string of consecutive correct or incorrect answers may be the result of aberrant response behavior. Sums of consecutive negative or positive residuals [Xg − Pg (θ)] can be investigated using a cumulative sum procedure [12].

Usefulness of IRT-based Person-fit Statistics Several studies have addressed the usefulness of IRT-based person-fit statistics. In most studies, simulated data were used, and in some studies, empirical data [15] were used. The following topics were addressed: 1. detection rate of fit statistics and comparing fit statistics with respect to several criteria such as distributional characteristics and relation to the total score; 2. influence of item, test, and person characteristics on the detection rate; 3. applicability of person-fit statistics to detect particular types of misfitting item score patterns; and 4. relation between misfitting score patterns and the validity of test scores. On the basis of this research, the following may be concluded. For many person-fit statistics for short tests and tests of moderate length (say, 10–60 items), due to the use of θˆ rather than θ for most statistics, the nominal Type I error rate is not in agreement with the empirical Type I error rate. In general, sound statistical methods have been derived for the Rasch

Person Misfit model, but because this model is rather restrictive to empirical data, the use of these statistics also is restricted. Furthermore, it may be wise to first investigate possible threats to the fit of individual item score patterns before using a particular person-fit statistic. For example, if violations against local independence are expected, one of the methods proposed in [5] may be used instead of a general statistic such as the M statistic proposed in [11]. Not only are tests against a specific alternative more powerful than general statistics, also the type of deviance is easier to interpret. A drawback of some person-fit statistics is that only deviations against the model are tested. This may result in interpretation problems. For example, item score patterns not fitting the Rasch model may be described more appropriately by means of the 3-PLM. If the Rasch model does not fit the data, other explanations are possible. Because in practice it is often difficult, if not impossible, to substantially distinguish different types of item score patterns and/or to obtain additional information using background variables, a more fruitful strategy may be to test against specific alternatives [8]. Almost all statistics are of the form given in (4) but the weights are different. The question then is, which statistic should be used? From the literature, it can be concluded that the use of a statistic depends on what kind of model is used. Using the Rasch model, the theory presented by Molenaar and Hoijtink and their statistic M is a good choice. Statistic M should be preferred over statistics like lz because the critical values for M are more accurate than those of lz . With respect to the 2-PLM and the 3-PLM, all statistics proposed suffer from the problem that the standard normal distribution is inaccurate when θˆ is used instead of θ. This seriously reduces the applicability of these statistics. The theory recently proposed in [18] may help the practitioner to obtain the correct critical values.

Conclusions The aim of person-fit measurement is to detect item score patterns that are improbable given an IRT model or given the other patterns in a sample. The first requirement, thus, is that person-fit statistics are sensitive to misfitting item score patterns. After

5

having reviewed the studies using simulated data, it can be concluded that detection rates are highly dependent on (a) the type of aberrant response behavior, (b) the θ value, and (c) the test length. When item score patterns do not fit an IRT model, high detection rates can be obtained in particular for extreme θs, even when Type I errors are low (e.g., 0.01). The reason is that for extreme θs, deviations from the expected item score patterns tend to be larger than for moderate θs. As a result of this pattern misfit, the bias in θˆ tends to be larger for extreme θs than for moderate θs. The general finding that detection rates for moderate θ tend to be lower than for extreme θs, thus, is not such a bad result and certainly puts the disappointment some authors [13] expressed about low detection rates for moderate θs in perspective. Relatively few studies have investigated the usefulness of person-fit statistics for analyzing empirical data. The few studies that exist have found some evidence that groups of persons with a priori known characteristics, such as test-takers lacking motivation, may produce deviant item score patterns that are unlikely given the model. However, again, it depends on the degree of aberrance of response behavior how useful person-fit statistics really are. We agree with some authors [15] that more empirical research is needed. Whether person-fit statistics can help the researcher in practice depends on the context in which research takes place. Smith [16] mentioned four actions that could be taken when an item score pattern is classified as aberrant: (a) Instead of reporting one ability estimate for an examinee, several ability estimates can be reported on the basis of subtests that are in agreement with the model; (b) modify the item score pattern (for example, eliminate the unreached items at the end) and reestimate θ; (c) do not report the ability estimate and retest a person; or (d) decide that the error is small enough for the impact on the ability to be marginal. This decision can be based on comparing the error introduced by measurement disturbance and the standard error associated with each ability estimate. Which of these actions is taken very much depends on the context in which testing takes place. The usefulness of person-fit statistics, thus, also depends heavily on the application for which it is intended.

6

Person Misfit

References [1]

Bradlow, E.T., Weiss, R.E. & Cho, M. (1998). Bayesian identification of outliers in computerized adaptive testing, Journal of the American Statistical Association 93, 910–919. [2] Drasgow, F., Levine, M.V. & Williams, E.A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices, British Journal of Mathematical and Statistical Psychology 38, 67–86. [3] Emons, W.H.M., Meijer, R.R. & Sijtsma, K. (2004). Testing hypothesis about the person-response function in person-fit analysis, Multivariate Behavioral Research 39, 1–35. [4] Guttman, L. (1950). The basis for scalogram analysis, in Measurement and Prediction, S.A. Stouffer, L. Guttman, E.A. Suchman, P.F. Lazarsfeld, S.A. Star & J.A. Claussen, eds, Princeton University Press, Princeton, pp. 60–90. [5] Klauer, K.C. (1995). The assessment of person fit, in Rasch Models, Foundations, Recent Developments, and Applications, G.H. Fischer & I.W. Molenaar, eds, Springer-Verlag, New York, pp. 97–110. [6] Levine, M.V. & Drasgow, F. (1988). Optimal appropriateness measurement, Psychometrika 53, 161–176. [7] Levine, M.V. & Rubin, D.B. (1979). Measuring the appropriateness of multiple-choice test scores, Journal of Educational Statistics 4, 269–290. [8] Meijer, R.R. (2003). Diagnosing item score patterns on a test using item response theory-based person-fit statistics, Psychological Methods 8, 72–87. [9] Meijer, R.R. & Sijtsma, K. (1995). Detection of aberrant item score patterns: a review and new developments, Applied Measurement in Education 8, 261–272. [10] Meijer, R.R. & Sijtsma, K. (2001). Methodology review: evaluating person fit, Applied Psychological Measurement 25, 107–135. [11] Molenaar, I.W. & Hoijtink, H. (1990). The many null distributions of person fit indices, Psychometrika 55, 75–106. [12] Page, E.S. (1954). Continuous inspection schemes, Biometrika 41, 100–115.

[13]

[14]

[15]

[16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

Reise, S.P. (1995). Scoring method and the detection of person misfit in a personality assessment context, Applied Psychological Measurement 19, 213–229. Reise, S.P. (2000). Using multilevel logistic regression to evaluate person fit in IRT models, Multivariate Behavioral Research 35, 543–568. Reise, S.P. & Waller, N.G. (1993). Traitedness and the assessment of response pattern scalablity, Journal of Personality and Social Psychology 65, 143–151. Smith, R.M. (1985). A comparison of Rasch person analysis and robust estimators, Educational and Psychological Measurement 45, 433–444. Sijtsma, K. & Meijer, R.R. (2001). The person response function as a tool in person-fit research, Psychometrika 66, 191–208. Snijders, T.A.B. (2001). Asymptotic null distribution of person-fit statistics with estimated person parameter, Psychometrika 66, 331–342. Tatsuoka, K.K. (1984). Caution indices based on item response theory, Psychometrika 49, 95–110. Trabin, T.E. & Weiss, D.J. (1983). The person response curve: fit of individuals to item response theory models, in New Horizons in Testing, D.J. Weiss, ed., Academic Press, New York. van der Linden, W.J. & Hambleton, R.K., eds (1997). Handbook of Modern Item Response Theory, Springer Verlag, New York. van Krimpen-Stoop, E.M.L.A. & Meijer, R.R. (2000). Detecting person-misfit in adaptive testing using statistical process control techniques, in Computerized Adaptive Testing: Theory and Practice, W.J. van der Linden & C.A.W. Glas, eds, Kluwer Academic Publishers, Boston. Wright, B.D. & Stone, M.H. (1979). Best Test Design, Rasch measurement. Mesa Press, Chicago.

(See also Hierarchical Item Response Theory Modeling; Maximum Likelihood Item Response Theory Estimation) ROB R. MEIJER

Pie Chart DANIEL B. WRIGHT

AND

ˆ E. WILLIAMS SIAN

Volume 3, pp. 1547–1548 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Pie Chart Pie charts are different from most of the common types of graphs and charts. They do not plot data against axes; instead they plot the data in a circle. They are most suited to displaying frequency data of a single categorical variable. Each slice of the pie represents a different category, and the size of the slice represents the proportion of values in the sample within that category. Figure 1 shows an example from a class of undergraduate students who had to think aloud about when they would complete an essay. Their thoughts were categorized, and the resulting frequencies are shown in the figure. Each slice of the pie is labeled. As with all graphs, it is important to include all the information that is necessary to understand the data being presented. For Figure 1, the percentage and/or number of cases could be included. In some circumstances, it may be beneficial to have more than one pie chart to represent the data. The diameter of each chart can be varied to provide an extra dimension of information. For instance, if we had a measure of the number of days it took students to complete the essay we could display this in separate pie charts where the diameter of the chart represented completion time. Most commentators argue that pie charts are inferior for displaying data compared with other

Other Future problems

Past experiences Future plans

Figure 1 Distribution of students’ reported thoughts when estimating the completion time of an academic assignment

methods such as tables and bar charts [1]. This is because people have difficulty comparing the size of different angles and therefore difficulty knowing the relative magnitudes of the different slices. Pie charts are common in management and business documents but not in scientific documents. It is usually preferable to use a table or a bar chart.

Reference [1]

Cleveland, W.S. & McGill, R. (1987). Graphical perception: the visual decoding of quantitative material when on graphical displays of data, Journal of the Royal Statistical Society. Series A 150, 192–229.

DANIEL B. WRIGHT

AND

ˆ E. WILLIAMS SIAN

Pitman Test CLIFFORD E. LUNNEBORG Volume 3, pp. 1548–1550 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Pitman Test The Pitman test [7] is sometimes identified as the Fisher–Pitman test [10] in the continuing but suspect belief that the Tasmanian mathematician E. J. G. Pitman (1897–1993) merely extended or modified an idea of R. A. Fisher’s [3]. While Fisher may have foreseen the population-sampling version of the test, the far more useful randomization-based version clearly is Pitman’s work [2]. In what follows, I adopt the practice of Edgington [2] in distinguishing between a permutation test (see Permutation Based Inference) and a randomization test. A permutation test requires random samples from one or more populations, each population of sufficient size that the sample observations can be regarded as independently and identically distributed (i.i.d.). A randomization test requires only that the members of a nonrandom sample be randomized among treatment groups. The Pitman test comes in both flavors.

The Two-sample Pitman Permutation Test Consider the results of drawing random samples, each of size four, from the (large) fifth-grade school populations in Cedar Grove and Forest Park. For the students sampled from the Cedar Grove population, we record arithmetic test scores of (25, 30, 27, 15), and for those sampled from Forest Park (28, 30, 32, 18). The null hypothesis of the two-sample test is that the two population distributions of arithmetic scores are identical. Thus, the Pitman test is a distribution-free alternative (see Distribution-free Inference, an Overview) to the two-sample t Test. The null hypothesis for the latter test is that the two distributions are not only identical but follow a normal law. The logic of the two-sample Pitman test is this. Under the null hypothesis, any four of the eight observed values could have been drawn from the Cedar Grove population and the remaining four from Forest Park. That is, we are as likely to have obtained, say, (25, 27, 28, 32) from Cedar Grove and (30, 15, 30, 18) from Forest Park as the values actually sampled. In fact, under the null hypothesis, all possible permutations of the eight observed scores,

four from Cedar Grove and four from Forest Park, are equally likely. There are M = 8!/(4!4!) = 70 permutations, and our random sampling gave us one of these equally likely outcomes. By comparing the test statistic computed for the observed random samples with the 69 other possible values, we can decide whether our outcome is consonant with the null hypothesis or not. What test statistic we compute and how we compare its value against this permutation reference distribution depends, of course, on our scientific hypothesis. The most common use of the Pitman twosample test is as a location test, to test for a difference in population means. The test may be one-tailed or two-tailed, depending on whether our alternative hypothesis specifies a direction to the difference in population means. A natural choice of test statistic would be the difference in sample means. For our data, we have as the difference in sample means, Forest Park minus Cedar Grove, s = [(28 + 30 + 32 + 18)/4] − [(25 + 30 + 27 + 15)/4] = 27.0 − 24.25 = 2.75. Is this a large difference, relative to what we would expect under the null hypothesis? Assume our alternative hypothesis is nondirectional (see Directed Alternatives in Testing). We want to know the probability, under the null hypothesis, of observing a mean difference of +2.75 or greater, or a mean difference of −2.75 or smaller. How many of the 70 possible permutations of the data produce mean differences in these two tails? In fact, 36 of the 70 permutations of the data yield mean differences with absolute values of 2.75 or greater. So the resulting P value is 36/70 = 0.514. There is no evidence in these sparse data that the null hypothesis might not be correct. For samples of equal size, the permutation test reference distribution is symmetric. The two-tailed P value is double the one-tailed P value. The Pitman test is exact (see Exact Methods for Categorical Data) in the sense that if we reject the null hypothesis whenever the obtained P value is no greater than α, we are assured that the probability of a Type I error will not exceed α. For small samples, however, the test is conservative. The probability of a Type I error is smaller than α. For our example, if we set α = 0.05, we would reject the null hypothesis if our observed statistic was either the smallest or the largest value in the reference distribution, with

2

Pitman Test

an obtained P value of 2/70 = 0.0286. We could not reject the null hypothesis if our observed statistic was one of the two largest or two smallest, for, then the obtained P value would be 4/70 = 0.0571, greater than α. Software for the Pitman test is available in statistical programs such as StatXact (www.cytel.com), SC (www.mole-soft.demon.co.uk), and R (www. R-project.org) (as part of an add-on package for R) (see Software for Statistical Analyses). The Pitman permutation test is discussed further in [1], [4], [9], [10]. In some references, the test is known as the raw scores permutation test to distinguish it from more recently developed permutation tests requiring transformation of the observations to ranks or to normal scores.

The Two-group Pitman Randomization Test In view of the scarcity of random samples, the population sampling permutation test has limited utility to behavioral scientists. Typically, we must work with available cases: patient volunteers, students enrolled in Psychology 101, the animals housed in the local vivarium. By randomizing these available cases among treatments, however, we not only practice good experimental design, we also enable statistical inference that does not depend upon population sampling. To illustrate the Pitman test in the randomizationinference (see Randomization) context, let us reattribute the arithmetic test scores used in the earlier section. Assume now we have eight fifth-grade students, not a random sample from any population. We randomly divide the eight into two groups of four. Students in the first of these treatment groups are given a practice exam one week in advance of the arithmetic test. Those in the second group receive this practice exam the afternoon before the test. Here, again, are the arithmetic test scores: for the Week-ahead treatment (28, 30, 32, 18), and for the Day-before treatment (25, 30, 27, 15). Is there an advantage to the Week-ahead treatment? The observed mean difference, Week-ahead minus Day-before, is +2.75. Is that worthy of note? The randomization test null hypothesis is that any one of the students in this study would have earned exactly the same score on the test if he or she had

been randomized to the other treatment. For example, Larry was randomized to the Week-ahead treatment and earned a test score of 28. The null hypothesis is that he would have earned a 28 if he had been randomized to the Day-before treatment also. There are, again, 70 different ways in which the eight students could have been randomized four apiece to the two treatments. And, under the null hypothesis, we can compute the mean difference for each of those randomizations. The alternative hypothesis is that, across these eight students, there is a tendency for the test score of a student to be higher with Week-ahead availability of the practice exam than with Day-before availability. The mechanics of the randomization test are identical to those of the permutation test, and we would find that the probability of a mean difference of +2.75 or greater under the null hypothesis, the P value for our one-tailed test, to be 18/70 = 0.2571, providing no consequential evidence against the null hypothesis. The randomization test of Pitman is discussed more fully in [2], [5], [6]. The software for the permutation test can be used, of course, for the randomization test as well. The permutation test requires random samples from well-defined populations. In the rare event that we do have random samples, we can extend our statistical inferences to those populations sampled. The statistical inferences based on randomization are limited to the set of cases involved in the randomization. While this may seem a severe limitation, randomization of cases among treatments is a gold standard for causal inference, linking differential response as unambiguously as possible to differential treatment [8]. Much research is intended to demonstrate just such a linkage.

References [1] [2] [3] [4] [5]

Conover, W.J. (1999). Practical Nonparametric Statistics, 3rd Edition, Wiley, New York. Edgington, E.S. (1995). Randomization Tests, 3rd Edition, Marcel Dekker, New York. Fisher, R.A. (1935). The Design of Experiments, Oliver & Boyd, Edinburgh. Good, P. (2000). Permutation Tests, 2nd Edition, Springer-Verlag, New York. Lunneborg, C.E. (2000). Data Analysis by Resampling, Duxbury, Pacific Grove.

Pitman Test [6]

[7]

[8]

Manly, B.F.J. (1997). Randomization, Bootstrap and Monte Carlo Methods in Biology, 2nd Edition, Chapman & Hall, London. Pitman, E.J.G. (1937). Significance tests which may be applied to samples from any population, Royal Statistical Society Supplement 4, 119–130. Rubin, D.B. (1991). Practical applications of modes of statistical inference for causal effects and the critical role of the assignment mechanism, Biometrics 47, 1213–1234.

[9] [10]

3

Sprent, P. (1998). Data Driven Statistical Methods, Chapman & Hall, London. Sprent, P. & Smeeton, N.C. (2001). Applied Nonparametric Statistical Methods, 3rd Edition, Chapman & Hall/CRC, Boca Raton.

CLIFFORD E. LUNNEBORG

Placebo Effect PATRICK ONGHENA Volume 3, pp. 1550–1552 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Placebo Effect The placebo effect is narrowly defined as the beneficial effect associated with the use of medication that has no intrinsic pharmacological effect (e.g., sugar pills or saline injections). More generally, it refers to the beneficial effect associated with all kinds of interventions that do not include the presumed active ingredients (e.g., sham procedures, mock surgery, attention placebo, or pseudo therapeutic encounters and symbols, such as a white coat) [5, 14, 15]. If the effect is harmful, then it is called a nocebo effect [2]. Both the placebo and the nocebo effects can be operationally defined as the difference between the results of a placebo/nocebo condition and a no-treatment condition in a randomized trial [6]. The use of placebos is commonplace in clinical testing in which the randomized double-blind placebo-controlled trial represents the gold standard of clinical evidence (see Clinical Trials and Intervention Studies). Ironically, this widespread use of placebos in controlled clinical trials has led to more confusion than understanding of the placebo effect because those trials usually do not include a notreatment condition. The two most popular fallacies regarding the interpretation of placebo effects in controlled clinical trials are (a) that all improvement in the placebo condition is due to the placebo effect, and (b) that the placebo effect is additive to the treatment effect. The fallacious nature of the first statement becomes evident if one takes other possible explanations of improvement into account: spontaneous recovery, the natural history of an illness, patient, or investigator bias, and regression artifacts; to name the most important ones [4, 6, 10]. In the absence of a no-treatment condition, both clinicians and researchers have a tendency to mistakenly attribute all improvement to the administration of the placebo itself. However, comparison of an active therapy with an inactive therapy offers control for the active therapy effect but does not separate the placebo effect from other confounding effects; hence the need for a no-treatment control condition to assess and fully understand the placebo effect (see operational definition above). Also, the idea of an additive placebo effect, which can be subtracted from the effect in the treatment condition to arrive at a true treatment effect,

is misguided. There is ample evidence that most often the placebo effect interacts with the treatment effect, operating recursively, and synergistically in the course of the treatment [4, 11]. Because of these methodological issues, and because of the different ways in which a placebo can be defined (narrow or more comprehensive), it should not come as a surprise that the size of the placebo effect and the conditions for its appearance remain controversial. A frequently cited figure, derived from an influential classical paper by Beecher [3], is that about one-third of all patients in clinical trials improve due to the placebo effect. However, in a recent systematic review of randomized trials comparing a placebo condition with a no-treatment condition, little evidence was found that placebos have clinically significant effects, except perhaps for small effects in the treatment of pain [7, 8]. Recently, functional magnetic resonance imaging experiments have convincingly shown that the human brain processes pain differently when participants receive or anticipate a placebo [18], but the clinical importance of placebo effects might be smaller than, and not as general as, once thought. Parallel to this discussion on effect sizes, there is also an interesting debate regarding the explanation and possible working mechanisms of the placebo effect [9, 17]. The two leading models are the classical conditioning model and the response expectancy model. According to a classical conditioning scheme, the pharmacological properties of the medication (or the active ingredients of the intervention) act as an unconditioned stimulus (US) that elicits an unconditioned biological or behavioral response (UR). By frequent pairings of this US with a neutral stimulus (NS), like a tablet, a capsule, or an injection needle, this NS becomes intrinsically associated with the unconditioned stimulus (US), and becomes what is called a conditioned stimulus (CS). This means that, if the CS (e.g., a tablet) is presented in the absence of the US (e.g., the active ingredients), this CS also elicits the same biological or behavioral response, now called the conditioned response (CR). In other words: the CS acts as a placebo [1, 16]. However, several authors have remarked that there are problems with this scheme [4, 9, 13]. Although classical conditioning as an explanatory mechanism seems plausible for some placebo effects, there appears to exist a variety of other placebo

2

Placebo Effect

effects for which conditioning can only be part of the story or in which other mechanisms are, at least partly, responsible. For instance, placebo effects are reported in the absence of previous exposure to the active ingredients of the intervention or after a single encounter, and medication placebos can sometimes produce effects opposite to the effects of the active drug [16]. These phenomena suggest that anticipation and other cognitive factors play a crucial role in the placebo response. This is in line with the response expectancy model, which states that bodily changes occur to the extent that the subject expects them to. According to this model, expectancies act as a mediator between the conditioning trials and the placebo effect, and, therefore, verbal information can strengthen or weaken a given placebo effect [9, 13, 17]. Whatever the possible working mechanisms or the presumed size of the placebo effect, from a methodological perspective, it is recommended to distinguish the placebo effect from other validity threats, like history, experimenter bias, and regression, and to restrict its use to a particular type of reactivity effect (or artifact, depending on the context) (see Quasi-experimental Designs; External Validity; Internal Validity). From this perspective, the placebo effect is considered as a specific threat to the treatment construct validity because it arises when participants or patients interpret the treatment setting in a way that makes the actual intervention different from the intervention as planned by the researcher. In fact, the placebo effect, and especially the response expectancy model, reminds us that it is hard to assess the effects of external agents without taking into account the way in which these external agents are interpreted by the participants. To emphasize this subjective factor, Moerman [11, 12] proposed to rename (and reframe) the placebo effect as a ‘meaning response’, which he defined as ‘the physiological or psychological effects of meaning in the treatment of illness’ [11, p. 14].

[3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

[18]

References [1] [2]

Ader, R. (1997). Processes underlying placebo effects: the preeminence of conditioning, Pain Forum 6, 56–58. Barskey, A.J., Saintfort, R., Rogerts, M.P. & Borus, J.F. (2002). Nonspecific medication side effects and the nocebo phenomenon, Journal of the American Medical Association 287, 622–627.

Beecher, H.K. (1955). The powerful placebo, Journal of the American Medical Association 159, 1602–1606. Engel, L.W., Guess, H.A., Kleinman, A. & Kusek, J.W. eds (2002). The Science of the Placebo: Toward an Interdisciplinary Research Agenda, BMJ Books, London. Harrington, A. ed. (1999). The Placebo Effect: An Interdisciplinary Exploration, Harvard University Press, Cambridge. Hr´objartsson, A. (2002). What are the main methodological problems in the estimation of placebo effects? Journal of Clinical Epidemiology 55, 430–435. Hr´objartsson, A. & Gøtzsche, P.C. (2001). Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment, New England Journal of Medicine 344, 1594–1602. Hr´objartsson, A. & Gøtzsche, P.C. (2004). Is the placebo powerless? Update of a systematic review with 52 new randomized trials comparing placebo with no treatment, New England Journal of Medicine 344, 1594–1602. Kirsch, I. (2004). Conditioning, expectancy, and the placebo effect: comment on Stewart-Williams and Podd (2004), Psychological Bulletin 130, 341–343. McDonald, C.J., Mazzuca, S.A. & McCabe, G.P. (1983). How much of the placebo ‘effect’ is really statistical regression? Statistics in Medicine 2, 417–427. Moerman, D.E. (2002). Medicine, Meaning, and the “Placebo Effect”, Cambridge University Press, London. Moerman, D.E. & Jonas, W.B. (2002). Deconstructing the placebo effect and finding the meaning response, Annals of Internal Medicine 136, 471–476. Montgomery, G.H. & Kirsch, I. (1997). Classical conditioning and the placebo effect, Pain 72, 107–113. Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Houghton Mifflin, Boston. Shapiro, A. & Shapiro, E. (1997). The Powerful Placebo: From Ancient Priest to Modern Physician, Johns Hopkins University Press, Baltimore. Siegel, S. (2002). Explanatory mechanisms of the placebo effect: Pavlovian conditioning, in The Science of the Placebo: Toward an Interdisciplinary Research Agenda, L.W. Engel, H.A. Guess, A. Kleinman & J.W. Kusek, eds, BMJ Books, London, pp. 133–157. Stewart-Williams, S. & Podd, J. (2004). The placebo effect: dissolving the expectancy versus conditioning debate, Psychological Bulletin 130, 324–340. Wager, T.D., Rilling, J.K., Smith, E.E., Sokolik, A., Casey, K.L., Davidson, R.J., Kosslyn, S.M., Rose, R.M. & Cohen, J.D. (2004). Placebo-induced changes in fMRI in the anticipation and experience of pain, Science 303, 1162–1167.

PATRICK ONGHENA

Point Biserial Correlation DIANA KORNBROT Volume 3, pp. 1552–1553 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Point Biserial Correlation The point biserial correlation, rpb , is the value of Pearson’s product moment correlation when one of the variables is dichotomous, taking on only two possible values coded 0 and 1, and the other variable is metric (interval or ratio). For example, the dichotomous variable might be political party, with left coded 0 and right coded 1, and the metric variable might be income. The metric variable should be approximately normally distributed for both groups.

Calculation The value of rpb can be calculated directly from (1) [2]. Y1 − Y0 N1 N0 , (1) rpb = sy N (N − 1)

standard deviation and also known as Glass’s g (see Effect Size Measures)), can be obtained from rpb via (3) [1] 2 N (N − 2)rpb (3) d= 2 N1 N0 (1 − rpb )

Hypothesis Testing and Power for rpb A value of rpb that is significantly different from zero is completely equivalent to a significant difference in means between the two groups. Thus, an independent groups t Test with (N − 2) degrees of freedom, df, may be used to test whether rpb is nonzero (or equivalently, a one-way analysis of variance (ANOVA) with two levels). The relation between the t-statistic for comparing two independent groups and rpb is given by (4) [1, 2] t=

√

rpb N − 2 2 1 − rpb

(4)

where Y 0 and Y 1 are means of the metric observations coded 0 and 1 respectively; N0 and N1 are number of observations coded 0 and 1 respectively; N is total number of observations, N0 + N1 ; and s y is standard deviation of all the metric observations 2 Y Y2 − N sy = . (2) N −1

Power to detect a nonzero rpb will also be the same as that for an independent groups t Test. For example, a total sample size, N = 52, is needed for a power of 0.80 (80% probability of detection) for a ‘large’ effect size, corresponding to rpb = 0.38. Similarly, a total sample size, N = 52, is needed for a power of 0.80 for a ‘medium’ effect size, corresponding to rpb = 0.24.

Equation (1) is generated by using the standard equation for the Pearson’s product moment correlation, r, with one of the dichotomous variables coded 0 and the other coded 1. Consequently, rpb can easily be obtained from standard statistical packages as the value or Pearson’s r when one of the variables only takes on values of 0 or 1.

Summary The point biserial correlation is a useful measure of effect size, that is, statistical magnitude, of the difference in means between two groups. It is based on Pearson’s product moment correlation.

References

Interpretation of rpb as an Effect Size [1]

The point biserial correlation, rpb , may be interpreted as an effect size for the difference in means between 2 is the proportion of variance two groups. In fact, rpb accounted for by the difference between the means of the two groups. The Cohen’s d effect size (defined as the difference between means divided by the pooled

[2]

Howell, D.C. (2004). Fundamental Statistics for the Behavioral Sciences, 5th Edition, Duxbury Press, Pacific Grove. Sheskin, D.J. (2000). Handbook of Parametric and Nonparametric Statistical Procedures, 2nd Edition, Chapman & Hall, London.

DIANA KORNBROT

Polychoric Correlation SCOTT L. HERSHBERGER Volume 3, pp. 1554–1555 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Polychoric Correlation Introduction The polychoric correlation is used to correlate two ordinal variables, X and Y , when it is assumed that underlying each is a continuous, normally distributed latent variable [2, 5]. At least one of the ordinal variables has more than two categories. When both variables are ordinal, but the assumption of a latent, normally distributed variable is not made, we obtain Spearman’s rank order correlation (Spearman’s rho) [4]. The tetrachoric correlation is a specific form of the polychoric correlation when the polychoric correlation is computed between two binary variables [1]. The polyserial correlation is another variant of the polychoric correlation, in which one of the variables is ordinal and the other is continuous [2]. Polychoric correlations between two ordinal variables rest on the interpretation of the ordinal variables as discrete versions of latent continuous ‘ideally measured’ variables, and the assumption that these corresponding latent variables are bivariate-normally distributed (see Catalogue of Probability Density Functions). In a bivariate-normal distribution, the distribution of Y is normal at fixed values of X, and the distribution of Y is normal at fixed values of Y . Two variables that are bivariate-normally distributed must each be normally distributed. The calculation of the polychoric correlation involves corrections that approximate what the Pearson productmoment correlation would have been if the data had been continuous.

Statistical Definition Estimating the polychoric correlation assumes that an observed ordinal variable (Y ) with C categories is related to an underlying continuous latent variable (Y ∗ ) having C − 1 thresholds (τc ): Y = c, τc < Y ∗ ≤ τc+1 ,

(1)

in which τ0 = −∞ and τC = +∞. Since Y ∗ is assumed to be normally distributed, the probability that Y ∗ ≤ τc is τc − µ ∗ p(Y ≤ τc ) = , (2) σ

where is the cumulative normal probability density function, and Y ∗ is a latent variable with mean µ and standard deviation σ . Thus, p(Y ∗ ≤ τc ) is the cumulative probability that latent variable Y ∗ is less than or equal to the cth threshold τc . It also follows that τc − µ ∗ p(Y > τc ) = 1 − . (3) σ If the observed variable Y had only two categories, defining the two probabilities p(Y ∗ ≤ τc ) and p(Y ∗ > τc ) would suffice: only a single threshold would be required. However, when the number of categories is greater than two, more than one threshold is required. In general then, the probability that the observed ordinal variable Y is equal to a given category c is p(Y = c) = p(Y ∗ ≤ τc+1 ) − p(Y ∗ ≤ τc ).

(4)

For example, if the ordinal variable has three categories, two thresholds must be defined: 3

p(Yc ) = p(Y = c1 ) + p(Y = c2 ) + p(Y = c3 )

c=1

= p(Y1∗ ≤ τ1 ) + p(τ1 < Y2∗ ≤ τ2 ) + p(τ2 < Y3∗ ) τ1 − µ τ2 − µ = + σ σ τ2 − µ τ1 − µ + . (5) − σ σ The thresholds define the cutpoints or transitions from one category of an ordinal variable to another on a normal distribution. For a given µ and σ of Y ∗ and the two thresholds τ1 and τ2 , the probability that latent variable Y ∗ is in category 1 is the area under the normal curve −∞ and τ1 ; the probability that Y ∗ is in category two is in the area bounded by τ1 and τ2 ; and the probability that Y ∗ is in category three is the area between τ2 and +∞. The polychoric correlation may also be thought of as the correlation between the normal scores Yi∗ , Yj∗ as estimated from the category scores Yi , Yj . These normal scores are based on the threshold values. In essence, the calculation of the polychoric correlation involves correcting errors arising from category group errors, and approximates what the Pearson product-moment correlation would have been if the data had been continuous.

2

Polychoric Correlation

The form of the distribution of the latent Y ∗ variables – bivariate normality (see Catalogue of Probability Density Functions) – has to be assumed so as to specify the likelihood function of the polychoric correlation. For example, the program PRELIS 2 [3] estimates the polychoric correlation by limited information maximum likelihood (see Maximum Likelihood Estimation). In PRELIS 2, the polychoric correlation is estimated in two steps. First, the thresholds are estimated from the cumulative frequencies of observed Y category scores and the inverse of the standard normal probability density function, assuming µ = 0 and σ = 1. Second, the polychoric correlation is then estimated by restricted maximum likelihood conditional on the threshold values.

The parameter estimates from joint maximum likelihood estimation are r = 0.419 τ1,1952 = −.242 τ2,1952 = 1.594 τ1,1953 = −.030 τ2,1953 = 1.133.

The parameter estimates from limited information maximum likelihood estimation are r = 0.420 τ1,1952 = −.240 τ2,1952 = 1.578

Example The table below, (Table 1) adapted from Drasgow [2], summarizes the number of lambs born to 227 ewes over two years: Spearman’s rho between the two years is .29, while the Pearson product-moment correlation is .32. We used two methods based on maximum likelihood estimation to calculate the polychoric correlation. The first method, joint maximum likelihood, estimates the polychoric correlation and the thresholds at the same time. The second method, limited information maximum likelihood estimation, first estimates the thresholds from the observed category frequencies, and then estimates the polychoric correlation conditional on these thresholds. Table 1 Number of Lambs Born to 227 Ewes in 1951 and 1952. Please also note that the row 1952 should be 1951. Lambs born in 1952 None 1 2 Total

Lambs born in 1952

τ1,1953 = −.028 τ2,1953 = 1.137.

1

2

Total

58 26 8 92

52 58 12 122

1 3 9 13

111 87 29 227

(7)

Thus, the results using either maximum likelihood approach are nearly identical.

References [1] [2]

[3] [4] [5]

None

(6)

Allen, M.J. & Yen, W.M. (1979). Introduction to Measurement Theory, Wadsworth, Monterey. Drasgow, F. (1988). Polychoric and polyserial correlations, in Encyclopedia of the Statistical Sciences, Vol. 7, L. Kotz & N.L. Johnson, eds, Wiley, New York, pp. 69–74. J¨oreskog, K.G. & S¨orbom, D. (1996). PRELIS user’s manual, version 2, Scientific Software International, Chicago. Nunnally, J.C. & Bernstein, I.H. (1994). Psychometric Theory, 3rd Edition, McGraw-Hill, New York. Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient, Psychometrika, 44, 443–460.

(See also Structural Equation Modeling: Software) SCOTT L. HERSHBERGER

Polynomial Model L. JANE GOLDSMITH

AND

VANCE W. BERGER

Volume 3, pp. 1555–1557 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Polynomial Model

1.4

Y = 0.75 + 0.5x −0.25x2

1.2

Y = β0 + β1 x + ε,

(1)

(2)

the multiple linear regression model. Of course, parameter estimates can be obtained and hypotheses can be tested using linear model least-squares theory (see Least Squares Estimation). A model of order two with one predictor variable is Y = β0 + β1 x + β2 x 2 + ε.

0.8 0.6 0.4 0.2 0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

X

where Y is the dependent variable, x is the predictor variable, and ε is an error term. The errors associated with each observation of Y are assumed to be independent and identically distributed according to the Gaussian law with mean 0 and unknown variance σ 2 ; that is, according to commonly used notation, ε ∼ N (0, σ 2 ). Thus, a first-order polynomial model with one predictor is recognizable as the simple linear regression model. See [1] and [2]. With k predictors, a first-order polynomial model is Y = β0 + β1 x1 + β2 x2 + · · · + βk xk + ε,

1.0

Y

A polynomial model is a form of linear model in which the predictor variables (the x’s) appear as powers or cross-products in the model equation. The highest power (or sum of powers for x’s appearing in a cross-product term) is called the order of the model. The simplest polynomial model is the single predictor model of order 1:

(3)

This is the well-known formula for a parabola that opens up if β2 > 0 and opens down if β2 < 0. Figure 1 demonstrates a second-order polynomial model β2 < 0. The line represents the underlying second-order polynomial and the dots represent observations made according to the ε ∼ N (0, σ 2 ) law, with σ 2 = 0.01. We see that, according to this model, Y increases to a maximum value at x = 1 and then decreases. A curve like this can be used to model blood concentrations of a drug component from the time of administration (x = 0), for example. Concentration of this chemical increases as the medicine is metabolized, then falls off through excretion from the body. At x = 3, blood concentration is expected to be 0. It may seem odd to refer to a second-order polynomial, or even a higher-order polynomial, as a linear

Figure 1

An example of a polynomial curve

model, but the model is linear in the coefficients to be estimated, even if it is not linear in the predictor x. Of course, given the predictor x, one can create another predictor, x 2 , which can be viewed not as the square of the first predictor but rather as a separate predictor. It could even be given a different name, such as w. Now the linear model is, in fact, linear in the predictors as well as in the coefficients. Either way, standard linear regression methods apply. Some authors refer to polynomial models as curvilinear models, thus acknowledging the curvature and the linearity that apply. A second-order model with k predictors is Y = β0 + β1 x1 + β2 x2 + · · · + βk xk + β11 x12 + β22 x22 + · · · + βkk xk2 + β12 x1 x2 + · · · + βij xi xj + · · · + βk−1k xk−1 xk + ε

(4)

The notation convention in this second-order multivariate model is that βi is the coefficient of xi and βij is the coefficient of xi xj . Thus, in general, βjj is the coefficient of xj2 . A second-order polynomial model with k predictor variables will contain (k + 1)(k + 2)/2 terms if all terms are retained. Often a simple model (say, one without cross-product terms) is desired, but terms should not be omitted without careful analysis of scientific and statistical evidence. The second-order polynomial model also admits the use of linear model inferential techniques for parameter estimation and hypothesis testing; that is, the βi ’s and βij ’s can be estimated and tested using the usual least-squares regression theory. Model fit

2

Polynomial Model

can be evaluated through residual analysis by plotting residuals versus the Yˆi ’s (the predicted values). By extending the notation convention given above for second-order multivariate polynomial models in a completely natural way, third- and higher-order models can be specified. Third-order models have some use in research: cubic equations can model a dependent variable that rises and falls and rises again (or, with a change of sign, falls, then rises, then falls again). Higher-than-third-order models have rarely been used, however, as their complicated structures have not been found useful in modeling natural phenomena. All polynomial models have the advantage that inference and analysis methods of linear leastsquares regression can be applied if the assumptions

of independent observations and normality of error terms are met.

References [1] [2]

Draper, N.R. & Smith, H. (1998). Applied Regression Analysis, 3rd Edition, John Wiley & Sons, New York. Ratkowsky, D.A. (1990). Handbook of Nonlinear Regression Models, Marcel Dekker, New York.

L. JANE GOLDSMITH

AND

VANCE W. BERGER

Population Stratification EDWIN J.C.G.

VAN DEN

OORD

Volume 3, pp. 1557–1557 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Population Stratification A commonly used method to detect disease mutations is to test whether genotype or marker allele frequencies differ between groups of cases and controls. These frequencies will differ if the marker allele is the disease mutation or if it is very closely located to the disease mutation on the chromosome. A problem is that case-control status and marker allele frequencies can also be associated because of population stratification. Population stratification refers to subgroups in the sample, for example, ethnic groups, that differ from each other with respect to the marker allele frequency as well as the disease prevalence. For example, assume that in population 1 the prevalence of disease X is higher than in population 2. Because of different demographic and population genetic histories, the frequency of marker allele A could also be higher in population 1 than in population 2. If the underlying population structure would not be taken into account, an association will be observed in the total pooled sample. That is, the cases are more likely to come from population 1 because it has a higher disease prevalence, and the cases are also more likely to have allele A because it has a higher frequency in population 1. Allele A is, however, not a disease mutation not does it lie close to a disease mutation on the chromosome. Because it does not contain information that would help to identify the location of a disease mutation, the association is said to be false or spurious. If the subgroups in the population can be identified, such spurious findings can be easily avoided

by performing the tests within each subgroup. In the case of unobserved heterogeneity, one approach to avoid false positive findings due to population stratification is to use family-based tests. For example, one could count the number of times heterozygous parents transmit allele A rather than allele a to an affected child. Because all family members belong to the same population strata, a significant result cannot be the result of population stratification. A disadvantage of these family-based tests is that it may be impractical to collect DNA from multiple family members. Another disadvantage is that some families will not be informative (e.g., a homozygous parent will always transmit allele A), resulting in a reduction of sample size. Because of these disadvantages, alternative procedures such as ‘genomic control’ have been proposed that claim to detect and control for population stratification without the need to genotype family members. The basic idea is that if many unlinked markers are examined, the presence of population stratification should cause differences in relatively large subset of the markers. An important question involves the extent population stratification is a threat for case-control studies. The answer to this question seems to fluctuate over time. The fact is that there are currently not many examples of false positives as a result of population stratification where the subgroups could not be identified easily. (See also Allelic Association) EDWIN J.C.G.

VAN DEN

OORD

Power DAVID C. HOWELL Volume 3, pp. 1558–1564 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Power Power is traditionally defined as the probability of correctly rejecting a false null hypothesis. As such it is equal to 1 − β, where β is the probability of a Type II error – failing to reject a false null hypothesis. Power is a function of the sample size (N ), the effect size (ES ) (see Effect Size Measures), the probability of a Type I error (α), and the specific experimental design. The purpose of the article is to review and elaborate important issues dealing with statistical power, rather than to become engaged in the technical issues of direct calculation. Additional material on calculation can be found in the entry power analysis for categorical methods, and in work by Cohen [5–8]. Following many years in which behavioral scientists paid little attention to power [4, 20], in the last 15 or 20 years there has been an increase in the demand that investigators consider the power of their studies. Funding agencies, institutional review boards, and even some journals expect to see that the investigator made a serious effort to estimate the power of a study before conducting the research. Agencies do not want to fund work that does not have a reasonable chance of arriving at informative conclusions, and review boards do not want to see animals or patients wasted needlessly – either because the study has too little power or because it has far more power than is thought necessary. It is an open question whether such requirements are having the effect of increasing the true power of experiments. Investigators can increase reported power by arbitrarily increasing the effect size they claim to seek until the power calculation for the number of animals they planned to run anyway reaches some adequate level. Senn [21] put it this way: ‘Clinically relevant difference: used in the theory of clinical trials as opposed to cynically relevant difference which is used in the practice (p. 174, italics added)’. Four important variables influence the power of an experiment. These are the size of the effect, the sample size, the probability of a Type I error (α), and the experimental design.

and Pearson [17, 18]. Whereas Fisher [11] recognized only a null hypothesis (H0 ) that could be retained or rejected, Neyman and Pearson argued for the existence of an alternative hypothesis (H1 ) in addition to the null. Effect size is a measure of the degree to which parameters expected under the alternative hypothesis differ from parameters expected under the null hypothesis (see Effect Size Measures). Different disciplines, and even different subfields within the behavioral sciences, conceptualize effect sizes differently. In medicine, where the units of measurement are more universal and meaningful (e.g., change in mmHg for blood pressure, and proportion of treated patients showing an improvement), it is sensible to express the size of an effect in terms of raw score units. We can look for a change of 10 mmHg in systolic blood pressure as a clinically meaningful difference. In psychology, where the units of measurement are often measures like the number of items recalled from a list, or the number of seconds a newborn will stare at an image, the units of measurement have no consensual meaning. As a result we often find it difficult to specify the magnitude of the difference between two group means that would be meaningful. For this reason, psychologists often scale the difference by the size of the standard deviation; it is more meaningful to speak of a quarter of a standard deviation change in fixation length than to speak of a 2 sec change in fixation. As an illustration of standardized effect size consider the situation in which we wish to compare two group means. Our null hypothesis is usually that the corresponding population means are equal, with µ1 − µ2 = 0. Under the alternative hypothesis, for a two-tailed test, µ1 − µ2 = 0. Thus, our measure of effect size should reflect the anticipated difference in the population means under H1 . However, the size of the anticipated mean difference needs to be considered in relation to the population variance, and so we will define ES = (µ1 − µ2 )/σ , where σ is taken as the (common) size of the population standard deviation. Thus, the effect size is scaled by, or standardized by, the population standard deviation. The effect size measures considered in this article are generally standardized. Thus, they are scale free, and do not depend on the original units of measurement.

Effect Size (ES )

Small, Medium, and Large Effects

The basic concepts in the definition of power go back to the development of statistical theory by Neyman

Cohen [4] initially put forth several rules of thumb concerning what he considered small, medium, and

2

Power

large values of ES. These values vary with the statistical procedure. For example, when comparing the means of two independent groups, Cohen defined ‘small’ as ES = 0.20, ‘medium’ as ES = 0.50, and ‘large’ as ES = 0.80. Lenth [14] challenged Cohen’s provision of small, medium, and large estimates of effect size, as well as his choice of standardized measures. Lenth argued persuasively that the experimenter needs to think very carefully about both the numerator and denominator for ES, and should not simply look at their ratio. Cohen would have been one of the first to side with Lenth’s emphasis on the importance of thinking clearly about the difference to be expected in population means, the value of the population correlation, and so on. Cohen put forth those definitions tentatively in the beginning, though he did give them more weight later. The calculation of ES is serious business, and should not be passed over lightly. The issue of standardization is more an issue of reporting effects than of power estimation, because all approaches to power involve standardizing measures at some stage.

Sample Size (N ) After the effect size, the most important variable controlling power is the size of our sample(s). It is important both because it has such a strong influence on the power of a test, and because it is the variable most easily manipulated by the investigator. It is difficult, though not always impossible, to manipulate the size of your effect or the degree of variability with the groups, but it is relatively easy to alter the sample size.

Probability of a Type I Error (α) It should be apparent that the power of an experiment depends on the cutoff for α. The more willing we are to risk making a Type I error, the more null hypotheses, true and false, we will reject. And as the number of rejected false null hypotheses increases, so does the power of our study. Statisticians, and particularly those in the behavioral sciences, have generally recognized two traditional levels of α : 0.05 and 0.01. While α can be set at other values, these are the two that we traditionally employ, and these are the two that we are likely to

continue to employ, at least in the near future. But if these are the only levels we feel confident in using, partly because we fear criticism from colleagues and review panels, then we do not have much to play with. We can calculate the power associated with each level of α and make our choice accordingly. Because, all other things equal, power is greater with α = 0.05 than with α = 0.01, there is a tendency to settle on α = 0.05. When we discuss what Buchner, Erdfelder, and Faul [3] call ‘compromise power’, we will come to a situation where there is the possibility of specifying the relative seriousness of Type I and Type II errors and accepting α at values greater than 0.05. This is an option that deserves to be taken seriously, especially for exploratory studies.

Experimental Design We often don’t think about the relevance of the experimental design to determination of power, but changes in the design can have a major impact. The calculation of power assumes a specific design, but it is up to us to specify that design. For example, all other things equal, we will generally have more power in a design involving paired observations than in one with independent groups. When all assumptions are met, standard parametric procedures are usually more powerful than their corresponding distribution-free alternatives, though that power difference is often slight. (See Noether [19] for a treatment of power of some common nonparametric tests.) For a third example, McClelland [15] presents data showing how power can vary considerably depending on how the experiment is designed. Consider an experiment that varies the dosage of a drug with levels of 0.5, 1, 1.5, 2, and 2.5 µg. If observations were distributed evenly over those dosages, it would take a total of 2N observations to have the same power as an experiment in which only N participants were equally divided between the highest and lowest dosages, with no observations assigned to the other three levels. (Of course, any nonlinear trend in response across the five dosage levels would be untestable.) McClelland and Judd [16] illustrate similar kinds of effects for the detection of interaction in factorial designs. Their papers, and one by Irwin and McClelland [13] are important for those contemplating optimal experimental designs.

Power

The Calculation of Power The most common approach to power calculation is the approach taken by Cohen [5]. Cohen developed his procedures in conjunction with appropriate statistical tables, which helps to explain the nature of the approach. However, the same issues arise if you are using statistical software to carry out your calculations. Again, this section describes the general approach and does not focus on the specifics. An important principle in calculating power and sample size requirements is the separation of our measure of the size of an effect (ES ), which does not have anything to do with the size of the sample, from the sample size itself. This allows us to work with the anticipated effect size to calculate the power associated with a given sample size, or to reverse the process and calculate the sample size required for the desired level of power. Cohen’s approach first involves the calculation of the effect size for the particular experimental design. Following that we define a value (here called δ) which is some function of the sample size. For example, when testing the √ difference between two population means, δ = ES n/2, where n is equal to the number of observations in any one sample. Analogous calculations are involved for other statistical tests. What we have done here is to calculate ES independent of sample size, and then combine ES and sample size to derive power. Once we have δ, the next step simply involves entering the appropriate table or statistical software with δ and α, and reading off the corresponding level of power. If, instead of calculating power for a given sample size, we want to calculate the sample size required for a given level of power, we can simply use the table backwards. We find the value of δ that is required for the desired power, and then use δ and ES to solve for n.

A Priori, Retrospective, and Compromise Power In general the discussion above has focused on a priori power, which is the power that we would calculate before the experiment is conducted. It is based on reasonable estimates of means, variances, correlations, proportions, and so on, that we believe represent the parameters for our population or populations. This is

3

what we generally think of when we consider statistical power. In recent years, there has been an increased interest is what is often called retrospective (or post hoc) power. For our purposes retrospective power will be defined as power that is calculated after an experiment has been completed on the basis of the results of that experiment. For example, retrospective power asks the question ‘If the values of the population means and variances were equal to the values found in this experiment, what would be the resulting power?’ One reason why we might calculate retrospective power is to help in the design of future research. Suppose that we have just completed an experiment and want to replicate it, perhaps with a different sample size and a demographically different pool of participants. We can take the results that we just obtained, treat them as an accurate reflection of the population means and standard deviations, and use those values to calculate ES. We can then use that ES to make power estimates. This use of retrospective power, which is, in effect, the a priori power of our next experiment, is relatively noncontroversial. Many statistical packages, including SAS and SPSS, will make these calculations for you (see Software for Statistical Analyses). What is more controversial, however, is to use retrospective power calculations as an explanation of the obtained results. A common suggestion in the literature claims that if the study was not significant, but had high retrospective power, that result speaks to the acceptance of the null hypothesis. This view hinges on the argument that if you had high power, you would have been very likely to reject a false null, and thus nonsignificance indicates that the null is either true or nearly so. But as Hoenig and Heisey [12] point out, there is a false premise here. It is not possible to fail to reject the null and yet have high retrospective power. In fact, a result with p exactly equal to 0.05 will have a retrospective power of essentially 0.50, and that retrospective power will decrease for p > 0.05. The argument is sometimes made that retrospective power tells you more than you can learn from the obtained P value. This argument is a derivative of the one in the previous paragraph. However, it is easy to show that for a given effect size and sample size, there is a 1:1 relationship between p and retrospective power. One can be derived from the

4

Power

other. Thus, retrospective power offers no additional information. As Hoenig and Heisey [12] argue, rather than focus our energies on calculating retrospective power to try to learn more about what our results have to reveal, we are better off putting that effort into calculating confidence limits on the parameter(s) or the effect size. If, for example, we had a t Test on two independent groups with t (48) = 1.90, p = 0.063, we would fail to reject the null hypothesis. When we calculate retrospective power we find it to be 0.46. When we calculate the 95% confidence interval on µ1 − µ2 we find −1.10 < µ1 − µ2 < 39.1. The confidence interval tells us more about what we are studying than does the fact that power is only 0.46. (Even had the difference been slightly greater, and thus significant, the confidence interval shows that we still do not have a very good idea of the magnitude of the difference between the population means.) Thomas [24] considered an alternative approach to retrospective power that uses the obtained standard deviation, but ignores the obtained mean difference. You can then choose what you consider to be a minimally acceptable level of power, and solve for the smallest effect size that is detectable at that power. We then know if our study, or better yet future studies, had or have a respectable chance of detecting a behaviorally meaningful result. Retrospective power can be a useful tool when evaluating studies in the literature, as in a metaanalysis, or planning future work. But retrospective power it not a useful tool for excusing our own nonsignificant results. A third, and less common, approach to power analysis is called compromise power. For a priori power we compute ES and δ, and then usually determine power when α is either 0.05 or 0.01. There is nothing that says that we need to use those values of α, but they are the ones traditionally used. However, another way to determine power would be to specify alternative values for α by considering the relative importance we assign to Type I and Type II errors. If Type II errors are particularly serious, such as in early exploratory research, we might be willing to let α rise to keep down β. One way to approach compromise power is to define an acceptable ratio of the seriousness of Type II and Type I errors. This ratio is β/α. If the two types of error were equally serious, this ratio would be 1. (Our traditional approach often sets α at 0.05 and β at

0.20, for a ratio of 4.) Given the appropriate software we could then base our power calculations on our desire to have the appropriate β/α ratio, rather than fixing α at 0.05 and specifying the desired power. There is no practical way that we can calculate compromise power using standard statistical tables, but the program G*Power, to be discussed shortly, will allow us to perform the necessary calculation. For example, suppose that we contemplated a standard experiment with n = 25 participants in each of two groups, and an expected effect size of 0.50. Using a traditional cutoff of α = 0.05, we would have power equal to 0.41, or β = 0.59. In this case, the probability of making a Type II error is approximately 12 times as great as the probability of making a Type I error. But we may think that ratio is not consistent of our view of the seriousness of these two types of errors, and that a ratio of 2 : 1 would be more appropriate. We can thus set the β/α ratio at 2. For a two group experiment with 25 participants in each group, where we expected ES = 0.50, a medium effect size, then G*Power shows that letting α = 0.17 will give us a power of 0.65. That is an abnormally high risk of a Type I error, but it is more in line with our subjective belief in the seriousness of the two kinds of errors, and may well be worth the price in exploratory research. This is especially true if our sample size is fixed at some value by the nature of the research, so that we can’t vary N to increase our power.

Power Analysis Software There is a wealth of statistical software for power analysis available on the World Wide Web, and even a casual search using www.google.com will find it. Some of this software is quite expensive, and some of it is free, but ‘free’ should not be taken to mean ‘inferior’. Much of it is quite good. Surprisingly most of the available software was written a number of years ago, as is also the case with reviews of that software. I will only mention a few pieces of software here, because the rest is easy to find on the web. To my mind the best piece of software for general use is G*Power [10], which can be downloaded for free at http://www.psycho.uniduesseldorf.de/aap/projects/gpower/. This software comes with an excellent manual and is available for both Macintosh and PC computers. It

Power will handle tests on correlations, proportions, means (using both t and F ), and contingency tables. You can also perform excellent power analyses from nQuery Advisor (see http://www.statsol. ie/nquery/nquery.htm). This is certainly not free software, but it handles a wide array of experimental designs and does an excellent job of allowing you to try out different options and plot your results. It will also allow you to calculate sample size required to produce a confidence interval of specified width. A very nice set of Java applets have been written by Russell Lenth, and are available at http://www.stat.uiowa.edu/∼rlenth/ Power/. An advantage of using Java applets for cal-

culation of power is that you only have to open a web page to do so. A very good piece of software is a program called DataSim by Bradley [2]. It is currently available for Macintosh systems, though there is an older DOS version. It is not very expensive and will carry out many analyses in addition to power calculations. DataSim was originally designed as a data simulation program, at which it excels, but it does a very nice job of doing power calculations. For those who use SAS for data analysis, O’Brien and Muller have written a power module, which is available at http://www.bio.ri.ccf.org/ Power/. This module is easy to include in a SAS program. A good review of software for power calculations can be found at http://www.zoology.ubc.ca/ ∼krebs/power.html. This review is old, but in this case that does not seem to be much of a drawback.

Confidence Limits on Effect Size Earlier is this article I briefly discussed the advantage of setting confidence limits on the parameter(s) being estimated, and argued that this was an important way of understanding our results. It is also possible to set confidence limits on effect sizes. Smithson [22] has argued for the use of confidence limits on (standardized) effect size, on the grounds that confidence limits on unstandardized effects do not allow us to make comparisons across studies using different variables. On the other hand, confidence intervals on standardized effect sizes, which

5

are dimensionless, can more easily be compared. While one could argue with the requirement of comparability across variables, there are times when it is useful. The increased emphasis on confidence limits has caused some [1] to suggest that sample size calculations should focus solely on the width of the resulting confidence interval, and should disregard the consideration of power for statistical tests. Daly [9], on the other hand, has emphasized that we need to consider both confidence interval width and the minimum meaningful effect size (such as a difference in means) when it comes to setting our sample sizes. Otherwise we risk seriously underestimating the sample size required for satisfactory power. For example, suppose that the difference in means between two treatments is expected to be 5 points, and we calculate the sample size needed for a 95% confidence interval to extend just under 5 points on either side of the mean. If we obtain such an interval and the observed difference was 5, the confidence interval would not include 0 and we would reject our null hypothesis. But a difference of 5 points is an estimate, and sampling error will cause our actual data to be larger or smaller than that. Half the time the obtained interval will come out to be wider than that, and half the time it will be somewhat narrower. For the half the time that it is wider, we would fail to reject the null hypothesis of no difference. So in setting the sample size to yield a mean width of the confidence interval with predefined precision, our experiment has a power coefficient of only 0.50, which we would likely find unacceptable. It is possible to set confidence limits on power as well as on effect sizes, making use of the fact that the sample variance is a random variable. However, our estimates of power, especially a priori power, are already sufficiently tentative that the effort is not likely to be rewarding. You would most likely be further ahead putting the effort into calculating, and perhaps plotting, power for a range of reasonable values of the effect sizes and/or sample sizes. Smithson [23] has argued that one should calculate both power and a confidence interval on the effect size. He notes that the confidence interval does not narrow in width as power increases because of an increased ES, and it is important to anticipate the width of our confidence interval. We might then wish to increase the sample size, even in the presence of high power, so as to decrease

6

Power

Table 1 Effect size formulae for various standard tests (column 2) and sample sizes required for power = 0.80 with α = 0.05 for several levels of effect size (columns 3–7). (The formulae were taken from Cohen [5], and the estimated sample sizes were calculated using G*Power (where appropriate) or tables from Cohen [5]) ES Test

Effect size formula

0.2

0.4

0.6

0.8

1.0

Difference between two independent means (H0 : µ1 − µ2 = 0)

ES =

µ1 − µ2 σe

394

100

45

26

17

Difference between means of paired samples (H0 : µ1 − µ2 = µD = 0)

ES =

µD µ1 − µ2 = σD σD

394

100

45

26

17

One way analysis of variance (H0 : µ1 = µ2 = · · · = µk = 0) (sample sizes/group for dfbet = 2) Product moment correlation Comparing two independent proportions (H √0 : P1 − P2 = 0); φ = 2arc sin P

ES =

82

22

10

6

–

91 392

44 98

17 44

7 25

– 16

0.20

0.30

0.40

(µj − µ)2 /k σe2

ES = ρ ES = φ1 − φ2

ES Test

Effect Size

0.05

0.10

2

Multiple correlation (Sample size for 2 predictors)

ES =

R 1 − R2

196

100

52

36

28

Multiple partial correlation

ES =

2 R01.2 2 1 − R01.2

196

100

52

36

28

0.15

0.20

0.25

85

49

30

ES Test Single proportion against P = 0.50(H0 : P = 0.50)

Effect Size

0.05

ES = P1 − 0.50

Goodness of fit chi-square (sample sizes for df = 3)

ES =

Contingency table chi-square (sample size for df = 1)

ES =

783

0.10 194

12 c

i=1

(P1i − P0i ) P0i

2

c (P − P )2 1i 0i P0i i=1

P1 , P2 = population proportions under H1 . P0 = population proportion under H0 . c = number of cells in one way or contingency table. µ1 , µ2 , µj = population means. σe = population standard deviation. σD = standard deviation of difference scores. k = number of groups.

331

68

30

<25

<25

196

44

<25

<25

<25

Power the confidence interval on the size of the resulting effect.

[8] [9]

Calculating Power for Specific Experimental Designs [10]

Once we have an estimate of our effect size, the calculation of power is straightforward. As discussed above, there are a number of statistical programs available for computing power. In addition, hand calculations using standard tables (e.g., Cohen [5]) are relatively simple and straightforward. Regardless of whether computations are done by software or by hand, we can choose to calculate the power for a given sample size, or the sample size required to produce a desired level of power. Whether we are using software or standard tables, we enter the effect size and the sample size, and then read off the available power. Alternatively we could enter the effect size and the desired power and read off the required sample size. Table 1, in addition to defining the effect size for each of many different designs, also contains the sample sizes required for power = 0.80 for a representative set of effect sizes.

[11] [12]

[13]

[14]

[15] [16]

[17]

[18]

References [1]

[2]

[3]

[4]

[5]

[6] [7]

Beal, S.L. (1989). Sample size determination for confidence intervals on the population mean and on the difference between two population means, Biometrics 45, 969–977. Bradley, D.R. (1997). DATASIM: a general purpose data simulator, in Technology and Teaching, L. Lloyd, ed., Information Today, Medford, pp. 93–117. Buchner, A., Erdfelder, E. & Faul, F. (1997). How to Use G*Power [WWW document], URL : http://www. psycho.uni-duesseldorf.de/aap/projects/ gpower/how to use gpower.html Cohen, J. (1962). The statistical power of abnormalsocial psychological research: a review, Journal of Abnormal and Social Psychology 65, 145–153. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd Edition, Lawrence Erlbaum, Hillsdale. Cohen, J. (1990). Things I have learned (so far), American Psychologists 45, 1304–1312. Cohen, J. (1992a). A power primer, Psychological Bulletin 112, 155–159.

[19]

[20]

[21] [22] [23]

[24]

7

Cohen, J. (1992b). Statistical power analysis, Current Directions in Psychological Science 1, 98–101. Daly, L.E. (2000). Confidence intervals and sample size, in Statistics with Confidence, D.G. Altman, D. Machin, T.N. Bryant & M.J. Gardner, eds, BMJBooks, London. Erdfelder, E., Faul, F. & Buchner, A. (1996). G*Power: a general power analysis program, Behavior Research Methods, Instruments, & Computers 28, 1–11. Fisher, R.A. (1956). Statistical Methods and Scientific Inference, Oliver and Boyd, Edinburgh. Hoenig, J.M. & Heisey, D.M. (2001). The abuse of power: the pervasive fallacy of power calculations for data analysis, American Statistician 55, 19–24. Irwin, J.R. & McClelland, G.H. (2003). Negative consequences of dichotomizing continuous predictor variables, Journal of Marketing Research 40, 366–371. Lenth, R.V. (2001). Some practical guidelines for effective sample size determination, American Statistician 55, 187–193. McClelland, G.H. (1997). Optimal design in psychological research, Psychological Methods 2, 3–19. McClelland, G.H. & Judd, C.M. (1993). Statistical difficulties of detecting interactions and moderator effects, Psychological Bulletin 114, 376–390. Neyman, J. & Pearson, E.S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference, Biometrika 20a, 175–240. Neyman, J. & Pearson, E.S. (1933). On the problem of the most efficient tests of statistical hypotheses, Transactions of the Royal Society of London. Series A 231, 289–337. Noether, G.E. (1987). Sample size determination for some common nonparametric tests, Journal of the American Statistical Association 82, 645–647. Sedlmeier, P. & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?, Psychological Bulletin 105, 309–316. Senn, S. (1997). Statistical Issues of Drug Development, Wiley, Chichester. Smithson, M. (2000). Statistics with Confidence, Sage Publications, London. Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: the importance of noncentral distributions in computing intervals, Educational and Psychological Measurement 61, 605–632. Thomas, L. (1997). Retrospective power analysis, Conservation Biology 11, 276–280.

DAVID C. HOWELL

Power Analysis for Categorical Methods EDGAR ERDFELDER, FRANZ FAUL

AND

AXEL BUCHNER

Volume 3, pp. 1565–1570 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Power Analysis for Categorical Methods In testing statistical hypotheses, two types of error can be made: (a) a type-1 error, that is, a valid null hypothesis (H0 ) is falsely rejected, and (b) a type2 error, that is, an invalid H0 is falsely retained when the alternative hypothesis H1 is in fact true. Obviously, a reasonable decision between H0 and H1 requires both the type-1 error probability α and the type-2 error probability β to be sufficiently small. The power of a statistical test is the complement of the type-2 error probability, 1 − β, and it represents the probability of correctly rejecting H0 when H1 holds [4]. As described in this entry, researchers can assess or control the power of statistical tests for categorical data by making use of the following types of power analysis [6]: 1. Post hoc power analysis: Compute the power of the test for a given sample size N , an alpha-level α, and an effect of size w, that is, a prespecified deviation w of the population parameters under H1 from those specified by H0 (see [4], Chapter 7); 2. A-priori power analysis: Compute the sample size N that is necessary to obtain a power level 1 − β, given an effect of size w and a type-1 error level α. 3. Compromise power analysis: Compute the critical value, and the associated α and 1 − β probabilities, of the test statistic defining the boundary between decisions for H1 and H0 , given a sample of size N , an effect of size w and a desired ratio q = β/α of the error probabilities reflecting the relative seriousness of the two error types. 4. Sensitivity analysis: Compute the effect size w that can be detected given a power level 1 − β, a specific α level, and a sample of size N .

Statistical Hypotheses for Categorical Data Statistical tests for categorical data often refer to a sample of N independent observations drawn randomly from a population where each observation can be assigned to one and only one of J observation categories C1 , C2 , . . . , CJ . These categories may correspond to the J possible values of a single discrete

variable (e.g., religion) or to the J cells of a bivariate or multivariate contingency table. For example, for a 2 by 2 table of two dichotomous variables (see Two by Two Contingency Tables) A (say, gender) and B (say, interest in soccer) the number of categories would be J = 2 × 2 = 4. Two types of statistical hypotheses need to be distinguished: 1. Simple null hypotheses assume that the observed frequencies y1 , y2 , . . . , yJ of the J categories are a random sample drawn from a multinomial distribution with category probabilities p(Cj ) = πoj , for j = 1, . . . , J . That is, each category Cj has a certain probability πoj exactly specified by H0 . An example would be a uniform distribution across the categories such that πoj = 1/J for all j . Thus, for J = 3 religion categories (say, C1 : Catholics, C2 : Protestants, C3 : other) the uniform distribution hypothesis predicts H0 : πo1 = πo2 = πo3 = 1/3. In contrast, the alternative hypothesis H1 would predict that at least two category probabilities differ from 1/3. Power analyses for simple null hypotheses are described in the section “Simple Null Hypotheses”. 2. Composite null hypotheses are also based on a multinomial distribution model for the observed frequencies but do not specify exact category probabilities. Instead, the probabilities p(Cj ) of the categories Cj , j = 1, . . . , J , are assumed to be functions of S unknown realvalued parameters θ1 , θ2 , . . . , θs that need to be estimated from the data. Hence, p(Cj ) = fj (θ1 , θ2 , . . . , θs ), where the model equations fj define a function mapping the parameter space (the set of all possible parameter vectors θ = (θ 1 , θ2 , . . . , θs )) onto a subof all possible category set o of the set probability vectors. Examples include parameterized multinomial models playing a prominent role in behavioral research such as loglinear models [1], processing-tree models [2], and signal-detection models (see Signal Detection Theory) [7]. Composite null hypotheses posit that the model defined by the model equations is valid, that is, that the vector π = (p(C1 ), p(C2 ), . . . , p(CJ )) of category probabilities in the underlying population is consistent with themodel equations. Put more formally, / o. H0 : π ∈ o and, correspondingly, H1 : π ∈ Power analyses for composite null hypotheses

2

Power Analysis for Categorical Methods are described in the section “Composite Null Hypotheses”.

The section on “Joint Multinomial Models” is devoted to the frequent situation that K > 1 samples of Nk observations in sample k, k = 1, . . . , K, are drawn randomly and independently from K different populations, with a multinomial distribution model holding for each sample. Fortunately, the power analysis procedures described in sections “Simple Null Hypotheses” and “Composite Null Hypotheses” easily generalize to such joint multinomial models both for simple and composite null hypotheses.

Simple Null Hypotheses Any test statistic PDλ of the power divergence family can be used for testing simple null hypotheses for multinomial models [8]: λ J 2 yj yj −1 , P Dλ = λ(λ + 1) j =1 N πoj for any real - valued λ ∈ / {−1, 0}  2 P Dλ=−1 = lim  λ→−1 λ(λ + 1) ×

J

yj

j =1

λ→0

×

yj

j =1

=2

J j =1

−1 

2 λ(λ + 1)

yj ln

yj N πoj

yj N πoj

λ

X2 =

J (yj − N πoj )2 = PD λ=1 N πoj j =1

 −1 

(1)

As specified above, yj denotes the observed frequency in category j, N the total sample size, and πoj the probability assigned to category j by H0 so that N πoj is the expected frequency under H0 . Read and Cressie [8] have shown that, if H0 holds and Birch’s [3] regularity conditions are met, PDλ is asymptotically chi square distributed with df = J − 1 for each

(2)

and the often used log-likelihood-ratio chi-square statistic J yj = PD λ=0 G2 = 2 yj ln (3) N πoj j =1 are special cases of the PDλ family. However, Read and Cressie [8] argued that these statistics may not be optimal. The Cressie-Read statistic PDλ with λ = 2/3, a ‘compromise’ between G2 and X 2 , performs amazingly well in many situations when compared to G2 , X 2 and other PDλ statistics. If H0 is violated and H1 holds with a category probability vector π1 = (π11 , π12 , . . . , π1J ), π1 = π 0 , then the distribution of any PDλ statistic can be approximated by a noncentral chi-square distribution with df = J − 1 and noncentrality parameter J 2 N π1j λ N π1j −1 , γλ = λ(λ + 1) j =1 N πoj for any real - valued λ ∈ / {−1, 0},





P Dλ=0 = lim  J

yj N πoj

λ

fixed value of λ. Both the well-known Pearson chisquare statistic (see Contingency Tables)

(4)

with γλ=−1 and γλ=0 defined as limits as in (1) [8]. Thus, the noncentrality parameter γλ is just PDλ , with the expected category frequencies under H1 , N π1j , replacing the observed frequencies yj in (1). Alternatively, one can first compute an effect size index wλ (see Effect Size Measures) measuring the deviation of the H1 parameters from the H0 parameters uncontaminated by the sample size N : J π1j λ 2

wλ = π1j −1 , λ(λ + 1) j =1 πoj for any real - valued λ ∈ / {−1, 0}.

(5)

In a second step, we derive the noncentrality parameter γλ from wλ and the sample size N as follows: (6) γλ = N wλ 2 . The effect size index for λ = 1, wλ=1 , is equivalent to Cohen’s [4] effect size parameter w for

Power Analysis for Categorical Methods Pearson’s chi-square test: J (π1j − π0j )2 = wλ=1 w= π0j j =1

(7)

Cohen ([4], Chapter 7) suggested to interpret effects of sizes w = 0.1, w = 0.3, and w = 0.5 as ‘small’, ‘medium’, and ‘large’, respectively, and it seems reasonable to generalize these effect size conventions to other measures of the wλ family. However, it has been our experience that these labels may be misleading in some applications because they do not always correspond to intuitions about small, medium, and large effects if expressed in terms of parameter differences between H1 and H0 . Therefore, we prefer assessing the power of a test in terms of the model parameters under H1 , not in terms of the w effect size conventions. As an illustrative example, let us assume that we want to test the equiprobability hypothesis for J = 3 religion types (H0 : πo1 = πo2 = πo3 = 1/3) against the alternative hypothesis H1 : πo1 = πo2 = 1/4; πo3 = 1/2 using Pearson’s X 2 statistic (i.e., PDλ=1 ) with df = 3 − 1 = 2. If we have N = 100 observations and select α = .05, what is the power of this test? One way of answering this question would be to compute w for these parameters as defined in (7) and then use (6) to derive the noncentrality parameter γλ=1 . The power of the test is simply the area under this noncentral chi-square distribution to the right of the critical value corresponding to α = .05. The power table 7.3.16 of Cohen ([4], p. 235) can be used for approximating this power value. More convenient and flexible are PC-based power analysis tools such as GPOWER [6]1 . In GPOWER, select ‘Chi-square Test’ and ‘Post hoc’ type of power analysis. Next, click on ‘calc effect size’ (Macintosh version: calc w) to enter the category probabilities under H0 and H1 . ‘Calc&Copy’ provides us with the effect size w = 0.3536 and copies it to the main window of GPOWER. We also need to enter α = .05, N = 100, and df = 2 before a click on ‘calculate’ yields the post hoc power value 1 − β = .8963 for this situation. Assume that we are not satisfied with this state of affairs because the two error probabilities are not exactly balanced (α = .05, β = 1 − .8963 = .1037). To remedy this problem, we select the ‘compromise’ type of power analysis in GPOWER and enter

3

q = beta/alpha = 1 as the desired ratio of error probabilities. Other things being equal, we obtain a critical value of 5.1791 as an optimal decision criterion between H0 and H1 , corresponding to exactly balanced error probabilities of α = β = .0751. If we are still not satisfied and want to make sure that α = β = .05 for w = 0.3536, we can proceed to an ‘A priori ’ type of power analysis in GPOWER, which yields N = 124 as the necessary sample size required by the input parameters.

Composite Null Hypotheses As an example of a composite H0 , consider the loglinear model without interaction term for the 2 by 2 contingency table of two dichotomous variables A (gender) and B (interest in soccer): ln(e0ij ) = u + uAi + uBj , with 2 i=1

uAi =

2

uBj = 0

(8)

j =1

The logarithms of the expected frequencies for cell (i, j ), i = 1, . . . , I, j = 1, . . . , J , are explained in terms of a grand mean u, a main effect of the i-th level of variable A, and a main effect of the j -th level of variable B. None of these parameters is specified a priori, so that we now have a composite H0 rather than a simple H0 . Nevertheless, any ReadCressie statistic J I 2 yij λ P Dλ = yij −1 , λ(λ + 1) i=1 j =1 eˆ0ij for any real - valued λ ∈ / {−1, 0},

(9)

with PDλ=−1 and PDλ=0 defined as limits as in (1), is still asymptotically chi-square distributed under this composite H0 , albeit with df = I J − 1 − S, where S is the number of free parameters estimated from the data [8]. In our example, we have I J = 2 × 2 cells, and we estimate S = 2 free parameters uA1 and uB1 ; all other parameters are determined by N and the two identifiability constraints 2i=1 uAi = 2j =1 uBj = 0. Hence, we have df = 4 − 1 − 2 = 1. As before, yij denotes the observed frequency in cell (i, j ), and eˆ0ij is the corresponding expected cell frequency under the model (8), with the u parameters being estimated

4

Power Analysis for Categorical Methods

from the observed frequencies by minimizing the PDλ statistic chosen by the researcher [8]. If the H0 model is false and the saturated loglinear model ln(e1ij ) = u + uAi + uBj + uAiBj , 2

uAi =

2 j =1

i=1

uBj =

2 i=1

uAiBj =

2

uAiBj = 0,

j =1

(10) holds with interaction terms uAiBj = 0, then PDλ as defined in (9) is approximately noncentrally chisquare distributed with df = I J − 1 − S and noncentrality parameter I J e1ij λ 2 e1ij −1 , γλ = λ(λ + 1) i=1 j =1 eˆ 0ij for any real - valued λ ∈ / {−1, 0}.

(11)

As before, γλ=−1 and γλ=0 are defined as limits. In (11), e1ij denotes the expected frequency in cell (i, j ) under the H1 model (10) and eˆ 0ij denotes the corresponding expected frequency under the H0 model (8) estimated from u parameters that minimize the noncentrality parameter γλ . Thus, to obtain the correct noncentrality parameter for tests of composite null hypotheses, simply set up an artificial data set containing the expected cell frequencies under H1 as ‘data’. Then fit the null model to these ‘data’ and compute PDλ . The PDλ ‘statistic’ for this artificial data set corresponds to the noncentrality parameter γλ required for computing the power of the PDλ test under H1 . Note that this rule works for any family of parameterized multinomial models (e.g., logit models, ogive models, processing-tree models),

not just for log-linear models designed for twodimensional contingency tables. As an illustrative example, we will refer to the chi-square test of association for the 2 by 2 table (cf. [4]). Let us assume that the additive null model (8) is wrong, and the alternative model (10) holds with parameters u = 3.0095 and uA1 = uB1 = uA1B1 = 0.5. By computing the natural logarithms of the expected cell frequencies according to (10) and taking the antilogarithms, we obtain the expected cell frequencies e1ij shown on the left side of Table 1. Fitting the null model (8) without interaction term to these expected frequencies using the G2 ‘statistic’ (= PD λ=0 ) gives us the expected frequencies under the null model shown in the right part of Table 1. What would be the power of the G2 test for this H1 in case of a sample size of N = 120 and a type-1 error probability α = .05? In the first step, we calculate the effect size wλ=0 measuring the discrepancy between the H0 and the H1 probabilities corresponding to the expected frequencies in Table 1 according to (5). We obtain the rather small effect size wλ=0 = 0.1379 which is close to Cohen’s effect size measure w = 0.1503 for the same H1 . Using (6), we compute the noncentrality parameter γλ=0 = 120 × 0.13792 = 2.282. The area to the right of the critical value corresponding to α = .05 under the noncentral chi-square distribution with df = 4 − 1 − 2 = 1 is the power of the G2 test for the composite H0 (i.e., model (8)). We can compute it very easily by using GPOWER’s ‘Chi-square test’ option together with the ‘Post hoc’ type of power analysis. Selecting α = .05, w = 0.1379, N = 120, df = 1, and clicking on ‘calculate’ provides us with the very low power value 1 − β = .3269. A ‘compromise’ power analysis based on the error probability ratio q = β/α = 1 results in disappointing values of α = β = .3068

Table 1 Expected frequencies for a 2 by 2 contingency table under the saturated log-linear model with parameters u = 3.0095, uA1 = uB1 = uA1B1 = 0.5 (H1 ; left side) and under a log-linear model without interaction term (H0 ; right side) a) Expected frequencies under the saturated model (H1 )

b) Expected frequencies under the additive model (H0 )

A|B:

A|B:

Gender Male Female

Interest in soccer Yes

No

90.878

12.298

103.176

12.298

4.525

103.176

16.832

16.832 120

Gender Male Female

Interest in soccer Yes

No

88.705

14.471

103.176

14.471

2.361

103.176

16.832

16.832 120

Power Analysis for Categorical Methods for the same set of parameters. Because both error probabilities are unacceptably large, we proceed to an ‘a priori ’ power analysis in GPOWER to learn that we need N = 684 to detect an effect of size wλ=0 = 0.1379 with a df = 1 likelihood-ratio test at α = .05 and a power of 1 − β = .95.

Joint Multinomial Models So far we have discussed the situation of a single sample of N observations drawn randomly from one multinomial distribution (see Catalogue of Probability Density Functions). In behavioral research, we are often faced with the problem to compare distributions of discrete variables for K > 1 populations (e.g., experimental and control groups, target-present versus target-absent trials, trials with liberal and conservative pay-off schedules, etc.). Fortunately, the power analysis procedures for simple and composite null hypotheses easily generalize to K > 1 samples drawn randomly and independently from K multinomial distributions (i.e., joint multinomial models). Again, we can use any PDλ statistic: λ Jk K 2 ykj ykj −1 , P Dλ = λ(λ + 1) k=1 j =1 Nk πokj (12) where the special cases PDλ=−1 and PDλ=0 are defined as limits. In (12), ykj denotes the observed frequency in the j -th category, j = 1, . . . , Jk , of the k-th sample, k = 1, . . . , K, and Nk is the size of the k-th sample. In case of simple null hypotheses requiring no parameter estimation, πokj is the probability of the j -th category in population k hypothesized by H0 . In case of composite null hypotheses, simply replace Nk πokj by the expected frequencies under H0 that are closest to the observed frequencies in terms of PDλ . Under H0 and Birch’s [3] regularity conditions for joint multinomial models, any PDλ statistic is asymptotically chi-square distributed with df =

K (Jk − 1) − S,

(13)

5

category of population k, the same statistic is approximately noncentral chi-square distributed with the same degrees of freedom and noncentrality parameter Jk K e1kj λ 2 γλ = e1kj − 1 . (14) λ(λ + 1) k=1 j =1 e0kj In (14), e1kj and e0kj denote the expected frequencies under H1 and H0 , respectively. In case of composite null hypotheses requiring parameter estimation, the e0kj must be calculated from H0 parameters that minimize γλ . Researchers preferring to perform power analyses in terms of standardized effect size measures need separate effect size measures for each of the K populations. A suitable index for the k-th population (inspired by [5], Footnote 1) is Jk e1kj e1kj λ 2

− 1 , (15) wλ(k) = λ(λ + 1) j =1 Nk e0kj which relates to the total effect size across the K populations as follows: K Nk (16) (wλ(k) )2 . wλ = N k=1 If we use (6) to compute the noncentrality parameter from wλ , we obtain the same value γλ as from (14). The power of the PDλ test is simply the area to the right of the critical value determined by α under the noncentral chi-square distribution with df given by (13) and noncentrality parameter given by (14). GPOWER can be used to compute this power value very conveniently.

Acknowledgment The work on this entry has been supported by grants from the TransCoop Program of the Alexander von Humboldt Foundation and the Otto Selz Institute, University of Mannheim.

Note

k=1

where S is the number of free parameters estimated from the data. By contrast, if H0 is actually false and H1 holds with expected frequencies e1kj in the j -th

1.

The most recent versions of GPOWER can be downloaded free of charge for several computer platforms at http://www.psycho.uni-duesseldorf .de/aap/projects/gpower/

6

Power Analysis for Categorical Methods

References [1] [2]

[3] [4]

[5]

Agresti, A. (1990). Categorical Data Analysis, Wiley, New York. Batchelder, W.H. & Riefer, D.M. (1999). Theoretical and empirical review of multinomial process tree modeling, Psychonomic Bulletin & Review 6, 57–86. Birch, M.W. (1964). A new proof of the Pearson-Fisher theorem, Annals of Mathematical Statistics 35, 817–824. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd Edition, Lawrence Erlbaum, Hillsdale. Erdfelder, E. & Bredenkamp, J. (1998). Recognition of script-typical versus script-atypical information: effects of cognitive elaboration, Memory & Cognition 26, 922–938.

[6]

[7]

[8]

Erdfelder, E., Faul, F. & Buchner, A. (1996). GPOWER: a general power analysis program, Behavior Research Methods, Instruments, & Computers 28, 1–11. Macmillan, N.A. & Creelman, C.D. (1991). Detection Theory: A User’s Guide, Cambridge University Press, Cambridge. Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data, Springer, New York.

EDGAR ERDFELDER, FRANZ FAUL AND AXEL BUCHNER

Power and Sample Size in Multilevel Linear Models TOM A.B. SNIJDERS Volume 3, pp. 1570–1573 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Power and Sample Size in Multilevel Linear Models Power of statistical tests generally depends on sample size and other design aspects; on effect size or, more generally, parameter values; and on the level of significance. In multilevel models (see Linear Multilevel Models), however, there is a sample size for each level, defined as the total number of units observed for this level. For example, in a threelevel study of pupils nested in classrooms nested in schools, there might be observations on 60 schools, a total of 150 classrooms, and a total of 3300 pupils. On average, in the data, each classroom then has 22 pupils, and each school contains 2.5 classrooms. What are the relevant sample sizes for power issues? If the researcher has the freedom to choose the sample sizes for a planned study, what are sensible guidelines? Power depends on the parameter being tested, and power considerations are different depending on whether the researcher focuses on, for example, testing a regression coefficient, a variance parameter, or is interested in the size of means of particular groups. In most studies, regression coefficients are of primary interest, and this article focuses on such coefficients. The cited literature gives methods to determine power and required sample sizes also for estimating parameters in the random part of the model. A primary qualitative issue is that, for testing the effect of a level-one variable, the level-one sample size (in the example, 3300) is of main importance; for testing the effect of a level-two variable it is the level-two sample size (150 in the example); and so on. The average cluster sizes (in the example, 22 at level two and 2.5 at level three) are not very important for the power of such tests. This implies that the sample size at the highest level is the main limiting characteristic of the design. Almost always, it will be more informative to have a sample of 60 schools with 3300 pupils than one of 30 schools also with 3300 pupils. A sample of 600 schools with a total of 3300 pupils would even be a lot better with respect to power, in spite of the low average number of students (5.5) sampled per school, but in practice, such a study would of course be much more expensive. A second qualitative issue is that for testing fixed regression

coefficients, small cluster sizes are not a problem. The low average number of 2.5 classrooms per school has in itself no negative consequences for the power of testing regression coefficients. What is limited by this low average cluster size, is the power for testing random slope variances at the school level, that is, between-school variances of effects of classroom- or pupil-level variables; and the reliability of estimating those characteristics of individual schools, calculated from classroom variables, that differ strongly between classes. (It may be recalled that the latter characteristics of individual units will be estimated in the multilevel methodology by posterior means, also called empirical Bayes estimates; see Random Effects in Multivariate Linear Models: Prediction.) When quantitative insight is required in power for testing regression coefficients, it often is convenient to consider power as a consequence of the standard error of estimation. Suppose we wish to test the null hypothesis that a regression coefficient γ is 0, and for this coefficient we have an estimate γˆ , which is approximately normally distributed, with standard error s.e.(γˆ ). The t-ratio γˆ /s.e.(γˆ ) can be tested using a t distribution; if the sample size is large enough, a standard normal distribution can be used. The power will be high if the true value (effect size) of γ is large and if the standard error is small; and a higher level of significance (larger permitted probability of a type I error) will lead to a higher power. This is expressed by the following formula, which holds for a one-sided test, and where the significance level is indicated by α and the type II error probability, which is equal to 1 minus power, by β: γ ≈ z1−α + z1−β , s.e.(γˆ )

(1)

where z1−α and z1−β are the critical points of the standard normal distribution. For example, to obtain at significance level α = 0.05 a fairly high power of at least 1 − β = 0.80, we have z1−α = 1.645 and z1−β = 0.84, so that the ratio of true parameter value to standard error should be at least 1.645 + 0.84 ≈ 2.5. For two-sided tests, the approximate formula is γ ≈ z1−(α/2) + z1−β , s.e.(γˆ )

(2)

but in the two-sided case this formula holds only if the power is not too small (or, equivalently, the effect size is not too small), for example, 1 − β ≥ 0.3.

2

Power and Sample Size in Multilevel Linear Models

For some basic cases, explicit formulas for the estimation variances (i.e., squared standard errors) are given below. This gives a basic knowledge of, and feeling for, the efficiency of multilevel designs. These formulas can be used to compute required sample sizes. The formulas also underpin the qualitative issues mentioned above. A helpful concept for developing this feeling is the design effect, which indicates how the particular design chosen – in our case, the multilevel design – affects the standard error of the parameters. It is defined as squared standard error under this design , deff = squared standard error under standard design

For estimating a population mean µ in the model Yij = µ + U0j + Rij ,

(4)

nτ 2 + σ 2 . mn

+ U0j + Rij ,

(6)

where it is assumed that X1 does not have a random slope and has zero between-group variation, that is, a constant group mean, and that it is uncorrelated with any other explanatory variables Xk (k ≥ 2), the estimation variance is var(γˆ1 ) =

3.

(5)

σ2 , 2 mnsX1

(7)

2 of X1 also where the within-group variance sX1 is assumed to be constant. The design effect here is deff = 1 − ρI ≤ 1. For estimating the effect of a level-one variable X1 under the same assumptions except that X1 now does have a random slope,

Yij = γ0 + (γ1 + U1j )X1ij + γ2 X2ij + · · · + γp Xpij + U0j + Rij ,

(8)

where the random slope variance is τ12 , the estimation variance of the fixed effect is 2 nτ12 sX1 + σ2 , 2 mnsX1

(9)

2 + σ2 nτ12 sX1 , 2 τ12 sX1 + τ2 + σ2

(10)

var(γˆ1 ) = with design effect deff =

4.

which can be greater than or less than 1. For estimating the regression coefficient of a level-two variable X1 in the model Yij = γ0 + γ1 X1j + γ2 X2ij + · · · + γp Xpij + U0j + Rij ,

the estimation variance is var(µ) ˆ =

Yij = γ0 + γ1 X1ij + γ2 X2ij + · · · + γp Xpij

(3)

where the ‘standard design’ is defined as a design using a simple random sample with the same total sample size at level one. (The determination of sample sizes under simple random sample designs is treated in the article in Sample Size and Power Calculation) If deff is greater than 1, the multilevel design is less efficient than a simple random sample design (with the same sample size); if it is less than 1, the multilevel design is more efficient. Since squared standard errors are inversely proportional to sample sizes, the required sample size for a multilevel design will be given by the sample size that would be required for a simple random sample design, multiplied by the design effect. The formulas for the basic cases are given here (also see [11]) for two-level designs, where the cluster size is assumed to be constant, and denoted by n. These are good approximations also when the cluster sizes are variable but not too widely different. The number of level-two units is denoted m, so the total sample size at level one is mn. In all models mentioned below, the level-one residual variance is denoted var(Rij ) = σ 2 and the level-two residual variance by var(U0j ) = τ 2 . 1.

2.

The design effect is deff = 1 + (n − 1)ρI ≥ 1, where ρI is the intraclass correlation, defined by ρI = τ 2 /(σ 2 + τ 2 ). For estimating the regression coefficient γ1 of a level-one variable X1 in the model

(11)

2 , and X1 is where the variance of X1 is sX1 uncorrelated with any other variables Xk (k ≥ 2),

Power and Sample Size in Multilevel Linear Models

Computer Programs

the estimation variance is var(γˆ1 ) =

nτ 2 + σ 2 , 2 mnsX1

3

(12)

and the design effect is deff = 1 + (n − 1)ρI ≥ 1. This illustrates that multilevel designs sometimes are more, and sometimes less, efficient than simple random sample designs. In case 2, a level-one variable without between-group variation, the multilevel design is always more efficient. This efficiency of within-subject designs is a well-known phenomenon. For estimating a population mean (case 1) or the effect of a level-two variable (case 4), on the other hand, the multilevel design is always less efficient, and more seriously so as the cluster size and the intraclass correlation increase. In cases 2 and 3, the same type of regression coefficient is being estimated, but in case 3, variable X1 has a random slope, unlike in case 2. The difference in deff between these cases shows that the details of the multilevel dependence structure matters for these standard errors, and hence for the required sample sizes. It must be noted that what matters here is not how the researcher specifies the multilevel model, but the true model. If in reality there is a positive random slope variance but the researcher specifies a random intercept model without a random slope, then the true estimation variance still will be given by the formula above of case 3, but the standard error will be misleadingly estimated by the formula of case 2, usually a lower value. For more general cases, where there are several correlated explanatory variables, some of them having random slopes, such clear formulas are not available. Sometimes, a very rough estimate of required sample sizes can still be made on the basis of these formulas. For more generality, however, the program PinT (see below) can be used to calculate standard errors in rather general two-level designs. A general procedure for estimating power and standard errors for any parameters in arbitrary designs is by Monte Carlo simulation of the model and the estimates. This is described in [4, Chapter 10]. For further reading, general treatments can be found in [1, 2], [4, Chapter 10], [11], and [13, Chapter 10]. More specific sample size and design issues, often focusing on two-group designs, are treated in [3, 5, 6, 7, 8, 9, 10].

ACluster calculates required sample sizes for various types of cluster randomized designs, not only for continuous but also for binary and time-to-event outcomes, as described in [1]. www.update-software.com/ See website Acluster. OD (Optimal Design) calculates power and optimal sample sizes for testing treatment effects and variance components in multisite and cluster randomized trials with balanced two-group designs, and in repeated measurement designs, according to [8, 9, 10]. See website http://www.ssicentral.com/ other/hlmod.htm. PinT (Power in Two-level designs) calculates stan-

dard errors of regression coefficients in two-level designs, according to [12]. With extensive manual. http://stat.gamma.rug.nl/ See website snijders/multilevel.htm. RMASS2 calculates the sample size for a two-group

repeated measures design, allowing for attrition, according to [3]. http://tigger.uic.edu/ See website ∼hedeker/works.html.

References [1]

[2]

[3]

[4] [5]

[6]

[7]

Donner, A. & Klar, N. (2000). Design and Analysis of Cluster Randomization Trials in Health Research, Arnold Publishing, London. Hayes, R.J. & Bennett, S. (1999). Simple sample size calculation for cluster-randomized trials, International Journal of Epidemiology 28, 319–326. Hedeker, D., Gibbons, R.D. & Waternaux, C. (1999). Sample size estimation for longitudinal designs with attrition: comparing time-related contrasts between two groups, Journal of Educational and Behavioral Statistics 24, 70–93. Hox, J.J. (2002). Multilevel Analysis, Techniques and Applications, Lawrence Erlbaum Associates, Mahwah. Moerbeek, M., van Breukelen, G.J.P. & Berger, M.P.F. (2000). Design issues for experiments in multilevel populations, Journal of Educational and Behavioral Statistics 25, 271–284. Moerbeek, M., van Breukelen, G.J.P. & Berger, M.P.F. (2001a). Optimal experimental designs for multilevel logistic models, The Statistician 50, 17–30. Moerbeek, M., van Breukelen, G.J.P. & Berger, M.P.F. (2001b). Optimal experimental designs for multilevel

4

[8]

[9]

[10]

Power and Sample Size in Multilevel Linear Models models with covariates, Communications in Statistics, Theory and Methods 30, 2683–2697. Raudenbush, S.W. (1997). Statistical analysis and optimal design for cluster randomized trials, Psychological Methods 2, 173–185. Raudenbush, S.W. & Liu, X.-F. (2000). Statistical power and optimal design for multisite randomized trials, Psychological Methods 5, 199–213. Raudenbush, S.W., & Liu, X.-F. (2001). Effects of study duration, frequency of observation, and sample size on power in studies of group differences in polynomial change, Psychological Methods 6, 387–401.

[11]

Snijders, T.A.B. (2001). Sampling, in Multilevel Modelling of Health Statistics, Chap. 11, A. Leyland & H. Goldstein, eds, Wiley, Chichester, pp. 159–174. [12] Snijders, T.A.B. & Bosker, R.J. (1993). Standard errors and sample sizes for two-level research, Journal of Educational Statistics 18, 237–259. [13] Snijders, T.A.B. & Bosker, R.J. (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, Sage Publications, London.

TOM A.B. SNIJDERS

Prediction Analysis of Cross-Classifications KATHRYN A. SZABAT Volume 3, pp. 1573–1579 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Prediction Analysis of Cross-Classifications Prediction analysis [4–6, 8, 11–13, 15] is an approach to the analysis of cross-classified, crosssectional, or longitudinal (see Longitudinal Data Analysis) data. It is a method for predicting events based on specification of predicted relations among qualitative variables and includes techniques for both stating and evaluating those event predictions. Event predictions are based on hypothesized relations among predictor (independent) and criterion (dependent) variables. Event predictions predict for each case what state of a criterion a variable follows from a given state of a predictor variable. States are categories of qualitative variables. Hypotheses can be directed at relations among variables at a given point in time (as in cross-sectional designs) as well as at the extent to which those relations change over time (as in longitudinal designs). Prediction analysis is performed on sets of predictions. It allows for the evaluation of the overall set of hypothesized event predictions as well as the evaluation of each hypothesized event prediction in the set.

Examples Table 1 reproduces a data set presented by Trautner, Geravi, and Nemeth [9] in their study of appearancereality (AR) distinction and development of gender constancy (GC) understanding in children. The data can be used to evaluate the prediction that AR distinction is a necessary prerequisite of the understanding of GC. Performance (1 = failed, 2 = passed) on both an appearance-reality distinction task (ART) and a gender constancy test (GCT) was recorded for each child in the study. The prediction can be expressed as performance on the ART predicts performance on the GCT. Trautner et al. report that there was a strong association between ability to distinguish appearance from reality and GC understanding (Cohen kappa = 0.76, p < 0.0001). Table 2 reproduces a data set presented by Casey [1] in her study of selective attention abilities. The data can be used to evaluate the prediction of agreement between levels of two tasks: a mirrorimage task and a shape-detail task. According to

Table 1 Performance on two tests: appearance–reality distinction Task (ART) and gender constancy test (GCT). Reproduced from Trautner, H.M., Geravi, J. & Nemeth, R. (2003). Appearance-reality distinction and development of gender constancy understanding in children, International Journal of Behavioral Development 27(3), 275–283. Reprinted by permission of the International Society for the Study of Behavioral Development (http://www.tandf. co.uk/journals), Taylor & Francis, & Judit Gervai X = ART

Y = GCT

Failed

Failed 31

Passed 5

36

Passed

4

35

39

35

40

75

Table 2 Level on two tasks: mirror image (MI) and shape–detail (SD). Reproduced from Casey, M.B. (1986). Individual differences in selective attention among prereaders: a key to mirror-image confusions, Development Psychology 22, 58–66. Copyright  1986 by the American Psychological Association. Reprinted with permission of the APA and M.B. Casey X = MI Level

Y = SD Level

1 2 3

1

2

3

9 3 1

1 7 4

0 2 1

10 12 6

13

12

3

28

Kinsbourne’s maturational theory, the shape-detail problem would be predicted to show equivalent level effects with the mirror-image problem. Casey reports that findings support Kinsbourne’s selective attention theory.

Prediction Analysis: The Hildebrand et al. Approach Prediction analysis of cross-classifications was formally introduced by Hildebrand, Laing, and Rosenthal [4–6]. Recognizing the limitations of classical cross-classification analysis techniques, Hildebrand et al. developed a prediction analysis approach as an alternative method. The prediction analysis approach is concerned with predicting a type of relation and

2

Prediction Analysis of Cross-Classifications

then measuring the success of that prediction. The method involves prediction logic, expressing predictions that relate qualitative variables and the del measure used to analyze data from the perspective of the prediction of interest. The prediction analysis method can be applied to one-to-one (one predictor state to one criterion state) and one-to-many (one predictor state to many criterion states) predictions about variables, to bivariate and multivariate settings, to predictions stated a priori (before the data are analyzed) and selected ex post (after the data are analyzed), to single observations (degree-1 predictions) and observation pairs (degree-2 predictions), and to pure (absolute predictions) and mixed (actuarial predictions) strategies. Thus, the prediction analysis approach should cover a wide range of research possibilities. Here, we present prediction analysis for bivariate, pure, and degree-1 predictions stated a priori. For comprehensive, detailed coverage of the Hildebrand et al. approach, readers are referred to the Hildebrand, Laing, and Rosenthal book, Prediction Analysis of Cross-classifications.

Specification of Prediction Analysis Hypotheses The prediction analysis approach starts with the specification of event predictions about qualitative variables. The language for stating such predictions, called prediction logic, is defined by a formal analogy to elementary logic. A prediction logic proposition is stated precisely by specifying its domain and the set of error events it identifies. The domain of a prediction logic proposition includes all events in the cross-classification of the predictor and criterion qualitative variables. The set of error events contains events that are not consistent with a prediction. von Eye and Brandtstadter [11–13, 15] describe how to use statement calculus, from which prediction logic is derived, when formulating hypotheses in prediction analysis. Prediction hypotheses within a set of predictions are stated in terms of ‘if-then’ variable relationships. Predictor states are linked to criterion states using an implication operator (→), often read ‘implies’ or ‘predicts.’ Multiple predictors or multiple criteria are linked using the conjunction operator (∧), to be read ‘and,’ or the adjunction operator (∨), to be read ‘or.’ These logical operators can be applied to both dichotomous (two state) and polytomous (more than two state) variables.

A set of predictions, H, contains one or more hypothesized predictions. Each prediction within a set is called an elementary prediction. Elementary predictions can be made to link (a) a single predictor state to a single criterion state, (b) a pattern of predictor states to a single criterion state, (c) a single predictor state to a pattern of criterion states, or (d) a pattern of predictor states to a pattern of criterion states. Patterns of states can arise from the same variable or from different variables in the case of multiple predictors or multiple criteria [15]. Within a set of predictions, it is not necessary to make a specific prediction for each predictor state or pattern of states. However, predictions, whether elementary or as a set, must be nontautological (must not describe hypotheses that cannot be disconfirmed). In addition, a set of predictions cannot contain contradictory elementary predictions. Recall the Trautner et al. study. The set of elementary predictions for the hypothesized relation between performance on the ART and performance on the GCT, expressed using statement calculus, is as follows. Let X be the predictor, performance on the ART, with two states (1 = failed, 2 = passed). Let Y be the criterion, performance on the GCT, with two states (1 = failed, 2 = passed). Then the set of prediction hypotheses, H, is x1 → y1, x2 → y2. Recall the Casey study. The set of elementary predictions for the hypothesized agreement between the levels of the mirror image and shape-detail tasks, expressed using statement calculus, is as follows. Let X be the predictor, level on the mirror-image task, with three states (1 = nonlearner, 2 = instructed learner, 3 = spontaneous responder). Let Y be the criterion, level on the shape-detail task, with three states (1 = nonlearner, 2 = instructed learner, 3 = spontaneous responder). Then the set of hypothesized predictions, H , is x1 → y1, x2 → y2, x3 → y3.

A Proportionate Reduction in Error Measure for Evaluating Prediction Success Hildebrand et al. use a PRE model to define del, the measure of prediction success. On the basis of the stated prediction, cells of a resultant crossclassification are designated as either predicted cells (cells that represent, for a given predictor state, the criterion state(s) that are predicted) or error cells (cells that represent, for a given predictor state, the criterion state(s) that are not predicted).

Prediction Analysis of Cross-Classifications Defining wij as the error cell indicator, (1 if cell ij is an error cell; 0 if otherwise); then, the population measure of prediction success, ∇ (del), is calculated as ∇ =1−

(i j wij Pij ) (i j wij Pi. P .j )

(1)

where Pi. and P.j are marginal probabilities indicating the probability that an observation lies in the ith row and j th column respectively, of the Y × X cross-classification and Pij is the cell probability representing the probabilities that an observation belongs to states Yi and Xj . We define, above, an error weight value of 1 for all error cells. However, in situations where some errors are regarded as more serious or more important than others, differential weighting of errors is allowed. By equation (1), ∇ represents a PRE definition of prediction success. As such, ‘∇p measures the PRE in the number of cases observed in the set of error cells for P relative to the number expected under statistical independence, given knowledge of the actual probability structure’ [6]. The value of ∇ may vary in the range, −∞ to 1. If ∇ is positive, then it measures the PRE afforded by the prediction. A negative value of ∇ indicates that the specified prediction is grossly incorrect about the nature of the relation. If the variables are statistically independent, then ∇ is zero (although, the converse does not hold: ∇ = 0 does not imply that two variables are statistically independent). When comparing several competing theoretical predictions to determine the dominance of a particular prediction, one needs to consider precision in addition to the value of ∇. Precision is determined, to a large extent, by the number of criterion states that are predicted for given predictor states. The overall precision of a prediction, U , is defined as U = i j wij Pi. P .j

(2)

Larger values of U represent greater precision (0 ≤ U ≤ 1). One prediction dominates another when the prediction has higher prediction success and at least as great prediction precision than the other, or a higher prediction precision and at least as great a prediction success [6]. A thorough analysis of prediction success would consider performance of each elementary prediction. The overall success of a prediction, ∇, can be

3

expressed as the weighted average of the success achieved by the elementary predictions, ∇.j . That is, P .j i wij Pi. (3) ∇ = j ∇.j j P .j i wij Pi. By equation (3), ‘the greater the precision of the elementary prediction relative to the precision of the other elementary predictions, the greater the contribution of the elementary prediction’s success to the overall measure of prediction success’ [6]. Recall the Trautner et al. study. To measure the success of the prediction that performance on the ART predicts performance on the GCT, we identify the error cells, indicated as shaded cells in Table 1. Using (1), the calculated measure of overall prediction success, ∇, is 0.76. This value indicates that a PRE of 76.0% is achieved in applying the prediction to children with known ART performance over that which is expected when the prediction is applied to a random selection of children whose ART performance is unknown. Furthermore, the calculated measure of prediction success for the first and second elementary predictions is 0.7804 and 0.7395 respectively. When weighted by their respected precision weights, one sees that each elementary prediction contributes about equally to the overall prediction success (0.37979 and 0.37961 respectively). We acknowledge that the value of del (0.76) is the same as the value of the Cohen’s kappa (k) measure reported in the Trautner et al. study. Hildebrand et al. note the equivalence of kappa to del for any square table given the proposition predicts the main diagonal in the Y × X cross-classification. von Eye and Brandtstadter [11] present an alternative measure for evaluating predictions. Their measure of predictive efficiency (PE) focuses on predicted cells as opposed to error cells and, by definition, excludes events for which no prediction was made. ‘Del and PE are equivalent as long as a set of predictions contains only nontautological and admissible partial (elementary) predictions’ [12].

Statistical Inference In the previous section, we presented ∇ as a population parameter. That is, we evaluated predictions with observations that represent an entire population. In practice, researchers almost always work with a sample taken from the population. As such, inference

4

Prediction Analysis of Cross-Classifications

is of concern. Earlier, we mentioned that the focus of this presentation is on predictions stated a priori. The distinction between a priori and ex post predictions is critical here in the statistical inference process because the statistical theory is different, dependent on whether the prediction is stated before or after the data are analyzed. We continue to focus on a priori predictions. A discussion on ex post analysis can be found in Hildebrand [6] and Szabat [8]. Hildebrand et al. [6] present bivariate statistical inference under three sample schemes that cover the majority of research applications. Given the specific sampling condition, the sample estimate of del is defined by substituting observed sample proportions for any unknown population probabilities. The sampling distributions of del estimators are based on standard methods of asymptotic theories; the estimators, ∇-hat, are approximately normally distributed. Hildebrand et al. [6] provide formulas for the approximate variance of the del estimators for each of the three sampling schemes. The variance calculations are quite tedious by hand, but can be programmed for computer calculation quite easily (see [14, 10]). Given the sampling distribution of the del estimators, it is possible to calculate P values for hypothesis testing. The form of the test statistic for testing the hypothesis that ∇ > 0 is Z=

(∇−hat) [est. var.(∇−hat)]1/2

a percent reduction in error measure, the von Eye et al. approach models the prediction hypothesis and evaluates the extent to which the model adequately describes the data. This is achieved via nonhierarchical log-linear models that specifies the relationship between predictors and criteria (base model) as well as specifies the event predictions. This approach to prediction analysis allows for consideration of a wide range of models and prediction hypotheses. Model specification results from an understanding of the relation between logical formulation of variable relationship and base model for estimation of expected cell frequencies. For an extensive coverage of different log-linear base models for different prediction models, the reader is referred to Statistical Analysis of Categorical Data in the Social and Behavioral Sciences [15]. Here, we present prediction analysis for one type of base model, a base model that is saturated in both the predictors and criteria, and, like all base models for prediction analysis, assumes independence between predictors and criteria. This base model corresponds to the standard hierarchical log-linear model that underlies the estimation of expected cell frequencies in the Hildebrand et al. approach. In addition, the model involves one predictor and one criterion.

Specification of Prediction Analysis Hypotheses (4)

The standard condition for applying the normal distribution to the binomial distribution, 5 ≤ nU (1–∇)≤ n – 5, is the essential condition for the application of the normal approximation. Recall the Trautner et al. study. Suppose we hypothesize the true value of ∇ to be greater than zero. Under sampling condition S2, the value for the estimate of del, 0.76, yields a test statistic, Z = 10.43, which is compared with the normal probability table value, 1.65. We conclude at the 5% significance level that the true value of ∇ is statistically greater than zero.

Prediction Analysis: The von Eye et al. Approach von Eye, Brandtstadter, and Rovine [12, 13] present nonstandard log-linear modeling as an alternative to the Hildebrand et al. approach. Rather than measuring prediction success of a set of predictions via

As is the case for prediction analysis in the Hildebrand et al. tradition, the von Eye et al. approach starts with the formulation of prediction hypotheses. One uses prediction logic or statement calculus to specify a set of nontautological and noncontradictory elementary predictions. The process is described above.

The Nonstandard Log-linear Model for Evaluating Prediction Success von Eye et al. use nonstandard log-linear modeling to evaluate prediction success. On the basis of the set of predictions, cells of the resultant crossclassification are identified as predicted or ‘hit’ cells (cells that confirm the prediction), error cells (cells that disconfirm the prediction), or irrelevant cells (cells that neither confirm nor disconfirm the prediction). A log-linear model is then specified. The form of the log-linear model to test prediction analysis hypotheses is derived by applying special design

Prediction Analysis of Cross-Classifications matrices to the general form of the log-linear model, log(F) = Xλ + e

(5)

where F is the vector of frequencies, X is the indicator matrix, λ is the parameter vector, and e is the residual vector. The specially designed indicator matrix needed for prediction analysis contains both indicator variables required for the model to be saturated in both predictors and criteria, and indicator variables reflecting prediction hypotheses. Indicator variables that reflect the prediction hypotheses are defined as follows: 1 if the event confirms the elementary prediction, −1 if the event disconfirms the elementary prediction, and 0 if else. This nonstandard log-linear model approach explicitly contrasts predicted cells and error cells, which allows one to discriminate X to Y and Y to X predictions, a feature not possible with the Hildebrand et al. approach. Once the model is specified, estimation of the nonstandard log-linear model parameters and expected cell frequencies occurs. There are a few major statistical software packages that can assist in this effort, namely, S+, SYSTAT, LEM, SPSS, and CDAS, (see [15] and Software for Statistical Analyses). Evaluation of the model is then assessed via statistical testing. Recall the Casey study. The design matrix for nonstandard log-linear modeling is given in Table 3. The design matrix contains two groups of indicator variables. The first four vectors contain two main effect variables each for the predictor and criterion variable. The next three vectors are the three coding variables that result for the three elementary predictions in the set.

Table 3 Design matrix for nonstandard log-linear modeling: Casey Study Variables XY 11 12 13 21 22 23 31 32 33

Design matrix 1 0 −1 1 0 −1 1 0 −1

0 1 −1 0 1 −1 0 1 −1

1 1 1 0 0 0 −1 −1 −1

0 0 0 1 1 1 −1 −1 −1

1 −1 −1 0 0 0 0 0 0

0 0 0 −1 1 −1 0 0 0

0 0 0 0 0 0 −1 −1 1

5

Statistical Inference As mentioned above, statistical testing is used to evaluate the model. Statistical evaluation occurs at two levels: (a) statistical evaluation of the overall hypothesis, via a goodness of fit test (likelihood ratio χ 2 or Pearson χ 2 (see Goodness of Fit for Categorical Variables), and (b) statistical evaluation of the each elementary hypothesis, via a test for significant parameter effect (z test). According to von Eye et al. [12], if statistical evaluation indicates overall model fit and at least one elementary prediction is successful (statistically significant), then the prediction set is considered successful. If statistical evaluation indicates some elementary predictions are successful, but the model does not fit, then parameter estimation is shaky. If statistical evaluation indicates overall model fit and not one of the elementary predictions are successful, then one assumes that the attempt at parameterization of deviations from independence failed. With this approach, the number of elementary predictions is more severely limited than with the Hildebrand et al. approach. There are no more than c – X – Y + 1 degrees of freedom available for prediction analysis given c cells from X predictor and Y criterion states. Given a 2 × 2 table with 4 cells, one has 4 − 2 − 2 + 1 = 1 degree of freedom left for prediction analysis, ‘This excludes these tables from PA using the present approach because inserting an indicator variable implies a saturated model.’ [12]. Recall the Casey study. Statistical evaluation shows that the log-linear base model does not adequately describe the data (LR – χ 2 = 13.742, p = 0.008 and Pearson χ 2 = 12.071, p = 0.017). When the parameters for the prediction hypotheses are added to the base model, statistical results indicate a good fit (LR – χ 2 = 0.273, p = 0.601 and Pearson χ 2 = 0.160, p = 0.690). However, only one of the parameters (the first one) for the predication analysis part of the model is statistically significant (z = 2.665, p = 0.004). Still, statistical evaluation indicates overall model fit and at least one elementary event is successful, therefore, we consider the prediction set successful.

Discussion Above we present two approaches to prediction analysis of cross-classifications. In both approaches, formulation of prediction hypotheses is the same. Both

6

Prediction Analysis of Cross-Classifications

approaches evaluate sets of elementary predictions statistically, and both base evaluation of prediction success on deviations from an assumption of some form of independence between predictors and criteria [12]. The two approaches differ, however, with respect to hypotheses tested (that is, the meaning assigned to the link between predictors and criteria), model independence assumption, and method applied for statistical evaluation. Two other prediction approaches warrant mention. Hubert [7] proposed use of matching models, instead of log-linear models, for evaluating prediction success. The approach provides an alternative method for constructing hypothesis tests that may require very cumbersome formulas; therefore, the approach is not presented here. Goodman and Kruskal [2, 3] proposed quasi-independence log-linear modeling for prediction analysis. In this approach, error cells are declared structural zeros, which result in loss of degrees of freedom. This severely limits the type of hypotheses one can test; therefore, the approach is not presented here.

[2]

[3]

[4]

[5]

[6]

[7] [8]

[9]

[10]

Summary Researchers in the social and behavioral sciences are often faced with investigations that concern variables measured in terms of discrete categories. The basic research design, whether cross-sectional or longitudinal in nature, for such studies involves the cross-classification of each subject with respect to relevant qualitative variables. Prediction analysis of cross-classifications approaches as developed by Hildebrand et al. and von Eye et al. provide a data analysis tool that allows researchers to statistically test custom-tailored sets of hypothesized predictions about the relation between such variables. Prediction analysis is a viable, valuable tool that researchers in the social and behavioral sciences can use in a wide range of research settings.

References [1]

Casey, M.B. (1986). Individual differences in selective attention among prereaders: a key to mirror-image confusions, Development Psychology 22, 58–66.

[11]

[12]

[13]

[14]

[15]

Goodman, L.A. & Kruskal, W.H. (1974a). Empirical evaluation of formal theory, Journal of Mathematical Sociology 3, 187–196. Goodman, L.A. & Kruskal, W.H. (1974b). More about empirical evaluation of formal theory, Journal of Mathematical Sociology 3, 211–213. Hildebrand, D.K., Laing, J.D. & Rosenthal, H. (1974a). Prediction logic: a method for empirical evaluation for formal theory, Journal of Mathematical Sociology 3, 163–185. Hildebrand, D.K., Laing, J.D. & Rosenthal, H. (1974b). Prediction logic and quais-independence in empirical evaluation for formal theory, Journal of Mathematical Sociology 3, 197–209. Hildebrand, D.K., Laing, J.D. & Rosenthal, H. (1977). Prediction Analysis of Cross- Classifications, Wiley, New York. Hubert, L.J. (1979). Matching models in the analysis of cross-classifications, Psychometrika 44, 21–41. Szabat, K.A. (1990). Prediction analysis, in Statistical Methods in Longitudinal Research, Vol. 2, A. von Eye, ed., Academic Press, San Diego, pp. 511–544. Trautner, H.M., Geravi, J. & Nemeth, R. (2003). Appearance-reality distinction and development of gender constancy understanding in children, International Journal of Behavioral Development 27(3), 275–283. von Eye, A. (1997). Prediction analysis program for 32-bit operation systems, Methods for Psychological Research-Online 2, 1–3. von Eye, A. & Brandtstadter, J. (1988). Formulating and testing developmental hypotheses using statement calculus and nonparametric statistics, in Life-Span Development and Behavior, Vol. 8, P.B. Bates, D. Featherman & R.M. Lerner, eds, Lawrence Erlbaum, Hillsdale, pp. 61–97. von Eye, A., Brandtstadter, J. & Rovine, M.J. (1993). Models for prediction analysis, Journal of Mathematical Sociology 18, 65–80. von Eye, A., Brandtstadter, J. & Rovine, M.J. (1998). Models for prediction analysis in longitudinal research, Journal of Mathematical Sociology 22, 355–371. von Eye, A. & Krampen, G. (1987). BASIC programs for prediction analysis of cross-classifications, Educational and Psychological Measurement 47, 141–143. von Eye, A. & Niedermeier, K.E. (1999). Statistical Analysis of Longitudinal Categorical Data in the Social and Behavioral Sciences, Lawrence Erlbaum, Hillsdale.

KATHRYN A. SZABAT

Prevalence HANS GRIETENS Volume 3, pp. 1579–1580 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Prevalence Prevalence is defined as the proportion of individuals in a population who have a particular disease. It is computed by dividing the total number of cases by the total number of individuals in the population. For instance, the prevalence rate per 1000 is calculated as follows:    Number of cases ill at one point in time  1000   Size of the population exposed to risk at that time Prevalence rates can be studied in general or in clinical populations and have to be interpreted as morbidity rates. In a general population study, the rates can be considered as base rates or baseline data. Studying the prevalence of diseases is a main aim of epidemiology, a subdiscipline of social medicine [6, 7], which has found its way into the field of psychiatry and behavioral science [5]. In these fields, prevalence refers to the proportion of individuals in a population who have a specified mental disorder (e.g., major depressive disorder) or who manifest specific behavioral problems (e.g., physical aggression). In a prevalence study, individuals are assessed at time T to determine whether they are manifesting a certain outcome (e.g., disease, mental disorder, behavioral problem) at that time [4]. Prevalences can easily be estimated by a single survey. Such surveys are useful to determine the needs for services in a certain area. Further, they permit study of variations in the prevalence of certain outcomes over time and place [2]. Prevalence rates are determined by the incidence and the duration of a disease. If one intends to determine whether an individual is at risk for a certain disease, incidence studies are to be preferred over prevalence studies. Prevalence rates may be distorted by mortality factors, treatment characteristics, and effects of preventive measures. According to the period of time that is taken into account, three different forms of prevalence can be defined: (1) point prevalence is the prevalence of a disease at a certain point in time, (2) period prevalence is the prevalence at any time during a certain time period, and (3) lifetime prevalence refers to the presence of a disease at any time during a person’s life. Most prevalence studies on behavioral problems

or mental disorders present period prevalence rates. This is, for instance, the case in all studies using the rating scales of Achenbach’s System of Empirically Based Assessment [1] to measure behavior problems in 1.5-to-18-year old children. Depending on the child’s age, these rating scales cover problem behaviors within a two-to-six month period. Examples of studies on adolescents and adults using lifetime prevalence rates of psychiatric disorders are the Epidemiological Catchment Area Study [8] and the National Comorbidity Survey [3], both conducted in the United States. Prevalences rely on the identification of ‘truly’ disordered individuals. The identification of true cases is only possible by using highly predictive diagnostic criteria. To measure these criteria, one needs to have instruments that are both highly sensitive (with the number of false negatives being as low as possible) and highly specific (with the number of false positives being as low as possible).

References [1]

[2]

[3]

[4]

[5]

[6] [7]

[8]

Achenbach, T.M. & Rescorla, L.E. (2001). Manual for the ASEBA School-Age Forms & Profiles, University of Vermont, Research Center for Children, Youth, & Families, Burlington. Cohen, P., Slomkowski, C. & Robins, L.N. (1999). Historical and Geographical Influences on Psychopathology, Lawrence Erlbaum, Mahwah. Kessler, R.C., McGonagle, K.A., Zhao, S., Nelson, C.B., Hughes, M., Eshleman, S., Wittchen, H.U. & Kendler, K.S. (1994). Lifetime and 12-month prevalence of DSM-III-R psychiatric disorders in the United States: results from the national comorbidity survey, Archives of General Psychiatry 51, 8–19. Kraemer, H.C., Kazdin, A.E., Offord, D.R., Kessler, R.C., Jensen, P.S. & Kupfer, D.J. (1997). Coming to terms with the terms of risk, Archives of General Psychiatry 54, 337–343. Kramer, M. (1957). A discussion of the concept of incidence and prevalence as related to epidemiological studies of mental disorders, Epidemiology of Mental Disease 47, 826–840. Lilienfeld, M. & Stolley, P.D. (1994). Foundations of Epidemiology, Oxford University Press, New York. MacMahon, B. & Trichopoulos, D. (1996). Epidemiology Principles and Methods, Little, Brown and Company, Boston. Robins, L.N. & Regier, D.A., eds (1991). Psychiatric Disorders in America, Free Press, New York.

HANS GRIETENS

Principal Component Analysis IAN JOLLIFFE Volume 3, pp. 1580–1584 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Principal Component Analysis Large datasets often include measurements on many variables. It may be possible to reduce the number of variables considerably while still retaining much of the information in the original dataset. A number of dimension-reducing techniques exist for doing this, and principal component analysis is probably the most widely used of these. Suppose we have n measurements on a vector x of p random variables, and we wish to reduce the dimension from p to q. Principal component analysis does this by finding linear combinations, a1 x, a2 x, . . . , aq x, called principal components, that successively have maximum variance for the data, subject to being uncorrelated with previous ak xs. Solving this maximization problem, we find that the vectors a1 , a2 , . . . , aq are the eigenvectors of the covariance matrix, S, of the data, corresponding to the q largest eigenvalues. These eigenvalues give the variances of their respective principal components, and the ratio of the sum of the first q eigenvalues to the sum of the variances of all p original variables represents the proportion of the total variance in the original dataset, accounted for by the first q principal components. This apparently simple idea actually has a number of subtleties, and a surprisingly large number of uses. It was first presented in its algebraic form by Hotelling [7] in 1933, though Pearson [14] had given a geometric derivation of the same technique in 1901. Following the advent of electronic computers, it became feasible to use the technique on large datasets, and the number and varieties of applications expanded rapidly. Currently, more than 1000 articles are published each year with principal component analysis, or the slightly less popular terminology, principal components analysis, in keywords or title. Henceforth, in this article, we use the abbreviation PCA, which covers both forms. As well as numerous articles, there are two comprehensive general textbooks [9, 11] on PCA, and even whole books on subsets of the topic [3, 4].

An Example As an illustration, we use an example that has been widely reported in the literature, and which is

Table 1 Vectors of coefficients for the first two principal components for data from [17] Variable x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

a1

a2

0.34 0.34 0.35 0.30 0.34 0.27 0.32 0.30 0.23 0.36

0.39 0.37 0.10 0.24 0.32 −0.24 −0.27 −0.51 −0.22 −0.33

originally due to Yule et al. [17]. The data consist of scores between 0 and 20 for 150 children aged 4 12 to 6 years from the Isle of Wight, on 10 subtests of the Wechsler Pre-School and Primary Scale of Intelligence. Five of the tests were ‘verbal’ tests and five were ‘performance’ tests. Table 1 gives the vectors a1 , a2 that define the first two principal components for these data. The first component is a linear combination of the 10 scores with roughly equal weight (max. 0.36, min. 0.23) given to each score. It can be interpreted as a measure of the overall ability of a child to do well on the full battery of 10 tests, and represents the major (linear) source of variability in the data. On its own, it accounts for 48% of the original variability. The second component contrasts the first five scores on the verbal tests with the five scores on the performance tests. It accounts for a further 11% of the total variability. The form of this second component tells us that once we have accounted for overall ability, the next most important (linear) source of variability in the test scores is between those children who do well on the verbal tests relative to the performance tests, and those children whose test score profile has the opposite pattern.

Covariance or Correlation In our introduction, we talked about maximizing variance and eigenvalues/eigenvectors of a covariance matrix. Often, a slightly different approach is adopted in order to avoid two problems. If our p variables are measured in a mixture of units, then it is difficult to interpret the principal components. What do we mean by a linear combination of weight, height, and

2

Principal Component Analysis

temperature, for example? Furthermore, if we measure temperature and weight in degrees Fahrenheit and pounds respectively, we get completely different principal components from those obtained from the same data but using degrees Celsius and kilograms. To avoid this arbitrariness, we standardize each variable to have zero mean and unit variance. Finding linear combinations of these standardized variables that successively maximize variance, subject to being uncorrelated with previous linear combinations, leads to principal components defined by the eigenvalues and eigenvectors of the correlation matrix, rather than the covariance matrix of the original variables. When all variables are measured in the same units, covariance-based PCA may be appropriate, but even here, there can be circumstances in which such analyses are uninformative. This occurs when a few variables have much larger variances than the remainder. In such cases, the first few components are dominated by the high-variance variables and tell us nothing that could not have been deduced by inspection of the original variances. There are certainly circumstances where covariance-based PCA is of interest, but they are not common. Most PCAs encountered in practice are correlation-based. Our example is a case where either approach would be appropriate. The results given above are based on the correlation matrix, but because the variances of all 10 tests are similar, results from a covariance-based analysis would be little different.

eigenvalues (the scree graph [1]). Some of these simple ideas as well as more sophisticated ones [11, Chapter 6] have been borrowed from factor analysis. This is unfortunate because the different objectives of PCA and factor analysis (see below for more on this) mean that, typically, fewer dimensions should be retained in factor analysis than in PCA, so the factor analysis rules are often inappropriate. It should also be noted that, although it is usual to discard lowvariance principal components, they can sometimes be useful in their own right, for example in finding outliers [11, Chapter 10] and in quality control [9].

Normalization Constraints Given a principal component ak x, we can multiply it by any constant and not change its interpretation. To solve the maximization problem that leads to principal components, we need to impose a normalization constraint, ak ak = 1. Having found the components, we are free to renormalize by multiplying ak by some constant. At least two alternative normalizations can be useful. One that is sometimes encountered in PCA output from computer software is ak ak = lk , where lk is the kth eigenvalue (variance of the kth component). With this normalization, the j th element of ak is the correlation between the j th variable and the kth component for correlation-based PCA. The normalization ak ak = 1/ lk is less common, but can be useful in some circumstances, such as finding outliers.

How Many Components?

Confusion with Factor Analysis

We have talked about q principal components accounting for most of the variation in the p variables. What do we mean by ‘most’, and, more generally, how do we decide how many components to keep? There is a large literature on this topic – see, for example, [11, Chapter 6]. Perhaps, the simplest procedure is to set a threshold, say 80%, and stop when the first q components account for a percentage of total variation greater than this threshold. In our example, the first two components accounted for only 59% of the variation. We would usually want more than this – 70 to 90% are the usual sort of values, but it depends on the context of the dataset, and can be higher or lower. Other techniques are based on the values of the eigenvalues (for example Kaiser’s rule [13]) or on the differences between consecutive

It was noted above that there is much confusion between principal component analysis and factor analysis. This is partially caused by a number of widely used software packages treating PCA as a special case of factor analysis, which it most certainly is not. There are several technical differences between PCA and factor analysis [10], but the most fundamental difference is that factor analysis explicitly specifies a model relating the observed variables to a smaller set of underlying unobservable factors. Although some authors [2, 16] express PCA in the framework of a model, its main application is as a descriptive, exploratory technique, with no thought of an underlying model. This descriptive nature means that distributional assumptions are unnecessary to apply PCA in its usual form. It can be used, although an

Principal Component Analysis element of caution may be needed in interpretation, on discrete, and even binary, data, as well as continuous variables. One notable feature of factor analysis is that it is generally a two-stage procedure; having found an initial solution, it is rotated towards simple structure (see Factor Analysis: Exploratory). The purpose of factor rotation (see Factor Analysis: Exploratory) is to make the coefficients or loadings relating variables to factors as simple as possible, in the sense that they are either close to zero or far from zero, with few intermediate loadings. This idea can be borrowed and used in PCA; having decided to keep q principal components, we may rotate within the q-dimensional subspace defined by the components in a way that makes the axes as easy as possible to interpret. This is one of a number of techniques that attempt to simplify the results of PCA by postprocessing them in some way, or by replacing PCA with a modified technique [11, Chapter 11].

Uses of Principal Component Analysis The basic use of PCA is as a dimension-reducing technique whose results are used in a descriptive/exploratory manner, but there are many variations on this central theme. Because the ‘best’ two(or three-) dimensional representation of a dataset in a least squares sense (see Least Squares Estimation) is given by a plot of the first two- (or three-) principal components, the components provide a ‘best’ low-dimensional graphical display of the data. A plot of the components can be augmented by plotting variables as well as observations on the same diagram, giving a biplot [6]. PCA is often used as the first step, reducing dimensionality before undertaking another multivariate technique such as cluster analysis (see Cluster Analysis: Overview) or discriminant analysis. Principal components can also be used in multiple linear regression in place of the original variables in order to alleviate problems with multicollinearity [11, Chapter 8]. Several dimension-reducing techniques, such as projection pursuit [12] and independent component analysis [8], which may be viewed as alternatives to PCA, nevertheless, suggest preprocessing the data using PCA in order to reduce dimensionality, before proceeding to the technique of interest. As already noted, there are also occasions when lowvariance components may be of interest.

3

Extensions to Principal Component Analysis PCA has been extended in many ways. For example, one restriction of the technique is that it is linear. A number of nonlinear versions have, therefore, been suggested. These include the ‘Gifi’ [5] approach to multivariate analysis, and various nonlinear extensions that are implemented using neural networks [3]. Another area in which many variations have been proposed is when the data are time series, so that there is dependence between observations as well as between variables [11, Chapter 12]. A special case of this occurs when the data are functions, leading to functional data analysis [15]. This brief review of extensions is by no means exhaustive, and the list continues to grow – see [9, 11].

References [1]

Cattell, R.B. (1966). The scree test for the number of factors, Multivariate Behavioral Research 1, 245–276. [2] Caussinus, H. (1986). Models and uses of principal component analysis: a comparison emphasizing graphical displays and metric choices, in Multidimensional Data Analysis, J. de Leeuw, W. Heiser, J. Meulman and F. Critchley eds, DSWO Press, Leiden, pp. 149–178. [3] Diamantaras, K.I. & Kung, S.Y. (1996). Principal Component Neural Networks Theory and Applications, Wiley, New York. [4] Flury, B. (1988). Common Principal Components and Related Models, Wiley, New York. [5] Gifi, A. (1990). Nonlinear Multivariate Analysis, Wiley, Chichester. [6] Gower, J.C. & Hand, D.J. (1996). Biplots, Chapman & Hall, London. [7] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24, 417–441, 498–520. [8] Hyv¨arinen, A., Karhunen, J. & Oja, E. (2001). Independent Component Analysis, Wiley, New York. [9] Jackson, J.E. (1991). A User’s Guide to Principal Components, Wiley, New York. [10] Jolliffe, I.T. (1998). Factor analysis, overview, in Encyclopedia of Biostatistics Vol. 2, P. Armitage & T. Colton, eds, Wiley, New York 1474–1482. [11] Jolliffe, I.T. (2002). Principal Component Analysis, 2nd Edition, Springer, New York. [12] Jones, M.C. & Sibson, R. (1987). What is projection pursuit? Journal of the Royal Statistical Society, A 150, 1–38. (including discussion). [13] Kaiser, H.F. (1960). The application of electronic computers to factor analysis, Educational and Psychological Measurement 20,141–151.

4 [14]

Principal Component Analysis

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space, Philosophical Magazine 2, 559–572. [15] Ramsay, J.O. & Silverman, B.W. (2002). Applied Functional Data Analysis, Methods and Case Studies, Springer, New York. [16] Tipping, M.E. & Bishop, C.M. (1999). Probabilistic principal component analysis, Journal of the Royal Statistical Society, B 61, 611–622.

[17]

Yule, W., Berger, M., Butler, S., Newham, V. & Tizard, J. (1969). The WPPSI: an empirical evaluation with a British sample, British Journal of Educational Psychology 39, 1–13.

(See also Multidimensional Scaling) IAN JOLLIFFE

Principal Components and Extensions GEORGE MICHAILIDIS Volume 3, pp. 1584–1594 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Principal Components and Extensions

Some of the properties of this solution are: • •

Introduction Principal components analysis (PCA) belongs to the class of projection methods, where one calculates linear combinations of the original variables with the weights chosen so that some measure of ‘interestingness’ is maximized. This measure for PCA corresponds to maximizing the variance of the linear combinations, which, loosely speaking, implies that the directions of maximum variability in the data are those of interest. PCA is one of the oldest and most popular multivariate analysis techniques; its basic formulation dates back to Pearson [7] and in its most common derivation to Hotelling [18]. For time series data it corresponds to the Karhunen–Loeve decomposition of Watanabe [30]. In this paper, we primarily discuss extensions of PCA to handle categorical data. We also provide a brief summary of some other extensions that have appeared in the literature over the last two decades. The exposition of the extension of PCA for categorical data takes us into the realm of optimal scoring, and we discuss those connections to some length.

PCA Problem Formulation The setting is as follows: suppose we have a data matrix X comprising p measurements on N objects. Without loss of generality, it is assumed that the columns have been centered to zero and scaled to have length one. The main idea is to construct d < p linear combinations XW that capture the main characteristics of the data. Let us consider the singular value decomposition of the data matrix X = U V ,

(1)

with being a p × p diagonal matrix with decreasing nonnegative entries, U an n × p matrix of orthonormal columns and V a p × p orthogonal matrix. Then, the principal components scores for the objects correspond to the columns of U .

The first d < p columns of the principal components matrix give the linear projection of X into d dimensions with the largest variance. Consider the following reconstruction of the data matrix X˜ = Ud d Vd , where only the first d columns/rows of U, , V have been retained. Then, as a consequence of the Eckart–Young theorem [12] X˜ is the best rank d approximation of X in the least squares sense (see Least Squares Estimation).

PCA for Nominal Categorical Data We discuss next a formulation for nominal categorical data that leads to a joint representation in Euclidean space of objects and the categories they belong to. The starting point is a ubiquitous coding of the data matrix. In this setting, data have been collected for N objects (e.g., the sleeping bags described in Appendix 1) on J categorical variables (e.g., price, quality, and fiber decomposition) with kj categories per variable (e.g., three categories for price, three for quality, and two for fibers). Let Gj be a matrix with entries Gj (i, t) = 1 if object i belongs to category t, and Gj (i, t) = 0 if it belongs to some other category, i = 1, . . . , N , t = 1, . . . , kj . The collection of all the indicator matrices G = [G1 |G2 | . . . |GJ ] corresponds to the super-indicator matrix [22]. In order to be able to embed both the objects and the categories of the variables in Euclidean space, we need to assign to them scores, X and Yj respectively. The object scores matrix is of order N × d and the category scores matrix for variable j is of order kj × d, where d is usually 2 or 3. The quality of the scoring of the objects and the variables is measured by the squared differences between the two, namely σ (X ; Y ) = J −1

J

SSQ(X − Gj Yj ).

(2)

j =1

The goal is to find object scores X and variable scores Y = [Y1 , Y2 , . . . , YJ ] that minimize these differences. In order to avoid the trivial solution X = Y = 0, we impose the normalization constraint X X = NI d , together with a centering constraint u X = 0, where u is a N × 1 vector comprised of ones.

2

Principal Components and Extensions

Some algebra (see [22]) shows that the optimal solution takes the form Yj = Dj−1 Gj X ,

(3)

where Dj contains the frequencies of the categories of variable j along its diagonal, and J

X = J −1

G j Yj .

(4)

j =1

Equation (3) is the driving force behind the interpretation of the results, since it says that a category score is located at the center of gravity of the objects that belong to it. This is known in the literature as the first centroid principle [22]. Although, (4) implies in theory an analogous principle for the object scores, such a relationship does not hold in practice due to the influence of the normalization constraint X X = NI d . Remark This solution is known in the literature as the homogeneity analysis solution [11]. Its origins can be traced back to the work of Hirschfeld [17], Fisher [10] and especially Guttman [13], although

some ideas go further back to Pearson (see the discussion in de Leeuw [7]). It has been rediscovered many times by considering different starting points, such as analysis of variance consideration [24] or extensions to correspondence analysis [1]. A derivation of the above solution from a graph theoretic point of view is given in [23]. We summarize next some basic properties of the homogeneity analysis solution (for more details see Michailidis and de Leeuw [22]). • • •

• •

Category quantifications and object scores are represented in a joint space. A category point is the centroid of objects belonging to that category. Objects with the same response pattern (identical profiles) receive identical object scores In general, the distance between two object points is related to the ‘similarity’ between their profiles. A variable is more informative to the extent that its category points are further apart. If a category applies uniquely to only a single

1.5 Ex

pe

ns

ive

1

Cheap Good

0.5

Bad

Dimension 2

Down fibers 0 Synthetic fibers −0.5 Not expensive −1 Acceptable −1.5 −1.5

Figure 1

−1

−0.5

0 0.5 Dimension 1

1

Graph plot of the homogeneity analysis solution of the sleeping bag data

1.5

2

Principal Components and Extensions

•

•

object, then the object point and that category point will coincide. Objects with a ‘unique’ profile will be located further away from the origin of the joint space, whereas objects with a profile similar to the ‘average’ one will be located closer to the origin (direct consequence of the previous property). The solution is invariant under rotations of the object scores in p-dimensional space and of the category quantifications. To see this, suppose we select a different basis for the column space of the object scores X; that is, let X = X × R, where R is a rotation matrix satisfying R R = RR = Ip . We then get from that Yj = Dj−1 Gj X = Yˆj R.

An application of the technique to the sleeping bags data set is given next. We apply homogeneity analysis to this data set and we provide two views of the solution: (a) a joint map of objects and categories with lines connecting them (the so-called graph plot) (Figure 1) and (b) the maps of the three variables, that is, the categories of the variable and the objects connected to them

(the so-called star plots, see Figures 2, 3, and 4). Notice that objects with similar profiles are mapped to identical points on the graph plot, a property stemming from the centroid principle. That is the reason that fewer than 18 object points appear on the graph plot. Several things become immediately clear. There are good, expensive sleeping bags filled with down fibers, and cheap, bad-quality sleeping bags filled with synthetic fibers. There are also some intermediate sleeping bags in terms of quality and price filled either with down or synthetic fibers. Finally, there are some expensive ones of acceptable quality and some cheap ones of good quality. However, there are no bad expensive sleeping bags. Remark Some algebra (see [22]) shows that the homogeneity analysis solution presented above can also be obtained by a singular value decomposition of J −1/2 LGD −1/2 = U V ,

Expensive Cheap

Dimension 2

0.5

0

−0.5 Not expensive

−1

−1.5 −1.5

Figure 2

−1

Star plot of variable price

−0.5

(5)

where L is a centering operator that leaves the superindicator matrix G containing the data in deviations from its column means, and D is a diagonal matrix

1.5

1

3

0 0.5 Dimension 1

1

1.5

2

4

Principal Components and Extensions 1.5

1

0.5 Dimension 2

Down fibers 0 Synthetic fibers −0.5

−1

−1.5 −1.5

Figure 3

−1

−0.5

0 0.5 Dimension 1

1

1.5

2

Star plot of variable fiber

1.5

1

Good

Dimension 2

0.5

Bad

0

−0.5

−1 Acceptable −1.5 −1.5

Figure 4

−1

Star plot of variable quality

−0.5

0 0.5 Dimension 1

1

1.5

2

Principal Components and Extensions containing the univariate marginals of all the variables. The above expression shows why homogeneity analysis can be thought of as PCA for categorical data (see Section ‘PCA Problem Formulation’).

where qj is a kj -column vector of single category scores for variable j , and βj a d-column vector of weights (component loadings). Thus, each score matrix Yj is restricted to be of rank-one, which implies that the scores in d dimensional space become proportional to each other. As noted in [22], the introduction of the rank-one restrictions allows the existence of multidimensional solutions for object scores with a single quantification (optimal scaling) for the categories of the variables, and also makes it possible to incorporate the measurement level of the variables (ordinal, numerical) into the analysis (see Scales of Measurement). The quality of the solution under such restrictions is still captured by the loss function given in (2), which is decomposed into two parts: the first component captures the loss (lack of fit) due to

Incorporation of Ordinal Information In many cases, the optimal scores for the categories of the variables exhibit nonmonotone patterns. However, this may not be a particularly desirable feature, especially when the categories encompass ordinal information. Furthermore, nonmonotone patterns may make the interpretation of the results a rather hard task. One way to overcome this difficulty is to further require Yj = qj βj ,

5

(6)

Larcerry Okl

2.5

Rape

CCr

2

Tuc Min

1.5

Dimension 2

1 0.5

Cin

CoS Aus Anc StP Lex SAn Arl Aur Akr Elp VBe Hon

Wic Cha Nas Jac Col Roc Sea

Ral

0

Den Nor

Mes

LV

Pho

−1

Pit

Lou Ana

SF

−1.5

LA

−2.5 −2

SA −1.5

−1

−0.5

Burglary

Atl Mia

Fre

JC −2

Tam

Fwo

Hou

LB

KS Bur

PO Mem

Piv Buf Bal SFR Dal Mil Sac Bos NOr Oak Cle

SJ −0.5 SD

BR

SPe

StL

Assault

New Det

Murder Theft Robbery

DC

Chi

Phi NY 0 0.5 Dimension 1

1

1.5

2

2.5

Figure 5 Variable quantifications (light grey points, not labeled) and US cities (The solid lines indicate the projections of the transformed variables. The placement of the variable labels indicates the high crime rate direction. (e.g., for the assault variable category 1 is on the left side of the graph, while category 6 on the right side))

6

Principal Components and Extensions

optimal scaling, while the second component captures the additional loss incurred due to the rank-one restrictions. The optimal quantifications qj are computed by (given that optimal variable scores Yˆ j have already been obtained) Yˆ j βj qˆ j = , βj βj

(7)

and the optimal loadings βj by Yˆ j Dj qj . βˆj = qj D j qj

(8)

An application of weighted monotone regression procedure [6], allows one to impose an ordinal constraint on the scores qj . This procedure is known in the literature as the Princals solution (principal components analysis by means of alternating least squares), since an alternating least squares algorithm is used to calculate the optimal scores for the variables and their component loadings.

Figure 6

We demonstrate next the main features of the technique through an example. The data in this example give crime rates per 100 000 people in seven areas – murder, rape, robbery, assault, burglary, larceny, motor vehicle theft- for 1994 for each of the largest 72 cities in the United States. The data and their categorical coding is given in Appendix 2. In principle, we could have used homogeneity analysis to analyze and summarize the patterns in this data. However, we would like to incorporate into the analysis the underlying monotone structure in the data (higher crime rates are worse for a city) and thus we have treated all the variables as ordinal in a nonlinear principal components analysis. In Figure 5, the component loadings of the seven variables of a two-dimensional solution are shown. In case the loadings are of (almost) unit length, then the angle between any two of them reflects the value of the correlation coefficient between the two corresponding quantified variables. It can be seen that the first dimension (component) is a measure of overall crime rate, since all variables exhibit high loadings on it. On the other hand, the second component has high

A schematic illustration of a principal curve in two dimensions

Principal Components and Extensions 7

7

40

35

6

30 5 25

y

y2

4 20

3 15 2

10

5

1

0 −4

0 −2

0 x

2

4

0

1

2

3

4

5

x2

Figure 7 A two-dimensional nonlinear pattern (left panel) becomes more linear when additional dimensions are introduced. In the right panel the same data points are projected onto the x 2 , y 2 space

positive loadings on rape and larceny, and negative ones on murder, robbery, and auto theft. Thus, the second component will distinguish cities with large numbers of incidents involving larceny and rape from cities with high rates of auto thefts, murders, and robberies. Moreover, it can be seen that murder, robbery and auto theft are highly correlated, as are larceny and rape. The assault variable is also correlated, although to a lesser degree, with the first set of three variables and also with burglary. It is interesting to note that not all aspects of violent crimes are highly correlated (i.e., murder, rape, robbery, and assault) and the same holds for property crimes (burglary, larceny and auto thefts).

Software The homogeneity analysis solution and its nonlinear principal components variety can be found in many popular software packages such as SPSS (under the

categories package), and SAS (under the corresponding procedure). A highly interactive version has been implemented in the xlisp-stat language [3], while a version with a wealth of plotting options for the results, as well as different ways of incorporating extraneous information through constraints can be found in R, as the homals package (see Software for Statistical Analyses).

Discussion The main idea incorporated in homogeneity analysis and in nonlinear principal components analysis is that of optimal scoring (scaling) of the categorical variables. An extension of this idea to regression analysis with numerical variables can be found in the ACE methodology of Breiman and Friedman [5] and with categorical variables in the ALSOS system of Young et al. [31].

8

Principal Components and Extensions

The concept of optimal scaling has also been used in other multivariate analysis techniques (see Multivariate Analysis: Overview); for example, in canonical correlation analysis [20, 28], in flexible versions of discriminant analysis [16, 15] and in linear dynamical systems [2]. The book by Bekker and de Leeuw [29] deals with many theoretical aspects of optimal scaling (see also [8]).

Some Other Extensions of PCA In this section, we briefly review some other extensions of PCA. All of the approaches covered try to deal with nonlinearities in the multivariate data. Principal curves. If the data exhibit nonlinear patterns, then PCA that relies on linear associations would not be able to capture them. Hastie and Stuetzle [14] proposed the concept of a principal curve. The data are mapped to the closest point on the curve, or alternatively every point on the curve is the average of all the data points that are projected onto it (a different implementation of a centroid-like principle). An illustration of the principal curves idea is given in Figure 6. It can also be shown that the Table 1

regular principal components are the only straight lines possessing the above property. Some extensions of principal curves can be found in [9, 19]. Local PCA. If one would like to retain the conceptual simplicity of PCA, together with its algorithmic efficiency in the presence of nonlinearities in the data, he could resort to applying PCA locally. One possibility is to perform a cluster analysis, and then apply different principal components analyses to the various clusters [4]. Some recent advances on this topic are discussed in [21].

Kernel PCA. The idea behind kernel PCA [27] is that nonlinear patterns present in low-dimensional spaces can be linearized by projecting the data into high-dimensional spaces, at which point classical PCA becomes an effective. Although this concept is contrary to the idea of using PCA for data reduction purposes, it has proved successful in some application areas like handwritten digit recognition. An illustration of the linearization idea using synthetic data is given in Figure 7. In order to make this idea computationally tractable, kernels (to some extent they can be thought of as generalizations

Sleeping bags data set

Sleeping bag Foxfire Mont Blanc Cobra Eiger Viking Climber Light Traveler’s Dream Yeti Light Climber Cobra Comfort Cat’s Meow Tyin Donna Touch the Cloud Kompakt One Kilo Bag Kompakt Basic Igloo Super Sund Finmark Tour Interlight Lyx

Expensive

Down

Good

Acceptable

Not expensive

Synthetic

Cheap

Bad

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

1 1 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0

0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0

0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

Principal Components and Extensions of a covariance function) are used, that essentially calculate inner products of the original variables. It should be noted that kernels are at the heart of support vector machines, a very popular and successful classification technique [26].

9

Appendix A The sleeping bags data set was used in [25] and [22]. The data set is given in Table 1.

Appendix B

Acknowledgments The author would like to thank Jan de Leeuw for many useful suggestions and comments. This work was supported in part by NSF under grants IIS-9988095 and DMS0214171 and NIH grant 1P41RR018627-01.

The data for this example are taken from table No. 313 of the 1996 Statistical Abstract of the United States. The coding of the variables is given next (see Table 2):

Murder 1: 0-10, 2: 11-20, 3: 21-40, 4: 40+ Rape 1: 0-40, 2: 41-60, 3: 61-80, 4: 81-100, 5: 100+ Robbery 1: 0-400, 2: 401-700, 3: 701-1000, 4: 1000+ Assault 1: 0-300, 2: 301-500, 3: 501-750, 4: 751-1000, 5: 1001-1250, 6: 1251+ Burglary 1: 0-1000, 2: 1001-1400, 3: 1401-1800, 4: 1801-2200, 5: 2200+ Larceny 1: 0-3000, 2: 3001-3500, 3: 3501-4000, 4: 4001-4500, 5: 4501-5000, 6: 5001-5500, 7: 5501-7000, 8: 7000+ Motor vehicle theft 1: 0-500, 2: 501-1000, 3: 1001-1500, 4: 1501-2000, 5: 2000+ Table 2

Crime rates data

City New York (NY) Chicago (IL) Philadelphia (PA) Phoenix (AZ) Detroit (MI) Honolulu (HI) Las Vegas (NV) Baltimore (MD) Columbus (OH) Memphis (TN) El Paso (TX) Seattle (WA) Nashville (TN) Denver (CO) New Orleans (LA) Portland (OR) Long Beach (CA) Kansas City (MO) Atlanta (GA) Sacramento (CA) Tulsa (OK) Oakland (CA) Pittsburgh (PA) Toledo (OH)

City Code NY Chi Phi Pho Det Hon LV Bal Col Mem ElP Sea Nas Den NOr Por LB KS Atl Sac Tul Oak Pit Tol

3134213 3a 46343 3232114 3213464 4546445 1111252 2323333 4445474 2522453 3533534 1213152 2223373 2425373 2312323 4433444 2426375 2133324 3536574 4546585 2223455 2314323 3445454 2322223 2522453

City

City Code

Los Angeles (CA) Houston (TX) San Diego (CA) Dallas (TX) San Antonio (TX) San Jose (CA) San Francisco (CA) Jacksonville (FL) Milwaukee (WI) Washington, DC Boston (MA) Charlotte (NC) Austin (TX) Cleveland (OH) Fort Worth (TX) Oklahoma City (OK) Tucson (AZ) Virginia Beach (VA) Saint Louis (MO) Fresno (CA) Miami (FL) Minneapolis (MN) Cincinatti (OH) Buffalo (NY)

LA Hou SD Dal SAn SJ SF Jac Mil DC Bos Cha Aus Cle FWo Okl Tuc VBe StL Fre Mia Min Cin Buf

3235223 3223323 1113223 3424354 2211362 1213112 2133253 2424462 3322244 4246363 2435245 2325462 1211262 3533314 3423363 2514583 1314384 1111131 4346585 3234455 3246585 2534573 2523351 3445533

10

Principal Components and Extensions

Table 2

(continued )

City

City Code

Wichita (KS) Colorado Springs (CO) Santa Ana (CA) Anaheim (CA) Louisville (KY) Newark (NJ) Norfolk (VA) Aurora (CO) Riverside (CA) Rochester (NY) Raleigh (NC) Akron (OH) Note: The

a

Wic Cos SA Ana Lou New Nor Aur Riv Roc Ral Akr

2312463 1311151 3122113 1123223 2222312 3346545 3322252 1215252 2225434 3332562 2113341 2413232

[2] [3] [4]

[5]

[6]

[7] [8]

[9] [10] [11] [12] [13]

City Code

Mesa (AZ) Tampa (FL) Arlington (VA) Corpus Cristi (TX) St Paul (MN) Birmingham (AL) Anchorage (AK) St Petersburg (FL) Lexington (KY) Jersey City (NJ) Baton Rouge (LO) Stockton (CA)

Mes Tam Arl CCr StP Bir Anc SPe Lex JC BRo Sto

1113353 3546585 1213242 1313371 2413332 4536573 1313152 1426462 1213241 2134414 3326584 2224454

for the Rape variable for Chicago denotes a missing observation.

References [1]

City

Benzecri, J.P. (1973). Analyse des Donn´ees, Dunod, Paris. Bijleveld, C.C.J.H. (1989). Exploratory Linear Dynamic Systems Analysis, DSWO Press, Leiden. Bond, J. & Michailidis, G. (1996). Homogeneity analysis in Lisp-Stat, Journal of Statistical Software 1(2), 1–30. Bregler, C. & Omohundro, M. (1994). Surface learning with applications to lipreading, in Advances in Neural Information Processing Systems, S.J. Hanson, J.D. Cowan & C.L. Giles, eds, Morgan Kaufman, San Mateo. Breiman, L. & Friedman, J.H. (1985). Estimating optimal transformations for multiple regression and correlation, Journal of the American Statistical Association 80, 580–619. De Leeuw, J. (1977). Correctness of Kruskal’s algorithms for monotone regression with ties, Psychometrika 42, 141–144. De Leeuw, J. (1983). On the prehistory of correspondence analysis, Statistica Neerlandica 37, 161–164. De Leeuw, J., Michailidis, G. & Wang, D. (1999). Correspondence analysis techniques, in Multivariate Analysis, Design of Experiments and Survey Sampling, S. Ghosh, ed., Marcel Dekker, New York, pp. 523–546. Delicado, P. (2001). Another look at principal curves and surfaces, Journal of Multivariate Analysis 77, 84–116. Fisher, R.A. (1938). The precision of discriminant functions, The Annals of Eugenics 10, 422–429. Gifi, A. (1990). Nonlinear Multivariate Analysis, Wiley, Chichester. Golub, G.H. & van Loan, C.F. (1989). Matrix Computations, Johns Hopkins University Press, Baltimore. Guttman, L. (1941). The quantification of a class of attributes: a theory and a method of scale construction. The Prediction of Personal Adjustment. Horst et al., eds, Social Science Research Council, New York.

[14] [15] [16]

[17]

[18]

[19]

[20]

[21] [22]

[23]

[24]

[25]

[26] [27]

Hastie, T. & Stuetzle, W. (1989). Principal curves, Journal of the American Statistical Association 84, 502–516. Hastie, T., Buja, A. & Tibshirani, R. (1995). Penalized discriminant analysis, Annals of Statistics 23, 73–102. Hastie, T., Tibshirani, R. & Buja, A. (1994). Flexible discriminant analysis with optimal scoring, Journal of the American Statistical Association 89, 1255–1270. Hirschfeld, H.O. (1935). A connection between correlation and contingency, Cambridge Philosophical Society Proceedings 31, 520–524. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24, 417–441 and 498–520. Kegl, B. & Krzyzak, A. (2002). Piecewise linear skeletonization using principal curves, IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 59–74. Koyak, R.A. (1987). On measuring internal dependence in a set of random variables, The Annals of Statistics 15, 1215–1228. Liu, Z.Y. & Xu, L. (2003). Topological local principal component analysis, Neurocomputing 55, 739–745. Michailidis, G. & de Leeuw, J. (1998). The Gifi system of descriptive multivariate analysis, Statistical Science 13, 307–336. Michailidis, G. & de Leeuw, J. (2001). Data visualization through graph drawing, Computational Statistics 16, 435–450. Nishisato, S. (1980). Analysis of Categorical Data: Dual Scaling and Its Applications, Toronto University Press, Toronto. Prediger, S. (1997). Symbolic objects in formal concept analysis, in Proceedings of the Second International Symposium on Knowledge, Retrieval, Use and Storage for Efficiency, G. Mineau & A. Fall, eds. Scholkopf, B. & Smola, A.J. (2002). Learning with Kernels, MIT Press, Cambridge. Scholkopf, B., Smola, A. & Muller, K.R. (1998). Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 1299–1319.

Principal Components and Extensions [28]

Van der Burg, E., de Leeuw, J. & Verdegaal, R. (1988). Homogeneity analysis with K sets of variables: an alternating least squares method with optimal scaling features, Psychometrica 53, 177–197. [29] Van Rijckevorsel, J. & de Leeuw, J. eds, (1988). Component and Correspondence Analysis, Wiley, Chichester. [30] Watanabe, S. (1965). Karhunen-Loeve expansion and factor analysis theoretical remarks and applications, Transactions of the 4th Prague Conference on Information Theory. [31] Young, F.W., de Leeuw, J. & Takane, Y. (1976). Regression with qualitative variables: an alternating least squares method with optimal scaling features, Psychometrika 41, 505–529.

11

Further Reading De Leeuw, J. & Michailidis, G. (2000). Graph-layout techniques and multidimensional data analysis, Papers in Honor of T.S. Ferguson, Le Cam, L. & Bruss, F.T. (eds), 219–248 IMS Monograph Series, Hayward, CA. Michailidis, G. & de Leeuw, J. (2000). Multilevel homogeneity analysis with differential weighting Computational Statistics and Data Analysis 32, 411–442.

GEORGE MICHAILIDIS

Probability: An Introduction RANALD R. MACDONALD Volume 3, pp. 1600–1605 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Probability: An Introduction Probabilities are used by people to characterize the uncertainties they encounter in everyday life such as the winning of a sports event, the passing of an exam or even the vagaries of the weather, while in technical fields uncertainties are represented by probabilities in all the physical, biological, and social sciences. Probability theory has given rise to the disciplines of statistics, epidemiology, and actuarial studies; it even has a role in such disparate subjects as law, history, and ethics. Probability is useful for betting, insurance, risk assessment, planning, management, experimental design, interpreting evidence, predicting the future and determining what happened in the past (for more examples see [1, 3]). A probability measures the uncertainty associated with the occurrence of an event on a scale from 0 to 1. When an event is impossible the probability is 0, when it is equally likely to occur or not to occur the probability is 0.5 and the probability is 1 when the event is certain. Probability is a key concept in all statistical inferences (see Classical Statistical Inference: Practice versus Presentation). What follows is intended as a short introduction to what probability is. Accounts of probability are included in most elementary statistics textbooks; Hacking [4] has written an introductory account of probability and Feller [2] is possibly the classic text on probability theory though it is not an introductory one. Set theory can be used to characterize events and their probabilities where a set is taken to be a collection of events. Such representations are useful because new distinctions between events can be introduced which break down what was taken to be an event into more precisely defined events. For example the event of drawing an ace from a pack of cards can be broken down by suit, by player, by time, and by place.

Representing the Probability of a Single Event In all probability applications there is one large set, consisting of the event in question and all other events that could happen should this event not occur. This

notA

A

S

Figure 1 Venn diagram of event A and sample space S. A is the area in the circle and S the area in the rectangle

is called the sample space and is denoted by S. For instance, S might consist of all instances of drawing a card. In set theory, any event that might occur is represented by a set of some but not all of the possible events. Thus, it is a region of the sample space (a subset of S) and it can be represented as in Figure 1, where the event A might be drawing an ace and not A drawing any other card. In Figure 1, called a Venn diagram, the sample space S is all the area enclosed by the rectangle, event A is the region within the circle and the event corresponding to the nonoccurrence of A (the complement of A) is represented by the region S–A (all the area within the rectangle except the area within the circle). The Venn diagram corresponds to a probability model where the areas within regions of the diagram correspond to the probabilities of the events they represent. The area within the circle is the probability of event A occurring and the area outside the circle within the diagram is the probability of A not occurring. A and not A are said to be mutually exclusive as A and not A cannot occur together and exhaustive as either A or not A must happen. The area within the rectangle is defined to be 1. Thus, p(A) + p(notA) = p(S) = 1.

Representing the Probability of Two Events A Venn diagram gets more complicated when it is extended to include two events A and B which are not mutually exclusive (see Figure 2). For example, A might be drawing an ace and B drawing a club. (If A and B were mutually exclusive, the circles in Figure 2 would not overlap. This would be the case

2

Probability: An Introduction Table 1 Distinguishable outcomes when three boxes are randomly selected which either contain a prize (P) or are empty (E)

notA notB

A notB

A B

Box 1 Box 2 Box 3

notA B

Figure 2 Venn diagram of events A and B. A is the area in the left circle, B the area in the right circle and A ∩ B the area the circles have in common

where A was drawing an ace and B was drawing a King from a pack of cards.) Figure 2 distinguishes between four different outcomes described below. 1. Both A and B occur. This can be written as A ∩ B and it is called the intersection of A & B or A ‘cap’ B. 2. A occurs and B does not, written as A ∩ notB. 3. A does not occur and B does, written as notA ∩ B. 4. Neither A or B occurs, written as notA ∩ notB. Again the Venn diagram corresponds to a probability model where the four regions correspond to the probabilities of the four possible outcomes. It is also useful to introduce the concept of the union of two sets A and B written as A ∪ B which stands for the occurrence of either A or B or both. This is sometimes referred to as A union B or A ‘cup’ B. A ∪ B = (A ∩ notB) + (notA ∩ B) + (A ∩ B). (1) From Figure 2 it can be seen that the probability of A union B p(A ∪ B) = p(A) + p(B) − p(A ∩ B).

(2)

Example 1 Suppose there are a large number of boxes half of which contain a prize while the other half contain nothing. If one were presented with three boxes chosen at random (where each box has the same probability of being chosen) find (a)

p(A) the probability that at least one box contains a prize,

P P P

P P E

P E P

P E E

E P P

E P E

E E P

E E E

(b) p(B) the probability that at least one box is empty, (c) p(A ∩ B) the probability that at least one contains a prize and that at least one is empty, (d) p(A ∪ B) the probability that at least one contains a prize or that at least one is empty. The sample space of all distinguishable outcomes is given in Table 1. Because the probability of a box containing a prize is. 5 and the boxes have been chosen at random, all the patterns of boxes in Table 1 are equally likely and thus each pattern has a probability of 1/8. (a) The number of patterns of boxes with at least one prize is 7 and the sample space is 8 so the probability p(A) that at least one box contains a prize is 7/8. (b) Similarly, the probability p(B) that at least one box is empty is 7/8. (c) The number of patterns of boxes that have at least one box containing a prize and at least one empty is 6 and the sample space is 8 so the probability p(A ∩ B) of at least one box containing a prize and at least one empty box is 6/8 = 3/4. (d) The probability p(A ∪ B) that at least one box contains a prize or that at least one is empty is 1 as all patterns have either a box containing a prize or an empty box. Alternatively it could be noted that p(A ∪ B) = p(A) + p(B) − p(A ∩ B) =

7 8

+

7 8

−

6 8

= 1.

(3)

A key finding that allows us to treat the long run frequencies of events as probabilities is the law of large numbers (see Laws of Large Numbers and [3] for an account of its development). The law states that the average number of times a particular outcome occurs in a number of events (which are assumed to be equivalent with respect to the probability of the outcome in question) will asymptote to the probability of

Probability: An Introduction the outcome as the number of events increases. Hacking [3] noted that subsequent generations ignored the need for equivalence and took the law to be an a priori truth about the stability of mass phenomena, but where equivalence can be assumed the law implies that good estimates of probabilities can be obtained from long run frequencies. This is how one can infer probabilities from incidence rates. Everitt [1] gives tables of the rate of death per 100,000 from various activities including, with the rates given in brackets, motorcycling (2000), hang gliding (80), and boating (5). From these it follows that the probability that someone taken at random from the sampled population dies from motorcycling, gliding, and boating are 0.02, 0.0008 and 0.00005 respectively.

Conditional Probabilities To return to Figure 2, suppose it becomes known that one of the events, say B, has occurred. This causes the sample space to change to the events inside circle B. When the sample space can change in this way it is better to think of the areas in the Venn diagram as relative likelihoods rather than as probabilities. The probability of an event is now given by the ratio of the relative likelihood of the event in question over the relative likelihood of all the events that are considered possible. Thus, if it is known that B has occurred, the only possible regions of the diagram are A ∩ B and notA ∩ B and the probability of A becomes the likelihood of A ∩ B divided by the likelihood of A ∩ B and notA ∩ B. This is known as the conditional probability of A given B and it is written as p(A|B). Thus, the probability of A before anything is known is p(A) =

[p(A ∩ B) + p(A ∩ notB)] . [p(A ∩ B) + p(A ∩ notB) +p(notA ∩ B) + p(notA ∩ notB)]

(4)

Nevertheless, most of the time these assumptions are taken for granted and the conditioning is ignored. Conditional probabilities are useful when one is interested in the relationship between the occurrence of one event and the probability of another. For example, conditional probabilities are useful in signal detection theory, where an observer has to detect the presence of a signal on a number of trials. It was found that the probability of being correct on a trial (as measured by the relative frequency) was an unsatisfactory measure of detection ability as it was confounded by response bias. Signal detection theory uses both the conditional probabilities of a hit (detecting a signal given that a signal was present) and of a false alarm (the probability of detecting a signal given that no signal was present) to determine a measure of detection ability that is, at least to a first approximation, independent of response bias (see Signal Detection Theory and [6]).

Independence Event A is said to be independent of event B when the probability of A remains the same regardless of whether or not B occurs. Thus, p(A) = p(A|B) = p(A|notB) but from the definition of conditional probability p(A|B) =

(6)

p(A ∩ B) = p(A) × p(B).

(7)

Similar arguments show that p(A ∩ notB) = p(A) × p(notB) p(notA ∩ B) = p(notA) × p(B) p(notA ∩ notB) = p(notA) × p(notB).

[p(A ∩ B)] [p(A ∩ B) + p(notA ∩ B)]

[p(A ∩ B)] . = [p(B)]

p(A ∩ B) . p(B)

Substituting p(A) for p(A|B) and rearranging the equation gives

The probability of A after B has occurred is p(A|B) =

3

(5)

One reason why conditional probabilities are useful is that all probabilities are conditional in that every probability depends on a number of assumptions.

(8)

It also follows that whenever A is independent of B, B is independent of A because p(B|A) = =

p(A ∩ B) p(A) p(A) × p(B) (from above) = p(B). (9) p(A)

4

Probability: An Introduction

Table 2 Box 1 Box 2 Box 3

Sample space when at least one box is empty P P E

P E P

P E E

E P P

E P E

E E P

E E E

Example 1 (continued) If three boxes are chosen at random, (e) what is the probability that at least one contains a prize given that at least one of the boxes is empty (p(A|B)) and (f) show that the probabilities of having at least one prize in the three boxes P(A) and that at least one box is empty is P(B) are not independent. (e)

(f)

When at least one box is empty the sample space is reduced to that given in Table 2. As all the possibilities are equally likely the probability p(A|B) of at least one prize given at least one box is empty is 6/7. Alternatively p(A|B) = [p(A ∩ B)]/[p(B)] = [3/4]/[7/8] = 6/7. From (a) p(A) = 7/8 and from (e) p(A|B) = 6/7 so the probability of A (at least one prize) is related to B (the probability of at least one empty box). Alternatively it could be noted from (c) that p(A ∩ B) = 3/4 and that this does not equal p(A)*p(B) which substituting from (a) and (b) is (7/8)2 .

Independence greatly simplifies the problem of specifying the probability of an event as any variable that is independent of the event in question can be ignored. Furthermore, where a set of data can be regarded as composed of a number of independent events their combined probability is simply the product of the probabilities of each of the individual events. In most statistical applications, the joint probability of a set of data can be determined because the observations have been taken to be identically independently distributed and so the joint probability is the product of the probabilities of the individual observations. Little or nothing could be said about the effects of sampling error where the data was supposed to consist of events derived from separate unknown nonindependent probability distributions.

Applying Probability to a Particular Situation In practice, applying probability to real life situations can be difficult because of the difficulty in identifying

the sample space and all the events in it that should be distinguished. The following apparently simple problem, called the Monty Hall problem after a particular quiz master, gave rise to considerable controversy [5] though the problem itself is very simple. In a quiz show, a contestant chooses one of three boxes knowing that there is a prize in one and only one box. The quiz master then reveals that a particular other box is empty and gives the contestant the option of choosing the remaining box instead of the originally chosen one. The problem is, should the contestant switch? The nub of the problem involves noting that if contestants’ first choices are correct they will win by sticking to their original choice; if they are wrong they will win by switching. To spell out the situation, take the sample space to be the nine cells specified in Table 3 where the boxes are labeled A, B & C. When a contestant’s choice becomes known the sample space reduces to one particular column. Suppose a contestant’s first choice is wrong; this implies that the chosen box is empty, so if another identified box is also empty the contestant will win by switching to the third box. On the other hand, if the contestant’s first choice is correct the contestant will lose by switching. Since the probability of the first choice being correct is 1/3, the probability of winning by sticking is 1/3 whereas the probability of winning by switching is 2/3 because the probability of the first choice being wrong is 2/3. Table 3 gives the winning strategy for each cell. This answer was counter intuitive to a number of mathematicians [5]. Actually this is a case in which the world may be more complex than the probability model. The correct strategy should depend on what you think the quiz master is doing. Where the quiz master always makes the offer (which requires the quiz master to know where the prize is) the best strategy is to switch as described above. However, the quiz master might wish to keep the prize and knowing where the prize is, only offer the possibility of switching when the first choice is right. In this case, switching will always be Table 3 Winning strategies on being showed an empty box in the Monty Hall problem

Prize in A Prize in B Prize in C

A chosen

B chosen

C chosen

Stick Switch Switch

Switch Stick Switch

Switch Switch Stick

5

Probability: An Introduction wrong. On the other hand, the quiz master might want to be helpful and only offer a switch when the first choice is wrong and under these circumstances the contestant will win every time by switching whenever possible. In other words, contestants should switch unless they think that the offer of a second choice is related to whether the first choice is correct in which case the best strategy depends on the size of the relationship between having chosen the correct box and being made the offer. Example 2 In the Monty Hall problem, suppose the quiz master offers a choice of switching with a probability P if the contestant is wrong and kP if the contestant is correct (where k is a constant > 1). How large should k be in order for sticking to be the best policy? Before any offer can be made the probability of the contestant’s first choice being correct (p(FC)) is 1/3 and of being wrong (p(FW)) = 2/3. However, when the quiz master adopts the above strategy being made an offer (O) gives information that p(FC) is greater than 1/3. The position can be formalized in the following way: From the definition of conditional probability: p(FC|O) =

p(O ∩ FC) . [p(O)]

(10)

Rewriting p(O ∩ FC) as p(O|FC) × p(FC), noting that p(O) is the sum of the probabilities of O ∩ FC and of O ∩ FW gives p(FC|O) =

p(O|FC) × p(FC) [p(O ∩ FC) + p(O ∩ FW)]

=

References [1] [2] [3] [4]

p(FC|O) =

p(O|FC) × p(FC) [p(O|FC) × p(FC) + p(O|FW) × p(FW)] (12)

(13)

Thus, when k = 2 and an offer is made the probability that the first choice is correct is a half and it makes no difference whether the contestant sticks or switches. When k is less than 2 the probability that the first choice is correct, given that an offer has been made, is less than a half and so the probability that the first choice is wrong is greater than a half. On being made an offer the contestant should switch because when the first choice is wrong the contestant will win by switching. However, if k is greater than 2 and the contestant is made an offer the probability that the first choice is correct is greater than a half and the contestant should stick as the probability of winning by switching is less than a half. This form of reasoning is called Bayesian (see Bayesian Statistics). It might be pointed out that this analysis could be improved on where, by examining the quiz master’s reactions, it was possible to distinguish between offers of switches that were more likely to help as opposed to those offers that were more likely to hinder the contestant. (For a discussion of the subtleties of relating probabilities to naturally occurring events, see Probability: Foundations of; Incompleteness of Probability Models).

(11)

expanding the denominator according to the definition of conditional probability gives

k . (k + 2)

[5] [6]

Everitt, B.S. (1999). Chance Rules, Copernicus, New York. Feller, W. (1971). An Introduction to Probability Theory, Vol. 1, 3rd Edition, Wiley, New York. Hacking, I. (1990). The Taming of Chance, Cambridge University Press, Cambridge. Hacking, I. (2001). Introduction to Probability and Inductive Logic, Cambridge University Press, Cambridge. Hoffman, P. (1998). The Man who Loved Only Numbers, Fourth Estate, London. Green, D.M. & Swets, J.A. (1966). Signal Detection Theory and Psychophysics, Wiley, New York.

and substituting yields kP × (1/3) p(FC|O) = [kP × (1/3) + P × (2/3)]

(See also Bayesian Belief Networks) RANALD R. MACDONALD

Probability Plots DAVID C. HOWELL Volume 3, pp. 1605–1608 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Probability Plots

The sample mean is 12.10, and the sample standard deviation is 10.32.

Probability plots are often used to compare two empirical distributions, but they are particularly useful to evaluate the shape of an empirical distribution against a theoretical distribution. For example, we can use a probability plot to examine the degree to which an empirical distribution differs from a normal distribution. There are two major forms of probability plots, known as probability–probability (P-P) plots and quantile–quantile (Q-Q) plots (see Empirical Quantile–Quantile Plots). They both serve essentially the same purpose, but plot different functions of the data.

An Example It is easiest to see the difference between P-P and Q-Q plots with a simple example. The data displayed in Table 1 are derived from data collected by Compas (personal communication) on the effects of stress on cancer patients. Each of the 41 patients completed the Externalizing Behavior scale of the Brief Symptom Inventory. We wish to judge the normality of the sample data. A histogram for these data is presented in Figure 1 with a normal distribution superimposed.

P-P Plots A P-P plot displays the cumulative probability of the obtained data on the x-axis and the corresponding expected cumulative probability for the reference distribution (in this case, the normal distribution) on the y-axis. For example, a raw score of 15 for our data had a cumulative probability of 0.756. It also had a z score, given the mean and standard deviation of the sample, of 0.28. For a normal distribution, we expect to find 61% of the distribution falling at or below z = 0.28. So, for this observation, we had considerably more scores (75.6%) falling at or below 15 than we would have expected if the distribution were normal (61%). From Table 1, you can see the results of similar calculations for the full data set. These are displayed in columns 4 and 6. If we plot the obtained cumulative percentage on the x-axis and the expected cumulative percentage on the y-axis, we obtain the results shown in Figure 2. Here, a line drawn at 45° is superimposed. If the data had been exactly normally distributed, the points would have fallen on that line. It is clear from the figure that this was not the case. The points deviate noticeably from the line, and, thus, from normality.

EXTERN 16 14

Frequency

12 10 8 6 4

Std. Dev = 10.32 Mean = 12.1 N = 41.00

2 0 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 Externalizing score

Figure 1

Histogram of externalizing scores with normal curve superimposed

2

Probability Plots Table 1

Frequency distribution of externalizing scores with normal probabilities and quantiles

Score

Frequency

Percent

Cumulative percent

z

CDF normal

Normal quantile

0 1 2 4 5 6 7 8 9 10 11 12 14 15 16 19 24 25 28 29 31 40 42

3 2 2 1 3 2 1 1 6 2 3 2 2 1 2 1 1 1 1 1 1 1 1

7.3 4.9 4.9 2.4 7.3 4.9 2.4 2.4 14.6 4.9 7.3 4.9 4.9 2.4 4.9 2.4 2.4 2.4 2.4 2.4 2.4 2.4 2.4

7.3 12.2 17.1 19.5 26.8 31.7 34.1 36.6 51.2 56.1 63.4 68.3 73.2 75.6 80.5 82.9 85.4 87.8 90.2 92.7 95.1 97.6 100.0

−1.17 −1.07 −0.98 −0.78 −0.69 −0.59 −0.49 −0.40 −0.30 −0.20 −0.1 −0.01 0.18 0.28 0.38 0.67 1.15 1.25 1.54 1.64 1.83 2.70 2.90

0.12 0.14 0.16 0.22 0.25 0.28 0.31 0.35 0.38 0.42 0.46 0.50 0.57 0.61 0.65 0.75 0.88 0.89 0.94 0.95 0.97 0.99 1.00

−3.01 −0.03 2.19 3.13 5.61 7.08 7.77 8.46 12.31 13.58 15.54 16.92 18.39 19.16 20.87 21.81 22.88 24.03 25.35 27.01 29.08 32.41 36.02

1.2

CDF of normal distribution

1.0

0.8

0.6

0.4

0.2

0.0 0.0

Figure 2

0.2

0.4 0.6 0.8 Cumulative percentage

1.0

1.2

P-P plot of obtained distribution against the normal distribution

Q-Q Plots A Q-Q plot resembles a P-P plot in many ways, but it displays the observed values of the data on the x-axis and the corresponding expected values from a normal

distribution on the y-axis. The expected values are based on the quantiles of the distribution. To take our example of a score of 15 again, we know that 75.6% of the observations fell at or below 15, placing 15 at the 75.6 percentile. If we had a normal distribution

Probability Plots

3

40

Normal quantiles

30

20

10

0

−10 −10

0

10

20

30

40

50

Externalizing score

Figure 3

Q-Q plot of obtained distribution against the normal distribution

with a mean of 12.10 and a standard deviation of 10.32, we would expect the 75.6 percentile to fall at a score of 19.16. We can make similar calculations for the remaining data, and these are shown in column 7 of Table 1. (I treated the final cumulative percentage as 99.9 instead of 100 because the normal distribution runs to infinity and the quantile for 100% would be undefined.)

The plot of these values can be seen in Figure 3, where, again, a straight line is superimposed representing the results that we would have if the data were perfectly normally distributed. Again, we can see significant departure from normality. DAVID C. HOWELL

Probits CATALINA STEFANESCU, VANCE W. BERGER

AND

SCOTT L. HERSHBERGER

Volume 3, pp. 1608–1610 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Probits Probit models have arisen in the context of analysis of dichotomous data. Let Y1 , . . . , Yn be n binary variables and let x1 , . . . , xn ∈ p denote corresponding vectors of covariates. The flexible class of Probit models may be obtained by assuming that the response Yi (1 ≤ i ≤ n) is an indicator of the event that some unobserved continuous variable, Zi say, exceeds a threshold, which can be taken to be zero, without loss of generality. Specifically, let Z1 , . . . , Zn be latent continuous variables and assume that Yi = I{Zi >0} ,

for i = 1, . . . , n,

Zi = xi β + εi , εi ∼ N (0, σ 2 ),

(1)

where β ∈ p is the vector of regression parameters. In this formulation, xi β is sometimes called the index function [11]. The marginal probability of a positive response with covariate vector x is given by p(x) = Pr(Y = 1; x) = Pr(xβ + ε > 0) = 1 − (−xβ),

(2)

where (x) is the standard normal cumulative distribution function. Also, Var(Y ; x) = p(x){1 − p(x)} = {1 − (−xβ)}(−xβ).

(3)

As a way of relating stimulus and response, the Probit model is a natural choice in situations in which an interpretation for a threshold approach is readily available. Examples include attitude measurement, assigning pass/fail gradings for an examination based on a mark cut–off, and categorization of illness severity based on an underlying continuous scale [10]. The Probit models first arose in connection with bioassay [4] – in toxicology experiments, for example, sets of test animals are subjected to different levels x of a toxin. The proportion p(x) of animals surviving at dose x can then be modeled as a function of x, following (2). The surviving proportion is increasing in the dose when β > 0 and it is decreasing in the dose when β < 0. Surveys of the toxicology literature on Probit modeling are included in [7] and [9].

Probit models belong to the wider class of generalized linear models [13]. This class also includes the logit models, arising when the random errors εi in (1) have a logistic distribution. Since the logistic distribution is similar to the normal except in the tails, whenever the binary response probability p belongs to (0.1, 0.9), it is difficult to discriminate between the logit and Probit functions solely on the grounds of goodness of fit. As Greene [11] remarks, ‘it is difficult to justify the choice of one distribution or another on theoretical grounds. . . in most applications, it seems not to make much difference’. Estimation of the Probit model is usually based on maximum likelihood methods. The nonlinear likelihood equations require an iterative solution; the Hessian is always negative definite, so the log-likelihood is globally concave. The asymptotic covariance matrix of the maximum likelihood estimator can be estimated by using an estimate of the expected Hessian [2], or with the estimator developed by Berndt et al. [3]. Windmeijer [18] provides a survey of the many goodness-of-fit measures developed for binary choice models, and in particular, for Probits. The maximum likelihood estimator in a Probit model is sometimes called a quasi-maximum likelihood estimator (QMLE) since the normal probability model may be misspecified. The QMLE is not consistent when the model exhibits any form of heteroscedasticity, nonlinear covariate effects, unmeasured heterogeneity, or omitted variables [11]. In this setting, White [17] proposed a robust ‘sandwich’ estimator for the asymptotic covariance matrix of the QMLE. As an alternative to maximum likelihood estimation, Albert and Chib [1] developed a framework for estimation of latent threshold models for binary data, using data augmentation. The univariate Probit is a special case of this class of models, and data augmentation can be implemented by means of Gibbs sampling. Under this framework, the class of Probit regression models can be extended by using mixtures of normal distributions to model the latent data. There is a large literature on the generalizations of the Probit model to the analysis of a variety of qualitative and limited dependent variables. For example, McKelvey and Zavoina [14] extend the Probit model to the analysis of ordinal dependent variables, while Tobin [16] discusses a class of models in which the dependent variable is limited in range. In particular,

2

Probits

the Probit model specified in (1) can be generalized by allowing the error terms εi to be correlated. This leads to a multivariate Probit model, useful for the analysis of clustered binary data (see Clustered Data). The multivariate Probit focuses on the conditional expectation given the cluster-level random effect, and thus it belongs to the class of clusterspecific approaches for modeling correlated data, as opposed to population-average approaches of which the most common example are the generalized estimating equations (GEE)-type methods [19]. The multivariate Probit model has several attractive features that make it particularly suitable for the analysis of correlated binary data. First, the connection to the Gaussian distribution allows for flexible modeling of the association structure and straightforward interpretation of the parameters. For example, the model is particularly attractive in marketing research of consumer choice because the latent correlations capture the cross-dependencies in latent utilities across different items. Also, within the class of cluster- specific approaches, the exchangeable multivariate Probit model is more flexible than other fully specified models (such as the beta-binomial), which use compound distributions to account for overdispersion in the data. This is due to the fact that both underdispersion and overdispersion can be accommodated in the multivariate Probit model through the flexible underlying covariance structure. Finally, due to the underlying threshold approach, the multivariate Probit model has the potential of extensions to the analysis of clustered mixed binary and continuous data or of multivariate binary data [12, 15]. Likelihood methods are one option for inference in the multivariate Probit model (see e.g., [5]), but they are computationally difficult due to the intractability of the expressions obtained by integrating out the latent variables. As an alternative, estimation can be done in a Bayesian framework [6, 8] where generic prior distributions may be employed to incorporate prior information. Implementation is usually done with Markov chain Monte Carlo methods – in particular, the Gibbs sampler is useful in models where some structure is imposed on the covariance matrix (e.g., exchangeability).

References [1]

Albert, J.H. & Chib, S. (1997). Bayesian analysis of binary and polychotomous response data, Journal of the American Statistical Association 88, 669–679.

[2] [3]

[4] [5]

[6] [7] [8]

[9] [10] [11] [12]

[13] [14]

[15]

[16] [17] [18]

[19]

Amemiya, T. (1981). Qualitative response models: a survey, Journal of Economic Literature 19, 481–536. Berndt, E., Hall, B., Hall, R. & Hausman, J. (1974). Estimation and inference in nonlinear structural models, Annals of Economic and Social Measurement 3/4, 653–665. Bliss, C.I. (1935). The calculation of the dosagemortality curve, Annals of Applied Biology 22, 134–167. Chan, J.S.K. & Kuk, A.Y.C. (1997). Maximum likelihood estimation for probit-linear mixed models with correlated random effects, Biometrics 53, 86–97. Chib, S. & Greenberg, E. (1998). Analysis of multivariate probit models, Biometrika 85, 347–361. Cox, D. (1970). Analysis of Binary Data, Methuen, London. Edwards, Y.D. & Allenby, G.M. (2003). Multivariate analysis of multiple response data, Journal of Marketing Research 40, 321–334. Finney, D. (1971). Probit Analysis, Cambridge University Press, Cambridge. Goldstein, H. (2003). Multilevel Statistical Models, 3rd Edition, Arnold Publishing, London. Greene, W.H. (2000). Econometric Analysis, 4th Edition, Prentice Hall, Englewood Cliffs. Gueorguieva, R.V. & Agresti, A. (2001). A correlated probit model for joint modelling of clustered binary and continuous responses, Journal of the American Statistical Association 96, 1102–1112. McCullagh, P. & Nelder, J.A. (1989). Generalised Linear Models, 2nd Edition, Chapman & Hall, London. McKelvey, R.D. & Zavoina, W. (1976). A statistical model for the analysis of ordinal level dependent variables, Journal of Mathematical Sociology 4, 103–120. Regan, M.M. & Catalano, P.J. (1999). Likelihood models for clustered binary and continuous outcomes: application to developmental toxicology, Biometrics 55, 760–768. Tobin, J. (1958). Estimation of relationships for limited dependent variables, Econometrica 26, 24–36. White, H. (1982). Maximum likelihood estimation of misspecified models, Econometrica 53, 1–1. Windmeijer, F. (1995). Goodness of fit measures in binary choice models, Econometric Reviews 14, 101–116. Zeger, S.L., Liang, K.Y. & Albert, P.S. (1988). Models for longitudinal data: a generalized estimating equations approach, Biometrics 44, 1049–1060.

CATALINA STEFANESCU, VANCE W. BERGER AND SCOTT L. HERSHBERGER

Procrustes Analysis INGWER BORG Volume 3, pp. 1610–1614 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Procrustes Analysis

x2 x1

The Ordinary Procrustes Problem

y2

y1 x3

The ordinary Procrustes1 problem is concerned with fitting a configuration of points X to a fixed target configuration Y as closely as possible. The purpose of doing this is to eliminate nonessential differences between X and Y. Consider an example. Assume that you have two multidimensional scaling (MDS) solutions with coordinate matrices 

0.07  0.93 X=  1.93 1.07

   1.00 2.00 2.62 2.00  3.12   −1.00 .  and Y =  −1.00 −2.00 1.38 1.00 −2.00 0.88

These solutions appear to be quite different, but are the differences of the point coordinates due to the data that these configurations represent or are they a consequence of choosing particular coordinate systems for the MDS spaces? In MDS, it is essentially the ratios of the distances among the points that represent the meaningful properties of the data. Hence, all transformations that leave these ratios unchanged are admissible. We can therefore rotate, reflect, dilate, and translate MDS configurations in any way we like. So, if two or more MDS solutions are to be compared, it makes sense to first eliminate meaningless differences by such transformations. Figure 1 demonstrates that X can indeed by be fitted to perfectly match Y. In this simple two-dimensional case, we could possibly see the equivalence of X and Y by carefully studying the plots, but, in general, such ‘cosmetic’ fittings are needed to avoid seeing differences that are unfounded.

The Orthogonal Procrustes Problem Procrustean procedures were first introduced in factor analysis because it frequently deals with relatively high-dimensional vector configurations that are hard to compare. Factor analysts almost always are interested in dimensions and, thus, comparing coordinate matrices is a natural research question to them. In addition, Procrustean methods can also be used in a confirmatory manner: one simply checks how well an empirical matrix of loadings X can be fitted to a hypothesized factor matrix Y.

x4

y3

y4

Figure 1 Illustration of fitting X to Y by a similarity transformation

In factor analysis, the group of admissible transformations is smaller than in MDS: Only rotations and reflections are admissible, because here the points must be interpreted as end-points of vectors emanating from a fixed origin. The lengths of these vectors and the angles between any two vectors represent the data. Procrustean fittings of X to Y are, therefore, restricted to rotations and reflections of the point configuration X or, expressed algebraically, to orthogonal transformations of the coordinate matrix X. Assume that Y and X are both of order n × m. We want to fit X to Y by picking a best-possible matrix T out of the set of all orthogonal T. By ‘bestpossible’ one typically means a T that minimizes the sum of the squared distances between corresponding points of the fitted X, X = XT, and the target Y. This leads to the loss function L = tr [(Y − XT) (Y − XT)] that must be minimized by picking a T that satisfies T T = TT = I. It can be shown that L is minimal if T = QP , where PQ is the singular value decomposition (SVD) of Y X [8, 9, 12, 14].

Procrustean Similarity Transformations We now extend the rotation/reflection task by also admitting all transformations that preserve the shape of X. Figure 1 illustrates a case for such a similarity transformation. Algebraically, we here have X= sXT + 1t , where s is a dilation factor, t a translation vector, T a rotation/reflection matrix, and 1 a column vector of ones. The steps to find the optimal similarity transformation are [13]:

2 (1) (2) (3) (4) (5)

Procrustes Analysis Compute C = Y JX Compute the SVD of C = PQ . Compute the optimal rotation matrix as T = QP . Compute the optimal dilation factor as s = (tr Y JXT)/(tr X JX). Compute the optimal translation vector as t = n−1 (Y − sXT) 1.

The matrix J is the ‘centering’ matrix, that is, J = I − (1/n)11 and 1 is a vector of n ones. To measure the extent of similarity between X and Y, the product-moment correlation coefficient r computed over the corresponding coordinates of the matrices is a proper index. Langeheine [10] provides norms that allow one to assess whether this index is better than can be expected when fitting random configurations.

usual sum-of-squares loss functions (see Least Squares Estimation), these points can cause severe misfits. To deal with this problem, Verboon and Heiser [17] proposed more robust forms of Procrustes analysis. They first decompose the misfit into the contribution of each distance between corresponding points of Y and X to the total loss, L = tr [(Y − XT) (Y − XT)] = n n 2 (y − T x ) (y − T x ) = i i i i=1 i i=1 di , where yi and xi denote rows i of Y and X, respectively. The influence of outliers on T is reduced by minimizing a variant of L, Lf = ni=1 f (di ), where f is a function that satisfies f (di ) < di2 . One obvious example for f is simply |di |, which minimizes the disagreement of two configurations in terms of point-by-point distances rather than in terms of squared distances. Algorithms for minimizing Lf with various robust functions can be found in [16].

Special Issues In practice, Y and X may not always have the same dimensionality. For example, if Y is a hypothesized loading pattern for complex IQ test items, one may only be able to theoretically predict a few dimensions, while X represents the empirical items in a higher-dimensional representation space. Technically, this poses no problem: all of the above matrix computations work if one appends columns of zeros to Y or X until both matrices have the same column order. A further generalization allows for an X with missing cells [6, 15, 7]. A typical application is one where some points represent previously investigated variables and the remaining variables are ‘new’ ones. We might then use the configuration from a previous study as a partial target for the present data to check structural replicability. A frequent practical issue is the case where the points of X and Y are not matched one-by-one but rather class-by-class. In this case, one may proceed by a more substantively motivated strategy, first averaging the coordinates of all points that belong to the same substantive class and then fit the matrices of centroids to each other. The transformations derived in this fitting can then be used to fit X to Y [1].

Robust Procrustean Analysis In some cases, two configurations can be matched closely except for a few points. Using the

Oblique Procrustean Analysis Within the factor-analytic context, a particular form of Procrustean fitting has been investigated where T is only restricted to be a direction cosine matrix. This amounts to fitting X’s column vectors to the corresponding column vectors of Y by individually rotating them about the origin. The main purpose for doing this is to fit a factor matrix to maximum similarity with a hypothesized factor pattern. The solutions proposed by Browne [3], Browne [5, 6], or [4] are formally interesting but not really needed anymore, because such testing can today be done more directly by structural equation modeling (SEM).

More than Two Configurations Assume we have K different Xk and that we want to eliminate uninformative differences from them by similarity transformations. Expressed in terms of a loss function, this generalized Procrustes analysis amounts to L = K k
Procrustes Analysis all Xk ’s, Z = (1/K) k Xk , and write the loss funcK Xk − Z) ( Xk − Z). This function as L = k=1 tr ( tion is minimized by iteratively fitting each of the Xk ’s and successively updating the centroid configuration Z. Geometrically, each of Z’s points is the centroid of the corresponding points from the fitted individual configurations. Thus, L is small if these centroids lie somewhere in the middle of a tight cluster of K points, where each single point belongs to a different Xk .

Procrustean Individual Differences Scaling A rather obvious idea in Procrustean analysis of MDS spaces is to attempt to explain each Xk by a simple transform of Z (or of a fixed Y). The most common model is the ‘dimensional salience’ or INDSCAL model (see Three-mode Component and Scaling Methods). If you think of Z, in particular, as the space that characterizes a whole group of K individuals (‘group space’), then each individual Xk may result from Z by differentially stretching or shrinking it along its dimensions. Figure 2 illustrates this notion. Here, the group space perfectly explains the two individual configurations if it is stretched by the factor 2 along dimension 2 and dimension 1, respectively. The weighted centroid configuration for individual k can be expressed as ZWk , where Wk is an m × m diagonal matrix of nonzero dimension weights. The Procrustean fitting problem requires one to generate both a dimensionally weighted target Z, ZWk , and the individual configurations Xk (k = 1, K) that are optimally fitted to this ‘elastic target’ by similarity transformations. The optimal Wk is easily found by regression methods. Yet, there may be no particular reason for stretching Z along the given dimensions, and

stretching it in other directions obviously can result in different shears of the configuration. Hence, in general, what is needed is also a rotation matrix S so that ZSWk optimally explains all K configurations Xk fitted to it. To find S, one may consider the individual case first, where Sk is an idiosyncratic rotation, that is, a different rotation Sk for each individual k. Hence, we want to find a rotation Sk and a set of dimensional weights Wk that transform the group space so that it optimally explains each (admissibly transformed) individual space. A direct solution for this problem is known only for the two-dimensional case (Lingoes & 1; Commandeur, 1990). This solution can be used iteratively for each plane of the space. The average of all ZSk ’s is then used as a target to solve for ZS. In general, we thus obtain a group space that is uniquely rotated by S. However, this uniqueness is often not very strong in the sense that other rotations are not much worse in terms of fit. This means that one must be careful when interpreting the uniqueness of dimensions delivered by programs that optimize the dimension-weighting model only, not providing fit indices for the unweighted case as benchmarks. Another possibility to fit Z to each (admissibly transformed) Xk is to weight the rows of Z, that is, using the model Vk Z. Geometrically, Vk acts on the points of Z, shifting them along their position vectors. Thus, these weights are called ‘vector weights’. Except for special cases (‘perspective model’, see [2]), however, vector weighting is not such a compelling psychological model as dimension weighting. However, it may provide valuable index information. For example, if we find that an optimal fitting of Z to each individual configuration can be done only with weights varying considerably around +1, then it makes little sense to consider the centroid

Dim. 2

Dim.1

Group space

Figure 2

3

Individual space #1

Individual space #2

A group space and two individual spaces related to it by dimensional weightings

4

Procrustes Analysis

configuration Z as a structure that is common to all individuals. Finding optimal vector weights is simple in the 2D case, but simultaneously to find all transformations (Vk and all those on Xk ) in higher-dimensional spaces appears intractable. Hence, to minimize the loss, we have to iterate over all planes of the space [11]. Finally, as in the idiosyncratic rotations in the dimension-weighting model, we can generalize the vector-weighting transformations to a model with an idiosyncratic origin. In other words, rather than fixing the perspective origin externally either at the centroid or at some other more meaningful point, it is also possible to leave it to the model to find an origin that maximizes the correspondence of an individual configuration and a transformed Z. Formally, using both dimension and vector weighting is also possible, but as a model this fitting becomes too complex to be of any use. The various individual differences models can be fitted by the program PINDIS [11] which is also integrated into the NewMDSX package. Langeheine [10] provides norms for fitting random configurations by unweighted Procrustean transformations and by all of the dimension and vector-weighting models.

[7] [8] [9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

[17]

Commandeur, J.J.F. (1991). Matching Configurations, Leiden, The Netherlands, DSWO Press. Cliff, N. (1966). Orthogonal rotation to congruence, Psychometrika 31, 33–42. Kristof, W. (1970). A theorem on the trace of certain matrix products and some applications, Journal of Mathematical Psychology 7, 515–530. Langeheine, R. (1982). Statistical evaluation of measures of fit in the Lingoes-Borg procrustean individual differences scaling, Psychometrika 47, 427–442. Lingoes, J.C. & Borg, I. (1978). A direct approach to individual differences scaling using increasingly complex transformations, Psychometrika 43, 491–519. Sch¨onemann, P.H. (1966). A generalized solution of the orthogonal Procrustes problem, Psychometrika 31, 1–10. Sch¨onemann, P.H. & Carroll, R.M. (1970). Fitting one matrix to another under choice of a central dilation and a rigid motion, Psychometrika 35, 245–256. Ten Berge, J.M. (1977). Orthogonal procrustes rotation for two or more matrices, Psychometrika 42, 267–276. Ten Berge, J.M.F., Kiers, H.A.L. & Commandeur, J.J.F. (1993). Orthogonal Procrustes rotation for matrices with missing values, British Journal of Mathematical and Statistical Psychology 46, 119–134. Verboon, P. & Heiser, W.J. (1992). Resistant orthogonal Procrustes analysis, Journal of Classification 9, 237–256. Verboon, P. (1994). A Robust Approach to Nonlinear Multivariate Analysis, DSWO Press, Leiden.

Note Further Reading 1.

Procrustes is an innkeeper from Greek mythology who would fit his guests to his iron bed by stretching them or cutting their legs off.

References [1] [2] [3] [4]

[5]

[6]

Borg, I. (1978). Procrustean analysis of matrices with different row order, Psychometrika 43, 277–278. Borg, I. & Groenen, P.J.F. (1997). Modern Multidimensional Scaling, Springer, New York. Browne, M.W. (1969a). On oblique Procrustes rotations, Psychometrika 34, 375–394. Browne, M.W. & Kristof, W. (1969b). On the oblique rotation of a factor matrix to a specified target, British Journal of Mathematical and Statistical Psychology 25, 115–120. Browne, M.W. (1972a). Oblique rotation to a partially specified target, British Journal of Mathematical and Statistical Psychology 25, 207–212. Browne, M.W. (1972b). Orthogonal rotation to a partially specified target, British Journal of Mathematical and Statistical Psychology 25, 115–120.

Carroll, J.D. & Chang, J.J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition, Psychometrika 35, 283–320. Gower, J.C. (1975). Generalized Procrustes analysis, Psychometrika 40, 33–51. Horan, C.B. (1969). Multidimensional scaling: combining observations when individuals have different perceptual structures, Psychometrika 34, 139–165. Kristof, W. & Wingersky, B. (1971). Generalization of the orthogonal Procrustes rotation procedure for more than two matrices, in Proceedings of the 79th Annual Convention of the American Psychological Association, Washington, DC, pp. 89–90. Mulaik, S.A. (1972). The Foundations of Factor Analysis, McGraw-Hill, New York.

INGWER BORG

Projection Pursuit WERNER STUETZLE Volume 3, pp. 1614–1617 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Projection Pursuit A time-honored method for detecting unanticipated ‘structure’ – clusters (see Cluster Analysis: Overview), outliers, skewness, concentration near a line or a curve – in bivariate data is to look at a scatterplot, using the ability of the human perceptional system for instantaneous pattern discovery. The question is how to bring this human ability to bear if the data are high-dimensional. Scanning all 45 pairwise scatterplots of a 10dimensional data set already tests the limits of most observers’ patience and attention span, and it is easy to construct examples where there is obvious structure in the data that will not be revealed in any of those plots. This fact is illustrated in Figures 1 and 2. Figure 1 shows a two-dimensional data set consisting of two clearly separated clusters. We added eight independent standard Gaussian ‘noise’ variables and then rotated the resulting 10-dimensional data set into a random orientation. Visual inspection of all 45 pairwise scatterplots of the resulting 10-dimensional data fails to reveal the clusters; the scatterplot which, subjectively, appears to be most structured is shown in Figure 2. However, we know that there do exist planes for which the projection is clustered; the question is how to find one.

Figure 1 Gaussians

Sample from a bivariate mixture of two

Looking for Interesting Projections The basic idea of Projection Pursuit, suggested by Kruskal [15] and first implemented by Friedman and Tukey [10], is to define a projection index I (u, v) measuring the degree of ‘interestingness’ of the projection onto the plane spanned by the (orthogonal) vectors u and v and then use numerical optimization to find a plane maximizing the index. A key issue is the choice of the projection index. Probably the most familiar projection index is the variance of the projected data. A plane maximizing this index can be found by linear algebra – it is spanned by the two largest principal components (see Principal Component Analysis). In our example, however, projection onto the largest principal components (Figure 3) does not show any clustering –

Figure 2

Most ‘structured’ pairwise scatterplot

variance is not necessarily a good measure of ‘interestingness’. Instead, a better approach is to first sphere the data (transform it to have zero mean and unit covariance) and then use an index measuring the deviation of the projected data from a standard Gaussian distribution. This choice is motivated by two observations. First, if the data are multivariate Gaussian (see Catalogue of Probability Density Functions), then all projections will be Gaussian and Projection Pursuit will not find

2

Projection Pursuit

Figure 3

Projection onto largest principal components

any interesting projections. This is good, because a multivariate Gaussian distribution is completely specified by its mean and covariance matrix, and there is nothing more to be found. Second, Diaconis and Freedman [3] have shown that under appropriate conditions most projections of multivariate data are (approximately) Gaussian, which suggests regarding non-Gaussian projections as interesting. Many projection indices measuring deviation from Gaussianity have been devised; see, for example [2,11–14]. Figure 4 shows the projection of our simulated data onto a plane maximizing the ‘holes’ index [1]; the clusters are readily apparent.

Example: The Swiss Banknote Data The Swiss Banknote data set [4] consists of measurements of six variables (width of bank note; height on left side; height on right side; lower margin; upper margin; diagonal of inner box) on 100 genuine and 100 forged Swiss bank notes. Figure 5 shows a projection of the data onto the first two principal components. The genuine bank notes, labeled ‘+’, are clearly separated from the false ones. Applying projection pursuit (with a Hermite index of order 7) results in the projection shown in Figure 6 (adapted from [14]). This picture (computed without use of the class labels) suggests that there are two distinct groups

Figure 4 index

Projection onto plane maximizing the ‘holes’

+ ++ + ++ ++ + ++ + +++ +++ + + + +++++ + +++++ + + + ++ + + + +++ ++++ + + + + + + + +++++ ++ + + ++ + + + +++++ + + + ++ + + + ++ + + + + + + ++ + + +

+

Figure 5 Projection of Swiss Banknote data onto largest principal components

of forged notes, a fact that was not apparent from Figure 5.

Projection Pursuit Modeling In general there may be multiple interesting views of the data, possibly corresponding to multiple local maxima of the projection index. This suggests using

Projection Pursuit

[2]

[3] + +

+

+ + ++ + + + ++ + ++ ++ ++ + + ++ ++ ++ + + + ++ + ++ +++ + + + +++++ +++ + + + + + + ++ + + + ++++ ++++ +++ ++++ +++ + +++ + + + ++ + +++ + ++ + ++

[4]

[5]

[6]

[7]

Figure 6 Projection of Swiss Banknote data onto plane maximizing the ‘Hermite 7’ index

multiple starting values for the nonlinear optimization, such as planes in random orientation (see Optimization Methods). A more principled approach is to remove the structure revealed in consecutive solution projections, thereby deflating the corresponding local maxima of the index. In the case where a solution projection shows multiple clusters, structure can be removed by partitioning the data set and recursively applying Projection Pursuit to the individual clusters. The idea of alternating between Projection Pursuit and structure removal was developed into a general projection pursuit paradigm for multivariate analysis by Friedman and Stuetzle [9]. The Projection Pursuit paradigm has been applied to density estimation [6, 8, 12, 13], regression [7], and classification [5].

Software Projection Pursuit is one of the many tools for visualizing and analyzing multivariate data that together make up the Ggobi Data Visualization System. Ggobi is distributed under an AT&T open source license. A self-installing Windows binary or Linux/Unix versions as well as accompanying documentation can be downloaded from www.ggobi.org.

References [1]

Cook, D., Buja, A. & Cabrera, J. (1993). Projection pursuit indexes based on orthonormal function expansions,

[8]

[9]

[10]

[11] [12] [13]

[14]

[15]

3

Journal of Computational and Graphical Statistics 2(3), 225–250. Cook, D., Buja, A., Cabrera, J. & Hurley, C. (1995). Grand tour and projection pursuit, Journal of Computational and Graphical Statistics 4(3), 155–172. Diaconis, P. & Freedman, D. (1984). Asymptotics of graphical projection pursuit, Annals of Statistics 12, 793–815. Flury, B. & Riedwyl, H. (1981). Graphical representation of multivariate data by means of asymmetrical faces, Journal of the American Statistical Association 76, 757–765. Friedman, J.H. (1985). Classification and multiple regression through projection pursuit, Technical Report LCS-12, Department of Statistics, Stanford University. Friedman, J.H. (1987). Exploratory projection pursuit, Journal of the American Statistical Association 82, 249–266. Friedman, J.H. & Stuetzle, W. (1981). Projection pursuit regression, Journal of the American Statistical Association 76, 817–823. Friedman, J.H., Stuetzle, W. & Schroeder, A. (1984). Projection pursuit density estimation, Journal of the American Statistical Association 79, 599–608. Friedman, J.H. & Stuetzle, W. (1982). Projection pursuit methods for data analysis, in Modern Data Analysis, R.L., Launer & A.F., Siegel, eds, Academic Press, New York, pp. 123–147. Friedman, J.H. & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis, IEEE Transactions on Computers C-23, 881–890. Hall, P. (1989). Polynomial projection pursuit, Annals of Statistics 17, 589–605. Huber, P.J. (1985). Projection pursuit, Annals of Statistics 13, 435–525. Jones, M.C. & Sibson, R. (1987). What is projection pursuit, (with discussion), Journal of the Royal Statistical Society, Series A 150, 1–36. Klinke, S. (1995). Exploratory projection pursuitthe multivariate and discrete case, in W. Kloesgen, P. Nanopoulos & A. Unwin, eds, Proceedings of NTTS ’95, Bonn, 247–262. Kruskal, J.B. (1969). Towards a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new “index of condensation”, in Statistical Computation, R.C., Milton & J.A., Nelder, eds, Academic Press, New York, pp. 427–440.

(See also Hierarchical Clustering; k -means Analysis; Minimum Spanning Tree; Multidimensional Scaling) WERNER STUETZLE

Propensity Score RALPH B. D’AGOSTINO JR Volume 3, pp. 1617–1619 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Propensity Score Observational studies occur frequently in behavioral research. In these studies, investigators have no control over the treatment assignment. Therefore, large differences on observed covariates in the two groups may exist, and these differences could lead to biased estimates of treatment effects. The propensity score for an individual, defined as the conditional probability of being treated given the individual’s covariates, can be used to balance the covariates in the two groups, and thus reduce this bias. In a randomized experiment, the randomization of units (i.e., subjects) to different treatments guarantees that on average there should be no systematic differences in observed or unobserved covariates (i.e., bias) between units assigned to the different treatments (see Clinical Trials and Intervention Studies). However, in a nonrandomized observational study, investigators have no control over the treatment assignment, and therefore direct comparisons of outcomes from the treatment groups may be misleading. This difficulty may be partially avoided if information on measured covariates is incorporated into the study design (e.g., through matched sampling (see Matching)) or into estimation of the treatment effect (e.g., through stratification or covariance adjustment). Traditional methods of adjustment (matching, stratification, and covariance adjustment) are often limited since they can only use a limited number of covariates for adjustment. However, propensity scores, which provide a scalar summary of the covariate information, do not have this limitation. Formally, the propensity score for an individual is the probability of being treated conditional on (or based only on) the individual’s covariate values. Intuitively, the propensity score is a measure of the likelihood that a person would have been treated using only their covariate scores. The propensity score is a balancing score and can be used in observational studies to reduce bias through the adjustment methods mentioned above.

Definition With complete data, the propensity score for subject i(i = 1, . . . , N ) is the conditional probability of

assignment to a particular treatment (Zi = 1) versus control (Zi = 0) given a vector of observed covariates, xi : e(xi ) = pr(Zi = 1|Xi = xi ).

(1)

where it is assumed that, given the X s, the Zi are independent: pr(Z1 = z1 , . . . , Z = zN |X1 = x1 , . . . , XN = xN ) =

N

e(xi )zi {1 − e(xi )}1−zi .

(2)

i=1

The propensity score is the ‘coarsest function’ of the covariates that is a balancing score, where a balancing score, b(X), is defined as ‘a function of the observed covariates X such that the conditional distribution of X given b(X) is the same for treated (Z = 1) and control (Z = 0) units’ [2]. For a specific value of the propensity score, the difference between the treatment and control means for all units with that value of the propensity score is an unbiased estimate of the average treatment effect at that propensity score if the treatment assignment is strongly ignorable given the covariates. Thus, matching, subclassification (stratification), or regression (covariance) adjustment on the propensity score tends to produce unbiased estimates of the treatment effects when treatment assignment is strongly ignorable. Treatment assignment is considered strongly ignorable if the treatment assignment, Z, and the response, Y , are known to be conditionally independent given the covariates, X (i.e., when Y ⊥Z|X). When covariates contain no missing data, the propensity score can be estimated using discriminant analysis or logistic regression. Both of these techniques lead to estimates of probabilities of treatment assignment conditional on observed covariates. Formally, the observed covariates are assumed to have a multivariate normal distribution (conditional on Z) when discriminant analysis is used, whereas this assumption is not needed for logistic regression. A frequent question is, ‘Why must one estimate the probability that a subject receives a certain treatment since it is known for certain which treatment was given?’ An answer to this question is that if one uses the probability that a subject would have been treated (i.e., the propensity score) to adjust the

2

Propensity Score

estimate of the treatment effect, one can create a ‘quasirandomized’ experiment; that is, if two subjects are found, one in the treated group and one in the control, with the same propensity score, then one could imagine that these two subjects were ‘randomly’ assigned to each group in the sense of being equally likely to be treated or controlled. In a controlled experiment, the randomization, which assigns pairs of individuals to the treated and control groups, is better than this because it does not depend on the investigator conditioning on a particular set of covariates; rather, it applies to any set of observed or unobserved covariates. Although the results of using the propensity scores are conditional only on the observed covariates, if one has the ability to measure many of the covariates that are believed to be related to the treatment assignment, then one can be fairly confident that approximately unbiased estimates for the treatment effect can be obtained.

Uses of Propensity Scores Currently in observational studies, propensity scores are used primarily to reduce bias and increase precision. The three most common techniques that use the propensity score are matching, stratification (also called subclassification), and regression adjustment. Each of these techniques is a way to make an adjustment for covariates prior to (matching and stratification) or while (stratification and regression adjustment) calculating the treatment effect. With all three techniques, the propensity score is calculated the same way, but once it is estimated, it is applied differently. Propensity scores are useful for these techniques because by definition the propensity score is the conditional probability of treatment given the observed covariates e(U ) = pr(Z = 1|X), which implies that Z and X are conditionally independent given e(X). Thus, subjects in treatment and control groups with equal (or nearly equal) propensity scores will tend to have the same (or nearly the same) distributions on their background covariates. Exact

adjustments made using the propensity score will, on average, remove all of the bias in the background covariates. Therefore, bias-removing adjustments can be made using the propensity scores rather than all of the background covariates individually.

Summary Propensity scores are being widely used in statistical analyses in many applied fields. The propensity score methodology appears to produce the greatest benefits when it can be incorporated into the design stages of studies (through matching or stratification). These benefits include providing more precise estimates of the true treatment effects as well as saving time and money. This savings results from being able to avoid recruitment of subjects who may not be appropriate for particular studies. In addition, it is important to note that propensity scores can (and often should) be used in addition to traditional methods of analysis rather than in place of these other methods. The propensity score should be thought of as an additional tool available to the investigators as they try to estimate the effects of treatments in studies. Further references on the propensity score can be found in D’Agostino Jr [1] where applied illustrations are presented and in Rosenbaum and Rubin [2] where the theoretical properties of the propensity score are developed in depth.

References [1]

[2]

D’Agostino Jr, R.B. (1998). Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group, Statistics in Medicine 17, 225–2281. Rosenbaum, P.R. & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects, Biometrika 70, 41–55.

RALPH B. D’AGOSTINO JR

Prospective and Retrospective Studies JANICE MARIE DYKACZ Volume 3, pp. 1619–1621 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Prospective and Retrospective Studies In mathematical or statistical applications to ‘realworld’ problems, the researcher often classifies variables as either x-variables (explanatory variables, factors, or covariates) or y-variables (response or outcome variables). Often the important research question is if and how an x-variable affects a y-variable. As an example, in a medical study, the x-variable may be smoking status (smoker or nonsmoker), and the y-variable may be lung cancer status (present or absent). Does smoking status affect the probability of acquiring lung cancer? There is a built-in direction between the x- and y-variables; x affects y. Mathematically, y is a function of x. In a prospective study, the values of x are set or specified. Individuals (or perhaps units such as families or experimental animals) are then followed over time, and the value of y is determined. For example, individuals who do not have lung cancer are classified as smokers or nonsmokers (the x-variable) and followed over a period of years. The occurrence of lung cancer (the y-variable) is determined for the smokers and nonsmokers. By contrast, in a retrospective study, the values of y are set or specified. We examine the histories of individuals in the study, and the values of the x-variable are then determined. For example, lung cancer cases (those who have lung cancer) and controls (those who do not have lung cancer) are selected, the histories of these individuals are obtained, and then their smoking status is determined. In short, a prospective study starts with values of the x-variable; a retrospective study starts with values of the y-variable. For both types of studies, we would like to know if x influences y. Prospective studies are sometimes called longitudinal or cohort studies. Retrospective studies are sometimes called case-control studies. Table 1 represents a hypothetical example of a prospective study. The x-variable has two values, Exposed (x = 1) and Not exposed (x = 0). The study starts with 1000 individuals with the x-variable = 1 (Exposed) and 1000 individuals with the x-variable = 0 (Not exposed). Individuals are followed over time, and at some point the number of individuals who acquire the disease are counted. We determine the

Table 1

Prospective study Disease present Disease absent y=1 y=0

Exposed (x = 1) Not exposed (x = 0)

100

900

1000

10

990

1000

number of those with the y-variable = 1 (Disease Present) and y = 0 (Disease Absent). In a prospective study, the relative risk can be directly calculated. In the example above, the probability that individuals acquire the disease given that they are exposed is 100/1000 = 0.1. The probability that individuals acquire the disease given that they are not exposed is 10/1000 = 0.01. The relative risk is the ratio of these two probabilities. Comparing those exposed with those not exposed, the relative risk is 0.1/0.01 = 10. Thus, those exposed are ten times more likely to acquire the disease than those not exposed. Table 2 represents a hypothetical example of a retrospective study. The y-variable has two values, Disease Present (y = 1) and Disease Absent (y = 0). Those in the group with Disease Present are called cases, and those in the group with Disease Absent are called controls. The study starts with 110 individuals having y-variable = 1 (the cases), and with 1890 individuals having y-variable = 0 (the controls). We examine the histories of these individuals and determine whether the individual was exposed or not. That is, we count those with the xvariable = 1 (Exposed) and y = 0 (Not exposed). In a retrospective study, the relative risk cannot be directly calculated, but the odds ratio can be calculated. In the example above, for those exposed, the odds of those acquiring the disease relative to those not acquiring the disease are 100/900. For those not exposed, the odds of those acquiring the disease relative to those not acquiring the disease are 10/990. Table 2

Retrospective study

Exposed (x = 1) Not exposed (x = 0)

Disease present y=1

Disease absent y=0

100 10 110

900 990 1890

2

Prospective and Retrospective Studies

The odds ratio is the ratio of these two fractions and is (100/900)/(10/990) or 11. Recall that the same numbers in the table were used for the example of a prospective study, and the relative risk was 10. When the disease is rare, the odds ratio is used as an approximation to the relative risk. In a prospective study, the relative risk is often used to measure the association between x and y. However, if a logistic regression model is used to determine the effect of x on y, the association between x and y can be described by the odds ratio, which occurs naturally in this model. In a retrospective study, the odds ratio is often used to measure the association between x and y. However, the odds ratio is sometimes used as an approximation to the relative risk. Prospective studies can be expensive and time consuming because individuals or units are followed over time to determine the value of the y-variable (for example, whether an individual acquires the disease or not). However, prospective studies provide a direct estimate of relative risk and a direct estimate of the probability of acquiring the disease, given the value of the x-variable (that is, for those exposed and for those not exposed). In a prospective study, many yvariables (diseases) can be examined. Examples of prospective studies in the medical literature are found in [1, 2, 5]. By contrast, a retrospective study is generally less expensive. Retrospective studies are often used to study a rare disease because very few cases of the disease might be found in a prospective study. However, the odds ratio, and not the relative risk, is directly estimated. Only one y-variable can be studied because individuals are selected at the beginning of the study, for say, y = 1 (Disease Present) and y = 0 (Disease Absent). Because the xvariable is determined by looking into the past, there are concerns about the recollections of individuals. Examples of retrospective studies in the medical literature are found in [3, 4]. Related types of studies are cross-sectional studies and randomized clinical trials (RCT). In a crosssectional study, the values of a x-variable and a

y-variable are measured for individuals or units at a point in time. Individuals are not followed over time, so, for example, the probability of acquiring a disease cannot be measured. Cross-sectional studies are used to determine the magnitude of a problem or to assess the nature of associations between the xand y-variables. Randomized clinical trials are similar to prospective studies, but individuals are randomly assigned values of the x-variable (for example, an individual is randomly assigned to receive a drug, x = 1, or not, x = 0). Individuals are followed over time to determine whether they acquire a disease (y = 1) or not (y = 0). Randomization generally guarantees that individuals are alike except for the value of their x-variable (drug or no drug).

References [1]

[2]

[3]

[4]

[5]

Calle, E.E., Rodriquez, C., Walker-Thurmond, K. & Thun, M.J. (2003). Overweight, obesity, and mortality from cancer in a prospectively studied cohort of U.S. adults, The New England Journal of Medicine 348(17), 1625–1638. Iribarren, C., Tekawa, I.S., Sidney, S. & Friedman, G.D. (1999). Effect of cigar smoking on the risk of cardiovascular disease, chronic obstructive pulmonary disease, and cancer in men, The New England Journal of Medicine 340(23), 1773–80. Karagas, M.R., Stannard, V.A., Mott, L.A., Slattery, M.J., Spencer, S.K. & Weinstock, M.A. (2002). Use of tanning devices and risk of basal cell and squamous cell skin cancers, Journal of the National Cancer Institute, 94(3), 224–6. Koepsell, T., McCloskey, L., Wolf, M., Moudon, A.V., Buchner, D., Kraus, J. & Patterson, M. (2002). Crosswalk markings and the risk of pedestrian-motor vehicle collisions in older pedestrians, JAMA 288(17), 2136–2143. Mukamal, K.J., Conigrave, K.M., Mittleman, M.A., Camargo Jr, C.A., Stampfer, M.J., Willett, W.C. & Rimm, E.B. (2003). Roles of drinking pattern and type of alcohol consumed in coronary heart disease in men, The New England Journal of Medicine 348(2), 109–118.

JANICE MARIE DYKACZ

Proximity Measures HANS-HERMANN BOCK Volume 3, pp. 1621–1628 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Proximity Measures

Table 2 

0.0  3.0   8.0 D=  1.5  10.5

Introduction: Definitions and Examples Proximity measures characterize the similarity or dissimilarity that exists between the objects, items, stimuli, or persons that underlie an empirical study. In contrast to cases where we distinguish only between ‘similar’ and ‘dissimilar’ objects i and j (in a binary, qualitative way), proximities measure the degree of similarity by a real number sij , typically between 0 and 1: the larger the value sij , the larger the similarity between i and j , and sij = 1 means maximum similarity. A dual approach measures the dissimilarity between i and j by a numerical value dij ≥ 0 (with 0 the minimum dissimilarity). In practice, there are many ways to find appropriate values sij or dij , for example, (a) (b) (c) (d) (e) (f)

sij the degree of friendship between two persons i and j sij the relative frequency of common descriptors shared by two documents i and j sij the (relative) number of symptoms that are shared by two patients i and j dij the road distance (or transportation costs) between two cities i and j dij the Euclidean distance between two (data) points xi and xj in IR p dij the number of nonmatching scores in the results of two test persons i and j .

Given a set X = {1, . . . , n} of n objects, the corresponding similarity matrix S = (sij ) of size n × n, or its dual (Table 1), a dissimilarity matrix D = (dij ) (Table 2) provides a quantitative information on the overall similarity structure of the set of objects, items, and so on. Such information is the basis for various statistical and data analytical techniques, in particular: Table 1 A 6 × 6 similarity  1.0 0.8 0.4 0.7  0.8 1.0 0.5 0.9   0.4 0.5 1.0 0.3 S=  0.7 0.9 0.3 1.0  0.1 0.2 0.7 0.1 0.3 0.1 0.8 0.2

A 6 × 6 dissimilarity matrix

6.0

– – –

0.1 0.2 0.7 0.1 1.0 0.5

 0.3 0.1   0.8   0.2  0.5  1.0

 8.0 1.5 10.5 6.0 5.0 1.5 7.5 3.0   0.0 6.5 2.5 2.0   6.5 0.0 9.0 4.5  2.5 9.0 0.0 4.5  2.0 4.5 4.5 0.0

for a detailed analysis of the network of similarities among selected (classes of) individuals, for an illustrative visualization of the similarity structure by graphical displays, and for clustering the objects of X into homogeneous classes of mutually similar elements.

Formal Specifications Formally, given a set of objects (individuals, items, points, elements) X, a dissimilarity measure d on X is a function that assigns to each pair (x, y) of elements x, y of X a real number d(x, y) such that for all x, y the conditions (i) d(x, y) ≥ 0 (ii) d(x, x) = 0 (symmetry)

(1)

(iii) d(x, y) = d(y, x) are fulfilled. d is called a pseudometric if, additionally, the triangle inequality (iv) d(x, y) ≤ d(x, z) + d(z, y)

(2)

for all x, y, z ∈ X is fulfilled. A pseudometric with [(v)d(x, y) = 0 holds only for x = y] (definiteness) is called a metric or a distance measure. Note that in practice, the term ‘distance’ is often used for a dissimilarity of whatever type. Analogous conditions may be formulated for characterizing a similarity measure S = (sij ): (i ) 0 ≤ sij ≤ 1

matrix

3.0 0.0 5.0 1.5 7.5 3.0

(ii ) sii = 1

(iii ) sij = sj i (3)

for all i, j .

Commonly Used Proximity Measures Depending on the practical situation and the data, various different proximity measures can be selected.

2

Proximity Measures

This selection must match the underlying or intended similarity concept from, for example, psychology, medicine, or documentation. In the following we list some formal definitions which are commonly used in statistics and data analysis (see [2, 11, 10, 9]).

Dissimilarities If x = (x1 , . . . , xp ) and y = (y1 , . . . , yp ) are two elements of IR p , for example, two data points with p real-valued components, the Euclidean distance 1/2 p d(x, y) := ||x − y||2 := (x − y )2 (4) =1

and the Manhattan distance d(x, y) := ||x − y||1 :=

p

|x − y |

(5)

=1

provide dissimilarity measures that are symmetric in all p components. If the components provide different contributions for the intended similarity concept, we may use weighted versions such as

d(x, y) :=

p

1/2 w (x − y )2

and

=1

d(x, y) :=

p

w |x − y |

(6)

=1

with suitable weights w1 , . . . , wp > 0. The Mahalanobis distance

dM (x, y) := ||x − y|| −1 := (x − y) −1 (x − y) p p 1/2 = (x − y )σ s (xs − ys ) (7) =1 s=1

takes into account an eventual dependence between the p components that is summarized by a corresponding covariance matrix = (σij )p×p (theoretical, empirical, within/between classes, etc.) and its inverse −1 = (σ s ). In fact, (7) is the Euclidean ˜ 2 of the linearly transdistance dM (x, y) = ||x˜ − y|| formed data points x˜ = −1/2 x and y˜ = −1/2 y ∈ IR p in a space of principal components.

In the case of qualitative data, each component x of the data vector x = (x1 , . . . , xp ) refers to a property such as nationality (French, German, Italian, . . .), color (red, blue, yellow, . . .), or material (iron, copper, zinc, . . .) and takes its ‘values’ in a finite set Z of categories or alternatives. In this case we measure the dissimilarity between two data configurations x = (x1 , . . . , xp ) and y = (y1 , . . . , yp ) typically by the number of nonmatching components, that is, the Hamming distance dH (x, y) := #{ | x = y }.

(8)

From a wealth of other dissimilarity measures we mention just the Levenshtein distance dL that measures, for example, the dissimilarity of two words (DNA strains, messages, . . .) x = (x1 , . . . , xp ) and y = (y1 , . . . , yq ): Here dL (x, y) is the minimum number of operations (= deletion of a letter, insertion of a letter, insertion of a space) that is needed in order to transform the word x into the other one y (see [14]).

Similarity Measures Quite generally, a similarity measure s can be obtained from a dissimilarity measure d by a decreasing function h such as, for example, s = h(d) = 1 − e−d or s = (d0 − d)/d0 (where d0 is the maximum observed dissimilarity value). However, a range of special definitions has been proposed for binary data where each component takes only two alternatives 1 and 0 (e.g., female/male, yes/no, present/absent) such that Z = {0, 1} for all . In this case, the matching properties of two data vectors x and y are typically summarized in a 2 × 2 table (see Table 3) where the integers a, b, c, d (with a + b + c + d = p) denote the number of components of x and y with x = y = 0 (negative matching), (x , y ) = (0, 1), and (x , y ) = (1, 0) (mismatches), and x = y = 1 (positive matching), respectively. An example is given

Table 3

2 × 2 matching table

x/y 0 1

0 a c a+c

1 b d b+d

a+b c+d p

3

Proximity Measures

is to find a weighting scheme that guarantees the comparability of, and balance between, the three different types of components.

Table 4 Binary data vectors x, y with n = 10, a = 4, b = 1, c = 2, d = 3 x=(0011011001) y=(0011010010)

Inequalities for Dissimilarity Measures in Table 4. In this notation, the following similarity measures have been proposed (see, e.g., [2]): •

the simple matching coefficient sM (x, y) :=

•

a+d a+b+c+d

the Tanimoto coefficient sT (x, y) =

a+d a + 2(b + c) + d

that are both symmetric with respect to the categories 1 and 0. On the other hand, there are applications where the simultaneous absence of an attribute in both x and y bears no similarity information. Then we may use asymmetric measures such as •

the Jaccard or S coefficient sS (x, y) :=

d . d +b+c

All practically relevant definitions of a dissimilarity measure d between elements x, y of a set X share various common-sense properties in analogy to, for example, a geographical distance (in the plane or on the sphere). Some of these properties are expressed by inequalities that relate the dissimilarities of several pairs of objects (such as the triangle inequality (2)) and are motivated by graphical visualizations that are possible just for a special dissimilarity type (see the section on “Visualization of Dissimilarity Structures”).

Ultrametrics and the Ultrametric Inequality A dissimilarity measure d on a set X is called a ultrametric or a ultrametric distance if and only if it fulfills the ultrametric inequality d(x, y) ≤ max {d(x, z), d(z, y)}

(10)

for all x, y, z in X

All these measures have the form s(x, y) = (d + τ a)/(d + τ a + σ (b + c)) with suitable factors σ, τ .

Mixed Data Problems are faced when considering mixed data where some components of the data vectors x and y are qualitative (categorical or ordinal) and some are quantitative (numerical) (see Scales of Measurement). In this case, weighted averages of the type

Ultrametric distances are related to hierarchical classifications (see section on ‘Hierarchical Clustering and Ultrametric Distances’), but are also met in the context of number theory and physics. Any ultrametric is a metric since (10) implies (2).

Additive Distances and the Four-point Inequality A dissimilarity measure d is called an additive or tree distance if and only if it fulfills the four-points inequality

d(x, y) := α1 d1 (x , y ) + α2 d2 (x , y ) + α3 d3 (x , y )

(9)

are typically used where d1 (x , y ), d2 (x , y ), d3 (x , y ) are partial dissimilarities defined for the quantitative, categorical, and ordinal parts (x , y ), (x , y ), (x , y ) of (x, y) whereas α1 , α2 , and α3 are suitable weights (see [12]). A main problem here

d(x, y) + d(u, v) ≤ max{d(x, u) + d(y, v), d(x, v) + d(y, u)} for all x, y, u, v in X

(11)

Such a distance corresponds to a tree-like visualization of the objects (see Figure 4).

4

Proximity Measures

Visualization of Dissimilarity Structures A major approach for interpreting and understanding a similarity structure of n objects i = 1, . . . , n proceeds by visualizing the observed dissimilarity values dij in a graphical display where the objects are represented by n points (vertices) y1 , . . . , yn and the dissimilarity dij of two objects i, j is reproduced or approximated by the ‘closeness’ (in some sense) of the vertices yi , yj in this display.

Euclidean Embedding and Multidimensional Scaling (MDS) Multidimensional Scaling represents the n objects by n points y1 , . . . , yn in a low-dimensional Euclidean space IR s (usually with s = 2 or s = 3) such that their Euclidean distances δ(yi , yj ) := ||yi − yj ||2 approximate as much as possible the given dissimilarities dij . An optimum configuration Y := {y1 , . . . , yn } is typically found to be minimizing an optimality criterion such as

STRESS (Y ) :=

n n

(dij − δ(yi , yj ))2 → min Y

i=1 j =1

(12) SSTRESS (Y ) :=

n n

(dij2 − δ(yi , yj )2 )2 → min .

a configuration Y = (y1 , . . . , yn ) in IR s such that δ(yi , yj ) = dij holds for all pairs (i, j ): Such a matrix D is called a Euclidean dissimilarity matrix since it allows an exact embedding of the objects in a Euclidean space IR s . A famous theorem states that this case occurs exactly if the row-and-column centered matrix D˜ = (dij2 ) is negative semidefinite with a rank less or equal to s (see, e.g., [7]).

Hierarchical Clustering and Ultrametric Distances Ultrametric distances (see Ultrametric Trees) are intimately related to hierarchical classifications and their visualizations. A hierarchical clustering or dendrogram (H, h) for a set X = {1, . . . , n} of objects is a system H = {A, B, C, . . .} of classes of objects that are hierarchically nested as illustrated in Figure 1 and where each class A is displayed at a (heterogeneity) level h(A) ≥ 0 such that larger classes have a higher level: h(A) ≤ h(B) whenever A ⊂ B. There exist plenty of clustering methods in order to construct a hierarchical clustering of the n objects on the basis of a given dissimilarity matrix D = (dij ). The resulting dendrogram may then reveal and explain the underlying similarity structure. Inversely, given a dendrogram (H, h), we can measure the dissimilarity of two objects i, j within the hierarchy by the level h(B) of the smallest class

Y

i=1 j =1

(13) Nonmetric Scaling assumes a linear relationship d = a + bδ or a monotone function d = g(δ) between the dij and the Euclidean distances δ(yi , yj ) and minimizes: n n

(dij − a − bδ(yi , yj ))2

h

B

dij,dik

or

i=1 j =1 n n

(dij2 − g(δ(yi , yj )))2

(14)

A

djk

i=1 j =1

with respect to Y and the coefficients a, b ∈ IR or, respectively, the function g. There are many variants of this approach with suitable software support (see [7]). If the minimum STRESS value in (12) is 0 for some dimension s, this means that there exists

j

Figure 1

k

i

Dendrogram for n = 9 objects

Proximity Measures B that contains both objects i and j (see Figure 1). If this level is denoted by δij := h(B), we obtain a new dissimilarity matrix = (δij ). As a matter of fact, it can be shown that is an ultrametric distance. Even more: It appears that dendrograms and ultrametrics are equivalent insofar as a dissimilarity matrix D = (dij )n×n is an ultrametric if and only if there exists a hierarchical clustering (H, h) of the n objects such that the resulting ultrametric is identical to D: dij = δij for all i, j . This suggests that in cases when D does not share the ultrametric property, the criterion () :=

n n

(δij − dij )2

(15)

i=1 j =1

may be used for checking if (H, h) is an adequate classification for the given dissimilarity D, and looking for a good hierarchical clustering for D can also be interpreted as looking for an optimum ultrametric approximation of D in the sense of minimizing φ() (see [8]).

Pyramids and Robinsonian Dissimilarities If the elements dij of a dissimilarity matrix D = (dij ) are monotonely increasing when moving away from the diagonal (with zero values) along a row or a column, D is called a Robinsonian dissimilarity (see Table 2 after rearranging the rows and columns in the order 1-4-2-6-3-5). There exists a close relationship between Robinsonian matrices and a special type of nested clusterings (pyramids) that is illustrated in Figure 2: A pyramidal clustering (P, ≺, h) is a system P = {A, B, C, . . .} of classes of objects, together with an order ≺ on the set O of objects such that each class A ∈ P is an interval with respect to ≺: A = {k ∈ O|i ≺ k ≺ j } for some objects i ≺ j . As before, h(A) is the heterogeneity level of a class A. Figure 2 illustrates also the general theorem that (a) the intersection A ∩ B of two classes of A, B ∈ H is either empty or belongs to H as well and (b) that any class C ∈ H can have at most two ‘daughter’ classes A, B from H (see, e.g., [13, 5]). In practice, the corresponding order ≺ provides a seriation of the n objects in a series that may describe, e.g., temporal changes, chronological periods, or geographical gradients. Similarly as in section on ‘Hierarchical Clustering and Ultrametric Distances’, we can define, for a

5

given pyramid, a dissimilarity rij between two objects i, j within the pyramid by the smallest heterogeneity level h(B) where i and j meet in the same class B of H. A major result states essentially (a) that this dissimilarity matrix (rij )n×n is Robinsonian and (b) that, inversely, any Robinsonian dissimilarity matrix can be obtained in this way from an appropriate pyramidal classification (P, ≺, h). Insofar both concepts are equivalent. For example, the pyramid in Figure 2 corresponds to the dissimilarity matrix from Table 1. For details see, e.g., [15]. As an application, we may calculate, for a given arbitrary dissimilarity matrix D = (dij ), an optimum Robinsonian approximation (rij ) (in analogy to (15)). Then the corresponding pyramidal clustering provides not only insight into the similarity structure of D, but also suggests what kind of seriation may be appropriate and interpretable for the underlying objects.

Tree-like Configurations, Additive and Quadripolar Distances Trivially, any dissimilarity matrix D = (dij ) may be visualized by a weighted graph G where each of the n objects i is represented by a vertex yi and each of the n(n − 1)/2 pairs {i, j } of vertices are linked by an edge ij¯ that has the weight (length) dij > 0 (Figure 3). For a large number of objects, this graph is too complex for interpretation, therefore we should concentrate on simpler graphs. A major approach considers the interpretation of D = (dij ) in terms of a weighted tree T with n vertices and n − 1 edges (Figure 4) such that dij is the length of the shortest path from i to j in this tree. h C

A B dij

A B

1

Figure 2

4

2

i=6

3

j=5

Pyramid for n = 6 objects with order ≺

6

Proximity Measures 3,0 b

a

7,5 5,0

3,0 1,5

6,0

10,5

4,5 e

f

considered as random variables with some probability distribution. In some cases, this distribution is known and can be used for finding dissimilarity thresholds, e.g., for distinguishing between ‘similar’ and ‘dissimilar’ objects or clusters. We cite three examples: 1.

2,0 c

4,5 1,5

2,5

6,5 9,0

Figure 3

d

Graphical display of Table by a graph 2

2. u

v

3. y x

Figure 4

A tree-like visualization T

A theorem of Buneman and Dobson states that, for a given matrix D, there exists such a weighted tree if and only if the four-points inequality (11) is fulfilled (see Figure 4). Then D is called an additive tree or quadripolar distance. If D is not a tree distance, we may find an optimum tree approximation to D by looking for a tree distance (δij ) with minimum deviation from D in analogy to (15) and interpret the corresponding tree T in terms of our application. For this and more general approaches see, e.g., [1], [8], [15].

Probabilistic Aspects Dissimilarities from Random Data When data were obtained from a random experiment, all dissimilarities calculated from these data must be

If X1 , . . . , Xn1 and Y1 , . . . , Yn2 are independent samples of p-dimensional random vectors X, Y with normal distributions Np (µ1 , ) and Np (µ2 , ), respectively, then the squared Maha2 ¯ ¯ lanobis distance dM (X, Y ) := ||X1 − X2 ||2 −1 of ¯ Y¯ is distributed as c2 χ 2 2 the class centroids X, p,δ 2 2 where χp,δ with p degrees 2 is a noncentral χ of freedom and noncentrality parameter δ 2 := ||µ1 − µ2 ||2 −1 /c2 with the factor c2 := n−1 1 + . n−1 2 On the other hand, if X1 , . . . , Xn are independent Np (µ, ) variables with estimated covariance ¯ ¯ ˆ := (1/n) ni=1 (Xi − X)(X matrix i − X) , then 2 the rescaled Mahalanobis distance dM (Xi , Xj ) := ˆ −1 (Xi − Xj )/(2n) has a BetaI dis(Xi − Xj ) tribution with parameters p/2 and (n − p − 1)/2 with expectation 2pn/(n − 1) ([2]). Similarly, in the case of two binary data vectors U = (U1 , . . . , Up ) and V = (V1 , . . . , Vp ) with 2p independent components, all with values 0 and 1 and the same hitting probability π = P (Ui = 1) = P (Vi = 1) for i = 1, . . . , p, the Hamming distance dH (U, V ) has the binomial distribution Bin(p, 2π(1 − π)) with expectation 2pπ(1 − π).

Dissimilarity Between Probability Distributions Empirical as well as theoretical investigations lead often to the problem to evaluate the (dis-)similarity of two populations that are characterized by two distributions P and Q, respectively, for a random data vector X. Typically, this comparison is conducted by using a variant of the φ-divergence measure of Csisz´ar [6]:

p(x) dx q(x) X

p(x) q(x)φ D(P ; Q) := q(x) x∈X

D(P ; Q) :=

q(x)φ

resp. (16)

Proximity Measures where p(x), q(x) are the distribution densities (probability functions) of P and Q in the continuous (discrete) distribution case. Here φ(λ) is a convex function of a likelihood ratio λ > 0 (typically with φ(1) = 0) and can be chosen suitably. Special cases include: (a)

the Kullback–Leibler discriminating information distance for φ(λ) := − log λ:

q(x) DKL (P ; Q) := q(x) log dx ≥ 0 p(x) X (17) ˜ with its symmetrized variant for φ(λ) := (λ − 1) log λ:

(b)

[3]

[4]

[5]

[6]

[7]

˜ ; Q)KL := DKL (Q; P ) + DKL (P ; Q) D(P

p(x) dx (18) = p(x)φ˜ q(x) X

[8]

the variation distance resulting for φ(λ) := |λ − 1|: |q(x) − p(x)| dx (19) Dvar (P ; Q) :=

[9]

X

(c)

the χ 2 -distance resulting from φ(λ) := (λ − 1)2 :

(p(x) − q(x)) dx. q(x) 2

[10]

(20)

[11]

Other special cases include Hellinger and Bhattacharyya distance, Matusita distance, and so on (see [3, 4]). Note that only (18) and (19) yield a symmetric dissimilarity measure.

[12]

Dchi (P ; Q) :=

X

[13] [14]

References [1]

[2]

Barth´elemy, J.-P. & Gu´enoche, A. (1988). Les Arbres et Les Repr´esentations Des Proximit´es, Masson, Paris. English translation: Trees and proximity representations. Wiley, New York, 1991. Bock, H.-H. (1974). Automatische Klassifikation, G¨ottingen, Vandenhoeck & Ruprecht.

[15]

7

Bock, H.-H. (1992). A clustering technique for maximizing φ-divergence, noncentrality and discriminating power, in Analyzing and Modeling Data and Knowledge, M. Schader, ed., Springer, Heidelberg, pp. 19–36. Bock, H.-H. (2000). Dissimilarity measures for probability distributions, in Analysis of Symbolic Data, H.-H. Bock & E. Diday, eds, Springer, Heidelberg, pp. 153–160. Brito, P. (2000). Hierarchical and pyramidal clustering with complete symbolic objects, in Analysis of Symbolic Data, H.-H. Bock & E. Diday, eds, Springer, Heidelberg, pp. 312–341. Csisz´ar, L. (1967). Information-type measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica 2, 299–318. de Leeuw, J. & Heiser, W. (1982). Theory of multidimensional scaling, in Classification, Pattern Recognition, and Reduction of Dimensionality, P.R. Krishnaiah & L.N. Kanal, eds, North Holland, Amsterdam, pp. 285–316. De Soete, G. & Carroll, J.D. (1996). Tree and other network models for representing proximity data, in Clustering and Classification, P. Arabie, L.J. Hubert & G. De Soete, eds, World Scientific, Singapore, pp. 157–197. Esposito, F., Malerba, D. & Tamma, V. (2000). Dissimilarity measures for symbolic data, in Analysis of Symbolic Data, H.-H. Bock & E. Diday, eds, Springer, Heidelberg, pp. 165–185. Everitt, B.S., Landau, S. & Leese, M. (2001). Cluster Analysis, Oxford University Press. Everitt, B.S. & Rabe-Hesketh, S. (1997). The Analysis of Proximity Data. Kendall’s Library of Statistics, Vol. 4, Arnold Publishers, London. Gower, J.C. (1971). A general coefficient of similarity and some of its properties, Biometrics 27, 857–874. Lasch, R. (1993). Pyramidale Darstellung Multivariater Daten, Verlag Josef Eul, Bergisch Gladbach – K¨oln. Sankoff, D., Kruskal, J. & Nerbonne, J. (1999). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, CSLI Publications, Stanford. Van Cutsem, B., ed. (1994). Classification and Dissimilarity Analysis, Lecture Notes in Statistics no. 93, Springer, New York.

HANS-HERMANN BOCK

Psychophysical Scaling HANS IRTEL Volume 3, pp. 1628–1632 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Psychophysical Scaling Introduction A large part of human cognition is devoted to the development of a mental representation of the physical environment. This is necessary for planned interaction and for successful anticipation of dangerous situations. Survival in a continuously changing physical world will only be possible if the organism’s mental representation of its environment is sufficiently valid. This actually is the case for most of our perceptual abilities. Our senses convey a rather valid view of the physical world, at least as long as we restrict the world to that part of the environment that allows for unaided interaction. And this is the basic idea of psychophysical scaling: Formulate a theory that allows the computation of perceived stimulus properties from purely physical attributes. A problem complex as this requires simplification and reduction in order to create proper experiments and models that can be used to describe, or even explain, the results of these experiments. Thus, most of the experimental examples and most of the theories we will discuss here will be small and simplified cases. Most cases will assume that we have only a single relevant physical independent variable, and a single and unidimensional dependent variable. Simplification like this is appropriate as long as we keep in mind the bigger aim of relating the properties of the physical world to behavior, or, better, to the mental representation of the world that guides behavior.

Subject Tasks Before going into the theoretical foundations of psychophysical scaling it might be useful to look at the experimental conditions which give rise to problems of psychophysical scaling. So we first will look at subject tasks which may be found in experiments involving psychophysical scaling. The most basic tasks involved in psychophysical experiments are detection and discrimination tasks. A detection task involves one or more time intervals during which a stimulus may be presented. The subject’s task is to tell which if any time interval contained the target stimulus. Single time interval tasks usually are called yes/no-tasks while multiple time interval tasks

are called forced choice-tasks. Discrimination tasks are similar to multiple interval detection tasks. The major difference being that there are no empty intervals but the distractor intervals contain a reference or standard stimulus and the subject’s task is to tell, whether the target is different from the reference or which of multiple intervals contains the target. The data of detection and discrimination experiments usually are captured by the probability that the respective target is detected or discriminated from its reference. When scaling is involved, then discrimination frequently involves ordering. In this case, the subject is asked whether the target stimulus has more of some attribute as a second target that may be presented simultaneously or subsequently. These tasks are called paired comparison (see Bradley–Terry Model) tasks, and frequently involve the comparison of stimulus pairs. An example may be a task where the difference between two stimuli x, y with respect to some attribute has to be compared with the difference between two stimuli u, v. The data of these comparisons may also be handled as probabilities for finding a given ordering between pairs. Another type of psychophysical task involves the assignment of numeric labels to stimuli. A simple case is the assignment of single stimuli to a small number of numeric categories. Or subjects may be required to directly assign real number labels to stimuli or pairs of stimuli such that the numbers describe some attribute intensity associated with physical stimulus properties. The most well-known task of this type is magnitude estimation: here the subject may be asked to assign a real number label to a stimulus such that the real number describes the appearance of some stimulus attribute such as loudness, brightness, or heaviness. These tasks usually involve stimuli that show large differences with respect to the independent physical attribute such that discrimination would be certain if two of them were presented simultaneously. In many cases, the numerical labels created by the subjects are treated as numbers such that the data will be mean values of subject responses. The previously described task type may be modified such that the subject does not produce a number label but creates a stimulus attribute such that its appearance satisfies a certain numeric relation to a given reference. An example is midpoint production: here the subject adjusts a stimulus attribute such that it has an intensity that appears to be the midpoint between two given reference stimuli. Methods of this

2

Psychophysical Scaling

type are called production methods since the subject produces the respective stimulus attribute. This is in contrast to the estimation methods, where the subjects produce a numerical estimation of the respective attribute intensity.

Discrimination Scaling Discrimination scaling is based on an assumption that dates back to Fechner [3]. His idea was that psychological measurement should be based on discriminability of stimuli. Luce & Galanter [5] used the phrase ‘equally often noticed differences are equal, unless always or never noticed’ to describe what they called Fechner’s Problem: Suppose P (x, y) is the discriminability of stimuli x and y. Does there exist a transformation g of the physical stimulus intensities x and y, such that P (x, y) = F [g(x) − g(y)],

(1)

where F is a strictly increasing function of its argument? Usually, P (x, y) will be the probability that stimulus x is judged to be of higher intensity as stimulus y with respect to some attributes. A solution g to (1) can be considered a psychophysical scale of the respective attribute in the sense that equal differences along the scale g indicate equal discriminability of the respective stimuli. An alternative but empirically equivalent formulation of (1) is h(x) , (2) P (x, y) = G h(y) where h(x) = eg(x) and G(x) = F (log x). Response probabilities P (x, y) that have a representation like (1) have to satisfy the quadruple condition: P (x, y) ≥ P (u, v) if and only if P (x, u) ≥ P (y, v). This condition, however, is not easy to test since no statistical methods exist that allow for appropriate decisions based on estimates of P . Response probabilities that have a Fechnerian representation in the sense of (1) allow the definition of a sensitivity function ξ : Let P (x, y) = π and define ξπ (y) = x. Thus, ξπ (y) is that stimulus intensity which, when compared to y, results in response probability π. From (1) with h = F −1 , we get ξπ (y) = g −1 [g(y) + h(π)], where h(π) is independent of x.

(3)

Fechner’s idea was that a fixed response probability value π corresponds to a single unit change on the sensation scale g: We take h(π) = 1 and look at the so called Weber function δ, which is such that ξπ (x) = x + δπ (x). The value of the Weber function δπ (x) is that stimulus increment that has to be added to stimulus x such that the response probability P (x + δπ (x), x) is π. Weber’s law states that δπ (x) is proportional to x [16]. In terms of response probabilities, this means that P (cx, cy) = P (x, y) for any multiplicative factor c. Weber’s law in terms of the sensitivity function ξ means that ξπ (cx) = cξπ (x). A generalization is ξπ (cx) = cβ ξπ (x), which has been termed the near-miss-to-Weber’s-law [2]. Its empirical equivalent is P (cβ x, cy) = P (x, y) with the representation P (x, y) = G(y/x 1/β ). The corresponding Fechnerian representation then has the form 1 P (x, y) = F log x − log y . (4) β In case of β = 1, we get Fechner’s law according to which the sensation scale grows with the logarithmic transformation of stimulus intensity.

Operations on the Stimulus Set Discrimination scaling is based on stimulus confusion. Metric information is derived from data that somehow describe a subject’s uncertainty when discriminating two physically distinct stimuli. There is no guarantee that the concatenation of just noticeable differences leads to a scale that also describes judgments about stimulus similarity for stimuli that are never confused. This has been criticized by [14]. An alternative method for scale construction is to create an operation on the set of stimuli such that this operation provides metric information about the appearance of stimulus differences. A simple case is the midpoint operation: the subject’s task is to find that stimulus m(x, y), whose intensity appears to be the midpoint between the two stimuli x and y. For any sensation scale g, this should mean that g[m(x, y)] =

g(x) + g(y) . 2

(5)

The major empirical condition that guarantees that the midpoint operation m may be represented in this way is bisymmetry [11]: m[m(x, y), m(u, v)] = m[m(x, u), m(y, v)],

(6)

Psychophysical Scaling which can be tested empirically. If bisymmetry holds, then a representation of the form g[m(x, y)] = pg(x) + qg(y) + r

(7)

is possible. If, furthermore, m(x, x) = x holds, then p + q = 1 and r = 0, and if m(x, y) = m(y, x), then p = q = 1/2. Note that bisymmetry alone does not impose any restriction on the form of the psychophysical function g. However, Krantz [4] has shown that an additional empirically testable condition restricts the possible forms of the psychophysical function strongly. If the sensation scale satisfies (5), and the midpoint operation satisfies the homogeneity condition m(cx, cy) = cm(x, y), then there remain only two possible forms of the psychophysical function g: g(x) = α log x + β, or g(x) = αx β + γ , for two constants α > 0 and β. Falmagne [2] makes clear that the representation of m by an arithmetic mean is arbitrary. A geometric mean would be equally plausible, and the empirical conditions given above do not allow any distinction between these two options. Choosing the geometric mean as a representation for the midpoint operation m, however, changes the possible forms of the psychophysical function g [2]. The midpoint operation aims at the single numerical weight of 1/2 for multiplying sensation scale values. More general cases have been studied by [9] in the context of magnitude estimation, which will be treated later. Methods similar to the midpoint operation have also been suggested by [14] under the label ratio or magnitude production with fractionation and multiplication as subcases. Pfanzagl [11], however, notes that these methods impose much less empirical constraints on the data such that almost any monotone psychophysical function will be admissible.

Magnitude Estimation, Magnitude Production, and Cross-modal Matching Magnitude estimation is one of the classical methods proposed by [14] in order to create psychophysical scales that satisfy proper measurement conditions. Magnitude estimation requires the subject to assign numeric labels to stimuli such that the respective numbers are proportional to the magnitude of perceived stimulus intensity. Often, there will be a reference stimulus that is assigned a number label, ‘10’, say, by the experimenter, and the subject is instructed to map the relative sensation of the target with respect

3

to the reference. It is common practice to take the subjects’ number labels as proper numbers and compute average values from different subjects or from multiple replications with the same subject. This procedure remains questionable as long as no structural conditions are tested that validate the subjects’ proper handling of numeric labels. In magnitude production experiments, the subject is not required to produce number labels but to produce a stimulus intensity that satisfies a given relation on the sensation continuum to a reference, such as being 10 times as loud. A major result of Stevens’ research tradition is that average data from magnitude estimation and production frequently are well described by power functions: g(x) = αx β . Validation of the power law is done by fitting the respective power function to a set of data, and by cross-modal matching. This requires the subject to match a pair of stimuli from one continuum to a second pair of stimuli from another continuum. An example is the matching of a loudness interval defined by a pair of acoustic stimuli to the brightness interval of a pair of light stimuli [15]. If magnitude estimates of each single sensation scale are available and follow the power law, then the exponent of the matching function from one to the other continuum can be predicted by the ratio of the exponents. Empirical evidence for this condition is not unambiguous [2, 7]. In addition, a power law relation between matching sensation scales for different continua is also predicted by logarithmic sensation scales for the single continua [5]. The power law for sensation scales satisfies Stevens’ credo ‘that equal stimulus ratios produce equal subjective ratios’ ([14], p 153). It may, in fact, be shown that this rule implies a power law for the sensation scale g. But, as shown by several authors [2, 5], the notion ‘equal subjective ratios’ has the same theoretical status as Fechner’s assumption that just noticeable stimulus differences correspond to a single unit difference on the sensation scale. Both assumptions are arbitrary as long as there is no independent foundation of the sensation scale. A theoretical foundation of magnitude estimation and cross-modal matching has been developed by [4] based on ideas of [12]. He combined magnitude estimates, ratio estimates, and cross-modal matches, and formulated a set of empirically testable conditions that have to hold if the psychophysical functions are power functions of the corresponding physical attribute. A key assumption of the Shepard–Krantz

4

Psychophysical Scaling

theory is to map all sensory attributes to the single sensory continuum of perceived length, which is assumed to behave like physical length. The main assumption, then, is that for the reference continuum of perceived length, the condition L(cx, cy) = L(x, y) holds. Since, furthermore, L(x, y) is assumed to behave like numerical ratios, this form of invariance generates the power law [2]. A more recent theory of magnitude estimation for ratios has been developed by [9]. The gist of this theory is a strict separation of number labels, as they are used by subjects, and mathematical numbers used in the theory. Narens derived two major predictions: The first is a commutativity property. Let ‘p’ and ‘q’ be number labels, and suppose the subject produces stimulus y when instructed to produce p-times x, and then produces z when instructed to produce q-times y. The requirement is that the result is the same when the sequence of ‘p’ and ‘q’ is reversed. The second prediction is a multiplicative one. It requires that the subject has to produce the same stimulus as in the previous sequence when required to produce pq-times x. Empirical predictions like this are rarely tested. Exceptions are [1] or for a slightly different but similar approach [18]. In both cases, the model was only partially supported by the data. While [6] describes the foundational aspects of psychophysical measurement, [8] presents a thorough overview over the current practices of psychophysical scaling in Stevens’ tradition. A modern characterization of Stevens’ measurement theory [13] is given by [10]. For many applications, the power law and the classical psychophysical scaling methods provide a good starting point for the question how stimulus intensity transforms into sensation magnitude. Even models that try to predict sensation magnitudes in rather complex conditions may incorporate the power law as a basic component for constant context conditions. An example is the CIE 1976 (L∗ a ∗ b∗ ) color space [17]. It models color similarity judgments and implements both an adaptation- and an illuminationdependent component. Its basic transform from stimulus to sensation space, however, is a power law.

References [1]

Ellermeier, W. & Faulhammer, G. (2000). Empirical evaluation of axioms fundamental to Stevens’ ratioscaling approach: I. loudness production, Perception & Psychophysics 62, 1505–1511.

[2] [3] [4]

[5]

[6]

[7]

[8]

[9] [10]

[11] [12]

[13] [14] [15]

[16]

[17]

[18]

Falmagne, J.-C. (1985). Elements of Psychophysical Theory, Oxford University Press, New York. Fechner, G.Th. (1860). Elemente der Psychophysik, Breit-kopf und H¨artel, Leipzig. Krantz, D.H. (1972). A theory of magnitude estimation and cross-modality matching, Journal of Mathematical Psychology 9, 168–199. Luce, R.D. & Galanter, E. (1963). Discrimination, in Handbook of Mathematical Psychology, Vol. I, R.D. Luce, R.R. Bush & E. Galanter, eds, Wiley, New York, pp. 191–243. Luce, R.D. & Suppes, P. (2002). Representational measurement theory, in Stevens’ Handbook of Experimental Psychology, Vol. 4, 3rd Edition, H. Pashler, ed., Wiley, New York, pp. 1–41. Marks, L.E. & Algom, D. (1998). Psychophysical scaling, in Measurement, Judgement, and Decision Making, M.H. Birnbaum, ed., Academic Press, San Diego, pp. 81–178. Marks, L.E. & Gescheider, G.A. (2002). Psychophysical scaling, in Stevens’ Handbook of Experimental Psychology, H. Pashler, ed., Vol. 4, 3rd Edition, Wiley, New York, pp. 91–138. Narens, L. (1996). A theory of ratio magnitude estimation, Journal of Mathematical Psychology 40, 109–129. Narens, L. (2002). The irony of measurement by subjective estimations, Journal of Mathematical Psychology 46, 769–788. Pfanzagl, J. (1971). Theory of Measurement, 2nd Edition, Physica-Verlag, W¨urzburg. Shepard, R.N. (1981). Psychophysical relations and psychophysical scales: on the status of ‘direct’ psychophysical measurement, Journal of Mathematical Psychology 24, 21–57. Stevens, S.S. (1946). On the theory of scales of measurement, Science 103, 677–680. Stevens, S.S. (1957). On the psychophysical law, The Psychological Review 64, 153–181. Stevens, J.C. & Marks L.E. (1965). Cross-modality matching of brightness and loudness, Proceedings of the National Academy of Sciences, 54, 407–411. Weber, E.H. (1834). De Pulsu, Resorptione, Auditu et Tactu. Annotationes Anatomicae et Physiologicae, K¨ohler, Leipzig. Wyszecki, G. & Stiles, W.S. (1982). Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd Edition, Wiley, New York. Zimmer, K., Luce, R.D. & Ellermeier, W. (2001). Testing a new theory of psychophysical scaling: temporal loudness integration, in Fechner Day 2001. Proceedings of the Seventeenth Annual Meeting of the International Society for Psychophysics, E. Sommerfeld, R. Kompass & T. Lachmann, eds, Pabst, Lengerich, pp. 683–688.

(See also Harmonic Mean) HANS IRTEL

Qualitative Research MICHAEL QUINN PATTON Volume 3, pp. 1633–1636 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Qualitative Research An anthropologist studies initiation rites among the Gourma people of Burkina Faso in West Africa. A sociologist observes interactions among bowlers in their weekly league games. An evaluator participates fully in a leadership training program she is documenting. A naturalist studies Bighorn sheep beneath Powell Plateau in the Grand Canyon. A policy analyst interviews people living in public housing in their homes. An agronomist observes farmers’ Spring planting practices in rural Minnesota. What do these researchers have in common? They are in the field, studying the real world as it unfolds. This is called naturalistic inquiry, and it is the cornerstone of qualitative research. Such qualitative investigations typically begin with detailed narrative descriptions, then constructing in-depth case studies of the phenomenon under study, and, finally, moving to comparisons and the interpretive search for patterns that cut across cases. Qualitative research with human beings involves three kinds of data collection: (a) in-depth, openended interviews; (b) direct observations; and (c) written documents. Interviews yield direct quotations from people about their experiences, opinions, feelings, and knowledge. The data from observations consist of detailed descriptions of people’s activities, behaviors, actions, and the full range of interpersonal interactions and organizational processes that are part of observable human experience. Document analysis includes studying excerpts, quotations or entire passages from organizational, clinical or program records; memoranda and correspondence; official publications and reports; personal diaries; and open-ended written responses to questionnaires and surveys. The data for qualitative research typically come from fieldwork. During fieldwork, the researcher spends time in the setting under study – a program, organization, or community, where change efforts can be observed, people interviewed, and documents analyzed. The researcher makes firsthand observations of activities and interactions, sometimes engaging personally in those activities as a ‘participant observer’. For example, a researcher might participate in all or part of an educational program under study, participating as a student. The qualitative researcher

talks with people about their experiences and perceptions. More formal individual or group interviews may be conducted. Relevant records and documents are examined. Extensive field notes are collected through these observations, interviews, and document reviews. The voluminous raw data in these field notes are organized into readable narrative descriptions with major themes, categories, and illustrative case examples extracted inductively through content analysis. The themes, patterns, understandings, and insights that emerge from research fieldwork and subsequent analysis are the fruit of qualitative inquiry. Qualitative findings may be presented alone or in combination with quantitative data. At the simplest level, a questionnaire or interview that asks both fixed-choice (closed) questions and open-ended questions is an example of how quantitative measurement and qualitative inquiry are often combined. The quality of qualitative data depends to a great extent on the methodological skill, sensitivity, and integrity of the researcher. Systematic and rigorous observation involves far more than just being present and looking around. Skillful interviewing involves much more than just asking questions. Content analysis requires considerably more than just reading to see what is there. Generating useful and credible qualitative findings through observation, interviewing, and content analysis requires discipline, knowledge, training, practice, creativity, and hard work.

Examples of Qualitative Applications Fieldwork is the fundamental method of cultural anthropology. Ethnographic inquiry takes as its central and guiding assumption that any human group of people interacting together for a period of time will evolve a culture. Ethnographers study and describe specific cultures, then compare and contrast cultures to understand how cultures evolve and change. Anthropologists have traditionally studied nonliterate cultures in remote settings, what were often thought of as ‘primitive’ or ‘exotic’ cultures. Sociologists subsequently adapted anthropological fieldwork approaches to study urban communities and neighborhoods. Qualitative research has also been important to management studies, which often rely on case studies of companies. One of the most influential books in organizational development has been In Search

2

Qualitative Research

of Excellence: Lessons from Americas’ Best-Run Companies. Peters and Waterman [4] based the book on case studies of highly regarded companies. They visited companies, conducted extensive interviews, and studied corporate documents. From that massive amount of data, they extracted eight qualitative attributes of excellence: (a) a bias for action; (b) close to the customer; (c) autonomy and entrepreneurship; (d) productivity through people; (e) hands-on, value-driven; (f) stick to the knitting; (g) simple form, lean staff; and (h) simultaneous loose–tight properties. Their book devotes a chapter to each theme with case examples and implications. Their research helped launch the quality movement that has now moved from the business world to notfor-profit organizations and government. A different kind of qualitative finding is illustrated by Angela Browne’s book When Battered Women Kill [1]. Browne conducted in-depth interviews with 42 women from 15 states who were charged with a crime in the death or serious injury of their mates. She was often the first to hear these women’s stories. She used one couple’s history and vignettes from nine others, representative of the entire sample, to illuminate the progression of an abusive relationship from romantic courtship to the onset of abuse through its escalation until it was ongoing and eventually provoked a homicide. Her work helped lead to legal recognition of battered women’s syndrome as a legitimate defense, especially in offering insight into the common outsider’s question: Why does not the woman just leave? Getting an insider perspective on the debilitating, destructive, and all-encompassing brutality of battering reveals that question for what it is: the facile judgment of one who has not been there. The effectiveness of Browne’s careful, detailed, and straightforward descriptions and quotations lies in their capacity to take us inside the abusive relationship. Offering that inside perspective powers qualitative reporting. Qualitative methods are often used in program evaluations because they tell the program’s story by capturing and communicating the participants’ stories. Research case studies have all the elements of a good story. They tell what happened when, to whom, and with what consequences. The purpose of such studies is to gather information and generate findings that are useful. Understanding the program’s and participant’s stories is useful to the extent that those stories illuminate the processes and outcomes

of the program for those who must make decisions about the program. The methodological implication of this criterion is that the intended users must value the findings and find them credible. They must be interested in the stories, experiences, and perceptions of program participants beyond simply knowing how many came into the program, how many completed it, and how many did what afterwards. Qualitative findings in evaluation can illuminate the people behind the numbers and put faces on the statistics to deepen understanding [3].

Purposeful Sampling Perhaps, nothing better captures the difference between quantitative and qualitative methods than the different logics that undergird sampling approaches (see Survey Sampling Procedures). Qualitative inquiry typically focuses in-depth on relatively small samples, even single cases (n = 1), selected purposefully. Quantitative methods typically depend on larger samples selected randomly. Not only are the techniques for sampling different, but the very logic of each approach is unique because the purpose of each strategy is different. The logic and power of random sampling derives from statistical probability theory. In contrast, the logic and power of purposeful sampling lies in selecting informationrich cases for study in depth. Information-rich cases are those from which one can learn a great deal about issues of central importance to the purpose of the inquiry, thus the term purposeful sampling. What would be ‘bias’ in statistical sampling, and, therefore, a weakness, becomes the intended focus in qualitative sampling, and, therefore, a strength. Studying information-rich cases yields insights and in-depth understanding rather than empirical generalizations. For example, if the purpose of a program evaluation is to increase the effectiveness of a program in reaching lower-socioeconomic groups, one may learn a great deal more by studying indepth a small number of carefully selected poor families than by gathering standardized information from a large, statistically representative sample of the whole program. Purposeful sampling focuses on selecting information-rich cases whose study will illuminate the questions under study. There are several different strategies for purposefully selecting information-rich cases. The logic of each strategy serves a particular purpose.

Qualitative Research Extreme or deviant case sampling involves selecting cases that are information-rich because they are unusual or special in some way, such as outstanding successes or notable failures. The influential study of high performing American companies published as In Search of Excellence [4] exemplifies the logic of purposeful, extreme group sampling. The sample of 62 companies was never intended to be representative of the US industry as a whole, but, rather, was purposefully selected to focus on innovation and excellence. In the early days of AIDS research, when HIV infections almost always resulted in death, a small number of cases of people infected with HIV who did not develop AIDS became crucial outlier cases that provided important insights into directions researchers should take-in combating AIDS. In program evaluation, the logic of extreme case sampling is that lessons may be learned about unusual conditions or extreme outcomes that are relevant to improving more typical programs. Suppose that we are interested in studying a national program with hundreds of local sites. We know that many programs are operating reasonably well, that other programs verge on being disasters, and that most programs are doing ‘okay’. We know this from knowledgeable sources who have made site visits to enough programs to have a basic idea about what the variation is. If one wanted to precisely document the natural variation among programs, a random sample would be appropriate, one of sufficient size to be representative, and permit generalizations to the total population of programs. However, with limited resources and time, and with the priority being how to improve programs, an evaluator might learn more by intensively studying one or more examples of really poor programs and one or more examples of really excellent programs. The evaluation focus, then, becomes a question of understanding under what conditions programs get into trouble, and under what conditions programs exemplify excellence. It is not

3

even necessary to randomly sample poor programs or excellent programs. The researchers and intended users involved in the study think through what cases they could learn the most from, and those are the cases that are selected for study. Examples of other purposeful sampling strategies are briefly described below: •

• •

Maximum variation sampling involves purposefully picking a wide range of cases to get variation on dimensions of interest. Such a sample can document variations that have emerged in adapting to different conditions as well as identify important common patterns that cut across variations (cut through the noise of variation). Homogenous sampling is used to bring focus to a sample, reduce variation, simplify analysis, and facilitate group interviewing (focus groups). Typical case sampling is used to illustrate or highlight what is typical, normal, average, and gives greater depth of understanding to the qualitative meaning of a statistical mean.

For a full discussion of these and other purposeful sampling strategies, and the full range of qualitative methods and analytical approaches, see [2, 3].

References [1] [2]

[3] [4]

Browne, A. (1987). When Battered Women Kill, Free Press, New York. Denzin, N. & Lincoln, Y., eds (2000). Handbook of Qualitative Research, 2nd Edition, Sage Publications, Thousand Oaks. Patton, M.Q. (2002). Qualitative Research and Evaluation Methods, 3rd Edition, Sage Publications, Thousand Oaks. Peters, T. & Waterman, R. (1982). In Search of Excellence, Harper & Row, New York.

MICHAEL QUINN PATTON

Quantiles PAT LOVIE Volume 3, pp. 1636–1637 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Quantiles A quantile of a distribution is the value such that a proportion p of the population values are less than or equal to it. In other words, the pth quantile is the value that cuts off a certain proportion p of the area under the probability distribution function. It is sometimes known as a theoretical quantile or population quantile and is denoted by Qt (p). Quantiles for common distributions are easily obtained using routines for the inverse cumulative distribution function (inverse c.d.f.) found in many popular statistical packages. For example, the 0.5 quantile (which we also recognize as the median or 50th percentile) of a standard normal distribution is P (x ≤ x0.5 ) = −1 (0.5) = 0, where −1 (.) denotes the inverse c.d.f. of that distribution. Equally easily, we locate the 0.5 quantile for an exponential distribution (see Catalogue of Probability Density Functions) with mean of 1 as 0.6931. Other useful quantiles such as the lower and upper quartiles (0.25 and 0.75 quantiles) can be obtained in a similar way. However, what we are more likely to be concerned with are quantiles associated with data. Here, the empirical quantile Q(p) is that point on the data scale that splits the data (the empirical distribution) into two parts so that a proportion p of the observations fall below it and the rest lie above. Although each data point is itself some quantile, obtaining a specific quantile requires a more pragmatic approach. Consider the following 10 observations, which are the times (in milliseconds) between reported reversals of orientation of a Necker cube obtained from a study on visual illusions. The data have been ordered from the smallest to the largest. 274, 302, 334, 430, 489, 703, 978, 1656, 1697, 2745. Imagine that we are interested in the 0.5 and 0.85 quantiles for these data. With 10 data points, the point Q(0.5) on the data scale that has half of the observations below it and half above does not actually correspond to a member of the sample but lies somewhere between the middle two values 489 and 703. Also, we have no way of splitting off exactly 0.85 of the data. So, where do we go from here? Several different remedies have been proposed, each of which essentially entails a modest redefinition of quantile. For most practical purposes, any

differences between the estimates obtained from the various versions will be negligible. The only method that we shall consider here is the one that is also used by Minitab and SPSS (but see [1] for a description of an alternative favored by the Exploratory Data Analysis (EDA) community). Suppose that we have n ordered observations xi (i = 1 to n). These split the data scale into n + 1 segments: one below the smallest observation, n − 1 between each adjacent pair of values and one above the largest. The proportion p of the distribution that lies below the ith observation is then estimated by i/(n + 1). Setting this equal to p gives i = p(n + 1). If i is an integer, the ith observation is taken to be Q(p). Otherwise, we take the integer part of i, say j , and look for Q(p) between the j th and j + 1th observations. Assuming simple linear interpolation, Q(p) lies a fraction (i − j ) of the way from xj to xj +1 and is estimated by Q(p) = xj + (xj +1 − xj ) × (i − j ).

(1)

A point to note is that we cannot determine quantiles in the tails of the distribution for p < 1/(n + 1) or p > n/(n + 1) since these take us outside the range of the data. If extrapolation is unavoidable, it is safest to define Q(p) for all P values in these two regions as x1 and xn , respectively. We return now to our reversal data and our quest for the 0.5 and 0.85 quantiles: For the 0.5 quantile, i = 0.5 × 11 = 5.5, which is not an integer so Q(0.5) = 489 + (703 − 489) × (5.5 − 5) = 596, which is exactly half way between the fifth and sixth observations. For the 0.85 quantile, i = 0.85 × 11 = 9.35, so the required value will lie 0.35 of the way between the ninth and tenth observations, and is estimated by Q(0.85) = 1697 + (2745 − 1697) × (9.35 − 9) = 2063.8. From a plot of the empirical quantiles Q(p), that is, the data, against proportion p as in Figure 1, we can get some feeling for the distributional properties of our sample. For example, we can read off rough values of important quantiles such as the median and quartiles. Moreover, the increasing steepness of the slope on the right-hand side indicates a lower density of data points in that region and thus alerts us to skewness and possible outliers. Note that we have also defined Q(p) = x1 when p < 1/11 and Q(p) = x10 when p > 10/11. Clearly, extrapolation would not be advisable, especially at the upper end.

2

Quantiles

Data (quantiles Q(p))

3000

2000

1000

0 0.0

Figure 1

0.2

0.4 0.6 Proportion (p)

0.8

1.0

Quantile plot for Necker cube reversal times

It is possible to obtain confidence intervals for quantiles, and information on how these are constructed can be found, for example, in [2]. Routines for these procedures are available within certain statistical packages such as SAS and S-PLUS. Quantiles play a central role within exploratory data analysis. The median (Q(0.5)), upper and lower quartiles (Q(0.75) and Q(0.25)), and the maximum and minimum values, which constitute the so-called five number summary, are the basic elements of a box plot. An empirical quantile–quantile plot (EQQ) is a scatterplot of the paired empirical quantiles of

two samples of data, while the symmetry plot is a scatterplot of the paired empirical quantiles from a single sample, which has been split and folded at the median. Of course, for these latter plots there is no need for the P values to be explicitly identified. Finally, it is worth noting that the term quantile is a general one referring to the class of statistics, which split a distribution according to a proportion p that can be any value between 0 and 1. As we have seen, well-known measures that have specific P values, such as the median, quartile, and percentile (on Galton’s original idea of splitting by 100ths [3]), belong to this group. Nowadays, however, quantile and percentile are used almost interchangeably: the only difference is whether p is expressed as a decimal fraction or as a percentage.

References [1]

[2] [3]

Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A. (1983). Graphical Methods for Data Analysis, Duxbury, Boston. Conover, W.J. (1999). Practical Nonparametric Statistics, 3rd Edition, Wiley, New York. Galton, F. (1885). Anthropometric percentiles, Nature 31, 223–225.

PAT LOVIE

Quantitative Methods in Personality Research R. CHRIS FRALEY

AND

MICHAEL J. MARKS

Volume 3, pp. 1637–1641 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Quantitative Methods in Personality Research The aims of personality psychology are to identify the ways in which people differ from one another and to elucidate the psychological mechanisms that generate and sustain those differences. The study of personality is truly multidisciplinary in that it draws upon data and insights from disciplines as diverse as sociology, social psychology, cognitive science, development, physiology, genetics, clinical psychology, and evolutionary biology. As with other subdisciplines within psychology, quantitative methods play a crucial role in personality theory and research. In this entry, we identify quantitative methods that have facilitated discovery, assessment, and model testing in personality science.

Quantitative Methods as a Tool for Discovery There are hundreds, if not thousands, of ways that people differ from one another in their thoughts, motives, behaviors, and emotional experiences. Some people are highly creative and prolific, such as Robert Frost, who penned 11 books of poetic works and J. S. Bach who produced over 1000 masterpieces during his lifetime. Other people have never produced a great piece of musical or literary art and seem incapable of understanding the difference between a sonnet and sonata. One of the fundamental goals of personality psychology is to map the diverse ways in which people differ from one another. A guiding theme in this taxonomic work is that the vast number of ways in which people differ can be understood using a smaller number of organizing variables. This theme is rooted in the empirical observation that people who tend to be happier in their personal relationships, for example, also take more risks and tend to experience higher levels of positive affect. The fact that such characteristics – qualities that differ in their surface features – tend to covary positively in empirical samples suggests that there is a common factor or trait underlying their covariation. Quantitative methods such as factor analysis have been used to decompose the covariation among

behavioral tendencies or person descriptors. Cattell [3], Tupes and Christal [14], and other investigators demonstrated that the covariation (see Covariance) among a variety of behavioral tendencies can be understood as deriving from a smaller number of latent factors (see Latent Variable). In contemporary personality research, most investigators focus on five factors derived from factor analytic considerations (extraversion, agreeableness, conscientiousness, neuroticism, openness to experience). These ‘Big Five’ are considered by many researchers to represent the fundamental trait dimensions of personality [8, 16]. Although factor analysis has primarily been used to decompose the structure of variable × variable correlation matrices (i.e., matrices of correlations between variables, computed for N people), factor analytic methods have also been used to understand the structure of person × person correlation matrices (i.e., matrices of correlations between people, computed for k variables). In Q-factor analysis, a person × person correlation matrix is decomposed so that the factors represent latent profiles or ‘prototypes,’ and the loadings represent the influence of the latent prototypes on each person’s profile of responses across variables [17]. Q-factor analysis is used to develop taxonomies of people, as opposed to taxonomies of variables [11]. Two individuals are similar to one another if they share a similar profile of trait scores or, more importantly, if they have high loadings on the same latent prototype. Although regular factor analysis and Q-factor analysis are often viewed as providing different representations of personality structure [11], debates about the extent to which these different approaches yield divergent and convergent sources of information on personality structure have never been fully resolved [2, 13].

Quantitative Methods in Personality Assessment and Measurement One of the most important uses of mathematics and statistics in personality research is for measurement. Personality researchers rely primarily upon classical test theory (CTT) (see Classical Test Models) as a general psychometric framework for psychological measurement. Although CTT has served the field well, researchers are slowly moving toward modern test theory approaches such as item response theory (IRT). IRT is a model-based approach that relates

2

Quantitative Methods in Personality Research

variation in a latent trait to the probability of an item or behavioral response. IRT holds promise for personality research because it assesses measurement precision differentially along different levels of a trait continuum, separates item characteristics from person characteristics, and assesses the extent that a person’s item response pattern deviates from that assumed by the measurement model [5]. This latter feature of IRT is important for debates regarding the problem of ‘traitedness’ – the degree to which a given trait domain is (or is not) relevant for characterizing a person’s behavior. By adopting IRT procedures, it is possible to scale individuals along a latent trait continuum while simultaneously assessing the extent to which the assumed measurement model is appropriate for the individual in question [10]. Another development in quantitative methodology that is poised to influence personality research is the development of taxometric procedures – methods used to determine whether a latent variable is categorical or continuous. Academic personality researchers tend to conceptualize variables as continua, whereas clinical psychologists tend to treat personality variables as categories. Taxometric procedures developed by Meehl and his colleagues are designed to test whether measurements behave in a categorical or continuous fashion [15]. These procedures have been applied to the study of a variety of clinical syndromes such as depression and to nonclinical variables such as sexual orientation [7].

Quantitative Methods in Testing Alternative Models of Personality Processes Given that most personality variables cannot be experimentally manipulated, many personality researchers adopt quasi-experimental and longitudinal approaches (see Longitudinal Data Analysis) to study personality processes. The most common way of modeling such data is to use multiple linear regression. Multiple regression is a statistical procedure for modeling the influence of two or more (possibly correlated) variables on an outcome. It is widely used in personality research because of its flexibility (e.g., its ability to handle both categorical and continuous predictor variables and its ability to model multiplicative terms). One of the most important developments in the use of regression for studying personality processes was

Baron and Kenny’s (1986) conceptualization of moderation and mediation [1]. A variable moderates the relationship between two other variables when it statistically interacts with one of them to influence the other. For example, personality researchers discovered that the influence of adverse life events on depressive symptoms is moderated by cognitive vulnerability such that people who tend to make negative attributions about their experiences are more likely to develop depressive symptoms following a negative life event (e.g., failing an exam) than people who do not make such attributions [6]. Hypotheses about moderation are tested by evaluating the interaction term in a multiple regression analysis. A variable mediates the association between two other variables when it provides a causal pathway through which the impact of one variable is transmitted to another. Mediational processes are tested by examining whether or not the estimated effect of one variable on another is diminished when the conjectured mediator is included in the regression equation. For example, Sandstorm and Cramer [12] demonstrated that the moderate association between social status (e.g., the extent one is preferred by one’s peers) and the use of psychological defense mechanisms after an interpersonal rejection is substantially reduced when changes in stress are statistically controlled. This suggests that social status has its effect on psychological defenses via the amount of stress that a rejected person experiences. In sum, the use of simple regression techniques to examine moderation and mediation enables researchers to test alternative models of personality processes. During the past 20 years, an increasing number of personality psychologists began using structural equation modeling (SEM) to formalize and test causal models of personality processes. SEM has been useful in personality research for at least two reasons. First, the process of developing a quantitative model of psychological processes requires researchers to state their assumptions clearly. Moreover, once those assumptions are formalized, it is possible to derive quantitative predications that can be empirically tested. Second, SEM provides researchers with an improved, if imperfect, way to separate the measurement model (i.e., the hypothesis about how latent variables are manifested via behavior or selfreport) from the causal processes of interest (i.e., the causal influences among the latent variables).

3

Quantitative Methods in Personality Research One of the most widely used applications of SEM in personality is in behavior genetic research with samples of twins (see Twin Designs). Structural equations specify the causal relationships among genetic sources of variation, phenotypic variation, and both shared and nonshared nongenetic sources of variation. By specifying models and estimating parameters with behavioral genetic data, researchers made progress in testing alternative models of the causes of individual differences [9]. Structural equations are also used in longitudinal research (see Longitudinal Data Analysis) to model and test alternative hypotheses about the way that personality variables influence specific outcomes (e.g., job satisfaction) over time [4]. Hierarchical linear modeling (HLM) (see Hierarchical Models) is used to model personality data that can be analyzed across multiple levels (e.g., within persons or within groups of persons). For example, in ‘diary’ research, researchers may assess peoples’ moods multiple times over several weeks. Data gathered in this fashion are hierarchical because the daily observations are nested within individuals. As such, it is possible to study the factors that influence variation in mood within a person, as well as the factors that influence mood between people. In HLM, the withinperson parameters and the between-person parameters are estimated simultaneously, thereby providing an efficient way to model complex psychological processes (see Linear Multilevel Models).

Table 1 Quantitative methods used in contemporary personality research Technique Correlation (zero-order) ANOVA t Test Multiple regression Factor analysis Structural equation modeling/path analysis Mediation/moderation Chi-square ANCOVA Hierarchical linear modeling and related techniques MANOVA Profile similarity and Q-sorts Growth curve analysis Multidimensional scaling Item response theory Taxometrics

175 90 87 72 47 41 20 19 16 14 13 9 3 3 3 2

Note: Frequencies refer to the number of articles that used a specific quantitative method. Some of the 259 articles used more than one method.

explore the benefits of newer quantitative methods for understanding the nature of personality.

References [1]

The Future of Quantitative Methods in Personality Research [2]

As with many areas of behavioral research, the statistical methods used by personality researchers tend to lag behind the quantitative state of the art. To demonstrate this point, we constructed a snapshot of the quantitative methods in contemporary personality research by reviewing 259 articles from the 2000 to 2002 issues of the Journal of Personality and the Personality Processes and Individual Differences section of the Journal of Personality and Social Psychology. Table 1 identifies the frequency of statistical methods used. As can be seen, although newer and potentially valuable methods such as SEM, HLM, and IRT are used in personality research, they are greatly overshadowed by towers of the quantitative past such as ANOVA. It is our hope that future researchers will

Frequency

[3]

[4]

[5]

[6]

Baron, R.M. & Kenny, D.A. (1986). The moderatormediator variable distinction in social psychological research: conceptual, strategic and statistical considerations, Journal of Personality and Social Psychology 51, 1173–1182. Burt, C. (1937). Methods of factor analysis with and without successive approximation, British Journal of Educational Psychology 7, 172–195. Cattell, R.B. (1945). The description of personality: principles and findings in a factor analysis, American Journal of Psychology 58, 69–90. Fraley, R.C. (2002). Attachment stability from infancy to adulthood: meta-analysis and dynamic modeling of developmental mechanisms, Personality and Social Psychology Review 6, 123–151. Fraley, R.C., Waller, N.G. & Brennan, K.G. (2000). An item response theory analysis of self-report measures of adult attachment, Journal of Personality and Social Psychology 78, 350–365. Hankin, B.L. & Abramson, L.Y. (2001). Development of gender differences in depression: an elaborated cognitive vulnerability-transactional stress theory, Psychological Bulletin 127, 773–796.

4 [7]

Quantitative Methods in Personality Research

Haslam, N. & Kim, H.C. (2002). Categories and continua: a review of taxometric research, Genetic, Social, and General Psychology Monographs 128, 271–320. [8] John, O.P. & Srivastava, S. (1999). The big five trait taxonomy: history, measurement, and theoretical perspectives, in Handbook of Personality: Theory and Research, 2nd Edition, L.A. Pervin & O.P. John, eds, Guilford Press, New York, pp. 102–138. [9] Loehlin, J.C., Neiderhiser, J.M. & Reiss, D. (2003). The behavior genetics of personality and the NEAD study, Journal of Research in Personality 37, 373–387. [10] Reise, S.P. & Waller, N.G. (1993). Traitedness and the assessment of response pattern scalability, Journal of Personality and Social Psychology 65, 143–151. [11] Robins, R.W., John, O.P. & Caspi, A. (1998). The typological approach to studying personality, in Methods and Models for Studying the Individual, R.B. Cairns, L. Bergman & J. Kagan, eds, Sage Publications, Thousand Oaks, pp. 135–160. [12] Sandstrom, M.J. & Cramer, P. (2003). Girls’ use of defense mechanisms following peer rejection, Journal of Personality 71, 605–627.

[13]

Stephenson, W. (1952). Some observations on Q technique, Psychological Bulletin 49, 483–498. [14] Tupes, E.C. & Christal, R.C. (1961). Recurrent personality factors based on trait ratings, Technical Report No. ASD-TR-61-97, U.S. Air Force, Lackland Air Force Base. [15] Waller, N.G. & Meehl, P.E. (1998). Multivariate Taxometric Procedures: Distinguishing types from Continua, Sage Publications, Newbury Park. [16] Wiggins, J.S., ed. (1996). The Five-Factor Model of Personality: Theoretical Perspectives. Guilford Press, New York. [17] York, K.L. & John, O.P. (1992). The four faces of eve: a typological analysis of women’s personality at midlife, Journal of Personality and Social Psychology 63, 494–508.

R. CHRIS FRALEY

AND

MICHAEL J. MARKS

Quartiles DAVID CLARK-CARTER Volume 3, pp. 1641–1641 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Quartiles The first quartile (often shown as Q1 ) is the number below which lie a quarter of the scores in the set, and the third quartile (Q3 ) is the number that has three-quarters of the numbers in the set below it. The second quartile (Q2 ) is the median. The first quartile is the same as the 25th percentile and the third quartile is the same as the 75th percentile. As with medians, the ease of calculating quartiles depends on whether the sample can be straightforwardly divided into quarters. The equation for finding the rank of any given quartile is: Qx =

x × (n + 1) , 4

(1)

where x is the quartile we are calculating and n is the sample size (see Quantiles for a general formula). As an example, if there are 11 numbers in the set and we want to know the rank of the first quartile, then it is 1 × (11 + 1)/4 = 3. We can then find the third number in the ordered set and it will lie on the first quartile; the third quartile will have the rank 9. On the other hand, if the above equation does not produce an exact rank, that is, when n + 1 is not divisible by 4, then the value of the quartile will be found between the two values whose ranks lie on either side of the quartile that is sought. For example, in the set 2, 4, 6, 9, 10, 11, 14, 16, 17, there are 9

numbers. (n + 1)/4 = 2.5, which does not give us a rank in the set. Therefore, the first quartile lies between the second and third numbers in the set, that is, between 4 and 6. Taking the mean of the two numbers produces a first quartile of 5. Interpolation can be used to find the first and third quartiles when data are grouped into frequencies for given ranges, for example age ranges, in a similar way to that used to find the median in such circumstances. Quartiles have long been used in graphical displays of data, for example, by Galton in 1875 [1]. More recently, their use has been advocated by those arguing for exploratory data analysis (EDA) [2], and they are important elements of box plots. Quartiles are also the basis of measures of spread such as the interquartile range (or midspread) and the semi-interquartile range (or quartile deviation).

References [1]

[2]

Galton, F. (1875). Statistics by intercomparison, with remarks on the law of frequency of error, Philosophical Magazine 4th series, 49, 33–46. Cited in Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900, Belknap Press, Cambridge. Tukey, J.W. (1977). Exploratory Data Analysis, AddisonWesley, Reading.

DAVID CLARK-CARTER

Quasi-experimental Designs WILLIAM R. SHADISH

AND JASON

K. LUELLEN

Volume 3, pp. 1641–1644 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Quasi-experimental Designs Campbell and Stanley [1] described a theory of quasiexperiments that was revised and expanded in Cook and Campbell [2] and Shadish, Cook, and Campbell [4]. Quasi-experiments, like all experiments, test hypotheses about the effects of manipulable treatments. However, quasi-experiments lack the process of random assignment of study units1 to conditions. Rather, some nonrandom assignment mechanism is used. For example, units might be allowed to choose treatment for themselves or treatment might be assigned on the basis of need. Many other nonrandom mechanisms are possible (see Randomization). As background, there are three basic requirements for establishing a causal relationship between a treatment and an effect. First, the treatment must precede the effect. This temporal sequence is met when the experimenter or someone else first introduces the treatment, with the outcome observed afterward. Second, the treatment must be related to the effect. This is usually measured by one of many statistical procedures. Third, aside from treatment, no plausible alternative explanations for the effect must exist. This third requirement is often facilitated by random assignment, which ensures that alternative explanations are no more likely in the treatment than in the control group. Lacking random assignment in quasiexperiments, it is harder to rule out these alternative explanations. Alternative explanations for the observed effect are called threats to internal validity [1]. They include: 1. Ambiguous temporal precedence: Lack of clarity about which variable occurred first may yield confusion about which variable is cause and which is effect. 2. Selection: Systematic differences over conditions in respondent characteristics that could also cause the observed effect. 3. History: Events occurring concurrently with treatment that could cause the observed effect. 4. Maturation: Naturally occurring changes over time that could mimic a treatment effect. 5. Regression: When units are selected for their extreme scores, they will often have less extreme

6.

7.

8.

9.

scores on other measures, which can mimic a treatment effect. Attrition: Loss of respondents to measurement can produce artifactual effects if reasons for attrition vary by conditions. Testing: Exposure to a test can affect scores on subsequent exposures to that test, which could mimic a treatment effect. Instrumentation: The nature of a measure may change over time or conditions in a way that could mimic a treatment effect. Additive and interactive effects of threats to internal validity: The impact of a threat can be added to, or may depend on the level of, another threat.

When designing quasi-experiments, researchers should attempt to identify which of these threats to causal inference is plausible and reduce their plausibility with design elements. Some common designs are presented shortly. Note, however, that simply choosing one of these designs does not guarantee the validity of inferences. Further, the choice of design might improve internal validity but simultaneously hamper external, construct, or statistical conclusion validity (see External Validity) [4].

Some Basic Quasi-experimental Designs In a common notation system used to present these designs, treatments are represented by the symbol X, observations or measurements by O, and assignment mechanisms by NR for nonrandom assignment or C for cutoff-based assignment. The symbols are ordered from left to right following the temporal sequence of events. Subscripts denote variations in measures or treatments. The descriptions of these designs are brief and represent but a few of the designs presented in [4].

The One-group Posttest Only Design This simple design involves one posttest observation on respondents (O1 ) who experienced a treatment (X). We diagram it as: X

O1

This is a very weak design. Aside from ruling out ambiguous temporal precedence, nearly all other threats to internal validity are usually plausible.

2

Quasi-experimental Designs

The Nonequivalent Control Group Design With Pretest and Posttest This common quasi-experiment adds a pretest and control group to the preceding design: NR O1 X O2 ------------ . O2 NR O1 The additional design elements prove very useful. The pretest helps determine whether people changed from before to after treatment. The control group provides an estimate of the counterfactual (i.e., what would have happened to the same participants in the absence of treatment). The joint use of a pretest and a control group serves several purposes. First, one can measure the size of the treatment effect as the difference between the treatment and control outcomes. Second, by comparing pretest observations, we can explore potential selection bias. This design allows for various statistical adjustments for pretest group differences when found. Selection is the most plausible threat to this design, but other threats (e.g., history) often apply too (see Nonequivalent Group Design).

Interrupted Time Series Quasi-experiments The interrupted time-series design can be powerful for assessing treatment impact. It uses repeated observations of the same variable over time. The researcher looks for an interruption in the slope of the time series at the point when treatment was implemented. A basic time-series design with one treatment group and 10 observations might be diagrammed as: O1 O2 O3 O4 O5 X O6 O7 O8 O9 O10 However, it is preferable to have 100 observations or more, allowing use of certain statistical procedures to obtain better estimates of the treatment effect. History is the most likely threat to internal validity with this design.

two-group version of this design where those scoring on one side of the cutoff are assigned to treatment and those scoring on the other side are assigned to control is diagrammed as: OA OA

C C

X

O O

where the subscript A denotes a pretreatment measure of the assignment variable. Common assignment variables are measures of need or merit. The logic of this design can be illustrated by fitting a regression line to a scatterplot of the relationship between assignment variable scores (horizontal axis) and outcome scores (vertical axis). When a treatment has no effect, the regression line should be continuous. However, when a treatment has an effect, the regression line will show a discontinuity at the cutoff score corresponding to the size of the treatment effect. Differential attrition poses a threat to internal validity with this design; and failure to model the shape of the regression line properly can result in biased results.

Conclusion Quasi-experiments are simply a collection of design elements aimed at reducing threats to causal inference. Reynolds and West [3] combined design elements in an exemplary way. They assessed the effects of a campaign to sell lottery tickets using numerous design elements, including untreated matched controls, multiple pretests and posttests, nonequivalent dependent variables, and removed and repeated treatments. Space does not permit us to discuss a number of other design elements and the ways in which they reduce threats to validity (but see [4]). Finally, recent developments in the analysis of quasi-experiments have been made including propensity score analysis, hidden bias analysis, and selection bias modeling (see [4], Appendix 5.1).

Note 1.

Units are often persons but may be animals, plots of land, schools, time periods, cities, classrooms, or teachers, among others.

The Regression Discontinuity Design

References

For the regression discontinuity design, treatment assignment is based on a cutoff score on an assignment variable measured prior to treatment. A simple

[1]

Campbell, D.T. & Stanley, J.C. (1963). Experimental and Quasi-Experimental Designs for Research, RandMcNally, Chicago.

Quasi-experimental Designs [2]

[3]

Cook, T.D. & Campbell, D.T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings, Houghton-Mifflin, Boston. Reynolds, K.D. & West, S.G. (1987). A multiplist strategy for strengthening nonequivalent control group designs, Evaluation Review 11, 691–714.

[4]

3

Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Houghton-Mifflin, Boston.

WILLIAM R. SHADISH AND JASON K. LUELLEN

Quasi-independence SCOTT L. HERSHBERGER Volume 3, pp. 1644–1647 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

versus

Quasi-independence

H1 : pj k = pj. p.k .

Independence Model One of the most basic questions that can be answered by a contingency table is whether the row and column variables are independent; that is, whether a statistical association is present between the variables [5]. For example, assume that two variables, X and Y (see Table 1), each with two outcomes, are measured on a sample of respondents. In this 2 × 2 table, n11 is the number of observations with levels X = 1 and Y = 1, n12 is the number of observations with levels X = 1 and Y = 2, n21 is the number of observations with levels X = 2 and Y = 1, n22 is the number of observations with levels X = 2 and Y = 2, and a · in a subscript indicates summation over the variable that corresponds to that position. The cells of this 2 × 2 table can also be expressed as probabilities (see Table 2). If X and Y are independent, then it is true that p(X = j and Y = k) = p(X = j ) × p(Y = k) (1) for all i and j . A test of the independence of X and Y would the following null and alternative hypotheses: H0 : pj k = pj. p.k

When the data are organized in a J × K contingency table, often a simple test, such as the Pearson χ 2 test of association, will suffice to accept or reject the null hypothesis of independence [1]. The χ 2 test of association examines the closeness of the observed cell frequencies (n) to the cell frequencies expected ( n) under the null hypothesis – the cell frequencies that would be observed if X and Y are independent. Under the null hypothesis of independence, the expected frequencies are calculated as j. p .k = nj k = n.. p

χ2 =

1 2 Total

( nj k − nj k )2 , nj k j k

(5)

n3. n.1 (345)(1051) = = 39.56. n.. 9166

(6)

1

2

Total

The χ 2 test of association statistic is 59.24, df = 4, p < .0001, indicating that the null hypothesis of independence should be rejected: A relationship

n11 n21 n.1

n12 n22 n.2

n1. n2. n..

Table 3 Gambling orientation

Y

X

(4)

with degrees of freedom equal to (J − 1) × (K − 1). As an example of a test of independence, consider the 5 × 2 table of gambling frequency and male sexual orientation adapted from Hershberger and Bogaert [3] (see Table 3). The expected frequency of each cell is given in parentheses. For example, the expected frequency of male homosexuals who reported ‘little’ gambling is 39.56:

(2)

Observed frequencies in a 2 × 2 contingency

nj. n.k . n..

The χ 2 test of association is defined by

n31 = Table 1 table

(3)

frequency

and

male

sexual

Sexual orientation Table 2 table

Observed probabilities in a 2 × 2 contingency Y

X

1 2 Total

1

2

Total

p11 p21 p.1

p12 p22 p.2

p1. p2. p..

Gambling frequency

Homosexual

None Rare Little Some Much Total

714 (679.82) 227 (291.90) 33 (39.56) 74 (38.30) 3 (1.38) 1051

Heterosexual 5215 2319 312 260 9

(5249.20) (2254.10) (305.44) (295.70) (10.62) 8115

Total 5929 2546 345 334 12 9166

2

Quasi-independence

does exist between male sexual orientation and gambling frequency. As an alternative to the χ 2 test of association, we can test the independence of two variables using loglinear modeling [6]. The loglinear modeling approach is especially useful for examining the association among more than two variables in contingency tables of three or more dimensions. A loglinear model of independence may be developed from the formula provided earlier for calculating expected cell frequencies in a two dimensional table:

Table 4 Gambling frequency and male sexual orientation: natural logarithms of the expected frequencies

nj k = n.. pj. p.k .

Note that the independence model for the 5 × 2 table, has four degrees of freedom, obtained from the difference between the number of cells (10) and the number of parameters estimated (6). Returning to the 5 × 2 table of sexual orientation and gambling frequency, the natural logarithms of the expected frequencies are as shown in Table 4. The parameter estimates are

(7)

Taking the natural logarithm of this product transforms it into a loglinear model of independence: log nj k = log n.. + log pj. + log p.k .

(8)

Thus, log nj k depends additively on a term based on total sample size (log n.. ), a term based on the marginal probability of row j (log pj. ), and a term based on the marginal probability of column k(log p.k ). A more common representation of the loglinear model of independence for a two dimensional table is (9) log nj k = λ + λj X + λk Y. For a 2 × 2 table, the loglinear model requires the estimation of three parameters: a ‘grand mean’ represented by λ, a ‘main effect’ for variable X(λj X) and a ‘main effect’ for variable Y (λk Y ). The number of parameters estimated depends on the number of row and column levels of the variables. In general, if variable X has J levels, then J − 1 parameters are estimated for the X main effect, and if Y has K levels, then K – 1 parameters are estimated for the Y main effect. For example, for the 5 × 2 table shown above, six parameters are estimated: λ, λ1 X, λ2 X, λ3 X, λ4 X, λ1 Y. In addition, in order to obtain unique estimates of the parameters, constraints must be placed on the parameter values. A common set of constraints is λj X = 0 (10) j

and

k

λk Y = 0.

(11)

Sexual orientation Gambling frequency None Rare Little Some Much

Homosexual

Heterosexual

6.52 5.68 3.68 3.65 1.10

8.57 7.72 5.72 5.69 2.20

λ = 4.99 λ1 X = 2.55 λ2 X = 1.71 λ3 X = −.29 λ4 X = −.32 λ1 Y = −1.02.

(12)

To illustrate how the parameter estimates are related to the expected frequencies, consider the expected frequency for cell (1, 1): 6.52 = λ + λ1 X + λ1 Y = 4.99 + 2.55 − 1.02. (13) The likelihood ratio statistic can be used to test the fit of the independence model: nj k nj k log = 52.97. (14) G2 = 2 nj k j k The likelihood ratio statistic is asymptotically chisquare distributed. Therefore, with G2 = 52.97, df = 4, p < .0001, the independence model is rejected. Note that if we incorporate a term representing the association between X and Y , λj k XY , four additional degrees of freedom are used, thus leading to a saturated model [2] with df = 0 (which is not a directly testable model). However, from our knowledge that the model without λj k XY does not fit the data, we know that the association between X and Y must be significant. Additionally, although the fit of the saturated model as whole is not

Quasi-independence testable, Pearson χ 2 goodness-of fit statistics are computable for the main effect of X, frequency of gambling (χ 2 = 2843.17, df = 4, p < .0001), implying significant differences in frequency of gambling; for the main effect of Y , sexual orientation (χ 2 = 158.37, df = 1, p < .0001), implying that homosexual and heterosexual men differ in frequency; and most importantly, for the association between frequency of gambling and sexual orientation (χ 2 = 56.73, df = 4, p < .0001), implying that homosexual and heterosexual men differ in how frequently they gamble.

Quasi-independence Model Whether by design or accident, there may be no observations in one or more cells of a contingency table. We can distinguish between two situations in which incomplete contingency tables can be expected [4]: 1. Structural zeros. On the basis of our knowledge of the population, we do not expect one or more combinations of the factor levels to be observed in a sample. By design, we have one or more empty cells (see Structural Zeros). 2. Sampling zeros. Although in the population all possible combinations of factor levels occur, we do not observe one or more of these combinations in our sample. By accident, we have one or more empty cells. While sampling zeros occur from deficient sample sizes, too many factors, or too many factor levels, structural zeros occur when it is theoretically impossible for a cell to have any observations. For example, let us assume we have two factors, sex (male or female) and breast cancer (yes or no). While it is medically possible to have observations in the cell representing males who have breast cancer (male × yes), the rareness of males who have breast cancer in the population may result in no such cases appearing in our sample. On the other hand, let us say we sample both sex and the frequency of different types of cancers. While the cell representing males who have prostate cancer will have observations, it is impossible to have any observations in the cell representing females who have prostate cancer. Sampling and structural zeros should not be analytically treated the same.

3

While sampling zeros should contribute to the estimation of the model parameters, structural zeros should not. An independence model that is fit to a table with one or more structural zeros is called a quasi-independence model. The loglinear model of quasi-independence is log nj k = λ + λj X + λk Y + sj k I,

(15)

where I is an indicator variable that equals 1 when a cell has a structural zero and equals 0 when the cell has a nonzero number of observations [1]. Note that structural zeros affect the degrees of freedom of the Pearson goodness-of-fit statistic and the likelihood ratio statistic: In J × K table, there are only J K − s observations, where s refers to the number of cells with structural zeros. As an example of a situation where structural zeros and sampling zeros could occur, consider a study that examined the frequency of different types of cancers among men and women (Table 5). Note that our indicator variable s takes on the value of 1 for women with prostrate cancer and for men with ovarian cancer (both cells are examples of structural zeros; that is, ‘impossible’ observations). However, s does take on the value of 0 even though there are no men with breast cancer because conceivably such men could have appeared in our sample (this cell is an example of a sampling zero; that is, although rare, it is a ‘possible’ observation. The likelihood ratio statistic G2 for the quasi-independence model is 24.87, df = 3, p < .0001, leading to the rejection of the model. Thus, even taking into account the impossibility of certain cancers appearing in men or women, Table 5 Frequency of the different types of cancer among men and women Sex Type of cancer Throat Lung Prostrate Ovarian Breast Stomach Total

Male

Female

Total

25 40 45 structural 0 sampling 0 10 120

20 35 structural 0 10 20 7 92

45 75 45 10 20 17 212

4

Quasi-independence

there is a relationship between type of cancer and sex. The importance of treating sampling zeros from structural zeros different analytically is emphasized if we fit the quasi-independence model without distinguishing the difference between the two types of zeros. In this case, the likelihood ratio statistic is. 19, df = 2, p = .91, leading us to conclude erroneously that no association exists between type of cancer and sex. Also note that by incorrectly treating the category of men with breast cancer as a structural zero, with have reduced the degrees of freedom from 3 to 2.

References [1]

Agresti, A. (2002). Categorical Data Analysis, 2nd Edition, John Wiley & Sons, New York.

[2] [3]

[4]

[5] [6]

Everitt, B.S. (1992). The Analysis of Contingency Tables, 2nd Edition, Chapman & Hall, Boca Raton. Hershberger, S.L. & Bogaert, A.F. (in press). Male and female sexual orientation differences in gambling, Personality and Individual Differences. Koehler, K. (1986). Goodness-of-fit tests for log-linear models in sparse contingency tables, Journal of the American Statistical Association 81, 483–493. von Eye, A. (2002). Configural Frequency Analysis, Lawrence Erlbaum, Mahwah. Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences, Lawrence Erlbaum, Hillsdale.

SCOTT L. HERSHBERGER

Quasi-symmetry in Contingency Tables SCOTT L. HERSHBERGER Volume 3, pp. 1647–1650 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Quasi-symmetry in Contingency Tables Symmetry Model for Nominal Data The symmetry model for a square I × I contingency table assumes that the expected cell probabilities (π) of cell i, j is equal to cell j, i. When that of there is symmetry, πi. = j πij = j πj i = π.i , a condition which also implies marginal homogeneity (see Marginal Independence) [1]. Cell probabilities satisfy marginal homogeneity if πi. = π.j , in which the row and column probabilities for the same classification are equal. We can see the relationship between marginal homogeneity and symmetry by noting that the probabilities in a square table satisfy symmetry if πij = πj i for all pairs of cells. Thus, a square table is symmetric when the elements on opposite sides of the main diagonal are equal. In 2 × 2 tables, tables that have symmetry must also have marginal homogeneity. However, when in larger tables, marginal homogeneity can occur without symmetry [2].

Example of Symmetry Model for Nominal Data Written as a log-linear model for the expected cell frequencies (e), the symmetry model is log eij = λ + λi X + λj Y + λij XY,

(1)

We use as an example for fitting the symmetry model data for which we wish to conduct multidimensional scaling. Multidimensional scaling models that are based on Euclidean distance assume that the distance from object ai to object aj is the same as that from aj to object ai [5]. If symmetry is confirmed, a multidimensional scaling model based on Euclidean distance can be applied to the data; if symmetry is not confirmed, then a multidimensional scaling model not based on Euclidean distance (e.g., city block metric) should be considered. Consider the data in Table 1 from [8], in which subjects who did not know Morse code listened to pairs of 10 signals consisting of a series of dots and dashes and were required to state whether the two signals they heard were the same or different. Each number in the table is the percentage of subjects who responded that the row and column signals were the same. In the experiment, the row signal always preceded the column signal. In order to fit the symmetry model, a symmetry variable is created, which takes on the same value for cells (1, 2) and (2, 1), another value for cells (1, 3) and (3, 1), and so forth. Therefore, the symmetry model has [I (I − 1)/2] = [10(10 − 1)/2] = 45 degrees of freedom for goodness-of-fit. The χ 2 goodness-of-fit statistic for the symmetry model is 192.735, which at 45 degrees of freedom, suggests that the symmetry model cannot be accepted. The log likelihood G2 = 9314.055. Bowker’s test of symmetry can also be used to test the symmetry model [3]. For Bowker’s test, the null hypothesis is that the probabilities in a square table satisfy symmetry or that for the cell probabilities, pij = pj i for all pairs of table cells. Bowker’s test

in which the constraints λi X = λi Y, λij XY = λj i XY

Table 1

(2)

are imposed. The fit if the symmetry model is based on nij + nj i ; (3) eij = ej i = 2 that is, the expected entry in cell i, j is the average of the entries on either side of the diagonal. The diagonal entries themselves are not relevant to the symmetry hypothesis and are ignored in fitting the model [9].

Morse code data

Digit

1

2

3

4

5

6

7

8

9

0

1 2 3 4 5 6 7 8 9 0

84 62 18 5 14 15 22 42 57 50

63 89 64 26 10 14 29 29 39 26

13 54 86 44 30 26 18 16 9 9

8 20 31 89 69 24 15 16 12 11

10 5 23 42 90 17 12 9 4 5

8 14 41 44 42 86 61 30 11 22

19 20 16 32 24 69 85 60 42 17

32 21 17 10 10 14 70 89 56 52

57 16 8 3 6 5 20 61 91 81

55 11 10 3 5 14 13 26 78 84

(.−−−− ) (..−−− ) (. . .−− ) (. . . .− ) (. . . . .) (− . . . .) (−− . . .) (−−− ..) (−−−− .) (−−−−− )

2

Quasi-symmetry in Contingency Tables

of symmetry is (nij − nj i )2 . nij + nj i i<j j

B=

(4)

The test is asymptotically chi-square distributed for large samples, with I (I − 1)/2 degrees of freedom under the null hypothesis of symmetry. For these data, B = 108.222, p < .0001 at 45 degrees of freedom. Thus, Bowker’s test also rejects the symmetry null hypothesis. When I = 2, Bowker’s test is identical to McNemar’s test [7], which is calculated as (n12 − n21 )2 . M= n12 + n21

on the same value for cells (1, 2) and (2, 1), another value for cells (1, 3) and (3, 1), and so forth. This variable takes up I (I − 1)/2 degrees of freedom. In addition, in order to allow for marginal heterogeneity, the constraint λi X = λi Y is removed, which results in estimating λi − 1 parameters. Therefore, the quasisymmetry model has I (I − 1) (I − 1)(I − 2) 2 + (I − 1) = I − 2 2 degrees of freedom

(9)

for goodness-of-fit. (5)

Example of Quasi-symmetry Model for Nominal Data

Quasi-symmetry Model for Nominal Data The symmetry model rarely fits the data well because of the imposition of marginal homogeneity. A generalization of symmetry is the quasi-symmetry model, which permits marginal heterogeneity [4]. Marginal heterogeneity is obtained by permitting the log-linear main effect terms to differ. Under symmetry, the loglinear model is log eij = λ + λi X + λj Y + λij XY,

(6)

with the constraints λi X = λi Y, λij XY = λj i XY.

(7)

Now under quasi-symmetry, the constraint λi X = λi Y is removed but the constraint λij XY = λj i XY is retained. Removing the constraint λi X = λi Y permits marginal heterogeneity. The quasi-symmetry model can be best understood by considering the odds ratios implied by the model. The quasi-symmetry model implies that the odds ratios on one side of the main diagonal are identical to corresponding ones on the other side of the diagonal [1]. That is, for i = i and j = j , α=

πj i πj i πij πi j = . πij πi j πj i πj i

(8)

Adjusting for differing marginal distributions, there is a symmetric association pattern in the table. Like the symmetry model, in order to fit the quasisymmetry model, a variable is created which takes

We return to the Morse code data presented earlier. As we did for the symmetry model, we specify a symmetry variable which takes on the same value for corresponding cells above and below the main diagonal. In addition, we incorporate the Morse code variable (the row and column factor) in order to allow for marginal heterogeneity. The model has [(10 − 1)(10 − 2)/2] = 36 degrees of freedom. The χ 2 goodness-of-fit statistic is 53.492, p < .05, which suggests that the quasi-symmetry model is rejected. The log likelihood G2 = 9382.907. We can test for marginal homogeneity directly by computing twice the difference in log likelihoods between the symmetry and quasi-symmetry models, a difference that is asymptotically χ 2 on I − 1 degrees of freedom. Twice the difference in log likelihoods is 2(9382.907 − 9314.055) = 137.704, df = 9, suggesting that marginal homogeneity is not present.

Quasi-symmetry Model for Ordinal Data A quasi-symmetry model for ordinal data (see Scales of Measurement) can be specified as log eij = λ + λi X + λj Y + βuj + λij XY,

(10)

where uj denotes ordered scores for the row and column categories. As in the nominal quasi-symmetry model, λij XY = λj i XY, (11) but λi X = λi Y

(12)

3

Quasi-symmetry in Contingency Tables to allow for marginal heterogeneity. In the ordinal quasi-symmetry model, λj Y − λj X = βuj ,

(13)

which means that the difference in the two main effects has a linear trend across the response categories, and that this difference is a constant multiple β [1]. Setting β = 0 produces the symmetry model. Thus, because an additional parameter β is estimated in the ordinal quasi-symmetry model, the ordinal quasi-symmetry model has one less degree of freedom than the symmetry model, or df = [I (I − 1)/2] − 1.

Example of Quasi-symmetry Model for Ordinal Data

Table 2 son

Son’s status Father’s status

1

2

3

4

5

6

7

1 2 3 4 5 6 7

50 16 12 11 14 0 0

19 40 35 20 36 6 3

26 34 65 58 114 19 14

8 18 66 110 185 40 32

18 31 123 223 714 179 141

6 8 23 64 258 143 91

2 3 21 32 189 71 106

[2]

[3] [4]

The data in Table 2 are from [6], representing the movement in occupational status from father to son, where the status variable is ordered. The ordinal quasi-symmetry model has [7(7 − 1)/ 2] − 1 = 20 df. The χ 2 goodness-of-fit statistic is 29.584, p = .077(G2 = 14044.363), suggesting that the null hypothesis of quasi-symmetry can be accepted. Removing the constant multiple β parameter from the model, results in a test of the symmetry model, with χ 2 = 38.007, df = 21, p < .05(G2 = 14040.007). Although quasi-symmetry can be accepted for these data, symmetry should be rejected.

References [1]

Agresti, A. (2002). Categorical Data Analysis, 2nd Edition, Wiley, New York.

Movement in occupational status from father to

[5] [6]

[7]

[8]

[9]

Bhapkar, V.P. (1979). On tests of marginal symmetry and quasi-symmetry in two- and three-dimensional contingency tables, Biometrics 35, 417–426. Bowker, A.H. (1948). Bowker’s test for symmetry, Journal of the American Statistical Association 43, 572–574. Caussinus, H. (1965). Contribution a` l’analyse statistique des tableaux de correlation. [Contributions to the statistical analysis of correlation matrices], Annales de la Facult´e des sciences de l’Universit´e de Toulouse pour les sciences mathematiques et les sciences physiques 29, 77–182. Cox, T.F. & Cox, M.A.A. (1994). Multidimensional Scaling, Chapman & Hall, London. Goodman, L.A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories, Journal of the American Statistical Association 74, 537–552. McNemar, Q. (1947). Note on the sampling error of the difference between two correlated proportions, Psychometrika 12, 153–157. Rothkopf, E.Z. (1957). A measure of stimulus similarity and errors in some paired associate learning tasks, Journal of Experimental Psychology 53, 94–101. Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences, Lawrence Erlbaum, Hillsdale.

SCOTT L. HERSHBERGER

Quetelet, Adolphe DIANA FABER Volume 3, pp. 1650–1651 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell  John Wiley & Sons, Ltd, Chichester, 2005

Quetelet, Adolphe Born: Died:

February 22, 1796, in Ghent, Belgium. February 17, 1874, in Brussels.

Qu´etelet was born nearly fifty years after PierreSimon Laplace, the French mathematician and statistician. Partly thanks to Laplace, the subject of mathematical statistics, and in particular the laws of probability, became increasingly recognized as useful in their application to social phenomena. Qu´etelet came to statistics through the exact science of astronomy, and some of his energy and ambition was channeled into his efforts to encourage the foundation of observatories. In the Belgian uprising in 1830 against French control of the country, the Observatory was used as a defence of Brussels; life became difficult for scientists in the capital and some of them joined the military defence. The Belgian revolution led to the defeat of France and its control of the country, but the scientific reputation of France remained paramount and the cultural ties between the two countries survived, to some extent, intact. In 1819, Qu´etelet gained the first doctorate in science at the University of Ghent, on the subject of a theory of conical sections. He then taught mathematics at the Athaeneum in Brussels. In 1820, he was elected to the Acad´emie des Sciences et Belles-Lettres de Bruxelles. In 1823, he travelled to Paris where he spent three months. Here he learned from Laplace, in an informal way, some observational astronomy and probability theory. He also learned the method of least squares, as a way of reducing astronomical observations. Soon after he founded the journal ‘Correspondence Math´ematique et Physique’ to which he contributed many articles between the years 1825 and 1839. In 1828, Qu´etelet became the founder of the national Observatory and received the title of Royal astronomer and meteorologist. He acted as perpetual secretary of the Royal Academy from 1834 until his death in 1874. Qu´etelet’s avowed ambition was to become known as the ‘Newton of statistics’. In the body of his

work, he left proposals for the application of statistics to human affairs on which his achievement could be judged. The first published work of Qu´etelet in 1826 was ‘Astronomie e´ l´ementaire’. The second was published as a memoir of the Acad´emie des Sciences in the following year, and reflected his interest in the application of statistics to social phenomena – ‘Recherches sur la population, les naissances, les d´ec`es, les prisons, etc’. A work on probability followed in 1828, ‘Instructions populaires sur le calcul des probabilit´es’. In 1835, ‘Sur l’homme et le d´eveloppement de ses facult´es, ou l’essai de physique sociale’ introduced the notion of ‘social physics’ by analogy with physical and mathematical science. He had already proposed the notion of ‘m´ecanique sociale’ as a correlate to ‘m´ecanique c´eleste’. Both terms pointed to the application of scientific method of analyzing social data. In 1846, Qu´etelet published ‘Lettres’, which discussed the theory of probabilities as applied to moral and political sciences. Statistical laws, and how they are applied to society, was the subject of a work in 1848: ‘Du syst`eme social et des lois qui les r´egissent’. In 1853 appeared a second work on probability: ‘Th´eorie des probabilit´es’. Qu´etelet likened periodic phenomena and regularities as found in nature to changes and periodicity in human societies. He saw in the forces of mankind and social upheavals an analogue to the perturbations in nature. The name of Qu´etelet has become strongly associated with his concept of the average man (‘l’homme moyen’), constructed from his statistics on human traits and actions. It has been widely used since, both as a statistical abstraction and in some cases as a guide, useful or otherwise for social judgments.

Further Reading Porter, T.M. (1986). The Rise of Statistical Thinking 1820–1900, Princeton University Press, Princeton. Stigler, S.M. (1986). The History of Statistics: the Measurement of Uncertainty Before 1900, The Belknap Press of Harvard University Press, Harvard.

DIANA FABER

Encyclopedia of Statistics in Behavioral Science

Read more

Encyclopedia of Statistics in Behavioral Science

Read more

Encyclopedia of Statistics in Behavioral Science

Read more

Encyclopedia of Statistics in Behavioral Science

Read more

Encyclopedia of Statistics in Behavioral Science

Read more

Behavioral Science in Medicine

Read more

Behavioral Science in Medicine

Read more

Behavioral science

Read more

Encyclopedia of Biopharmaceutical Statistics

Read more

Encyclopedia of Statistics

Read more

Concise Encyclopedia of Statistics

Read more

Concise Encyclopedia of Statistics

Read more

Essentials of Statistics for the Behavioral Science, 7th Edition

Read more

Essentials of Statistics for the Behavioral Science 7th Ed

Read more

Encyclopedia of Marine Science (Science Encyclopedia)

Read more

Understanding Statistics in the Behavioral Sciences

Read more

Concise Corsini Encyclopedia of Psychology and Behavioral Science

Read more

Essentials of Statistics for the Social and Behavioral Sciences (Essentials of Behavioral Science)

Read more

Understanding Statistics in the Behavioral Sciences

Read more

Understanding Statistics in the Behavioral Sciences

Read more

Encyclopedia Of Mathematics (Science Encyclopedia)

Read more

Science of Advanced Behavioral Modeling

Read more

Understanding Statistics in the Behavioral Sciences

Read more

Encyclopedia Of Chemistry (Science Encyclopedia)

Read more

Encyclopedia Of Mathematics (Science Encyclopedia)

Read more

Encyclopedia Of Chemistry (Science Encyclopedia)

Read more

Encyclopedia Of Chemistry (Science Encyclopedia)

Read more

Statistics for the Behavioral Sciences

Read more

Statistics for the Behavioral Sciences

Read more

Statistics for the Behavioral Sciences

Read more

Recommend Documents

Encyclopedia of Statistics in Behavioral Science

ENCYCLOPEDIA STATISTICS IN BEHAVIORAL SCIENCE OF ENCYCLOPEDIA STATISTICS IN BEHAVIORAL SCIENCE OF Volumes 1-4 A Z Ed...

Encyclopedia of Statistics in Behavioral Science

Encyclopedia of Statistics in Behavioral Science – Volume 4 – Page 1 of 4 VOLUME 4 R & Q Analysis. 1653-1655 R Squa...

Encyclopedia of Statistics in Behavioral Science

Encyclopedia of Statistics in Behavioral Science – Volume 2 – Page 1 of 4 VOLUME 2 Ecological Fallacy. 525-527 Educ...

Encyclopedia of Statistics in Behavioral Science

Encyclopedia of Statistics in Behavioral Science – Volume 1 – Page 1 of 4 VOLUME 1 A Priori v Post Hoc Testing. 1-5 ...

Encyclopedia of Statistics in Behavioral Science

Encyclopedia of Statistics in Behavioral Science – Volume 1 – Page 1 of 4 VOLUME 1 A Priori v Post Hoc Testing. 1-5 ...

Behavioral Science in Medicine

Behavioral Science in Medicine

Behavioral science

Board Review Series Behavioral Science 3rd edition Barbara Fadem, Ph.D. Professor Department of Psychiatry University...

Encyclopedia of Biopharmaceutical Statistics

Encyclopedia of Statistics

First Edition, 2007 ISBN 978 81 89940 53 9 © All rights reserved. Published by: Global Media 1819, Bhagirath Palace,...