Understanding Statistics in the Behavioral Sciences
This page intentionally left blank
Understanding Statistics in ...
97 downloads
1134 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Understanding Statistics in the Behavioral Sciences
This page intentionally left blank
Understanding Statistics in the Behavioral Sciences
by
Roger Bakeman Byron F. Robinson Georgia State University
2005
LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS Mahwah, New Jersey London
Senior Editor: Editorial Assistant: Cover Design: Textbook Production Manager: Text and Cover Printer:
Debra Riegert Kerry Breen Kathryn Houghtaling Lacey Paul Smolensk! Hamilton Printing Company
Camera ready copy for this book was provided by the authors
Copyright © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved. No part of this book may be reproduced in any form, by photostat, microform, retrieval system, or any other means, without prior written permission of the publisher. Lawrence Erlbaum Associates, Inc., Publishers 10 Industrial Avenue Mahwah, New Jersey 07430 www.erlbaum.com Library of Congress Cataloging-in-Publication Data Bakeman, Roger. Understanding statistics in the behavioral sciences / by Roger Bakeman and Byron F. Robinson, p. cm. Includes bibliographical references and index. ISBN 0-8058-4944-0 (casebound : alk. paper) 1. Psychology—Statistical methods—Textbooks. 2. Social sciences—Statistical methods—Textbooks. 3. Psychometrics—Textbooks. I. Robinson, Byron F. II. Title.
BF39.B325 2004 150'.1'5195—dc22
2004056417 CIP
Books published by Lawrence Erlbaum Associates are printed on acid-free paper, and their bindings are chosen for strength and durability. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
Disclaimer: This eBook does not include the ancillary media that was packaged with the original printed version of the book.
Contents
Preface
xi 1
1
Preliminaries: How to Use This Book
1.1 1.2 1.3
Statistics and the Behavioral Sciences Computing Statistics by Hand and Computer An Integrated Approach to Learning Statistics
1 3 12
2
Getting Started: The Logic of Hypothesis Testing
17
2.1 2.2 2.3 2.4
Statistics, Samples, and Populations Hypothesis Testing: An Introduction False Claims, Real Effects, and Power Why Discuss Inferential Before Descriptive Statistics?
17 20 24 30
3
Inferring From a Sample: The Binomial Distribution
31
3.1 3.2
The Binomial Distribution The Sign Test
31 39
4
Measuring Variables: Some Basic Vocabulary
45
4.1 4.2 4.3
Scales of Measurement Designing a Study: Independent and Dependent Variables Matching Study Designs With Statistical Procedures
45 48 49
5
Describing a Sample: Basic Descriptive Statistics
53
5.1 5.2 5.3 5.4
The Mean The Variance The Standard Deviation Standard Scores
54 60 63 66
V
CONTENTS
VI
6
Describing a Sample: Graphical Techniques
71
6.1 6.2
Principles of good design Graphical Techniques Explained
72 73
7
Inferring From a Sample: The Normal and t Distributions
83
7.1 7.2 7.3 7.4 7.5 7.6
The Normal Approximation for the Binomial The Normal Distribution The Central Limit Theorem The t Distribution Single-Sample Tests Ninety-Five Percent Confidence Intervals
84 87 91 92 93 98
8
Accounting for Variance: A Single Predictor
103
8.1 8.2
Simple Regression and Correlation What Accounting for Variance Means
103 113
9
Bivariate Relations: The Regression and Correlation Coefficients 117
9.1 9.2 9.3 9.4
Computing the Slope and the Y Intercept Computing the Correlation Coefficient Detecting Group Differences with a Binary Predictor Graphing the Regression Line
119 124 127 132
1O
Inferring From a Sample: The F Distribution
137
1O.1 1O.2 10.3 10.4 10.5
Estimating Population Variance The F Distribution The F Test The Analysis of Variance: Two Independent Groups Assumptions of the F test
137 140 142 149 152
11
Accounting for Variance: Multiple Predictors
155
11.1 11.2
11.3
11.4 11.5
Multiple Regression and Correlation Significance Testing With Multiple Predictors Accounting For Unique Additional Variance Hierarchic MRC and the Analysis of Covariance More Than Two Predictors
156 166 168 171 178
12
Single-Factor Between-Subjects Studies
181
12.1 12.2 12.3
Coding Categorical Predictor Variables One-Way Analysis of Variance Trend Analysis
182 194 197
13
Planned Comparisons, Post Hoc Tests, and Adjusted Means
201
13.1 13.2 13.3 13.4 13.5
Organizing Stepwise Statistics Planned Comparisons Post Hoc Tests Unequal Numbers of Subjects Per Group Adjusted Means and the Analysis of Covariance
203 205 206 210 212
CONTENTS
viii
14
Studies With Multiple Between-Subjects Factors
223
14.1 14.2 14.3 14.4
Between-Subjects Factorial Studies Significance Testing for Main Effects And Interactions Interpreting Significant Main Effects and Interactions Magnitude of Effects and Partial Eta Squared
224 233 235 238
15
Single-Factor Within-Subjects Studies
245
15-1 15.2 15.3 15.4
Within-Subjects or Repeated-Measures Factors Controlling Between-Subjects Variability Modifying the Source Table for Repeated Measures Assumptions of the Repeated Measure ANOVA
245 250 258 266
16
Two-Factor Studies With Repeated Measures
269
16.1 16.2 16.3 16.4
One Between- and One Within-Subjects Factor Two Within-Subjects Factors Explicating Interactions With Repeated Measures Generalizing to More Complex Designs
269 278 284 286
17
Power, Pitfalls, and Practical Matters
289
17.1 17.2
Pretest, Posttest: Repeated Measure Or Covariate? Power Analysis: How Many Subjects Are Enough?
289 295
References Glossary of Symbols and Key Terms Appendix A: SAS exercises Appendix B: Answers To Selected Exercises Appendix C: Statistical Tables
301 303 309 325
A. B. C. D.1 D.2 E.1 E..2 F.1 F.2
345 347 350 351 352 353 354 355 356
Critical Values for the Binomial Distribution, P = 0.5 Areas Under the Normal Curve Critical Values for the t Distribution Critical Values for the F Distribution, a = .05 Critical Values for the F Distribution, a = .01 Distribution of the Studentized Range Statistic, a = .05 Distribution of the Studentized Range Statistic, a = .01 L Values for a = .05 L Values for a = .01
Author Index Subject Index
357 359
This page intentionally left blank
Preface
There are at least three reasons why you might buy this book: 1. You are a student in the behavioral sciences taking a statistics course and the instructor has assigned this book as the primary text. 2. You are a student taking a statistics course and the instructor has assigned this book for enrichment and for the exercises it provides. 3. You are a behavioral science researcher who feels a need to brush up on your statistics and you find working on exercises at your own pace appealing. If you are an instructor, your reasons for assigning this book might be similar: 1. You assigned this book as the primary text because its unified and economic approach appeals to you. You also like the idea of presenting to students a book that they can truly master in its entirety. 2. You assigned another text because of the encyclopedic coverage of statistical topics it offers, but you also assigned this book because you thought the straightforward prose and the conceptually based exercises would prove useful to your students. When sitting in statistics classes or when trying to read and understand statistical material, too many otherwise intelligent and capable students and researchers feel dumb. This book is intended as an antidote. It is designed to make you feel smart and competent. Its approach is conservative in that it attempts to identify and present the essentials of data analysis as developed by statisticians over the last two or three centuries. But the selection and organization of topics and the manner in which they are presented result from a radical rethinking of how basic statistics should be taught to students whose primary concern lies with behavioral science research or clinical practice, and not with the formal, mathematical properties of statistical theory. This book is designed to develop your conceptual and practical understanding of basic data analysis. It is not a cookbook, although by reading IX
X
PREFACE the text and working through the exercises you will gain considerable practical knowledge and the ability to analyze data using multiple regression and the analysis of variance. We assume that your goal is basic statistical literacy—the ability to understand and critique the statistical analyses others perform, and the ability to proceed knowledgeably with your own data analyses. We also assume that you are far more interested in the results and empirical implications of your analyses than in the formal properties of the statistics used—and therefore will welcome the practical, intuitive approach to statistics taken here. Several sorts of audiences, from advanced undergraduates, to beginning and advanced graduate students, to seasoned researchers, should find this book useful and appealing. The primary audience we had in mind when writing it consisted of graduate students in the behavioral sciences, students who were required to take a statistics course as undergraduates, did reasonably well at the time, but frankly never quite "got it." When reading journal articles and other research reports, they tend to skip over the results section. They see ps and Fs and lots of other numbers and their eyes glaze over. They look forward to the required graduate statistics courses with a mixture of terror and loathing and cannot quite imagine how they will analyze data for their required thesis or dissertation on their own. We hope that such students will find this book something of a relief, perhaps even a delight, and as they read it and work through the exercises their past coursework will suddenly make sense in new ways and they will increasingly feel competent to read journal articles and to pursue data analyses on their own. For such students, the present volume can serve as the basis for a semester or quarter course. We recommend that as students read the text, they faithfully perform the exercises, and not skip over them as some readers might a results section in a research report. But we also recommend that readers seek out data sets relevant to their area of interest, and use them to practice the statistical techniques presented here. Usually it is not too difficult to borrow a small data set; if you cannot find one relevant to your area, a practice data set is provided on the accompanying CD. Any attempt to apply the principles learned here to new data is helpful. There is something particularly motivating when you create exercises for yourself and when they are relevant to your areas of interest. Similarly, learning is enhanced and consolidated immensely when, in the context of reading journal articles relevant to your own interests, you suddenly recognize that the author knows, and has used, what you have just come to understand. A second audience for which this book should prove useful consists of active and productive researchers. Such researchers, like the graduate students we described earlier, should also find this book a relief and a delight—and for most of the same reasons. As they read the text and work through the exercises, much of the data analyses they have been doing fairly mechanically will suddenly make more sense to them and they will be infused with a new sense of understanding and control. The exercises will serve them especially well. These are designed to be done at the reader's own pace, in the privacy of the office or home as desired. Researchers who read this text in order to upgrade their statistical understanding may miss the help and nudging that fellow students in a class often provide, but the book is intended to work equally well as a self-study guide and a classroom text. For all of these audiences, the radical redefining of the topics required for a basic understanding of statistics that this book exemplifies will matter. It is a cliche of modern architecture (usually attributed to Mies van der Rohe) that less is more. Jacob Cohen (1990), whose writings on statistics have had considerable influence on us, argued that the same principle applies equally well when
PREFACE
_xi
considering the application of statistics to the behavioral sciences generally. Certainly it is a precept that has guided us throughout the design and writing of this book. CONTENTS There are a staggering number of undergraduate statistics books on the market today, several dozen of which are designed for behavioral science majors, and a fair number designed specifically for graduate courses, many of which are encyclopedic in scope. Over the past several decades there has come to be a "standard content" for these texts, a content from which few texts deviate. We have come to believe that many of the topics presented in traditional behavioral science statistics texts are nice but not necessary for a basic understanding of data analysis as practiced by researchers— and that several of the more essential topics could be presented in a far more unified manner than is usual. In writing this book, we began by asking what students and researchers need to know in order to understand and perform most of the basic kinds of data analyses used in the work-a-day world of the behavioral sciences. We then set out to provide a conceptually based but practice-oriented description of the requisite knowledge. As a result, some, but by no means all, of the topics found in traditional texts are covered here. And although students are hardly harmed by exposure to the full array of traditional topics, the "less is more" approach exemplified here has several advantages, not the least of which is efficiency. Here students are exposed to fewer topics, but those presented possess considerable power, generality, and practical importance. Two relatively recent developments have made this book possible, one conceptual and one technical. Over the past few decades, there has been a growing realization that statistical topics formerly kept in quite separate compartments can be viewed in a unified way. Thus the analysis of variance is conveniently, and economically, viewed as simply another application of multiple regression. This unified view permits considerable savings. Students learn multiple regression, which is extremely useful in its own right, and at the same time— and with little additional cost— learn how to analyze data associated with the basic analysis-of-variance designs. Moreover, analysis of covariance— a topic that traditional experimental design texts make quite complex— is thrown in almost for free. The unified view has been expressed in journal articles for several decades (e.g., Cohen, 1968) and has greatly influenced general-purpose statistical computer packages. But only now is it beginning to have an influence on textbooks. The result can be a dramatic trimming of traditional topics with little sacrifice in data analysis capability. The technical development concerns the now almost universal availability of personal computers. We assume that readers will have access to a microcomputer and a spreadsheet program such as Microsoft's Excel and a statistical package such as SPSS or SAS. Statistical packages are powerful tools for rapidly conducting complex analyses, and any serious student or researcher should be familiar with their operation. The ease with which analyses can be carried out in a statistical package, however, can also promote the mechanical application of analytical procedures without • the background knowledge necessary to select the correct analysis and interpret the results appropriately. Spreadsheets, on the other hand, make the formulas used in an analysis explicit and allow for the exploration of the conceptual underpinnings of the analyses and their constituent formulas. Spreadsheets relieve the user of laborious computations; this allowed us to develop exercises that use more meaningful
_XII
PREFACE definitional formulas rather than the sometimes opaque computational formulas necessary when problem sets are to be worked by hand. Together, spreadsheets and statistical packages provide a ready means for teaching conceptual knowledge and concrete skills to readers interested in learning how to conduct and interpret analyses appropriately. For readers who have already had statistics, the first seven chapters should constitute a brief review of some important, introductory topics (hypothesis testing, descriptive statistics, the normal and t distributions). For students new to statistics, chapters 1-7 should provide a sufficient basis for the material to follow. The foundation of the unified view (simple and multiple regression, accounting for variance) is presented in chapters 8-11. Chapters 12-16 present the basic analysis-of-variance designs—single and multifactor factorial designs, including designs with repeated measures—and also discuss post hoc tests. Mastery of these topics should make it possible to understand most of the simpler analyses appearing in journal articles and to perform basic analyses of data. In addition, mastery of this material should adequately prepare readers to move on to more advanced texts and topics, such as the analysis of categorical data. The interested reader is referred to our companion to this text, Understanding Loglinear Analysis with ILOG (Bakeman & Robinson, 1994).
LEARNING TOOLS This book contains a number of features that should aid learning. Most chapters contain exercises, and answers to selected exercises are provided in an appendix. Many of the exercises use spreadsheets or statistical software. The computerbased exercises, which allow you to learn by doing, are a central and essential feature of this book. The spreadsheet exercises, in particular, promote conceptual understanding. Spreadsheets allow you to perform meaningful analyses quickly and efficiently and in conceptually informative ways. The spreadsheet-based exercises are almost as central to this book as the unified view of basic statistics. In fact, as you will see, the two dovetail remarkably. Naturally, we wish that all the examples and exercises were relevant to each reader's particular interests and concerns. We could never create enough different examples to cover the breadth of topics readers will bring with them. And so, instead of even attempting this, we have based most of the examples and exercises on just a few paradigmatic studies. Readers will rapidly become familiar with these few studies (the money-cure study, the lie-detection study, etc.), which should allow them to focus more on new statistical concepts as they are presented without being distracted by details associated with a different study. The spreadsheet exercises are a powerful learning tool and allow you to conduct basic analyses. When data sets are large or more complex analyses are necessary, however, it is useful to know how to use one of the more powerful statistical packages. To this end, the chapters include exercises that walk the reader through the procedures necessary to conduct the analyses using either SPSS or SAS. SPSS exercises are presented in the chapters, with corresponding SAS exercises in an appendix. Key terms and symbols, which usually are italicized when they first appear in the text, are defined in boxes throughout the chapters and are collected together into a glossary at the end of the book. In addition, most chapters contain considerable graphic and tabular information, including spreadsheet printouts and figures. Finally, a CD is provided that contains all of the spreadsheets used in the book, SPSS and SAS output and data files, and practice data sets.
PREFACE
xiii
ACKNOWLEDGEMENTS Several classes of graduate students at Georgia State University have read and used earlier versions of this book, and we have benefited greatly from their comments and their experience. In fact, without them—without their efforts, their needs, and their anxieties—we would probably never have come to write this book. Among our former students, we would especially like to thank Anna Williams, Kim Deffebach, Robert Casey, Carli Jacobs, and P. Brooke Robertson for their diligence, good humor, and many insightful comments. Carli Jacobs also assisted with creating the answers appendix and the SAS exercises, and Brooke Robertson assisted with the indexes. We would also like to thank our colleagues and friends for their many helpful comments and support (Roger Bakeman would especially like to thank Lauren B. Adamson, Josephine V. Brown, Kenneth D. Clark, Alvin D. Hall, Daryl W. Nenstiel, and James L. Pate, and Byron Robinson would especially like to thank Carolyn B. Mervis, Irwin D. Waldman, and Bronwyn W. Robinson). Roger Bakeman & Byron F. Robinson
This page intentionally left blank
Understanding Statistics in the Behavioral Sciences
This page intentionally left blank
1
Preliminaries: How to Use This Book
In this chapter you will: 1. Be introduced in a broad, general way to the uses of statistics in the behavioral sciences and the scope of this book. 2. Be introduced to a simple tool—a spreadsheet program (e.g., Microsofts's Excel)—that can perform statistical computations in a way that is both conceptually meaningful and practically useful. 3. Be introduced to an integrated approach to basic statistics, one that relies on multiple regression and emphasizes a few powerful ideas common to a wide range of basic statistical analyses. 4. Be introduced to some of the assumptions, notations, and strategies for learning used in this book. 1.1
STATISTICS AND THE BEHAVIORAL SCIENCES What is the effect of early experience with the mother on an infant's subsequent development? What sort of person makes a good manager? What kinds of appeals are most likely to affect a person's attitudes? Why do people buy a certain kind of soap? Why do some children do better in school than others? Are there programs that can affect children's academic performance? Are there therapies that can affect an adult's mental health? When seeking answers to questions like these, behavioral scientists almost always resort to statistics. Why is this true? Why should science in general, and the behavioral sciences in particular, be so dependent on statistics?
Why Behavioral Scientists Need Statistics Simplifying considerably, assume for the moment that the goal of any science is to understand a phenomenon well enough so that such characteristics as when it occurs and how often it occurs can be predicted with some precision. Some phenomena studied by scientists (e.g., simple physical events) might arise from only a few causes. If this were so, then it would be relatively easy to include in one study all the important causes, making it at least easier to understand and predict that phenomenon. Other phenomena (especially phenomena studied by
1
2
PRELIMINARIES: How To USE THIS BOOK behavioral scientists), however, might result from a multitude of causes. If important causes were not identified, or for any other reason were not included in a study, then prediction would suffer. But prediction could suffer for another reason as well. It may be that at least some complex phenomena are affected by genuinely random processes. For scientists who hold a deterministic view of human behavior, this is a nightmarish possibility, as William Kessen (1979) noted in a speech to the American Psychological Association (Kessen was speaking to child psychologists but what he said also applies to students of human behavior in general): To be sure, most expert students of children continue to assert the truth of the positivistic dream—that we have not yet found the underlying structural simplicities that will reveal the child entire, that we have not yet cut nature at the joints—but it may be wise ... to peer into the abyss of the positivistic nightmare—that ... the variety of the child's definition is not the removable error of an incomplete science, (p. 815) Kessen raised two possibilities that students of human behavior should consider: Predictions of human behavior may be imprecise because the behavioral sciences are not yet mature enough to identify all determining causes, or—and this is the "positivistic nightmare"—precise predictions may be impossible in principle because of random or chaotic elements in human behavior. An ancient and important philosophic question echoes in these two possibilities, but statistics are neutral in this debate. Prediction may fail to be perfect because important causes of a completely determined phenomenon were not taken into account. Or prediction may fail because indeterminate elements shape the phenomenon. In either case, the observed association of the presumed causes (or explanatory factors) with the resultant phenomenon will be less than perfect. When prediction is essentially perfect, statistics are not necessary. When prediction is not perfect, statistical techniques help us determine whether our observations should be ignored because they probably reflect just a chance occurrence, or whether they should be emphasized because they probably reflect a real effect. In other words, when results are ambiguous, statistical analysis allows us to determine whether or not the effects we observe are statistically significant. Thus, the need for statistical analysis is not a matter of whether it is physical or social phenomena that are under investigation, but rather a matter of how strongly effects are linked to their presumed causes, that is, of how well the phenomenon under study can be predicted. For example, when studying the effect of certain toxins on the body, predictability can approach 100%. If the toxin is introduced into a person's body, that person dies. In such cases, statistics are hardly needed. The effects of psychoactive drugs often appear to occupy a middle ground between complete determinism and complete chaos. Even though there are broad regularities, different drugs in different concentrations affect individuals differently—a state of affairs that can complicate both research and clinical practice. Behavioral regularities of the sort students of human behavior study are typically weaker than the effects of toxins or psychoactive drugs. Usually the rules uncovered by behavioral research suggest probabilities, not certainties, which can be quite frustrating for judges, clinical psychologists, and others who must make decisions in individual cases. The behavioral regularities may be real, but often they are not all that powerful and there will be many exceptions to the general rules. And, from the point of view of the researcher, many behavioral
1.1 STATISTICS AND THE BEHAVIORAL SCIENCES
3
regularities may not be at all obvious to the naked eye. These regularities are probabilistic and the evidence for them may appear ambiguous, so skeptics will naturally suspect that an invested researcher who claims effects is led more by desire than evidence. Statistical techniques are important, indeed essential, because they provide ways to resolve such ambiguities. Thus, to return to the question posed at the beginning of this section, scientists need statistics when they deal with phenomena that can be predicted only imperfectly. Behavioral scientists make a career of dealing with imperfectly predicted phenomena and thus often have greater need of statistics than other kinds of scientists. But any scientist, when faced with ambiguous results, needs techniques that control for the all-too-human tendency to see what we want or expect to see. Happily, statistical techniques provide ways of making decisions about ambiguous evidence, ways that insulate those decisions from an investigator's potential bias. No wonder, then, that statistics has come to play an especially important role in the thinking and research of behavioral scientists.
How Much Statistical Knowledge Do Behavioral Scientists Need? Few would deny that behavioral scientists need some knowledge of statistics. However, there is room for a considerable diversity of opinion as to how much knowledge is enough. Certainly they need to be able to read journals in their field and analyze their data. At a minimum, diligent readers of this book, who read the material and faithfully do the exercises, should be able to understand much of what they find in the results sections of journal articles and should be able to perform and present competent data analyses of their own. In no sense is this book a compendium of all the statistical knowledge a behavioral scientist will ever need, but it will provide a basis for learning about more advanced topics like multivariate analysis of variance, advanced multiple regression, advanced experimental design, factor analysis, log-linear analysis, logistic regression, and the like later on. This book is perhaps best viewed as an advanced-level introductory text. We assume that many readers will have had an undergraduate introductory statistics course but they are not now comfortable applying whatever they learned earlier. In fact, as noted in the preface, that earlier material may be remembered only in the most vague and shadowy of forms and as a result cannot form the basis for confident practice. Reading this book and working its exercises should allow previous knowledge to coalesce with present learning and provide many "moments of recognition" when terms and concepts heard before suddenly make sense in a way they had not previously. One caveat is necessary: It is as important for us to realize what we do not know as it is to understand what we do know. An attempt is made to present topics in this book in as straightforward and simple a manner as possible. Yet any one of them could be elaborated and, if probed, might reveal subtleties, complexities, and depths not even hinted at here. Such successive refinement seems central to academic work and to scholarly endeavor generally. We hope this book provides you with a sufficiently elaborated view of basic statistics to allow you to realize what a small corner of the statistical world you have conquered. 1.2
COMPUTING STATISTICS BY HAND AND COMPUTER We learn new skills—from broad jumping to piano playing to statistics—best by doing. Watching others perform any of these activities may be helpful initially,
4
PRELIMINARIES: How To USE THIS BOOK but in the end there is no substitute for individual exercise. Thus, watching an instructor perform statistical computations on a blackboard may be helpful at first, but at some point we need to work through computations ourselves, at our own speed. The question then is, what is the best way to practice the required computations? And, once mastered, what is the best way to do the computations required for a real data analysis? Traditionally, pencil and paper (or something similar) have been used. Later, mechanical calculators offered some relief from the tedium and error-prone ways of hand methods. These have now been almost totally replaced by amazingly inexpensive and seemingly ubiquitous hand-held electronic calculators. At the same time, personal computers have become as commonly used by students as typewriters once were, and computer labs have sprouted apparently overnight on most college campuses. It is now reasonable to assume that students will have easy access to microcomputers. In fact, this book assumes that readers will use a personal computer for computations, sometimes when reading the text, certainly when doing the exercises. In addition to accuracy and speed, personal computers offer advantages not easily realized even with electronic calculators. Historically, considerable space in statistics texts, and considerable student time, was devoted to deriving computational from theoretical formulas. The computational formulas were then used for all subsequent exercises. In other words, the formulas used in exercises were designed for hand omputation but were not necessarily conceptually meaningful. The only justification for this state of affairs was practical. Given earlier technology, theoretical formulas were simply too cumbersome to use. But with the advent of personal computers, this has all changed. It is simply no longer necessary for anyone (other than specialists who develop the general-purpose computer packages the rest of us use) to know computational formulas. For basic work, the definitional methods presented in this book can be used, and for more advanced work, several excellent packages are available, such as SPSS. As a happy result, less rote learning is required and concepts, not mechanics, can be used to solve both the exercises presented in this book and practical problems as well. The advantages for the reader are considerable. Free from the need to spend time mastering unnecessary details, the same degree of statistical mastery now takes less time. But it is the personal computer, specifically a personal computer running a spreadsheet program, that makes such an approach practical.
Spreadsheets: A Basic Computing Tool for Statistics The computer programs that spurred so many sales of personal computers initially were electronic spreadsheets and today, after word processors, they remain the second most common kind of application run on personal computers. Although designed initially for business applications, it turns out that spreadsheets are extremely well-suited for statistical use as well. The idea behind spreadsheets—like so many innovative ideas—is astonishingly simple. A spreadsheet consists of rows (usually labeled 1, 2, 3, etc.) and columns (usually labeled A, B, C, etc.) that together form a field or matrix of individual cells (identified as A1, A2, B1, etc.; see Fig. 1.1). Among other things, cells can contain formulas or arithmetic rules. For example, a cell might contain a formula for adding together all the numbers in the column above it. That cell would then display the total for the column. Several entries in a row might all contain formulas for column totals and an additional formula in that row might sum the column totals, providing a grand sum of all the numbers in the table.
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER
_5
Rows can represent such diverse items as expense categories for a small business, lines in a tax form, participants in an experiment, or students in a class. Columns can represent months or years, interest or other kinds of rates, or test scores or scores of any kind. The whole might then represent a model for a small business, a tax computation, a statistical analysis, or a class grade book. Whatever is represented, spreadsheets typically consist of linked and interdependent computations, which means that a change in one cell will affect the values of all the other cells that depend on it. For example, if cell B3 in Fig. 1.1 were changed from 12 to 22, the column total in cell B5 would change from 52 to 62, the row total in cell F3 would change from 67 to 77, and the total of the column totals in cell F5 would change from 232 to 242. Electronic spreadsheets ensure that making such changes, whether they result from incorrect entries or simply a desire to experiment with different values, does not necessitate a series of frantic erasures and error-prone hand recalculations. Instead, the effects of a change in the value of one cell can ripple through the spreadsheet seemingly instantaneously as the computer performs the required recalculations. This means that a template (a particular set of labels and formulas) can be developed for a series of calculations (e.g., a particular statistical analysis) and then reused with different data by entering the new data into the cells in place of the old data. As though this were not enough, the use of spreadsheets for statistical analysis offers yet another distinct advantage. For whatever reason, many students encountering statistics for the first (or even second) time find the usual sort of statistical notation—subscripts, superscripts, big and little Greek letters, various squiggles—intimidating or worse. Too often, notation seems to form a barrier that impedes the understanding of too many students. Much of the notational burden is carried by the spreadsheet itself because of the graphic way spreadsheets are laid out with rows representing subjects (i.e., participants or, more generally, cases) and columns representing various statistical elements. As a result, in this book the need for elaborate statistical notation is greatly reduced and the notation remaining is relatively straightforward and its meaning easily tied to the spreadsheet layout. Moreover, given mastery of a spreadsheet approach to statistics as presented in this book, readers should then have little trouble understanding and applying more complex notation in subsequent, more advanced courses. Another advantage of a spreadsheet approach to basic data analysis should be mentioned. Although there is already a limited graphic aspect to data laid out
1 2 3 4 5 6 7
A
B
C
D
E
F
Name Alex Betty Carlos Total N Mean
Test 1 21
Test 2
Test 3
Test 4
Total
22 16 18
19 20 21
23 19 22
85 67 80
56 3
60 3
64
232
12 19 52 3
17.3
18.7
20.0
3
21.3
3
77.3
FIG. 1.1. An example of a simple spreadsheet. For example, cells in row 1 and column A contain labels, cell B3 contains the value 12, cell B5 contains a formula for the sum of the three cells above it, and cell B7 contains a formula for dividing cell B5 by B6 to get the mean.
6
PRELIMINARIES: How To USE THIS BOOK in a spreadsheet format, other kinds of graphic portrayals can considerably enhance understanding of our data. Fortunately, most spreadsheet programs have some graphing capability, and in subsequent chapters we demonstrate just how useful various graphs produced by spreadsheet programs can be for understanding the concepts as well as the results of data analysis.
A Brief Introduction to Spreadsheet Programs It is characteristic of tools, from simple hammers to high-speed electronic calculators, that advantages are reaped only after an entry price is paid. The advantages of spreadsheets when learning and doing statistics is clear. They automate calculations, reinforce concepts, and simplify notation. Yet to gain these advantages, first the tool needs to be mastered. In the case of spreadsheets, this entry cost is neither unduly high nor completely negligible. It takes some effort to learn how to use spreadsheets, more than for a hand-held calculator, but less than for a typical word processor. But the payoff can be considerable, not just for statistics but for a host of other calculating tasks as well. Many of you are probably already reasonably expert with spreadsheets, but if you are a complete novice, do not despair. Spreadsheets are not very difficult to master. Almost all spreadsheets programs are remarkably similar, far more similar than word-processing programs. Whatever spreadsheet programs you use, your first task is to become familiar with it, a process that is usually aided by the tutorial programs provided with many programs. Nonetheless, some general orienting comments that apply to spreadsheets in general may be helpful. Specific examples are also provided. These examples follow the conventions of Microsoft's Excel, which is one of the most widely available spreadsheets. But even if you use a spreadsheet with somewhat different conventions, this combination of general discussion and specific examples should prove helpful. As already noted, a spreadsheet consists of a matrix of rows and columns. Columns are identified with letters, rows with numbers. Each cell in the matrix has an address that identifies that cell. For example, A1 is the address for the cell in the upper left-hand corner and cell C5 is over two columns and down four rows from A1. When the program is first invoked, one cell will be highlighted. The highlight can be moved by pressing the directional arrows. A few moments of experimentation with your spreadsheet program should demonstrate how this works. Initially, all cells of the spreadsheet are empty, but potentially any cell can contain a (1) value, (2) label, or (3) formula. In other words, cells can contain numbers, identifying information, or computational rules. To enter something in a cell, first highlight that cell. Then simply type. If the first letter you type is a number, most spreadsheets assume you mean a value. If the first letter you type is a letter (or a quote mark), most spreadsheets assume you mean a label. A label can be any identifying information you want and is usually used to make the spreadsheet readable. Conventions for formulas can vary, but the usual assumption is, if the first character is an equals sign a formula follows.
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER
7
Spreadsheet Formulas and Functions Whatever the convention used for your spreadsheet, formulas are very general and powerful and it is important to understand how they work. If a number is entered in a cell, that number is displayed on your spreadsheet. For example, in Fig. 1.2 you see the number 12 in cell B3. Similarly, if a label is entered in a cell, that label is displayed on your spreadsheet, for example, "Alex" in cell A2. But if a formula is entered in a cell, the spreadsheet displays the value rather than the formula. For example, in Fig. 1.2 we have indicated the formula that is in cell B5, =SUM(B2:B4), but in fact the value 52 and not the formula would be displayed in cell B5 on your screen (when you highlight a cell that contains a formula, the formula is displayed, usually in a special box at the top of your screen). If the value of any cell used in computing that formula is changed subsequently, the formula will be recomputed and the new value displayed, which, as noted earlier, gives spreadsheets a notable advantage over pencil-and-paper techniques. A formula can indicate simple arithmetic. For example, =B5/B6 entered in cell By divides the cell two above By by the cell immediately above it (see Fig. 1.2). A formula can be as simple as a pointer to another cell. For example, =F7
entered in cell F11 would cause cell F11 to display the value of cell F7. Or a formula can combine values in other cells, constants, or both as required. For example, =F6-3 entered in cell G6 would subtract a constant value of three from the value of. the cell immediately to the left of G6. Or, =(G5*3.14)/F6 entered in cell G6 would multiply the value of the cell immediately above by a constant value of 3.14 and would then divide the product by the value of the cell
A
1 2 3 4 5 6 7
B
Name Test 1 Alex 21 Betty 12 Carlos 19 Total =SUM(B2:B4) N=COUNT(B2:B4) Mean =B5/B6
C
D
E
F
Test 2
Test 3
Test 4
Total
22 16 18 56 3 18.7
19 20 21 60 3 20.0
23 19 22 64 3 21.3
85 67 80 180 3 60.0
FIG. 1.2. The spreadsheet shown in Fig. 1.1 with Column B expanded to show the formulas in Cells B5, B6, and B7. Row 1 and Column A contain labels, Cells B2 through E4 contains numbers, Rows 5-7 and Column F contain formulas.
PRELIMINARIES: How To USE THIS BOOK
8
immediately to the left of G6. In other words, formulas can combine cell address and constants with operators. Operators include: 1. 2. 3. 4. 5.
+ / * ^
(for addition) (for subtraction) (for division) (for multiplication) (for exponentiation)
Left and right parentheses are used to group addresses and constants as needed to indicate the order of computation. For example, =A7+B7/B8 is not the same as =(A7+B7)/B8. A formula can contain any address except that of its host cell, because this would result in a circular definition. A formula can also indicate a function, or can combine functions with cell addresses, constants, or both. Functions are predefined and the particular functions included vary some among spreadsheets. Only three functions are required for the exercises in this book: 1. =SUM(range) 2. =COUNT(range) 3. =SQRT(cell) although others you may find useful include: 4. =AVERAGE(range) 5. =STDEV(range) (more on the standard deviation function later). For example, =SUM(D2:D4) sums the numbers in column D in rows 2 through 4. And =COUNT(D2:D4) counts the number of cells containing numeric values, which in this case (assuming there is a value in each of the cells in the range indicated) would be 3 Finally, =SQRT(H23)or=H23^.5 computes the square root of whatever value is in cell H23. (Recall that the square root of a number is the same as that number raised to the one-half power. Thus =SQRT(H23) and =H236^ .5 give identical results.) One final note: We have shown functions here with capital letters to make them stand out, but function names are not case sensitive. =SUM and =sum have the same effect. Some functions operate on a single cell, like the square root function; others, like the summation function, operate on a range of cells. For example, C1:C10 specifies a range incorporating 10 cells in column C. Again, D8:H8
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER
_9
specifies a range of five cells in row 8. Similarly,
A1:D5 indicates a rectangular range, four columns across by five rows down, containing 20 cells. The ability to specify a range turns out to be very useful. Not only can a range be specified for a function in a formula, a range, like an individual cell, can also be copied or moved (more on this later). Further, you do not need to type in the range explicitly. For example, you might point to a cell, select it with a left mouse click, begin typing "=sum(" and then, holding the left mouse burton down, "paint" a range of cells. This range (whether a singe column, a single row, or a rectangle of cells) will be entered into the cells for you.
Relative Versus Absolute Addressing in Spreadsheets Spreadsheets distinguish between relative and absolute addresses, for both individual cells and ranges. This distinction may seem rather academic at first, but in fact the flexibility and power of spreadsheets depends in large measure on the way addresses are copied, preserving relative addresses. When a formula is copied from one cell to another, the addresses in the new cell are relocated, preserving the relative, not the absolute address. For example, if the formula in cell By, which indicates that the value in cell B5 is to be divided by the value in cell B6 (=B5/B6), is copied to cell C7, the addresses are relocated so that the formula in cell C7 is now =C5/C6, exactly as you would like if you wanted to copy this formula for computing the mean (see Fig. 1.2). What is preserved is the relative address (like the old formula, the new formula refers to the two cells immediately above it), not the absolute address (B5 and B6 in the original formula). This allows for considerable efficiency in defining spreadsheets. For example, a formula that operates on other values in its row can be copied to cells in the same column, and each of those formulas will then operate on the values in the appropriate row. In addition, if a formula in cell C12 indicates summation for cells Ci through C10, and if the formula in C12 is copied to cells D12 through G12, then the formulas in cells D12 through G12 will indicate summation for rows 1 though 10 for that column. All of this is extremely helpful for statistical analysis. Raw data for behavioral science studies often consist of scores for different subjects. (When discussing spreadsheets, we use the briefer term subjects, although usually participants would be used when writing results for publication.) Each row could represent a different subject. Subject identification numbers might then be entered in column A and raw data for these subjects in column B. Data for the first participant might be on row 1, for the second participant on row 2, and so forth. Various statistical manipulations of the raw scores could be accomplished in various columns, and summations for all columns could be entered in an appropriate row after the data for the last subject. Formulas need be entered fully only once and then copied. Taking advantage of the way copying works allows spreadsheet users to rapidly create a spreadsheet template that performs the required statistical computations. Sometimes, of course, we want an address in a formula to be copied absolutely, not relatively, which is easy enough to do. An address, like D13, is always understood to be relative, unless the column letter and row number is preceded by a dollar sign, like $D$13. For example, the formula, =B1-$B$24
_10
PRELIMINARIES: How To USE THIS BOOK entered in cell D1 indicates that the value in cell B24 is to be subtracted from the value in the cell Bl, the cell two columns to the left of D1. If this formula is copied to cell D2, the formula in cell D2 will be =B2-$B$24 The address for Bl was relocated—the first term in the initial formula and in the copied formula points to a cell two to the left of the host cell—but the address for B24 was not relocated (because of the dollar signs). In other words, the initial formula in cell D1 indicates relative addressing for the first term but absolute addressing for the second. So far we have talked about copy, but move is a different matter. When a cell or a range is copied, addresses are relocated, preserving their relative position (unless the formula being copied indicates an absolute address). When a cell or a range is moved, on the other hand, the addresses are not relocated, but the initial addresses are preserved instead. Furthermore, and consistent with what the words copy and move usually mean, after a copy two versions exist, the initial cell or range and the copied cell or range. After a move, however, only the moved cell or range exists; the initial cell or range is erased. Copying is very useful when you want to expand a formula throughout a table, whereas moving is useful when you want to reorganize the layout of your spreadsheet, perhaps to make room for new variables or formulas. The number of significant digits maintained internally by spreadsheet programs for a particular cell can vary, depending on the particular computer, but typically the internal representation is accurate to several digits. Consequently, accuracy is not usually a concern of spreadsheet users. The number of digits displayed on your screen, however, will depend on the column width you allow and the format you have specified. For example, in Figs.1.1i and 1.2, we specified a numeric format with one digit after the decimal point for row 7. Consequently cell C7 displays 18.7, but the number held internally, and used for any computations, is 18.6666... Again, it is important to distinguish between what a cell contains and what a cell displays. If your answers to the exercises in this book do not agree exactly with the answers given in the figures or in the Answers to Selected Exercises section, do not be unduly alarmed. The discrepancy may be due simply to the format of the display. The brief introduction to spreadsheets contained in the preceding paragraphs is far from exhaustive and most spreadsheets have many more capabilities than those mentioned here. Moreover, we have not discussed how to save a spreadsheet once defined or how to retrieve that spreadsheet for subsequent use, nor have we discussed machine-specific or systems-level matters, for example, how one invokes the spreadsheet program in the first place. Still, we have mentioned most of the spreadsheet capabilities necessary for the exercises and demonstrations contained in this book. As a next step, and as preparation for material presented later, readers unfamiliar with spreadsheets should now spend some time familiarizing themselves with whichever spreadsheet they plan to use. Exercise 1.1 Using a Spreadsheet Program This first exercise provides brief and introductory practice in using a spreadsheet program. If you are not sure how to perform any of these operations, ask someone to show you or experiment with the help provided by the program.
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER
11_
1. Invoke your spreadsheet program. Experiment with moving the highlight around using the cursor control keys (marked with arrows). 2. Enter five numbers in rows 2-6 of column B. Enter five different numbers in rows 2-6 of columns C and column D. Enter a label for column B (like "Test 1") in row 1 of column B; likewise for columns C and D. Enter labels in column A as appropriate. 3. Enter a formula in cell B7 that sums the five numbers in B2 through B6. Copy the formula from cell B7 to cells C7 through D7. Are the computed values in row 7 correct? 4. Enter a formula in cell E2 that sums the values in row 2. Copy this formula from cell E2 to cells E3 through E6. Are the totals in these cells correct? Change one of the numbers in the range B2:D6. Are the totals updated correctly? 5. Enter a formula in cell B8 that counts the numbers in B2 through B6. Copy the formula from cell B8 to cells C8 through D8. Are the computed values in row 8 correct? Delete one the numbers in the range B2:D6. Is the count updated correctly? 6. Insert three blank rows between rows 5 and 6 (this moves row 6 to row 8). Enter numbers in some of the blank cells in what are now rows 5 and 6. Are the sums and counts updated correctly? 7. Print a hard copy of your spreadsheet (assuming that a printer is attached to your microcomputer). 8. Save the template you have created, giving it some meaningful name. Exit the spreadsheet program.
A Brief Introduction to SPSS A spreadsheet is a powerful tool for manipulating data and conducting basic statistical analyses. Spreadsheets also allow students to explore the inner workings of meaningful definitional formulas and avoid the drudgery of hand calculations using opaque computational formulas. All of the computations described in this book can be accomplished by most spreadsheet packages with a regression function. As the analyses you wish to conduct become larger (e.g., contain many cases) and more complex (e.g., analysis of repeated measures), however, it will be to your advantage to learn a computer package dedicated to statistical analysis. To this end we have included exercises using SPSS (and SAS) that parallel and extend the spreadsheet exercises. SPSS is a system of commands and procedures for data definition, manipulation, and analysis. Recent versions of SPSS use a Windows-based menu and a point-and-click-driven interface. We assume the reader has basic familiarity with such Windows-based program features. SPSS includes a data interface that is similar in many respects to a spreadsheet. Rows represent cases and columns represent different variables. You can manipulate data using cut, paste, and copy commands in a manner similar to spreadsheets. SPSS also provides the ability to import spreadsheet files directly into the SPSS data editor. The SPSS exercises will provide enough information to navigate the commands and procedures necessary to complete each exercise. You should, however, consult the SPSS manuals and tutorial to gain a more thorough background in all of the options and short-cuts available in SPSS. The following exercise will familiarize you with the SPSS interface and basic commands (comparable SAS exercises for this and subsequent SPSS exercises are contained in an appendix).
12
PRELIMINARIES: How To USE THIS BOOK Exercise 1 .2 Using SPSS This first exercise provides brief and introductory practice in using SPSS. If you are not sure how to perform any of these operations, ask someone to show you or experiment with the help provided by the program. Prior to importing data into SPSS, open the spreadsheet form Exercise 1.1, delete lines 5-7 (leaving just data), save the file with a new name. 1. Invoke SPSS. Open the spreadsheet file you just created (data from Exercise 1.1). From the main menu, select Flle->Open->Data. In the Open File dialog box find the Files of Type window and change the file type from an SPSS (.sav) file to and Excel (.xls) file. Find your spreadsheet file and open it by either double clicking on the filename or highlighting the file and clicking Open. In the open data source window, make sure that Read variable names in the first row of data is checked and then click on OK. Confirm that the variable names and data are the same as your spreadsheet. 2. Change the name of the variable in column A. At the bottom of the SPSS data editor, click on the Variable View tab. In variable view, experiment by changing the name of one of the variables. Note that SPSS variables are limited to eight characters. You can, however, click on the Label column and provide a more descriptive label that will appear in the output. Labels can be up to 256 characters in length, but be aware that labels that are too long will make the output difficult to read. 3. Create a new variable by entering a new name in the Name column. Designate the number of decimal places you would like to display and give the new variable a meaningful label. 4. Enter data for the new variable. Click on the Data View tab and enter five values in the rows below the new variable name. 5. From the main menu, select Analyze-> Descriptive Statistics ->Descriptives. In the Descriptives window, highlight all three variables in the left-hand window and move them to the right hand window by clicking on the right-pointing triangle. Click on the Options button, check the box next to Sum. Click Continue and then OK. This should open the output viewer and display basic descriptive statistics. Confirm that the N (i.e., count) and sum are the same values you obtained in Exercise 1.1. 6. Return to the data editor and change some of the values. Run descriptive statistics again. Were the descriptive statistics updated correctly? 7. Save the output and data files. To save the output file, make sure you are in the SPSS output viewer, and then select File->Save from the main menu. Give the file a meaningful name and click Save. The output will be saved in a file with an .spo extension. Now return to the data editor and save the data using a meaningful name. SPSS data files are saved with a .sav extension. Exit SPSS.
1.3
AN INTEGRATED APPROACH TO LEARNING STATISTICS To a large extent, statistical practice has developed separately in a number of different academic disciplines or areas, many of which hardly spoke the same language. Often students in some areas would learn only analyses of variance and students in other areas would learn only multiple regression. And even when
1.3 AN INTEGRATED APPROACH TO LEARNING STATISTICS
13_
students learned about both, the two would typically be presented as quite unrelated topics. As a result, students in different areas learned seemingly separate statistics in different ways. Even introductory statistical texts often presented statistics in a piecemeal fashion, as a series of disparate topics. During the last several decades there has been a growing realization that considerable commonality underlies seemingly different statistical approaches, that much of what constitutes basic statistical analysis can be understood as applications of a general linear model. This has fortunate implications for the student new to statistics. Formerly, for example, multiple regression was typically taught as a predictive technique, appropriate for correlational studies in education and business, whereas the analysis of variance was taught as a way to analyze experimental results in agriculture and psychology. Increasingly, we now understand that both use the same concepts and computations, and analysis of variance is simply a specific application of multiple regression, which is the more general or overarching approach. As Cohen and Cohen (1983) noted in their excellent text on multiple regression, multiple regression/correlation is a general data-analytic system, "a versatile, all-purpose system for analyzing the data of the behavioral, social, and biological sciences" (p. 4). Reflecting this new understanding, correlation, multiple regression, and the analysis of variance for standard experimental designs are treated here in a single unified and integrated way. This results in considerable economy. Students confront fewer separate topics, but those they do encounter possess broad generality and application, which is consistent with the general "less-is-more" philosophy on which this book is predicated. Earlier in this chapter we argued that only theoretical, not computational, statistical formulas need be learned. It may not be necessary to learn many of the traditional theoretical formulas either—at least not in an introductory course—because an integrated approach renders many of them unnecessary. Thus, an integrated approach reduces the amount to be learned but does not sacrifice understanding of basic statistical concepts or limit the kinds of analyses that can be performed. This may sound like an extravagant claim to you now (although certainly appealing), but once you understand the material and work the exercises in this book, you should be in a position to judge for yourself.
Multiple Regression: A Second Basic Computing Tool The conceptual framework for the statistical analyses presented in this book is provided mainly by multiple regression. Both concepts and computations for particular analyses are embodied in spreadsheets. The two make for a happy marriage, almost as though spreadsheets were invented to make a multipleregression approach to basic statistical analysis comprehensible and practical. As demonstrated repeatedly throughout this book, the layout of spreadsheets allows the logic of multiple regression to be expressed faithfully and, as noted earlier, with a minimum of intimidating notation. By way of introduction, it seems worthwhile at this point to describe multiple regression in a very brief and general way. In introducing multiple regression/correlation, Cohen and Cohen (1983) wrote that it "is a highly general and therefore very flexible data-analytic system that may be used whenever a quantitative variable (the dependent variable) is to be studied as a function of, or in relationship to, any factors of interest (expressed as independent variables)" (P- 3). An amazing number of research questions in the behavioral sciences fit this simple description. Usually there is something we want to explain, or account for (like IQ scores, income, or health), which we want to account for in terms of
14
PRELIMINARIES: How To USE THIS BOOK various factors. To be sure, a study that attempts to account for a categorical instead of a quantitative dependent variable (these terms are defined in chap. 4) does not fit this description, but techniques for analyzing categorical variables are mentioned in chapter 3. Moreover, many of the same concepts apply whether the dependent variable is categorical or quantitative. Thus the material presented here applies broadly to most simple statistical analytic techniques. Multiple regression, used as a basic computing tool, can be understood in black-box terms. Without even knowing what is inside the black box, or how it works, we can simply provide the black box with the appropriate information. We would tell it what we want to account for (the dependent variable or DV), what the factors of interest are (the independent variables or the IVs), and what the independent and dependent variable scores are for the subjects in our study. In other words, we would provide the multiple regression program with a rectangular data matrix. Rows would represent subjects (or some other unit of analysis like a dyad or a family), the first column would represent the dependent variable, and subsequent columns would represent independent variables. The matrix would then consist of scores for individual subjects for the various variables. For its part, the multiple-regression program would return to us coefficients or "weights" for each independent variable, indicating the relative importance of each, along with an overall constant and other statistics. The coefficients are then used as part of the basic machinery for carrying out the statistical analyses presented in this book. Multiple-regression statistics are discussed in greater detail in chapter 11. As a practical matter, and in order to perform the exercises throughout this book, the reader needs some way of determining multiple regression coefficients. If there is only one independent variable, the regression statistics can be computed directly in the spreadsheet according to formulas given in chapters 8 and 11. If there is more than one, it becomes much easier to let a general-purpose multipleregression program do the computational work. Happily, most spreadsheet programs have an internal or built-in ability to do multiple regression. (In Excel, specify Tools | Data Analysis, usually after a onetime Tools | Add-ins specifying the Analysis ToolPak.) All you need do is request regression and then specify the range for the dependent variable, the range for the independent variable or variables, and the place in the spreadsheet where you wish the regression output placed. And for more advanced cases, and also as a check on the spreadsheet exercises and operations, you will use the multipleregression procedure in a statistical package such as SPSS. However, do not let the technical details of how, or with which program, you will compute multiple-regression statistics obscure the power and sweep of the argument being made here. Two basic computing tools have been discussed. Multiple regression provides coefficients. These coefficients are then incorporated in spreadsheets, which complete the necessary statistical computations. As you read this book and come to understand the integrated approach pursued here, you will also be developing specific spreadsheet templates that allow you to carry out the analyses you have learned. In order to perform competently what is presented here, in the future you could use either these templates or a specific statistical package. In fact, when you are done, and as a necessary consequence of doing the exercises in this book, you will have created a library of spreadsheet templates that can be, in effect, your own personal statistical package.
1 .3 AN INTEGRATED APPROACH TO LEARNING STATISTICS
_15_
A Note About Statistical Notation As noted earlier, spreadsheets provide a graphic notation of sorts and allow notation to be simplified, but notation is hardly eliminated entirely. Some notation is needed. Yet notation can be a tricky subject that raises passions seemingly out of proportion to its importance. Part of the problem may be prior experience. Once we have learned one set of notation, we are often reluctant to learn another, just as we are likely to favor the first word-processing program we learned and reject subsequent ones as ridiculous or unnatural. This presents problems. What sort of notation should be used here? Is it possible to select symbols that seem useful, clear, consistent, and reasonably natural to most readers? Moreover, for this book we want notation that can easily be used for labeling spreadsheets, which rules out a fair number of nonstandard symbols, including something as simple as a bar over a letter. Consider, for example, one of the most common statistics, the arithmetic mean. When raw scores are symbolized with X, their mean is often symbolized as an X with a bar above it (read "X-bar"), but there is no easy way to enter X-bar as a spreadsheet label. Another symbol often used to designate the mean is M, and because of its typographical simplicity, we decided to use it in this book. Thus MX (read "mean-of-X") indicates the mean of the X scores, My indicates the mean of the Y scores, and so forth. For purposes of labeling spreadsheets, you may use either MX and My (no italics and no subscripts) or MX and My, whichever you find easier with your spreadsheet. Some might prefer X-bar and Y-bar, but what is easy for a pen often becomes complex for a computer. Our use of notation in this book has been guided by a few simple principles. Whenever possible, and recognizing that there is some but by no means universal standardization of statistical notation among textbooks, we have usually opted for forms in common use. However, as mentioned in the preceding paragraph, usability as a spreadsheet label has affected our notational choices. We have also tried to avoid complex forms. For example, if no ambiguity is introduced, we often avoid subscripts, and thus X may be used instead of Xij to indicate a particular datum if context (and location in spreadsheet) indicate clearly which datum is under discussion. Above all, we have endeavored to be consistent within this book and to define all notation when first introduced. Statistical symbols and key terms are set off from the text in boxes (labeled Notes) and, for ease of reference, statistical symbols and key terms are collected together in a glossary at the back of the book. Finally, and in order to facilitate comparison with other notation the reader may have learned or may encounter, we usually mention any common alternative forms.
The Importance of Spreadsheet Exercises This book is intended to be one leg of a tripod. The reader and the computer form the other two legs. When using this book, we assume that readers will have their computer at hand and they will use it to check examples and perform all exercises given in the text. Answers for most exercises are provided, either in the text itself, in figures in the text, or in the Answers to Selected Exercises section at the end of the book. This should be understood as a device to provide immediate feedback, not as an invitation to short-circuit the kind of learning that comes only by doing things for yourself. Tripods are useful because their three legs provide stability, but all three legs working together are required. When deprived of one of its legs, a tripod becomes unstable and falls. Likewise, unless you ground your work firmly on the three legs we intend— the material presented in this book, yourself, and your computer— your progress could be unstable and your
16_
PRELIMINARIES: How To USE THIS BOOK understanding might fall short. The moral is clear: You need to read thoughtfully and carefully, do all exercises, and think about how the exercises and the written material relate. Readers will come to this text with different degrees of spreadsheet experience, including none, which is why detailed instructions are provided for many of the exercises, especially those earlier in the book. For many readers, such instructions will prove welcome. But for readers who begin with more experience, and for readers who become experienced in the course of doing these exercises, detailed instructions can become more of an annoyance than a help. For that reason, general instructions are provided for all exercises, allowing more practiced or venturesome readers the option of working more on their own. Many of the exercises build on previous problems, modifying previous spreadsheets for new purposes. This is advantageous because it minimizes the tedium of data entry and demonstrates in a clear way how your work, and your learning, are accumulating. Nonetheless, we recommend that you save a copy of each spreadsheet template that you develop, even if only a slight modification of an earlier template. You may want to print a copy of each different template for future reference and you will want to save a copy on disk or other storage medium. To avoid confusion, give each template that you save a name that relates to the exercise, one that allows you to identify and retrieve it later. As you will soon discover, you will often have occasion to refer back to previous spreadsheets and you will want to retrieve earlier attempts for later modification. Now, on to the work at hand. The next chapter begins where researchers typically begin, with a research question, some research data, and the need to evaluate the hypothesis that stimulated the research in the first place.
2
Getting Started: The Logic of Hypothesis Testing
In this chapter you will: 1. 2. 3. 4. 5. 6. 7.
Learn the difference between descriptive and inferential statistics. Learn the difference between samples and populations. Learn the difference between statistics and parameters. Be introduced to the logic of statistical tests. Learn about type I errors (false claims). Learn about type II errors (missed effects). Learn about the power of statistical tests to detect real effects.
Students in the behavioral sciences study statistics for a variety of reasons: the course is required, they need to analyze their thesis data in order to get a degree, and so forth. Thinking positively, let us assume that learning how to analyze, understand, and interpret research data is the paramount motivation. Then, rather than starting with long digressions into descriptive statistics and basic probability theory, it makes sense to begin immediately showing how statistics can be used to answer research questions. 2.1
STATISTICS, SAMPLES, AND POPULATIONS In discussing our need for statistics, imagine as our starting point a very simple study. Assume that a maverick psychotherapist decides to try an unusual treatment. The next 10 people who show up at her office are included in the study. Each is referred to an outside consultant for an initial evaluation. Then appointments are scheduled weekly for the next 3 months, but instead of talking to the clients, the psychotherapist simply gives them $50 and sends them on their way. After 3 months, the patients are again evaluated by an outside consultant who knows nothing of this unusual treatment. Based on the consultant's reports, the therapist determines that 8 of 10 clients improved. Her data are shown in Fig. 2.1. The question is, should we be impressed by any of this? After all, all 10 did not get better. And some would probably have gotten better just by chance alone. Is 8 out of 10 just a chance happening? Or do these results merit some attention?
17
18
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING Subject
Outcome
1
+
2
+
3 4 5 6 7 8 9 10
+ + + + + +
FIG. 2.1. Data for the money treatment study: + = improved, - = not improved. Specifically, do they suggest that the "money cure" the therapist attempted is effective? A knowledge of statistics lets us answer questions like these and reach conclusions even when our knowledge about the world is incomplete. In this case, all we know is what happened with a sample of 10 people. But what can we say about the effect of the money cure on people in general?
Generalizing From Sample to Population Ultimately we want to know not just the datum for a single subject (whether the subject improved) and not just the group summary statistic (8 of 10 improved), but what happens to therapeutic clients in general. What we really want to know is, for the population of all possible clients, how many would get better—that is, what is the probability that the money cure would prove effective in general? Assessing all possible clients to find this out is a practical impossibility. Even if we could identify potential clients, the number would be overwhelming. So instead we select a limited but manageable number to study and hope that this sample represents the population at large.
The Importance of a Random Sample The study sample and how it is selected are the foundation on which the logic of statistical reasoning rests. In the most narrow sense, the results of the moneycure study tell us only about the 10 people actually studied. Yet normally we want to claim that these results generalize to a wider population. Such claims are justified if our sample is truly a random sample from the population of interest. We would first need to define the population of interest (e.g., all potential psychotherapy clients), put their names in a hat, and then draw 10 names. This would be a random sample because each and every name has an equal chance of being selected. In practice, however, the requirement of random sampling is almost never met in behavioral science research. Some survey research studies that use random-digit dialing of respondents come very close. But almost always psychological studies use "samples of convenience" and then argue that the sample, if not randomly selected, at least appears to represent middle-class mothers, college sophomores, happily married couples, and so forth. This rather
2.1 STATISTICS, SAMPLES, AND POPULATIONS
19
casual and after-the-fact approach to sampling is less than ideal from a statistical point of view, and although commonly condoned, it does place a burden on individual researchers to justify any generalizations.
What Are Statistics Anyway? The term statistics is used in two quite different ways. In a narrow sense, a statistic is simply a number that characterizes or summarizes group data. The statistic of interest for the present example is a count or tally— the number of people in the study who improved— but other statistics for the group could be the average age of the subjects or the variability of their income. In a broader sense, statistics, as a body of knowledge and as an academic discipline, comprises concepts and techniques that allow researchers to move beyond the information provided by one study and generalize, or make inferences about, a larger population. Thus statistics, in the broad sense, makes extensive use of statistics, in the narrow sense.
Descriptive and Inferential Statistics It is common to distinguish between descriptive statistics and inferential statistics. Descriptive statistics are statistics in the narrow sense. Examples include the ratio of men to women in a graduate statistics class, the average age of students at our university, the medium income of households in the United States, and the variability of Scholastic Aptitude Test (SAT) scores for 2004. Descriptive statistics are simply summary scores for sets of data and as such characterize various aspects of the units studied. In psychological studies, units are often people (e.g, students in a graduate statistics class or the people who took the SAT in 2004) and are often referred to as participants in research reports. But other kinds of units (e.g., a mother-infant dyad, a family, a school, a city, a year, etc.) are possible. Inferential statistics, on the other hand, implies using statistics in a broader sense. The term refers to a body of reasoning and techniques, called statistical decision theory or hypothesis testing, that allow researchers to make decisions about populations based on the incomplete information provided by samples. Thus the techniques of inferential statistics are designed to address a practical problem. If we cannot assess all units (individuals, dyads, schools, states, etc.) in the population, how can we still make decisions about various aspects of the population that we believe are important and interesting?
Populations and Parameters, Samples and Statistics Aspects of interest are identified with population parameters. For example, one parameter could be the probability that a client in the population would improve. Another could be the average age of clients in the population. Another could be the strength of the association between clients' age and their income, again in the population, not just in the sample. We assume that population parameters are quantifiable and, in theory at least, knowable in some sense even if not directly observable. Some writers, for example, regard populations as essentially infinite by definition, which means that determination of parameter values by direct measurement would forever elude us. But even if we cannot assess population parameters directly, we can at least estimate values for them from sample data instead. Indeed, statisticians spend considerable care and time deriving appropriate ways to estimate different parameters and demonstrating that those estimates have a variety of desirable properties.
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
20
FIG. 2.2. Schematic representation of the relations between statistics, samples, parameters, and populations. In the sample, aspects of interest are identified with statistics (in the narrow sense). In other words, a statistic is to a sample as a parameter is to a population (see Fig. 2.2). Often (but not always) a sample statistic can be used to estimate a population parameter directly. For example, the population mean is usually represented with u (the Greek lower case mu), the sample mean is often represented with M (as discussed in the last chapter), and as it turns out the value of u can be estimated with the sample M. In other words, "estimated = M
(2.l)
Simplifying considerably, inferential statistics implies: 1. Drawing a random sample from a population. 2. Computing statistics from the sample. 3. Estimating population parameters from the sample statistics. In the next section we discuss ways of comparing values of population parameters estimated from a sample with values assumed by a particular theory. As you will see, this is the way we test our research hypotheses.
Note 2.1 M
Roman letters are usually used to represent sample statistics. For example, M is often used for the sample mean.
u
Greek letters are usually used to represent population parameters. For example, u (lower case Greek mu) is often used for the population mean.
2.2 HYPOTHESIS TESTING: AN INTRODUCTION 2.2
21
HYPOTHESIS TESTING: AN INTRODUCTION With reference to the money-cure study described at the beginning of this chapter, is 8 out of 10 clients improving a significant result or just a chance happening? Just by chance we could have selected 8 out of 10 patients who improved even though in the population from which those subjects were selected the money cure has no effect on therapeutic outcome. If we assume that the money cure has no effect—that patients, left to their own devices, are as likely to improve as not—then the value for the population parameter of interest (the probability of improving) would be .5 (or 1 in 2). This is not a matter of computation, but simply an assumption, based on the theory that the money cure has no effect. But even if the true value of the population parameter were .5, it is still possible, just by chance alone, to draw a sample in which the value of the relevant statistic is .8. It may be unlikely, but it could happen. Even so, if an observed result (.8) is sufficiently unlikely, naturally we question whether or not the assumed value for the population parameter (.5) is reasonable. Over the course of the past century, a method of inference termed statistical decision making or hypothesis testing has been developed, largely by Fisher but with generous assists from Neyman and Pearson. Hypothesis testing provides a way of deciding whether or not postulated or assumed values for various population parameters are tenable given the evidence provided by the statistics computed from a particular sample. Five elements central to this process, listed in the order in which they need to be addressed during hypothesis testing, are: 1. The null hypothesis, which provides a theoretical basis for constructing a sampling distribution for a particular test statistic. 2. The sampling distribution, which allows us to determine how probable particular values of a test statistic would be if in fact the null hypothesis were true. 3. The alpha level, a probability value that is used for the statistical test. 4. The test statistic, the value for which is computed from the sample. 5. The statistical test, which allows us to determine whether it is probable, or improbable (the exact probability value is the alpha level), that the sample was drawn from a population for which the null hypothesis is true. The Null Hypothesis The null hypothesis (usually symbolized HO) is one guess about the true state of affairs in the population of interest. It is contrasted with the alternative hypothesis (usually symbolized H1). The null and alternative hypotheses are usually formulated so that, as a matter of logic, they are mutually exclusive (only one of them can be true) and exhaustive (one of them must be true). In other words, if one is false, the other necessarily must be true. Thus if the null hypothesis states that it is not raining, then the alternative hypothesis states that it is raining. Textbooks in statistics often begin by introducing hypotheses about population means. For example, if the null hypothesis states that the population mean is zero, then an alternative hypothesis could be that the population mean is not zero. Symbolically:
22
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING In this case, the alternative hypothesis commits to a mean different from zero no matter whether that mean is smaller or larger than zero. Thus this H1 is nondirectional (sometimes called two-tailed). Another null hypothesis could state that the population mean is greater than zero whereas the alternative hypothesis could be that the population mean is not greater than zero. Symbolically: H0: u < 0 . H 1 : u>0. Thus this H1 is directional (sometimes called one-tailed} because it commits only to differences from zero that, in this case, are larger than zero. Usually, the alternative hypothesis corresponds to the investigator's hunch regarding the true state of affairs; that is, it corresponds to what the investigator hopes to demonstrate (for the current example, that the money treatment has an effect). However, it is the tenability of the null hypothesis that is actually tested (that the money treatment has no effect). If the null hypothesis is found untenable, then as a matter of logic the alternative must be tenable, in which case investigators "reject the null" and, for that reason, "accept the alternative." When encountered for the first time, this way of reasoning often seems indirect and even a bit tortured to many students. Why not simply demonstrate that the alternative hypothesis is tenable in the first place? This is partly a matter of convention, but there is good reason for it. There is a crucial difference between null and alternative hypotheses. In its typical form, a null hypothesis is exact. It commits to a particular value for a population parameter of interest. The alternative hypothesis, on the other hand, usually is inexact. It only claims that the parameter is not the value claimed by the null hypothesis but does not commit to a specific value (although, as we just saw, it may commit to a particular direction). (Exceptions to the foregoing exist, but they are rare.) In order to construct a sampling distribution for the test statistic, which is the next step in hypothesis testing, an exact value for the population parameter of interest is needed. In most cases, it is provided by the null hypothesis. Thus we test, and perhaps reject, not the hypothesis that embodies our research concern and that postulates an effect, but another hypothesis that postulates exactly no effect and that, from a substantive point of view, is less interesting. According to the no-effect or null hypothesis for the present example, exactly half of the clients should get better just by chance alone. If this null hypothesis were true, then the population parameter for the probability of improving would be .5. This may or may not be the true value of the population parameter, but at least it provides a basis for predicting how probable various values of the test statistic would be, if the true value for the population parameter were indeed .5.
The Sampling Distribution Once the null hypothesis is defined, the sampling distribution appropriate for the test statistic can be specified. Sampling distributions are theoretical constructs and as such are based on logical and formal considerations. They resemble but should not be confused with frequency histograms, which are based on data and indicate the empirical frequency for the scores in a sample. Sampling distributions are important for hypothesis testing because we can use them to derive how likely (i.e., how probable) a particular value of a test statistic would be, in theory at least, if the null hypothesis were true. The binomial, which is described in chapter 3, is the appropriate sampling distribution for the current example.
2.2 HYPOTHESIS TESTING: AN INTRODUCTION
23
The phrase "in theory" as it applies to sampling distributions is important. The probabilities provided by a sampling distribution are accurate for the test statistic to the extent that the data and the data collection procedures meet the assumptions used to generate the theoretical sampling distribution in the first place. Assumptions for different tests vary. And in practice many assumptions can be violated without severe consequences. One assumption, however, called independence of measurements (or simply, independence), is basic to most sampling distributions and, if violated, raises serious questions about any conclusions. This key assumption requires that during data collection scores are assigned to each unit in the sample (subject, dyad, family, and so forth.) independently. In other words, each score must represent an independent draw from the population. For the present example, this means that the evaluation of one client cannot be linked, or affected by, the evaluation given another client. To use the classic example, imagine an urn filled with tickets, each of which has a number printed on it. We shake the urn, reach in, remove one ticket, and note its number. We then replace the ticket and repeat the procedure until we have accumulated N numbers, the size of our sample. (If the population of tickets is large enough, it may not matter much whether we draw with, or without, replacement of tickets previously drawn.) This constitutes an independent sample because presumably tickets pulled on previous draws do not influence which ticket we pull on the next draw. We would not, however, select two tickets at a time because then pairs of tickets would be linked and the total of all tickets drawn would not constitute an independent sample. And if the assumption of independence is violated, then we can no longer be confident that a probability value derived from a theoretical sampling distribution provides us with the correct probability value for the particular test statistic being examined.
The Alpha Level The alpha level is the probability value used for the statistical test. By convention, it is usually set to .05 or, more stringently, to .01. If the probability of the results observed in the sample occurring by chance alone (given that the null hypothesis is true) is equal to or less than the alpha level, then we declare the null hypothesis untenable.
The Test Statistic In contrast to the sampling distribution, which is based on theory, the value of the test statistic depends on the data collected, and thus in general terms a test statistic is a score computed from sample data. For example, if our null hypothesis involved the age of our clients, the average age might be used as a test statistic. In theory, any of a wide array of summary statistics could be used as test statistics, but in practice, statistical attention has focused on just a few. For the present example, the test statistic is the probability that clients who received the money cure would improve and its value, as determined from the sample of 10 subjects, is .8.
The Statistical Test Given a particular null hypothesis, an appropriate sampling distribution, and a value for the appropriate test statistic, we are in a position to determine whether or not the result we observed in our sample would be probable if the null hypothesis is in fact true. If it turns out that our result would occur only rarely,
24
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING given that the null hypothesis is true, we may decide that the null hypothesis is untenable. But how rare is rare? As noted a few paragraphs back, by convention and somewhat arbitrarily, 5% is generally accepted as a reasonable cutoff point. Certainly other percentages could be justified, but in general behavioral scientists are willing to reject the null hypothesis and accept the alternative only if the results actually obtained would occur 5% of the time or less by chance alone if the null hypothesis were true. This process of deciding what level of risk is acceptable and, on that basis, deciding whether or not to reject the null hypothesis constitutes the statistical test. For the present example (testing the effectiveness of the money cure), the appropriate test is called a sign test or a binomial test. In the next chapter we describe this test and demonstrate its use.
2.3
FALSE CLAIMS, REAL EFFECTS, AND POWER
Type I Error: The Risk of Making a False Claim Earlier in this chapter we claimed that knowledge of statistics allows us to make decisions from incomplete information. Thus we may make decisions about a population based only on a sample selected from the relevant population. For the money-cure example, only 10 subjects were examined—which falls far short of a complete survey of psychotherapy clients. Yet based on this sample of 10 subjects (and pending presentation of the sign test in the next chapter), we might conclude that the money cure affected so many clients in the sample positively that the null hypothesis (the hypothesis that the clients were selected from a population in which the money cure has no effect) is untenable, which leads us to conclude that, yes, the money cure does have a beneficial effect. Basing decisions on incomplete information entails a certain amount of risk. What if, for example, in the population from which our subjects were selected, the money cure had no effect even though, in this one study, we just happened to select a high proportion of clients who got better? In this case, if we claimed an effect based on the particular sample we happened to draw, we would be wrong. We would be making what is called a type I error, which means we would have rejected the null hypothesis when in fact the null hypothesis is true. We would have made a false claim. Given the nature of statistical inference, we can never eliminate type I errors, but at least we can control how likely they are to occur. As noted earlier, the probability cutoff point for rejecting the null hypothesis is called the alpha level. If we set our alpha level to the conventional .05, then the probability that we will reject the null hypothesis wrongly, that is, make a type I error, is also .05. After all, by setting the alpha level to .05 for a statistical test we commit ourselves to rejecting the null hypothesis if the results we obtain would occur 5% of the time or less given that the null hypothesis is true. If we did the same experiment again and again, and if in fact there is no effect in the population, over the long run 95% of the time we would correctly claim no effect. But 5% of the time, just by the luck of the draw, we would wrongly claim an effect. As noted earlier, by convention most behavioral scientists find this level of risk acceptable.
Type II Error: The Risk of Missing a Real Effect Making a false claim is not the only error that can result from statistical decision making. If, for the population from which our subjects were selected, the money cure indeed has an effect but, based on the particular sample drawn, we claimed that there was none, we would be making what is called a type II error. That is,
2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER
25
we would have failed to reject the null hypothesis when in fact the null hypothesis is false. We would have missed a real effect. Under most circumstances, we do not know the exact probability of a type II error. The probability of a type II error depends on the actual state of affairs in the population— which we do not know exactly, and not on the null-hypothesis assumed state of affairs— which we define and hence know exactly. A type I error occurs when the magnitude of the effect in the population (indexed by an appropriate population parameter) is zero (or some other specific value) and yet, based on the sample selected, we claim an effect (thereby making a false claim). The probability that this will occur is determined by the alpha level, which we set, and depends on the sampling distribution for the test statistic under the null hypothesis, which we assume. Hence we can specify and control the probability of type I error. In contrast, a type II error occurs when the magnitude of the effect in the population is different from zero (or another specific value) by some unspecified amount and yet, based on the sample selected, we do not claim an effect (thereby missing a real effect). The probability of a type II error can be determined only if the exact magnitude of the effect in the population is known, which means that, under most circumstances, we cannot determine the probability of a type II error exactly. However, even if we do not know the probability of a type II error, we can affect its magnitude by changing the alpha level. If we select a more stringent alpha level (.01 instead of .05, for example), which has the effect of decreasing the probability of making a false claim, we necessarily increase the probability of missing a real effect. It is a trade-off. Decreasing the probability of a type I error increases the probability of a type II error and vice versa. If we are less likely to make a false claim, we are more likely to miss a real effect. If we are less likely to miss a real effect, we are more likely to make a false claim.
Power: The Probability of Detecting Real Effects Perhaps too pessimistically, the preceding paragraphs have focused on the ways statistical decision making can go wrong. But just as there are two ways we can be wrong, there are also two ways we can be right. (All four possibilities are shown schematically in Fig. 2.3.) One way that we can be right occurs when there genuinely is no effect in the population and we claim none. The probability of making a false claim is alpha (a), so the probability of correctly claiming no effect is 1 - alpha. If there is no effect, if our alpha level is .05, and if we conduct study after study, over the long run 95% of the time we will be right when we claim no effect. A second way that we can be right occurs when we claim an effect and there genuinely is one. Just as alpha (a) is the probability of making a false claim (Type I error), so the probability of missing a genuine effect (Type II error) is often symbolized as beta (B). And just as the probability of correctly claiming no effect is 1 - alpha, so the probability of correctly claiming an effect is 1 - beta. This probability, the ability to detect real effects, is called the power of a statistical test. The power of statistical tests is affected by three factors. First, the magnitude of the effect in the population affects power. Other things being equal, bigger effects are more likely to be detected than smaller effects and hence when effects are large we are more likely to claim an effect. Second, as is discussed in chapter 17, we are more likely to detect an effect of a particular magnitude when sample sizes are larger than when they are smaller. Third, alpha level affects power. If the alpha level is made less stringent (changed from .05 to .10, for example),
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
26
FIG. 2.3. Possible outcomes of the statistical decision-making process and their associated probabilities (P symbolizes probability).
power will be increased. We will be more likely to detect real effects if they exist but we will also be more likely to make false claims if H0 is true. Conversely, if we make the alpha level more stringent, power will be decreased. We will be less likely to make false claims if H0 is true but we will also be less likely to detect real effects if they exist. Of these three, only sample size is under our control and so the only practical way to increase power is to increase sample size. Alpha is almost always set at .05 (or less) by convention, so choosing a less stringent alpha level is usually not a practical way to increase power; in any case, such increases would be balanced by the concomitant increase in type I error. And the magnitude of the effect in the population is hardly under our control. In fact, as we noted earlier, normally we do not even know its actual value and thus we cannot compute a value for either beta (the probability of missing a real effect) or power (the probability of detecting a real effect).
Note 2.2 Type I Error
Making a false claim, or incorrectly rejecting the null hypothesis (also called an alpha error}. The probability of making a Type I error is alpha.
Type II Error
Missing a real effect, or incorrectly accepting the null hypothesis (also called a beta error). The probability of making a Type II error is beta.
Power
The probability of detecting a real effect, or the probability of correctly rejecting the null hypothesis. The power of a test is one minus the probability of missing a real effect, or 1 - beta.
2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER
27
Alpha, Beta, and Power: A Numerical Example Under normal circumstances, we do not know the true state of affairs for the population. All we know (because we assume it) is what the sampling distribution for the test statistic would be if the null hypothesis were true. For example, imagine that we have developed a test statistic whose sampling distribution under the null hypothesis is as shown in the first histogram (top) of Fig. 2.4. Altogether there are 20 possible outcomes, each represented by a square. One of the outcomes is 1, which is why there is one square or box above the value 1. Two of the outcomes are 2, which is why there are two boxes above the value 2. Similarly, three of the possible outcomes are 3, four are 4, four are 5, three are 6, two are 7, and one is 8. Area is proportional to probability, and thus the probability that the outcome actually will be 1 on any one trial is 1/20, that it will be 6 is 3/20, and so forth. The most likely outcomes for the test statistic are either 4 or 5. For each, the probability is 4/20 or .2; hence the probability of either a 4 or a 5 is .4 (because the probability of at least one of a set of events occurring is the sum of their separate probabilities). Symbolically, P( 4 or 5 ) =P(4) +P(5) = .2 + .2 = .4.
(Recall that P symbolizes probability.) Imagine further that we have decided beforehand that we will reject the null hypothesis only if our study yields a test statistic whose value is larger than expected (this specifies a directional or one-tailed test as opposed to a nondirectional or two-tailed test) and that we have set our alpha level to an unconventionally generous .15. Note that the theoretical probability of a 7 is 2/20 or .10 and the probability of an 8 is 1/10 or .05; hence the probability of either a 7 or 8— the largest values in this sampling distribution— is the sum of .10 and .05 or .15. Symbolically,
P(7 or 8) = P(7)+P(8) = .10 + .05 = .15. Thus if we conduct a study and discover that the value of the test statistic is 7 or larger, we would reject the null hypothesis. This is, after all, a relatively unlikely outcome, given the sampling distribution portrayed at the top of Fig. 2.4. If we performed the study hundreds of times, and if the null hypothesis were true, the value for the test statistic would be as big as 7 or 8 only 15% of the time. In order to demonstrate the relations between alpha, beta, and power, it is convenient (in fact, necessary) to assume that we actually know the true state of affairs for the population. For example, imagine that we know the true value of the appropriate population parameter and it generates a sampling distribution like the one shown in the second histogram (bottom) of Fig. 2.4. Given this information we can compute beta. The sampling distributions shown in Fig. 2.4, like all sampling distributions, indicate the likelihood of various outcomes. For example, the first histogram (top) indicates that if the null hypothesis is true then a test statistic as large as 7 or 8 would occur only 15% of the time or less. Each time we do a study, of course, only one outcome is produced. If that outcome were 6 or less, we would not reject the null hypothesis. After all, outcomes of 6 or less would occur 85% of the time (17 times out of 20; top histogram). In other words:
28
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
P(1 or 2 or 3 or 4 or 5 or 6) = P(1) + P2) + P3) + P(4) + P5) + TO) = .05 + .10 + .15 + .20 + .20 + .15 = .85
However, if the population probabilities for the test statistic are as indicated in the second histogram (bottom), then outcomes of 6 or less would actually occur 6 times out of 20, or just 30% of the time:
P(4 or 5 or 6) = P(4)+P(5)+n6) = .05 + .10 + .15 = .30 If we conducted the study repeatedly, over the long run 30% of the time we would not reject the null hypothesis (because the value of the test statistic would be 6 or less). In other words, 30% of the time we would fail to detect the effect. Hence, if the true state of affairs is as indicated in the bottom histogram, and our null hypothesis is based on the top histogram, then the probability of a type II error (of missing the effect) would be .3. However, 70% of the time we would correctly claim an effect. In this case the power of the statistical test would be .7.
FIG. 2.4. Sampling distributions for a hypothetical test statistic assumed by the null hypothesis (top) and actually occurring in the population (bottom).
2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER
29
The Region of Rejection Sampling distributions like those portrayed in Fig. 2.4 show the relations among alpha, beta, and power graphically. Consider the unshaded and shaded areas. For the null-hypothesis-generated sampling distribution (top), and an alpha level of .15, values in the shaded area to the right would cause us to reject the null hypothesis (in this case, values of 7 or more), whereas values in the unshaded area to the left would not cause us to reject the null hypothesis (in this case, values of 6 or less). Recall that for a sampling distribution, area is equivalent to probability. Thus, for an alpha level of .15, 15% of the area is shaded. This area is called the region of rejection or the critical region. If values computed for our test statistic fall within this rejection region (in this case, if our test statistic is 7 or greater), we would reject the null hypothesis. In the top panel, 85% of the area under the null-hypothesis-generated sampling distribution is unshaded, to the left. This proportion is 1 - alpha, the probability of correctly making no claim if the null hypothesis is true. But if the null hypothesis is false, and if we know that the true state of affairs is as depicted in the bottom panel, then matters are different. The unshaded 30% of the area to the left represents beta, the probability of missing an effect (the probability of a type II error), whereas the shaded 70% of the area to the right is 1 - beta and represents power. In this case, we have a 70% chance of detecting an effect as significant, given an alpha level of .15. The shaded area can help us visualize the relation between alpha and beta. Imagine that we reduce the alpha level to .05, reducing the shaded portion to values of 8 or more. This makes alpha (the probability of making a false claim, a type I error) smaller (top panel). But at the same time, beta (the probability of missing an effect, a type II error) becomes larger while 1 - beta (the power of the test or the probability of correctly detecting a genuine effect) necessarily decreases (bottom panel). Exercise 2.1 Type I Errors, Type II Errors, and Power The problems in this exercise refer to Fig. 2.4. They provide practice in determining the probability of making false claims (type I error), of missing real effects (type II error), and of detecting real effects (power). All questions assume that if the null hypothesis is true, the test statistic is distributed as shown in the top panel. 1. If the alpha level were .05, instead of .15, what values for the test statistic would fall in the critical region, that is, lead us to reject the null hypothesis? 2. If the alpha level were .30, what values for the test statistic would fall in the critical region? 3. If the alpha level were .05, and the true state of affairs were as depicted in the bottom panel, what is the probability of missing a real effect? What is the probability of correctly claiming an effect? 4. If the alpha level were .30, and the true state of affairs were as depicted in the bottom panel, what is the probability of missing a real effect? What is the probability of correctly claimin'g an effect?
GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
30
2.4
WHY DISCUSS INFERENTIAL BEFORE DESCRIPTIVE STATISTICS? The differences between samples and populations, between statistics and parameters, and between descriptive and inferential statistics were discussed in this chapter. Hypothesis testing, which represents an important application of inferential statistics, was presented and the ways a statistical decision could be either correct or incorrect were noted. Many textbooks spend the first several chapters detailing descriptive statistics—a topic not addressed in earnest in this book until chapter 5. Descriptive statistics usually seem more concrete than inferential statistics (there are no type I and type II errors to puzzle about and no parameters to estimate, for example), and as a result many students find presentations of descriptive statistics less challenging and easier to learn. Nonetheless, we prefer to begin with material that is more abstract and certainly more intellectually exciting (hypothesis testing, for example) because it immediately and clearly demonstrates the usefulness of statistical knowledge and techniques. Continuing in this vein, a common and useful statistical test, the sign test, is presented in chapter 3.
Note 2.3 Alpha Level
Maximum acceptable probability for type I error (claiming an effect based on sample data when there is none in the population). Conventionally set to .05 or .01.
Rejection Region
Designates values for the test statistic whose probability of occurrence is equal to or less than the alpha level, assuming that the null hypothesis is true. If the test statistic assumes any of these values (falls in the region of rejection), the null hypothesis is rejected.
3
Inferring From a Sample: The Binomial Distribution
In this chapter you will: 1. Learn about a simple sampling distribution, the binomial. 2. Learn how to perform a simple statistical test, the binomial or sign test. This chapter continues the discussion of hypothesis testing begun in chapter 2 and illustrates how a common statistical test, the binomial or sign test, can be used to answer simple research questions. The sign test makes use of one of the easiest distributions to understand, the binomial. 3.1
THE BINOMIAL DISTRIBUTION Interest in the binomial distribution began with Blaise Pascal in the 17th century and has continued to be a source of interest for mathematicians and statisticians ever since. The binomial is easy to understand (at least now, after several brilliant mathematicians have shown us how), easy to generate, and lends itself to a clear and straightforward demonstration of hypothesis testing. Moreover, a simple and useful statistical test—the sign test—is based on the binomial. For all these reasons, the first sampling distribution introduced in this book is the binomial. The binomial is the appropriate sampling distribution to consider whenever a trial results in one of two events—that is, whenever the results of a single trial (or treatment) can be categorized in one, and only one, of two ways. For example, in the money-cure study described in the last chapter, clients were categorized as either improved or not improved. In more general terms, whenever a subject (or a family, a coin toss, or whatever constitutes the sampling unit) can be assigned to one of two possible categories, the binomial distribution can be used. From it, we can figure out how probable the results for a particular study would be if the null hypothesis (and not the alternative hypothesis) were in fact true. Consider the money-cure study. If we examine only one subject (symbolized S), then there are only two possible outcomes for the study—the events defined for a single trial (because, in fact, this study consists of only one trial):
31
32
INFERRING FROM A SAMPLE: THE BINOMIAL 1. The subject improves. 2. The subject does not improve. If we assume that both events are equally likely, then the probability that the subject improves is 1/2 or .5. The probabilities for both events must sum to 1, so the probability that the subject does not improve must be 1 minus .5. Symbolically: P - P(S improves) = .5 Q - P(S does not improve) = 1 - .5 - .5 (P is often used to represent the probability for the first event, Q for its complement; thus Q =1- P.) However, if we examine two subjects (S1 and S2), then there are four possible outcomes for the study: 1. 2. 3. 4.
Both subjects improve. The first improves but the second does not improve. The first does not improve but the second does improve. Neither subject improves.
The probability that the first subject will improve is .5 and the probability that the second subject will improve is also .5; therefore, because the probability that all events in a set will occur is the product of their separate probabilities, the probability that both subjects will improve is .25. Symbolically (imp = improves), P(both improve) = P(S1 imp and S2 imp) = P(Sl imp) x P(S2 imp) - 0.5 x 0.5 = 0.25 Similarly, because the probability of a subject not improving is also .5, the probability for the second, third, and fourth outcomes will also be .25. This makes sense, because the probabilities for all outcomes must sum to 1. Although there are four different possible outcomes for a study with two subjects, there are only three outcome classes: 1. Two improve (outcome 1). 2. One improves (outcomes 2 and 3). 3. None improve (outcome 4). The probability for the first outcome class is .25. This class contains only one outcome, outcome 1, whose probability was computed in the previous paragraph. Likewise, the probability for the third class is also .25, again because this outcome class contains only one outcome, outcome 4, whose probability is .25. But the probability for the second outcome class is .5. This class contains two outcomes, outcomes 2 and 3, both of whose probabilities are .25. Therefore, because the probability of any one of a set of events occurring is the sum of their separate probabilities, the probability that either will occur is twice .25 or .5. Symbolically,
3.1 THE BINOMIAL DISTRIBUTION
33
.P(one improves) =P(S 1 imp and S2 not) + P(S1 not and S2 imp) - P(S1 imp) x P(S2 not) + P(S1, not) x P(S2 imp) = 0.5 x 0.5 + 0.5 x 0.5 = 0.25 + 0.25 = 0.5 This formula illustrates the two basic probability rules we have been using: (a) The and rule, which states that the probability that two events will both occur is the product of their individual probabilities; and (b) the or rule, which states that the probability that either of two events will occur is the sum of their individual probabilities. Often discussions of the binomial are cast in terms of coins. After all, under normal circumstances a tossed coin must come up either heads (H) or tails (T). Continuing our intuitive introduction to the binomial, if we toss a coin three times (which is equivalent to examining three subjects), then there are eight different possible outcomes for a three-trial study (T = tails, H = heads): 1. 2. 3. 4. 5. 6. 7. 8.
TTT TTH THT THH HTT HTH HHT HHH
and four outcome classes: 1. 2. 3. 4.
0 heads (1 above) 1 head (2, 3, and 5 above) 2 heads (4, 6, and 7 above) 3 heads (8 above)
A convenient way to represent all of the possible outcomes is with a branching tree diagram (see Fig. 3.1). The first branch at the top represents the first toss, which could be tails or heads. The second set of branches represents the second toss, and so forth. If you tried to list all possible outcomes without some mnemonic device, you might miss a few. If you use the branching diagram, you should be able to list all without omission.
FIG. 3.1. The eight possible outcomes if a coin is tossed three times.
34
INFERRING FROM A SAMPLE: THE BINOMIAL
At this point you may begin to see (or remember) a pattern. With one toss there are two outcomes and two classes. With two tosses there are four outcomes and three classes. With three tosses there are eight outcomes and four classes. And in general, if there are N tosses, there will be N + 1 classes (e.g., for 7 tosses there will be 8 classes: 0 heads, 1 head, 2 heads, ..., 7 heads) and there will be 2N outcomes (because each toss doubles the number of outcomes for the previous toss). Thus for 2 tosses there are 4 outcomes, for 3 tosses, 8 outcomes, for 4 tosses, 16 outcomes, for 5 tosses, 32 outcomes, and so forth (see Fig. 3.2). Exercise 3.1 Tree Diagrams This exercise provides practice in generating tree diagrams and in determining theoretical probabilities for different outcomes. 1. Generate a tree diagram for 4 tosses. Count the number of outcomes in each outcome class. Have you generated the correct number of outcomes and outcome classes? If this seems easy, generate a tree diagram for 5 or 6 tosses. 2. How many outcomes, and outcome classes, are there for 7 tosses? For 8 tosses? For 10 tosses? For 20 tosses?
The principles described in the preceding several paragraphs form the basis for generating a binomial distribution, which is then used as the sampling distribution for the sign test. If we know the number of subjects (N) in a study— and if it is appropriate to assign each subject to one of two categories—then we know that there are 2N different outcomes or 2N different ways the study might turn out. If according to our null hypothesis the two events for a single subject are equally likely (i.e., if the null hypothesis states that the population parameter for the probability of one of two events is .5—which means that the probability for the other outcome must also be .5), it follows that each of the 2N possible outcomes for the study will be equally likely also. For example, if a coin is tossed two times, then there are four possible outcomes (TT, TH, HT, HH), and if the assumed probability for a head is .5, then the probability for any one of the four outcomes is .25. In general, if the null hypothesis assumes that P - .5, and if there are N trials (subjects, coin tosses, and so forth.), then in order to compute
FIG. 3.2. Relation between number of trials, number of outcomes, and number of outcome classes.
3.1 THE BINOMIAL DISTRIBUTION
35
probabilities for the different outcome classes we need only know how many of the 2N outcomes fall in each outcome class. For example, as described previously, with three subjects there are eight possible outcomes. Three of the possible outcomes involve two heads (THH, HTH, HHT); thus the probability of tossing two out of three heads is 3/8 or .375. One of the possible outcomes involves three heads (HHH), thus the probability of tossing all three heads is 1/8 or .125. Putting these together, four of the outcomes involve two or more heads (THH, HTH, HHT, HHH) and so the probability of getting at least two heads in three tosses is 4/8 or .5. But remember, these are all theoretical probabilities based on the null hypothesis that the coin is fair.
Pascal's Triangle For just five or six subjects, you could generate a list of all the simple outcomes using a tree diagram and count the number of outcomes in each class, but with larger numbers of subjects this becomes unwieldy. Fortunately, there is a simple way to compute the number of outcomes in each outcome class. Named for its discoverer, this device is called Pascal's triangle, and although its role in the history of mathematics has been profound, our use of it here is quite simple. Exercise 3.2 Pascal's Triangle This is the second exercise to require a spreadsheet program. The purpose is to develop a spreadsheet template that generates values for Pascal's triangle. Due to its triangular shape, this spreadsheet is unique, unlike any other in the book. The formulas in the cells of the triangle are essentially alike. Instead of laboriously and tediously entering similar formulas in different cells, you should use this exercise to acquaint yourself with the use and utility of the spreadsheet copy command (see chap. 1). General Instructions 1. Let columns represent different number of trials. Use values beginning with 1 and continuing through 10. Label the columns appropriately. 2. Indicate the number of simple outcomes that would occur for each number of trials. 3. Normally Pascal's triangle is represented with the top of the triangle pointing up. In this case, lay the triangle on its side, letting the "top" point to the left. The 1-trial column will have two entries (because there are two outcome classs), the 2-trials column will have three entries (because there are three outcome classes), and so forth. Entries in each column will be staggered, which means that no cell will have an entry immediately to its left or right. Instead, entries will be located on what are, in effect, diagonal lines. 4. Enter formulas that will generate the correct values for Pascal's triangle. Remember that the value of a target entry is the sum of the two entries to the left, the one diagonally above and the one diagonally below the target entry. Once all formulas are correctly entered (which can be done by copying formulas to the appropriate cells), enter the "seed value" of 1 in the apex or top cell of the triangle. Observe the way the values for Pascal's triangle ripple through your spreadsheet. 5. Sum the entries (number of simple outcomes per outcome class) in each column of the triangle. The sum of the entries in each column should be
36
INFERRING FROM A SAMPLE: THE BINOMIAL exactly the same as the number of simple outcomes you computed previously in step 2. Why? 6. The numbers in the 10-trials column represent the number of outcomes per outcome class (0 heads, 1 head, 2 heads, ..., 10 heads). Compute the probability for each outcome class. 7. Sum the probabilities for these 11 outcome classes. This sum must be 1. Why? Detailed Instructions 1. We want to display columns A-L, so set the default column width to 5. Set the width for column L to 8. 2. Label cell A1 "N' for number of trials. Label cell A2 "N + 1" for number of classes. Label cell A3 (and A27) "2N,, for number of outcomes. 3. Put the formula "=B1+1" in cell C1. (This is the Excel convention. Other spreadsheets may have different conventions for entering formulas.) This causes C1 to display a number one larger than the number in the cell immediately to its left. Copy this formula from cell C1 to cells D1 through K1. Now put the value 1 in cell B1 and watch the results "ripple" through the first row. Cells B1-K1 should now display the values 1, 2, 3, ..., 10. 4. Put the formula "=B1+1" in the cell B2. This causes B2 to display a number one greater than the value contained in B1. Copy this formula from cell B2 to cells C2 through K2. 5. Put the formula "=B3*2" in cell C3. This causes C3 to display a number double the number in the cell immediately to its left. Copy this formula from cell C3 to cells D3 through K3. Now "seed" cell B3 with the value 2. Cells C3-K3 should now display the values 4, 8, 16, ..., 1024. 6. Put the formula "=A13+A15" in cell B14. This causes cell B14 to display the sum of the cells diagonally up to the left and diagonally down to the left. Copy this formula to cell B16. When formulas are copied, the cells pointed to are relative to the cell copied. Therefore cell B16 will display the sum of A15 and A17. And because no values have yet been put in the these A-column cells, B14 and B16 will, for now at least, display zeros. 7. Copy the "sum of up and down left diagonal cells" formula to cells C13, C15, C17 (omitting C14 and C16); to D12, D14, D16, D18 (omitting D13, D15, D17); to E11 through E19 (omitting even rows); to F10 through F20 (omitting odd rows); ...; to K5 through K25 (omitting even rows). At this point, your spreadsheet should display a triangle of zeros. The triangle is lying on its side. Its base spans cells K5 through K25 and its apex, cell A15, has yet to be filled in. 8. Put the value 1 in cell A15. Observe the way the values for Pascal's triangle ripple through your spreadsheet. 9. Put a formula for the sum of cells B5-B25 in cell B27. Copy this formula to cells C27-L27. Rows 2 and 28 should now display the same values. Why? 10. In cells L5, L7, L9, ..., L25, put a formula for dividing the cell immediately to the left by cell K27 (absolute). This gives the probabilities for each outcome class when there are 10 trials. (You may want to format column L to have four decimal places.) The value in cell L28 should be 1. Why?
37
3.1 THE BINOMIAL DISTRIBUTION
At this point your spreadsheet should look the one displayed in Fig. 3.3. Each column provides you with the number of outcomes in each outcome class for a particular number of trial. These are the binomial coefficients for that number of trials. For this exercise, you stopped at 10 trials, but in principle you could keep generating more of the triangle forever, simply by copying columns to the right. In practice, of course, you would be limited by the maximum dimensions allowed by your spreadsheet program. But in theory, you should now understand how to generate the number of outcomes in each outcome class for a binomial sampling distribution involving any number of trials. For example, given five trials and a fair coin, your spreadsheet should indicate there are 32 different outcomes and six outcome classes (column F in Fig. 3-3)- The binomial coefficients in this column (1, 5, 10, 10, 5, 1) represent the number of outcomes in each class; thus there is 1 outcome that contains zero heads, 5 that contain one head, 10 for two heads, 10 again for three heads, 5 for
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
B N
N+1
2N
C 1 2 2
E 2 3 4
F
4 5 16
5 6 32
I
H
K 6 7 8 9 10 9; 10 11 8 7 64 128 256 512 1024
G
J
L
1 0 0010 1
10 0 0098
1 1 1 1 1 1
21
2
6
1
4
20
70 35
15 5
1
0 2461
210 0 2051
28
120
0 1172
36
7
45 0.0439
8
1
252
84
21
1
0 2051
126
56
6
210 126
35
10
1
84 56
15 10
120 0 1172
28
6
4
45 0 0439 36
7
5 1
9 8
1
9 1
10 0 0098 1 1 0 0010
2
4
16
32
64 128 256 512 1024
1
FIG. 3.3. Pascal's triangle: number of outcomes in each outcome class for 1 to 10 trials.
38
INFERRING FROM A SAMPLE: THE BINOMIAL four heads, and 1 for all five heads. The probability, given a fair coin (P = .5), of getting five heads is 1/32 (.03125), of getting four heads is 5/32 (.15625), of getting four or more heads is 6/32 (.1875 or .15625 + .03125), and of getting either no heads or all heads is 2/32 (.0625 or .03125 + .03125). This sampling distribution for five trials is displayed graphically on the left side of Fig. 3.4. Exercise 3.3 Binomial Probabilities The problems in this exercise refer to Fig. 3.3. They provide practice in computing theoretical probabilities using the binomial. 1. The sampling distribution given a fair coin for six tosses is portrayed on the right side of Fig. 3.3. (The coefficients are listed in column G of Fig. 3.3.) What is the probability of getting all heads? Of getting either all heads or all tails? Of getting five or more heads? Of getting either two or three or four heads? Of getting five or more all the same (either five or more heads or five or more tails)? 2. Draw a sampling distribution for eight tosses. What is the probability of eight heads? Of either eight heads or eight tails? Of seven or more heads? Of six or more heads? Of seven or more all the same? Of six or more all the same? 3. Given N trials, there are N ways of getting just one head. Why does this make sense? 4. If your spreadsheet program can produce graphs, use it to graph the binomial sampling distributions for 8, 9, and 10 trials. You may want to produce both bar and line graph versions. Which is more correct? Why?
FIG. 3.4. Binomial sampling distributions for five (left) and six (right) tosses, given a fair coin.
3.1 THE BINOMIAL DISTRIBUTION
39
Binomial Parameters The binomial distribution can be thought of as a family of possible distributions, only two of which are displayed in Fig. 3.4. The binomial is often specified with three parameters: P is the probability of the first outcome, Q the probability of the second outcome, and N is the number of trials. However, only two parameters are necessary. Because Q= 1 - P necessarily (the probability of the second outcome must be one minus the probability of the first outcome), a particular instance of the binomial is specified completely by P and N. In other words, two parameters are associated with the binomial distribution. Given values for these parameters, a particular member of the family can be generated. The examples given in this chapter have all assumed P = .5 because this is the usual case. But imagine that P - .2 and N - 3. If according to our null hypothesis the probability of a head is .2 (and thus the probability of a tail is .8), then how likely are we to toss three heads, three tails, or any of the other possible outcomes? The probabilities for each of the eight possible outcomes are: P(TTT) P(TTH) P(THT) P(THH) P(HTT) P(HTH) P(HHT) P(HHH)
= P(T) x P(T) x P(T) = .8 x .8 x.8 = P(T) x P(T) x P(H) = .8 x .8 x .2 = P(T) x P(H) x P(T) = .8 x .2 x .8 = P(T) x P(H) x P(H) - .8 x .2 x .2 = P(H) x P(T) x P(T) = .2 x .8 x .8 = P(H) x P(T) x P(H) = .2 x .8 x .2 = P(H) x P(H) x P(T) = .2 x .2 x .8 = P(H) x P(H) x P(H) = .2 x .2 x .2
=.512 = .128 = .128 = .032 = .128 = .032 = .032 = .008
Hence the probabilities for the four outcome classes are: P(0 heads) P(1 head) P(2 heads) P(3 heads)
= P(TTT) = P(TTH) + P(THT) + P(HTT) = P(THH) + P(HTH) + P(HHT) = P(HHH)
= 1 x .512 = .512 = 3 x .128 = .384 = 3 x .032 = .096 = 1 x .008 = .008
Note that the probabilities for the eight outcomes and for the four outcome classes add to one, as they must. Note also that you are far more likely to toss three tails than three heads, assuming that the true probability of tossing one head is .2. Given this one example, you should be able to generate binomial distributions for various values of P and N (for additional examples and formulas, see Loftus & Loftus, 1988). More to the point, you should have a sense for how binomial distributions with various values for their P and N parameters are generated in the first place. 3.2
THE SIGN TEST This simple and common statistical test is called the sign test because, as commonly presented, it requires that a sign (usually a + or - sign, as in Fig. 2.1) be assigned subjects on the basis of some assessment procedure. It is based on the binomial distribution, consequently it is also called the binomial test. It is useful for determining if there are more (or fewer) plus signs than would be expected, if subjects were sampled from a population in which the probability of assigning a + were some specified value, usually .5. Consider the money-cure study. There were 10 subjects in the sample, and 8 improved. The null hypothesis purports that subjects were as likely to improve as
40
INFERRING FROM A SAMPLE: THE BINOMIAL
not (which is analogous to assuming that a coin is fair) and thus the values for the relevant population parameters are P = .5 and N- 10. This would generate the sampling distribution with binomial coefficients as shown in column K, and probabilities as shown in column L, of the spreadsheet portrayed in Fig. 3.3.
One-Tailed and Two-Tailed Tests For the sign test (and for certain other statistical tests too), it is important to distinguish between one- and two-tailed tests. If a coin is fair, we would expect the number of heads to be one-half the number of trials— a value in the center of the distribution. Any extreme score for the test statistic— for example, one head and nine tails or nine heads and one tail— would suggest that the coin might not be fair. Note that extreme values fall at either the left or the right ends of the distribution (see, e.g., Fig. 3.4), under one or the other of the distribution's two "tails." If we reject the null hypothesis whenever the test statistic is either considerably less or considerably more than expected, then the test is called nondirectional or two-tailed. Symbolically:
H0 : H1:
P=.5. P=.5.
On the other hand, if we think that only more heads than expected, or more clients improving than expected, is of interest, and so choose to reject the null hypothesis only when values of the test statistic fall at one end of the distribution but not the other, then the test is called directional or one-tailed. Symbolically:
H0: HI:
P<.5P > .5.
Note that null and alternative hypotheses are formulated somewhat differently for one- and two-tailed tests. For example, based on the values shown in Fig. 3.3 for N - 10 trials and
P=.5:
1. The probability of all heads is .001 (one-tailed) but the probability of either all heads or all tails is .002 (two-tailed). 2. The probability of nine or more heads is .011il (one-tailed) but the probability of nine or more heads or nine or more tails is .021 (two-tailed). 3. The probability of eight or more heads is .055 (one-tailed) but the probability of eight or more heads or eight or more tails is .109 (twotailed). These probabilities were computed using the or rule, which means that individual probabilities were summed. Thus .001 + .010 + .044 = .055. If you try to recreate these values on a hand-held calculator from those displayed in Fig. 3.2, and you round at intermediate steps, your answers may vary slightly from those given here. Numeric results given here and throughout the text were computed with a spreadsheet and intermediate results were not rounded. Only the final scores are rounded and then usually three significant digits are retained. If the alpha level were .05, if the test were two-tailed, and if the value of the test statistic were 0, 1, 9, or 10, we would reject the null hypothesis. These outcomes, which are termed critical values and constitute the region of rejection or the critical region (see chap. 2), would occur 2.1% of the time or less by chance
3.2 THE SIGN TEST
41_
alone. If the alpha level were again .05, but the test one-tailed, again we would reject if the value of the test statistic were 9 or 10. (Arbitrarily, and just as a convenient convention, if a test is one-tailed we assume values that might cause rejection are in the right-hand tail unless explicitly stated otherwise.) These outcomes would occur 1.1% of the time by chance alone, whereas values of 8 or higher would occur 5.5% of the time—just a little too often for a value of 8 to be included in the rejection region for an alpha level of .05, one-tailed. Although it is true that the probability of tossing exactly 8 heads with a fair coin is .044, regions of rejection are constructed by accumulating probabilities from the extremities of the tails inward. If the alpha level were .10, however, critical values for a one-tailed test would be 8, 9, or 10. These outcomes would occur 5.5% of the time by chance alone. Critical values for a two-tailed test would be 0, 1, 9, or 10. These outcomes would occur 2.1% of the time, as noted earlier. The critical values would not include 2 and 8. The probability of tossing 8 or more heads or 8 or more tails is .109, which is too high for values of 2 and 8 to be included in the critical region for an alpha level of .10, two-tailed. Finally we can answer the question posed at the beginning of chapter 2: Is 8 out of 10 clients improving worthy of attention? If our alpha level were .05, it would not matter whether the test was one- or two-tailed—we would not reject the null hypothesis. However, if our alpha level were .10, we would reject for a one- but not for a two-tailed test. A test statistic as high as 8 (one-tailed) would occur 5.5%, and a test statistic as extreme as 8 (two-tailed) would occur 10.9% of the time if the null hypothesis were true. This is not strong enough evidence to reject the null hypothesis at an alpha level of .05, one- or two-tailed, or at an alpha level of .10, two-tailed. But a value of 8 falls in the critical region for an alpha = .10, one-tailed sign test. But note, an alpha level of .10 is used in this paragraph largely for illustrative purposes. In research reports published in scientific journals, alpha levels are almost always .05 or less. Exercise 3.4 Critical Values for the Sign Test This exercise provides practice in determining critical values for the sign test. 1. Verify that all probabilities and percentages given in the previous several paragraphs are correct (if you were not already doing this.) 2. Using Pascal's triangle, compute the number of outcomes in each class, along with their associated probabilities, for N = 11 and N= 12. 3. Construct a table of critical values for the sign test. (A critical value is one that causes you to reject the null hypothesis.) Include nine rows, one each for number of trials = 4 through 12. Include four columns, two for one-tailed and two for two-tailed tests. Within each kind of test, include a column for alpha = .05 and for alpha = .10. Indicate with an "x" combinations for which no values exist that would justify rejecting the null hypothesis. For example, the N = 10 row would look like this: ...
10
One-tailed
Two-tailed
a = 05
a = .10
a = .05
a =.10
9-10
8-10
0-1,9-10
0-1,9-10
42
INFERRING FROM A SAMPLE: THE BINOMIAL
Using the Sign Test The table you have just constructed (Exercise 3.4, part 3) works as long as there are no more than 12 subjects in a study. But if the need arose, you know enough so that, in principle at least, you could lengthen this table. More importantly, the process of computing values for this table should have helped you gain insight into how such tables of critical values for sampling distributions are constructed in general. In practice, of course, investigators rarely compute critical values. Instead, they consult tables constructed by others and indeed such a table is provided here for the sign test (see Table A in the statistical tables section). When contemplating a statistical test, the first step is to decide which test is appropriate. The sign test is appropriate if a binomial (literally, bi = two, nomin = name) code can be meaningfully assigned each subject (e.g., assigning each subject a + or a - based on some assessment procedure) and if the assessments are made independently so that the outcome of one cannot affect the outcomes of others. Assuming that a sign test is appropriate, the next steps require that you commit to an alpha level, decide on an appropriate null and alternative hypothesis (which will imply certain values for the relevant population parameters), and (as a related matter) decide whether a one- or two-tailed test is appropriate. The final step is the statistical decision. From the appropriate table, determine which values of the test statistic would occur rarely (their combined probabilities cannot exceed your preselected alpha level) if your null hypothesis were true. If the test statistic derived from your sample is one of those rare values (i.e., if it falls in the critical region) then you reject the null and necessarily accept the alternative hypothesis. Exercise 3.5 Applying the Sign Test This exercise provides practice in applying the sign test. You will need to refer to Table A in the Statistical Tables section. 1. An expanded study testing the efficacy of the money cure contains 40 subjects. By chance, you expect 20 to respond to treatment, but in fact 30 do respond. You decide beforehand that you will reject the null hypothesis that treatment has no effect only if more subjects improved than expected and only if the specific result would occur less than 5% of the time if the null hypothesis were in fact true. Do you reject the null hypothesis? Why or why not? 2. What if only 26 improved? Would you reject the null hypothesis? Why or why not? 3. What if 30 improved, but you had decided beforehand to reject the null hypothesis if the number of subjects was different from expected, either more improving or more worsening than expected. Keeping the alpha level at .05, would you reject the null hypothesis now? Why or why not? Given these same circumstances, what would your decision be if 26 improved? Explain your decision. 4. Given 50 subjects, an alpha level of .05, and a two-tailed test, what is the smallest number of subjects who, even if they all improved, would not allow you to reject the null hypothesis? What is the largest number of subjects who, even if they all improved, would not allow you to reject the null hypothesis?
3.2 THE SIGN TEST
43
5. Given 50 subjects, an alpha level of .05, but a one-tailed test, what is the largest number who, even if they all improved, would not allow you to reject the null hypothesis? What is the smallest number that would allow you to reject the null hypothesis? 6. Now answer question 4, but for an alpha level of .01. 7. Now answer question 5, but for an alpha level of .01. At this point you are now able to use a simple test, the sign test, and the logic of hypothesis testing in order to answer research questions like the one posed at the beginning of chapter 2: How many patients would need to get better before we could claim statistical significance for the result? The next exercise will instruct you in how to conduct a sign test using SPSS. Exercise 3.6 The Sign Test in SPSS This exercise provides practice in using the sign test in SPSS. 1. Invoke SPSS and create two new variables in the Variable View window. Name the first variable "outcome" and the second variable "freq." Set the number of decimal places for both variables to 0. 2.1 Provide appropriate value labels for the outcome variable. In row one, click on the values cell and then click again on the grey box that appears to the right of the cell. This will open the Value Labels dialog box. Enter "1" and "improved" in the Value and Value Label windows, respectively. Then click on Add to enter the information in the bottom window. Now do the same, but this time enter "0" and "no improvement". After you enter labels for "0" and "1", click on OK. Now, the labels improved and no improvement will appear in the SPSS output, instead of the rather cryptic codes 1 and 0. 3. Weight the cases by the frequency variable. Select Data->Weight Cases from the main menu. In the Weight Cases dialog box, select the Weight cases by radio button and move the freq variable to the Frequency variable window. Click on OK. This command tells SPSS to weight each value for outcome (i.e., improved or no improvement) by the number indicated in the freq variable. You could simply enter a 1 or 0 for each case in the outcome column, but doing so would become tedious when N is large. 4. In Data View enter a 1 in the first row of the outcome column and a 0 in the second row. In the freq column enter 30 in the first row and 10 in the second row. Because you weighted the cases by the freq variable in step 3, this is the same as entering 30 ones and 10 zeros in the outcome column. 5. Select Analyzed->Nonparametric Tests->Binomial from the main menu. Move the outcome variable to the Test Variable List box. Note that the Test Proportion window reads .50 by default. This indicates that you expect 50% of the outcomes to be positive. Click on OK. 6. Look at the output. Is the N correct? Look at the Observed Prop. column. What proportion improved? What was the expected proportion based on the Test Prop. column? Now look at the final column labeled Asymp. Sig (twotailed). This tells you the probability of finding 30 out of 40 positive outcomes if you expected only 20. If the value is less than alpha, then you reject the null hypothesis that that treatment has no effect. 7. Change the values in the freq column to reflect that 26 improved and 14 did not. Would you reject the null hypothesis with an alpha of .05?
44
INFERRING FROM A SAMPLE: THE BINOMIAL 8. What if 30 improved, but you expected 80% to improve? Keeping the alpha level at .05, would you reject the null hypothesis? You are now able to conduct a simple sign test by hand and using SPSS. In subsequent chapters other tests for statistical significance are described—those based on the normal, t, and F distributions instead of the binomial—but first some matters of basic terminology (chap. 4) and descriptive statistics (chap. 5) need to be discussed.
Note 3.1 Critical values
Values of a test statistic that would occur 5% of the time or less (assuming an alpha level of .05) if the null hypothesis were true. Depending on the test statistic used and how the null hypothesis is stated, these values could fall only at the extreme end of one tail of a distribution or at the extreme ends of both tails. If a test statistic assumes a critical value, the null hypothesis is rejected. For that reason, critical values define the critical region or the region of rejection of a sampling distribution.
4
Measuring Variables: Some Basic Vocabulary
In this chapter you will: 1. Learn the difference between qualitative and quantitative variables. 2. Learn the difference between nominal, ordinal, interval, and ratio scales. 3. Learn how to describe a study design in terms of independent and dependent variables. 4. Begin to learn how to match study designs with statistical procedures. In talking with others, whether describing your study to fellow researchers or attempting to gain statistical advice, it is important to use relatively standard, clear vocabulary. This chapter describes some basic terms and concepts used to talk about research studies. Some of these terms deal with the variables under study and how they are measured, and others deal with how the variables are related to each other in the study design. 4.1
SCALES OF MEASUREMENT All studies, including simple studies like the money-cure study used as a running example in chapters 2 and 3, incorporate one or more variables. The variables— so named because, unlike constants, they can take on various values at different times—identify the concepts the investigator thinks important for a given study. The variable of interest for the money-cure study was outcome: Did patients get better or not. The design included only one kind of treatment (and so this was constant), but an expanded study might well have defined two different kinds or levels of treatment (e.g., the money cure and a standard talk cure, or the money cure and no treatment). In this case, two variables would be identified, type of treatment and outcome, and the research question would involve the connection, if any, between the two. An important attribute of variables is their scale of measurement. Knowing the kind of scale used to assign values to variables helps the investigator decide which statistical procedures are appropriate. Following a scheme first introduced by Stevens (1946), four kinds of scales are identified:
45
46
MEASURING VARIABLES: SOME BASIC VOCABULARY 1. 2. 3. 4.
Nominal (or categorical). Ordinal. Interval. Ratio.
Nominal or Categorical Scales Nominal variables assume named values, that is, their permissible values are qualitative instead of quantitative. And although variables can be assigned different categories or levels, there is no inherent reason to order the categories in any particular way. Sex (female/male), religion (Catholic/ Jew/ Muslim/ Protestant/Other), and outcome (improved/did not improve) are examples.
Ordinal Scales Values for ordinal variables are also named. In other words, measurement results in assigning a particular name, level, or category to the unit (subject, family, trial, etc.) under consideration. But unlike nominal variables, for ordinal variables there is some rationale for ordering the categories in a particular way. Examples are freshman/sophomore/junior/senior, and first/second/third, and so forth. There is no reason, however, to assume that the distance between the first and second categories is in any way equivalent to the distance between the second and third pair, or any other adjacent pair, of categories. In sum, ordinalscale values represent degree, whereas nominal-scale values represent only kind. Ordinal data can be regarded as quantitative (in a relatively loose sense), but nominal data remain forever qualitative.
Interval and Ratio Scales Like ordinal scales, both interval and ratio scales reflect an underlying and ordered continuum, but unlike ordinal, these scales are characterized by equal intervals. That is, an interval anywhere on the scale (e.g., the interval between 4 and 5) is assumed equivalent to any other interval (e.g., the one between 12 and 13. Interval and ratio scales differ in only one respect. For an interval scale the placement of zero is arbitrary, whereas for a ratio scale zero indicates truly none of the quantity measured. Thus if the ratio of two numbers is 3:1, the second value represents three times more of the quantity measured only for ratio but not interval data. Examples of interval scales are IQ, scores on most attitude and personality tests, and temperature measured on the Fahrenheit or Celsius scales. Examples of ratio scales are age, weight, and temperature measured on the Kelvin scale (for which zero implies no molecular movement). The scale of measurement can itself be viewed as a nominal (or, arguably, ordinal) variable. It can assume any of four values and the value for a particular case can be determined by asking, at most, three yes/no questions (later on, in chap. 10, we will learn to say that three degrees of freedom are associated with the four categories). These questions or distinctions, and their permissible values, can be represented with a tree diagram (see Fig. 4.1).
47
4.1 SCALES OF MEASUREMENT
FIG. 4.1. Tree diagram for the four types of measurement scales as represented by three binary questions. In practice, the distinction between nominal (qualitative) and other (quantitative) scales of measurement is more important than the distinctions among ordinal, interval, and ratio scales. For most statistical procedures, the difference between interval and ratio variables is not consequential. In addition, under many circumstances it is often regarded as acceptable to apply statistical procedures that were developed for equal-interval to ordinal data. However, it is never acceptable to apply statistical procedures designed for equal-interval data to any numbers arbitrarily associated with nominal categories (with the possible exception of binary categorical variables, if one category can be regarded as representing more of some attribute than the other).
Note 4.1 Nominal
The levels of a nominal or categorical scale are category names, like Catholic | Jew| Muslim [Other for religion or male | female for sex. Their order is arbitrary; there is no obviously correct way to order the names.
Ordinal
The values or levels of an ordinal scale are named but also have an obvious order, like first| second third or freshman | sophomore |junior | senior. However, there is no obvious way to quantify the distance between levels or ranks.
Interval
The intervals of an interval scale are equal, no matter where they fall on the measurement continuum. The placement of zero, however, is arbitrary, like zero degrees Fahrenheit.
Ratio
The intervals of a ratio scale are also equal, but zero indicates truly none of the quantity, like zero degrees Kelvin. Thus the ratio of two numbers is meaningful for numbers measured on a ratio but not an interval scale.
48 4.2
MEASURING VARIABLES: SOME BASIC VOCABULARY DESIGNING A STUDY: INDEPENDENT AND DEPENDENT VARIABLES In addition to the distinction between qualitative and quantitative variables, a second distinction, likewise important because it helps the investigator decide which statistical procedures are appropriate, is between independent and dependent variables. Whether a variable is regarded as independent or dependent depends on the thinking of the investigator. Thus, a particular variable might be viewed as independent in one study but dependent in another. The dependent variable (also called the criterion or response variable) identifies the variable the investigator wants to account for or explain, whereas the independent variable (also called the predictor or explanatory variable) identifies the variable (or variables) that the investigator thinks may account for, or affect, the different values the dependent variable assumes. Strictly speaking, it is not always necessary that variables in a study be segregated into these two classes. In some cases an investigator may be interested only in the association among a set of variables and may not think of any variables as prior in some way to others. More typically, however, investigators are concerned with explanation. Reflecting this concern, the studies used as examples in this book all contain variables that can be regarded either as independent (i.e., logically prior and so independent of other variables named in the study) or as dependent (i.e., their values presumably depend on values for the independent variable or variables). The distinction between independent and dependent variables is more important than the actual words used to express it. We could just as well refer to explanatory and response variables instead. These terms have the merit of being more general but are less used (except when discussing log-linear analyses). Or, we could refer to predictor and criterion variables, which are terms commonly used in the multiple regression literature. Still, the terms independent variable (IV) and dependent variable (DV) have the merit of being the most widely used— and, as used by most writers, no longer have the once exclusively experimental connotation.
Experimental and Observational Studies In traditional usage the terms independent variable and dependent variable have been reserved for true experimental studies. That is, values or levels (usually called treatment conditions) for the independent variable would be determined (manipulated) by the experimenter, subjects would be randomly assigned to a particular level or condition, and then the subjects' scores for the dependent variable would be measured. As students in introductory statistics courses learn, experimental studies support causal conclusions in a different way than merely correlational or observational studies. In such studies, variation among variables is merely observed; there is no manipulation of independent variables and no random assignment of subjects to treatments. As a result, if some association is observed between independent and dependent variables, there is always the suspicion that their covariation reflects, not a causal link from one variable to the other, but the influence of a third variable, unmeasured in the study, that affects them both. In an experimental study, on the other hand, because values for any possibly influential third variables should be distributed by random assignment equally among the experimental conditions, the third variable alternative to a causal interpretation is rendered unlikely.
4.2 INDEPENDENT AND DEPENDENT VARIABLES
Dependent
Independent
49
Note 4.2 The variable a researcher wants to explain or account for is called the dependent variable or DV. It is also called the criterion variable or the response variable. Variables a researcher thinks account for or affect the values of the dependent variable are called independent variables or IVs. They are also called predictor variables or explanatory variables.
In contemporary usage, the terms independent and dependent variable indicate not necessarily experimental manipulation, but simply how investigators conceptualize relations among their variables. Such usage, however, should not lull us into making stronger causal claims than our procedures justify. In general, only experimental studies can justify strong causal claims. 4.3
MATCHING STUDY DESIGNS WITH STATISTICAL PROCEDURES The qualitative versus quantitative and independent versus dependent distinctions are used when describing a study's design. The question "What is the design of your study?" means: What variables do you think are important for this study? Which do you regard as dependent, which independent? What is their scale of measurement? Answering these questions, along with one or two more involving how independent variables are related and whether or not dependent variables involve repeated measurements, both specifies the design and aids in selecting statistical procedures appropriate for that design. In the following paragraphs, a number of common design considerations are described and appropriate statistical procedures named. Almost all of the remaining chapters in this book then describe how the more basic of these procedures are put into practice. For example, the two variables identified as important for the money-cure study were treatment and outcome. Outcome is viewed as the dependent variable, treatment as the independent variable. Both are nominal. There were two values for outcome: improved or not. However, there was only one value for the treatment—that is, all subjects received the same treatment. As we saw, for such a design (a single group, a binary dependent variable) a sign test was appropriate. However, this is a relatively weak design. A stronger design would include a control or comparison group. For example, the independent variable could be represented by two levels: the money treatment and a standard talk treatment. This would result in a design involving qualitative independent and dependent variables, both scored with more than one level or category. Data for such designs are the counts associated with the cells of the design—that is, the number of subjects who received the standard treatment who improved, the number who did not, the number who received the money treatment who improved, and the number who did not—resulting in a two by two table of counts. These counts are typically analyzed with chi-square statistics for two-dimensional tables (like the present example) and with loglinear approaches (Bakeman & Robinson, 1994; Bishop, Fienberg, & Holland, 1975) when tables include two or more dimensions (each categorical variable defines a dimension of the cross-classification or contingency table). As noted in
50
MEASURING VARIABLES: SOME BASIC VOCABULARY the preface, such analyses, which require computational tools not provided by spreadsheets, are discussed in Bakeman & Robinson (1994). Perhaps the most common design, insofar as there is one in the behavioral sciences, involves a quantitative dependent variable and one or more qualitative independent variables. Such data are analyzed with the analysis of variance (ANOVA), a statistical approach that has received considerable attention and has been much used in the behavioral sciences. If the independent variables were quantitative, correlation or multiple regression would be used instead. And if the independent variables were a mixture of qualitative and quantitative, again multiple regression would be used. In fact, as noted in chapter 1, the analysis of variance can be presented as one straightforward application of multiple regression, although not all texts emphasize this simplifying fact. For completeness, logistic regression and discriminant function analysis should be mentioned because they are appropriate techniques to use when the dependent variable is qualitative and the independent variable quantitative. Multivariate analysis of variance (MAN OVA) and multivariate regression should also be mentioned in recognition of the fact that some designs involve multiple dependent quantitative measures. These can be important techniques, but their use is regarded as an advanced topic in statistics and hence beyond the scope of this book. For a clear explication of these and other multivariate techniques, see Tabachnick and Fidell (2001). The preceding discussion (summarized in Fig. 4.2) has emphasized the scale of measurement for the independent and dependent variables (and has mentioned the number of dependent variables), but other distinctions are important and are often made. For example, it is common to distinguish between parametric and nonparametric tests. Parametric tests require assumptions about the distributions underlying the data, whereas nonparametric tests do not require such assumptions. In practice, nonparametric tests are typically applied to categorical or ordinal data, and parametric tests, which are emphasized here, to quantitative data. The traditional reference for a number of nonparametric techniques is Siegel (1956). Another distinction commonly made concerns the nature of the research question. Many authorities (e.g., Tabachnick & Fidell, 2001) find it convenient to distinguish questions that ask about relations among variables (for which correlation and regression are appropriate), from questions that ask about the significance of group differences (for which analysis of variance is appropriate), from questions that attempt to predict group membership (for which discriminant function analysis is appropriate), from questions that ask about the underlying structure of the data (for which factor analysis is appropriate). Explication of the techniques required for the first and second kinds of questions constitutes the bulk of this book, but in practice— because of the integrated multiple regression approach adopted here— the distinction between the two is somewhat blurred. Technically, however, the assumptions required for tests of association and tests of group differences are somewhat different (again see Tabachnick & Fidell, 2001). As noted previously, the statistical techniques required for the third and fourth kinds of questions constitute an advanced topic in statistical analysis and are not covered here.
51
4.3 MATCHING DESIGNS WITH STATISTICAL PROCEDURES
Independent Variables Single categorical
Dependent variable Categorical
Quantitative
Chi-square
ANOVA. Hest
Multiple categorical
Log-linear analysis
ANOVA
Single quantitative
Discriminant function
Simple correlation
Multiple quantitative
Discriminant function
Multiple regression
Categorical and Quantitative
Logistic regression
Multiple regression
FIG. 4.2. Appropriate statistical techniques for studies with categorical and quantitative dependent variables and various kinds and numbers of independent variables. Techniques appropriate for analyzing quantitative dependent variables are emphasized in this book. One technique that usually appears in introductory texts but is not discussed here in any detail is Student's t test. The t test is appropriate for designs with a quantitative dependent variable and a single binary independent variable. Thus it can be used to answer questions like, "Is the mean IQ at 3 years of age different for preterm and full-term infants?" The t test was developed earlier than the analysis of variance and is limited to two groups. The analysis of variance can analyze for differences between two groups and more besides. In the interest of economy (why learn two different tests when only one is needed?), we have decided to present only the more general ANOVA technique for analyzing problems that earlier would have been subjected to a t test. As Keppel and Saufley (1980) noted: The t test is a special case of the F test. ... If you were to conduct a t test and an F test on the data from the same two-group experiment, you would obtain exactly the same information. ... The two statistical tests are algebraically equivalent, that is,
Due to this equivalency, we decided not to develop the t test. . . . The F test can be applied to almost any situation in which the t test can be used, but it can also be applied in situations where the t test cannot be used. (p. 109) Do not misunderstand. We are not claiming that the t test is of no use and should not be used. Moreover, in order to understand much of the literature, its application needs to be understood. And pedagogically, there is much to be gained by a thorough explication of the t test and its variants. In a longer volume, it would have a place. Our intent in writing this book, however, was to present basic statistics in as economical and efficient a manner as possible, which means that concepts and techniques that have wide application (like multiple regression
52
MEASURING VARIABLES: SOME BASIC VOCABULARY and analysis of variance) are emphasized and other topics with more limited application (like chi-square and the different kinds of t tests) are omitted. The purpose of this chapter was first to introduce some basic vocabulary that will be used in this book, and second, to introduce the names of several useful statistical techniques with which you should be familiar, even though explication of these techniques is beyond the scope of this book. It is important, of course, to know how to apply particular statistical techniques, but it is even more important to know which techniques should be applied in the first place—and in order to do that, readers need to at least be aware of a wide range of possibilities. The importance of matching statistical procedures to study designs has been emphasized in the last several paragraphs. It is a key issue that will occupy us throughout the remainder of this book. In the next chapter, however, we discuss some simple and basic ways to describe data.
5
Describing a Sample: Basic Descriptive Statistics
In this chapter you will: 1. Learn how to summarize a sample (a particular set of scores) with a single number or statistic, the arithmetic mean. 2. Learn what it means to say that the mean is the best fit, in a least-squares sense, to the scores in the sample. 3. Learn how to describe the variability in a sample and in a population using the variance and the standard deviation. 4. Learn how to compare scores from different samples or populations using standard scores (Z scores). 5. Learn how to identify scores so extreme that they might possibly be measuring or copying mistakes. If you never wanted to generalize beyond a sample, inferential statistical techniques like hypothesis testing would be unnecessary. But very likely you would still want to describe what your sample was like. Basic descriptive statistical techniques, like those described in this chapter, allow you to do exactly that. In the last chapter we distinguished among variables measured on nominal, ordinal, interval, and ratio scales. Descriptive statistics for nominal data, as the running example of the money-cure study demonstrates, are extremely simple, consisting of little more than counts (or proportions), for example, the number (or percentage) of people who improved. Descriptive statistics for ordinal data (unless we think it justified to regard differences between ranks as equal) are no more complex. Again, we simply report the numbers and proportions of subjects who fall in the different categories. Describing interval or ratio variables offers more possibilities. In this chapter we discuss the mean, the variance, and the standard deviation, which are basic descriptive statistics appropriate for summarizing quantitative data.
53
54_ 5.1
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
THE MEAN For economy of presentation, if for no other reason, it is often useful to characterize an entire set of scores with just one typical score. There are several different ways to compute this typical, or average, score. For example, the mode, the median, the arithmetic mean, and the geometric mean all involve different computations and often result in different values. Each of these typical scores has its uses, as students in introductory statistics courses learn, but by far the most commonly used average score in statistical work is the arithmetic mean. In fact, whenever the word mean is used in this book, it is safe to assume that it means the arithmetic mean.
The Mean Defined Almost certainly you learned how to compute the arithmetic mean long ago. The rule is (a) add the scores together and (b) divide their sum by the number of scores. Symbolically (using I, the Greek upper case sigma, to indicate summation, Y to represent a raw score, and N to indicate the number of scores), the mean for the Y scores is
We could leave matters there, but there is something to be learned if the definition of the mean is probed a bit more. You may find this probing rather overwrought for something so basic, but our purpose is to introduce an important concept (the method of least squares) within the simplest context possible. In developing this definition of the mean as a best-fit statistic in a least-squares sense, we make use of data derived from a simple (imaginary) study, which is described next and is used as an example in this and subsequent chapters.
An Example: The Lie Detection Study Imagine that we invite 10 subjects into our laboratory for a polygraph examination. We provide them with the script of the questions the examiner will ask and we instruct them to lie when answering certain of those questions. Different subjects provide false answers to different questions, but all subjects tell the same number of lies. Then we ask the examiner to identify untruthful responses. Thus the dependent variable (DV) for this study is the number of lies detected by the examiner. The data for this study are given in Fig. 5.1 and, as you can see, the number of lies detected ranged from two to nine. The next exercise shows you how to manipulate these data using a spreadsheet. Although this exercise uses the data from the lie detection study, it could easily be adapted for use with data from other studies as well.
Exercise 5.1 The Mean The spreadsheet template developed during this exercise computes basic descriptive statistics for the lie detection study. Moreover, it will form the basis for subsequent exercises. Give some attention to the notation used and defined in this exercise. It is used throughout the book.
5.1 THE MEAN
55
General Instructions 1. Setup and label five columns. Each row represents a subject and the first column contains that subject's number (1-10). 2. The second column contains values for the dependent variable, which is the number of lies detected for each subject as given in Fig. 5.1. Label this Y, which represents a generic DV. 3. The third column contains predicted scores. Label this Y' (read as "V-prime") and point each cell in this column to a cell that contains 5 (use an absolute address). You could enter the value 5 (which is an arbitrary value selected for this exercise) into the cells in this column directly, but if you point the cells to another, single cell you have the option of later changing the predicted values for all subjects by just changing one cell. 4. In the fourth column, enter a formula for the differences between the raw and the predicted scores. This difference score is symbolized with a lower case y, and is often called a deviation or residual score. Label this column Y-Y'. 5. In the fifth column, enter a formula for the square of the difference scores. Label it y*y (where y = Y-Y1). 6. On separate rows, enter formulas that compute the sum, the count, and the mean for the raw, the predicted, the difference, and squared difference scores. You may use Fig. 5.2 as a guide if you wish. Detailed Instructions 1. In row 2, label the columns as follows: Label
Column
Meaning
s
A
Y
B
Y'
C
The subject number or index. In this case it ranges from 1 to 10. In general, the subject could be an individual, a dyad, a family, or any other appropriate unit of analysis. The number of lies detected, actually observed. In general, Y indicates an observed score. For the present example, seven lies were detected for the fifth subject. The number of lies detected, predicted by you (read as "Vprime"). Usually the basis for prediction will be clear from context.
0 .. . Subject
No. lies detected
1
3
2 3 4 5 6 7 8
2 4 6 6 4 5 7
9 10
7 9
FIG. 5.1. Data for the lie detection study.
56
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS The difference between observed and predicted. This is the amount left when the predicted score is subtracted from the observed score. This difference score is symbolized with a lower case y, and is often called a deviation or residual score. y*y E The square of this difference, that is, Y-Y' multiplied by Y-Y'. The sum of the entries in this column is called a sum of squares (symbolized as SS) because the difference scores are squared and then summed. 2. To remind ourselves that the sum of the column E entries is the sum of the Y deviation scores squared, enter the label "SSy" in cell E1. In addition, enter the labels "Lies" in cell B1 and "y=" in cell D1. 3. In column A, label rows 13-16 as follows:
4. 5. 6. 7. 8. 9. 10. 11. 12.
Y-Y'
D
Label
Row
Meaning
Sum= 13 The sum of the scores in rows 3-12. N= 14 The number of scores in rows 3-12. Mean= 15 The mean of those scores. a= 16 The predicted score. Enter the values 1 through 10 in column A, rows 3-12. Enter the observed scores in column B, rows 3-12. Put a value in cell B16. For now, make that value 5. Point the cells in the Y' column (cells C3-C12) to the predicted cell (B16), that is, put the address B16 in C3-C12. Cells C3-C12 and B16 should now all display the same value. Enter a formula for subtracting Y' (the column C entry) from Y (the column B entry) in cells D3-D12. These are the deviation scores or the residuals. Enter a formula for multiplying Y-Y' (the column D entry) by itself in cells E3E12. These are the deviations squared. In cells B13-E13, enter a function that sums the entries in rows 3-12. In cells B14 and C14, enter a function that counts the entries in rows 3-12. In cells B15 and C15, enter a formula that computes the mean (the sum divided by N) for that column. At this point, your spreadsheet should look like the one portrayed in Fig. 5.2.
Predicting the Mean At this point, your spreadsheet should look like the one portrayed in Fig. 5.2. It shows the raw data, the arithmetic mean for those data, and the sum of the squared deviation scores, where a deviation consists of the difference between the raw score and the value 5. You probably wonder why we choose 5 as the predicted value in Exercise 5.1. We took a quick look at the numbers, noted they ranged from 2 to 9 but that there were two 2s. Then hurriedly and somewhat arbitrarily, we decided to guess that a typical value for this set of 10 numbers was 5. This was only a first guess, selected for purposes of exposition; we will revise it shortly. Actually, we had in mind a formal model for these scores:
Yi = a + e/
(5.2)
This particular model states that the score for the ith subject in the population, Yi, consists of some population parameter, represented with a (lower case Greek
57
5.1 THE MEAN
alpha), plus an error component, Y' (lower case Greek epsilon), specific to that subject. This is a very simple model. It suggests that the value of a score is determined by a single parameter and nothing more. Not all scores will have exactly the same value as the parameter, which is what the error component signals, but if asked to predict the value for any score, our prediction would always be a, the value of the parameter. If Y' (readY -prime, for Y-predicted) represents the predicted value for Y, and if a represents the estimated value for the population parameter a, then the prediction equation associated with this model is:
Yi' = a
(5.3)
For Exercise 5.1, we guessed that a was 5; thus the specific prediction equation became Yi' - 5 (or, omitting the subscript, simply Y - 5). For this prediction model, characteristics of individual subjects are not considered. The same prediction is applied to all, which is why all the predicted values in column C (see Fig. 5.2) are the same. In general, a model should indicate concisely how we think a particular dependent or criterion variable is generated. In that sense, it is an idealization of some presumed underlying reality, and presumably it reflects the mechanism that produces the criterion variable. Usually, a model contains one or more population parameters (like the a in Equation 5.2). In addition, and unlike Equation 5.2, it often contains one or more independent or predictor variables as well and so indicates how we think those predictor variables are related to the criterion variable under consideration.
D
y=.
Lies Y
Y'
Y-Y1
y*y
3 2 4
5 5 5
-2 -3 -1
4 9 1
5
1
1
6 7 8 9 10
6 6 4 5 7 7 9
5 5 5 5 5 5
1 -1 0 2 2 4
.1 1 0 4 4 16
Sum= N=
53 10
50 10
3
41
Mean=
5.3 5
5
s
1 2 3 4 5
a=
SSy
FIG. 5.2. Spreadsheet for the lie detection study using V"= 5 as the prediction equation.
_58
__
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
The Equation 5.2 model makes sense because we have not yet really thought about the lie detection study. We have not yet considered how characteristics of individual subjects or the circumstances surrounding their testing sessions might have affected their responses. In other words, we have yet to identify research factors (independent variables) that we want to investigate. Thus, this simple model represents a starting point. Later on we will ask whether other, more complex models allow us to make better predictions than the initial model posed here, and it will prove useful then to have this simple model as a basis for comparison. But even if we provisionally accept this simple model and its associated prediction equation, have we found the best estimate for the parameter? The value 5 was after all a rather offhand guess. Can we do better? Is there a best value? The answer is yes, if we first define what we mean by best. A common criterion for best, one that is often used in statistics, is the least-squares criterion. According to this guide, whatever value provides us with the lowest (or least) possible value for the sum of the squared error scores (or squares) is best. The Equation 5.2 model, rewritten using estimates instead of parameters, is: Yi = a + ei
Substituting Yi' for a (because Y,' = a, Equation 5.3), this can be rewritten as: Yi=Yi' + ei Rearranging terms, this becomes: ei=Yi-Yi' In other words, the error score for each subject is the difference between that subject's actual score and the predicted score for that subject. In the last exercise the error scores (also called deviation scores or residuals because they indicate what is left after the predicted is subtracted from the observed) were symbolized with a lower case y (omitting subscripts, y = Y - Y'). When we used 5 as our predicted value, the sum of the squared error scores, which is the quantity we wish to minimize, was 41 (see Fig. 5.2). If no other value yields a sum of squares (as the sum of the squared residuals is usually called) lower than this, then we would regard 5 as the best estimate for the parameter, in the least-squares sense. As you have probably already guessed, 5 is not the best estimate. For pedagogic purposes, we deliberately choose a number that was close but not best. The best estimate, based on our sample of 10 scores, is 5.3, which is the mean value for these scores. Statisticians have proved to their satisfaction that the mean provides the best-fit, in the least-squares sense, to a set of numbers. When the mean is used to predict scores (when Y,' = M), the sum of the squared deviations between observed and predicted scores will assume its lowest possible value. This is easy to demonstrate, as the next exercise shows. Exercise 5.2 The Sum of Squares The purpose of this exercise is to demonstrate that the minimum value for the sum of the squared deviations is obtained when the mean is used to form deviation scores.
5.1 THE MEAN
59
1. Using the spreadsheet from the last exercise, enter different values in the cell that now contains 5, the predicted value or the value for the parameter a. For example, enter 3.5, 4, 4.5, 5, 5.5, 6, and so forth, and note the different values that are computed for the sum of squares, that is, the sum of the squared deviation scores. You should observe that the minimum value is achieved only when the mean (in this case, the value 5.3) is used. 2. Verify that using the mean indeed yields the smallest value for the sum of squares by entering values like 5.31, 5.32, 5.29, 5.28, and so forth. When the value 5.3 is used to form deviation scores, the sum of squares (usually abbreviated SS) should be 40.1, which is its lowest possible value for these data, and your spreadsheet should look like the one given in Fig. 5.3. The principle of least squares—which, once presented, may seem like an obvious way to minimize errors of prediction—was first articulated in 1805 by a French mathematician, Adrien Marie Legendre. He wrote, "We see, therefore, that the method of least squares reveals, in a manner of speaking, the center around which the results of observations arrange themselves, so that the deviations from that center are as small as possible" (cited in Stigler, 1986). The presentation here may have seemed like a long and involved way to present something as seemingly simple as the arithmetic mean, but consider what you have learned. On the practical side, you have had practice setting up and using a spreadsheet for a simple study. You have also learned to view the mean as the value that best represents or fits a set of numbers, using the least-squares criterion. This concept will serve you well later when the least-squares criterion is used in more interesting and more complex contexts. And, as you will discover in the next two sections, the groundwork required for computing the variance and the standard deviation is already insinuated in the spreadsheet you have just developed.
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
B s 1 2 3 4 5 6 7 8 9 10
Sum=
N= Mean=
a=
Y 3 2 4 6 6 4 5 7 7 9 53 10 5.3 5.3
E
D
C Lies
Y' 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 53 10 5.3
y=
SSy
Y-Y' -2.3 -3.3 -1.3 0.7 0.7 -1.3 -0.3 1.7 1.7 3.7 0
5.29 10.89 1.69 0.49 0.49 1.69 0.09 2.89 2.89 13.69 40.1
y*y
FIG. 5.3. Spreadsheet for the lie detection study using Y' = 5.3 for the prediction equation.
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
60
Note 5.1 Y
An upper case Y is used to indicate a generic raw score, usually for the dependent variable. Thus Y represents any score in the set generally, and Yi,indicates the score for the ith subject.
Y' Y' (read Y-prime) indicates a predicted score (think of "prime" as standing for "predicted"). This is relatively standard regression notation, although sometimes a Y with a circumflex or "hat" (i.e., a ^) over it is used instead, Y. Usually the basis for prediction will be clear from the context. Individual predicted scores are symbolized Y,-', but often the subscript is omitted. Y-
Y'
SS
5.2
The difference or deviation between an observed score and a predicted score is called a residual or error score. Again the subscripts are often omitted. Often lower case letters are . used to represent residuals. For example, in Fig. 5.3 a lower case y represents the deviation between a raw Y score and the mean for the Y scores. The sums of squares is formed by first squaring each residual, and then summing the resulting squares; thus it is the sum of the squared residuals. In other words, SS = E (Yi - Yi)2 where i = 1,N.
THE VARIANCE The mean is a point estimate because it identifies one typical score, whereas variability indicates range, which is an area within which scores typically fall. Consider the kind of practical problem that first caused mathematicians to ponder variability of scores. Imagine you are a ship's navigator at sea attempting to read the angle of the North Star above the horizon from a moving deck. In a desperate attempt at accuracy, you take successive readings, but each one is somewhat different from the others. Still, you reason, the best estimate you can make for the true value is the mean of your various measurements. You know there is only one true value, so you regard any deviation between that value and one of your measurements as error. Under these circumstances, variability might not be your major concern, unless, of course, you wanted to compare the variability of your measurements with someone else's. But now imagine you are concerned with various anthropometric measurements, for example, the heights of military recruits who come from different geographic areas. In this case, you might well find it interesting to see if the height of Scotsmen, for example, is more variable than that of Japanese. Or, given the lie detection study described earlier, you might want to know if the number of lies detected for the first five subjects (who might have been treated differently in some way) was more variable than the number for the last five. No matter whether scores are several measurements of the same subject, as for the navigation example, or single measurements of different subjects, as for the anthropometric or the lie detection study example, it can be useful to quantify the amount of variability in a set of scores. One common measure of variability is
5.2 THE VARIANCE
61_
the variance. The variance for a sample of scores is simply the mean squared deviation, that is, the average of the squared residuals. In other words, to compute the variance for a sample of scores: 1. 2. 3. 4.
Subtract the mean from each score. Square each difference. Sum the squared deviation scores. Divide that sum by the number of scores.
Symbolically (letting My represents the mean for the Y scores) the variance for the Y scores is:
Most of the necessary computations are already contained in the spreadsheet shown in Fig. 5.3. Exercise 5.3 The Sample Variance The purpose of this exercise is to modify the current template so that, in addition to the mean, it computes the variance for the sample data as well. General Instructions 1. Make sure that the predicted score is always the mean. In other words, point the cell that contains the predicted score to the cell that contains the mean of the raw scores. This allows you to change the raw scores, and hence the mean, insuring that deviation scores will remain deviations from the mean. 2. Add a formula for the sample variance to the current template. Provide an appropriate label. 3. Selectively change the values for the raw scores and note how these changes affect the value of the variance. First, change the raw scores so that they are all quite similar to each other. Then, make just one value very different from the others (e.g., most scores might be 3s and 4s whereas one score might be 94). Finally, select values that are somewhat different, and then a lot different, from each other. What do you observe? Detailed Instructions 1. Put the label "VAR=" for variance in cell D15. 2. Enter 10 (or a counting function) in cell E14. In cell E15 enter a formula for dividing the sum of the squared deviation scores (cell E13) by the number of scores (cell E14). This is the variance. 3. Insure that the predicted value will always be the mean of the raw scores, even if you change the raw scores (i.e., point cell B16 to cell B15). 4. The more scores are spread out around the mean, the greater the variance. This is easy to demonstrate. First replace the most extreme numbers of lies with numbers nearer the mean (in column B). For example, replace the 2 with a 4, the 7s with 6s, and the 9 with a 7. What is the variance for this new set of scores? How does it compare with the original variance of 4.01 ? 5. Next undo the changes you just made (restore the original raw scores) and replace numbers near the mean with more extreme scores. For example,
62
___
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
replace the 4 with a 1, the 5 with a 2, and the two 7s with 9s. What is the variance now? Why should it be larger than the initial 4.01 ?
Sample and Population Variance For the last exercise you computed a quantity called the sample variance. According to Equation 5.4, this is the sum of squares divided by N. There is a second definition for the variance, which divides the sum of squares not by N, but by N - 1 instead. Symbolically:
Many writers call this the population variance. It is an estimate of the true value of the population variance, based on sample data, and is symbolized here as VAR' to distinguish it from VAR (the sample statistic). As a general rule, in this book addition of the apostrophe or prime indicates an estimate; many texts use the circumflex (i.e., the ^ or hat) over the symbol for the same purpose. Generations of students have been perplexed at first by the difference between the sample and population variance (and the sample and population standard deviation), which is not surprising once you realize that different writers define matters differently. One question concerns the N -1 divisor: Why is it N - 1 instead of N? For now accept that one of the divisors should be N -1 and wait until chapter 10, where we offer an intuitive explanation for the N -1. A second question concerns which divisor is used for the sample, which for the population variance: For example, should N or N-1 be used for the sample variance? Here we have defined the sample variance as the one with the N divisor and the population variance as the one with the N -1 divisor, but some texts and programs do just the opposite. The VAR function in Excel, for example, returns the sums of squares divided by N -1, whereas the VARP function (P for population) returns the SS divided by N. On one point there is no confusion: If one wants to describe the variance for a group, and if scores are available for all members of the group, then one divides the sum of squares by N. Some writers, usually those of a more empirical bent, think primarily in terms of samples, assume they can never know all the scores in the population, identify the group with a sample, and therefore call the SS divided by N the sample variance. If they divide the same SS by N -1, they call it the population variance, presumably because it is an estimate of population variance based on sample data. Other writers, usually those of a more formal or mathematical bent, think primarily in terms of populations, assume they can know all scores in the population, identify the group with the population, and therefore call the SS divided by N the population variance. If for some reason they do not have access to all scores in the population, they divide the SS by N -1 and call this the sample variance, presumably because it is an estimate of the population variance based on sample data. In this book, reflecting our preoccupation with empirical data analysis as opposed to formal mathematics, we side with the empiricists and call the SS divided by N the sample variance, and the SS divided by N -1 the population variance, or more correctly, the estimated population variance. But once you are aware of this distinction, and understand why some writers define matters one way whereas other writers reverse the definition, you should be able to understand different writers and use different computer routines correctly.
5.3 THE STANDARD DEVIATION 5.3
63
THE STANDARD DEVIATION A second useful measure of variability used in statistics is the standard deviation. To compute the standard deviation for a sample of scores, first compute (a) the variance and then compute (b) its square root. Symbolically, the sample standard deviation for the Y scores is:
The variance is an average sum of squares, so the units for variance are the units for the initial scores squared. For example, if the initial scores were feet, then variance would be measured in square feet and analogous to a measurement of area. Taking the square root of the variance, which is how the standard deviation is computed, means the units for the standard deviation are again the same as those used initially. Thus the standard deviation is like an average error, expressed in the same units as those used for the raw scores. The larger the deviations from the mean are (i.e., the more scores are spread out instead of clustering near the mean), the larger is the standard deviation. You may wonder why the square root of the variance has come to be called the standard deviation; why it is used to represent an average error? Certainly the average of the absolute values of the deviation scores (column D in the Fig. 5.3 spreadsheet) would be a logical candidate for a measure of the typical deviation or error. The reason the standard deviation has become standard has to do with its technical statistical properties. These properties are not shared by the mean absolute deviation. For now, and until you read further, accept this on faith but know that there are reasons for this that experts find acceptable.
Sample and Population Standard Deviation Equation 5.6 defines the sample standard deviation. The formula for the population standard deviation, estimated from sample data, is:
Again, as with the sample and population variance, you should be aware that some writers (and spreadsheets) reverse the definitions used here. Exercise 5.4 The Sample Standard Deviation The purpose of this exercise is to modify the current template so that, in addition to the mean and the sample variance, it computes the sample standard deviation as well.
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
64
General Instructions 1. Add a formula for the sample standard deviation to the current template. Provide an appropriate label. 2. Again, selectively change the values for the raw scores and note how these changes affect the value of the standard deviation. What do you observe? 3. Finally, restore the raw scores to their initial values as given in Fig. 5.1. Detailed Instructions 1. Put the label "SD=" for standard deviation in cell D16. Then in cell E16, enter a formula for the square root of the variance (cell E15). This is the standard deviation. 2. Now change the raw data (cells B3-B12) and note the effect on the standard deviation (cell E16). First try the two alternative sets given in Exercise 5.2. Recall that one set was less spread out, and one more, than the initial data. Then try some unusually distributed data sets. For example, make one score 24, one 35, and all the rest 2s and 3s. Then make half the scores 3s and the other half 12s. What do you observe? 3. Finally, restore the original data as given in Fig. 5.1.
Note 5.2 M
The sample mean, which is the sum of the scores divided by N. The mean for the Y scores is often symbolized as Y with a bar above it (read Y-bar), the mean for the X scores as X with a bar above, and so forth. In this book, to avoid symbols difficult to portray in spreadsheets, My represents the mean of the Y scores, Mx the mean of the X scores, and so forth.
VAR
The sample variance, which is the sum of squares divided by N. Often it is symbolized S2 or s2, which makes sense because variance is a squared measure. If not clear from context, VARy indicates the variance for the Y scores, VARx the variance for the X scores, and so forth.
VAR'
The population variance as estimated from sample data, which is the sum of squares divided by N -1. Often it is symbolized with a circumflex (a ^ or hat) above s2.
SD
The sample standard deviation, which is the square root of the sample variance. Often it is symbolized S or s. If not clear from context, SDy indicates the standard deviation for the Y scores, SDx the standard deviation for the X scores, and so forth.
SD'
The population standard deviation as estimated from sample data, which is the square root of the estimated population variance. Often it is symbolized with a circumflex (a ^ or hat) above s.
' (prime)
In this book, a prime or apostrophe after a symbol indicates an estimate, for example, VAR' and SD'. Some texts use the circumflex (the ^ or hat) for this purpose.
5.3 THE STANDARD DEVIATION
65
After you have restored the raw data to the initial values, your spreadsheet should look like the one given in Fig. 5.4. There are two additional points concerning this spreadsheet. First, because Y, the predicted value, is the mean of the raw scores, the mean of the predicted scores will of course be the same as the raw score mean. Second, given that the predicted score is the mean, the sum of the deviation scores must be zero. This follows from the way the mean is defined. However, due to rounding errors, entries that should sum exactly to zero, like the sum of the deviation scores, may sum to an extremely small number instead, like 4E-16, which in scientific notation means 4 divided by a 1 followed by 16 zeros. Such values may appear if a general (G) format is used for a cell. If you specify a format with, for example, two decimal places, then the value ".00" will be displayed instead. Exercise 5.5 SPSS Descriptive Statistics The purpose of this exercise is to familiarize you with the Descriptives command in SPSS. 1. Invoke SPSS. Create two variables, s and Y. Give the Y variable the label "Lies". Enter the data from the lie detection study. You could do this by hand, or you could cut and paste from your spreadsheet as displayed in Fig. 5.4. 2. Select Analyze-> Descriptive Statistics-> Descriptives from the main menu. In the Descriptives window, move the Lies variable to the right-hand window. Click on the Options button, check the box next to Variance. Click Continue and then OK.
A
B
1
D
C
E
y
Lies
=
SSy
2
s
Y
Y'
Y-Y'
y*y
3 4
1
5.3
-2.3
5.29
2
3 2
5.3
-3.3
3. 4 5 6
4 6 6 4
5.3 5.3 5.3 5.3
-1.3 0.7 0.7 -1.3
10.89 1.69 0.49 0.49 1.69
5 6 7 8 9 10 11 12
7 8 9 10
5 7 7 9
5.3 5.3
-0.3 1.7
5.3 5.3
1.7
2.89 2.89
3.7
13.69
13 14 15 16 17
Sum= N= Mean=
53 10 5.3 5.3
53 10 5.3
0
40.1
a=
VAR= SD=
0.09
10 4.01 2.0025
FIG. 5.4. Spreadsheet for the lie detection study after variance and standard deviation calculations have been added.
66
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS 3. Examine the output. Do the values you obtained for N, the mean, variance, and standard deviation agree with the results from the spreadsheet you created in Exercise 5.4? 4. Save the SPSS data file.
5.4
STANDARD SCORES Adolphe Quetelet, a French 19th-century statistician, is noted for his work with the way various measurements are distributed. Among the scores he worked with were the chest circumferences of Scottish soldiers (see Stigler, 1986). Assume the mean is 40 inches and one particular soldier has a chest circumference of 45. Thus, this soldier's chest measured 5 inches more than the mean. Now from our lie detection study recall that the mean number of lies detected was five and for the loth subject nine lies were detected, which means that four more lies than average were detected for that subject. It makes little sense to ask who is more extreme, the subject with four detected lies above the mean or the soldier whose chest circumference is 5 inches more than the mean. The two scales, number of lies and inches, are as alike as goats and apples. What is needed is a common or standard scale of measurement. Then both scores could be rescaled and, once the transformed scores were expressed in the same units, these new scores could be compared. The common scale is provided by the standard deviations, and the transformed scores are called standard scores or Z scores. To compute a standard score, (a) subtract the mean from an individual score and (b) divide that difference by the standard deviation for the set of scores. Symbolically (letting My and SDy represent the mean and standard deviation for the Y scores), the Z score corresponding to the Y score for the ith subject is:
Equation 5.8 is expressed in terms of sample statistics. The corresponding definition expressed in terms of population parameters is:
Note that both the deviation (Yi - MY or Y -u) and the standard deviation (SDY or a) are measured in the units used for the initial scores. Thus when one is divided by the other, the units cancel, resulting in a unit-free score, neither lies nor inches. Actually, because deviation scores are divided by the appropriate standard deviation, we can regard the resulting standard scores as being expressed in standard deviation units. A standard score of 1.0, then, means the corresponding raw score is exactly one standard deviation above its mean, whereas a standard score of -1.0 indicates a raw score one standard deviation below the mean, a standard score of 2.5 implies a raw score two and a half standard deviations above the mean, and so forth.
5.4 STANDARD SCORES
67 Exercise 5.6 Standard Scores
This exercise modifies the current spreadsheet so that, in addition to the mean, variance, and standard deviation, it now computes standard scores as well. General Instructions 1. Add two columns to the current spreadsheet. In the first column enter formulas for Zscores and in the second column enter formulas for the square of the Zscores. 2. Enter formulas for the sum, count, and mean of the Zscores. Enter formulas for the sum, count, mean, and square root of the mean for the squared Z score. What is the mean Z score? Why must it be zero? What is the variance and standard deviation for the Zscores? Why must they be one? Detailed Instructions 1. Label column F (cell F2) with "Z" (for Zor standard score) and column G (cell G2) with "z*z." 2. Enter a formula in cells F3-F12 for dividing the residual (Y-Y' in column D) by the standard deviation (cell E16). This is the standard score. 3. Enter a formula in cells G3-G12 for multiplying the standard score (Z, the column F entry) by itself. (You may want to format columns F and G so that only two or three places after the decimal point are given.) 4. Copy the formulas from cells E13-E16 to F13-F16 and to G13-G16. 5. What is the mean Z score? Why must it be zero? What is the variance and standard deviation for the Z scores? Why must they be one? At this point, your spreadsheet should look the one given in Fig. 5.5. Note that the mean standard or Z score is zero. This follows from the fact that deviations from the mean must sum to zero. Note also that the standard deviation for the standard scores (the square root of the average squared Z score) is 1. This follows from the fact that differences were divided by the standard deviation; hence, a raw score that initially was a standard deviation above the mean would have been transformed into a standard score of 1. Exercise 5.7 More Standard Scores This exercise provides additional practice in computing standard scores. The current template is used, but with different data. This time chest circumference measurements are used. 1. Using the spreadsheet for the lie detection study as a base, create a new spreadsheet for the chest circumference of Scottish soldiers. All that needs to be changed is some labeling (e.g., "Chest" for chest circumference, instead of "Lies") and the raw data. Replace the number of lies with the values 34, 38, 39, 39, 40, 40, 41, 41, 43, and 45, respectively. These represent the chest measurements, in inches, for 10 Scottish soldiers. 2. Nothing else needs to be done. All formulas for computing residuals, means, variances, standard deviations, and standard scores are already in place'.
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
68
Your new spreadsheet should look like the one shown in Fig. 5.6. You are now in a position to compare scores in the two distributions. Note that the standard deviation for lies was 2.28 (Fig. 5.5) and the standard deviation for chest measurements was 2.79 (rounded to two decimal places). This means that the chest measurements were somewhat more widely distributed about their mean than the number of lies were around their mean. Now examine the standard or Z scores. As you can see, the standard scores associated with 9 lies and 45 inches were 1.85 (Fig. 5.5) and 1.79 (Fig. 5.6), respectively, almost the same. This means that the two highest scores from these two distributions are almost equally extreme. Note also that many standard scores were between -1 and +1 and almost no scores were less than -2 or greater than +2. However, the soldier with the smallest chest circumference (34 inches) was more extreme in his distribution (Z score = -2.15) than was the subject who had only one lie detected (Z score = -1.65).
Exercise 5.8 Obtaining Standard Scores in SPSS Open the data file you created in exercise 5.5. Run Descriptive statistics for the Lies variable as you did in the previous exercise, but this time check the Save standardized values as variables box in the Descriptives window. After you run the Descriptives command, return to the Data editor. Notice that SPSS created a new variable called "zy." Why do these z-scores differ slightly from the scores you calculated with your spreadsheet?
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
s 1; 2 3 4 5 6 7 8 9 10 Sum= N=. Mean= a-
B Lies Y 3 2 4 6 6 4 5 7 7 9 53 10 5.3 5.3
E
D
C
y= Y' 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 53 10 5.3
Y-Y' -2.3 -3.3 -1.3 0.7 0.7 -1.3 -0.3 1.7 1.7 3.7 0 10 VAR= SD=
F
G
SSy
y*y 5.29 10.89 1.69 0.49 0.49 1.69 0.09 2.89 2.89 13.69 40.1 10 4.01 2.0025
Z -1.15 -1.65 -0.65 0.35 0.35 -0.65 -0.15 0.85 0.85 1.85 0 10 0 0
z*z 1.3192 2.7157 0.4214 0.1222 0.1222 0.4214 0.0224 0.7207 0.7207 3.4140 10 10 1 1
FIG. 5.5. Spreadsheet for the lie detection study including standard or Z-score calculations.
69
5.4 STANDARD SCORES
Identifying Outliers Transforming or rescaling raw scores into standard scores is useful, not only because it allows us to compare scores in different distributions, but also because it can reveal comforting or disturbing information about a particular sample of scores. Sometimes a data recording instrument malfunctions or sometimes a person makes a copying or data entry mistake. If such mistakes result in a data point quite discrepant from other, legitimate data, then its standard score will be extreme. Opinions vary as to how extreme a score needs to be before it should be regarded as an outlier, a possibly illegitimate data point that should be modified or deleted. A common rule of thumb suggests that any data point whose standard score is less than -3 or greater than +3 should, at the very least, be subjected to scrutiny. Scores this extreme may be important and valid observations, but they also may indicate that something (a subject, a piece of recording apparatus, a procedure) may have malfunctioned, or the subject does not legitimately belong to the population from which the investigator intended to sample. Outliers can exercise undue weight on the statistics computed (recall Exercise 5.3) and can render suspect the assumptions required for parametric statistical tests. As a matter of course, then, it is good practice always to compute and examine standard scores and to consider whether any outliers represent influential and legitimate data or unwanted noise. The mean of standard scores will always be zero and their standard deviation and variance will always be 1 because of the way standard scores are defined. Note, however, that the shape of the distribution of the scores is unaffected by being standardized. If we begin with a sample of scores, and use the sample mean and standard deviation to compute standard scores, the shape of the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
A :
B Scots; Chests Y s 34 38 2 39 3 39 4 40 5 40 6 41 7 8 41! 43 9 10 45 Sum= 400 N= 10 Mean40 a= 5
E
D
C
SSy
Y-Y' Y-Y' 40 -6 -2 40 -1 40 -1 40 0 40 0 40 1 40 1 40 40 3 5 40 0 400 10 10
y*y
y
1:
•
F
=
40
VAR=
SD=:
36 4 1 1
o; 0;
11 1 9 25 78 10 7.80 2.79
Z -2.15 -0.72 -0.36 -0.36 0.00 0.00 0.36 0.36 1.07 1.79 0 10
G z*z 4.62 0.51 0.13 0.13 0.00 0.00 0.13 0.13 1.15 3.21 10 10 1 1
FIG. 5.6. Spreadsheet giving the chest circumferences in inches and their associated standard scores for 10 Scottish soldiers.
70
DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS distribution of the transformed scores will be identical to the shape of the distribution for the raw scores. If the raw scores are highly skewed, for example, the standardized scores will also be skewed. Many students come to a course in statistics believing that standardizing scores somehow makes them normally distributed. This is simply not so; their distribution is unaffected by standardizing. Exercise 5.9 Outliers For this exercise you again use the current template and modify the data. The purpose is to demonstrate the effect different distributions of raw scores, especially those that include outliers, can have on their means, standard deviations, and standardized scores. 1. Change the raw scores for chest measurements so that the scores are more spread out from the mean. Then change the raw scores to represent a skewed distribution, for example, one with many more low than high scores. In each case, how are the standard scores affected? 2. Change one or two of the raw scores for chest measurements to a very large number, like 88 (likely an outlier). Then change one or two of the raw scores to a very small number (an outlier in the other direction). In each case, how is the mean and standard deviation affected? How are the standard scores affected?
Descriptive Statistics Revisited This chapter has been concerned primarily with description, with statistics that tell us something about the scores in a sample (or in a population, if all scores in the population are available to us). Specifically we have learned that the mean number of lies detected for the 10 subjects in the lie detection study was 5.3, and if the mean were used as the best guess for the number of lies detected for each subject, a typical or average error (i.e., the standard deviation) would be 2.00. Likewise, we have learned that the average chest circumference for a sample of 10 Scotsmen was 40 inches, and if the mean were used as the best guess for each Scotsman's chest circumference, the standard deviation or error would be 2.79. We have also learned that the mean can be thought of as the typical score for a set of scores, whereas the variance and standard deviation indicate whether scores are widely dispersed or clustered more narrowly about that mean. The standard deviation, moreover, is used when computing standard scores, which provides a way of determining whether scores in different distributions are equally extreme. This may also aid in identifying measurement or data transcription errors. Descriptive statistics such as these can indicate to others exactly what we observed and measured; consequently, they are interesting and valuable in their own right. In addition, they can also be used as a basis for inference, as we begin to demonstrate in the next chapter.
Note 5.3
z
A standardized or z score. It is the difference between the raw score and the mean divided by the standard deviation.
6
Describing a Sample: Graphical Techniques
In this chapter you will: 1. Learn basic principles of good graphical design. 2. Learn how to create histograms, stem-and-leaf plots, and box plots. In the previous chapter you learned basic statistics designed to describe the characteristics of central tendency (e.g., the mean) and variability (e.g., the variance and standard deviation). These statistics help define two important aspects of a distribution. Indeed, they form the foundation for the analysis of variance. Before applying inferential statistics, however, it is always important to know the subtleties inherent in any data set. The mean and standard deviation do not provide a comprehensive view of the data. The application of good graphical techniques allows you to determine the shape of your data distribution, discover outliers, and uncover subtle patterns in your data that may not be readily detectable when viewing only numerical values. As Edward Tufte summarized in the epilogue of his classic text, The Visual Display of Quantitative Information (1983, p. 191), "What is to be sought in designs for the display of information is the clear portrayal of complexity... the task of the designer is to give visual access to the subtle and the difficult—that is, the revelation of the complex." At this point we should differentiate two purposes of graphing data. Most of you are likely familiar with the bar charts typically found in professional journals. The purpose of such figures is to present the reader with the results of an analysis and they are therefore designed to tell a specific story; perhaps to the exclusion of unimportant information. The graphical techniques in this chapter, on the other hand, are more in the spirit of exploratory data analysis described by John Tukey (1977)- These graphical techniques will allow you to get to know your data before you conduct confirmatory or inferential statistics to assure that you have selected the right models to test, and to help you determine if your data meet the assumptions (e.g., normality) for each of the inferential analyses you plan to conduct. These techniques can also be used at any time, before or after confirmatory analysis, to simply explore the data in search of new and interesting patterns that you may have not considered during study design. The techniques are also critical to data cleanup. Strange patterns in your data or outliers may 71
72
DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES become evident in your graphs. These may be due to true variability in the population, or they may be errors. Therefore, it is always important to check.
6.1
PRINCIPLES OF GOOD DESIGN Before proceeding immediately to the different types of graphs used in exploratory data analysis, it is helpful to present a few principles that will help you organize your data in the most efficient manner possible. Graphical techniques are useful for showing large amounts of data in one readily interpretable unit. What would take you many words and paragraphs to describe in written text can be presented in a single well-designed figure. Thus, it is important to produce graphs that present your data in a coherent format. At the same time, however, you should avoid covering up patterns in your data or, worse yet, distorting the data in ways that would lead to erroneous conclusions. If you follow a few simple principles, the graphs you produce should give you vital insight into your data that will help you make appropriate decisions concerning the more complex inferential analyses you will learn about shortly.
Have a Clear Goal Good expository writing always begins with a clear thesis and then uses the literary style most suitable for presenting that thesis. Similarly, when producing a graph, you should have a clear goal in mind, and then select the best technique for accomplishing that goal. For instance, a histogram may be the best format for getting an overall feel for the way univariate data are distributed, while a stemand leaf plot may be better when trying to uncover outliers or unusual clusters of values. When comparing two or more groups, box plots may be the best graphical form. In your initial attempts to explore a data set, you will need to use various different graphical techniques with different goals in mind. As your analyses progress, however, the goal of each figure should become clearer. Eventually, when presenting your data to others, you should ensure that the figures are clearly integrated with the text and written interpretations of your analyses. In later chapters we present graphs that are designed to explicate the results of your analyses.
Show the Data The preceding principle does not imply that a good figure can have only one point. In fact, as the "show the data" principle suggests, a single graph can, and should, present as many aspects of the data as can be viewed in a coherent fashion. Thus, a single graph could provide the viewer with information concerning central tendency, variability, the shape of the distribution, and even individual data points. By doing so, for instance, it is possible to provide information concerning group comparisons, but also how valid the comparisons are, and how to best interpret any differences and similarities between the groups. Consider the typical bar chart ubiquitous in many professional publications. If two groups are compared, then the figure conveys only two pieces of information—the mean of each group. In some cases you may also be provided with error bars that give you some idea of variability within the groups. If a graph is to be worth a thousand words, however, then it needs to contain more than four values (i.e., two means and two standard errors). The stem-and-leaf diagram (see the next section) is an excellent example of a graphical technique that
6.1 PRINCIPLES OF GOOD DESIGN
73
presents large amounts of data (in some cases all of the raw data itself) in a format that allows the quick apperception of summary characteristics.
Keep It Simple The explosion of relatively simple to use software packages such as Excel and Power Point in the 1990s gave great power to the user to conduct statistical analyses and prepare presentation quality graphics. Along with that power, however, comes responsibility. Just as you would not conduct a statistical analysis in SPSS or Excel simply because it was available to you as a menu item, you should not add graphical decoration to your figures just because the option is available. Charts in Excel frequently default to the use of garish color schemes, grid lines, and background color fills that, at best, direct attention away from the main point of the figure, and, at worst, obscure or distort the data. Some authors refer to this as "chartjunk" (Tufte, 1983; Wainer, 1984) Chartjunk should be kept to a minimum and information to a maximum. You can use some of the design options, such as holding one aspect of a figure to highlight an important point, but you should always ask yourself before selecting an option, such as threedimensional shadows for a bar chart, if it adds any informational value to the figure. 6.2
GRAPHICAL TECHNIQUES EXPLAINED The following section will present three graphical techniques, the stem-and-leaf plot, the histogram, and the box plot, that will help you get to know your data. Using a combination of these techniques will allow you to uncover characteristics of your data such as the shape of the distribution (i.e., skew), central tendency (the mean, median, and/or mode), and spread (i.e., variance). Some of the techniques are also useful for detecting gaps and clusters within the data and for finding possible outliers. It is also possible to adapt these techniques to aid in the comparison of distributions.
The Stem-and-Leaf Plot The stem-and-leaf plot, or stem plot for short, embodies the principle of "show the data." It is actually a hybrid of a text table and a true figure. The stem-andleaf plot organizes individual numerical data in ways that highlight summary features of the distribution. Consider the following set of numbers—40, 49, 47, 46, 64, 64, 46, 66, 48, 46, 46, 38, 64, 63, 45, 65, 34, 60, 57, 55, 5O, 41, 65, 47, 49, 63, 47, 46, 56, 40, 49, 60, 19, 47, 42, 46, 63, 61, 63, 37. Try to quickly identify the characteristics of this data set. What is the minimum value? What is the maximum value? Are there any clusters in the data? Are there any outliers? An unordered case listing of the- data is not very helpful. A first step to organizing the data is to sort it in numerical order—19, 34, 37, 38, 40, 40, 41, 42, 45, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 48, 49, 49, 49, 50, 55, 56, 57, 60, 60, 61, 63, 63, 63, 63, 64, 64, 64, 65, 65, 66. It is now readily apparent that the minimum value is 19 and the maximum is 66. Upon closer inspection there is a large number of values at 46. The careful reader may have also noticed that 19 may be an outlier. Thus, when data sets are relatively small (say, less than 50 or so cases), simply sorting your data in Excel or a statistical package and then scrolling through the values will give you a hint about some of the important features of your data. The shape of the distribution, however, is hard to extract from this sort of presentation.
74
DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES We could also group the data into a frequency table. To do so, we simply create class intervals and then count the number of values within each interval (see figure 6.1). When viewing the figure it is obvious that there also are two modes—one between 40 and 49, and the other between 60 and 69. The distribution also appears asymmetrical—there are fewer numbers in the lower ranges and more in the higher ranges (i.e., a negative skew). By creating a frequency table we have gained a better understanding of the shape of the data, but we have lost access to the original data. We don't know, for instance, if the minimum value is 10 or 19. We also don't know if the mode is due to the existence of 19 scores of 40, or if the 19 scores are distributed evenly throughout the 40-49 range. The stem-and-leaf plot provides the opportunity to organize the data into a frequency table, yet it maintains access to the original values. The plot is made up of stems, which represent the class intervals from a frequency table. Each leaf represents an individual value. To construct a stem plot you separate each number at a place value. The stem becomes the value to the left of the place value, and the leaf becomes the number to the right of the place value. In the previous example, because the values range between 10 and 100, it makes sense to create stems in the tens and let each leaf represent the ones place holder. Typically stems and leaves are represented by one-digit numbers, but depending on the range of your date, it is sometimes useful to have two-digit stems or leaves. To read the plot presented in Fig. 6.2, you combine the stem with the leaf to recreate each data point. The first value is 1 for the stem and 9 for the leaf; thus the 9 represents the value 19. Similarly, on the third line we combine a stem of 3 with the leaves 4, 7, and 8 to create 34, 37, and 38. The frequency column is optional, but provides a good check to make sure you have the correct N for your data set. The stem plot provides the same information as the frequency table, but retains the original data values. From this plot, we can still see that the data clusters between 40 and 50, but we also know that 46 is the true mode. An added advantage is that the stem plot is more graphical in nature. The shape of the distribution (remember the negative skew) is visually perceptible in this format. It should be noted that it is necessary to use a fixed font, such as Courier, rather than a proportional font, to create an accurate visual impression of horizontal bars extending from each leaf. The stem-and-leaf plot can be adapted in a number of ways. First, if your data are in a different range, say from 100 to 999, you could round or truncate the values so that the stem represents the hundreds place holder and each leaf would represent the tens. For instance, a value of 324 would have a stem of 3 and a leaf of 2. Of course, given this method, you could not differentiate between 324 and 327. If you desired to retain the full resolution of each number, you could use
FIG. 6.1. A frequency table.
5.2 GRAPHICAL TECHNIQUES EXPLAINED
75
FIG. 6.2. A stem-and-leaf plot.
a stem of 3 (hundred) and leaves of 24 and 27. If you have a large number of scores clustered on one stem, you could alter the number of stems by splitting each one in half. For instance, Fig. 6.2 could be recreated with each stem split into those values below five and those five and above. By doing so, the shape of the distribution may sometimes become more obvious (see Fig. 6.3)
The Histogram The stem plot is an excellent graphical technique for displaying all of the data in a fashion that allows the viewer to perceive the distributional characteristics with little effort. The histogram does not retain information about individual scores, but as it is more graphical in nature, it is particularly good for providing a feel for the distribution of data at a quick glance. Given the design principle of show the data, however, the stem plot is the preferred technique, but most statistical programs do not provide the flexibility in selection of interval sizes and the manipulation of the number of digits in the stem and/or leaves necessary to take full advantage of the features of the stem plot. It is for this reason that we also present the histogram. A histogram is a graphical presentation of a frequency table. Class intervals are plotted on the x axis and frequency is plotted on the y axis. Individual bars indicate the number of values contained within each class interval. A histogram looks very much like the stem plot turned on its side. Compare the stem plot in
FIG. 6.3. A stem-and-leaf plot with each stem split in two.
DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES
76
FIG. 6.4. An example of a histogram. Fig. 6.3 and the histogram of the same data presented in Fig. 6.4. The two modes are clearly visible, and so is the outlier. We can also see the negative skew in the shape of the distribution. What is missing is the ability to access individual data points. When creating a histogram it is important to be aware of the number and size of the bins (i.e., the class intervals). In Fig. 6.4 there are 10 bins, which seems a reasonable number given the range of the data. Fig. 6.5 shows the same data as Fig. 6.4, but with only five bins. The two modes are still apparent in this figure, but the outlier is no longer readily apparent and the gap between the two modes is not as dramatic. Some consider the optimum width of each bin or class interval to be the one that most closely resembles the probability distribution of the data. Scott (1979) suggested that this can be accomplished by calculating the width of each bin as
W=3.49 * SV* N-1/3
(6.1)
where W is the bin size, SD is the standard deviation, and N is the sample size. Given formula 6.1 and using the estimated standard deviation, the appropriate width of the bins for the data is 10.85. Given the range of the data this would produce a histogram very similar to the one in Fig. 6.5. Using Scott's formula will often result in noninteger bin widths, which might be confusing to some viewers. Therefore, histograms based on Formula 6.1 provide a good exploratory tool for
FIG. 6.5 A histogram with only five bins.
5.2 GRAPHICAL TECHNIQUES EXPLAINED
77
determining the overall shape of a distribution, but may not be as well suited for presenting your data to others. Moreover, such histograms may sometimes omit detailed information, such as a small number of outliers that do not have great influence on the overall distribution. It is important to play with the size of the bins to give yourself the opportunity to explore all aspects of a distribution. This can be easily accomplished in most statistical packages.
The Box-and-Whiskers Plot The box-and-whiskers plot, or box plot, contains less information than the histogram or stem plot, but provides a graphical representation of important descriptive statistics related to the distribution of a data set. The box plot is also a good graphical technique for comparing two or more distributions. It is usually comprised of a box, a line within the box, and whiskers, which are two lines emanating from the either side of the box. These features of the box plot describe five characteristics of a distribution—the 25th and 75th percentiles, maximum and minimum values (excluding outliers), and the 5Oth percentile (i.e., the median). When oriented vertically, the lower and upper edges of the box represent the 25th and 75th percentiles. These two values define the interquartile range (IQR), which is the 75th percentile minus the 25th percentile, or the middle 50% of scores. The whiskers extend to the maximum and minimum values, excluding outliers. A line within the box is also drawn to indicate the position of the median. The box plot uses the median instead of the mean as a measure of central tendency, and the IQR instead of the standard deviation as a measure of variability. This is because the median and IQR are more resistant to the effects of individual scores, especially extreme scores, than the mean and standard deviation. If we changed the 66 to 166 in the data set, the median and IQR would remain unchanged at 48.5 and 16.5, respectively. The mean and standard deviation, on the other hand, would change from 50.85 to 53.35 and from 10.64 to 21.00, respectively. Thus, when first exploring your data, the median and IRQ are more useful. Later, after you have cleaned your data and determined the appropriate inferential statistics, the mean and standard deviation become more important. Fig. 6.6 presents a box plot for the data set. It is readily apparent that 50% of values fall within the box between the mid 40s and the low 6os. The median is around 48, and the minimum and maximum values are in the mid 30s and 6os. Skew is also apparent in these data as the bottom whisker is longer than the top, and a larger number of scores within the box are found above the median. The dot at the bottom of the plot represents an outlier. In chapter 5 you learned that an outlier could be defined as any value that was greater than three standard deviations away from the mean. In fact, this outlier is 19 and has a z score of -2.99, so it falls just under the threshold of 3.00 absolute—although the definition of an outlier as ±3.00 standard deviations is just a rule of thumb and any value approaching 3 should probably be examined carefully. In the box plot, however, outliers are defined as values that lie 1 l /2 IQRs or more away from the median. SPSS also identifies extreme values as those that are 3 IQRs or more away from the median. If there were any extreme values in these data, SPSS would identify them with an asterisk. Not all statistical packages define outliers in the same manner, so it is important that you check. Occasionally, you may come across a box plot that uses the terms hinges and H spread. Hinges are approximations of the 25th and 75th percentiles designed to be easier to calculate. The H spread is the equivalent of the IQR and is defined as the upper hinge minus the lower hinge. Hinges were useful before the days of
78
DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES
FIG. 6.6. A box-plot representation of the data. statistical packages implemented on a computer. Today they are rarely seen, but if you do come across a box plot that uses hinges, they can be interpreted in the same way as plots based on percentiles and the IQR.
Comparing Distributions Graphically Thus far we have discussed the graphical representation of univariate data, or data based on a single sample. Often, however, it is useful to evaluate the distributions of two or more groups of data on the same figure. In such cases, the box plot allows for the quick comparison of distributions. The median appears prominently in a box plot and allows one to quickly perceive differences between groups with reference to central tendency. At the same time, however, information concerning variability and the shape of each group's distribution is maintained, making it a much better representation of group differences than a text table of means and standard deviations. The box plot is therefore an excellent graphical technique to use when your eventual goal is to apply inferential statistics that test group differences. In Fig. 6.7, Group 1 is the same sample we have been working with throughout the chapter. Compare Group 1 to Group 2. It is easy to see that the two groups have a similar median value, but there is more variability in Group 2. Group 2 also seems to be more symmetrical and has an outlier in the 120 range that should be checked. Group 3, on the other hand, has a median much higher than the other two groups. As with Group 1, the third group is negatively skewed—the bottom whisker is longer than the top whisker—but the skew is not as large as the first group. The previous example demonstrates the usefulness of side-by-side box plots for comparison of the distributional characteristics of different samples. The box plot for comparing groups can also be easily created in most statistical packages. Ease of computation is important, but you should also not limit yourself to only those graphical techniques that have already been implemented by the common statistical packages. For instance, you could use many small histograms in one figure to compare distributions. This is what Tufte (1983) referred to as the small multiple. It is also possible to adapt the stem plot to grouped data. Compare the side-by-side stem plot in Fig. 6.8 to the box plot in Fig. 6.7. From both plots we can immediately determine that the central tendency of the two distributions is about the same. The mode of both is in the 405. With small sample sizes it is also relatively easy to find the median of the two groups by counting leaves in order
5.2 GRAPHICAL TECHNIQUES EXPLAINED
79
FIG. 6.7. A box plot comparing the distribution of three groups. until you reach the midpoint of the sample. In this case, because there is an even number of scores, it is necessary to take the mean of the two scores closest to the center of the plot. From this figure, we can also see that Group 2's distribution is more symmetrical than Group 1's distribution, if you exclude the outliers. Finally, even though this figure may be a little busier than the box plot, it retains information concerning individual data points that allow you to extract characteristics of the samples, such as the outliers and the large number of 46s and 47s in Group 1. These variations of the graphical techniques can easily be created by hand, on a word processor (as with Fig. 6.8), or in common draw programs that allow you to manipulate the output from statistical packages such as SPSS. Although they may take a little more effort, such graphs give you insight into your data and uncover subtle patterns that may not be readily apparent using the graphical techniques already implemented in software packages. Remember the quote from Edward Tufte: "the task of the designer is to give visual access to the subtle and the difficult—that is, the revelation of the complex." Your ability to reveal the complex patterns in your data is limited only by your creativity and skill in
FIG. 6.8. The side-by-side stem plot.
DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES
80
applying good principles of exploratory data analysis and the armamentarium of graphical techniques that you have mastered. Exercise 6.1 Graphical Data Exploration This exercise provides practice in the creation and interpretation of stem plots, histograms, and box plots. You will use the data provided in the table here. These data come from a study of children with Williams syndrome, a syndrome that often results in mild mental retardation (Robinson, Mervis, & Robinson, 2003). Each child was given a vocabulary test (Voc) and a test of grammar (Gram). In this exercise you will explore the univariate distributions of the vocabulary and the grammar scores. You will also generate a box-plot to explore possible differences between the vocabulary and grammar scores. Sub 1 2 3 4 5 6 7 8 9 10
1. 2.
3.
4.
5.
6.
Voc 84 70 75 62 93 83 98 59 73 67
Gram 79 79 69 68 91 75 94 61 67 57
Sub 11 12 13 14 15 16 17 18 19 20
Voc 80 92 60 70 75 35 31 64 69 83
Gram 67 76 64 75 76 55 55 63 67 82
Sub 21 22 23 24 25 26 27 28 29 30
Voc 56 45 68 66 54 44 89 80 70 80
Gram 66 66 82 63 67 72 75 67
For the vocabulary data, sort them in numerical order. Without calculating any statistics, what are your first impressions of the distribution? What are the minimum and maximum values? Are there any possible outliers? Do the scores seem to cluster in any particular way? By hand, using graph paper, create a stem plot of the data. Were your first impressions concerning central tendency, variability, and outliers correct? What about the shape of the distribution? Is it skewed? If so, describe the skew. Now create a histogram of the vocabulary data using eight equally spaced bins. Compare and contrast the histogram to the stem plot. Which do you find more informative and why? Create another histogram by calculating the width according to Formula 5.1. What is the width? Compare this histogram to the one you created in part 3. How do your impressions of the data differ based on the histograms with differing bin widths? Redo parts 1-4 using the grammar data. You may do this by hand, or on a computer, if you are familiar with the graphing functions in Excel, SPSS (see the next exercise), or another statistical package. Create side-by-side box plots of the vocabulary and grammar data. How do the two sets of scores differ with respect to central tendency, variability, and the shapes of their distributions?
5.2 GRAPHICAL TECHNIQUES EXPLAINED
81
The preceding exercise gave you valuable experience in creating graphs by hand. In the next exercise, you will be introduced to some of the graphs available in the Explore procedure of SPSS. Statistical packages provide a quick and efficient means of summarizing graphical data, but it is often the case that graphing data by hand gives you a better "feel" for the data. Moreover, most statistical packages do not provide options for creating some of the graphs presented in this chapter (e.g., the side-by-side stem plot), nor do they provide you with the flexibility to change all of the parameters in any particular graph (e.g., the stem size of a stem plot). For these reasons you should not limit yourself to any one statistical package. Often, using a statistical package to create summary statistics that you then graph by hand, or using a draw program, is the most flexible option available for exploratory data analysis. Exercise 6.2 Graphing in SPSS In this exercise you will learn to use the Explore command to obtain descriptive statistics, histograms, stem-and-leaf plots, and box plots. 1. Open SPSS, create variables, and enter the data from Exercise 6.1. 2. Select Analyze->Descriptive Statistics->Explore from the main menu. 3. Move the vocabulary and grammar variables to the Dependent List box. Click on Plots and select the histogram option in the Explore: Plots dialog box. Click Continue. 4. Click on Options and select the exclude cases pairwise option and click continue. Note that some SPSS commands give you the option of excluding cases listwise or pairwise. Selecting listwise exclusion will cause SPSS to delete any case that is missing data on at least one variable. Pairwise exclusion, on the other hand, will exclude only those cases that have missing data on variables involved in the current analysis. 5 Click on OK to run the Explore command. 6. Examine the output. What are the Ns for the vocabulary and grammar variables? Why are they different? What would they be if you selected listwise exclusion? 7. Examine the Descriptives output. Are the mean, median, standard deviation, interquartile range, and skewness statistics what you would expect given your results from Exercise 6.1? 8. Examine the histogram for the vocabulary scores. Double click on the histogram to open the chart editor. In the chart editor you can change the bin widths. Select Chart->Axis from the main menu. In the intervals box, select custom and click on Define. You can now select either the number of intervals you would like to display, or the interval width. Change the number of bins and/or the bin width. Observe how doing so affects the shape of the histogram. Do this with the gram variable as well. 9. Examine the stem-and-leaf plots and the box plots. Do they look like the plots you generated by hand? 10. Change some of the values in the data and rerun the Explore command. See if you can predict how the plots will look based on the changed values. 11. Return the numbers you changed to their original values and save the data file.
This page intentionally left blank
7
Inferring From a Sample: The Normal and t Distributions
In this chapter you will: 1. Learn how the normal can be used as an approximation for the binomial distribution if N (the number of trials) is large. 2. Learn what the normal curve is, how it arose historically, and what kind of circumstances produce it. 3. Learn what the central limit theorem is and how the normal distribution can be used as an approximation for the distribution of sample means if N (the sample size) is large. 4. Be introduced to the t distribution and learn how it can be used for the distribution of sample means if N is small. 5. Be introduced to the standard error of the mean and learn how to perform a single-sample test, determining whether a sample was probably drawn from a population with specified parameters. 6. Learn how to determine 95% confidence intervals for the population mean. In chapter 3 you were asked to derive critical values for the binomial sampling distribution for larger and larger numbers of trials. This was not too difficult to do for small values like 10 or even 15, but you can see how tedious and errorprone this could become for larger numbers. This concerned at least some 18thcentury English and French mathematicians. Lacking today's high-speed electronic computers, they set out to reduce tedium in a different way, and so sought approximations for the binomial. The goal was to find a way for computing probability values that did not require deriving probabilities for all the separate outcomes of series of different numbers of binomial trials. The binomial is a discrete distribution. No matter the number of trials, the number of outcomes that contain, for example, no heads, 1 head, 2 heads, ..., N heads are always whole numbers. As a result, the distribution, portrayed correctly, will always be jagged and will not be described by a smooth, continuous line. Recall the binomial distributions graphed in chapter 3, but now imagine the distributions for ever larger values of N. As N becomes large, the discreteness matters less. The graphs for the distributions begin to look almost smooth, which 83
84
INFERRING FROM A SAMPLE: THE NORMAL AND t suggests that as N approaches infinity, the distribution might approach a smooth line, one that could be represented with a continuous function. And if a function could be found that gave frequencies for the various outcome classes, then the entire distribution could be generated relatively easily. Desired is a function of X, symbolized as Y=f(X), that would generate all possible values of Y. The function would be supplied with the binomial parameters and produce the appropriate binomial distribution, or at least a close approximation. It is doubtful that the function would be as simple as Y - 2 + .5X, which would generate a straight line, or Y= X2, a quadratic equation that would generate a parabola (or U-shaped curve). But whatever the correct function, once defined, it can be viewed as a generating device. Supplied with continuously changing values of X, it would produce (i.e., generate) the corresponding values of Y as defined by the particular function.
7.1
THE NORMAL APPROXIMATION FOR THE BINOMIAL The person usually credited with first describing a functional approximation for the binomial distribution is a French mathematician, De Moivre, although it is doubtful if he fully appreciated the implications of his discovery (Stigler, 1986). Writing in the 1730s, De Moivre defined a curve that has come to be called the normal. If the number of trials is large (over 50, for example, but an exact rule of thumb is discussed subsequently), then the discrete binomial can be approximated with the continuously varying curve specified by De Moivre. This has an immediate and practical implication. As you may have noted, Table A in the statistical tables appendix (Critical Values for the Binomial Distribution) does not give values for N greater than 50. The reason is, for values over 50 the normal approximation is usually quite sufficient. The normal distribution is well known and values for the standard normal distribution (this is defined in a few paragraphs) are widely available (see Table B, Areas Under the Normal Curve, in the statistical tables appendix). Thus it seems easier and more economical to use the normal distribution for any large-sample sign or binomial tests. The first step is to standardize the test statistic, which is the number of trials characterized by the first of the two possible outcomes (e.g., the number of coin tosses that came up heads) and is symbolized here as T. To standardize T, we need to know (or assume) values for the three parameters that define its binomial distribution: 1. N, is the total number of trials conducted. 2. P, the probability for the first outcome according to the null hypothesis under investigation (e.g., that the coin is fair, which means that P, the probability of a head, would be .5). 3. Q, the probability for the second outcome (e.g., the probability of a tail), which necessarily equals one minus P (i.e., Q = 1 - P). In standardizing T, we compare T against all possible outcomes represented by the binomial distribution whose parameters are the values given for N and P, hence we are dealing with a population. Recall from the chapter before last (Equation 5.9) that, for a population, the definition of a standard score (omitting subscripts, and understanding u and a to be the mean and standard deviation for the T scores) is
7.1 THE NORMAL APPROXIMATION FOR THE BINOMIAL
85
This is the standardized value for T, or the binomial test Z score. In order to compute Z, we need to know both u and a. For the binomial distribution, the mean is
This makes intuitive sense. If N were 6 and P were .5, then the mean or expected value would be 3, meaning three heads out of six tosses. Similarly, if N were 5 and P were .5, the mean value would be 2.5 or if N were 6 and P were .33, the mean value would be 2. The standard deviation for the binomial distribution is
This makes less intuitive sense. For now, accept that it is correct, but if you would like proof, check a mathematically oriented statistics text (e.g., Marascuilo & Serlin, 1988). In any case, we can now rewrite Equation 7.1, the binomial test Z score, as follows:
If N is large, statisticians have proved that the Z defined by Equation 7.4 will be distributed approximately normally. That is, if the population parameters are as claimed, then the sampling distribution for Z should be the normal curve first described by De Moivre. The parameter N is determined by our procedures, Q is determined by P, and P is determined by the null hypothesis we wish to test. Now, and this is the point of the foregoing discussion, we have a way of determining if the null hypothesis is tenable, given the value actually observed for T. For example, if N = 86 and P = .5, we would expect that half of the trials, or 43, would result in successes. If the actual number of successes were 53 then, applying Equation 6.1:
Five percent of the area under the normal curve is demarcated with a Z value of 1.96 (two-tailed), and 2.16 is bigger than that; thus we conclude that a value as extreme as 53 would occur by chance alone less than 5% of the time, if the present sample were drawn from a population whose value for P is .5. (The most common null hypothesis for binary trials assumes that P = .5 and for that reason our discussion of the sign or binomial test in chapter 3 was confined to such cases. Some null hypotheses may demand other values, however, and equation 7.4 can easily accommodate them.) As already noted, the normal approximation for Z requires a large value of N. But how large is large? It depends not just on N, but on P as well. The usual rule of thumb is provided by Siegel (1956). He suggested that if the value of NPQ is at least 9, then the approximation provided by the normal will fit the true binomial distribution adequately enough for statistical hypothesis testing. This means that
86
INFERRING FROM A SAMPLE: THE NORMAL AND t if P = .5, then N could be as low as 36 (36 x .5 x .5 = 9) but that if P = .8, then N should be at least 57 (57 x .8 x .2 = 9.12). This rule of thumb, however, only establishes minimum standards; the approximation becomes better of course for larger values of N. The preceding paragraphs have introduced some useful material. If a study consists of a series of trials (e.g., if several different individuals are assessed), and if there are only two possible outcomes for each trial (e.g., a patient gets better or fails to get better), then the tenability of a particular value for P can be evaluated. This value predicts how many subjects should improve according to the null hypothesis. The actual number of subjects who improved can then be compared against this predicted value. The discrepancy between observed and predicted is standardized (i.e., divided by its standard deviation) and the resulting Z score is compared against the normal distribution. The assumption that such Z scores are approximately normally distributed, however, is reasonable only if N is sufficiently large (specifically, the product NPQ should be greater than nine). If the computed Z score is extreme, that is, if it falls in the region of rejection as defined by the alpha level and type of test (one- versus two-tailed), then the null hypothesis is rejected. We conclude that the sample of study subjects was probably drawn from a population whose value for P is not the value we derived from the null hypothesis and then used to compute the Z score. As our alternative hypothesis suggests, probably the true population P is a different value. To return to an example used earlier, if 86 patients were tested, and if 53 got better, we would conclude that this excess of 10 over the expected 43 was not likely a chance happenstance. Similarly, if only 33 got better, 10 less than expected, we would again conclude that this was likely not just chance (assuming a two-tailed test). It seems reasonable to conclude, therefore, that the treatment made a difference. Exercise 7.1 The Binomial Test using the Normal Approximation This exercise provides practice in using the large sample normal approximation for the binomial test. 1. Draw a graph, using your spreadsheet program if you like (this would be what spreadsheet programs usually call an XY graph). Label the X axis with values of P from 0 to 1. Label the Y axis N. Using Siegel's rule of thumb, compute enough values of N for various values of P so you can draw the graph. There are an infinite number of values for P, you need only select enough to give you confidence that you have accurately portrayed the shape of the graph. In drawing lines between the points you plotted, what assumptions did you make? You probably drew a smooth graph; is this justified? Should it be jagged instead? Why? What is the shape of your graph? It should be symmetric; why must it be? 2. The critical values for the normal distribution are: Two-tailed, alpha = .05: Z= ±1.96 Two-tailed, alpha = .01: Z= ±2.58 Memorize these values; you will need them later as well as for this and the next few exercises. Now, if N = 40, P= .5, and T= 26, what is the value of Z? Would you reject the null hypothesis, alpha = .05, two-tailed? Why or why not? If T= 28? If T= 30? Again, why or why not?
7.1 THE NORMAL APPROXIMATION FOR THE BINOMIAL
87_
3. 4.
Now answer the same questions as in part 2, but for an alpha level of .01. If N = 40, P = .75, and T= 25, would you reject the null hypothesis, alpha = .05, two-tailed? Why or why not? If T = 35? If T = 23? If T = 37? 5. If P = .5, alpha = .05, two-tailed, and N = 80, what values of T would allow you to reject the null hypothesis? What values would not allow you to reject? 6. Now answer part 5, but for N = 100 and N = 200. 7. If P = .2, alpha = .05, two-tailed, and N = 80, what values of T would allow you to reject the null hypothesis? What values would not allow you to reject?
7.2
THE NORMAL DISTRIBUTION In the preceding section we referred to the normal distribution and showed one way it could be used, but we have yet to describe it in any detail. There is some historical justification. Only early in the 19th century did Pierre Simon Laplace and Carl Friedrich Gauss, in a series of seminal papers, integrate De Moivre's earlier work, Legendre's method of least squares, Laplace's central limit theorem (discussed in the next section), and the normal curve into a synthesis that provided the basis for modern statistics.
Historical Considerations Primarily problems of astronomy and geodesy (the geological science concerned with the size and shape of the earth), in addition to sheer intellectual pleasure, motivated these early statistical theorists. One practical problem was how to combine several observations of the same astronomical or geodetic phenomenon, and from consideration of this problem both the method of least squares (discussed in chap. 5) and the realization that errors were distributed in a particular way resulted. This distribution later became called the normal or Gaussian distribution. Intended at first as a description for errors of observation, only later in the 19th century did investigators realize how common, or normal, it was. Adolphe Quetelet, in particular, became entranced by the normal curve arid demonstrated repeatedly that it could be found throughout nature. His fame in statistics, Stigler (1986) wrote, "depends primarily on two hypnotic ideas: the concept of the average man, and the notion that all naturally occurring distributions of properly collected and sorted data follow a normal curve" (p. 201). As noted briefly in chapter 5, Quetelet analyzed chest measurements for Scottish soldiers. These were normally distributed, which Quetelet interpreted as evidence that nature was striving for an ideal type, the average man. The ideal is the mean; deviations from it can only be due to accidental causes, to "errors." Quetelet is hardly the first person in the history of science to become enamored of one simple idea to the detriment of others, and the notion that the mean has some mystical force or moral imperative can still be found. Fortunately, the normal curve can be understood without recourse to underlying mystical principles.
The Normal Function The normal curve is a particular curve. It is not any bell-shaped curve, as students sometimes think, but a curve defined by a particular function. The function is not especially simple. It involves pi (a), which equals approximately 3.1416, the base of the natural logarithm (e), which equals approximately 2.7183,
88
INFERRING FROM A SAMPLE: THE NORMAL AND t
and a negative exponent. Like the binomial, the normal curve is actually a family of curves. The parameters for the normal curve are mu (u) and sigma (a), the population mean and standard deviation, respectively, and each pair of parameters generates a particular instance of a normal curve. The function that defines the normal curve is:
(X can range from minus to plus infinity.) One instance of the normal curve is so useful and is used so often that it has its own name. When the mean is zero and the standard deviation is one, the resulting curve is called the standard normal. Substituting u = 1 and a = 0 in equation 7.5 simplifies it somewhat and yields the definition of the standard normal curve:
(Again, Z can range from minus to plus infinity.) Any normal distribution can be transformed into the standard normal just by rescaling the X axis, in effect forcing the mean to be zero and the standard deviation one. To standardize the values on the X axis, the difference between each value and the mean would be divided by the standard deviation. By definition, standardizing produces a distribution whose mean is zero and whose standard deviation is one. The curve defined by Equation 7.6 (i.e., the standard normal curve) is depicted in Fig. 7.1. The function for the normal distribution may appear complex or even intimidating. In practice, however, this formula is almost never used. Tables giving the area under the standard normal curve to the left and/or right of various values of Z are used instead. If you look at the standard normal table provided here (see Table B in the statistical tables appendix), you will find that the area to the left of Z = 1 is .1587 and to the left of Z = +1 is .8413. This means that the area to the right of Z = +1 is .1587, that the area either less than -1 or greater than +1 is .3174, and that the area between -1 and +1 is .6826. In other words, if scores are normally distributed, a little over two-thirds of them will be within a standard deviation of the mean, and just under one-third will be greater than a standard deviation from the mean.
u o
Note 7.1 As noted earlier, the population mean is usually represented with a lower case Greek mu and the sample mean with M or X-bar. The population standard deviation is usually represented with a lower case Greek sigma and the sample standard deviation with S or SD.
7.2 THE NORMAL DISTRIBUTION
89
FIG. 7.1. The standard normal distribution.
Underlying Circumstances The underlying circumstances that produce normally distributed numbers are not difficult to understand. Whenever the thing measured (e.g., the chest circumference of Scottish soldiers, the height of college freshmen, the length of bats' wings, the IQ of military recruits) is the result of multiple, independent, random causes, the scores will be normally distributed. That is, their distribution will fit the form of the function described in the preceding paragraph. This is not mysterious, but simply a straightforward mathematical consequence. These circumstances necessarily produce normally distributed scores. And because so many phenomena are indeed the result of multiple, independent, random genetic and environmental influences, normally distributed scores are frequently encountered. Occasionally students think that, for example, IQ scores are normally distributed because of the way IQ tests are constructed. This is not true. Although the mean and standard deviation can be manipulated, the shape of the distribution cannot be changed. As noted in chapter 5, standard scores are distributed the same way the scores were distributed before standardization. Standardization only changes the scale markers on the X axis, replacing the mean of the raw scores with 0 (100 for IQ scores) and labeling one standard deviation below and above the mean -1 and +1 respectively (85 and 115 for IQ scores). To reiterate, standardization does not change the shape of a distribution. Moreover, we should remember that not all scores are normally distributed. For example, the distribution for individual annual income is quite skewed: Many people make very little, and only a few people make a lot. However, when circumstances are right—when a phenomenon is the result of many independent, random influences—then scores will be normally distributed. Of course, this is a theoretical statement. The distribution of a small sample of scores might not
90
INFERRING FROM A SAMPLE: THE NORMAL AND t appear normal, just due to the luck of that particular draw or sampling error, but if enough scores were sampled, the empirical distribution would closely approximate the theoretically expected normal distribution.
Normal and Rectangular Distributions Compared Why is any of this useful? Playing a bit with popular imagination, imagine we have developed a test to detect aliens from outerspace, something like the tests used by British intelligence during World War II to trip up German spies who were attempting to pass themselves off as English. Assume we know the theoretical distribution of scores for terrestrials, that is, we know the shape of the distribution as well as its mean and standard deviation. Now, even if we do not know how aliens perform, confronted with a single score, we can at least determine how likely that score is if the person tested is a terrestrial. Assume, for example, that possible scores for the test are 80, 81, 82 ..., 120, and each of the 41 possible scores is equally likely. This represents not a normal but a rectangular distribution. In this case, the probability for a score as high as 120 would be 1/41 or .0244 (one-tailed) and the probability for scores as extreme as 80 or 120 would be 2/41 or .0488 (two-tailed). The mean for this distribution is 100 and its standard deviation is 11.8. Thus the standard score for a raw score of 120 is 1.69. If the scores for the test had been normally distributed instead, the probability of a standard or Z score as high as 1.69 would have been .0455 (onetailed) and the probability of a Z score as extreme as 1.69 would have been .0909 (two-tailed; see Table B in the statistical tables appendix). This example was introduced for three reasons. First, it demonstrates that a standard score by itself does not tell us how probable the corresponding raw score is: We need to know the shape of the distribution as well. In addition, it provides yet another demonstration of the logic of hypothesis testing, in which the probability for a particular value of a test statistic is determined with reference to the appropriate theoretical sampling distribution. Finally, it lays some groundwork for the next exercise and for discussion of the central limit theorem, which is presented in the next section. Exercise 7.2 The Normal Distribution Primarily, this exercise provides practice using the table of areas under the normal curve (Table B in the statistical tables appendix). 1. As in the example just presented, assume M = 100, S = 11.8, but that scores are normally distributed. What proportion of scores are greater than 105? Greater than 115? Less than 80? What proportion differ from the mean by more than 3? 2. For normally distributed scores, what proportion are greater than two standard deviations above the mean? Greater than three standard deviations? More extreme than two standard deviations from the mean? More extreme than three standard deviations? 3. Assuming normal distributions, what are the critical values of Z for alpha = .05 and for alpha = .01 for one-tailed tests? For two-tailed tests? 4. Why might it make sense to regard any standardized scores less than -3 or greater than 3 as outliers?
7.2 THE NORMAL DISTRIBUTION
91
5. For the rectangular distribution just presented, M = 100 and SD = 11.8. What proportion of the scores are greater than .678 of a standard deviation above the mean? Are within .678 of a standard deviation from the mean? 6. (Optional) For this rectangular distribution, whose scores ranged from 80 to 120, why does SD= 11.8?
7.3
THE CENTRAL LIMIT THEOREM What has come to be called the central limit theorem (central in this context means fundamental, not middle) was first described by Laplace in 1810 and represented a major advance over De Moivre's work a century earlier. Laplace proved the following theorem: Any sum or mean (not just the number of successes in N trials) will be approximately normally distributed if the number of terms is large. The larger N is, the better the approximation. In theory, as the limit approaches infinity, the approximation approaches perfection. But long before that, the approximation becomes good enough to be useful. The implications for statistical analysis are profound. If we draw a single score at random, unless we know exactly how those scores are distributed in the population, we have no basis for determining the probability of drawing that score. But if we draw a sample of scores and compute the mean for the sample, we now know how sample means (if not individual scores) should be distributed. According to the central limit theorem, no matter how the raw scores in a given population are distributed, sample means will be distributed normally. For example, no matter whether the distribution of raw scores is rectangular, or bimodal, or whatever, if we drew thousands of samples of size N, computed the mean for each, and graphed the distribution, the distribution of the sample means would approximate a normal distribution. Given what you learned about the normal distribution in the preceding section, this makes sense. Earlier we defined the mean as the sum of the raw scores divided by N (Equation 5.1). But this is simply an economical way of saying that the mean is a weighted summary score. Each raw score is multiplied (or weighted) by a weighting coefficient, which for the mean is 1 divided by N for all scores, and the resulting products are summed. Symbolically, the weighted summary score formulation for the mean is:
When drawing a sample of N scores, each X can be regarded as representing an independent, random influence. If this is so, then of course the summary score (or mean) for these multiple scores will be distributed normally. After all, the mean simply combines into one score each of the presumably independent, random separate scores.
INFERRING FROM A SAMPLE: THE NORMAL AND t
92
7.4
THE t DISTRIBUTION As noted earlier in this chapter, for only 4, 5, or even 10 trials, the normal is not a very good approximation for the binomial distribution. However, as the number of trials increases, the approximation becomes increasingly better. As a general rule of thumb, and assuming the value of P is not extremely small or extremely large, the normal approximation for the binomial is usually good enough to be practically useful if N is greater than 50, certainly if N is greater than 100. And in theory, as N approaches infinity, the binomial comes ever closer to being exactly like the normal. The same is true for the distribution of sample means. According to the central limit theorem, sample means are distributed approximately normally, and the approximation becomes better as the sample size becomes larger. The general rule of thumb is the same. The normal approximation for the distribution of sample means is good enough to be practically useful if the sample size (or N) is greater than 50. And as with the binomial, as N approaches infinity, the distribution of sample means comes ever closer to being exactly like the normal. For smaller numbers of trials, sampling distributions can be constructed for the binomial, which is exactly what you did for an exercise in chapter 3. The distribution of sample means for smaller sample sizes is likewise known. The distribution is called the t distribution and has been extremely important in the history of statistics. It can be applied to many problems and tests, and indeed many introductory texts in statistics spend considerable time developing the many applications of t tests. As noted at the end of chapter 4, we have elected to emphasize other, more general techniques that accomplish many of the same ends, but nonetheless students should be familiar with the t distribution. In any case, when dealing with small samples (i.e., samples for which N is less than 50, certainly less than 30), it is essential to use the t distribution for solving the two kinds of problems described in the next two sections: single-sample tests and 95% confidence intervals. The t distribution itself looks very much like the normal distribution, only flatter. Imagine the normal distribution is a tent, held up with a single wire. As the sample size decreases, the wire is lowered and the tent spreads out, assuming the shape of the t distribution for smaller and smaller N. The tails for the t distribution extend further than the tails for the normal distribution, so critical values for t (values that demarcate the most extreme 5% or 1% of the area under the curve) are larger than the corresponding values would be for the normal distribution; they become larger as N becomes smaller. For example, the alpha .05 nondirectional critical value for the normal distribution is 1.96. The corresponding values for a t distribution with a sample size of 30 is 2.05, for N = 20 is 2.09, and for N = 10 is 2.26. (See Table C, Critical Values for the t Distribution, in the statistical tables appendix. In these cases, degrees of freedom are N - 1 so for N = 30, df = 29; for N = 20, df = 19; and for N = 10, df = 9. What degrees of freedom are and how they are determined is explained in chapter 10.)
7.5 SINGLE-SAMPLE TESTS 7.5
93
SINGLE-SAMPLE TESTS An understanding of the central limit theorem and the normal and t distributions allows you to perform what are usually called single-sample tests, or tests that determine if a single sample was likely drawn from a particular population. Consider the lie detection study. The only data available to us are the number of lies detected for the 10 subjects in the lie detection study sample. From these sample data, we can compute the sample mean (M = 5.3) and standard deviation (SD = 2.00). Imagine our null hypothesis states that the sample is drawn from a population whose mean is 7, so H0: u = 7). Is the sample mean of 5.3 sufficiently different from the assumed population mean of 7 to render the null hypothesis that the sample was drawn from this particular population untenable? The statistic we have in hand is a single sample mean, M. What we want to know is whether or not M is especially deviant, compared to the distribution of sample means. In other words, if we drew thousands of samples of 10 scores each from a population whose mean score is 7, computed the mean for each sample, and examined the distribution of these sample means, would a value of 5.3 be near the center of this distribution of values for sample means or far out in one of the tails? We do know from the central limit theorem that, no matter how the raw scores were distributed in the population, the distribution of sample means will be approximately normal, or in this case distributed as t with N - 1 or 9 degrees of freedom. Hence, if we standardize the sample mean, we can evaluate the resulting standard score using the t distribution, which will tell us how likely a mean score of 5.3 is, if the population raw score mean is 7. The sample mean is standardized as follows:
It is important to remember that we are dealing with means, not with raw scores. Thus ZM indicates a standard score for means, not raw scores. And in order to compute it we need to know the mean for the distribution of sample means (symbolized UM), which conceptually is not the same as the mean for the scores in the population (symbolized u), and we need to know the standard deviation for the distribution of sample means (symbolized aM), which is not the same as the standard deviation for the scores in the population (which is symbolized a). If we know the mean and standard deviation for scores in the population, then the mean and standard deviation for the distribution of sample means can be determined. It seems reasonable to assume (and statisticians can prove) that the mean for the distribution of sample means is the same as the mean for the scores in the population. In other words,
In the present case, because u is assumed to be 7 (by our null hypothesis), we can assume that uM is also 7. In contrast, we would expect the standard deviation for sample means to be less than the standard deviation for the population scores. If we draw a sample of 10 scores, for example, some will likely be less than the mean, and some will be larger, in effect canceling each other out. As a result, the values for sample means will tend to cluster more tightly around the mean than the raw scores, often by quite a bit.
94
INFERRING FROM A SAMPLE: THE NORMAL AND t
The Standard Error of the Mean What is needed is a way of estimating the standard deviation for sample means (a M ). We know that the standard error of the mean depends on the sample size. If we draw hundreds of samples of size 20, for example, their means will cluster more tightly around the population mean than if we had drawn samples of size 10. We also know that the standard error of the mean will be less than the population standard deviation (a). Statisticians have proven to their satisfaction that an accurate estimate of the standard error of the mean is the population standard deviation divided by the square root of the sample size:
However, how do we determine the value of a in the first place? For single-sample tests, the value of u is assumed (determined by the null hypothesis) but the value of a is typically estimated from the sample. We already know how to compute the sample standard deviation (see Equation 5.6). It is the square root of the quotient of the sum of the deviation scores squared (symbolized SS for sum of squares) divided by N (the sample size):
The population standard deviation, as estimated from the sample, however, is not quite the same (see Equation 5.7). It is the square root of the sum of squares divided by N - 1:
Because N - 1 is smaller than N, the estimated value for the population standard deviation will be somewhat larger than the value computed for the sample standard deviation. This is explained in chapter 10. For now, note that although the sample standard deviation or SD is 2.00 (the square root of 40.1 divided by 10) for the 10 data points from the money-cure study, the estimated population standard deviation or SD' is 2.11 (the square root of 40.1 divided by 9). Now we can compute the standard deviation for the distribution of sample means. From Equation 7.10, the standard error of the mean (a M ) equals a (estimated as 2.11) divided by the square root of N. Thus, letting SD'M represent the estimated value for the standard error of the mean,
For the data from the money-cure study,
7.5 SINGLE-SAMPLE TESTS (computed retaining unshown significant digits for SD'). Equations 7.10 and 7.12 and simplifying algebraically:
95 Or, combining
Again, the result for our current example is .667 (the square root of the quotient of 40.1 divided by 90). This formula (Equation 7.14) for the standard error of the mean is convenient when the sum of the squared deviation scores is readily available. Finally we can answer the question posed at the beginning of this section: Is the sample whose mean is 5.3 likely drawn from a population whose mean is 7? The standard score representing the deviation of the sample mean from the assumed population mean (see Equation 7.8) is 5.3 minus 7 divided by 0.667, the standard error of the mean (recall that SD'M is the estimated value for a M ). Hence:
The sample size is small (N = 10); thus ZM will be distributed as t in the preceding equation, so we refer to Table C, not Table B, in the statistical tables appendix. According to this table, the critical value for t (alpha = .05, two-tailed) with nine degrees of freedom is 2.26. Therefore a sample mean as small as 5.3, whose standardized value is -2.55 (and is distributed as t with nine degrees of freedom), would occur less than 5% of the time if the population mean were in fact seven. We conclude that the sample is likely not drawn from the null hypothesis assumed population. As a second example of a single-sample test, let us return to our earlier problem of alien identification, assuming this time that instead of sampling just one individual, we sample groups of eight instead. In this case, we know that the population from which we are sampling is homogeneous, either all aliens or all terrestrials. As before, we know that the population mean and standard deviation for terrestrials are 100 and 11.8, respectively (in this case, the population standard deviation is given, not estimated from a sample). If the mean for the sample is 94, should we reject the null hypothesis (alpha - .05, two-tailed) that the sample is drawn from a population of terrestrials? The mean for the distribution of sample means is
and the standard deviation for the distribution of sample means is
Thus the standardized value for the deviation of the sample mean from the population mean is:
INFERRING FROM A SAMPLE: THE NORMAL AND t
96
Again, because the sample size is small, ZM is distributed as t (and we could have replaced ZM with t in the equation). According to Table C in the statistical tables appendix, the critical value for t (alpha = .05, two-tailed) with N - 1 or seven degrees of freedom is 2.37. Thus a value of — 1.44 is not sufficient to reject the null hypothesis. Probably the sample consists of terrestrials.
M (u
Note 7.2 The sample mean. Computed by summing the scores in the sample and dividing by N. The population mean. It can be estimated by the sample mean, or its value can be assumed based on theoretical considerations.
UM
The mean of the distribution of sample means. It can be estimated by the population mean.
SD
The sample standard deviation. Computed by taking the square root of the sum of the squared deviation scores divided by N.
a
The population standard deviation. It can be estimated by taking the square root of the quotient of the sum of the squared deviation scores divided by N -1. This estimate is symbolized SD'.
aM
The standard error of the mean, or the standard deviation for the distribution of sample means. It can be estimated by dividing a by the square root of N, or by dividing SD' by the square root of N, or by taking the square root of the quotient of the SS divided by N times N - 1. This estimate is symbolized SD'M . A test statistic similar to Z but for small samples. The shape of the t distribution is similar to the normal distribution but flatter. As a result, critical values for t are larger than corresponding critical values for Z.
t
Exercise 7.3 Single-sample tests This exercise provides practice in single-sample tests and in computing the standard error of the mean. 1.
If the alien identification sample size were 100 instead of 8, what would be the value for the standard error of the mean? What is the critical value for t (alpha = .01, two-tailed)? Would you conclude that the sample consisted of aliens or terrestrials? 2. Assume the sample size for the money-cure study is sufficiently large that you decide to use the standard normal (Table B) instead of the t table (Table C). Assume further that the standard error of the mean retains the value
7.5 SINGLE-SAMPLE TESTS
3. 4. 5.
6.
7.
8. 9.
97
.760, so that ZM retains the value -2.63 computed earlier. What percentage of the time would a sample mean as low as 5 occur if the population mean really were 7? A sample consists of six scores: 32, 25, 45, 38, 52, and 29. How likely is it that this sample was drawn from a population whose mean is 26? Assume a two-tailed test, alpha = .05. What would your decision be if alpha were .01? A second sample likewise consists of six scores: 32, 25, 97, 38, 52, and 29. How likely is it that this sample was drawn from a population whose mean is 26 (alpha = .05, two-tailed)? As in the last example given in the text, assume the population mean is 100 and the population standard deviation is 11.8. If the sample size were 16 (instead of 8), what values for the sample mean would allow you to reject the null hypothesis (alpha = .05, two-tailed)? If the sample size were 32? 64? As sample size gets larger, what happens to the value of the standard error of the mean? The standard error of the mean is often used in conjunction with bar graphs to indicate variability. Imagine you have selected two samples, one of five boys and one of seven girls, and have determined their weights. The mean weight is 72 pounds for the boys and 64 pounds for the girls. The population standard deviations, estimated from the sample data, are 9 for the boys and 8 for the girls. Compute standard errors of the mean for both the boys and the girls. Then create a bar graph. One bar will represent the boys, one the girls; the Y axis will indicate pounds. From the middle of the top of each bar draw two lines, one extending up and one extending down, each marking off exactly one standard error of the mean. Draw small horizontal lines to demarcate the top and bottom of these error bars. Adding error bars to a conventional bar graph is desirable because of the graphic way they indicate variability—and remind the viewer that the top of the bar is just the mean obtained from one sample, but other samples might locate the mean differently. (See Fig. 7.2 for an example.) (Optional) Using the functional definition for the standard normal distribution (Equation 6.6), set up a spreadsheet to compute Y values for values of X that range from -4.0 to +4.0 in steps of 0.2. Now graph this normal distribution (using your spreadsheet's graphing capability, if it has one). Your graph should look like Fig. 7.1. Note that probability is associated with the area under the curve; the exact height of the curve at various points is not especially meaningful in this context. (Optional) Starting with Equation 6.10, provide an algebraic derivation for Equation 6.14. (Optional) A third definition for the estimated standard error of the mean or SD'M (in addition to Equations 6.13 and 6.14) is the sample standard deviation divided by the square root of N - 1. Again starting with Equation 6.10, provide an algebraic derivation for this formula.
At this point, a confession is in order. In actual practice, single-sample tests are rarely used. They were introduced here, as they typically are in statistical texts, because of the particularly clear way they demonstrate principles of statistical inference and the use of the normal and t distributions. The standard error of the mean, however, is often used in conjunction with bar graphs to indicate variability, as is demonstrated in the next exercise.
98
INFERRING FROM A SAMPLE: THE NORMAL AND t
FIG. 7.2. A bar graph of the group means described in Exercise 7.3 showing error bars. The distance the error bars extend above and below a particular group mean represents an estimate of the standard error of the mean computed from data for that group.
7.6
NINETY-FIVE PERCENT CONFIDENCE INTERVALS Usually error bars indicate one standard error of the mean above and below the sample mean, but other intervals could be used. A particularly useful one is the 95% confidence interval. The interval is based on data from a single sample, but if we drew sample after sample repeatedly, 95% of the time the true population mean should fall within the indicated confidence intervals. Confidence intervals, whether or not displayed graphically, are valuable for the same reason that error bars are valuable: They serve to remind us that information gleaned from sample data provides probabilistic and not absolute information about the population. We may never know the value of the population mean with absolute certainty, but at least we can say that there is a 95% chance that the true population mean falls within the interval indicated. You know from the central limit theorem that sample means are distributed approximately normally and means for small samples follow the t distribution. You also know that the critical values for t, alpha = .05, two-tailed, given in Table C in the statistical tables appendix, demarcate the 5% of the area under the curve in the tails from the 95% of the area in the center under the hump, and that for large samples, this critical value is 1.96 (from Table B, the standard normal table). The t values (or Z values, for large samples) are standardized, which is why you divided by the standard error of the mean when computing them. In order to compute 95% confidence intervals, you need to reverse the process, multiplying (instead of dividing) by the standard error of the mean. Remember that t or Z scores are in standard units. By definition, the number of raw score units that correspond to one standard unit is the value of the standard error of the mean. Thus the upper 95% confidence interval, expressed in whatever units were used for the raw scores in the first place, is the mean plus the standard error of the mean multiplied by 1.96 for large samples, or multiplied by the appropriate
7.6 NINETY-FIVE PERCENT CONFIDENCE INTERVALS
99
value from Table C for small samples; the lower 95% confidence interval is the mean minus the standard error of the mean times the appropriate value. Symbolically (where CI stands for confidence interval),
Thus 95% error bar = SD'M x t(df)05,two-tailed Lower 95% confidence limit -M-
SD'M x t(df)05,two-tailed
Upper 95% confidence limit = M + SD'M x t(df)05,two-tailed As an example, consider the samples of boys and girls shown in Fig. 7.2. The standard error of the mean, estimated from sample data, was 4.02 for the boys and 3.02 for the girls, whereas the mean weights were 72 pounds for the five boys, 64 for the seven girls. The confidence intervals for the boys are: 95% error bar = SD'M x t(4) = 4.02 x 2.78 = 11.2 Lower 95% confidence limit = M - SD'M x t(4) = 72 - 11.2 = 60.8 Upper 95% confidence limit = M + SD'M x t(4) = 72 + 11.2 = 83.2 For the girls the confidence intervals are: 95% error bar = SD'M x t(6) = 3.02 x 2.45 = 7.4 Lower 95% confidence limit = M - SD'M x t(6) = 64 - 7.4 = 56.6 Upper 95% confidence limit = M + SD 'M x t(6) = 64 + 7.4 = 71.4 Often researchers want to know whether the means for two groups are significantly different. For example, is the average weight for the girls different from the average weight for the boys? Techniques designed to answer such questions are introduced in chapter 10. For now, it is worth noting that because the 95% confidence interval for the boys includes within its range the mean value for the girls, and vice versa, it is unlikely that the difference between their mean weights is statistically significant (at the .05 level). Exercise 7.4 Ninety-five Percent Confidence Intervals This exercise provides practice in computing 95% confidence intervals. Fig, 7.3 illustrates the results. 1. Compute the 95% confidence interval for the population mean using the sample of six scores from part 3 in the last exercise. 2. Compute the 95% confidence interval for the population mean using the sample of six scores from part 4 in the last exercise. 3. Do the results of the single-sample tests performed for the last exercise agree with the 95% confidence intervals computed for this exercise?
100
INFERRING FROM A SAMPLE: THE NORMAL AND t
FIG. 7.3. A bar graph of the group means described in Exercise 7.4, showing 95% confidence intervals.
Exercise 7.5 SPSS: Single-sample tests and confidence intervals This exercise provides practice in using SPSS to conduct single-sample tests. You will use the data from Exercise 7.3, parts 3 and 4, to determine whether the sample is drawn from a population with a mean of 20. 1. Open SPSS and create a variable called scores. Enter the data from Exercise 7.3, part 3. 2. Select Analyze->Compare Means->One-Sample T Test from the main menu. Move the scores variable to the Test Variable(s): window. Enter 20 in the Test Value window. Click on OK. 3. Examine the output. Are the values for the standard deviation and the standard error the same as the values you calculated in Exercise 7.3? What is the significance level of the t test? Do you conclude that the sample was drawn from a population with a mean of 20 at the alpha = .05 level? What would your decision be if alpha were .01? 4. Conduct a single-sample test using the data from Exercise 7.3, part 4. Why is this test not significant at the .05 level even though the mean is higher than the data from question number three? 5. To compute confidence intervals for the scores variable, select Analyze->Descriptive Statistics->Explore from the main menu. Move the score variable to the Dependent List window and click OK. The 95% confidence interval for the mean will appear in the Descriptives box. If you would like to calculate confidence intervals based on a value other than 95%, click the Statistics button in the Explore dialog box and change the 95 to another value in the Confidence Interval for Mean window.
7.6 NINETY-FIVE PERCENT CONFIDENCE INTERVALS
101
Some of the material presented in this chapter is more important conceptually than practically. Practically, knowing how to present figures with error bars representing the standard error of the mean, or 95% confidence intervals, allows you to present your data visually. It is often useful to know how to approximate a binomial with a normal distribution. It is also important, as a matter of basic statistical literacy, to have some familiarity with the t distribution. Single-sample tests, on the other hand, are probably more important conceptually than practically. Still, their principles nonetheless provide a particularly clear foundation for the material to follow. This chapter, in fact, represents something of a turning point. Having finished with basic topics like hypothesis testing and simple descriptive statistics, we now turn to the topic that will occupy us for the remainder of the book: accounting for and analyzing variance.
This page intentionally left blank
8
Accounting for Variance: A Single Predictor
In this chapter you will: 1. Learn how to determine empirically the best-fit line (the line that, when used as a basis for predicting one variable from another, minimizes the error sum of squares). 2. Learn what the phrase accounting for variance means. 3. Learn to distinguish measures that indicate the exact nature of the relation between two variables from measures that indicate the strength of their relation. In this chapter we begin laying the groundwork required for most of the analytic techniques described in the remainder of this book. Many attributes of interest to behavioral scientists are measured with interval, ratio, or at least ordinal scales. In such cases, almost all specific research questions can be reduced to a single general form. The general question is, if the values of some quantifiable attribute (a dependent or criterion variable measured on a quantitative scale) vary from subject to subject, can that variability be explained, or accounted for, by research factors identified by the investigator (independent variables measured on either qualitative or quantitative scales)? In other words, does the independent variable have an effect on the dependent variable? Is the number of lies the expert detects affected by the subject's mood, for example, or by whether or not the subject received a drug? The techniques described in this chapter explain how such questions can be answered. 8.1
SIMPLE REGRESSION AND CORRELATION Historically, both simple regression and correlation (one criterion and one predictor variable) and multiple regression and correlation (one criterion and multiple predictor variables; often abbreviated as MRC) have been taught as techniques for nonexperimental data exclusively. Indeed, the term correlational has come to signify a nonexperimental or observational study. The only reason for the segregation of MRC into the nonexperimental world, and the dominance of analysis of variance or ANOVA in the experimental world, however, is 103
104
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
historical precedent and tradition. There is no compelling statistical basis for this bifurcation (Cohen, 1968). In fact, there is much to be gained by a more unitary approach. From the point of view of the general linear model, analysis of variance is simply a subset of multiple regression and correlation. Giving prominent play to this fact, as this book does, simplifies learning considerably. As noted in chapter 1, there are fewer topics to master initially and those topics possess considerable generality and power. The unified approach lets you see the forest as a whole without being tripped up by individual trees. Two numeric examples are presented in this chapter and the next. In both cases, the quantitative dependent variable accounted for is number of lies detected. In one case, the independent variable is quantitative (mood, used in this chapter), and in the other, qualitative (drug vs. no drug, used in chap. 9). These two examples are used first to demonstrate the basic concepts required to analyze such data, and second to demonstrate that no matter whether the predictor variable is quantitative (mood score) or qualitative (drug group), its effect can be analyzed using multiple-regression procedures. But remember, whether qualitative or quantitative, throughout this chapter we assume just a single predictor. Methods for dealing with two or more predictors are deferred until chapter 11.
Finding the Best-Fit Line In chapter 5, you learned how a set of scores (like the number of lies detected for the 10 subjects) could be described. In particular, you learned how, using the method of least squares, a single point (the mean) could be fit to the set, and how the variability of the scores around the mean could be described (using the variance and the standard deviation). Research questions rarely involve just one variable, however. Typically we want to know, at the very least, how much one variable (the independent or predictor variable) affects another (the dependent or criterion variable). For example, imagine that before determining the number of lies detected we had asked our subjects to fill out a mood scale (higher scores indicate a better mood). We do this because we want to find out whether subjects who are in a good mood (and who perhaps are less guarded) are more likely to have their lies detected. The question now is, what relation, if any, is there between mood and number of lies detected? In other words, if we know a subject's mood, can we predict the number of lies detected more accurately than if we did not know that individual's mood score, if we only knew the average number of lies detected irrespective of mood? In order to answer this question, we would first postulate a prediction equation that ignores mood, and a second one that takes mood into account, and then compare the two equations. If mood affects the number of lies detected, then the second equation should provide a better fit to the data (i.e., should make more accurate predictions) than the first. However, if mood has no effect, then there should be little difference between the predictions made by the two equations. In chapter 5 (Equation 5.3), we described a prediction or regression equation that consisted of a single constant:
This prediction equation contains only a constant and no variables, so the predicted scores will be the same for all subjects. Of particular interest is the equation for which a is set equal to the mean:
8.1 SIMPLE REGRESSION AND CORRELATION
105
When the mean for a group of subjects serves as the predicted score for each subject, then the sum of the squared deviation scores (SS, or the sum of differences between observed and predicted scores for each subject) will be the minimum value possible for that group's data. This is the total SS, which is defined as the sum of the squared differences between each raw score and the group's mean. Similarly, the variance for the sample (or the total variance) is defined as the total SS divided by the sample size or N. The preceding sentences are nothing more than a restatement of material first presented in chapter 5. Recall, for example, that for the lie detection study the mean number of lies was 5.3, the SS was 40.1, the number of subjects was 10, and so the sample variance was 4.01. It is this variance, the SS divided by N, to which accounting for variance refers, and it is this variance or at least portions of it that we hope to account for with our research factor (or factors). The first prediction equation (Equation 8.2) considered only the mean. A second prediction equation, one that takes the values of a single predictor variable into account and assumes a linear relation between the predictor and the criterion variable, is
Following the usual convention, Y is used for the DV or criterion variable and X for the IV or predictor variable. This regression equation generates an expected or predicted score for each subject equal to the sum of a (which is called the regression constant) and the product of b (which is called the regression coefficient) times the value of the variable X for that subject; thus, the value of the predictor variable is multiplied or "weighted" by b. Graphically, Equation 8.3 is an equation for the straight line defined by a and b. The regression constant, a, is the Y intercept of the line—that is, it is the value of Y at the point the line crosses the Y axis. The regression coefficient, b, is the slope of the line—that is, it is the ratio of rise (a vertical distance) to run (a horizontal distance) for the line. If values of Y tend to increase as values of X increase, the line will tilt up to the right and the value of b will be positive. If values of Y tend to decrease with increases in X, however, then the line will tilt down to the right and the value of b will be negative. The mood scores for the 10 subjects in the lie detection study are given in Fig. 8.1. An obvious question at this point is, how do we even know if an equation that attempts to fit a straight line to these data is reasonable? The best advice is, first graph the data. Prepare a simple scattergram like that shown in Fig. 8.2 and decide, on the basis of visual inspection, whether the data appear to reflect a roughly linear relationship—that is, as a general rule, are the increases (or decreases) along the Y axis about the same anywhere along the X axis? (The graph in Fig. 8.2 was prepared with Excel. If your spreadsheet has a similar graphing capability, your next exercise should be to prepare a similar graph.) If number of lies detected were perfectly predicted by mood, then these points would all fall on a straight line. In this case (and in most cases encountered in the behavioral sciences) they do not form a straight line. On the other hand, there appears to be some orderliness to these data, a general tendency for higher mood scores to be associated with more lies detected; thus, it makes some sense to assume linearity and proceed with Equation 8.3. We could even attempt to fit a line to these data points visually. As an exercise, you should draw a straight line on Fig. 8.2 that you think provides the best fit for these data.
106
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
FIG. 8.1. Mood data for the lie detection study. Once you have drawn a straight line, you can figure out the values of a (the Y intercept) and b (the slope) directly from the graph and then use these values and Equation 8.3 to compute Y', the number of lies predicted for each subject based on that subject's mood score. Alternatively, you could read the predicted values directly from the graph. To do so, you would locate each subject's mood score on the X axis, draw a vertical line from it to your fitted line, and then extend a horizontal line from that point over to the Y axis. The predicted number of lies is the Y value at the point where the horizontal line crosses the Y axis. And now that you know both observed and predicted scores for each subject, you can compute each subject's error or residual score, which is the difference between the number of detected lies actually observed and the number predicted by the equation. For example, three lies were detected for the first subject. This individual's mood score was 5.5. The predicted Y value (number of lies) will be the height of your visually fitted line above the X axis at the point on the X axis where X (mood score) equals 5.5. If this height is 5, then the error of prediction (or the residual) for the first subject would be 3 (the observed score) minus 5 (the predicted score), which equals -2. Earlier you computed a total sum of squares (the sum of the squared differences between observed scores and the group mean). You can
FIG. 8.2. Scattergram for predicting number of lies detected from mood.
8.1 SIMPLE REGRESSION AND CORRELATION
107_
now compute an error sum of squares. As you would expect, it is the sum of the squared differences between observed scores and the scores predicted by the Equation 8.3 regression line. (And, as demonstrated in the next exercise, the model sum of squares is the sum of the squared differences between the scores predicted by the regression line and the group mean.) In your attempt to find the best-fit line, you might draw several,slightly different lines. Each would be uniquely specified by the values for its intercept and slope, but only one line would be best, in the sense that the sum of the squared deviations or errors (the error sum of squares) would be minimized. This best-fit line could be determined empirically, which is the purpose of the next spreadsheet exercise.
Note 8.1
a
The regression constant. When Y, the criterion variable, is regressed on X, a single predictor variable, a is the Y intercept of the best-fit line.
b
The simple regression coefficient. Given two variables, b is the slope of the best-fit line.
Exercise 8.1 The Error Sum of Squares The template developed for this exercise computes the error sum of squares (as well as the total and model sums of squares) for specified values of the regression constant a and the regression coefficient b. This allows you to determine the best values for these two regression statistics. You may find it convenient to begin with the spreadsheet shown later in Fig. 5.4 and modify it according to the following instructions. General Instructions 1. Establish columns (columns like these will be used repeatedly in the exercises to follow), labeled appropriately and containing the appropriate data or formulas, for: a. The subject number. b. The number of lies detected (Y). c. The mood score (X). d. The predicted score (Y' = a + bX). e. The deviation between Y and the Y mean (y = Y- MY). f. The deviation between Y'and the Y mean (m = Y'- MY). g. The deviation between Y- Y'(e = Y- V'). h. y* y or the square of y, their sum is SS total. i. m*m or the square of m; their sum is SS model. j. e*e or the square of e; their sum is SS error.
108
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR 2.
Establish rows that compute sums, counts, and means for the columns as appropriate. Enter a formula for the SDY. Reserve space for the regression statistics a and b. 3. Experiment with various values for a and b and note the effect on the various deviation scores and sums of squares. What exactly do the deviation scores represent? Finally, enter 2.5 for a and 0.05 for b. Do you get the values shown in Fig. 8.3? Detailed Instructions 1. In row 2, label the columns as follows: Labeel
Col
s Y X Y' Y-My
A B C D E
Meaning
Label
Column Meaning
SStot
H
The subject number or index; ranges from 1-10. The observed number of lies detected (the DV). The mood score (the IV). The predicted number of lies detected. The difference between the observed y score and the mean y score (represented here with a lower case y). Y'-My F The difference between the predicted Y score and the mean y score (represented with m for model). Y-Y' G The difference between the observed Y and the predicted Y score (represented with e for error or residual). y*y H The square of the difference between the observed raw score and the mean. m*m I The square of the difference between the score predicted by the model and the mean. e*e J The square of the difference between the raw score and the one predicted by the model. 2. To remind ourselves what the sums of the squares in columns H-J represent, in row 1 label columns H-J as follows:
3. 4.
5. 6.
Total sum of squares--the sum of the differences between the raw scores and the mean, squared and summed. (On some previous spreadsheets this same quantity was labeled SSY.) SSmod I Model sum of squares or SS explained by the regression equation (or model)--it is the differences between the predicted scores and the mean, squared and summed. SSerr J Error or residual sum of squares or SS unexplained by the regression equation--it is the differences between observed and predicted scores, squared and summed. In addition, enter the labels "Lies" in cell B1, "Mood" in cell C1, "y=" in cell E1, "m=" in cell F1, and "e=" in cell G1. In column A, label rows 13-16 as follows: Label Row Meaning Sum= 13 The sum of the scores in rows 3-12. N= 14 The number of scores in rows 3-12. Mean= 15 The mean of those scores. a,b= 16 The values for the regression statistics (a in column B; b in column C). Enter the observed lie and mood scores in columns B and C, rows 3-12. For the time being, enter a value of 2.5 for a (the Y intercept) in cell B16 and a value of 0.5 for b (the slope) in cell C16.
8.1 SIMPLE REGRESSION AND CORRELATION
109
7. Enter a formula for the predicted value in column D, cells 3-12. The predicted value is a (cell B16) plus b (cell C16) times this subject's mood score (X, in column C). 8. In cells E3-E12, enter a formula for subtracting the mean number of lies detected (MY, in cell B15) from the observed Y score (Y, in column B). 9. In cells F3-F12, enter a formula for subtracting MY from the predicted Y score (Y,' in column D). 10. In cells G3-G12, enter a formula for subtracting Y' from the raw score observed for Y (column B). 11. In cells H3-H12 enter a formula for the total sum of squares (Y-MY, in column E, multiplied by itself). 12. In cells 13-112, enter a formula for the model or regression sum of squares (Y'-MY, in column F, multiplied by itself). 13. In cells J3-J12, enter a formula for the error or residual sum of squares (Y-Y', in column G, multiplied by itself). 14. In cells B13-J13, enter a function that sums the entries in rows 3-12. 15. In cells B14-J14, enter a function that counts the entries in rows 3-12. 16. Enter the label "VAR=" in cell G15 and the label "SD=" in cell G16. 17. In cells B15-D15 and H15-J15 enter a formula that computes the mean for the column. In cell H16 enter the formula for the standard deviation for the Y scores (the square root of the Y variance). The means for the SStotal, SSmodel. and SSerror (columns H-J) are total Y score, model, and error variances, respectively. At this point, your spreadsheet should look like the one shown in Fig. 8.3. Of particular interest in this spreadsheet are the errors, which are given in column G. These are the deviations of the observed values from those predicted by the line whose intercept is 2.5 and whose slope is 0.5. This particular straight line, however, was only an initial guess. It may or may not be the best-fit line. One way to find out is by trial and error. Try other values of a and b until you find those that yield a lower error sum of squares (cell J13) than the others. These values then define the best-fit line. Exercise 8.2 Finding the Best Fit Line Empirically The template shown in Fig. 8.3, which was developed during the last exercise, is used again for this exercise. Only the values for the parameters a and b are changed. The purpose is to find the best fit line by trial and error. 1. By a process of trial and error, and guided by an inspection of the scattergram in Fig. 8.2, replace values for a and b in the previous spreadsheet until you find values that give the smallest value you can discover for the sum of the squared differences between observed and predicted values (SSerror or error sum of squares). For the time being, limit your search to numbers that have only two significant digits. 2. Now set a to 3.0 (if this was not the value you ended up with in number 1) and try to find a value for b that minimizes the error sum of squares. Is this sum smaller than the one you found before? 3. Now set a to 3.0 and b to 0.47 (if this was not the value you ended up with in number 2). Are you able to find other parameter values that yield a smaller sum of squares than these?
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
110
At this point your spreadsheet should look like the one shown in Fig. 8.4. The values computed for the various sums of squares (total, model, and error) are based on values of 3.0 and 0.47 for a and b, and although these values are used for discussion purposes in the next several paragraphs, they are not quite correct. As becomes evident, more than two significant digits are required to express the exact values for these regression statistics. Still, several aspects of this last spreadsheet deserve comment. As you would expect, the sums and means of the raw Y scores and the predicted Y scores are almost the same. Also as you would expect, the sums of the deviation scores based on the predicted score (predicted minus mean, raw minus predicted) are close to zero. If we had found the best values for a and b, not just values that were close, the means for the raw and the predicted scores would be identical and the sums of these two deviation scores would be exactly zero. An error sum of squares of 28.862625 (Fig. 8.4) is better than 30.3125 30.31 (Fig. 8.3), but it is still not the minimum possible. These are more accurate values, given to considerably more precision than is needed, not the less accurate values of 28.86 and 30.31 displayed in the figures. Remember that the values displayed by your spreadsheet depend on the format you specify and the width of the column you select, but that values as accurate as your computer allows are retained and used by the spreadsheet program. Also note that, for each subject, the raw minus the predicted (the error deviation score or Y i - Y i ' ) and the predicted minus the mean (the model deviation score or Yi' - My) sum to the total deviation score (Yi - My), as they must and as you can prove algebraically. That is, for rows 3-12, the score in column E is the sum of the scores in columns F and G. This will be true no matter what the values of a and b are; it follows from the way the deviation scores are defined.
A 1 2 3 4 5 6 7 8
c
B s 1
D
Lies Mood X Y
3 2 4 6 6 4
9 10 11
2 3 4 5 6 7 8 9
12
10.
9
13 Sum= 14 N= 15 Mean= a,b= 16 17
53 10 5.3 2.5
5 7 7
5.5 2 4.5 3 1.5 6.5 3.5 7
6 9
F
E
y= m= Y' Y-My Y'-My 5 .25 3.5 4 .75 4 3 .25 5 .75 4 .25 6 5.5 7
48.5 49 .25 10 10 4.85 4.925 0.5
-2.3 -3.3 -1.3 0.7 0.7 -1.3 -0.3 1.7 1.7 3.7
G
H
I
J
e=
SStot
SSmod
SSerr
Y-Y'
y*y
m*m
e*e
-2.25 5.29 0.002 5.063 -1.5 10.89 3.24 2.25 -0.55 -0.75 1.69 0.303 0.563 2 0.49 4 -1.3 1.69 -2.05 2.75 0.49 4.203 7.563 0.45 -1.75 1 .69 0.203 3.063 -1.05 0.75 0.09 1.103 0.563 1 2.89 0.49 0.7 1 0.2 1.5 2.89 0.04 2.25 2 13.69 2.89 4 1.7 -0.05
-1.8
0 -3.75 10 10
40.1 14.16 30.31 10 10 10 10 VAR= 4.01 1.416 3.031 SD= 2.002 3.75
FIG. 8.3. Spreadsheet for finding the best-fit line by trial and error: results when a = 2.5 and b = 0.5.
111
8.1 SIMPLE REGRESSION AND CORRELATION
However, only for the best-fit line will the model sum of squares and the error sum of squares add up exactly to the total sum of squares. That is, for the best-fit line,
The total sum of squares for the present example is 40.1 (see Fig. 8.4). The model (or regression, or explained) sum of squares and the error (or residual, or unexplained) sum of squares would add up exactly to 40.1 if we had found the best fit values for a and b. In this case, they sum to 40.13825 (using nonrounded values), a value just slightly over 40.1. That means we are close but still not exactly on target with our guesses for a and b. Exercise 8.3 Deviation Scores The purpose of this exercise is to demonstrate graphically how the various deviation scores are formed and how they are related. 1. On a piece of graph paper, graph the mood/lies data, that is, reproduce the scattergram shown in Fig. 8.2. Be sure to label the X and Y axes appropriately. By convention, the dependent variable is usually graphed on the Y axis. 2. Draw the line whose Y intercept is 3.0 and whose slope is 0.47. Two points uniquely define a straight line, consequently an easy way to do this is to compute values of Y associated with two arbitrarily chosen X values using the formula Y'= 3.0 + 0.47X.
B
A 1 2
s
3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10
13 Sum= 14 N= 15 Mean= 16 a,b= 17
Lies Mood X Y
F
E
D
C
G
SSerr
m*m
e*e
5.29 -2.3 0.285 -2.59 -3.3 -1.36 -1.94 10.89 1.69 -1.3 -0.19 -1.12 0.49 1.59 0.7 -0.89 -1.6 2.295 0.49 0.7 1.69 -1.3 0.755 -2.06 -0.3 -0.66 0.355 0.09 0.71 2.89 1.7 0.99
0.081 1.85 0 .034 0 .792 2 .544 0.57
e= Y-Y1
4.5 5.115 3 4.41 1.5 3.705 6.5 6.055
5 7 7
3.5 4.645 7 6.29 6 5.82
1.7
0.52
9
9
7.23
3.7
1.93
53 10 5.3 3
48.5 10 4.85 0.47
52.8 10
0 10
5.28
J
S Smod
m=
5.5 5.585 2 3.94
I
sstot y*y
y=
Y' Y-My Y'-My
3 2 4 6 6 4
H
1.18 1.77
-0.2 0.205 10 10
2.89 13.69
6.682 3.764 1.243 2.528 5.267 4.223 0.429 0.126 0.98 0.504 0.27 1.392 3 .725 3.133
40.1 11.28 28.86 10 10 10 4.01 1.128 2.886
VAR= SD= 2.002
FIG. 8.4. Spreadsheet for finding the best-fit line by trial and error: results when a = 3 and b = 0.47.
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
112
3. Draw arrows representing each deviation score. There are three kinds of deviation scores: total (Y- MY), model (Y'- MY), and error (Y- Y'). Arrows will begin at points where Y = MY for the total and model deviations and at points where Y= Y' 'for the error deviations. They will go up (+) or down (-) depending on the sign of the deviation scores. Their associated mood scores will determine how far away from the Y axis they are. For example, for subject 3 the mood score is 4.5 and the deviation between raw score and mean is 4 - 5.3, which equals -1.3. Therefore draw an arrow beginning at the point on the graph where X = 4.5 and Y= 5.3 and ending with an arrowhead at the point where X = 4.5 and Y= 4. Note that this arrow, which represents a deviation of -1.3, extends 1.3 units and points down. You will probably want to use three different colors (or some other stylistic device) to represent the three kinds of deviations. The three kinds of deviations (total, model, and error) and their associated sums of squares are basic to an understanding of much of the material presented in this book. It is worthwhile to ponder the figure you produced for the previous exercise (see Fig. 8.5) and think about what it represents. What do total deviations represent? Model deviations? Error deviations? Based on the figure, how strong is the relation between mood and number of lies detected? WhyT
Note 8.2 Y-My Y-My
The total deviation score, that is, the difference between each subject's observed score and the group mean. (Subscripts indicating subject are not shown.)
V - MY
The model (or regression) deviation score, that is, the difference between the score predicted by the regression equation (or model) for each subject and the group mean. (Subscripts indicating subject are not shown.)
Y-Y'
The error (or residual) deviation score, that is, the difference between each subject's observed score and each subject's predicted score. (Subscripts indicating subject are not shown.)
SStotal
The total sum of squares, or Y - My for each subject, squared and summed. This can also be symbolized SSy.
SSmodel
The model sum of squares, or Y - MY for each subject, squared and summed. This can also be symbolized 5Sreg, for regression sum of squares, or SSexp, for sum of squares explained by the model.
SSerror
The error sum of squares, or Y - Y' for each subject, squared and summed. This can also be symbolized SSres, for residual sum of squares, or SSunexp, for sum of squares left unexplained by the model.
8.2 WHAT ACCOUNTING FOR VARIANCE MEANS 8.2
113
WHAT ACCOUNTING FOR VARIANCE MEANS From a best-fit line (and remembering that we have not yet determined the absolute best-fit line for the lie detection data, although we will do so in the next chapter), two important facts can be determined. 1. The exact nature of the linear relation between predictor and criterion variable. 2. The strength of the linear relation between predictor and criterion variable. The nature of the linear relation is indicated by the regression statistics for the best-fit line, whereas the strength of the linear relation is indicated by the ratio of variability explained by the regression model to total variability. Exactly what this means is explained in the next several paragraphs.
The Exact Nature of the Relation Assume the values for a and b given in Fig. 8.4 are accurate. (We know they are accurate to two significant digits.) The value of a indicates that, for mood scores of zero, 3.0 lies would be detected (Y' = 3.0 + 0.47X = 3.0 when X= 0). This statement is meaningful only if mood scores of zero are meaningful. Although we can imagine any best-fit line stretching to infinity in both directions, any actual line we draw will be based on a particular sample of X values; values that will necessarily be confined to a particular range. In this case, mood scores ranged from 2 to 9, not minus to plus infinity, so predictions of number of lies based on mood scores outside this range (like zero) may not make much sense. More interesting is the information conveyed by the value for b. It indicates that, for each increase of one mood point, the number of lies detected increases 0.47. Numbers less than 1 often seem difficult for many people to grasp, thus in this case we might instead say that for each increase of 2.13 mood points, the number of lies detected increases by 1. (The algebra is as follows: b = rise:run = 0.47/1; 0.47/1 = 1/X; O.47X = 1; X = 1/0.47 = 2.13 rounded.) It is important to remember that the exact nature of the relation between a predictor and criterion variable—the change in value for the criterion expected for a given change in the predictor—is an idealization based on the model. It is an ideal statement about an ideal world, one in which error is ignored and the model is assumed to be true, when in fact it is better viewed only as a probabilistic guide. In the behavioral sciences, observed values often deviate strikingly from those predicted by a particular model, even though that model is based on the best possible fit to the data. Thus it becomes important to assess, not just what the best-fit model is (i.e., what are the best-fit values for a and b), but also how well that model fits a particular set of data. In other words, although we may have found a best-fit line, the question remains, how good is that fit?
The Strength of the Relation The strength of the relation (or association) between a particular predictor variable and its criterion, or the goodness of fit, can be assessed by the proportion of criterion to total variance that is accounted for when predictions take the predictor variable into account, that is, when predictions are based on the regression equation (Equation 8.3) instead of just on the group mean (Equation
114
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR 8.2). In effect, two variances are compared. First is total variance or the extent to which raw scores deviate from the mean (see Fig. 8.5, top). Second is error variance or the extent to which raw scores deviate from the best-fit line (see Fig. 8.5, bottom). Error variance is represented graphically by the dispersion of points around the best-fit line, whereas total variance is represented by the dispersion of points around the horizontal line representing the Y mean (recall Exercise 8.3). When the relation between X and Y is weak, the dispersion of data points around the best-fit line is almost as great as the dispersion around the mean line. When the relation between X and Y is strong, however, the data points are clustered far more tightly around the best-fit line than around the mean line; basing predictions on the linear model (Y'i = a + bXi) instead of the mean model (Yi' = My) results in a clear reduction in variability. For the present example, the total variance for the mean model (the SStotal divided by N) was 4.01, whereas the error variance (the SSerror divided by N) was 2.89 (see Fig. 8.4). We are now in a position to answer the question posed earlier in this chapter: How much do we gain by stepping up from the one-parameter mean model (with requires a value only for My) to the two-parameter linear model (which requires values for a and b)? Recall that for the best-fit linear model, the total sum of squares (SStotal) can be divided (or partitioned) into two pieces, the sum of squares associated with or explained by the regression model (SSmodel) and the error or residual sum of squares (SSerror), which is the sum of squares left unexplained by the model (Equation 8.4). The total variance, which is simply the total sum of squares divided by N, can also be partitioned in the same way:
The mean model generates the same value for all predicted scores. As a result, there is no variability in predicted scores and no variability to be associated with the model. In the case of the mean model, total variance is all error variance, portions of which we hope to account for by more complex models. Usually the error variance for the linear model will be less than the error (i.e., total) variance for the mean model. The amount by which it is less is the variance due to the model and represents the effect of the additional information (i.e., each subject's values for the predictor variable) used in the linear model as compared to the mean model. For the present example, the error variance for the mean model is 4.01 and the error variance for the linear model is 2.69 (assuming a = 3.0 and b = 0.47). This represents an improvement of approximately 4.01 minus 2.89, or an improvement of 1.12 (rounded). This improvement is due to the variance accounted for by the linear model (VAR model ), which according to Fig. 8.4 is 1.13 (rounded) and which would be exactly 1.13 (actually, 1.1275625) if we had used values for a and b that produced the absolute best-fit line. In any case, compared to the mean model, the linear model reduces error variance by approximately 28% (1.13 divided by 4.01 times 100). In other words, for the present example the linear model accounts for approximately 28% of the variance in criterion variable scores. The formula for proportion of variance accounted for can be expressed in terms of either variances or sums of squares. In terms of variances,
8.2 WHAT ACCOUNTING FOR VARIANCE MEANS
115
This formulation clearly expresses the notion that what is being computed is the proportion of total variance accounted for by the model. The second variant is given because sometimes total and error variances are more readily available than model variance, and clearly, if VARtotal = VARmodel + VARerror then VARmodel = VARtotal - VARerror. The various variances given in Equation 8.6 are all the appropriate sums of squares divided by N, the sample size, and therefore all Ns implicit in Equation 8.6 cancel, leaving
FIG. 8.5. Total (top), model (middle), and error (bottom) deviations; when squared and summed, the result is the sum of squares total, for the model, and for error.
116
ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
This formulation bypasses the divisions by N required for computing the variances in Equation 8.6 and for that reason may occasionally be more convenient. But both formulas yield identical results. As an additional aid to understanding the concept of accounting for variance, it can be helpful to consider extreme, even degenerate circumstances. For example, imagine that prediction were perfect, that every data point fell on the best-fit line. As always, the total sum of the deviation scores squared (deviations of raw scores from the mean) would reflect variability for the criterion variable, the very thing we want to account for. But in this case, each error deviation would be zero (because every point is perfectly on the line), so the sum of the squared residuals would be zero. As a result, total and model sum of squares would be identical. With no error variance, all of the total variance is accounted for by the model. In other words, the SSmodel divided by the SStotal would equal exactly 1. When prediction is perfect, 100% of the variability in criterion scores is accounted for by the model; there is no error variability. Now imagine there is no association between predictor and criterion variables. In this case, the slope for the best-fit line would equal zero, indicating no relation, and the Y intercept would equal the mean of the Y scores. That is, the best-fit line would be horizontal (zero slope) and would intercept the Y axis at the Y mean. Given no relation, the best prediction for a Y score is simply the Y mean; there is no way to improve it. As a result, the sum of squares explained by the model is zero (because the model predicts the mean, the deviations between predicted and mean scores are zero). With no model variance, none of the total variance can be due to the model. In other words, the SSmodel divided by the SStotal would equal exactly zero. When prediction is nil, 0% of the variability in criterion scores is accounted for by the model. Normally, of course, the proportion of variance accounted for falls somewhere between 0 and 1. For the present example (Fig. 8.4), the sum of squares for the number of lies detected is 40.10. The sum of squares explained by the model is 11.28 and the error sum of squares is 28.86 (rounding to four significant digits). Thus the proportion of variability in number of lies detected that is accounted for by mood scores is .281 (11.3 divided by 40.1; but remember, this is based on values for a and b determined by trial and error and accurate only to two significant digits, which is why the present .281 is different from the more accurate .302 we compute in the next chapter. The purpose of the past several paragraphs has been to develop a clear understanding of what it means to say that a given proportion of criterion variable variance is accounted for by a particular predictor variable. Many readers will already have recognized that the proportion of variance accounted for is an important and widely used descriptive statistic, known as the coefficient of determination or r2 (read r-squared). It indicates the degree or strength of association between two variables, it can vary from 0 to 1, and it is symbolized and defined as follows:
Historically, another statistic, the correlation coefficient or r, was developed before r2. In traditional statistic texts, r is usually introduced first and receives
117
8.2 WHAT ACCOUNTING FOR VARIANCE MEANS 2
2
more attention than r . We prefer to emphasize r , partly because it has such a clear and concrete meaning (proportion of variance accounted for) and partly because accounting for variance is a general and powerful concept, one not limited to cases involving single predictor variables. In any case, and ignoring for a moment the matter of sign, you can always compute the value of r simply by taking the square root of r2. In this chapter you have been introduced to four important descriptive statistics: the regression constant or a, the regression coefficient or b, the correlation coefficient or r, and the coefficient of determination or r2. The regression coefficient indexes the exact nature of the relation between two variables, r indexes the strength of association between the predictor and the criterion variables, and r2 indicates the proportion of criterion variable variance accounted for by the model that includes the predictor variable. You have also learned how to determine the value of a and b by trial and error and how to use them to compute first the SStotal, SSmodel, and SSerror, and then r2. At this point, you should have a good understanding of accounting for variance. In chapter 9 you will learn how to compute the exact, best-fit values for a and b.
Note 8.3 VARtotal
The variance for the scores in a sample. It indicates the variability of raw scores relative to the group mean. To compute, divide SStotal by N. The subscript indicates the group of scores in question, e.g., VARx for the X scores, VARy for the Y scores, and so forth.
VARmodel
The model variance. It indicates the variability of scores predicted by a specified regression equation or model relative to the group mean. To compute, divide SSmodel by N.
VARerror
The error variance. It indicates the variability of raw scores relative to scores predicted by a specified regression equation or model. To compute, divide SSenor by N.
This page intentionally left blank
9
Bivariate Relations: The Regression and Correlation Coefficients
In this chapter you will: 1. Learn how to compute the slope and Y intercept for the best-fit line. 2. Learn how to compute and interpret the correlation coefficient. 3. Learn how to account for variance with a single predictor, either quantitative or binary. 4. Learn how to graph the regression line for both a quantitative and a binary predictor variable. In the preceding chapter, as a way of describing the relation between two variables, you determined the best-fit line by trial and error. You also learned that the slope of that line (the regression coefficient) indicates the exact nature of the linear relation between the independent and dependent variable. Topics introduced in the last chapter are discussed further in this chapter. The focus remains on descriptive statistics, and the major question remains: How can the strength of the relation between a quantitative dependent variable and an independent variable (either quantitative or qualitative) be assessed? Other ways to state this question are: How can we describe the effect of one variable on another, and how much variability in one variable can be accounted for by another? In the previous chapter, the numeric example used a quantitative independent or predictor variable (mood). In this chapter, after describing procedures for computing the regression and correlation coefficients, we demonstrate how the methods learned in the last chapter also apply when the predictor variable is binary. This is useful because researchers often compare two groups and in such cases the variable indicating group membership will be binary. 9.1
COMPUTING THE SLOPE AND THE Y INTERCEPT As you probably know, or certainly have guessed, in practice the slope and Y intercept are computed using formulas developed by statisticians, not discovered
119
120
BIVARIATE RELATIONS: REGRESSION, CORRELATION empirically by refined guesswork as demonstrated in the previous chapter. Still, it is instructive to learn the trial-and-error method first, partly because it is so simple and it does work, but mainly because it helps develop a clear and intuitive sense of a best-fit line. Statisticians can prove that the formulas for the slope and Y intercept given in this section do indeed result in best-fit lines. We do not reproduce their arguments here, preferring to leave such discussion for advanced and more mathematically oriented statistics texts. However, you can use the spreadsheets developed in the last chapter to demonstrate to yourself that the computed values for a and b really do describe a best-fit line. You will not find it possible to find any other values for a and b that yield a smaller value for the error sum of squares. Keep in mind, however, that the computations for the Y intercept and slope presented here apply only when a criterion variable is being accounted for by a single predictor variable. Procedures used when more than one predictor variable is under consideration are presented in subsequent sections. If we let a lower case y represent the total deviation score for a Y score, so that yi =Y0i- MY, then the sum of squares for Y is:
In other words, the sum of squares for Y is the sum of the squared differences between the raw Y scores and the Y mean. Likewise, if we let a lower case x represent a total deviation score for an X score, so that xi = Xi - MX, then the sum of squares for X is:
As you already know, the sum of squares for Y indicates the variability of the Y scores. Similarly, the sum of squares for X indicates how variable the X scores are. Now we can define a new sum of squares, the XY sum of squares, that indicates the extent to which the X and Y scores covary:
Note that SSx and SSy are always positive (because deviations are multiplied by themselves) but SS xy can be negative as well as positive. If X and Y covary in a positive direction—that is, if small values of Y tend to be paired with small values of X and large values of Y tend to be paired with large values of X— then most of the products of the paired x and y deviation scores will be positive (because most paired deviations will have the same sign), resulting in a large, positive value for SSxy. On the other hand, if X and Y covary in a negative direction—that is, if large values of Y tend to be paired with small values of X, and small value of Y tend to be paired with large values of X—then most of the products of the paired x and y deviation scores will be negative (because most paired deviations will have opposite signs) and thus the value of SSxy will be large and negative. There is a third possibility, of course: X and Y may not covary at all. In that case, about half of the xy products would be positive, half negative, and SS xy would be some value close to zero. You already know how to compute the variance for the X and Y scores in a sample. These variances are:
9.1 COMPUTING THE SLOPE AND THEY INTERCEPT
121
Similarly, the sample covariance is the XY sum of squares divided by the sample size:
We remind you about variances, and introduce the covariance, because the slope for the best-fit line can be defined in terms of these statistics. If Y is the criterion variable and X the predictor variable, then the slope for the best-fit line is the ratio of the covariance for X and Y to the variance for X:
(The Ns used to compute the covariance and variance cancel, so the slope could also be expressed as the ratio of SSxy to SSx.) If the covariance is positive, then the slope is positive (and tilts up to the right); if the covariance is negative, then the slope is negative (and tilts down to the right). In terms of units, Equation 9.7 makes sense. If Y is lies and X is mood, then the numerator, which is the product of lie and mood deviation scores, is measured in lie-mood units (or lies times mood). Similarly, the denominator is measured in mood-squared units (like area is measured in square feet). When lie-mood units are divided by mood-squared units, the mood units cancel, and the units for the resulting dividend are lies/mood (or lies per mood), which is exactly what the slope is: a ratio of rise to run or the increase in Y (number of lies) per unit increase in X (mood). If you remember this, you should have no trouble remembering that the slope is the covariance for X (run) and Y (rise) divided by the variance for X (run-squared). Given the slope, and the mean values for X and Y, the Y intercept can be determined by geometry. The formula for the Y intercept is:
By definition, the best-fit line must go through the point where X = MX and Y = My. Its slope, or the ratio of rise to run, is the increase in Y units for each X unit. At the point where the line intercepts the Y axis, X equals zero. In running from X = 0 to X = MX, a distance of MX X units, we rise b times Mx Y units. In order to get back to the Y value corresponding to X - 0 (the Y intercept), we in effect run backward, subtracting the total rise (b times MX) from the mean for Y, as indicated by Equation 9.8. If this seems unclear, try making a simple figure that depicts Equation 9.8 Exercise 9.1 Computing the Y intercept and Slope The template developed for this exercise adds the ability to compute the exact values for the regression constant (the Y intercept or a) and the regression coefficient (the slope or b) to the template shown in Fig. 8.4.
122
BIVARIATE RELATIONS: REGRESSION, CORRELATION General Instructions 1. Add three columns, labeled appropriately and containing the appropriate formulas, to your Exercise 8.4 spreadsheet. These columns are for: a. The deviation between X and the X mean (x= X- Mx). b. The square of x; their sum is SSX c. THE product of xy, their sum is SSXY. 2. Establish rows that compute sums, counts, and means for these columns as appropriate. Enter a formula for the SDX. 3. Now that your spreadsheet contains all the elements needed, enter formulas for a and b, the Y intercept and slope. Do your computed values agree with those shown in Fig. 9.1? How does the current value for the error sum of squares compare with the value computed for the last exercise? Detailed Instructions 1. Begin with the spreadsheet shown in Fig. 8.4. Insert a blank column between columns G and H, or, alternatively, move the block H1-J16 over one column, to I1-K16. This opens up column H, which will be used for labeling. 2. Move the VAR= and SD= labels in cells G15-G16 to H15-H16. 3. In row 2, add the following labels to columns L-N:
4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
Label
Column
X-Mx
L
Meaning
The difference between the observed mood score and its mean (symbolized x). x*x M The product of the mood deviation scores. The sum is SSx and indicates variance in X scores. x*y N The crossproduct of the mood and lie deviation scores. The sum is SSxy and indicates covariance in X and Y scores. In addition, enter the labels "x=" in L1, "SSx" in M1, and "SSxy" in N1. In column L (cells 3-12), enter a formula for subtracting the mean mood score (Mx, in cell C15) from the observed X score (in column C). In column M (cells 3-12), enter a formula for squaring the mood deviation score (in column L). In column N (cells 3-12), enter a formula for the cross product of the mood and lie deviation scores (in column E). In cells L13-N13, enter a function that sums the entries in rows 3-12. In cells L14-N14, enter a function that counts the entries in rows 3-12. In cells M15-N15, enter a formula that computes the mean for the column. The column M mean is the variance for X and the column N mean is the XY covariance. In cell M16, enter a formula for the standard deviation for the mood or X scores. In cell C16, enter the formula for the slope. This is the covariance (cell N15) divided by the X variance (cell M15). In cell B16, enter a formula for the Y intercept. This is the product of the slope (cell C16) and the X mean (cell C15) subtracted from the Y mean (cell B15).
At this point, your spreadsheet should look like the one shown in Fig. 9.1. Now we can see that the computed values for a and b (rounded to five significant digits) are 3.0235 and 0.46938 respectively. (Remember, the number of digits displayed depends on the format and column width you select; the number of digits stored internally is not affected by the display.)
123
9.1 COMPUTING THE SLOPE AND THE Y INTERCEPT
The guesses we used in the previous chapter for a and b, 3.0 and .47 (see Fig. 8.4) , were close but not the best possible. Now that our estimates for a and b are mor e accurate (Fig. 9.1), the error sum of squares (cell K13) is 28 .85840274,
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
B
s 1 2 3 4 5 6 7 8 9 10 Sum= N= Mean= a,b=
H
Lies Y 3 2 4 6 6 4 5 7 7 9 53 10 5.3 3.024
C Mood X 5.5 2 4.5 3. 1.5 6.5 3.5 7 6 9 48.5 10 4.85 0.469
I
J
y= Y' 5.605 3.962 5-136 4.432 3.728 6.074 4.666 6.309 5.84 7.248 53 10 5.3
Y-My -2.3 -3.3 -1.3 0.7 0.7 -1.3 -0.3 1.7 1.7 3.7 0 10
m= Y'-My 0.305 -1.34 -0.16 -0.87 -1.57 0.774 -0.63 1.009 0.54 1.948 0 10
K
L
M
1
SStot
SSmod
SSerr
x=
2
y*y
m*m
e*e
X-Mx
5.29
0.49
0.093 1.79 0.027 0.754 2.472
1.69
0.6
0.09
0.402 1.018 0.291 3.794 11.24
6.787 3.851 1.29 2.46 5.164 4.303 0.111 0.477 1.346 3.07 28.86
0.65 -2.85 -0.35 -1 .85 -3.35 1.65 -1.35 2.15 1.15 4.15
3 4 5 6 7
10.89
1 .69 0.49
8 9 10 11 12 13 14 15 16 17
2.89 2.89 13.69 Sum=
40.1
G
F
E
D
N=
10
10
10
VAR=
4.01
1.124
2.886
SD=
2.002
0 10
e=
Y-Y' -2.61 -1.96 -1.14 1.568 2.272 -2.07 0.334 0.691 1.16 1.752 0 10
N SSX x*x
SSxY
0.423 8.123 0.123 3.423 11.22 2.723 1.823 4.623 1.323 17.22 51.03
-1.5 9.405 0.455 -1.3 -2.35 -2.15 0.405 3.655 1.955 15.36 23.95
x*y
10
10
5.103 2.259
2.395
FIG. 9.1. Spreadsheet for finding the Y intercept and slope of a best-fit line predicting number of lies detected from mood scores.
124
BIVARIATE RELATIONS: REGRESSION, CORRELATION which is very slightly less than the 28.862625 from Fig. 8.4 (when we adjust the column width to see more significant digits than the 28.86 displayed in the figures). This value from Fig. 9.1 should be the smallest value possible. Moreover, the model and error deviation scores (columns F-G) now sum to zero, as they should (or else a number like -2E-15, which in scientific notation is -2 divided by a ten followed by 15 zeros and which is essentially zero). And the SSmodel and the SSerror (cells J13-K13) sum exactly to SStotal (values, rounded to four significant digits, for SSmodel and SSerror are 11.24 and 28.86, which sum exactly to 40.10). Given a single predictor, then, the spreadsheet shown in Fig. 9.1 can be used to compute the regression constant (a or the Y intercept) and the regression coefficient (b or the slope for the best-fit line). In the next section, we show how to compute the correlation coefficient (r) as well. Note 9.1 SSxy The XY sum of squares, or E (Xi - MX) (Yi - My), where i = 1,N. Thus SSxy is the sum of the cross products of the X and Y deviation scores. COV The covariance. If for the X and Y scores, it is SSXY divided by N. Thus COVXY is the average cross product of the X and Y deviation scores. It indicates the direction (plus or minus) and the extent to which the X and Y scores covary, and is related to both the correlation and the regression coefficients.
9.2
COMPUTING THE CORRELATION COEFFICIENT The Pearson product-moment correlation coefficient, which is symbolized by r, is an historically important and widely used index of association usually applied to quantitative variables. It was first defined by Karl Pearson in 1895. Although Pearson approached the matter somewhat differently, we have already noted that r2 is the proportion of variance accounted for in a criterion variable when values for the predictor variable are taken into account. Thus, one way to compute r is simply to take the square root of r2:
This gives the absolute value of r; in order to determine its sign (whether the relation is positive or negative) you would need to look at the scattergram or the sign of the covariance. For the present example, relating number of lies detected to mood:
The values for VARmodel , VARtotal, and COVxy (which is positive) are taken from Fig. 9.1. The Ns in the variance formula cancel, so we could have defined r2 as the SSmodel divided by SStotal instead (11.24 divided by 40.1), which gives the same result. The correlation coefficient can vary from –1 (a perfect negative relation) to zero (no relation) to +1 (a perfect positive relation), whereas r2 can vary only
9.2 COMPUTING THE CORRELATION COEFFICIENT
125
from zero (no variance accounted for) to +1 (all variance accounted for). If you compute r as the square root of r2, you will need to supply the appropriate sign. As noted earlier, the covariance (cell N15 in Fig. 9.1) indicates the direction of the relation, so its sign can be used if you compute r as the square root of r2. In addition, there are several other ways to compute the standard Pearson correlation coefficient. One of these is demonstrated in the next exercise. Exercise 9.2 Computing the Correlation Coefficient Using Z Scores The purpose of this exercise is to add the ability to compute the correlation coefficient (r) to the template shown in Fig. 9.1. The method used for computing the correlation coefficient is based on Z or standardized scores. General Instructions 1. Add three columns, labeled appropriately and containing the appropriate formulas, to your Exercise 9.1 spreadsheet. These columns are for: a. Zy, the standardized Y score. b. Zx, the standardized X score. c. Zy Zx, the product of ZY and Zx. 2. Establish rows that compute sums, counts, and means for these columns as appropriate. 3. Provide labels and formulas for r and r2. Note that r is the average cross product of the Z scores. Compute r2 by dividing the variance accounted for by the model by the total variance. Is the square root of this value the same as the average cross product of the Z scores? Would it be if the average cross product were negative? Detailed Instructions 1. Begin with the spreadsheet shown in Fig. 9.1. In row 2, add the following labels to columns O-Q: Label
2. 3. 4. 5.
6. 7. 8.
Column
Meaning
ZY O Standard or Z scores for number of lies detected. Zx P Standard or Z scores for mood. ZY* Zx Q The cross product of ZY and Zx. In column O, enter the formula for a standardized lie score (Use the standard deviation for Y in cell 116.) If you do not remember the formula, reread the section on standard scores in Chapter 5. In column P, enter the formula for a standardized mood score. (Use the standard deviation for X in cell M16.) In column Q, enter a formula for the cross product of the ZY and Zx scores. In cells O13-Q13, enter a function for summing the data entries in the column. In cells O14-Q14, enter the function for counting the number of data entries in columns O to Q. In cells O15-Q15, enter a formula for the mean of the data entries in columns O to Q. The value in cell Q15 is the mean of the cross products of the Z scores. Enter the label "r=" (for correlation coefficient) in cell A17 and the label "R2=" (for r2 or r-squared) in cell H17. Point cell B17 to cell Q15 (the mean cross product of the standard scores). In cell 117, enter a formula for dividing model variance (cell J15) by total variance (cell 115). This is r2.
BIVARIATE RELATIONS: REGRESSION, CORRELATION
126
At this point, your spreadsheet should look like the one shown in Fig. 9.2. You already know from Equation 9.9 that the value of the correlation coefficient for our current running example is 0.529 (the square root of VARmodel divided by VARtotal). An examination of the exercise just completed should reveal a second way to define and compute the correlation coefficient. As you can see, the correlation coefficient is the average cross product of the standard or Z scores:
This formulation emphasizes the fact that the correlation coefficient is an average statistic that, like the mean, summarizes data for a group. For the present example (and rounding the correlation coefficient to two significant digits),
There is another way to define and compute the correlation coefficient. Again, the values needed are given in the spreadsheet you just prepared. This definition, like the one for the regression coefficient given in Equation 9.7, uses the variances and covariances for the two variables involved. It is:
H
B
A
I
Lies
1 2
s
Y
3 4
1 2
5 6 7 8
3 4 5 6
3 2 4
9 10 11 12
7 8 9 10
5 7 7 9
13 14
Sum=
Sum=
N=
53 10
N=
10
15
Mean=
5.3
16 17
a,b=
3.024 0.529
VAR= SD=
4.01 2.002
R2=
0.28
r=
P
0
Q
SSY
y*y 5.29 10.89 1 .69 0.49 0.49 1.69 0.09 2.89 2.89 13.69
6 6 4
40.1
Zy
Zx
Zy*Zx
-1.15 -1 .65 -0.65 0.35 0.35 -0.65 -0.15
0.288
-0.33
-1.26 -0.15 -0.82 -1.48 0.73 -0.6
2.079
0.849 0.849
0.952 0.509
1.848
1.837 2E-15
0.808 0.432 3.395 5.295
0 10 0
0.101 -0.29 -0.52 -0.47 0.09
10
10
2E-16
0.529
FIG. 9.2. Spreadsheet for computing the correlation between number of lies detected and mood scores. Columns that are the same as Fig. 9.1 are not shown.
9.2 COMPUTING THE CORRELATION COEFFICIENT
127_
For the present example,
(The values for these variances and covariances are given in cells I15, M15, and N15 in Fig. 9.1.) In other words, the correlation coefficient can be defined as the ratio of the covariation of X and Y, relative to the square root of the variation of X and Y considered separately. Both variance and covariance are group summary statistics, reflecting average variability about group means, which again serves to emphasize that r is an average, group-based statistic. It does not indicate the relation between two variables for an individual, but rather indicates the average relation between those variables for a group of individuals. However defined or computed, the correlation coefficient is by far the most common index of association for two quantitative variables. Moreover, the same identical computations can be applied when both variables are ranked (ordinal) data, in which case the statistic is called the Spearman rank-order correlation coefficient; when one of the variables is quantitative (i.e., interval-scaled) and the other is binary, in which case the statistic is called a point-biserial correlation; and when both variables are binary data, in which case the statistic is called the phi coefficient. Historically, special computational formulas have been developed for each of these cases, but they are merely computationally convenient derivatives of the formulas presented here. Unfortunately, the separate formulas have led generations of students to think these variants of the correlation coefficient were somehow different things. Because of their widespread use, it is important for you to know the different names given these correlation coefficients, but at the same time to understand that they are fundamentally the same. One final comment: We think it often makes more sense to report r2 instead of r. The units of the correlation coefficient have no particular meaning. An r of .4, for example, is not twice as much correlation as an r of .2, and the difference between an r of .8 and an r of .9 is not the same as the difference between an r of .4 and an r of .5. The units of r2, however do have a more intuitively grasped meaning: the proportion of variance accounted for.
The Correlation and Regression Coefficients Compared The correlation coefficient (r) and the regression coefficient (6, or the slope of the best-fit line) appear very similar. According to Equation 9.11, one definition for r is:
And according to Equation 9.7, the definition for b is:
The similarity between the two equations is made more evident if we substitute the square root of the X variance squared for VARx in the denominator:
128
BIVARIATE RELATIONS: REGRESSION, CORRELATION
This superficial similarity between r and 6, however, masks a very real difference. As noted earlier, the regression coefficient indicates the exact nature of the relation between X and Y. Its value indicates how much Y increases (or decreases, if b is negative) for each unit change in X. Its units are X units per Y units (e.g., lies per mood score). The correlation coefficient, on the other hand, is "unit free" (numerator and denominator units cancel). It is a pure index of the strength of the association between X and Y. Both correlation and regression coefficients convey important, but different, pieces of information. When reporting the results of the lie detection study, it would be traditional to say that mood and number of lies detected correlate 0.543. As noted earlier, our preference is to report r2, not r, because we think it is more informative to say that 29.5% of the variance in number of lies detected is accounted for when mood scores are taken into account. Either way, r or r2 provides information about the strength of the relation. In addition, in order to provide information about the exact nature of the relation, you would report that for each increase of 1 mood point, 0.0465 more lies were detected on average. Alternatively, because decimals less than 1 often seem difficult to grasp, you might report instead that for each increase of 21.5 mood points, one more lie is detected (recall the algebra used in Chap. 8).
Note 9.2 b The simple regression coefficient. Given a single predictor variable, b is the slope of the best-fit line. It indicates the exact nature of the relation between the predictor and criterion variables. Specifically, b is the change in the criterion variable for each one unit change in the predictor variable. To compute (assuming that Y is the criterion or dependent variable), divide the XY covariance by the X variance (b = COV XY /VAR x ). a The 'regression constant. When Y, the criterion variable, is regressed on X, a single predictor variable, a is the Y intercept of the best-fit line. To compute, subtract the product of b (the simple regression coefficient) and the mean of X from the mean of Y (a - My - b MX). r The correlation coefficient or, more formally, the Pearson product-moment correlation coefficient. It is an index of the strength of the relation between two variables, and its values can range from -1 to +1. To compute, find the average cross product of the Zx and Zy scores (r = E Zxi Zyi/N, where i = 1,N), or take the square root of r2 and assign it the same sign as the covariance. r2 The coefficient of determination, or r-squared. It indicates the proportion of criterion variable variance that can be accounted for given knowledge of the predictor variable: To compute, divide model variance by total variance (r2 = VARmodel/VARtotal).
9.3 DETECTING GROUP DIFFERENCES, BINARY PREDICTOR 9.3
129
DETECTING GROUP DIFFERENCES WITH A BINARY PREDICTOR Perhaps one of the most common research questions asked is, do two groups differ? The groups may differ naturally in some way or they may have been formed by random assignment and then treated differently, but in either the correlational or experimental case, the investigator wants to know whether the average of some measure of interest is different for the two groups. Imagine, for example, that 5 of the 10 subjects in the lie detection study had received a drug that made them less physiologically reactive than normal whereas the other 5 had received a placebo. Such a study design could be motivated by the hypothesis that a tranquilizing drug will make lies more difficult to detect. A question like this can be answered easily with only slight modification to the template shown in Fig. 9.2, as you will see in the next exercise. Then, in the next chapter, you will learn how the statistical significance of this information can be assessed. You can then use the template you are about to develop whenever subjects fall into two groups. But is the difference between the means for the two groups statistically significant? The procedure developed here gives results identical with those obtained using Student's t test, a test that was developed early in this century by Gossett (who used the pen name Student). However, as we mentioned in chapter 4 and again in the next chapter, we prefer to emphasize a more general approach appropriate not only for situations in which the historically much-used t test could be applied, but one that is also appropriate for many other situations as well. This general approach uses regression computations to perform what is called a one-way analysis of variance and requires that a subject's group membership (in the present case, whether the subject received a drug or placebo) be treated as a predictor variable. Exactly how a categorical variable like group membership can be coded so that its qualitative information is represented quantitatively is an important topic discussed in some detail in chapter 12. For present purposes, and without further justification, we use a single predictor variable to represent group membership. Arbitrarily, values for this variable are set to 1 for subjects who received the drug and to 0 for subjects who received the placebo. Exercise 9.3 Predicting Lies Detected From Drug Group The template developed during this exercise can be used whenever you want to describe the association between a predictor and a criterion variable. The predictor variable can be binary (indicating membership in one of two groups, for example) or quantitative (like mood score). 1. Begin with the spreadsheet shown in Fig. 9.2. Change the formula for r from the average cross products of the Z scores to the covariance divided by the square root of the product of the X and Y variances. This eliminates the need for the columns containing ZY, Zx, and their cross product. You can erase these columns. 2. Replace the mood scores with drug group scores (and change labels appropriately). Enter 1 for the first five subjects (indicating that they belong to the drug group) and 0 for the last five (indicating that they belong to the placebo group).
130
BIVARIATE RELATIONS: REGRESSION, CORRELATION 3. You are now done. All other formulas carried over from the previous template should be in place and correct. 4. As a final exercise, verify that Equation 9.9 gives the same value for r as Equation 9.11. The value for r-currently displayed is based on Equation 9.11 (r = the XY covariance divided by the square root of the product of the X and Y variances). In another cell, enter a formula based on Equation 9.9 (r= the square root of the quotient resulting from dividing the model variance by total variance). Except for one detail, the two values you just computed should be identical. In what way are they different? Why? At this point, your spreadsheet should look like the one shown in Fig. 9.3. From it you know that knowledge of drug group (drug or placebo) allows you to account for 30% of the variance in number of lies detected. This is a reasonable amount, and a bit higher than the 28% accounted for by mood scores (see Fig. 9.2). You also know that the value of the correlation coefficient describing the association between drug group and number of lies detected is -.55 (rounded to two significant digits). Drug group is a binary categorical variable. When the predictor variable is quantitative, the sign of the correlation coefficient indicates the direction of the relation (a positive sign indicates that increases in the IV tend to be associated with increases in the DV; a negative sign indicates that increases in the IV tend to be associated with decreases in the DV) and the Y intercept and slope have the meanings described in chapter 8 and in the previous section of this chapter. When the predictor variable is categorical, however, the sign of the correlation coefficient and the values for Y intercept and slope need to be interpreted in light of the arbitrary codes assigned to the two levels of the binary variable. In this case the drug group was coded 1 and the placebo group was coded 0. There are only two values for the predictor variables; thus there are only two values for the predicted scores. The predicted score for all subjects in the drug group was 4.2, which is the mean of the observed scores for the five drug group subjects, whereas the predicted score for all subjects in the placebo group was 6.4, which is the mean of the observed scores for the five placebo group subjects (see Fig. 9.3). You may want to ponder why, using a binary predictor variable, predicted values based on the best-fit line are the mean scores for the two groups. For now, however, note that because the lower code (zero for the placebo group) was assigned to the group with the higher mean (M - 6.4), the best-fit line (which you will graph during the next exercise) tilts upward to the left and as a result the correlation coefficient is negative (-0.55). If codes assigning a higher number to the placebo than to the drug group were chosen, the correlation coefficient would have been positive instead. Just as the sign of the regression coefficient depends on the codes chosen for the levels of the binary variable, so do the values computed for the Y intercept and slope. In this case, the Y intercept (or regression constant) is 6.4, which is the mean for the group coded zero (the placebo group). The slope (or regression coefficient) is -2.2, which means that for a unit increase in drug group, the number of lies detected decreases by 2.2. The drug group was coded 1, which is a unit increase from the code for the placebo group, so this means that the mean number of lies for the drug group is 2.2 less than the mean for the placebo group, which it is (6.4 - 4.2 = 2.2). However, because the values computed for a and b are affected by the codes used, the exact nature of the relation between a binary predictor and a quantitative criterion variable is best conveyed by simply reporting the group means in the first place, by saying that the mean number of
131
9.3 DETECTING GROUP DIFFERENCES, BINARY PREDICTOR
lies detected was 4.2 for subjects who had the drug and 6.4 for those who received the placebo. Whether the predictor variable is qualitative or quantitative, r and r2 are
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
B s 1 2 3 4 5 6 7 8 9 10
Sum=
N= Mean=
a,b= r=
H
Lies Y 3 2 4 6 6 4 5 7 7 9 53 10 5.3 6.4 -0.55 I
y=
Drug X 1 1 1 1 1 0 0 0 0 0 5 10 0.5 -2.2
Y' 4.2 4.2 4.2 4.2 4.2 6.4 6.4 6.4 6.4 6.4 53 10 5.3
K
J
G
F
E
D
C
Y-My -2.3 -3.3 -1.3 0.7 0.7 -1.3 -0.3 1.7 1.7 3.7 0 10
m= Y'-My -1.1 -1.1 -1.1 -1.1 -1.1 1.1 1.1 1.1 1.1 1.1 0 10
L
M
N
1
SStot
SSmod
2
y*y
m*m
e*e
1.21 1.21
1.44 4.84
1.21
0.04
1.21 1.21
3.24
0.5
0.25
3.24
0.5
0.25
1.21
5.76
-0.5
0.25
3 4 5 6 7 8
5.29 10.89 1.69 0.49 0.49 1.69
9 10 11 12
2.89 2.89 13.69
1.21
Sum=
40.1
12.1
N= VAR=
10 4.01
1.21
SD=
2.002
13 14 15 16 17
0.09
2
R=
1.21 1.21 1.21
10
SSerr
e= Y-Y' -1.2 -2.2 -0.2 1.8 1.8 -2.4 -1.4 0.6 0.6 2.6 0 10
x=
SSX
X-Mx
x*x
x*y
0.5 0.5
0.25 0.25
0.5
0.25
-1.15 -1.65 -0.65 0.35 0.35 0.65
1.96
-0.5
0.25
0.36 0.36 6.76
-0.5 -0.5 -0.5
0.25 0.25 0.25
28 10 2.8
0 10
2.5 10
0.25
SSXY
0.15 -0.85 -0.85 -1.85 -5.5
10 -0.55
0.5
0.302
FIG. 9.3. Spreadsheet for predicting lies detected from drug used. Drug is treated as a binary categorical variable: 0 = no drug, 1 = drug.
132
BIVARIATE RELATIONS: REGRESSION, CORRELATION important statistics to report because they indicate the strength of the association. In addition, for a quantitative predictor, the slope of the best-fit line indicates the exact nature of the relation, whereas for a qualitative predictor, the same information is conveyed by group means. In either case, whenever you want to know if one variable affects another, the template developed for the last exercise provides the basic descriptive statistics. For now, we consider cases for which the single independent variable, if qualitative, has only two values or levels (drug vs. placebo group, for example), although later we show how to deal with more than two levels. How those two levels are coded is not especially critical, as the next exercise demonstrates. Exercise 9.4 Codes for a Two-Group Predictor Variable This exercise uses the template developed for the last exercise. Its purpose is to demonstrate some interesting properties of coded variables. 1. Begin with the spreadsheet shown in Fig. 9.3. Arbitrarily we had decided to code the drug group 1 and the placebo 0. Change the codes to -1 for the drug group and +1 for the placebo. How does this affect the values for a, b, and r2? 2. Imagine that the drug variable is quantitative, not qualitative, and that two levels of drug were selected for study. The first five subjects received 4 mg, the last five, 8 mg of the drug. Change the data in column C. What are the values of a, b, and r2 now? 3. Select one other pair of codes (in addition to 1/0 and -1/+1) for the two groups and recompute a, b, and r2. What do you conclude about the effect of the values selected to represent the two groups on the values computed for a, b, and r2? 4. Now turn your attention to Y, the predicted values. How are they affected by different ways of coding drug group? When a single binary predictor variable is used, what will the predicted scores always be?
9.4
GRAPHING THE REGRESSION LINE As mentioned earlier, if the independent variable is quantitative, the regression coefficient has a clear interpretation. It indicates the change in DV units expected for each one unit change in the IV. If the independent variable is qualitative, as in the last two exercises, the interpretation of the slope is less clear. This is because the numeric codes assigned the two categories of a binary nominal variable are arbitrary. Still, graphing the relation between a quantitative criterion variable (like number of lies detected) and a qualitative predictor variable (like drug group) can be meaningful, as the next exercise demonstrates.
9.4 GRAPHING THE REGRESSION LINE
133
Exercise 9.5 Graphing the Regression Line For this exercise, you graph two regression lines. One shows the relation between number of lies detected and mood score, the other shows the relation between number of lies detected and drug group. 1. Graph the lies/mood data given in Fig. 9.2. (This will look like the scattergram shown in Fig. 8.2.) Now, using the parameters given in Fig. 9.2, graph the regression line. If you are using your spreadsheet to draw this "xy" graph, graph the raw data using symbols (this gives the data points) and then graph the predicted scores using lines (this gives the regression line; Excel terms, x = mood scores, a = raw data, b = predicted data). 2. Note the mean value for the mood scores on the X axis. Draw a vertical line through it. Now note the mean value for number of lies detected and draw a horizontal line through it. If you have done this correctly, all three lines (the regression line and the two mean lines) cross at a common point. Why must this be so? 3. Using the graph, estimate about how many more lies are detected if mood increases from 4 to 6. (Draw a vertical line from mood = 4 to the regression line, then draw a horizontal line from that point to the Y axis. Do the same for mood = 6.) If you did not want to use this graphic method, which is only approximate, how could you have computed the value exactly? 4. Now graph the lies/drug data given in Fig. 9.3. (Use 1 to indicate the drug and 0 the placebo group.) In this case the slope is negative. How do you interpret this? 5. What is the predicted number of lies detected when drug = 0? When drug = 1? (Draw a line vertically from drug = 0 to the regression line and then horizontally to the Y axis. Do the same for drug = 1.) How else might you have determined these values? 6. In this case, what concrete meaning can you give the Y intercept and slope? As you can see, whether a predictor variable is quantitative (like mood score) or binary (like drug group), the regression line superimposed on the raw data provides a way to visualize the relation between the independent and dependent variables (see Figs. 9.4 and 9.5). The slope indicates the exact nature of the relation between criterion and predictor variables. It indicates the change in the criterion variable occasioned by a unit change in the predictor. If it is positive, increases in the predictor are associated with increases in the criterion variable, if negative, with decreases in the criterion variable. For a binary predictor, the slope passes through the means for the two groups and indicates the amount of difference between them. No matter whether the predictor is quantitative or qualitative, r2 indicates the strength of the relation between criterion and predictor variables. In other words, it indicates the proportion of variance accounted for when predictions for the dependent variable are made, based not just on the mean, but with knowledge of the value of the independent variable as well.
134
BIVARIATE RELATIONS: REGRESSION, CORRELATION
FIG. 9.4. Regression line for predicting number of lies detected from mood scores: a = 3.024, b = 0.469, r = .529.
FIG. 9.5. Regression line for predicting number of lies detected from drug group: a = 6.40, b = -0.220, r = –.549. On average, 4.2 lies were detected for drug-group subjects compared to 6.4 for placebo-group subjects.
Exercise 9.6 More Accounting for Variance . This exercise provides additional practice in variance accounting. You can easily modify the last spreadsheet in order to do these computations 1. Sixteen infants were recruited for a study. The number of different words each infant spoke at 18 months of age was 32, 27, 48, 34, 33, 30, 39, 23, 24, 25, 36, 31, 19, 28, 32, and 22 for the 1 st through the 16th infant, respectively. The first 7 infants all had no older siblings. The remaining 9 infants (the 8th through the 16th) had 2, 1, 4, 1, 1, 3, 2, 1, and 5 older siblings, respectively. What is the correlation between number of words spoken and number of older siblings? How do you interpret this correlation? How much variance in number of words spoken is accounted for by the number of older siblings?
9.4 GRAPHING THE REGRESSION LINE
135
2. What if you had made a data entry error and had entered 4 instead of 48 for the 3rd subject's number of words? What would r and r2 be then? What if you had entered 77 instead of 22 for the last subject's number of words? Or 222 instead of 22? Or 10 instead of 0 for the 7th subject's number of older siblings? What do you infer about the potential effect of a single incorrect datum? 3. What if you had divided the infants into two groups, those with no older siblings (coded 0) and those with one or more older siblings (coded 1)? What is the correlation between the number of words spoken and the binary code for none versus one or more older siblings? How do you interpret this correlation? How much variance in number of words spoken is accounted for by knowing whether infants had at least one as opposed to no older siblings? 4. How many words are predicted for the 7 infants who had no older siblings? For the 9 who did? Do these values surprise you?
Exercise 9.7 Regression in SPSS In this exercise you will use SPSS to conduct a regression analysis and create a graph of the lies and mood data. 1. Invoke SPSS. Create variables and enter the Lies and Mood data, or copy the data from the spreadsheet you last used in Exercise 9.2. 2. Select Analyze->Regression->Linear from the main menu. Move Lies to the Dependent window and Mood to the Independent(s) window. Click on Save and check the Unstandardized boxes under Predicted Values and Residuals. Click on Continue and then OK. 3. Examine the Model Summary in the output. Do the values for R and R2 agree with your spreadsheet? Now look at the ANOVA box. The Sums of Squares Regression, Residual, and Total should agree with the values from your spreadsheet for SSmod, SSerr, and SStot, respectively. 4. Examine the coefficients in the output. Can you find the values for a and b? 5. Finally return to the Data Editor. In step two you instructed SPSS to create two new variables. The variable pre__1 contains the Y' predicted values and res_1 contains the residuals (Y-Y'). Do these values agree with you spreadsheet? 6. Select Graphs->Scatter from the main menu. Click on Simple and then Define. Move Lies to the Y axis window and Mood to the X axis window. Click on OK. 7. To create a scatter plot and regression line for the lies and mood data, double click on the scatter plot to open the chart editor. Select Chart->Options from the main menu and check Total under Fit Line. 8. Save the lies and mood data. 9. For additional practice you should try running the SPSS Regression procedure and creating scatter plots for the lies and drug data. Remember to save the data in a separate file.
BIVARIATE RELATIONS: REGRESSION, CORRELATION Both the exact nature and the strength of a relation or association are important descriptive topics. In the next chapter we consider a third and different topic: the statistical significance of an observed association. There we ask the key inferential question: Is the association observed between two variables in a sample large enough to merit attention? In other words, is the observed association statistically significant: Is it likely a real effect and not just a chance result?
10
Inferring From a Sample: The FDistribution
In this chapter you wil: 1. Learn how to estimate the population variance from the value computed for the sample. 2. Be introduced to the concept of degrees of freedom. 3. Learn about the F distribution. 4. Learn how to determine whether the proportion of variance accounted for in a sample by a single predictor is statistically significant. 5. Learn how to do a one-way analysis of variance with two independent groups. In chapter 5 you learned how to compute the variance for a sample of scores, and in the last chapter you learned how a portion of that sample variance could be accounted for, using the best prediction equation (in the least-squares sense) and values for a single independent variable. These topics belong to the realm of descriptive statistics. In this chapter, on the other hand, we discuss ways of moving beyond the sample, inferring facts and relations that probably characterize the population from which the sample was drawn. Thus the material presented here, like that in chapter 7, concerns inferential statistics. 10.1
ESTIMATING POPULATION VARIANCE Equation 5.4 in chapter 5 defined the sample variance as the sum of squares divided by the sample size. That is:
Subscripts such as "X" or "Y" or "total" can be added to VAR and SS to identify the group of scores or the sample in question, but in general terms, the variance for a sample of scores is the average squared deviation from the mean for scores in that sample. It is computed by subtracting the sample mean from each score, squaring each difference, summing the squares, and dividing the sum by the
137
138
INFERRING FROM A SAMPLE: THE F DISTRIBUTION number of scores in the sample. This is a perfectly reasonable way to describe the extent to which scores in a sample are dispersed about the mean but as an estimate of the value of the population variance (O2), the formula for the sample variance has one undesirable characteristic. It is biased.
Biased and Unbiased Estimates Not all estimates are biased. For example, imagine that we drew 100 samples and computed the mean for each. Each sample mean would be an unbiased estimate of the population mean. Rarely would a particular sample mean exactly equal the value for the population mean. Some values might be too high, some too low, but overall there should be no pattern to the direction of the errors—for example, no tendency for the sample means to be smaller than the population mean. A statistic that is biased, on the other hand, will tend to give values that are either too low, on average, or else too high: The values will be consistently off in a particular direction. For example, if we computed the variance for each of the same 100 samples, many more of the values computed for the sample variances would underestimate the population variance than would overestimate it. Thus the sample variance is a biased estimate of the population variance because it consistently underestimates the true population value. This makes intuitive sense. When drawing samples from a population, we are likely to miss some of the rarer extreme scores, so it is likely that the variability in the sample will be less than the variability in the population. Statisticians have proved to their satisfaction that the following formula provides an unbiased estimate of the population variance (recall from chapter 5 that the apostrophe or prime after a symbol indicates an estimate):
This formula was first introduced in chapter 5 (Equation 5.5), but without explaining that the purpose of dividing by N –1 was to provide an unbiased estimate. When first encountered, this formula may seem too neat. Why should dividing the sum of the squared deviation scores by N–1i instead of N produce an unbiased estimate? Accept for now that proof is possible (e.g., see Hays, 1981), that the sample variance consistently underestimates the population value, and that in order to provide an unbiased estimate a correction factor is needed, some number that will remove the bias from the sample statistic. This correction factor must be greater than 1 (because the population variance is larger than the sample variance) and must become smaller as the sample size becomes larger (because bias decreases with sample size). Statisticians agree that this correction factor is exactly N divided by N – 1. Hence:
This equation reduces to Equation 10.2 because VAR equals SS divided by N , and because the Ns cancel. In other words:
10.1 ESTIMATING POPULATION VARIANCE
139
Similarly, the sample standard deviation (SD) is a biased estimate of the true population standard deviation (a). Like the sample variance, the sample standard deviation underestimates the true population value. A better estimate of the population standard deviation is the square root of the estimated population variance or the square root of the quotient of the sum of squares divided by N – 1:
Technically, a second correction factor is necessary to make this a truly unbiased estimate. However, for sample sizes larger than 10 the amount of bias is small and thus in the interests of simplicity—and remembering that the analysis of variance is based on variances, not standard deviations—the correction is usually ignored (again, see Hays, 1981). Indeed, many texts ignore this nicety and state without qualification that Equation 10.4 provides an unbiased estimate of a. Exercise 10.1 The Estimated Population Variance The purpose of this exercise is to examine the correction factor, N divided by N– 1, and to provide practice computing the estimated population variance. 1. What is the correction factor for a sample size of 5? 10? 20? 40? 100? 2. Graph sample size (on the X axis) against the correction factor (on the Y axis). Compute the correction factor for sufficient sample sizes in the range 1-200 so that the shape of the curve is accurately portrayed. As a practical matter, beginning with what sample size does it makes sense to ignore the correction? 3. What is the estimated population variance for the number of lies detected (see Exercise 9.5)? 4. What is the estimated population variance for the number of words spoken (see Exercise 9.6)?
Note 10.1 VAR'
SD'
The estimated population variance. It is an unbiased estimate of O2, the true population value. It can be estimated by multiplying the sample variance, VAR, by the quotient of N divided by N–-1, or by dividing the sum of squares by N– 1. Often it is symbolized with a circumflex (a ^ or hat) above O2 or S2. The estimated standard deviation. It is almost an unbiased estimate (especially if N > 10) of a, the true population value. It can be estimated by taking the square root of VAR' , or by taking the square root of the quotient of the sum of squares divided by N– 1. Often it is symbolized with a circumflex (a ^ or hat) above a or S.
140
I
NFERRING FROM A SAMPLE: THE F DISTRIBUTION
Mean Squares and Degrees of Freedom There is another more general way to view the equation defining the estimated population variance:
Here N – 1 is replaced with df, which stands for degrees of freedom. In the analysis of variance tradition, an estimate of variance is usually called a mean square because it is a sum of squares divided by its appropriate degrees of freedom. Hence Equation 10.5 is often written as:
Subscripts such as or Y or total (and, as you will see, model and error) can be added to MS (and SS) to identify a particular variance estimate. But in general, a mean square is a particular sum of squares divided by its appropriate degrees of freedom. Mean squares, and the degrees of freedom used to compute them, are basic to inferential statistics. To begin with, and as described in the next section, the ratio of two mean squares, called the F ratio, plays an important role in the analysis of variance. 10.2
THE F DISTRIBUTION Judging from articles in professional journals, perhaps the sampling distribution behavioral scientists use most often in their day-to-day work is the F distribution. It was developed primarily by R. A. Fisher during the first few decades of the 2Oth century and was named in his honor by another important 20th-century statistician, George Snedecor. The F statistic, or F ratio, is the ratio of two independent variance estimates or mean squares. Both the numerator and the denominator of the F ratio are mean squares, that is, sums of squares divided by their appropriate degrees of freedom. Thus, in general:
Because sums of squares cannot be negative, neither can the F statistic. Its possible values range from zero to infinity. The F distribution is in fact a family of distributions. Each member of the family is characterized by its numerator and denominator degrees of freedom. Thus df might equal 2,8 for one distribution (meaning 2 df associated with the numerator SS and 8 df with the denominator SS) and 6,24 for another. The actual shape of the F distribution depends on the values for the numerator and denominator degrees of freedom (see Fig. 10.1). With two degrees of freedom in the numerator, the distribution looks like a diagonal line that sagged; values are high near zero and steadily decline as F increases. With higher degrees of freedom in the numerator, the distribution rises sharply from F = 0, rapidly reaches a peak, and then falls back more slowly as F increases. And, as you might guess from chapter 7, as the numerator and denominator degrees of freedom
10.2
THE F DlSTRIBUTION
141
FIG. 10.1. F distributions for 2 and 8 (lighter line) and for 6 and 24 (heavier line) degrees of freedom. Critical values (marked with arrows) are 2.51 for F(2,8) and 4.46 for F(6,24).
become extremely large, the shape of the F distribution approaches that of the normal. In practice, null hypotheses are usually formulated so that if the null hypothesis is true then the two independent variance estimates under consideration will be equal. And if two variances are equal, their ratio will be 1, which is the null-hypothesis expected value for the F ratio. Moreover, the mean squares selected for the numerator and the denominator of the F ratio usually require large values (certainly greater than 1) of F to disconfirm the null hypothesis. In other words, in practice the F test is always and inherently onetailed. Although values for the F ratio less than 1 can and do occur, they usually indicate deviant data that likely violate assumptions required for an appropriate use of the F test in the first place. We could overstate the matter as follows: There are two kinds of Fs, big Fs (that allow us to reject a particular null hypothesis) and small Fs (that do not). Of primary interest to the researcher is the region of rejection, that 5% (if alpha = .05) or 1% (if alpha = .01) of the area under the sampling distribution curve that falls in the right-hand tail of the distribution and is demarcated by the critical value of F. If the computed F ratio falls in this area, that is, if the computed F ratio exceeds the critical value, then the null hypothesis is rejected. Thus the result usually desired by researchers is an F ratio big enough to reject the null hypothesis. Critical values for various members of the F distribution family, as defined by various pairs of values for the numerator and denominator degrees of freedom, are given in Table D in the statistical tables appendix. The next exercise gives you practice in using this table. At this point you know enough to answer a simple and common preliminary question. Imagine that you have two sets of scores (e.g., one from a group of subjects who received a drug and one from another group of subjects who received a placebo) and you want to know if the variability of the scores in one
142
INFERRING FROM A SAMPLE: THE F DISTRIBUTION group is significantly different from the variability of the other group. (This is useful to know before we ask whether the groups means differ significantly.) First you would compute estimated variances (mean squares) for each group separately; then you would divide the larger by the smaller, computing an F ratio. According to the null hypothesis, the variances for the two groups are equal (presumably because the groups were sampled from the same and not different populations). If in fact the variances are roughly equal, the F ratio will be small, not much larger than 1. But if it exceeds the critical value, as determined by the numerator and denominator degrees of freedom, then you would reject the null hypothesis. For example, if the estimated variance or mean square (VAR' or MS) is 9.62 for a group containing 16 subjects and 5.17 for a second group containing 19 subjects, the F ratio (with 15 degrees of freedom in the numerator and 18 in the denominator) is 1.86. The critical value for F(15,18) is 2.27 (alpha = .05); thus we would not reject the null hypothesis. (By convention, numbers in parentheses after an F ratio indicate degrees of freedom for the numerator and denominator, respectively.)
Exercise 10.2 Statistical Significance of the F statistic The purpose of this exercise is to provide practice in using Table D in the statistical tables appendix to determine critical values for the F statistic or F ratio. 1. What is the critical value for the F ratio if alpha = .05 and degrees of freedom = 3,24? lf df =6,24? 2. Would you reject the null hypothesis if F(1,8) = 5.24 and alpha = .05? Why or why not? Note that F(1, 8) indicates an F statistic with 1 df in the numerator, 8 in the denominator. 3. If degrees of freedom = 2,30, what is the critical value for alpha = .05? For alpha = .01 ? Why must the latter be a larger number than the former? 4. If your computed F were 4.42 and the degrees of freedom were 1 for the numerator and N - 2 for the denominator, what is the smallest sample size (N) that would allow you to reject the null hypothesis, alpha = .05? 5. Can an F ratio of 3.8 with one degree of freedom in the numerator ever be significant at the .05 level? Why or why not? 6. Reaction time scores (in milliseconds) for subjects who received an experimental drug are 348, 237, 681, 532, 218, 823, 798, 644, 734, 583 and for subjects who received a placebo are 186, 463, 248, 461, 379, 436, 304, 212. Are the variances for the two groups significantly different?
10.3
THE F TEST In the preceding chapter we found that 28% of the sample variance in number of lies detected could be accounted for by knowing subjects' mood scores and 30% could be accounted for by knowing whether subjects were in the drug or placebo group (see Figs. 9.2 and 9.3). That was straightforward description. But how can we decide whether or not the proportion of variance accounted for is statistically significant? For example, if drug has no effect on the number of lies detected in the population from which a sample is drawn (as the null hypothesis claims), how often, just by the luck of the draw, would we select a sample in which we were able to account for 28% of the variance? Is the probability low enough so that we
10.3 THE F TEST
143
can reject the null hypothesis? The statistic used to answer these and similar questions is the F ratio, which was introduced in the previous section (Equation 10.7), and a test that uses the F ratio is called an F test. Recall that in chapter 8 we noted that the total sum of squares could be divided, or partitioned, into two pieces, the sum of squares associated with the model and the error sum of squares (Equation 8.3). Again, this is simply a descriptive statement. But now, given our present inferential concern (Were the 10 subjects in the lie detection study drawn from a population in which drug has no effect on number of lies detected?), it is useful to examine the mean squares, or variance estimates, associated with the model and error sums of squares. As you might guess, the ratio of these two mean squares will be distributed as F if the null hypothesis is true. Thus if the F we compute is large, we have a basis for rejecting the null hypothesis and can conclude that in general, when people who have taken this drug lie, their lies are less easily detected by experts.
Degrees of Freedom for the Total Sum of Squares As you already know, mean squares are sums of squares divided by their appropriate degrees of freedom (Equation 10.6). Determining the correct degrees of freedom for different sums of squares is relatively straightforward and a basic statistical skill you must master. There are two common approaches that can be understood on a relatively nontechnical level. The first emphasizes how many scores are free to vary, whereas the second, which is used somewhat more in this book, emphasizes the number of parameter estimates used in computing the particular sum of squares. For simplicity of presentation and ease of understanding, these two approaches are demonstrated first for total degrees of freedom and then for model and error degrees of freedom, even though it is the model and error mean squares that are needed to compute the F ratio. Assume a sample size of 10. If taking the first approach, we accept the sample size and the sample mean as fixed. If we did not know the actual scores, we could specify any number we like for the first score— in that sense it is free to vary; likewise for the second score, the third, and so forth. However only nine scores are free to vary in this manner. Once the first nine scores are in place, the loth is determined. There is one value, and only one value, that will insure that the mean of the 10 numbers is the value specified initially. Thus, because nine scores are free to vary, there are nine degrees of freedom. If taking the second approach, we note that an error sum of squares is computed by summing the squares of the deviations of the raw scores from the predicted scores. For the total sum of squares, the predicted scores are always the group's mean. Symbolically:
The first element of the deviation (Yi) represents a raw score and because there are 10 of them, we begin with 10 degrees of freedom. From them we subtract the predicted scores. The equation for a predicted score (which first appeared in chap. 5) is
144
___
INFERRING FROM A SAMPLE: THE F DISTRIBUTION
There is just one parameter estimate, a, which for the total sum of squares is the sample mean (a = MY). Thus when computing SStotal, we begin with 10 degrees of freedom and use 1 estimating a parameter, which leaves 9 degrees of freedom. In general,
This formula applies whenever the sum of squares represents deviations from the sample mean. Moreover, the total degrees of freedom, like the total sum of squares, can be partitioned into two pieces. In other words,
But do not overextend this notion. Although the constituent sums of squares add up to the total sum of squares, and although the constituent degrees of freedom add up to the total degrees of freedom, the constituent mean squares do not sum to the total mean square.
Degrees of Freedom for the Error Sum of Squares The error sum of squares is computed by summing the squares of the deviations of the raw scores from the predicted scores. Symbolically:
Again, the first element of the deviation (Yi) represents a raw score and because there are 10 of them, we begin with 10 degrees of freedom. From them we again subtract the predicted scores. In this case, the prediction equation for Yi' is
There are two parameter estimates, a and b, which are computed using regression procedures (Equations 8.7 and 8.8). Thus when computing SSerror, we begin with 10 degrees of freedom and use 2 estimating parameters, which leaves 8 degrees of freedom. In general,
This formula applies whenever a single predictor variable is used. One degree of freedom is "lost" to the regression constant (a), the other to the regression coefficient (b) for the single predictor variable. More generally, the degrees of freedom for error will always be N minus 1 (for the constant) minus the number of predictor variables. Alternatively, and more traditionally, we would accept that there are five scores in each of two groups and the means for the two groups are fixed. If we did not know the actual scores, we could specify any number we like for the first score in the first group— in that sense, it is free to vary— and likewise for the second, third, and fourth, but not the fifth score in the first group. The fifth score is constrained by the mean for the first group. The same logic applies to the second group. Thus, because eight scores are free to vary (four in each group), there are eight degrees of freedom.
THE F TEST
145
Degrees of Freedom for the Model Sum of Squares The model sum of squares is computed by summing the squares of the deviations of the predicted scores from the mean. Symbolically:
In this case, the first element of the deviation (Yi') represents a predicted score, which as we already know (from Equation 10.11) requires two parameter estimates; thus we begin with two degrees of freedom. From the predicted scores we subtract the mean, which requires one estimate (see Equation 10.8). Thus when computing SSmodel, we begin with two degrees of freedom and lose one, which leaves one degree of freedom. In general,
This formula applies whenever a single predictor variable is used. More generally, the degrees of freedom for the model will always be the number of predictor variables. Alternatively, we would accept that there are two groups whose means are fixed, and an overall or grand mean, whose mean is also fixed. If we did not know the actual group means, we could specify any number we like for the first group's mean—in that sense, it is free to vary. The mean for the second group, however, is constrained (or determined) by the grand mean. Thus, because only one score is free to vary, there is only one degree of freedom. For a sample size of 10 with a single predictor variable, we have shown that the total degrees of freedom is 9, the degrees of freedom for error is 8, and the degrees of freedom for the model is 1. Thus for the present example, degrees of freedom for error and for the model add up to the total degrees of freedom, as they should (Equation 10.10).
Model and Error Mean Squares In general terms, a mean square is a sum of squares divided by its appropriate degrees of freedom. For a single predictor variable and a total sample size of N, the formulas for the model and error mean squares are:
Do not confuse the SSmodel and SSerror, which are descriptive statistics and sum to SStotal, with MSmodel and MS error , which are estimates of population parameters and do not sum to MStotal. But how are these estimates of variance used? What, exactly, do they estimate? As it turns out, their use in inferential statistics depends on what they mean when the null hypothesis is true. Let us now consider each in turn. Mean squares are population estimates based on sample data. We know that such estimates are not perfect reflections of a population but reflect some degree
146
INFERRING FROM A SAMPLE: THE F DISTRIBUTION of sampling error. If the null hypothesis is true, MSmodel and MSerror provide two different and independent ways to estimate that sampling error, that is, the population error variance. If the null hypothesis is not true, then the mean square for the model reflects variability between groups in addition to sampling error: MSmodel = effect of predictor variable + sampling error However, if the means for the two groups are equal as claimed by the null hypothesis (i.e., if the effect of the predictor variable is nil), then the model mean square is a pure estimate of sampling error. Again, do not confuse SStotal, which is a descriptive statistic and can be decomposed into sums of squares based on between group and within group deviations, with MSmodel, which is a population estimate and is affected by sampling error as well as by variability associated with the model if any. No matter whether or not the null hypothesis is true, the mean square for error is a pure estimate of sampling error: MSerror = sampling error It is not influenced by the effect of the predictor variable. Subtracting predicted scores from raw scores removes any such effect from the within group deviations used to compute the error mean square. Both MSmodel and MSerror provide independent estimates of sampling variance when the null hypothesis is true, so the ratio of MSmodel to MSerror should be distributed according to the sampling distribution for F, which provides us with a way to evaluate the null hypothesis, as described in the next section. (For a formal development of this argument, see Winer, 1971.)
The F Ratio The F ratio that tests the null hypothesis that claims no effect for the predictor variable (or no difference between group means) is
Theoretically MSmodel should be equal to or larger than MSerror because MSmodel estimates variability associated with the effect of the predictor on the criterion variable, if any, plus sampling error, whereas MSerror estimates only sampling error. Therefore if the null hypothesis is true, F should equal 1. But if it is not true, F should be larger than 1 (although as noted later it sometimes may be less than 1). Values of F that exceed the critical value associated with dfmodel and dferror are grounds for rejecting the null hypothesis. For the lie detection study, one null hypothesis suggests that treatment group (drug versus placebo, represented with a single predictor variable) has no effect on number of lies detected. If this null hypotheses were true, and if the sample data reflected it exactly, then the proportion of criterion variance accounted for by the predictor variable would be zero (and the regression coefficients, correlation coefficients, and r2s would all be zero). Moreover, again if the null hypothesis were true, the F ratio—the ratio of MSmodel to MSerror—would be distributed as F with one and eight degrees of freedom (because N = 10). According to the null hypothesis, MSmodel and the MSerror are equal (they both estimate the same population parameter); thus the expected value for F is 1.
147
10.3 THE F TEST Note 10.2 MSmodel
Mean square for the model. Computed by dividing the sum of squares associated with the model by the degrees of freedom for the model. It is also called mean square between groups or mean square due to regression.
dfmodel
Degrees of freedom for the model. It is equal to the number of predictor variables.
MSerror
Mean square for error. Computed by dividing the error sum of squares by the error degrees of freedom. It is also called the mean square within groups or the residual mean square.
dferror
Degrees of freedom for error. It is equal to N minus 1 (for the regression constant) minus the number of predictor variables.
F ratio
Usually the ratio of MSmodel to MSerror; more generally, the ratio of two variances. If the null hypothesis is true, it will be distributed as F with the degrees of freedom associated with the numerator and denominator sums of squares.
Needless to say, the ratio of MSmodel to MSerror is rarely exactly 1. A computed F may occasionally even be less than 1, which suggests that the F test was probably not appropriate for those data in the first place. However, because the ratio of two theoretically equal variances is distributed as F with the appropriate degrees of freedom, the statistical significance of a computed F ratio can be determined—and if the F ratio is big enough, the null hypothesis can be rejected. In the next two exercises you are asked to use the F test to evaluate, first, the null hypothesis that drug does not affect number of lies detected and, second, the null hypothesis that mood does not affect number of lies detected. Exercise 10.3 The Significance of a Single Binary Predictor The template developed for this exercise allows you to evaluate the statistical significance of a single predictor variable. The predictor variable examined is drug group and the null hypothesis states that drug has no effect on number of lies detected. As you will see, only a few modifications to earlier spreadsheets are required. General Instructions 1. A spreadsheet you developed earlier computed the proportion of variance in number of lies detected that is accounted for by knowing the subject's drug group (see Fig. 9.3). Modify this spreadsheet so that it computes MStotal, MSmodel, MSerror, and the F statistic that evaluates whether the proportion of variance is statistically significant. You may use Fig. 10.2 as a guide. 2. What is the value of the F ratio you computed? What is the critical value for this F (alpha = .05)? Would you reject the null hypothesis? Why or why not? Is the effect of drug treatment (drug or placebo) on the number of lies detected significant at the .05 level or better?
148
INFERRING FROM A SAMPLE: THE F DISTRIBUTION Detailed Instructions 1. Begin with the spreadsheet shown in Fig. 9.3. In order to make room for the inferential statistics, move the summary descriptive statistics from the sumof-squares columns (I-K) to the deviation-score columns (E-G). Specifically, move (use the cut and paste functions, rather than the copy function) the labels and the formulas for N, the variance, the standard deviation, and R2 from block H14–K17 to block D14–G17. This will erase any formulas currently in cells D14-G15. 2. Check to make sure that all function references and formulas are correct. Due to the way spreadsheets execute a move (as opposed to a copy), the count functions in cells E14-G14 will refer to columns I-K, the variance formulas in cells E15-G15 will divide the sums of squares in cells I13-K13 by the Ns now in cells E14–G14, and the formulas for the standard deviation (cell E16), R2 (cell E17), and r (cell B17) will likewise point to the correct cells. Other formulas (predicted values, deviation scores, sums of squares, parameter values) should still be correct from before. 3. Provide labels for the inferential statistics. In column H, label rows 13–16 as indicated:
4. 5. 6. 7. 8.
Label
Row
ss=
13 14 15 16
Meaning
The sum of squares for this column. The degrees of freedom for this SS. df= The mean square (estimated population variance). MS= SD'= The estimated population standard deviation. Enter the correct degrees of freedom for the SStotal, SSmodel, and SSerror in cells I14–K14, respectively. Enter formulas for the mean squares (the sums of squares divided by the appropriate degrees of freedom) in cells I15–K15. Enter the formula for the estimated population standard deviation in cell 116. Enter the label "F=" in cell J17 and enter the formula for the F ratio (the MSmodel divided by the MSerror) in cell K17. What is the value of the F ratio you computed? What is the critical value for this F (alpha = .05)? Would you reject the null hypothesis? Why or why not? Is the effect of drug treatment (drug or placebo) on the number of lies detected significant at the .05 level or better?
At this point your spreadsheet should look like the one shown in Fig. 10.2. You have now produced a template that allows you to determine the significance of a single predictor variable and can easily be modified for other data, for example, the mood score data, as we demonstrate in the next exercise. Finally, in the next section, we discuss how to interpret the results of both the drug group and mood score analyses. Exercise 10.4 The Significance of a Single Quantitative Predictor The template you developed for the last exercise is used for this exercise too. Only the data are different. The predictor variable examined is mood and the null hypothesis suggests that mood has no effect on number of lies detected. 1. Change the data entered in the previous spreadsheet (see Fig. 10.2) from a binary code for drug group to mood scores (the mood scores were last
149
10.3 THE F TEST
shown in Fig. 9.1). All other formulas from the previous spreadsheet should be in place correct. 2. What is the value of the F ratio you computed? What is the critical value for this F (alpha = .05)? Would you reject the null hypothesis? Why or why not? Is the effect of mood on the number of lies detected significant at the .05 level or better?
10.4
THE ANALYSIS OF VARIANCE: TWO INDEPENDENT GROUPS At this point your spreadsheet should look like the one shown in Fig. 10.3. It is worth reflecting for a moment on what these last two exercises have accomplished. The analysis performed for Exercise 10.3 (and shown in Fig. 10.2) is an analysis of variance with two independent groups. This analysis allows you to determine whether the means for two different groups of subjects are sufficiently different so that, in all likelihood, subjects in the two groups were sampled from two different populations, not one. In practical terms, this analysis allows you to determine whether there is a significant difference between the means computed for the two groups. The analysis performed in Exercise 10.4 (and shown in Fig. 10.3) evaluates the significance of a regression or correlation coefficient. This analysis allows you to determine whether the observed association is sufficiently great so that, in all likelihood, subjects were not sampled from a population in which there is no association. In practical terms, this analysis allows you to determine whether the two variables are significantly
D
1 2
s
13 14 15 16 17
Sum= N= Mean= a,b= r=
Lies Y
Drug X
53 10
5 10 0.5 -2.2
5.3 6.4
-0.55
Y'
y=
m-
Y-My
Y'-My
e= Y-Y'
4E–15 10 1.21
0 10 2.8
53
0
N= VAR= SD= R2=
10 4.01 2.002 0.302
K
L
M
N
I
J
1
SStot
SSmod
SSerr
x=
y*y
m*m
e*e
X-Mx
SSX x*x
SSxY
2
40.1 9
12.1 1 12.1
28 8 3.5
0 10
2.5 10
-5.5
0.25
-0.55
F=
3.457
H
13 14 15 16 17
SS= df= MS=
SD'=
4.456 2.111
x*y
10
0.5
FIG. 10.2. Spreadsheet for computing the F ratio that evaluates the effect of drug treatment on number of lies detected. Rows 3-12 are the same as Fig. 9.3 so are not shown.
INFERRING FROM A SAMPLE: THE F DISTRIBUTIONS
150
related. It should now be evident to you that both kinds of questions can be answered with the same analysis. The overarching question concerns the proportion of variance accounted for by a single predictor variable. Is it sufficiently greater than zero so that, in all likelihood, subjects were sampled from a population in which that predictor has an effect, and not sampled from a population in which the predictor has no effect as the null hypothesis claims? In other words, does the predictor variable matter? Can we predict individual outcome significantly better if we know how subjects scored on the predictor variable?
Statistical Significance and Effect Size You computed two F ratios for the last two exercises, one evaluating the contribution of drug group and one evaluating mood. Based on the sample data provided, neither mood nor drug group significantly affected the number of lies experts could detect. Although not finding effects significant may seem disappointing from a research point of view, remember that our example included data for only 10 subjects, which is not very many. Still, the magnitude of the effect was not trivial: In this sample, drug group accounted for 30% and mood scores accounted for 28% of the variance in number of lies detected. Thus the data from the lie detection study rather dramatically demonstrate the difference between real-world significance, or the magnitude of the effect, and statistical significance. In fact, with five additional subjects, both effects would have been statistically significant at the .05 level (assuming that the additional
D
A
1
y=
s
Lies Y
Mood
2
X
Y'
13 14
Sum= N= Mean= a,b=
53 10 5.3 3.024
48.5
53 N= VAR=
r=
0.529
15 16 17
H
I
10 4.85 0.469
SD= R2= K
J
1
SStot
SSmod
SSerr
y*y
m*m
e*e
40.1 9
11.24 1
4.456 2.111
11.24
28.86 8 3.607
F=
3.116
15 16 17
SS= df= MS=
SD'=
e=
Y'-My
Y-Y'
0 10 4.01 2.002
0 10 1.124
0 10 2.886
M
N
0.28
L
2 13 14
m=
Y-My
x= X-Mx
SSX x*x
SSxY
0 10
51.03 10
23.95 10
5.103 2.259
2.395
x*y
FIG. 10.3. Spreadsheet for computing the F ratio that evaluates the effect of mood on number of lies detected. Rows 3-12 are the same as Fig. 9.1 so are not shown.
10.4 ANALYSIS OF VARIANCE: Two INDEPENDENT GROUPS
151
subjects did not alter the values for the SSmodel and SSerror). Often, perhaps all too often, researchers pay attention only to statistical significance and ignore effect sizes (see Cohen, 1990; Rosnow & Rosenthal, 1989; Wilkinson et al., 1999). But effect sizes provide important descriptive information and should always be reported (using, for example, an index of effect size such as r2, which was discussed in the previous chapter and used in this one, or R2, which is discussed in the next chapter). There is no reason to be particularly dazzled by statistical significance or depressed by its absence. As the present example demonstrates, when the sample size is small even a relatively hefty effect may not be statistically significant. Worse, the reverse is also true. Given a large enough sample size, even a miniscule effect can achieve statistical significance. You now can understand why throughout this book r2 (and its multivariate extension, R2} are emphasized. Reflecting as they do the size of realworld effects, they provide an excellent correction to what might otherwise be an exclusive focus on the statistical significance of F ratios. Exercise 10.5 The Significance of a Single Predictor: More Examples The purpose of this exercise is to provide additional practice in determining the statistical significance of the proportion of variance accounted for. 1. Recall Exercise 9.6 for which you computed the proportion of variance in the number of words 18-month-old infants spoke that was accounted for by knowing the number of older siblings. This proportion was 43.1%. Would you reject the null hypothesis that this proportion is really zero at the .05 level of significance? At the .01 level? 2. You also computed the proportion of variance accounted for by knowing whether an infant had (a) no older siblings or (b) one or more older siblings. This proportion was 32.6%. Would you reject the null hypothesis that this proportion is really zero at the .05 level of significance? At the .01 level? 3. The 32.6% in question 2 is a proportion of variance accounted for by a binary predictor. If significant, it means that the means for the two groups differ significantly. What is the name of the test that, especially in older literature, is often used to determine if the means for two groups differ? 4. For the exercises in chapters 8 and 9, you computed the variances and covariance for X and Y and used them to compute the regression coefficient and constant. The regression statistics were then used to compute predicted values for Y, which in turn allowed you to compute the model and error sum of squares and R2. Now use the multiple regression data analysis procedure in your spreadsheet to verify the values you have computed for a, b, and R2. 5. For these data, the mean number of words spoken by infants with no older siblings was significantly different from the mean number for infants with one or more older siblings. Compute the standard error of the mean for these two groups and prepare a bar graph (similar to Fig. 7.2) with error bars extending one standard error of the mean above and below the top of each bar. Now compute 95% confidence intervals for the two means (and graph them). Do the 95% confidence intervals suggest that the means are significantly different from each other?
152
INFERRING FROM A SAMPLE: THE F DISTRIBUTIONS The figure you prepared for part 5 in the last exercise shows how to present the results from a simple analysis of variance of two groups graphically (see Fig. 10.4). The top of each bar indicates the mean value for one of the two groups of infants. But remember, the means obtained for these two groups are derived from sample data. They only represent an estimate for the true population mean, which need not be exactly the sample mean. The error bars serve to remind us of the probabilistic nature of sample data, and so serve as some protection against taking our results too absolutely. Usually error bars indicate standard errors of the mean, but as Fig. 10.4 shows, it is also possible to have error bars represent 95% confidence intervals instead. In the context of an analysis of variance of two independent groups, 95% confidence intervals do have one advantage: They suggest statistical significance. Usually, if the difference between two means is statistically significant (at the .05 level), then neither mean will fall within the 95% confidence interval for the other—but to verify the suggestion you need to do the formal statistical test.
10.5
ASSUMPTIONS OF THE F TEST Three assumptions need to be met before you can be confident that the probability value of your observed F ratio is valid. The first is that observations are independent and that scores across groups are independent. If you randomly sample participants from the population and then randomly assign them to groups, this assumption will be satisfied. The second assumption is that the Y values are normally distributed within groups. You can check this assumption using the graphical techniques described in chapter 6. You can also examine the skew statistic provided in the Descriptives procedure of SPSS. A general rule of thumb is that if the skew divided by its standard error is >2.00, then you may have violated the normality assumption. The final assumption requires that the
FIG. 10.4. A bar graph showing the mean number of words infants with no older siblings, and infants with one or more older siblings, spoke at 18 months of age. The error bars on the left represent standard errors of the mean for the two groups. The error bars on the right represent 95% confidence intervals for the means of the two groups.
10.5 ASSUMPTIONS OF THE F TEST
153
variances of the groups be roughly equal. This assumption, known as homogeneity of variances, can be tested using the F test described earlier in this chapter. SPSS also provides a test, described in the next exercise, that examines the homogeneity assumption. It should be noted that, when the groups are of equal size, analysis of variance (ANOVA) is fairly robust to minor violations of normality and homogeneity of variances. If, however, the ns are not equal or you violate both assumptions simultaneously, then corrective actions may be necessary. Exercise 10.6 The F Test in SPSS Regression and One-Way ANOVA This will show you how to find the significance of the amount of variance accounted for by Drug in Lies Scores. 1. Open the Mood and Drug data file you created in Exercise 9.7 and rerun the Regression procedure. 2. Examine the ANOVA box in the output. Does this value agree with the F you calculated in Exercise 10.3? 3. Select Analyze->Compare Means->One-Way ANOVA from the main menu. Move lies to the Dependent List and Drug to the Factor window. Click on Options and then, under Statistics, check the Descriptive and Homegeneity of variance boxes. Click on Continue and then OK. 4. Examine the output. In the Descriptives box, make sure the N, means, and standard deviations agree with your spreadsheet. You will find the sums of squares and the F value reported in the ANOVA table. The statistics should be identical to the ANOVA output from the regression procedure with the exception that the regression statistics are now termed between groups and the residual statistics are called within groups. Thus, you may analyze a single-factor study with a categorical independent variable using either the Regression or One-Way ANOVA procedures. 5. The One-way ANOVA procedure does , however, also provide a test of the homogeneity of variances. Find Levene's statistic in the box labeled Test of Homegeneity of Variances. Levene's statistic provides a test that the variances of groups formed by the categorical independent variable are equal. This test is similar to the F test for equal variances presented earlier in this chapter. Levene's, however, is less likely to be biased by departures from normality. Considerable basic and important material has been presented in this chapter. You have been introduced to degrees of freedom, a topic that is straightforward and yet often seems perplexing to students at first. If you remember that degrees of freedom relate to the deviations used to compute sums of squares and are determined by the total number of scores analyzed as constrained by predictor variables—and if you memorize all the formulas when presented—then you should have little difficulty. You have also been introduced to the F distribution and have learned how to perform an F test to determine whether the means (or variances) of two groups are significantly different from each other. More importantly, you have learned to distinguish between realworld importance and statistical significance. You now know that nontrivial effects may not be found statistically significant in small samples, but that trivial effects can be found significant if the sample size is large enough. Clearly, when reporting research results, it is
154
INFERRING FROM A SAMPLE: THE F DISTRIBUTIONS important to describe both the magnitude of any effects investigated as well as their statistical significance. Just as clearly, when planning any research, it is important to determine exactly how large a sample should be in order to detect as significant effects the researcher regards as big enough to be nontrivial. This determination is called power analysis and is discussed in chapter 17. At this point you now know how to perform an analysis of variance for two independent groups (which is equivalent to a t test for independent groups). First you determine the proportion of variance accounted for by a single predictor variable, and then you determine its statistical significance using an F test. This approach, as you will see in the next chapter, is easily generalized to situations that require more than one predictor variable. However, no matter whether one or more than one predictor variables are considered, the central descriptive issue is, how much variance do they account for (i.e., what is the magnitude of the effect)? And the central inferential question remains, is that amount sufficiently different from zero so that it is unlikely to have occurred in our sample just by chance, given that the null hypothesis is true (i.e., is the effect statistically significant)?
11
Accounting for Variance: Multiple Predictors
In this chapter you will: 1. Be introduced to multiple regression and learn how to interpret the basic multiple-regression statistics. 2. Learn how to estimate R2 for a population from the Rz computed for a particular sample. 3. Learn how to determine the amount of variance two predictor variables account for together and whether that amount is statistically significant. 4. Learn how to determine the amount of additional variance a second predictor variable accounts for, above and beyond the portion of variance accounted for by the first, and whether that additional amount is statistically significant. 5. Be introduced in a basic way to concepts underlying the analysis of covariance. 6. Learn how to generalize the techniques discussed in this chapter from two to more than two predictor variables. Beginning in chapter 8, and continuing in chapters 9 and 10, our discussion was limited to simple, as opposed to multiple, regression and correlation. Simple correlation is concerned with the degree of association between two variables and is typically indexed with the simple correlation coefficient, r (whose values can range from -1 to +1). Simple regression is concerned with the exact relation between a predictor and criterion variable, as indexed by the regression coefficient b (whose values are affected by values for the two variables involved and, in theory at least, can range from minus to plus infinity), and with the proportion of criterion variance that can be accounted for, given knowledge of the predictor variable, as indexed by r2 (whose values can range from 0 to +1). Simple regression and correlation are conveniently viewed as a single topic. As noted previously, definitional formulas for r and b (Equations 8.11 and 8.12) appear quite similar. In fact, one way to compute r is as follows:
155
156
_
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
This serves to remind us that r is a standardized version of b. The regression coefficient is expressed in Y units per unit change in X. Multiplying b by the ratio of the X to Y standard deviations cancels the units and provides a unit-free, standardized measure of the association between the two variables. In other words, correlation coefficients can be compared for different pairs of variablessimilar values indicate similar degrees of relation— whereas regression coefficients cannot be compared in this way. Their values and units reflect the raw-score scales of the variables involved. 11 .1
MULTIPLE REGRESSION AND CORRELATION Multiple regression and correlation (MRC) is a straightforward extension of simple regression and correlation. Instead of being concerned with only one predictor variable, MRC is concerned with the effects of two or more predictor variables, working in concert. The techniques of MRC provide answers to questions like, how much criterion variance is accounted for by knowledge of a set of predictor variables working together? And, what is the unique contribution of a particular predictor— that is, how much additional criterion variance is accounted for when that predictor variable is added to the existing set? If mood affects the number of lies detected, for example, does knowing the drug treatment, in addition to knowing the mood score, increase the proportion of variance accounted for significantly? More generally, the techniques of MRC allow answers to many of the simpler research questions typically asked by behavioral scientists. Analysis of variance (ANOVA) and analysis of covariance (ANCOVA), for example, can be understood as particular instances or applications of multiple regression. Thus, whenever investigators wish to account for or explain a given quantitative criterion variable in terms of a few predictor variables, a basic understanding of MRC, as presented in this book, is often all that is required.
Multiple Regression Parameters In chapter 8 we noted that the regression equation for a single predictor was:
We showed that the parameters for this equation, a and b, could be determined by trial and error, but we also showed a way to compute the values for the parameters. As you might guess, the equation for two predictors is
This equation has three parameters, a, bi, and b2. More generally, the equation for K predictors is
In chapter 8 we also interpreted the parameters graphically, as the Y intercept and slope of a straight line, in the belief that such visualization can aid understanding. Although it is possible to visualize the two-predictor case spatially in three dimensions, the more than three dimensions required for three or more predictors are difficult, if not impossible, for most of us to visualize. Hence from now on we rely on the more general Equation 11.4 and drop the spatial metaphor, which only works for Equation 11.2.
11.1 MULTIPLE REGRESSION AND CORRELATION
157
The trial- and- error method for determining parameter values, however, remains appropriate, or at least theoretically possible, no matter the number of predictors. The best values for the parameters (in the least-squares sense) remain those that minimize the sum of the squared errors (the deviations between predicted and observed scores, squared and summed). And, in theory at least, they could be found by trying different combinations of values for the parameters and noting their effect on the error sum of squares. However, as the number of variables increases, the concomitant increase in the number of combinations becomes astronomical, which renders the trial-and-error method tedious if not completely impractical. Similarly, computing the values for the multiple-regression parameters with simple spreadsheet formulas is easy enough with one predictor, not much more difficult with two, but becomes increasingly impractical as the number of predictors increases. Happily, there are general methods for computing the bestfit parameters. Learning such methods (which involve matrix algebra) is beyond the scope of this book, but the methods are embodied in widely available multiple-regression computing programs. Indeed, as noted in chapter 1, most spreadsheet programs include built-in routines for computing multipleregression statistics. For the exercises in the remainder of this book, instead of computing multiple-regression statistics directly, you will use a multiple-regression routine. (In Excel this is done by entering Tools/Data Analysis/Regression, although the first time you may need to specify Tools/Add-Ins/Analysis ToolPak.) You then specify the range for the dependent variable and the range for the independent variable or variables, indicate where the output should be placed, and instruct the routine to compute the statistics. The routine then computes and places in your spreadsheet values for various regression statistics including values for the constant (a) and the regression coefficients (b 1 , b2, etc.). These values can then be referenced by the formula you define to compute predicted values for Y.
Partial Regression Coefficients Perhaps too informally, we have referred to the b1S, b 2 s, and so forth, of multiple regression as regression coefficients. This is to deprive them of their full and correct name, which is partial regression coefficients. Only in the case of a single predictor variable is it correct to speak of a regression coefficient. Whenever there is more than one predictor variable, it is important to remember the qualifying partial because it serves to remind us that each partial regression coefficient is just one member of a family and we cannot forget the family. Each bk (where K is the number of variables and k - 1,K) describes the relation between a particular predictor variable and the criterion, not for that predictor variable in isolation—the simple regression coefficient does that—but when the other predictor variables in the set are taken into account. In statistical terms, the matter is often put as follows: The partial regression coefficient describes the change in the criterion variable per unit change in the predictor variable when other variables in the set are held constant. It is important to keep in mind that the partial regression coefficients of multiple regression (the b 1 S, b 2 s, and so forth) cannot be treated as if they described a simple bivariate relation between a single predictor variable and the criterion. They describe instead the relation between a predictor variable and the criterion when that predictor is part of, or works in concert with, the other predictors in the set. Multiple regression routines typically compute values for a number of other statistics in addition to the a, b1, b2, and so forth, needed to compute predicted
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
158
scores and so it makes some sense to describe them briefly. The output produced by a typical spreadsheet multiple-regression routine is shown in Fig. 11.1. The data provided as input were from our running example (Y = number of lies detected, X1 = mood score, X2 - drug group, coded 1 for drug and 0 for placebo). Some of the statistics shown in Fig. 11.1 are already familiar to you. For example, the regression constant, labeled intercept, and the regression coefficients, labeled X Variable 1 and X Variable 2, appear on last three lines. But others, described in subsequent paragraphs, may be new to you.
R, R2, and Adjusted R2 Multiple R, on the first line after Regression Statistics, is the multiple correlation coefficient, the correlation between a criterion variable and a set of predictor variables. R Square, on the second line, which is often written R2, is similar to r2, which was introduced earlier (Equation 7.8). When only one predictor variable is used, the proportion of variance accounted for is written r2 (lower case) but when more than one predictor variable is used, this is written R2 (upper case). But the definition remains the same: R2, like r2, is the proportion of total variance accounted for in a sample by a particular model. That is:
The multiple correlation coefficient squared (R2}, like the coefficient of determination (r 2 ), can assume values ranging from 0 to 1. Before proceeding further we should point out that no matter the value or
SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations
Regression Residual Total
Intercept X Variable 1 X Variable 2
0.588 0.346 0.159 1.936 10 AN OVA df 2
7 9
SS
MS
13.86 26.24
6.932 3.748
F Significance F 1.849 0.227
40.1
Coefficients Standard Error t Stat P-value 4.764 2.537 1.878 0.102 0.256 0.373 0.686 0.515 -1.41 1.683 -0.84 0.431
FIG. 11.1. Typical output from a spreadsheet multiple-regression routine.
1 MULTIPLE REGRESSION AND CORRELATION
159
significance of R2, its use as a descriptive statistic has limitations. The multiple correlation coefficient squared is a sample statistic and, like the sample variance and standard deviation, it provides a biased estimate of the population value. But unlike the sample variance and standard deviation, the sample R2 overestimates the population value. This is because it capitalizes on any chance variation in the sample. If R2 were .5, for example, claiming that the predictor variables accounted for half of the variance would likely overstate the true state of affairs, if one meant this to apply to the population generally. A formula for R2, adjusted to provide an unbiased estimate of the population value, is
The adjustment appears on the right-hand side of the equation after the minus sign. For constant numbers of predictor variables and subjects, the adjustment becomes smaller as R2 becomes larger. And for constant values of R2, the adjustment becomes smaller as the ratio of predictor variables to subjects becomes smaller (a ratio of 1:12 is smaller than a ratio of 1:3). In other words, the more subjects there are for each variable, the smaller is the adjustment. If there are few subjects for each predictor variable—if the ratio of variables to subjects is small (3:1 is smaller than 12:1)—the adjustment can be quite substantial, especially if R2 is not very large. The adjustment can even be greater than the sample R2, which results in an adjusted R2 less than zero. If that should happen, however, report the value as zero: A negative R2 makes no sense. A question is, which value should you report, R2 or adjusted R2 ? Often only 2 R is reported, but if you wish your results to characterize the population from which your sample was drawn, it makes more sense to report the adjusted R2 and not the sample .R2. If you have any doubts, you can always report both. At this point, a brief comment about notation is in order. Throughout this text, an apostrophe or prime is used to indicate an estimated value (e.g., Y', VAR', SD', SD' M ,and so forth). Thus it would make sense to apply the same convention to the adjusted R2. However, perhaps because R2' or .R2 looks somewhat awkward, R2adjusted or R2adj is conventional so we use it here.
The Standard Error of Estimate The fourth line after Regression Statistics in Fig. 11.1 is Standard Error. More fully, this is the estimated standard error of estimate, which may be new to you but is not new in concept. For the present sample of 10 scores, the total sum of squares is 40.1. Therefore the standard deviation for raw scores in the population as estimated from sample data is
(See Equation 5.7 and Fig. 10.3.) This quantity, which has the same units as those used to measure Y (in this case, number of lies), can be thought of as the average error—for the population—when guessing the number of lies knowing just the mean.
160
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS Similarly, the standard error of estimate is the square root of the error sum of squares divided by N, and the estimated standard error of estimate (the population value estimated from sample data) is the square root of the sum of squares for error divided by its degrees of freedom. Subscripting SD' with Y-Y indicates that SD'Y-Y' represents the estimated standard deviation for the deviations between raw and predicted scores (instead of between raw scores and the mean as for SD'):
This quantity can be thought of as the average error, for the population, when guessing the number of lies using the prediction equation. In this text, standard error of estimate refers to the sample statistic and estimated standard error of estimate refers to the population estimate. Some texts drop the "estimated" and use standard error of estimate to refer to the quantity defined in Equation 11.7. As you will see in the next exercise, the SSerror when prediction is based on two variables is 26.24. Thus for the lie detection study the estimated standard error of estimate, when predicting number of lies detected from both mood and drug group, is
Note that 1.94, the estimated standard error of estimate or the average error in the population when prediction is based on two variables, is less than 2.11, the estimated value for the population standard deviation or the average error when prediction is based only on the mean number of lies detected. In other words, when using mood and drug group to predict number of lies, and not just the mean number of lies, the average error decreases from 2.11 to 1.94, a reduction of 12.9%. The arithmetic is as follows:
Thus SD'Y-Y', especially when compared with SD', gives a sense of how important a particular set of predictor variables is in general. The more effective a set of predictors is in improving prediction and hence reducing error, the smaller the estimated standard error of estimate will be relative to the estimated standard deviation. For some reason, perhaps historical, perhaps reflecting researchers' urge to generalize, most general-purpose regression routines print only SD'Y-Y' (and call it the standard error of estimate or simply the standard error) and not SDY-Y' (the square root of the SSerror divided by N). At the same time, as exemplified by Fig. 11.1, often they print only the sample R2 and not the adjusted R2. The degrees of freedom (df) are also given in Fig. 11.1. For the present example, three parameters (a, b1, and b2) were used to compute the predicted scores. The regression constant (a) is subtracted from N, the number of scores, which gives 9 degrees of freedom Total. The model then contains two parameters (b 1 , and b2), hence two degrees of freedom for the Regression, which leaves 7 degrees of freedom for error or Residual.
11.1 MULTIPLE REGRESSION AND CORRELATION
161
A second brief comment about notation is now in order and concerns SD', the estimated standard deviation, and SD' M ' the standard error of the mean, as compared with SD' Y-Y' ,the estimated standard error of estimate. Each of these three represents a population standard deviation, that is, the average deviation of scores from a specified reference point. For SD' Y-Y' , the deviations are the regression residuals—the deviations of raw scores from predicted ones—as reflected by the subscript Y-Y'. Yet when deviations are the Y scores from their mean, SD' Y is used and not SD'Y-M.. Similarly, when deviations are sample means from the population mean, SD'M is used and not SD'M-U. This usage both is conventional and makes sense. When the reference point for a deviation is a mean, and when the particular mean used is clear from context, notation for standard deviations of any sort typically omits the reference point from the subscript. Exercise 11.1 The Standard Error of Estimate and Adjusted R2 This exercise adds the capability to compute the estimated standard error of estimate and the adjusted R2 to the spreadsheets shown in Figs. 9.2 and 9.3. 1. Add a formula for the estimated standard error of estimate to the spreadsheets shown in Figs. 9.2 and 9.3. Note the similarity between this formula and the formula for the estimated standard deviation for the population. 2. What is the estimated standard error of estimate when drug treatment alone is used to predict number of lies detected? What percentage reduction in average error, beyond using just the mean as a predictor, does this represent? What is the estimated standard error of estimate when mood scores alone are used as the single predictor variables? Again, what percentage reduction does this represent? 3. Add a label and a formula for the adjusted R2 to the spreadsheets shown in Figs. 9.2 and 9.3. For each spreadsheet, what is the value for R2adj? How does each compare with the corresponding value for R2?
Standardized Partial Regression Coefficients In common with the adjusted R2, standardized partial regression coefficients (usually symbolized as lower case Greek betas or Bs) are not computed by most spreadsheet multiple-regression routines. (Do not confuse this use of beta with its use as a symbol for the probability of type II error.) Like correlation coefficients, and unlike unstandardized partial regression coefficients or bs, Bs are standardized, so their values so can be compared even when variables within a study have quite different scales. For that reason they figure prominently in most discussions and many applications of multiple regression. For the analyses presented in this book, however, the change in R2 when additional variables or sets of variables are added to a multiple-regression equation is emphasized. Little attention is given to the Bs and only some to the 6s (which are used to compute Y'), because for almost all of the analyses described in this book the only multiple-regression statistic actually needed is R2. Nonetheless, you should be aware of other multiple-regression statistics so that you recognize them when you encounter them, and you should also be aware that there is much to learn about multiple regression other than the brief and introductory material presented in this book.
162
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
Other Regression Statistics We have yet to comment further on the last few lines in Fig. 11.1. These include the standard errors for the X coefficients (the unstandardized partial regression coefficients or the 5s). They are useful because they can be used to compute a t score, which is also shown in Fig. 11.1 and which can be used to determine whether or not the individual partial regression coefficients are significantly different from zero. For present purposes, the statistical significance of regression coefficients is not emphasized. Instead, the analyses developed here utilize the significance of the increase in R2 that occurs when variables (or sets of variables) are added, step by step, to the regression equation. Exactly how this works is demonstrated later in this chapter, but first you need to know how to determine the amount of variance two predictors, working in concert, account for, and how to evaluate whether that amount of variance is statistically significant. Exercise 11.2 Significance of Multiple R2 The template developed for this exercise allows you to evaluate the statistical significance of, and describe the proportion of criterion variance accounted for by, two predictor variables acting together. The predictor variables examined are mood and drug group, and the null hypothesis suggests that these two variables together have no effect on number of lies detected. In addition, using information from this and the previous two spreadsheets (Figs. 9.2 and 9.3), the proportion of variance accounted for uniquely by each predictor variable, separate from the other, can be computed. General Instructions 1. This spreadsheet will have one column for the dependent variable (number of lies detected) and two for the independent variables (mood and drug group). Beginning with the spreadsheet shown in Fig. 10.3, insert a column for drug group and enter the appropriate data (1 for the first five subjects, 0 for the last five). 2. Beginning with this spreadsheet, you will use a multiple-regression routine to compute regression statistics. Run the routine and insure that the cells for a, b1, and b2 display the values computed by it. Then correct the formula for Y' so that is based on the linear combination of both predictor variables (Y'= a + b1X1 + b 2 X 2 ). Also check that the value for R2 computed by the spreadsheet (as proportion of total variance due to the model) is the same as the R2 computed by the program. 3. From now on we will usually be concerned with more than one predictor variable, so change the label for the correlation coefficient from r (for simple correlation) to R (for multiple correlation). Compute r as the square root of R2. This eliminates the need for the three columns containing the deviation scores for X, their square, and the cross products for the X and Y deviation scores. They can be erased. 4. Finally, enter the correct degrees of freedom. At this point, all statistics, including the F ratio, should be displayed correctly, based on formulas already in place.
11. 1 MULTIPLE REGRESSION AND CORRELATION
163
Detailed Instructions 1.
2.
3.
4.
5.
6. 7.
Begin with the spreadsheet shown in Fig. 10.3. Insert a new column between columns C and D. This has the effect of moving the old columns DK to columns E– L, opening up a new (and for the moment blank) column D. Alternatively, you may want to move the block D1-K17 to E1-L17, which also has the effect of opening up column D. Label the new column D "Drug" (in cell D1). Enter the label "X1" (instead of "X") in cell C2 and the label "X2" in cell D2. Then enter the coded data for drug group in cells D3–D12 (1 for the first five subjects, 0 for the last five). Enter the value for the parameter a (given in Fig. 11.1) in cell B16. Enter the values for the parameters b1 and b2 (again from Fig. 11.1) in cells C16 and D16. Alternatively, you may want to invoke whatever multiple-regression routine you plan to use and verify these values yourself. From now on, we will usually be concerned with more than one predictor variable, which means we will compute multiple R, not r. Therefore enter the label "R=" (instead of "r=") in cell A17. In addition, replace the formula currently in cell B17 with a function that computes the square root of R2 (cell F17). The formula for R2, however, remains the same. The R is computed from R2, and the regression statistics are computed by a multiple-regression routine; thus we no longer need the information in columns M–O (which were columns L-N in Fig. 10.3). These columns should be deleted. Enter a formula for the equation, Y'= a + b1X1 + b2X2, in cells E3-E12, using the parameter values in cells B16-D16. Enter the correct degrees of freedom in cells J14-L14. At this point, all of the statistics indicated in the spreadsheet should be computed correctly, using formulas already in place from the previous spreadsheet.
The spreadsheet you just completed should look like the one shown in Fig. 11.2. If everything was done correctly, then the R2 computed as the proportion of total variance accounted for by the model should be the same as the R2 in Fig. 11.1. Similarly, the square root of the mean square for error, which is the standard error of estimate, should have the value given in Fig. 11.1.
Multiple R Revisited Multiple R is the correlation between Y and Y', between the observed scores for Y and the predicted scores, which are the optimally weighted sum of a set of predictor variables, X1t X2, and so forth. Earlier in this chapter, multiple correlation was defined as the relation between a criterion variable and a set of variables working in concert. What was meant by that phrase was the optimally weighted sum of a set of predictor variables, Y' , as computed by Equation 11.4, the general prediction equation:
The weights used to compute Y' (that is, the regression coefficients b1 b2, and so forth) are selected by multiple-regression computations so that the error sum of squares will be the smallest value possible given the data, which is why the weighted sum is called optimal. As a consequence of minimizing the error sum of squares, the model sum of squares and hence R2 (and multiple R) will be maximized, that is, will assume the
164
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
largest values possible for the data. As noted earlier, multiple regression inherently capitalizes on chance variation in data, producing the highest possible values of R and R2 for a given data set. This bias can be corrected by adjusting R2 (Equation 11.6), but even so readers should be aware that multiple-regression statistics are especially vulnerable to chance fluctuations in sampling and, especially with small data sets, results seen in one sample may not replicate in others. For spreadsheet purposes, the value of multiple R can be computed in one of two ways: as the square root of R2 (as in the last exercise) or as the correlation between Y and Y' (although this could require that deviation scores and their squares and cross products be computed for both Y and Y'). The correlation method is demonstrated in the next exercise.
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I s 1 2 3 4 5 6
7 8 9 10 Sum= N= Mean= a,b= R= I
B C Lies Mood Y X 3 5.5 2 2 4 4.5 6 3 6 1.5 4 6.5 5 3.5 7 7 7 6 9 9 53 48.5 10 10 5.3 4.85 4.764 0.588
0.256
K
J
E D F Drug y= X Y' Y-My 1 4.762 -2.3 1 3.868 -3.3 -1.3 1 4.507 1 4.123 0.7 1 0.7 3.74 -1.3 0 6.426 -0.3 0 5.659 1.7 0 6.553 1.7 0 6.298 3.7 0 7.064 5 53 0 10 N= 10 0.5 VAR= 4.01 -1.41 SD= 2.002 R2= 0.346
SStot
SSmod
SSerr
2
y*y
m*m
e*e
40.1 9
13.86 2
4.456
6.932
2.111 0.159
F=
26.24 7 3.748 1.936 1.849
SS=
15
MS= SD'=
16 17
df=
R2adj =
H
m= Y'-My -0.54 -1.43 -0.79 -1.18 -1.56 1.126 0.359 1.253 0.998 1.764 0 10 1.386
e= Y-Y' -1.76 -1.87 -0.51 1.877 2.26 -2.43 -0.66 0.447 0.702 1.936 0 10 2.624
L
1
13 14
G
FIG. 11.2. Spreadsheet for computing the F ratio that evaluates the effect of mood and drug group together on number of lies detected. Rows 3-12 for columns I-L are not shown.
11.1 MULTIPLE REGRESSION AND CORRELATION
165
Exercise 11.3 The Correlation Between Observed and Predicted Scores This exercise shows that multiple R the correlation between Y and Y' scores. 1. Beginning with the spreadsheet shown in Fig. 11.2, compute the correlation between Y and Y'. There are two ways to do this: You could add columns for the appropriate deviations and their squares and cross products, and compute R according to Equation 9.11, or you could regress Y on Y' and compute the square root of the R2 provided by the multiple-regression routine. In other words, regressing Yon either Y', or on X1 and X2, should produce the same R2. Does it?
Note 11.1
R
R2 R adj
Multiple .R. Just as r is the correlation between a criterion variable (Y) and a single predictor variable (X), so R is the correlation between a criterion variable and an optimally weighted sum of a set of predictor variables (X1, X2, and so forth). Multiple R squared. It is the proportion of criterion variable variance accounted for by a set of predictor variables, working in concert. Adjusted multiple R squared. It is the population value for R2 estimated from a sample, which will be somewhat smaller than R2. Consistent with the notion used in this text, it could also be symbolized as jR 2 ' with the prime indicating an estimated value, but the adjusted subscript is used far more frequently in multiple-regression texts.
SD'y.Y-Y' The estimated standard error of estimate. It is the estimated
population standard deviation for the regression residuals, that is, the differences between raw and predicted scores. It can be regarded as an estimate of the average error made in the population when prediction is based on a particular regression equation. Sometimes it is called the standard error of estimate, leaving off the qualifying estimated. Often it is symbolized with a circumflex or ^ above the SD and above the second Y subscript instead of a prime after them.
11.2
SIGNIFICANCE TESTING WITH MULTIPLE PREDICTORS Exercise 11.2 should have convinced you how easy it is to generalize from one predictor to more than one predictor variable. All you did was add the data for a second variable and change the prediction equation and degrees of freedom to take the second variable into account. All other computations you needed were held over from the earlier spreadsheet, including the statistics needed for significance testing.
166
___
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
In chapter 10 you learned that the F ratio is:
For the current example, the value for this F ratio is 1.85 (see Fig. 11.2), and it has 2 and 7 degrees of freedom. The critical value for F(2,7), alpha = .05, is 4.74, so this result fails to reach a conventional level of significance. We conclude that we cannot reject the null hypothesis that mood and drug acting together have no effect on number of lies detected, at least not with a sample of 10 subjects. Not all regression routines print SSmodel and SSerror, but all give R2. For that reason, an alternate but equivalent formula for the F ratio that often proves useful is
The R2 for error is itself rarely printed but it is easily computed. The R2model is the proportion of variance accounted for by the model; thus R2error, the proportion unaccounted for, is:
And so we could rewrite Equation 11.9 as follows:
All we need to remember is
and
(N scores initially; one df used by the regression constant, the rest by the predictor variables). Thus, using Equation 11.10, you can test whether two (or more) variables together account for a statistically significant amount of criterion score variance. Exercise 11.4 Computing the F Ratio Using R2 This exercise provides practice in using the R2 formulation for the F ratio. 1. Demonstrate that, for the data shown in Fig. 11.2, the F ratios computed using equations 11.8 (SS) and 11.9 (R2) are identical. 2. Demonstrate algebraically that the two equations must give equivalent results. (Optional) 3. Assume that a single predictor variable accounts for 40% of the variance. What is the minimum number of subjects that would need to be included in a study in order to conclude that this amount of variance is significant at the .05
11.2 SIGNIFICANCE TESTING WITH MULTIPLE PREDICTORS
167
level? This exercise is more challenging than most. One way to determine the answer to this question is to set up a spreadsheet. Columns would indicate number of subjects, degrees of freedom, R2, and F. Rows would indicate successively greater numbers of subjects. In this way, the fewest number of subjects that would nonetheless yield a significant F ratio can be found. 4. Now determine the minimum number of subjects required for significance when the proportion of variance accounted for is 30%, 20%, 10%, and 5%. Note that if a statistical table does not contain an entry for the exact number of degrees of freedom you need, use the entry for the next fewer degrees of freedom. It is acceptable (and conservative) to claim fewer degrees of freedom than you have, but is it not acceptable to claim more. 5. Based on your answers to parts 3 and 4, what appears to be the relation between proportion of variance and number of subjects needed to find that proportion significant?
11.3
ACCOUNTING FOR UNIQUE ADDITIONAL VARIANCE From the lie detection study analyses conducted in this chapter and in chapter 10 we have learned that, with respect to the DV, number of lies detected: 1. 28.0% of the variance is accounted for when mood scores alone are considered (see Fig. 10.3). 2. 30.2% of the variance is accounted for when drug group alone is considered (see Fig. 10.2). 3. 34.6% of the variance is accounted for by drug group and mood scores considered together in concert (see Fig. 11.2). We could organize these three analyses into two different hierarchic series. For the first, we would begin with mood and then add drug group. From it we would learn that drug group accounts uniquely for 6.5% of the variance above and beyond that already accounted for by mood:
(We used numbers accurate to five digits for this computation, and then rounded the result.) For the second, we would begin with drug group and then would add mood. From it we would learn that mood accounts uniquely for 4.4% of the variance above and beyond that already accounted for by drug group:
Not surprisingly, the unique contribution of each predictor variable is less than its contribution alone. The two predictor variables are correlated, and as a result, part of their influence is joint; that is, it cannot be assigned uniquely to one variable or the other. This overlap is easy to compute. It must be 23.6%, which is the difference between the total variance and the unique variance accounted for by a predictor:
168
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
Likewise:
In other words, given two predictor variables, variability associated with the criterion variable can be divided into four pieces (see Fig. 11.3): 1. Variance accounted for uniquely by the first predictor (in this case 4.4% for mood scores, represented by the nonoverlapped part of the mood circle). 2. Variance accounted for uniquely by the second predictor (in this case 6.5% for drug group, represented by the nonoverlapped part of the drug circle). 3. Variance accounted for jointly by the two predictors (in this case 23.6%, represented by the overlap between the mood and drug circles). 4. Variance left unaccounted for by the predictor variables (in this case 65.5%, represented by the area outside the mood and drug group circles). Occasionally, the optimally weighted sum of two predictors will produce a higher R2 than the sum of the r2s for the two predictors separately. This pattern occurs when the relation between two predictor variables hides or suppresses their real relation with the criterion variable, as can happen when the correlation between two predictor variables is negative but both correlate positively with the criterion. In the presence of such a suppressor effect the overlap between the two predictor variables is negative, which makes drawing a figure like Fig. 11.3 untenable. For further discussion of suppression see Cohen and Cohen (1983). Even in the (relatively rare) presence of suppression, determining the unique contribution of a particular variable—the additional proportion of variance accounted for when that variable is added to the equation—is straightforward. The variable is added to the regression equation and a new R2 is computed. Its contribution is the difference between the new .R2 and the R2 for the previous equation, the one that did not include the new variable. If we call the new equation the larger model and the previous equation the smaller model, then the change in .R2 due to the new variable added is
and the degrees of freedom associated with this increase are
The Significance of the Second Predictor Determining the statistical significance associated with this new variable is also straightforward. Again, an F ratio is used, but this F ratio tests the increase in R2 df change is the numerator df and dferror is the denominator df):
11.3 ACCOUNTING FOR UNIQUE ADDITIONAL VARIANCE
169
FIG. 11.3. Venn diagram partitioning variance for number of lies detected into four portions: variance accounted for uniquely by drug, uniquely by mood, by drug and mood jointly, and unaccounted or error variance. In this case,
Equation 11.13 is simply a more general version of Equation 11.10. It has the advantage of reminding us that an addition or change in R2 is being tested. An alternative version of Equation 11.13, formed by expandingR2Changeand dfchange (Equations 11.11-11.12) and dferror, is
This formulation has the advantage of bypassing the intermediate steps represented by Equations 11.11 and 11.12. When the effect of a model is compared to no model (as in Equation 11.10), the numerator degrees of freedom are the number of predictor variables associated with that model. Similarly, when the effect of a larger model is compared to a smaller model (as in Equations 11.13 and 11.14), the numerator degrees of freedom for the change in .R2 are the number of variables added to the equation, which is the difference in degrees of freedom between the larger and smaller models. If variables are added one at time, as they have been in the examples presented so far, dfchange = 1, but as you can see, Equations 10.11-10.14 apply when more than one variable is added to a preexisting set as well. We mention this matter again at the end of the chapter. For the present example, the F ratio used to evaluate the significance of the unique contribution of drug group, above and beyond any contribution already made by mood, is
170
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
The critical value of F with 1 and 7 degrees of freedom, alpha = .05, is 5.59; thus we conclude that the unique contribution of drug group observed for this sample would occur by chance more than 5% of the time. With a sample size of 10 and two predictor variables it is not rare enough to reject the null hypothesis that the unique contribution of drug group is zero. It is helpful to organize the results of analyses like these as shown in Fig. 11.4. Each row represents a step and indicates the proportion of variance accounted for by all variables in the equation as of that step (R2total), as well as the unique proportion accounted for by the variable (or variables) added at that step R2change). In addition, the table shows the F ratios and their associated degrees of freedom for the various .R2s. It may seem disappointing that none of the results for our running example reached a conventional level of significance. Actually, this is quite realistic. For convenience, the current example includes a relatively small number of subjects (N = 10) and, as we discuss in chapter 17, statistically significant results are found for samples this small only when effects are very strong. Still, the proportion of variance accounted for uniquely by drug group, above and beyond the contribution of mood, was a nontrivial 6.5%. With a larger sample size, an effect of this magnitude would be statistically significant. As the present example demonstrates, once again, there is no reason to be overimpressed with statistical significance or to be depressed by the lack of it—and there is every reason to pay attention to the magnitude of effects. The previous analysis evaluated the unique contribution of drug group given mood scores, but different underlying theoretical concerns might lead us to evaluate the unique contribution of mood given drug group. The F ratio used to evaluate the significance of the unique contribution of mood, above and beyond the contribution already made by drug group, is
Again this F ratio is not significant, so we decide that, in addition to knowledge of a subject's drug group, knowledge of mood scores would not allow us to make significantly better predictions for the number of lies detected. Statistical significance aside, there is much in the way of technique to be gained from the current analyses. For each variable (or set of variables) added to a multiple-regression equation, you now understand how to determine its contribution to variance accounting, its R 2 change , and you know how to decide whether or not that contribution is statistically significant. Moreover, using Fig. 11.4 as a model, you now know how to organize the results of such hierarchic regression analyses. Such analyses are very general, as you will soon see.
11.3 ACCOUNTING FOR UNIQUE ADDITIONAL VARIANCE Step 1 2
Variable added Mood Drug
171
Total R2 0.280 0.346
dt 1,8 2,7
F 3.12 1.85
Change dt 0.280 1,8 0.065 1,7
Jl?
RF2 3.12 0.70
FIG. 11.4. Predicting lies detected: Adding drug to mood.
Exercise 11.5 Significance of Increases in R2 For this exercise you are asked to organize the hierarchic multiple-regression results for two previous examples. 1. Fig. 11.4 shows a hierarchic analysis that adds the drug variable to mood. Prepare a table like the one shown in Fig. 11.4, but for the analysis that adds mood to drug group. How much additional variance in number of lies detected can you account for if you know mood scores in addition to drug group membership? Is this amount statistically significant? 2. Recall the example that examined the effect of number of older siblings on number of words infants spoke (Exercises 9.6 and 10.5). First we regressed number of words on the actual number of older siblings; then in a separate analysis, number of words was regressed on a binary variable representing simply whether or not infants had at least one older sibling. The number of siblings accounted for more variance than the binary variable, but these were two separate analyses and so we do not know whether knowing the actual number of siblings, in addition to knowing simply whether or not infants had older sibling{s), accounts for a significant increase in variance accounted for. To find out, we would regress number of words, first on the binary variable, then on both binary and quantitative variable together, and evaluate the increase in R2. Do this analysis and prepare a table modeled after Fig. 11.4 that presents the results. How do you interpret these results?
11.4
HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE The way MRC was used in the last section provides an example of what is called hierarchic multiple regression. In the case of hierarchic multiple regression, variables (or sets of variables) are added to the equation one step at a time, and ordered in a way that makes sense to the investigator. It would make sense to call this procedure stepwise regression. Unfortunately, that term has been preempted and usually refers to an approach whereby variables that account for more variance in a particular data set are selected automatically by the computer program and are added to the regression equation before variables that account for less variance, no matter their meaning or interpretability. This kind of automatic stepwise selection capitalizes on chance variation and, especially in small data sets, may lead to results that are difficult to interpret or replicate. It is well suited to certain technical predictive tasks, but is usually not very useful if you are interested in explanation or substantive interpretation of your predictor variables. Still, because the term does occur with some frequency, it is important for you to know what stepwise regression usually means, even though it is not used here.
172
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS With hierarchic multiple regression, the researcher and not the computer determines the order with which variables are entered into the multipleregression equation. Thus, variables regarded as antecedent in some way are added before variables regarded as consequent, and the significance of each new variable (or set of variables) added is tested in turn. This approach is very general and powerful—indeed, it forms the basis for most of the analyses presented in this book—but one particular use to which this approach can be put is traditionally called the analysis of covariance. Whether a predictor variable is regarded as a covariate depends not on the particular variable, but rather on the researcher's view of the matter. Within the experimental tradition, a covariate is a variable that is thought to be associated with (i.e., to covary with) the dependent variable but whose influence is not of primary concern. It is more of a nuisance, obscuring a clear evaluation of the effect of the experimental independent variable on the dependent variable. The purpose of the analysis of covariance is to control for the effect of the covariate statistically, in effect neutralizing it. As you can see, the hierarchic regression analysis discussed in the last section does exactly that. Imagine, for example, that our major concern is the effect of drug on number of lies detected. However, we have reason to think that mood might also have an effect; that is, we think of mood as a covariate. And even though subjects were randomly assigned to the two treatment conditions (drug and placebo), we still note higher mood scores in one of them. In this case we would add mood scores to the regression equation first (step 1; see Fig. 11.4). Then the increment in R2 that occurs when drug group is added to the equation (step 2) gives us the effect of drug, above and beyond any effect of mood, on the number of lies detected. In this way, the effect of mood is controlled statistically and we can evaluate the effect of the drug manipulation uncontaminated by variability in mood scores. In the present case, we see that, in addition to the 28.0% of the variance in number of lies detected accounted for by mood scores, an additional 6.5% is accounted for uniquely by drug group. The example just presented assumed an experimental study (presumably, subjects were randomly assigned to the two drug groups), but the need to control for background variables, or other covariates, arises in nonexperimental studies as well. In fact, the need is usually greater because nonexperimental studies, lacking experimental control, must rely more on statistical control. Whether or not studies are experimental, however, the approach is the same, which demonstrates the generality of a hierarchic multiple-regression framework.
An Example: The Button-Pushing Study In chapter 13 we return to the analysis of covariance in greater detail, but first a new study is introduced. This study is used to exemplify both an analysis of covariance and other analytic strategies discussed in this and subsequent chapters. It has more subjects and more research factors than the lie detection study we have been using up to now for our running example. Imagine we are interested in how people with different experiences perceive infants. In particular, we want to know if parents see, that is, detect, more communicative attempts on the part of preverbal infants than nonparents. For this study we prepare videotapes of preverbal infants playing. We show them to some participants who are parents and some who are not, and we ask the subjects to push a button every time they believe the infant has done something the adult participant views as communicative. The button is attached to a microcomputer that records the number of presses.
11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE
173
The dependent variable for this study, then, is the number of button pushes for each subject. One independent variable is parental status (parent vs. nonparent), and one research hypothesis purports that parents will push the button more than nonparents. Another variable is the subject's age. Not surprisingly, after recruiting subjects we note that parents tend to be older than the nonparents and we worry that older subjects may push the button more than younger subjects, introducing a confound into our study. If parents are more frequent button pushers than nonparents, it might be due to their age and not their parental status. One solution to the age-parental status confound is to analyze the data using hierarchical multiple regression. First the subject's age would be entered into the regression equation (step 1), then parental status (step 2). In this way the effect of parental status on the number of button pushes, above and beyond any effect of age, can be evaluated. This is, in effect, an analysis of covariance in which age serves as the covariate. The data for this study are given in Fig. 11.5 and the necessary computations for the analysis of covariance just described are performed during the course of the next exercise. Exercise 11.6 An Analysis of Covariance This exercise uses the template shown in Fig. 11.2, but incorporates data from the button-pushing instead of the lie detection study. The predictor variables are age and parental status. These are examined hierarchically, first subject's age (step 1), then parental status (step 2). Age can be regarded as a covariate and the entire analysis, an analysis of covariance. The null hypothesis suggests that parental status does not account for a significant increase in R2, above any variance already accounted for by the subject's age.
Subject
No. of pushes
Subject's age
Parental status
1
102 125 95 130
35 38 40 32 29 44 26
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
79 93 75 69 43 82
•36 27
26
69 66 101 94 84
18 22 31 21 27
69
28
1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
FIG. 11.5. Data for the button-pushing study: 1 .= parent, 0 = nonparent.
174
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS General Instructions 1. Modify the spreadsheet shown in Fig. 11.2 to accommodate data from the button-pushing study. This involves relabeling the data columns, inserting rows for additional subjects, and entering new data. The required formulas are either already in place or can be copied into new rows. 2. Do step 1 of the hierarchic regression analysis. Regress number of button pushes on age and note its R2and associated F ratio. 3. Do step 2 of the hierarchic regression analysis. Regress number of button pushes on both age and parental status. Note both the final R2 and the change in R2. 4. To complete the analysis, organize your results in a stepwise table modeled after Fig. 11.4. Then answer the questions posed in part 10 of the detailed instructions. Detailed Instructions 1. Begin with the spreadsheet shown in Fig. 11.2. Relabel columns B–D, row 2, as follows: Label Column Meaning Y B The number of button pushes. In general, "Y" is used to indicate the dependent variable. X C Subject's age. In general, "X" is used to indicate an independent variable or covariate. (In Fig. 11.2 this column was labeled "XL") A D Parental status. (Often, A, B, C, and so forth, are used to indicate categorical independent between-subjects variables, although in Fig. 11.2 this column was labeled "X2.") 2. Enter the labels "#BPs" for number of button pushes in B1, "Age" in C1, and "PSt" for parental status in D1. 3. Insert six rows before row 13 (or move the block A13-L18 to A19-L24). This makes room for the 16 (instead of 10) subjects used for the button-pushing study. 4. Extend the subject numbers from 10 to 16 in column A and enter the data from Fig. 11.5 (number of pushes, age, parental status) in columns B, C, and D respectively. 5. Using a multiple-regression routine, find the values for a and b for the step 1 model, #BPs = a + b Age, ignoring parental status. Enter the values for the parameters a and b in cells B22 and C22 respectively. Enter the predictive formula, Y' = a + bX in cells E3–E18. 6. Copy the formulas in row 12, columns F–L, to the new rows 13-18. Rows 312, columns F–L, should already contain the correct formulas from the previous spreadsheet, as should rows 19–23, columns A–L. 7. Enter the correct values for the total, model, and error degrees of freedom in cells J20-L20 respectively. At this point, the correct R2 for the step 1 model (the one using only age as a predictor) should be displayed in cell F23. This completes step 1 of the hierarchic regression analysis. At this point, your spreadsheet should look like the one given in Fig. 11.6. 8. For step 2 of the analysis, the spreadsheet is modified as follows. Using a multiple-regression routine, find the values for a, b1, and b2 for the model, #BPs = a + b1 Age + b2 PSt. Enter the values for the parameters a, b1, and b2 in cells B22, C22, and D22 respectively. Enter the predictive formula Y' = a + b1X1 + b2X2 in cells E3–E18.
11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE
175_
9. Now enter the correct values for the total, model, and error degrees of freedom for this model in cells J20-L20 respectively. At this point, the correct R2 for the step 2 model (the one using both age and status as predictors) should be displayed in cell F23 and your step 2 spreadsheet should look like the one shown in Fig. 11.7. 10. To complete the exercise, organize your results in a hierarchic regression table organized like Fig. 11.4. In this case, what is the critical value of F, alpha = .05, for one predictor? Is age a significant predictor for the number of button pushes? What is the critical value of F, alpha = .05, for two predictors? Do age and status together account for a significant amount of variance? Finally, does parental status make a significant unique contribution to prediction, above any contribution made by age? After step 1, your spreadsheet should look like Fig. 11.6, and after step 2, like Fig. 11.7. The hierarchic results from the last exercise should look like those shown in Fig. 11.8. Even though age accounted for 21.5% of the variance, with 16 subjects this amount approached, but did not reach, conventional levels of significance. The critical value for F(1,14), alpha .05, is 4.60 and the obtained F was 3.83. If the effect of age had been our primary concern, we would have learned from this study that an N of 16 is simply inadequate to find significant an effect that accounts for approximately 20% of the sample variance. Age and parental status together accounted for 25.0% of the variance F((2,13) = 2.16, NS), and parental status accounted uniquely for 3.5% of the variance above that accounted for by age alone (F(1,13) = 0.61, NS). Again, neither effect was statistically significant. Still, there is an important lesson to be learned here. It is important, if we are not to waste time and resources, to decide before any study begins how large an effect we think is important and to insure that sufficient subjects are studied so that we have a reasonable chance of detecting effects of that size as statistically significant. How to do this is explained in chapter 17. Otherwise we are left, as in the present case, knowing that it is more probable than we would like (and more probable than journal editors usually allow) that the size of the effects we see could be just a lucky draw from a population in which the true values of the R2s under investigation are zero. Exercise 11.7 Hierarchical Regression in SPSS In this exercise you will learn how to conduct a hierarchical regression in SPSS. 1. Open the Lies and Drug data file you created in Exercise 10.6. Create a new variable for mood and enter the data. 2. Select Analyze-> Regression-> Linear from the main menu. Move Lies to the Dependent window and Mood to the Independent(s) window. Click on Next under Block 1 of 1 and move Drug to the Independent(s) window. Click on Statistics and check the R squared change box. Click Continue and then OK. 3. Examine the Model Summary in the output. On the left hand side you will find the statistics for the Total model. Thus Model 1 represents the first step that includes only mood scores. Model 2 includes both mood and drug as predictors. In the right-hand box of the Model Summary you will find the R2, F, df, and, significance of the change. Thus Model 2 represents the amount
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
176
of variance and significance of adding drug to mood scores. Make sure that these values agree with Fig. 11.4. 4. In the ANOVA table you will find the test of each model. Thus, Model 1 tests if mood alone is a significant predictor of lies, and Model 2 represents the significance of mood and drug together as predictors of lies. Finally, in the
A 1 2 s 1 3 2 4 5 3 4 6 7 5 8 6 7: 9 10 8 11 9 10; 12 11 13 12 14 13 15 14 16 17 15 16 19 Sum= 19 N= 20 21 Mean= a,b= 21 R= 23
1
Y 102 125;
95 130 79 93 75 69 43 82 69 66 101 94 84 69 1376 16 86 42.79 0.463
J
1 2 19 20 21
B #BPs
SS=
SSerr
e*e
7438
1597 1 1597
5841 14 417.2 20.43 3.829
15
SD'= 2
L
m*m
495.9 22.27 0.159
R2adj =
K
F=
E
F
G
m= y= Y' Y-My Y'-My 93.2 16 7.201 97.52 39 11.52 100.4 9 14.4 44 2.881 88.88 -7 -1.44 84.56 7 20.16 106.2 80.24 -11 -5.76 94.64 -17 8.642 -43 -4.32 0 81.68 0 80.24 -4 -5.76 -17 -17.3 0 68.72 0 74.48 -20 -11.5 0 87.44 15 1.44 0 73.04 8 -13 0 81.68 -2 -4.32 0 83.12 -17 -2.88 8 1376 0 -0 16 16 16 N= 0.5 VAR= 464.9 99.83 0 SD= 21.56 R2= 0.215
ssmod
df=
23
D PSt A 1 1 1 1 1; 1 1 1
sstot y*y
MS=
21
C Age X 35 38 40 32 29 44 26 36 27 26 18 22 31 21 27 28 480 16 30 1.44
H
e= Y-Y' 8.799 27.48
-5.4 41.12 -5.56 -13.2 -5.24 -25.6 -38.7 1.761 0.283
-8.48 13.56 20.96
2.321 -14.1 3E–14
16 365
FIG. 11.6. Spreadsheet for evaluating the effect of age on number of button pushes. Rows 3—8 for columns I-L are not shown.
177
11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE
Coefficients find the values unstandardized and standardized regression coefficients. 5. As additional practice, use the SPSS Regression procedure to reanalyze the button-pushing study presented in Exercise 11.6.
A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 19 20 21 21 23
B #BPs
s 1 2 3 4 5 6 7 8 9 10 11 12 13. 14 15 16 Sum=
N= Mean=
a,b= R=
Y 102 125 95 130 79 93 75 69; 43 82 69 66 101 94 84 69 1376 16. 86 55.12 0.5
1
1
21 21 23
0.835
D PSt A 1 1 1 1 1 1 1 1
SSerr
m*m
e*e
SS=
7438
df= MS=
15
1858 2 929
5580 13 429.2 20.72 2.164
R2
R2adj=
495.9 22.27 0.134 0.134
F
Y' 96 98.51 100.2
Y-My 16 39 9 44 -7 7 -11 -17 -43 -4 -17 -20 15 8 -2 -17 0 16
y=
93.49 90.99
103.5 88.48 96.84 77.67 76.84
464.9
m= Y'-My 10 12.51 14.18 7.495 4.989
17.52 2.484
10.84 -8.33 -9.16 -15.8 -12.5 -4.99 -13.3 -8.33 -7.49 -0 16 116.1
H e= Y-Y' 6 26.49
-5.18 36.51 -12 -10.5 -13.5 -27.8 -34.7 5.165 -1.15 -7.49 19.99 21.34 6.33 -9.51 6E-14 16 348.7
21.56 0.25
L
K
ssmod
SD'=
G
E
0 0 0 70.15 0 73.49 0 81.01 0 72.66 0 77.67 0 78.51 8 1376 N= 16 0.5 VAR= 11.65 SD= R2=
sstot y*y
2 19 20
C Age X 35 38 40 32 29 44 26 36 27 26 18 22 31 21 27 28 480 16 30
F=
FIG. 11.7. Spreadsheet for evaluating the effect of age and parental status together on number of button pushes. Rows 3-18 for columns I-L are not shown.
178
ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS Step
1 2
Variable added Age Status
Total ~
R2 .215 .250
d1 1,14 2,13
Change F 3.83 2.16
R2 .215 .035
df 1,14 1,13
F 3.83 <1
FIG. 11.8. Predicting number of button pushes: adding parental status to age. The value of the F ratio is actually 0.608, but usually values less than 1 are indicated simply as <1.
11.5
MORE THAN TWO PREDICTORS The discussion and examples used in this chapter have been cast in terms of just two predictor variables. In chapter 8 we learned how to compute the proportion of criterion variance accounted for by one predictor variable, and in chapter 10 we learned how to determine if that proportion of variance was statistically significant. In this chapter we learned how to compute the proportion of criterion variance accounted for by two predictor variables considered together as a set. We also learned how to compute the proportion of criterion variance uniquely accounted for by the second predictor variable when added to the first. Finally, we learned how to evaluate the statistical significance of the R2 accounted for by the set of two variables and the increment in R2 that occurred when the second predictor was added to the first. These concepts and computations apply with only slight modification when three or more predictor variables are under consideration. For each additional predictor, another column is added to the spreadsheet. The partial regression coefficient for that predictor is computed and incorporated into the predictive equation. The computation for R2 (the proportion of total variance accounted for by the model), with the evaluation of its significance, is the same whether two or more predictors are used. No matter the number of predictor variables, the F ratio (to restate Equation 11.10) is
Similarly, hierarchic regression is easily extended beyond two variables. Moreover, a step can consist of adding either a single variable or a set of variables. Thus, at each step, an additional variable, or a set of variables, is added. When evaluating the significance of that variable (or set), the smaller is the previous model, the larger is the new model, and the increase in R2 is the change between R2larger and R 2 smaller . Thus, for example, the larger model for step 2 becomes the smaller model for step 3, and so forth. As noted earlier, the F ratio for evaluating the significance of the variable (or variables) entered at each step (to restate Equation 11.14) is
Again as noted earlier, the two F ratios given in this and the preceding paragraph are really the same. The first is simply a special and more limited case of the second.
11.5 MORE THAN Two PREDICTORS
179
The material summarized in the previous three paragraphs may seem clear and simple enough. However, unless one has had some experience with data analysis, the stunning simplicity and power of these concepts and computations may not be apparent. Essentially, whenever the criterion or dependent variable is quantitative, the procedures just outlined apply. In other words, almost all of the basic sorts of analyses used by behavioral scientists can be understood, and performed, based on the material presented in this chapter. It only remains to define the independent variables, or the research factors of interest. In substantive terms, this means identifying those factors or variables that the investigators think might affect or account for variation in the variable of interest (i.e., in the criterion or dependent variable). There are caveats, of course. For example, the independent variables should be well measured and reasonably distributed. And, certainly for nonexperimental studies, these variables should not exclude variables that strongly influence the criterion being studied. Such cautions are important and usually are amply discussed in multiple-regression texts (see, e.g., Cohen & Cohen, 1983). We do not discuss these cautions further here but urge anyone using multiple-regression techniques to be aware that there are pitfalls for unwary users, and any seemingly nonsensical results should be discussed with knowledgeable colleagues. The usefulness of defining several independent variables is probably selfevident. Typically researchers are interested in more than one factor: Variables other than just age may warrant investigation, for example. Moreover, because not all independent variables are quantitative, the usefulness of evaluating categorical as well as quantitative independent variables is probably also selfevident. What may not yet be evident is how productively these two strategies can be combined. As we will see in the next chapter, categorical independent variables (other than binary variables) are represented with more than one predictor variable. Such sets of coded variables, each set representing a categorical or nominal variable, when combined with multiple-regression techniques, provide a straightforward and general way to analyze data from a variety of different studies. In the next chapter we describe how to perform a one-way analysis of variance, analyzing data from a single-factor between-subjects study.
This page intentionally left blank
12
Single-Factor Between-Subjects Studies
In this chapter you will: 1. Learn how to render a categorical variable that has more than two categories as a set of coded variables. The set can then be used to represent the categorical variable in multiple-regression equations. 2. Learn two different ways of coding categorical variables: dummy coding and contrast coding. 3. Learn how to perform a one-way analysis of variance using coded variables. This allows you to analyze data from single-factor betweensubjects studies when the single factor encompasses more than two levels or groups. This chapter describes how to analyze data from single-factor between-subjects studies. A between-subjects analysis (as opposed to a within-subjects analysis, which is discussed in chap. 15) is appropriate whenever a score for the dependent variable is determined once, and only once (as opposed to repeatedly), for each subject. In the interests of reliability, several measurements might be made and then combined into one score, but such data would still require a betweensubjects analysis. For example, in the button-pushing study, the total number of times the button was pushed was tallied once for each subject. This total number was then analyzed to see if it was affected by the subject's age or parental status. A single-factor study, as the name implies, is concerned with the effect of a single research factor on the dependent variable. As such, single-factor studies are among the simplest researchers use. For example, if we confine our attention solely to parental status, then the button-pushing study could be an example of a single-factor study. Moreover, it would be the simplest single-factor betweensubjects study possible, one involving only two groups or two levels of the factor. We have already discussed in chapters 8 and 9 how to analyze data from singlefactor between-subjects studies when only two groups are involved, but studies involving more than two groups are common. For example, a researcher might want to investigate the effect of three or four different kinds of treatment on patients' neuroticism or the effect of religious background on individual attitudes toward abortion. 181
182
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES In general terms, the analysis of data from a single-factor study allows us to determine whether knowing the group to which a subject belongs (which could be a self-assigned natural group or an experimenter-assigned experimental group) allows us to make better predictions than otherwise as to how that individual will respond. In statistical terms, the question is whether the proportion of variance accounted for by the research factor is significantly different from zero. The null hypothesis, which claims no effect for the research factor, would predict an R2 of zero. A sample will almost always yield a value of R2 higher than that, but the key statistical question is whether the obtained value of R2 would occur less than 5% of the time just due to the luck of draw, if the population value for R2 really were zero. If the F ratio associated with R2 exceeds the appropriate critical value, we declare the null hypothesis untenable and label our obtained value statistically significant: In other words, we claim a statistically significant effect.
12.1
CODING CATEGORICAL PREDICTOR VARIABLES In order to test the significance of a single research factor, we need some way to represent the group to which each subject belongs. If the single research factor were a quantitative variable, like maternal age, then one predictor variable would suffice and for each subject the value for this variable would be that subject's age. This variable could be entered into a multiple-regression equation, where it would account for one degree of freedom. Similarly, if the single research factor were a binary qualitative variable, like parental status (parent vs. nonparent), again one predictor variable would suffice. Parents would be assigned one value, nonparents a second value. As we saw in Exercise 9.4, the values selected to represent the two groups are not critical; the two values must be different and that is all. A binary research factor, then, is represented with one coded variable and hence accounts for one degree of freedom. Qualitative research factors comprising more than two groups or levels are handled somewhat differently. The general rule is, a factor with G groups or levels must be represented with G –1 coded variables. There are several ways to do this, but we emphasize two here: dummy variable coding, because it is so simple, and contrast coding, because contrasts can be interpreted as planned comparisons.
Dummy Variable Coding Dummy variable coding reduces the information concerning group membership into a series of binary distinctions (see Fig. 12.1). It is particularly appropriate when one group is regarded as a comparison or control group and other groups are compared to it. For example, individuals with Catholic, Jewish, or Protestant religious backgrounds might be compared to all others. In this case, religious background would be represented with four levels, the three just listed plus a fourth category for Other. Yet group membership can be represented unambiguously with just three (G –1) predictor variables. Values for the first variable, X1 would be 1 = Catholic and 0 = Non-Catholic. Similarly, values for the second variable, X2, would be 1 = Jewish and 0 = Non-Jewish, and the values for the third variable, X3, would be 1 = Protestant and 0 = Non-Protestant. Each variable represents a religious group and subjects receive a code of one for the group to which they belong. Subjects belonging to the Other category would be assigned a code of zero for all three predictor variables.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES
183
Dummy coded variables for 2-5 groups Group
G1 S1 G2
.
G3
.
G4
.
G5
.
SN
2 X1 1 1 0 0
3 X1 X2 1 0 1 0 0 1 0 1 0 0 0 0
4 X1 X2 X3
1 1 0 0 0 0 0 0
0 0 1 1 0 0 0 0
0 0 0 0 1 1 0 0
X1 X2 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 X3 X4 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
FIG. 12.1. Values for dummy-coded variables for research factors comprising two, three, four, and five levels or groups. Two subjects per group are shown, although usually groups would have more subjects.
As you can see from Fig. 12.1, three binary-coded predictor variables allow for four patterns (100, 010, 001, and 000). A fourth predictor variable (presumably coded 1 for Other) would be redundant and including it in our set of predictor variables would result in an error message from the multiple-regression routine. Likewise, two binary-coded predictor variables allow for three patterns, four allow for five patterns, and so fourth. In general, G– 1 predictor variables are required to indicate to which of G groups a subject belongs. In chapter 10 we noted that the degrees of freedom associated with a categorical variable were one less than the number of categories (or levels or groups) and argued that, once the means for all groups but the last one had been determined, the last was not free to vary. Here the same argument is made in a somewhat different way. In order to unambiguously represent group membership for G groups, G – 1 predictor variables are needed. Each predictor variable represents a degree of freedom, so again we see that the degrees of freedom associated with a categorical variable is one less than the number of categories. Occasionally a student will ask, since a quantitative variable like maternal age requires only one predictor variable, why should a qualitative variable like religious background require more than one? Why not code religious background 1 = Catholic, 2 = Jewish, 3 = Protestant, and 4 = Other and use only one predictor variable and hence one degree of freedom? To do so would violate what a categorical variable means and would deny the amount of unique information a categorical variable conveys. If we indeed scaled religious background—that is, assigned values 1-4 to the four categories—we would be saying, in effect, that religious background could be quantified and Protestants had more of it than Jews who had more of it than Catholics. Not only would such scaling be controversial, we would still fail to code group membership. We would be confusing qualitative measurement with quantitative measurement, in effect forcing an interval (or ratio, or ordinal) scale of measurement on a categorical scale inappropriately (recall the discussion of scales of measurement in chap. 4). The information conveyed by a qualitative as compared to a quantitative variable is of kind, not degree, and, as it turns out, requires G – 1 independent coded variables to represent the distinctions among G groups.
184
SINGLE-FACTOR BETWEEN - SUBJECTS STUDIES Imagine, for example, that the 16 subjects from the button-pushing study (see Fig. 11.5) were divided into the following four groups representing what we will call want/has children status: 1. 2. 3. 4.
Has no desire for children. Has none but would like children. Has one child. Has more than one child.
Because the categorical variable of children status consists of four levels or four groups, it can be represented with a set of three dummy-coded predictor variables. If we wanted to know whether the mean numbers of button pushes for these four groups were significantly different from each other, we would perform an analysis of variance, first regressing the number of button pushes on these three dummy-coded variables taken together as a set, and then testing whether the computed R2 was significantly different from zero. The next exercise gives you the opportunity to do this analysis. Exercise 12.1 A One-Way Analysis of Variance Using Dummy-Coded Variables The template that results from this exercise allows you to analyze data from a single-factor between-subjects study, performing a one-way analysis of variance. The effect of a single factor, marital/parental status, is analyzed. This factor is represented with four groups or levels and hence with three predictor variables. The data and many of the computations needed are already present in the spreadsheet shown in Fig. 11.7, which is modified for this exercise. 1. Modify the spreadsheet shown in Fig. 11.7 to accommodate a third predictor variable. The three predictor variables for this spreadsheet represent X1, X2, and X3, the dummy-coded variables for factor A, marital/parental status. There is no need to compute the means or counts for coded variables, so these formulas can be erased. 2. Enter values for the dummy-coded variables. Assume that subjects 1-4, 58, 9-12, and 13-16 belong to the four groups listed in the previous paragraph. Guided by Fig. 12.1, enter the appropriate values for the predictor variables. 3. Regress the dependent variable, Y or number of button pushes, on the dummy-coded variables X1, X2, and X3. 4. Enter the correct formula for Y'. Remember that the exercise shown in Fig. 11.7 involved two predictor variables whereas the present exercise involves three predictor variables. How does this change the prediction equation? The degrees of freedom for the model sum of squares? For the error sum of squares? 5. What are the values for the predicted scores? Why are there only four different values for the predicted scores? 6. What is the value of F? Would you reject the null hypothesis at the .05 level of significance? Why or why not? 7. How would the values of R2 and F be changed if you had mistakenly entered 1 instead of 102 for subject 1? If there had been a 17th subject with no desire for children who pushed the button 200 times?
185
12.1 CODING CATEGORICAL PREDICTOR VARIABLES
The summary statistics for this one-way ANOVA (one-way because only a single factor is involved) are provided by the spreadsheet just constructed. This spreadsheet is given in Fig. 12.2. From it we know that 66% of the variance in number of button pushes is accounted for by children status group. The critical value for F, for 3 and 12 degrees of freedom, alpha = .05, is 3.49. The obtained value was 7.63, consequently we can say there is a significant difference, p < .05,
A
C
B
1 #BPs 2 s Y 3 1 102 4 2 125 5 3 95 6 4 130 7 5 79 8 6 93 9 7 75 10 8 69 11 9 43 12 82 10. 13 11 69 14 12 66 15 13 101 16 14 94 17 15 84 19 16 69 19 Sum= 1376 20 N= 16 21 Mean= 86 21 87 a,b= 23 R= 0.81 K
J
19 20 21 21 23
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 4 16 0.25 26
X2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 4 16 0.25 -8
Y-Y' Y' Y-My Y'-My -11 27 113 16 12 27 39 113 27 9 -18 113 44 17 27 113 -7 0 79 -7 -7 7 14 79 -11 -7 -4 79 -17 -7 -10 79 1 -21 -22 -43 65 1 -21 -4 17 65 -21 1 -17 4 65 1 -21 -20 65 1 1 14 87 15 0 1 8 7 87 0 1 87 -2 -3 0 1 -17 -18 87 0 4 1376 0 0 0 N= 16 16 16 1.6 305 159.9 0.25 VAR= 464.9 -22 SD= 21.56 R2= 0.656
SSerr
m*m
e*e
SS=
7438
df= MS=
15
4880 3 1627
2558 12 213.2 14.6 7.631
SD'= R2
adj=
0.57
I
m=
e=
X3 0 0 0 0 0 0 0 0
M
L
495.9 22.27
G
y= X1
ssmod
2
H
E
sstot y*y
1
F
D
F=
FIG. 12.2. Spreadsheet for determining the effect of want/has children status on number of button pushes using dummy-coded variables. Rows 3-18 for columns J-M are not shown.
186
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES in the mean values for number of button pushes for the four groups. (Thep < .05 means that the obtained F exceeded the alpha .05 critical value.) However, pending further analyses, we cannot yet say exactly which groups account for this significant difference or exactly how the groups differ. Such a general or overall test is usually called an omnibus test (omnibus means "for all" in Latin). If we want to know whether groups differ, but lack specific predictions concerning exactly how they differ, then first we would perform an overall or omnibus test. Instead of regressing the dependent variable on each predictor in turn, we regress it on a set of predictor variables all at once, in a single step (as in Exercise 12.1), which is usually called a simultaneous regression. If the R2change associated with this step is significant, then we can claim that at least some of the group means are different from others. We are then justified in probing differences among the group means further, a topic discussed under the heading of post hoc tests in the next chapter, and in reporting the means broken down separately for the different groups, as is done in Fig. 12.3 for the present example. However, if the omnibus test is not significant, then we should report only the mean for all subjects (pooled over all groups) and not report separate means for the different groups. After all, if they do not differ significantly then there usually is no point in reporting them. In the present case, if the F ratio had not been significant, we would simply report that subjects pushed the button 86 times on average. When more than two groups are under consideration, omnibus F tests (which necessarily have more than one degree of freedom in the numerator) are the rule. Occasionally, however, researchers have a specific rationale for predicting an exact pattern of differences among groups. Those predictions (each of which is associated with a particular predictor variable and hence one degree of freedom) can be tested directly, proceeding step by step. This matter is discussed in the section on planned comparisons presented in the next chapter. An examination of the present spreadsheet (see Fig. 12.2), along with the means for the four groups (see Fig. 12.3), tells us something about how a one-way of analysis works with dummy-coded variables. Note that a, the regression constant, is 87, which is the mean for group 4, the comparison group (the group whose values for all three dummy-coded variables were zero). Note also that the regression coefficients represent deviations from the comparison group. The first variable is coded 1 for group 1; its regression coefficient is 26, which is 113 (the group 1 mean) minus 87 (the group 4 mean). Similarly, the second variable codes group 2; its regression coefficient is -8, which is 79 (the group 2 mean) minus 87. Likewise, the regression coefficient for the third variable is -22, or 65 minus 87. Also note that Y' or the predicted value for a score is always the mean of the group in which that score lies. This will be true no matter how categorical variables are coded or how many subjects are in each group, as subsequent exercises demonstrate. Thus the error scores for a one-way ANOVA represent deviations of observed scores from their group's mean (see the Y - Y' column). That is why error variance for a single-factor between-subjects study is often called error within groups.
Contrast Coding Contrast coding is a second method for representing categorical information. Like dummy variable coding, it reduces information about group membership to a series of binary distinctions, but the distinctions or contrasts are structured hierarchically—in a way that can be represented with a tree diagram. This method is particularly appropriate whenever a researcher has some a priori basis for arranging groups according to a series of yes-no questions or contrasts. In
12.1 CODING CATEGORICAL PREDICTOR VARIABLES Want or have children Has no desire for children (ND) Has none but would like (CO) Has 1 Child (C1) Has more than 1 child (C2)
187 Number of pushes 113 79 65 87
FIG. 12.3. Mean number of button pushes, computed separately for each want/has children status group. Each group contains four subjects.
fact, contrast coding automatically provides for the analysis of what are usually called planned comparisons or a priori tests. (In Latin, a priori means "from the previous" and implies deduction from a hypothesis or cause.) Both planned comparisons and post hoc tests are described in chapter 13. What a contrast is, and what constitutes a set of contrast-coded predictor variables, is perhaps best conveyed by example. Recall the four groups defined earlier in this chapter: 1. 2. 3. 4.
ND, has no desire for children. CO, has none but would like. Cl, has 1 child. C2, has more than 1 child.
One possible set of three contrasts could be: 1. ND and CO versus Cl and C2. 2. ND versus CO. 3. Cl versus C2. In other words, first we would contrast subjects who have no children (ND and CO) with those that do (Cl and C2), then those that have no desire with those that do among the former (ND vs. CO), and finally those that have one and those that have more than one child among the latter (Cl vs. C2). A second set of possible contrasts is: 1. ND versus CO, Cl, and C2. 2. CO versus Cl and C2 3. Cl versus C2 In other words, first we would contrast subjects with no desire (ND) with those who desire children (Co, Cl, and C2), then subjects who have no children (CO) with those that do (Cl and C2), and finally those that have one and those that have more than one child (Cl vs. C2). Speaking generally, given G groups, several different sets of G - 1 contrasts are possible. For example, with three groups one set of contrasts is group 1 versus groups 2, 3 and group 2 versus 3; another set is groups 1, 2 versus group 3 and group 1 versus 2; yet another is group 2 versus groups 1, 3 and group 1 versus 3. As you can see, for a simple three-group analysis, several different sets of contrasts could be defined. But for any one analysis, the researcher must commit to just one set. For an analysis involving three groups, a set will consist of two contrasts. The general rule is, one contrast is allowed for each degree of freedom, just as with dummy-coded variables. More would be redundant because G -1 contrasts exhausts information about group membership (assuming G groups).
188
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES Is there any reason to prefer contrast to dummy codes? It depends. If you have decided beforehand to compare all groups to an explicit comparison group, then use dummy-coded variables. If the increase in R2 associated with a particular predictor variable (when predictors are entered one at a time) is significant, then you can conclude that the mean for the group associated with that particular predictor variable is significantly different from the mean for the comparison group. Alternatively, if you have decided for theoretical reasons to limit your investigation to G - 1 particular contrasts, then use contrast codes. With contrast codes, each predictor variable is associated with a particular contrast (or question, or planned comparison) and hence, if that predictor variable is associated with a significant increase in R2 (when added to the regression equation), you can conclude that the means for the two sets of groups defined by the contrast differ significantly (e.g., all subjects with no desire for children compared to all others, or all subjects who desire children but have none compared to all subjects with one or more child). However, if you only want to know whether the dependent variable is affected by the research factor and so regress the dependent variable on the G - 1 predictor variables required to represent that categorical independent variable simultaneously, then it does not matter which coding scheme you choose for the categorical independent variable. You may use either dummy or contrast codes, and if you choose contrast codes, you may use any of the possible sets of G - 1 contrasts. The increase in the value of R2 and the value of the omnibus F test when the G -1 predictor variable are entered in the equation will be the same no matter what kind or which set of coded predictor variables are used. (A third way to code categorical variables is called effects coding—see Cohen & Cohen, 1983— but in the interest of simplicity, only dummy and contrast coding are described here.) Rules for forming contrast codes are relatively simple. It is helpful to begin by organizing a set of contrasts into hierarchic tree structures like those shown in Fig. 12.4. Consider our current four-group example. We might ask first whether a subject has children, second whether those who have no children want children, and third whether those who have children have one or more than one (see top of Fig. 12.4). In other words, the first contrast is between subjects who have no children and those who do, the second between those with no children who do and do not desire children, and the third between those who have one or more than one child. However, given a different set of theoretic concerns, we might instead ask first whether a subject wanted children, then if yes ask whether the subject had children, and again if yes whether the subject had more than one (see bottom of Fig. 12.4). Other sets of contrasts could also be defined, but no one set would be more correct than the other. The correct set is always the one that most faithfully reflects the theoretical concerns of the study. However, whichever set is selected, it is desirable to depict that set with a tree diagram. Properly constructed, this insures that the number of (binary or yes-no) questions asked and the degrees of freedom are the same. It also insures that the questions are independent of each other (or orthogonal) and not confounded. Again, we should emphasize that if the dependent variable is regressed on all G - 1 predictor variables in a single step, and if contrast codes instead of dummy codes are used, then the particular set of contrasts used does not matter. It only matters that the coded predictor variables in the set are orthogonal, that is, reflect a tree structure as exemplified in Fig. 12.4.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES
189
FIG. 12.4. Tree diagrams indicating two different ways of contrasting the four groups that represent the want/has children factor. Each predictor variable in a set of orthogonal contrast codes is associated with a node of the tree diagram, which represents a particular contrast or question. Consider the set of contrasts at the top of Fig. 12.4. The first question asks whether a subject has children. Thus for the first predictor variable, subjects in the two groups without children will be assigned one code (i.e., one contrast coefficient) and subjects in the two groups with children, another code. The second question asks whether subjects with no children desire them. Thus for the second predictor variable, subjects with no desire would be assigned one code, subjects who desire children would be assigned another code, and subjects in groups not included in this contrast would be assigned a third code, usually zero. Similarly, for the third predictor variable subjects with one child would be assigned one code, subjects with more than one child another code, and subjects in the groups not included in this contrast would be assigned zero.
Rules for Forming Orthogonal Contrasts Two rules guide the selection of codes used to represent orthogonal contrasts. First, the codes selected for each contrast must sum to zero groupwise. That is, codes are not summed across all subjects (because there might be uneven numbers of subjects per group) but instead only one code per group is included in the sum. An example should clarify. Consider the first contrast in the set at the top of Fig. 12.4, which is between subjects who have no children and subjects who do. If we assigned a contrast code of -2 to subjects in groups with out children and a code of +2 to subjects in groups with children, then the coefficients for the
190
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES four groups would be -2, -2, +2, and +2, which sum to zero as required. There are an equal number of groups in each contrasting condition for the contrast just described (two groups without children, two with children); consequently any pair of equal numbers would sum to zero (e.g., -1, +1) and could be used as contrast coefficients. The coefficients -2 and +2 are suggested here because an appropriate contrast coefficient for the groups included in one contrasting condition is always the number of groups included in the other. This ensures that the codes selected for a contrast sum to zero as required. Other general rules for selecting contrast codes are possible, but this one has the merit of avoiding fractions, which many people find more confusing than integers. As a second example of the contrasts-must-sum-to-zero rule, consider the first contrast in the set at the bottom of Fig. 12.4, which is between subjects who do and do not want children. If we assigned a contrast code of -3 to those who do not desire children and a code of +1 to subjects in the other three groups (who we are assuming want children), then the coefficients for the four groups would be -3, +1, +1, and +1, which again sum to zero as required. Groups not included in a contrast are assigned a code of zero. For example, consider the second contrast in the set at the top of Fig. 12.4, which is between subjects without children who do and do not desire them. If we assigned a code of -1 to subjects who have no children and no desire for them and a code of +1 to subjects who have no children but who want them, then the coefficients for the four groups would be -1, +1, 0, and 0, which again sum to zero as required. As an exercise, you should now determine the codes for the remaining contrasts associated with the two tree diagrams in Fig. 12.4. Your codes should agree with the contrast codes given in Fig. 12.5. Orthogonal contrast codes are also constrained by a second rule, specifically that the cross products for all possible pairs of contrasts must sum to zero groupwise. This insures that the contrasts will represent independent comparisons. (Technically it means that the correlations between pairs of contrast coefficients will be zero, considered groupwise.) For example, consider the group cross products between variables X1 and X2 in Set I of Fig. 12.5. The cross products for the four groups are 2, -2, 0, and 0, which sum to zero as required:
-2 x -1 = -2 -2 x +1 = -2 +2 x 0 = 0 +2 x 0 = 0
Coded Variables
Group Set l Has Has Has Has Set II Has Has Has Has
no desire for children none but would like 1 Child more than 1 child
-2 -2
no desire for children none but would like 1 Child more than 1 child
_3
+2 +2
+1 +1 +1
-1 +1 0 0 0 -2 +1 +1
0 0
-1 +1 0 0
-1 +1
FIG. 12.5. Two sets of contrast codes for the want/has children factor.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES
191
In the next exercise, you are asked to verify for yourself that the cross products for the other possible pairs of contrasts shown in Fig. 12.5 sum to zero as the second rule for forming contrast codes requires. (See Fig. 12. 6 for a restatement of the two rules for forming contrast codes.) One final point regarding contrast codes should be mentioned. When defining a set of orthogonal contrast codes, it is important to keep in mind the distinction between individuals and the groups to which they belong. Ultimately the codes or contrast coefficients determine the values assigned to the predictor variables for individuals within groups (see Figs. 12.7 and 12.8). However, first the contrast coefficients are determined for the groups, no matter whether or not the numbers of subjects in the various groups are equal. It is these group-level contrast coefficients that must sum to zero and whose cross products must sum to zero. In other words, a set of contrast codes is first determined for groups and then applied to individuals in those groups. Exercise 12.2 Applying Rules for Forming a Set of Contrast Coefficients This exercise provides practice in determining whether contrast coefficients and their cross products sum to zero as required. 1. For the contrast coefficients in Set I, Fig. 12.5, verify that the cross products for X1 and X3 and for X2 and X3 sum to zero. 2. For the contrast coefficients in Set II, Fig. 12.5, verify that the sum of the contrast coefficients for the three contrasts sum to zero. Also verify that the three possible cross products sum to zero. 3. Label the 4th group, 2 children, and add a 5th group, 3 or more children. Draw a tree diagram showing one set of contrasts that could be used for such a study. Indicate the contrast coefficients that would be used and verify that these coefficients and their cross products all sum to zero. Now draw a tree diagram for a second set of contrasts and indicate the contrast coefficients that you would use for this second set. 4. Describe a study with six groups. Name the groups and provide a rationale for a particular set of contrasts. Indicate the contrast codes you would use and verify that they and their cross products sum to zero. It is important to know how to form contrast coefficients correctly. They can be used to analyze planned comparisons, as described in the next chapter, and they can be used instead of dummy-coded variables to perform one-way analyses of variance (and other, more complex analyses, as you will see in subsequent chapters). Dummy-coded variables were used in Exercise 12.1 to analyze the number of button pushes. In the next exercise, you will use contrast codes to perform a similar analysis. Rules for forming orthogonal contrast codes Rule 1
The codes selected for each contrast must sum to zero across groups.
Rule 2
The cross products for all possible pairs of contrasts must sum to zero across groups.
FIG. 12.6. Rules for forming orthogonal contrast codes.
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
192
Exercise 12.3 A One-Way Analysis of Variance Using Contrast Codes This exercise uses the template developed for Exercise 12.1. Again, the effect of a single factor, want/has children status, on number of button pushes is analyzed, but this time, instead of dummy-coded variables, the two sets of contrast codes given in Fig. 12.5 are used as predictor variables. 1. Begin with the spreadsheet shown in Fig. 12.2. Replace the dummy-coded variables used there with the contrast codes for set I, Fig. 12.5. Remember that subjects 1-4 belong to the first group, subjects 5-8 to the second, and so forth. 2. Regress the number of button pushes on the three predictor variables that code for group membership. What proportion of variance is accounted for? What is its associated F ratio? 3. Next replace the contrast codes with those shown for set II, Fig. 12.5. Again regress the number of button pushes on these three new predictor variables. What proportion of variance is accounted for? What is its associated F ratio?
A
B #BPs s Y 1 102
1 2 3 4 2 5 3 6 4 7 5 8 6 9 7 10 8 11 9 12 10 13 11 14 12 15 13 16 14 17 15 19 16 19 Sum= 20 N= 21 Mean= 21 a,b= 23 R=
125 95 130 79 93 75 69 43 82 69 66 101 94 84 69 1376 16 86 86
D
C X1 -2 -2 -2 -2 -2 -2 -2 -2 2 2 2 2 2 2 2 2 0 16 0 -5
X2 -1 -1 -1 -1 1 1 1 1 0 0 0 0 0 0 0 0 0 16 0 -17
E X3 0 0 0 0 0 0 0 0
-1 -1 -1
-1 1
1 1
1 0 16 0 11
0.81;
FIG. 12.7. Spreadsheet for determining the effect of want/has children status on number of button pushes using contrast code set I. Columns F-M are the same as columns F-M in Fig. 12.2, so are not repeated here.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES
193
4. Why are the predicted values the same no matter which set of contrast codes are used? 5. If there had been a 17th no-desire (ND) subject who pushed the button 200 times, what proportion of variance would be accounted for? What is its associated F ratio? After completing the last exercise, your spreadsheets for the two sets of contrast codes should look like those shown in Figs.12.7 and 12.8. For both the last exercise and for Exercise 12.1 earlier in the chapter, you determined that 66% of the variance in number of button pushes was accounted for by group membership. No matter whether you used dummy-coded variables or contrast codes to represent group membership, and no matter which set of contrast codes you used, SS, R, R2, and the F ratio were the same. We concluded that the difference among the means for these four groups is significant at the .01 level. In the next section, we show how information from these spreadsheets can be organized in an ANOVA source table. And in the next chapter, we show how, if the predictor variables had been added one at a time instead of all at once, information from the separate steps can be used to analyze planned comparisons.
A 1 2 s 3 1 4 2 5 3 6 4 7 5 8 6 9 7 10 8 11 9 12 10 13 11 14 12 15 13 16 14 17 15 19 16 19 Sum= 20 N= 21 Mean= 21 a,b= 23 R=
B #BPs Y 102 125 95 130 79 93 75 69 43 82 69 66 101 94 84 69 1376 16 86 86 0.81
D
C x1 -3 -3 -3 -3 1 1 1 1 1 1 1 1 1 1 1 1 0 16 0 -9
E X2 0 P. 0 0 -2 -2 -2 -2 1 1 1 1 1 1 1 1 0 16 0 -1
X3 0 0 0 0 0 0 0 0
-1 -1
-1 -1 1
1 1 1 0 16 0 11
FIG. 12.8. Spreadsheet for determining the effect of want/has children status on number of button pushes using contrast code set II. Columns F-M are the same as columns F-M in Fig. 12.2, so are not repeated here.
194 12.2
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES ONE-WAY ANALYSIS OF VARIANCE For a one-way analysis of variance, the G groups must be identified with a unique set of values for G - 1 predictor variables. Regressing the dependent measure on the coded predictor variables yields an R2. If this R2 is significantly different from zero, then we say that there is a significant effect of group, that the group means differ in some way. No matter whether dummy-coded variables, contrast codes, or some other scheme is used to distinguish the groups, the various statistics (sums of squares, mean squares, F ratios, and R2s) are identical, as demonstrated by the analyses shown in Figs.12.2, 12.7, and 12.8. In all three cases (dummy-coded variables, contrast code set I, contrast code set II) the value of the F ratio was 7.63. Only the parameter values appear affected by the way the predictor variables are coded, a matter discussed in the next chapter. Traditionally, results of an analysis of variance are organized into an ANOVA source table, so called because it lists the sources of variability examined and notes the portions of variance associated with each (see Fig. 12.9). A typical ANOVA source table gives, first, the sources of variance. For the present example, total variance is divided into a between-groups component and a within-groups component. The between-group (or treatment, or model) sum of squares, the within-group (or error, or residual) sum of squares, and the total sum of squares are computed by the spreadsheet. For the present example, with its four groups and three predictor variables, Y predicted is computed as a + b1X1 + b2X2 + b3X3. Note that the sums of squares for the model and for error add up to the total sum of squares (4880 + 2558 = 7438), as they should. To complete the ANOVA source table, sums of squares are divided by their associated degrees of freedom, which gives mean squares (variance estimates) for the various effects. Finally, the significance of the overall (or omnibus) group effect is evaluated with the appropriate F ratio, dividing the between-group mean square by the mean square for error (i.e., the within group mean square). As before, the df model is the number of predictor variables (which in this case is 3) and the df error is N minus 1 minus the number of predictor variables (which in this case is 16 - 1 - 3 = 12). General formulas for the degrees of freedom for a one-way analysis of variance are given in Fig. 12.10. 1. The total degrees of freedom between subjects is N - 1, the number of subjects minus one. You can regard that one degree of freedom as consumed by the grand mean. 2. The degrees of freedom for the between-groups effect (the A main effect) is a - 1, where a symbolizes the number of levels or groups for the between-subjects factor A. This is the same as G - 1 (the number of groups minus one or the number of predictor variables) used earlier in this chapter. 3. The degrees of freedom within groups, or the degrees of freedom for error, is N- a, and is symbolized here as S/G or S/A, which is read "subjects within groups" or "subjects within A." Source Between groups Within groups' TOTAL between subjects
SS 4880 2558 7438
df
MS
F
3 12 15
1627 213
7.63
FIG. 12.9. Analysis of variance source table for contrast code Set I.
12.2 ONE-WAY ANALYSIS OF VARIANCE Source A main effect S/A, subjects within A TOTAL between subjects
195 Degrees of Freedom a- 1 N- a N- 1
FIG. 12.10. Degrees of freedom for a one-way analysis of variance. The number of levels for between-subjects factor A is symbolized with a and the number of subjects is symbolized with N. There are two ways to explain algebraically why the degrees of freedom for error is N - a. First, recall that dftotal = dfbetween + dferror
Therefore, dferror — dftotal - dfbetween
= ( N - 1 ) - ( a - 1 ) =N - 1 - a + 1 = N - a In addition, recall that for each of the a groups, one degree of freedom is lost within groups estimating the group mean. Assuming equal number of subjects per group, dferror
= a ( N / a - l) = N - a
Within each group there are N/a subjects and N/a - l degrees of freedom (one consumed by the group mean). The degrees of freedom for error pooled over all groups is the N/a - l for each group multiplied by a, the number of groups. After routine algebraic manipulation, this yields N - a. Exercise 12.4 A One-Way ANOVA: Effect of Sibling Status on Number of Words The purpose of this exercise is to provide additional practice in performing a oneway analysis of variance. 1.
Recall the example study that examined the effect of number of older siblings on number of words infants spoke. In Exercise 9.6 you determined that 32.6% of the variance in number of words spoken at 18 months of age could be accounted for by knowing whether infants had just one or more older siblings, and in Exercise 10.5 you determined that this R2 was significant at the .01 level. The previous analysis was, in effect, a one-way analysis of variance with two groups. For this exercise, categorize infants in three rather than two groups: those with no older siblings, those with just one older sibling, and those with more than one older sibling. 2. Define and use contrast codes. The first predictor variable contrasts infants who have no older siblings with those who have one or more older siblings, and the second contrasts infants who have just one older sibling with those who have more than one older sibling. 3. What proportion of variance is accounted for by knowing to which of these three groups infants belong? Is the effect of group membership statistically significant?
196
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES Your spreadsheet for the previous exercise should look like the one shown in Fig. 12.11. Three facts are worth noting. First, for an analysis of variance, the numbers of subjects in the various groups need not be equal. In this case, there were seven, five, and four subjects in the three groups, respectively. Second, the order of the subjects in the spreadsheet is immaterial. Often it is convenient to place subjects who are in the same group together, and in this case all subjects who had no older siblings were in fact listed together. But it is not necessary to group subjects, and in this case subjects in the second and third groups were mingled together. Values for the predictor variables must accurately reflect the group to which a subject belongs, but that is all. Finally, recall that contrast codes are required to sum to zero at the group, but not at the individual, level. In fact, when the number of subjects per group is unequal, contrast codes will not sum to zero at the individual level, as the numbers in Fig. 12.11 demonstrate. Exercise 12.5 Analyzing a Singe-Factor Between-Subjects Study in SPSS In this exercise you will analyze the button-pushing study using the four groups defined in Fig. 12.3. In SPSS you could analyze these data using the Regression procedure and dummy or contrast codes. If using dummy codes, you would run a regression and enter all of the coded vectors, X1 through X3 in a single block, and then examine the output to determine if the model containing all three predictor variables is significant. If you want to test individual contrast codes, you would enter each of the coded predictor variables in separate blocks and check the R squared change option under Statistics. The resulting hierarchical analysis (see Exercise 11.7) will allow you to determine the significance of each contrast. Finally, as presented in this exercise, you could use the One-Way ANOVA procedure to determine if group is a significant predictor of number of pushes. 1. Invoke SPSS. Create a variable called Pushes and enter, or cut and paste, the data from Fig. 12.2. 2. When using the One-way ANOVA procedure, you do not need to use dummy or contrast coding. Instead, create one variable called group and enter Os for all the cases in the ND group, 1s for the CO group, 2s for the C1 group, and 3s for C2 group. The output will be more readable if you create value labels for each of these groups. 3. Select Analyze->Compare Means->One-way ANOVA from the main menu. Enter pushes in the Dependent List window and group in the Factor window. Click on Options and check the Descriptive and Homogeneity of Variance Test boxes. Click Continue and OK. 4. Examine the Descsriptives output and confirm that the Ns, means, and standard deviations are correct. Next look at Levene's test. Is the assumption of equal variances met? Finally look at the output in the ANOVA table. Do the sums of squares and df agree with values you calculated using the spreadsheets? Based on the Fand significance values, do you reject the null hypothesis that the number of button pushes in each group is equal?
197
12.3 TREND ANALYSIS 12.3 TREND ANALYSIS
As you have seen in this chapter, with G groups, you can define G - 1 coded variables or contrasts. The examples we have given of contrast codes (e.g., Fig. 12.4) ordered the groups in a hierarchic or tree structure. But there is another
A
c
B
1
1Swords
2 Y s 3 1 32 4 2 27 5 3 48 6 34 4 7 33 5 8 6 30 9 7 39 10 23 8 11 24 9 12 10 25 13 11 36 14 12 31 15 13 19 16 14 28 17 32 15 19 22 16 19 Sum= 483 20 N= 16 21 Mean= 30.19 21 a,b= 29.62 23 R= 0.692
J
K
_D
#sibs Ovs>0 1vs>1
X1
X2 0 0 0 0 0 0 0
-2 -2 -2 -2 -2 -2 -2 1 1 -1 1 1 1 1 -1 1 -1 1 1 1 1 1 -1 1 1 1 -5 16 16 -0.31 0.063 -2.55 -3.68
0 0 0 0 0 0 0 2 1 4 1 1 3 2 1 5 20
16 1.25
SStot
Somod
SSerr
2
y*y
m*m
e*e
782.4 15 52.16 7.222 0.399
375.1
407.4
2
187.5
13 31.34
F=
5.598 5.984
ss= df= MS= SD'= R2 R
-
adi-
G
H
_I
y= Y-My
m= Y'-My
e= Y-Y'
VAR=
48.9
4.527 4.527 4.527 4.527 4.527 4.527 4.527 -6.79 0.563 -6.79 0.563 0.563 -6.79 -6.79 0.563 -6.79 0 16 23.44
-2.71 -7.71 13.29 -0.71 -1.71 -4.71
23.4 483 N=
1.813 -3.19 17.81 3.813 2.813 -0.19 8.813 -7.19 -6.19 -5.19 5.813 0.813 -11.2 -2.19 1.813 -8.19 0 16
Y' 34.71 34.71 34.71 34.71 34.71 34.71 34.71 23.4 30.75
23.4 30.75 30.75
23.4 23.4 30.75
4.286 -0.4 -6.75
1.6 5.25 P.25 -4.4 4.6
1.25 -1.4 -0 16 25.46
SD= 6.993 R2= 0.479
M
L
1 19 20 21 21 23
F
E
FIG. 12.11. Spreadsheet for determining the effect of sibling group (no older siblings, one older sibling, more than one older sibling) on number of words spoken at 18 months of age.
198
SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES way to define contrast codes that turns out to be very useful. Contrast codes can also be used to define trends. Often groups are ordered in some way. For example, imagine you have assembled three groups of children and that the three groups comprise children who are within a month of their 3rd, 4th, and 5th birthdays, respectively. You have collected motor dexterity scores for the children and, not surprisingly, you expect older children to score higher. You could form contrast codes that compare, first, 3-year-olds with the combined group of 4- and 5-year-olds and, second, 4- with 5-year-olds. But the more relevant question may be, of the variance accounted for by group membership, what proportion is due to the linear trend of the group means and what proportion to the remaining variance. Variables that code the various trends possible with G groups let you do just that. With G groups, you can define G - 1 trends, each one of which corresponds to a coded predictor variable. With three groups, you can define linear (i.e., a straight line) and quadratic (i.e., a parabola or a curve with one bend) trends. With four groups you can define linear, quadratic, and cubic (an S-shaped curve with two bends) trends. With five groups you can define linear, quadratic, cubic, and quartic (a W-shaped curve with three bends) trends, and so forth. The values for the predictor variables that code for these trends are called orthogonal polynomials, and advanced texts often have tables of them for various numbers of groups (e.g., Kirk, 1982). Values for three, four, and five groups are given in Fig. 12.12. Orthogonal polynomials obey the rules we have already defined for orthogonal contrast codes generally (Fig. 12.6). You should now verify for yourself that the values given in Fig. 12.12 obey these rules. If you use orthogonal polynomials for your contrast codes instead of some other set of contrast codes, you will still account for exactly the same amount of variance: The value of all of your summary statistics (R2, F ratio, etc.) will be the same (as they would be if you used any other set of coded variables including dummy codes). However, if you use trend contrasts (i.e., orthogonal polynomials), then, with two groups, you can say what proportion of variance was accounted for by a linear and what proportion by a quadratic trend; with three groups, what proportion of variance was accounted for by a linear trend, what proportion by a quadratic trend, and what proportion by a cubic trend; and so forth. When research hypotheses predict that most of the variance should be accounted for by particular trend (often a linear trend), this can be very useful. Exercise 12.6 A Trend Analysis: Effect of Sibling Status on Number of Words The purpose of this exercise is to modify the last exercise for a trend analysis. 1. Replace values for the contrast codes used in the last exercise with values for a linear and a quadratic contrast. For the linear trend, code subjects with no sibs -1, those with one sib 0, and those with more than one sib +1. For the quadratic trend, code subjects with no sibs +1, those with one sib -2, and those with more than one sib +1. 2. Verify that all summary statistics are the same as for Exercise 12.4. 3. What proportion of variance is accounted for by knowing only the linear trend? How much additional variance is accounted for by the quadratic trend?
12.3 TREND ANALYSIS
199
FIG. 12.12. Values for coded variables for a trend analysis comprising three, four or five groups. Two subjects per group are shown, although usually groups would have more subjects.
In this chapter you have learned how to analyze data from single-factor between-subjects studies. The single factor defines levels or groups, and the statistical question is, do the means for the various groups differ significantly? In other words, is the proportion of variance accounted for when the dependent variable is regressed on coded variables representing group membership (the research factor) significantly different from zero? Can significantly better predictions for the value of the criterion or dependent measure be made when subjects' group membership is taken into account? Most succinctly, does group membership matter? If the answer is yes, that is, if the F ratio is statistically significant, the next question is, exactly how do the means for the groups differ? The next chapter addresses this and related questions concerning how results of analyses like those presented in this chapter should be described.
This page intentionally left blank
13
Planned Comparisons, Post Hoc Tests, and Adjusted Means
In this chapter you will: 1. Learn how to use contrast codes to analyze the significance of any group comparisons you have planned prior to analysis. 2. Learn how to use a post hoc test to determine, after the fact, exactly which group means differ significantly from each other, given that an omnibus F test was significant. 3. Learn how to adjust group means when reporting results, given that an analysis of covariance yielded significant results. 4. Learn how to test whether the assumption of homogeneity of regression, which is required for an analysis of covariance, is warranted. Analyzing variance, or determining which factors account for significant proportions of criterion variance, whether performing a one-way analysis of variance as in the last chapter or an analysis of covariance as in chapter 11, is only half of the data analytic task. The task is not complete until the specific results have been described. What should be described, and how it should be described, is the topic of this chapter. First we consider one-way ANOVAs. As mentioned earlier, if the test for a factor is not significant, only the grand mean for the dependent variable need be given. If the test is significant, however, then the means for the different groups analyzed should be presented. If there are only two groups, no further analysis is required. We know that the means for those two groups differ significantly. However, if there are more than two groups—and planned comparisons or trend analyses were not defined—then post hoc tests should be performed in order to explicate the exact nature of the differences among the groups. Post hoc tests (tests performed after overall significance has been established) examine each possible pair of group means and tell you which pairs of means are significantly different from each other. You might ask, why bother with this combination of an omnibus F test (regressing the dependent variable on a set of G - 1 predictor variables that represents the categorical variable) followed by post hoc tests? Why not proceed directly to pairwise tests instead? For each test, you would select a different pair 201
202.
PLANNED COMPARISONS, POST Hoc TESTS of groups and would then determine whether the single-coded variable required to represent the distinction between those two groups accounted for a significant proportion of variance. For example, given four groups, six separate tests would be required to represent the possible pairwise comparisons: 1 versus 2. 1 versus 3. 1 versus 4. 2 versus 3. 2 versus 4. 3 versus 4. In other words, there are six combinations of four things taken two at a time. The information provided by this pairwise approach seems to be the same as that provided by post hoc tests; that is, we find out whether or not the means in each pair are significantly different from each other. Pairwise testing, however, is not recommended; in fact, it is usually roundly condemned. An omnibus test represents a single test with a set alpha level. The exhaustive pairwise approach multiplies the number of separate tests of significance, which increases the probability of Type I error (the exact amount of the increase is defined in the section on post hoc tests later in this chapter). Thus the reason we perform an omnibus F test first, and only proceed to post hoc tests if the omnibus test is significant, is to provide ourselves with some protection from Type I error. There is an alternative to the omnibus followed by post hoc test strategy. (Remember, the present discussion applies to one-way analyses of variance involving more than two groups.) In the last chapter we mentioned briefly that there are two ways an analysis of group effects can proceed. Although it is relatively uncommon to do so, the investigator may commit beforehand to particular contrasts or a priori tests. As described in the previous chapter, as many contrasts as there are degrees of freedom are permitted, and each contrast is represented with a coded predictor variable. In this case, each predictor variable is added to the regression equation one step at a time, and the increase in R2 associated with each step is tested for significance. In this way, the investigator tests the significance of each (single degree of freedom) contrast directly, instead of first asking whether all G - 1 coded predictor variables together account for a significant proportion of variance. The contrasts of a planned comparison approach are more powerful (i.e., more likely to be statistically significant) than the comparable post hoc comparisons, which is the main advantage of this approach. However, the number of comparisons is limited to G - 1 and they must be specified beforehand, which is the main disadvantage. Given four groups, for example, only three planned comparisons are permitted, but post hoc tests could examine difference between all six possible pairs of groups. Typically investigators do not commit to particular contrasts before analyzing their data but first perform an omnibus F test as described in the last chapter. Contrast codes can still be used for the G - 1 predictor variables, of course—and in this book, often will be—but the dependent variable is regressed on all of them in one step. Thus an omnibus F test necessarily involves more than one predictor variable. If the R2 for the final model (e.g., the one including all three of the predictor variables that code group membership given four groups) is significant, you would conclude that somehow the group means differ among themselves. But in order to determine exactly how they differ, you would then perform what are called post hoc tests, or tests performed after overall significance has been
13.1 ORGANIZING STEPWISE STATISTICS
__
203
established. One common post hoc test, the Tukey test, is described later in this chapter. Given a significant group effect, you would then present group means separately, as we mentioned earlier. Normally, these would be the simple arithmetic means for the groups. However, imagine that variables representing two research factors are involved, the first is a covariate, as discussed in chapter 11 (like subject's age), and the second is a group membership variable (like children status). In this case, you would perform the analysis in two steps. The first would test for the significance of the covariate and the second for the significance of the group membership variable. Note that, depending on the number of groups, the group membership variable might be represented with one or more coded predictor variables. If the group effect were significant (determined by testing the significance of the R2Change for step 2), it would mean that group membership affects the dependent variable even when the covariate is taken into account. In this case you would report not the simple group means,but the group means adjusted for the effect of the covariate. At the end of this chapter we show how such an adjustment is effected. 13.1
ORGANIZING STEPWISE STATISTICS In chapter 11 we showed how results of a hierarchic multiple-regression analysis can be portrayed. In this chapter we discuss hierarchic analyses further and show how, for an analysis of planned comparisons, each step represents a different contrast. And in the next chapter we will see that the hierarchic approach has even broader application still, that successive steps may each represent a different research factor. In chapter 12 the number of button pushes for 16 subjects— data we have been using as a running example— was regressed on different sets of predictor variables. For Exercise 12.3, you were asked to perform an omnibus F test, regressing. number of button pushes on two different sets of three predictor variables. But you could have proceeded step by step, determining the significance of the increase in R2 associated with each predictor variable as it was added to the regression equation in turn. If you had proceeded step by step, the results after the third step would be identical with those shown in Figs. 12.7 and 12.8; that is, you would account for 66% of the variance with three predictor variables. But the results after the first and second step, and comparisons of each step with the previous step, would provide us with information about the three predictor variables used. You already know how to organize the results of such an analysis in a step-by-step table (see Fig. 11.8) and how to test the significance of the change in R2 at each step. The F ratio used to test the significance of an increase in R2 was introduced in chapter 11 (Equation 11.13). It is
According to this formulation, the error term (the denominator in Equation 13.1) is different at each step. It is the variance left unaccounted at each step (i 2 step(1 -R)divided by the number of subjects minus one minus the number of large predictor variables used in that step (i.e., dferror = N- 1- dflarge). Thus, -R2large refers to the proportion of variance accounted for at each step, step by step.
204
PLANNED COMPARISONS, POST Hoc TESTS For our current running example—predicting number of button pushes from children status for 16 subjects—three predictor variables are defined. Entering each predictor variable one at a time, the error terms for the first, second, and third steps are 1 minus the R2 for that step divided by 14, 13, and 12 error degrees of freedom, respectively (see Fig. 13.1). However, using a different error term at each step makes sense only if we regard each step as representing a separate and progressively less important question. More often, as is the case with our running example, we conceive of the predictor variables as a set (together they represent the group variable), and regard all as important. The overarching or omnibus question is, do the groups differ? This question is answered by the proportion of variance accounted for by all three variables. In such cases, it both makes sense and is traditional to use the final error term (one minus the step 3 or final R2 divided by its degrees of freedom) to assess not just the change in R2 from step 2 to 3 but the change in R2 from no model to step 1, and from step 1 to 2 as well. Reflecting these considerations, Equation 13.1 could be rewritten as follows: `
According to this formulation, although which model is large and which is small varies with each step, the error term is the same for all steps. It is the proportion of variance left unaccounted after the final step divided by the number of subjects minus one minus the number of predictor variables in the final model. Traditional analyses of variance use a single final, not a progressively changing, error term. For that reason, the final error term approach is used in this book whenever increments in proportions of variance accounted for are tested for statistical significance. Still, there are occasions when the progressive approach may be justified and interested readers should consult Cohen and Cohen (1983). The stepwise statistics for the first set of contrast codes used in Exercise 12.3, using the final error term approach as described in the preceding paragraph, are organized and presented in Fig. 13.1. Note that the error degrees of freedom associated with the F ratios used to test the significance of the changes in R2 are all 12. As an exercise, you should now verify that the numbers given in Fig. 13.1 are correct. The next exercise provides you with an opportunity to organize the results from the second set of contrast codes into a similar table. And in the next section, we discuss how the results given in Fig. 13.1 (and in the table you will prepare for Exercise 13.1) can be interpreted.
Step
Contrast
1 2 3
Children? Want? >1?
Change
Total
R2
df
.215 .526 .656
1,14 2,13 3,12
F 3.84 7.21 7.63
R2
df
F
.215 .311 .130
1,12 1,12 1,12
7.51 10.85 4.54
FIG. 13.1. Stepwise statistics for a planned comparison analysis using contrast code Set I.
13.1 ORGANIZING STEPWISE STATISTICS
205
Exercise 13.1 Presenting Results of Planned Comparisons For this exercise you are asked to organize a stepwise results table like that shown in Fig. 13.1 but for the second contrast set used in Exercise 12.3. 1. Prepare a table like the one shown in Fig. 13.1, but for contrast set II (Fig. 12.5). You have already completed step 3 for this analysis (see Fig. 12.8). Return now to your previous spreadsheet and perform steps 1 and 2 of a planned comparison analysis and compute the stepwise statistics.
13.2
PLANNED COMPARISONS From the one-way analysis of variance performed in chapter 12, you know that the mean number of button pushes differs for the four parental-marital status groups. The critical value of F(3,12) is 3.49 at the .05 level and 5.95 at the .01 level. Thus, no matter whether we had selected the .05 or the .01 level of significance, because F(3,12) = 7.63 for our running example, we would have claimed statistical significance for these results. (Recall from chap. 10 that numbers in parentheses after an F ratio indicate degrees of freedom for the numerator and denominator respectively.) The overall ANOVA, however, only tells us that the mean number of button pushes differs for the different groups but it does not tell us how. We could examine the group means (see Fig. 11.3) but this would be merely impressionistic—far better to fortify our impressions with statistical tests. As mentioned earlier, two approaches are possible. If we have no particular theoretical rationale for suspecting how the groups might differ beforehand—that is, if we are unwilling to commit to G - 1 specific contrasts—then we should perform post hoc tests as described in the next section. (Post hoc means "after this" or "after the event" in Latin.) Otherwise, although it is done less frequently, we can analyze the contrasts or comparisons as planned. A contrast is regarded as significant if R2 increases significantly when its associated predictor variable is added to the equation. Consider the first set of contrast codes we have been using. The critical value of F(1,12) = 4.75, alpha = .05, so both the first (F(1,12) = 7.51) and the second (F(1,12) = 10.85) contrasts are significant, but the third is not significant (F(1,12) = 4.54; see Fig.13.1). This indicates that the mean number of button presses for the groups of subjects without children is significantly different from the groups of subjects with children (the first contrast). In other words, 96—the mean of the first two group means—is significantly different from 76—the mean of the last two group means. Also, the mean number of presses for the group of subjects with no desire for children was significantly different from the group who desired children (the second contrast). In other words, 113 differed significantly from 79. However, the means for the group of subjects with one and more than one child, which were 65 and 87, respectively, did not differ significantly.
206
PLANNED COMPARISONS, POST Hoc TESTS Exercise 13.2 Interpreting Planned Comparisons This exercise provides practice in interpreting the results of planned comparisons. 1. Using the table you prepared for the last exercise, describe and interpret the results of the planned comparison analysis using the second set of contrast codes. Be sure to describe the group means or means of groups means, as appropriate (i.e., as embodied in the significant contrasts). 2. Recall the study examining number of words infants spoke at 18 months of age, last used in Exercise 12.4. Prepare a stepwise table for a planned comparison analysis using the contrasts shown in Fig. 12.11. Then describe and interpret the results. 3. Organize the results of the trend analysis from Exercise 12.6 in a stepwise table. Then describe and interpret the results.
13.3
POST HOC TESTS Planned comparisons are disciplined by forethought and limited in number by the degrees of freedom. But lacking an a priori basis for specific comparisons (which, judging from the research literature, is the usual case), a significant analysis of variance omnibus result can be—and should be—explicated with post hoc tests. Typically, post hoc tests determine which of all possible pairs of group means differ significantly. For the present four-group example, this generates six pairs: groups 1 vs. 2. groups 1 vs. 3. groups 1 vs. 4. groups 2 vs. 3. groups 2 vs. 4. groups 3 vs. 4. In order to determine significance, we could treat each pair of groups as a separate two-group analysis of variance, but this is problematic because the more such comparisons we make, the more likely we are to make type I errors if the null hypothesis is true. Assume an alpha level of .05. Then, although the probability of a type I error per comparison is .05, the familywise error rate for a set of c independent tests or comparisons is
The present example requires six comparisons, thus
Clearly, a familywise probability for type I error that equals .265 is a far cry from the .05 that an unwary investigator might (mistakenly) expect. Over the years, statisticians have suggested a number of post hoc procedures, all designed to control the type I error rate. The number of different procedures, and the arguments pro and con for each, can seem confusing to the novice (and to
13.3 POST Hoc TESTS
207
many seasoned investigators as well). The Tukey test, however, seems the one test most often preferred, primarily because it is computationally straightforward and appears to strike the best balance between type I and type II errors. Still, other tests have their partisans. For simplicity, the only post hoc test presented here is the Tukey, although the reader is strongly urged to read Keppel's (1982, chap. 9) excellent and informative discussion concerning multiple comparisons.
The Tukey Test The Tukey test can be decomposed into four essential steps. First, order the group means from smallest to largest. Consider the button-pushing study (the means for the four groups were last given in Fig. 12.8). The smallest mean is 65 for the has one child or C1 group, the next smallest is 79 for the desire but have no children or CO group, the next is 87 for the has more than one child or C2 group, and the largest mean is 113 for the no desire for children or ND group. Then, compute differences between pairs of means. Arrange these differences in a G x G table, where G is the number of groups (see Fig. 13.2). Differences on the diagonal are necessarily zero. There is no reason to display differences below the diagonal because they necessarily mirror the differences above the diagonal. If the difference between a pair of group means exceeds a specified critical value (symbolized here as TCD for Tukey critical difference), then the means for those two groups are said to differ significantly. Third, compute the value for the Tukey critical difference. Its formula is
In this formula q represents the Studentized Range Statistic, G is the number of groups, df e r r o r is from the omnibus F test, MSerror is computed for the omnibus Ftest, and n is the number of subjects per group. (Do not confuse n with N, the number of subjects total). Values for the studentized range statistic for various values of G and dferror, for alpha = .05 and .01, are given in Table E in the statistical tables appendix. For the present example, G = 4 and dferror - 12; hence, assuming the usual .05 level of significance, q(4,12).05 = 4.20 (from Table D), so
Groups Has Has Has Has
1 Child none but would like more than 1 child no desire for children
Group
Means
65 79 87 113
Group Means C2 CO 87 79 14 22 0 8 0 0
C1 65
ND 113 48 34 26 0
FIG. 13.2. Differences between ordered group means for four want/have child groups. Group means are ordered from smallest to largest.
208
PLANNED COMPARISONS, POST Hoc TESTS
Having computed the TCD, it is a simple matter to move on to the fourth step: Determine which differences exceed the critical value. If the difference between the means for two specific groups exceeds the value of the Tukey critical difference, then those means are significantly different. For the present example, only two differences exceed 30.7, the difference between the means for the C1 and ND groups (48) and between the means for the CO and ND groups (34). All other differences are not significant. (These four steps are summarized in Fig. 13.3.) It is conventional to identify group means that do not differ significantly with a common subscript. Thus, means that do differ significantly will not share a common subscript and can be readily identified. Fig. 13.4 demonstrates a graphic way to determine which subscripts should be applied to which means. First, draw a diagonal line through the cells that contain zero. Second, draw vertical lines between cells so as to separate differences that exceed the Tukey critical difference from those that do not (here, between 22 and 48, and between 8 and 34). Third, likewise draw horizontal lines between cells so as to separate differences that exceed the Tukey critical difference from those that do not (here, between 34 and 26). Fourth, draw a horizontal line above the first row of differences from the top of the first vertical line back to the diagonal line, and likewise extend any other horizontal lines back to the diagonal. These lines accomplish two things. First, differences that exceed the TCD are segregated from those that do not (here, 34 and 48 exceed 30.7). Second, the horizontal lines identify groups whose means do no differ significantly. Label each such horizontal line with a different letter, and attach these letters as subscripts to the means above them. Here the two lines are labeled a and b. Arbitrarily, we labeled the first line, which includes groups C1, CO, and C2, with a b, and the second line, which includes groups C2 and ND, with an a, but only because we knew that later we would table these means and list the ND group mean first. The important point is that the post hoc analysis identified two groups: The means for groups C1, CO, and C2 do not differ among themselves; likewise, the means for C2 and ND do not differ between themselves. Of particular interest are horizontal group lines that do not overlap. Their associated group means do not share a common subscript and so differ significantly. Here, the means for both the C1 and CO groups (65 and 79) differ significantly from the mean for the ND group (113). According to these post hoc test results, persons with no desire for children noted significantly more communicative acts (i.e., pushed the button more), on the average, than persons who had one child and persons who had no children but desired them—but not more than persons who had two or more children.
Steps for the Tukey post hoc test
Step 1 Step 2 Step 3 Step 4
Order the group means from smallest to largest. Compute differences between pairs of means. Compute the value for the tukey critical difference. Determine which differences exceed the critical value.
FIG. 13.3. Steps for the Tukey post hoc test.
13.3 POST Hoc TESTS
209
FIG. 13.4. A graphic way for displaying results of a post hoc analysis. The gray area indicates differences between group means that exceed the TCD. The solid horizontal lines underscore means that do not differ significantly.
The mean number of button pushes by persons who had two or more children did not differ significantly from the means noted for the other three groups. Thus, for these data, we now understand not just that the means for the four groups differed (R2 = 0.66, F(3,12) = 7.63, p < .01) but exactly how they differed as well. The lines of Fig. 13.4 are useful as a first step in understanding the pattern of differences revealed by a post hoc test. In a fairly mechanical way, they quickly identify which means do not differ among themselves and should therefore be identified with common subscripts. More formally, post hoc results are often presented in written reports (articles, dissertations, and so forth) as shown in Fig. 13.5. In such a table, the means can be ordered in any way that makes conceptual sense. They need not be ordered from smallest to largest. For consistency, the order here is the same as the one used in chapter 11 when the button-pushing study was first introduced. Nonetheless, as noted earlier, groups that differ significantly can be readily identified by their lack of a common subscript. Exercise 13.3 Presenting Results of Post Hoc Tests This exercise provides practice in interpreting the results of post hoc tests. 1.
2.
3.
Prepare a bar graph with error bars (like the one shown in Fig. 10.4) for the means given in Fig. 13.5. Prepare a second graph showing 95% confidence intervals, as discussed at the end of chapter 10. (See Fig. 13.6) Given different data for the button-pushing study, the results of the post hoc analyses could have been quite different. Assume, for example, that the MSerror is 92.5 instead of 213.6 but that all other details remain the same. Prepare a figure like Fig. 13.4 and a table like Fig. 13.5 for this post hoc analysis. Explain how these results would be interpreted. Now do the same, assuming this time that the MSerror is 23.2.
Variable Number of button pushes
Groups
ND 113a
CO 79b
C1 65b
C2 87ab
FIG. 13.5. Mean number of button pushes for four want/has child status groups. Means that do not differ significantly per a Tukey post hoc test, p < .05, share a common subscript.
PLANNED COMPARISONS, POST Hoc TESTS
210
FIG. 13.6. A bar graph showing the mean number of button pushes for the four groups; error bars indicate 95% confidence intervals.
13.4
UNEQUAL NUMBERS OF SUBJECTS PER GROUP Traditional analysis of variance texts often discuss at some length solutions to what is usually called the unequal n problem, noting adjustments required when groups contain different numbers of subjects, which is a common situation in behavioral science research (n is used here to symbolize number of subjects per group, whereas N indicates the total number of subjects in all groups). With multiple regression, such adjustments are handled automatically and so no computational problems arise. Our running example involves four subjects per group. For generality, we probably should have had unequal numbers of subjects in each group, but none of the procedures or computations would have been changed by that, as you have demonstrated in a number of the exercises. Widely discrepant ns can affect interpretation of results, however. For one thing, the discrepant proportions of subjects in different groups can affect correlations between the criterion and coded variables and can limit the maximum value for a number of multiple-regression statistics. There is probably little reason to worry if group ns are only somewhat discrepant, but if widely different it makes sense to consult with more knowledgeable colleagues because interpretation may require some qualification. If you want to know more about this matter, read Cohen and Cohen's (1983, pp. 186-189) discussion concerning the relationships of dummy variables to Y.
Unequal ns and Post Hoc Tests Before leaving this chapter, one detail remains to be addressed. Recall that the formula for the Tukey critical difference used for post hoc tests includes n, the number of subjects per group. But what if the ns for the groups are different? What value should be used then? One obvious choice might seem the arithmetic
13.4 UNEQUAL NUMBERS OF SUBJECTS PER GROUP
211
mean of the group ns, but Winer (1971) recommended that the harmonic mean be used instead. The harmonic mean is
For example, imagine four groups with 3, 4, 4, and 5 subjects in each. Their arithmetic mean would be 4.00, but the harmonic mean would be
Exercise 13.4 Post Hoc Tests with Unequal Sized Groups This exercise provides practice in performing a one-way analysis of variance and post hoc tests when groups contain unequal numbers of subjects. 1.
Begin with either the spreadsheet shown in Fig. 12.2 or 11.7. For this exercise you will perform an omnibus test followed by post hoc tests, so it does not matter which set of coded variables you use to represent group membership. Omit subjects 4 (#BPs = 130) and 14 (#BPs = 94). Perform a one-way analysis of variance on the remaining 14 subjects. Note that groups 1 and 4 now contain three subjects. Prepare a source table for this analysis, modeled on the one shown in Fig. 12.9. 2. Now perform a post hoc test analysis and organize your results as shown in Figs. 13.4 and 13.5. Explain how these results would be interpreted. 3. The mean number of words spoken by three groups of infants was analyzed in Exercise 12.4 and the group effect was found to be significant. Now perform a post hoc analysis and present and interpret your results.
Exercise 13.5 Post Hoc Tests in SPSS In this exercise you will learn how to use SPSS to conduct post hoc tests for the button pressing study presented in Chapter 12. 1. Open the data file for the button pushing study you created in Exercise 12.5. 2. Redo the one-way ANOVA analysis. This time, however, click on the Posthoc button and check the Tukey box before you click on OK. 3. Selecting the Post-hoc option under the one-way procedure produces two tables in addition to the typical One-way ANOVA output. The first, displays all possible comparisons between group means and their significance level. The second table, called Homogeneous Subsets, groups together all means that do not differ from each other at the selected alpha level. This grouping should make it easy for you to create the subscripts necessary to present the results of your post hoc tests in the format of Fig. 13.5. Note that SPSS always uses the harmonic mean of the group sample sizes. 4. To create a bar graph of the means, select Graphs->Interactive->Bar from the main menu. Move the pushes variable to the y axis window and the
212
PLANNED COMPARISONS, POST Hoc TESTS group variable to the x axis window. Click on the Error bars tab and select either Standard Error of Mean or Confidence Interval of Mean from the pull-down menu under Confidence Interval. If you select the Confidence Interval of Mean, you should also input the size of the confidence interval (e.g., 95%, 99% or 90%).
13.5
ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE In chapter 9 we learned that if a criterion variable is regressed on a single quantitative predictor variable, the relation between them is described by reporting the proportion of variance accounted for by the predictor (r 2 ) and the change in the criterion per unit change in the predictor (b). We also learned that if a criterion is regressed on a binary categorical variable, the relation is described by reporting the means for the two groups (as for the logically equivalent t test). And in this chapter we have learned how to report results from a one-way analysis of variance when more than two groups are involved, that is, when a criterion is regressed on a set of predictor variables that together represent a multilevel categorical variable. Specifically, we have learned that either we report group means lumped to reflect the significant contrasts (planned comparison analysis) or else we report all group means and identify those that differ significantly with different subscripts (post hoc analysis). A single study, however, may include some quantitative and some qualitative research variables. This set of circumstances presents no problems for a multiple-regression-based analysis, and, in fact, such an analysis was first demonstrated in chapter 11. Within the experimental tradition, the name given an analysis involving mixed quantitative and categorical predictors is the analysis of covariance. In the simplest instance, an analysis of covariance involves two predictor variables, a quantitative covariate and a binary group variable (assuming just two groups). Of primary concern is whether group membership (defined by the categorical variable) affects the dependent measure. However, if the groups vary with respect to some other attribute (the covariate), which may also affect the dependent measure, then it may be desirable to control for the effect of that other variable statistically.
An Example: The Gender Smiling Study Imagine, for example, that we want to test the hypothesis that during routine caregiving male infants smile more than female infants. We obtain a sample of 10 male and 10 female infants, observe each infant for 15 minutes during routine caregiving in the home, and count each time the infant smiles during that time. Examination of our raw data (see Fig. 13.7) reveals that the mean number of smiles is indeed higher for male infants. But this, of course, could mean nothing. We need to determine whether the difference between the mean number of smiles for male and female infants is statistically significant. Procedures first described in chapter 9 are appropriate for these data. First we regress the number of smiles on the (binary categorical) variable coded for sex. For the data in Fig. 13.7, the r2 is .150 and the corresponding F(1,18) is 3.19. The critical value for F(1,18), alpha = .05, is 4.41, so we cannot reject the null hypothesis. Based on this analysis, we could not claim that the difference between a mean of 6.4 smiles for males and a mean of 4.1 smiles for females is significant. However, we note two interesting and disturbing aspects of our data. First, for some reason, the female infants were older than the male infants. Their
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE
213
mean age was 16.1 months compared to an average age of 10.9 months for the males. Second, it appears that older infants may have smiled somewhat more than younger infants. This set of circumstances suggests that an analysis of covariance might be warranted. Age may be related to the number of smiles, and the mean age for the two groups of infants appears somewhat different; consequently, before testing for a sex effect perhaps we should first remove (or control) statistically any effect age might have on number of smiles. This you already know how to do, as you are asked to demonstrate in the next exercise.
A
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
c
_B
Smiles
s_ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sum= N= Mean= a,b= R= Mean= Mean=
Y_
5 7 4 8 6 3 4 11 12
4 1 1 5 2
1 5 5 8 7 6_ 105 20 5.25
6.4 4.1
Age X_ 5 5 5 6 9 13 14 17 17 18 4 8 13 17 18 18 20 20 21 22 270 20 13.5
D Sex A 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1_
10.9 males 16.1 1females
FIG. 13.7. Spreadsheet showing raw data and group means for the gender smiling study. Male infants are coded 0 and female infants 1.
214
PLANNED COMPARISONS, POST Hoc TESTS Exercise 13.6 Analyzing Variance Using Age as a Covariate The template that results from this exercise allows you to perform an analysis of covariance for the gender smiling study. The question is, once differences in age are statistically controlled, do male infants smile significantly more than female infants? 1. Modify one of your previous spreadsheets (e.g., Fig. 12.8 or Fig. 12.11) to accommodate the gender smiling study. Enter data (shown in Fig. 13.7) for the number of smiles (y), the infant's age (X), and the infants sex (A, a coded predictor variable). Reserve space for a third predictor variable, in addition to the covariate (age) and the group variable (sex), which represents the interaction between age and sex. The interaction variable will not be used until Exercise 13.7. Exactly what interaction means is discussed in greater detail in chapter 14. 2. Extend the spreadsheet to accommodate the 20 subjects in the gender smiling study and fill in all formulas as appropriate. 3. Do step 1. Regress number of smiles on age. Note the R2 and its corresponding F. You will need them later. 4. Do step 2. Regress number of smiles on age and the variable coded for sex together. Organize your results into a stepwise table like Fig. 13.1. 5. Does sex account for a significant unique proportion of variance, beyond that already accounted for by age? The table you prepared should look like the one shown in Fig. 13.8. As you can see, the increase in R2 when sex is added to the equation is .290 and its corresponding F is 7.46. The critical value of F(1,17), alpha = .05, is 4.45, so this effect is significant and we conclude that, controlling statistically for the effect of age, sex of infant significantly affects number of smiles. But how do we describe this effect? It does not make sense to report the mean number of smiles for male and female infants, which were 6.4 and 4.1, respectively. We know from an earlier analysis that those means do not differ significantly. But what means do? An analysis of covariance adjusts the individual raw scores for the effect of the covariate. That is, any effect the covariate might have on the criterion measure is removed (step 1) and then, in effect, these adjusted scores are analyzed to see if they are affected by the categorical variable of interest (step 2). This is simply another way to state the function of a hierarchic regression analysis. A change in R2, after all, tells us how much influence the last variable entered has, after any effects of variables already in the equation have been removed from the criterion variable. Thus, if the effect of the categorical variable entered at step 2 is significant, the means for the scores adjusted for the step 1 covariate, the scores on which the effect is based, differ significantly and thus it is the adjusted and not the raw means that should be reported.
Step 1 2
Variable added age sex
R2 .051 .340
Total df 1,18 2,17
F
R2
<1 4.39
.051 .290
Change df 1,17 1,17
F 1.31 7.46
FIG. 13.8. Stepwise results for an analysis of covariance of the gender smiling study.
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE
215
The statistically insignificant difference between the means of the raw scores for female and male infants can be portrayed graphically (see Fig. 13.9). The top horizontal line represents the mean number of smiles for male infants. Its equation is Y = 6.4 (the mean number of smiles for male infants), which means that the predicted scores for male infants would all fall on this line. The bottom horizontal line represents the mean number of smiles for female infants. Its equation is Y = 4.1 (the mean number of smiles for female infants), which means that the predicted scores for female infants would all fall on this line. The distance between the two lines (2.3 smiles) indicates the (nonsignificant) difference between the means for males and females. Fig. 13.9 also graphically portrays aspects about the data that we noted earlier, specifically, that females are older on average than males (the four oldest infants are female whereas four of the five youngest infants are male) and there is a tendency, albeit somewhat weak, for older infants to smile more than younger infants. The statistically significant difference between the means of the scores for female and male infants that take age into account can also be portrayed graphically (see Fig. 13.10). First, we regress the number of smiles on age and sex using the prediction equation - a + b1 agei + b2 sex.
(13.6)
Then we compute the predicted values for each infant and graph them. These predicted values will fall on two sloping parallel lines, representing the predicted number of smiles for males and females adjusted for age (see Fig. 13.10). The top sloping line represents males, who smiled more on average, and the bottom sloping line represents females, who smiled less. Note that the difference between the parallel lines representing predicted scores for males and females is
FIG. 13.9. Graphic presentation of observed scores for the gender smiling study. Triangles represent observed number of smiles for males, circles, the number for females. The top horizontal line (black) represents the mean number of smiles for males (y=6.4), the bottom horizontal line (gray), the mean number of smiles for females (Y= 4.1).
PLANNED COMPARISONS, POST Hoc TESTS
216
greater in Fig. 13.10 (which takes the infant's age into account) than in Fig. 13.9 (which does not take infant's age into account).
Adjusting Individual Scores The top and bottom sloping lines in Fig. 13.10 represent predicted scores that take age and sex into account. These lines slope up to the right, indicating that on average older infants smiled more than younger infants. Earlier we said that if we wanted to describe the difference between male and female scores we should use adjusted and not raw means. But how should adjusted scores be computed? The task is to remove statistically the effect of age. In effect, we ask, if all infants were the same age (in this case, the mean age for the sample or 13.5 months), how many times would they smile? Given the upward sloping line in Fig. 13.10, this means that we will adjust younger infants' number of smiles upward and older infants' downward, and that larger adjustments will be made for the youngest and oldest infants than for those whose age is nearer the mean. In fact, the size of the adjustment for an infant of a specified age will be the distance at that age between the light gray horizontal line (representing the sample mean) and the light gray parallel line midway between the male and female upward slanting lines (representing predicted scores for infants whose sex is not specified) in Fig. 13.10. This distance can be computed with the regression equation (Equation 13.6). We use the regression coefficients we computed for age and sex (b1 and b2), and, when computing the adjustment for a particular infant, we use that infant's age (agei). However, because we want to compute an adjustment for the average infant of that age, we do not supply that particular
FIG. 13.10. Graphic presentation of predicted scores for the gender smiling study. Triangles represent observed number of smiles for males, circles, the number for females. The horizontal gray line represents the mean number of smiles for all infants (Y= 5.25). Predicted scores for males and for females taking age and sex into account fall on the top (black) and bottom (gray) sloping parallel lines, respectively. The middle sloping line (light gray) represents predicted scores that take age into account when sex is not specified; scores are adjusted by the difference between it and the horizontal mean line.
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE
__
217
infant's sex (sexi, in this case 0 for males and 1 for females) but instead the average of the codes we used for sex (Msex, in this case 0.5): (13.7) The symbol Y" is used to remind us that the prediction equation is for an average subject, not one who belongs to a specific group, and thus the individual variable (sexi, in this case) is replaced with a constant that is the average of the codes used. As noted earlier, adjustments to individual infants' scores are represented by differences between the middle sloping line in Fig. 13.10 and the horizontal line representing the mean. In other words adjustmenti = Yi" - MY
(13.8)
As you can see from Fig. 13.10, this adjustment is negative for the youngest ages, becomes smaller with increasing age, is zero at 13.5 months (the mean age for the infants in this study), and then becomes positive and progressively larger for older infants. The adjusted score— the individual's raw score adjusted for the covariate— is then the initial score from which this adjustment is subtracted:
Thus in the present case the number of smiles is increased for younger infants (a negative adjustment is subtracted from their initial scores) and decreased for older infants (a positive adjustment is subtracted from their initial scores). You may wonder, why define the adjustment so that you end up subtracting negative adjustments, which really means adding? Why not define the adjustment as MY- Yi" in the first place, in which case the adjusted score would be Yi + My- Yi", which can be derived algebraically from Equation 13.9 in any case? The answer lies with convention, which defines any deviation from a mean as the mean subtracted from a score and hence Yi" - MY is used. In any event, you should now have grasped a key and central concept of the analysis of covariance: In terms of the present example, controlling for age means, in effect, adjusting raw scores statistically so that the adjusted scores now reflect what the scores would be if all infants had been observed at 13.5 months of age, the mean age for the sample. This idea should become clearer to you during the course of the next exercise, which asks you to compute adjusted scores. Exercise 13.7 Computing Adjusted Means for an Analysis of Covariance The template developed for this exercise adds the ability to compute adjusted scores to the template developed for the last exercise. It allows you to compute the mean number of smiles for male and female infants adjusted for age, which is necessary information to describe the significant sex effect revealed by your last analysis. 1. Modify the spreadsheet developed for the last exercise. Add three new columns to it. These columns are for the predicted number of smiles based on the truncated prediction equation, or Y"; the difference between y "and the mean number of smiles for all infants, Y" - MY, which is the adjustment for the raw score; and the adjusted score, Y- d where d = Y"- MY.
218
PLANNED COMPARISONS, POST Hoc TESTS 2. The correct values for the parameters, a, b1, and b2 should be left over from the previous exercise. Now enter correct formulas in the three new columns. Do the Y" scores fall on the middle sloped line in Fig. 13.10? Are the adjusted compared to the raw scores larger for younger infants and smaller for older infants, as they should be? 3. Finally, compute the adjusted means for the male and female infants. Do these means make sense? 4. Your spreadsheet should contain a column for the age x sex interaction. Enter a formula in this column that multiplies age by the coded value for sex. This value represents the age x sex interaction and will be used in the next exercise. At this point your spreadsheet should look like the one shown in Fig. 13.11. It is instructive to compare the adjusted scores, which are portrayed graphically in Fig. 13.12, with the unadjusted scores shown in Figs. 13.9 and 13.10. In general, scores for younger infants have increased and scores for older infants have decreased. As noted earlier, for a given age, the amount of change is the difference between the middle sloped line (Y i " = a + b1 agei+ b2 Msex) and the horizontal line (Yi = My) shown in Fig. 13.10. More of the younger infants were male, and more of the older female; thus the males' scores were more often increased and the females' scores more often decreased. Moreover, the mean for the males' scores was higher than the mean for the females' to begin with, so as a result the means for the adjusted scores are even further apart, as indicated by the horizontal lines in Fig. 13.12. Indeed, as we know from the analysis of covariance conducted earlier (Exercise 13.6), the difference between the means for the adjusted scores is statistically significant (R2 - .290, F(1,17) = 7.46, p <..05), and it is these adjusted means (males = 7.02, females = 3.48) that we would report.
Homogeneity of Regression In the course of performing the analysis of covariance just described, we made an assumption. By imposing parallel best fit or regression lines on the two groups (see Fig. 13.10) we assumed homogeneity of regression—that is, we assumed that the regression lines relating age to outcome had the same slope for both groups. Thus we proceeded to use the same age coefficient when adjusting both males' and females' scores (the b1 in Equation 13.7; in experimental texts this 61 is often called the average within-groups regression coefficient). However, if the regression lines relating age to outcome were quite different for males and females, it would not be appropriate to make a common adjustment. In traditional analysis of covariance terms, if the slopes relating the covariate to outcome varied across the groups analyzed, the assumption of homogeneity of regression would not be warranted. Instead of a model (e.g., Equation 12.6) that includes just terms for the covariate and for group membership, a more complex model that includes terms reflecting different slopes across groups would be required instead. It is an easy matter to test whether the assumption of homogeneity of regression is warranted. First, a third variable is formed, one that represents the interaction between the covariate and the categorical variable (this will be a set of variables if more than two groups are under consideration). We have more to say about interaction in the next chapter. For now, it is sufficient to know that a variable representing the interaction between two other variables is formed by multiplying the values for those two variables together. A covariate is
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE
219
continuous; thus, one variable is always sufficient to represent it. For the present example, the categorical variable is binary, and hence both it and the interaction can be represented with one variable each. In fact, you have already formed values for the interaction variable in the current spreadsheet by multiplying values for age and the coded values for sex together. Now the question is, does the interaction variable account for a significant increase in R2 when added to age and sex, the two variables already in the equation?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
c
B D Smiles Age Sex X_ s_ Y_ A 1 5 5 0 2 7 5 0 4 3 5 0 4 8 6 0 6 9 5 0 3 13 6 0 7 4 14 0 11 17 8 0 17 12 9 0 10 4 18 0 4 1 11 1 1 12 8 1 5 13 1 13 2 17 14 1 1 1 15 18 16 5 18 1 17 5 20 1 20 1 18 8 1 19 7 21 6_ 22 20_ 1_ Sum= 105 270 10 N= 20 20 20 Mean= 13.5 5.25 0.5 3.807 0.2379 -3.54 a,b= R= 0.583 6.4 Mean= 10.9 males Mean= 4.1 16.1 females A
E
F
XxA 0 0 0 0 0 0 0
o 0 0 4 8 13 17 18 18 20 20 21 22 161 20
Y"_ 3.228 3.228 3.228 3.466
G
2.022 7E-15 20 4E-16
Yadj= Y-d 7.022 9.022 6.022 9.784 7.07 3.119 3.881 10.17 11.17 2.93 3.26 2.308 5.119 1.167 -0.07 3.93 3.454 6.454 5.216 3.978 105 20 5.25
MMadj FMadi
7.018 3.482
cN Y"-My -2.02 -2.02 -2.02
5.369 6.083 6.083 6.32
-1.78 -1.07 -0.12 0.119 0.833 0.833 1.07
2.99
-2.26
3.942
-1.31 -0.12 0.833 1.07 1.07 1.546 1.546 1.784
4.18 5.131
5.131 6.083 6.32 6.32
6.796 6.796 7.034 7.272
105 20 5.25
H
FIG. 13.11. Spreadsheet for computing analysis of covariance adjusted means for the gender smiling study. Columns I-P are not shown.
220
PLANNED COMPARISONS, POST Hoc TESTS Exercise 13.8 Testing Homogeneity of Regression This exercise uses the template developed for the last exercise to test whether the assumption of homogeneity of regression, which is required for an analysis of covariance, is warranted for the gender smiling study. 1. This exercise can be viewed as an extension of Exercise 13.6. For that exercise, steps 1 and 2 consisted of adding first age and then the coded variable for sex to the regression equation. 2. Now do step 3. Regress number of smiles on age, the coded variable for sex, and the variable representing the age x sex interaction and make other appropriate changes. What is the value of R2 for this equation? What is the change in R2 from step 2 to step 3? Is this increase in R2 significant? As you can see, for the current example the assumption of homogeneity of regression is warranted. The increase in R2 due to the interaction variable was not significant (R2 = .019, F(1,16) < 1, NS), which means that the regression lines for the two groups do not differ significantly. If the increase in R2 had been significant, reporting adjusted mean would not have been informative. In such a case, we would have instead described the (significantly different) slopes for the two groups separately. The categorical variable used in the present analysis of covariance example was sex, which is a binary variable. But even if the categorical variable were more complex, requiring more than one coded predictor variable, the analysis still proceeds hierarchically. In general, if a categorical variable comprises G levels, then G - 1 predictor variables are required to represent it, as discussed in the last chapter. In order to test the significance of any categorical variable, the covariate
FIG. 13.12. Graphic presentation of adjusted scores for the gender smiling study. Triangles represent adjusted number of smiles for males, cirlces, the adjusted number for females. The top horizontal line represents the mean of the adjusted number of smiles for males (y= 7.02), the bottom horizontal line, the mean of the adjusted number of smiles for females (Y= 3.48).
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE
221
would be added to the regression equation at step 1 and the set of predictor variables representing the categorical variable, however many are required, would be entered at step 2. Similarly, G - 1 predictor variables would be required to represent the interaction. They would be formed by multiplying each predictor variable required for the coded variable by age. These predictor variables would then be entered at step 3, as a set. If the increase in R2 for step 3, however many interaction variables are involved, is not significant, then you can assume that the assumption of homogeneity of regression is warranted. Then, if the increase in R2 for step 2 is significant, you would claim a significant group effect, compute and report adjusted means, and explicate the pattern of group differences with a Tukey post hoc test based on the adjusted means. However, if the increase in R2 for step 3 were significant, a more complex explication that takes the interaction into account is called far. In such cases, more knowledgeable colleagues should be consulted. Exercise 13.9 ANCOVA in SPSS In this exercise you will learn how to conduct an analysis of covariance in SPSS. 1. Create a new SPSS data file from the spreadsheet you created in Exercise 13.6. 2. Conduct a hierarchical regression as you did in Exercise 11.7. Enter smiles as the dependent variable and age in the first block and sex in the second block. Make sure you select R squared change statistics as an option. The model summary output should correspond to the values found in Fig. 13.8. 3. Test the homogeneity of regression assumption by running a second analysis and entering age, sex, and the age by sex interaction term in the third block. 4. You could also conduct an ANCOVA using the General Linear Model (GLM) procedure. To do this select Analyze->General Linear Model->Univariate from the main menu. Move smiles to the Dependent Variable window, sex to the Fixed Factor(s) window, and age to the Covariate(s) window. 5. Click on Model and select Type I from the Sums of Squares pull down menu. Click on Continue. Click on Options and move the sex variable to the Display Means for window. Click on Continue and then OK. 6. Examine the output. The statistics for age, sex, error, and total should be identical to the regression you ran in step 2. 7. To test the homogeneity of regression assumption, you must create a custom model. Rerun the GLM procedure, but this time click on Model and check the Custom button under Specify Model. Click on the age variable in the Factor(s) and Covariate(s) window to highlight it, then click on to move the variable to the Model window. Do the same with sex. Then click on both age and sex while holding down the Ctrl key, so that both are highlighted, and then click on This moves the age * sex interaction term to the Model window. 8. Examine the Test of Between Subjects Effects box in the output. Check the F value and significance of the interaction term. Do these statistics agree with the values you calculated in part 3? 9. Examine the Estimated Marginal Means box in the output from the model that does not include the interaction. SPSS automatically calculates the
222
PLANNED COMPARISONS, POST Hoc TESTS adjusted group means. Do they agree with the values you calculated in Exercise 13.7?
Single- and Multifactor Studies This has been an important skill-building chapter, filled with "And after the initial analyses, what then?" kinds of considerations. It is also something of a bridge chapter, standing midway between discussion of single-factor (chap. 11) and multiple-factor (chap. 13) between-subjects studies. The chapter began with a discussion of planned comparisons, which probably seemed simply a straightforward application of hierarchic multiple regression. But as you will see in the next and subsequent chapters, the ideas used in an analysis of planned comparisons apply to analyses of multifactor studies as well. Significance testing, which is usually embodied by big Fs, is greatly emphasized in our present statistical tradition. And although it is important to determine whether apparently impressive (or even puny) results might be just chance happenings, it is equally important to describe results once significance tests have given us license to do so. Post hoc tests are especially important in this regard. Whenever more than two groups are included in a one-way analysis of variance, for example, a single significant omnibus F is insufficient to tell us exactly which groups differ from which others. Analogously, in the presence of a significant analysis of covariance result, the raw score means do not provide an accurate picture of how the groups vary once scores have been adjusted for the covariate. Only the adjusted means do that. You now know how to compute adjusted means, which ordinarily is regarded as an advanced topic but is rendered quite simple, even elementary, by the integrated multiple-regression approach of this book. You also know how to perform one widely used post hoc test, the Tukey, and you should be comfortable interpreting post hoc test results. This is an important skill. As you will see in the next and subsequent chapters, post hoc analysis applies not just to the singlefactor studies discussed so far but to all of the other analysis of variance designs you will encounter.
14
Studies With Multiple Between-Subjects Factors
In this chapter you will: 1. Learn what between-subjects factorial studies are and what advantages they offer. 2. Learn how to construct predictor variables for analyzing data from between-subjects factorial studies. 3. Learn how to compute the degrees of freedom associated with the main effects and interactions of these studies. 4. Learn how to determine whether the main effects and interactions (i.e., conditional relationships) of these studies are statistically significant and how to characterize the magnitude of such effects. 5. Learn how to interpret any statistically significant main effects and interactions. In chapter 10 you learned how to decide if two groups differ significantly and in chapter 12 you learned how to perform an analysis of variance with more than two groups. As you now know, no matter the number of groups, you need only determine whether a single between-subjects factor significantly increases predictability. The single factor in such studies indicates group membership—for example, whether subjects were assigned to an experimental or a control group, or to which of four different want/have children status groups subjects belonged. Subjects in these studies are assigned or belong to one, and only one, of the groups; that is, the groups in the studies are formed independently. Such studies can be visualized as a single row of cells, with each cell representing a group and containing the subjects belonging to that group (see Fig. 14.1). When subjects belong to (or are assigned to) two or more groups (a singlefactor study), the usual research question is, are the groups different in some way? With respect to some measure of interest, are men different from women, or are people who received one treatment different from those who received another treatment (or treatments)? In statistical terms, are the criterion score means for the various groups so discrepant from one another that it is unlikely that the subjects were sampled from a population in which there is no association between group membership and the criterion variable? As you know, the usual 223
224
MULTIPLE BETWEEN-SUBJECTS FACTORS analytic technique for detecting such group differences is called a one-way analysis of variance. In chapter 12 you learned how to conduct a one-way ANOVA using multiple regression and coded predictor variables, and in the last chapter you learned how to describe significant results. Researchers' interests, however, are rarely limited to just a single independent factor or variable. In this chapter we discuss a common and more general situation, one for which a two-way, three-way, or even higher way analysis of variance is appropriate. This situation can be characterized as follows: 1. More than one research factor is of interest (and for the time being we assume that all factors define groups, i.e., are categorical). 2. The factors operate between subjects (i.e., subjects serve in and hence contribute criterion scores to only one group, not repeatedly to more than one group). 3. The factors are completely crossed (i.e., each level of a factor is represented at all levels of the other factors).
14.1
BETWEEN-SUBJECTS FACTORIAL STUDIES Some writers use the term factorial broadly for any study involving more than one factor. Others use the term more narrowly just for studies with completely crossed factors. (Some alternatives to completely crossed factors are named at the end of the next section.) Usage in this book adheres to the narrow tradition: If a study is called factorial, its factors are understood to be completely crossed (see Fig. 14.2). In this chapter those factors are understood to be between subjects as well, although in the next chapter the discussion will be extended to studies involving repeated factors (i.e., factors that operate within subjects). Factorial studies, in the sense just defined, are deservedly one of the most common arrangements used in behavioral science research. Their hallmark is a factorial or crossed arrangement of the treatments (the levels of the independent variables), which means that groups of subjects receiving every possible combination of the levels of the independent variables are included in the study. For example, if factor A were sex (with two levels, male and female) and factor B were instruction set (again two levels, set I and set II), then the corresponding 2 x 2 factorial study would include four groups: males exposed to set I, males to set II, females to set I, and females to set II. More generally, if factor A includes a levels and factor B includes b levels, then the study includes a times b groups. As is conventional, we indicate the first categorical between-subjects variable with A, the second with B, the third with C, and so forth. Then the first level for factor A would be A1, the second level would be A1, and so forth, whereas the first level for factor B would be B1, the second level would be B2, and so forth. If there were only two factors, and if each factor had two levels each (as in the male/female, instruction set I/set II example just given), then the four groups for that study (male and I, male and II, female and I, female and II) would be
FIG. 14.1. Schematic for a single-factor between-subjects study. The factor is symbolized A and there are G levels of the factor, hence groups are labeled A1 through AG. Each of the N subjects would be assigned to one, and only one, of the groups.
14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES
225
designated A1B1, A1B2, A2B1, and A2B2 (see top, Fig. 14.2). Similarly, if there were two factors, but if factor A had three levels and factor B had four, then the study would include 12 groups (see bottom, Fig. 14.2). As you can see, the number of groups included in a factorial study is determined by multiplying the number of levels for each factor together. We just saw that 2 x 2 = 4 groups, 2 x 3 = 6 groups, and 3 x 4 = 12 groups. Additional examples are: 1. If factor A, B, and C are each represented with 2 levels, then the complete 2 x 2 x 2 study includes 8 groups. 2. If factors A, B, and C contain 3, 4, and 5 levels respectively, then the complete 3 x 4 x 5 study includes 60 groups. In some circumstances not all groups might be necessary to answer the research question of interest. Moreover, not all research situations match the factorial study described here. Readers should be aware that, in addition to a factorial arrangement, there are other ways to combine two or more factors in a study. Designs for such studies (e.g., nested or hierarchical designs, Latin square designs, and other kinds of incomplete designs) are regarded as an advanced topic in statistical analysis, beyond the scope of this book, but interested readers should consult authorities like Hays (1981), Keppel (1982), Kirk (1982), and Winer (1971).
FIG. 14.2. Schematics for three different two-factor between-subjects studies: a 2 x 2 (top), a 2 x 3 (middle), and a 3 x 4 (bottom) two-factor factorial study.
226
MULTIPLE BETWEEN-SUBJECTS FACTORS
Advantages of Factorial Studies There are two major reasons why factorial studies are so popular. First, the effect of more than one variable can be investigated in an economical way that requires no more data. For example, imagine that the sex (male/female) by instruction (set I/set II) 2 x 2 factorial design mentioned a few paragraphs ago were used for the button-pushing study described in chapter 11. Ignoring instruction set, we could ask whether men and women differed in how often they pushed the button. In addition, this time ignoring sex, we could ask whether the instructions given affected how often subjects pushed the button. In analysis of variance terms, we are asking whether there is a mam effect for sex and also whether there is a main effect for instruction set. As appealing as this first advantage is—allowing us, in effect, to address two questions for the price of one—the second advantage of factorial studies may be the more interesting. Factorial studies allow us to test for conditional relations or what in analysis of variance terms are called statistical interactions. It may be, for example,. that the effect of instruction set is conditional on (i.e., depends on) the sex of the subject. Perhaps instruction set only affects men, not women (or vice versa). Factorial studies let us test for such interesting possibilities in a simple and straightforward way.
Coding Predictor Variables for Factorial Studies Between-subjects single-factor and multifactor factorial studies are alike in that both consist of a number of cells or groups to which subjects are assigned. And for both, the number of predictor variables required to code for group membership is G - 1, one less than the total number of groups. In chapter 12 we learned how to use contrast (and dummy) coding to create predictor variables for single-factor studies. We also learned that the G- I predictors could be coded in a number of different ways (as long as they satisfied certain rules), that the particular set selected did not affect the overall variance accounted for, but other things being equal it makes sense to select a set that reflects the research questions and the design of the study. In this chapter we learn how to form sets of predictor variables appropriate for factorial studies. First we define predictor variables that code for factor A. This is done exactly as it was done in chapter 12 for single-factor studies. Two groups would require one predictor variable, three groups would require two predictor variables, and so forth. Moreover, the subset of predictor variables that code for factor A should follow the two rules defined in chapter 12. That is: 1. The codes selected for each contrast (i.e., each predictor variable) must sum to zero. 2. The cross products for all possible pairs of contrasts must likewise sum to zero. Next the predictor variables that code for factor B are formed in exactly the same way, just as though they also derived from a single-factor study. The same is true for any other factors. In other words, each factor is associated with a subset of predictor variables. For each factor, the number of predictor variables is one less than the number of levels for that factor. Finally, predictor variables that code for interactions are formed. Interactions exhaustively (i.e., completely) combine the factors. For example, a two-factor factorial would include a main effect for A, a main effect for B, and an
227
1 4.1 BETWEEN-SUBJECTS FACTORIAL STUDIES
AB interaction, whereas a three-factor factorial would include main effects for A, B, and C, two-way (or first-order) interactions for AB, AC, and BC, and a threeway (or second-order) interaction for ABC. Main effects and interactions for two-, three-, and four-factor way factorial studies are shown in Fig. 14.3. At this point, the reader should grasp the logic of factorial designs and should understand how to list all the higher order interaction terms for any factorial design. Exactly what those interactions mean in substantive terms will become clear later. As noted a few paragraphs earlier, coded predictor variables for main effects are formed as though each factor were the single factor in a single-factor study. Consider a simple 2 x 2 factorial (two factors, each represented with two levels). Such a study consists of four groups, labeled as follows (see Fig. 14.2): A1B1 A1B2 A2B1 A2B2
There are four groups, so three predictor variables are required. The first codes for factor A: Using contrast codes, subjects in the first two or A1 groups could be coded -1 and subjects in the last two or A2 groups could be coded +1 (see Fig. 14.4). The second predictor variable codes for factor B: Subjects in the first and third or B1 groups could be coded -1 and subjects in the second and fourth or B2 groups could be coded +1. The third predictor variable codes for the AB interaction. The code for the subjects in each group would be formed by multiplying the A and B codes for that group. Thus the codes for the first, second, third, and fourth groups would be +1,
Effects Main effects:
First-order interactions:
Second-order interactions:
Number of Factors 2
3
4
A
A
A
B
B C
B C D
AB
AB
AB
AC BC
AC AD BC BD CD
ABC
ABC ABD ACD BCD
Third-order interactions:
ABCD
FIG. 14.3. Main effects and interactions for factorial studies with two, three, and four factors.
228
MULTIPLE BETWEEN-SUBJECTS FACTORS
FIG. 14.4. Contrast codes for a 2 x 2 factorial study. -1, -1, and +1, respectively (again, see Fig. 14.4.) When codes are formed in this way, the entire set of three predictor variables will obey the two formation rules described in chapter 12—that is, the codes selected for each predictor variable will sum to zero and the cross products for all possible pairs of contrasts will also sum to zero. Once the predictor variables are formed, analysis of the 2 x 2 factorial is straightforward. Three steps are required. Each step adds a predictor variable, first the one that codes for the A main effect, then the B main effect, and finally the AB interaction. The significance for the A main effect, for the B main effect, and for the AB interaction is the significance of the increase in R2 associated with step 1, 2, and 3, respectively. Thus analysis of a 2 x 2 factorial is identical with a planned comparison analysis of a study involving four groups. For any planned comparison analysis, selection of the contrasts used requires some thought and justification. In the case of a 2 x 2 factorial (or any other factorial) the contrasts to be used are determined by the factorial design. Again as for any planned comparison analysis, interest lies with the increases in R2 between successive steps and so it makes no sense to regress the dependent measure on all three predictor variables at once in a single step as for an omnibus Ftest. As a second example, consider a 2 x 3 factorial. The six groups are shown in Fig. 14.5, and again contrast codes are used. For factor A (predictor variable X 1 ), subjects in the first three or A1 groups are coded -1 and subjects in the last three groups or A2 groups are coded +1. Factor B has three levels and hence two predictor variables are required (X2 and X3). The first contrasts group 1 with groups 2 and 3, and the second contrasts group 2 with group 3. For predictor variable X2, subjects in the first and fourth or Bl groups are coded -2 and all other subjects are coded +1. For predictor variable X3, subjects in the second and fifth or B2 groups are coded -1 and subjects in the third and sixth or B3 groups are coded +1. Factor A is represented with one and factor B with two predictor variables, thus the AB interaction requires two predictor variables as well (one times two). Predictor variable X4 is formed by multiplying values for Xl and X2 together, and
FIG. 14.5. Contrast codes for a 2 x 3 factorial study.
1 BETWEEN-SUBJECTS FACTORIAL STUDIES
229
predictor variable X5 is formed by multiplying values for X1 and X3. As an exercise you should verify that the X4 and X5 products shown in Fig. 14.5, are correct. Analysis of this 2 x 3 factorial, like all two-way analyses, requires three steps. In step 1, the first coded predictor variable is entered. Factor A has only two levels, so only one predictor variable is required to represent it. In step 2, the two predictor variables required to code for the three levels of factor B are added. Thus step 2 constitutes what is in effect an omnibus F test for factor B. Finally, the two predictor variables representing the AB interaction are added. Again, the significance for the increase in R2 for steps 1, 2, and 3 is the significance for the A main effect, the B main effect, and the AB interaction respectively. Assuming the predictor variables representing factor A follow contrast code rules, and those for factor B do too, and the codes for the interaction are formed by multiplication, then the entire set of five predictor variables will also fulfill contrast code requirements, as you can easily verify. Analysis of the factorial design requires a hierarchical procedure, just like a planned comparison analysis, although more than one predictor variable may be added on some steps. Specifically, if a main effect or any component factor of an interaction involves more than two groups, then more than one predictor variable will be required to represent it. As a third example, consider a 2 x 2 x 2 factorial. The eight groups are as shown in Fig. 14.6. The first predictor variable codes for factor A, the second for factor B, and the third for factor C. The codes for the AB interaction are formed by multiplying the A and B codes. Similarly, multiplying the A and C codes yields the codes for the AC interaction and multiplying the B and C codes gives the codes for the BC interaction. Finally, multiplying A, B, and C codes together gives the codes for the ABC interaction. As you can see, a three-way factorial analysis of variance considers seven effects, hence seven steps are required. In this case, because the A, the B, and the C factor all comprised only two levels, each of the seven steps is represented by only one predictor variable.
Exercise 14.1 Coding Predictor Variables for Factorial Studies I This exercise provides preliminary practice with coding predictor variables for factorial studies. 1. Verify that the values for the coded variables for the first- and second-order interactions (AB, AC, BC, and ABC) given in Fig. 14.6 are correct. 2. Verify that the values for the coded predictor variables given in Figs. 14.4 and 14.5 obey the two formation rules for such variables (group contrast codes and all possible pairs of their cross products sum to zero). One final example should help clarify further how contrast codes are formed for factorial studies. Two of the examples just presented, the 2 x 2 and the 2 x 2 x 2 factorial, were relatively simple. For both, all factors were represented with only two levels; therefore only one predictor variable was required for each main effect and interaction. When factors are represented by more than two levels, however, additional predictor variables are required. Specifically, as we already know from the one-way case, if factor A consists of a levels, then a - 1 predictor variables are required to code the A main effect. Similarly, if factor B consists of b levels, then b - 1 predictor variables are
230
MULTIPLE BETWEEN-SUBJECTS FACTORS
FIG. 14.6. Contrast codes for a 2 x 2 x 2 factorial study
required for the B main effect, and so forth. That is why two predictor variables were required to represent the B factor for the 2 x 3 example presented in Fig. 14.5. Furthermore, because codes for interaction terms are formed by multiplying the values for the predictor variables representing the constituent main effect codes together, the number of predictor variables required to represent the AB interaction is a - 1 times b - 1. For example, in Fig. 14.4, one predictor variable was required to represent the AB interaction for the 2 x 2 example, and in Fig. 14.5 two predictor variables were required for the AB interaction for the 2 x 3 example. As a further and more complex example, consider a 3 x 4 factorial study. Two predictor variables are required for the A main effect and three for the B main effect. This means that six predictor variables are required to code the AB interaction:
(a - 1)(6 - 1) = 2 x 3 = 6 Thus 11 predictor variables in all are needed for this 12 group study. Predictor variables X1 and X2 would code factor A, predictor variables X3, X4 and X5 would code factor B, and predictor variables X6 through X11 would code for the AB interaction (see Fig. 14.7). Specifically, variables X6, through X11 would be formed as follows:
Again, as an exercise you should verify that the values given in Fig. 14.7 are correct. This example, with its 12 groups and 11 predictor variables, may seem somewhat cumbersome. But imagine, for example, a 3 x 4 x 5 analysis of variance, which would have 60 different groups and hence 59 predictor variables! True, studies with so many groups are quite rare, but in any case, once the codes for the predictor variables representing the main effects have been determined, the codes for the predictor variables associated with the different interactions can be generated easily using a spreadsheet, which is exactly how codes for predictor variables X6 through X11 shown in Fig. 14.7 were computed.
14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES
231
FIG. 14.7. Possible contrast codes for a 3 x 4 factorial study. For this example, factor A comprises three and factor B four levels. Therefore factor A is represented with two, factor B with three, and the AB interaction with six predictor variables.
Exercise 14.2 Coding Predictor Variables for Factorial Studies II This exercise provides additional practice with coding predictor variables for factorial studies. 1. Think of a plausible research application for a 3 x 4 factorial design. What are the research factors? What are the levels for those factors? How do you plan to code the factors and what is your rationale for the coding you selected? (Your coding should be different from that used for Fig. 14.7.) Following the format used in Fig. 14.7, indicate the groups included in the study and show how the predictor variables would be coded for each group. Verify that these predictor variables obey the two formation rules for contrast codes.
Degrees of Freedom for Factorial Studies In this chapter the discussion of one-way or single-factor between-subjects studies begun in chapter 12 has been extended and generalized. You have learned how to generate the higher order interaction terms implied by multifactor studies, and you have learned how to code the predictor variables associated with the various main effects and interactions of such studies. In this subsection, you will learn how to determine degrees of freedom for the various effects—the main effects and interactions—of factorial studies. In order to perform tests of significance—the topic of the next section—you will need to know the correct degrees of freedom for these effects. Recall from chapter 10 that, given N scores derived from N subjects, the degrees of freedom associated with the total sum of squares was N - 1. Recall further that the total degrees of freedom, like the total sum of squares, can be
MULTIPLE BETWEEN-SUBJECTS FACTORS
232
partitioned into two parts: the portion due to the model and the portion remaining, or residual, due to error. In other words,
Finally, recall that the degrees of freedom for the model were simply the number of predictor variables included in the model whereas the degrees of freedom for error were, as the term residual implies, the degrees of freedom left over or remaining. Thus, for a factorial study that includes a total of G groups, the following is true:
The degrees of freedom for the model is G - 1 because, given G groups, G - 1 predictor variables are required to code group membership. The degrees of freedom for error, then, can determined by simple algebraic manipulation:
This agrees with Equation 14.4. Just as the total degrees of freedom can be partitioned into model and error components, so too the degrees of freedom due to the model can be further subdivided. A consideration of Fig. 14.7 and material presented earlier suggests some general principles. Let a symbolize the number of levels of A, b the number of levels of B, and so forth. Then the degrees of freedom associated with the A main effect are a - 1 (because a - 1 predictor variables are required to code for factor A) and b - 1 degrees of freedom are associated with the B main effect. Then, because a - 1 times b - 1 predictor variables are required to code the AB interaction, the degrees of freedom associated with the AB interaction are a 1 times b - 1, the degrees of freedom for A times the degrees of freedom for B. For the example portrayed in Fig. 14.7:
df
In general, the degrees of freedom for any interaction will be the product of the degrees of freedom of its constituents. For example, if dfA = 2, dfB = 3, and dfc = 4, then
Generalized degree of freedom computations for any two- or three-way factorial study are given in Figs. 14.8 and 14.9. As noted earlier, the total degrees of freedom between N subjects is N - 1. (Recall from chapter 10 that one degree of freedom is lost when scores are
14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES Source A main effect B main effect AB interaction S/AB, subjects within AB TOTAL between subjects
233 Degrees of freedom a -1 b -1 (a - 1 )(b -1) N-ab /V-1
FIG. 14.8. Degrees of freedom for a two-way factorial study. The number of levels for factors A and B is symbolized with a and b respectively and the number of subjects is symbolized with N. For other comments, see text.
constrained by the overall or grand mean.) These N - 1 degrees of freedom between subjects can be divided into two parts: those concerned with betweengroup variability and those concerned with how subjects vary within groups:
The between-groups degrees of freedom are associated with the main effects and interactions of the factorial study and add up to G - 1, the number of predictor variables. (G = ab for a two-way study, abc for a three-way study, etc.) The remaining N - G degrees of freedom are associated with how subjects differ within groups. For that reason, the residual or error terms in Figs. 14.8 and 14.9 are symbolized S/group (the virgule or slash is read as within). S/group (read, subjects within groups) indicates subjects within the groups or cells defined by the factorial study—for example, S/AB and S/ABC for a two- and three-way factorial study, respectively. Exercise 14.3 Degrees of Freedom for Factorial Studies This exercise provides practice with computing degrees of freedom for betweensubjects factorial studies. 1. Create a table, modeled after the ones shown in Figs. 14.8 and 14.9, but for a four-way factorial study. 2. Based on this table, and assuming that a = 3, b = 2, c = 4, d = 2, and N = 240, compute the degrees of freedom for each main effect and interaction. 3. Compute degrees of freedom for the 3 x 4 factorial study you described in the last exercise. Assume that N = 108. Organize your results in a table like that shown in Fig. 14.8, but instead of the symbols give the computed degrees of freedom. Label the effects with the names you supplied for the last exercise. Verify that the degrees of freedom for the main effects and interactions sum to G - 1 and that all degrees of freedom sum to N - 1.
14.2
SIGNIFICANCE TESTING FOR MAIN EFFECTS AND INTERACTIONS The statistical significance of the main effects and interactions appearing in factorial studies can be tested using the hierarchical techniques first presented in chapter 11 and elaborated in chapter 13. Consider first a simple 2 x 2 study. Any 2 x 2 study, like the button-pushing study we have been using as an example,
234
MULTIPLE BETWEEN-SUBJECTS FACTORS Source A main effect 6 main effect C main effect AB interaction AC interaction BC interaction ABC interaction S/ABC, subjects within ABC TOTAL between subjects
Degrees of freedom a -1 b -1 c -1 (a-1)(b-1) (a-1)(c-1) (b -1 )(c -1) (a-1)(b-1)(c-1) N - abc N- 1
FIG. 14.9. Degrees of freedom for a three-way factorial study. The number of levels for factors A, B, and C is symbolized with a, b, and c, respectively and the number of subjects is symbolized with N. For other comments, see text. comprises four groups and for that reason will require three predictor variables. Two sets of predictor variables, each representing a different way to contrast the four groups for a planned comparison analysis, were presented in chapter 13. In Fig. 14.4, we presented a third set of predictor variables, one appropriate for a 2 x 2 factorial study. In this case, the three contrast codes represent an A main effect, a B main effect, and an AB interaction. There is little new here. In fact, as noted earlier a multifactorial study can be viewed simply as a special case of a single-factor planned-comparison study. For both, G - 1 coded predictor variables are required. The planned-comparison study allows some latitude as to which contrasts are selected, whereas the multifactorial study essentially dictates how contrasts are formed, but that is the only difference. For single-factor planned comparison analyses and for analyses of multifactorial studies, procedures for significance testing are the same. The predictor variables are arranged in a hierarchy of steps; a new predictor variable (or set of variables) is added at each step; and the increase in variance accounted for at each step is tested for significance. A two-way factorial analysis of variance requires three steps. The predictor variable or variables that represent factor A are entered on the first step, those for factor B on the second step, and the variable or variables representing the AB interaction on the third step. The next exercise demonstrates how data would be analyzed for a simple 2 x 2 factorial. Exercise 14.4 Analysis of a 2 x 2 Factorial Study The template that results from this exercise allows you to analyze data from a 2 x 2 factorial study. The data are from the button-pushing study but for this exercise you assume that the four groups are formed by crossing subject's gender and instruction set. The resulting analysis tells you whether any of the effects—the gender main effect, the instruction main effect, or the gender x instruction interaction—are statistically significant. 1. For this exercise, you will modify the spreadsheet shown in Fig. 12.8. Assume that the data shown there resulted from a 2 x 2 factorial study. Let factor A be gender of subject (male or female) and factor B the instructions given (set I or set II). Thus the resulting four groups could be symbolized MI, M-ll, F-l, and F-ll. Assume that subjects 1-4 are in the M-l, subjects 5-8
14.2 TESTING FOR MAIN EFFECTS AND INTERACTIONS
2. 3.
4.
5.
6.
235
in the M-ll, 9-12 in the F-l, and 13-16 in the F-ll group. Label columns appropriately Enter contrast codes for gender (-1 = male, +1 = female), instruction set (-1 = set I, +1 = set II), and their interaction (gender x instruction) in the appropriate columns. Do step 1 (the A main effect) of the hierarchic analysis. That is, regress number of button pushes on the coded predictor variable representing gender. Determine the values for the parameters a and b1, and use them to compute predicted scores. What are the values for the predicted scores? Why do they have these values? What are the values of R2 and F for step 1? Do step 2 (adding the B main effect), that is, regress number of button pushes on the predictor variables representing both gender and instruction. Correct the prediction equation so that it now takes both the A and B main effects into account. Now what are the values for the predicted scores? Again, why do they have these values? What are the values of R2 and F for step 2? Do step 3 (adding the AB interaction), that is, regress the number of buttons pushed on predictor variables representing the gender and instruction main effects and the gender x instruction interaction. Again, correct the prediction equation so that it now takes the A and B main effects and the AB interaction into account. Now what are the values for the predicted scores and why do they assume these values? What are the values of R2 and F for step 3? Summarize the results of all three steps of this hierarchic analysis in a table, organized like that shown in Fig. 13.1, showing R2 change and its significance for each step.
The spreadsheet resulting from step 3 of the last exercise is given in Fig. 14.10. Spreadsheets for steps 1 and 2 differ from this one in the proportion of variance accounted for and in the values for the predicted values (ask yourself, why do the predicted values at each step make sense). This analysis of a 2 x 2 factorial study and the planned comparison analysis pursued in the last chapter are comparable in a number of ways. For both, predictor variables are added to the regression equation step by step, increases in R2 are computed, and the increases are tested for statistical significance using the error term associated with the last step (the final model). Only the predictor variables—how they are formed and what they meandifferentiate the two analyses. For the 2 x 2 factorial analysis, the three contrasts represent the A main effect, the B main effect, and the AB interaction, and the significance of each is determined by whether its associated R2 change is significantly different from zero. Note that an omnibus test, one testing the significance of all three predictor variables entered in one step, is not performed. The questions of interest for this 2 x 2 factorial are embodied in the three separate predictor variables. 14.3
INTERPRETING SIGNIFICANT MAIN EFFECTS AND INTERACTIONS The results of the two-way ANOVA you just performed reveal a significant main effect for subject's gender (F(1,12) = 7.51, p < .05) and a significant gender x instruction interaction (F(1,12) = 14.71, p < .01). (The critical value for F(1,12).05 = 4.75 and for F(1,12).01 = 9.73.) The mean number of button pushes for all subjects was 86; means for men and women separately were 96 and 76, whereas
MULTIPLE BETWEEN-SUEJECTS FACTORS
236
means for instruction sets I and II were 89 and 83. The analysis of variance results indicate that the means for men and women were significantly different but the means for instruction were not significantly different. The main effect for subject's gender, however, is qualified by an interaction with instruction and, as a general rule, such main effects should not be emphasized, or even discussed, until the qualifying interaction is understood.
A 1 2 s 3 1 4 2 5 3 6 4 7 5 8 6 9 7 10 8 11 9 12 10 13 11 14 12 15 13 16 14 17 15 19 16 19 Sum= 20 N= 21 Mean= 21 a,b= 23 R= J
21
K
F I E G H AB me= y= Y' Y-My Y'-My Y-Y' X3 1 113 27 16 -11 1 12 113 27 39 1 113 27 9 -18 113 27 1 44 17 -1 -7 -7 79 o -1 14 79 -7 7 79 -7 -1 -11 -4 -1 79 -7 -17 -10 -21 -22 65 -1 -43 -21 -1 -4 17 65 -1 4 65 -21 -17 1 -21 -1 65 -20 1 14 87 15 1 87 1 1 8 7 87 1 1 -2 -3 87 1 -17 -18 1 0 0 0 1376 0 16 N= 16 16 16 305 159.9 0 VAR= 464.9 14 SD= 21.56 R2= 0.656
D Inst
X2 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 1 1 0 16 0 -3
M
L
sstot y*y
SSerr
m*m
e*e
SS= df=
7438
2558
MS= SD'=
495.9 22.27
4880 3 1627
2
23
Y 102 125 95 130 79 93 75 69 43 82 69 66 101 94 84 69 1376 16 86 86 0.81
C Sex X1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 1 1 1 1 0 16 0 -10
ssmod:
1
19 20 21
B #BPS
R2=
15
0.57
F=
12
213.2 14.7 7.631
FIG. 14.10. Spreadsheet for determining the effect of gender, instruction, and their interaction on number of button pushes after step 3. Rows 3-18 for columns J-M are not shown.
14.3 SIGNIFICANT MAIN EFFECTS AND INTERACTIONS
237
After all, a significant interaction indicates that the means for the groups formed by crossing the constituent factors differ among themselves in ways that cannot be described by invoking the significant main effects alone. An interaction signals a conditional relation, which signifies that means vary among themselves with respect to one factor in ways that depend on the level of the other factor. For example, instead of women scoring higher than men in all circumstances (a main effect), they might score higher only in some circumstances (a gender x circumstance interaction). The problem, then, is to understand exactly how the groups identified by the significant interaction—the four groups formed by crossing gender and instruction set—differed. This is not an entirely new problem for us. Whether groups are identified by a significant omnibus F test in a one-way analysis of variance, as in chapter 12, or by a significant interaction, as in the last exercise, differences among them can be described using the Tukey post hoc test detailed in chapter 13. In fact, because the data used to demonstrate the Tukey post hoc test were used in the last exercise as well, the computations needed to understand the significant gender x instruction interaction detected in the last exercise have already been done. It is only a matter of relabeling the want/has children groups used in chapter 13 (see Fig. 13.5) to conform with the gender by instruction design used in this chapter. Group means, labeled for the present example and subscripted to indicate Tukey post hoc results, are given in Fig. 14.11. These results could be interpreted as follows. Apparently men were particularly responsive to instruction set I. The instruction set used did not affect how often women pushed the button (65 was not significantly different from 87), nor did gender affect button-pushing for those instructed with set II (79 was not significantly different from 87). However, men instructed with set I pushed the button 113 times, on average, which was significantly different from the means for women instructed with set I (M - 65) and for men instructed with set II (M = 79). Given these results, it would be misleading to emphasize the significant gender main effect. True, the mean number of button presses was greater for men than women, but this effect was confined to subjects who were exposed to instruction set I. There was no significant gender difference for subjects exposed to instruction set II. Exercise 14.5 Interpreting Interaction in a 2 x 2 Factorial Study This exercise provides practice in interpreting significant interaction results for a 2 x 2 factorial study. 1.
If the MSerror were 92.5 instead of 213.6, the post hoc results would have been different, as you demonstrated when doing Exercise 13.3. If the groups were those defined by the present gender by instruction factorial, not the marital/parental groups of chapter 13, how would you interpret the post hoc results? 2. Now interpret the post hoc results for the significant gender x instruction interaction if the MSerror were 23.2.
In the last several paragraphs we have considered how to explicate significant interactions detected by an analysis of variance. Let us return for a moment to the previous problem: the initial analysis of the main effects and interactions of a
238
MULTIPLE BETWEEN-SUBJECTS FACTORS . Gender
Instruction Set I Set II
Male Female
113a 65b
79b 87ab
FIG. 14.11. Post hoc analysis for the gender x instruction interaction. Scores are means based on four subjects each. Means that do not differ significantly according to the Tukey test, alpha = .05, share a common subscript.
factorial study. For Exercise 14.4 you were asked to organize the results in a table showing R2 change and its significance for each step. It may have occurred to you, as you performed the necessary computations, that such work could be done more easily with a spreadsheet. For the next exercise, you are asked to create a template that displays and summarizes the results of the gender by instruction analysis. This new format merges the best features of the hierarchic table we have been using (e.g., Fig. 13.1 and the table you created for Exercise 14.4) with a typical analysis of variance source table (e.g., Fig. 12.9) and is used from now on to summarize all ANOVA results. 14.4
MAGNITUDE OF EFFECTS AND PARTIAL ETA SQUARED When using multiple-regression computations to analyze the effects of a factorial design, the change in R2 at each step (or the change in the sum of squares) is of primary interest because it indicates the variance accounted for by the main effects and interactions associated with each step. In earlier tables, we displayed the total R2 at each step along with its degrees of freedom and associated F ratio, and also the change in R2 at each step along with the degrees of freedom and the associated F ratio for R2 change We did this as a learning device, to show how hierarchic regression worked. The analysis of variance source tables we now introduce are more economical and more conventional. They also add a new and useful magnitude-of-effect statistic, partial n2 (a Greek lower case eta, squared). These new tables retain the total R2s as of each step (the proportion of criterion variance accounted for by all variables in the equation as of this step), but otherwise the statistics displayed refer to the step—that is, they characterize the source of variance (A main effect, B main effect, AB interaction, etc.) whose predictor variables were added at that step. The statistics in the source table include each step's R2 change, in part because this is a magnitude of effect statistic with which you have become familiar but also because it is used to compute the SS values associated with the step. SS are traditional in an analysis of variance source table and are computed by multiplying R2 change for the step by the total SS for the criterion variable. The degrees of freedom for each step are the number of predictor variables entered at that step, and the residual or error degrees of freedom for the last step are, as usual, the number of cases minus 1 minus the number of predictor variables (N - 1 - K). Finally, the traditional ANOVA mean square (MS) for each step is computed by dividing the SS by its degrees of freedom, and the F ratio for each effect is computed by dividing the effect MS by the error MS. A statistic new to you is n2 (eta squared). In the traditional analysis of variance literature it is defined as
1 4.4 MAGNITUDE OF EFFECTS AND PA RTIAL ETA SQUARED
239
In other words, it is identical to R2 change, which is why we have not introduced it earlier. (The estimated population value, analogous to R2 adjusted , is symbolized with a Greek lower case omega squared, w2; see Hays, 1973, and Tabachnick & Fidell, 2001). Thus we could label the same statistic either R2change, reflecting a multiple-regression heritage, or n2, reflecting an analysis of variance heritage; the value would be the same. More useful is a variant of n2, the partial n2, which is defined as
As Tabachnick and Fidell (2001) noted, the value of n2 for a particular variable depends on the number and significance of other variables in the model, whereas partial n2 isolates the effect for a particular variable more. Further, partial n2 makes more sense than n2 or -R2change as a magnitude of effect statistic in the context of repeated-measures designs (see next two chapters). Thus partial n2 lends itself to comparison within and across studies in a way n2 does not, and for that reason we recommend its use and incorporate it in our analysis of variance source tables (where we label it pn 2 ). However, be aware that partial n2 may not be the best statistic when comparing effects of a particular variable across studies that use different designs; the recently introduced generalized n2 may be better, especially for repeated measures designs (see Olejnik & Algina, 2003, for details). One comment about n2 is in order: SPSS and other statistical packages optionally provide values for partial n2. Nonetheless, when investigators report those values in research reports, they are often (incorrectly) labeled n2. If you see a value labeled n2 in an article, chances are very good it is actually a partial n2. From here on, we incorporate partial n2s, sums of squares, mean squares, and F ratios in a more traditional analysis of variance source table, which is the point of the next exercise. Exercise 14.6 ANOVA Source Table for a 2 x 2 Factorial Study The template developed for this exercise adds an analysis of variance source table to the template you developed for Exercise 14.4 to analyze a 2 x 2 factorial. It provides a summary of the results for the button-pushing study, assuming the four groups were formed by crossing the two factors, gender and instruction set. 1. Add a source table to the spreadsheet shown in Fig. 14.10. You may use Fig. 14.12 as a guide if you wish. 2. Establish cells that contain the number of subjects and the number of levels for factors A and B. Enter formulas, not values, in cells displaying the degrees of freedom. Thus if your design changes, you can change just a few cells and the appropriate degrees of freedom for your new design will be computed and displayed automatically. 3. If the multiple-regression output from steps 1, 2, and 3 is still in your spreadsheet, then the cells in your source table that display the total R2 for
240
MULTIPLE BETWEEN-SUBJECTS FACTORS each step can point to the appropriate cells of the multiple-regression output and you will not need to reenter these values. 4. Earlier you only did three steps but there was an implicit fourth step, which you should add now. Remember that R2totai = R2model + R2 error and SStotai = SSmodel + SSerror. If at step 4 you hed added coded predictor variables for the error term (one for each degree of freedom associated with the error term, a matter explained in the next chapter), you would have accounted for all of the variance. Accordingly, you can enter 1 (all variance accounted for) as the step 4 entry in the total (accumulative) R2 column. Likewise, you can enter the overall sum of squares (or a pointer to it) as the step 4 entry in the total (accumulative) SS column. 5. For each step, enter the appropriate formulas for total sums of squares, for changes in R2, for changes in sums of squares, and for partial n2. Do the changes in R2 and SS sum to 1 and SStotal as they should? 6. Finally, enter formulas for mean squares and formulas for the F ratios that test the significance of the mean squares. At this point, your spreadsheet should look like the one shown in Fig. 14.12. From it you can determine significance for the A and B main effects and the AB interaction for a two-way factorial, along with the magnitude of each of these effects as assessed with a partial n2. Significance testing for the main effects and interactions of more complex factorial studies follows the same strategy demonstrated here. These procedures can be summarized as follows. When completely crossed, the levels of the factors of a factorial study define G groups. G - 1 predictor variables, formed following the rules and principles presented earlier, are required to code the information implied by such designs. For a 2 x 2 (or any factorial with only two levels per factor), main effects and interactions are associated with one predictor variable each. Exercise 14.7 SPSS Analysis of a 2 x 2 Factorial Study In this exercise you will learn how to use the General Linear Model procedure in SPSS to conduct a 2 X 2 Analysis of Variance for the button-pushing study. 1. Create a new SPSS data file containing variables for the number of button pushes (bps), gender (gen), and instruction set (insf). Enter the data from Fig. 14.10. You should create value labels for the sex and instruction set to make the output more readable. 2. Select Analyze->General Linear Model->Univariate from the main menu. Move bps to the Dependent Variable window, and gen and inst to the Fixed Factors(s) window. 3. Click on Options and check the boxes next to Descriptive statistics, Estimates of effect size, and homogeneity tests. Also, in the Estimated Marginal Means box move [Overall], sex, inst, and sex*inst to the window labeled Display means for. Click Continue and then OK. 4. Examine the Descriptive Statistics box, where means are displayed for each of the cells and marginals. Make sure these values agree with your spreadsheets. Now scroll down to the boxes under Estimated Marginal Means. The values for the grand mean, sex and instruction set main effects, and the interaction should be the same as those found in the Descriptive Statistics Box. In the case of an unbalanced design (i.e., one or more of the
14.4 MAGNITUDE OF EFFECTS AND PARTIAL ETA SQUARED
241
cells were of different size), then the descriptive statistics would provide traditional weighted means while the estimated marginal means would be unweighted. Typically, when cells are unequal due to subject attrition or random factors, you would want to report the unweighted means for any significant effects resulting from your analysis. 5. Scroll back up to the box labeled Levene's Test of the Equality of Error Variances. Notice that this test is not statistically significant, indicating that the assumption of equal variances is met. 6. Finally examine the box labeled Tests of Between-Subjects Effects. Look at the lines for the SEX, INST, SEX*INST, Error, and corrected total. Check that the sums of squares, df, mean square, F, and partial eta-squared values correspond to your spreadsheet calculations. 7. For additional practice you should try reanalyzing the data from Exercise 14.8 using SPSS. Do all of the relevant statistics agree with your spreadsheet analysis?
Factors with more than two levels are associated, not with a single predictor, but with a set of predictor variables instead. Similarly, interactions involving such factors will also be associated with a set of predictor variables. In such cases the question remains, how much additional variance is accounted for when the set of variables associated with a particular main effect or interaction is added to the regression equation? Increases in R2 are tested exactly the same way for individual variables or sets. As always, the degrees of freedom in the numerator of the F ratio reflect the number of predictor variables added, whether one or more than one (see Equation 13.2), whereas degrees of freedom in the denominator reflect the residual degrees of freedom for the final model (N - 1 number of predictor variables used for the final step). Results from any two-way factorial analysis of variance can be summarized and presented, as shown in Fig. 14.12. Before any significant main effects are emphasized, any qualifying interactions should first be analyzed and understood. Significant interactions can be analyzed using the Tukey post hoc test procedures described in chapter 13. This is not the only approach, of course. Just as there are several variants of post hoc tests, so too there are several approaches to analyzing interactions. The Tukey is emphasized here because of its generality, general acceptability, and simplicity. (For additional approaches to analyzing significant interactions see Winer, 1971, and Keppel, 1982.) Main effects require post hoc tests only if more than two groups are involved. Interactions, however, necessarily require post hoc
A
1 2 3 4 5 6
B
Step Source 1 A, gender 2 B, instruction 3 AB, gend x inst 4 S/AB, error TOTAL btwn Ss
C
D
R2
0 215 0 234 0 656 1
R2 change
E SS
0.215 1600 0.019 144 0.422 3136 0.344 2558 1 7438
F
G
H
I
F dt MS pn2 1 1600 7.506 0.385 1 144 0 676 0.053 1 3136 14.81 0.551 12 213.2
FIG. 14.12. Spreadsheet showing an analysis of variance source table for a 2 x 2 factorial study analyzing the effect of gender, instruction, and their interaction on number of button pushes.
242
MULTIPLE BETWEEN-SUBJECTS FACTORS explication. After all, even if factor A and B both involve only two groups (as in the previous exercise), their interaction defines four groups, and hence a post hoc test is required to understand the nature of the interaction. For the next exercise, you will analyze a 2 x 3 factorial study and will be asked to explain a significant main effect involving three groups. Exercise 14.8 Analysis of a 2 x 3 Factorial Study The exercise provides additional practice in a two-way analysis of variance. In this case, factor A has two levels and factor B has three levels. This exercise also provides additional practice in interpreting post hoc results. 1. For this exercise, you reanalyze the data from the gender smiling study last shown in Fig. 13.11. Retain the 20 subjects and the number of smiles shown for each infant but ignore age. This study involves two factors, gender (factor A) and partner (factor B). The first 10 subjects are males and the last 10 females, as before. However, this time assume that the first three male (subjects 1-3) and the first four female infants (subjects 11-4) interacted with their mother, that the second four male (subjects 4-7) and the second three female infants (subjects (15-17) interacted with their father, and the rest interacted with a stranger (subjects 8-10 and 18-20). set up contrast codes to represent the A main effect, the B main effect, and the AB interaction. 2. Analyze the data for this 2 x 3 factorial study. You will want to incorporate an analysis of variance source table like that shown in Fig. 14.12, so you may find it easier to modify the spreadsheet used in the last exercise rather than modifying the spreadsheet shown in Fig. 13.11. Or, if you are becoming adept with your spreadsheet program, you may combine the two spreadsheets. Or, for practice and to assure yourself that you understand all the required formulas, you could create this spreadsheet from scratch. However you do it, the goal is to determine whether the gender effect, the partner effect, and their interaction are statistically significant, organizing your results in a source table like that shown in Fig. 14.12. 3. If you have performed the analysis correctly, you should have found out that the gender effect was not significant (it approached but did not reach the .05 level of significance), that the partner effect was significant, and that their interaction was not significant. Perform a post hoc test on the three means representing the number of smiles seen when interacting with mothers, fathers, and strangers and interpret your results. Think about this carefully. This exercise is a stringent, but quite realistic, test of your understanding of post hoc analysis. From the previous exercise you should have gained an understanding of the general analysis of variance approach to factorial studies. To begin with, you create contrast codes for each main effect and interaction. Then, using hierarchic multiple-regression procedures, you determine the statistical significance and magnitude of effect for each main effect and interaction. Your first task is to understand and interpret any significant interactions. In order to do this you will need to use post hoc tests. The second task is to explicate any significant main effects that are not qualified by higher order interactions. You will need post hoc tests for this only if a main effect involves more than two groups.
14.4 MAGNITUDE OF EFFECTS AND PARTIAL ETA SQUARED
243
Analyses of variance appropriate for studies including more than one between-subjects factor have been described in this chapter. You have learned how to generate higher order terms for two-way, three-way, and so forth, factorial studies, how to represent main effects and interactions with coded predictor variables, how to compute the appropriate degrees of freedom for each effect, and how to test the statistical significance of each effect using hierarchic multipleregression procedures. In addition, you have also learned how to interpret significant conditional (interactive) effects and when post hoc tests should be used to explicate significant main effects and interactions. At this point, you should be able to analyze completely crossed factorial studies of any order and understand others' analyses. Factorial studies represent by far the most commonly used analysis of variance design. As noted earlier in this chapter, there are other more complex possibilities (e.g., various kinds of incomplete designs), but these are regarded as advanced topics beyond the scope of this book. All of these analyses are alike in that factors are between subjects. However, there is another fairly common alternative. Factors can also be within subjects—that is, subjects can contribute repeated measurements to a study. How data from such studies are analyzed constitutes the topic of the next two chapters.
This page intentionally left blank
15
Single-Factor Within-Subjects Studies
In this chapter you will: 1. Learn about repeated measures or within-subjects factors. 2. Learn about studies that include within-subjects factors and when they should be used. 3. Learn how to analyze data from a single-factor within-subjects study. The statistical procedures discussed in the previous chapters are appropriate for assessing either the effect of a single factor on a dependent measure of interest or the effects of multiple factors (their main effects and interactions) combined in a completely crossed design. A restriction has been that factors must operate between subjects; that is, subjects may contribute one, and only one, score to the analysis. Thus, for these between-subjects studies, no subject is represented at more than one level of any factor. In other words, each subject appears in exactly one group or cell of the study. As suggested earlier, there is another possibility. A subject might be assessed repeatedly and the repeated assessments could represent levels of a withinsubjects (or repeated measures) factor. Such factors would allow us to investigate, for example, changes over time, either those occurring naturally or perhaps with an intervening experimental treatment. This chapter explores how within-subjects factors can be incorporated into factorial studies, and how their effects can be analyzed. 15.1
WITHIN-SUBJECTS OR REPEATED-MEASURES FACTORS A factor is said to be within subjects when the subjects (or dyads, families, or whatever sampling unit is used) in a study are assessed more than once and when those repeated assessments form the levels of the factor. For example, the within-subjects factor could be time and the levels might represent time 1 and time 2. Such repeated assessments are rightfully popular because they allow researchers to address questions like, do individuals tend to score higher at time 2, after a particular treatment (or event)? Or, the within-subjects factor could be setting and the levels might be laboratory and home, or day and night. Again, the 245
246
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES purpose is to determine if the factor has any effect on the dependent measure. For example, we might want to determine if scores are systematically higher at night, in the home, and so forth. The unit of analysis (or sampling unit) is not always an individual. If married couples were studied, for example, a husband's and wife's scores would form repeated assessments. The factor would be spouse, whose two levels are husband and wife. Marital satisfaction scores, assessed separately for husbands and wives, would then represent repeated measures for the couple. Analysis of these scores would let us determine whether husbands and wives differed significantly with respect to marital satisfaction. When selecting an appropriate design for analysis, there may occasionally be some question as to whether a particular factor is between or within subjects. Such questions can usually be resolved by considering how units are sampled. For example, a single-factor two-group study might consist of husbands and wives. If the wife group is formed by selecting wives at random and the husband group in a similar way, with no restriction that the husbands be matched to the wives already selected, then the spouse factor is between subjects. The spouse factor is within subjects, however, if husbands are linked to wives—this is, if the husband group consists of the husbands of the wives previously selected. As you can see from the preceding paragraph, factors are not inherently between or within subjects. Depending on the sampling design, the same factor can be between subjects in one study, within subjects in another. Moreover, several factors—some between subjects, some within subjects—can be combined in a single factorial study. In the previous chapter we discussed factorial studies and assumed that all factors were between subjects. In this chapter and the next we consider the possibility that some or all of the factors in a factorial study can be within subjects instead.
Factorial Studies With Within-Subjects Factors A within-subjects factor, as already noted, is also called a repeated-measures factor. Similarly, a study including any within-subjects factors is called a repeated-measures study. Such studies could consist solely of repeated-measures factors, or could be mixed, containing both between- and within-subjects factors. Possibilities include a one-between, one-within two-factor study; a no-between, two-within two-factor study; a two-between, one-within three-factor study; a one-between, two-within three-factor study; and so forth. The general case, then, is a u-between, u-within factorial study, where u represents the number of between-subjects and v the number of within-subjects factors. Consider the 2 x 2 between-subjects study analyzed in the previous chapter. The two dimensions were gender of subject (male/female) and instruction (set I/set II). The male and female subjects were not linked in any way, and different subjects received different instructions, so both factors were between subjects. This is a straightforward example of a two-factor between subjects (two-between, no-within) factorial study. However, if the male and female subjects received two treatments, first one instruction set and then the other, instruction would be a within-subjects factor. The number of button pushes would be assessed after each instruction set, and each subject would contribute two scores to the analysis. This is an example of a mixed two-factor (one-between, one-within) factorial study: The gender of subject is the between-subjects factor and the instruction set is the withinsubjects factor.
15.1 WITHIN SUBJECTS OR REPEATED MEASURES FACTORS
247
There is still another variant of the basic 2 x 2 study. Instead of sampling men and women randomly, we might instead select married couples. If husband and wife were each exposed to instruction set I and instruction set II, then both gender (husband/wife) and instruction (set I/set II) would be within-subjects factors and each couple would contribute four scores to the analysis (number of button pushes for both spouses for both instruction sets). This is an example of a two-factor, within-subjects (no-between, two-within) factorial study. For simplicity, single-factor within-subjects studies are emphasized in this chapter, leaving discussion of multifactor studies that include repeated factors, such as those described in the preceding paragraphs, until chapter 16. Still you should be aware that the more complex designs are quite common. For example, when studies include a treatment like instruction set as a within-subjects factor, order of presentation is commonly included as a between-subjects factor because if it is not the results are difficult to interpret unambiguously. Imagine that in a single-factor within-subjects study instruction set I was always presented first, instruction set II second, and the analysis of variance detected a main effect for instruction set. Then we would not know whether or not subjects always pushed the button more during the first session or if they were reacting in particular to instruction set I. In such a case, we would say that order and instruction set were confounded, combined in such a way that we cannot disentangle their separate effects. The solution is to add a betweensubjects factor of order to the study design. For the present example, half of the males and half of the females would be exposed to instruction set I first and then set II (order 1), whereas the remaining half would be exposed first to instruction set II and then set I (order 2), thus counterbalancing instruction set. Then if there is a main effect for instruction set, we know that subjects are reacting differently to the different instruction sets, no matter which is presented first. On the other hand, if subjects are always more reactive to whichever instructions are presented first, then there would be a main effect for order. In the next chapter we describe how to analyze data from mixed two-factor studies such as this, but for now we focus on the simpler single-factor case.
Advantages of Within-Subjects Factors Studies including within-subjects factors have one notable strength: Subjects assessed repeatedly serve as their own controls. This means that variability between subjects is irrelevant to testing within-subject effects. Hence variance due to between-subject variability is removed from consideration at the outset, which reduces the residual or error term used for tests of within-subject effects considerably. A smaller MSerror usually results in a larger F ratio, so, as a general rule, we are more likely to find an effect significant if tests involve within instead of between-subjects variables. In other words, for the same number of subjects, tests involving within instead of between-subjects factors are usually more powerful (see chap. 16). This set of circumstances presents an interesting opportunity. The same power afforded by a test of a between-subjects factor can be achieved with considerably fewer subjects if that factor reasonably can be assessed within instead of between subjects. Fewer subjects typically means greater economy of effort, so researchers are well-advised to consider whether repeated measures procedures might be appropriate for a particular study. Unfortunately, repeated assessments are not always possible. There are two major stumbling blocks to their exclusive use. First, some tests are reactive (subjects react differently the second time, not due to different circumstances, but due to having being assessed previously), and if subsequent scores are indeed
248
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES affected by earlier administrations of the same test, repeated measures are ruled out. For example, imagine we wanted to study the effect of three different drugs on memory. We could use either a between-subjects design (one-third of the subjects receive drug A, one-third B, one-third C) or a within-subjects design (all subjects receive all three drugs at different times and are tested after each). Given the same number of subjects, the within-subjects design is usually more powerful. But if subjects remember elements of the test, and so subsequent scores are affected by previous scores, the investigator may have no choice but to use a between-subjects design. If tests are reactive, meaning that subsequent scores are contaminated by previous scores, fresh subjects may be needed. Further, some factors, by their very nature, can only be between subjects, which is a second stumbling block to the exclusive use of within-subjects factors. Under usual conditions, families are either lower income or middle income (assuming this is a reasonable categorization), and usually we assume this status is relatively enduring. A repeated-measures design would not be possible. Or, if it were, it would study only that (perhaps atypical) subset of families who change status and would undoubtedly focus on specific questions concerning social mobility. Other examples of inherently between-subjects factors are relationship and parental status. At a given time, a person either is in a committed relationship or not, and either has children or not, so study of either of these factors requires a between-subjects design. Of course, one could select subjects and wait until they had children, in which case parental status could be a within-subjects factor. Such a study would take considerable time, although if the investigator were interested specifically in questions concerning the transition to parenthood, the time might be justified. As we have seen, factors can be either between or within subjects, and any number of factors of either kind can be combined in a factorial design. The advantages of factorial designs that include within-subjects factors (either ailwithin or mixed designs) are the same as the advantages of purely betweensubjects designs: Any effects the different factors have on the dependent variable, either singly or in interaction with other factors, can be evaluated. Sometimes logical or practical considerations require that a particular factor be between subjects, sometimes that it be within subjects. Given a choice, though, it usually makes sense to opt for the within-subjects version of a factor. As noted earlier, the reasons are largely economic. If an effect (of a particular magnitude) exists, usually it will be detected with fewer subjects if the statistical tests involve within-subjects instead of between-subjects factors.
Partitioning Variance Between and Within Subjects The analysis of studies including repeated measures requires no major new concepts, only an extension and application of ideas and techniques already presented in previous chapters. Two ideas in particular are central to an understanding of any analysis of variance involving within-subjects factors: 1. The first idea concerns how variance is partitioned into between-subjects and within-subjects components. 2. The second conceptualizes subject as a control variable or covariate so that between-subjects variance, which is irrelevant to analyzing withinsubjects factors, can be removed from within-subject error terms.
15.1 WITHIN SUBJECTS OR REPEATED MEASURES FACTORS
249
Beginning in chapter 8, we learned that the total sum of squares (or variance) could be partitioned into two portions, one part due to the best-fitting model and the other due to error (the residual sum of squares). In other words,
In earlier chapters, we ignored the possibility of within-subjects factors, but now we can recognize that Equation 15.1 tells only part of the story. It applies only to studies consisting solely of between-subjects factors. An expanded and more accurate formulation for Equation 15.1 would be:
(Note that what is termed "between-subjects error" in Equation 15.2 has also been termed "error within groups" in earlier chapters because they are the same thing.) Purely between-subjects studies have N subjects and N scores, so the total sum of squares consists only of variability between subjects. We have always referred to this as the total sum of squares before, although it would have been more accurate to call it the total sum of squares between subjects. This serves to remind us that the SStotal in Equation 15.1 is total only if no within-subjects factors are present. Studies including within-subjects factors, however, have more than N scores, so, as you might guess, their total sum of squares is greater than the sum of squares due to variability between subjects. Variability within subjects also contributes. Specifically, for repeated-measures studies: SStotal (between + within)
=
SStotal between subjects + SStotal within subjects
(15.3)
Consider for a moment how you would compute the total (between + within) sum of squares. If three assessments were made for each of 12 subjects, you would compute the mean of the 36 scores, subtract that mean from each score, square the deviation, and sum the 36 squared deviations. But as you know from Equation 15.3, this sum of squares can be partitioned into two pieces, one representing variability between subjects and one representing variability within subjects. Moreover, each of these two sums of squares (SStotal between subjects and SStotal within subjects) can be subdivided further. From chapter 14 you know that the total between subjects sum of squares can be partitioned into a series of betweensubjects main effects and interactions and a between-subjects (or within-groups) error term. Similarly, the total within-subjects sum of squares can be partitioned into sums of squares and error terms that allow us to evaluate the effects, if any, of within-subjects variables on the criterion measure. It is useful to view a repeated-measures analysis as consisting of two (or more) separate analyses. First, there is the between-subjects analysis, which is associated with the sum of squares total between subjects and which includes only between-subjects factors and their interaction with each other. Second, there is the within-subjects analysis (or analyses, if there is more than one repeated factor), which is associated with the sum of squares total within-subjects and includes within-subjects factors, their interactions with each other, and their interactions with the between-subjects factor or factors. Exactly how now repeated measures analyses are ordered and organized will become clearer when we describe in detail how the total sum of squares and degrees of freedom are partitioned for specific repeated-measures studies.
250 15.2
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES CONTROLLING BETWEEN-SUBJECTS VARIABILITY The second basic idea required for an understanding of repeated measures analysis of variance is based on the realization that subject, like age or group, can itself be a research factor and can be treated as a covariate for repeated-measures analyses. In the previous section we noted that the total (between + within) sum of squares for 36 scores, derived from three assessments of 12 subjects, is easily computed. But how would we compute the sum of squares between subjects? You could compute a mean for each subject's three scores and then compute a sum of squares for the 12 mean scores. But there is another way to compute the sum of squares between subjects. It requires that you apply concepts learned in previous chapters and has direct application to the analysis of repeated measures factors. Just as G groups can be represented with G - 1 predictor variables (see chap. 12), so too N subjects can be represented with N - 1 predictor variables. After all, as we learned in chapter 10, N scores are associated with N - 1 degrees of freedom. Typically, dummy-coded variables are used. Thus the first subject would be coded one for the first predictor variable, zero otherwise, the second subject would be coded one for the second variable, zero otherwise, and so forth, until the final subject would be coded zero for all N - 1 predictor variables. Once subjects are represented with predictor variables, we can determine the proportion of total variability accounted for by the subject factor. Again consider the N = 12 example. We would regress all 36 scores on the 11 predictor variables for subject. The resulting R2 is the proportion of variance accounted for by the subject factor. Then we can compute the sum of squares between subjects. It is the total (between + within) sum of squares multiplied by the proportion accounted for by between-subject variability—the total SS multiplied by the R2 just computed. The analysis of studies involving repeated measures proceeds hierarchically, just like the factorial studies described in the previous chapter. If there are no between-subjects factors, then the first step consists of regressing criterion scores on coded predictor variables representing subjects. But if there are betweensubjects factors, then some of the N - 1 predictor variables representing total between-subject variability will be identified with the between-subjects factor or factors and their interactions (just as in chap. 14). Thus the total set of N - 1 between-subjects predictor variables allows us to evaluate the significance of any between-subjects factors. Exactly how this works is demonstrated in chapter 16. In any event, once the N - 1 between-subjects predictor variables are entered (whether that takes one or more than one step), the next steps involve adding predictor variables associated with the repeated measures factor (or factors). Note the overall strategy. First we control for between-subject variability, entering coded variables representing subjects—exactly as though the subject factor were a covariate (see chaps. 10 and 12). Then, after removing purely between-subject variance, which is irrelevant to the analysis of within-subject effects, we proceed to analyze the variance remaining, the variance associated with within-subject effects. Whether an analysis of variance includes one or more repeated measures, or includes between-subjects factors as well, the basic steps are the same although some details vary. Exact procedures are demonstrated in the remainder of this chapter and in the next. Specific topics include the way variables (including subject variables) are coded, the sequencing of the multiple-regression steps required, the way the total sum of squares (and the total degrees of freedom) is
15.2 CONTROLLING BETWEEN SUBJECT VARIABILITY
251
partitioned among the various effects, and the way the various effects are tested for significance. The next section illustrates a no-between, one-within study, whereas a one-between, one-within and a no-between, two-within study are described in the next chapter. An understanding of these three exemplars should make it possible for you to understand how to analyze other, more complex repeated-measures designs.
A Study With a Single Within-Subjects Factor Recall the lie detection study first introduced in chapter 5. There were 10 subjects and the dependent variable was the number of lies an expert detected. Beginning in chapter 9, we imposed a single-factor between-subjects design on these data. We divided the 10 subjects into a drug group and a placebo group and represented the single factor with a dummy-coded variable. There were 10 scores and 9 degrees of freedom total. We would now say that there were 9 degrees of freedom between subjects and these were divided into 1 degree of freedom for the between-subjects model (one predictor variable) with 8 remaining for betweensubjects error. The partitioning for the total sum of squares and degrees of freedom was:
and
Because dfbetween subjects = N - I,
Finally, the F test for the between-subjects drug effect was:
But now imagine this was a within-subjects study instead and there were only five subjects and each subject received both drug treatments (the actual drug and a placebo). There would still be 10 scores total but now each subject would contribute two scores. If those scores were the same as those given earlier (Fig. 5.1), and if we assume the scores initially given for subjects 6-10 are now scores for subjects 1-5 after the second drug treatment (the placebo condition), the total sum of squares for the 10 scores would still be 52 and the total degrees of freedom would still be 9, the same values we computed earlier (see Fig. 10.3). After all, we have not changed the scores for this example, only the number of subjects and the fact that each subject now contributes two scores. Although the total sum of squares and degrees of freedom would remain the same, the way they are partitioned would be quite different. Five subjects require four predictor variables. The first step would regress the number of lies detected on the coded variables representing the subject factor. This would give us the R2 between subjects and (multiplying by the total sum of squares for lies detected) the total between subjects sum of squares, which is associated with 4 degrees of freedom (because five subjects require four predictor variables). Having removed or accounted for between-subjects variance (step 1), we would now proceed to
252
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
analyze the remaining within-subjects variance, which is associated with 5 degrees of freedom (9 initially minus 4 for the subject factor). On the second step, a coded variable for drug treatment would be entered. This gives us the increase in R2 (and the increase in SS) due to drug and is associated with 1 degree of freedom (because two drug groups require 1 predictor variable). The residual or error variance is associated with 4 degrees of freedom (5 for total within-subject variability minus 1 for drug). In this case the partitioning for the total sum of squares and degrees of freedom is:
Because dftotal = 2N - 1 (there are two scores per subject),
Finally, the F test for the within-subjects drug effect is:
If the obtained F ratio exceeds its critical value, we would say that the drug effect was statistically significant. The next exercise demonstrates in detail how this works, so do not worry if the procedure is not yet completely clear. For the moment, focus on the overall strategy. First total variability is partitioned into two pieces: variability between subjects and variability within subjects. Then variability within subjects is subdivided into variability due to the repeated factor and residual variability. Finally, the significance of the repeated factor is determined by comparing a mean square (variance estimate) for the repeated factor with the mean square for the residual. In the preceding paragraphs, the error mean square for the between-subjects F test was labeled MSerror and the corresponding error term for the withinsubjects test was labeled MSresidual. This was done to signal that the "proper" error terms for these two tests are somewhat different— although both represent residual error and could quite correctly be called either MSerror or MSreSidual. In mathematically oriented texts, formal justification is offered for the error terms used for various F tests. For present purposes and in the more informal spirit of this book, it is sufficient to know that statisticians can prove to their satisfaction that the error terms given here are correct. However, it is useful to consider some ways in which the mean squares for the between-subjects and within-subjects F tests described in the previous several paragraphs differ. The proportion of total variance associated with the mean square for the within-subjects F test is smaller than the proportion associated with the betweensubjects F test. There is an easy way to visualize this situation. Consider the drug versus placebo independent groups version of the lie detection study. There are 10 subjects, 10 scores, and 9 degrees of freedom total. Therefore total variability can be represented with nine predictor variables, one representing drug group and the remaining eight representing residual error between subjects or error within groups. Before we have only formed the first predictor variable, the one that represents drug group (labeled X in Fig. 10.2) but it is possible to complete the set, forming the remaining eight predictor variables (see Fig. 15.1).
15.2 CONTROLLING BETWEEN SUBJECT VARIABILITY
253
In Fig. 15.1 the predictor variable representing drug group is labeled A and the eight predictor variables representing subjects are labeled S1-S8. Predictor variable A is coded 1 for the drug subjects and 0 for the placebo subjects. Five subjects are nested within each group. Predictor variables S1-S4 represent the subjects nested within the drug group, and predictor variables S5-S8 represent the subjects nested within the placebo group. Variables S1-S4 are coded as though the first five subjects each represented different groups. Variable S1 is coded 1 for subject one, variable S2 is coded 1 for subject two, S3 is coded 1 for subject three, and S4 is 1 for subject four. Subject five, whose values for S1-84 are all 0, represents, in effect, a comparison subject for the drug group. Similarly, variable S5 is coded 1 for subject six (the first subject in the placebo group), S6 is coded 1 for subject seven, S.7 is coded 1 for subject eight, and S8 is coded 1 for subject nine. Subject ten, like subject five, represents a comparison subject but for the placebo group. However, subjects five and ten differ in their value for variable A. Thus a different pattern of predictor variable values is associated with each of the 10 scores. It might seem economical to use only four predictor variables for subject, repeating the S1-84 pattern for both drug and placebo subjects. Codes for variable A distinguish between drug and placebo groups, so this scheme would also assure a different pattern of predictor variable values for each of the 10 scores. But it would not reflect the reality of the situation. Variable S1, for example, would be coded 1 for both subjects one and six only if they were the same subject, and for the independent groups design they are not. When analyzing for the between-subjects effect of group, first the criterion scores are regressed on the predictor variable representing drug group (step 1). We do not bother to perform what is in effect a phantom step 2, regressing the criterion on all of the N - 1 predictor variables, for two reasons. First, we do not need to. We know that the additional variance accounted for by step 2 must be 1 - R2, one minus the variance accounted for by step 1, because that is what is left. Second, multiple-regression routines do not allow N scores to be regressed on N - 1 predictor variables. In such cases they return an error message. Now consider the drug versus placebo repeated measures version of the lie detection study. There are 5 subjects, 10 scores, and 9 degrees of freedom total. Again, total variability can be represented with nine predictor variables. This time, however, only four represent subject. Variable Si is coded 1 for subject one,
s
1
2 3 4 5 6 7 8 9 10
A 1 1 1 1 1 0 0 0 0 0
1
S2 0
0 0 0 0 0 0 0 0 0
1
S3 0 0
0 0 0 0 0 0 0 0
1
S4 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
S1
1
1
S6 0 0 0 0 0 0
0 0 0 0
1
S7 0 0 0 0 0 0 0
0 0 0
1
S8 0 0 0 0 0 0 0 0
0 0
0
S5 0 0 0 0 0
1
FIG. 15.1. Predictor variables representing all nine degrees of freedom for a single-factor between-subjects study. There are 10 subjects, the betweensubjects predictor variable is labeled A, and variables S1 through S8 code for subjects nested within groups.
254
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES variable S2 is coded 1 for subject two, S3 is coded 1 for subject three, and S4 is 1 for subject four. As before, subject five, whose values for S1-84 are all 0, represents a comparison subject (see Fig. 15.2). Again, subject is treated as a factor and each subject represents a level of that factor, which is why N - 1 predictor variables are required for the subject factor. In this case, however, the pattern of codes for S1-84 is repeated because each subject contributes two scores. For example, the codes for S1-84 are the same for the first and sixth rows because those rows represent the same subject. The fifth predictor variable, labeled P in Fig. 15.2, represents drug group. Again, it is coded 1 for drug group and 0 for placebo, but now it represents a within-subjects variable. At this point, we have coded two factors, subject and drug. Next, and as you would expect from the factorial designs discussed in the previous chapter, is the interaction term. It is represented with four predictor variables formed by multiplying each predictor variable for subject with the predictor variable for drug group (see Fig. 15.2). At this point, it is not possible to form further predictor variables. The four associated with the S x P (subject x within-subjects drug group) interaction have exhausted the nine original degrees of freedom. In effect, there are no degrees of freedom within groups (i.e., subjects), which is a consequence of having "groups" that contain a single subject. Thus it should not be surprising that the proper error term for analysis of a within-subjects factor is the subject x within-subjects factor interaction. The analysis of the within-subjects effect of group would again proceed hierarchically. First the criterion scores are regressed on the predictor variables representing subject (step 1), and then on the predictor variables representing subject and drug group (step 2). Again, we can regard the step that adds the predictor variables representing the drug by subject interaction as a phantom third step, one that completes the present analysis. The R2 after this step necessarily must be one—with 9 predictor variables and 9 degrees of freedom, all variance is accounted for. Similarly, the final sum of squares must be 52, the total sum of squares for the 10 scores. The next exercise asks you to complete steps 1 and 2 for this analysis of a single within-subjects factor.
s
1 2 3 4 5 1 2 3 4 5
S1 1 0 0 0 0
S2 0
1
S3 0 0
1
S4 0 0 0
1
0 0 0 0
0 0 0 0
1
0 0 0 0
0 0 0
1
0 0 0 0
0 0
0
1
1
P 1 1 1 1 1 0 0 0 0 0
PSi
1 0 0 0 0 0 0 0 0 0
PS2
1
PS3 0 0
0 0 0 0 0 0 0 0
1
PS4 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0
1
FIG. 15.2. Predictor variables representing all nine degrees of freedom for a single-factor within-subjects study. There are 5 subjects and 10 scores, variables S1 through S4 code for subject, the within-subjects predictor variable is labeled P, and variables PS1 through PS4 represent the PxS interaction.
CONTROLLING BETWEEN SUBJECT VARIABILITY
255
Exercise 15.1 A Single-Factor Repeated-Measures Study The template that results from this exercise allows you to analyze data from a single-factor within-subjects study. You will use data from the lie detection study but will assume that five subjects were tested twice. 1. Modify the spreadsheet shown in Fig. 11.2. Add columns for the dummycoded subject variables, enter values for the dummy codes, and change labels as appropriate. In particular, rows for subjects should be labeled 1-5, not 1-10. Each subject occupies two rows (because there are two scores for each subject) and the dummy codes for each subject's two rows are the same (it is the same subject, after all). What distinguishes each of the two rows is the level of the repeated factor (drug versus placebo). Consistent with our earlier spreadsheet, the drug factor can be dummy coded as well. 2. Do step 1. Regress number of lies on the four coded subject variables. Enter the correct formula for Y'. What is the value of R2 and SS for this model? The significance of this covariate (the subject factor) is not especially important. If R2 is large, it only means that subjects behaved fairly consistently over trials. 3. Do step 2. Regress number of lies on the four coded subject variables plus the coded variable for drug treatment. Enter the correct formula for Y'. What is the value of R2 and SS for this model? 4. What is the increase in R2 and SS from step 1 to step 2? How many degrees of freedom does the error (residual) term have? What is significance of the change in R2? 5. Examine the predicted scores after steps 1 and 2. How do they appear to be computed? How do the values for the regression constant and coefficients appear to be computed? After steps 1 and 2 your spreadsheets should look like those shown in Figs. 15.3 and 15.4. Careful scrutiny of these spreadsheets can help you understand exactly how an analysis of a within-subjects factor proceeds. The primary question, of course, is whether drug matters— if we know whether a person received the drug or a placebo, can we more accurately predict the number of lies the expert will detect? If we do not know which treatment the subject received, we would simply guess the mean number of lies detected for that subject, averaging over both treatment conditions. And indeed, these are the predicted scores at step 1, when only coded variables for subjects are in the equation. Note that the predicted scores are the same for each subject across both treatment conditions, and that the predicted scores for each subject are the number of lies detected for that subject, averaged over both conditions. Thus the predicted score for subject 1, whose scores were 3.0 and 4.0 for the two treatment conditions, is 3.5; the predicted score for subject 4, whose scores were 6.0 and 7.0, is 6.5; and so forth (see Fig. 15.3). These predictions are not perfect, of course, but predictions made with knowledge of the particular subject involved allow us to account for 63.8% of the criterion variance. In other words, 63.8% of the variability in number of lies detected is due to variation between subjects. We already know that the mean score for subjects exposed to the drug is 4.2 and for subjects exposed to the placebo is 6.4. In other words, subjects who received the drug had 1.1 less, and subjects who received the placebo had 1.1 more, lies detected than average (the grand mean is 5.3). Thus, if predicting
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
256
scores knowing both the individual involved and the treatment condition, instead of just the individual as in the last paragraph, we would refine our previous predictions by subtracting 1.1 from the mean for the drug condition and adding 1.1 for the placebo condition, which is exactly how the predicted scores in Fig. 15.4 were formed. Thus instead of predicting 3.5 for subject 1, we predict 2.4 for the drug and 4.6 for the placebo condition; instead of predicting 6.5 for subject 4, we predict 5.4 for the drug and 7.6 for the placebo condition, and so forth (see Fig. 15.4). This allows us to account for 94% of criterion variability, which represents a statistically significant increase of 30.2% (F(1,4) = 20.17, p < .05). Thus drug matters. In addition to knowing the subjects involved, if we also know whether they received a drug or a placebo, our predicted scores for the number of lies detected will improve significantly.
A
B Lies s Y 1 3 2 2
1 2 3 4 5 6 7
C
5 1
0 0 0
6 6 4 5
8 9
2
10 11 12
3 4 5
7 7
13 14
Sum=
53 10
N=
15 Mean= 5.3 16 a,b= 7.5 17 R= 0.799 L
S3 0 0
1
1 1 1
1
0 0
1 1
H
I
y=
J m=
Y1 Y-My Y'-My 3.5 5.5
-2.3 -3.3 -1.3
6.5
0.7
1.2
7.5 3.5
0.7
2.2
3.5
-1.8 -1.8 0.2
1 0
1
0
0
5.5
0 0
0 0
0 0
1
0 0
6.5
0
7.5
53 N=
0 10
10
-4
N
-1
-2
3.5
-1.3 -0.3 1.7
-1.8 -1.8
K e=
Y-Y' -0.5 -1.5 -1.5 -0.5 -1.5 0.5 1.5
0.2
1.5
1.7
1.2
0.5
3.7
2.2
1.5
0
0 10
VAR= 4.01 2.56 SD= 2.002 R2= 0.638
1.45
O
ssmod
sserr
m*m
e*e
40.1
25.6
df=
9
4
5
15
MS=
4.456
6.4
2.9
16 17
SD'=
2.111 F=
2.207
adj
o
0 0 0
sstot y*y
0.349
P
0
SS=
=
S4 0 0
13 14
R2
G Drug
0 0
-4
M
1 2
F
0 0 0
0 1
9
E
S2 0 1 0 0 0 0
S1 1
4
3 4
D
14.6
1.703
FIG. 15.3. Spreadsheet for analyzing the effect of drug (within-subjects factor P) on number of lies told. Only predictor variables for subject have been entered (step 1).
15.2 CONTROLLING BETWEEN SUBJECT VARIABILITY
257
It is both helpful and illuminating to view the analysis of the variance accounted for by a within-subjects factor as an analysis of covariance. In chapter 11 we analyzed whether drug contributed significantly to the variance in lies detected, once the effect of mood was taken into account (i.e., controlled statistically). In chapter 13 we analyzed whether gender of infant contributed significantly to variance in number of smiles, once the age of the infant was taken into account And in this chapter, we analyzed whether drug contributed significantly to the variance in lies detected, once the individual subject was taken into account. When a within-subjects factor is under consideration, the fact that individuals vary between themselves is not of interest. That is why betweensubject variability is controlled or, in other words, removed from consideration at the outset.
B
A
3 4
G
F
Y 3
Si 1
2
2 4 6
0 0 0
6 4 5 7 7 9
3 4
8 9
1
5 2
1o 11
3 4 5
12
Sum=
N= 15 Mean=
S3 0 0
1
0
0 0 0
0 0
1
0
0
0
1
0
0 0
0 0 0
1
S4 0 0 0
1 0 0 0 0
1
0 0
0
53 10 5.3
-4
-4
-1
-2
P 1 1 1 1 1 0 0 0 0 0
K
J
y=
m=
Y1 Y-My Y'-My 2.4 2.4 4.4
-2.3 -3.3 -1.3
-2.9 -2.9 -0.9
e=
Y-Y' 0.6
-0.4 -0.4
5.4 6.4
0.7
0.1
0.6
0.7
1.1
4.6 4.6
-1.3 -0.3
-0.7 -0.7
-0.4 -0.6
6.6
1.7
7.6 8.6
1.7 3.7
1.3 2.3 3.3
53 N=
0 10
0.4 0.4
-0.6 0.4
04E-15 10
10
VAR= 4.01 3.77 0.24 -2.2 SD= 2.002 R2= 0.94
M
1 2
S2 0 1
o
8.6 a,b= R= 0.97
L
1
H
Drug
s 1
5 6 7
16 17
E
Lies
1 2
13 14
D
C
N
O
SStot
SSmod
SSerr
y*y
m*m
e*e 2.4
13 14
SS=
40.1
37.7
df=
9
5
4
15 16 17
MS=
4.456
7.54
0.6
SD'=:
2.111
R 2adj=
0.865
0.775 F=
12.57
FIG. 15.4. Spreadsheet for analyzing the effect of drug (within-subjects factor P) on number of lies told. Predictor variables for subject and drug have been entered (step 2).
258
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
But note, that in all these cases the analytic strategy is the same. First the criterion scores are regressed on the covariate, whether that covariate is a quantitative variable like age or a categorical variable like subject (step 1). Then variables representing the research factor of interest (e.g., gender of infant or drug group) are added to the equation (step 2) and the increase in R2 is tested for significance. 15.3
MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES The general layout used for the template in the last exercise (and in a number of previous exercises) has served us well. But now, especially as we increasingly emphasize analyses consisting of several steps, and are primarily concerned with the significance of the increases in Rz associated with different steps, some modifications are in order. After all, neither of the spreadsheets shown in Figs. 15.3 and 15.4 gives us directly the information we want for the present analysis, which is whether or not the drug effect is significant. Thus it makes sense to incorporate a table giving step-by-step statistics into any template used for an analysis of variance, just as we did for the two-factor between-subjects factorial study in Exercise 14.7 (see Fig. 14.12) and as we will do for the present single-factor within-subjects study in the next exercise. The analysis of variance templates we have developed in the last several exercises have used a multiple-regression routine to compute the regression constant and coefficients and then have used those statistics to compute the predicted values for Y (column H in Figs. 15.3 and 15.4). The predicted value of Y (i.e., Y') was then used to compute the SSmodel and the SSerror. These, in turn, were used to compute R2, the proportion of variance accounted for by the variables whose regression coefficients were included in the calculation of Y. All of this was not only useful but seemed pedagogically justified. However, there is a shortcut to R2. Almost all multiple-regression programs give its value. We could use this value directly, rather than computing R2 ourselves as we have been doing. This would simplify the procedure. After identifying the dependent variable, we would invoke the multiple-regression procedure repeatedly, specifying the independent variables and the output range for each step. Computing Y, its associated statistics, and R2 after each step could be bypassed. Thus values of R2 can be taken directly from the multiple-regression routine and from them theR2Change for each step can be computed. Analysis of variance source tables, however, traditionally give the sum of squares for each step, not the R2 change. However, as we have already noted, the SSchange is easy to compute. The model sum of squares for each step is easily computed from R2, as we noted in the previous chapter. Recall that:
Therefore, by simple algebra:
The SStotal depends only on the mean of Y, not on Y, thus it does not vary from step to step. For example, SStotal = 40.1 in both Figs.15-3 and 15.4. Thus, given the value of R2 for each step, it is a simple matter to compute both the corresponding model sum of squares for that step and the change in sum of
1 5.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES
____
259
squares from the previous step. It is not necessary to compute predicted values at each step in order to derive the corresponding model sum of squares. The major elements of an ANOVA source table are SSs, dfs, MSs, and Fs. Sums of squares we have just discussed and MSs and Fs are easily computed given the SSs and their corresponding dfs. Moreover, spreadsheets are such a versatile computing tool, so we might as well let them compute degrees of freedom instead of entering values explicitly as we have been doing. And we might as well make the formulas used as general as possible. In chapter 14 we gave general formulas for degrees of freedom for two- and three-way betweensubjects factorial studies (Figs. 14.8 and 14.9). Now, in Fig. 15.5, we provide general formulas for degrees of freedom for a single-factor within-subjects study. These are useful, not only for the template developed during the course of the next exercise, but also as a basis for understanding how the single-factor withinsubjects design accounts for variance. Earlier we introduced A, B, Cto symbolize between-subjects factors and a, b, c to symbolize the number of levels for each of those factors. Now we introduce P, Q, R to symbolize within-subjects factors and p, q, r to symbolize the number of levels for those factors (see Fig. 15.5). As always, N symbolizes the number of subjects. For a single-factor within-subjects study, the total (between + within) degrees of freedom is the number of scores minus one (N subjects each contribute p scores so there are a total of Np scores): dftotal = Np - l
As before, the degrees of freedom between subjects is the number of subjects minus one: between subjects
=N
— 1
There are three ways to derive the degrees of freedom within subjects. Taking the scores free to vary approach, we would note that for a single subject p - 1 scores can vary (because there are p score for each subject) and because there are N subjects, the total degrees of freedom within subjects is N times p - 1: dfwithin subjects = N(p - 1) = Np - N
Or, taking the number of parameters approach, we would note that there are Np scores initially but that in computing deviations we need to estimate means for N subjects, and hence there are Np - N degrees of freedom left. Finally, we could determine the degrees of freedom within subjects by subtraction:
Source S, subjects TOTAL between subjects P main effect PS interaction TOTAL within subjects TOTAL (between + within)
Degrees of freedom N-1 N-1 P-1 (P-1)(N-1) Np-N Np-1
FIG. 15.5. Degrees of freedom for a single-factor within-subjects study. The number of levels for within-subjects factor P is symbolized with p and the number of subjects is symbolized with N.
260
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
Degrees of freedom for the components of total within-subjects variance (P and PS) also make sense. The degrees of freedom for p levels is p - 1, the number of predictor variables:
And the degrees of freedom for the PS interaction (the error term) is simply the degrees of freedom for P multiplied by the degrees of freedom for S (the degrees of freedom between subjects):
For the next exercise, these formulas are incorporated into an analysis of variance source table. Exercise 15.2 Source Table for a Single-Factor Repeated-Measures Study The template developed for this exercise creates an analysis of variance source table for the template you developed for Exercise 15.1 to analyze data from a single-factor within-subjects study. It provides a summary of the results for the lie detection study, assuming that five subjects were each tested twice, once with a drug and once with a placebo. General Instructions 1. On a new worksheet, add a source table that summarizes the analysis in Exercise 15.1. You may want to use Fig. 15.6 as a guide. Use general formulas for degrees of freedom. Experiment with different values for N, a, and p before you enter the correct values for the present example. Do all degrees of freedom sum as they should? 2. Invoke the multiple-regression routine in order to compute step 1 and 2 values for R2. Provide formulas to compute other sums of squares, changes in R2 and SS, the required mean squares, and the Fratio. 3. Answer question 9 of the detailed instructions. Detailed Instructions 1. Enter labels in cell A1 through l1 as indicated in Fig. 15.6. Label the steps in cells A2 through A4 and the sources of variance in cells B2 thought B5. In cells A7, A8, and A9 enter the labels "N=", "a=", and "p=", respectively. 2. Enter the number of subjects in B7 and the number of levels for the single within-subjects variable in A9. There is no between-subjects variable, hence no levels for it, so enter zero in cell A8. 3. Enter the appropriate formulas for degrees of freedom in column F (see Fig. 15.5). If done correctly, df for between subjects, the P main effect and the PS error term should sum to the df for the Total (between + within). 4. At this point, you have a template that partitions degrees of freedom for any no-between, one-within study. Experiment with different values for N and p
15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES
5.
6.
7.
8.
9.
261
(cells B7 and B9) and note the effect on degrees of freedom. Now replace N and p with 5 and 2, the values for this exercise. Do step 1. Invoke the multiple-regression routine using the data from your exercise 15.1 spreadsheet, specifying lies (column B) as the dependent variable and the coded variables for subjects (columns C-F) as the independent variable. Point the R2 for step 1 (cell C2) to the R2 computed for this step by the multiple-regression routine. This is the proportion of total variance accounted for by the subject factor. Do step 2. Again invoke the multiple-regression program, this time specifying subject plus drug variables (columns C-G) as the independent variables. Point the R2 for step 2 (cell C3) to the R2 just computed. This is the proportion of total variance accounted for by the subject and drug factor together. Do step 3. As previously noted, actually performing the regression implied by the implicit or phantom step 3 would exhaust the degrees of freedom. Therefore simply enter 1 (all variance accounted for) in cell C4. This is the value of R2 for step 3. Then point cell E4 to cell O13 on the exercise 15.1 spreadsheet. This is the total sum of squares, which is the SS for step 3. The change in R2 and SS for each step can now be computed. Enter appropriate formulas in columns D and E. As a check, enter summation formulas in cells D5 and E5. Do the sum of the changes in R2 and SS sum to 1 and SStotai as they should? Finally, enter the appropriate formulas for MSs, F, and n2 in cells G3, G4, and H3 and l3. What is the critical value for this F ratio? Is it significant? What is n2? How do you interpret the results?
It is instructive to compare the analysis of variance results when drug is a between-subjects factor (Fig. 10.2) with those produced when it is within subjects (Fig. 15.6). Although the same 10 scores are analyzed, the F ratios for the drug effect are dramatically different. The F ratio for the between-subjects study is 3.46 whereas the F ratio for the within-subjects study is 20.17. The total sum of squares (SStotai = 40.1) and the sum of squares associated with the drug effect (SSmode1 = SSp main effect = 12.1) are the same in both cases, of course. But for the between-subjects study, with its 10 different subjects, the remaining sum of squares (SSerror = 28) is all within-group or between-subjects error. For the within-subjects study, on the other hand, with its five subjects assessed twice, the sum of squares between subjects (whose value is 25.6) is irrelevant and is removed at the outset, during step 1. Once the drug effect sum of squares (whose value is 12.1) is also removed, the sum of squares remaining—
A
B
1 Step Source 2 3 4 5
1 TOTAL btwn Ss 2 P (Drug) 3 PS (error) TOTAL (B+W)
C
D
E
R2 R2 change
0.638 0.94 1
SS
0.638 25.6 0.302 12.1 0.06
2.4
1 40.1
F
G
H
I
2 MS dt F Pn 4 1 12.1 20.17 0.834 4 0.6
9
FIG. 15.6. Spreadsheet showing an analysis of variance source table for a single-factor between-subjects study analyzing the effect of drug on number of lies detected.
262
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES the error or PS interaction sum of square—is only 2.4. There are fewer degrees of freedom for the within subjects as compared to the between-subjects error term (4 vs. 8); still, because the within-subjects error term is so much smaller, the mean square for error is considerably smaller for the within-subjects study (0.6 vs. 3.5) , so what was an insignificant drug effect with a between-subjects (or independent groups) study becomes, given identical data but half as many subjects, a significant drug effect with a within-subjects (or repeated-measures) study. The analyses of these particular data dramatically demonstrate the advantage of using a repeated-measures factor.
Exercise 15.3 A Single-Factor Repeated-Measures Study in SPSS In this exercise you will analyze the lie-detection study, last described in Exercise 15.2, using the repeated measures option under the General Linear Model procedure of SPSS. 1. Create a new SPSS data file with three variables. Create one variable for subject number (s), one variable for number of lies detected in the drug condition (drug), and a third variable for number of lies detected in the placebo condition (placebo). Give each variable a meaningful label. 2. Enter 1-5 in the subject number column. In the drug column, enter the number of lies detected for subjects 1-5 in the drug condition. Do the same in the placebo condition. Thus, the SPSS file set up for a repeated-measures study will have a single row for each subject with the scores from each level of the repeated measures variable in separate columns. 3. Select Analyze-General Linear Model->Repeated Measures from the main menu. In the Repeated Measures Define Factor(s) window, type in a name to describe the repeated measures factor in the Within-Subject Factor Name box (e.g., group). Enter 2 for the number of levels of the factor and click Add to move this factor definition to the lower box. Click the Define button. 4. After the Repeated Measures window opens, move the drug and placebo variables to the right hand Witihin Subjects Variables box. Click on Options and check the Estimates of Effect Size box. Click Continue and then OK. 5. Examine the output. For the moment, ignore the Multivariate Tests and Mauchly's Test of Sphericity. Look at the Sphericity Assumed lines in the Tests of Within-Subjects Effects box. The SS, df, MS, F, and pn2 values should agree with your spreadsheet results. Do they?
A Second Study With a Single Within-Subjects Factor The previous example consisted of a study with a single within-subjects factor measured on two levels. The single factor was drug group and the two levels were drug or placebo treatments. In the interest of generality, a second example is presented. This study, like the previous one, includes a single within-subjects factor but it is measured on four instead of two levels. Recall the button-pushing study first described in chapter 11. There it was presented as a single-factor between-subjects study: Sixteen subjects categorized into four groups were
15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES
263
exposed to videotapes and each time they thought the infant had done something communicative they pushed a button. But imagine instead that the study includes only four subjects and each subject is exposed to four different videotapes, each portraying an infant with a different diagnosis (Down syndrome, fetal alcohol syndrome, very low birth weight, and a no diagnosis comparison group). Again subjects are asked to push a button whenever they think the infant has done something communicative. For the previous example, the between-subjects factor was want/have children status. For the present example, the within-subjects factor is infant's diagnosis. But the research question remains the same: Is the mean number of button pushes significantly different for different levels of the factor? The analysis required to answer this question, assuming that infant's diagnosis is a within-subjects factor, is accomplished in the next exercise. Exercise 15.4 A Second Single-Factor Repeated-Measures Study This exercise provides additional practice in analyzing data from a single-factor within-subjects study. The single factor, infant's diagnosis, comprises four levels. You will use data from the button-pushing study but will assume that four subjects were each exposed to four videotapes, each videotape portraying an infant with a different diagnosis. General Instructions 1. Analyze the 16 scores shown in Fig. 11.5. Assume that four subjects contributed four scores each, that the 1st, 5th, 9th, and 13th scores were contributed by one subject, the 2nd, 6th, 10th, and 14th by a second subject, and so forth. Assume further that the first score for each subject represents the number of button pushes when viewing a Down syndrome infant, the second when viewing a fetal alcohol syndrome infant, the third a very low birth weight infant, and the fourth a no diagnosis comparison infant. Use dummy-coded variables to represent subjects and type of infant. 2. First regress number of button pushes on the coded variables for subjects, then on the coded variables for subjects plus the coded variables for diagnostic group. Enter results, including those for the phantom third step, in a source table like that shown in Fig. 15.6. At this point your spreadsheet should look like the one shown in Fig. 15.7 and 15.8. Again, it is instructive to compare the analysis of variance results when group (want/have children status) is a between-subjects factor (Fig. 12.2) with those produced when group (infant's diagnosis) is within subjects (Fig. 15.8). The total sum of squares (SStotai = 7438) and the sum of squares associated with the group effect (SSmodel - SSpmain effect = 4880) are the same in both cases. However, in contrast to the lie detection example, the variability between subjects for the repeated-measures button-pushing study was not especially large (SSbetween subjects = 850.5), which means that little variability in button pushing can be accounted for by knowing the particular subject involved. As a result, when the group factor was redefined from between- to within-subjects, there was a dramatic reduction in the error sum of squares for the lie detection study but not for the button-pushing study. For the between-subjects version of the button-pushing study (16 subjects categorized by status group), the error term (the mean square for error) was
SINGLE-FACTOR WitHIN-SUBJECTS STUDIES
264
2558/12 = 213.2 and for the within-subjects version (4 subjects exposed to videotapes of 4 infants representing different diagnostic groups), it was 1707.5/9 = 189.7. The F ratio was somewhat higher for the within- compared to the between-subjects versions (8.57 vs. 7.63), but this difference is hardly significant. The partitioning of the total sum of squares for the two versions of the buttonpushing study, both of which analyze the same data, is compared explicitly in Fig. 15.9. In general it is true that within-subjects factors result in more powerful tests and so, other things being equal, are to be preferred. However, as the present comparison illustrates, the extent of the advantage depends on the nature of the data analyzed. Specifically, if the correlation between the repeated measures is small, and if little variance is accounted for by the subject factor, then the usual advantage is diminished. The research question (Are there group differences?) is the same, of course, no matter whether a between-subjects or a within-subjects factor is analyzed. For the button-pushing data the group effect was significant no matter whether group was a between-subjects or a within-subjects factor—no matter whether subjects belonged to different want/have children status groups or infants from different diagnostic groups were viewed by the same subjects. No matter whether effects are between or within subjects, if more than two groups are defined and if the group effect is significant, next we would want to know exactly how the groups differed among themselves. Post hoc tests for the
A
B
1 #BPs Y 2 s 1 102 3 4 125 2 95 3 5 130 4 6 1 7 79 93 2 8 75 3 9 69 4 10 1 43 11 82 2 12 69 3 13 4 14 66 1 101 15 2 94 16 84 17 3 4 69 19 19 Sum= 1376 16 20 N= 86 21 Mean=
C
D Si 1 0 0 0
1 0 0 0
1 0 0 0
E S2 0
1 0 0 0
1 0 0 0
1
1
0 0 0
0 0 0
0 0
1
G
F S3 0 0
1 0 0 0
1 0 0 0
P1 1 1 1
1 0 0 0 0 0
o
1
0
0 0 0
o
1 0
0 0 0 0
H
I SS tot
P2
0 0 0 0
1 1 1 1 0 0 0 0 0 0 0 0
P3 0 256 0 1521 0 81 0 1936 49 0 49 0 121 0 0 289 1 1849 1 16 1 289 1 400 0 225 64 0 4 0 0 289 7438
FIG. 15.7. Spreadsheet for analyzing the effect of infant's diagnosis (withinsubjects factor P) on number of button pushes (coded predictor variables).
15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES A
B
1 Step Source 2 3 4 5
1 TOTAL btwn Ss 2P (Diagnosis) 3 PS (error) TOTAL (B+W)
C
D
R2
F
E R2
265
SS
change
0.114 0.114 850.5 0.77 0.6564880.0 1 0.231707.5 1 7438
G df
MS
H
I F
Pn2
3 283.5
3 1627. {3.574 0.741 9 189.7 15 495.9
FIG. 15.8. Spreadsheet for analyzing the effect of infant's diagnosis (withinsubjects factor P) on number of button pushes (source table).
between-subjects version of the button-pushing study were illustrated in chapter 13. For the within-subjects version, the Tukey test is again applied. Recall from chapter 13 (Equation 13.4) that the formula for the Tukey critical difference is
The number of groups or G is still 4, as is the number of scores per group or n. The degrees of freedom for error is now 9, not 12, which changes the values of q, and the MSerror is now 189.7. Thus the value of TCD in this case is 30.4, which happens to be not much different from the 30.7 we computed earlier for the between-subjects version. For these data, it happens that the post hoc results are the same for both between- and within-subjects versions (see Fig. 15.10). Specifically, for the within-subjects study, subjects pushed the button significantly more for Down syndrome infants (M = 113) than for either fetal alcohol syndrome (M = 79) or very low birthweight infants (M = 65). However, the number of button pushes for these three groups did not differ significantly from the number for the no diagnosis comparison group (M - 87). In earlier chapters you learned how to decide whether mean scores for the levels or groups defined by a research factor differed significantly, assuming that different subjects were assessed at each level (i.e., within each group). In this chapter you have learned how to decide whether mean scores for the groups defined by levels of a research factor differ, assuming instead that subjects are assessed repeatedly— that each subject contributes a score to each level of the research factor. The general research question, however, remains the same: Can predictions concerning a subject's criterion score be significantly improved if they are made with knowledge of the particular level of the research factor associated with that score? Does the research factor matter?
Analysis 1-between (Fig. 12.9) 1-within (Fig. 15.8) Source SS df Source SS df S 851 3 A 4880 3 P 4880 3 S/A 2558 12 PxS 1708 9 FIG. 15.9. Partitioning total sums of squares and degrees of freedom for a one-between and a one-within study with the same data.
266
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES p
Post-hoc statistic
G dferror
q MSerror
n TCD
Analysis 1 -between 1 -within (Fig. 12.9) (Fig. 15.8) 4 4 9 12 4.41 4.20 189.7 213.1 4 4 30.4 30.7
FIG. 15.10. Statistics needed for a post hoc test for a one-between and a one-within study with the same data.
Within subjects as opposed to between-subjects studies have. certain advantages. Other things being equal, analysis of within-subjects factors is more powerful, which means that fewer subjects are required to decide that effects of a specified size are statistically significant. But not all factors are good candidates for within-subjects factors. Some factors are too reactive. For example, a subject may remember a test and so react differently at a later assessment only because it is a later assessment and not because the second assessment occurs after an experimental treatment. And some factors, like socioeconomic status, describe individual difference and are inherently between subjects. Nonetheless, studies can often use within-subjects factors to good advantage. Levels of commonly used within-subjects factors can represent different instructions, different forms of a test, or different settings (night vs. day, home vs. laboratory, and so forth.). They can also distinguish between pre- and posttreatment assessments (but see chap. 17) and can represent age in longitudinal studies (e.g., 9-, 12-, and 15month-old infants).
15.4
ASSUMPTIONS OF THE REPEATED MEASURE ANOVA Repeated-measures ANOVA requires that the typical assumptions of random sampling, normality, and equal variances are met. Given that there are the same number of observations at each treatment level and the data come from the same subjects, the homogeneity of variances assumption is unlikely to be violated, and often not even tested. When you have more than two levels of the repeated measures factor, however, an additional assumption, sphericity, is required to ensure that the F ratio is unbiased. Sphericity holds when the variances of the difference scores between all levels of a factor are the same. A difference score is simply a subject's score at one level subtracted from his or her score at another level. For a four-level factor there are six possible pairs of levels: 1-2, 1-3, 1-4, 2-3, 2-4, and, 3-4. The variances of these six sets of difference scores should be roughly equal to satisfy the sphericity assumption. Sphericity is estimated by a statistic called epsilon (e). If sphericity is met perfectly, then 8 will have a value of 1. Epsilon has a lower bound o f 1 / ( k - 1). Thus, any value between i and the lower bound indicates that sphericity is violated to some extent. There are two generally accepted methods for calculating e. The first is the Greenhouse and Geisser (1959), and is the more conservative approach. The second is more liberal and is known as the Huynh and Feldt (1976). When violations of sphericity are not extreme, the HuynhFeldt is considered more accurate. As the severity of the violation increases, however, the Greenhouse-Geisser is the more appropriate estimate. The
15.4 ASSUMPTIONS OF REPEATED MEASURES ANOVA
267_
calculations for the two estimates of e are beyond the scope of this book, but appear in the output of most major statistical packages. Many statistical packages also provide a statistical test of sphericity. SPSS calculates Mauchly's W. Such tests have very little power when sample sizes are small. For this reason, most researchers assume that sphericity is violated and proceed as described in the next subsection.
The Modified Univariate Approach The spreadsheet calculations for a repeated-measures design hold when sphericity is assumed. They also provide a nice demonstration of the logic behind a repeated-measures design. However, when sphericity is violated (and it often is to some extent), then it is necessary to make some adjustments to your calculations. When sphericity is violated, the F ratio becomes positively biased. Therefore one way to correct for sphericity violations is to adjust your degrees of freedom based on the estimate of e. To do this, Box (1954) suggested multiplying the df numerator and df denominator by e, and then evaluating the significance of F using these new values. Another, extremely conservative, approach is to evaluate F at the lower bound. In other words, set df for the numerator equal to 1, and df for the denominator equal to k - 1. Given that there are two methods of estimating e, there are four possible levels of significance for the repeated-measures factor. One assumes sphericity and is the method presented in the spreadsheet; the other three, just described, involve corrections to the degrees of freedom. It is possible to select one method based on the importance of a Type I error relative to a Type II error, but most researchers use a strategy known as the modified univariate approach. This strategy consists of three steps: 1. Evaluate the significance of F assuming sphericity. If the F is not significant then retain the null hypothesis. If the F is significant then proceed to step 2. 2. Evaluate the significance of F at the lower bound. If the F is significant, then the null hypothesis is rejected. If the F is not significant, then proceed to step 3. 3. Evaluate the significance of F using the Box correction (i.e., multiply df by e). You can use either the Greenhouse-Geisser or Huynh-Feldt estimate.
The Multivariate Approach An alternative to the modified univariate approach is to treat each set of difference scores as a separate dependent variable and then conduct a multivariate analysis of variance. Sphericity is not an assumption of the multivariate tests, but they tend to be less powerful than the univariate approach. A general rule of thumb is that if there appears to be some violation of sphericity (e.g., e < .80 or so) and N > 30 plus the number of levels of the repeatedmeasures factor, then the multivariate approach is recommended. When N is small and the multivariate tests are not significant, then the modified univariate approach may be more powerful.
268
SINGLE-FACTOR WITHIN-SUBJECTS STUDIES Exercise 15.5 SPSS Analysis of a Repeated-Measures Study with Four Levels In this exercise you will analyze the data for the four-level repeated-measures study presented in Exercise 15.4. 1. Create a new SPSS data file for the button pushes study that is set up for a repeated-measures analysis. You will need five variables, one for the subject number (s), and four for each level of the infant diagnosis factor: Down syndrome (ds), fetal alcohol syndrome (fas), low birth weight (Ibw), and no diagnosis comparison (cntrl). Give each variable a meaningful label. 2. Enter 1-4 in the s column and the appropriate number of button presses for each of the four cases in the remaining columns. 3. Select Analyze->General Linear Model->Repeated Measures from the main menu. In the Repeated Measures Define Factor(s) window, type in a name to describe the repeated measures factor in the Within-Subject Factor Name box (e.g., diag for diagnosis). Enter 4 for the number of levels of the factor and click Add to move this factor definition to the lower box. Click the Define button. 4. Move each of the diagnosis levels to the Within Subjects Variables box, and check the Descriptive Statistics and Estimates of Effect Size boxes under Options. Run the analysis. 5. Check your descriptive statistics. Do they agree with your spreadsheet? Examine the Sphericity Assumed lines in the output. Do the SS, df, MS, F, and pn2 values agree with your spreadsheet? 6. Look in the box labeled Mauchley's Test of Sphericity. Mauchly's test is not significant, indicating that the sphericity assumption has been met. Remember, however, that Mauchly's test is underpowered when the sample size is small. It would therefore be prudent, in this case, to assume that there is at least some violation of sphericity. Note that the lower bound is .33; 1/(k - 1) = 1/(4-1). Also note that the Greenhouse-Geisser and Huynh-Feldt estimates of e differ by a large amount. This is due to the small number of cases in the study. Typically the two estimates would be closer. 7. Examine the Multivariate Tests. None are significant, but remember that N is small and the multivariate tests are not very powerful under these circumstances. Apply the modified univariate approach. What is your statistical decision based on this approach? From this chapter, you have learned specifically how to evaluate the effect of a single within-subjects factor. Little new was needed conceptually, just additional applications of the hierarchic approach emphasized throughout this text, an understanding that variability due to subjects can be regarded as a covariate and so can be dispensed with at step 1, and an understanding that subject can serve as a factor and can be represented with dummy-coded variables. The next chapter extends your understanding of repeated-measures studies. Studies with more than one factor, at least one of which is repeated, are described and additional post hoc tests for within-subjects studies are demonstrated.
16
Two-Factor Studies With Repeated Measures
In this chapter you will: 1. Learn how to analyze data from a two-factor study when factors are mixed, one between subjects and one within subjects. 2. Learn how to analyze data from a two-factor study when both factors are within subjects. 3. Learn how to compute the degrees of freedom associated with the main effects and interactions for these two kinds of repeated-measures studies. 4. Learn how to perform post hoc tests for analyses involving repeated measures. The two studies presented in the last chapter included a single repeated measure or within-subjects factor. Subjects were assessed more than once and the levels of the within-subjects factor formed the repeated assessments or groups for analysis. Thus for the last example, subjects were presented with videotapes of four infants, each with a different diagnosis. The purpose of the analysis was to determine whether the mean number of times subjects pushed the button was significantly different for infants with different diagnoses. Each subject saw videotapes representing all four diagnostic groups, so our only concern was with variability within subjects. Variability between subjects was irrelevant to the question at hand, which involved within-subject effects, and for that reason it was removed from consideration, or statistically controlled, at the outset. In this chapter we consider more complex possibilities. Two studies are examined: Both studies involve two factors, at least one of which is within subjects. But as you will see, the general approach developed in the last chapter applies to these more complex repeated-measures studies as well. The common feature is controlling for between-subject variability, exactly as though the subject factor were a covariate. 16.1
ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR In this section we consider one example of a mixed study. This is a study that has, in addition to a repeated-measures factor, a between-subjects factor as well.
269
270
TWO-FACTOR STUDIES WITH REPEATED MEASURES We again use data from the button-pushing study but change the assumed procedural arrangements to reflect a mixed two-factor study. In chapter 14 we used these data to exemplify a 2 x 2 between-subjects factorial. The 16 subjects were arranged in four groups and the four groups represented males and females (factor A) given either instruction set I or II (factor B). In this and the following section, two new examples are presented. All three of the examples represent 2 x 2 factorials, and for all three, the four groups are the same (i.e., group 1 consists of males exposed to set I; group 2, males given set II; etc.). However, this section and the following section demonstrate how these data would be analyzed if one or both of the factors were within subjects. For the mixed example presented in this section, imagine there were only 8 subjects instead of 16, that 4 were male and 4 female, and all subjects were tested twice, once after receiving instruction set I, once after set II. Thus, for the present example, we are assuming the study consists of a single between-subjects factor, gender of subject, and a single within-subjects factor, instruction set. For the two single-factor repeated-measures studies presented in the last chapter, the subjects constituted a single group and were not subdivided further. Dummy-coded variables were defined for subjects and each subject received a unique code—in effect, constituting each subject as a group of one—but the subjects were not grouped in any other way. In other words, there were no between-subjects factors as such, only the subject factor nested within the single group, which was used to represent between-subject variability. Here, however, subjects are divided into a male and a female group, which introduces gender as a between-subjects factor. For all repeated-measures studies, part of the total variance is between subjects (recall Equation 15.3): SStotal (between + within)
=
SStotal between subjects + SS'total within subjects
The degrees of freedom associated with the between subjects variance is N - 1. For a single-factor within-subjects study, no further subdividing of total betweensubjects variance is possible (see Fig. 15.5). However, for a mixed design like the current example, total between-subjects variance is partitioned into components representing the between-subjects main effect (or effects), their interactions (if there is more than one between-subjects factor), and the residual, which is between-subjects error. The present example includes one between-subjects factor (not counting the nested subject factor). In this case, total betweensubjects variance is partitioned into two portions (see Fig. 16.1). One part (symbolized A) is associated with the between-subjects factor (in this case, gender of subject) and the rest with residual between-subjects error (subjects within groups, symbolized S/A). The other part of total variance for repeated-measures studies is within subjects. The degrees of freedom associated with the within-subjects variance is Np - N (where p is the number of levels for the within-subjects factor P). For the single-factor repeated-measures studies presented in the last chapter, one portion (symbolized P) was associated with the effect of the repeated measure and the remaining portion, the error term, with the repeated factor by subjects interaction (symbolized PS, i.e., P x S). For the present example, the effect of the repeated measure is still symbolized P, but now an additional portion of withinsubjects variance is associated with the interaction of the between and withinsubjects factors. This is usually symbolized as AP, but in this book often PA is used instead because it makes the partitioning of variance easier to follow (see Fig. 16.1).
16.1 ONE BETWEEN-AND ONE WITHIN-SUBJECTS FACTOR
271
The residual or error variance is again the interaction of the repeated factor with subjects—or, more correctly, the interaction of the repeated factor with subjects within groups (symbolized PS/A, i.e., P x S/A}. For repeated-measures studies, subject is treated as a factor nested within groups. For the single-group repeated-measures studies described in the last chapter, within-subjects error variance was symbolized as PS. With only one group, it did not seem necessary to specify that subjects were nested within the one group, although we could have used a notation like PS/G instead of PS. But for the present example with its two independent groups (males and females), within-subject error is associated with the repeated measure by subjects within groups interaction. As just noted, this is symbolized as PS/A and signals the presence of a single between-subjects factor, A, in addition to the within-subjects factor, P. The point of all this, of course, is to be able to determine whether the subject's gender, the instructions given, and/or the gender x instruction interaction affect the number of times subjects believe they have noted infants engaging in communicative acts. As noted in chapter 14 when discussing factorial designs generally, significant interactions are often informative. This is as true for interactions involving repeated measures as for interactions involving only between-subjects factors. For example, in the present case we might learn that the effect of the instructions given was conditional on the gender of the person receiving the instruction. However, in order to test the various main effects and interactions for significance, we need to know first how to partition variance and degrees of freedom for mixed two-factor studies, and second, which error terms are used to test which effects. The preceding discussion is summarized in Fig. 16.1, which lists sources of variance and general formulas for their associated degrees of freedom for a mixed two-factor study. It is worth noting several points of similarity with the corresponding table for a single-factor within-subjects study (see Fig. 15.5). As before, the total degrees of freedom is the total number of scores minus one, Np 1. Likewise, the total degrees of freedom between subjects is N - 1 and the total degrees of freedom within subjects is N(p - 1) or Np - N. The subdivisions of total between and total within variance, however, are somewhat different for the mixed two-factor design. The N - 1 degrees of freedom between subjects are divided into a - 1 degrees of freedom for the A main effect (a - 1 predictor variables), which leaves N - a degrees of freedom for subjects within groups (S/A). This is exactly how degrees of freedom were computed earlier for the between-subjects error term of a single-factor between-
Source A main effect S/A, subjects within A TOTAL between subjects P main effect PA interaction PS/A interaction TOTAL within subjects TOTAL (between + within)
Degrees of freedom a -1 N -a N - 1 p -1 (p - 1)(a - 1) (p _- |)(/y - a) Np - N Np -1
FIG. 16.1. Degrees of freedom for a mixed two-factor study. The number of levels for between-subjects factor A is symbolized with a, and for withinsubjects factor P, with p. The number of subjects is symbolized with N.
272
TWO-FACTOR STUDIES WITH REPEATED MEASURES subjects study (see Fig. 12.10). Further, the Np - N degrees of freedom within subjects are divided into p - 1 degrees of freedom for the P main effect, p - 1 times a - 1 degrees of freedom for the PA interaction, and p - 1 times N - a degrees of freedom for the P by subjects within groups interaction. The value of p - 1 (degrees of freedom for the repeated measure) times N - a (degrees of freedom for subjects within groups) is the degrees of freedom for the error term used to evaluate the significance of the P main effect and the PA interaction. In the previous chapter, variance for a single-factor repeated-measures study was partitioned into three components—symbolized as S (for subjects within groups), P, and PS (see Fig. 15.5)—and each was associated with a step in the regression procedure used to analyze criterion variance. For the current mixed two-factor study, variance is partitioned into five components, symbolized as A, S/A, P, PA, and PS/A (see Fig. 16.1): 1. 2. 3. 4. 5.
A, between-subjects main effect. S/A, between-subjects error term. P, within-subjects main effect. PA, between x within interaction. PS/A, error term for P and PA effects.
As before, each of these is associated with a multiple-regression step. The R2 and SS are determined for the first four steps; the last or phantom step exhausts the degrees of freedom and accounts for all variance. The steps and exact computations required for analysis of the present mixed two-factor example are demonstrated in the next exercise. Before beginning the next exercise, you need to know how to code the N - 1 between-subjects predictor variables when there is a between-subjects factor. This is an important although somewhat technical detail. For the single-factor repeated-measures studies described in the previous chapter this was easy enough. Subjects were dummy coded. The first subject was coded one for the first predictor variable and zero otherwise, the second subject was coded two for the second predictor variable and zero otherwise, and so forth, and the last subject was coded zero for all between-subjects predictor variables (see Figs.15.3, 15.4, and 15.7). Subjects constituted a single group. In other words, there was no between-subjects factor (other than the nested subject factor) and hence no need for predictor variables that coded group membership. In the present case, there are eight subjects, so seven predictor variables are required to represent the total between-subjects effect. There are two groups of subjects, however—male and female—so one of those seven predictor variables has to code for the between-subjects factor, A or gender. The remaining six predictor variables code for the between-subjects error term, S/A (whose degrees of freedom are, of course, six). These are dummy coded, but nested within group. What this means should become clear in the course of the next exercise, if not the next paragraph. Recall that the first predictor variable codes for gender. Using contrast codes, males might be -1 and females +1 (see Fig. 16.2). The second predictor variable is coded one for the first male and zero otherwise, the third predictor variable is coded one for the second male and zero otherwise, and the fourth predictor is coded one for the third male and zero otherwise. The fourth and final male in his group receives codes of zero for all between-subjects predictor variables except the first (which is -1 indicating a male). Similarly, the fifth predictor variable is coded one for the first female and zero otherwise, the sixth predictor variable is coded one for the second female and zero otherwise, and the
16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR
273
seventh predictor is coded one for the third female and zero otherwise. The fourth and final female in her group receives codes of zero for all betweensubjects predictor variables except the first (which is +1, indicating a female). Sometimes students ask, why not use just three predictor variables, in addition to the one coding for gender, to indicate the S/A effect? The first variable could be coded -1 for male and +1 for female. Then the second variable could be coded one for first subject in group (for both the first male and the first female) and zero otherwise, the third variable could be coded one for second subject in the group and zero otherwise, and so forth. Such a coding, however, implies that the subject factor is crossed instead of nested and some subjects were randomly assigned to receive the "male" treatment, others the "female" treatment. However, gender is usually believed to be a relatively enduring attribute not so easily subject to experimental manipulation. In fact, we define a group (male or female) and then select subjects from within each group, which is what nesting implies. The between-subjects predictor variables reflect this reality, which is why the first predictor variable codes group membership, the second three indicate males within the male group, and the final three indicate females within the female group—which reflects, of course, the division of the seven between-subjects degrees of freedom into one for the effect and six for error. The next exercise requires that you form predictor variables according to these principles. Exercise 16.1 Analysis of a One-Between, One-Within Two-factor Study The template developed for this exercise allows you to analyze data from a mixed two-factor study. You will use data from the button-pushing study, but will assume that a group of eight men and a second group of eight women were each exposed to two different sets of instructions. General Instructions 1. Analyze data from the button-pushing study assuming two factors, one between and one within subjects. Assume the first four scores are for males exposed to instruction set I and the second four scores are for the same males exposed to set II. Likewise, assume scores 9-12 are for females exposed to set I and scores 13-16 are for the same females exposed to set II. 2. Establish seven predictor variables for the between-subjects variables. One codes gender (use contrast codes) and the remaining six code for subjects nested within gender. Use dummy codes as described in the last paragraph before this exercise. You may also want to examine Fig. 16.2 to see an example of how the nesting of subjects within groups works. Establish two additional predictor variables, one for the within-subjects variable (use contrast codes) and one for its interaction with the dummy-coded variable for gender. 3. Guided by Fig. 16.1, create a source table appropriate for this study using general formulas for degrees of freedom. Compute the total R2 for each of the four steps. Provide correct formulas for all steps, including the phantom fifth step. Check that all formulas are correct and then answer the questions in item 19 of the detailed instructions.
274
TWO-FACTOR STUDIES WITH REPEATED MEASURES Detailed Instructions 1. Begin with the spreadsheet shown in Fig. 15.7. Insert three columns after column H. The present example requires nine predictor variables whereas the previous required six, so an additional three column are needed. 2. Enter the labels "A," "S1," "S2," "S3," "S4," "S5," "S6," "P," and "AP" in cells C2-K2. These columns will contain the coded variable for the betweensubjects variable gender, the remaining six coded subject variables (for a total of seven), the coded variable for instruction group, and the variable for the gender x instruction interaction. Enter the label "Sex" in cell C1 and the label "Inst" in cell J1. 3. Enter the subject numbers in column A, cells 3-18. Recall that for the 2 x 2 factorial example using these same data (Fig. 15.12) subjects 1-8 were male and 9-16 were female; subjects 1-4 and 9-12 were exposed to instruction set I and 5-8 and 13-16 to set II. For this example, we want to keep the characteristics (gender of subject, instruction set) of the four groups the same, at the same time changing instruction to a within-subjects variable. Thus use the numbers 1-4 for the males subjects, who appear in rows 3-6 and again in rows 7-10, and use the numbers 5-8 for the female subjects, who appear in rows 11-14 and again in rows 15-18. 4. There are 8 subjects, so there are 7 (N - 1) coded between-subjects variables in all. Enter dummy codes for the first between-subjects predictor variable, which represents gender of subject, in column C. Use 0 for males, 1 for females. Remember that the first eight scores are for the males (subjects 1-4) and the last eight scores are for the females (subjects 5-8). 5. Next enter dummy-coded variables for the remaining six between-subject variables, S1-S6, in columns D-l. Remember that the codes used for subject 1 must be the same everywhere subject 1 appears; likewise for the other subjects. (In this case, lines 3 and 7, 4 and 8, ..., 14 and 18, and so forth, are the same; see Fig. 16.2.) Note how the codes are formed within groups (recall that the subject factor is nested within groups). All codes are set to zero initially. The first variable is coded one for the first subject in the first group. The next variable is coded one for the second subject, and so forth, up to but not including the last subject within the group. The next variable after that is coded one for the first subject in the second group, and so forth. This insures that the appropriate number of predictor variables (Na) are coded for between-subjects error (subjects within groups or S/A). 6. Enter contrast codes for instruction set in column J. Use -1 for set I and +1 for set II. Remember that the first four scores for the males and female subjects represent the number of button pushes for instruction set I. 7. In column K enter a formula for the gender x instruction interaction (the code for gender multiplied by the code for instruction set, or the code in column C times the one in column J). At this point, values for the nine coded predictor variables for all 16 scores should be in place. 8. On a separate sheet in the workbook, create a source table for the results of your hierarchical regression. In the first row, enter the labels as shown in Fig. 16.3. In column A enter the labels "N=", "p=", and "a=", in rows 9, 10, and 11, respectively. In column B, enter the number of subjects, and levels for the within and between factors. 9. Label the sources of variance in column B (rows 2-7) as shown in Fig. 16.3. 10. In column F enter the appropriate degrees of freedom using the values you entered in cells B9-B11; these formulas are given in Fig. 16.1. If done correctly, the degrees of freedom for the between components should sum to
16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR
11.
12.
13.
14.
15.
16.
17.
18.
19.
275
total between, the within components should sum to total within, and the total between and total within should sum to the grand total (between + within). At this point, you have a spreadsheet that partitions degrees of freedom for any mixed two-factor design. Experiment with different values for N, a, and p and note the effect on how degrees of freedom are apportioned. Now replace N, a, and p with 8, 2, and 2, the values for this exercise. Do step 1. Invoke the multiple-regression routine, specifying number of button pushes (column B) as the dependent variable and gender (column C) as the independent variable. Point step 1 RSQ in the source table to the R2 output by the regression program for this step. Do step 2. Again invoke the multiple-regression program, this time specifying all between-subjects variables (columns C-l) as the independent variables. Point step 2 RSQ to the R2 just computed. Do step 3. This time the independent variables are the between-subjects variables plus instruction set (columns C-J). Point step 3 RSQ to the R2 just computed. Do step 4. The independent variables are the between-subjects variables, instruction set, and the gender x instruction interaction (columns C-K). Point step 4 RSQ to the R2 just computed. The phantom step 5 would exhaust the degrees of freedom. Enter 1 (all variance accounted for) in cell C5. Enter the SStotal in cell D7 of the source table. As a check, you should enter the regression statistics from the last step in cells B22-K22, correct the prediction equation in column L, and see if the R2 computed in cell M23 agrees with the value computed by the regression program. Complete the source table. The SSs for each step and changes in R2s and SSs for each step can now be computed. Enter formulas for SSs in column E. Enter the appropriate change or summation formulas in column D. Finally, ensure that the appropriate formulas for MSs, Fs, and n2s are entered in columns G, H, and I. What are the critical values for the three F ratios? Which are significant? How do you interpret these results?
Your results should look those given in Figs. 16.2 and 16.3. It is worth comparing the results of the two-factor one-between, one-within analysis of variance just performed with the results of the two-between factorial analysis performed in chapter 14 (see Fig. 14.12). The sums of squares associated with the
D
E
F
G
H
I
change
SS
dl
MS
F
Pn2
0.215 0.152 0.019 0.422 1 0.192
1600 1130
A
B
C
1 Step
Source A (Gender) S/A (error) P (Instruction) PS (Gend x Inst) PS/A (error) TOTAL (B+W)
R2
2 3 4
1 2
5
3 4
6
5
7
0 215 0 367 0 386 0 808
R2
144
3136 1428 1 7438
1 1600 8.496 0 586 6 188.3 1 144 0.605 0 092 1 3136 13.18 0 687 6 238 15
FIG. 16.2. Spreadsheet for analyzing the effect of gender (between-subjects factor A) and instruction set (within-subjects factor P) on number of button pushes (source table).
TWO-FACTOR STUDIES WITH REPEATED MEASURES
276
gender (SSA - 1600) and instruction (SSB = SSp = 144) main effects and the gender x instruction interaction (SSAB - SSAp = 3136) remain the same. But the error sums of squares change. The error term for the two-factor betweensubjects study (SSs/AB = 2558) is, in effect, split into two pieces, an error term for the between-subjects factor (SSs/A - 1130) and an error term for the withinsubjects factor (SSps/A = 1428). For these data, it happens that the two different analyses did not give dramatically different results. For both, the gender effect and the gender x instruction interaction are significant. But note that the onebetween, one-within study required half as many subjects as the two-factor between-subjects study. As noted in the previous paragraph, for the present example both the gender main effect (F(1,6) = 8.50, p < .05) and the gender x instruction interaction (F(1,6) = 13.2, p < .05) are significant. This suggests that males and females responded differently to the different instructions. The gender main effect is qualified by a higher order interaction; consequently, the nature of the interaction needs to be understood before we interpret the gender main effect, and that requires a post hoc test. In the last chapter we analyzed these same scores, assuming that the four groups defined here by a 2 x 2 factorial (onebetween, one-within) represented four levels of a single within-subjects factor (infant's diagnosis, see Fig. 15.7). The within-subjects group effect was
A
1
B #BPs s Y T 102 2 125 3 95 4 130 1 79 2 93 3 75 4 69 5 43 82
2 3 4 5 6 7 8 9 10 11 12 6 7 69 13 8 14 66 5 101 15 6 94 16 7 84 17 8 69 19 19 Sum= 1376 N= 16 20 21 Mean= 86
c
D
E
F
G
H
Sex A 0 0 0 0 0 0 0 0
1 1 1 1
1
1 1 1
S1 1 0 0 0
S2 0
1
S3 0 0
1
1
0 0 0
0 0 0 0 0 0 0 0 0 0 0
1
0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
1
S4 0 0 0 0 0 0 0 0
1 0 0 0
1
S5 0 0 0 0 0 0 0 0
o 1 0
1
0 0
0 0 0
0 0
1
I "S6" 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 0
J
K
Inst P PA -1 0 -1 0 -1 0 -1 0 1 0 1 0 0 1 1 0 -1 -1 -1 I -1 -1 -1 -1 -1 1 1 1 1 1 1 1 1
L
SStot
y*y 256 1521 81 1936 49 49 121 289 1849 16 289 400 225 64 4 289 7438
FIG. 16.3. Spreadsheet for analyzing the effect of gender (between-subjects factor A) and instruction set (within-subjects factor P) on number of button pushes (predictor variables).
16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR
277
significant but we delayed interpretation of the results, which likewise required a post hoc test, until this chapter. In a subsequent section, the appropriate post hoc tests for both these studies are described, but first a third exemplar of a 2 x 2 factorial study, one that includes only within-subjects factors, is presented.
Equality of Covariances When a study includes between- and within- subjects factors, an additional assumption, the homogeneity of covariances, is required. This assumption states that the covariances of the levels of the within-subjects factor are the same for each group. In the present case, this would mean that the covariance between set I scores and set II scores is the same for males and females. A test of this assumption, as described in the next exercise, is provided by SPSS. Exercise 16.2 SPSS Analysis of a One-Between, One-Within Two-Factor Study This exercise walks you through an SPSS analysis for a mixed two-factor study. You will use data from Exercis16.1. 1. Create a new SPSS data file, or adapt one from Exercise 15.1. The data file should contain variables for subject number (s), gender (gen) instruction set I (set1), and instruction set II (set2). 2. Enter 1 through 8 for the 8 subjects in the s column. Enter 0 for males and 1 for females in the gen column. Finally, enter the number of button pushes for each subject in the appropriate column for instruction set. Create appropriate variable labels and value labels for the variables. 3. Select Anaiyze->General Linear Model->Repeated Measures from the main menu. In the Repeated Measures Define Factor(s) window, type in a name to describe the repeated measures factor in the Within-Subject Factor Name box (e.g., instruct for instruction set). Enter 2 for the number of levels of the factor and click Add to move this factor definition to the lower box. Click the Define button. 4. Move set1 and set2 to the Witihin-Subjects Variables box. Move gen to the Between-Subjects Factor(s) box. Check the Descriptive statistics, Estimates of effect size, and Homogeneity tests boxes under Options. Run the analysis. 5. Check the descriptive statistics to ensure that you set up the variables and entered the data correctly. Next, check Box's M to determine if there is homogeneity of covariances. Usually, you would then check the sphericity assumption, but because there are only two levels of the repeated measure in this design, e = 1 and Mauchly's test of sphericity does not apply. Finally, check Levene's test to make sure the homogeneity of variances assumption holds for the between-subjects factor. 6. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects effects. Do they agree with your spreadsheet? Because there are only two levels of the within-subjects factor, the multivariate tests, sphericity assumed tests, lower bound, and e corrected tests will all yield the same F ratios and significance levels. 7. Examine the test of the between-subject effect for gender. Are the SS, MS, and df correct for the sex factor and the error term? What about the F ratio and pn2?
278 16.2
TWO-FACTOR STUDIES WITH REPEATED MEASURES TWO WITHIN-SUBJECTS FACTORS In addition to studies with two between-subjects factors (presented in chap. 13), and mixed studies with one between- and one within-subjects factor (presented in the previous section), there is a third possibility for studies with two factors: Both factors could be within subjects. The mixed study discussed in the last section consisted of two groups, one of males and one of females, and we assumed subjects in both groups were randomly selected. But now, imagine that instead of separate individuals, we selected male-female couples instead. The sampling unit would be the couple, not the individual. When couples rather than individuals are sampled, gender of spouse becomes a within-subjects factor. Each spouse would be exposed separately to instruction set I and II, and thus each couple would contribute four scores to the analysis: his scores for the two instruction sets and her scores for the two instruction sets. In this case, the appropriate analysis would treat both gender and instruction set as within-subjects variables. It is worthwhile to consider for a moment how total variance is partitioned for three kinds of repeated-measures studies: one involving a single within-subjects factor, one involving both a between- and a within-subjects factor, and one involving two within-subjects factors. For a study involving one within-subjects factor, like the examples in the last chapter, total variance is partitioned into three components: The first component is associated with variability between subjects, the second with the P main effect, and the third with the interaction between P and subject (see Fig. 15.5). Mean squares are formed for these components and the mean square for P is tested against the mean square for the PS interaction. For a study involving one within- and one between-subjects factor, like the example in the last section, total variance is partitioned into five components (see Fig. 16.1). The first two are associated with variability between subjects. The first is associated with the A main effect, the second with variability between subjects within A, and the mean square for A is tested against the mean square for S/A. The last three are associated with variability within subjects. The third is associated with the P main effect, the fourth with the P by A (or A by P) interaction, and the fifth with the P by subjects within A interaction. Mean squares for P and for PA are tested against the mean square for the PS/A interaction. For a study involving two within-subjects factors, as for the present example, total variance is partitioned into seven components (see Fig. 16.4). The first is associated with variability between subjects and, as with all repeated-measures studies, can be viewed as an irrelevant source of variance, dispensed with (or controlled for) at the outset. Because there are no between-subjects factors, it is not subdivided further. The remaining six components are associated with the P and Q main effects and the PQ, PS, QS, and PQS interactions: 1. 2. 3. 4. 5. 6. 7.
S, between-subjects error. P, first within-subjects main effect. PS, error term for P effect. Q, second within-subjects main effect. QS, error term for Q effect. PQ, P by Q interaction. PQS, error term for P by Q interaction.
279
16.2 TWO WlTHIN-SUBJECTS FACTORS
As you might guess, generalizing from the previous examples, the mean square for P is tested against the mean square for the PS interaction, the MS for Q against the MS for the QS interaction, and the MS for the PQ interaction against the MS for the PQS interaction. For a PQ design with N subjects (p levels for within-subjects factor P, q levels for within-subjects factor Q), there are a total of Npq criterion scores (each subject or sampling unit contributes p times q scores). Hence the total number of degrees of freedom associated with the criterion scores is the number of scores minus one:
As always, the degrees of freedom between subjects is N - 1 and, as noted previously, for a two-factor repeated-measures study it is not subdivided further:
The degrees of freedom within subjects is the number of scores minus the number of subjects, which for the PQ design is Npq - N:
The partitioning of these Npq - N within-subjects degrees of freedom into separate components is straightforward, as long as you remember that the error terms for each effect consist of an interaction with subject. Thus the degrees of freedom associated with the P main effect are p - 1 and the degrees of freedom for the SxP interaction are p - 1 times N - 1:
Similar reasoning applies to the Q, QS, PQ, and PQS components:
Source_ S, subjects_ TOTAL between subjects P main effect PS interaction Q main effect OS interaction PQ interaction PQS interaction TOTAL within subjects_ TOTAL (between + within)_
Degrees of freedom N- 1 N- 1 P-1 (p -1)(N-1) q-1 (q-1)(N-1) (p-1)(q-1) (p-1)(q-1)(?N-1) Npq- N Npq-
FIG. 16.4. Degrees of freedom for a two-factor within-subjects study. The number of levels for within-subjects factors P and Q are symbolized with p and q respectively. The number of subjects is symbolized with N.
280
TWO-FACTOR STUDIES WITH REPEATED MEASURES With a little bit of algebraic manipulation, you should be able to convince yourself that the degrees of freedom for the within-subjects components do indeed sum to Npq - N, as they should. The predictor variables required for the hierarchic analysis of a two-factor repeated-measures study are formed using the same techniques we have developed for earlier studies. First the between-subjects predictor variables are formed (step 1). Then predictor variables representing the within-subjects effects and their interactions with subjects are entered, following a P, PS, Q, QS, PQ, PQS order (steps 2-7). Exactly how this works is perhaps best conveyed with an example. Exercise 16.3 Analysis of a Two-Within Two-Factor Study The template developed for this exercise allows you to analyze data from a twofactor study when both factors are within subjects. You use data from the buttonpushing study, but assume that husband and wife pairs were each exposed separately to two different sets of instructions. General Instructions 1. Analyze data from the button-pushing study assuming that both factors are within subjects. Assume the first four scores are for husbands exposed to instruction set I and the second four scores are for the same husbands exposed to set II. Likewise, assume scores 9-12 are for their wives exposed to set I and scores 13-16 are for the same wives exposed to set II. 2. There are four spouses; therefore, establish three dummy-coded predictor variables to identify which spouse is associated with each of the 16 scores. Establish 9 additional predictor variables: 1 codes the gender of the person associated with the score, 3 the gender x subject (spouse) interaction, 1 the instruction associated with the score, 3 the instruction by subject interaction, and the final predictor codes the gender x instruction interaction. It is not necessary to form the 3 (and final) predictor variables associated with the gender x instruction by subject interaction explicitly. The R2 associated with them is the residual R2 when all the other 12 predictor variables have been entered. 3. Guided by Fig. 16.4, create a source table appropriate for this study using general formulas for degrees of freedom. Compute the total R2 for each of the five steps. Provide correct formulas for all steps, including the phantom sixth step. Check that all formulas are correct and then answer the questions in part 22 of the detailed instructions. Detailed Instructions 1. Begin with the spreadsheet shown in Fig. 16.2. Insert three columns after column K. The present example requires 12 predictor variables whereas the previous required 9. 2. Enter the labels "S1," "S2," "S3," "P," "S1 P," "S2P," "S3P," "Q," S1 Q," "S2Q," "S3Q," and "PQ" in cells C2-N2. These columns will contain the coded variables for the subject factor, for gender and its interaction with subject, for instruction group and its interaction with subject, and for the gender x instruction interaction. Move the label "Sex" from cell C1 to F1. Leave the label "Inst" in cell J1.
16.2 TWO WlTHIN-SUBJECTS FACTORS
281
3. There are only four subjects (husband-wife pairs) for this example. Enter appropriate subject numbers in column A. 4. There are 4 subjects; thus there are 3 (N - 1) coded between subjects variables. Enter dummy-coded variables for the subject codes, S1-S3, in columns C-E. Remember the codes used for subject 1 must be the same everywhere subject 1 appears; likewise for the other subjects. (In this case, lines 3, 7, 11, and 15, and so forth, are the same; see Fig. 16.5.) 5. Enter contrast codes for gender (spouse) in column F. Use -1 for husbands and +1 for wives. Assume the first eight scores given (rows 3-10) are for husbands. 6. In columns G-l enter formulas for the subject by gender interaction (the code for each subject variable multiplied by the code for gender). The code in column G is the product of columns C and F, the code in column H is the product of columns D and F, and the code in column I is the product of columns E and F. 7. Enter contrast codes for instruction set in column J. Use -1 for Set I and +1 for Set II. Assume the first four scores for the husbands and the first four for the wives represent the number of button pushes for instruction Set I. (The contrast codes currently in place from the last exercise should be correct.) 8. In columns K-M enter formulas for the subject by instruction interaction (the code for subject multiplied by the code for instruction set). The code in column K is the product of columns C and J, the code in column L is the product of columns D and J, and the code in column M is the product of columns E and J. 9. In column N enter a formula for the gender x instruction interaction (the code for gender multiplied by the code for instruction set, or the product of columns F and J). At this point, values for the 12 coded predictor variables for all 16 scores should be in place. 10. Update the source table you created in the last exercise. Change the label "a=" in cell A11 to "p=". 11. Extend and modify the 1-between, 1-within source table so that it is appropriate for the present 0-between, 2-within example. Label the sources of variance as shown in Fig. 16.5. 12. In column F enter the appropriate formulas for degrees of freedom; these formulas are given in Fig. 16.4. If done correctly, the degrees of freedom for the within components should sum to total within, and the total between and total within should sum to the grand total (between + within). 13. At this point, you have a spreadsheet that partitions degrees of freedom for any PQ design (both factors within subjects). Experiment with different values for N, p, and q and note the effect on how degrees of freedom are apportioned. Now replace N, p, and q with 4, 2, and 2, the values for this exercise. 14. Do step 1. Invoke the multiple-regression routine, specifying number of button pushes (column B) as the dependent variable and the coded variables for subject (columns C-E) as the independent variables. Point step 1 R2 in the source table to the R2 computed by the regression program for this step. 15. Do step 2. Again invoke the multiple-regression program, this time adding the coded variable for gender (column F) to the independent variable list. Point step 2 R2 to the R2 just computed. 16. Do step 3. This time add the variables for the gender x subject interaction (columns G-l) to the equation. Point step 3 R2 to the R2 just computed.
TWO-FACTOR STUDIES WITH REPEATED MEASURES
282
17. Do step 4. Add the variable for instruction set (column J) to the equation. Point step 4 R2 to the R2 just computed. 18. Do step 5. Add the variables for the instruction set by subject interaction (columns K-M) to the equation. Point step 5 R2 to the R2 just computed. 19. Do step 6. Add the variable for the gender x instruction set interaction (column N) to the equation. Point step 5 R2 to the R2 just computed. 20. The phantom step 7 would exhaust the degrees of freedom. Enter 1.00 (all variance accounted for) in cell C8. Enter the SStotal in cell E9. 21. Complete the source table. The SSs for each step and changes in R2s and SSs for each step can now be computed. Enter formulas for SSs in column E. Enter the appropriate change or summation formulas in columns D. 22. Finally, insure that the appropriate formulas for MSs, Fs, and partial n2s are entered in columns G through I. What are the critical values for the three F ratios? Which are significant? How do you interpret these results? At this point, your spreadsheet should like the one shown in Figs. 16.5 and 16.6. Now you can compare results of three different analyses of the same 2x2 factorial data: the two-between, no-within (Fig. 14.12), the one-between, onewithin (Fig. 16.3), and the no-between, two-within (Fig. 16.6) arrangements. For these data, the interaction effect was quite strong and, as a result, it was significant no matter the kind of factors employed. But remember, the first study consisted of 16, the second of 8, and the third of just 4 subjects. This demonstrates the greater economy often afforded by studies that include withinsubjects factors. In this case, we were able to detect effects present in the data with fewer subjects when we construed the study as one including repeated measures.
A
1 2 3 4 5 6 7 8 9
B
Step Source 1 TOTAL btwn.Ss 2 P (gender) 3PxS (error) 4 Q (insturction) 5QxS 6 PQ (gender x inst) 7 PXQxS (error) 5TOTAL (B+W)
D
C 2 R2 R
0 114 0 329 0 367 0 386 0 536 0 957 1
E
change
SS
F
G df
MS
H
I F
Pn2
0.114 850.5 0.215 1600
3 1 1600 17. 17 0.851!
0.038 279.5
3 93.17 1 144 0.389 0.115
144 0.019 0.149 1110 0.422 3136 0.043 318.5
1 7438
3 369.8 1 3136 29. 54 0.908 3 106.2 15
FIG. 16.5. Source table for analyzing the effect of gender of spouse and instruction set (within-subjects factors Pand Q) on number of button pushes.
283
16.2 TWO WlTHIN-SUBJECTS FACTORS
Exercise 16.4 SPSS Analysis of a Two-Within Two-Factor Study This exercise walks you through an SPSS analysis for a mixed two-factor study. You will use data from Exercise 16.1. 1. Create a new SPSS data file, or adapt one Exercise Ex. 16.2. The data file should contain variables for subject number (s), and button pushes for the four repeated-measures conditions: husbands in instruction set I (set1h), wives in instruction set I (set1w), husbands in instruction set II (set2h), and wives in instruction set II (set2w). 2. Enter 1 through 4 for the four sets of matched scores. Enter the number of button pushes in the appropriate column for the instruction set and gender combinations. Create appropriate variable labels and value labels for the variables. 3. Select Analyze->General Linear Model->Repeated Measures from the main menu. In the Repeated Measures Define Factor(s) window, type in a name to describe the first repeated-measures factor, instruction set (e.g., instruct for instruction set), in the Within-Subject Factor Name box. Enter 2 for the number of levels of the factor and click Add to move this factor definition to the lower box. Do the same for the gender variable. Click the Define button.
A
B
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 19 20 21
C
D
E
#BPs Ss
s 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Y S1 S2 S3 102 1 0 0 125 0 1 0 95 0 0 1 130 0 0 0 79 1 0 0 93 0 1 0 75 0 0 1 69 0 0 0 43 1 0 0 82 0 1 0 69 o 0 1 66 0 0 0 101 1 0 0 94 0 1 0 84 0 0 1 69 0 0 0
Sum=
1376
N=
16 86
Mean=
F
G
H
I
Sex P
-1 -1
-1 -1
-1 -1 -1 -1 1 1
1 1
1 1 1 1
S1PS2P S3P -1 0 0 0 -1 0 0 0 -1 0 0 0 -1 0 0 0 -1 0 0 0 -1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
J
K
L
M
Inst Q.S1PS2P S3P -1! -1 0 0 -1 0 -1 0 -1 0 0 -1 -1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 -1 -1 0 0 -1 0 -1 0 -1 0 0 -1 -1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0
N
O
lxS SStot PQ y*y 1 256 1 1521 1 81 1 1936 -1 49 -1 49 -1 121 -1 289 -1 1849 -1 16 -1 289 -1 400 1 225 1 64 1 4 1 289 7438
FIG. 16.6. Spreadsheet for analyzing the effect of gender of spouse and instruction set (within-subjects factors P and O) on number of button pushes.
TWO-FACTOR STUDIES WITH REPEATED MEASURES
284
4. Move setlh, setlw, set2h, and set2w to the correct places in the WitihinSubjects Variables box. Be careful that you move the variables to the correct location; note the order SPSS expects for the variables above the window. For example, if you defined instruction set as the first variable in the previous step, then SPSS expects you to enter all levels of set I first, followed by all levels of set II. Check the Descriptive Statistics and Estimates of Effect Size boxes under Options. Run the analysis. 5. Check the descriptive statistics to ensure that you set up the variables and entered the data correctly. 6. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects effects. Do they agree with your spreadsheet?
16.3
EXPLICATING INTERACTIONS WITH REPEATED MEASURES For all three two-factor designs analyzing number of button pushes discussed in this and previous chapters, the gender main effect and the gender x instruction interaction were significant. In chapter 14 you were cautioned that significant main effects should not be discussed until their qualifying interactions are understood. It does not matter whether factors involved in the interaction are between or within subjects, the same caution applies. Moreover, for all three designs, the same test, the Tukey post hoc test described in chapter 13, can be used to understand the nature of any significant interactions. There is one qualification. As explained shortly, the MS error used for the post hoc analysis of the two-within-subjects study is not the same as the MSeiror from its source table. The partitioning of the total sum of squares for these three different 2 x 2 factorial studies is shown in Fig. 16.7, which should clarify exactly how variance components are partitioned, depending on whether the two factors of a 2 x 2 study are either both between subjects, mixed, or both within subjects. Fig. 16.8 gives the various statistics required for a post hoc test of a significant interaction for each of the three studies. The post hoc analysis for the two-between, nowithin study was presented earlier (see Fig. 14.13). Here we see that the number of groups or G is the same for all three analyses, as is the number of scores per group or n. However, the degrees of freedom for error varies, as does the MSerror.
2-between, 0-within (Fig. 14.12) Source SS df A 1600
1
B
144
AB 3136 S/AB 2558
1-between, 1-within (Fig. 16.3) Source SS df
1
A 1600 S/A 1130 P 144
1 6 1
1 12
PA 3136 PxS/A 1428
1 6
0-between, 2-within (Fig. 16.6) Source SS df S 581 3 1 P 1600 3 PxS 280 1 Q 144 3 QxS 1109 1 PQ 316 3 PQxS 319
FIG. 16.7. Partitioning total sums of squares and degrees of freedom for a two-between, no-within; a one-between, one-within; and a no-between, twowithin study with the same data; sums of squares may not add exactly, due to rounding.
16.3 EXPLICATING INTERACTIONS WITH REPEATED MEASURES
285
The dferror for the two-between study is 12, which makes sense when we recall that 1 degree of freedom each is lost for the A main effect, the B main effect, and the AB interaction; the MSerror is then read directly from the source table (Fig. 14.12). Similarly, df e r r o r for the one-between, one-within study is 6 because 1 degree of freedom each is lost for the A main effect, the P main effect, and the PA interaction and an additional 6 are lost for S/A; again the MSerror can be read directly from the source table (Fig. 16.3) Matters are somewhat different for the two-within study. Here, the significant PQ interaction tells us that the means of the four groups defined by crossing the P and Q factors differ. The two-within analysis separates the error terms into three, but to perform the Tukey post hoc test, we need just one error term for the four groups. That is, instead of analyzing these four groups as a twofactorial, PQ design, we need to redo the analysis as a four-groups analysis, one that has 3 degrees of freedom for the four groups and 9 degrees of freedom for the PS error term. In fact, we have already performed this 4-group analysis (see Fig. 15.8). Thus we know that the MSerror for the Tukey post hoc test for the PQ interaction is 189.7, as shown in Fig. 16.8. For these data, the Tukey critical difference (TCD) for the one-between, onewithin and for the no-between, two-within studies are 37.8 and 30.4, respectively, and, as a result, the post hoc analysis for the one-between, onewithin study is somewhat different from the post hoc results for the two-between, no-within study presented earlier. The next exercise provides you with the opportunity to describe exactly how the results of these different analyses vary. Exercise 16.5 Post Hoc Tests for Repeated-Measures Studies This exercise provides practice in interpreting post hoc results for studies that include repeated measures. 1. For the two studies analyzed in this chapter, demarcate differences between group means that do and do not exceed the Tukey critical difference. Display your results as in Figs. 13.4 and 13.5. How do you interpret these results?
Post-hoc statistic G Q'error
q MSerror
n TCD
Analysis 2-between, 1 -between, 0-between, 0-within 1 -within 2-within (Fig. 14.12) (Fig. 16.3) (Fig. 16.6) 4 12 4.20 213.2 4 30.7
4 6 4.90
238.0 4 37.8
4 9 4.41 189.7 4 30.4
FIG. 16.8. Statistics needed for a post hoc test of a significant interaction for a two-between, no-within; a one-between, one-within; and a no-between, two-within study with the same data.
286 16.4
TWO-FACTOR STUDIES WITH REPEATED MEASURES GENERALIZING TO MORE COMPLEX DESIGNS
At this point you should be able to generalize what you have learned about repeated-measures studies in this and the last chapter to more complex situations, which is the point of the next exercise. Exercise 16.6 Degrees of Freedom for Three-Factor Repeated-Measures Studies For this exercise you are asked to generalize what you have learned about partitioning total sums of squares and degrees of freedom to more complex studies involving repeated measures. 1. Create a table, modeled on the one shown in Fig. 16.4, showing sources of variance and general formulas for degrees of freedom for a two-between, one-within design. Compute the degrees of freedom if N - 96, a = 2, b = 4, and p = 3. 2. Now create a table showing sources of variance and general formulas for degrees of freedom for a one-between, two-within design. Compute degrees of freedom assuming N = 45, a = 3, p = 2, and q = 4. 3. Describe studies for which these designs would be appropriate. Provide plausible names and descriptions for each of the factors and its levels.
It is easy to imagine even more complex designs than those listed in the last exercise. Interpreting the results of the appropriate analyses, however, is less easy. Studies that include many factors also include many possible interactions between factors, some of which may turn out to be statistically significant, and making sense of unexpected but significant second- and third-order interactions can challenge even the most seasoned investigator. If interactions have been predicted, there is no problem, of course. Interpretation follows from whatever theorizing formed the basis for the prediction. But unexpected interactions are another matter, and often there may be no obvious or plausible explanation for a particular pattern of results. Investigators often feel compelled to lavish considerable interpretive energy on such unexpected and seemingly difficult to explain results. However, because such results may reflect nothing more than a chance result, one unlikely to replicate, whatever interpretation is offered should in any case be labeled clearly for what it is—post hoc speculation, a tale woven after the fact. There is no reason to pay much attention to such findings unless and until they reappear in subsequent studies. Exercise 16.7 Analysis of Number of Hours Spent Studying This exercise gives you an opportunity to analyze data from a new two-factor study that includes a repeated measure. In addition, this exercise provides additional exercise with a post hoc analysis of a repeated measure factor and introduces a trend analysis as well. 1. A class consists of five males and six females. They are asked to keep careful records of the number of hours they spend studying statistics on each of three successive Wednesdays (the data are provided in Fig. 16.9).
16.4 GENERALIZING TO MORE COMPLEX DESIGNS ouujeui
VJCM HJCI
1
M F F F M M M F F F M
2 3 4 5 6 7 8 9 10 11
•
287
Number of hours Week 2 Week 3 Weekl 5 5 3 6 6 5 7 4 6 5 8 5 5 6 3 5 7 4 5
7 6 6 6 5 7 5
6 7 6 7 8 6 5
FIG. 16.9. Data for sex x week, repeated-measures, trend analysis.
2. Analyze these data using contrast codes for gender (M = -1, F = +1) and linear (P1) and quadratic (P2) contrast codes for week (see Fig. 16.10). Verify that the contrast codes for week obey contrast code formation rules. Use dummy-coded variables for subject. It is not necessary to reorder the subjects, putting all males and all females together, but you must pay attention to which subjects are nested within which gender group. 3. Perform an analysis of variance for these data using the appropriate design. Is there a gender effect on number of hours? (Do males study significantly more, or fewer, hours than females?) Is there an effect for week? (Do students study more some weeks than others?) Do gender and week interact? (Is the pattern of differences among weeks different for males and females?) If there is a week effect, or a gender x week interaction, perform the appropriate post hoc analysis to elucidate the exact nature of the effect. 4. The first within-subjects predictor variable, P1, can be interpreted as a linear trend. Note, values increase from -1, to 0, to +1 for weeks 1-3 respectively. The second within-subjects predictor variable, P2, can be interpreted as a quadratic trend. Note that the values change from +1 to -2 and then back to +1 for weeks 1-3 respectively, tracing a "U-shaped" relation. (Contrast codes like these are called orthogonal polynomials.) For the previous analysis, in order to evaluate the significance of the within-subjects variable, you entered P1 and P2 in one step. Now enter each one separately and evaluate the significance of each. This is like a planned comparison analysis, except that P1 represents a linear trend and P2 a quadratic trend. In this case, which kind of trend accounts for the differences among weeks?
... . Week Week 1 Week 2 Week 3
Coded variable
P1 -1 0 1
P2 1 -2 1
Fig 16.10. Coded variables representing linear and quadratic trend for week.
_288
TWO-FACTOR STUDIES WITH REPEATED MEASURES
Limitations of Repeated—Measures Analyses With Spreadsheets As mentioned earlier, factorial studies—both with and without repeatedmeasures factors—often offer interesting economies: Several research factors can be investigated in the context of a single study. However, as discussed in the previous paragraphs, when too many factors are included, the results may become so complex that interpretation is strained. This limitation on the number of factors is purely conceptual. In addition, depending on the computer program used to analyze data, the number of factors can be limited by technical considerations as well. But more than that, the sheer number of dummy-coded predictor variables that a multiple-regression approach to repeated measures requires to represent subjects—to say nothing of the other factors—will almost always cause you to use programs designed specifically for repeated-measures analyses and not multiple regression. Standard statistical packages (e.g., SAS and SPSS) limit the number of variables that can be included in their multiple-regression routines, but typically the limit is so large than it rarely if ever becomes a consideration. Multipleregression routines . incorporated in spreadsheets historically were quite constrained, but currently are not, so such constraints are no longer a barrier to using spreadsheet programs for repeated-measures analyses. Studies with repeated measures, as emphasized throughout this chapter and the previous one, have a number of advantages. A thorough understanding of analysis of variance designs that include repeated measures is thus an important skill to include in your statistical armamentarium. And indeed, given the specific examples described here in detail, you should now be able to generalize what you have learned to other, more complex studies. The advantage of the approach to repeated measures adopted here—regarding subject as an explicit research factor, representing it with appropriately coded predictor variables, and treating it as a covariate—is the ease with which studies with and without repeated measures can all be integrated into a common hierarchic multiple-regression approach. True, when adopting a multiple-regression approach, studies with repeated measures can generate an inordinate number of predictor variables. For that reason, use of specific programs in standard packages (such as SPSS GLM for repeated measures, as described in Exercise 16.4)—which require only that the number of between and within factors and the number of levels for each be specified—is especially attractive on practical ground. Nonetheless, once you understand how to approach studies including repeated measures as a series of hierarchic steps, you should have a clear grasp of the logic of these analyses. In particular, you should understand the way total variance and degrees of freedom are partitioned among the various effects. This understanding will be helpful to you no matter whether you use the hierarchic multiple-regression procedures described here or come to rely on specific repeated-measures program in standard statistical packages. As a practical matter, however, now that you understand the logic of repeated-measures analyses, you will almost always use a program in a statistical package to perform your analyses. In practice, the number of dummy-coded predictor variables that a reasonable number of subjects requires is simply too cumbersome for routine use of multiple regression for repeated measures analyses. A final comment: Some situations that seem appropriate for a repeated measures analysis may, in fact, be better handled in a more typical multipleregression way. The next chapter describes one commonly encountered situation for which this is true and also describes power analytic methods for determining the minimum number of subjects that should be included in a study.
17
Power, Pitfalls, and Practical Matters
In this chapter you will: 1. Learn how to analyze data from pretest, posttest repeated-measures studies using analysis of covariance. 2. Learn how to do a power analysis. A power analysis helps you decide how many subjects should be included in a study. If too few subjects are included, significant effects may not be detected. In this final chapter two important but somewhat disparate topics are discussed. The first topic is quite specific. Given what you have just learned about analyzing studies that include within-subjects factors, you might be tempted to apply such analyses to studies including pretest and posttest measures because such measures are clearly repeated. Actually, data from such studies are analyzed more easily as an analysis of covariance and an example is presented here. Some readers may be able to apply this example directly to their own work, but for all readers this final example serves as a way to summarize and exemplify the general analytic principles emphasized throughout this book. The second topic is more general. Before conducting any study it is important to determine how many subjects should be included. If too few subjects are studied, the probability of finding significant results may be unacceptably low. On the other hand, if too many subjects are included, the investigator may spend more time and effort on the study than is warranted. Some general methods for determining what constitutes an appropriate sample size have been developed and are presented here.
17.1
PRETEST, POSTTEST: REPEATED MEASURE OR COVARIATE? The 16 scores for the button-pushing study have been used to illustrate a number of different analyses throughout this book. In chapter 14, for example, we assumed that each score was contributed by a different person, and that the first eight subjects were male and the second eight were female; the first four male and female subjects received instruction set I whereas the last four received instruction set II. Accordingly, these data were analyzed as a 2 x 2 between289
290
POWER, PITFALLS, AND PRACTICAL MATTERS subjects design (two-between). Then, in chapter 16, we showed how to analyze these same data, first assuming that instruction set was a within-subjects factor (one-between, one-within), then assuming that both gender (of spouse) and instruction set were within-subjects factors (two-within). In this section, a somewhat different set of assumptions is made but again the same scores are used as an example. Assume that eight people have been randomly selected to serve as subjects and four of them have been randomly assigned to a training group whereas the second four are assigned to a notraining control group. Thus treatment (training vs. no training) for the present example, like gender in the first two of the previous three examples, is a betweensubjects factor or variable. Assume further that subjects are exposed to videotapes and are asked to push a button whenever the infants portrayed in the tapes do something communicative. The number of button pushes constitutes the pretest or the pretreatment assessment. Next, the four treatment-group subjects receive training concerning common infant cues. Finally, all subjects are again asked to push a button whenever infants do something communicative. The number of times the button is pushed this time is the posttest or posttreatment assessment. As with all such studies involving a pretest, a posttest, and different treatments given to different subjects, the research question is: Does the treatment matter? Are the treated subjects' scores, on average, different from the untreated subjects' scores? There are two ways we might seek to answer this question, given the situation described in the preceding paragraph, but before we do that it is useful to discuss an even simpler situation. Imagine that we did not have a control group—that all subjects were first given a pretest, then a treatment, and finally a posttest. Such a single-group procedure is not recommended as a way to demonstrate unambiguously that a particular treatment has a desired effect. Subjects might score higher on the posttest, for example, simply because of some effect of elapsed time, and not because of the treatment. Still, it is useful for present purposes to consider how data from such a study would be analyzed. Pretest and posttest scores constitute repeated measures, two scores from each subject. If this single-group study had been presented in chapter 9, you might have been tempted to correlate pretest and posttest scores. Certainly you could compute regression and correlation coefficients and these would tell you how, and how well, you could predict the posttreatment from the pretreatment assessment. But this would not tell you whether the pretreatment and posttreatment assessments were different. This is a different question and requires a different analysis. In order to determine if the mean pretest score is different from the mean posttest score, you would perform an analysis of variance. Time of test (pre versus post) would be the single within-subjects factor. If the P main effect were significant, you would conclude that the mean pretest and posttest scores were significantly different. Do not confuse these two very different analyses. Given two repeated measures, such as a pretest and a posttest: 1. The correlation coefficient tells you how well the second score can be predicted from the first. 2. The analysis of variance, on the other hand, tells you whether the mean score for the first test is different from the mean score for the second test.
17.1 PRE-, POSTTEST: REPEATED MEASURES, COVARIATE?
291
These are two separate but important pieces of information. But note that there is no necessary relation between their significance: One, neither, or both might be significant. The single-factor study just described, like the somewhat more complex twofactor study described a few paragraphs earlier, assesses subjects both before and after treatment, but the more complex study, in addition, assigns some subjects to a treatment group and the remaining subjects to a no-treatment control group. For this reason, your first thought might be to analyze the data from the twofactor study as a one-between, one-within, 2 x 2 design. The between-subjects factor is treatment (training vs. no training) and the within-subjects factor is time of testing (before vs. after training). A main effect for training (factor A) would indicate that, considering both pretest and posttest scores, one group scored higher than the other. A main effect for time of testing (factor P) would indicate that, considering both treatment conditions, scores tended to be higher at one time compared to the other. The result the investigator most likely desires, however, is a significant interaction. If the treatment affects the posttest score, then there should be a difference between pretest and posttest scores only for treated subjects. In other words, the effect of the treatment factor on score should be conditional on—that is, should interact with—the level of treatment. It is a significant interaction between treatment level and time of testing that allows the investigator to conclude that subjects were affected by the treatment offered in the treatment condition. Thus one way to test the effectiveness of a treatment, given the present example, is to analyze the data as a one-between (factor A - treatment level), one-within (factor P = time of assessment) study, in which case a significant AP interaction might indicate that scores for the treated subjects, but not the untreated subjects, were significantly different at the second assessment. There is a second way to analyze these same data. It is somewhat simpler and more straightforward than the mixed two-factor strategy just described and, in fact, it is usually preferred by experts. Simply put, the question of interest concerns whether the treatment matters—that is, whether it has an effect on a particular outcome. Treatment is the research factor of interest and, as noted previously, it is a between-subjects factor. This does not change. However, rather than treat the repeated measures as levels of a within-subjects factor, it makes sense to regard the second measure as the outcome variable and the first measure as the covariate. This reduces a one-between, one-within design to a one-between design with a covariate. Quite likely the second score can be predicted to some extent by the first one, but the question of interest is, above and beyond this expected effect (the effect of the covariate), does knowledge of whether or not a subject received training (the effect of the between-subjects factor) allow us to make significantly better predictions for the second score? In other words, when accounting for criterion or outcome score variance, in addition to whatever effect the covariate may have, does the training factor matter? As the discussion in the preceding paragraph makes clear, the present example can be analyzed with an analysis of covariance, exactly as described in chapters 10 and 12. Recall that 16 scores were collected for the button-pushing study. Assume that scores 1-4 are posttreatment scores for the training group subjects and scores 5-8 are their pretreatment scores; similarly, assume scores 9-12 are posttreatment scores for the no-training control group and scores 13-16 are their pretreatment scores. In other words, the preceding between-subjects variable of gender is now the between-subjects variable of training. However, scores that were previously ascribed to set I for the instruction factor (scores 1-4
292
POWER, PITFALLS, AND PRACTICAL MATTERS and 9-12) are now scores for the criterion variable, whereas scores that were previously ascribed to set II for the instruction factor (scores 5-8 and 13-16) are now scores for the covariate. The analysis of covariance for the present example proceeds in three steps. First the posttest scores (representing the dependent or criterion variable) are regressed on the pretest scores (representing the covariate). Then, in order to test whether training has an effect, the coded variable for group membership (training vs. no training) is added to the regression equation. Finally, in order to test for homogeneity of regression, a variable representing the covariate by group interaction is added. Computations for this analysis are carried out during the course of the next exercise. Note that because there are no within-subjects factors for this analysis, it is not necessary to code subjects as a factor. In this case, the covariate is the pretest score, not the subject factor as it was for the analyses described in chapters 14 and 15. Exercise 17.1 Analysis of Covariance of a Pretest, Posttest Study The template that results from this exercise allows you to perform a simple analysis of covariance for pretest and posttest scores. Data from the buttonpushing study will again be used. This time you will assume that subjects were tested twice and half of the subjects received sensitivity training between the two assessments. 1. You will probably want to modify an earlier spreadsheet, for example, the one shown in Fig. 14.10. However, for this analysis there are only eight subjects and three predictor variables. The first four scores from the button-pushing study are now the values for the dependent variable for the first four subjects—the subjects who receive training—and scores 5-8 are now the values for the covariate for the first four subjects (their pretest scores). Scores 9-12 are now the values for the dependent variable for the second four subjects—the subjects who did not receive training—and scores 13-16 are now the values for the covariate for the second four subjects. 2. Analyze these data and enter the appropriate information into a summary table. First, do the analysis of covariance, regressing the posttest score on the pretest score (step 1) and then on both the pretest score and a contrast coded variable for treatment group (step 2). Second,, check for homogeneity of regression, following the preceding two steps with a third step that regresses the posttest score on the pretest score, a contrast-coded variable for treatment group, and a third variable representing their interaction. 3. Is the effect of the covariate statistically significant? Is the effect of the sensitivity training statistically significant? Is the assumption of the homogeneity of regression warranted? The spreadsheet that results from the last exercise is shown in Figs.17.1 and 17.2. The pretest by training interaction is not significant (F(1,4) <1, NS); thus homogeneity of regression is not violated and we can conclude that these are appropriate data for an analysis of covariance. The effect of pretraining on posttraining scores is not significant (F(1,5) = 2.5, NS), but its significance, or lack thereof, is not critical. The major question concerns the effect of training and it is important to note that, controlling for pretest scores, the treatment variable—training versus no training—significantly affects posttest scores (F(1,5) = 11.6, P < .05).
17.1 PRE-, POSTTEST: REPEATED MEASURES, COVARIATE? A
B
3 4 5 6 7 8 9 10 11
E
2
1 Step Source 2
D
C R /
change
SS
0.132
826
1 X, pretest 2A, training 3error TOTAL btwn Ss
0.132 0 .738
1 X, pretest 2 A, training 3XA, interaction 4 error TOTAL btwn Ss
0 .132
0.132
o .738 0.75
0.607 3807 0.012 73.74
1
0.25 1569
0.607 3807 1 0.262 1643 1 6276 826
1 6276
293 F
G
H
I
F dt MS n)2 1 826 2 514 0 335 1 3807 11.58 0 699 5 328.6 7 1
826 2 105 0 345
1 3807 9 703 0 708 1 73.74 0 188 0.045 4 392.3 7
FIG. 17.1. Source table for analyzing the effect of training (between-subjects factor A) on posttreatment number of button pushes. Pretreatment number of button pushes (labeled X) is a covariate. One final set of computations remains. We could report raw means, noting that after training the mean number of button pushes for the treatment group was 113 whereas the corresponding number for the no-treatment control group was 65. In addition, however, and in line with the usual reporting of analysis of covariance results, we should also report the adjusted means, computed as described in chapter 13.
A D B C 1 Tr Post Pre 2 s Y X A -1 1 102 79 3 4 -1 2 125 93 -1 5 3 95 75 -1 6 4 130 69 1 7 5 43 101 1 8 6 82 94 9 7 69 84 1 1 10 8 66 69 11 Sum= 712 12 8 N= 13 Mean= 89 14 a,b= 102.9 -0.17 -23.3 15 R= 0.859
H G I J K y= m= sstot e= Y' Y-My Y'-My Y-Y' XA y*y 24 -11 -79 113 13 169 -93 110.7 36 21.66 14.34 1296 -75 113.7 6 24.67 -18.7 36 -69 114.8 41 25.67 15.33 1681 101 62.66 -46 -26.3 -19.7 2116 -7 -25.2 18.17 94 63.83 49 84 65.5 -20 -23.5 3.498 400 69 68.01 -23 -21 -2.01 529 Sum= 0 -0 0 SS= 6276 8 8 8 df= N= 7 VAR= 784.5 579.1 205.4 MS= 896.6 SD= 28.01 SD'= 29.94 2 R = 0.738 R adj= 0.633
E
F
FIG. 17.2. Spreadsheet for analyzing the effect of training (between-subjects factor A) on posttreatment number of button pushes (labeled Post). Pretreatment number of button pushes (labeled Pre) is a covariate.
294
POWER, PITFALLS, AND PRACTICAL MATTERS Exercise 17.2 Adjusted Scores for a Pretest, Posttest Study This exercise provides additional practice in computing adjusted scores as required for an analysis of covariance. 1. Modify the spreadsheet shown in Fig. 17.1 to compute posttest scores, adjusted for the effect of pretest scores. If you have forgotten how to do this, refer to Exercise 13.7. 2. What are the adjusted posttest means for the treated and untreated groups?
The example just described illustrates the general analytic strategy emphasized throughout this book, which can be summarized as follows: 1. A quantitative dependent or criterion variable (in this case, number of posttreatment button pushes) is identified and measured. 2. Independent or predictor variables are identified and hierarchically ordered. Predictor variables may be quantitative, like age or number of pretreatment button pushes, or they may be categorical, like treatment group. Categorical variables distinguishing among G groups are represented with G - i coded predictor variables. Predictor variables representing an interaction are formed by multiplying their component predictors together. 3. The criterion variable is regressed on the first predictor variable, or set of predictor variables. Then the second variable or set (if there is one) is added to the regression equation, then the third, fourth, and so forth. The increase in the proportion of criterion variance accounted for at each step is noted and indicates the strength of the effect associated with the variable (or variables) added at that step. Normally predictor variables that serve as covariates, like number of pretreatment button pushes or coded variables for subjects when analyzing within-subjects factors, are added to the regression equation first, whereas predictor variables that represent interactions between other variables are added last. 4. An F ratio is computed for the increase or change in R2 at each step and is tested for significance. This test indicates whether or not the effect associated with that step is statistically significant. 5. The magnitude of the effect can be represented with either the increase in R2 or the partial n2. The approach just summarized is amazingly general. Using only a few basic and easily understood principles and techniques, it incorporates under one roof, so to speak, not only standard regression analyses, but also standard analyses of variance and covariance—and does so in a way that remains remarkably faithful to the investigator's primary concern with identifying the strength and significance of research factors of interest. Still, unless enough subjects are studied, effects strong enough to be interesting may not be detected as statistically significant. How investigators can guard against this undesirable outcome is discussed in the next section.
17.2 POWER ANALYSIS: How MANY SUBJECTS ARE ENOUGH? 17.2
295
POWER ANALYSIS: HOW MANY SUBJECTS ARE ENOUGH? Power, as you learned in chapter 2, is the probability that a real effect will be detected as significant by a particular statistical test. Again as you learned in chapter 2, if the alpha level is made less stringent, power is increased. However, because the alpha level regarded as appropriate is set not by individual investigators acting unilaterally but by social consensus, modifying the alpha level is usually not an acceptable way to gain power. Some tests are more powerful than others. For example, as a general rule, tests that rely on quantitative data are more powerful than tests that make use of categorical data and, as discussed in the last two chapters, usually tests involving within-subjects factors are more powerful than tests involving between-subjects factors. Under most circumstances, however, the nature of the data and the design appropriate for analyzing those data are limited by the nature of the problem and cannot easily be changed. By far, the easiest way to gain power is to increase the number of subjects studied. How this works with the F distribution is easy to demonstrate. Look at Table D in the statistical tables appendix. Notice that as the degrees of freedom for the denominator increases, critical values of F decrease. In other words, other things being equal, as sample size increases, the bulk of the sampling distribution shifts to the center, leaving a thinner tail. As a result, the value of F that demarcates 5% of the area becomes steadily smaller—and an F ratio that was not deemed significant may become significant if it is based on data from more subjects. In other words, very large effects may be found significant even with quite small sample sizes, whereas miniscule effects may be found significant if sample size is large enough. A practical question to address before beginning any study is, how many subjects are enough? If too few subjects are included in a study, there may be little hope of finding significant results. Both time and money will have been wasted in a futile effort. On the other hand, if too many subjects are included, effects too small to be of either theoretical or practical interest may be detected. This too can be viewed as a waste of resources. Happily, there are fairly straightforward procedures for determining an appropriate sample size. The standard reference is Jacob Cohen's Statistical Power Analysis for the Behavioral Sciences (1988), which should be consulted for situations that do not fit the basic guidelines given here. The procedures presented in this chapter are adapted from Cohen and Cohen's Applied Multiple Regression/Correlation: Analysis for the Behavioral Sciences (1983). As such, they apply to multiple-regression analyses and so are appropriate for almost all of the analyses described in this book (the exception is the sign test). A power analysis can be decomposed into six constituent steps (Fig. 17.3). First select an alpha level. This step requires only that you state your alpha level, which is part of the planning for any study. Usually it will be .05 or .01. Second, set power. Decide what level of power you prefer or find acceptable. A common choice is .9, but under some circumstances you may be willing to accept only a .8 or less chance of detecting effects as significant. Third, count predictors. The number of predictors is symbolized with K and is the number of predictors uniquely associated with the effect. In other words, K is dfchange or the number of predictor variables added to the regression equation at the step associated with the effect. Fourth, find the L score. L is a statistic whose values Cohen has computed and tabled (Cohen & Cohen, 1983; see Tables F1i and F.2 in the statistical tables
296
POWER, PITFALLS, AND PRACTICAL MATTERS appendix). It will be used in the sixth step to compute the number of subjects. Note that the values from the first three steps are used to determine the value of L; thus for the fourth step you scan Table F.1 or F.2 (depending on the alpha level you selected) for the value of L that corresponds to the power you desire and the number of predictor variables you have. The fifth step requires that you determine the size of the effect of interest. This requires somewhat more judgment than the first four steps. Along with L, the effect size will be used in the sixth step to compute the number of subjects. Following Cohen and Cohen (1983), the effect size is symbolized as f2 and is defined as follows:
The large, small, and final R2s assume that we are testing effects hierarchically. Thus -R2change (i.e., R2large - R2small) refers to the proportion of variance we think will be accounted for uniquely by the effect of interest, whereas R2final refers to the proportion of variance we think will be accounted for by all the effects tested by a particular design. (Recall the discussion of larger, smaller, and final models in chap. 12.) It is important to note that these are population, not sample .R2s. Thus,f2 is a population statistic that reflects the size we expect for a given effect not in a sample but in the population. This is hardly surprising. Presumably determining an appropriate sample size is something we do before selecting a sample, so we would hardly expect that procedure to rely on sample data. But how are the population R2s of Equation 17.1 determined? There are several possibilities. Unfortunately, none are simply mechanical and all require that the investigator exercise some judgment. The investigator could be guided by previous results in the field, arguing that because effects of a certain size have typically been found in the past, it seems reasonable to think that effects of a similar size would be found in the future. Or, if there is little in the way of precedent, an investigator could simply argue that only effects of a certain size are of interest, for either theoretical or practical reasons. In any case, in order to proceed with a power analysis, it is necessary to commit to some expected effect size. And it is highly desirable, of course, to have some rationale for the size selected. This is not quite as difficult as it sounds. Even though we do not know the true population values for particular R2s, for most areas of research there is usually some consensus as to what constitutes a small, medium, or large effect. And if no other rationale is available, one can use Steps required for a power analysis Step 1 Select an alpha level. Step 2
Set power.
Step 3
Count predictors.
Step 4
Find the 1 score.
Step 5
Select an effect size.
Step 6
Compute the number of subjects required
FIG. 17.3. Steps required for a power analysis.
17.2 POWER ANALYSIS: How MANY SUBJECTS ARE ENOUGH?
297_
Cohen's guidelines. For the behavioral sciences, Cohen (1988) wrote, an R2 of .01 is regarded as small, .09 as medium, and .25 as large (which translate into correlation coefficients of .1, .3, and .5, respectively). Not all investigators will think that accounting for 25% of the criterion variance constitutes a large effect, but nonetheless to proceed with a power analysis some desired effect size must be selected. Having defined the alpha level, the power desired, the number of predictors, the value of L, and the effect size, it is now possible to perform the desired computation, thus for the sixth step compute the number of subjects required. The formula, where n* indicates number of subjects, is:
This is the smallest sample size you should consider. If an effect is real and of the strength claimed, then the probability of detecting it as significant, with K predictor variables at the stated alpha level, is the probability you selected for power at step 2—which is why a value of .9 for power was recommended. Under most circumstances it does not seem desirable to have, for example, only a 50% chance of finding significant results of a strength you believe important. The lie detection study described earlier can be used to exemplify a power analysis. As presented in chapter 10, this study included only 10 subjects and, not surprisingly given the small number of subjects, did not find a significant effect of drug on number of lies detected (see Fig. 10.2). The number of subjects, although convenient for exposition, was unrealistically small. Still, the R2 for this sample—.19—does not seem trivial. This raises a question; In a real study how many subjects would be needed to detect a similarly sized effect as significant? The .19 is a sample R2. Its adjusted value (see chap. 11), which is a reasonable estimate for the population value, is .09, exactly what Cohen called medium. (As an exercise, verify that .09 is the population value estimated from the sample value of .19.) The conventional value for alpha is .05 and a commonly selected value for power is .9. In this case, there is one predictor variable, the coded variable that indicates whether or not the subject received a drug. The value for L is 10.51 ( K = 1 , power = .90). The estimated population value for R2 is .09, and the effect is associated with the first and only step (which means that R2Small - 0 and R2large = R2final); consequently, for this example,
Finally,
Instead of rounding to the nearest whole number, always round up instead because this calculation determines a minimum number of subjects. In other words, in order to have a 9 in 10 chance (power = .9) of finding significant at the .05 level an effect of drug that accounts for 9% of the variance in number of lies detected in the population, at least 109 subjects would be needed. Approximately half would receive the drug treatment and the remaining subjects would be assigned to a no-drug control group.
298
POWER, PITFALLS, AND PRACTICAL MATTERS Power analysis is important primarily because it provides a rational way to determine the number of subjects needed before a study begins. It prevents you from being embarrassed when, after the study is over and you are bemoaning the lack of significant results, someone points out you only had a 40% (or whatever) chance anyway of finding significant effects you thought important, given the number of subjects. In addition, power analysis provides a way to interpret negative results after the study is over. It is never correct to claim that negative results "prove" the null hypothesis. And certainly, given the results of the lie detection study, it would be misleading to even say that they support the notion that drug has no effect on number of lies detected. After all, 19% of the variance in the sample was accounted for by drug group—an effect that would be statistically significant with a larger sample. Imagine, however, that we redo the study with 109 subjects, the number of subjects we computed based on power analysis considerations, and again find no significant effect. We cannot claim that this proves that drug has no effect, of course, but we can say that we had a 90% chance of detecting a drug effect that accounts for 9% of the population variance as significant at the .05 level—and we failed to do so. We do not know the true magnitude of the drug effect in the population, but we do know that if it really does account for 9% of the variance in number of lies detected, then 90% of the studies conducted with a sample size of 109 should find the drug effect significant (at the .05 level). Such a statement helps to put negative results in perspective and—coupled with description and discussion of the size of the effect we actually found—seems considerably more informative than simply stating that an effect achieved, or did not achieve, statistical significance. Exercise 17.3 A Power Analysis This exercise provides practice in power analysis. 1. Assume you want at least a .9 chance of detecting an effect as significant at the .05 level. Assume further that you believe the effect, which is coded using two predictor variables, should account for at least 36% of the variance in the population. What sample size should you use? 2. Assume you want at least a .9 chance of detecting an effect as significant at the .05 level. Assume further a 2 x 2 between-subjects factorial design. You believe that the A, B, and AB effects together account for 30% and you want to detect a B main effect that accounts for 22% of the variance in the population. What sample size should you use? 3. Assume you want at least a .9 chance of detecting an effect as significant at the .05 level. Assume a covariate accounts for 7% and you want to detect an effect for the between-subjects factor, which is coded using three predictor variables and accounts for an additional 17% of the variance in the population. What sample size should you use?
17.2 POWER ANALYSIS: How MANY SUBJECTS ARE ENOUGH?
299
Note 17.1 L
A statistic used in power analysis calculations. Its value is determined by the power desired and the number of predictor variables used. See Table F in the statistical tables section.
f2
A measure of effect size. Technically, the ratio of variance accounted for by an effect to residual or error variance. Used in power analysis calculations.
n*
The number of subjects needed in a study in order to detect effects of a specified size as significant. Determined by power analysis calculations.
The approach to power analysis described here—which requires you to identify the sources of variance, arrange them hierarchically, and identify an increase in R2 for each step—has been automated. If you provide this information to BWPower, a computer program written by Bakeman and McArthur (1999), the program computes power for various sample sizes (see Fig. 17.4). BWPower can be downloaded from: www.gsu.edu/~psyrab/BakemanPrograms.htm.
-
D X
,
FIG. 17.4. 17.3.
Program BWPower showing answer for Part 2 from Exercise
300
POWER, PITFALLS, AND PRACTICAL MATTERS
A Final Comment At this point, after 17 chapters, you should have a firm, conceptual grasp of how variation in an outcome of interest (a dependent or criterion variable) can be explained or accounted for in terms of research factors (independent or predictor variables). It is worthwhile to reflect on what you have gained. You have learned exactly what it means to say that an effect accounts for a stated proportion of criterion variance, and you have learned how to describe the magnitude of a given effect and to test that magnitude for statistical significance. You have also learned that statistical significance is important—you want to know that observed results are probably not just chance happenings—but that it is not all important. After all, any effect that is not zero will be statistically significant if the sample size is large enough. But now you know how to decide on an appropriate sample size before launching a study. From the examples and exercises throughout this book you have gained an appreciation for statistical hypothesis testing as it has come to be conventionally understood and practiced—and presumably you see it as a useful tool in data interpretation, not as a cult idol to be adhered to without question. If you wish to deepen your understanding of hypothesis testing and the seemingly arbitrary .05 cutoff point, two essays, one by Cohen (1990) and one by Rosnow and Rosenthal 1989), are especially recommended and serve as a superb compliment to the material presented here. Along the way you have accumulated an impressive array of skills. Coding categorical predictor variables and arranging both quantitative and qualitative predictor variables into a series of hierarchic steps for multiple regression should by now be second nature. Given the integrated approach adopted in this book, you should be able to perform simple multiple-regression analyses, one-way and factorial analyses of variances with and without repeated measures, and analyses of covariance—and you should also be adept at describing the results of such analyses, using post hoc tests and adjusted means. One topic not discussed here is the analysis of categorical dependent variables using chi-square and log-linear techniques. As mentioned in the preface, such analyses are described in a companion volume (Bakeman & Robinson, 1994). Otherwise this book has provided you with an integrated understanding of most of the basic sorts of data analyses used by behavioral scientists. You have reason to be pleased with your accomplishment but you should not take this as license to rest on your laurels. William Shakespeare has Richard II say: "But what e'er I be,/ nor I, nor any man that but man is/ with nothing shall be pleased till he be eased/ with being nothing" (Act 5, Scene 5, lines 38-41). This is not a somber thought, but simply suggests the restless quality of an active person with an active mind; it suggests that anyone too satisfied is no longer fully alive. It seems a fitting moral with which to end this book. The diligent reader will have learned a great deal about statistics, at an introductory level and as usually applied to behavioral science data. But it is only a foundation, a beginning. We hope that the material presented throughout this book will provide a useful base, not just for your understanding of the simpler statistical analyses found in the research reports you read, and not just for the basic analyses you may need to perform on your own data, but also for all of your future learning about statistics and statistical data analysis.
References
Bakeman, R., & McArthur, D. (1999). Determining the power of multiple regression analyses both with and without repeated measures. Behavior Research Methods, Instruments, and Computers, 31, 150-154 Bakeman, R. & Robinson, B. F. (1994). Understanding log-linear analysis with ILOG; An interactive approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis. Cambridge, MA: MIT press. Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426-443. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associatiates. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312. Cohen, J., & Cohen P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. Greenhouse, S. W., & Geisser, S. (1959). On methods in the analysis of profile data. Psychometrika, 24, 95-112. Hays, W. L. (1981). Statistics. New York: Holt, Rinehart & Winston. Huynh, H., & Feldt, L. S. (1976). Estimation of the Box correction for degrees of freedom from sample data in randomized block and split-plot designs. Journal of Educational Statistics, 1, 69-82. Keppel, G. (1982). Design and analysis: A researcher's handbook. Englewood Cliffs, NJ: Prentice Hall. Keppel, G., & Saufley, W. H., Jr. (1980). Introduction to design and analysis: A student's handbook. New York: W. H. Freeman. Kessen, W. (1979). The American child and other cultural inventions. American Psychologist, 34, 815-820. Kirk, R. E. (1982). Experimental design: Procedures for the behavioral sciences. Belmont, CA: Brooks/Cole. Loftus, G. R., & Loftus, E. F. (1988). Essence of statistics. New York: Knopf. Marascuilo, L. A., & Serlin, R. C. (1988). Statistical methods for the social and behavioral sciences. New York: W. H. Freeman. 301
302
REFERENCES Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8, 434-447. Robinson, B. F., Mervis, C. B., & Robinson, B. W. (2003). The roles of verbal short-term memory and working memory in the acquisition of grammar by children with Williams syndrome. Developmental Neuropsychology, 23, 1332. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44,1276-1284. Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605-610. Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900. Cambridge, MA: Harvard University Press. Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston: Allyn and Bacon. Tufte, E. R. (1983). The visual display of quantitative information. Cheshire (CN): Graphics Press.Tukey, J. W. (1977). Exploratory data analysis. Reading (MA): Addison-Wesley.Wainer, H. (1984). How to display data badly. American Statistician 38,137-147. Tukey, J. W. (1977). Exploratory data analysis. Reading (MA): AddisonWesley. Wainer, H. (1984). How to display data badly. American Statistician, 38, 137147. Wilkinson, L., & the Task Force on Statistical Inference, American Psychological Association Board of Scientific Affairs (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Winer, B. J. (1971). Statistical principles in experimental design. New York: McGraw-Hill.
Glossary of Symbols and Key Terms
a The regression constant. When Y, the criterion variable, is regressed on X, a single predictor variable, a is the Y intercept of the best-fit line. To compute, subtract the product of b (the simple regression coefficient) and the mean of X from the mean of Y(a=MY-bMX). Also, the number of levels for the between-subjects factor A. A The first between-subjects factor in a factorial study. a (alpha) The probability of making a false claim (Type I error). Alpha Level Maximum acceptable probability for type I error (claiming an effect based on sample data when there is none in the population). Conventionally set to .05 or .01. b The simple regression coefficient. Given a single predictor variable, b is the slope of the best-fit line. It indicates the exact nature of the relation between the predictor and criterion variables. Specifically, b is the change in the criterion variable for each one-unit change in the predictor variable. To compute (assuming that Y is the criterion or dependent variable), divide the XY covariance by the X variance (b == COVXY/VARx). Also, the number of levels for the between-subjects factor B.
B The second between-subjects factor in a factorial study. P (beta) The probability of missing a real effect (Type II error). c The number of levels for the between-subjects factor C. C The third between-subjects factor in a factorial study. COV The covariance. If for the X and Y scores, it is SSxy divided by N. Thus COVxy is the average cross product of the X and Y deviation scores. It indicates the direction (plus or minus) and the extent to which the X and Y scores covary, and is related to both the correlation and the regression coefficients. 303
GLOSSARY
304
Critical Values of a test statistic that would occur 5% of the time or less values (assuming an alpha level of .05) if the null hypothesis were true. Depending on the test statistic used and how the null hypothesis is stated, these values could fall only at the extreme end of one tail of a distribution or at the extreme ends of both tails. If a test statistic assumes a critical value, the null hypothesis is rejected. For that reason, critical values define the critical region or the region of rejection of a sampling distribution. df Degrees of freedom. d/error Degrees of freedom for error. It is equal to N minus 1 (for the regression constant) minus the number of predictor variables. dymodel
Degrees of freedom for the model. It is equal to the number of predictor variables.
DV The variable a researcher wants to explain or account for is called the dependent variable or DV. It is also called the criterion variable or the response variable. F ratio Usually the ratio of MSmodel to MSenor', more generally, the ratio of two variances. If the null hypothesis is true, it will be distributed as F with the degrees of freedom associated with the numerator and denominator sums of squares.
f2 A measure of effect size.
Technically, the ratio of variance accounted for by an effect to residual or error variance. Used in power analysis calculations.
G The number of groups or cells included in a study. IV Variables a researcher thinks account for or affect the values of the dependent variable are called independent variables or IVs. They are also called predictor variables or explanatory variables. Interval The intervals of an interval scale are equal, no matter where they Scale fall on the measurement continuum. The placement of zero, however, is arbitrary, like zero degrees Fahrenheit. K The number of predictor variables. L A statistic used in power analysis calculations. Its value is determined by the power desired and the number of predictor variables used. See Table F in the statistical tables section. M The sample mean, which is the sum of the scores divided by N. The mean for the Y scores is often symbolized as Y with a bar above it (read Y-bar), the mean for the X scores as X with a bar above, and so forth. In this book, to avoid symbols difficult to portray in spreadsheets, My represents the mean of the Y scores, MX the mean of the X scores, and so forth. MX The mean of the X scores. My The mean of the Y scores. ui(mu) The population mean., It can be estimated by the sample mean, or its value can be assumed based on theoretical considerations. MM
The mean of the distribution of sample means, estimated by the population mean.
It can be
GLOSSARY
305
MSerror Mean square for error. Computed by dividing the error sum of squares by the error degrees of freedom. It is also called the mean square within groups or the residual mean square. MSmodel Mean square for the model. Computed by dividing the sum of squares associated with the model by the degrees of freedom for the model. It is also called mean square between groups or mean square due to regression. n The number of subjects within one group or cell of a study N The total number of subjects (sampling units) included in a study. n* The number of subjects needed in a study in order to detect effects of a specified size as significant. Determined by power analysis calculations. Nominal The levels of a nominal or categorical scale are category names, Scale like Catholic| Jew | Muslim | Other for religion or male|female for sex. Their order is arbitrary; there is no obviously correct way to order the names. Ordinal The values or levels of an ordinal scale are named but also have Scale an obvious order, like first | second | third. Another example is freshman | sophomore |junior | senior. However, there is no obvious way to quantify the distance between levels or ranks. p The number of levels for the within-subjects factor P. P The first within-subjects factor in a factorial study. Power The probability of detecting a real effect, or the probability of correctly rejecting the null hypothesis. The power of a test is one minus the probability of missing a real effect, on - beta. ' (prime) In this book, a prime or apostrophe after a symbol indicates an estimate, for example, VAR' and SD', Some texts use the circumflex (the ^ or hat) for this purpose. q The number of levels for the within-subjects factor Q. Also, the Studentized Range Statistic used to compute the Tukey critical difference used in post hoc tests. Q The second within-subjects factor in a factorial study. Also, the reciprocal of a probability or P - 1. r The correlation coefficient or, more formally, the Pearson product-moment correlation coefficient. It is an index of the strength of the relation between two variables, and its values can range from -1 to +1. To compute, find the average cross product of the ZX and ZY scores (r = E ZxI Zy i /N, where i = 1,N), or take the square root of r2 and assign it the same sign as the covariance. R Multiple R. Just as r is the correlation between a criterion variable (Y) and a single predictor variable (X), so R is the correlation between a criterion variable and an optimally weighted sum of a set of predictor variables (X1, X2, and so forth).
306
GLOSSARY
r 2 The coefficient of determination, or r-squared. It indicates the proportion of criterion variable variance that can be accounted for given knowledge of the predictor variable. To compute, divide model variance by total variance (r2 = VAR model /VAR total ). R2 Multiple R squared. It is the proportion of criterion variable variance accounted for by a set of predictor variables, working in concert. 2 R adj Adjusted multiple R squared. It is the population value for R estimated from a sample, which will be somewhat smaller than the sample R2. Consistent with the notion used in this text, it could also be symbolized as R2' with the prime indicating an estimated value, but the adjusted subscript is used far more frequently in multiple-regression texts.
Ratio The intervals of a ratio scale are equal (as for an interval scale), Scale but zero indicates truly none of the quantity, like zero degrees Kelvin. Thus the ratio of two numbers is meaningful for numbers measured on a ratio but not an interval scale. Rejection Designates values for the test statistic whose probability of Region occurrence is equal to or less than the alpha level, assuming that the null hypothesis is true. If the test statistic assumes any of these values (falls in the region of rejection), the null hypothesis is rejected.
S The sample standard deviation. Alternatives are SD (used in this book) and s. a (sigma) The population standard deviation. It can be estimated by taking the square root of the quotient of the sum of the squared deviation scores divided by N — 1. This estimate is symbolized SD'. S2 The sample variance. Alternatives are VAR (used in this book)
and s2. Computed by dividing the sum of squares by N.
a 2 The population variance. It can be estimated by the sum of the squared deviation scores divided by N -1. This estimate is symbolized VAR'.
SD The sample standard deviation, which is the square root of the sample variance. Often it is symbolized S or s. If not clear from context, SDy indicates the standard deviation for the Y scores, SDx the standard deviation for the X scores, and so forth. SD' The estimated standard deviation. It is almost an unbiased estimate (especially if N > 10) of o, the true population value. It can be estimated by taking the square root of VAR ', or by taking the square root of the quotient of the sum of squares divided by N- 1. Often it is symbolized with a circumflex (a ^ or hat) above a ors.
307
GLOSSARY
SD'y-y The estimated standard error of estimate. It is the estimated population standard deviation for the regression residuals, that is, the differences between raw and predicted scores. It can be regarded as an estimate of the average error made in the population when prediction is based on a particular regression equation. Sometimes it is called the standard error of estimate, leaving off the qualifying estimated. Often it is symbolized with a circumflex or ^ above the SD and above the second Y subscript instead of a prime after them. SD'M The estimated standard error of the mean.
1,N). SSerror The error sum of squares, or Y - Y' for each subject, squared and summed. This can also be symbolized SSres, for residual sum of squares, or SSunexp,for sum of squares left unexplained by the model. SSmodel The model sum of squares, or Y' - My for each subject, squared and summed. This can also be symbolized SSreg, for regression sum of squares, or SSexp, for sum of squares explained by the model. SStotal The total sum of squares, or Y - My for each subject, squared and summed. This can also be symbolized SSy. SSxy The XY sum of squares, or I (Xi - Mx) (Y{ - My), where i = 1,N. Thus SSxy is the sum of the cross products of the X and Y deviation scores. t A test statistic similar to Z but for small samples. The shape of the t distribution is similar to the normal distribution but flatter and more spread out. As a result, critical values for t are larger than corresponding critical values for Z. TCD Tukey Critical Difference. Computed based on the Studentized Range Statistics and used in post hoc tests. If differences between means exceed the TCD, then the difference between those means is statistically significant. Type I Error Making a false claim, or incorrectly rejecting the null hypothesis (also called an alpha error). The probability of making a Type I error is alpha. Type II Error Missing a real effect, or incorrectly accepting the null hypothesis (also called a beta error). The probability of making a Type II error is beta.
308
GLOSSARY VAR The sample variance, which is the sum of squares divided by N. Often it is symbolized S2 or s2, which makes sense because variance is a squared measure. If not clear from context, VARy indicates the variance for the Y scores, VARx the variance for the X scores, and so forth. VAR' The estimated population variance. It is an unbiased estimate of a2, the true population value. It can be estimated by multiplying the sample variance, VAR, by the quotient of N divided by N -1, or by dividing the sum of squares by N -1. Often it is symbolized with a circumflex (a ^ or hat) above a2 or s2. or VARerrc
The error variance. It indicates the variability of raw scores relative to scores predicted by a specified regression equation or model. To compute, divide SSerror by N.
VARmodel
The model variance. It indicates the variability of scores predicted by a specified regression equation or model relative to the group mean. To compute, divide SSmode\ by N..
VA.Rtotal The variance for the scores in a sample. It indicates the variability of raw scores relative to the group mean. To compute, divide SStotal by N. The subscript indicates the group of scores in question, e.g., VARx for the X scores, VARy for the Y scores, and so forth. X An upper case X is used to indicate a generic raw score, usually for the independent or predictor variable. If there is more than one, the first is indicated X1, the second X2, and so forth. Y An upper case Y is used to indicate a generic raw score, usually for the dependent variable. Thus Y represents any score in the set generally, and Yi, indicates the score for the ith subject. Y' Y' (read Y-prime) indicates a predicted score (think of "prime" as standing for "predicted"). This is relatively standard regression notation, although sometimes a Y with a circumflex or "hat" (i.e., a ^) over it is used instead, Y. Usually the basis for prediction will, be clear from the context. Individual predicted scores are symbolized Yi', but often the subscript is omitted. Y' - MY The model (or regression) deviation score, that is, the difference between the score predicted by the regression equation (or model) for each subject and the group mean. (Subscripts indicating subject are not shown.) Y - Y' The difference or deviation between an observed score and a predicted score is called a residual or error score. Again the subscripts are often omitted. Often lower case letters are used to represent residuals. For example, in Fig. 5.3 a lower case y represents the deviation between a raw Y score and the mean for the Y scores. Z A standardized or z score. It is the difference between the raw score and the mean divided by the standard deviation.
Appendix A: SAS Exercises
A BRIEF INTRODUCTION TO SPSS AND SAS A spreadsheet is a powerful tool for manipulating data and conducting basic statistical analyses. Spreadsheets also allow students to explore the inner workings of meaningful definitional formulas and avoid the drudgery of hand calculations using opaque computational formulas. All of the computations described in this book can be accomplished by most spreadsheet packages with a regression function. As the analyses you wish to conduct become larger (i.e., contain many cases) and more complex (i.e., analysis of repeated measures), however, it will be to the students advantage to learn a computer package dedicated to statistical analysis. To this end we have included exercises using both SPSS and SAS that parallel and extend the spreadsheet exercises. SPSS is a system of commands and procedures for data definition, manipulation, and analysis. Recent versions of SPSS use a Windows-based menu and a point-and-click driven interface. We will assume the reader has basic familiarity with such Windows based program features. SPSS includes a data interface that is similar in many respects to a spreadsheet. Rows represent cases and columns represent different variables. You can manipulate data using cut, paste, and copy commands in a manner similar to spreadsheets. SPSS also provides the ability to import Spreadsheet files directly into the data editor. Although similar to SPSS in its data analytic capabilities, SAS relies much less on a Windows-based menu to perform operations. Rather, it requires that the user become familiar with the SAS programming language to execute commands. Nonetheless, data may be imported from Spreadsheet files for further analysis in SAS. The SPSS and SAS exercises will provide enough information to navigate the commands and procedures necessary to complete each exercise. You should, however, consult the SPSS and' SAS manuals and tutorials to gain a more thorough background in all of the options and short-cuts available in these programs. The following exercise will familiarize you with the SPSS and SAS interfaces and basic commands, respectively.
309
310
APPENDIX A Exercise 1.2 Using SAS Now you will complete this first exercise using SAS. If you are not sure how to perform any of these operations, ask someone to show you or experiment with the help provided by the program. 1.
Invoke SAS. Import the spreadsheet file you created in Exercise 1.1. Since all SAS files have part of their names associated with a location on your computer, use the LIBNAME command to generate the library under which all of your work will be saved. For example, type (in the Program Editor window) and run the following command: LIBNAME sasuser 'c:\SASExercises' Any spreadsheets you wish to import should be saved to this directory on your computer. Spreadsheets containing the data for each exercise are provided on the CD. You may want to copy all of them to the appropriate directory. If you would like experience creating data files, you could create your own spreadsheets, using the files on the CD as a guide. Now import your data using this next command: PROC IMPORT DATAFILE = 'c:\SASExercises\ex1pt2.xls' OUT = sasuser.ex1pt2; PROC PRINT DATA = sasuser.ex1pt2; TITLE 'Exercise 1.2.1'; RUN; Look in the Output window and confirm that the variable names and data are the same as your spreadsheet. 2. Change the name of the variable in the first column. As with the majority of SAS programs, you will use a DATA command to tell SAS which data and variables you want to work with and a PROC command to inform SAS of the operations you would like performed with these data. In this case, you would type: DATA sasuser.ex1pt2 (RENAME = (test1 = exam1)); SET sasuser.ex1pt2; RUN; PROC PRINT DATA = sasuser.ex1pt2; TITLE 'Exercise 1.2.2'; RUN; in order to change variable one's name from Test 1 to Exam 1. Note that, while there is a character maximum of 32 for variable names, you may use the LABEL command to provide a more descriptive label that will appear in the output. 3. Create a new variable by writing the following program: DATA sasuser.ex1pt2; SET sasuser.ex1pt2; test4 =.; RUN; Use the LABEL command to give the variable a meaningful description. 4. Enter data for the new variable. One means of doing this is to open the SAS datafile that contains the original imported data and to switch from the BROWSE to the EDIT mode by clicking the appropriate toggle icon in the upper right hand corner of the toolbar. Once you are in edit mode, you are
SAS EXERCISES
311_
free to hand enter five hypothetical values for your new variable, namely Test 4. 5. To display basic descriptive statistics for these data, write and run the following code: PROC MEANS DATA=sasuser.ex1pt2 N MEAN STDDEV MIN MAX RANGE SUM; TITLE 'Descriptives1; RUN; This will open the Output Viewer and provide you with summary statistics for all of the variables in the spreadsheet, unless otherwise specified. Confirm that the N (i.e., count) and sum are the same values you obtained in Ex. 1.1. 6. You may return to the data file/editor and change some of the values. Run the descriptive statistics code again (e.g., PROC MEANS). Were the descriptive statistics updated correctly? 7. Save the output and data files. To save the output file, just make it the active window and then select File->Save As from the main menu. Give the file a meaningful name and click Save. The output will be saved in a file with a .1st extension. Since using the LIBNAME command gets SAS to create a permanent data file, there is no need specify where you want your data to be saved. Now return to the data editor and save the actual data using a meaningful name. SAS files are saved with a .data extension. Exit SAS. Exercise 3.6 The Sign Test in SAS This exercise provides practice in using the sign test in SAS. 1. Invoke SAS and create two new variables. Name the first variable "outcome" and the second variable "freq". Add value labels, 'no improvement' for '0' and 'improved' for '1', using the PROC FORMAT command. The code for these steps is as follows: PROC FORMAT; value Outcome 1='improved' 0='no improvement'; DATA sasuser.ex3pt6; Outcome=.; Frequency= .; FORMAT Outcome Outcome.; OUTPUT; RUN; 2. The component of code you will now write is that which identifies the frequency of cases in each of the outcome classes. The frequency associated with outcome 1 is 30, while the frequency associated with outcome 0 is 10. DATA sasuser.ex3pt6; Outcome=1; Frequency=30; OUTPUT; Outcome=0; Frequency=10; OUTPUT; RUN;
312
3.
4.
5.
6.
In your code, you will ask SAS to weight the cases by frequency variable, which is essentially the same as entering 30 ones and 10 zeros in the outcome column. You could simply enter a 1 or 0 for each case in the outcome column, but doing so would become tedious when N is large. Now, request a binomal sign test using the following code: PROC FREQ; WEIGHT Frequency; TABLE Outcome / BINOMIAL; RUN; Note that the Test Proportion is set at .50 by default. This indicates that you expect 50% of the outcomes to be positive. To change the expected proportion, enter the correct value in parentheses after the BINOMIAL statement: BINOMIAL (p=.3); Look at the output. Is the N correct? What proportion improved? What was the expected proportion based on the Test Prop. column? Look at the '1 sided Pr > z' row towards the bottom of the output. This tells you the probability of finding 30 out of 40 positive outcomes if you expected only 20. If the value is less than alpha, then you reject the null hypothesis that that treatment has no effect. Re-run the program so that the values in the freq column reflect that 26 improved and 14 did not. Would you reject the null hypothesis with an alpha of .05? What if 30 improved, but you expected 80% to improve? Keeping the alpha level at .05, would you reject the null hypothesis?
Exercise 5.5 SAS Descriptive Statistics The purpose of this exercise is to familiarize you with requesting various descriptive statistics in SAS. 1.
Invoke SAS. Import part of the data from the lie detection study, as follows: PROC IMPORT DATAFILE = 'c:\SASExercises\ex5pt5.xls1 OUT = sasuser.exSptS; RUN; Label your two variables using the following code: DATA sasuser.exSptS; SET sasuser.exSptS; LABEL s = 'Participant1 y = 'Lies'; View your data in the output window PROC PRINT DATA = sasuser.exSptS; TITLE 'Exercise 5.5'; RUN; 2. The most common command for requesting descriptive statistics is the PROC MEANS command. To run descriptives for the imported data, type and run: PROC MEANS DATA = sasuser.exSptS N MEAN STDDEV MIN MAX RANGE VAR; VARY; TITLE 'Descriptives for 5.5'; RUN;
SAS EXERCISES
3J3
Consult the output window for your results. Examine the output. Do the values you obtained for N, the mean, variance, and standard deviation agree with the results from the spreadsheet you created in Exercise 5.4? 4. Note: A permanent SAS data file will be automatically saved under your LIBNAME directory. 3.
Exercise 5.8 Obtaining Standard Scores in SAS 1. Open the data file you created in exercise 5.5. To do this via the menu system in SAS, select View->Explorer from the main menu. Highlight sasuser and double click on the ex5pt5 data file. Run your program for requesting descriptive statistics for the Lies variable as you did in the previous exercise, but this time add a command that requests standardized values to be saved as variables. The following code will allow you to do so: PROC REG DATA=sasuser.ex5pt5; MODEL Y=S / r cli clm; RUN; 2. Although your program asked SAS to run an inferential data analytic procedure. What you should be most interested in is the list of z-scores provided for you towards the end of the output. Do these agree with the scores you calculated in exercise 5.6? Exercise 6.2 Graphing in SAS In this exercise you will learn how request graphs in SAS. 1.
Import the data from exercise 6.1 into SAS: PROC IMPORT DATAFILE = 'c:\SASExercises\ex6pt2.xls1 OUT = sasuser.ex6pt2; PROC PRINT DATA = sasuser.ex6pt2; TITLE 'Vocab & Grammar Data'; RUN; 2. By typing just one procedural command, namely: DATA sasuser.ex6pt2; SET sasuser.ex6pt2; PROC UNIVARIATE PLOT DATA = sasuser.ex6pt2; VAR vocab grammar; TITLE 'Vocab & Grammar Var Descriptions'; RUN; SAS will generate a stem-and-leaf plot, normal probability plot, and box plot of your data in addition to several descriptive statistics. 3. Are the mean, median, standard deviation, interquartile range, and skewness statistics what you would expect given your results from Ex. 6.1? 4. Once graphs and charts are generated in SAS, sometimes the option is available to double click on an image and modify it. However, this is not always the case and you may need to write additional code in order to edit the characteristics of a chart or graph (e.g., the bin widths). The interested reader is referred to the SAS documentation.
314
APPENDIX A 5.
Examine the stem-and-leaf plots and the box plots. Do they look like the plots you generated by hand? 6. Change some of the values in the data and rerun your program. See if you can predict how the plots will look based on the changed values. 7. Return the numbers you changed to their original values and save your work. Exercise 7.5 SAS: Single Sample Tests and confidence intervals
This exercise provides practice in using SAS to conduct single-sample tests. You will use the data from exercise questions 7.3.3 and 7.3.4 to determine if the sample is drawn from a population with a mean of 20. 1. Open SAS and import the spreadsheet data: PROC IMPORT DATAFILE = 'c:\SASExercises\ex7pt5.xls1 OUT = sasuser.ex7pt5; PROC PRINT DATA = sasuser.ex7pt5; TITLE 'Samples 1 & 2'; RUN; 2. Run the following code to analyze the data for Sample 1: PROC TTEST DATA = sasuser.ex7pt5 H0=20; VAR sample"!; RUN; 3. Examine the output. Are the values for the standard deviation and the standard error the same as the values you calculated in exercise 7.3? What is the significance level of the t-test? Do you conclude that the sample was drawn from a population with a mean of 20 at the alpha = .05 level? What would your decision be if alpha were .01 ? 4. Conduct a single sample test using the data from question number four by running the following code: PROC TTEST DATA = sasuser.ex7pt5 H0=20; VAR sample2; RUN; Why is this test not significant at the .05 level even though the mean is higher than the data from question number three? 5. To compute confidence intervals for the scores variable, run: PROC MEANS DATA = sasuser.ex7pt5 MEAN ALPHA = .05 CLM; VAR sample1 sample2; TITLE '95% CIs'; RUN; If you would like to calculate confidence intervals based on a value other than 95%, change the value after 'ALPHA ='. Exercise 9.7 Regression in SAS In this exercise you will use SAS to conduct a regression analysis and create a graph of the lies and mood data. 1. Invoke SAS. Import the Lies and Mood data you last used in exercise 9.2: PROC IMPORT DATAFILE = 'c:\SASExercises\ex9pt7.xls' OUT = sasuser.ex9pt7;
SAS EXERCISES
2.
3.
4.
5. 6. 7. 8. 9.
315
PROC PRINT DATA = sasuser.ex9pt7; TITLE 'Mood and Lies Data'; RUN; To conduct a regression analysis, you will use the PROC REG command. For these data, you would write: PROC REG DATA = sasuser.ex9pt7; MODEL lies = mood; OUTPUT OUT=sasuser.ex9pt7 PREDICTED = predict RESIDUAL = resid; PLOT lies * mood; TITLE 'Simple Linear Regression'; RUN; To view the output, run: PROC PRINT DATA = sasuser.ex9pt7; TITLE 'Unstandardized Predicted Values & Residuals'; RUN; Examine the Model Summary in the output. Do the values for R and R2 agree with your spreadsheet? Now look the results for the ANOVA. The Sums of Squares Model, Error, and Total should agree with the values from your spreadsheet for SSmod, SSerr, and SStot, respectively. Examine the coefficients in the output. Can you find the values for a and b? Finally, scroll down and observe the predicted values of Y (V) and the residuals (Y-Y'). Do these values agree with you spreadsheet? As part of the code you wrote above, a request was made for a scatterplot of Lies and Mood (PLOT lies * mood;). Also as part of the program, the lies and mood data have been saved under your chosen directory (sasuser.ex9pt7). For additional practice you should try running the SAS Regression procedure and creating scatter plots for the lies and drug data. Exercise 10.6 The F test in SAS Regression and One-Way ANOVA
This will show you how to find the significance of the amount of variance accounted for by Drug in Mood Scores. 1. Import the Lies and Drug spreadsheet (Note you could also open the data file you created in Exercise 9.6.6): PROC IMPORT DATAFILE = 'c:\SASExercises\ex10pt6.xls' OUT = sasuser.ex10pt6; PROC PRINT DATA = sasuser.ex10pt6; TITLE 'Drug and Lies Data'; RUN; Rerun the Regression procedure: PROC REG DATA - sasuser.ex10pt6; MODEL lies = drug; OUTPUT OUT=sasuser.ex10pt6 PREDICTED = predict RESIDUAL = resid; PLOT lies * drug; TITLE 'Simple Linear Regression'; RUN;
316
APPENDIX A 2.
Examine the ANOVA output. Does this value agree with the F you calculated in exercise 10.3? 3. The general command for a one-way ANOVA in SAS is PROC ANOVA. Run code that conducts a one-way ANOVA on drug (IV) and lies (DV) that requests descriptive statistics and Homogeneity of Variance: PROC ANOVA DATA = sasuser.ex10pt6; CLASS drug; MODEL lies = drug; MEANS drug / HOVTEST = LEVENE; TITLE 'ANOVA Output Exercise 10.6'; RUN; PROC SORT DATA = sasuser.ex10pt6; BY drug; PROC MEANS DATA = sasuser.ex10pt6 N MEAN STDDEV; BY drug; VAR lies; TITLE 'Summary of Lies by Drug'; RUN; 4. Examine the output. Make sure the N, means and standard deviations agree with your spreadsheet. You will find the sums of squares and the F value reported in the ANOVA table. The statistics should be identical to the ANOVA output from the regression procedure with the exception that the regression statistics are now termed between groups and the residual statistics are called within groups. Thus, you may analyze a single factor study with a categorical independent variable using either the Regression or One-Way ANOVA procedures. 5. The One-way ANOVA procedure does, however, also provide a test of the homogeneity of variances. Find Levene's statistic in your output. Levene's statistic provides a test that the variances of groups formed by the categorical independent variable are equal. This test is similar to the Ftest for equal variances presented earlier in this chapter. Levene's, however, is less likely to be biased by departures from normality. Exercise 11.7 Hierarchical Regression in SAS In this exercise you will learn how to conduct a hierarchical regression in SAS. 1. Import the ex11pt7 spreadsheet provided on the CD. For practice, you could open the Lies and Mood SAS data file you created in Ex. 10.6, create a new variable for the drug data, and dummy code the two groups. 2. For the hierarchical regression analysis, write and run the following code: PROC REG DATA = sasuser.ex11pt7; MODEL lies = mood drug/SCORR1 (TESTS); TITLE 'Hierarchical Regression in SAS'; RUN; 3. Examine the Model Summary in the output (under Analysis of Variance heading). Notice that the model builds sequentially, such that the first level describes the model with only mood entered. The next level provides information about the model when the variable drug is added as a predictor. If you use the cumulative R2 for any step in the model and subtract the R2 from the previous step in the model, you will get the R2Change increase
SAS EXERCISES
3T7
associated with adding that variable to the model. Make sure that the values for R2, F, df, and, significance agree with Figure 11.4. 4. Explore the ANOVA table to see if the overall model is significant. You can look at the Parameter Estimates table to find the values of the unstandardized and standardized partial regression coefficients. 5. As additional practice, use the SAS Regression procedure (PROC REG) to reanalyze the button pushing study presented in exercise 11.6. Exercise 12.5 Analyzing a singe-factor between-subjects study in SAS In this exercise you will analyze the button pushing study using the four groups defined in Figure 12.3. In SAS you could analyze these data using the Regression procedure and dummy or contrast codes. If using dummy codes, you would run a regression notifying SAS of all of the coded vectors (X1 through X3) to be analyzed in a single block. You would then examine the output to determine if the model containing all three predictor variables is significant. If you want to test individual contrast codes, you would request that each of the coded predictor variables be analyzed in separate blocks, ultimately calculating the R squared change associated with each step. The resulting hierarchical analysis (see Ex. 11.7) will allow you to determine the significance of each contrast. Finally, as presented in this exercise, you could run the One-Way ANOVA procedure to determine if group is a significant predictor of number of pushes. 1. In Excel, create a variable called Pushes and enter the data from Figure 12.2. Next, import these data into SAS by running: PROC IMPORT DATAFILE = 'c:\SASExercises\ex12pt5.xls' OUT = sasuser.ex12pt5; PROC PRINT DATA = sasuser.ex12pt5; TITLE 'Group Predicting Button Pushes Data'; RUN; 2. When using the One-way ANOVA procedure, you do not need to use dummy or contrast coding. Instead, create one variable called group and enter 0s for all the cases in the ND group, 1s for the CO group, 2s for the C1 group, and 3s for C2 group. The output will be more readable if you create value labels for each of these groups using the FORMAT procedure: PROC FORMAT; VALUE Group 0="ND" 1="CO" 2="C3" 3="C4"; RUN; 3. To conduct your ANOVA (requesting Levine's test and descriptives), run the following code: PROC ANOVA DATA = sasuser.ex12pt5; FORMAT group group.; CLASS group; MODEL pushes = group; MEANS group / HOVTEST = LEVENE; TITLE 'ANOVA Output Exercise 12.5'; RUN; PROC MEANS DATA = sasuser.ex12pt5 N MEAN STDDEV; FORMAT group group.; BY group; VAR pushes;
318
APPENDIX A TITLE 'Summary of Pushes by Group'; RUN; 4. Examine the Descriptives output and confirm that the A/s, means, and standard deviations are correct. Next look at Levene's test. Is the assumption of equal variances met? Finally look at the output in the ANOVA table. Do the sums of squares and df agree with values you calculated using the spreadsheets? Based on the Fand significance values, do you reject the null hypothesis that the number of button pushes in each group is equal? Exercise 13.5 Post-hoc tests in SAS In this exercise you will learn how to use SAS to conduct post-hoc tests for the button pressing study presented in Chapter 12. 1. Open the data file for the button pushing study you created in exercise 12.5. By the nature of the code used to import these data, SAS created a permanent file under your LIBNAME (e.g., sasuser) directory. 2. Redo the One-way ANOVA analysis. This time, however, make a request for Tukey Post-hoc analyses, as such: PROC ANOVA DATA = sasuser.ex12pt5; CLASS group; MODEL pushes = group; MEANS group / HOVTEST = LEVENE TUKEY; TITLE 'ANOVA Output Exercise 13.5'; RUN; 3. By making the request for the Post-hoc tests, a table is produced that groups together all means that do not differ from each other at the selected alpha level. This grouping should make it easy for you to create the subscripts necessary to present the results of your post-hoc tests in the format of Figure 13.5. 4. To create a bar graph of the means that contains the standard error of the means, run the following code: PROC GCHART DATA = sasuser.ex12pt5; VBAR group / SUMVAR = pushes AXIS = axis1 ERRORBAR= bars WIDTH = 5 GSPACE=2 DISCRETE TYPE=mean CFRAME=ligr COUTLINE= blue CERROR=black; RUN; If you choose to use the Confidence Interval of Mean, you must include the size of the confidence interval (e.g., 95%, 99% or 90%). Exercise 13.9 ANCOVA in SAS In this exercise you will learn how to conduct an analysis of covariance in SAS. 1. Create a new SAS data file by importing the spreadsheet data from exercise 13.5: PROC IMPORT DATAFILE = 'c:\SASExercises\ex13pt9.xls' OUT = sasuser.ex13pt9 REPLACE; PROC PRINT DATA = sasuser.ex13pt9; TITLE 'Age, Sex, and Smiles Data';
SAS EXERCISES
319
RUN; 2. Conduct a hierarchical regression as you did in Exercise 11.7. The code is as follows: PROC REG DATA = sasuser.ex13pt9; MODEL smiles = age sex / SCORR1(TESTS); TITLE 'ANCOVA in SAS'; RUN; DATA sasuser.ex13pt9; SET sasuser.ex13pt9; agexsex = age * sex; RUN; The model summary output should correspond to the values found in Figure 13.8. 3. Test the homogeneity of regression assumption by running a second analysis and entering age, sex, and the age by sex interaction term in the third block. Similar to the code above, run the program PROC REG DATA = sasuser.ex13pt9; MODEL smiles = age sex agexsex/SCORR1(TESTS); TITLE 'ANCOVA in SAS1; RUN; 4. You could also conduct an ANCOVA using the General Linear Model (GLM) procedure (PROC GLM). To do this select, run the code: PROC GLM DATA = sasuser.ex13pt9; CLASS sex; MODEL smiles = age sex age*sex; RUN; 5. Note that both Type I and Type III Sums of Squares are provided in the output. For this analysis, attend to the Type I Sums of Squares. 6. Examine the output. The statistics for age, sex, error, and total should be identical to the regression you ran in step 2. 7. When you use PROC GLM, the homogeneity of regression assumption is automatically tested because the testing of the interaction between the covariate and independent variable is built into the command. Whereas in SPSS, a custom model must be created, SAS conveniently runs this step for you. 9. Examine the Parameter Estimates. Do they agree with the values you calculated in Exercise 13.7?
Exercise 14.7 SAS Analysis of a 2 x 2 Factorial Study In this exercise you will learn how to use the General Linear Model procedure in SAS to conduct a 2 X 2 Analysis of Variance for the button-pushing study. 1. Create a new SAS data file containing variables for the number of button pushes (bps), gender (gen), and instruction set (insf). Enter the data from Figure 14.10. You can do this by entering the data directly into the Table Editor in SAS or importing the spreadsheet from the CD: PROC IMPORT DATAFILE = 'c:\SASExercises\ex14pt7.xls' OUT = sasuser.ex14pt7; PROC PRINT DATA = sasuser.ex14pt7; TITLE 'Gender, Instruction, and BPs';
320
APPENDIX A 2.
3.
4.
5.
6
7.
RUN; Using the PROC GLM command, request the following: DATA sasuser.ex14pt7; SET sasuser.ex14pt7; instxsex = sex * inst; RUN; PROC GLM DATA = sasuser.ex14pt7; CLASS sex inst; MODEL pushes = sex inst sex * inst; For options, you will want SAS to generate estimates of effect size and homogeneity tests. You will also want Estimated Marginal Means. To get these, continue the program above by writing: MEANS sex inst sex * inst / HOVTEST=LEVENE WELCH; LSMEANS sex inst / PDIFF STDERR; LSMEANS sex * inst / SLICE = inst; RUN; Run the entire program. Examine the means for each of the cells and marginals. Make sure these values agree with your spreadsheets. Now look at the Estimated Marginal Means (generated by LSMEANS). The values for the grand mean, sex and instruction set main effects, and the interaction should be the same as those generated by the descriptive statistics command. In the case of an unbalanced design (i.e., one or more of the cell means were of different size), then the descriptive statistics would provide traditional weighted means while the estimated marginal means would be unweighted. Typically, when cells are unequal due to subject attrition or random factors, you would want to report the unweighted means for any significant effects resulting from your analysis. Look at the Levene's Test of the Equality of Error Variances. Notice that this test is not statistically significant, indicating that the assumption of equal variances is met. Finally examine the box labeled The GLM Procedure. Look at the lines for the SEX, INST, SEX*INST, Error, and corrected total. Check that the sums of squares, df, mean square, F, and partial eta-squared values correspond to your spreadsheet calculations. For additional practice you should try reanalyzing the data from exercise 14.8 using SAS. Do all of the relevant statistics agree with your spreadsheet analysis?
Exercise 15.3 A Single-Factor Repeated Measures Study in SAS In this exercise you will analyze the lie-detection study, last described in Exercise 15.2, using the repeated measures PROC GLM of SAS. 1. Create a new SAS data file with three variables. Create one variable for subject number (s), one variable for number of lies detected in the drug condition (drug), and a third variable for number of lies detected in the placebo condition (placebo). Give each variable a meaningful label. 2. Enter 1 - 5 in the subjects column. In the drug column, enter the number of lies detected for subjects 1 - 5 in the drug condition. Do the same in the placebo condition. Thus, the SAS file set up for a repeated measures study
SAS EXERCISES
321
will have a single row for each subject with the scores from each level of the repeated measures variable in separate columns. You could also import the spreadsheet from the CD where steps 1 and 2 are already complete: PROC IMPORT DATAFILE = 'c:\SASExercises\ex15pt3.xls' OUT = sasuser.ex15pt3; PROC PRINT DATA = sasuser.ex15pt3; TITLE 'Lies, Drug/Placebo - Repeated Measures'; RUN; 3. To run this analysis with, run the following code: PROC GLM DATA = sasuser.ex15pt3; CLASS; MODEL drug placebo = /NOUNI; REPEATED condition 2 / PRINTE; RUN; In this case, the PRINTE option will tell SAS to generate the various statistics of interest to us, such as Mauchly's Test of Sphericity. 4. Examine the output. For the moment, ignore the Multivariate Tests and Mauchly's Test of Sphericity. Look at the Sphericity values for your WithinSubjects Effects. The SS, df, MS, F and prp values should agree with your spreadsheet results. Do they?
Exercise 15.5 SAS Analysis of a Repeated Measures Study with Four Levels In this exercise you will analyze the data for the four-level repeated measures study presented in Ex. 15.4. 1. Create a new SAS data file for the button pushes study that is set up for a repeated measures analysis. You will need five variables, one for the subject number (s), and four for each level of the infant diagnosis factor; Down syndrome (ds), fetal alcohol syndrome (fas), low birth weight (Ibw), and no diagnosis comparison (cntrf). Give each variable a meaningful label (either by hand in the Table Editor or by running the LABEL command). 2. Enter 1 - 4 in the subjects column and the appropriate number of button presses for each of the four cases in the remaining columns. Iternatively, you can import an Excel file from the CD with the data by typing: PROC IMPORT DATAFILE = 'c:\SASExercises\ex15pt5.xls' OUT = sasuser.ex15pt5; PROC PRINT DATA = sasuser.ex15pt5; TITLE 'Repeated Measures Study w/ 4 Levels'; RUN; 3. To run the primary analysis, use the code: PROC GLM DATA = sasuser.ex15pt5; CLASS; MODEL ds fas Ibw cntrl = / NOUNI; REPEATED dx 4 / PRINTE; RUN; 4. Check your descriptive statistics. Do they agree with your spreadsheet? Examine the Sphericity values in the output. Do the SS, df, MS, F and pn2 values agree with your spreadsheet? 5. Look in the table labeled Mauchley's Test of Sphericity. Mauchly's test is not significant, indicating that the sphericity assumption has been met. Remember, however, that Mauchly's test is underpowered when the sample
322
APPENDIX A size is small. It would therefore be prudent, in this case, to assume that there is at least some violation of sphericity. Note that the lower bound is .33; 1/(k - 1) = 1/(4-1). Also note that the Greenhouse-Geisser and Huynh-Feldt estimates of e differ by a large amount. This is due to the small number of cases in the study. Typically the two estimates would be closer. 6. Examine the Multivariate Tests. None are significant, but remember that N is small and the multivariate tests are not very powerful under these circumstances. Apply the modified univariate approach. What is your statistical decision based on this approach? Exercise 16.2 SAS Analysis of a One-between, One-within Two-factor Study This exercise walks you through an SAS analysis for a mixed two-factor study. You will use data from Ex 16.1. 1. Create a new SAS data file.The data file should contain variables for subject number (s), gender (gen) instruction set I (set1), and instruction set II (set2). 2. Enter 1 through 8 for the 8 subjects in the sub column. Enter 0 for males and 1 for females in the gen column. Finally, enter the number of button pushes for each subject in the appropriate column for instruction set. Create appropriate variable labels and value labels for the variables. To import a file from the CD, run the following program: PROC IMPORT DATAFILE = 'c:\SASExercises\ex16pt2.xls' OUT = sasuser.ex16pt2; PROC PRINT DATA = sasuser.ex16pt2; TITLE '1 B/W & 1 W/l'; RUN; 3. Run the analysis, remembering to request the appropriate statistics, as follows: PROC GLM DATA = sasuser.ex16pt2; CLASS gen; MODEL set1 set2 = gen / NOUNI; REPEATED inst 2 / PRINTE; RUN; 4. Check the descriptive statistics to ensure that you set up the variables and entered the data correctly. Next, determine if there is homogeneity of covariances. Usually, you would then check the sphericity assumption, but because there are only two levels of the repeated measure in this design, e = 1 and Mauchly's test of sphericity does not apply. Finally, check Levene's test to make sure the homogeneity of variances assumption holds for the between-subjects factor. 5. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects effects. Do they agree with your spreadsheet? Because there are only two levels of the within-subjects factor, the multivariate tests, sphericity assumed tests, lower bound, and e corrected tests will all yield the same F-ratios and significance levels. 6. Examine the test of the between-subject effect for gender. Are the SS, MS, and df correct for the sex factor and the error term? What about the F-ratio
Pn2?
SAS EXERCISES
323 Exercise 16.4 SAS Analysis of a Two-Within Two-Factor Study
This exercise walks you through an SPSS analysis for a mixed two-factor study. You will use data from Ex. 16.1. 1. Create a new SAS data file, or adapt one from Ex. 16.2:The data file should contain variables for subject number (s), and button pushes for the 4 repeated-measures conditions: husbands in instruction set I (set1h), wives in instruction set I (set1w), husbands in instruction set II (set2h), and wives in instruction set II (set2w). 2. Enter 1 through 4 for the four sets of matched scores. Enter the number of button pushes in the appropriate column for the instruction set and gender combinations. Create appropriate variable labels and value labels for the variables. To import a spreadsheet from the CD, run the following: PROC IMPORT DATAFILE = 'c:\SASExercises\ex16pt4.xls1 OUT = sasuser.ex16pt4; PROC PRINT DATA = sasuser.ex16pt4; TITLE '2 W/l'; RUN; 3. This analysis and it's relevant statistics can be obtained by running the following program: PROC GLM DATA = sasuser.ex16pt4; CLASS; MODEL setlh set1w set2h set2w = / NOUNI; REPEATED spouse 4 / PRINTE; RUN; 4. Check the descriptive statistics (using the PROC MEANS command) to ensure that you set up the variables and entered the data correctly. 5. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects effects. Do they agree with your spreadsheet?
This page intentionally left blank
Appendix B: Answers to Selected Exercises
2.
3.
GETTING STARTED: AN INTRODUCTION TO HYPOTHESIS TESTING 2.1.1
8 or higher.
2.1.2
6 or higher.
2.1.3
P(type II) = P(4|5|6|7) = (1+2+3+4)/20 = .5; power = i - B = 1 - .5 = .5.
2.1.4
P(type II) = P(4l5) = (1 + 2)/20 = .15; power = 1 - B = i - .15 = .85.
INFERRING FROM A SAMPLE: THE BINOMIAL DISTRIBUTION
3.1.1 and 3.1.2 # tosses 4 5 6 7 8
10 20
# outcomes 16 32 64 128 256 1,024 1,048,576
# classes 5 6 7 8 9 11 21
3-3-1
P(6 heads) = 1/64 = .0156 P(o|6 heads) = (1+1)/64 = .0313 P(5 6 heads) = (6+1)/64 = .109 P(2|3|4 heads) = (15+20+15)/64 = .781 P(0|1|5|6heads) = (1 + 6 + 6 + 1)/64 = .219
3.3.2
P(8 heads) = 1/256 = .00391 P(0|8 heads) = (1+1)/256 = .00781 P(7|8 heads) = (8+1)/256 = .0352 P(6|7|8 heads) = (28+8+1)/256 = .145 P(0|1|7|8 heads) = (1+8+8+1)/256 = .0703 P(0|1|2|6|7|8 heads) = (1+8+28+28+8+1)/256 = .289 325
ANSWERS TO SELECTED EXERCISES
326
3.3.3
A head can appear once in each serial position—the first, the second, and so forth, up to the nth.
3.4.2 # Heads/ Class 0 1 2 3 4 5 6 7 8 9 10 11 12
# Classes # Outcomes
# Trials = 12 # Trials = 11 # Outcomes # Outcomes in Class Probability in Class Probability 1 1 0.00049 0.00024 11 0.00537 12 0.00293 55 0.02686 66 0.01611 0.08057 220 165 0.05371 330 0.16113 495 0.12085 462 792 0.22559 0.19336 462 924 0.22559 0.22558 792 330 0.16113 0.19336 0.08057 165 495 0.12085 55 0.02686 220 0.05371 11 0.00537 66 0.01611 1 12 0.00049 0.00293 1 0.00024 12
13
2048
4096
3.4.3
One-tailed
N 4 5 6 7 8 9 10 11 12
a = .05 X
5 6 7 7-8 8-9 9-10 9-11 10-12
a = .10 4 5 6 6-7 7-8 7-9 8-10 9-11 9-12
Two-tailed
a = .05 a= .10 X X X 0,5 0,6 0,6 0,7 0,7 0,8 0-1 ,7-8 0-1 ,8-9 0-1 ,8-9 0-1,9-10 0-1,9-10 0-1,10-11 0-2,9-11 0-2,10-12 0-2,10-12
3.5.1
Critical values for N = 40, alpha = .05, one-tailed, are 26-40. You would reject the null hypothesis because 30 falls in this range.
3.5.2
Again, you would reject the null hypothesis because 26 falls in the range 26-40.
3.5.3
Critical values for N = 40, alpha = .05, two-tailed, are 0-13 and 27-40. You would reject the null hypothesis if 30 improved because 30 is in this range, but you would not reject the null hypothesis if only 26 improved. For a two-tailed test, 26 is not in the critical range.
3.5.4
Critical values for N - 50, alpha = .05, two-tailed, are 0-iy and 33-50. If the number who improved were any number 0-17 you would reject the null hypothesis, thus 18 is the smallest number who, even if they all improved, would not allow you to reject the null hypothesis. The largest number who, even if they all improved, would not allow you to reject the null hypothesis is 32.
APPENDIX B
5.
3.5.5
Critical values for N - 50, alpha = .05, one-tailed, are 32-50. In this case, the largest number who, even if they all improved, would not allow you to reject the null hypothesis is 31. The smallest number that would allow you to reject the null hypothesis is 32.
3.5.6
Critical values for N = 50, alpha = .01, two-tailed, are 0-15 and 35-50. If the number who improved were any number 0-15 you would reject the null hypothesis, thus 16 is the smallest number who, even if they all improved, would not allow you to reject the null hypothesis. The largest number who, even if they all improved, would not allow you to reject the null hypothesis is 34.
3.5.7
Critical values for N = 50, alpha = .01, one-tailed, are 34-50. In this case, the largest number who, even if they all improved, would not allow you to reject the null hypothesis is 33. The smallest number that would allow you to reject the null hypothesis is 34.
DESCRIBING A SAMPLE: BASIC DESCRIPTIVE STATISTICS 5.9.2
7.
327
The mean is pulled in the direction of the outlier and the standard deviation increases noticeably. The standard score for the outlier will be quite large, probably greater than three.
INFERRING FROM A SAMPLE: THE NORMAL AND t DISTRIBUTIONS 7.1.2
T=26, Z = 1.9O. Would not reject: 1.90 < 1.96. T = 28, Z = 2.53. Would reject: 2.53 > 1.96. T = 30, Z = 3.16. Would reject: 3.16 > 1.96.
7.1.3
T = 26, Z = 1.90. Would not reject: 1.90 < 2.58. T = 28, Z = 2.53. Would not reject: 2.53 < 2.58. T = 30,Z = 3.16. Would reject: 3.16 > 2.58.
7.1.4
T = 25, Z = -1.83. Would not reject: -1.83 > -1.96. T = 35, Z = +1.83. Would not reject: +1.83 < +1.96. T = 23, Z = -2.56. Would reject: -2.56 < -1.96. T = 37, Z = +2.56. Would reject: +2.56 > +1.96.
7.1.5
N = 80, reject if T is in the range 0-31 or 49-80.
7.1.6
N = 100, reject if T is in the range 0-40 or 60-100. N = 200, reject if T is in the range 0-86 or 114-200.
7.1.7
N = 80, reject if T is in the range 0-8 or 24-80.
7.2.1
Only two digits after the decimal point are given for Z scores in the table, so you will need to interpolate. M M M M
= 100, S = 11.8, X = 100, S = 11.8, X = 100, S = 11.8, X = 100, S = 11.8, X
= 105, Z = 0.424, 33.5% > 105. = 115, Z = 1.271, 10.2% > 115. = 80, Z = -1.695, 4.50% < 80. = 97| 103, Z = ±0.254, 80.0% < 97 or > 103.
7.2.2
2.28% > 2S above mean. 0.13% > 3S above mean. 4.56% > +2S or < -2 from mean. 0.26% > +3S or < -3S from mean.
7.2.3
Z(critical, .05) = 1.65, one-tailed. Z(critical, .01) = 2.33, one-tailed.
328
ANSWERS TO SELECTED EXERCISES Z(critical, .05) = 1.96, two-tailed. Z(critical, .01) = 2.58, two-tailed. 7.2.5
Z = 0.678, S = 11.8, X = 108, 12/41 = 29.3% > 108. Z = 0.678, S = 11.8, X = 92|108, 17/41 = 41.5% < 108 or > 92.
7.3.1 o - 11.8, N = 100, OM = 1.18. Critical value for t, df = 99, alpha = .01, twotailed, is ±2.66 (if the degrees of freedom you want are not tabled, always use the next lowest value in the table, in this case the value for df = 60). m - 100, M = 94, ZM = -5.08. A sample mean this deviant from the population mean would occur less than 1% of the time if the sample really were drawn from a population of terrestrials. Therefore the null hypothesis is rejected. The sample likely consists of aliens. 7.3.2
Standardized normal scores less than -2.63 occur less than 0.43% of the time.
7.3.3
M = 36.83, SD = 9.33, SD' = 10.23, SD'M = 4.17, t = 2.60. Critical value for t, df = 5, alpha = .05, two-tailed, is ±2.57. Because 2.60 > 2.57, reject null hypothesis that sample is drawn from a population whose mean is 26. Critical value for alpha = .01 is 4.03. Because 2.60 < 4.03, you would not reject the null hypothesis if you had selected an alpha level of .01.
7.3.4
M = 45.5, SD = 24.58, SD' = 26.93, SD'M = 10.99, t = 1.77. Critical value for t is ±2.57. Because 1.77 < 2.57, you would not reject the null hypothesis. Based on means alone, the evidence that the population mean is not 26 seems stronger for the second set of data (M = 45.5) than for the first set (M = 36.83) - yet you reject the null hypothesis for the first but not for the second set. This occurs because the data are much more variable for the second set than for the first (SD = 24.58 compared to 9.33). Due to the greater raw score variability, the standard error of the mean estimated from the second set is larger than for the first, which offsets the larger deviation between sample and population mean and results in a smaller standarized value for that deviation (t - 1.77 compared to 2.60 for the first set).
7.3.5
N = 16, OM = 2.95, f(15)critical = 2.13. Reject if M < 93 or > 107. N = 32, oM = 2.09, f(31)critical = 2.04 (use df = 30). Reject if M < 95 or > 105. N = 64, oM = 1.48, t(63)critical = 2.00 (use df = 60). Reject if M < 97 or > 103. The larger the sample size, the smaller the standard error of the mean. As sample size increases, the sample means (for samples of that size) are distributed more tightly around the population mean.
7.3.6
SD'M(boys) = 4.02, SD'M(girls) = 3.02. See Fig. 6.2 in text.
7.4.1
lower confidence limit = 26.10, upper confidence limit = 47.56.
7.4.2
lower confidence limit = 17.24, upper confidence limit = 73..76.
7.4.3
Yes. The first confidence interval (26.10 to 47.56) does not include the null hypothesis population mean of 26, whereas the second interval (17.24 to 73.76) does include it. Therefore we reject the null hypothesis that m = 26 for the first set of data, but not for the second set.
APPENDIX B
9.
329
BIVARIATE RELATIONS: THE REGRESSION AND CORRELATION COEFFICIENTS 9.3.4
The square root is always positive and so in this case the negative sign for the correlation is lost.
9.4.4
The predicted score will be the mean of the group to which the raw score belongs.
9.5.6
The Y intercept is the number of lies detected for the no-drug group; the slope is the difference in number of lies detected between the no-drug and drug group.
9.6.1
r= -0.657, r2 = 0.431. The more older siblings an infant has, the fewer words he or she is likely to speak at 18 months of age. The predicted number of words for an infant with no older siblings is 34.0 (Yintercept). For each additional sibling, an infant is likely to speak 3.02 fewer words, on average (slope).
9.6.2
If number of words = 4 for subject 3, r = -0.291, r2 = 0.0845. If number of words = 77 for subject 16, r = 0.298, r2 = 0.0888. If number of words = 222 for subject 16, r = 0.561, r2 - 0.315. If number of sibs = 10 for subject 7, r = -0.082, r2 = 0.0067. Note how a single data entry error can reduce a large correlation to a small one, or can even change a large negative correlation to a large positive one.
9.6.3
r = -0.571, r2 = 0.326. The mean number of words spoken by the group with no older siblings (M = 34.7) is higher than the mean number spoken by the group with one or more older siblings (M = 26.7).
9.6.4
The predicted score for each subject is the mean score for the group to which that subject belongs, thus the predicted score for infants with no siblings is 34.7 (the mean number of words spoken by those infants) and the predicted score for infants with one or more siblings is 26.7. If the number of subjects in the two groups had been equal initially, and two subjects had been "lost" from the no-sibling group, the multiple regression analysis would have yielded what in older texts is called an unweighted means analysis for studies with unequal cell sizes. The moral is, what is a problem for traditional analyses of variances (unequal cell size) is routine within the multiple regression approach. In general, unequal cell sizes present no computational problems.
10.
INFERRING FROM A SAMPLE: THE FDISTRIBUTION 10.1.1
1.250,1.111,1.053,1.026,1.010.
10.1.2
About 40 or 50.
10.2.1
F(3,24)critical,05
= 3.O1. F(6,24)
critical,05
= 2.51.
10.2.2 F(1,8) criticai,05 = 5.32. 5.24 is not big enough to reject. 10.2.3 F(2,30) critical,05 = 3.32. F(2,30) cnticai.o1 = 5.39. The larger number, 5.39, demarcates a smaller area under the curve, in this case 1%. 10.2.4 N = 20, because F(1,18) Criticai.o5 = 4.41 and 4.42 > 4.41.
330
ANSWERS TO SELECTED EXERCISES 10.2.5 No, because the smallest critical value for F with one degree of freedom in the numerator and an infinite number of degrees in freedom in the denominator is 3.84 and 3.8 is less than that. 10.2.6 MSplacebo = 12,892 and N = 8, MSdrug = 49,481 and N = 10. The null hypotheis is rejected because F(9,7) critical.o5 = 3.68 is less than F(9,7) = 3.84. 10.3.8 F(1,8)computed = 3.46; F(1,8) critical.05 = 5.32. Do not reject; computed F not big enough. For this sample size and these data, the effect of drug treatment on number of lies detected is not statistically significant. 10.4.3 F(1,8)computed = 3.12; F(1,8) critical,05 = 5.32. Do not reject; computed F not big enough. For this sample size and these data, the effect of mood on number of lies detected is not statistically significant. 10.5.1 F(1,14) = 10.62; F(1,14) critical.01 = 8.86. Reject at both .05 and .01 levels. 10.5.2 F(1,14) = 6.77; F(1,14) critical,05 = 4.6o. Reject at the .05 but not the .01 level. 10.5.3 Student's t test (for independent groups). 10.5.5 See Fig. 10.4 in text. For no sibs: N = 7, M = 34.7, SD'M = 2.62, M - SD' M = 32.1, M + SD' M = 37.3, t(6) Critical,05 = 2.45, lower 95% confidence limit = 28.3, upper 95% confidence limit = 41.1. For one or more sibs: N = 9, M = 26.7, SD'M = 1.83, M - SD' M = 24.8, M + SD'M = 28.5, t(8) Critical,05 = 2.31, lower 95% confidence limit = 22.5, upper 95% confidence limit = 30.9. The mean for the no sib group (34.7) does not fall within the 95% confidence interval for the sib group (22.5-30.9), nor does the mean for the sib group (26.7) fall within the 95% confidence interval for the no sib group (28.3-41.1). Therefore the two means are probably significantly different (at the .05 level), but this would need to be verified with a formal statistical test.
11.
ACCOUNTING FOR VARIANCE: TWO OR MORE PREDICTORS 11.1.2
Estimated standard error of estimate (predicting lies from drug) = 1.87. Estimated standard deviation (predicting lies from mean) = 2.002. Percentage reduction = 6.6%. Estimated standard error of estimate (predicting lies from mood) = 1.899. Estimated standard deviation (predicting lies from mean) = 2.002. Percentage reduction = 5.29%.
11.1.3
Drug: R2 = 0.30, R2adj = 0.21. Mood: R2 = 0.28, R2adj = 0.19.
11.4.3 R2 = .40, N = 10, F(1,8) = 5.33, F(1,8)critical,05 = 5.32 11.4.4
11.4.5
R2 = .30, N = 14, F(l,12) = 5.14, F(l,12)
critical,05
= 4.75
R2 = .20, N = 20, F(1,18) = 4.50, F(1,18) Critical,05 = 4.41 R2 = .10, N = 40, F(1,38) = 4.22, F(1,30) Critical,05 = 4.17 R2 = .05, N= 80, F(1,78) = 4.00, F(1,60) Critical,05 = 4.00 If a proportion of variance is just barely significant, and if you want to be able to claim that half that amount is statistically significant, then you would need to double the number of subjects included in the study.
APPENDIX B 11.5.1
331 Predicting lies detected: Adding mood to drug: Step 1 2
11.5.2
Variable added Drug Mood
Total R2 df ..302 1,8 ..346 2,7
Change F R2 df F 3.46 ..302 1,8 3.46 1.85 ..044 1,7 <1
Predicting words spoken: In addition to knowing whether or not an infant has an older sibling, does knowing the exact number matter: Step 1 2
Variable added O vs > 0 #Sibs
Total Change R2 df F R2 df F .326 1,14 6.77 .326 1,14 6.77 .450 2,13 5.32 .124 1,13 2.94
Infants who have no older siblings use more words than infants who have one or more older siblings. The binary distinction between infants who have no, and who have one or more, older siblings accounts for 32.6% of the variance in number of words used (F(1,14) = 6.77, p < .05). An additional 12.4% is accounted for by knowing the actual number of older sibling, but this is not significant (F(1,14) = 2.94, NS). 11.6.10 See text. 12.
SINGLE-FACTOR BETWEEN SUBJECTS STUDIES 12.1.4
Y'
= a + b1X1 + b2X2 + 63X3 , dfmodel = 3, dferror = 12.
12.1.5
Predicted scores are the means for the four groups.
12.1.6
F(3,12) = 7.63, F(3,12)criticali,o5 = 3-49, therefore reject.
12.1.7 R2 = 0.097, F(3,12) = 0.431, NS. R2 = 0.562, F(3,13) = 5.562, p < .05. 12.2.3 One possible set of contrast codes for five groups is as follows: Coded Variable Group Has no desire for children Has none but would like Has 1 child Has 2 children Has >2 children
X1 -4 1 1 1 1
X2 0 -3 1 1 1
X3 0 0 -2 1 1
X4 0 0 0 -1 1
12.3.2 R2 = 0.656, F(3,12) = 7.63, p < .01. 12.3.3 R2 = 0.656, F(3,12) = 7.63, p < .01. 12.3.4 As long as the subjects in each group and their scores on the DV do not change, the G - 1 predictor variables coding for group membership will always account for the same proportion of variance. The particular way the predictors code for group membership does not matter. 12.3.5 R2 = 0.369, F(3,13) = 2.53, NS. 12.4.3 R2 = 0.562, F(3,13) = 5.562, p < .05.
332
13.
ANSWERS TO SELECTED EXERCISES
PLANNED COMPARISONS, POST HOC TESTS, AND ADJUSTED MEANS 13.1.1
Stepwise results for a planned comparison analysis using contrast code set II: Step 1 2 3
Variable added Want children? Have children? >1 child?
R2
Total df
F
R2
Change df
F
.523
1,14
15.33
.523
1,12
18.24
.526 .656
2,13 3,12
7.21 7.63
.003 .130
1,12 1,12
<1 4.53
13.2.1 There is a significant difference in the mean number of button pushes for subjects in the no desire for children group (M - 113) and subjects in the want/have children group (M = 77). However, there is no difference in mean button pushes for those who have no children, but would like to (M - 79) and those who already have one or more children (M - 76). There is also no difference in mean number of button presses for those who have only one child (M = 65), and those with more one child (M = 87). 13.2.2
Stepwise results for a planned comparison analysis, comparing the seven infants with no older siblings with the nine infants with older siblings (contrast 1) and the four infants with one with the five infants with more than one older sibling (contrast 2): Step 1 2
Variable added 0 vs >0 1 vs > 1
R2 .326 .479
Total df
F
R2
1,14 2,13
6.77 5.98
.326 .153
Change df 1,13 1,13
F
8.14 3.83
F(1,13)critical,o5 = 4.67, thus only the first contrast is significant. The mean number of words spoken by the seven infants with no older siblings (M = 34.7) differs significantly from the mean number of words spoken by the nine infants with one or more older siblings (M = 26.7). The mean number of words spoken by the four infants with only one older sibling (M = 30.8) does not differ significantly from the mean number of words spoken by the five infants with more than one older sibling (M = 23.4), and hence normally these last two means would not be reported. 13.3.1 See Fig. 13.6 in text.
333
APPENDIX B 13.3.2
Post hoc analysis: TCD =
MSerror = 92.5, df = 12, n = 4,
q(4,12) = 4.20, therefore
20.2.
Groups C1 (1 child)
M = 65 C1 0
CO (none, desire)
Groups 1 = 87 M = 79 CO C2
M = 113 ND
14
22
48
0
8
34
0
26
C2 (> 1 child)
0
ND (no desire ) c
c b
b a
Means ND 113a
13.3.3 Post hoc analysis: TCD = 10.1.
Groups C1 (1 child)
CO 79b,c
C1 65C
MSerror = 23.2, df = 12, n = 4,
M = 65 C1 0
CO (none, desire)
C2 87b
g(4,12) = 4.20, therefore
Groups M = 79 n= 87 C2 CO
M = 113 ND
14
22
48
0
8
34
0
26
C2 (> 1 child)
0
ND (no desire )
c
c b
b a
Means
ND 113a
CO 79b
C1 65C
C2 87b
ANSWERS TO SELECTED EXERCISES
334
13.4.1 Source table for want/has child status ANOVA: 14 subjects: Source Between groups Within groups TOTAL between subjects
SS 3137.5 2107.3 5244.9
df 3 10 13
MS 1045.8 210.7
F 4.96
The critical value for F(3,10), alpha = .05, is 3.88 and for alpha =.01, 5.27. Thus the group effect would be significant if alpha were set to .05 but it would not be significant if alpha were .01. 13.4.2
Post hoc analysis: MSerror = 210.7, df = 10, n = 3.43 (harmonic mean), q(4,10) = 4.33, therefore TCD = 33.9.
Groups C1 (1 child)
M = 65.0 C1
0
CO (none, desire)
Groups M = 79.0 M = 84.7 CO C2
M = 107.3 ND [
14.0
19.7
0
5.7
28.3
0
22.7
C2 (> 1 child) ND (no desire )
42.3
0 b
b
b
a
a
a
C1 65.0b
C2 84.7ab
Means ND 107.3a
CO 79.0ab
13.4.3 Post hoc analysis: MSerror = 31.34, df = 13, n = 5.06 (harmonic mean), q(3,13) = 3.73, therefore TCD = 9.28.
Groups > 1 sib
>1 Sib (23.4)
Groups =1 Sib (30.8)
=0 Sib (34.7)
0
7.35
11.31
0
3.96
= 1 sib = 0 sib
0 b
= 0 sib 34.7a
b a
a
Means = 1 sib 30.8ab
> 1 sib 23.4b
APPENDIX B
335 The post hoc analysis reveals that the mean number of words spoken by infants who have no older siblings (M = 34.7) is significantly different from the mean for infants with more than one older sibling (M - 23.4), but that the mean number of words spoken by infants with only one older sibling (M = 30.8) is not significantly different from either of the other two means. Note than a post hoc analysis compares all possible pairs of means for the groups involved. It does not allow some groups to be lumped together, as a planned comparison analysis does.
13.8.2
For step 3, the Age x Sex interaction, R2total = 0.359, F(3,16) = 2.99, NS, andR2Change = 0.019, F(1,16) = 0.471, NS.
STUDIES WITH MORE THAN ONE BETWEEN-SUBJECTS FACTOR 14.3.1 Formulas for degrees of freedom for a four-way factorial study (numbers at the right give values if a = 3, b = 2, c = 4, d = 2, and N = 240): Source A main effect B main effect C main effect D main effect AB interaction AC interaction AD interaction BC interaction BD interaction CD interaction ABC interaction ABD interaction ACD interaction BCD interaction ABCD interaction S/ABCD, subjects/ABCD
cffformulas df a -1 2 b -1 1 c -1 3 d -1 1 (a-1)(b-1) 2 (a-1)(c-1) 6 (a - 1 )(d - 1) 2 (b-1)(c-1) 3 (b-1)(d-1) 1 (c-1)(d-1) 3 (a-1)(b-1)(c - 1) 6 (a- 1)(b- 1)(d - 1) 2 (a- 1)(c- 1)(d - 1) 6 (b - 1)(c- 1)(d - 1) 3 (a - 1)(b-1)(c - 1)(d - 1) 6 N - abcd 192
TOTAL between subjects
N- 1
239
14.3.3 Degrees of freedom for a 3 x 4 factorial study, N = 108: Source A main effect B main effect AB interaction S/AB, subjects within AB
df 2 3 6 96
TOTAL between subjects
107
14.4.3 The mean for males is 96, for females 76, which represents a deviation of 10 (b1) from the grand mean of 86 (a). The prediction equation includes a term for sex only (Y' = a + b1X1). Accordingly, the predicted scores are 96 for males (86 + 10) and 76 for females (86 - 10) for females.
336
ANSWERS TO SELECTED EXERCISES 14.3.4 The mean for set I is 89, for set II 83, which represents a deviation of 3 (b2) from the grand mean. The prediction equation includes a term for sex and for instruction set (Y' - a + b1X1 + b2X2). Accordingly, the predicted scores are 99 for set I males (86 + 10 + 3), 93 for set II males (86 + 10 - 3), 79 for set I females (86 - 10 + 3), and 73 for set II females (86 - 10 - 3). 14.4.5 The four separate groups means are 113, 79, 65, and 87, which deviate 14 (b3) from the grand mean. (In the case of a 2 x 2, the deviations for all four groups are always the same.) The prediction equation includes a term for sex, for instruction set, and for their interaction (Y' = a + b1X1 + b2X2 + b3X3) Accordingly, the predicted scores are 113 for set I males (86 + 10 + 3 + 14), 79 for set II males (86 + 10 - 3 - 14), 65 for set I females (86 - 10 + 3 -14), and 87 for set II females (86 - 10 - 3 + 14). 14.4.6 Stepwise results for a 2 x 2 factorial analyzing sex and instruction set: Step 1 2 3
Variable added Sex Inst Sex x Inst
R2
.215 .234 .656
Total df 1,14 2,13 3,12
F 3.84 1.99 7.63
R2 .215 .019 .422
Change df 1,12 1,12 1,12
F 7.51 0.68 14.71
14.5.1 Instruction affects males and females differently. Males push the button more when exposed to set I, females more when exposed to set II. Moreover, instruction set I, but not II, differentiates males and females. When exposed to set I, males push the button more than females, but when exposed to set II, the mean number of button pushes is not significantly different for males and females. Nor is the difference between set I females and set II males significant. Sex Male Female
Instruction Set I Set II 113a 79b,c 65b 87C
14.5.2 The interpretation is essentially the same as for 13.5.1. Now the difference between set I females and set II males is significant, but the key differences mentioned in the preceding answer remain the same. Sex Male Female
Instruction Set I Set II 113a 79b 65c 87b
337
APPENDIX B
14.8.2 Your spreadsheet for this exercise, including the source table, should look as follows (columns beyond K not shown): A
B
1
D
C
Smiles!
Gen
2
s
Y
A
3 4 5
1
5 7
1
2 3 4
6 7
5 6 7
8 9 10 11 12 13 14
8 9 10 11 12 13 14
15 16 17
15 16 17
4 8 6 3 4 11 12 4 1 1 5 2
18 19
1 5 5 8 7
22
20
6
23 24
Sum=
105
N=
20
18 19 20 21
25 26 27
1
1 1 1 1 1 1 1 -1 -1 -1 -1
Partner B1 B2 2 2 2 -1 -1 -1
0
-1
1
-1 -1
-1 -1 -1
-1
2 2 2 -1
-1 -1
-1 -1
-1 -1
-1 -1
-1
-1
0 20 0
1
I
H
y
AB2
2 2 2 -1 -1 -1
0 0 0 1 1 1
-1 -1 -1
1 -1 -1
-1 -2 -2 -2 -2 1
-1 0 0 0 0 -1
1 1
-1
0 1 1 1
0 0
0 0 1 1 1 -1
1
-1
1
-1
1
1 20
J
-1 1 1
K
m=
e= Y-Y1
5.333 5.333 5.333 5.250 5.250 5.250 5.250 9.000 9.000 9.000 2.250 2.250 2.250 2.250 3.667 3.667 3.667 7.000 7.000
1 7.000 1 105.0 20
C
B
Step
20
Source
3 4
1 A - Gender 2B-Partner 3AB - GxP
5 6
4S/AB-error 5 Total between Ss
D 2
R / 0 150 0 528 0 540 1 000
change
-3 20
=
Y1 Y-My Y'-My
GxP AB1
0
-1 2
G
-0.25 1.75
0.08 -0.333
-1.25 2.75 0.75 -2.25
0.08 -1.333 0.00 2.750
-1.25
0.00 -1 .250
5.75 6.75
3.75 2.000 3.75 3.000
0.08 1.667
0.00 0.750 0.00 -2.250
-1.25 3.75 -5.000 -4.25 -3.00 -1.250 -4.25 -3.00 -1.250 -0.25 -3.00 2.750 -3.25 -3.00 -0.250 -4.25 -1.58-2.667 -0.25 -1.58 1.333 -0.25 -1.58 1.333 2.75 1.75 1.000 1.75 1.75 0.000
1.75-1.000
0.75
0.00 0.000
0 20 20 20 0.05 0.05 -0.15 0.05 VAR - 8.788 4.746 4.042 M= 5.25 a,b= 5.417 1.111 -0.813 -1.771 0.215-0 .104 SD = 2.964 R = 0.735 RSQ= 0.540
A 2
1 1
F
E
E SS
0.150 26.45 0.378 66 42 0.012 2 04 0.460 80 83 1.000 175 75
F
G dt
MS
I
H F
Pn2
1 26.45 4 581 0.247 2 33.21 5 752 0.451 2 1.02 0 177 0.025 14 5.77
ANSWERS TO SELECTED EXERCISES
338
14.8.3 Post hoc analysis: MSerror = 5.77, df = 14, n - 6.63 (harmonic mean), q(3,14) = 3.70, therefore TCD = 3.45:
Groups Mother
Mother (3.57)
Groups Father (4.57)
0
1.00
Father
Stranger (8.00) :
4.43
:
3.43
0
0
Stranger
a
Mother N=7 3.57a
a b
b
Means Father N=7 4.57ab
Stranger N=6 8.00b
According to this analysis, the sex by partner effect is not significant, therefore we are free to examine the main effects. The sex main effect is also not significant, although we might note that the difference between the mean number of smiles for males (6.4) and females (4.1) approached, but did not reach, the conventional .05 level of significance, F(1,14)critical,05 = 4.60. The partner effect, however, was significant, F(2,14) = 5.75, p < .05. Infants smiled more to strangers (M = 8.00) than to mothers (M = 3.57) and this difference was significant (p < .05) according to the Tukey post hoc test. The difference between the number of smiles to mothers and to fathers, and the difference between the number of smiles to fathers and strangers, however, were not significant. 15.
SINGLE-FACTOR WITHIN SUBJECTS STUDIES 15.1.2
R2 = 0.638, SS = 25.6.
15.1.3
R2 = 0.940, SS = 37.7.
15.1.4
Change = O.3O2, SSchange = 12.1, dferror = 4, *(1,4) = 20.17, p < .05.
15.1.5
After step 1, predicted scores represent the mean score for the subject. After step 2, they take into account the mean difference between time 1 and time 2 scores as well. Regression coefficients for dummy-coded variables reflect deviations from the comparison group, the group coded all zeros. Thus the first subject's mean score (3.5) minus the last subject's mean score (7.5) is - 4.0, and so forth, and the mean score for the drug group (4.2) minus the mean score for the placebo group (6.4) is - 2.2.
15.2.9 F(l,4)computed = 20.17; F(1,4)critical,05 = 7.71. You reject the null hypothesis because 20.17 > 7.71,. For this sample size and these repeated measures data, the effect of drug treatment on number of lies detected is statistically significant.
APPENDIX B
16.
339
TWO-FACTOR STUDIES INVOLVING REPEATED MEASURES 16.1.19 In this case, F(1,6)critical,05 = 5.99 for all three effects tested. The sex main effect and the sex by instruction interaction are significant. The main effect is qualified by an interaction, so first that interaction should be understood and interpreted. 16.2.22 In this case, F(1,3)critical,05 = 10.13 for all three effects tested. The sex main effect and the sex by instruction interaction are significant. The main effect is qualified by an interaction, so first that interaction should be understood and interpreted. 16.5.1 The TCD for the one-between, one-within is 37.8 and for the no-between, two-within is 30.4. Thus for these data it happens that the post hoc results for the two designs are the same: #3-F1 (M = 65)
Groups
0
#3-F1 #2-M2
Groups #2-M2 #4-F2 (M = 79) (M = 87)
#1-M1 (M- 113)
14 0
19.7
42.3
0
5.7
28.3
0
22.7
#4-F2
0
#1-M1
b
b
b
a
a
a
For both the one-between, one-within and the no-between, two-within analyses, means are based on four scores each and are displayed in the following table. Means that do not differ significantly according to the Tukey test, alpha = .05, share a common subscript. Sex Male Female
Instruction Set I
Set II
113a 65b
79ab 87ab
Neither males nor females were affected by the instructions they received; the difference between the mean number of button pushes for males who received set I versus II was not significant (113 versus 79), nor was the corresponding difference significant for females (65 versus 87). Instruction set I, however, but not set II, distinguished between males and females; the difference between the mean number of button pushes for males versus females exposed to set I was significant (113 versus 65), but the corresponding difference for males versus females exposed to set II was not significant (79 versus 87).
340
ANSWERS TO SELECTED EXERCISES 16.6.1 Degrees of freedom for a three-factor, two-between, one-within design (numbers at the right give values if a - 2, b = 4, p - 3, and N - 96): Source A, main effect B, main effect AB interaction S/AB, subjects w/i AB TOTAL b/w subjects
dfformulas cffformulas a -1 b -1 (a-1)(b-1) (N - ab) (N -1)
df
P main effect PA interaction PB interaction PAB interaction PxS/AB interaction TOTAL w/i subjects
(p-1) (p - 1 )(a - 1) (p - 1 )(b - 1) (p - 1)(a- 1)(b - 1) (p - 1 )(N - ab) Np - N
2 6 6 176 192
TOTAL (b/w + w/i)
Np - 1
287
1 3 3 88 95 2
16.6.2 Degrees of freedom for a three-factor, one-between, two-within design (numbers at the right give values if a = 3, p = 2, q = 4, and N = 45): Source df A, main effect S/A, subjects w/i A TOTAL b/w subjects
formulas dftormulas a -1 N- a N -1
df 2 42 44
P main effect PA interaction PxS/A interaction Q main effect QA interaction QxS/A interaction PQ interaction PQA interaction PQxS/A interaction TOTAL w/i subjects
p-1 (p - 1)( a - 1) (p - 1)( N - a) q -1 (q-1)(a-1) (q - 1 )(N - a) (p - 1 )(q - 1) (p-1)(q-1)(a-1) (p - 1 )(q - 1 )(N - a) Npq - N
1 2 42 3 6 126 3 6 126 315
TOTAL (b/w + w/i)
Npq - 1
359
341
APPENDIX B
16.7.1 Predictor variables for this one-between (sex), one-within (week) 2 x 3 factorial are as follows:
s
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
#Hr Y 5 6 4 5 5 6 3 5 7 4 5 3 5 6 5 7 6 6 6 5 7 5 5 6 7 8 6 7 6 7 8 6 5
Sex A -1 1 1 1 -1 -1 -1 1 1 1 -1 -1 1 1 1 -1 -1 -1 1 1 1 -1 -1 1 1 1 -1 -1 -1 1 1 1 -1
S1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
S2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
S3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
S4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
S/A S5 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Week Wk x Sx S6 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
S7 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
S8 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
S9 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
P1 P2 A1 AP2
-1 1
1 -1
-1 1 -1 1
1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 1 -1
-1 -1 -1 -1 -1 -1
0 0 0 0 0 0 0 0 0 0 0
-2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2
1 1 1
1 1 1
1 1 1 1 1 1
1 1 1 1
1 1 1 1 1
1
0 0 0 0 0 0 0 0 0 0 0 -1 1 1 1 -1 -1 -1 1 1 1 -1
2 -2 -2 -2 2 2 2 -2 -2 -2 2 -1 1 1 1 -1 -1 -1 1 1 1 -1
ANSWERS TO SELECTED EXERCISES
342
16.7.2 The spreadsheet for the analysis of variance source table is as follows: 1-between, 1-within Step Source 1 A (main effect) 2 S/A (error term) TOTAL between Ss 3 4 5
P (main effect) AP (interaction) S/AxP (error term) TOTAL within Ss
Total
Change df 1 9 10
MS 3 .06 1.14
F 2 .676
5 .94 0 .72 1.15
5 .167 0 .623
0.718 34.00
2 2 18 22
1.000
32
RSQ SS SS RSQ 0.,065 3.06 0.065 3.06 0. 282 13.33 0.217 10.28
0.282 13.33 0. 533 25.21 0.251 11.88 0.,563 26.64 0.030 1.43 1..000 47.33 0.437 20.69
TOT (between + within)
47.33
Only the effect for weeks was significant (F(2,18) = 5.2, p < .05). The post hoc analysis is as follows: MSerror = 1.149, df = 18, n = 11, q(3,18) = 3.61, therefore TCD = 1.167:
Groups Weekl
Week 1 (5.00)
Groups Week 2 (5.55)
Week 3 (6.45)
0
0.55
1.45
0
0.91
Week 2 Week 3
0
a
Week 1 N =11 3.57a
a
Means Week 2 N = 11 4.57 ab
Week 3 N =11 8.00b
For these data, the number of hours males and females studied did not differ significantly, but the mean number of hours studied the first week (M = 5.00 hours) was significantly less than the mean number of hours studied the third week (M - 6.45 hours). The mean number of hours studied the second week (M = 5.55) was between the means for the first and third weeks and did not differ significantly from either of them.
APPENDIX B
343
16.7.3 The spreadsheet for the analysis of variance source table for a trend analysis is as follows (the two planned comparisons are a linear and a quadratic trend): 1 -between, 1 -within Step Source 1 A (main effect) 2 S/A (error term) TOTAL between Ss 3 4 5
Linear Qadratic AP (interaction) S/AxP (error term) TOTAL within Ss
Total
Change F
SS RSQ RSQ SS 0 .065 3.06 0 .065 3.06 0 .282 13.33 0 .217 10.28 0,.282 13.33
df 1 9 10
MS 3. 06 1. 14
24.97 0,.246 11.64 25.21 0..005 0.24 26.64 0..030 1.43 47.33 0.437 20.69 0,.718 34.00
1 1 2 18 22
11 .64 10.124 0.24 0.211 0. 72 0.623 1. 15
1.000 47.33
32
0 .528 0 .533 0 .563 1.000
TOT (between + within)
2.676
By committing to a trend analysis, the two degrees of freedom within weeks are partitioned into a linear and a quadratic component, each of which is tested for significance. An omnibus test for weeks is not performed, nor are post hoc tests. In this case, only the linear trend was significant (F(1,18) = 10.1, p < .01), indicating that the monotonic increase from 5.00, to 5.55, to 6.45 is significant. In other words, we can account for an additional 24.6% of the variance (a significant increase) in number of hours studied, above and beyond that accounted for by knowing the particular student's mean for all three scores (R2 = .282), if we fit a straight line to each subject's week 1, week 2, and week 3 scores.
ANSWERS TO SELECTED EXERCISES
344
17.
POWER, PITFALLS, AND PRACTICAL MATTERS 17.1.3 See text. 17.2.1 Spreadsheet that computes posttest scores adjusted for the effect of the pretest scores (labeled Yadj). A
1
B
C
Post Pre 2 Y S X 3 1 102 79 4 93 2 125 5 95 75 3 6 4 130 69 7 101 5 43 8 82 94 6 9 84 7 69 10 69 66 8 11 Sum= 712 12 N= 8 13 Mean= 89 14 a,b= 102.8 -0.16 15 R= 0.859
D
E
Tr A -1 -1 -1 -1 1 1 1 1
XxA -79 -93 -75 -69 101 94 84 69
F
Y" 89.66 87.32 90.33 91.34 85.98 87.15 88.83 91.34 712 8
G
d= Y"-My 0.669 -1.67 1.339 2.343 -3.01 -1.84 -.016 2.343 0 8
H
Yadj Y-d 101.3 126.6 93.66 127.6 46.01 83.84 69.16 63.65 712 8
-3.3
17.2.2 The adjusted means for the trained and untrained groups are 112.3 and 65.7 respectively (unadjusted means = 113 and 65). For these data the adjustment is not great because the correlation between pretest and posttest scores is not strong (R2 = 0.132, .F(1,5) = 2.5, NS). 17.3.1 L = 12.65,f2 = .36/.64 = .563, n* = 12.65/.563 + 3 = 26. 17.3.2 L = 10.51, f2 = .22/.70 = .314, n* = 10.51/.314 + 2 = 36. 17.3.3 L = 14.17, f2 = .17/.76 = .224, n* = 14.17/.224 + 4 = 68.
Appendix C: Statistical Tables
Table A Critical Values for the Binomial Distribution, P = 0.5 The first number in each row, N, indicates the number of trials that are categorized either plus or minus. The remaining numbers represent unlikely values for the number of pluses. The column in which the value falls indicates the probability of that or a smaller number of pluses occurring (lower tail), or the probability of that or a larger number of pluses occurring (upper tail), assuming the null hypothesis value of P = 0.5. Probability values for the columns represent selected standard values. 1. a = .05, two-tailed: Reject the null hypothesis if the number of pluses is equal to or less than the number in the .025 column for the lower tail, or if the number of pluses is equal to or greater than the number in the .025 column for the upper tail. 2. a = .01, two-tailed: Reject the null hypothesis if the number of pluses is equal to or less than the number in the .005 column for the lower tail, or if the number of pluses is equal to or greater than the number in the .005 column for the upper tail. 3. a = .05, one-tailed: If the alternative hypothesis predicts a lower-tail value, reject the null hypothesis if the number of pluses is equal to or less than the number in the .050 column for the lower tail. If the alternative hypothesis predicts an upper-tail value, reject the null hypothesis if the number of pluses is equal to or greater than the number in the .050 column for the upper tail. 4. a = .01, one-tailed: If the alternative hypothesis predicts a lower-tail value, reject the null hypothesis if the number of pluses is equal to or less than the number in the .010 column for the lower tail. If the alternative hypothesis predicts an upper-tail value, reject the null hypothesis if the number of pluses is equal to or greater than the number in the .010 column for the upper tail.
345
STATISTICAL TABLES
346
Table A Critical Values for the Binomial Distribution, P = 0.5 (continued) N
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
005
0 0 0 0
1 1 1
2 2 2 3 3 3 4 4 4 5 5 6 6 6 7 7 7 8 8 9 9 9 10 10 11 11 11 12 12 13 13 13 14 14 15 15
Lower Tail .010 .025
0 0 0 0
1 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 7 8 8 8 9 9 10 10 11 11 11 12 12 13 13 13 14 14 15 15 15 16
0 0 0
.050
.050
0 0 0
5 6 7 7 8 9 9 10 10 11 12 12 13 13 14 15 15 16 16 17 18 18 19 19 20 20 21 22 22 23 23 24 24 25 26 26 27 27 28 28 29 30 30 31 31 32
1 1 1
1 1 1
2
2 2 2 3 3 4 4 4 5 5 5 6 6 7 7 7 8 8 9 9 10 10 10 11 11 12 12 12 13 13 14 14 15 15 15 16 16 17 17
2 3 3 3 4 4 5 5 5 6 6 7 7 7 8 8 9 9 10 10 10 11 11 12 12 13 13 13 14 14 15 15 16 16 16 17 17 18 18
Upper Tail .025 .010
6 7 8 8 9 10 10 11 12 12 13 13 14 15 15 16 17 17 18 18 19 20 20 21 21 22 22 23 24 24 25 25 26 27 27 28 28 29 29 30 31 31 32 32 33
7 8 9 10 10 11 12 12 13 14 14 15 15 16 17 17 18 18 19 20 20 21 22 22 23 24 24 25 25 26 26 27 28 28 29 29 30 31 31 32 32 33 34 34
.005
8 9 10 11 11 12 13 13 14 15 15 16 17 17 18 19 19 20 20 21 22 22 23 24 24 25 25 26 27 27 28 28 29 30 30 31 31 32 33 33 34 34 35
APPENDIX C
347 Table B Areas Under the Normal Curve
The first number in each row gives a Z score accurate to the first decimal digit. Subsequent columns provide the second decimal digit. The numbers in the table give the proportion of area under the normal curve to the left of the indicated Z score. For example, the proportion of area to the left of a Z score of -1.96 is .0250 and the proportion of area to the left of a Z score of +2.58 is .9951. 1. What is the probability of obtaining a Z score less than a particular value (under the curve to the left)? For both negative and positive Z scores, this probability can be read directly from the table. For example, the probability of a Z score being less than +1.00 is .8413. 2. What is the probability of obtaining a Z score greater than a particular value (under the curve to the right)? For both negative and positive Z scores, subtract the tabled probability from 1. For example, the probability of a Z score being greater than +1.64 is 1 minus .9495, which is .0505, and the probability of a Z score being greater than +1.65 is 1 minus .9505, which is .0495. 3. What is the probability of obtaining a Z score less than the absolute value of a particular value (under the curve in the center)? Double the tabled value for the negative Z score and subtract it from 1. For example, the probability of a Z score being less than 1.00 absolute is twice .1587 or .3174 subtracted from 1, which is .6826. 4. What is the probability of obtaining a Z score greater than the absolute value of a particular value (under the curve in the tails)? It is twice the tabled probability for the negative Z score. For example, the probability of a Z score being greater than 2.57 absolute is twice .0051, which is .0102, and the probability of a Z score being greater than 1.58 absolute is twice .0049, which is .0098.
348
STATISTICAL TABLES Table B Areas Under the Normal Curve (continued) z
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
.5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413 .8643 .8849 .9032 .9192 .9332 .9452 .9554 .9641 .9713 .9772 .9821 .9861 .9893 .9918 .9938 .9953 .9965 .9974 .9981 .9987
.5040 .5438 .5832 .6217 .6591 .6950 .7291 .7611 .7910 .8186 .8438 .8665 .8869 .9049 .9207 .9345 .9463 .9564 .9649 .9719 .9778 .9826 .9864 .9896 .9920 .9940 .9955 .9966 .9975 .9982 .9987
.5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461 .8686 .8888 .9066 .9222 .9357 .9474 .9573 .9656 .9726 .9783 .9830 .9868 .9898 .9922 .9941 .9956 .9967 .9976 .9982 .9987
.5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485 .8708 .8907 .9082 .9236 .9370 .9484 .9582 .9664 .9732 .9788 .9834 .9871 .9901 .9925 .9943 .9957 .9968 .9977 .9983 .9988
.5160 .5557 .5948 .6331 .6700 .7054 .7389 .7704 .7995 .8264 .8508 .8729 .8925 .9099 .9251 .9382 .9495 .9591 .9671 .9738 .9793 .9838 .9875 9904 .9927 .9945 .9959 .9969 .9977 .9984 .9988
.5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531 .8749 .8944 .9115 .9265 .9394 .9505 .9599 .9678 .9744 .9798 .9842 .9878 .9906 .9929 .9946 .9960 .9970 .9978 .9984 .9989
.5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554 .8770 .8962 .9131 .9279 .9406 .9515 .9608 .9686 .9750 .9803 .9846 .9881 .9909 .9931 .9948 .9961 .9971 .9979 .9985 .9989
.5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577 .8790 .8980 .9147 .9292 .9418 .9525 .9616 .9693 .9756 .9808 .9850 .9884 .9911 .9932 .9949 .9962 .9972 .9979 .9985 .9989
.5319 .5714 .6103 .6480 .6844 .7190 .7517 .7823 .8106 .8365 .8599 .8810 .8997 .9162 .9306 .9429 .9535 .9625 .9699 .9761 .9812 .9854 .9887 .9913 .9934 .9951 .9963 .9973 .9980 .9986 .9990
.5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621 .8830 .9015 .9177 .9319 .9441 .9545 .9633 .9706 .9767 .9817 .9857 .9890 .9916 .9936 .9952 .9964 .9974 .9981 .9986 .9990
Note. Abridged from Table 1, Biometrika Tables for Statisticians (Vol.1,3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
APPENDIX C
349
Table B Areas Under the Normal Curve (continued) z 0.0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1.0 -1.1 -1.2 -1.3 -1.4 -1.5 -1.6 -1.7 -1.8 -1.9 -2.0 -2.1 -2.2 -2.3 -2.4 -2.5 -2.6 -2.7 -2.8 -2.9 -3.0
.00 .5000 .4602 .4207 .3821 .3446 .3085 .2743 .2420 .2119 .1841 .1587 .1357 .1151 .0968 .0808 .0668 .0548 .0446 .0359 .0287 .0228 .0179 .0139 .0107 .0082 .0062 .0047 .0035 .0026 .0019 .0013
.01 .4960 .4562 .4168 .3783 .3409 .3050 .2709 .2389 .2090 .1814 .1562 .1335 .1131 .0951 .0793 .0655 .0537 .0436 .0351 .0281 .0222 .0174 .0136 .0104 .0080 .0060 .0045 .0034 .0025 .0018 .0013
.02 .03 .4920 .4880 .4522 .4483 .4129 .4090 .3745 .3707 .3372 .3336 .3015 .2981 .2676 .2643 .2358 .2327 .2061 .2033 .1788 .1762 .1539 .1515 .1314 .1292 .1112 .1093 .0934 .0918 .0778 .0764 .0643 .0630 .0526 .0516 .0427 .0418 .0344 .0336 .0274 .0268 .0217 .0212 .0170 .0166 .0132 .0129 .0102 .0099 .0078 .0075 .0059 .0057 .0044 .0043 .0033 .0032 .0024 .0023 .0018 .0017 .0013 .0012
.04 .4840 .4443 .4052 .3669 .3300 .2946 .2611 .2296 .2005 .1736 .1492 .1271 .1075 .0901 .0749 .0618 .0505 .0409 .0329 .0262 .0207 .0162 .0125 .0096 .0073 .0055 .0041 .0031 .0023 .0016 .0012
.05 .4801 .4404 .4013 .3632 .3264 .2912 .2578 .2266 .1977 .1711 .1469 .1251 .1056 .0885 .0735 .0606 .0495 .0401 .0322 .0256 .0202 .0158 .0122 .0094 .0071 .0054 .0040 .0030 .0022 .0016 .0011
.06 .4761 .4364 .3974 ,3594 .3228 .2877 .2546 .2236 .1949 .1685 .1446 .1230 .1038 .0869 .0721 .0594 .0485 .0392 .0314 .0250 .0197 .0154 .0119 .0091 .0069 .0052 .0039 .0029 .0021 .0015 .0011
.07 .4721 .4325 .3936 .3557 .3192 .2843 .2514 .2206 .1922 .1660 .1423 .1210 .1020 .0853 .0708 .0582 .0475 .0384 .0307 .0244 .0192 .0150 .0116 .0089 .0068 .0051 .0038 .0028 .0021 .0015 .0011
.08 .09 .4681 .4641 .4286 .4247 .3897 .3859 .3520 .3483 .3156 .3121 .2810 .2776 .2483 .2451 .2177 .2148 .1894 .1867 .1635 .1611 .1401 .1379 .1190 .1170 .1003 .0985 .0838 .0823 .0694 .0681 .0571 .0559 .0465 .0455 .0375 .0367 .0301 .0294 .0239 .0233 .0188 .0183 .0146 .0143 .0113 .0110 .0087 .0084 .0066 .0064 .0049 .0048 .0037 .0036 .0027 .0026 .0020 .0019 .0014 .0014 .0010 .0010
Note. Abridged from Table 1, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
STATISTICAL TABLES
350
Table C Critical Values for the t Distribution Two-tailed Nondirectional Test 0.05 0.01 0.001 df 12.71 63.66 636.58 2 9.925 31.600 4.303 3.182 5.841 12.924 3 4.604 8.610 4 2.776 4.032 5 2.571 6.869 3.707 2.447 5.959 6 7 3.499 5.408 2.365 5.041 8 2.306 3.355 9 2.262 3.250 4.781 4.587 3.169 10 2.228 11 4.437 2.201 3.106 12 3.055 4.318 2.179 3.012 4.221 13 2.160 14 2.977 4.140 2.145 2.947 15 2.131 4.073 16 2.120 2.921 4.015 17 2.110 2.898 3.965 3.922 18 2.101 2.878 3.883 19 2.093 2.861 2.845 20 3.850 2.086 21 2.080 2.831 3.819 22 2.074 2.819 3.792 2.807 23 2.069 3.768 24 2.797 2.064 3.745 2.787 25 2.060 3.725 3.707 26 2.779 2.056 27 2.052 2.771 3.689 2.763 3.674 28 2.048 29 2.045 2.756 3.660 2.042 2.750 3.646 30 2.704 40 2.021 3.551 2.660 3.460 60 2.000 80 2.639 3.416 1.990 2.617 120 1.980 3.373 oo 1.960 2.576 3.290
One-tailed Ddirectional Test 0.05 0.01 0.001 318.29 6.314 31.82 2 2.920 22.33 6.965 4.541 10.214 3 2.353 4 2.132 3.747 7.173 5.894 5 2.015 3.365 6 1.943 3.143 5.208 7 4.785 1.895 2.998 8 1.860 2.896 4.501 2.821 4.297 9 1.833 1.812 2.764 4.144 10 11 1.796 2.718 4.025 12 1.782 3.930 2.681 1.771 3.852 13 2.650 3.787 14 2.624 1.761 15 1.753 2.602 3.733 16 1.746 2.583 3.686 17 1.740 2.567 3.646 1.734 18 2.552 3.610 19 1.729 2.539 3.579 3.552 20 1.725 2.528 21 1.721 3.527 2.518 22 1.717 2.508 3.505 1.714 23 2.500 3.485 24 1.711 2.492 3.467 25 1.708 2.485 3.450 3.435 26 1.706 2.479 3.421 27 1.703 2.473 1.701 2.467 28 3.408 29 1.699 2.462 3.396 1.697 2.457 30 3.385 1.684 3.307 40 2.423 1.671 3.232 60 2.390 1.664 80 2.374 3.195 120 3.160 1.658 2.358 oo 3.090 1.645 2.326
df 1
Note. Abridged from Table 12, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table D.1 Critical Values for the F-Distribution, alpha = .05 V1 V2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 OO
1
2
3
4
5
6
7
8
9
10
12
15
20
24
30
40
60
120
00
161.4 18.51 10.13 7.71 6.61 5.99 5.59 5.32 5.12 4.96 4.84 4.75 4.67 4.60 4.54 4.49 4.45 4.41 4.38 4.35 4.32 4.30 4.28 4.26 4.24 4.23 4.21 4.20 4.18 4.17 4.08 4.00 3.92 3.84
199.5 19.00 9.55 6.94 5.79 5.14 4.74 4.46 4.26 4.10 3.98 3.89 3.81 3.74 3.68 3.63 3.59 3.55 3.52 3.49 3.47 3.44 3.42 3.40 3.39 3.37 3.35 3.34 3.33 3.32 3.23 3.15 3.07 3.00
215.7 19.16 9.28 6.59 5.41 4.76 4.35 4.07 3.86 3.71 3.59 3.49 3.41 3.34 3.29 3.24 3.20 3.16 3.13 3.10 3.07 3.05 3.03 3.01 2.99 2.98 2.96 2.95 2.93 2.92 2.84 2.76 2.68 2.60
224.6 19.25 9.12 6.39 5.19 4.53 4.12 3.84 3.63 3.48 3.36 3.26 3.18 3.11 3.06 3.01 2.96 2.93 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 2.73 2.71 2.70 2.69 2.61 2.53 2.45 2.37
230.2 19.30 9.01 6.26 5.05 4.39 3.97 3.69 3.48 3.33 3.20 3.11 3.03 2.96 2.90 2.85 2.81 2.77 2.74 2.71 2.68 2.66 2.64 2.62 2.60 2.59 2.57 2.56 2.55 2.53 2.45 2.37 2.29 2.21
234.0 19.33 8.94 6.16 4.95 4.28 3.87 3.58 3.37 3.22 3.09 3.00 2.92 2.85 2.79 2.74 2.70 2.66 2.63 2.60 2.57 2.55 2.53 2.51 2.49 2.47 2.46 2.45 2.43 2.42 2.34 2.25 2.17 2.10
236.8 19.35 8.89 6.09 4.88 4.21 3.79 3.50 3.29 3.14 3.01 2.91 2.83 2.76 2.71 2.66 2.61 2.58 2.54 2.51 2.49 2.46 2.44 2.42 2.40 2.39 2.37 2.36 2.35 2.33 2.25 2.17 2.09 2.01
238.9 19.37 8.85 6.04 4.82 4.15 3.73 3.44 3.23 3.07 2.95 2.85 2.77 2.70 2.64 2.59 2.55 2.51 2.48 2.45 2.42 2.40 2.37 2.36 2.34 2.32 2.31 2.29 2.28 2.27 2.18 2.10 2.02 1.94
240.5 19.38 8.81 6.00 4.77 4.10 3.68 3.39 3.18 3.02 2.90 2.80 2.71 2.65 2.59 2.54 2.49 2.46 2.42 2.39 2.37 2.34 2.32 2.30 2.28 2.27 2.25 2.24 2.22 2.21 2.12 2.04 1.96 1.88
241.0 19.40 8.79 5.96 4.74 4.06 3.64 3.35 3.14 2.98 2.85 2.75 2.67 2.60 2.54 2.49 2.45 2.41 2.38 2.35 2.32 2.30 2.27 2.25 2.24 2.22 2.20 2.19 2.18 2.16 2.08 1.99 1.91 1.83
243.9 19.41 8.74 5.91 4.68 4.00 3.57 3.28 3.07 2.91 2.79 2.69 2.60 2.53 2.48 2.42 2.38 2.34 2.31 2.28 2.25 2.23 2.20 2.18 2.16 2.15 2.13 2.12 2.10 2.09 2.00 1.92 1.83 1.75
245.9 19.43 8.70 5.86 4.62 3.94 3.51 3.22 3.01 2.85 2.72 2.62 2.53 2.46 2.40 2.35 2.31 2.27 2.23 2.20 2.18 2.15 2.13 2.11 2.09 2.07 2.06 2.04 2.03 2.01 1.92 1.84 1.75 1.67
248.0 19.45 8.66 5.80 4.56 3.87 3.44 3.15 2.94 2.77 2.65 2.54 2.46 2.39 2.33 2.28 2.23 2.19 2.16 2.12 2.10 2.07 2.05 2.03 2.01 1.99 1.97 1.96 1.94 1.93 1.84 1.75 1.66 1.57
249.1 19.45 8.64 5.77 4.53 3.84 3.41 3.12 2.90 2.74 2.61 2.51 2.42 2.35 2.29 2.24 2.19 2.15 2.11 2.08 2.05 2.03 2.01 1.98 1.96 1.95 1.93 1.91 1.90 1.89 1.79 1.70 1.61 1.52
250.1 19.46 8.62 5.75 4.50 3.81 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.31 2.25 2.19 2.15 2.11 2.07 2.04 2.01 1.98 1.96 1.94 1.92 1.90 1.88 1.87 1.85 1.84 1.74 1.65 1.55 1.46
251.1 19.47 8.59 5.72 4.46 3.77 3.34 3.04 2.83 2.66 2.53 2.43 2.34 2.27 2.20 2.15 2.10 2.06 2.03 1.99 1.96 1.94 1.91 1.89 1.87 1.85 1.84 1.82 1.81 1.79 1.69 1.59 1.50 1.39
252.2 19.48 8.57 5.69 4.43 3.74 3.30 3.01 2.79 2.62 2.49 2.38 2.30 2.22 2.16 2.11 2.06 2.02 1.98 1.95 1.92 1.89 1.86 1.84 1.82 1.80 1.79 1.77 1.75 1.74 1.64 1.53 1.43 1.32
253.3 19.49 8.55 5.66 4.40 3.70 3.27 2.97 2.75 2.58 2.45 2.34 2.25 2.18 2.11 2.06 2.01 1.97 1.93 1.90 1.87 1.84 1.81 1.79 1.77 1.75 1.73 1.71 1.70 1.68 1.58 1.47 1.35 1.22
254.3 19.50 8.53 5.63 4.36 3.67 3.23 2.93 2.71 2.51 2.40 2.30 2.21 2.13 2.07 2.01 1.96 1.92 1.88 1.81 1.81 1.78 1.76 1.73 1.71 1.69 1.67 1.65 1.61 1.62 1.51 1.39 1.25 1.00
Note: vl = df numerator'V2 = df denominator . Abridged from Table 18, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table D.2 Critical Values for the F-Distribution, alpha = .01 Vi V2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 CO
1
2
3
4052 4999.5 5403 98.50 99.00 99.17 34.12 30.82 29.46 21.20 18.00 16.69 16.27 13.27 12.06 13.75 10.92 9.78 12.25 9.55 8.45 11.26 8.65 7.59 8.02 10.56 6.99 10.04 7.56 6.55 9.65 7.21 6.22 9.33 6.93 5.95 9.07 6.70 5.74 8.86 6.51 5.56 8.68 6.36 5.42 8.53 6.23 5.29 8.40 6.11 5.18 8.29 6.01 5.09 8.81 5.93 5.01 8.10 4.94 5.85 8.02 4.87 5.78 7.95 5.72 4.82 7.88 5.66 4.76 7.82 5.61 4.72 7.77 5.57 4.68 7.72 5.53 4.64 5.49 7.68 4.60 7.64 5.45 4.57 7.60 5.42 4.54 7.56 5.39 4.51 7.31 5.81 4.31 7.08 4.98 4.13 6.85 4.79 3.95 6.63 4.61 3.78
4
5
6
7
8
9
10
12
15
20
24
30
40
60
120
oo
5625 99.25 28.71 15.98 11.39 9.15 7.85 7.01 6.42 5.99 5.67 5.41 5.21 5.04 4.89 4.77 4.67 4.58 4.50 4.43 4.37 4.31 4.26 4.22 4.18 4.14 4.11 4.07 4.04 4.02 3.83 3.65 3.48 3.32
5764 99.30 28.24 15.52 10.97 8.75 7.46 6.63 6.06 5.64 5.32 5.06 4.86 4.69 4.56 4.44 4.34 4.25 4.17 4.10 4.04 3.99 3.94 3.90 3.85 3.82 3.78 3.75 3.73 3.70 3.51 3.34 3.17 3.02
5859 99.33 27.91 15.21 10.67 8.47 7.19 6.37 5.80 5.39 5.07 4.82 4.62 4.46 4.32 4.20 4.10 4.01 3.94 3.87 3.81 3.76 3.71 3.67 3.63 3.59 3.56 3.53 3.50 3.47 3.29 3.12 2.96 2.80
5928 99.36 27.67 14.98 10.46 8.26 6.99 6.18 5.61 5.20 4.89 4.64 4.44 4.28 4.14 4.03 3.93 3.84 3.77 3.70 3.64 3.59 3.54 3.50 3.46 3.42 3.39 3.36 3.33 3.30 3.12 2.95 2.79 2.64
5982 99.37 27.49 14.80 10.29 8.10 6.84 6.03 5.47 5.06 4.74 4.50 4.30 4.14 4.00 3.89 3.79 3.71 3.63 3.56 3.51 3.45 3.41 3.36 3.32 3.29 3.26 3.23 3.20 3.17 2.99 2.82 2.66 2.51
6022 99.39 27.35 14.66 10.16 7.98 6.72 5.91 5.35 4.94 4.63 4.39 4.19 4.03 3.89 3.78 3.68 3.60 3.52 3.46 3.40 3.35 3.30 3.26 3.22 3.18 3.15 3.12 3.09 3.07 2.89 2.72 2.56 2.41
6056 99.40 27.23 14.55 10.05 7.87 6.62 5.81 5.26 4.85 4.54 4.30 4.10 3.94 3.80 3.69 3.59 3.51 3.43 3.37 3.31 3.26 3.21 3.17 3.13 3.09 3.06 3.03 3.00 2.98 2.80 2.63 2.47 2.32
6106 99.42 27.05 14.37 9.89 7.72 6.47 5.67 5.11 4.71 4.47 4.16 3.96 3.80 3.67 3.55 3.46 3.37 3.30 3.23 3.17 3.12 3.07 3.03 2.99 2.96 2.93 2.90 2.87 2.84 2.66 2.50 2.34 2.18
6157 99.43 26.87 14.20 9.72 7.56 6.31 5.52 4.96 4.56 4.25 4.01 3.82 3.66 3.52 3.41 3.31 3.23 3.15 3.09 3.03 2.98 2.93 2.89 2.85 2.81 2.78 2.75 2.73 2.70 2.52 2.35 2.19 2.04
6209 99.45 26.69 14.02 9.55 7.40 6.16 5.36 4.81 4.41 4.10 3.86 3.66 3.51 3.37 3.26 3.16 3.08 3.00 2.94 2.88 2.83 2.78 2.74 2.70 2.66 2.63 2.60 2.57 2.55 2.37 2.20 2.03 1.88
6235 99.46 26.60 13.93 9.47 7.31 6.07 5.28 4.73 4.33 4.02 3.78 3.59 3.43 3.29 3.18 3.08 3.00 2.92 2.86 2.80 2.75 2.70 2.66 2.62 2.58 2.55 2.52 2.49 2.47 2.29 2.12 1.95 1.79
6261 99.47 26.50 13.84 9.38 7.23 5.99 5.20 4.65 4.25 3.94 3.70 3.51 3.35 3.21 3.10 3.00 2.92 2.84 2.78 2.72 2.67 2.62 2.58 2.54 2.50 2.47 2.44 2.41 2.39 2.20 2.03 1.86 1.70
6287 99.47 26.41 13.75 9.29 7.14 5.91 5.12 4.57 4.17 3.86 3.62 3.43 3.27 3.13 3.02 2.92 2.84 2.76 2.69 2.64 2.58 2.54 2.49 2.45 2.42 2.38 2.35 2.33 2.30 2.11 1.94 1.76 1.59
6313 99.48 26.32 13.65 9.20 7.06 5.82 5.03 4.48 4.08 3.78 3.54 3.34 3.18 3.05 2.93 2.83 2.75 2.67 2.61 2.55 2.50 2.45 2.40 2.36 2.33 2.29 2.26 2.23 2.21 2.02 1.84 1.66 1.47
6339 99.49 26.22 13.56 9.11 6.97 5.74 4.95 4.40 4.00 3.69 3.45 3.25 3.09 2.96 2.84 2.75 2.66 2.58 2.52 2.46 2.40 2.35 2.31 2.27 2.23 2.20 2.17 2.14 2.11 1.92 1.73 1.53 1.32
6366 • 99.50 26.13 13.46 9.02 6.88 5.65 4.86 4.31 3.91 3.60 3.36 3.17 3.00 2.87 2.75 2.65 2.57 2.49 2.42 2.36 2.31 2.26 2.21 2.17 2.13 2.10 2.06 2.03 2.01 1.80 1.60 1.38 1.00
Note: vl =df numeratorV2 = dfdenominator Abridged from Table 18, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table E.1 Critical Values for the Studentized Range Statistic, alpha = .05 G V
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 oo
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17.97 6.08 4.50 3.93 3.64 6.46 3.34 3.26 3.20 3.15 3.11 3.08 3.06 3.03 3.01 3.00 2.98 2.97 2.96 2.95 2.92 2.89 2.86 2.83 2.80 2.77
26.98 8.33 5.91 5.04 4.60 4.34 4.16 4.04 3.95 3.88 3.82 3.77 3.73 3.70 3.67 3.65 3.63 3.61 3.59 3.58 3.53 3.49 3.44 3.40 3.36 3.31
32.82 9.80 6.82 5.76 5.22 4.90 4.68 4.53 4.41 .4.33 4.26 4.20 4.15 4.11 4.08 4.05 4.02 4.00 3.98 3.96 3.90 3.85 3.79 3.74 3.68 3.63
37.08 10.88 7.50 6.29 5.67 5.30 5.06 4.89 4.76 4.65 4.57 4.51 4.45 4.41 4.37 4.33 4.30 4.28 4.25 4.23 4.17 4.10 4.04 3.98 3.92 3.86
40.41 11.74 8.04 6.71 6.03 5.63 5.36 5.17 5.02 4.91 4.82 4.75 4.69 4.64 4.59 4.56 4.52 4.49 4.47 4.45 4.37 4.30 4.23 4.16 4.10 4.03
43.12 12.44 8.48 7.05 6.33 5.90 5.61 5.40 5.24 5.12 5.03 4.95 4.88 4.83 4.78 4.74 4.70 4.67 4.65 4.62 4.54 4.46 4.39 4.31 4.24 4.17
45.40 13.03 8.85 7.35 6.58 6.12 5.82 5.60 5.43 5.30 5.20 5.12 5.05 4.99 4.94 4.90 4.86 4.82 4.79 4.77 4.68 4.60 4.52 4.44 4.36 4.29
47.36 13.54 9.18 7.60 6.80 6.32 6.00 5.77 5.59 5.46 5.35 5.27 5.19 5.13 5.08 5.03 4.99 4.96 4.92 4.90 4.81 4.72 4.63 4.55 4.47 4.39
49.07 13.99 9.46 7.83 6.99 6.49 6.16 5.92 5.74 5.60 5.49 5.39 5.32 5.25 5.20 5.15 5.11 5.07 5.04 5.01 4.92 4.82 4.73 4.65 4.56 4.47
50.59 14.39 9.72 8.03 7.17 6.65 6.30 6.05 5.87 5.72 5.61 5.51 5.43 5.36 5.31 5.26 5.21 5.17 5.14 5.11 5.01 4.92 4.82 4.73 4.64 4.55
51.96 14.75 9.95 8.21 7.32 6.79 6.43 6.18 5.98 5.83 5.71 5.61 5.53 5.46 5.40 5.35 5.31 5.27 5.23 5.20 5.10 5.00 4.90 4.81 4.71 4.62
53.20 15.08 10.15 8.37 7.47 6.92 6.55 6.29 6.09 5.93 5.81 5.71 5.63 5.55 5.49 5.44 5.39 5.35 5.31 5.28 5.18 5.08 4.98 4.88 4.78 4.68
54.33 15.38 10.35 8.52 7.60 7.03 6.66 6.39 6.16 6.03 5.90 5.80 5.71 5.64 5.57 5.52 5.47 5.43 5.39 5.36 5.25 5.15 5.04 4.94 4.84 4.74
55.36 15.65 10.52 8.66 7.72 7.14 6.76 6.48 6.28 6.11 5.98 5.88 5.79 5.71 5.65 5.59 5.54 5.50 5.46 5.43 5.32 5.21 5.11 5.00 4.90 4.80
56.32 15.91 10.69 8.79 7.83 7.24 6.85 6.57 6.36 6.19 6.06 5.95 5.86 5.79 5.72 5.66 5.61 5.57 5.53 5.49 5.38 5.27 5.16 5.06 4.95 4.85
Note: G = number of groups, u = dferror. Abridged from Table 29, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table E.2 Critical Values for the Studentized Range Statistic, alpha = .01 G V
1
2
3 4 5 6 7
8 9 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 00
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
90.03 14.04 8.26 6.51 5.70 5.24 4.95 4.75 4.60 4.48 4.39 4.32 4.26 4.21 4.17 4.13 4.10 4.07 4.05 4.02 3.96 3.89 3.82 3.76 3.70 3.64
135.0 19.02 10.62 8.12 6.98 6.33 5.92 5.64 5.43 5.27 5.15 5.05 4.96 4.89 4.84 4.79 4.74 4.70 4.67 4.64 4.55 4.45 4.37 4.28 4.20 4.12
164.3 22.29 12.17 9.17 7.80 7.03 6.54 6.20 5.96 5.77 5.62 5.50 5.40 5.32 5.25 5.19 5.14 5.09 5.05 5.02 4.91 4.80 4.70 4.59 4.50 4.44
185.6 24.72 13.33 9.96 8.42 7.56 7.01 6.62 6.35 6.14 5.97 5.84 5.73 5.63 5.56 5.49 5.43 5.38 5.33 5.29 5.17 5.05 4.93 4.82 4.71 4.60
202.2 26.63 14.24 10.58 8.91 7.97 7.37 6.96 6.66 6.43 6.25 6.10 5.98 5.88 5.80 5.72 5.66 5.60 5.55 5.51 5.37 5.24 5.11 4.99 4.87 4.76
215.8 28.20 15.00 11.10 9.32 8.32 7.68 7.24 6.91 6.67 6.48 6.32 6.19 6.08 5.99 5.92 5.88 5.79 5.73 5.69 5.54 5.40 5.26 5.13 5.01 4.88
227.2 29.53 15.64 11.55 9.67 8.61 7.94 7.47 7.13 6.87 6.67 6.51 6.37 6.26 6.16 6.08 6.01 5.94 5.89 5.84 5.69 5.54 5.39 5.25 5.12 4.99
237.0 30.68 16.20 11.93 9.97 8.87 8.17 7.68 7.33 7.05 6.84 6.67 6.53 6.41 6.31 6.22 6.15 6.08 6.02 5.97 5.18 5.65 5.50 5.36 5.21 5.08
245.6 31.69 16.69 12.27 10.24 9.10 8.37 7.86 7.49 7.21 6.99 6.81 6.66 6.54 6.44 6.35 6.27 6.20 6.14 6.09 5.92 5.67 5.60 5.45 5.30 5.16
253.2 32.59 17.13 12.57 10.48 9.30 8.55 8.03 7.65 7.36 7.13 6.94 6.79 6.44 6.55 6.46 6.38 6.31 6.25 6.19 6.02 5.85 5.69 5.53 5.37 5.23
260.0 33.40 17.53 12.84 10.70 9.48 8.71 8.18 7.78 7.49 7.25 7.06 6.90 6.77 6.66 6.56 6.48 6.41 6.34 6.28 6.11 5.93 5.76 5.60 5.44 5.29
266.2 34.13 17.89 13.09 10.89 9.65 8.86 8.31 7.91 7.60 7.36 7.17 7.01 6.87 6.76 6.66 6.57 6.50 6.43 6.37 6.19 6.01 5.83 5.67 5.50 5.35
271.8 34.81 18.22 13.32 11.08 9.81 9.00 8.44 8.03 7.71 7.46 7.26 7.10 6.96 6.84 6.74 6.66 6.58 6.51 6.45 6.26 6.08 5.90 5.73 5.56 5.40
277.0 35.43 18.52 13.53 11.24 9.95 9.11 8.55 8.13 7.81 7.56 7.36 7.19 7.05 6.93 6.82 6.73 6.65 6.58 6.52 6.33 6.14 5.96 5.78 5.61 5.45
281.8 36.00 18.81 13.73 11.40 10.08 9.24 8.66 8.23 7.91 7.65 7.44 7.27 7.13 7.00 6.90 6.81 6.73 6.66 6.59 6.39 6.20 6.02 5.84 5.66 5.49
Note: G = number of groups, v = dferror. Abridged from Table 29, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table F.1 L Values for alpha = .05 k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 22 24 28 32 36 40 50 60 70 80 90 100
Power .10 .43 .62 .78 .91
1.03 1.13 1.23 1.32 1.40 1.49 1.56 1.64 1.17 1.78 1.84 1.90 2.03 2.14 2.25 2.36 2.56 2.74 2.91 3.08 3.46 3.80 4.12 4.41 4.69 4.95
.30
.50
.60
.70
.75
.80
.85
.90
.95
.99
2.06 2.78 3.30 3.74 4.12 4.46 4.77 5.06 5.33 5.59 5.83 6.06 6.29 6.50 6.71 6.91 7.29 7.65 8.00 8.33 8.94 9.52 10.06 10.57 11.75 12.81 13.79 14.70 15.56 16.37
3.84 4.96 5.76 6.42 6.99 7.50 7.97 8.41 8.81 9.19 9.56 9.90 10.24 10.55 10.86 11.16 11.73 12.26 12.77 13.02 14.17 15.02 15.82 16.58 18.31 19.88 21.32 22.67 23.93 25.12
4.90 6.21 7.15 7.92 8.59 9.19 9.73 10.24 10.71 11.15 11.58 11.98 12.36 12.73 13.09 13.43 14.09 14.71 15.30 15.87 16.93 17.91 18.84 19.71 21.72 23.53 25.20 26.75 28.21 29.59
6.17 7.70 8.79 9.68 10.45 11.14 11.77 12.35 12.89 13.40 13.89 14.34 14.80 15.22 15.63 16.03 16.78 17.50 18.17 18.82 20.04 21.17 22.23 23.23 25.53 27.61 29.52 31.29 32.96 34.54
6.94 8.59 9.77 10.72 11.55 12.29 12.96 13.59 14.17 14.72 15.24 15.74 16.21 16.67 17.11 17.53 18.34 19.11 19.83 20.53 21.83 23.04 24.81 25.25 27.71 29.94 31.98 33.88 35.67 37.36
7.85 9.64 10.90 11.94 12.83 13.62 14.35 15.02 15.65 16.24 16.80 17.34 17.85 18.34 18.81 19.27 20.14 20.96 21.74 22.49 23.89 25.19 26.41 27.56 30.20 32.59 34.79 36.83 38.75 40.56
8.98 10.92 12.30 13.42 14.39 15.26 16.04 16.77 17.45 18.09 18.70 19.28 19.83 20.36 20.87 21.37 22.31 22.20 24.04 24.85 26.36 27.77 29.09 30.33 33.19 35.77 38.14 40.35 42.14 44.37
10.51 12.65 14.17 15.41 16.47 17.42 18.28 19.08 19.83 20.53 21.20 21.83 22.44 23.02 23.58 24.13 25.16 26.13 27.06 27.94 29.60 31.14 32.58 33.94 37.07 39.89 42.48 44.89 47.16 49.29
13.00 15.44 17.17 18.57 19.78 20.86 21.84 22.74 23.59 24.39 25.14 25.86 26.55 27.20 27.84 28.45 29.62 30.72 31.77 32.76 34.64 36.37 38.00 39.59 43.07 46.25 49.17 51.89 54.44 56.85
18.37 21.40 23.52 25.24 26.73 28.05 29.25 30.36 31.39 32.37 33.29 34.16 35.00 35.81 36.58 37.33 38.76 40.10 41.37 42.59 44.87 46.98 48.96 50.83 55.12 58.98 62.53 65.83 68.92 71.84
Note: k = number of variables added. From Applied multiple regression/correlation for the behavioral sciences (p. 526) by J. Cohen and P. Cohen, 1983, Hillsdale (NJ): Lawrence Erlbaum Associatates. Copyright 1983 by Lawrence Erlbaum Associatates. Reprinted by permission of the publisher.
Table F.2 L Values for alpha = .01 k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 22 24 28 32 36 40 50 60 70 80 90 100
Power .10
.30
.50
.60
.70
.75
.80
.85
.90
.95
.99
1.67 2.30 2.76 3.15 3.49 3.79 4.08 4.34 4.58 4.82 5.04 5.25 5.45 5.65 5.84 6.02 6.37 6.70 7.02 7.32 7.89 8.42 8.92 9.39 10.48 11.46 12.37 13.22 14.01 14.76
4.21 5.37 6.22 6.92 7.52 8.07 8.57 9.03 9.47 9.88 10.27 10.64 11.00 11.35 11.67 12.00 12.61 13.19 13.74 14.27 15.26 16.19 17.06 17.88 19.77 21.48 23.05 24.51 25.89 27.19
6.64 8.19 9.31 10.23 11.03 11.79 12.41 13.02 13.59 14.13 14.64 15.13 15.59 16.04 16.48 16.90 17.70 18.45 19.17 19.86 21.15 22.35 23.48 24.54 27.00 29.21 31.25 33.15 34.93 36.62
8.00 9.75 11.01 12.04 12.94 13.74 14.47 15.15 15.79 16.39 16.96 17.51 18.03 18.53 19.01 19.48 20.37 21.21 22.01 22.78 24.21 25.55 26.80 27.99 30.72 33.18 35.45 37.55 39.53 41.41
9.61 11.57 12.97 14.12 15.12 16.01 16.83 17.59 18.30 18.97 19.60 20.21 20.78 21.34 21.88 22.40 23.39 24.32 25.21 26.06 27.65 29.13 30.52 31.84 34.86 37.59 40.10 42.43 44.62 46.70
10.57 12.64 14.12 15.34 16.40 17.34 18.20 19.00 19.75 20.46 21.13 21.77 22.38 22.97 23.53 24.08 25.12 26.11 27.05 27.94 29.62 31.19 32.65 34.04 37.23 40.10 42.75 45.21 47.52 49.70
11.68 13.88 15.46 16.75 17.87 18.87 19.79 20.64 21.43 22.18 22.89 23.56 24.21 22.83 25.43 26.01 27.12 28.16 29.15 30.10 31.88 33.53 35.09 36.55 39.92 42.96 45.76 48.36 50.80 53.11
13.05 15.40 17.09 18.47 19.66 20.73 21.71 22.61 23.46 24.25 25.01 25.73 26.42 27.09 27.72 28.34 29.52 30.63 31.69 32.69 34.59 36.35 38.00 39.56 43.14 46.38 49.35 52.11 54.71 57.16
14.88 17.43 19.25 20.74 22.03 23.18 24.24 25.21 26.12 26.98 27.80 28.58 29.32 30.03 30.72 31.39 32.66 33.85 34.99 36.07 38.11 40.01 41.78 43.46 47.31 50.79 53.99 56.96 59.75 62.38
17.81 20.65 22.67 24.33 25.76 27.04 28.21 29.29 30.31 31.26 32.16 33.02 33.85 34.64 35.40 36.14 37.54 38.87 40.12 41.32 43.58 45.67 47.63 49.49 53.74 57.58 61.11 64.39 67.47 70.37
24.03 27.42 29.83 31.80 33.50 35.02 36.41 37.69 38.89 40.02 41.09 42.11 43.09 44.03 44.93 45.80 47.46 49.03 50.51 51.93 54.60 57.07 59.39 61.57 66.59 71.12 75.27 79.13 82.76 86.18
Note: k = number of variables added. From Applied multiple regression/correlation for the behavioral sciences (p. 526) by J. Cohen and P. Cohen, 1983, Hillsdale (NJ): Lawrence Erlbaum Associatates. Copyright 1983 by Lawrence Erlbaum Associatates. Reprinted by permission of the publisher.
Author Index
A
Algina, J., 239, 302
L
Loftus, G. R., 39, 301 Loftus, E. F., 39, 301
B M
Bakeman, R., 49, 50, 299, 300, 301 Bishop, Y.M.M., 49, 301 C
Cohen, J., 13, 104, 151, 168, 179, 188, 204, 210, 295, 296, 297, 300, 301 Cohen, P., 13, 168, 179, 188, 204, 210, 295, 296,301 F
Fidell, L. S., 50, 239, 302 Fienberg, S.E., 49, 301
Marascuilo, L. A., 85, 301 McArthur, D., 299, ,301 Mervis, C. B., 80, 302 O
Olejnik, S., 239, 302 R
Robinson, B. F., 49, 50, 80, 300, 301, 302 Robinson, B. W., 80, 302 Rosenthal, R., 151, 300, 302 Rosnow, R. L., 151, 300, 302
H S
Hays, W. L., 138, 139, 225, 301 Holland, P.W., 49, 301 K
Keppel, G., 51, 207, 225, 241, 301 Kessen, W., 2, 301 Kirk, R. E., 225, 301
Saufley, W. H., Jr., 51, 301 Scott, D.W., 76, 302 Serlin, R. C., 85, 301 Siegel, S., 50, 85, 302 Stevens, S. S., 45, 302 Stigler, S. M., 59, 66, 84, 87, 302
357
358
AUTHOR INDEX
T Tabachnick, B. G., 50, 239, 302 Tufte, E. R., 71, 73, 78, 79, 302 Tukey, J.W., 77, 302 W Wainer, H., 73, 302 Wilkinson, L., 151, 302 Winer, B. J., 146, 211, 225, 241, 302
Subject Index
A
A priori tests, see Planned comparisons Accounting for variance, 105, 113-117, 167-168 and statistical significance, 140, 168-171 Adjusted means, see Analysis of covariance (ANCOVA) Adjusted R2, see R squared, adjusted Alpha error, see Type I error Alpha level, 21, 23, 24, 25-26, 30 and power, 295, 296 Alternative hypothesis, 21-22 Analysis of covariance (ANCOVA), 156, 171-172 and adjusted means, 212, 293 and adjusted individual scores, 216217, 294 and homogeneity of regression, 218219, 292 and pretest-posttest studies, 289292 Analysis of variance (ANOVA), 12-13, 50, 51, 103-104, 290 and multiple regression, see Multiple regression and unequal numbers of subjects per group, 210 for two independent groups, 149150 one-way, 185-186, 194 source table, 194, 241
source table for repeated measures, 258-259,261 two-way, 235-238 And rule, see Probability Arithmetic mean, see Mean B
Best fit line, 104-107, 111, 113, 114, 116 Beta error, see Type II error Between groups, see Degrees of freedom, between groups Between-subjects factors, 181-182, 223224 Between-subjects studies, see Designs Biased estimates, 138 Binary predictor variables, see Predictor variables, binary Binomial coefficients, 37 Binomial distribution, 31, 34, 39, 83-84 normal approximation for, 84-86 Binomial parameters, 39 Binomial test, see Sign test Box-and-whiskers plot, 77-78, 79 Button-pushing study, introduced, 172173 between-subjects 2 X 2 factorial version, 233-234 mixed between- within-subjects 2 x 2 factorial version, 246, 270 single-factor within subjects version, 262-263
359
SUBJECT INDEX
360
within-subjects 2 x 2 factorial version, 247, 278
C Categorical scales, see Nominal scales Categorical variables, see Coding categorical variables Central limit theorem, 91 Chi-square analysis, 49, 51 Coding categorical variables, 182 contrast coding, 186-189 dummy variable coding, 182-184, 250, 253, 254 for factorial studies, 226-231 orthogonal contrasts, see Contrast coefficients Coefficient of determination (r2), 116, 128 Conditional relations, see Interactions Confidence intervals, 98-99, 152 Contrast coding, see Coding categorical variables Contrast coefficients, 189-191, 228, 230, 231 Correlation, 50, 103-104 Correlation coefficient (r), 116-117, 124128, 156, 290 Correlational studies, see Observational studies Covariance, 121, 124, 126-127 Criterion variable, see Dependent variable Critical region, see Region of rejection Critical values, 40-41, 44 D
De Moivre, Abraham, 84, 85, 87, 91 Degrees of freedom, 140 between groups, 233 error, 144, 147 for one-way ANOVA, 194-195 for single-factor within-subjects studies, 259-260 for mixed two-factor studies, 271 for two-factor between-subjects studies, 233 for two-factor within-subjects studies, 279 for three-factor between-subjects studies, 234 model, 145, 147
total, 143-144 within groups, 232-233 Dependent variable, 13-14, 48-49 Descriptive statistics, see Statistics Designs, and statistical procedures, 4951 single-factor between-subjects, 181182, 224 single-factor within-subjects, 246247 mixed between within, 246 multi-factor between-subjects, 224225, 246 Deviation scores, see Residuals Directional test, see One-tailed test Dummy variable coding, see Coding categorical variables
E Effect size, 150-151, 296, 299 Error bars, 98, 152 Error sum of squares, see Sum of squares, error Estimated standard deviation, see Standard deviation, estimated Estimated standard error of estimate, see Standard error of estimate, estimated Estimated standard error of the mean, see Standard error of the mean, estimated Estimated variance, see Variance, estimated Excel, 1, 6, 14, 62, 73, 105, 157 Experimental studies, 48-49 Explanatory variables, see Predictor variables
F F distribution, 140-142 F ratio, 140, 146-147 for model, 166 for unique additional variance, 168169 using final error term, 204 Ftest, 142-143 Factorial studies, 224-225 advantages of, 226 Fisher, Ronald A., 21, 140
SUBJECT INDEX
361 G
Gauss, Carl Friedrich, 87 Gender smiling study, introduced, 212213
H Harmonic mean, 211 Hierarchic multiple regression, 171-172, 203, 214, 229, 280 Histogram, 75-77 Homogeneity of regression, see Analysis of covariance (ANCOVA) Hypothesis testing, 20-22
I Independence assumption of, 23 Independent variables, 13-14, 48-49, see also Predictor variables Inferential statistics, see Statistics Interactions, 218-219, 226 interpreting significance of, 235-237 Interval scales, 46, 47
L Laplace, Pierre Simon, 87, 91 Least squares, method of, 54, 58 Legendre, Adrien Marie, 59, 87 Lie detection study, introduced, 54 within subjects version, 251 Linear relation, 105 exact nature of, 113 strength of, 113-115 Log linear analysis, 49, 51
M Magnitude of effect, see Effect size Main effects, 226 interpreting significance of, 235-237 Mean, 15, 54 population, 19-20, 88, 93, 96 of sample means, 96 sample, 20, 64, 91-92, 96 standard error of, see Standard error of the mean weighted sum, 91 Mean square, 140 error, 145-146, 147 model, 145-146, 147
Model sum of squares, see Sum of squares, model Money cure study, introduced, 17-18 Multiple correlation coefficient (R), see Multiple R Multiple R, 158, 163-164, 165 Multiple R squared, see R squared Multiple regression, 12-14, 50, 51, 156 and ANOVA, unified view, 12-13, 103-104 and spreadsheets, 14, 288 Multivariate analysis of variance (MANOVA), 50
N Neyman, Jerzy, 21 Nominal scales, 46, 49 Non-directional test, see Two-tailed test Nonparametric tests, 50 Normal curve, see Normal distribution Normal distribution, 87, 90 functional definition, 87-88 historical considerations, 87 underlying circumstances, 89-90 Notation, statistical, see Statistical notation Null hypothesis, 21-22 O
Observation studies, 48 Omnibus test, 186, 201-202 One-tailed test, 22, 27, 40-41 One-way analysis of variance, see Analysis of variance (ANOVA) Or rule, see Probability Ordinal scales, 46, 47 Orthogonal contrasts, see Coding categorical variables Outcomes, see Probability Outcome class, see Probability Outliers, 69
P Parameter, population, 19-20 Parametric tests, 50 Partial regression coefficients (bi), 157158 standardized (Bi, ), 161 Partitioning variance, between and within groups, 114, 194
SUBJECT INDEX
362
between and within subjects, 248239 Pascal's triangle, 35, 37 Pascal, Blaise, 31 Pearson, Karl, 21, 124 Pearson product-moment correlation coefficient (r), see Correlation coefficient Phi coefficient, 127 Planned comparisons, 186, 201-203, 205, 234 Point estimate, 60 Population, 18, 19-20 Population mean, see Mean Population parameter, see Parameter, population Population standard deviation, see Standard deviation Population variance, see Variance Post hoc tests, 186, 201-202, 206-207, 237, 242-243 and unequal numbers of subjects per group, 210-211 for within subjects studies, 264-266, 284-285 Power, 25-26 Power analysis, 154, 295-299 Predicting the mean, 56-58 Predictor variables, see also Independent variables binary, 129, 130, 133 for a single-factor between-subjects study, 253 for a single-factor within-subjects study, 254, 264 for a multi-factor between-subjects studies, 228, 230, 231 Pretest-posttest studies, see Analysis of covariance Probability, 32-33 and rule, 33 or rule, 33, 40 outcomes, 31, 33-34, 35, 37, 39 outcome class, 32, 33-34, 35, 39 O
Qualitative scales, see Nominal scales Qualitative variables, see Coding categorical variables Quantitative variables, 13, 46-47, 183 Quetelet, Adolphe, 66, 87
R R squared, 158-159, 165 adjusted, 159, 165 Random sample, 18 Ratio scales, 46, 47 Region of rejection, 29, 30, 40-41 Regression, 103-104, 156 Regression coefficient (b), 105, 121, 127128, 144, 218 Partial, see Partial regression coefficient Regression constant (a), 105, 121, 128, Regression line, 132, 134 Rejecting the null hypothesis, 22, 24 Rejection region, see Region of rejection Repeated-measures factors, see Withinsubjects factors Residuals, 58, 60, 110-111, 112 Response variable, see Dependent variable
s Sample, 18, 19-20 Random, see Random sample Sample mean, see Mean Sample standard deviation, see Standard deviation Sample statistics, see Statistics Sample variance, see Variance Sampling distribution, 21, 22-23 Scales of measurement, 45-47 Scattergram, 105, 106 Sign test, 24, 34, 39-41, 42, 49 Significance testing, 140-142, 146-147 for main effects and interactions, 233-234 for model, 166 for unique additional variance, 168, 204 Single sample tests, 93-96 Slope, see Regression coefficient Snedecor, George, 140 Source table, see Analysis of variance (ANOVA) Spearmen rank-order correlation coefficient, 127 Spreadsheets, 4-10 defined, 4-5 elements of, 6-10 formulas, 7-9
363
SUBJECT INDEX
and multiple regression, see Multiple regression operators for, 8 Standard deviation, 63 estimated, 139, 161 population, 63, 64, 94, 96, 139 sample, 63, 64, 96, 139 Standard error of estimate, 160 estimated, 159-161, 165 Standard error of the mean, 94-96 estimated, 94, 161 population, 95, 96 Standard normal distribution, 88 Standard scores, see Z scores Standardized partial regression coefficient, see Partial regression coefficient, Statistical interactions, see Interactions Statistical notation, 14-15 Statistical test, and hypothesis testing, 21, 23-24 Statistics, descriptive, 19, 30, 70 inferential, 19, 30 sample, 19-20 Stem-and-leaf plot, 73-75, 79 Step-wise regression, 171 Student's t test, see t test Sum of squares, 58, 60 error, 107, 110-111, 145 model, 107, 110-111, 145 total, 105, 110-111 within and between subjects, 249 X, 120 XY, 120, 124 F, 120 Suppressor effect, 168 T
t distribution, 92, 96, 97 t statistic, 96 ttest, 51, 92, 129, 154 Test statistic, and hypothesis testing, 21, 23
Tree diagram, 33, 189 Tukey critical difference, 207-208, 265 Tukey test, 207-209, 284-285 Two-tailed test, 21, 27, 40-41 Type I error, 24, 26, 29, 202, 206-207 Type II error, 24-25, 26, 29, 207
U
Unbiased estimates, 138 V
Variance, 60-61 error, 114-116, 117 estimated, 137-138, 139 model, 114-116, 117 population, 62, 64, 138 sample, 62, 64, 137, 138 total, 114-116, 117 W
Withinstandardized groups, see Degrees of freedom, within groups Within-subjects factorial studies, see Designs Within-subject factors, 245-246, 270271 Advantages of, 247-248 Controlling between subject variability for, 250-251 Y
Y-intercept, see Regression constant Z
Z scores, 66-67, 69, 70, 85, 126