JVonParametri c Statistics FOR
THE
BEHAVIORAL
SGIENGES
McGraw-Hill Seriesin Psychology CLIFFQRD T. MoROANConsultin...
2817 downloads
8535 Views
16MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
JVonParametri c Statistics FOR
THE
BEHAVIORAL
SGIENGES
McGraw-Hill Seriesin Psychology CLIFFQRD T. MoROANConsultingEditor
BhzxzR, KQUNrN,hNn WRIGHT' Child Behaviorand Development BhRTLzY ~BeginningExperimentalPsychology BLUM~ Psychoanalytic Theoriesof Personality BRowN ~ The Psychodynamicsof Abnormal Behavior
BRowNhNn GarsELLr ScientificMethod in Psychology ChTTBLL ~ Personality
CRhrTs,ScHNRIRLh RQBINsoN, hNnGILBBRTRecentExperiments in Psychology Dzzsz The Psychologyof Learning DoLLhznhNn MrLLRR~ Personalityand Psychotherapy DGRcvshNn Jowzs ~ Handbookof EmployeeSelection FERGUsoN' Personality Measurement
GHISBLLIhNn BRowN' Personneland Industrial Psychology GRhr ~ Psychology Applied to Human ASairs GRhT ~Psychologyin Industry Gvraman ~ FundamentalStatisticsin Psychologyand Education GvrLrozn ~ PsychometricMethods Hhr Rz ~ Psychologyin Management
Hrzsa ~ TheMeasurementof Hearing HURLocx ~ AdolescentDevelopment HURLocx ~ Child Development
HURLocx~ Developmental Psychology JQHNsoN Essentialsof Psychology KhRN hzn GrLMBR. Readingsin Industrial snd BusinessPsychology Kzzca hNn CRUTcarrzLn~ Theoryand Problemsof SocialPsychology LzwrN A Dynamic Theory of Personality LzwrN ~ Principles of TopologicalPsychology Mhrzz hNn ScaNzrRLh~ Principles of Animal Psychology MrLLRR ~ Experiments in Social Process
MrLLzR ~ Language snd Communication Mrsrhx hNn SThvnT Catholicsin Psychology:A Historical Survey Mooaz ~ Psychology for Businessand Industry MozohN hNn STRLLhR~ Physiological Psychology Phoz ~ AbnormalPsychology Rz YMzRT ~ Feelingssnd Emotions
Rrcahzns Modern Clinical Psychology Szhsaozz Psychologyof Music Szwhzn
~ Sex and the Social Order
SmurFzRhNn LhlhRvs ~ FundamentalConceptsin Clinical Psychology
SrzGBL . Nonparsmetric Statistics:For the Behavioral Sciences SThoNzR~ Psychology of Personality TowNszNn ~ Introductionto ExperimentalMethod VrNhcxz ~ ThePsychologyof Thinking WhLLRN~ ClinicalPsychology:The Study of Persons Zvzzx hNn SoLBBRGHuman Development
JohnF. DsshiellwasConsultingEditor of this seriesfrom its inceptionin 1931 until January 1, 1950.
Non'Parametric Statistics FOR THE
BEHAVIORAL
SIDNEY
SCIENCES
SIEGEL
Associate Profer' of Statisticsand Socia/Psychology The Pennsylvania State Unkelsity
McGRAW-HILL BOOK COMPANY, INC. New York
Toronto 1956
London
NONPARAMETRICSTATISTICS:FoRTREBEHhvIoRhL SciExcEs Copyright) 1956 by the McGraw-HillHookCompany, Inc. Printedin the
UnitedStatesof America. All rightsreserved. Thisbook,or partsthereof,maynotbereproduced in anyformwithoutpermission of thepublishers. Library of Congreee CalalogCardNumber56-8185 IV
THE MhPLE PRESSCOMPhNT,YORK, Ph,
To Jay
PREFACE
I believethat the nonparametric techniques of hypothesis testingare uniquely suited to the data of the behavioral sciences. The two alterna-
tive nameswhicharefrequentlygivento thesetestssuggest two reasons for their suitability. The testsareoftencalled"distribution-free,"one of their primary merits being that they do not assumethat the scores under analysiswere drawn from a population distributed in a certain
way, e.g., from a normally distributed population. Alternatively, many of these tests are identified as "ranking tests," and this title
suggests their otherprincipalmerit: nonparametric techniquesmay be usedwith scores whicharenot exactin anynumericalsense, but whichin effect are simply ranks. A third advantageof thesetechniques,of course,is their computationalsimplicity. Many believethat researchers and studentsin the behavioralsciencesneedto spend more time and
reflectionin the carefulformulationof their researchproblemsand in collecting precise and relevant data. Perhaps they will turn more
attentionto thesepursuitsif they are relievedof the necessityof computing statisticswhich are complicatedand time-consuming.A final advantage of the nonparametric tests is their usefulness with small
samples, a featurewhichshouldbe helpfulto the researcher collecting pilot study data and to the researcher whosesamplesmust be small becauseof their very nature (e.g.,samplesof personswith a rareform of mental illness, or samplesof cultures).
To date, no sourceis availablewhich presentsthe nonparametric techniques in usable form and in terms which are familiar to the behav-
ioral scientist..The techniquesare presentedin various mathematics and statistics publications. Most behavioral scientists do not have the
mathematicalsophistication requiredfor consultingthesesources.In
addition,certainwritershavepresented summaries of thesetechniques in articlesaddressed to socialscientists. NotablesamongtheseareBlum andFattu (1954),Moses(1952a),MostellerandBush(1954),andSmith (1953). Moreover, some of the newer texts on statistics for social scientistshave containedchapterson nonparametricmethods. These
includethe textsby Edwards(1954),McNemar(1955),andWalkerand
Lev (1953).Valuableas thesesources are,they havetypicallyeither
Vill
PREFACE
been highly selective in the techniques presented or have not included the tables of significance values which are used in the application of the various tests. Therefore I have felt that a text o» the nonparametric
methods would be a desirable addition to the literature formed by the sources mentioned.
In this book I have presented the tests according to the research designfor which eachis suited. In discussingeachtest, I have attempted to indicate its "function,"
i.e., to indicate the sort of data to which it
is applicable, to convey some notion of the rationale or proof underlying the test, to explain its computation, to give examplesof its application in behavioral scientific research,and to comparethe test to its parametric equivalent, if any, and to any nonparametric tests of similar function.
The reader may be surprised at the amount of spacegiven to examples of the use of these tests, and even astonished at the repetitiousness which
these examples introduce. I may justify this allocation of space by pointing out that (a) the examples help to teach the computation of the test, (b) the examples illustrate the application of the test to research problems in the behavioral sciences,and (c) the use of the same six steps in every hypothesis test demonstrates that identical logic underlies each of the many statistical techniques, a fact which is not well understood by many researchers. Since I have tried to present all the raw data for each of the examples,
I was not able to draw thesefrom a catholic group of sources. Research publications typically do not present raw data, and therefore I was compelled to draw upon a rather parochial group of sourcesfor most examples those sources from which raw data were readily available. The reader will understand that this is an apology for the frequency with which I have presentedin the examplesmy own researchand that of my immediate colleagues. SometimesI have not found appropriate data to illustrate
the use of a test and therefore
have "concocted"
data for the
purpose.
In writing this book, I have become acutely aware of the important influence which various teachers and colleagues have exercised upon my
thinking.
Professor Quinn McYemar gave me fundamental trai»i»g in
statistical inference and first introduced me to the importance of the
assumptions underlying various statistical tests. Professor Lincoln Moseshas enriched my understanding of statistics, and it was he who first interested me in the literature
of nonparametric statistics.
My study
with ProfessorGeorge Polya yielded exciting insights in probability theory. ProfessorsKenneth J. Arrow, Albert H. Bowker, DouglasH. Lawrence,and the late J. C. C. McKinsey have each contributed significantly to my understandingof statistics and experimentaldesign. My comprehension of measurementtheory wasdeepenedby my research
PREFACE
collaboration with Professors Donald Davidson and Patrick Suppes. This book has benefited enormously from the stimulating and detailed
suggestions and criticisms which Professors James B. Bartoo, Quinn McKemar, and Lincoln Moses gave me after each had read the manu-
script.
I am greatly indebted to each of them for their valuable gifts
of time and knowledge.
I am also grateful to Professors John F. Hall
and Robert E. Stover, who encouraged my undertaking to write this book and who contributed helpful critical comments on some of the
chapters. Of course,none of these personsis in any way responsiblefor the faults which remain; these are entirely my responsibility, and I should be grateful if any readers who detect errors and obscurities would call my attention to them.
Much of the usefulnessof this book is due to the generosity of the many authors and publishers who have kindly permitted me to adapt or reproduce tables and other material originally presented by them. I have mentioned each source where the materials appear, and I also wish to mention here my gratitude to Donovan Auble, Irvin L. Child, Frieda Swed Cohn, Churchill Eisenhart, D. J. Finney, Milton Friedman, Leo A. Goodman, M. G. Kendall, William Kruskal, Joseph Lev, Henry B. Mann, Frank J. Massey, Jr., Edwin G. Olds, George W. Snedecor, Helen M. Walker, W. Allen Wallis, John E. Walsh, John W. M. Whiting, D. R. Whitney, and Frank Wilcoxon, and to the Institute of Mathematical Statistics, the American Statistical Association, Biometrika, the American Psychological Association, Iowa State College Press, Yale University Press, the Institute of Educational Research at Indiana University, the American Cyanamid Company, Charles Griffin & Co., Ltd., John Wiley 4 Sons, Inc., and Henry Holt and Company, Inc. I am indebted to Professor Sir Ronald A. Fisher, Cambridge, to Dr. Frank Yates, Rothamsted, and to Messrs. Oliver and Boyd, Ltd., Edinburgh, for permission to reprint Tables No. III and IV from their book
Statistical Tablesfor Biological, Agricultural, and Medical Research. My greatest personal indebtednessis to my wife, Dr. Alberta Engvall Siegel, without whose help this book could not have been written.
She
has worked closely with me in every phase of its planning and writing. I know it has benefited not only from her knowledge of the behavioral sciencesbut also from her careful editing, which has greatly enhancedany expository merits the book may have. SIDNEY
SIEGEL
CONTENTS
PREFhcE
~~
~~
~~
~~
~~
~~
~~
GLOSShRY OFSYMBOLS....,...,,,
~
vn
~
CHhPTER I.
INTRODUCTION
CHhPTER2.
THE UsE oF SThTIsTIchL TEsTs IN RESEhROH .
i. The Null Hypothesis .
77
ii. The Choice of the Statistical Test
iii. The Level of Significanceand the SampleSise
8
iv. The Sampling Distribution v. The Region of Rejection .
11
vi. The Decision
14
Example
14
CHhPTER3.
13
CHOOSINGhN APPROPRIhTESThTISTIChLTEST .
The Statistical
Model.
18 18
Power-Efficiency
20
Measurement
21
Parametric and Nonparametric Statistical Tests
30
CHhPTER 4.
THE ONE-shMPLE ChsE
The Binomial
Test.
The z' One-sampleTest . The Kolmogorov-Smirnov One~pie The On~ample
36 42
Test
Runs Test
Discussion CHhPTER 5.
47 52 59
THE ChsE QF Two RELhTED ShMPLES
The McNemar Test for the Significance of Changes . The SignTest .
63
The WilcoxonMatched-pairs Signed-ranks Test
75
The Walsh Test
83
The Randomisation Test for Matched Pairs
88
Discussion
92
CHhPTER6.
.
THE ChsE oF Two INDEPENDENTShMPLES.
The Fisher Exact Probability Test .
68
95 96
The x~Test for Two IndependentSamples
104
The Median Test .
111
CONTENTS
The Mann-Whitney U Teat . The Kolmogorov-Smirnov Two-sample Test
116
The Wsld-Wolfowitz Runs Test. The Moses Test of Extreme Reactions
136
.
127 145
The Randomization Test for Two Independent Samples.
152
Discussion
156
CHAPTER 7.
THE ChsE QF k RELATED SAMPLES
159
The Cochran Q Test
161
The Friedmsn Two-way Analysis of Variance by Ranks.
166
Discussion
173
CHAPTER 8.
THE CASE OF k INDEPENDENT SAMPLES .
174
The xz Test for k Independent Samples
175
The Extension
179
of the Median
The Kruskal-Wallis
Test
.
One-way Analysis of Variance by Ranks
CHAPTER9.
184 193
Discussion
MEASUREsoF CQRRELATIQNANDTHEIR TEsTs QF SIGNIFIchNGE
195
The Contingency CoefBcient: C .
196
The SpearmanRank CorrelationCoefficient:rz
202
The Kendall
213
Rank
Correlation
CoefBcient:
r
The Kendsll Partial Rank CorrelationCoefBcient:r,,
223
The Kendsll CoefBcient of Concordance: W
229
Discussion
238
REFERENCES.....
241
~
245
APPENDIX
A. Table of Probabilities
Associated with Values as Extreme as Observed
Values of z in the Normal Distribution. B. Table of Critical Values of f .
247
C. Table of Critical Values oi Chi Square .
249
D. Table of Probabilities Values
of z in
the
248
Associated with Values aa Small as Observed
Binomial
Test
250
E. Table of Critical Values of D in the Kolmogorov-Smirnov One~mple Test
251
.
F. Table of Critical
Values of r in the Runs Test.
252
G. Table of Critical Values of T in the Wilcoxon Matched-pairs Signedranks Test
.
H. Table of Critical Values for the Walah Test
I. Table of Critical Values of D (or C) in the Fisher Teat
254 255 256
J. Table of Probabilities Associated with Values as Small as Observed Values of U in the Mann-Whitney Test
K. Table of Critical Values of U in the Mann-Whitney Test L. Table of Critical Values of Ko in the Kolmogorov-Smirnov Two~mple Test (Small Samples) . M. Table of Critical Values of D in the Kolmogorov-Smirnov Two~mple
Test (LargeSamples: Two-tailedTest).
278 279
N. Table of Probabilities Associated with Values aa Large as Observed
Valuesof x,~in the FriedmanTwo-wayAnalysisof Varianceby Ranks.
280
CONTENTS
Xlll
O. Table of ProbabilitiesAssociatedwith Valuesas Large as Observed Valuesof H in the Kruskal-%allis One-wayAnalysisof Varianceby Ranks...................
282
P. Table of Critical Values of ra, the Spearman Rank Correlation Coefficient....................
284
Q. Table of ProbabilitiesAssociatedwith Valuesas Large as Observed Values of $ in the Kendall Rank Correlation Coefficient..... R. Table of Critical Values of e in the Kendall Coefficient of Concordance.
285 286
S. Table of Factorials...............
287
T. Table of Binomial Coefficients............
288
U. Table of Squaresand Square Roots...........
289
I NDEX s ~
~~
~~
~~
303
GLOSSARY
OF SYMBOLS
Upperleft-hand cell in a 2 X 2 table; numberof casesobservedin that cell.
Alpha. Level of significance ~ probability of a Type I error.
Upperright-handcell in a 2 X 2 table; numberof casesobservedin that cell.
Beta.
Power of the test ~ probabiTity of a Type II error.
Lower left-hand cell in a 2 X 2 table; number of casesobservedin that cell.
C
Contingency coefficient.
Chi square A random variable which follows the chiwquare distribution, certain values of which are shown in Table C of the Appendix. A statistic whosevalue is computed from observeddata. X The statistic in the Friedman two-way analysis of variance by ranks. x' A difference score, used in the caseof matched pairs, obtained for any
pair by subtractingthe scoreof onememberfrom that of the other. Degreesof freedom. Lower right-hand cell in a 2 X 2 table; number of casesobservedin that cell.
The maximum difference between the two cumulative distributions in the
Kolmogorov-smirnow test. Under Ho, the expected number of casesin the ith row and the jth column in a x' test. JP
Eo(x)
Frequency, i.e., number of cases. The F test: the parametric analysis of variance. Under Ho, the proportion of casesin the population whose scores are
equal to or less than X. Smirnov
This is a statistic in the Kolmogorov-
test.
In the Mosestest, the amount by which an observedvalue of ei exceeds iic 2h, where iic 2h is the minimum span of the ranks of the control cases,
In the Cochran Q test, the total number of "successes"in the jth column (sample).
In the Moses test, the predetermined number of extreme control ranks which are dropped from each end of the span of control ranks before si is determined.
The statistic used in the Kruskal-Wallis one-way analysis of variance by ranks.
Ho Hi
The null hypothesis.
The alternative hypothesis, the operational statement of the research hypothesis. A variable subscript, usually denoting rows.
GLOSSARY
OF
SYMBOLS
A variable subscript, usually denoting columns.
In the Kolmogorov-Smirnov test, the number of observationswhich are equal to or less than X. In the Kolmogorov-Smirnov test, the numeratorof D. In the CochranQ test, the total numberof "successes" in the ith row. Mu.
The population mean.
The populationmean under Ho. The populationmean under H>. The numberof independentlydrawn casesin a singlesample. The total numberof independentlydrawn casesusedin a statisticaltest. The observed number of casesin the ith row and the jth column in a x' test
Probabilityassociated with the occurrence underHoof a valueasextreme as or more extreme than the observed value.
In the binomialtest, the proportionof "successes." In the binomial test, 1 P. The statistic used in the Cochran test. The number of runs.
The Pearsonproduct-momentcorrelationcoefficient, The number of rows in a k X r table. The sum of the ranks in the jth column or sample.
The Spearman rank correlation coefficient. The mean of several rs's.
In the Kendall W, the sum of the squaresof the deviationsof the Rt from the mean value of R;.
In the Mosestest, the spanor rangeof the ranksof the controlcases.
In the Mosestest,thespanorrangeof theranksof thecontrolcases after A,caseshave been droppedfrom eachextremeof that range. A statistic in the Kendall T.
In theKolmogorov-Smirnov test,theobserved cumulative stepfunction of s randomsampleof N observations.
Sigma.Thestandard deviation of thepopulation.Whena subscript is given,thestandard errorof a sampling distribution, for example, ~v ~ the standarderrorof the samplingdistributionof U. The variance of the population. Summation
of.
Student's t test, s parametric test. The number of observations in any tied group.
In the Wilcoxontest,the smallerof the sumsof like-signed rants. A correction factor for ties.
Tau.
The Kendall rank correlation coefficient.
The Kendsll partial rank correlationcoefficient. The statistic in the Mann-Whitney test.
U = n>n>U', a transformationin the Mann-Whitneytest. The Kendall coefficient of concordance,
In the binomial test, the numberof casesin oneof the groups. Any observed score.
The meanof a sampleof observations. Deviation of the observedvalue from po when e ~ 1. s is normally distributed. Probabilitiesassociatedunder Ho with the occurrence of valuesssextremeasvariouss'ssre givenin TableA ofthe Appendix.
GLOSSARY
(')
OF
SYMBOLS
/aX Thebinomial coefficient (l,)
XV11
b! a!' !, ! TableT oftheAppen-
dix gives binomial coefficients for N from 1 to 20.
Factorial.
N! = N(N
1)(N 2)
5! = (5)(4)(3)(2)(1)
0! IX Yl
~ ~ 1. For example, 120
1. Tables 8 of the Appendix gives factorials for N from 1 to 20.
The absolute value of the difference between X and
Y.
That is, the
numerical value of the difference regardless of sign. For example, i5 3} i3 5i 2. X>Y XY X
X is greater than Y. X is less than
Y.
X is equal to Y. X is equal to or greater than Y.
X is equal to or less than Y. X is not equal to Y.
CHAPTER
INTRODUCTION
The student of the behavioral sciencessoon grows accustomedto using familiar words in initially unfamiliar ways. Early in his study he learns that when a behavioral scientist speaks of socMtyhe is not referring to that leisured group of personswhose names appear in the society pages of our newspapers. He knows that the scientific denotation of the term personality has little or nothing in common with the teen-ager'smeaning. Although a high school student may contemptuously dismiss one of his
peersfor having "no personality," the behavioralscientist can scarcely conceivesuch a condition. The student has learned that culture, when used technically, encompassesfar more than aesthetic refinement. And he would not now be caught in the blunder of saying that a salesman "uses" psychologyin persuadinga customer to purchasehis wares.
Similarly, the studenthasdiscoveredthat the field of statisticsis quite diferent from the common conception of it.
In the newspapersand in
other journals of popular thinking, the statistician is representedas one who collectslarge amounts of quantitative information and then abstracts
certainrepresentativenumbersfrom that information. We areall familiar with the notion that the determinationof the averagehourly wage in an industry or of the average number of children in urban American families is the statistician's job. But the student who has taken even
one introductory coursein statistics knowsthat descriptionis only one function of statistics.
A central topic of modern statistics is stotcetecal inference. Statistical inference is concernedwith two types of problems: estimation of population parametersand tests of hypotheses. It is with the latter type, tests
of hypotheses,that we shall be primarily concernedin this book. Webster tells us that the verb "to infer"
means "to derive as a con-
sequence, conclusion,or probability." Whenweseea womanwho wears no ring on the third finger of her left hand, we may infer that she is unmarried.
In statistical inference,we are concernedwith how to draw conclusions
about a largenumberof eventson the basisof observationsof a portion of them. Statistics provides tools which formalize and standardize our 1
INTRODUCTION
procedures for drawing conclusions. For example, we might wish to determine which of three varieties of tomato sauceis most popular with American homemakers. Informally, we might gather information on this question by stationing ourselvesnear the tomato saucecounter at a grocery store and counting how many cans of each variety are purchased in the courseof a day. Almost certainly the numbers of purchasesof the three varieties will be unequal. But can we infer that the one most fre-
quently chosenon that day in that storeby that day's customersis really the most popular among American homemakers?
Whether we can make
such an inference must depend on the margin of popularity held by the most frequently chosen brand, on the representativenessof the grocery store, and also on the representativenessof the group of purchaserswhom we observed.
The proceduresof statistical inferenceintroduce order into any attempt to draw conclusionsfrom the evidence provided by samples. The logic of the procedures dictates some of the conditions under which the evi-
dence must be collected, and statistical tests determine how large the
observeddifferencesmust be before we can have confidencethat they representreal differencesin the larger group from which only a few events were sampled.
A common problem for statistical inferenceis to determine, in terms of
probability, whether observeddifferencesbetweentwo samplessignify that the populations sampled are themselvesreally different. Now whenever we collect two groups of scores by random methods we are likely to find that the scores differ to some extent. Differences occur simply becauseof the operations of chance. Then how can we determine in any given case whether the observed differences are merely due to chance or not? The procedures of statistical inference enable us to determine, in terms of probability, whether the observed difference is within the range which could easily occur by chance or whether it is so large that it signifies that the two samplesare probably from two different populations. Another common problem is to determine whether it is likely that a sample of scoresis from some specified population. Still another is to decide whether we may legitimately
infer that several
groups differ among themselves. We shall be concerned with each of these tasks for statistical
inference
in this book.
In the development of modern statistical methods, the first techniques of inference which appearedwere those which madea goodmany assumptions about the nature of the population from which the scores were drawn. Since population values are "parameters," thesestatistical techniques are called parametric. For example, a technique of inference may be basedon the assumption that the scoreswere drawn from a normally
distributed population. Or the techniqueof inferencemay be basedon
INTRODUCTION
theassumption thatbothsetsof scores weredrawnfrompopulations havingthesame variance (o') or spread of scores.Suchtechniques produce conclusions whichcontain qualifiers, e.g.,"If theassumptions
regarding theshape ofthepopulation(s) arevalid,thenwemayconclude that...."
Morerecentlywehaveseenthedevelopment ofa large number oftech-
niques ofinference which donotmake numerous orstringent assumptions
aboutparameters. Thesenewer"distribution-free" or nonparametric techniques resultin conclusions whichrequire fewerqualifications. Hav-
ingusedoneofthem,wemaysaythat"Regardless oftheshape ofthe population(s), we may conclude that...."
It is with thesenewer
techniquesthat we shall be concernedin this book.
Somenonparametric techniques areoftencalled"rankingtests"or
"ordertests,"andthese titlessuggest another wayin whichtheydier fromtheparametric tests.In thecomputation ofparametric tests,we
add,divide,andmultiplythe scores fromthe samples.Whenthese
arithmetic processes areused onscores which arenottrulynumerical,
they naturallyintroducedistortionsin thosedataand thusthrowin doubtanyconclusions fromthe test. Thusit is permissible to usethe
parametric techniques only with scoreswhichare truly numerical. Manynonparametric tests,ontheotherhand,focusontheorderorrank-
ing of the scores, not on their "numerical"values,andothernon-
parametric techniques areusefulwithdatafor whichevenordering is
impossible (i.e.,with classificatory data). Whereas a parametric test
mayfocusonthedifference between themeans oftwosetsofscores, the
equivalent nonparametric testmayfocuson thedifference between the
medians. Thecomputation of themean requires arithmetic manipula-
tion(addition andthendivision); thecomputation ofthemedian requires
onlycounting.Theadvantages of orderstatisticsfor datain thebehav-
ioralsciences (inwhich"numerical" scores maybeprecisely numerical in appearance only'.) should beapparent. Weshalldiscuss thispointat greater lengthin Chap, 3,in whichtheparametric andnonparametric tests are contrasted.
Qf theninechapters contained in this book,sixaredevoted to the presentation of thevariousnonparametric statistical tests. Thetests
areassigned tochapters according totheresearch design forwhich they
areappropriate. Onechapter contains those testswhich maybeused when onewishes todetermine whether a single sample isfroma specified
sortof population. Twochapters contai~ those testswhich maybe used when one wishes tocompare thescores yielded bytwosamples; one
ofthese chapters considers tests fortworelated samples, while theother
considers tests fortwoindependent samples. Similarly, twochapters
INTRODUCTION
are devoted to significance tests for k (3 or more) samples;one of these presents tests for k related samples and the other presents tests for k independent samples. The final chapter gives nonparametric measures of association,and the tests of significance which are useful with someof these.
Before the reader comes to these chapters, however, he will be confronted with two others in addition to the present one. The first of these, Chap. 2, is devoted to a general discussionof tests of hypotheses. Becausethis discussionis really little more than a summary statement about the elementary aspectsof hypothesis testing, and becausemuch of its vocabulary may be unfamiliar, some readersespecially those with little or no acquaintance with the theory of statistical inferencemay find Chap. 2 a difficult one. We suggestthat such personswould do well to turn to the referencescited there for a more comprehensivetreatment of the notions discussed. We hope, however, that for most readers Chap. 2 will provide sufhcient background for understanding the balance of the book. The notions and vocabulary introduced in Chap. 2 are employed frequently and even repetitiously throughout the book, and therefore should become more familiar and meaningful as the reader repeatedly encountersthem in succeedingchapters. Chapter 3 discussesthe choice of that statistical technique which is best suited for analyzing a given batch of data. This discussionincludes a comparison of parametric and nonparametric statistical tests, and introduces the reader to the theory of measurement. Again the reader may find that he is facing much new material in a few pages. And again we suggestthat the new material will becomeincreasingly familiar as he progressesthrough the succeedingchapters. We have tried to make this book fully comprehensibleto the reader who has had only introductory work in statistics. It is presumed that the reader will have a speaking acquaintance with descriptive statistics (means,medians,standard deviations, etc.), with parametric correlational methods (particularly the Pearson product-moment correlation r), and with the basic notions of statistical
inference and their use in the t test
and in the analysis of variance. The reader who has had even limited experiencewith these statistics and statistical tests should find the references to them comprehensible.
Moreover, we have tried to make the book completely intelligible to the reader whosemathematical training is limited to elementary algebra. This orientation has precluded our presenting many derivations. Where possible we have tried to convey an "intuitive" understanding of the
rationale underlying a test, and have thought that this understanding will be more useful than an attempt to follow the derivation.
The more
INTRODUCTION
mathematically sophisticated readerwill wantto pursue thetopicsof this book by turning to the sourcesto which we have madereference.
Readers whose mathematical trainingis limited,andespecially readers
whose educational experience hasbeensuchthat theyhavedeveloped negative emotional responsesto symbols, often find statistics books
difBcultbecause oftheextensive useofsymbols.Suchreaders mayfind thatmuchof thisdifBculty willdisappear if theyreadmoreslowlythan is their custom. It is not to be expectedthat a readerschooledin the
behavioral sciences canmaintainthesamefastclipin reading a statistics bookthat hemaintains in reading a bookon,say,personality or onintergrouphostility or on the roleof geography in culturaldifFerences. Sta-
tisticalwritingis morecondensed thanmostsocialscientific writingwe
usesymbols forbrevityaswellasforexactness andtherefore it requires slowerreading.The readerwhofindssymbolsdificult may alsobe aidedby theglossary whichis included.That glossary summarizes the meanings of the various symbols used in the book.
The readerwith limited mathematical trainingmay alsofind the examples especially helpful:an exampleis givenof the usein research of
everystatisticaltest. Onereasonthat the extensive useof symbols makesmaterialmoredifBcultmaybethat symbolsaregeneralor abstract
terms,whichacquire a varietyof specific meanings in a varietyof specific cases.Thuswhenwespeakof k samples wemeananynumberof sam-
ples,3 oi'4 or8 or5 oranyothernumber.In theexamples, of course,
the symbolseachacquirea specificnumericalvalue,andthusthe examples may serveto "concretize" the discussionfor the reader.
Theexamples alsoserveto illustratethe roleandimportanceof statis
tiesin theresearch of thebehavioral scientist. Thismaybetheirmost usefulfunction,for wehaveaddressed this bookto the researcher whose primaryinterestis in the substance or topicalfieldsof the socialsciences rather than in their methodology.The examples demonstrate the intimateinterrelationof substance andmethodin the behavioralsciences.
CHAPTER 2 THE
USE
OF
In the behavioral
the acceptability
STATISTICAL
TESTS
IN
RESEARCH
sciences we conduct research in order to determine
of hypotheses which we derive from our theories of
behavior. Having selected a certain hypothesis which seemsimportant in a certain theory, we collect empirical data which should yield direct information on the acceptability of that hypothesis.
Our decision about
the meaning of the data may lead us to retain, revise, or reject the hypothesis and the theory which was its source.
In order to reach an objective decision as to whether a particular hypothesis is confirmed by a set of data, we must have an objective procedure for either rejecting or accepting that hypothesis. Objectivity is emphasized because one of the requirements of the scientific method is
that one should arrive at scientific conclusions by methods which are public and which may be repeated by other competent investigators. This objective procedure should be based on the information we obtain
in our research, and on the risk we are willing to take that our decision with respect to the hypothesis may be incorrect. The procedure usually followed involves several steps.
lucre we list
these steps in their order of performance; this and the following chapter are devoted to discussing each in some detail. i. State the null hypothesis (Ha).
ii. Choose a statistical test (with its associatedstatistical model) for testing Ho.
From among the several tests which might be used with a
given researchdesign, choosethat test whosemodel most closely approximates the conditions of the research (in terms of the assumptionswhich qualify the use of the test) and whose measurement requirement is met
by the measuresused in the research. iii. Specify a significance level (0.) and a sample size (N). iv. Find (or assume) the sampling distribution of the statistical test under HD.
v. On the basis of (ii), (iii), and (iv) above, define the region of rejection.
vi. Compute the value of the statistical test, using the data obtained from the sample(s)., If that value is in the region of rejection, the deci6
THE
CHOICE
OF
THE
STATISTICAL
TEST
sion is to reject Ho, if that value is outsidethe region of rejection,the decision is that Ho cannot be rejected at the chosenlevel of significance. A number of statistical tests are presented in this book. In most presentations,one or more examplesof the use of the test in researchare given. Each examplefollows the six stepsgiven above. An understanding of the reasonfor each of thesesteps is central to an understanding of the role of statistics in testing a research hypothesis. i. THE
NULL
HYPOTHESIS
The first step in the decision-making procedure is to state the null
hypothesis (Ho). The null hypothesisis a hypothesis of no differences. It is usually formulated for the expresspurpose of being rejected. If it is rejected, the alternative hypothesis (H>) may be accepted. The alternative hypothesis is the operational statement of the experimenter's research hypothesis. The researchhypothesisis the prediction derived from the theory under test. When we want to make a decisionabout differences,we test Ho against Hl. Hl constitutes the assertion that is acceptedif Ho is rejected. suppose a certain social scientific theory would lead us to predict that two specified groups of people differ in the amount of time they spend
in readingnewspapers.This prediction would be our researchhypothesis. Confirmation of that prediction would lend support to the social
scientifictheoryfrom whichit wasderived. To test this researchhypoth-
esis,we state it in operationalform as the alternativehypothesis,H,. Hl would be that pl w y2,that is, that the meanamountof time spentin reading newspapers by the membersof the two populationsis unequal.
Ho wouldbe that yl = sc2,that is, that the meanamountof time spent in reading newspapersby the membersof the two populations is the same,
If the data permit us to rejectHo,then H>canbeaccepted,and this would support the researchhypothesisand its underlyingtheory. The nature of the research hypothesis determines how H> should be
stated. If the researchhypothesissimply states that two groups will
differ with respectto means,thenHl is that y>W ym.But if the theory predicts the directionof the difference,i.e., that one specifiedgroup will have a largermeanthan the other, then H >may be either that pl ) pg or
that p, < p~(where) means"greaterthan" and < means"lessthan"). ii. THE CHOICE OF THE STATISTICAL TEST
Thefieldof statistics hasdeveloped to theextentthatwenowhave, for almostall research designs, alternativestatisticaltestswhichmight beusedin orderto coIneto a decision abouta hypothesis.Havingalter-
8
THE VSE OF STATISTICAL TESTS IN RESEARCH
native tests, we needsomerational basisfor choosingamongthem. Since this book is concernedwith nonparametric statistics, the choice among (parametric and nonparametric) statistics is one of its central topics. Therefore the discussionof this point is reservedfor a separatechapter; Chap. 3 gives sn extended discussionof the basesfor choosingamongthe various tests applicable to s given researchdesign. m. THE
LEVEL
OF SIGNIFICANCE
AND
THE
SAMPLE
SIZE
When the null hypothesis and the alternative hypothesis have been stated, snd when the statistical test appropriate to the researchhas been selected,the next step is to specify a level of significance (a) and to select a sample size (N), In brief, this is our decision-making procedure: In advanceof the data collection, we specify the set of all possible samples that could occur when Hp is true. From these, we specify a subset of possible samples which are so extreme that the probability is very small, if Hp is true, that the sample we actually observe will be among them. If in our research we then observea samplewhich was included in that subset,we reject H p. Stated differently, our procedureis to reject H pin favor of H> if a statistical test yields a value whoseassociatedprobability of occurrenceunder H pis equal to or lessthan somesmall probability symbolized as a. That small probability is called the level of significance. Common values of a are .05 and .01. To repeat: if the probability associatedwith the occurrence under Hp, i.e., when the null hypothesis is true, of the particular value yielded by a statistical test is equal to or less than n, we reject H p and accept H>, the operational statement of the researchhypothesis.' ' In contemporarystatisticaldecisiontheory, the procedureof adheringrigidly to an arbitrary level of significance,say .05 or .01, hasbeenrejectedin favor of the procedureof making decisionsin terms of lossfunctions,utilizing suchprinciplesas the minimax principle (the principleof minimizingthe maximumloss). For a discussion of this approach,the reader may turn to Blackwell and Girshick (1954), Savage (1954), or Wald (1950). Although the desirability of such a technique for arriving at decisions is clear, its practicality in most research in the behavioral sciencesat
presentis dubious,becausewe lack the information which would be basicto the use of loss functions.
~,
A common practice, which reflects the notion that different investigators and readers
may hold different views as to the "losses" or "gains" involved in implementinga social scientific finding, is for the researchersimply to report the probability level associated with his finding, indicating that the null hypothesis may be rejected at that
level.
From the discussion of significance levels which is given in this book, the reader should not infer that the writer believes in a rigid or hard-and-fast approach to the
settingof significancelevels. Rather, it is for heuristicreasonsthat significancelevels
areemphasized; suchan exposition seems the bestmethodof clarifyingtherolewhich the information contained in the sampling distribution plays in the decision-making procedure.
THE LEVEL OP SICNIFIChNCE
AND THE SAMPLE SiZE
.9
It can be seen, then, that a gives the probability of mistakenly or falsely rejecting Ho. This interpretation of a will be amplified when the Type I error is discussed. Since the value of a enters into the determination of whether Ho is or is not rejected, the requirement of objectivity demands that a be set in advance of the collection
of the data.
The level at which the
researcherchoosesto set a should be determined by his estimate of the importance or possiblepractical significance of his findings. In a study of the possible therapeutic sects of brain surgery, for example, the researchermay well chooseto set a rather stringent level of significance, for the dangersof rejecting the null hypothesisimproperly (and therefore unjustifiably advocating or recommendinga drastic clinical technique) are great indeed. In reporting his findings, the researchershould indicate the actual probability level associatedwith his findings, so that the reader may use his own judgment in deciding whether or not the null hypothesis should be rejected. A researchermay decideto work at the .05 level, but a reader may refuse to accept any finding not significant at the .01, .005, or .001levels, while another reader may be interested in any finding which reaches,say, the .08 or .10 levels. The researchershould give his readers the information they require by reporting, if possible,the probability level actually associatedwith the finding.
There are tw'o types of errors which may be made in arriving at a decision about Ho, The first, the Type I error, is to reject Ho when in fact it is true. The second,the Type II error, is to accept Ho when in fact it is false.
The probability of committing a Type I error is given by u. The larger is a, the morelikely it is that Ho will be rejectedfalsely,i.e. t e more likely it is that the Type I error will be committed The Type II error Is usually representedby P. a and p will be used here to Indicate
boththetypeof errorandtheprobabilityof makingthat error. That is, p (Type I error) = a y (Type II error) = p
Ideally, the specificvaluesof both a and P wouldbe specifiedby the experimenter before he began his research. These values would deter-
mine the sizeof the sample(N) he would haveto draw for computingthe statistical
test he had chosen.
In practice,however,it is usualfor a andN to be specifiedin advance. Once a and N have been specified,P is determined. Inasmuch as there
is an inverserelationbetweenthe likelihoodof makingthe two typesof errors, a decreasein a will increaseP for argr given N. If we wish to reducethe possibility of both types of errors,we must increaseX.
It shouldbe clearthat in any statisticalinferencea dangerexistsof committingone of the two alternativetypes of errors,and that the
10
THE USE OF SThTISTIChL TESTS IN RESEhRCH
experimentershould reach somecompromisewhich optimizesthe balance between the probabilities of making the two errors. The various statistical tests oEer the possibility of diferent balances. It is in achieving this balancethat the notion of the power function of a statistical test is relevant.
The poler of a testis definedas the probability of rejectingHo whenit is in fact false. That is, Power = 1 probability
of Type II error = 1 P
The curves in Fig. 1 show that the probability of committing a Type II
error (P) decreases as the samplesize(N) increases,and thus that power increaseswith the size of N.
Figure 1 illustrates the increasein power
of the two-tailed test of the mean which comes with increasing sample
sizes:N = 4, 10,20, 50, and 100. Thesesamplesare takenfrom normal
populationswith varianceo'. The meanunderthe null hypothesis is symbolized here as yc.
Fro. 1. Power curves of the two-tailed test at a .05
with vary.'ng sample sizes.
Figure 1 alsoshowsthat whenHois true, i.e., whenthe true mean= ps,
the probabilityof rejectingHs = .05. This is asit shouldbe,inasmuch asa = .05,and a givesthe probabilityof rejectingHowhenit is in fact true.
From this discussionit is important that the readerunderstandthe
followingfivepoints,whichsummarize whatwehavesaidabouttheselection of the level of significanceand of the sample size:
1. The significance levela isthe probabilitythat a statisticaltestwill
yielda valueunderwhichthe null hypothesis will be rejectedwhenin fact it is true. That is, the significance levelindicatesthe probabilityof committing the Type I error.
2. P is the probabilitythat a statisticaltest will yield a valueunder
THE
8hMPLING
DI8TRIBUTION
which the null hypothesiswill be acceptedwhenin fact it is false. That is, p givesthe probability of committing the Type II error.
3. Thepowerof a test,1 P,tellsthe probabilityof rejectingthe null hypothesiswhenit is false (and thus shouldbe rejected). 4. Power is related to the nature of the statistical test chosen.'
5. Generally the power of a statistical test increaseswith an increase in N. iv. THE SAMPLING
DISTRIBUTION
When an experimenter has chosena certain statistical test to use with
his data, he must next determinewhat is the samplingdistribution of the test statistic.
The sampbng thstribution asa theoretical dIstribution.
It is that 4s-
tribution we would get if we took all possiblesamplesof the samesize
from the samepopulation,drawingeachrandomly. Anotherway of
sayingthis is to saythat the sampling distributionis the distribution, underHo,of aQpossible valuesthatsomestatistic(saythesample mean,
g') cantakewhenthatstatisticis computed fromrandomly drawnsamples of equal size.
The samplingdistribution of a statistic showsthe probabilitiesunder Hs associatedwith variouspossiblenumericalvaluesof the statistic. The probability "associatedwith" the occurrenceof a particular value of the
statisticunderHsisnottheexactprobabilityofjust thatvalue. Rather, "the probability associatedwith the occurrenceunder Hs" is hereused
to referto the probabilityof a particularvalueplusthe probabilitiesof
all moreextreme possible values.Thatis,the"associated probability" or "the probabilityassociated withtheoccurrence underHs"is theprobability of the occurrenceunderHo of a valueaseztremeasor moreextreme gharry the particular value of the test statistic. In this book weshall have
frequentoccasion to usetheabovephrases, andin eachcasetheyshall carry the meaning given above.
Supposewe wereinterestedin the probability that th ~ heads would
landup whenthree"fair" coinsweretossed simultaneously. pling distributionof the numberof headscouldbedrawnfrom the h t of all possibleresultsof tossingthreecoins,whichis givenin Table2.1. The total numberof possibleevents(possiblecombinations of H's and T'sheadsand tails) is eight,only oneof whichis the eventin whichweare
interested: thesimultaneous occurrence ofthreeH's. Thustheprobability of the occurrence underHs of threeheadson the tossof threecoins
HereHois theassertion thatthecoinsare"fair," whichmeans that i poweris alsorelatedto thenatureof Hi. If H~hasdirection, a one-tailed testis A one-tailed testismorepowerful thana two-tailed test. Thisshould beclear
from the de6nitionof power.
12
THE
VSE
OF
STATISTICAL
TESTS
IN
RESEARCH
TABLE 2.1. PossIBLE OUTcoMEs OF THE Toss OF THREE CoINs Outcomes 78 Coin
1
H
Coin
2
H
Coin 3
H
H TH T H TH H HH T T
H
T
T
T
T
T
TT
H
for eachcointhe probability of a headoccurringis equalto the probability of a tail occurring. Thus the sampling distribution of sll possible events has shown us the probability of the occurrence under Ha of the event with which we are concerned.
It is obvious that it would be essentially impossible for us to use this method of imagining all possible results in order to write down the sampling distributions for even moderately large samplesfrom large populations.
This being the case, we rely on the authority
of statements of
"proved" mathematical theorems. These theorems invariably involve assumptions,and in applying the theoremswe must keep the assumptions in mind. Usually these assumptionsconcernthe distribution of the population and/or the size of the sample. the central-limit
An example of such a theorem is
theorem.
When s variable is normally distributed, its distribution is completely characterized by the mean and the standard deviation. This being the case,we know, for example, that the probability that an observedvalue of such a variable will differ from the mean by more than 1.96 standard
deviations is lessthan .05. (The probabilities associatedwith any differencein standard deviations from the mean of a normally distributed variable are given in Table A of the Appendix,) Supposethen we want to know, before the sample is drawn, the prob-
ability associated with the occurrence of a particularvalueof X' (the arithmetic mean of the sample), i.e., the probability under Ho of the occurrenceof a value at least as large as a particular value of X, when the
sampleis randomlydrawnfrom somepopulationwhosemeany,andstandard deviation e we know.
One version of the central-limit theorem states
that:
If a variableis distributed with mean = pand standard deviation = o; and if random samplesof size X are drawn, then the meansof these
samples,the X's, will be approximatelynormallydistributedwith
meanII andstandard deviatione/~N for N sufficiently large.
THE
REGION
OP RESECTION
13
In otherwords,if N is suSciently large,weknowthat thesampling distributionof X (a) is approximately normal,(5) hasa meanequalto the population meanp, and(c)hasa standard deviationwhichis equal to the population standard deviationdividedby thesquarerootof the samplesize,that is, eg = ~/~N.
For example,supposewe know that in the populationof American
college students, some psychological attribute, asmeasured bysome test, is distributed with p = 100 and o = 16. Now we want to know the
probabilityof drawinga random sample of 64cases fromthispopulation andfindingthatthemeanscore in thatsample, X, ihaslargeas104. The central-limit theoremtells us that the samplingdistribution of X's of all
possible samples of size64 will be approximately normallydistributed and will havea meanequalto 100(p = 100)and a standarddeviation
equalto ~/~N = 16/~64= 2. Wecanseethat 104differsfrom100
by twostandard errors.' Reference to TableA reveals thattheprobabilIty associated withtheoccurrence underHsof a value aslargeassuchan observed valueof X, thatis,ofanX whichis at leasttwostandard errors above the mean (z > 2.0), is y (
.023.
It should beclearfromthisdiscussion andthisexample thatbyknowingthesampling distribution ofsome statisticweareableto makeprobability statementsaboutthe occurrence of certainnumericalvaluesof
thatstatistic.Thefollowing sections willshowhowweusesucha probability statement in making a decision about Hs. v. THB REGION OF REJECTION
Theregionof rejectionis a regionof thesampling distribution. The sampling distribution includes all possible valuesa teststatisticcantake
underHo,'theregionof rejection consists of a subset of thesepossible
values,andis defined sothat theprobabilityunderHoof theoccurrence of a teststatistichavinga valuewhichis in that subsetis a. In other
words,the regionof rejectionconsists of a setof possible valueswhich
aresoextreme thatwhen Hsistruetheprobability isverysmall(i.e.,the
probability is a) thatthesample weactuallyobserve will yielda value whichis among them. Theprobability associated withanyvaluein the region of rejectionis equalto or lessthan a.
Thelocation of theregionof rejection is affected by thenatureof Hi. If Hi indicates thepredicted directionof thedifference, thena one-tailed
testis called for. If Hi does notindicate thedirection ofthepredicted deference, thena two-tailed testis calledfor. One-tailed andtwo-tailed
t stsdier in thelocation (butnotin thesize)oftheregion ofrejection. Thatis in a one-tailed testtheregion ofrejection isentirely at oneend a ~estaudard deviatioa ofa sampling distributioa isusually oalled astasderd error.
14
THE USE OF STATISTICAL TESTS IN RESEARCH
(or tail) of the samplingdistribution. In a two-tailedtest,theregionof rejectionis locatedat both endsof the samplingdistribution. The sizeof the regionof rejectionis expressed by a, the levelof significance. If a = .05, then the sizeof the regionof rejectionis 5 per cent of the entire spaceincludedunder the curvein the samplingdistribution. One-tailed and two-tailed regions of
rejection for a = .05 are illustrated in Fig. 2. Observe that these two regions dier in location but not in total
size.
A. Oarkened areo shows one-toiled
regionof rejectionwhenac=.05
vL THE
DECISION
If the statistical test yields a value which is in the region of rejection, we reject Ha.
P
Z5
The reasoningbehind this decision
8 OaAened orna shows two-tailedprocess isverysimple.If theprob-
region afrejection when a =.05 ability associatedwith the occurFIo. 2. Regions of rejectionfor one- renceunder the null hypothesisof a tailedandtwo-tailedtests. particular value in the samplingdistribution
is very small, we may
explaintheactualoccurrence of thatvaluein twoways:first,wemayexplainit by decidingthat thenull hypothesis is false,or second, wemsy
explain it bydeciding thata rare andunlikely eventhasoccurred.In the decision process, wechoose thefirstof theseexplanations.Occasionally, of course,the second maybethe correctone. In fact,the probability
that the secondexplanationis the correctoneis givenby a, for rejecting He when in fact it is true is the Type I error.
Whentheprobabilityassociated withanobserved valueof a statistical testis equalto or lessthanthepreviously determined valueof a, weconcludethat Ha is false. Suchan observedvalueis called"significant."
He,thehypothesis undertest,isrejected whenever a "significant" result occurs.A "significant"valueis onewhoseassociated probabilityof occurrence underHa (asshownby the samplingdistribution)is equalto or less than a. EXAMPLE
In the discussionsof the variousnonparametricstatistical tests,many
examples of statistical decisions will begivenin this book. Herewe
shallgivejustoneexample of howa statistical decision is reached, to
illustratethe pointsmadein this chapter.
EXAMPLE
Suppose wesuspect a particularcoinof beingbiased.Oursuspicionis that thecoinis biasedto landwithheadup. To testthis suspicion (whichwe heremay dignifyby callingit a "research hypothesis"),we decideto tossthe coin 12 timesand to observe the frequency with which head occurs.
i. NttQHypothesis. Hs. p(H) = p(T) = z. That is, for this cointhereis nodifference between theprobabilityof theoccurrence of a head,that is, p(H), and the probabilityof the occurrence of a
tail, that is, p(T); thecoinis "fair." H~.p(H) ) p(T), ii. Stati8tical Test. The statisticaltest whichis appropriate to test this hypothesis is the binomialtest,whichis basedon the bino-
mialexpansion.(Thistestis presented fully in Chap.4.) iii. SignificanceLevel. In advancewe decideto use a = .01 as
our levelof significance. N = 12= the numberof independent tosses.
iv. Sampling Diatribution.Thesampling distribution whichgives the probabilityof obtainingx headsandN
x tails underthe null
hypothesis (thehypothesis thatthecoinisin factfair)is givenby Nt thebinomial distribution function: t N t x)! P Q"-';x =0, 1, s! (N
2,...,
N. Table2.2shows thesampling distribution of g, the
Tmm 2.2. ShMPLING DrsvmsUTroN os'x (Ntamsaos'Hmns) FQR 2" S~prxs
os Sr'
N=
12
Samplingdistribution
(Expected frequency of occurrence if Number of heade
2" ssmples of 12tosses weretsken)
12
1
11
12
10 9
66 220
87 65
495 792 924 792
4
495
3 12
220 66 12
0
1
Totsl
2" = 4 096
numberof heads.Thissampling distributionshowsthat the most likely outcome of tossinga coin12timesis to obtain6 heads and
6 tails.Obtaining 7 heads and5 tailsis somewhat lesslikelybut
THE
16
USE OF STATISTICAL
TESTS
IN
RESEARCH
still quite probable. But the occurrenceof 12 headson 12 tossesis very unlikely indeed. The occurrenceof 0 heads(12tails) is equally unlikely.
v. RejectumRegion. Since H~ has direction, a one-tailed test will be used, and thus the region of rejection is entirely at one end of the sampling distribution. The region of rejection consists of all values of x (number of heads)so large that the probability associatedwith their occurrence under Ho is equal to or less than a=
.01. 1
The probability of obtaining 12 headsis
= .00024. Since
1
p = .00024 is smallerthann = .01,clearlytheoccurrence of 12heads would be in the region of rejection.
The probability of obtaining either 11or 12 headsis 1 12
4,096
13
4,096
4,096
Sincep = .0032is smallerthan a = .01,the occurrenceof 11heads would also be in the region of rejection.
The probability of obtaining 10 heads(or a value moreextreme: 1 12
66
79
11or 12heads) is 4096+4096+4096 4096 p = .019is largerthan n
. Since
.01,the occurrence of 10headswould
not be in the regionof rejection. That is, if 10 or fewerheadsturn
up in our sampleof 12 tosseswe cannotrejectHp at the a = .01 level of significance.
vi. Deciaion. Supposein our sampleof tossesweobtain 11heads. The probability associatedwith an occurrenceasextremeasthis one
is p = .0032. Inasmuchasthis p is smallerthanour previouslyset level of significance(a = .01), our decisionis to reject H pin favor of H~. We concludethat the coin is biasedto land headup.
This chapterhasdiscussed the procedure for makinga decisionas to whethera particular hypothesis,as operationallydefined,shouldbe accepted or rejectedin termsof the informationyieldedby the research. Chapter3 completesthe generaldiscussion by goinginto the question
of howonemaychoose themostappropriate statisticaltestfor usewith one'sresearchdata. (This choiceis step ii in the procedureoutlined
above.) Thediscussion in Chap.3 clarifies the conditions underwhich the parametrictestsare optimumand indicatesthe conditions under whichnonparametric testsaremoreappropriate. The readerwhowishesto gaina morecomprehensive or fundamental
E3QlMPLE
17
understanding of the topicssummarized in bareoutlinein thepresent chaptermayreferto DixonandMassey(1950,chap.14)for anunusually clearintroductorydiscussion of powerfunctionsand of the two types of errors, and to Andersonand Bancroft (1952, chap. 11) or Mood
(1950,chap.12)for moreadvanceddiscussions of the theoryof testing hypotheses.
CHAPTER 3
CHOOSING AN APPROPRIATE
STATISTICAL
TEST
When alternative statistical tests are available for a given research
design,as is very often the case,it is necessaryto employsomerationale for choosingamong them. In Chap. 2 we presentedone criterion to use
in choosingamong alternative statistical tests: the criterion of power. In this chapter other criteria will be presented.
The reader will rememberthat the powerof a statistical analysisis partly a function of the statistical test employedin the analysis. A statistical test is a goodone if it has a small probability of rejectingHp when Hp is true, but a large probability of rejecting lip when II p is false. Suppose we find two statistical tests, A and B, which have the same probability of rejecting Hp when it is true. It might seemthat we should
simply selectthe onethat hasthe largerprobability of rejectingH pwhen it is false.
However, there are considerations other than power which enter into the choice of a statistical
test.
In this choice we must consider the
mannerin which the sampleof scoreswasdrawn, the nature of the population from which the sample was drawn, and the kind of measurement or scaling which was employed in the operational definitions of the variables involved, i.e., in the scores. All these matters enter iiito determin-
ing which statistical test is optimum or most appropriatefor analyzinga particular set of research data. THE
STATISTICAL
MODEL
When we have asserted the nature of the population and the tnanner of sampling, we have established a statistical model. Associated with
every statistical test is a model and a measurementrequirement;the test is valid under certain conditions, and the model and the measurement requirement specify those conditions. Sometimes we are able to test whether the conditions of a particular statistical model are met, but more often we have to assumethat they are met. Thus the conditions of the statistical model of a test are often called the "assumptions" of the test. All decisionsarrived at by the use of any statistical test must 18
THE
STATISTICAL
MODEL
19
carry with themthis qualification:"If the modelusedwascorrect,and if the measurement requirementwassatisfied,then...." It is obviousthat the feweror weakerare the assumptionsthat define a particular model, the less qualifying we need to do about our decision
arrivedat by the statisticaltest associated with that model. That is, the fewer or weaker are the assumptions,the more general are the conclusions.
However,the mostpowerfultestsarethosewhichhavethe strongest or mostextensiveassumptions.The parametrictests,for example,the t or F tests, have a variety of strong assumptionsunderlying their use.
Whenthoseassumptions arevalid, thesetestsare the mostlikely of all teststo rejectHo whenHo is false. That is, whenresearchdata may appropriately be analyzedby a parametric test, that test will be more
powerfulthan any otherin rejectingHo whenit is false. Notice,however, the requirementthat the researchdata must beappropriatefor the test. What constitutes such appropriateness? What are the conditions that are associatedwith the statistical model and the measurement requirement underlying, say, the t test? The conditions which must be
satisfied to makethet testthemostpowerfulone,andin factbeforeany confidence canbe placedin any probabilitystatementobtainedby the use of the t test, are at least these:
1. The observationsmust be independent. That is, the selectionof
any onecasefrom the populationfor inclusionin the samplemustnot bias the chancesof any other casefor inclusion, and the score which is
assigned to anycasemustnot biasthe scorewhichis assigned to any other
case.
2. The observations must be drawnfrom normallydistributedpopulations.
3. These populations musthavethesame variance (or,in special cases, they must have a known ratio of variances).
4. The variables involved must have been measuredin at teas~an interval scale,so that it is possibleto use the operationsof arithmetic (adding, dividing, finding means,etc.) on the scores.
In the caseof the analysisof variance(theF test), anothercondition is added to those already given:
5. The meansof thesenormaland homoscedastic populationsmust
belinearcombinations of eEects dueto columns and/orrows. That is, the eHects must
be additive.
All the above conditions [except (4), which states the measuremen requirement]are elementsof the parametricstatistical model. With the
possible exception of the assumption of homoscedasticity (equalvar iances),these conditionsare ordinarily not tested in the courseof the
performance of a statisticalanalysis.Rather,they are presumptions
20
CHOOSING
AN APPROPRIATE
STATISTICAL
TEST
which are accepted, and their truth or falsity determines the meaningfulness of the probability statement arrived at by the parametric test.
When we have reason to believe that these conditions
are met in the
data under analysis, then we should certainly choosea parametric statistical test, such as t or F, for analyzing those data. Such a choice is optimum becausethe parametric test will be most powerful for rejecting Ho when it should be rejected.
But what if these conditions are not met? What happens when the population is not normally distributed? What happens when the measurement is not so strong as an interval scale? What happenswhen the populations are not equal in variance? When the assumptions constituting
the statistical
model for a test
are in fact not met, or when the measurement is not of the required strength, then it is difBcult if not impossible to say what is really the power of the test. It is even diScult to estimate the extent to which a probability statement about the hypothesis in question is meaningful when that probability statement results from the unacceptableapplication of a test. Although some empirical evidence has been gathered to show that slight deviations in meeting the assumptionsunderlying para-
metric tests may not have radical effectson the obtained probability figure, there is as yet no general agreement as to what constitutes a "slight" deviation. POWER-EFFICIENCY
We have already noticed that the fewer or weaker are the assumptions that constitute a particular model, the more general are the conclusions
derived from the application of the statistical test associatedwith that
modelbut the lesspowerfulis the test of Ho. This assertionis generally true for any given samplesize. But it may not be true in the comparison of two statistical tests which are applied to two samplesof unequal size. That is, if N = 30 in both instances, test A may be more powerful than test B. But the sametest B may be more powerful with N = 30 than is test A with N = 20. In other words, we can avoid the dilemma of having to choosebetween power and generality by selecting a statistical test which has broad generality and then increasing its power to that of the most powerful test available by enlarging the size of the sample. The concept of pmner-egcieecyis concernedwith the amount of increase in sample size which is necessaryto make test B as powerful as test A. If test A is the most powerful known test of its type (when used with data which meet its conditions), and if test B is another test for the same researchdesign which is just as powerful with N~ casesas is test A with
21
N, cases, then
Power-efBciency of test B = (100) ' per cent Na
For example,if test B requiresa sampleof N = 25 casesto have the same
poweras test A has with N = 20 cases,then test B haspower-efBciency of (100)~ per cent, i.e., its power-efBciency is 80 per cent. A powerefBciencyof 80 per centmeansthat in orderto equatethe powerof test A and test B (whenall the conditionsof both testsare met, and whentest A is the morepowerful)weneedto draw 10casesfor test B for every8 cases drawn for test A.
Thus we canavoid havingto meetsomeof the assumptionsof the most
powerfultests,the parametrictests, without losingpowerby simply choosinga different test and drawing a larger N. In other words, by choosinganotherstatistical test with fewerassumptionsin its modeland thus with greatergeneralitythan the t and E tests,and by enlargingour N', we canavoid havingto makeassumptions2, 3, and 5 above,and still retain equivalent power to reject Ko.
Two other conditions,1 and 4 above,underlie parametric statistical tests. Assumption1, that the scoresare independentlydrawn from the
population,is anassumption whichunderlies all statisticaltests,parametric or nonparametric.But assumption 4, whichconcerns the strength of measurement required for parametric tests measurement must be at least in an interval scaleis not shared by all statistical tests. Different tests require measurementof different strengths. In order to under-
standthe measurement requirements of the variousstatisticaltests,the readershouldbeconversant with someof the basicnotionsin the theory of measurement.The discussionof measurement which occupiesthe next few pagesgives the required information. MEASUREMENT
~en a physicalscientisttalks aboutmeasurement, he usuallymeans the assigning of numbersto observations in sucha waythat the numbers areamenable to analysisby manipulationor operationaccordingto certain rules. This analysisby manipulation will reveal new information. about the objectsbeingmeasured. In other words,the relation between the things beingobservedand the numbersassignedto the observations
is so direct that by manipulatingthe numbersthe physicalscientist
obtainsnewinformation aboutthethings. Forexample, hemaydeterminehowmucha homogeneous massof materialwouldweighif cut in ha]f by simply dividing its weight by 2.
Thesocialscientist, takingphysics ashismodel, usuallyattempts to
22
CHOOSING AN APPROPRIATE STATISTICAL TEST
do likewise in his scoring or measurement of social variables. But in his scaling the social scientist very often overlooks a fundamental fact
in measurementtheory. He overlooks the fact that in order for him to be able to make certain operations with numbers that have beenassigned to observations,the structure of his method of mapping numbers(assigning scores) to observations must be isomorphic to some numerical structure
which includes these operations. If two systems are isomorphic, their structures are the same in the relations and operations they allow. For example, if a researcher collects data made up of numerical scores and then manipulates these scores by, say, adding and dividing (which
are necessaryoperations in finding means and standard deviations), he is assuming that the structure of his measurement is isomorphic to that
numerical structure known as arithmetic.
That is, he is assumingthat
he has attained a high level of measurement. The theory of measurement consists of a set of separate or distinct
theories, eachconcerninga distinct teuelof measurement. The operations allowable on a given set of scores are dependent on the level of measurement
achieved.
Here
we will
discuss four
levels of measurement
nominal, ordinal, interval, and ratio and will discussthe operationsand thus the statistics and statistical tests that are permitted with eachlevel. The Nominal or Classificatory Scale Definition.
Measurement
at its weakest level exists when numbers
or other symbols are used simply to classify an object, person, or characteristic.
When numbers or other symbols are used to identify
the
groups to which various objects belong, these numbers or symbols constitute a nominal or classificatory scale.
Examples. The psychiatric system of diagnostic groups constitutes a nominal scale. When a diagnostician identifies a person as "schizophrenic," "paranoid," "manic-depressive," or "psychoneurotic," he is using a symbol to represent the class of persons to which this person belongs, and thus he is using nominal scaling. The numbers on automobile license plates constitute a nominal scale. If the assignment of plate numbers is purely arbitrary, then each plated car is a member of a unique subclass. But if, as is common in the United States, a certain number or letter on the license plate indicates the county in which the car owner resides, then each subclass in the noininal scale
consists of a group of entities: all owners residing in the same county. Here the assignment of numbers must be such that the samenumber (or letter) is given to all personsresiding in the same county and that different numbers (or letters) are given to people residing in different counties. That is, the number or letter on the license plate must clearly indicate to which of a set of mutually exclusive subclassesthe owner belongs,
MEASVREMENT
Numbers on football jerseys and social-securitynumbersare other examplesof the use of numbers in nominal scaling.
Formal properties. All scaleshave certain formal properties. These propertiesprovide fairly exact definitions of the scale'scharacteristics, moreexactdefinitionsthan we can give in verbal terms. Theseproperties may be formulatedmoreabstractly than we havedonehereby a set pf axiomswhichspecifythe operationsof scalingand the relationsamong the objects that have been scaled.
In a nominal scale, the scaling operation is partitioning a given class into a set of mutually exclusive subclasses. The only relation involved js that of equivalence. That is, the membersof any one subclassmust be
equivalentin the property beingscaled. This relation is symbolizedby the familiar sign: =. The equivalencerelation is reflexive,symmetrical, and transitive.'
Admissible operations. Since in any nominal scale the classification may be equally well representedby any set of symbols, the nominal scale js said to be "unique up to a one-to-onetransformation." The symbols
designatingthe various subclasses in the scalemay be interchanged,if this is done consistently and completely. For example,when new license
platesare issued,the licensenumberwhichformerly stoodfor onecounty can be interchangedwith that which had stood for another county. Nominal scalingwould be preservedif this change-overwereperformed consistently and thoroughly in the issuing of all license plates. Such
pne-to-one transformations are sometimes called"the symmetricgroup of transformations."
Since the symbolswhich designatethe various groupspn a npmjnal scale may be interchanged without altering the essential information in the scale, the only kinds of admissible descriptive statistics are those
whichwouldbeunchanged by sucha transformation: themode,frequency counts,etc. Undercertainconditions,wecantesthypptheses regarding the distribution of casesamongcategoriesby using the nonparametrjc statistical test, x', or by using a test basedon the binomial expansion. These tests are appropriate for nominal data becausethey fpcus pn fr~
quenciesin categories,j.e., on enumerativedata. The most common
measure of association for nominaldatais the contingency coe@cjent, C, a nonparametric statistic. The Ordinal or Ranldng Scale
Definition. It may happenthat the objectsin onecategor of a seal
are not just differentfrom the objectsin othercategories of that scale, ' Rejlezave: s = x for all valuesof z. 8ym~~~n): if + = y then if'
=yandy
=e,then@ =g.
24
CHOOSING
AN APPROPRIATE
STATISTICAL
TEST
but that they stand in some kind of rection to them. Typical relations among classesare: higher, more preferred, more diScult, more disturbed, more mature, etc. Such relations may be designatedby the carat (>) which, in general, means "greater than." In reference to particular scales, > may be used to designateis preferredto, is higher than, is more
dificult than,etc. Its specificmeaningdependson the natureof the relation
that
defines the scale.
Given a group of equivalenceclasses(i.e., given a nominal scale),if the relation > holds between some but not all pairs of classes,we have
a partially orderedscale. If the relation > holds for all pairsof classes so that a completerank orderingof classesarises,we havean ordinalscale. Examples. Socioeconomicstatus, as conceivedby Warner and his associates,'constitutesan ordinal scale. In prestigeor social acceptability, all membersof the upper middle classare higher than (>) all members of the lower middle class. The lower middles, in turn, are
higher than the upper lowers. The = relation holdsamongmembersof the same class,and the > relation holds between any pair of classes.
The systemof gradesin the military servicesis anotherexampleof an ordinal scale. Sergeant > corporal > private.
Many personalityinventoriesand tests of ability or aptituderesultin scores which have the strength of ranks. Although the scores may
appearto be moreprecisethan ranks, generallythesescalesdo not meet the requirements of any higher level of measurementand may properly be viewed
as ordinal.
Foanal properties. Axiomatically, the fundamental differencebetween a nominal and an ordinal scale is that the ordinal scale incorporates not
only the relation of equivalence(=) but alsothe relation "greater than" (>). The latter relation is irreflexive,asymmetrical,and transitive.' Admissible operations. Since any order-preserving transformation does not change the information contained in an ordinal scale, the scale
is said to be "unique up to a monotonictransformation." That is, it doesnot matter what numbers we give to a pair of classesor to members of those classes,just as long as we give a higher number to the members of the class which is "greater" or "more preferred." (Of course, one
may usethe lower numbersfor the "more preferred" grades. Thus we usually refer to excellentperformanceas "first-class," and to progres-
sively inferior performances as "second-class" and "third-class." So long as we are consistent,it doesnot matter whetherhigheror lowernumbersare usedto denote"greater" or "more preferred.") ~ Warner, W. L., Meeker,M., andEells,K. 1949. Socialdossie America. New York: Science Research Associates.
'Irrejfezive:it is not true for any s that s > s. AeymmctricaL' if s > y, then p > s. Transitive:if x > y andy > s, thens > s.
MEhSURRRENT
Forexample, a corporal in thearmywears twostripes onhissleeve anda sergeant wears three.These insignia denote thatsergeant > cor-
poral.Thisrelation would beequally wellexpressed if thecorporal wore
fourstripes andthesergeant woreseven.Thatis,a transformation
which does notchange theorder oftheclassea iscompletely admissible
because it does notinvolve anylose ofinformation. Anyorallthenumbers
applied toclasses inanordinal scale maybechanged inanyfashion which
doesnotaltertheordermg (ranking) of theobjects.
Thestatistic most appropriate fordescribing thecentral tendency of
scores in anordinal scale isthemedian, since themedian ianotaffected
by changes of anyscores which areabove or below it aslongasthe
number of scores above andbelowremains thesame.Withordinal
scaling, hypothesea canbetested byusing thatlarge group ofnonpara-
metricstatistical testswhicharesometimes called "orderstatistics" or
"ranking statistics." Correlation coefBcients based onrankings (e.g.,
theSpearman r8or theKendall r) areappropriate.
Theonlyassumption made bysome ranking testsisthatthescores we pbserve aredrawnfromanunderlying continuous distribution. Parametrictestsalsomake thisassumption. Anunderlying continuous var-
isteisonethatisnotrestricted tohaving onlyisolated values. It may haveanyvaluein a certain interval.A discrete variste, ontheother
hand, is onewhich cantakeononlya fmite number ofvalues; a con-
tinuous variateis onewhichcan(butmaynot)takeona continuous in6nity
of values.
Forsome nonparametric techniques whichrequire ordinalmeasure-
ment,therequirement is thattherebea continuum underlying the
pbserved scores.Theactualscores weobserve mayfall intodiscrete
cs,tegpries. Forexample, theactualscores maybeeither«pass" pr
"fail"ona particular item.Wemaywellassume thatunderlying such s dichotomy there isa continuum ofpossible results. Thatis,some individuals who were categorized asfailing mayhave been closer topassingthanwere others whowere categorized asfailing.Similarly, some passed onlyminimally, whereas others passed withease anddispatch. Theassumption isthat"pass" and"fail"represent a continuum dichot-
omized into two intervals.
Similarly, in matters ofopinion those whoarecls~gged aa«a<.»
and"disagree" maybethought tofaUona continuum. Some whoscor as"agree" areactually notveryconcerned mththeissue, whereas others arestrongly convinced oftheirposition. Those who"disagree" include
those who areonlymildly indisagreement aawellasdie-hard opponents. Frequently thegrossness ofourmeasuring devices obscures theunderlyingcontinuity thatmay exist.If a variate istrulycontinuously distributed, thentheprobability ofa tieiszero.However, tiedscores fre-
26
CHOOSINGAN APPROPRIATE STATISTICALTEST
quently occur. Tied scoresare almostinvariably a reflectionof the lack
of sensitivityof our measuring instruments,whichfail to distinguishthe small differences which really exist between the tied observations.
Thereforeeven when ties are observedit may not be unreasonable to assumethat a continuousdistribution underliesour grossmeasures.
At therisk of beingexcessively repetitious,the writerwishesto emphasize here that parametric statistical tests, which use meansand standard
deviations(i.e.,whichrequirethe operations of arithmeticontheoriginal scores),ought not to be usedwith data in an ordinal scale. The properties of an ordinal scaleare not isomorphicto the numericalsystemknown as arithmetic.
When only the rank order of scoresis known, meansand
standard deviations found on the scores themselves are in error to the
extent that the successiveintervals (distances between classes)on the scaleare not equal. When parametric techniques of statistical inference are used with such data, any decisions about hypothesesare doubtful, Probability statements derived from the application of parametric statistical tests to ordinal data are in error to the extent that the structure
of the method of collecting the data is not isomorphicto arithmetic. Inasmuch as most of the measurementsmade by behavioral scientists
culminatein ordinal scales(this seemsto be the caseexceptin the field of psychophysics,and possiblyin the useof a few carefullystandardized tests), this point deservesstrong emphasis.
Sincethis book is addressedto the behavioralscientist,and sincethe scalesused by behavioral scientists typically are at best no stronger than ordinal, the major portion of this book is devoted to those methods which are appropriate for testing hypotheses with data measured in an ordinal
scale. These methods, which also have much less circumscribingor restrictive assumptionsin their statistical modelsthan have parametric tests, make up the bulk of the nonparametric tests. The Interval
Definition.
Scale
When a scale has all the characteristics of an ordinal
scale,and when in addition the distancesbetweenany two numberson the scaleare of knownsize,then measurement considerablystrongerthan ordinality has been achieved. In such a case measurement has been
achievedin the senseof an interval scale. That is, if our mappingof severalclassesof objectsis so precisethat we know just how largeare the intervals (distances)betweenall objects on the scale, then we have achievedinterval measurement. An interval scaleis characterizedby a common and constant unit of measurementwhich assignsa real number
to aH pairs of objectsin the orderedset. In this sort of measurement, the ratio of any two intervals is independent of the unit of measurement
MEASUREMENT
andofthezeropoint. In aninterval scale, thezeropointandtheunit
of measurement are arbitrary.
Examples. Wemeasure temperature onaninterval scale.In fact,
twodifferent scales centigrade andFahrenheit are commonly used. Theunitof measurement andthezeropointin measuring temperature
arearbitrary;theyaredifferent forthetwoscales.However, bothscales containthe sameamountandthe samekind of information.Thisis
thecasebecause theyarelinearlyrelated.Thatis,a reading onone scale canbetransformed to theequivalent reading ontheotherbythe linear
transformation
F = fC+32 where F = number of degreeson Fahrenheit scale
C = numberof degreeson centrigradescale
It canbeshown thattheratiosoftemperature differences (intervals)
areindependent of theunitof measurement andof thezeropoint. For
instance, "freezing" occurs at 0 degrees onthecentigrade scale, and
"boiling"occurs at 100degrees. OntheFahrenheit scale, "freezing"
occurs at32degrees and"boiling"at212degrees. Some otherreadings of the same temperature on the two scalesare: Centigrade Fahrenheit
0 10 32
30
60
86
212
Noticethattheratioofthedigerencee between temperature readings on onescaleis equalto theratiobetween theequivalent differences onthe otherscale. Forexample, onthecentigrade scaletheratioof thediffer 30 10 enees between 80endio, snd10end0, is 10 = 0 2. For the oonr-
persble readings onthePehrenheit sonic, theratiois86=
50 50 32
2. The
ratioisthesame in bothcases: 2. In aninterval scale, in otherwords, theratioof anytwointervals isindependent of theunitusedandof the
zero point, both of which are arbitrary.
Mostbehavioral scientists aspire tocreate interval scales, andonjnfre
quentoccasions theysucceed. Usually, however, whatis takenfor suc-
cess comes because oftheuntested assumptions thescale maker iswilling
to make.Onefrequent assumption is thatthevariable being scaled is
normally distributed intheindividuals being tested.Having made this
assumption, thescale maker manipulates theunitsofthescale untilthe gambled normal distribution is recovered fromtheindividuals' scores.
Thisprocedure, ofcourse, isonlyasgood astheintuition oftheinvestiga-
tor when he hits upon the distribution to assume.
28
CHOOSINGAN APPROPRIATE STATISTICALTEST
Anotherassumption whichis oftenmadein orderto create anapparent intervalscaleis the assumption that the person's answer of "yes" on
anyoneitemisexactlyequivalent to hisanswering affirmatively onany otheritem. Thisassumption is madein orderto satisfytherequirement that an interval scalehavea commonand constantunit of measurement.
In abilityoraptitude scales, theequivalent assumption isthatgivingthe correctanswer to anyoneitemis exactlyequivalent (inamountofability shown)to giving the correctanswerto any other item.
Formalproperties.Axiomatically, it canbe shownthat the operaations and relationswhich give rise to the structure of an interval scale
aresuchthat the differences in the scaleareisomorphicto the structure of arithmetic. Numbersmay be associated with the positionsof the
objectsonanintervalscalesothat the operations of arithmeticmaybe meaningfully performed on the differencesbetween these numbers.
In constructing sn intervalscale,onemustnot onlybeableto specify equivalences, asin a nominalscale,and greater-thanrelations,asin an
ordinalscale,but onemustalsobeableto specifytheratioof anytwo intervals.
Admissible operations. Any changein the numbersassociatedwith
thepositions of theobjectsmeasured in anintervalscalemustpreserve not only the ordering of the objects but also the relative differences
betweenthe objects. That is, the intervalscaleis "uniqueup to a linear transformation." Thus the informationyieldedby the scaleis not affected if each number is multiplied by a positive constant and then
a constantis addedto this product,that is,f(s) = as + L (In thetemperature example, a = ~ and b = 32.)
We have already noticed that the zero point in an interval scaleis arbitrary. This is inherent in the fact that the scaleis subjectto trans-
formations whichconsistof addinga constant to thenumbers makingup the scale.
The interval scale is the first truly quantitative scale that we have
encountered. All the commonparametric statistics (means,standard deviations, Pearson correlations, etc.) are applicable to data in an inter-
val scale,ss are the commonparametricstatisticaltests(t test,F test, etc.).
If measurementin the senseof an interval scale has in fact been
achieved,snd if all of the assumptionsin the statistical model(givenon page19)areadequatelymet, then the researchershouldutilize parametric statistical tests. In sucha case,nonparametricmethodsusually would not take advantage of all the information contained in the researchdata. The Ratio
Scale
Definitio. When a scale has all the characteristics of an interval scaleand in addition has a true zero point as its origin, it is called a ratio
scale. In a ratioscale,theratioof anytwoscalepointsis independent of the unit of measurement.
Example. %e measuremassor weight in a ratio scale. The scaleof
ouncesand poundshasa true zeropoint. Sodoesthe scaleof grams. The ratio betweenany two weightsis independentof the unit of measure-
ment. For example, if wedeterminethe weightsof two differentobjects not only in pounds but also in grams, we would find that the ratio of the
two poundweightsis identicalto the ratio of the two gramweights. Formal properties The operationsand relations which give rise to
the numericalvaluesin a ratioscalearesuchthat the scaleis isomorphic to the structure of arithmetic. Thereforethe operationsof arithmetic
are permissibleon the numericalvaluesassigned to the objectsthemselves, as well as on the intervals between numbers as is the casein the interval
scale.
Ratioscales, mostcommonly encountered in thephysicalsciences, are achievedonly whenall four of theserelationsare operationally possible to attain: (a) equivalence, (b) greaterthan, (c) knownratio of any two intervals, and (d) known ratio of any two scalevalues. Admissible operations. The numbers associatedwith the ratio scale
valuesare"true" numberswith a truezero;only the unit of measurement
is arbitrary. Thusthe ratio scaleis ' uniqueup to multiplicationby a positive constant." That is, the ratios betweenany two numbersare
preserved whenthe scalevaluesareall multipliedby a positiveconstant, and thus such a transformation does not alter the information contained in the scale.
Any statistical test is usablewhen ratio measurementhas beenachieved.
In additionto usingthosepreviousIymentionedasbeingappropriatefor usewith data in interval scales,with ratio scalesonemay usesuchstatistics as the geometricmean and the coefBcientof variation statistics which requireknowledgeof the true zero point. Senary
Measurement is theprocess of mapping orassigning numbers to objects or observations. The kind of measurementwhich is achievedis a func-
tionoftherulesunderwhichthenumbers wereassigned.Theoperations
andrelations employed in obtaining thescores define andlimitthemanipulationsandoperations whicharepermissible in handling the scores; the manipulations and operationsmustbe thoseof the numericalstructure to which the measurement is isomorphic.
Fourofthemostgeneral scales werediscussed: thenominal, ordinal, interval, and ratio scales. Nominaland ordinal measurement are the mostcommontypesachieved in thebehavioralsciences.Datameasured
by eithernominalor ordinalscales shouldbe analyzed by the non-
30
CHOOSING AN APPROPRIATE STATISTICAL TEST
parametricmethods. Data measuredin interval or ratio scalesmay be analyzedby parametric methods,if the assumptionsof the parametric statistical model are tenable. Table 3.1 summarizes the information in our discussion of various levels of measurement and of the kinds of statistics and statistical tests which
are appropriate to eachlevel when the assumptionsof the tests' statistical models
are satisfied.
TABLE 3.1. FQUR LEYELS oF MEhsUREMENT To EhcH
hND THE SThTISTICS APPRQPRIhTE
LEvEI,
Examples of appropriate statistics
Deffning relations
Appropriate statistical
tests
Mode
Nominal
(1) Equivalence
Frequency Contingency coefficient Non parametric
Median
statistical
Percentile Ordinal
(1) Equivalence (2) Greater than
tests
Spearman rs Kendsll Kendall
r 8
Mean (1) Equivalence Standard deviation (2) Greater than (3) Known ratio of Pearson product-moment any two inter- - correlation
Multiple
vals
product-moment
correlation
Nonparametric
(1) Equivalence
cal tests
(2) Greater than Ratio
(3) Known ratio of sny two inter- Geometric Coefficient
vals
and
parametric statisti-
mean of variation
(4) Known ratio of any two scale values
The reader may find other discussionsof measurementin Bergman and Spence (1944), Coombs (1950; 1952), Davidson, Siegel, and Suppes (1955), Hempel (1952), Siegel (1956), and Stevens (1946; 1951). PARAMETRIC
AND
NONPARAMETRIC
STATISTICAL
TESTS
A parametric statistical test is a test whose model specifies certain conditions (given on page 19) about the parameters of the population
PARAMETRIC AND NONPARAMETRICSTATISTICAL TESTS
81
from which the researchsample was drawn. Since these conditions are
not ordinarilytested,they areassumed to hold. The meaningfulness of the resultsof a parametric test depends on the validity of theseassumptions. Parametrictestsalsorequirethat the scoresunder analysisresult from measurementin the strength of at least an interval scale.
A nonparametric statisticaltestis a testwhosemodeldoesnot specify conditionsabouttheparameters of thepopulationfromwhichthesample was drawn. Certain assumptionsareassociated with mostnonparametric statistical tests, i.e., that the observationsare independentand that
the variableunderstudy hasunderlyingcontinuity,but theseassumptions are fewer and much weakerthan thoseassociatedwith parametric tests. Moreover, nonparametric tests do not require measurement so
strong as that required for the parametric tests; most nonparametric tests apply to data in an ordinal scale,and someapply alsoto data in a nominal
scale.
In this chapter we have discussed the various criteria which should
be consideredin the choiceof a statisticaltest for usein makinga decision about a researchhypothesis. Thesecriteria are (a) the power of the test, (b) the applicability of the statistical model on which the test is based to the data of the research, (c) power-eSciency, and (d) the level of measurement achieved in the research.
It has been stated that a
parametricstatistical test is most powerful when all the assumptionsof its statistical modelare met and when the variablesunder analysisare measured in at least an interval scale. However, even when all the
parametric test's assumptionsabout the population and requirements about strength of measurement are satisied, we know from the concept
of powerwfficiency that by increasing the samplesizeby an appropriate amount we canusea nonparametrictest rather than the parametricone and yet retain the samepower to reject Hs.
Becausethe powerof any nonparametric test may be increasedby
simplyincreasing thesizeof N, andbecause behavioral scientists rarely achievethe sort of measurementwhich permits the meaningfuluse of
parametrictests,nonparametric statisticaltestsdeservean increasingly prominentrole in research in the behavioralsciences.This book presents a variety of nonparametric tests for the use of behavioral scientists. The use of parametric tests in research has been presented well in a variety of sources' and therefore we will not review those tests here.
In many of the nonparametric statisticaltests to be presented,the data are changedfrom scoresto ranks or even to signs. Such methods i Amongthe many sourceson parametricstatisticaltiestsp Fisher(1934;1935),McNemar(1955),Mood (1950),Snedecor (1946),Walkerand Lev (1953).
CHOOSING
hN
APPROPRIATE
STATISTICAL
TEST
may arousethe criticismthat they "do not useall of the informationin the sample"or that they "throw away information." The answerto this
objectionis containedin the answersto thesequestions:(a) Of the methodsavailable,parametricand nonparametric,whichusesthe infor-
mationin thesamplemostappropriately?(b) Howimportantisit that the conclusions from the researchapply generallyrather than only to populations with normal distributions?
The answerto the first questiondependson the level of measurement
achievedin the research and on the researcher's knowledge of the population. If the measurementis weaker than that of an interval scale, by usingparametrictests the researcherwould "add information" and therebycreatedistortionswhichmay be asgreatand asdamagingasthose introducedby the "throwing away of information" which occurswhen scoresare convertedto ranks. Moreover,the assumptionswhichmust be made to justify the use of parametrictestsusuallyrest on conjecture and hope,for knowledgeaboutthe populationparametersis almostinvariably lacking. Finally, for somepopulationdistributionsa nonparametric statistical test is clearly superior in power to a parametric one (%'hitney, 1948).
Theanswerto thesecond question csnbegivenonlyby theinvestigator as he considersthe substantiveaspectsof the researchproblem. The relevanceof the discussion of this chapterto the choicebetween parametricand nonparametricstatisticaltestsmay be sharpenedby the summarybelow, which lists the advantagesand disadvantagesof nonparsmetric statistical tests.
Advantagesof NonparametricStatisticalTests 1. Probability statementsobtainedfrom most nonparametricstatistical tests are esactprobabilities(except in the caseof large samples, whereexcellentapproximationsare available), regardlessof the shapeof the populationdistributionfrom which the random samplewas drawn. The accuracyof the probability statementdoesnot dependon the shape of the population,althoughsomenonparametrictestsmay assumeidentity of shapeof two or more populationdistributions,and someothers assumesymmetricalpopulationdistributions. In certaincases,the non-
parametric testsdoassume that theunderlying distribution iscontinuous, an assumptionwhich they sharewith parametrictests. 2. If samplesizesas smallas N = 6 areused,there is no alternativeto usinga nonparametricstatisticaltest unlessthe nature of the population distribution is known exactly.
3. There are suitablenonparametricstatisticaltestsfor treating sam-
plesmadeup of observations from severaldiferentpopulations.None
PhRhMETRIc
hND NONPhRhMETRIc
SThTISTIchL
TEsTa
33
of the parametrictestscanhandlesuchdata without requiringuato make seemingly unrealistic assumptions. 4. Nonparametric statistical tests are available to treat data which are
inherently in ranks as well as data whoseseeminglynumericalscores have the strengthof ranks. That is, the researchermay only be able to aay of his subjects that one has more or less of the characteristic than
another,without beingableto sayhotomuchmoreor less. For example, in studying such a variable as anxiety, we may be able to state that sub-
ject A is more anxiousthan subject B without knowing at all exactly how much moreanxiousA is. If data are inherently in ranks, or even if they can only be categorizedas plus or minus (more or less, better or
worse), they can be treated by nonparametricmethods,whereasthey cannot be treated by parametricmethodsunlessprecariousand perhaps unrealistic assumptionsare made about the underlying distributions.
5. Nonparametricmethodsareavailableto treat data whicharesimply classificatory, i.e., are measured in a nominal scale. No parametric technique applies to such data. 6. Nonparametric statistical testa are typically much easier to learn and to apply than are parametric tests. Disadvantages of Nonpaaunetric Statistical Tests
1. If all the assumptionsof the parametric statistical model are in fact
met in the data, and if the measurement is of the requiredstrength,then nonparametric statistical tests are wasteful of data. The degreeof wastefulnessis expressedby the power-efBciency of the nonparametric test.
(It will be rememberedthat if a nonparametric statistical test haa
power-efBciency of, say, 90 per cent, this meansthat whereaQSs condQQ7$8 of the parametric test are satisfiedthe appropriate parametric teat
would be just aseffectivewith a samplewhich is 10per cent smallerthan that used in the nonparametric analysis.)
2. There are as yet no nonparametricmethodsfor testing interactions in the analysisof variancemodel,unlessspecialassumptionsare made about additivity.
(Perhaps w'eshould disregard this distinction because
parametric statistical tests are also forced to make the assumptionof additivity.
However, the problem of higher-ordered interactions has
yet to be dealt with in the literature of nonparametricmethods.)'
Another objection that has been enteredagainst nonparametric methodsis that the testsand their accompanying tablesof significant
valueshavebeenwidelyscattered in variouspublications, manyhighly a Afterthis bookhadbeensetin type, a nonparametric test waspresented which contributesto the solutionof this problem. SeeWilson,K. V. 1956. A distribution-freetest of analysisof variancehypotheses.Psychol.Bell., 13,96-101.
34
CHOOSING AN APPROPRIATE STATISTICAL TEST
specialized, and they havethereforebeencomparatively inaccessible to the behavioralscientist. In preparingthis book,the writer'sintention
hasbeen to robthatobjection ofitsforce. Thisbookattempts topresent all the nonparametrictechniquesof statistical inferenceand measures of
association that thebehavioral scientistis likelyto need,andit givesall of the tablesnecessary for the useof thesetechniques.Althoughthis text is not exhaustivein its coverageof nonparametrictestsit couldnot
be without beingexcessively redundantenough testsare includedin the chapterswhichfollow to give the behavioralscientistwidelatitude in
choosing a nonparametric technique appropriate to his research design and usefulfor testing his researchhypothesis.
CHAPTER
THE
4
ONE-SAMPLE
CASE
In this chapterwe presentthosenonparametricstatistical tests which
maybeusedto testa hypothesis whichcallsfor drawingjust onesample. The tests tell us whether the particular samplecould have comefrom
somespecified population.Thesetestsarein contrastto thetwo-sample tests,whichmaybe morefamiliar,whichcomparetwo samples andtest whetherit is likely that the two camefrom the samepopulation.
The one-sample test is usuallyof the goodness-of-fit type. In the typicalcase,wedrawa random sample andthentestthehypothesis that this samplewasdrawnfrom a populationwith a specifieddistribution. Thus the one-sampletest can answerquestionslike these: Is there a
significant difference in location(central tendency) between thesample and the population? Is there a significant difference between the
observedfrequencies and the frequencies we wouldexpecton the basis of some principle? Is there a significant differencebetweenobserved
andexpected proportions? Is it reasonable to believe thatthissample hasbeendrawnfroma population of a specified shapeor form(e.g.,normal,rectangular)?Is it reasonable to believethat thissample is a random sample from some known population?
In theone-sample casea common parametric technique is to applya t test to the differencebetweenthe observed(sample)meanand the
expected (population) mean. The~ test,strictlyspeaking, assumes that
theobservations orscores in thesample havecome froma normally di~ tributed population. The < testalsorequiresthat the observations be
measured at least in an interval
scale.
Therearemanysortsof datato whichthet testmaybeinapplicable Theexperimenter mayfindthat (a)theassumptions andrequirements
ofthet testareunrealistic forhisdata,(b)it ispreferable toavoidmaking
theassumptions of thet testandthusto gaingreater generality for the
conclusions, (c)thedataof hisresearch areinherentlyin ranksandthus
notamenable to analysis bythet test,(d)thedatamaybesimplyclassificatory or enumerative andthusnotamenable to analysis by the~ test,or (e)heis not interested onlyin differences in locationbut rather wishesto exposeany kind of differencewhatsoever.In suchinstances 35
THE
ONE-SAMPLE
CASE
the experimenter may chooseto use one of the one-samplenonparametric statistical tests presentedin this chapter. Four tests for the one-sample case will be presented. The chapter concludeswith a comparison and contrast of these tests, which may aid the researcherin selecting the test best suited to his researchhypothesis and to his data. THE
Function
BINOMIAL
TEST
and Rationale
There are populations which are conceived as consisting of only two classes. Examples of such classes are: male and female, literate and
illiterate, member and nonmember, in-school and out-of-school, married and single, institutionalized and ambulatory. For such cases,all the possible observations from the population will fall into either one or the other
of the two discrete
classifications.
For any population of two classes,if we know that the proportion of casesin one class is P, then we know that the proportion in the other class must be 1 P. Usually the symbol Q is used for 1 P. Although the value of P may vary from population to population, it is fixed for any one population. However, even if we know (or assume) the value of P for somepopulation, we cannot expect that a random sample of observations from that population will contain exactly proportion P of cases in one class and proportion Q of cases in the other.
Random
effects of sampling will usually prevent the samplefrom exactly duplicating the population values of P and Q. For example, we may know from the ofBcial records that the voters in a certain county are evenly split between the Republican and Democratic parties in registration. But a random sample of the registered voters of that county might contain 47 per cent Democrats and 53 per cent Republicans, or even 56 per cent Democrats and 44 per cent Republicans. Such differences between the observed and the population values arise becauseof chance. Of course, small differencesor deviations are more probable than large ones. The binomial distribution is the sampling distribution of the proportions w'e might observe in random samples drawn from a two-class population. That is, it gives the various values which might occur under Hp. Here H~ is the hypothesis that the population value is P. Therefore when the "scores" of a researchare in two classes,the binomial distribution may be used to test H~. The test is of the goodness-of-Gt type. It tells us whether it is reasonableto believe that the proportions (or frequencies) we observe in our sample could have beendrawn from a population having a specified value of P.
THE
BINObEAL
TE8T
Method
The probabilityof obtainings objectsin one categoryand N z objects in the other category is given by p(>) =
N
P Q"
whereP = proportionof casesexpected in oneof the categories Q = 1 P = proportionof casesexpected in the othercategory N Nt
z s! ~~
(N
s)!
A, simpleillustration will clarify formula (4.1). Supposeafair die is rolled five times. What is the probability that exactlytwo of the
rolls will show"six"? In this case,N = the numberof rolls= 5; g = thenumberof sixes= 2; P = theexpected proportionof sixes= ~~ (sincethe die is fair and thereforeeachaidemay be expectedto show
equallyoften);andQ = 1
P = f. Theprobabilitythat exactlytwo
of the five rolls will showsix is given by formula (4.1): N
p(g)
PaQNx
(4 1)
66 = .16
Theapplication of theformulato theproblem shows usthat theprobability of obtaining exactly two "sixes" when rolling a fair die five times is p = .16.
Now whenwe do researchour questionia usuallynet "What ia the probability of obtaining exactly the values which were observed?"
Rather,we usuallyask, "What is the probabilityof obtainingthe observed valuesor valuesevenmoreextreme?"To answerquestions of this type, the sampling distribution of the binomial is PCQN-i i~0
~ Nlis N factorial, whichmeans N(N 1)(N 2) ~~ ~(2)(1).Forexample, 4! ~ (4)(3)(2)(1) 24. Table S oftheAppendix gives factorials forvalues through N TableT of the Appendix givesbinomialcoeKcients,, s
through 20.
for valuesof N
38
THE
ONE-SAMPLE
CASE
In other words, we sum the probability of the observed value with the probabilities of values even more extreme.
Supposenow that we want to know the probability of obtaining two or feurer"sixes" whena fair die is rolled five times. Here againN = 5, s = 2, P = z, and Q = in'. Now the probability of obtaining 2 or fewer "sixes" is p(s < 2). The probability of obtaining 0 "sixes" is p(0). The probability of obtaining 1 "six" is p(1). The probability of obtaining 2 "sixes" is p(2). We know from formula (4.2) above that p(
< 2) = p(0) + p(1) + p(2)
That is, the probability of obtaining two or fewer "sixes" is the sum of
the three probabilities mentioned above. If we use formula (4.1) to determine each of these probabilities, we have:
P(0)= 0f5f 6 6
and thus
p(*<2)
= 40
= p(0) + p(1)+ = .40+
p(2)
.40 + .16
= .96
We have determined that the probability under Ho of obtaining two or fewer "sixes" when a fair die is rolled five times is p = .96. Sma11samples. In the one-sample case, when a two-category class is
used, a common situation is for P to equal ~~. Table D of the Appendix gives the one-tailed probabilities associated with the occurrence of various values as extreme as x under the null hypothesis that P = Q = z. When
referring to Table D, let s = the smaller of the observed frequencies. This table is useful when N is 25 or smaller.
Its use obviates the necessity
for using formula (4.2). When P g Q, formula (4.2) should be used. Table D gives the probabilities associatedwith the occurrenceof various values as small as s for various N's (from 5 to 25). For example,suppose we observe 7 casesfall in one category while the other 3 fall in another. Here N =
10 and x = 3.
Table D shows that the one-tailed probability
of occurrence under JJOof x = 3 or fewer when N = 10 is p = .172. (Notice that the decimal points have been omitted from the p's given in the body of the table.)
The p's givenin Table D areone-tailed. A one-tailedtest is usedwhen
THE
BINOMIAL
TEST
39
we havepredictedin advancewhichof the categories will containthe smallernumberof cases. Whenthe predictionis simplythat the two
frequencies will differ,a two-tailed testis used. For a two-tailed test, thep yielded by TableD isdoubled.Thusfor N = 10andx = 3, the two-tailedprobability associatedwith the occurrenceunderHs of suchan extreme value of x is p = 2(.172) = .344. The example which follows illustrates the use of the binomial test in a researchwhere P = Q = 4. Example
In a study of the effectsof stress,'an experimenter taught 18 collegestudents 2 different methods to tie the same knot.
Half of
the subjects(randomlyselectedfrom the groupof 18)learnedmethod
A first,andhalflearnedmethodB first. Laterat midnight,aftera 4-hour final examinationeach subject was askedto tie the knot.
Thepredictionwasthat stresswouldinduceregression, i.e.,that the subjectswouldrevertto the first-learnedmethodof tying the knot. Eachsubjectwascategorizedaccordingto whetherhe usedthe knot-
tying methodhe learnedfirst or the onehe learnedsecond, when asked to tie the knot under stress.
i. Nul/ Hypothesis. Ho. p> p2 = z. That is, thereisnodifferenc betweenthe probability of usingthe first-learnedmethodunder stress(p,) and the probability of using the second-learned method
understress (p2);anydifference between thefrequencies whichmaybe observed is ofsucha magnitude thatit mightbeexpected in a sample fromthepopulation of possible resultsunderHo. H~. 'p~ ) pl. ii. StatisticalTeat. The binomial test is chosenbecausethe data
arein twodiscrete categories andthedesign isoftheone-sample type. Sincemethods A andB wererandomlyassigned to beingfirst-learned and second-learned, thereis no reasonto think that the first-learned
methodwouldbepreferred to thesecond-learned underHa,andthus P=Q=4
iii. SignificanceLevel. Let a = .01. N = the number of cases = 18.
iv. Sampling Distribution.Thesampling distributionis givenin formula(4.2)above. However, whenN is 25or smaller,andwhen P = Q = ~~,Table D gives the probabilitiesassociatedwith the
occurrence underHo of observed valuesas smallas x, and thus
obviates thenecessity for usingthesampling distribution directly in the employment of this test.
v. Rejection Region. Theregionof rejectionconsistsof all values ~ Barthol,R. P., and Ku, Nani D. 1955. Specificregression undera nonrelated stresssituation. Amer.Psychologist, 10,482. (Abstract)
THE
ONE-ShMPLE
ChSE
of z (where s = the number of subjects who used the second-learned
methodunderstress)whicharesosmallthat the probabilityassociated with their occurrenceunder H pis equal to or lessthan a = .01.
Sincethe directionof the differencewaspredictedin advance,the region of rejection is one-tailed.
vi. Decision. In the experiment, all but two of the subjectsused the first-learned methodwhenaskedto tie theknot understress(late at night after a long final examination). These data are shown in Table
4.1 ~ ALE
4.1. KNOT-TTING METHOD CHOSEN UNDER STRESS
In this case,N = the number of independent observations = 18.
s = the smallerfrequency= 2. Table D showsthat for N = 18, the probability associatedwith x < 2 is p = .001. Inasmuch as this p is smaller than a = .01, the decision is to reject Hp in favor of H~. We concludethat p» p2, that is, that personsunder stressrevert to the first-learned
of two methods.
Large samples. Table D cannot be used when N is larger than 25. However, it can be shown that as N increases, the binomial distribution tends toward the normal distribution. This tendency is rapid when P is close to >, but slow when P is near 0 or 1. That is, the greater is the disparity betweenP and Q, the larger must be N before the approximation is usefully close. When P is near z, the approximation may be usedfor a statistical test for N > 25. When P is near 0 or 1, a rule of thumb is that NPQ must equal at least 9 before the statistical test based on the normal
approximation is applicable. Within these limitations, the sampling distribution of s is approximately normal, with mean = NP and standard
deviation= QNPQ, andthereforeHo maybe testedby z=
s
p,
g
NP
QNPQ z is approximately normally distributed with zero mean and unit variance, The approximation becomesan excellent one if a correction for continuity is incorporated. The correction is necessarybecausethe normal
THE
BINOMIAL
41
TE8T
distribution is for a continuousvariable, whereasthe binomial distribution involves a discrete variable. To correct for continuity, we regard the observedfrequency s of formula (4.3) as occupying an interval, the lower limit of which is half a unit below the observedfrequency while the upper limit is half a unit above the observedfrequency. The correction for continuity consists of reducing, by .5, the difference between the observed value of s and the expected value, p, = NP. Therefore when s < y,,
we add .5 to s, and whenx > p, we subtract.5 from s. That is, the observeddifference is reduced by .5. Thus z becomes
where x + .5 is used when s < NP, and s
.5 is used when x > NP.
The value of z obtainedby the applicationof formula (4.4) may be considered to be normally distributed with zero mean and unit variance,
Thereforethe significanceof an obtainedz may be determinedby referenceto Table A of the Appendix. That is, Table A givesthe one-tailed probability associatedwith the occurrenceunder Ho of values as extreme
asanobserved z. (If a two-tailedtestis required,they yieldedby Table A should be doubled.)
To showhow goodan approximationthis is whenP = ~~evenfor N < 25, we can apply it to the knot-tying data discussedearlier. In that
case, N = 18,s = 2, andP = Q = 4. Forthesedata,s < NP,that is, 2 < 9, and, by formula (4.4), (2+ .5) (18)(.5) =
3.07
TableA shows that a zasextreme as 3.O7hasa one-tailed probability associated withits occurrence underHoof p = .OO11.Thisis essentia the sameprobability we found by the other analysis,which useda table of exact probabilities.
Summaryof pmedure. In brief,thesearethe stepsin the useof the binomial
test:
1. Determine N = the total number of casesobserved.
2. Determinethe frequenciesof the observedoccurrences in eachof the two categories.
3. The methodof findingthe probabilityof occurrence underHo of the observed values,or valuesevenmoreextreme,varies: a. If N is 25or smaller,andif P = Q = z, TableD givesthe one-tailed probabilitiesunder Ho of variousvaluesas small as an observeds.
THE
ONE-SAMPLE
CASE
A one-tailedtest is usedwhenthe researcher haspredictedwhich
categorywill havethe smallerfrequency.For a two-tailed test,, double the p shown in Table D.
5. If P g Q, determinethe probabilityof the occurrence underHpof
the observed valueof x or of an evenmoreextreme valueby substitutingtheobserved valuesin formula(4.2). TableT ishelpfulin N this computation; it givesbinomialcoefficients,,
for N < 20.
c. If N islargerthan25,andP closeto q-,testH0byusingformula(4.4). Table A gives the probability associatedwith the occurrenceunder
HD of valuesas large as an observedz yielded by that formula.
TableA givesone-tailed p's; for a two-tailed test,doublethep it yields.
If the p associated with the observedvalue of x or an evenmoreextreme value is equal to or less than a, reject H0. Power-Efficiency
Inasmuchas thereis no parametrictechniqueapplicableto data meas-
uredin a nominalscale,it ivouldbe meaningless to inquireabout the power-efficiencyof the binomial test when used with nominal data. If a continuum is dichotomized and the binomial test usedon the result-
ing data,that testmaybewastefulof data. In suchcases, the binomial test haspower-efficiency (in the sensedefinedin Chap.3) of 95per cent for N 6, decreasingto an eventual (asymptotic)efficiencyof 2= = 63 percent(Mood,1954). However,if the dataarebasicallydichotomous, eventhough the variablehas an underlying continuousdistribution, the binomial test may have no more powerful alternative. References
For other discussionsof the binomial test, the reader may turn to
Clopperand Pearson(1934),David (1949,chaps.3, 4), McNemar(1955, pp. 42-49), and Mood (1950, pp. 54 58). THE
xm ONE"SAMPLE
TEST
Function
Frequently researchis undertaken in which the researcheris interested in the number of subjects, objects, or responseswhich fall in various categories. For example, a group of patients may be classifiedaccording to their preponderant type of Rorschach response,and the investigator may predict that certain types will be more frequent than others. Or children may be categorized according to their most frequent modes of
play, to test the hypothesisthat thesemodeswill dier in frequency. Qr
THE
g ONE-SAMPLE
TEST
43
persons maybecategorized according to whetherthey are"in favorof," "indifferent to," or "opposedto" somestatementof opinion, to enable the researcherto test the hypothesisthat theseresponses will differ in t'requency.
The g' test is suitable for analyzing data like these. The number of
categoriesmay be two or more. The techniqueis of the goodness-of-fit type in that it may be used to test whether a significant differenceexists
betweenan observed numberof objectsor responses falling in eachcategory and an expectednumber based on the null hypothesis. Method
In order to be ableto comparean observedwith an expectedgroup of frequencies,we must of coursebe able to state what frequencieswould be
expected. The null hypothesisstatesthe proportionof objectsfalling in eachof the categories in the presumedpopulation. That is, from the null hypothesiswe may deducewhat are the expectedfrequencies. The g' techniquetestswhetherthe observedfrequenciesare sufficientlycloseto the expectedonesto be likely to have occurred under Ks. The null hypothesis may be tested by ( 0; E;) ' E;
(4 5)
where0, = observednumberof casescategorizedin ith category E; = expected number of casesin ith category under Hs directs one to sum over all (A') categories i~i
Thus formula (4.5) directs one to sum over k categoriesthe squared differencesbetweeneachobservedand expectedfrequencydivided by the correspondingexpectedfrequency,
If the agreementbetweenthe observedand expectedfrequenciesis close,the differences(0, E;) will be small and consequentlyy~ wiQ be
small. If the divergence is large,however,the valueof y~as computed from formula(4.5)will alsobelarge. Roughlyspeaking, the largerg' is, the more likely it is that the observedfrequenciesdid not come from the population on which the null hypothesis is based.
It canbeshownthat the samplingdistributionof g' underHs,ascorn puted from formula (4.5), follows the chi-square' distribution with > Toavoid confusion,the symbolx' will be usedfor the quantity which is calculated
from the observed data[usingformula(4.5)]whena x' testis performed.Thewords «chi square"will referto a randomvariablewhichfollowsthe chi-square distribution, certain values of which are shown in Table C.
THE
df = k
ONE-SAMPLE
CASE
1. (df refersto degreesof freedom;thesearediscussed below.)
Table C of the Appendix is taken from the sampling distribution of chi square, and gives certain critical values. At the top of each column in Table C are given the associatedprobabilities of occurrence(two-tailed) under Ho. The values in any column are the values of chi square which have the associated probability of occurrenceunder Ho given at the top of that column. There is a diferent value of chi square for each df.
There are a number of diferent sampling distributions for chi square, one for each value of df. The size of df reflects the number of observations that are free to vary after certain restrictions have been placed on the data.
These restrictions are not arbitrary, but rather are inherent in
the organization of the data. For example, if the data for 50 casesare classified in two categories, then as soon as we know that, say, 35 cases fall in one category, we also know that 15 must fall in the other. For this example, df = 1, becausewith two categoriesand any fixed value of
N, assoonas the numberof casesin onecategoryis ascertainedthen the number of casesin the other category is determined.
In general,for the one-samplecase,whenHo fully specifiesthe E<'s, df = k 1,wherek standsfor the numberof categories in the classification. To use x' in testing a hypothesis in the one-samplecase, cast each observation into one of k cells.
The total number of such observations
shouldbe N, the numberof casesin your sample. That is, eachobservation must be independent of every other; thus one may not make several
observationson the samepersonand count eachasindependent. To do
so produces an "inflated N." For eachof the k cells,the expected fre quencymustalsobeentered. If Hois that theproportionof cases in each categoryis the same,thenE; = N/k. With the variousvaluesof E; and 0; known,onemaycomputethe valueof g' by theapplicationof formula (4.5). The significanceof this obtainedvalue of g' may be determined
by reference to TableC. If the probabilityassociated with the occurrence under Ho of the obtained g' for df = k
1 is equal to or lessthan
the previouslydeterminedvalue of a, then Ho may be rejected. If not, Ho will be accepted. Example Horse-racing fans often maintain that in a race around a circular track significant advantagesaccrue to the horsesin certain post positions. Any horse's post position is his assignedpost in the starting
line-up. Position 1 is closestto the rail on the insideof the track; position8 is on the outside,farthest from the rail in an 8-horserace. We may test the eKect of post position by analyzing the race results,
THE Xs ONEWMKPLETE8T
45
givenaccordingto post position,for the first month of racingin the 1955 season at a particular circular track.'
i. Null Hypothesis. Ho.'there is no differencein the expectednumber of winners starting from each of the post positions, and any observeddifferencesare merely chancevariations to be expectedin
a randomsamplefrom the rectangularpopulation wherefI = fs ~ . = fS. HI.'the frequenCieS fI, fI,..., fs are nOt all equal. ii. Statistical Teat. Since we are comparing the data from one sample with some presumedpopulation, a one-sampletest is appropriate. The g' test is chosenbecausethe hypothesis under test concerns a comparisonof observedand expectedfrequenciesin discrete categories. (The categoriesare the eight post positions.)' iii. SignificanceIce/. Let e = .01. N = 144, the total number of winners in 18 days of racing. iv. Sampling Distribution. The sampling distribution of g, as computedfrom formula (4.5), follows the chi-squaredistribution with df =k
1.
v. RejectumRegion. Hs will be rejected if the observed value of x' is such that the probability associatedwith its occurrenceunder Ho for df = 7 is equal to or less than a = .01.
vi. Decision. The sample of 144 winners yielded the data shown
in Table 4.2. The observedfrequenciesof wins are given in the TABLE 4.2. WINB AccRUEn oN L CIROULhR TRhcK BY HQRBE8 PRQM EIGHT PosT PosITIoNS
center of eachcell; the expectedfrequenciesare given in italics in the corner of each cell. For example, 29 wins accrued to horses in
position1,whereas underHoonly 18winswouldhavebeenexpected. ~d only 11winsaccruedto horsesin position8, whereas underHs 18 would have been expected. ' Thedataaregivenin the ¹m YorkPoet,Aug.30, 1955,p. 42. ' The z' testmaynot bethe mostappropriate onefor thesedata,sincethereseems to be somequestionof order involved, and the x~ test is insensitiveto the effect of
order. Theexample ispresented because it illustrates theuseandcomputation of the Laterin this chapterweshallpresenton~pie testswhichmaybemore appropriate for such data.
THE
ONE-SAMPLE
CASE
The computation of g' is straightforward: (0; E;)' E;
(29 18) ' (19 18
18) ' (18 18
(17 18)'
18) ' (25
18)'
18
(10 18)'
18
18
18
(15 18)' 18
(11 18)P 18
-'TV-+ ~ + 0 + h + Hr + I'a + A + Tf 16.3
TableC showsthat y' > 16.3 for df = 7 hasprobabilityof occurrencebetween p = .05andp = .02. Thatis,.05 > p > .02. Inasmuch as that probability is larger than the previouslyset level of significance,a = .01, we cannotreject Hp at that significancelevel.
We noticethat the null hypothesiscould have beenrejectedat
0.= .05. It wouldseemthat moredataarenecessary beforeany definite conclusionsconcerningH~ can be made.
Smallexpected frequencies.Whendf = 1,that is,whenk = 2, each expected frequency shouldbe at least5. Whendf > 1, that is, when k > 2, the g' test for the one-samplecaseshouldnot be usedwhenmore than 20 per cent of the expectedfrequenciesare smallerthan 5 or when
any expectedfrequencyis smallerthan 1 (Cochran,1954). Expected
frequencies sometimes canbeincreased by combining adjacent categories. This is desirableonly if combinations canmeaningfullybemade(andof courseif there are morethan two categoriesto beginwith).
For example, a sampleof persons may be categorized according to whethertheirresponse to a statement of opinionis strongly support, support,indifferent,oppose, or stronglyoppose.To increaseE s, adjacent
categories couldbe combined, andthe persons categorized as support, indifferent,or oppose,or possiblyas support,indifferent,oppose,and strongly oppose.
If onestartswithbut twocategories andhasanexpected frequency of lessthan5, or if aftercombining adjacentcategories oneendsup with but twocategories andstill hasanexpected frequency of lessthan5,then thebinomialtest(pages 36to 42)shouldbeusedratherthantheg' test to determine the probability associatedwith the occurrenceof the observedfrequenciesunder H p.
Summary of procedure.In this discussion of the methodfor using they' testin theone-sample case,wehaveshownthat theprocedure for using the test involvesthesesteps:
THE KOLMOGOROV-SMIRNOV ONE-SAMPLE TEST
47
1. Cast the observedfrequenciesin the k categories. The sum of the frequenciesshouldbe N, the numberof independentobservations.
2. FromHo,determinethe expected frequencies (the E s) for eachof thek cells. Wherek > 2, if morethan 20per centof theE saresmaller
than 5, combineadjacentcategories, wherethis is reasonable, thereby reducing the value of k and increasing the values of some of the E s.
Wherek = 2, the y' test for the one-sample casemaybeusedappropriately only if each expectedfrequency is 5 or larger. 3. Using formula (4.5), compute the value of g'. 4. Determine the value of df. df = k l.
5. By referenceto Table C, determinethe probability associatedwith the occurrence underHo of a valueaslargeasthe observedvalueof x' for
the observed valueof df. If that p isequalto or lessthan0,,rejectHo. Power
Theliteraturedoesnotcontainmuchinformationaboutthepowerfunction of the g' test. Inasmuchasthis testis mostcommonlyusedwhen wedo not havea clearalternativeavailable,weareusuallynot in a position to compute the exact power of the test.
When nominal measurementis used or when the data consist of fre-
quenciesin inherentlydiscretecategories, then the notion of powerefficiencyof the g' testis meaningless, for in suchcasesthereis no parametric test that is suitable. If the data aresuchthat a parametrictest is available, then the g' test may be wasteful of information.
It shouldbe noted that when df > 1, g' tests are insensitiveto the
effectsof order,andthuswhena hypothesis takesorderinto account, x' maynot bethe besttest. For methodsthat strengthenthe commong'
testswhenHois testedagainstspecific alternatives, seeCochran (1954). References
Usefuldiscussions of this y' testarecontained in Cochran(19521954)
Dixon and Massey(1951,chap.13),Lewisand Burke (1949),and McNemar (1955, chap. 13). THE KOLMOGOROV-SMIRNOVONE-SAMPLE TEST Function
and Rationale
TheKolmogorov-Smirnov one-sample testis a test,of goodness of fit. That is, it is concerned with the degreeof agreementbetweenthe distribu-
tion of a setof sample values(observed scores) andsomespecified theoreticaldistribution. It determines whetherthescores in thesample can reasonably be thoughtto havecomefrom a populationhavingthe theoretical
distribution.
48
THE
ONE-ShMPLE
ChSE
BrieBy, the test involvesspecifyingthe cumulativefrequencydistribution which would occurunder the theoreticaldistribution and comparing that with the observedcumulative frequency distribution.
The theoreti-
cal distribution representswhat would beexpectedunderH p. The point at which these two distributions, theoretical and observed, show the greatest divergenceis determined. Referenceto the sampling distribution indicates whether such a large divergence is likely on the basis of chance. That is, the sampling distribution indicates whether a divergenceof the observedmagnitude would probably occur if the observations were really a random sample from the theoretical distribution. Method
Let Fp(X) = a completely specified cumulative frequency distribution function, the theoretical cumulative distribution under Hp. That is, for any value of X, the value of Fp(X) is the proportion of casesexpectedto have scoresequal to or less than X. And let Sw(X) = the observedcumulative frequency distribution of a
random sample of N observations. Where X is any possiblescore, S>(X) = k/N, where Ir = the number of observations equal to or less than
X.
Now under the null hypothesis that the sample has been drawn from the specified theoretical distribution, it is expected that for every value of X, S~(X) should be fairly closeto Fp(X). That is, under H pwe would
expectthe differencesbetweenSw(X) and F p(X)to be small and within the limits of random errors.
The Kolmogorov-Smirnov
test focuses on
the largestof the deviations. The largest value of Fp(X) Sz(X) is called the nmximum den,ation, D:
D = maximum IFp(X)
Spy(X)I
(4.6)
The samplingdistribution of D under Hp is known. Table E of the Appendixgives certain critical valuesfrom that samplingdistribution. Notice that the significanceof a givenvalue of D dependson N, For example,supposeone found by formula (4.6) that D = .325 when N = 15. Table E shows that D > .325 has an associatedprobability of occurrence(two-tailed) between p = .10 and .05. If N is over 35, one determinesthe critical values of D by the divisions indicated in Table E.
For example, suppose a researcher uses N = 43
casesand sets a = .05. Table E shows that any D equal to or greater than
1.36
~N
will be significant. That is, any D, as definedby formula (4.6),
which is equal to or greater than level (two-tailed test).
1.36
~43
= .207 will be significant at the .05
THE KOLMOQOROVWMIRNOV ON~h.MPLE
TEST
49
Critical values for one-tailed tests have not as yet been adequately tabled. For a method of finding associatedprobabilities for one-tailed tests, the readermay refer to Birnbaum and Tingey (1951) and Goodman (1954, p. 166). Example
Supposea researcherwereinterested in confirming by experimental meansthe sociological observation that American Negroes seem to have a hierarchy of preferencesamongshadesof skin color.' To test how systematic Negroes' skin-color preferencesare, our fictitious researcherarrangesto have a photograph taken of each of ten Negro subjects. The photographer develops these in such a way that he obtains five copies of each photograph, each copy differing slightly in darknessfrom the others, so that the five copies can reliably be ranked from darkest to lightest skin color. The picture showing the darkest skin color for any subject is ranked as 1, the next darkest as 2, and so on, the lightest being ranked as 5. Each subject is then offered a choice among the five prints of his own photograph. If skin shade is unimportant to the subjects, the photographs of each rank should be chosenequally often except for random differences. If skin shade is important, as we hypothesize, then the subjects should consistently favor one of the extreme ranks.
i. Null Hypothesis. Hp. there is no difference in t'he expected number of choices for each of the five ranks, and any observed differencesare merely chancevariationa to be expectedin a random samplefrom the rectangular population wheref~ f p ~ ~ ~ f<. H>'.the frequenciesf~, fi,..., f< are not all equal. ii. Statistical Test. The Kolmogorov-Smirnov one-sampletest ia chosenbecausethe researcherwishes to compare an observed distribution of scores on an ordinal scale with a theoretical distribution.
iii. SignificanceLevel. Let a = .01. N = the number of Negroes who served as subjects in the study = 10. iv. Sampling Distribution. Various critical values of D from the sampling distribution are presented in Table E, together with their associated probabilities of occurrence under H p.
v. RejectionRegion. The region of rejection consistsof all values
of D [computedby formula (4.6)]which are solarge that the probability associatedwith their occurrenceunder H pis equal to or less than a =
.01.
vi. Decision In this hypothetical study, each Negro subject choMesoneof five print of the samephot graph. Supp~ onesub ject chooses print 2 (the next-to-darkestprint), five subjectscho > Warner,W. L., Buford, H. J., and Walter, A. A.
Washing:
AmericanCouncilon Education.
1941. Cokrrand Semenmature.
50
THE
ONE-SAMPLE
CASE
print 4 (thenext-to-lightestprint), andfour chooseprint 5 (thelightest print). Table4.8showsthesedata andcaststhemin the form appropriate for applying the Kolmogorov-Smirnov one-sampletest. ThBLE 4.3. HYPGTHETIchL SKIN-coLQR PREFERENcEsQP 10 NEGRo
SvBJEcTS
Rank of photo chosen
(1 is darkest skin color)
Notice that Fs(X) is the theoretical cumulative distribution under
Hs, where Hs is that each of the 5 prints would receive~ of the choices. Sic(X) is the cumulative distribution of the observed
choicesof the 10Negrosubjects. The bottom row of Table4.8gives the absolutedeviationof eachsamplevaluefrom its pairedexpected value. Thus the 6rst absolutedeviationis ~, whichis obtainedby subtracting 0 from z.
Inspection of the bottom row of Table 4.8 quickly revealsthat the
D for thesedata is Tss,whichis .500. TableE showsthat for N = 10, D > .500 has an associatedprobability under Ha of p < .01. Inasmuch as the p associated with the observed value of D is smaller than
a = .01, our decisionin this fictitious study is to reject Hs in favor of Hi. %'e conclude that our subjects show significant preferences among skin colors.
Summary of procedure. In the computation of the KolmogorovSmirnov test, these are the steps:
1. Specifythe theoreticalcumulativestepfunction,i.e., the cumulative distribution expectedunder Ho.
2. Arrange the observedscoresin a cumulative distribution, pairing eachinterval of S~(X) with the comparableinterval of Fo(X). 8. For eachstep on the cumulativedistributions,subtractSz(X) from Fs(X).
THE
KOLMOGOROVWMIRNOV
ONE-SAMPLE
TEST
4. Using formula (4.6), find D. 5. Refer to Table E to find the probability (two-tailed) associatedwith the occurrenceunder Ho of values as large as the observedvalue of D. If that p is equal to or less than a, reject Ho. Power
The Kolmogorov-Smirnov one-sampletest treats individual observations separately and thus, unlike the x' test for one sample,neednot lose information through the combining of categories. When samples are small, and therefore adjacent categoriesmust be combined before x' may properly be computed, the x' test is definitely less powerful than the Kolmogorov-Smirnov test. Moreover, for very small samplesthe x' test is not applicable at all, but the Kolmogorov-Smirnov test is. Thesefacts suggest that the Kolmogorov-Smirnov test may in all cases be more powerful than its alternative, the y' test. A reanalysis by the y' test of the data given in the exampleabove will highlight the superior power of the Kolmogorov-Sinirnov test. In the form in which the data are presentedin Table 4.3, x' could not be computed, becausethe expected frequencies are only 2 when N = 10 and It = 5. We must combine adjacent categoriesin order to increase the expected frequency per cell. By doing that we end up with the two-
categorybreakdownshownin Table 4.4. Any subject'schoiceis simply classifiedas being for a light or a dark skin color; finer gradations must be ignored. TABLE 4.4. HYPOTHETICALSKINWOLORPREFERENCESOP 10 NEGRO SUBAICT8
For thesedata, x' (uncorrectedfor continuity) = 3.75. Table C shows that the probability associatedwith the occurrenceunder Ho of such a
valuewhendf = k 1 = 1 is between.10 and.05. That is, .10) p p .05. This value of p does not enable us to reject Ho at the .01 level of significance. Notice that the p we found by the Kolrnogorov-Smirnov test is smaller than .01, while that found by the x' test is larger than .05. This differ-
encegivessomeindicationof the superiorpowerof the KolmogorovSmirnov
test.
THE
52
ONE-SAMPLE
CASE
References
The reader may find other discussionsof the Kolmogorov-Smirnov test in Birnbaum (1952; 1953), Birnbaum and Tingey (1951), Goodman (1954), and Massey (1951a). THE
Function
and
ONE-SAMPLE
RUNS
TEST
Rationale
If an experimenterwishesto arrive at someconclusionabout a population by using the information contained in a samplefrom that population, then his sample must be a random one. In recent years, several techniques have been developed to enable us to test the hypothesis that a
sample is random. These techniques are based on the order or sequence in which the individual scores or observations originally were obtained.
The technique to be presented here is based on the number of runs which a sample exhibits. A run is defined as a successionof identical symbols which are followed and precededby dif7erent symbols or by no symbols at all. For example, suppose a series of plus or minus scores occurred in this order:
This sample of scoresbegins with a run of 2 pluses. A run of 3 minuses follows.
Then comes another run which consists of 1 plus.
It is foljowed
by a run of 4 minuses,after which comesa run of 2 pluses,etc. We can group thesescoresinto runs by underlining and numbering eachsuccession of identical symbols: ++
+
12
++
+
34
We observe 7 runs in all: r =
56
7
number of runs = 7.
The total number of runs in a sampleof any given size gives an indication of whether or not the sample is random. If very few runs occur, a time trend or some bunching due to lack of independenceis suggested. If a great many runs occur, systematic short-period cyclical fluctuations seem to be influencing the scores.
For example, suppose a coin were tossed 20 times and the following sequenceof heads (H) and tails (T) was observed: HH
HH
HH
HH
HH
TT
TT
TT
TT
TT
Only two runs occurred in 20 tosses. This would seemto be too few for a "fair" coin (or a fair tosser.'). Somelack of independencein the events
THE ONE&LE
RUNS TEST
is suggested. On the otherhand,suppose the following sequence occurred:
HT
HT
HT
HT
HT
HT
HT
K TH
TH
T
IIere too manyrunsare observed.In this case,with r = 20 when N = 20,it wouldalsoseemreasonable to rejectthehypothesis that the coin is "fair." Neitherof the abovesequences seemsto be a random series of H's and T's.
Notice thatouranalysis, which isbased ontheorder oftheevents, gives
usinformation whichis notindicated by thefrequency of theevents. In
both of the abovecases,10tails and 10headsoccurred. If the scores
wereanalyzed according to theirfrequencies, e.g.,by useof they' test
orthebinomial test,wewouldhavenoreason to suspect the"fairness"of
thecoin. It isonlya runs test,focusing ontheorderoftheevents, which reveals thestrikinglackofrandomness ofthescores andthusthepossible lack of "fairness"
in the coin.
Thesampling distribution ofthevalues ofr which wecouldexpect from
repeated random samples is known.Usingthissampling distribution, we maydecidewhethera givenobserved samplehasmoreor fewerruns than wouldprobablyoccurin a randomsample. Method
I,etn~= thenumberof elements of onekind,andnm= thenumberof elements oftheotherkind. Thatis,n~mightbethenumber ofheads and
n>thenumber of tails;or n~mightbethenumber of pluses andn2the number ofminuses. N = thetotalnumber ofobserved events = n~+n~. To usetheone-sample runstest,first observe then~andn2eventsin
thesequence in whichtheyoccurred anddetermine thevalueofr, the number
of runs.
Smallsamples. If bothn>andn>areequalto orlessthan20,then
TableF oftheAppendix givesthecriticalvaluesofr underHofora = .05.
These arecriticalvalues fromthesampling distribution of r underHo.
If,theobserved valueofr fallsbetween thecriticalvalues, weaccept Ho.
If the observed viue of r is equ~to or moreextremethanoneof the critical values, we reject HD.
Twotables aregiven:F>andF». TableFi givesvalues ofr whichare
so8matl thattheprobability associated withtheiroccurrence under Hois p = .025.TableF»gives values ofr which aresolarge thattheproba-
bilityassociated withtheiroccurrence underH, is p = 025 Anyobserved valueofr whichisequalto orlessthanthevalueshown in TableFi orisequalto orlargerthanthevalueshown in TableF»isin tb
res ion of rejection for a = .05.
THE
ONE-SAMPLE
CASE
For example,in the first tossing of the coin discussedabove,we observed two runs: one run of 10 heads followed by one run of 10 tails.
Here
n> 10, n~ 10, and r = 2. Table F showsthat for thesevaluesof ni andn~,a randomsamplewouldbeexpectedto containmorethan 6 runs but less than 16. Any observed r of 6 or less or of 16 or more is in the
regionof rejectionfor a = .05. Theobserved r = 2 is smallerthan6, so at the .05 significancelevel we reject the null hypothesisthat the coin is
producinga randomseriesof headsandtails. If a one-tailed test is called for, i.e., if the direction of the deviation
from randomness is predictedin advance,then only oneof the two tables needbeexamined. If the predictionis that too few runswill beobserved,
TableFi givesthecriticalvaluesof r. If theobserved r undersucha onetailed test is equalto or smallerthan that shownin TableF<,H, may be
rejectedat a = .025. If the predictionis that too manyruns will be observed, TableF» givesthe critical valuesof r whicharesignificantat the .025 level.
Forexample, taketheeaseof thesecond sequence of cointosses reported above. Supposewehad predictedin advance,for somereasonthis writer
cannot imagine,that this coin would producetoo many runs. We observethat r = 20 for n> 10 and n~
10. Since our observedvalue
of r is equalto or largerthan that shownin TableF», wemayrejectHo
at a = .025,and conclude that the coinis "unfair" in the predicted direction.
Examplefor Small Sample8
In a studyof the dynamicsof aggression in youngchildren,the
experimenter observed pairsof childrenin a controlled playsituation.' Most of the 24 children who servedas subjectsin the study camefrom the samenurseryschooland thus playedtogetherdaily.
Sincethe experimenter wasableto arrangeto observebut two childrenon anyday,shewasconcerned that biasesmightbeintroduced into thestudyby discussions between thosechildrenwhohadalready servedas subjectsand thosewhowereto servelater. If suchdiscussionshad any eKecton the level of aggression in the play sessions, this eeet might showup as a lack of randomness in the aggression scoresin the order in which they were collected. After the study
wascompleted, the randomness of the sequence of scoreswastested by convertingeachchild's aggression scoreto a plus or minus, depending on whetherit fell aboveor belowthe groupmedian,and thenapplyingthe one-sample runstest to the observed sequence of plusesand minuses. ' Siegel,Alberta E. 1955. The effectof 61m-mediated fantasyaggression on strengthof aggressive drive in youngchildren. Unpublished doctor'sdissertation, Stanford University.
THE
ONE-Sh.MPLE
RUNS
TEST
i. Null Hypothesis. Ho, the plusesand minusesoccur in random order. HI.'the order of the pluses and minuses deviates from randomness.
ii. Statistical Test. Sincethe hypothesis concernsthe randomness
of a single sequenceof observations,the one-sampleruns test is chosen.
iii. Significance reve/.
Let a = .05.
N = the number of sub-
jects = 24. Since the scoreswill be characterized as plus or minus dependingon whether they fall above or below the middlemost score in the group, nI = 12 and n~
12.
iv. Sampling Diatribution. Table F gives the critical values of r from the sampling distribution.
v. RejectionRegion. Since HI does not predict the direction of the deviation from randomness,a two-tailed test is used. Ho will be
rejectedat the .05levelof significanceif the observedr is eitherequal to or lessthan the appropriatevalue in Table FI or is equal to or larger than the appropriate value in Table FII. For nI = 12 and n> 12, Table F showsthat the region of rejection consistsof all r's of 7 or less and all r's of 19 or more.
vi. Ded sion. Table 4.5 showsthe aggressionscoresfor each child in the order in which those scores occurred.
The median of this set
of scoresis 24.5. All scoresfalling belowthat medianaredesignated ThBLR 4.5. AGGRE88IONSCORESIN ORDER OF OCCURRENCE
asminusin Table4.5;all abovethat medianaredesignated asplus. From the column showing the sequenceof +'s and 's the reader
can readily observethat 10 runs occurredin this series,that is, r =10.
THE
ONE-shMPLE
ChsE
Reference to TableF revealsthat r = 10for n~= 12andnp 12 does not fall in the region of rejection, and thus our decision is
that the null hypothesisthat the sample of scoresoccurredin random order is acceptable.
Large samples. If either nI or np is larger than 20, Table F cannot
beused. For suchlargesamples, a goodapproximation to thesampling distribution of r is the normal distribution, with Mean=p,= Standard deviation = 0, =
''
nI + np
+1
2ngnp(2n~npnI np ) (nI + nQ)'(nl + np )
Therefore,wheneithernI or n>is largerthan 20, H pmay be testedby
2ngnp +1 z=
r
p
n> + nm
o'g
Since the values of z which sre yielded by formula (4.7) under Hp are
approximatelynormally distributed with zero meanand unit variance, the significance of any observed value of z computed from this formula msy be determined by referenceto the normal curve table, Table A of the Appendix. That is, Table A gives the one-tailed probabilities associated with the occurrence under H p of values as extreme as an observed z.
The large-sampleexamplewhich follows usesthis normal curve approximation to the sampling distribution of r. Example for Large Samples
The writer was interested in ascertaining whether the arrangement of men and women in the queue in front of the box office of a motionpicture theater was a random arrangement. The data were obtained
by simply tallying the sexof eachof s succession of 50personsasthey approached the box office.
i. Null Hypothesis. Hp'. the order of males and females in the queue was random.
Hx. the order of males and females was not
rand oIn.
ii. Statistical Test. The one-sampleruns test was chosenbecause
the hypothesisconcernsthe randomnessof a singlegroup of events. iii. Signi~nee
Level. Let a = .05. N = 50 = the number of
personsobserved.The valuesof n>andn. will bedetermined only after
the data are collected.
THE ONFA5hMPLE RUNS TEST
iv. SamplerDistribution.For largesamples, the valuesof z
which arecomputed fromformula (4.7)under Hp areapproximately
normally distributed. TableA givesthe one-tailed probability
associated with theoccurrence underHpof valuesasextreme asan observed z.
v. Rejection Region.SinceH~doesnot predictthe directionof
thedeviation fromrandomness, a two-tailed region of rejection is
used.It consists ofallvalues ofz,ascomputed fromformula (4.7),
whichareso extreme that the probabilityassociated with their occurrence underHp is equalto or lessthan a = .05. Thus the
regionof rejection includes all values ofz equal to or moreextreme than J1.96.
vi. Decision. Themales (M) andfemales (F) werequeued in
front of theboxoScein theordershownin Table4.6. Thereader Tmm4.6.ORDER oP30Mamas (M)hNn20Fmrhms (F)IN@URUS mph
Tax~vxa
Box arm
(Runsareindicatedby underlining) M F M F MMM FF M F M F M F MMMM F
F M F M F MM
FFF
M
M
MM
F MMMM
F
M F
F M
MM F
F MM
willobserve thatthere were 30males and20females inthissample.
Byinspection ofthedatainTable 4.6,hemayalsoreadily determine
that r = 35 = the number of runs.
To determine whetherr > 35mightreadilyhaveoccurred under
Hp,wecompute thevalueof z asdefined by formula (4.7).Let
n~= the numberof males= 30,andel = the numberof females = 20.
Then
2$ /san ng+ np
(4.7) 2(30)(20) 30 + 20 II
II
II
+ 20)'(30 + 20 1) = 2.98
TableA shows that theprobability of occurrence underHoof
THE
ONE-SAMPLE
CASE
z > 2.98 is p = 2(.0014) = .0028. (The probability is twice the y given in the table becausea two-tailed test is called for.) Inasmuch as the probability associatedwith the observedoccurrence,p = .0028, is lessthan the level of significance,e = .05, our decisionis to reject the null hypothesis in favor of the alternative
hypothesis.
We
conclude that in the queue the order of males and femaleswas not random.
Summary of procedure. These are the steps in the use of the onesample runs test:
1. Arrange the n~ and n2 observations in their order of occurrence. 2. Count the number of runs, r. 3. Determine the probability under HD associated with a value as extreme as the observed value of r. If that probability is equal to or
less than a, reject Ho. The technique for determining the value of p dependson the size of the ni and ne groups: a. If n~ and n~ are both 20 or less,refer to Table F. Table F
of r and all values more extreme. For a one-tailed test, the region of rejection at n = .025 consistsof the tabled value of r in the predicted direction (either too small or too large) and all values more extreme.
b. If either ni or n~ is larger than 20, determine the value of z as com-
puted from formula (4.7). Table A showsthe one-tailedprobability associated with the occurrence under Ho of values as extreme as an
observed z.
For a two-tailed test, double the p given in that table.
If the p associatedwith the observedvalue of r is equal to or lessthan 0,, reject Ho. Power-EEciency
Becausethere are no parametric tests for the randomnessof a sequence of events in a sample, the concept of power-efficiencyis not meaningful in the case of the one-sample runs test. References
For other discussionsof this test, the reader is referred to Freund
(1952, chap. 11), Moore and Wallis (1943), and Swed and Eisenhart (1943).
DISCUSSION
DISCUSSION
We have presented four nonparametric statistical tests of use in a one-
sample design. Three of these tests are of the goodness-of-fittype, and the fourth is a test of the randomness of the sequence of events in a
sample. This discussion,which comparesand contrasts thesetests, may aid the reader in choosing the test which will best handle the data of a
given study. In testing hypotheses about whether a sample was drawn from a
population with a specifieddistribution, the investigator may use one of three goodness-of-fittests: the binomial test, the y' one-sampletest, or the Kolmogorov-Smirnov one-sample test. His choice among these three tests should be determined by (a) the number of categories in his
measurement,(b) the level of measurementused, (c) the size of the sample, and (d) the power of the statistical test. The binomial test may be used when there are just two categories in
the classification of the data. It is uniquely useful when the samplesize is so small that the y' test is inapplicable.
The x' test shou/d be used when the data are in discrete categories and when the expectedfrequenciesare sufficiently large. When k = 2, each E; should be 5 or larger.
When k > 2, no more than 20 per cent of
the E s should be smaller than 5 and none should be smaller than 1.
Both the binomial test and the y' test may be used with data measured in either a nominal
or an ordinal scale.
x' tests are insensitive to the effects of order when df > ],, and thus x~may not be the best test when a hypothesistakes order into account. The Kolmogorov-Smirnov test should be used when one can assume that the variable under consideration has a continuous distribution.
However, if this test is used when the population distribution, Fo(X), is discontinuous, the error which occurs in the resulting probability statement is in the "safe" direction (Goodman, 1954). That is, if the tables which assume that Fo(X) is continuous are used to test a hypothesis about a discontinuous variable, the test is a conservative one: if IIO is
rejectedby that test wecanhavereal confidence in that decision. It has already been mentioned that the Kolmogorov-Smirnov test treats individual observations separately and thus does not lose information because of grouping, as the x' test sometimes must. With a con-
tinuous variable,if the sampleis small and thereforeadjacentcategories must be combinedfor the x' test, the x' test is definitely lesspowerful than the Kolmogorov-Smirnov
test.
It would seem that in all cases
whereit is applicablethe Kolmogorov-Smirnovtest is the most powerful
goodness-of-fit test of thosepresented.
THE
ONE-8AMPLE
CA8E
The distribution of D is not known for the casewhen certain parameters
of the population have been estimated from the sample. However, Massey (1951a, p. 73) gives some evidence which indicates that if the Kolmogorov-Smirnov test is applied in such cases(e.g., for testing goodness-of-fit
to a normal
distribution
with
mean and standard
deviation
estimated from the sample), the use of Table E will lead to a conservative test. That is, if the critical value of D (as shown in Table E) is exceededby the observedvalue in thesecircumstances,we may with considerable confidencereject Ha (e.g., that the population is normal) and safely conclude that there is a significant difference. In caseswhere parameters must be estimated from the sample, the y' test is easily modified for use by a reduction of the number of degreesof freedom. The Kolmogorov-Smirnov test has no suchknown modifications. The one-sample runs test is concerned with the randomnessof the temporal occurrence or sequence of the scores in a sample. No general statement about the efficiency of tests of randomness based on runs can
be meaningful; in this case the question of efficiency has meaning only in the context of a specific problem.
CHAPTER 5
THE CASE OF TWO RELATED
SAMPLES
The two-sample statistical tests are used when the researcher wishes
to establish whether two treatments are diferent, or whether one treat-
ment is "better" than another. The "treatment" may be any of a
diversevarietyof conditions:injectionof a drug,training,acculturation, propaganda,separationfrom family, surgicalalteration,changedhousing conditions,intergroup integration, climate alteration, introduction of a new elementinto the economy,etc. In eachcase,the group which has u,ndergone the treatment is comparedwith one which has not, or which has undergonea diferent treatment.
In such comparisonsof two groups,sometimessignificantdifferences
areobserved whicharenot the resultsof the treatment. For instance,a researchermay attempt to comparetwo teachingmethodsby havingone
groupof studentstaught by onemethodand a differentgrouptaught by another. Now if one of the groups has abler or more motivated students, the performance of the two groups after the different learning experiencesmay not accurately refiect the relative effectivenessof
the two teachingmethods at all, because othervariablesare creating differencesin performance.
Oneway to overcomethe difBculty imposedby extraneousdifferences
betweengroupsis to usetwo relatedsamples in the research.That is, one may "match" or otherwiserelate the two samplesstudied. This
matchingmaybe achievedby usingeachsubjectashis owncontrol,or by pairingsubjectsandthenassigning the two members of eachpair to the two conditions. Whena subject"servesashis owncontrol,"he is exposed to bothtreatmentsat diferenttimes. Whenthepairingmethod is used,the effort is to selectfor eachpair subjectswho are as muchalike
as possiblewith respectto any extraneous variableswhichmight inBu-
encethe outcome of the research.In the example mentioned above, the pairing methodwould require that a numberof pairs of studentsbe
selected,eachpair composedof two studentsof substantiallyequal ability and motivation. Onememberof eachpair, chosenfrom the two 61
62
THE
CASE
OF
TWO
RELATED
SAMPLES
by randommeans,would be assignedto the classtaught by oneof the methodsand his matched "partner" would be assignedto the class taught by the other method.
Wherever it is feasible, the method of using each subject as his own control (and counterbalancing the order in which the treatments are assigned) is preferable to the pairing method. The reason for this is
that we are limited in our ability to match people by our ignorance of the relevant variables which determine behavior. Moreover, even when we do knoiv ivhat variables are important and therefore should be controlled by the pairing process,our tools for measuringthesevariables are
rather grossor inexactand thus our pairing basedon our measures may be faulty.
A matching design is only as good as the experimenter's
ability to determinehow to matchthe pairs,and this ability is frequently very limited. This problem is circumvented when each subject is used as his own control; no more precise matching is possiblethan that achieved by identity.
The usual parametric technique for analyzing data from two related samples is to apply a t test to the difference scores. A difference score may be obtained from the two scores of the two members of each
matched pair, or from the two scores of each subject under the two conditions.
The t test assumes that these difference scores are nor-
mally and independently distributed in the population from which the sample was drawn, and requires that they be measured on at least an interval
scale.
In a number of instances, the t test is inapplicable. The researcher may find that (a) the assumptions and requirements of the t test are
unrealistic for his data, (6) he may prefer to avoid making the assumptions or testing the requirements and thus give greater generality to his conclusions, (c) his differencesbetweenmatched pairs are not representedas scoresbut rather as "signs" (that is, he can tell which memberof any pair is "greater than" the other, but cannot tell howmuchgreater), or (d) his scoresare simply classificatory the two members of the matched pair can either respondin the sameway or in entirely different ways which do
not stand in any order or quantitative relation. In theseinstances,the experimenter may choosefrom one of the nonparametric statistical tests for two related sampleswhich are presentedin this chapter. In addition to being suitable for the casesmentioned above, these tests have the further advantage that they do not require that all pairs be drawn from the same population. Five tests are presented; the discussion at the closeof the chapter indicates the special features and usesof each. This
discussionmay aid the readerin selectingthat techniquewhichwould be most appropriate to use in a particular research.
THE
MCNEMAR
THE McNEMAR
TEST
FOR THE
SIGNIFICANCE
TEST FOR THE SIGNIFICANCE
OF CHANGES
BS
OF CHANGES
Function
The McNemartest for the significanceof changesis particularly applicableto those"before and after" designsin whicheachpersonis usedas his own control and in which measurement is in the strength of either a nominal or ordinal scale. Thus it might be used to test the effectiveness
of a particulartreatment(meeting,newspaper editorial,mailedpamphlet, personal visit, etc.) on voters' preferencesamong various candidates. Or
jt might be usedto test say the sects of farm-to-city moveson people's po]itical affiliations. Notice that theseare studiesin which peoplecould serve as their own controls and in which nominal measurement would be
used to assessthe "before to after" change. Rationale
and Method
To test the significanceof any observedchangeby this method, one sets up a fourfold table of frequenciesto represent the first and second sets of responsesfrom the sameindividuals. The generalfeaturesof such
a tableareillustratedin Table5.1,in whichj and are usedto signify PARLE5.1. FOURFOLD TABLEFORUSE IN TESTINGSIGNIFICANCE OF CHANGES After
Before
diferent responses.Noticethat thosecaseswhichshowchangesbetween the first and secondresponseappear in cells A and D. An individual is tallied in cell A if he changedfrom + to . He is tallied in cell D if he changed from to +. If no changeis observed,he is tallied in either
cell 8 (+ responses both beforeand after) or cell C ( responsesboth before and after).
SinceA + D represents the total numberof persons whochanged, th expectationunderthe null hypothesiswou],dbe t at,(A + D) changedin onedirection and ~~(A+ D) caseschanged; other words,$(A + D) is the expectedfrequencyunder Ho in both A and
cell D.
It will be rememberedfrom Chap 4 that
k (0,. E.) 2 I
E;
ll
THE cAsE QF Two
RELATED sAMPLES
where0; = observednumberof cases in ith category E; = expected numberof casesunderHoin ith category In the McNemartestfor the significance of changes, weareinterested only in cellsA and D. Therefore,if A = the observednumberof cases
in cellA, D = the observed numberof cases in cellD, andz(A + D) = the expected numberof cases in bothcellA andcellD, then (0 @'
x=
E
A A+ D'
D A+D'
A+D
A+D
Expanding and collecting terms, we have (A D)s
with df = 1
That is, the samplingdistribution under Ho of y' as yieldedby formula (5.1) is distributed approximatelyas chi square'with df = 1.
Correctionfor continuity. The approximation by the chi-square distribution of the sampling distribution of formula (5.1) becomesan excel: lent one if a correction for continuity is performed. The correction is necessary because a continuous distribution (chi square) is used to approximate a discrete distribution. When all expectedfrequenciesare small, that approximation may be a poor one. The correction for continuity (Yates, 1934)is an attempt to removethis sourceof error. With the correction for continuity included (IA
A Dl D
1)
withdf 1
This expressiondirects one to subtract 1 from the absolute value of the
differencebetweenA and D (that is, from that differenceirrespectiveof sign) before squaring. The significanceof any observedvalue of g', as computedfrom formula (5.2), is determinedby referenceto Table C
of the Appendixwhichgivesvariouscriticalvaluesof chi squarefor df's from 1 to 30. That is, if the observed valueof g' is equalto or greater thanthat shownin TableC for a particularsignificance levelat df = 1, the implication is that a "significant" "before" and "after" responses.
eHect was demonstrated in the
The examplewhich foQowsillustrates the application of this test. ' Seefootnoteon psge43.
THE MCNEMAR TEST FOR THE SIGNIFICANCE OF CHANGES
65
Example Supposea child psychologistis interested in children's initiation of social contacts,
He has observed that children who are new in a
nursery school usually initiate interpersonal contacts with adults
rather than with other children. He predictsthat with increasing familiarity and experience,children will increasinglyinitiate social contacts with other children rather than with adults.
To test this
hypothesis,he observes25 new childrenon eachchild's first day at nursery school, and he categorizestheir first initiation of social contact according to whether it was directed to an adult or to another child.
He then observes each of the 25 children after each has
attendednurseryschoolfor a month, making the samecategorization.
Thus his data are cast in the form shown in Table 5.2.
His
test of the hypothesisfollows; this discussionreports artificial data. Trna% 5.2. FonM os' FoUam~
Tmm
To Saow CHAmoss xa
CHILDBRN s OBJECTS OP INITILTloN
Objectof initiation on thirtieth day Child
Object of initiation on first day
Adult
Adult
i. NuQ Hypothesis. Ho'.for
those children who change,the
probability that any child will change his object of initiation from adult to child (that is, Pz) is equal to the probability that he will
changehis object of initiation from child to adult (that is, P~) is equal to one-half. That is, Pz = Pn = ~.
H>.'P~ ) Pn.
ii. S~istical Test. The McNemar test for the significanceof
changesis chosenbecause the study usestwo relatedsamples,is of the before-and-aftertype, and uses nominal (classificatory) measurement.
iii. SiPuficanee LeteL Let a = .05. N 25, the number of children observedon the first and thirtieth day at schoolof each. iv. Sampling Distribution. Table C gives critical values of chi-
squarefor variouslevelsof significance. The samplingdistribution
of x' ascomputedby formula(5.2)is very closelyapproximated by the chi-square distribution with df = 1.
v. Rejection Regs,. SinceH>specifies the directionof the prestatementof Ho suggestsa straightforwardapplicationof the binomial test
(pages36 to 42). Therelationof the McNemartest to the binomialtest is shown the discussionof small expectedfrequencies(below).
THE
66
CASE
OF
TWO
RELATED
SAMPLES
dieted difference, the region of rejection is one-tailed.
The region of
rejection consistsof all values of p2 (computed from data in which A)
D) which are so large that they have a one-tailed probability
associated with their occurrence under Hp of .05 or less.
vi. Decision. Table 5.3.
The artificial
data of this study are shown in
It shows that A =
14 = the
number of children whose
TABLE 5.3. CHILDREN S OBJECTS OF INITIhTION ON FIRST AND THIRTIETH DAYs IN NURSERY ScHooL
(Artificial
data)
Object of initiation Child
Object of initiation
on firstday
on thirtieth
day
Adult
Adult
Child
objects changed from adult to child, and D = 4 = the number of children whose objects changed from child to adult. 8 = 4 and C = 3 represent those children whose objects were in the same category on both occasions. We are interested in the children who showed change: those representedin cells A and D. For these data,
(jA
Dl
1)'
x A+D (114 4l
1
14+
4
(5 2)
92 18 = 4.5
Reference to Table C reveals that when g2 > 4.5 and df = 1, the
probability of occurrenceunder Hp is p < z(.05) which is p < .025. (The probability value given in Table C is halved becausea onetailed test is called for and the table gives two-tailed values.)
Inasmuch as the probability under Hp associatedwith the occurrence we observedis p < .025 and is lessthan a = .05, the observed value of x' is in the region of rejection and thus our decision is to reject Hp in favor of HI. With these artificial data we conclude that children show a significant tendency to change their objects of initiation from adults to children after 30 days of nursery school experience.
Sma11expected frequencies. If the expected frequency, that is, ~(A + D), is very small (lessthan 5), the binomial test (Chap. 4) should be
THE MCNEMAR TEST FOR THE SIGNIFICANCE OF CHANGES
67
usedrather than the McNemartest. For the binomialtest, N = A + D, and x = the smaller of the two observedfrequencies,either A or D. Notice that we could have tested the data in Table 5.3 with the binomial
test.
The null hypothesis would be that the sample of N =
A+ D
easescamefrom a binomialpopulationwhereP = Q = ~. For the above data, N = 18 and x = 4, the smallerof the two frequenciesobserved. Table D of the Appendix showsthe probability under Ho associatedwith
such a smallvalueis p = .015which is essentiallythe samep yieldedby the McNemartest. The differencebetweenthe two p's is due mainly to the fact that the chi-square table does not include all values between
p = .05 and p = .01. Siimmary of procedure. Theseare the steps in the computation of the McNemar
test:
1. Cast the observed frequencies in a fourfold table of the form illustrated
in Table 5.1.
2. Determine the expected frequencies in cells A and D.
E = ~(A+D) If the expected frequenciesare less than 5, use the binomial test rather than the McNemar
test.
3. If the expected frequenciesare 5 or larger, compute the value of x' using formula (5.2).
4. Determinethe probability underHo associated with a valueaslarge as the observed value of y' by referring to Table C.
If a one-tailed test
is calledfor, halvethe probability shownin that table. If the p shownby Table C for the observedvalue of x' with df = 1 is equal to or less than
a, reject Ho in favor of H>. Power-Efficiency
WhentheMcNemartestis usedwith nominalmeasures, the conceptof power-efficiencyis meaningless inasmuchas there is no alterriative with which to comparethe test. However, when the measurementand other
aspectsof the data are suchthat it is possibleto apply the parametric
t test, the McNemartest,like the binomialtest, haspower-efficiency of about 95 per centfor A + D = 6, and the power-efficiency declinesas the sizeof A + D increases to an eventual(asymptotic)efficiencyof about 63 per cent. References
Discussions of this testarepreseated by Bowker(1948) (1947I 1955,pp. 228-231).
THE
CASE
OF
THE
TWO
RELATED
SIGN
SAMPLES
TEST
Function
The signtestgetsits namefromthe fact that it usesplusandminus
signsratherthanquantitative measures asits data. It is particularly usefulfor research in whichquantitativemeasurement is impossible or infeasible,but in whichit is possibleto rank with respectto eachother the two membersof eachpair.
The signtest is applicableto the caseof two relatedsamples when the experimenter wishes to establish that two conditions are different.
The only assumptionunderlyingthis test is that the variableundercon-
siderationhasa continuous distribution. The test doesnot makeany assumptions about the form of the distributionof differences, nor doesit assumethat all subjectsare drawn from the same population. The
differentpairsmaybefromdifferentpopulations withrespect to age,sex, intelligence, etc.;theonlyrequirement is that withineachpairtheexperimenter has achievedmatchingwith respectto the relevant extraneous . variables. As was notedbefore,oneway of accomplishing this is to use eachsubjectas his own control. Method
The null hypothesistestedby the signtest is that p(Xg > Xa)
= p(Xg
( Xa)
=g
whereX~ is the judgmentor scoreunder one of the conditions(or after the treatment) and Xg is the judgmentor scoreunderthe othercondition (or before the treatment). That is, Xz and Xg are the two "scores" for a matched pair. Another way of stating Ho is: the mediandifference is zero.
In applyingthe sign test, we focuson the directionof the differences betweenevery X~; and Xs;, noting whether the sign of the difference is plusor minus. Under Ho, we wouldexpectthe numberof pairswhich have X~ > Xa
to equal the number of pairs which have X~ ( Xs.
That is, if the null hypothesiswere true we would expectabout half of the differences to be negativeand half to be positive. Ho is rejectedif too few differencesof one signoccur. Small samples. The probability associatedwith the occurrenceof a particular number of +'s and 's can be determinedby referenceto the binomial distribution with P = Q = L where N = the number of
pairs. If a matchedpair showsno difference (i.e., the difference, being zero,hasnosign)it isdropped fromtheanalysis andN istherebyreduced. Table D of the Appendix gives the probabilitiesassociatedwith the
THE
SIGN
TE8T
69
occurrenceunder Hs of valuesas small as s for N < 25. To use this table, let s = the numberof fewersigns. For example,suppose20 pairsare observed. Sixteenshowdifferences
in onedirection ( +) andtheotherfourshow differences intheother( ) . HereN = 20 ands = 4. Reference to TableD reveals that theprobability of this distributionof + 'a and 's or an even moreextremeone under Hs ia p = .006 (one-tailed). The signtest may be either one-tailedor two-tailed. In a one-tailed
test,theadvance prediction stateswhichsign,+ or , willoccurmore frequently. In a two-tailedtest, the predictionis simplythat the fre-
quenciea withwhichthe twosignsoccurwill be significantldifferent. For a two-tailedtest, doublethe valuesof p shownin Table D. Examplefor Seal Samples
In a studyof theeffects offather-absence uponthedevelopment of children,1 7marriedcoupleswho had beenseparatedby war snd whosefirst child was born during the father's absencewere interviewed,husbandsand wivesseparately. Each wasaskedto discuss
various topicsconcerning thechildwhose firstyearhadbeenspent in a fatherlesshome. Each parentwssaskedto discussthe father's disciplinaryrelationswith the child in the years after his return from war. These statementswere extracted from the recorded
interviews,anda psychologist whokneweachfamilywasaskedto rate the statementson the degreeof insightwhicheachparent showedin discussing paternal discipline. ' The predictionwas that the mother,becauseof her longerand closerassociation with the
childandbecause of a varietyof othercircumstances typically associatedwith father-separation becauseof war, would have greaterinsightinto her husband's disciplinary relationswith their child than he would have.
i. NuQ Hypotheeie.Hs.' the medianof the differences is sero. That is, thereare sa manyhusbands whoseinsightinto theu own disciplinary relationswith theirchildrenis greaterthan their wives'
astherearewiveswhose insightintopaternaldiscipline ia greater
thantheirhusbands'. Hi . 'themedian ofthedifferences ispositive. ii. Statistical Test. Theratingscale usedin thisstudyeonstitu at besta partiallyorderedscale. The informationcontained in the
ratingsis preserved if thedifference between eachcouple's tworst
ingsis expressed by a sign.Eachmarried couple in thisstudy constitutes a matchedpair;theyarematchedin the sensethat each Kngvall,Alberta. 1 954.Comparison of motherand fatherattitudestoward war~eparated children.In LoisM. Stolsetal.,Eafher relations oftear-born ckimren. Stanford p calif StanfordUniver.Press. Pp, 149-l,80.
THE
70
CASE
OF
TWO
RELATED
SA.MPLES
discussedthe samechild and the samefamily situation in the material
rated. The sign test is appropriatefor measuresof the strength indicated, and of courseis appropriatefor a caseof two related samples.
iii. SignificanceLeuel. Let a = .05. N = 17, the number of war-separatedcouples. (N may be reducedif ties occur.) iv. SamphngDistribution. The associatedprobability of occurrence of values as small as s is given by the binomial distribution
for P = Q = ~. The associatedprobabilitiesare givenin Table D. v. RejectionRegion. SinceH> predictsthe directionof the differ-
ences,the regionof rejectionis one-tailed. It consists of all values of s (wherex = the numberof minuses,sincethe predictionis that
pluseswill predominate andx = the numberof fewersigns)whose one-tailed associatedprobability of occurrenceunder Ho is equal to or less than a =
.05.
vi. Decision. The statements of each parent were rated on a five-
pointratingscale. Onthisscale,a ratingof 1 represents highinsight. ThBLE 6.4. WhR-sEPhRhTED PhRENTs INsroHT INTQ PhTERNhLDIscIPLINE Rating on insight' into paternal discipline Couple (pseudonym)
D>rection
of
Sign
~ Aratingof 1 represents greatinsight;a ratingof 6 represents little or no insight.
Table5.4showsthe ratingsassigned to eachmother(M) andfather(F)
amongthe 17 war-separated couples. The signsof the differences
THE
SIQN
TEST
71
between each couple are shown in the final column. Observethat 3 couples (the Holmans, Mathewses,and Soules)showeddifferences
in the oppositedirection from that predicted, i.e., in each case X» < X~, andthus eachof these3 receiveda minus. For 3 couples (the Harlows, Marstons, and Wagners), there was no difference
betweenthe two ratings, that is, X» = Xm, and thus thesecouples received no sign. The remaining 11 couples showed differencesin the predicted direction.
For the data in Table5.4,s = the numberof fewersigns= 3, and N = the number of matched pairs who showed differences = 14.
Table D showsthat for N = 14,anx < 3hasa one-tailedprobability of occurrenceunder Ho of p = .029. This value is in the region of rejection for a = .05; thus our decisionis to reject Ho in favor of
H,. We concludethat war-separatedwives show greater insight into
their husbands
children
disciplinary
relations with
their war-born
than do the husbands themselves.
Ties. For the sign test, a "tie" occurswhen it is not possibleto discriminate betweena matched pair on the variable under study, or when the two scoresearned by any pair are equal. In the caseof the war-separated couples, three ties occurred: the psychologist rated three
cpuplesas having equalinsight into paternaldiscipline. M tied casesaredroppedfrom the analysisfor the signtest, and the N is correspondinglyreduced. Thus N = the number of matchedpairs whosedifferencescorehasa sign. In the example,14 of the 17 couples had difference scores with a sign, so for that case N =
14.
Relation to the binomial expansion. In the study just discussed,we should expectunder Ho that the frequency of plusesand minuseswould be the same as the frequency of heads and tails in a toss of 14 unbiased
cpins. (More exactly,the analogyis to the tossof 17 unbiasedcoins, 3 pf which rolled out of sight and thus could not be included in the
analysis.) The probability of getting as extreme an occurrenceas 3 heads and 11 tails in a tossof 14coinsis given by the binomial distribution as
PsQNs s0
where N = total number of coins g=
observed number of heads
N Nt
g st(N
s)t
72
THE
CASE
OF
TWO
RELATED
SAMPLES
In the caseof 3 orfewerheadswhen14 coinsaretossed, thisis 14
14
14
14
2l4
1+
14 + 91
+ 364
16,284 = .029
The probabilityvalue found by this methodis of courseidenticalto that foundby the methodusedin the example:p = .029. Large samples. If N is larger than 25, the normalapproximationto the binomial distribution can be used.
This distribution has
Mean = p, = NP = >N
Standarddeviation= o = QNPQ = z ~N
That is, the value ofz given is by 'N
(5.3) This expression is approximatelynormallydistributedwith zeromeanand unit variance.
The approximation becomesan excellentonewhena correction for continuityis employed. The correctionis efFected by reducingthe difference between the observednumber of pluses (or minuses)and the expectednumber,i.e., the meanunderHo, by .5. (Seepages40 to 41 for a more completediscussion of this point.) That is, with the correction
for continuity
(x+ .5)'N ~ ~/V
(5.4)
wherex + .5 is usedwhenx < zN, and x .5 is usedwhenx ) >N. The value of z obtainedby the applicationof formula (5.4) may be considered to be normally distributed with zero mean and unit variance.
Thereforethe signi6canceof an obtainedz may be determinedby referenceto Table A in the Appendix. That is, Table A givesthe one-tailed probabilityassociatedwith the occurrence underHo of valuesas extreme
asan observed z. (If a two-tailedtestis required,the p yielded by Table A shouldbe doubled.) Examplefor LargeSamplee
Suppose an experimenter wereinterested in determining whethera certainBlmaboutjuveniledelinquency wouldchange the opinions of; the members of a particularcommunity abouthowseverely juvenile
THE
8IQN
TE8T
delinquentsshouldbe punished, He drawsa randomsampleof 100 adults from the community, and conducts a "before and after" study, having each subject serve as his own control. He asks each
subject to take a position on whethermoreor lesepunitive action against juvenile delinquents should be taken than is taken at
present. He then showsthe film to the 100adults, after which he repeats the question.
i. Null Hypothesie. Ho. the film has no systematic effect. That
is, of thosewhoseopinionschangeafter seeingthe film, just asmany changefrom moreto lessas changefrom lessto more,and any differenceobservedis of a magnitudewhich might be expectedin a random sample from a population on which the fllm would have no systematic effect. Hr. the film has a systematic effect.
ii. StatisticalTest. The sign test is chosenfor this study of two related groups because the study uses ordinal measures within
matchedpairs, and thereforethe differencesmay appropriatelybe representedby plus and minus signs. iii. Significance Level. Let a =.01. N = the number of subjects (out of 100)who show an opinion changein either direction. iv. Sampling Dietribution. Under Ho, z ascomputed from formula (5.4) is approximately normally distributed for N > 25. Table A gives the probability associated with the occurrence of values as extreme as an obtained z.
v. Rejection Region.
Smce Hi does not state the dIrection of the
predicted differences, the region of rejection is two-tailed. It consists of all values of z which are so extreme that their associated probability of occurrenceunder Ho is equal to or less than a = .01.
vi. Decjgion. The resultsof this hypotheticalstudy of the effects of a film upon opinion are shown in Table 5.5. Tmus 6.6. AnULT OPINIONsCONCERNING W~T 8)gypmTYop PUNISHb05NT Is Dnsnusrx vos JmrsNma Dy~Nq~NcY
(Arti5cial data)Amount ofpunishment favored
Less
after Slm
More
More
Amount of punishment favored before Qm
Did the film have any effect? The data showthat there were 15
adults(8 + 7) whowereunaffected and85whowere. Thehypothesisof the studyappliesonlyto those85. If the film hadno systematiceffect,wewouldexpectabouthalf of thosewhoseopinions
74
THE
CASE
OF
TWO
RELATED
SAMPLES
changedfrom beforeto after to havechangedfrom moreto less, and abouthalf to havechangedfrom lessto more. That is, wewould expectabout 42.5subjectsto showeachof the two kinds of change. Now weobservethat 59 changedfrom moreto less,while26 changed
from lessto more. We may determinethe associated probability under HDof such an extreme split by using formula (5.4). For these data, s > ~.V, that is, 59 > 42.5. z
(* + .5) ,'N (5.4)
~ ~N (59 .5)
Y(85)
4 v~~ = 3.47
Reference to Table A reveals that the probability under Ho of z > 3.47 is p = 2(.0003) = .0006. (The p shown in the table is
doubledbecausethe tabled valuesare for a one-tailedtest, whereas the region of rejection in this case is two-tailed.) Inasmuch as p = .0006 is smaller than u = .01, the decision is to reject the null hypothesis in favor of the alternative hypothesis. We conclude
from thesefictitious data that the film had a significantsystematic eEect on adults' opinions regarding the severity of punishment which is desirable for juvenile delinquents.
This example was included not only becauseit demonstratesa useful application of the sign test, but also because data of this sort are often
analyzed incorrectly. It is not too uncommon for researchersto analyze such data by using the row and column totals as if they representedindependent samples. This is not the case; the row and column totals are separate but not independent representationsof the samedata. This example could also have been analyzed by the McNemar test for
the significance of changes (discussedon pages 63 to 67). With the data shown in Table 5.5,
(~A
D)
1)'
A+D
(i59 26'
(5.2)
1)'
59 + 26 = 12.05
Table C showsthat x' > 12.05 with df = 1 hasa probability of occurrence under Ho of p ( .001. This finding is not in conflict with that
yieldedby the sign test. The dilTerence betweenthe two findingsis due to the limitations of the chi-square table used.
THE WILCOXON MATCHED-PAIRS SIGNED-RANKS TEST
75
SnmInaryof procedure. Theseare the stepsin the useof the signtest: 1. Determine the sign of the difference between the two members of each pair.
2. By counting, determinethe value of N = the numbersof pairs whose differences show a sign.
3. The method for determining the probability associated with the occurrence under Ha of a value as extreme as the observed value of x
depends on the size of N:
a. If N is 25 or smaller,Table D shows the one-tailedp associated with a value as small as the observed value of x = the number of
fewer signs. For a two-tailedtest, doublethe value of p shownin Table
D.
g. If N is larger than 25, computethe value of z, usingformula (5.4). Table A gives one-tailed p's associated with values as extreme as
variousvaluesof z. For a two-tailedtest, doublethe valueof p shown
in Table
A.
If the p yielded by the test is equal to or less than a, reject Ho. power-EEciency
The powerefficiency of the sign test is about 95 per cent for N = 6
but it declines asthe sizeof the sampleincreases to aneventual(asymptotic) efficiencyof 63 per cent. For discussions of the power-efficiency of the sign test for largesamples,seeMood (1954)and Walsh (1946). References
For other discussionsof the sign test, the reader is directed to Dixon
and Massey(1951,chap.17),DixonandMood(1946),McNemar(1955, pp 357358),Moses(1952a),andWalsh,(1946). THE WILCOXON
MATCHED-PAIRS
SIGNED-RANKS
TEST
Fnnction
The test we have just discussed,the sign test, utilizes information
simply about the directionof the differenceswithin pairs. If the rela-
tive magnitude aswell asthe directionof the differences is considered, a morepowerfultest canbe made. The Wilcoxonmatched-pairs signedranks test does just that: it gives more weight to a pair which shows a
large differencebetweenthe two conditionsthan to a pair which showsa small difference. coxon test is a most useful test for the beha o l
With
avioral data, it is not uncommon that the
tell which memberof a pair is "greater than" wh' h,, the dIfference between any Pair and (g) ra
h
t0 h
THE
76
CASE
OF
TWO
RELATED
SAMPLES
absolute size. That is, he can make the judgment of "greater than"
betweenany pair's two performances,and also can make that judgment betweenany two differencescoresarising from any two pairs With suchinformation,' the experimentermay usethe Wilcoxontest. Rationale
and Method
Let d; = the difference score for any matched pair, representingthe
differencebetweenthe pair's scoresunder the two treatments. Each
pair hasoned,. To usethe Wilcoxontest,rankall the d s without regardto sign:givethe rank of 1 to the smallest d;, the rankof 2 to the next smallest,etc. When one ranks scoreswithout respectto sign, a d; of 1 is given a lowerrank than a d; of either 2 or +2. Then to each rank affix the sign of the difference. That is, indicate
which ranks arose from negative d s and which ranks arosefrom positive d s. Now if treatments A and B are equivalent, that is, if Ho is true, we
shouldexpectto find someof the largerd s favoringtreatmentA and somefavoringtreatmentB. That is, someof the largerrankswould comefrom positived s while otherswould comefrom negatived s.
Thus,if wesummed therankshavinga plussignandsummed theranks havinga minussign,wewouldexpectthetwo sumsto beaboutequal under Hs. But if the sum of the positive ranks is very much different from the sum of the negativeranks, we would infer that treatment A differs from treatment B, and thus we would reject Ho. That is, we
rejectHs if eitherthe sumof theranksfor thenegative d s Orthe sum of the ranksfor the positived s is too small.
Ties. Occasionally the two scoresof any pair areequal. That is, no differencebetweenthe two treatmentsis observedfor that pair, so that
d = 0. Suchpairsaredroppedfrom the analysis.Thisis the same
practice thatwefollowwiththesigntest. N = thenumber ofmatched
pairsminusthenumberof pairswhose d = 0. Another sort of tie can occur. Two or more d's can be of the same
size. Weassignsuchtied cases the samerank. Therankassigned is the average of therank8whichwouldhavebeenassigned if the d'shad differedslightly. Thusthreepairsmightyieldd'sof 1, 1, and+l. '
Eachpairwouldbeassigned therankof 2, for
1+2+3
2. Then
the next d in order would receivethe rank of 4, becauseranks 1, 2, and ' Torequirethattheresearcher haveordinalinformation notonlywithinpairsbut aho concerning the differences betweenpairsseemsto be tantamountto requiring measurement in the strengthof an orderedeuAricscale. In strength,an ordered metricscaleliesbetweenan ordinalscaleandan intervalscale. For a discussion of orderedmetric scaling,seeCoombs(1950)and Siegel(1956).
THE WILCOXON MATCHED-PAIRS SIGNED-RANKS TEST
77
3 havealready been used. If two pairs had yielded d's of 1, both would receivethe rank of 1.5, and the next largest d would receivethe rank of 3. The practice of giving tied observationsthe average of the ranks they would otherwisehave gotten has a negligible effect on T, the statistic on which the Wilcoxon
test is based.
For applications of these principles for the handling of ties, see the examplefor large samples,later in this section. Sma11 samples. Let T = thesmallersumof like-signedranks. That
is, T is either the sum of the positive ranks or the sum of the negative
ranks,whichever sumis smaller. TableG oftheAppendixgivesvarious valuesof T and their associated levelsof significance.That is, if an observed T is equalto or lessthanthevaluegivenin thebodyof TableG undera particularsignificance levelfor the observed valueof N, thenull
hypothesis maythenberejected at that levelof significance. Table G is adaptedfor usewith both one-tailedand two-tailed tests.
A one-tailed testmaybeusedif in advance of examining thedatathe experimenter predictsthesignof thesmallersumof ranks. ~ 1 That is asis
thecase wit all pne-tailed tests,hemustpredictin advance thedirection pf the differences.
Forexample, If T =3were thesum ofthenegative ranks when N =9,
one could reject Hoatthea =.02level if H>hadbeen thatthetwogroups ould d ffer, andone could reject Hoatthea =.01level if Hlhadb n
thatthesumofnegative rankswould bethesmaller sum. Example for 8~l 8<m>lee
Suppose a childpsychologist wished to test s wwheth e ernurseryschool
attendance hasanyeffectonchildren's social ti oci perce perceptiveness. He scoressocialperceptiveness by rating children's res rens responses
to a
group ofpictures which depict a variety ofsocial situatipns asking a standard groupof questions abouteachpicture. gy ~ he obtains a scorebetween0 and 100 for each child.
Although theexperimenter is confident thata higher score repiesentshighersocialperceptiveness than a lowerscore,he is npt sure
that the scoresare sufficientlyexactto be treatednumerically That is, he is not willing to say that a child whose score is 60 is
twiceas socially perceptiveas a child whosescoreis 80, npr is he willing to say that the difference between scores of 60 and 40 Is
exactly twice as large as the differencebetweenscoresof 40 and 30. However,he is confidentthat the differencebetweena scoreof, say, 60andoneof 40 is greaterthan the differencebetweena scoreof40and oneof 30. That is, he cannotassertthat the differencesarenumer-
icallyexact,but hedoesmaintainthat theyaresufficientlymeaningful
>hattheymayappropriately berankedin orderof absolute size.
THE
78
CASE OF TWO
RELATED
SAMPLES
To test the eKect of nursery school attendance on children's
socialperceptiveness scores,he obtains8 pairsof identicaltwins tq serveas subjects. At random, 1 twin from eachpair is assignedto attend nurseryschoolfor a term. The other twin in eachpair is to remain out of school. At the cnd of the term, the 16 children are
eachgiven the test of socialperceptiveness.
i. Vull Hypothesis.Ho.'thesocialperceptiveness of "home" and "nursery school" childrendoesnot differ. In termsof the Wilcoxontest,the sumof the positiveranks= the sumof the negative ranks. Hi. the social perceptiveness of the two groupsof children
divers,i.e.,the sumof the positiveranksg the sumof the negative ranks.
ii. Statistical Test. The Wi1«oxon matched-pairs signed-ranks test is chosenbecausethe study einploys two related samplesand it
yieldsdifference scoreswhichmaybe ra»kcdi» orderof absolute magnitude.
iii. Significance Level. Let n = .05. N = the numberof pairs (8) minusany pairs whosed is zero.
iv. SamplingDistribution. Table G gives critical valuesfrom the samplingdistribution of T, for N < 25.
v. Rejection Region. Sincethedirectionof thedifference isnotpredicted,a two-tailedregionof rejectionis appropriate.The region of rejectionconsistsof all valuesof T whichareso smallthat the
probabilityassociated with theiroccurrence uiiderEIO is equalto or less than u = .05 for a two-tailed test.
vi. Decision. In this fictitious study, the 8 pairsof "home" and
"nurseryschool"childrenaregiventhetestin socialperceptiveness after the latter have beenin schoolfor one term. Their scoresare
givenin Table5.6. Thetableshows thatonly2 pairs oftwins,c and
g,showed differences in thedirection ofgreater social perceptiveness in the "home" twin.
And these difference scores are among the
smallest: their ranks are 1 and 3.
The smaller of the sumsof the like-signedranks = 1 + 3 = 4 = T. Table G shows that for N = 8, a T of 4 allows us to reject the null
hypothesisat a = .05for a two-tailedtest. Thereforewe reject Hp in favor of Hi in this fictitious study, concluding that nursery school
experiencedoesaffect the socialperceptiveness of children. It is worth noting that the data in Table 5.6 are amenableto treatment
by the sign test (pages68 to 75), a less powerful test. For that test, x = 2 and N = 8.
Table D gives the probability associated with such an
occurrenceunder Ho as p = 2(.145) = .290 for a two-tailed test. With
the signtest, therefore,our decisionwould be to acceptffo whena = .05,
THE
WILCOXON
MATCHED-PAIRS
SIGNED-RANKS
TABLE 5.6. SOCIAL PERCEPTIVENESS SCORES OF AND
HOME
TEST
79
NURSERY SCHOOL
CHILDREN
(Artificial
data)
whereasthe Wilcoxon test enabled us to reject Hp at that level. This difference is not surprising, for the Wilcoxon test utilizes more of the information in the data.
Notice that the Wilcoxon
test takes into consider-
ation the fact that the 2 minus d's are among the smallest d's observed, whereasthe sign test is unaffected by the relative magnitude of the d s. I,arge samples. When N is larger than 25, Table G cannot be used. However, it can be shown that in such casesthe sum of the ranks, T, is
practicallynormally distributed, with N(N + 1)
Mean = pY
and
Standard deviation
4
=o Y=
N(N + 1) (2N + 1) 24
T N(N+ 1) Therefore
z
T
PI
4
N(N + l)(2N
+ 1)
(5.5)
24
ls approximatelynormallydistributed with zeromeanand unit variance. Thus Table A of the Appendix gives the probabilities associatedwith the occurrenceunder Ho of various values as extreme as an observed z com-
putedfromformula(5.5). To show what an excellent approximation this is, even for small sam-
ples weshalltreatthe datagivenin Table5.6,whereN = 8, by this large-sample approximation. In that case,T = 4. Inserting the values
80
THE
CASE
OF
TWO
RELATED
SAMPLES
in formula (5.5), we have (8)(9) 4
=
1.96
Reference to Table A reveals that the probability associatedwith the occurrenceunder Hc of a zas extreme as 1.96 is p = 2(.025) = .05, for a two-tailed test. This is the same p we found by using Table G for the same data.
Example for Large Samples
Inmates in a federal prison served as subjects in a decision-making study.' First the prisoners' utility (subjective value) for cigarettes was measured individually, cigarettes being negotiable in prison society. Using each subject's utility function, the experimenter then attempted to predict the decisions the man would make in a game in which he repeatedly had to choose between two different
(varying) gambles, and in which cigarettes might be won or lost. The first hypothesistested was that the experimentercould predict the subjects' decisionsby meansof their utility functions better than he could by assumingthat their utility for cigaretteswas equal to the cigarettes' objective value and therefore predicting the "rational" choice in terms of objective value. This hypothesis was confirmed. However, as was expected,someresponseswere not predicted successfully by this hypothesis of maximization of expected utility. Anticipating this outcome, the experimenter had hypothesized that such errors in prediction would be due to the indifference of the sub-
jects between the two gambles offered. That is, a prisoner might find two gambles either equally attractive or equally unattractive, and therefore
be indifferent
in the choice between
them.
Such
choices would be difficult to predict. But in such choices, it was reasonedthat the subject might vacillate considerably before stating a decision. That is, the latency time betweenthe offer of the gamble and his statement of a decisionwould be high. The secondhypothesis, then, was that the latency times for those choices which would
not be predicted successfully by maximization of expected utility would be longer than the latency times for those choiceswhich would be successfully predicted.
i. Null Hypothesis. Hc.' there is no differencebetweenthe latency times of incorrectly predicted and correctly predicted decisions.
H>.
' Hurst, P. M., and Siegel, S. 1956. Prediction of decisions from a higher ordered metric scale of utility. J. esp. Psychol., 52, 138144.
THE WILCOXON MATCHED-PAIRS SIGNED-RANKS TEST
81
the latency times of incorrectly predicted decisions are longer than the latency times of correctly predicted decisions. ii. Statistical Test. The Wilcoxon matched-pairssigned-rankstest is selected because the data are difference scores from two related
samples(correctly predicted choicesand incorrectly predicted choices made by the sameprisoners), where each subject is used as his own control.
iii. Significance Level. Let a = .01.
N = 80 = the number of
prisoners who served as subjects. (This N will be reduced if any prisoner's d is zero.)
iv. Sampling Distribution.
Under Ko, the values of z ascomputed
from formula (5.5) are normally distributed with zero mean and unit
variance. Thus Table A gives the probability associatedwith the occurrence under Ho of values as extreme as an obtained z.
v. Rejection Region. Since the direction of the difference is predicted, the region of rejection is one-tailed.
If the difference is in
the predicted direction, T, the smaller of the sums of the like-signed ranks, will be the sum of the ranks of those prisoners whose d's are in the opposite direction from that predicted. The region of rejection consists of all z's (obtained from data with such T's) which are so extreme that the probability associated with their occurrence under Ho is equal to or less than a = .01.
vi. Decision. A difference score(d) wasobtainedfor eachsubject by subtracting his median time in coming to correctly predicted decisions from his median time in coming to incorrectly predicted decisions. Table 5.7 gives thesevalues of d for the 30 prisoners,and gives the other information necessaryfor computing the Wilcoxon test. A minus d indicates that the prisoner's median time in coming to correctly predicted decisionswas longer than his median time in coming to incorrectly predicted decisions. For the data in Table 5.7, T = 58.0, the smaller of the sums of the
like-signed ranks. We apply formula (5.5);
T N(N+
1)
4
N(N + 1)(2N 24
(26)(27) 4
(26)(27)(58) 24
=
8.11
+ 1)
'THE
82
CASE
OF
TWO
RELATED
SAMPLES
TABLE 5.7. DIFFERENcE IN MEDIAN TIME BETwEEN PRIsoNE'Rs' CORRECTLY AND INCORRECTLY PREDICTED DECISIONS
Rank
Prisoner
1
2
of d
Rank
11.5
0 10 00 5 74 4 11
with
less
frequent sign 11.5
2
3
4.5
20.0
8
20.0
9
4.5
10
4.5
11
5
23.0
35
12
16.5
13
23.0
14
16.5 1
4.5
4.5
1
4.5 4.5
4.5
18
5
23.0
19
8
25.5
20
22
15 16 17
1
21 22
2
11.5 11.5 11.5
23
3
16.5
16.5
24
2
11.5
11.5
25
1
4.5
26
4
20.0
27
8
25.5
28
2
11.5
29 30
16.5 1
4.5 T=
53.0
Notice that we have N = 26, for 4 of the prisoners' median times
were the samefor both correctly and incorrectly predicted decisions and thus their d's were 0.
Notice also that our T is the sum of the
ranks of those prisoners whosed's are in the opposite direction from predicted, and therefore we are justified in proceeding with a onetailed test.
Table A shows that z as extreme as 3.11 has a one-
tailed probability associatedwith its occurrenceunder H pof p = .000g. Inasmuch as this p is less than a = .01 and thus the value of z is in
the region of rejection, our decision is to reject Hp in favor of HI. We conclude that the prisoners' latency times for incorrectly pre-
THE
WALSH
83
TEST
dieted decisions were significantly longer than their latency times for correctly predicted decisions. This conclusion lends some support to the idea that the incorrectly predicted decisions concerned gambles which were equal, or approximately equal, in expected utility to the subjects.
Summary of procedure. Theseare the stepsin the useof the Wilcoxon matched-pairssigned-rankstest: 1. For each matched pair, determine the signeddifference (d;) between the two scores.
2. Rank these d s without respect to sign. With tied d's, assign the averageof the tied ranks. 3. Affix to each rank the sign (+ or ) of the d which it represents.
4. 5. 6. of T a.
Determine T = the smaller of the sums of the like-signed ranks. By counting, determine N = the total number of d's having a sign. The procedurefor determining the significanceof the observedvalue dependson the side of N: If N is 25 or less,Table G showscritical values of T for various sizes of N. If the observedvalue of T is equal to or lessthan that given in the table for a particular significancelevel and a particular N, KD may be rejected at that level of significance. ti. If N is larger than 25, compute the value of z as defined by formula
(5.5). Determineits associatedprobability under Ho by referring to Table A. For a two-tailed test, double the p shown. If the p thus obtained is equal to or lessthan a, reject Ho. Power-EfBciency
When the assumptions of the parametric t test (see 19) fact met, the asymptotic efficiencynear Ho of the Wilcoxon m t h d-
signed-ranks testcompared with the l testis 3/'+ = 95 5 pe
t (M
1954). This meansthat 3/~ is the limiting ratio of sam l for the Wilcoxon test and the t test to attain the samep F
samples,the efficiencyis near95 per cent References
r may find other discussionsof the Wilcoxon m t h d-
signed-ranks test in Mood (1954) Moses(19 2 ) 1947; 1949). THE
WALSH
TEST
Function
If the experimenter can assumethat the difference scoreshe observes
in two relatedsamples aredrawnfrom symmetrical populations, he may
84
THE
CASE
OF
TWO
RELATED
SAMPLES
usethe very powerful test developedby Walsh. Notice that the assump-
tion is not that the d s arefrom normalpopulations(whichis the assumption of the parametric t test), and notice that the d s do not even have to be from the same population. What the test does assume is that the
populations are symmetrical, so that the mean is an accurate representation of central tendency, and is equal to the median. requires measurement in at least an interval scale.
The Walsh test
Method
To use the )Valsh test, one first obtains differencescores(d s) for each
of the X pairs. Thesed s arethenarrangedin orderof size,with the sign of each d taken into consideration in this arrangement. Let di = the lowest difference score (this may well be a negative d), d~ = the next lowest difference etc. Thus d~ < d2 < d3 < d4 < .
population whose median = 0 (or from a group of populations whose common median = 0). In a symmetrical distribution, the mean and the
mediancoincide. The Walshtest assumes that the d s arefrom populations with symmetricaldistributions. ThereforeHo is that the average of the difference scores (po) is zero. For a two-tailed test, H> is that pi W 0. For a one-tailed test, Hj may be either that y,~) 0, or that p> <0.
Table H of the Appendix is usedto determine the significanceof various results under AValsh's test.
To use this table, one must know the
observed value of N (the number of pairs), the nature of H~, and the numerical values of every d;. Table H gives significant values for both one-tailed and two-tailed
tests. The two right-hand columns give the values which permit rejecting H0 at the stated significance level. If H~ is that p,>W 0, then the null hypothesis may be rejected if either of the tabled values are observed. If H> is directional, then the null hypothesis may be rejected if the values tabled under that H~ are observed.
The left-hand column shows various values of N, from 4 to 15. Next
to that column are two columns which show the significance levels at which the tabled values may be rejected.
SinceTable H is somewhatmorecomplicatedthan most, we will give several examples of its use.
SupposeN = 5. For a two-tailed test, we may reject Ho at the a = .125level if z(d4+ d>)is lessthan zeroor if ~(d>+ d2) is largerthan zero. And we may reject Ho at a = .062 if d>is less than zero or if d>is larger than zero.
Now supposethat N = 5 and that we had predicted in advance that
our differencescoreswouldbe largerthan zero. Then if z(d>+ d>) is
THE
WALSH
85
TEST
larger than zero, we can reject Hp at a = .062. And if di is larger than zero, we can reject Hp at u = .031. On the other hand, suppose we had predicted in advance that our differ-
ence scores would be negative. That is, Hi is that yi < 0.
Then, if
N = 5, if ~(d4 + d>) were less than zero we could reject Hp at a = .062.
And if dp were lessthan zero, we could reject Hp at a = .031. Now for larger values of N, Table H is somewhat more complicated. The two right-hand columnsgive alternative values,the alternatives being separated by a comma. Accompanying these are "max" and "min." "Max" means that we should select that alternative which is larger; "min"
means that we should select that alternative
which is smaller.
For example,at N = 6, supposeHi is that p,<< 0. We may reject Hp in favor of that Hi at 4x= .047 if ds or ~(d4+ dp), tvhicheveris larger, is less than
zero.
In the example of the application of this test given below, the use of Table H is illustrated again.
Example In a study designedto induce repression,Lowenfeld' had his fifteen subjectslearn 10 nonsensesyllables. He then attempted to associate
negativeaffect to 5 of these (selectedat random from the 10) by giving the subjects an electric shock whenever any one of the 5
syllableswasexposedtachistoscopically. After a lapseof 48 hours, the subjects were brought back to the experimental room and asked
to recallthe list of nonsense syllables. The predictionwasthat they would recall more of the nonshocksyllables than the shock syllables. i. Null Hypothesis. Hp.' the median difference between the number of nonshock syllables remembered and the number of shock
syllables rememberedis zero. That is, subjects will recall the two groups of syllables equally well. H>.' the number of nonshocl-
syllablesrememberedis larger than the number of shock syllables remembered. That is, the mediandifferencewill belargerthan zero. ii. StatisticalTest. The Walshtest waschosenbecausethe study usestwo relatedsamples(eachsubject servingas his own control), and because the assumption that the numerical difference scores came from symmetrical populations seemed tenable.
iii. Significance Level. Let n = .05. N = 15 = the number of subjects who served in the study, each being exposedto both shock and nonshock syllables.
iv. SamplingDistribution. Table H gives the associatedprobai~vrenfeld, J. 1955. An experimentrelatingthe conceptsof repression, suocepfion,and perceptualdefense.Unpublished doctor'sdissertation,The Pennsylvania State University.
86
THE
CASE
OF
TWO
RELATED
SAMPLES
bility of occurrence under Hp for various values of the statistical test when N < 15.
v. Rejection Region.
Since the direction of the diA'erences was
predicted in advance, a one-tailed region of rejection will be used.
SinceH>is that pg ) 0, Hp will be rejectedif any of the valuesgiven in the right-hand column of the table for N = 15 should occur, since the levels of significance for all of the values tabled for N = less than
e=
15 are
.05.
vi. Decision. The number of shock and nonshock syllables recalled by each subject after 48 hours is given in Table 5.8, which ThBLE 5.8. NU54BER oF SHocK AND NoiwsHocK AFTER
Number S ubject
SYLLABLEs REcALLED
48 HDURS
of
nonshock syllables recalled
Number
of
shock syllables recalled
b a
3
23
C
de
1 2
f 1
1
h
31 1
ip
1 3 1
n
1
0
also gives the d for each. Thus subject a recalled 5 of the nonshock syllables but only 2 of the shock syllables; his d = 5
2 = 3.
Xiotice that the smallest d is 1. Thus d> the lowest d, taking sign into consideration = 1. Five of the d's are 1's; therefore dy =
1, d2 =
1, d3 =
1, d4 =
1, and dp =
l.
The next smallest d's are 1's. Three subjects (h,j, and o) have d's of 1. Therefore dp = 1, d7 = 1, and ds = 1. Three of the d's are 2's. Thus dp 2, d~p= 2 and d,y = 2.
The largest d's are 3's. There are four of them. Thus d~~ 3, d~3 3, d~4 3, and dip = 3. Now Table H shows that for N = 15, the one-tailed test for the Hi that p4 ) 0 at n = .047 is
Minimum [z(di + d~2), z(d2 + d»)j > 0
THE
The "minimum"
WALSH
87
TEST
means that we should choose the smaller of the two
values given, in terms of our observed values of d. That is, if ~(d~+ d~~) or z(d2 + dpi), whichever is smaller, is larger than zero, then we may reject Hp at a = .047.
As we have shown, d~ =
1, d~>= 3, d2 =
1, and dji = 2.
Substituting these values, we have
Minimum [z( 1 + 3)) ~( 1 + 2)] = minimum = '(1)
[~(2), ~(1)]
We seethat for our data the smaller of thesetwo values is z(1) = ~. Since this value is larger than zero, we can reject Ho at n = .047. Since the probability
under Ho associated with the values which
pccurred is lessthan a = .05, we decide to reject Ho in favor of H~.* We concludethat the number of nonshocksyllables rememberedwas significantly larger than the number of shock syllables remembered, a conclusion which supports the theory that negative affect induces repression. Summary of procedure. These are the steps in the use of the Walsh test:
1. Determinethe signeddifferencescore(d;) for eachmatchedpair. 2. Determine N, the number of matched pairs. 3. Arrange the d s in order of increasingsize,from di to d~. Take the
signof the d into accountin this ordering. Thus d~is the largestnegative d, and dNis the largestpositived. 4. Consult Table H to determine whether Ho may be rejected in favor
of Hy with the observedvaluesof d>,d2,d3,..., gN. The techniquepf using Table H is explainedaboveat somelength. Power-Efficiency
Whencompaed with the mostpowerfultest, the parametri t t t, th Walsh test has power-efficiency(in the sensedefined in Ch 3) f 95 cent for most values of N and u. Its power-efficiencyis as hi h 99 cent (for N = 9 and = .01, one-tailed test) and is nowhere low th
87.5per cent (for N = 10and0, = .06, one-tailedt st)
F
on its powerefficiency, seeWalsh (1949b) References
Fpr pther discussionsof the Walsh test, the reader is referred to Dixon and Massey (1951, chap. 17) and to Walsh (1949a; 1949b). ~ Usingthe nonparametricWilcoxonmatched-pairssigned-rankstest, Lowenfeld came to the samedecision.
88
THE
CASE OF TWO
THE RANDOMIZATION
RELATED
SAMPLES
TEST FOR MATCHED PAIRS
Function
Randomization testsare nonparametric testswhichnot only have practical value in the analysisof researchdata but also have heuristic
valuein that they helpexposethe underlyingnatureof nonparametric tests in general. With a randomizationtest, we can obtain the exact
probability underHpassociated withtheoccurrence ofourobserved data, andwecandothiswithoutmakinganyassumptions aboutnormalityor homogeneityof variance. Randomizationtests, under certain condi-
tions,arethemostpowerful of thenonparametric techniques, andmaybe usedwhenevermeasurementis so precisethat the valuesof the scores have numerical meaning. Rationale
and Method
Consider the smallsampleexample to whichweearlierappliedthe Wilcoxonmatched-pairs signed-ranks test(discussed onpages 77to 78). In that study,wehad8 matched pairs,andonemember of eachpairwas randomlyassigned to eachconditionone twin attended nursery school while the other stayedat home. The researchhypothesispredicted differencesbetweenthesetwo groupsin "social perceptiveness"because of the different treatment conditions. The null hypothesiswasthat the two conditionsproducedno differencein socialperceptiveness.It will be
remembered that the two members of any matchedpair wereassigned to the conditionsby somerandom method,say by tossinga coin. For this discussion,let us assumethat in the fictitious researchunder discussionmeasurementwas achieved in the senseof an interval scale.
Nowif thenull hypothesis that thereis notreatment effectwerereally true, we would have obtained the samesocialperceptivenessscoresif both
groupshadattendedthe nurseryschoolor if both groupshadstayedat home. That is, underHp thesechildrenwouldhavescoredas they did regardlessof the conditions. We may not know why the childrendiffer among themselvesin social perceptiveness,but under H pwe do know how
the signaof the differencescoresarose:they resultedfrom the random
assignment of thechildrento thetwo conditions.For example, for the two twins in pair a we observeda differenceof 19 points betweentheir two scoresin socialperceptiveness.Under Hp, we presumethat this d
was+19 ratherthan 19 simplybecause wehappened to assignto the
nurseryschool groupthat twinwhowouldhavebeenhigherin socialperceptiveness anyway. Thed was+19 ratherthan 19 simplybecause whenwe wereassigningthe twins to treatmentsour coin fell on head ratherthan on tail. By this reasoning, underHpeverydifference score we observed couldequallylikely havehadthe oppositesign.
THE
RANDOMIZATION
TEST
FOR MATCHED
PAIRS
89
Thedifference scores that weobserved in oursamplein that studyhappened to be +19
+27
1
+6
+7
+13
4
UnderHo,if ourcointosses hadbeendifferent,theymightjust asprobably have been
19
27
+1
6
7
13
+4
or if the coins had fallen still another way they would have been +19
27
+1
6
7
13
4
+3
As a matter of fact, if the null hypothesis is true, then there are 2 =
2'
equally likely outcomes,and the onewhich we observedependsentirely on how the coinlandedfor eachof the 8 tosseswhenweassignedthe twins to the two groups. This meansthat associatedwith the sampleof scores we observedthereare many other possibleones,the total possiblecombinations being 2' = 256. Under Ho, any one of these256 possibleoutcomes was just as likely to occur as the one which did occur.
For each of the possible outcomes there is a sum of the differences: gd;. Now many of the 256 Zd; are near zero, about what we should expect if Ho were true. A few Zd; are far from zero. Theseare for those combinations in which nearly all of the signs are plus or are minus. It is
such combinationswhich we shouldexpectif the populationmeanunder
oneof the treatmentsexceeds that underthe other,that is, if Hois false. If we wish to test Ho againstsomeHi, we set up a regionof rejection consistingof the combinationswhoseZd; is largest. Supposecx= .05.
Then the regionof rejectionconsistsof that 5 per cent of the possible combinations which contains the most extreme values of Zd;.
In the exampleunderdiscussion, 256 possibleoutcomesare equally jikely under Ho. The region of rejection consistsof the 12 most extreme
possibleoutcomes, for (.05)(256) = 12.8. Underthe null hypothesis, the probability that we will observeone of these 12 extreme outcomesis » = .047. If we actually observeone of those extreme outcomeswhich
is includedin the regionof rejection,we reject Ho in favor of H~. ~hen a one-tailedtest is calledfor, the regionof rejection consistsof the samenumberof samples. However,it consistsof that numberof the
mostextreme possible outcomes in onedirection,eitherpositiveor negative, depending on the direction of the prediction in H j.
~hen a two-tailed testis calledfor,asis thecasein theexample under discussion, the regionof rejectionconsists of the mostextremepossible outcomesat both the positiveand the negativeendsof the distribution of
Thatis,in theexample, the12outcomes in theregionof rejection wouldincludethe6 yielding thelargestpositiveZd;andthe6 yielding the largest negative >d;.
90
THE
CASE
0%' TWO
RELATED
SAMPLES
Example i. >Vull Hypothesis. Ho. the two treatments are equivalent. That is, there is no difference in social perceptiveness under the two conditions (attendance at nursery school or staying at home). In social perceptiveness, all 16 observations (8 pairs) are from a common population. Hi. 'the two treatments are not equivalent.
ii. Statistical Test. The randomization test for matched pairs is chosen because of its appropriateness to this design (two related sam-
ples, V not cumbersomely large) and becausefor these (artificial) data we are willing to consider that its requirement of measurement in at least an interval
scaje is met.
iii. Significance Level. Let e = .05.
N = the nuinber of pairs
= 8.
iv. Sampling Distribution. The sampling distribution consists of the permutation of the signs of the difTerenccsto include all possible (2' ) occurrences of Zd;. In this case, 2 = 2' = 256. v. Rejection Region. Since HI does not predict the direction of
the differences, a two-tailed test is used. The region of rejection consists of those 12 outcoines which have the most extreme Zd s, 6 being the most extreme positive Zd s and 6 being the most extreme negative Zd s. vi. Decision. The data of this study are shown in Table 5.6. The d's observed
were:
+19
+27
1
+6
+7
+13
4
+3
For these d's, Zd; = +70.
Table 5.9 shows the 6 possible outcomes with the most extreme TABLE 5.9. THE Six IIosT LrxTREME POSSIBI.E POSITIVE OUTCOMES FoR TIIE d s SHowN IN ThBI.E 5,6
(These constitute one tail of the rejection region for the randomization test when u =
.05)
Outcome
(1) (2) (3) (4) (5)
(6) '
+19 +19 +19 +19 +19 +19
+27 +27 +27 +27 +27 +27
+1 1 +1 +1 1 1
+6 +6 +6 +6 +6 +6
+7 +7 +7 +7 +7 +7
+13 +13 +13 +13 +13 +13
+4 +4 +4 4 +4 4
+3 +3 3 +3 3 +3
80 78 74 72 72 70
Zd s at the positive end of the sampling distribution. These 6 outcomes constitute one tail of the two-tailed region of rejection for
THE RANDOMIZATION TEST FOR MATCHED PAIRS
91
N' = 3. Outcome 6 (with an asterisk) is the outcome we actually observed. The probability of its occurrenceor a set more extreme under IIO is p = .047. Sincethis p is lessthan a = .05, our decision in this fictitious study is to reject the null hypothesis of no condition differences.
Large samples. If the number of pairs exceeds,say, N = 12, the randomization test becomes unwieldy. For example, if N = 13, the number of possibleoutcomesis 2" = 8,192. Thus the region of rejection fpr 0. = .05 would consist of (.05)(8,192) = 409.6 possibleextreme outcpmes. The computation necessaryto specify the region of rejection would therefore be quite tedious. Because of the computational cumbersomeness of the randomization
test when N is at all large, it is suggestedthat the Wilcoxon matched-pairs signed-ranks test be used in such cases. In the Wilcoxon test, ranks are substituted for numbers. randomization
It provides a very efficient alternative to the
test because it is in fact a randomization
test on the ranks.'
Even if we did not have the use of Table G, it would not be too tedious to cpmpute the test by permuting the signs (+ and ) on the set of ranks
in all possiblewaysand then tabulating the upper and lower significance points for a givensamplesize. If N is larger than 25, and if the differencesshow little variability, anptheralternative is available. If the d; be all about the samesize,so 2
Zd,.2
2N
pwhere d ..' is the square of the largest observeddiffer-
ence,then the central-limit theorem(seeChap. 2) may be expectedto hpld (Moses,1952a). Under theseconditions,we can expectZd; to be approximatelynormally distributed with Mean
aIld and therefore
=0
Standard deviation = QZd;z Zd; p
Zd;
QZd;z
is approximately normally distributed with zero mean and unit variance. Table A of the Appendix gives the probability associatedwith the occur> 1ns,randomization test on ranks, all 2N permutations of the signs of the ranks are considered,and the most extreme possible constitute the region of rejection. For the
data shownin Table 5.6, thereare 2' = 256 possibleand equallylikely combinations of signedranks under Ho. The curiousreadercan satisfy himselfthat the sampleof signedranksobservedis amongthe 12mostextremepossibleoutcomesand thus leads
us to rejectHoat a = .05,whichwasour decisionwhichwebasedon TableG. By this randomization method,Table G, the table of significantvaluesof T, can be reconstructed.
92
THE
CASE
OF
TWO
RELATED
SAMPLES
renceunderH pof valuesasextremeasanyz obtained throughtheapplication of formula (5.6).
However,therequirement that thed s showlittle variability,i.e.,that dms x' 4
5
~
~ isnot too commonlymet. For this reason,and alsobecause
theefficiency of the Wilcoxon test(approximately 95percentfor large samples) is verylikely to besuperiorto that of thislargesampleapproximation to the randomization test when nonnormal populations are involved, it would seemthat the Wilcoxon test is the better alternative when N's are cumbersomelylarge. Summary of procedure. When N is small and when measurementis
in at leastanintervalscale,therandomization testfor matchedpairsmay be used.
These are the steps:
1. Observethe values of the various d s and their signs. 2. Determine the number of possible outcomes under Hp for these values:
2~.
3. Determinethe numberof possibleoutcomes in the regionof rejection: (a) (2").
4. Identify thosepossible outcomes whicharein the regionof rejection by choosingfrom the possibleoutcomesthosewith the largestZd s. For a one-tailedtest, the outcomesin the region of rejection are all in one direction (either positive or negative). For a two-tailedtest, half of the outcomesin the region of rejection are those with the largest positive Zd s and half are those with the largest negative Zd s. 5. Determine whether the observed outcome is one of those in the
region of rejection. If it is, reject H0 in favor of Hi. When N is large, the Wilcoxon matched-pairs signed-ranks test is recommended for use rather than the randomization test.
When N is 25
or largerand whenthe data meetcertainspeci6edconditions,an approximation [formula (5.6)] may also be used. Power-Efficiency
The randomization test for matched pairs, becauseit uses all of the information in the sample, has power-efficiencyof 100 per cent. References
Discussions of the randomization method are contained in Fisher
(1935),Moses(1952a),Pitman(1937a;1937b;1937c), Scheff6 (1943),and Welch (1937). DISCUSSION
In this chapterwe have presentedfive nonparametricstatistical tests
for thecaseof two relatedsamples (thedesignin whichmatchedpairsare
DISC USSION
used). The comparison and contrast of these tests which are presented below may aid the reader in choosingfrom among these tests that one which will be most appropriate to the data of a particular experiment. All the tests but the McNemar test for the significance of changes assume that the variable under consideration
has a continuous
distribu-
tion underlying the scores. Notice that there is no requirement that the measurement itself be continuous; the requirement concerns the variable
of which the measurementgivessomegrossor approximate representation. The McNemar test for the significance of changesmay be used when one or both of the conditions under study has been measuredonly in the sense of a nominal scale. For the case of two related samples, the McNemar test is unique in its suitability for such data. That is, this test should be used when the data are in frequencieswhich can only be classified by separate categories which have no relation to each other of the
"greater than" type.
No assumption of a continuous variable need.be
made, because this test is equivalent to a test by the binomial distribution
with P = Q = -'where N = the number of changes. If ordinal measurementwithin pairs is possible (i.e., if the scoreof one member of a pair can be ranked as "greater than" the score of the other
member of the samepair), then the sign test is applicable. That is, this test is useful for data on a variable which has underlying continuity but
which can be measuredin only a very grossway. When the sign test is appliedto data which meetthe conditionsof the parametricalternative (the t test), it haspower-efficiency of about95 per centfor N = 6, but its power-efficiency declinesasN increases to about 63per centfor very large samples. When the measurement is in an ordinal scale both within
and betiiieen
pairs, the Wilcoxontest shouldbe used. That is, it is applicablewhen the researcher can meaningfully rank the differences observed for the
various matched pairs. It is not uncommon.for behavioral scientists to be able to rank differencescoresin the order of their absolute sizewithout
beingableto give truly numericalscoresto the observationsin eachpair. When the Wilcoxon test is used for data which in fact meet the conditions
of the t test, its power-efficiency is about95 per centfor largesamplesand not much less than that for smaller samples. If the experimenter can assumethat the populations from which he has
sampledare both symmetricaland continuous,then the Walsh test is
applicablewhenN is 15 or less. This test requiresmeasurement in at leastan intervalscale. It haspower-efficiency (in the sensepreviously defined) of about 95 per cent for most values of N and a.
The randomizationtest shouldbe usedwheneverN is sufficientlysmaQ to make it computationally feasible and when the measurement of the variable is at least in an interval scale. The randomization test usesall
94
THE
CASE
OF
TWO
RELATED
SAMPLES
the information in the sampleand thus is 100per cent efficienton data which may properly be analyzed by the t test.
Of coursenoneof thesenonparametric testsmakesthe assumption of normality which is madeby the comparableparametrictest, the t test. In summary,weconcludethat the McNemartest for the significanceof changesshouldbe usedfor both largeand small sampleswhenthe measurement of at least one of the variables is merely nominal. For the crudest of ordinal measurement,the sign test should be used. For more
refinedmeasurement, the Wilcoxonmatched-pairs signed-ranks testmay be usedin all cases. For N's of 15or fewer,the Walshtest may be used. If interval measurementis achieved, the randomizatioii test should be used when the N is not so large as to make its computation cumbersome.
CHAPTER
THE
CASE OF TWO
6
INDEPENDENT
SAMPLES
In studying differencesbetweentwo groups, we may use either related pr independent groups. Chapter 5 offered statistical tests for use in a design having two related groups. The present chapter presentsstatistical tests for use in a design having two independent groups. Like those
presentedin Chap. 5, the testspresentedheredeterminewhetherdifferences in the samples constitute convincing evidence of a difference in the
processes appliedto them. Although the merits of usingtwo relatedsamplesin a researchdesign are great, to do so is frequently impractical.
Frequently the nature of
the dependentvariableprecludesusingthe subjectsastheir own controls, as is the casewhenthe dependentvariableis length of time in solvinga particular unfamiliar problem. A problemcan be unfamiliar only once. It may also be impossibleto designa study which usesmatchedpairs,
perhapsbecause of theresearcher's ignorance of usefulmatchingvariables, pr becauseof his inability to obtain adequatemeasures(to usein selecting matchedpairs) of somevariableknown to be relevant,or finally because gppd "matches" are simply unavailable.
.hen the useof two relatedsamplesis impracticalor inappropriate, pne may use two independentsamples. In this designthe two samples may be obtainedby either of two methods:(a) they may eachbe drawn at random from two populations,or (5) they may arisefrom the assignment at randomof two treatmentsto the membersof somesamplewhose prigins are arbitrary. In either caseit is not necessary that the two sam-
plesbe of the samesize. An exampleof random samplingfrom two populationswould be the drawing of every tenth Democratand every tenth Republicanfrom an alphabetical list of registeredvoters. This would result in a random
sampleof registered Democrats andRepublicans from the votingarea cpveredby the list, andthe numberof Democratswould equalthe number
pf Hepublicans only if the registrationof the two partieshappened to be
substantially equalin thatarea. AnotherexamPle wouldbethedrawing of every eighth upperclassman and every twelfth lowerclassman from a list of students in a college. 95
96
THE
CASE OF TWO
INDEPENDENT
SAMPLES
An exampleof the random assignmentof method might occur in a study of the effectiveness of two instructorsin teachingthe samecourse. A registrationcard might be collectedfrom everystudentenrolledin the course,and at randomone half of thesecardswould be assignedto one instructor
and one half to the other.
Theusualparametric techniquefor analyzingdatafromtwo independent samplesis to apply a t test to the meansof the two groups. The t test
assumesthat the scores(which are summedin the computingof the means)are independentobservationsfrom normally distributed populations with equal variances. This test, becauseit uses meansand other statistics arrived at by arithmetical computation, requiresthat the observations
be measured
on at least an interval
scale.
For a given research,the t test may be inapplicablefor a variety of reasons. The researchermay find that (a) the assumptionsof the t test
areunrealisticfor his data, (b) heprefersto avoidmakingthe assumptions and thus to give his conclusionsgreater generality, or (c) his "scores" may not be truly numerical and therefore fail to meet the measurement,
requirementof the t test. In instanceslike these,the researcher may chooseto analyze his data with one of the nonparametric statistical tests
for two independentsampleswhich are presentedin this chapter. The comparison and contrast of these tests in the discussion at the conclusion
of the chaptermay aid him in choosingfrom amongthe testspresented that one which is best suited for the data of his study. THE
FISHER
EXACT
PROBABILITY
TEST
Function
The Fisher exact probability test is an extremely useful nonparametric techniquefor analyzing discrete data (either nominal or ordinal) when the two independent samplesare small in size. It is used when the scores from two independent random samples all fall into one or the other of two
mutually exclusiveclasses. In other words,everysubjectin both groups obtains one of two possiblescores. The scoresare representedby frequenciesin a 2 X 2 contingencytable, like Table 6.1. GroupsI and II
mightbeanytwoindependent groups,suchasexperimentals andcontrols, Tmaz
6.1. 2 X 2 CoNTINOENCY
ThBLE
+ Total
Group I
A+B
Group II
C+D
Total
A+C
B+D
N
THE
FISHER
EXACT
PROBABILITY
TEST
97
malesand females,employedand unemployed,Democratsand Republicans,fathers and mothers,etc. The column headings,here arbitrarily indicated as plus and minus, may be any two classifications: above and
below the median,passedand failed, sciencemajors and arts majors, agreeand disagree,etc. The test determineswhether the two groups differ in the proportion with which they fall into the two classifications.
For the data in Table 6.1 (where A, B, C, and D stand for frequencies)it
would determinewhetherGroupI and GroupII differ significantlyin the proportion of plusesand minusesattributed to them. Method
The exact probability of observing a particular set of frequenciesin a
2 )( 2 table, whenthe marginaltotals are regardedas fixed, is given by the hypergeometric distribution A+
B+D
(A+ B) (A + C)!
(B + D)!
A! Cl
B! D!
(A + B)!
andthus
p-
(C + D) l
(A + B)l (C+ D)1(A + C)! (B+ D)!
all Al BlClD'I
That is, the exactprobability of the observedoccurrenceis found by tak-
ing the ratio of the productof the factorialsof the four marginaltotals to the product of the cell frequenciesmultiplied by N factorial. (Table 8 of the Appendix may be helpful in these computations.) To illustrate the use of formula (6.1): suppose we observe the data
shownin Table 6.2. In that table, A = 10, B = 0, C = 4, and D = 5. The marginal totals are A + B = 10, C+ D = 9, A + C = 14, and B + D = 5. N, the total number of independentobservations,is 19. The exact probability that these 19 casesshould fall in the four cells as TABLE 6.2 + Total
Group I
10
Group II Total
14
5
19
98
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
they did may be determined by substituting the observed values in formula (6.1): 1019! 14! 51 19! 10! 014'1 5! = .0108
We determine that the probability of such a distribution of frequencies under Ho is p = .0108. Now the above example was a comparatively simple one to compute because one of the cells (cell B) had a frequency of 0. But if none of the cell frequencies is zero, we must remember that more extreme deviations
from the distributionunderHocouldoccurwith the samemarginaltotals, and we must take into consideration these possible "more extreme"
deviations, for a statistical test of the null hypothesis asks: What is the probability under Ho of such an occurrenceor of one evenmoreextremely For example,supposethe data from a particular study werethosegiven in Table 6.3. With the marginal totals unchanged, a more extreme TABLE 6.3 + Total
Group I Group II Total
57
12
occurrencewould be that shownin Table 6.4. Thus, if we wish to apply ThBLE
6.4
+ Total
Group I Group II Total
57
12
a statistical test of the null hypothesis to the data given in Table 6.3, we must sum the probability of that occurrencewith the probability of the
moreextremepossibleone(shownin Table 6.4). We computeeachp by using formula (6.1). Thus we have 715! 5! 71 12111614111 = .04399 and
71515171 12! 0! 7! 510! = .00126
THE
FISHER
EXACT
PROBhBILITY
99
TEST
Thus the probability of the occurrence in Table 6.8 or of an even more extreme occurrence (shown in Table 6.4) is p = .04899 + .00126 = .04525
That is, p = .04525is the value of p which we usein deciding whether the data in Table 6.8 permit us to reject Hc. The reader can readily see that if the smallest cell value in the con-
tingency table is even moderately large, the Fisher test becomescomputationally very tedious. For example, if the smallest cell value is 2, then three exact probabilities must be determined by formula (6.1) and then summed; if the smallest cell value is 3, then four exact probabilities must be found and summed, etc. If the researcher is content to use significance levels rather than exact
values of p, Table I of the Appendix may be used. It eliminates the necessity for the tedious computations illustrated above. Using it, the researcher may determine directly the significance of an observedset of values in a 2 X 2 contingencytable. Table I is applicable to data where
+ is 30 or smaller,andwhereneitherof the totals in the right-handmargin is larger than 15. That is, neither A + B nor C + D may be larger than (The researchermay find that the bottom marginal totals in his data
meet this requirementbut the right hand totals do not. Obviously,in that casehe may meetthe requirementby simply recastingthe data, i.e.,
by shiftingthe labelsat the top of the contingency tableto the left mar-
gin,andviceversa.) Because of its very size, Table I is somewhat more difficult to use th
are most tables of significance values. Therefore we include detailed directions for its use. These are the steps in the use of Table I: 1. Determine the values of A + B and C + D in the data.
2. Find the observedvalue of A + B in Table I under the heing "Totals in Right Margin." 3. In that section of the table, locate the observed value of C+ D
under the sameheading. 4. For the observedvalue of C + D, several possiblevalues of B*
1istedin the table. Find the observedvalueof B amongth~ 5. Now observeyour value of D.
o ibBit
If the observedvalue of D
to or lessthan the valuegivenin the table underyour level of sig 'fi then the observeddata are significantat that level. It should be noted that the significancelevels given in T bl I
approximate.And theyerr on the conservative side. Thusth
probability ofsome datamaybep = 007butTabl
I the observed valueof B isnot includedamongthem,usetheob
d al
A inst i. If A @~din PlsceofB, thenCia usedin phceofDm te 6.
t of
THE
CASE
OF
TWO
INDEPENDENT
SAMPI.ES
cant at a = .01. If the reader requires exact probabilities rather than
significancelevels,he may find thesein Finney (1948,pp. 145-156)or he may compute them by using formula (6.1) in the manner described earlier.
Notice also that the levels of significancegiven in Table I are for one-
tailed regionsof rejection. If a two-tailedrejectionregionis calledfor, double the significancelevel given in Table I.
The reader'sunderstandingof the useof Table I may be aidedby an example. We recur to the data given in Table 6.3, for which we have already determined the exact probability by using formula (6.1). For Table 6.3, A + B = 7 and C+ D = 5. The reader may find the appropriate section in Table I for such right marginal tot@a.
In that
section he will find that three alternative values of B (7, 6, and 5) are tabled. Now in Table 6.3, B = 6. Therefore the reader should use the middle of the three lines of values, that in which B = 6. thevalueofDinourdata:D
= 1in
Table6.3.
Now observe
TableIshowathatD
=1
is significant at the .05 level (one-tailed). This agreeswith the exact probability we computed: p = .045.
For a two-tailedtest we would doublethe observedsignificancelevel, and conclude that the data in Table 6.3 permit us to reject Ho at the a = 2(.05) = .10level.
Example In a study of the personaland social backgroundsof the leadersof the Nazi movement,Lerner and his collaborators' comparedthe Nazi elite with the established and respected elite of the older German society. One such comparison concerned the career histories of the 15 men who constituted
the German
Cabinet
at the end of 1934.
These men were categorized in two groups: Nazis and non-Nazis.
To test the hypothesis that Nazi leaders had taken political party work as their careers while non-Nazis had come from other, more stable and conventional, occupations, each man was categorized according to his first job in his career.
The first job of each waa
classifiedas either "stable occupation" or as "party administration and communication." The hypothesis was that the two groups would differ in the proportion with which they were assigned to these two categories.
i. Null Hypothesis. Hs. Nazis and non-Nazisshowequalproportions in the kind of "first jobs" they had. H>'. a greater proportion of Nazis' " first jobs" werein party administration and communication than were the "first jobs" of non-Nazi politicians. ii. Statist''eal Teat. This study calls for a test to determine the
significanc of the differencebetweentwo independentsamples. ' Lerner,D., Pool,L de S., and Schueller,G. K. 1951. TheNazi elite. Stanford, Calif.: Stanford Univer. Press. The data cited in this exampleare given on p. 101.
THE
FI8HER
EXhCT
PROBABILITY
TE8T
101
Sincethemeasures arebothdichotomous andsinceN is small,the Fisher test is selected.
iii. SignificanceLevel. Let a = .05.
N=
15.
iv. SamplingDistribution.The probabilityof the occurrence underH pof an observed setof valuesin a 2 X 2 tablemaybefound by the use of formula(6.1). However,for N < 30 (whichis the casewith thesedata),TableI may beused. It givescriticalvalues of D for variouslevelsof significance. v. RejectionRegion. Since HI predictsthe direction of the dif-
ference, the regionof rejection is one-tailed.Hp will be rejected if the observedcall valuesdiffer in the predicteddirectionand if
they are of suchmagnitudethat the probabilityassociated with their occurrenceunder H pis equal to or lessthan a = .05.
vi. Decision.The information concerning the "first jobs" of eachmember of theGermanCabinetlatein 1934is givenin Table 6.5. For this table, A+ B = 9 and C+ D = 6. Referenceto TABm 6.5. FIEI.n oz' FlasT Jon oI 1934 MEMBEasor GERMAN CABINET
StableoccupationsParty administration
pawandcivilservice)andcommunicationTotal Nazis Non-Nazis
Total
78
15
TableI revealsthatwiththesemarginal totals,andwithB = 8, the observedD = 0 has a one-tailedprobabilityof occurrence under Hp of p ( .005. Since this p is smaller than our level of
significance, a = .05, our decisionis to rejectHp in favor of HI. ~e concludethat Nazi and non-Nazi political leadersdid dMer in the fields of their first jobs.'
yocher>smodification.In the literatureof statistics,' therehasbeen nsiderable discussion of the applicabihty of the Fishertest to various
of data inasmuch as thereseems to be something arbitraryor
improper aboutconsidering themarginal totalsfixed,forthemarginal
Ilnpro
t tais mighteasilyvaryif weactuallydrewrepeated samples ofthesame
jze by the samemethod fromthe samepopulation. Fisher(1934)
Slse
ecolnmends thetestfor all typesof dichotomous data,but thisrecomrecomm mendation hasbeenquestioned by otheM. q I rnerptal, cometo thesameconclusion, although theydonotreportanystatisti~ testof thesedata.
THE
102
CASE
OF
TWO
INDEPENDENT
SAMPLES
However, Tocher (1950) has proved that a slight modification of the Fisher test provides the most powerful one-tailed test for data in a 2 X 2 table. AVewill illustrate this modification by giving Tocher's example. Table 6.6 shows some observed frequencies (in a) and shows the two more FBI.E Observed
6.6. TOCHER s ExAMpl,E
beforeextreme outcomes with same marginal totals
data
6
57
57
12
extreme distributions
C
12
57
12
of frequencies which could occur with the same
marginal totals (b and c). Given the observeddata (a), we wish to test HD at a = .05. Applying formula (6.1) to the data in each of the three tables, we have 715l 517f
P.
12l 2f 51312
P
1211f6f 411I
7! 5! 5! 7!
71 5f 51 71
P = 12lOl7!5tOl
=
The probability associatedwith the occurrenceof valuesas extremeas the observedscores(a) under Ho is given by adding these three p's: .26515 + .04399 + .00126 = .31040
Thusp = .31040is the probability wewouldfind by the Fishertest. Tocher's modification first determines the probability of all the cases more extreme than the observed one, and not including the observed one. Thus in this case one would sum only pb and p,: .04399 + .00126
= .04525
Now if this probability of the more extreme outcomes is larger than a, we cannot reject H0. But if this probability is lessthan a while the probability yielded by the Fisher test is greater than a (as is the casewith these data), then Tocher recommendscomputing this ratio: pmnreeztrsme esses Pobssrvad essatsEessfose
(6.2)
THE
FISHER
EXACT
PROBABILITY
TEST
103
For the data shown in Table 6.6, this would be (pb + pc) po
which
is 05
0425 .26515
NQw we go to a table of random numbers and at random draw a number between 0 and 1. If this random number is smaller than our ratio above
(I.e., if it is smallerthan .01791),we reject Ho. If it is larger,we cannot reject Ho. Of coursein this caseit is highly unlikely that the randomly drawn number will be sufficiently small to permit us to reject Ho. But this added small probability of rejecting Ho makes the Fisher test slightly less conservative.
perhaps the readerwill gain an intuitive understandingof the logic aIId power of Tocher's modification by considering what a one-tailed
test at n = .05 really is for the data given in Table 6.6. Supposewe Ieject Hp only when casesb or c occur. Then we are actually working at ~
.04525. In order to move to exactly the n = .05 level, we also
declare as significant (by Tocher's modification)a proportion (.01791) pf the caseswhen a occurs in the sampling distribution.
Whether we
Inay considerour observedcaseasoneof thosein the proportionis determined by a table of random numbers.
Summary of procedure. These are the steps in the use of the Fisher test:
1. Cast the observed frequencies in a 2 X 2 table.
2. Determine the marginal totals. Each set of marginal totals sums to N, the number of independent casesobserved.
3. The methodof decidingwhetheror not to reject Ho dependson whether or not exact probabilities are required: g. For a test of significance, refer to Table I.
b. For an exact probability,the recursiveuse of formula (6.1) required.
In either case,the value yielded will be for a one-tailed test. For a
two-tailedtest,thesignificance levelshownby TableI or thep yielded by the useof formula (6.1) must be doubled.
4. If thesignificance levelshownby TableI or the p yieldedby the use of formula (6.1) is equalto or lessthan a, reject Ho.
5. If the observedfrequencies are insignificantbut all moreextreme
possible outcomes withthesame marginal totalswouldbesignificant, use Tocher'smodification to determine whetheror not to rejectHp for a one-tailed test.
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
Power
With Tocher's modification,the Fisher test is the most powerful of one-tailed tests (in the senseof Neyman and Pearson) for data of the kind for which the test is appropriate (Cochran, 1952). References
Other discussions of the FisherTest may be found in Barnard (1947), Cochran (1952), Finney (1948), Fisher (1934, sec. 21.02), McNemar (1955,pp. 240 242), and Tocher (1950). THE x' TEST FOR TWO INDEPENDENT
SAMPLES
Function
Whenthe data of researchconsistof frequenciesin discretecategories, the g' test may be used to determine the significance of differences
betweentwo independent groups. The measurementinvolved may be as weak as nominal scaling.
The hypothesis under test is usually that the two groups differ with respect to some characteristic and therefore with respect to the relative frequency with which group members fall in several categories. To test this hypothesis, we count the number of casesfrom eachgroup which fall in the various categories,and compare the proportion of casesfrom one group in the various categories with the proportion of casesfrom the
other group. For example, we might test whether two political groups differ in their agreementor disagreementwith someopinion, or we might test whether the sexesdiffer in the frequency with which they choosecertain leisure time activities, etc. Method
The null hypothesis may be tested by
(6.3) i1
jI
whereOii = observed number of casescategorizedin ith row pf jth column
E;; = number of casesexpectedunder Ho to be categorizedin ith row of jth
column
directs one to sum over sll (ri rows snd sii (ti co]umns i
Ij
I
i.e., to sum over all cells
THE
X TEST
FOR TWO
INDEPENDENT
SAMPLES
105
The valuesof g' yieldedby formula (6.3)aredistributedapproximatelyas chi square with df = (r 1)(k' 1), where r = the number of rows and p = the number of columns in the contingency table.
To find the expectedfrequencyfor eachcell (Ey), multiply the two marginal totals commonto a particularcell, and then divide this product by the total number of cases,N.
We mayillustratethe methodof findingexpectedvaluesby a simple example, using artificial data. Supposewe wished to test whether tall
and short personsdier with respectto leadershipqualities. Table 6.7 Tmm
6.7. HEIGET aND LEaDEEsHIp (Artificial data) Short
Tall
Total
Leader Follower
36
Unclassifiable
15
Total
43
52
95
showsthe frequencies with which43 shortpeopleand 52 tall peopleare categorized as "leaders," "followers," and as "unclassifiable."
Now
the null hypothesis would be that height is independentof leaderfollower position,i.e., that the proportion of tall peoplewho are leaders the sameas the proportion of short peoplewho are leaders,that the
pI'oportionof tall peoplewhoarefollowersis the sameasthe proportion
Qfshortpeoplewhoarefollowers, etc. With sucha hypothesis, wemay determine the expectedfrequency for each cell by the method indicated. 6.8, HEIGHTh.NDLEhDEEsHIP: OBSEEvED ~ EKPEcTED FREQUENcIEs (Artificial Short
data) Tall
Total
Leader
Follower
Unclassifiable
Total
52
95
yn eachcasewemultiplythetwomarginal totalscommon to a particular
cell,andthendividethisproduct by N to obtaintheexpected frequency.
106
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
Thus, for example, the expected frequency for the lower right-hand cell (52) (15)
in Table
6.7 is E>2
95
= 8.2.
Table 6.8 shows the expected
frequencies for each of the six cells for the data shown in Table 6.7.
In
each case the expected frequencies are shown in italics in Table 6.8, which also shows the various observed frequencies.
Now if the observed frequencies are in close agreement with the expected frequencies,the differences (0;, E;;) will of course be small, and consequently the value of g' will be small.
With a small value of
x'we may not reject the null hypothesisthat the two setsof characteristics are independent of each other. However, if some or many of the differences are large, then the value of y' will also be large.
'I'he larger is
g', the more likely it is that the two groups differ with respect to the classi fications.
The sampling distribution of y' as defined by formula (6.3) can be shown to be approximated by a chi-square' distribution with
The probabilities associated with various values of chi square are given
in Table C of the Appendix. If an observed value of y' is equal to or greater than the value given in Table C for a particular level of significance, at a particular df, then Ho may be rejected at that level of significance. Notice that there is a different sampling distribution
for every value
of df. That is, the significance of any particular value of g' dependson the number of degreesof freedom in the data from which it was computed. The size of df reflects the number of observations that are free to vary after certain restrictions have beenplaced on the data. (Degrees of freedom are discussed in Chap. 4.)
The degreesof freedom for an r X k contingency table may be found by df = (r
1)(k 1)
where r = number of classifications (rows) k = number of groups (columns) For the data in Table 6.8, r = 3 and k = 2, for we have 3 classifications
(leader, follower, and unclassifiable)and 2 groups (tall and short). Thus the df = (3 1)(2
1)
= 2.
' To avoid confusion, the symbol x' is used for the quantity
in formula (6.3) which
is computed from the observed data when a x' test is performed. Thc words "chi square" refer to a random variable which follows the chi-square distribution, tabled in Table C.
THE y TEST
FOR TWO INDEPENDENT SAMPLES
107
The computationof y' for the datain Table6.8is straightforward: rk
s=li
I
(12 19.9)' (82 19.9
24.1)' (22 24.1
16.8)' (14 16.8
19.7)' 19.7
(9 6.8)' 6.8
(6 8.2)' 8.2
= 8.14 + 2.59 + 1.99 + 1.65 + .71 + .59 = 10.67
To determinethe significance of g' = 10.67whendf = 2, we turn to TableC. Thetableshowsthat this valueof g'-is significantbeyondthe 0] level. Thereforewe couldreject the null hypothesisof no differences at a =
.01.
2 )< 2 contingencytables. Perhapsthe most commonof all usesof the
x2 test is the test of whetheran observed breakdownof frequencies in a 2 y 2 contingency table could have occurred under Hs. We are familiar
wjth theformof sucha table;anexample is Table6.1. Whenapplying the x' test to data whereboth r andA equal2, formula(6.4)shouldbe used:
N I AD BCI
N' 2
(A + B)(C + D)(A + C)(B + D)
Thjs formulais somewhat easierto applythanformula(6.8),inasmuch as only onedivisionis necessary in the computation. Moreover,it lends
jtselfreadilyto machine computation.It hastheadditionaladvantage of jncorporating a correction for continuitywhichmarkedlyimproves the approximationof the distributionof the computedy' by the chisquare dsstnbution. Example
Adamsstudiedthe relation of vocationalinterestsand curriculum choiceto rate of withdrawalfrom collegeby bright students.' Her
subjectswerestudentswhoscoredat or abovethe 90thpercentilein
college entrance testsof intelligence, andwhochanged theirmajors followingmatriculation. Shecompared thosebrightstudentswhose
curriculum choice wasin thedirectionindicated asdesirable by their i Adams,Lois. 1955. A study of intellectuallygifted studentswho withdrew
fromthePennsylvania Stat University.Unpublished mastr'sthesis, Pennsylvania State University.
108
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
scores on the Strong Vocational Interest Test (such a change was
called "positive")
with those bright students whose curriculum
change was in a directioii contrary to that suggested by their tested
interests. Her hypothesis was that thosewho madepositivecurricular changeswould more frequently remain in school. i. Null Hypothesis. H0. there is no difference betweenthe two groups (positive curriculum changersand negative curriculum changers) in the proportion of members who remain in college. HI'. a greater proportion of students who make positive curriculum changes remain in college than is the case with those who make negative curriculum changes. ii. Statistical Test. The x' test for two independe»t samples is
chosen becausethe two groups (positive and negative curriculum changers) are indepe»dent, and becausethe "scores" under study are frequencies in discrete categories (withdrew and remained). iii. Significance Level. Let o. = .05. N = the number of students in the sainple = 80.
iv. SamphngDistribution. y' as computedfrom formula (6.4) has a sampling distribution which is approximated by the chi-square distribution with df = 1. Critical values of chi square are givenin Table
C.
v. RejectionRegion. The region of rejection consistsof all values of g' which are so large that the probability associatedwith their occurrenceis equal to or less than a = .05. Since HI predicts the direction of the difference between the two groups, the region of
rejection is one-tailed. Table C showsthat for a one-tailedtest, when df = 1, a y' of 2.71 or larger has probability of occurrence under Ho of p = ~(.10) = .05. Thereforethe region of rejection consistsof all x' > 2.71 if the directionof the resultsis that predicted by HI.
vi. Decision. Adams' findings are presentedin Table G.9. This table shows that of the 56 bright students who made positive cur-
riculum changes,10 withdrew and 40 remainedin college. Of the ThBLE 6.9. CURRIcULUM C»hNoE hNn Wir»nnhwhL FRQM COLLEOE hMONO BRIOIIT STUDENTS
Direction of curriculum change Positive Negative
21
Withdrew Remained Total
Total
56
THE X TEST FOR TWO INDEPENDENT SAMPLES
109
24 whomadenegativechanges, 11withdrewfrom college and 18 remained.
The value of g' for these data is
N I AD BCI X
(A + B) (C + D)(A + C)(B + D) 80(l(10)(13) (11)( 6)I ~so)
(6 4)
(21)(59)(56)(24) 80(336)~ 1,665,216 = 5.42
Theprobabilityof occurrence underHofor g' ) 5.42withdf = 1 is p ( ~(.02)= p < .01. Inasmuch as this p is lessthan a = .05,
thedecision isto rejectHoin favorofH~. Weconclude thatbright students whomake"positive"curriculum changes remain in college morefrequentlythan do bright studentswho make"negative" curriculum changes.
Smallexpectedfrequencies.The y' test is applicableto data in a
contingency tableonlyif theexpected frequencies aresufBciently large. The size requirementsfor expectedfrequenciesare discussedbelow.
Whentheobserved expected frequencies donotmeetthese requirements, onemayincrease theirvaluesby combining cells,i.e.,by combining adjacentclassificationsand therebyreducingthe numberof cells. This may be properly doneonly if such combiningdoesnot rob the data of
their meaning.In our fictitious"study" of heightandleadership, of course,any combiningof categorieswould haverenderedthe data use-
lessfor testingourhypothesis.Theresearcher mayusuallyavoidthis problemby planningin advanceto collecta fairly largenumberof cases
relativeto thenumberof classifications hewishes to usein hisanalysis. Summary of procedure.Thesearethestepsin theuseof thex' test for two independent samples:
1. Casttheobserved frequencies in a k X r contingency table,using the k columnsfor the groupsand the r rows for the conditions for this test k =
2.
2. Determine the expected frequency for eachceQby findingthe productof themarginal totalscommon to it anddividingthisby N (N is the sumof eachgroupof marginaltotals. It represents the total
number of independen~ observations. InflatedN'sinvalidate thetest)
Step2 is unnecessary if the dataarein a 2 X 2 tableand thusformula (6.4) is to be used.
THE
110
CASE
OF
TWO
INDEPENDENT
SAMPLES
3. For a 2 X 2 table, computey' by formula (6.4). Whenr is larger than 2, computeg' by formula (6.3).
4. Determinethe significance of the observed y' by reference to Table C. For a one-tailed test, halve the significancelevel shown. If the
probability givenby TableC isequalto or smaller thann, rejectHoin favor of H~. When to Use the x' Test
As we havealreadynoted,the x' test requiresthat the expected fre-
quencies (E;;)in eachcellshouldnot be toosmall. Whentheyare smallerthan minimal,the test may not be properlyor meaningfully used. Cochran(1954)makestheserecommendations:
The 2 X 2 case. If the frequencies arein a 2 X 2 co»ti»gency table, the decisionconcerningthe useof y' shouldbe guidedby theseconsiderations:
1. WhenN > 40, usey' correctedfor continuity,i.e., useformula
(6.4).
2. When N is between20 and 40, the x' test [formula (6.4)] may be
used if all expected frequencies are5 ormore. If thesmallest expected frequency islessthan5,usetheFishertest(pages 94to 104). 3. When N (
20, use the Fisher test in all cases.
Contingency tableswithdf largerthani. Whenk is largerthan2 (andthusdf > 1),thex' testmaybeused if fewerthan20percentofthe cellshavean expected frequency of lessthan 5 andif no cellhasan
expected frequency oflessthan1. If these requirements arenotmetby thedatain theformin whichtheywereoriginallycollected, theresearcher
mustcombine adjacentcategories in orderto increase theexpected fre-
quencies in thevarious cells.Onlyafterhehascombined categories to meettheaboverequirements mayhemeaningfully applythex' test.
When df > 1,x' testsareinsensitive to theeffects oforder, andthus
whena hypothesis takesorderinto account,x' may not be the besttest. The readermay consultCochran(1954)for methods that
strengthen the common x' testswhenHo is testedagainstspecific alternatives. Power
Whentheg' testis usedthereis usuallynoclearalternative andthus theexactpowerof the testis difficultto compute.However, Cochran
(1952)hasshown that thelimitingpowerdistribution of x' tendsto 1
as .V becomes large. References
For other discussionsof the x' test, the readermay refer to Cochran
(1952;1954),Dixon and Massey(1951,chap.13),Edwards(1954.
THE
MEDIAN
TEST
chap. 18), Lewis and Burke (1949), McNemar (1955, chap. 13),. and Walker and Lev (1953, chap. 4.). THE
MEDIAN
TEST
Function
The median test is a procedure for testing whether two independent
groups dier in central tendencies. More precisely,the median test will give information as to whether it is likely that two independent
groups (not necessarilyof the samesize)have beendrawn from populations with the samemedian. The null hypothesis is that the two groups are from populations with the same median; the alternative hypothesis may be that the median of one population is diferent from that of the other (two-tailed test) or that the median of one population is higherthan that of the other (one-tailed test). The test may be used wheneverthe scores for the two groups are in at least an ordinal scale. Rationale
and Method
To perform the median test, we first determine the median score for the combined group (i.e., the median for all scores in both samples). Then we dichotomize both sets of scoresat that combined median, and cast these data in a 2 X 2 table like Table 6.10. ThBLE 6.10. MEDIhN
TEST: FORM FOR DhTh
Group I Group
II
Total
go. of scoresabove combined median
A+B
No. of scores below combined median
C+D
Total
A+C
B+D
N
a +a~
Now if both group I and group II are samplesfrom populations whose median is the same,we would expect about half of each group's scoresto be above the combined median and about half to be below.
would expectfrequenciesA and C to be about equal,and frequencies and D to be about equal. It can be shown (Mood, 1950,pp. 394 395) that if A is the number of casesin group I which fall above the combined median, and if B is the
numberof casesin grouPII whichfall abovethe combinedmedian,th the samplingdistribution of A and B under the nuQ hypothesis(Q js that A = zni and B = zn2) is the hypergeometric distribution A+C
B+D n> + ns
THE
Ch8E
OF
TWO
INDEPENDENT
ShMPI,E8
Thereforeif the total numberof casesin both groups(ni + nr) is small, one may use the Fisher test (pages96 to 104) to test Hs. If the total number of casesis sufficiently large, the g' test with df = 1 (page 107) may be used to test Ho.
When analyzingdata split at the median,the researchershouldbe guidedby theseconsiderationsin choosingbetweenthe Fishertest and the y' test:
1. When n> + nl is larger than 40, use x' corrected for continuity, i.e., use formula (6.4). 2. When n i + n>is between20 and 40 and when no cell hasan expected
frequency'of lessthan 5, useg' correctedfor continuity [formula(6.4)j. If the smallestexpectedfrequency is lessthan 5, usethe Fisher test. 3. When ni + n~ is lessthan 20, use the Fisher test.
Onedifficulty Inay arisein the computationof the mediantest: several scoresmay fall right at the combinedmedian. If this happens,the researcherhas two alternatives: (a) if n> + nl is large, and if only a few casesfall at the combined median, those few casesmay be dropped from
the analysis,or (b) the groupsmay be dichotomizedasthosescoreswhich exceedthe median and those which do not. In this case, the troublesome scores would be included in the second category.
Example
In a cross-cultural test of some behavior theory hypotheses
adapted from psychoanalytictheory,' Whiting and Child studied the relation between child-rearing practices and customs related to illness in various nonliterate cultures. One hypothesis of their
study,derivedfrom the notionof negativefixation,wasthat oral explanationsof illnesswould be usedin societiesin which the socialization of oral drives is such as to produce anxiety. Typical
oral explanationsof illness are these: illness results from eating
poison,illnessresultsfrom drinking certainliquids,illness.results from verbal spells and incantations performedby others, Judgments of the typical oral socialization anxiety in any society were based on the rapidity of oral socialization, the severity of oral socialization, the frequency of punishment typical in oral socializa-
tion, and the severity of emotionalconflict typically evidencedby the children during the period of oral socialization. Excerpts from ethnological reports of nonliterate cultures were used in the collection of the data. Using only excerpts concerning ' The method for computing expected frequenciesis given on pages 105and 106.
~ Whiting,J. W. M., and Child, L L. Haven:
Yale Univer.
Press.
1953. Childbrainingandpersonality. New
THE
MEDIAN
118
TE8T
customs relating to illness, judges clasei6edthe societies into two groups: those with oral explanations of illness present and those with oral explanations absent. Other judges, using the excerpts concerning child-rearing practices, rated each society on the degree of oral socialization anxiety typical in its children.
For the 39 societies
for which judgments of the presenceor absenceof oral explanations were possible, these ratings ranged from 6 to 17.
i. Null Hypothesis. Ho.'there is no differencebetweenthe median oral socialization anxiety in societieswhich give oral explanations of illness and the median oral socialization anxiety in societieswhich do not give oral explanations of illness.
H>. the median oral
socialization anxiety in societies with oral explanations present is higher than the median in societieswith oral explanationsabsent. ii. Stetistical Test. The ratings constitute ordinal measures at
best; thus a nonparametric test is appropriate. This choice also eliminates the necessity of assuming that oral socialization anxiety is normally distributed among the cultures sampled, as well as eliminating the necessityof assuming that the variances of the two groups sampled are equal. For the data from the two independent groups of societies,the mediantest may be usedto test Ho. iii. Significance lese/. Let a = .01. N = 89 = the number of societies for which ethnological information on both variables was
available. n~
16 = the number of societieswith oral explana-
tions absent; es = 23 = the number of societieswith oral explanations present. iv. Sampling Distribution.
Since we cannot at this time state
which test (Fishertest or x' test) will be usedfor the scoressplit at the median, since n~ + ns = 89 is between 20 and 40 and therefore
our choicemust be determinedby the sizeof the smallestexpected frequency, we cannot state the sampling distribution. v. Reject~
Region. Since Hi predicts the direction of the dif-
ference, the region of rejection is one-tailed. It consists of all outcomes in a median-split table which are in the predicted direction and which are so extreme that the probability associatedwith their
occurrence under Ho (as determinedby the appropriateteat) is equal to or lessthan e = .01. vi. Decision. Table 6.11 shows the ratings assignedto each of the 39 societies. These are divided at the combined median for
the n, + n> ratings. (We have followed Whiting and Child in calling 10.5 the median of the 89 ratings.)
Table 6.12 shows these
datacastin theformfor themediantest. Sincenoneof theexpected frequenciesis lessthan 5, andsincen~+ nl > 20, wemay usethe g' test to test Ho.'
114
THE
CASE
OF
TWO
INDEPENDENT
N iAD
BCi
SAMPLES
N' 2
(A + B)(C + D)(A + C)(B + D)
89(l(8) (6) (»)
(>8)I
(6.4)
")'
(20) (19) (16) (23) = 9.89 TABLE 6.11. ORAL SocIALIZATIQN ANZIETY AND ORAL ExPLANATIQNs OF ILLNESS
(The name of each society is precededby its rating on oral socialization anxiety) Societies with
oral
Societies with oral
explanations absent explanations present
Societies above median on
oral socialization anxiety
Societies below median
on
oral socialization anxiety
~ Reproducedfrom Table 4 of Whiting, J. W. M,, and Child, I. L.
1953. C4ld
training and personahty. New Haven: Yale Univer. Press,p. 1M, with the kind permission of the authors and the publisher.
THE
MEDIAN
115
TEST
ThELE 6.12. ORhL SocIhLIKhTIoN ANxIETY hND ORhL ExPLhNATIQNS OF ILLNESS
Societies with oral
Societies with
oral
explanationsabsent explanationspresent Total Societies above median on oral socialization
anxiety Societies below median 19
on oral socialisation
anxiety Total
39
23
Referenceto Table C showsthat x' > 9.89 with df = 1 hasprobabil-
ity of occurrence underHpof p < ~(.01) = p < .005 for a one-tailed test. Thus our decisionis to reject Hp for e = .01.' W'e conclude that the median oral socialization anxiety is higher in societieswith
oral explanationsof illnesspresentthan is the medianin societieswith oral explanations absent.
Sumamryof procedure. Theseare the stepsin the useof the median test:
1. Determine the combined median of the nI + e, scores.
2. Split each group's scoresat that combinedmedian. Enfer the resultantfrequenciesin a table like Table 6.10. If many scoresfall at the combined median, split the scoresinto these categories:those which exceedthe median and those which do not. 8. Find the probability of the observedvalues by either the Fisher test
or the x' test, choosingbetweentheseaccordingto the criteria given above.
4. If the p yielded by that test is equal to or smaller than a, rejec power-ESciency
Mood (1954)has shownthat whenthe mediantest is appliedto data measuredin at least an interval scalefrom normal distributions w't
commonvariance(i.e., data that might properlybe analyzedby the
parametric t test),it hasthe samepower-efficiency as the signtest. '7hat is, its power~fficiency is about95per centfor nI + n, aslow as6 This power-efficiency decreases asthe samplesizesincrease, reachingan eventualasymptoticefficiencyof 2/s = 68 per cent. > Thisdecisionagreeswith that reachedby Whitingand ~id. Inetric t test on thesedata,they foundthat f = 4.05 p < 0005
Usmgthe p~~
116
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
References
Discussions
of the median
test are contained
in Brown and Mood
(1951), Mood (1950, pp. 394 395), and Moses (1952a). THE
MANN-WHITNEY
U TEST
Function
When at least ordinal measurement has been achieved, the MannWhitney U test may be used to test ivhcther two independent groups have been draxvn from the same population. This is one of the most powerful of the nonparametric tests, and it is a most useful alternative to the
parametric t test ivhen the researcherwishesto avoid the t test's assumptions, or when the measurementin the researchis weaker than interval scaling.
Suppose we have samples from two populations, population A and population B. The null hypothesis is that A and 8 have the same distribution. The alternative hypothesis, H>, against which we test
Ha, is that A is stochastically larger than 8, a directional hypothesis. We may accept H> if the probability that a score from A is larger than a score from 8 is greater than one-half. That is, if a is one observation
from population A, and b is one observation from population 8, then HI is that p(a > b) > z.
If the evidence supports Hi, this implies that
the "bulk" of population A is higher than the bulk of population 8. Of course, ~vemight predict instead that 8 is stochastically larger than A. Then H j would be that p(a > b) ( ~. Confirmation of this
assertionsvouldimply that the bulk of 8 is higher than the bulk of A. For a two-tailed test, i.c., for a prediction of differences which does not state direction, H> would be that p(a > b) g z. Method
Let n~ the number of casesin the smaller of two independentgroups, and n2 the number of casesin the larger. To apply the U test, we first combine the observations or scores from both groups, and rank these
in order of increasing size. In this ranking, algebraic size is considered, i.e., the lowest ranks are assigned to the largest negative numbers, if any.
Row focus on one of the groups, say the group with ni cases. The value of U (the statistic used in this test) is given by the number of times that a scorein the group with n~ casesprecedesa scorein the group with ni casesin the ranking.
For example, supposewe had an experimental group of 3 casesand a control group of 4 cases. Here n> 3 and n> 4. Supposethesewere
THE MANN-WHITNEY
U TEST
117
the scores: E scores
9 ll
15
C scores
68
10
To find U, we first rank these scoresin order of increasing size, being careful to retain each score's identity as either an E or C score: 89
10
CE
CE
11
13
15
CE
Now consider the control group, and count the number of E scores that
precedeeachscorein the controlgroup. For the C scoreof 6, no E score precedes. This is also true for the C score of 8. For the next C score (10), one E score precedes. And for the final C score (13), two E scores precede. Thus U = 0+ 0+ 1+ 2 = 3. The number of times that an E scoreprecedesa C scoreis 3 = U. The sampling distribution of U under Ho is known, and with this knowl-
edgewe can determinethe probabilityassociated with the occurrence under Hs of any U as extreme as an observedvalue of U,
Very small samples. Whenneithern>nor ns is largerthan 8, Table J of the Appendixmay be usedto determinethe exact probability associated with the occurrenceunder Ho of any U as extreme as an observed value of U. The reader will observe that Table J is made up of six
separatesubtables,onefor'each value of n2,from n>3
to n2 8.
To determine the probability under H0 associated with his data, the researcherneed know only n> (the size of the smaller group), n2, and U. Wjth this information he may read the value of p from the subtable
appropriate to hisvalueof n2. In our example,n> = 3, ns = 4, and U = 3. The subtable for n> 4 jn Table J shows that U < 3 has probability of occurrenceunder Hs of
p = .200. The probabilitiesgiven in Table J are one-tailed. For a two-tailed
test,the valueof p givenin the tableshouldbe doubled. Now it may happen that the observed value of U is so large that it
doesnot appearin the subtablefor the observedvalue of n~. Such a va]ueariseswhenthe researcherfocuseson the "wrong" groupin deter-
miningU. Weshallcall sucha too-largevalueU'. For example,sup-
posethat ln theabovecasewehadcounted thenumberof C scores preceding eachE score ratherthancounting thenumber ofEscores precedingeachC score. We wouldhavefoundthat U = 2 + 3 + 4 = 9. The subtablefor ns = 4 doesnot go up to U = 9, We thereforedenote our observedvalueas U' = 9. We cantransformany U' to U by U = n>ns U' 0p(U > U') ~ p(U < nina U').
(6.6)4
118
THE
CASE OF TWO
INDEPENDENT
SAMPLES
In our example,by this transformationU = (3)(4) 9 = 3. Of coursethis is the U wefound directly whenwe countedthe numberof E scorespreceding each C score.
Examplefor VerySmallSamples Solomonand Coles'studiedwhetherrats wouldgeneralizelearned
imitation whenplacedundera newdrive and in a newsituation. Five rats weretrainedto imitate leaderrats in a T maze. They were
trainedto followthe leaderswhenhungry,in orderto attaina food incentive. Then the 5 rats were each transferred to a shock-
avoidancesituation, whereimitation of leaderrats would have enabledthem to avoid electricshock. Their behaviorin the shock-
avoidance situationwascompared to that of 4 controlswhohadhad
noprevious trainingto followleaders.Thehypothesis wasthatthe 5 ratswhohad alreadybeentrainedto imitate wouldtransferthis
trainingto the newsituation,andthuswouldreachthe learning criterion in the shock-avoidance situation soonerthan would the 4
controlrats. The comparison is in termsof howmanytrials each rat took to reacha criterion of 10 correctresponses in 10trials.
i. Null Hypothesis.Flo'.the numberof trialsto the criterionin the shock-avoidance situationis thesamefor ratspreviouslytrainedto follow a leaderto a foodincentiveasfor rats not previouslytrained.
H>.ratspreviously trainedto followa leader to a foodincentive will reachthe criterionin the shock-avoidance situationin fewertrials than will rats not previouslytrained. ii. StatisticalTest. The Mann-Whitney U test is chosenbecause
this studyemploystwo independent samples, usessmallsamples, and usesmeasurement (numberof trials to criterionasan indexto
speed of learning) whichis probably at mostin anordinalscale.
iii. Significance Level.Let a = .05. n~ 4 controlrats, and
nz = 5 experimental rats.
iv. Sampling Distribution.Theprobabilities associated with the
occurrence underHo of valuesas small as an observedU for ni, nz < 8 are given in Table J.
v. Rejection Region.SinceH>statesthedirection ofthepredicted difference, the regionof rejectionis one-tailed.It consists of all
valuesof U which are so small that the probability associatedwith their occurrenceunder Hs is equal to or less than a = .05.
vi. Decision. The numberof trials to criterion requiredby the E ' Solomon, R. L., andColes,M. R. 1954.A case of failureof generalization of imitation acrossdrives and acrosssituations. J. Abnorm.Soc.I'sychol.,49, 7 13.
Onlytwoof thegroups studiedby theseinvestigators areincluded in thisexample.
THE MANN-WHITNEY and C rats
U TEST
119
were: E rats
78
64
75
C rats
110
70
53
82
51
Wearrangethesescoresin the orderof their size,retainingthe identity of each: 51
45
CC
53
64
70
EC
75
78 EE
82
110 C
We obtain U by countingthe numberof E scoresprecedingeachC score: U =
1 + 1 + 2 + 5 = 9.
In Table J, we locate the subtable for n2 5.
We seethat U < 9
whenn> = 4 hasa probability of occurrenceunder H0 of p = .452. Our decisionis that the data do not give evidencewhich justify rejecting Ha at the previously set level of significance. The con-
clusionis that thesedata do not supportthe hypothesisthat previous training to imitate will generalize across situations and across drives.'
n2 between 9 and 20. If n2 (the size of the larger of the two inde-
pendentsamples)is larger than 8, Table J may not be used. When n~ is between 9 and 20, significance tests may be made with the MannWhitney test by using Table K of the Appendix which gives critical values pf U for significancelevels .001, .01, .025, and .05 for a one-tailed test.
Fpr a two-tailedtest,the significance levelsgivenare .002,.02,.05,and .10.
Notice that this set of tablesgivescritical valuesof U, and doesnot
giveexactprobabilities(asdoesTableJ). That is,if anobserved U fora particularn>< 20 andn2between9 and20is equalto or lessthan that value given in the table, Ho may be rejectedat the level of significance indicated at the head of that table.
Fpr example,if n>= 6 and nz = 13,a U of 12 enablesus to reject H, at 0, .01
for a one-tailedtest, and to reject Ho at a = .02 for a two-
tailed test.
computingthe valueof U. For fairly largevaluesof n>and n2,the counting methodof determiningthe value of U may be rather tedious.
An alternativemethod,whichgivesidenticalresults,is to assignthe 1 solomon andColesreportthesameconclusion.Thestatistical testwhichthey utilized is not disclosed.
120
THE
CA8E
OF
TWO
INDEPENDENT
8AMPLE8
rank of 1 to the lowestscorein the combined(nI + n2) groupof scores, assign rank 2 to the next lowest score,etc. Then = nl'n2
+
nI(nI+ 1) RI
(6.7a)
or, equivalently, nIn2
n,(n, + 1)
(6.7b)
2
whereRI = sum of the ranks assignedto group whosesamplesizeis nI R2 = sum of the ranks assignedto groupwhosesamplesizeis n2 For example,we might haveusedthis methodin finding the valueof U for the data given in the examplefor small samplesabove. The E and C scoresfor that exampleare given again in Table 6.13, with their ranks. ThIILs 6.13. TIIIhL8 To CRITERIQN oF E hND C RhT8 E Score
C Score
78
110
64
70
75
53
45
51
Rank
82
R2 = 26
RI
19
For those data, RI = 19 and R2 = 26, and it will be rememberedthat nI = 4 and n2 = 5.
By applying formula (6.7b), we have
U =(4)(5)+'",+" -26 9
U = 9 isof courseexactlythe valuewefoundearlierby counting. Formulas (6.7a) and (6.7b) yield different U's. It is the smaller
of thesethat we want. The larger value is U'. The investigator
shouldcheckwhetherhe hssfoundU' ratherthan U by applyingthe' transformation
U = nIn2 U' (6.6) Thesmallerof thetwovalues,U, is theonewhose sampling distribution
isthebasis forTable K. Although thisvalue canbefound bycomputing both formulas(6.7a)and (6.7b)and choosingthe smallerof the two results,a simplermethodis to useonly oneof thoseformulasand then find the other value by formula (6.6).
Largesamples(n2largerthan 20). NeitherTableJ nor TableK is
usable when n» 20. However, it hasbeen shown (Mann andWhitney,
THE MhNN-wHITNEY
U TEsT
121
1947)that ssni, n>increase in size,thesampling distribution of U rapidly approachesthe normal distribution, with Mean = uv =
ning
2
Standard deviation = ep
snd
That is, whennm> 20 we msy determinethe significanc of an observed value of U by U
pp
U
ning
2
(ni)(ns)(ng y ng + 1) 12
which is practically normally distributed with zero mean snd unit vari-
ance. That is, the probability associatedwith the occurrenceunder Ho of valuesas extremess an observedz msy be determinedby referenceto Table A of the Appendix.
When the norma)approximationto the samplingdistribution of U is
usedin a testof Ho,it doesnot matterwhetherformula(6.7a)or (6.75) is usedin thecomputation of U, for the absolute valueof z yieldedby formula(6.8)will bethesame if eitheris used. Thesignofthez depends on whether U or U' was used, but the value doesnot.
Examplefor LargeSamplee
For our example,wewill reexamine the Whitingand Child data
whichwehavealreadyanalyzed by the mediantest (onpages112 to 115).
i. NuQHypothesis. Ho. oral socialization anxietyis equally severe in bothsocieties with oralexplanations of illnesspresent and societieswith oral explanationsabsent. H>'. societieswith oral
explanations of illnesspresentare (stochastically) higherin oral socialization anxietythansocieties whichdonot haveoralexplanations of illness.
II. Statistcal Teet. The two groupsof societiesconstitutetwo
independent groups, andthemeasure of oralsocialization anxiety (ratingscale)constitutes an ordinalmeasure at best. For these
reasons theMann-Whitney U testisappropriate foranalyzing these data.
iii. Significance Level. Let e = .01. ni = 16 = the numberof societieswith oral explanations absent;n~= 28 = the numberof societieswith oral explanationspresent.
THE
122
CASE
OF
TWO
INDEPENDENT
SAMPLES
iv. Sampling Distribution. For nm> 20, formula (6.8) yields values of z. The probability associatedwith the occurrenceunder Hp of values as extreme as an observed z may be determined by reference
to Table
A.
v. RejectionRegion. Since HI predicts the direction of the difference,the region of rejection is one-tailed. It consistsof all valuesof z (from data in which the difference is in the predicted direction) which are so extreme that their associatedprobability under Hp is equal to or less than a = .01.
vi. Decision. The ratings assignedto each of the 39 societiesare shown in Table 6.14, together with the rank of eachin the combined TABLE 6.14. ORAL SocIhLIzhTION
Rating
Societies with
ANZIETY hND ORhL EZPLhNATIQNs oF [LLNEss
Societies
on oral
oral
with
socialiss-
explanations
tion
absent
oral
explanations present
anxiety
Rating on oral socislisation
anxiety
Lapp
13
29. 5
Msrqucsans
17
39
Chamorro
12
24.5
Dobusns
16
38
Samoans
12
24.5
Bsigs
15
36
Arapesh
10
16
Kwoma
15
36
Balinese
10
16
Thongs
15
36
Hopi
10
16
Alorese
14
33
Tan ala
10
16
Chagga
14
33
Navaho
14
33
Dahomesns
13
29.5
9.5
Lesu
13
29.5
5
Masai
13
29.5
Lepcha
12
24.5
12 9.5
9
Psiute Chenchu
8
Teton
8
Flathead
Papago Venda Wsrrau
Wogeo Ontong-Javane"c
77 77 G
5 1.5 5
BI = 200.0
Maori
12
24.5
Pukapukans
12
24.5
Trobrianders Kwskiutl
12
24.5
11
20.5
Manus
ll
20.5
Chiricahua
lj
16
Comanche
10
16
Siriono
1J
16 9.5
Bena
Slave Kurtatchi
88
9.5
6
1.5
BI
580.0
group. Notice that tied ratings are assignedthe averageof the tied ranks.
For these data, RI = 200.0 and Rp = 580.0.
The value of
U may be found by substituting the observed values in formula (6.7a):
THE %ANN-WHITNEY
U+
U TEST
n~(n~+) g
123
(6.7a)
2
= (16)(28) +2
200
= 304
Knowing that U = 304,we may 6nd the value of z by substituting in formula (6.8): nn
2
804 (16)(28) 2
= 3.43
Referenceto TableA revealsthat z > 8.48hasa one-tailedprobabB-
Ity underHoof p ( .0008. Sincethis p issmallerthana = .01,our decision is to reject Ho in favor of Hi.*
We conclude that societies
with oralexplanations of illnesspresentare(stochastically) higherin pral socialization anxiety than societieswith oral explanations a,bsent.
Q, is importantto noticethat for thesedata the Mann-WhitneyU test exhibitsgreaterpowerto rejectHothan the mediantest. Testing a similar hypothesisabout thesedata, the mediantest yielded a value
whichpermittedrejectionof Hoat thep < .005level(one-tailed test), whereasthe Mann-Whitneytest yieldeda valuewhichpermittedrejectIpn pf Ho at the p < .0003 level (one-tailed test). The fact that the Mann-Whitney test is more powerful than the median test is not sur-
prising inasmuch as it considers the rank valueof eachobservation
ratherthansimplyits locationwith respect to thecombined median, and thus usesmore of the information in the data.
Ties. The Mann-Whitneytest assumes that the scoresrepresenta distributionwhichhasunderlyingcontinuity. With very precisemeas-
urement of a variable whichhasunderlying continuity, theprobability pf a tie is zero. However,with the relativelycrudemeasures whichwe
typicallyemployin behavioral scienti6cresearch, tiesmaywell occur. e ~ ~e havealreadynoted,Whitingand Childreachedthe smnedecisionon the of the parametrict test. Theyfoundthat t 4.05,p < .0005.
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
We assumethat the two observations whichobtaintied scoresare really different, but that this differenceis simply too refinedor minute for detectionby our crudemeasures. When tied scoresoccur, we give each of the tied observationsthe averageof the ranksthey wouldhavehadif notieshadoccurred. If the ties occurbetweentwo or moreobservations in the samegroup, the value of U is not affected.
But if ties occur between two or more
observations involvingboth groups,the valueof U is affected. Although the effectis usually negligible,a correctionfor ties is availablefor use with the normalcurveapproximationwhichwe employfor largesamples. The effectof tied ranksis to changethe variability of the set of ranks. Thus the correctionfor ties must be applied to the standard deviation of
the samplingdistributionof U.
Correctedfor ties, the standarddevia-
tion becomes
where N =
T=
n~ + n~ t'
t
(wheret is the numberof observations tied for a given
rank)
ZT is foundby summingthe T's overall groupsof tied observations With the correctionfor ties, we find z by
U 2$$$2 (6.9)
It may be seenthat if thereare no ties,the aboveexpression reduces directlyto that givenoriginallyfor z [formula(6.8)]. The useof the correctionfor ties may be illustratedby applyingthat correction to the data in Table 6.14.
For those data,
n~ + n~ = 16 + 23 = 39 = N
We observethesetied groups: 2 scores of 6 5 scores of 7 4 scores of 8 7 scores of 10 2 scores of 11 6 scores of 12 4 scores of 13
3 scores of 14 3 scores of 15
THE bKANN-WHITNEYU TEST
125
Thuswehavet'sof2,5,4,7,2,6,4,3,and8. TofindZT,wesumthe t3 t values of 12 foreachof th~ t1Mgroups 2' 2
5' 5
12
12
4' 4
7' 7
12
2' 2
12
6' 6 12
12
4' 4
8' 3
12
12
8'
3 12
= .5 + 10.0 + 5.0 + 28.0 + .5 + 17.5 + 5.0 + 2.0 + 2.0 = 70.5
Thusfor thedatain Table6.14,n>= 16,ng= 28,N = 89,U = 304, snd XT = 70.5. Substituting thesevaluesin formula(6.9),we have U
RJSQ
2
z
(6.9) N(N
1) 304
12
(16)(28) 2
= 3.45
Thevalueof z whencorrected for tiesis a little largerthanthat found eaher when the correctionwas not incorporated.The difference between z > 3.48andz > 3.45,however, is negligible in sofar as the
probability givenbyTable A isconcerned. Bothz'sarereadashaving an associated probabilityof p < .0008(one-tailed test).
this example demonstrates, tieshaveonly a slighteffect. Even whena largeproportion of thescores aretied (thisexample hadover90
percentof its observations involved in ties)theeffect is practically
negligible.Observe, however,that the magnitudeof the correction
factprXT,depends importantly onthelength ofthevarious ties,i.e., pnthesizeof thevarious ~'s. Thusa tieof length4 contributes 5.0to
yg jnthisexample, whereas twotiesoflength 2 contribute together only
0 (thatis,.5+ .5)to ZT. Anda tieoflength 6 contributes 17.5,
whereas twoof length3 contribute together only2.0+ 2.0 = 4.0.
peahen thecorrection is employed, it tendsto increase thevalueof z
ghtly,making it moresignificant. Therefore when wedonotcorrect
forQes ourfpQQt is"conservative" inthatthevalue ofy willbeslightly ~AaM. Thatis, thevalueof the probabii ty assoclat d with the bserved dataunder Howillbeslightly larger thanthatwhichwouldbe
f und werethecorrection employed.Thewriter'srecommendation is
THE
126
CASE
OF
TWO
INDEPENDENT
SAMPLES
that one should correct for ties only if the proportion of ties is quite
large, if some of the t's are large, or if the p which is obtained without the correction is very closeto one'spreviously set value of n. Summary of procedure. These are the steps in the use of the MannWhitney U test: 1. Determine
the values of n> and n2.
ni = the number of cases in
the smaller group; n2 = the number of casesin the larger group. 2. Rank together the scores for both groups, assigning the rank of 1 to the score which is algebraically lowest. Ranks range from 1 to N = n~ + n2. Assigntied observationsthe averageof the tied ranks. 3. Determine the value of U either by the counting method or by applying formula (6.7a) or (6.7b). 4. The method for determining the significance of the observedvalue of U dependson the size of n2. a. If n2 is 8 or less, the exact probability associatedwith a value as small as the observed value of U is shown in Table J.
For a two-tailed
test, double the value of p shownin that table. If your observedU is not shown in Table J, it is U' and should be transformed to U by formula
(6.6).
b. If n2 is between 9 and 20, the significanceof any observedvalue of U may be determined by referenceto Table K. If your observed value of U is larger than n>n2/2,it is U'; apply formula (6.6) for a transformation.
c. If n2 is larger than 20, the probability associatedwith a value as extreme as the observed value of U may be determined by comput-
ing the valueof z asgivenby formula (6.8),and testingthis valueby referringto Table A. For a two-tailedtest, doublethe p shownin that table. If the proportion of ties is very large or if the obtained
p is very closeto a, apply the correctionfor ties, i.e., useformula (6.9) rather than (6.8). 5. If the observed value of U has an associated probability equal to or
less than a, reject Ho in favor of H>. Power-Efficiency
If the Mann-Whitney test is applied to data which might properly be analyzed by the most powerful parametric test, the t test, its powerefficiency approaches3/m = 95.5 per cent as N increases(Mood, 1954), and is closeto 95 per cent even for moderate-sizedsamples. It is therefore an excellent alternative to the t test, and of course it does not have
the restrictive assumptionsand requirements associatedwith the t test. Whitney (1948, pp. 51 56) gives examples of distributions for which the U test is superior to its parametric alternative, i.e., for which the U test has greaterpower to reject Ho.
THE
KOLMOGOROV-SMIRNOV
Two-SAMPLE
127
TEST
References
For discussions of the Mann-Whitney
test,' the reader may refer to
Auble (1953), Mann and Whitney (1947), Whitney (1948), and Wilcoxon (1945).
THE
Function
KOLMOGOROV-SMIRNOV
TWO-SAMPLE
TEST
and Rationale
The Kolmogorov-Smirnov
two-sample test is a test of whether two
independent samples have been drawn from the same population (or from populations with the same distribution). The two-tailed test is sensitive to any kind of differencein the distributions from which the two samples were drawn differences in location (central tendency), in dis-
persion,in skewness, etc. The one-tailedtest is usedto decidewhether or not the values of the population from which one of the samples was
drawn are stochastically larger than the values of the population from which the other sample was drawn, e.g., to test the prediction that the scores of an experimental group will be "better" than those of the control group.
Like the Kolmogorov-Smirnov one-sampletest (pages47 to 52), th two-sample test is concernedwith the agreementbetweentwo cumulative distributions.
The one-sample test is concerned with the agreement
between the distribution of a set of sample values and some specified theoretical
distribution.
The two-sample test is concerned with
the
agreementbetweentwo setsof samplevalues. If the two sampleshave in fact been drawn from the same population distribution, then the cumulative distributions of both samples may be
expectedto be fairly closeto eachother, inasmuchas they both should show only random deviations from the population distribution.
If the
> Twononparametric statistical tests which are essentially equivalent to the MannWhitney U test have been reported in the literature and should be mentioned here. The first of these is due to Festinger (1946). He gives a method for calculating exact
probabilitiesandgivesa two-tailedtablefor the .05and .01levelsof significance for Qg+ >Q < 4Pgwhenn> < 12. In addition, for n>from 13to 15,valuesare given up to~,
+n~
=30.
The secondtest is due to White (1952), who gives a method essentially the same as the Mann-Whitney test except that rather than U it employs 8 (the sum of the ranks
of one of the groups)as its statistic. White overstwo-tailed tablesfor the .05, .01, and .001 levels of significance for n~ + n~ < 30.
Inasmuch as these tests are linearly related to the Mann-Whitney test (and therefore will yield the same results in the test of Ho for any given batch of data), it was felt that inclusion of complete discussions of them in this text would introduce unneces-
sary redundancy.
128
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
twosample cumulative distributions are"toofar apart"at anypoint, this suggests that the samplescomefrom differentpopulations. Thusa large enoughdeviationbetweenthe two samplecumulativedistributions is evidencefor rejectingHo. Method
To apply the Kolmogorov-Smirnov two-sampletest, we makea cumu-
lativefrequency distributionfor eachsampleof observations, usingthe sameintervalsfor both distributions. For eachinterval, then, we sub-
tract onestepfunctionfrom the other. The testfocuses on the largest of these observed deviations.
Let S,(X) = the observedcumulative step function of one of the
samples, that is, S,(X) = K/ni, whereK = the numberof scores equal
to or lessthan X. And let S,(X) = the observed cumulative step functionof the other sample,that is, $,(X)
= K/ng. Now the Kolmo-
gorov-Smirnov two-sample test focuseson
D = maximum[S,(X) S (X)]
(6.10 )
for a one-tailed test, and on
D = maximum~S,(X) S,(X) ~ (6.1m) fora two-tailed test. Thesampling distribution of D is known(Smirnov, 1948; Massey, 1951) and the probabilities associatedwith the occurrence
of valuesas large as an observedD under the null hypothesis(that the two sampleshavecomefrom the samedistribution)havebeentabled. Notice that for a one-tailed test we find the maximum value of D ie
the predicteddiredion [by formula (6.10a)) and that for a two-tailedtest
we flnd the maximumabsolute valueof D [by formula(6.10b)],i.e., we find the maximum deviation irrespective of direction. This is because
in the one-tailedtest, H~ is that the populationvaluesfrom whichoneof the sampleswas drawn are stochasticallylarger than the population values from which the other sample was drawn, whereasin the two-tailed
test, H> is simply that the two samplesare from differentpopulations. In the useof the Kolmogorov-Srnirnov test on data for whichthe sise
and numberof the intervalsare arbitrary,it is well to use as many intervalsas are feasible, When too few intervalsare used, information may be wasted. That is, the maximum vertical deviation D of
the two cumulativestepfunctions may be obscured by castingthe data into too few intervals.
For instance, in the example presented below for the case of small
samples, only8 intervalswereused,in orderto simplifythe exposition. As it happens,8 intervalsweresufficient,in this case,to yield a D which enabledus to reject Ho at the predeterminedlevel of significance. If jt
THE KOLMOGOROVWMIRNOV TWO-8AMPLE TE8T
129
had happenedthat with these8 intervals the observedD had not been
large enoughto permitus to reject Hs, beforewe couldacceptHo it would be necessary for us to increasethe numberof intervals,in orderto
ascertainwhetherthe maximumdeviationD had beenobscured by the use of too few intervals. It is well then to useas many intervals as are feasible to start with, so as not to waste the information inherent in the data.
Small samples. When ni = n~, and when both nx and mgare 40 or
less,TableL of theAppendixmaybeusedin thetestof the null hypothesis. The body of this table givesvariousvaluesof K~, whichis defined as the numerator of the largest difference between the two cumulative
distributions, i.e., the numeratorof D. To readTableL, onemust know the value of N (which in this caseis the value of ni = n~) and the value of Q~.
Qbserve also whether H~ calls for a one-tailed or a two-tailed
test. With this information, one may determinethe significanceof the observed data.
For example,in a one-tailedtest where N = 14, g K reject the null hypothesisat the a = .01 level.
>8
Esamyle for Seal Samples
Lepley' comParedthe serial learning of 10 seventh-gradepu il with the seriallearningof 10eleventh-grade pupils. His hypothesis
wasthat the primacyeffectshouldbelessprominentin the learning of the youngersubjects. The primacyeffectis the tendencyfor the
materiallearnedearlyin a seriesto be remembered moreefBciently thanthemateriallearnedlaterin theseries. Hetestedthis hypothesisby comparingthe percentageof errorsmadeby the two groupsin the first half of the seriesof learnedmaterial, predicting that the
older group (the eleventhgraders)would make relatively fewer errorsin repeatingthe first half of theseriesthanwouldthe younger group,
i. Null Hypothesis.Ho: thereis no differencein the proportion of errors made in recalling the first half of a learned seriesbetween
eleventh-gradesubjectsand seventh-gradesubjects. H~. eleventhgraders make proportionally fewer errors than seventh-gradersin recalling the first half of a learned series.
ii. Statiatica/Test. Sincetwo smallindependent samples of equal sizeare beingcompared, the Kolmogorov-Smirnov two-sample test may be appliedto the data. iii. Signgcance Level. Let a = .01. ei = n~ = N = the number of subjects in each group = 10. > Lepley,%'. M. 1934. Serial reactionsconsideredas conditionedreactions. p~chot.Monagr.,46,No. 205.
THE
130
CASE
OF
TWO
INDEPENDENT
iv. Sampling Distribution. for nI
SAMPLES
Table L gives critical values of KII
np when nl and n2 are less than 40.
v. Region of Rejection. Since HI predicts the direction of the difference, the region of rejection is one-tailed. Hp will be rejected if the value of KD for the largest deviation in the predicted direction
is so large that the probability associatedwith its occurrenceunder Hp is equal to or less than a = .01. vi. Decision. Table 6.15 gives the percentage of each subject's TABLE 6.15. PERcENTAGEoF TQTAL ERRQRs IN FIRsT HAI.F ol' SERIE8 Elevenlh-grade subjects
Seventh-gradesub' eds 39.1
35. 2
41.2
39.2
45.2
40.9
46.2
38.1
48.4
34.4
48.7
29.1
55.0
41.8
40.6
24.3
52.1
32.4
47.2
32.6
errors which were committed
in the recall of the first half of the
serially learned material. For analysis by the Kolmogorov-Smirnov test, these data were cast in two cumulative frequency distributions, shown in Table 6.16. = 10 seventh-graders.
Here nI = 10 eleventh-graders, and n2
TABLE 6.16. DAYA IN TABI.E 6.15 CAsv FoR KoI.MoooRov-SMIRNov TEST Per cent of total errors in first half of series 28-31
32-35
36-39
40-43
44-47
I
2 T5'
1 To'
10 T5'
0 To'
0 T%
6 Tlr 0
0 Ti'
2 Tlr
10 TII 5 Tlr
24-27
Sip,(X) Sgp,(X)
S,(X)
S,(X)
48-51
52-55
10 Tlr 8 T5
10 TII 10 T5'
5 T6'
Observethat the largest discrepancy betweenthe two seriesis ~ ~ Kz = 7, the numerator of this largest difference. Reference to Table L reveals that when N = 10, a value of KII = 7 is significant at the a = .01 level for a one-tailed test.
Inasmuch as the probabil-
ity associatedwith the occurrenceof a value as large as the observed value of K> under' Hp is at most equal to the previously set level of
significance, our decision is to reject Hp in favor of HI.*
We con-
P Using a parametric technique, Lepley reached the same decision. He used the critical ratio technique, and rejected Hp at a = .01.
THE
KOLMOGOROV-SMIRNOV
TWO-SAMPLE
131
TEST
elude that eleventh-gradersmake proportionally fewer errors than seventh-gradersin recalling the first half of a learned series. Large samples: two-tailed test. When both ni and n~ are larger than 40, Table M of the Appendix may be used for the Kolmogorov-Smirnov two-sample test. When this table is used,it is not necessarythat n> = n>. To use this table, determine the value of D for the observed data, using formula (6.10b). Then compare that observed value with the critical one which is obtained by entering the observed values of n> and
n,, in the expressiongiven in Table M. If the observedD is equal to or larger than that computed from the expressionin the table, H0 may be rejected at the level of significance (two-tailed) associated with that expression. For example, suppose ni = 55 and n2 = 60, and that a researcher wishes to make a two-tailed
test at a =
.05.
In the row in Table M for
~ = .05, he finds the value of D which his observation must equal or exceed in order for him to reject Ho. By computation, he finds that his D must be .254 or larger for Ho to be rejected, for 1.36
'=
n,n, =
1.36
(55)(60)=.254
Large samples: one-tailed test. When ni and n~are large, and regardless of whether or not n> n2, we may make a one-tailed test by using D = maximum [S,(X) S,(X)j
(6.10a)
We test the null hypothesis that the two sampleshave been drawn from the same population against the alternative hypothesis that the values of
the population from which one of the sampleswas drawn are stochastically larger than the valuesof the populationfrom whichthe othersample was drawn.
For example, we may wish to test not simply whether an
experimentalgroup is diferent from a control group but whether the experimentalgroupis "higher" than the control group. It has been shown (Goodman, 1954) that A$7E2
Si+
S2
(6.11)
has a sampling distribution which is approximatedby the chi-square distribution with df = 2. That is, we may determinethe significance
of an observed valueof D, ascomputedfromformula(6.10a),by solving formula(6.11)for the observed valuesof D, n>,andn>,andreferringto the chi-squaredistribution with df = 2 (Table C of the Appendix).
132
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
Examplefor Large Samples:One-tailedTest In a study of correlates of authoritarian personality structure,~ onehypothesiswas that personshigh in authoritarianism would show a greater tendency to possessstereotypes about members of various
national and ethnic groups than would thoselow in authoritarianism. This hypothesis was tested with a group of 98 randomly selected collegewomen. Each subject was given 20 photographs and asked to "identify" those whose nationality she recognized,by matching the appropriate photograph with the name of the national group. Subjects were free to "identify" (by matching) as many or as few
photographsas they wished. Since,unknown to the subjects,all photographs were of Mexican nationals either candidates for the Mexican legislature or winners in a Mexican beauty contest and since the matching list of 20 different national and ethnic groups did not include "Mexican," the number of photographs which any subject "identified" constituted an index of that subject's tendency to stereotype.
Authoritarianism
was measured by the well-known F scale of
authoritarianism,' and the subjects were grouped as "high" and "low" scorers. "High" scorerswere those who scoredat or above the median on the F scale; "low"
scorers were those who scored
below the median. The prediction was that thesetwo groups would differ in the number of photographs they "identified." i. Null Hypothesis. Hs. women at this university who score low in authoritarianism stereotype as much ("identify" as many photo-
graphs)as womenwho scorehigh in authoritarianism. Hi'. women who score high in authoritarianism stereotypemore ("identify" morephotographs)than womenwho scorelow in authoritarianism. ii. Statistical Test. Since the low scorers and the high scorers constitute two independent groups, a test for two independent
sampleswaschosen. Becausethe numberof photographs"identifie" by a subject cannot be considered more than an ordinal measure of that subject's tendency to stereotype, a nonparametric test is desirable. The Kolmogorov-Smirnov two-sample test compares the two sample cumulative frequency distributions and deter- . mines whether the observedD indicates that they have been drawn from two populations, one of which is stochastically larger than the other.
~siegel, S. 1954. Certain determinants and correlates of authoritarianism. Omet. Psychol.Nonopr., 40, 187-229. s Presentedin AdornogT Wy Frenkel-Brunswik,Else, LevinsonyJ3 Jp and Sanford, R. N. Thsauthoritnrianpersonality. New York: Harper, 1950.
THE KOLMOGOROV-SbHRNOV TW
ShMPLE TEST
133
iii. Significance Level. Let a = .01. The sizesof nI and ep
may be determined only afterthe dataarecollected, for subjects will be groupedaccordingto whether they scoreat or above the median on the F scaleor scorebelow the median on the F scale.
iv. SamplingDistribution. The samplingdistribution of X
p(nInp) (SI + fly)
[i.e., formula (6.11)],whereD is computedfrom formula (6.10a), is approximated by the chi-squaredistributionwith df = 2. The probabilityassociated with an observedvalueof D may be determinedby computingg' fromformula(6.11)andreferringto TableC. v. RejectionRegion. Since HI predicts the direction of the difference between the low and high F scorers,a one-tailed test is used.
The regionof rejectionconsists of all valuesof g', ascomputed from formula (6.11), which are so large that the probability associated with their occurrenceunder Ho for df = 2 is equal to or lessthan a=
.01.
vi. Decision. Of the 98 college women, 44 obtained F scores
below the median. Thus nI = 44. The remaining 54 women obtained scoresat or above the median: n~
54. The number of
photographs "identified" by eachof thesubjectsin thetwo groupsis givenin Table6.17. To apply the Kolmogorov-Smirnov test, we TABLE6.17. NUMBEROFLOW hND HIGH AUTHORITARIhNS IDENTIFYING VARIOUSNUMBERSOF PHOTOGRAPHS
recastthesedatainto twocumulative frequency distributions, asin Table6,18. For easeof computation,the fractionsshownin Table 6.]8 maybe convertedto decimalvalues;thesevaluesareshownin Table6.19. By simplesubtraction,wefind the differences between
thetwosample distributions at thevarious intervals.Thelargest
134
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
TABLE 6. 18. DATA IN TABLE 6. 17 ChsT FQR KQLMoGQRov-SMIRNov
TEsT
Number of photographs "identified"
TABLE 6.19. DEGIMAL EQUIvhLENTS OF DATA IN TABLE 6.18
of these differencesin the predicted direction is .406. That is, D = maximum [S,(X) S,(X)] = maximum [S44(X) Sp4(X)]
(6.10a)
= .406
With D = .406, we compute the value of y' as defined by formula (6.11)
Dp nInm
X
nI+
(6.11)
n2
( ),()(
44 + 54
)
= 15.97
Reference to Table C reveals that the probability
gp = 15.97 for df = 2 is p <,001
associated with
(one-tailed test). Since this
value is smaller than a = .01, we may reject Hp in favor of KI.e We conclude that women who score high on the authoritarianism scale stereotype more (" identify" more photographs) than do women who score low on the scale.
It is interesting to notice that the chi-square approximation may also be used with small samples, but in this case it leads to a conservative Usinga parametrictest, Siegelmadethe samedecision. IIe foundthat I p < .001 (one-tailed test).
3.65,
THE KOLMOGOROV-SbDRNOV TWO-SAMPLE TEST
135
That is, the errorin the useof the chi-square approximation with
smallsamples is alwaysin the"safe"direction(Goodman, 1954, p. 168). In other words,if IIOis rejectedwith theuseof thechi-square approximation with small samples,we may surely have confidencein the decision.
Whenthis approximation is usedfor smallsamples, it is not necessary that n> and n> be equal.
To show how well the chi-square approximation works even for small
samples,let us useit on the data presentedin the examplefor small samples(above). In that case,n~= n~ = 10,andD, ascomputedfrom formula (6.10a), was ~.
The chi-squareapproximation: X
4D2 %asm n> + np
10
(6.11)
10 + 10
= 9.8
Table C shows that g' = 9.8 with df = 2 is significant at the .01 level.
This is the sameresultasthat whichwasobtainedfor thesedataby the use of Table L, which is basedon exact computations. Summary of procedure. These are the steps in the use of the Kolmogorov-Smirnov two-sample test:
1. Arrangeeachof thetwogroupsof scores in a cumulative frequency distribution, using the sameintervals (or classifications)for both dis tributions.
Use as many intervals as are feasible.
2. By subtraction, determine the difference between the two sample cumulative distributions at each listed point. 3. By inspection, determine the largest of these differences this is D
For a one-tailedtest, D is the largestdifferencein the predicteddirection. 4. The method for determiningthe significanceof the observedD depends on the size of the samplesand the nature of H,:
a. Whenn~n~= N,
andwhenN < 40,TableL is used. It gives
critical values of Ky (the numerator of D) for various levels of significance, for both one-tailed and two-tailed tests.
b. For a two-tailed test,whenni andnmarebothlargerthan40, TableM is used. In suchcases it is not necessary that ni n,.
Criticalvalues of D foranygivenlargevalues of ni andn>maybe computed fromtheexpressions givenin thebodyof TableM.
c. For a one-tailed testwheren>andn~arelarge,the valueof ~> withdf = 2 whichis associated withtheobserved D is computed fromformula(6.11).Thesignificance of the resultingvalueof x>with df = 2 maybe determinedby reference to TableC. This
186
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
chi-square approximation is also useful for small samples with ni 8 n~, but in that application the test is conservative. If the observedvalue is equal to or larger than that given in the appro-
priate table for a particular level of significance,Ho may be rejectedat that level of significance. Power-EfBciency
Whencomparedwith the t test, the Kolmogorov-Smirnov test hashigh power-efficiency(about 96 per cent) for small samples(Dixon, 1954). It would seem that as the sample size increasesthe power-efficiency would tend to decrease slightly.
The Kolmogorov-Smirnov test seemsto be more powerful in all cases than either the g' test or the median test. The evidence seemsto indicate that whereas for very small samples the
Kolmogorov-Smirnovtest is slightly more efficient than the MannWhitney test, for large samplesthe converseholds. References
For other discussions of the Kolmogorov-Smirnov two-sample test, the reader may consult Birnbaum (1952; 1953), Dixon (1954), Goodman
(1954),Kolmogorov(1941),Massey(1951a;1951b),and Smirnov(1948). THE OLD-WOLFOWITZ
RUNS TEST
Function
The Wald-Wolfowitz runs test is applicable when we wish to test the
null hypothesisthat two independentsampleshave beendrawn from the same population against the alternative hypothesisthat the two
groupsdifferin any respectwhatsoever.That is, with sufficientlylarge samplesthe Wald-Wolfowitztest can rejectHo if the two populations differ in any way: in central tendency,in variability, in skewness,or whatever. Thus it may be used to test a large class of alternative
hypotheses.Whereasmany other tests are addressed to particular sorts of differencesbetween two groups (e.g., the median test determines whether the two samples have been drawn from populations with the
same median), the Wald-Wolfowitz test is addressedto any sort of difference.
Rationale
and Method
The Wald-Wolfowitz
test assumes that the variable under considera
tion has an underlying distribution which is continuous. It requires that the measurement
of that variable
be in at least an ordinal
scale.
To apply the test to data from two independent samples of size n> and n~, we rank the ni + n~ scoresin order of increasing size. That is,
THE
WhLD-WOLFOWITZ
RUNS
TE8T
187
~e cast the scoresof all subjectsin both groupsinto oneordering. Then ~g determine the number of runs in this ordered series. A run is defined
+~ any sequence of scoresfrom the samegroup (eithergroup1 orgroup2). For example, supposewe observedthese scoresfrom group A (consist-
ing of 8 cases~~ = 3) andgroupB (consistingof 4 cases em 4): Scores for group A
12
16
Scores for group B
~en
these 7 scoresare cast in one ordered series,we have: 66
8 11
BB
A
12 B
16
AA
Notice that we retain the identity of each score by accompanying that 8core with the sign of the group to which it belongs. We then observe the order of the occurrenceof these signs (A's and B's) to determine the number of runs.
Four runs occurred in this series: the 8 lowest scores
~ere all from group B and thus constituted 1 run of B's; the next highest acore is a run of a single A; another run constituted by 1 B follows; and the two highest scoresare both from group A and constitute the final rGB.
Now we may reasonthat if the two samplesare from the samepopulation (that is, if H pis true), then the scoresof the A's and the B's will be
dwellmixed. In that caser, the numberof runs, will be relatively large. It is when H pis false that r is small.
For example,r will be smallif the two samplesweredrawnfrom popu lations having diferent medians. Supposethe population from whic
the A cases were drawn had a higher median than the population frown which the B caseswere drawn. In the ordered series of scores from the two samples,we would expect a long run of B's at the lower
end of the seriesand a longrun of A's at the upperend,and consequently s,n r which is relatively small. Again, suppose the samples were drawn from populations which differed in variability. If the population from which the A cases were drawn was highly dispersed,whereas the population from which the
casesweredrawnwashomogeneous or compact,wewouldexpecta long run of A's at eachendof the orderedseriesand thus a relativelysmaQ value of r.
Sirdar arg mentscanbepresen~ to showthat whenthe populations from which the n>and n> caseswere drawn differ in skewnessor kurtosls
then the sise of r will also be "too small," i.e., small relative to the siles of nl and np.
THE
13S
CASE
OF
TWO
INDEPENDENT
SAMPLES
In general,then,we reject H pif r = the numberof runsis "too small." The samplingdistribution of r arisesfrom the fact that whentwo different kinds of objects(sayn>and n2)arearrangedin a singleline, the total number of different possiblearrangementsis ng + n2
ng + nm
From this it can be shown(Stevens,1939;Mood, 1950,pp, 392393)that
the probabilityof gettingan observed valueof r or anevensmallervalue 18
when r is an even number. When r is an odd number, that probabiTityis given by
where r =
2k 1.
Small samples. Tablesof critical valuesof r, basedon formulas (6.12a)and (6.12b),havebeenconstructed.TableFi of the Appendix
presents criticalvaluesof r for n~,n>< 20. Thesevaluesaresignificant at the .05level. That is, if an observed valueof r is equalto or lessthan the valuetabledfor the observedvaluesof nI and np H pmay be rejected
at the .05levelof significance.If the observed valueof r is largerthan that shownin Table Fr, we can only concludethat in termsof thetotal
number of runsobserved, the null hypothesis cannotberejectedat a = .05. Examplefor Small Samples
Twelve four-year-old boys and twelve four-year-oldgirls were observedduring two 15-minuteplay sessions, and eachchild's play
duringbothperiodswasscoredfor incidence of anddegreeof agression.' With thesescores,it is possibleto test the hypothesisthat there are sex differencesin the amount of aggressionshown.
i. Null Hypothesis. Hp. incidenceand degreeof aggression are the same in four-year-olds of both sexes. H>. four-year-old boys 'Siegel, Alberta E. 1956. Film-mediatedfantasyaggression and strengthof aggressive drive. Child Dppelpm., Q7,365378.
THE
WALD-WOLFOWITZ
RUNS
TEST
139
and four-year-old girls display differences in incidence and degree of aggression. ii. Statiatica/ Test. Since the data are in an ordinal scale, and since the hypothesis concerns differences of any kind between the
aggressionscoresof two independentgroups (boys and girls), the Wald-Wolfowitz
runs test is chosen.
iii. SignificanceLevel. Let a = .05. mI = 12 = the number of
boys,and n>
12 = the numberof girls.
iv. Sampling Distribution.
From the sampling distribution of
r, critical values have been tabled in Table FI for n,I, n~ < 20. (although nI = n> in this example,this is not necessaryfor the use of the runs test.)
v. RejectionRegt'on. The region of rejection consistsof all values of r which (for nI = 12 and nr, = 12) are so small that the probability associatedwith their occurrenceunder Ho is equal to or less than n=
.05.
vi. Decision. Each child's score for his total aggressionin both sessions was obtained. T~LE
These scores are given in Table 6.20.
6.20. AoGREsaIQNScoREs oP BQY8 AND GIRLs IN FREE PLaY Boys 86
55
69
40
72
22
65
58
113 65
16 79
118 45
16
141
26
104
36
41
20
50
15
Now if we combine the scoresof the boys (B's) and girls (G's) in
a singleorderedseries,we may determinethe numberof runs of G's and g s. This ordered seriesis shown in Table 6.21. Each run is underlined, and we observethat r = 4. Referenceto Table FI reveals that for nI = 12 and n> 12, an r pf 7 is significant at the .05 level. Since our value of 7' is smaller than that tabled, we may reject Ho at a = .05.e We conclude that
boys and girls displaydifferences in aggression in the free play situation.
Ui go
parametic Mann-WhitneyU teatfor the d ta ho
the investigator rejected H, at the> < 0002>
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
ThELE 6.21. DhTh IN ThELE 6.20 Ch8T FOR RUNS TE8T Score
79 GG
Group
15 GG
16
16 GG
20
22 GG
26
36 GG
40
Run Score
41
Group
45
50
BB
Run
B
55
58
GG
2 65
Score
Group
65
69
BB
72
86
BB
104
B
113
BB
118 BB
141
Run
Large samples. When either nI or nr, is larger than 20, Table F, can not be used. However, for such large samplesthe sampling distribution under Hp for r is approximately normal, with Mean = p, = and
2nIng nI+
Standard deviation = o, =
+l
nm
2nIn2(2nIn~ nI (n, + n,) (n, +»
n2) 1)
That is, the expression 2n In' Jl
z
(6. 13)
is approximately normally distributed with zero mean and unit variance. Thus Table A of the Appendix, which gives the probability associated
with the occurrenceunder Hp of valuesas extremeas an observedg, may be used with large samples to determine the signincance of an observed
value
of r.
A correction for continuity shouldbe usedwhen nI + n~js not very large. The correctionis requiredbecausethe distribution of empirical valuesof r must of necessitybe discrete,whereaswith large sampleswe approximatethat sampling distribution by the normal curve, a continuous curve. This approximationcan be improved by correctingfor continuity.
The correction is achieved by subtracting .5 from the
absolute difference between r and p,'.
Ir ~ I
5
(6.14)
Thusto computethe valueof z with the correctionfor continuityincor-
THE
WELD-WOLPOWITZ
RUNS
TEST
141
porated, we use formula (6.14):
(
+1).5 (6.14)
Computation of formula (6.14) will yield a z whoseassociatedtabular value (in Table A) gives the probability under Ho of a value as small as
the observed valueof r. If the z obtained fromthe useof formula(6.14) has an associated probability,p (readdirectlyfrom TableA), whichis equal to or less than a, then Hs may be rejected at the a level of signi6cance.
Examplefor LargeSamples
In a study whichtestedthe equipotentiality theory,'Ghiselli compared the learning(in a brightness-discrimination task)of 21 normalratswith the relearningof 8 postoperative rats with cortical
lesions.Thatis, thenumber of trialsto relearning required postoperativelyby the 8 E rats was comparedwith the number of trials to learning required by the 21 0 rats.
i. Null Hypothesis. Ho.'there is no difference between normal
ratsandpostoperative ratswith corticallesions with respect to
rateof learning(or relearning) in the brightness-discrimination task.
H,: thetwogroups of ratsdifferwithrespect to rateoflearning (or relearning).
ii. StatisticalTest. The Kaid-Wolfowitz test was chosento
providean over-alltestfor differences between the two groups. Sincen2> 20, the normalcurveapproximation will be used. And
sincen>+n~ 29is fairlysmall,thecorrection for continuity will be employed,i.e., formula (6.14)will be used.
iii. Significanc Level.Let a = .01. ni = 8 postoperative rats and n~
21 normal rats.
iv. Samphng Dtstrtbutton.TableA ~vestheprobably associ-
atedwiththeoccurrence underHoof a value asextreme asanyz computedfrom formula (6.14).
v. Rejection Region.Theregion of rejection consists ofall values of z which aresoextreme thattheprobability associated withtheir occurrenceunder Ho is equalto or lessthan a = .01.
vi. Decision. Table 6.22 gives thenumber oftrialstorelearning
required bythe8 postoperative animals andthenumber oftrialsto 1Qhisellj~ E,E, 1938.Mass action andequipotentiality ofthecerebral cortex in
brightnessdiscrimination.J. Comp.Peychol., $5,273290.
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
TABLE 6.22. TRIALS TO LEARNING (RELEARNING) REQUIREDBY E AND C RATS C Rale
23 8
20 55
24
29
15 86
24 75 56
15
31
15
45
21 23 16 15 24 15 21 15 18 14 22 15 14
learning required by the 21 normal animals. From these scores we may determine the number of runs; the runs are shown in Table 6.23.
We see that
r=
6.
TABLE 6.23. DATA IN TABLE 6.22 CABT FQR RUN8 TEBT Score
Group
68 CC
8 14 CC
14 CC
15
15 CC
15
15 15 CC
15 15 CC
16 18 C 'C
Run 20
Score
Group
EC
Run
23
Score
29
Group
EE
21
21
22
CC
23
23
CC
24
24
CE
24
C 45
31
45 EE
55
56
75
EE
Run
To determine the probability under H pof such a small value or even
smaller value of r, we compute the value of z, substituting our observedvalues (r 6, nI 8, and nm= 21) in formula (6.14):
THE
WALD-WOLFOWITZ
RUNS TEST
143
( ~'+l).5 (6.14)
2n>n~(2n>n~ ni n~) (ni + n2)'(ni + ng 1)
(2)(8)(21)+ 1 8+
5
21
= 2.92
Reference to Table A indicatesthat z > 2.92 has probabilityof occurrence underHoof p = .0018. Sincethis valueof y is smaller than a = .01, our decisionis to reject Ho in favor of IIi.~
We con-
cludethat the two groupsof animalsdiffersignificantlyin %heir rate of learning (relearning).
Ties. Ideally no ties shouldoccur in the scoresusedfor a runs test
inasmuchas the populationsfrom whichthe samplesweredrawn are
assumed to becontinuous distributions.In practice,however, inaccurate or insensitivemeasurementresults in the occasionaloccurrenceof
Whentiesoccurbetween members of the differentgroups,then the sequence of scores is not unique. That is, suppose threesubjects
obtaintiedscores.Twoof theseareA's andoneis a B. In making the orderedseriesof scores,how shouldwe groupthesethree? If we gioup them asA B A, then we will havea differentnumberof runs than if we group them as A A B or (alternatively) as B A A.
If all tiesarewithinthesamesample, thenthenumber of runs(r) is unaffectedand thereforethe obtainedsignificance levelis unaffected. gut if observations from onesamplearetied with observations from the other sample,we cannotobtain a uniqueorderedseriesand therefore Usuallycannotobtaina unique valueof r, aswehavejust shown. This problemoccurredin the examplejust presented.Three rats required24 trials to learnto the criterion. In Table6.23we ordered
thesecases asC EC. Wemightjust aswellhaverankedthemE CC. ps it happens, no matterwhatorderwehadused,in this case,r would
havebeen6 orsmaller, andthusourdecision wouldhavebeento reject H, in any case. For this reasonties presented no majorproblemin reachinga statisticaldecisionconcerningthosedata.
In othersetsof data,they might. Our procedure with ties is to eUsmgs,parametric test,Ghiselli reached thesamedecision.Hereported a
criticalratio of 3.95,whichwouldallowhim to rejectHoat a .00005.
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
break the ties in all possible ways and observe the resulting values of r. If all these values are significant with respect to the previously set value of a, then ties present no major problem, although they do increase the tedium of computation. If the various possible ways of breaking up ties lead to some values of r which are significant and some which are not, the decision is more difficult. In this case, we suggest that the researcher determine the probability of occurrence associated with each possible value of r and
take the averageof thesep's ashis obtainedprobability for usein deciding whether to accept or reject Ho.
If the number of ties between scoresin the two diferent samplesis large, r is essentially indeterminate. test is inapplicable.
In such cases, the Wald-Wolfowitz
Summary of procedure. These are the steps in the use of the WaldWolfowitz
runs test:
1. Arrange the n> + n~ scores in a single ordered series. 2. Determine
r=
the number
of runs.
3. The method for determining the significanceof the observed value of r depends on the size of ni and n~.'
a. If both n>and n2are 20 or smaller, Table F~gives critical values of r at the .05 level of significance. If the observedvalue of r is equal
to or smallerthan that tabled for the observedvaluesof n~and n2, then Jlo may be rejected at a = .05.
b. If either n> or n2 is larger than 20, formula (6.13) or (6.14) may be used to compute the value of z whose associatedprobability under Ho may be determined by reading the p associated with that z, as given in Table A. Chooseformula (6.14) if ni + nq is not very large and thus a correction for continuity is desirable, If the p is equal to or less than a, reject Ho. 4. If ties occur between scoresfrom the two diferent samples,follow the procedure suggestedabove in the discussionof ties. Power-Efficiency
Little is known about the power-efficiency of the Wald-Wolfowitz test.
Noses (1952a) points out that statistical
tests which test Ho
against many alternatives simultaneously and the runs test is such a test are not very good at guarding against accepting Ho erroneously with respect to any one particular alternative.
For instance, if we were interested simply in testing whether two samples come from populations with the same location, the MannWhitney U test would be a more powerful test than the runs test because
it is specifically designedto disclosedifferencesof this type, whereasthe
THE MOSES TEST OF EXTREME REACTIONS
145
runs test is designedto disclosedifferencesof any type and is thus less powerful in disclosing any particular kind. This difference was illustrated in the examplefor small samplesshown above. The investigator was interested in sex differences in location of aggressionscores, and therefore used the U test. We tested the data for differences o anysort, using the runs test. Both tests rejected Ho, but the Mann-Whitney U test did so at a much more extreme level of significance.
Mood (1954) points out that when the Wald-Wolfowitz test is used to test Ho against specific alternatives regarding location or variability, it has theoretic asymptotic efficiency of zero. However, Lehmann (1958) discusseswhether it is proper to apply the notion of asymptotic normality to the runs test.
Smith (1958) states that empirical evidenceindicates that the powerefficiency of the Wald-Wolfowitz test is about 75 per cent for sample sizes near 20. References
The reader may find discussionsof the runs test in Lehmann (1958), Moses (1952a), Smith (1958), Stevens (1989), and Swed and Eisenhart
(1948). THE
MOSES
TEST
OP
EXTREME
REACTIONS
Function and Rationale
In the behavioral sciences,we sometimesexpect that an experimental condition
will cause some subjects to show extreme behavior in one
direction while it causesothers to show extreme behavior in the opposite direction. Thus we may think that economic depressionand political instability will causesome people to become extremely reactionary and others to becomeextremely "left wing" in their political opinions. Or we may expect environmental unrest to create extreme excitement in some mentally ill people while it creates extreme withdrawal in others. In psychological researchutilizing the perception-centeredapproach to
personality, there are theoretical reasonsto predict that "perceptual defense" may manifest itself in either an extremely rapid "vigilant" perceptual responseor an extremely slow "repressive" perceptual response.
The Mosestest is specificallydesignedfor usewith data (measured in at least an ordinal scale) collected to test such hypotheses. It should be used when it is expected that the experimental condition will affect
somesubjectsin oneway and othersin the oppositeway. In studiesof perceptualdefense,for example,we expectthe control subjectsto evince "medium or "normal" responses,while we expect the experimental
146
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
subjectsto giveeither"vigilant" or "repressive"responses, thusgetting either high or low scoresin comparison to those of the controls.
In such studies, statistical tests addressedto differences in central
tendencywill shieldrather than revealgroup differences, They leadto acceptanceof the null hypothesiswhen it should be rejected,because when some of the experimental subjects show "vigilant"
responsesand
thus obtain very low latency scoreswhile others show "repressive" responsesand thus obtain very high latency scores,the average of the
scoresof the experimentalgroup may be quite closeto the average score of controls (all of whom may have obtained scores which are "medium "). Although the Moses test is specifically designed for the sort of data
mentionedabove,it is also applicablewhen the experimenterexpects that one group will scorelow and the other group will scorehigh. However, Moses (1952b) points out that in such casesa test based on medians
or on mean ranks, e.g., the Mann-Whitney U test, is more efficient and is
thereforeto be preferredto the Mosestest. The latter testis uniquely valuablewhenthereexista priorigroundsfor believingthat the experimental condition will lead to extreme scores in either direction.
The Moses test focuses on the span or spread of the control cases.
That is, if there are nc control casesand ng experimentalcases,and the ng + nc scoresare arranged in order of increasing size, and if the null
hypothesis(that the E's and C's come from the samepopulation) is true, then we should expect that the E's and C's will be well mixed in the
orderedseries. WeshouldexpectunderHothat someof the extremely high scoreswill be E's and someC's, that someof the extremelylow scoreswill be E's and someC's, and that the middle rangeof scoreswould include a mixture of E's and C's. However, if the alternative hypothesis (that the E scoresrepresent defensive responses)is true, then we would
expectthat (a) most of the E scoreswill be low, i.e., "vigilant," or (b) most of the E scoreswill be high, i.e., "repressive," or (c) a considerable proportion of the E's will score low and another considerableproportion will score high, i.e., some E responseswill be "vigilant" while others are "repressive." In any of these three cases,the scoresof the C's will be
unduly congestedand consequentlytheir span will be relatively small. If situation (a) holds,then the C's will be congested at the high endof the
series,if (b) holdsthe C's will be congested at the low endof the series, and if (c) holds the C's will be congestedin the middle of the ordered
series. The Mosestest determines whetherthe C scores are so closely compacted or congested relative to the ng+ nc scores as to call for rejecting the null hypothesis that both E's and C's come from the same population.
THE MOSES TEST OIP EXTREME REhCTIONS
l47
Method
To compute the Moses test, combine the scoresfrom the E and C
groups,andarrangethesescoresin a singleorderedseries,retainingthe grpupidentity of eachscore. Then determinethe spanof the C scoresby noting the lowestand the
highestC scores andcountingthenumberof cases between them,including bpthextremes.That is, the span,symbolized ass', is definedasthe smallestnumberof consecutive scoresin an orderedseriesnecessary to jncludeall the C scores.For easeof computation,we may rank each sepreand determines' from the orderedseriesof the ranksassigned to the ns + nq cases.
For example, suppose scores are obtained for W = 6 a d =
7
Whenthese13 casesare rankedtogether,we havethis series.
Group
12
34
56
78
EE
CE
CE
CC
10
11
EC
12
13
EE
The spanof the C scores in this caseextendsover9 ranks(from3 tp >] jneluslve) and thus s' = 9.
Notice that in generals' is equalto the di6'erencebetweenthe extreme C ranks plus 1. In the present case,s' = ll
3+
> = 9.
The Mosestest determineswhether the observedvalue of s' js tpp small a value to be thought to have reasonablyarisenby chanceif the
E'sandC'sarefromthesamepopulation.Thatis,thesampling distri butionof s' underthenullhypothesis is known(Moses, 1952b) andmay be used for tests of significance.
Thereaderwill haveobserved that s' is essentially the rangeof the
C scores, andhemayobjectthatthewell-known instabilitypftherange makess' an unreliableindexto the actualspreador compactness of the
C scores.Mosespointsout that it is usuallynecessary to modifys~jn
orderto takecareof justthisproblem.Themodification is especially importantwhenncis large,because especially in thiscaseis therange (span)of C'saninefficient indexto thespread ofthegroup,dueto possible samplingfluctuations.
Themodification suggested by Moses is thattheresearcher, in advance pf collectinghis data,arbitrarilyselectsomesmallnumber,h. After the data are collected, he may subtracth controlscoresfrom bpth extremes of the rangeof controlscores.Thespanis foundfor those scpreswhichremain. That is, the spanis foundafterh controlscores have beendroppedfrom eachextremeof the series.
Fprexample, in thedatagivenearlier,theexperimenter mighthave
148
THE CASE OF TWO INDEPENDENT SAMPLES
Decided in advance that A = 1. Thenhe wouldhavedropped ranks3
and11fromtheC scores before determining thespan. In thatcase, the "truncated span," symbolizedas s~, would be el,= 9 5+
1 = 5.
This is given as: e~ 5, A = 1. Thus sais definedas the smallestnum-
berof consecutive ranksnecessary to includeall thecontrolscores except the A leastand the A greatestof them. Notice that s~can never be smallerthan nc
largerthan nc + ng
2A and can never be
2A. Thesamplingdistribution,then,shouldtell
us the probability under Ho of observingan s~which exceedsthe minimum value (nc 2A) by any specified amount.
If we useg to representthe amount by which an observedvalue of eq
exceeds nc
2A,wemaydetermine theprobabilityunderHoof observing
a particular value of s~ or less as i + nc 2A
2 ng+2A+
1
i
ng
< n
2A+ g) '
(nc+ ng nc
Thusfor any observed valuesof nc andng anda givenpreviouslyset value of A, onefirst finds the minimumpossibletruncatedspan:nc
2A.
Then one finds the value of g = the amount that the observedel,exceeds the value of (nc 2A). The probability of the occurrence of the
observedvalueof e>or lessunderHo is found by cumulatingthe termsin the numerator of formula (6.15). If g = 1, then one must sum the numerator termsfori = Oandi = 1. If g = 2, then one mustsum three numerator terms: for i = 0, i = 1, and i = 2. The computations called for by formula (6.15) are illustrated in the following exampleof the useof the Moses test.
Example
In a pilot study of the perceptionof interpersonalhostility in Qm dramas,the experimenter'comparedthe amountof hostility perceivedby two groupsof femalesubjects. The E groupwere
womenwhose personality testdatarevealed that theyhaddifBculty in handlingtheir own aggressive impulses. The C group were womenwhosepersonality tests revealedthat they had little or no disturbancein the area of aggressionand hostility, Each of the 9 E subjectsand the 9 C subjectswas showna filmed drama and
askedto ratetheamountof aggression andhostilityshownby the characters in the drama.
'This examplecitesunpublished pilot study data madeavailableto the author through the courtesyof the experimenter,Dr. Ellen Tessman.
THE MOSES TEST OP EXTREME REACTIONS
149
The hypothesiswas that the E subjectswould eitherunderattribute or overattribute
hostility
to the filIn characters.
Underattribution
is indicated by a low score,whereasoverattribution is indicated by a high score. It was predicted that the C subjects' scoreswould be more moderate than those of the E subjects, i.e., that the C's would evince lessdistortion in their perception of interpersonal hostility. i. Nul/ Hypothesis. Hp. women who have personal difBculty in handling aggressiveimpulses do not di6'er from women with relatively little disturbance in this area in the amount of hostility that they attribute to the film characters. HI. women who have personal diKculty in handling aggressive impulses are more extremethan others
in their judgments of hostility in film characterssome underattribute
and others overattribute.
ii. Statistical Teat. Since defensive (extreme) reactions are being predicted, and since the study employs two independent groups, the Moses test is appropriate for an analysis of the research data. In advanceof collecting the data, the researcherset II at 1. iii. SignificanceLeveL Let a = .05. nz = 9 and nc 9. iv. Sampling Distribution. The probability associated with the occurrenceunder H pof any value as small as an observeds~is given by formula (6.15). v. RejectionRegion. The region of rejection consistsof all values of sI, which are so small that the probability associatedwith their occurrenceunder H pis equal to or less than a = .05.
vi. Decision. The scoresfor attribution of aggressionby the E and C subjects are given in Table 6.24, which also showsthe rank of TAnm
6.24. ATTMBUTION OP AGGRESSION To CHARhcTERS IN FILM
+ ~en ties occur betweentwo membersof the samegroup, the value of sl, is unaffectedand thus the useof tied ranks is unneceaaary.For a
discussion of theproblemof tiesin the Mosestest,seethesectionfollowing this example.
150
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
each. When these ranks are ordered in a single series,we have the data
shown in Table
6.25.
TABLE 6.25. DATA IA TABLE 6.24 CAsT FQR MosEs Rank
2,
34
67
89
10
Group
clE c
EC
CC
CC
11
TEsT
12
13
14
C
E
EC
15
16
17
E
EE
18
Since h = 1, the most extreIne rank at each end of the C range is dropped; these are ranks 2 and 15. Without these two ranks, the truncated span of the C scores is 9. sI,=9
That is,
h=1
Now the minimum possible sA would be (nc 2h) = 9 2 = 7. Thus the amount by which the observed sq exceeds the minimum
possibleis 9 7 = 2. Thus g = 2. To determine the probability of occurrenceunder 110of sA< 9 when nc = 9, ng 9, and g = 2p we substitute these values into formula (6.15): i + nc 2h
2
ng+
2h + ng
p(sA< ng
1
i
c
h + g) ' no+
i+
9
2
2 9+2+
ng
1
i
0
(1) (220) + (6) (165) + (21) (120) 48,(.:20 = .077
* For any positive ititcgcrs, sap a s,nd b, ifa
and
=0
ifa
>b
<6
/NX Table T oftheAppendix gives numerical values forbinomial coeAicicnts ( Z) for X <20.
THE
MOSES
TEST
OF
EXTREME
REACTIONS
151
Since p = .077 is larger than n = .05, the data do not permit us to reject Ho at our previously set level of significance. We conclude that, on the basis of these data, we cannot say at a = .05 that the E subjectsdiHer significantly from the C subjectsin their attribution of aggressionto the film characters. The p is sufficiently small, however, to be considered"promising" in pilot study data such as these.
Ties.
Tied observations
between two or more members of the same
group do not acct the value of s~. When there are tied observations between members of the two different groups, however, there may be more than one value of s~, depending on how the tie is broken. When this is the case, the researcher should break the ties in his data in all
possibleways, find the probability under H0 associatedwith eachsuch break [by using formula (6.15) for each], and take the average of these
pi'obabilities as the one to use in making his decisionabout Ho. If the number of ties between groups is large, then the Moses test is inapplicable. Summary of procedure. These are the steps in the use of the Moses test:
1. In advance of the collection of data, specify the value of h.
2. When the scoreshave been collected, rank them in a single series, retaining the group identity of each rank. 3. Determine the value of s~, the span of the control ranks after the h
Inost extreme C ranks at each end of the serieshave been dropped. 4. Determine the value of g, the amount by which the observed value of si, exceedsnc 2h.
5. Determine the probability associated with the observed data by
computing the value of p as given by formula (6.15). If ties occurre between groups, break them in all possibleways and find the p for each such break; the averageof thesep's is usedas the p in the decision. 6. If p is equal to or smaller than a, reject HQ. Power
The power of the Moses test has not been reported. However, when the test is used for its specialpurpose (i.e., for testing the hypothesis that the membersof one group will be extreme with. respectto the membersof
another group), it is moreefficientthan tests that are sensitiveonly to shifts in location (central tendency) or in dispersion. Of course, as we
have pointed out earlier, if the hy'pothesisunder test dealsspecifically with central tendencies,then a testbasedon mediansor meanranks,e.g., the Mann-Whitney U test, will make more efficient useof the information in the data.
152
THE
cAsE
QF
Two
INDEPENDENT
shMPLES
References
Further information on this test is containedin Moses(1952b). THE RANDOMIZATION
TEST FOR TWO INDEPENDENT
SAMPLES
Function
The randomization test for two independent samplesis a useful and powerful nonparametric technique for testing the significance of the difference between the means of two independent sampleswhen ni and n~ are small. The test employs the numerical values of the scores,and therefore requires at least interval measurementof the variable being studied.
With
the randomization
test we can determine the exact
probability under Ho associatedwith our observations, and can do so without assuming normal distributions or homogeneity of variance in
the populationsinvolved (which must be assumedif the parametric equivalent, the t test, is used). Rationale
and Method
Consider the caseof two small independent samples,either drawn at random from two populations or arising from the random assignmentof two treatments to the members of a group whose origins are arbitrary,
Group A includes 4 subjects; n> = 4. Group B includes5 subjects; ni = 5.
We observethe following scores: Scoresfor group A Scoresfor group B
0 11 16
12 19
22
24
29
With these scores,' we wish to test the null hypothesis of no difference between the means against the alternative hypothesis that the mean of the population from which group A was drawn is smaller than the mean of the population from which group B was drawn. Now under the null hypothesis, all n> and n~ observationsare from the same population. That is, it is merely a matter of chancethat certain
scoresare labeledA and othersare labeledB. The assignmentof the labelsA and B to the scoresin the particular way observedmay be con-.
ceivedasoneof manyequallylikely accidents if Hois true. UnderHs, the labelscouldhavebeenassigned to the scoresin any of 126equally 'This exampleis taken from Pitman, E. J. G. 1937a. Signi6cancetests which may be applied to samplesfrom any populations. Supplementto J. Royal Stolid. Sos., 4, 122.
THE RANDOMIzATIoN
TE8T FQR Two
INDEPENDENT
sAMPLEs
15$
likely ways:
( + .) (4+a) Under Hp, only oncein 126 trials would it happen that the four smallest
scoresof the nine wouldall acquirethe label A, while the flve largest acquired the label B.
Now if just sucha result shouldoccurin an actual single-trialexperiment, we could reject Hp at the p = ~ = .008 level of significance, applying the reasoningthat if the two groupswerereally from a common population,i.e., if Hp werereally true, there is no goodreasonto think that the most extremeof 126possibleoutcomesshouldoccuron just the trial that constitutes our experiment. That is, we would decide that
there is little likelihood that the observedevent could occurunder Hp, and therefore we would reject, Hp when the event did occur. This is part of the familiar logic of statistical inference.
The randomizationtest specifiesa numberof the mostextremepossible outcomeswhich could occur with n~+ np scores,and designatesthese
as theregionof rejection.Whenwehave np+ np ng
equally likely
occurrencesunder Hp, for someof thesethe differencebetweenZA (the sum of group A's scores)and ZB (the aum of group B's scores)will be extreme. The casesfor which these differencesare largest constitute the region of rejection.
If a is the significancelevel, then the region of rejection consistsof
thea nj+ np mostextremeof the possibleoccurrences. That is, the ni
number of possibleoutcomesconstituting the region of rejection is 'ng
The particular outcomes chosento constitute that num-
ber are those outcomesfor which the difference between the mean of the A'a and the mean of the B'a is largest. These are the occurrencesin
whichthe difference betweenZA andZB is greatest. Nowif the sample we obtainis amongthosecases listedin the regionof rejection,wereject Hp at significancelevel a.
In theexample of9 scores given above, thereare ng+ np = 126 np
possibledifferences betweenZA andZB. If a = .05,thenthe regionof
rejection consists of a ng+ np= .05(126) ng
= 6.3 extreme outcomes.
Sincethe alternativehypothesisis directional,the regionof rejection consistsof the 6 mostextremepossibleoutcomesin the specifieddirection. Under the alternative hypothesisthat p~ < pa, the 6 most extreme
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
possible outcomes constituting the region of rejection of a = .05 (onetailed test) are those given in Table 6.26. The third of these possible TABLE 6.26. THE SIx MosT ExTREME PossIBI.E OUTcoMEs IN THE PREDICTED
DIRECTION
(Theseconstitute the region of rejection for the randomization test when a = .05)
' The sample obtained.
extreme outcomes, the one with an asterisk, is the sample we obtained. Since our observed scores are in the region of rejection, we may reject
Hp at a = .05. The exact probability (one-tailed) of the occurrenceof the observed scores or a set more extreme under H pis p = ~ =
.024.
Now if the alternative hypothesis had not predicted the direction of the difference, then of course a two-tailed test of Hp would have been in order. In that case, the 6 sets of possible outcomesin the region of rejection would consist of the 3 most extreme possible outcomesin one direction and the 3 most extreme possibleoutcomesin the other direction. It would include the 6 possibleoutcomeswhosedifferencebetweenZA and XB was greatest in absolute value. For illustrative purposes,the 6 most extreme possible outcomes for a two-tailed test at a = .05 for the 9 scores presented earlier are given in Table 6.27. With our observed scoresHp would have been rejected in favor of the alternative hypothesis that II> W», becausethe obtained sample (shownwith asterisk in Table 6.27) is one of the 6 most extreme of the possible outcomes in either direction. The exact probability (two-tailed) associated with the occurrence under Hp of a set
as extreme as the one observed is
u = rh = 048. Large samples. When nI and nz are large, the computations necessary for the randomization test may be extremely tedious. However, they may be avoided, for Pitman has shown (1937a)that for nI and nz large, if the kurtosis of the combined samples is small and if the ratio of nt to nz lies between~ and 5, that is, if the larger sample is not more than five
THE RANDQMIzATIQN TEsT FQR Two T~sm
INDEPENDENT
sAMPLE8
6.27. THE Sxx Moss ExrRExxE PossxsLxx Omcoaxls EnaER
155
xN
DxaxcrroN
(These constitute the two-tailed region of rejection for the randomisation test when a .05)
' The sample obtained.
times larger than the smaller sample,then tbo randomization distribution possible outcomesis closely approximated by the t distrins
bution. That is, if the above-mentioned two conditions(small kurtosis and
'«
5 are satisfied,then
5 nx E(B 8)' + Z(A A)'1 1 sg + xmas 2
'+A
+B
has approximatelythe Student t distribution with df = n~ + ns 2. Therefore the probability associatedwith the occurrenceunder Ho of any valueasextremeasan observedt may be determinedby referenceto gable B of the Appendix.
Thereadershouldnotethat eventhoughformula(6.16)is the ordinary
f test, the testis not usedin this caseasa parametric statisticaltest, for the assumption that the populations are normally distributed with
commonvarianceis not necessary.However,its userequiresnot only that the two conditions mentioned above be Inet, but also that the scores
representmeasurement in at leastan intervalscale. When n, and nx are large, another alternative to the randomization test is the Mann-WhitneyU test,which may be regardedasa randomization test applied to the ranks of the observationsand which thus constitutes a good approximationto the randomizationtest. It can be shown (Whitney, 1948)that there are situationsunder which the Mann~ Thebarredsymbols,for example,8, stand for means.
156
THE
CASE
OF
TWO
INDEPENDENT
SAMPLES
Whitney U test is more powerful than the t test and thus is the better alternative.
Summaryof procedure. These are the stepsin the use of the randomizationtest for two independentsamples: 1. Determinethe numberof possibleoutcomesin the regionof rejec-
tion: n~+ n~ 2. Specifyas belongingto the regionof rejectionthat numberof the most extremepossibleoutcomes. The extremesare thosewhich have
the largestdifference betweenXA andEB. For a one-tailed test,all of theseare in the predicteddirection. For a two-tailed test, half of the numberare the mostextremepossibleoutcomesin onedirectionand half are the mostextremepossibleoutcomesin the other direction. 3. If the observedscoresare oneof the outcomeslistedin the regionof
rejection,rejectHo at the a levelof significance. For sampleswhich are so large that the enumerationof the possible outcomesin the regionof rejectionis too tedious,formula (6.16) may be used as an approximationif the conditionsfor its use are met by the data. An alternative, which need not meet such conditionsand thus may be moresatisfactory,is the Mann-Whitney U test. Power-ESciency Becauseit usesall the information in the samples,the randomization
test for two independentsampleshas power-efficiency, in the sense defined,of 100 per cent. References
The reader may find discussions of the randomizationtest for two
independent samplesin Moses(1952a),Pitman (1937a;1937b;1937c), SchefN(1943), Smith (1953), and Welch (1937). DISCUSSION
In this chapter we have presentedeight statistical tests which are useful in testing for the "significanceof the difference"betweentwo
independent samples.In his choiceamongthesetests,the researcher may be aided by the discussionwhich follows, in which any unique advantagesof the testsare pointedout and the contrastsamongthem are noted.
All the nonparametrictestsfor two independentsamplestest whether it is likely that the two independentsamplescamefrom the samepopulation. But the varioustestswe have presentedare moreor lesssensitive
DISCUSSION
157
to differentkindsof differences betweensamples.For example,if one ~shes to test whethertwo samplesrepresentpopulationswhich differ in location (central tendency), these are the tests which are most sensitive
tQ sucha difference andthereforeshouldbe chosen:the mediantest (or the Fisher test whenN is small), the Mann-Whitney U test, the Kolmogorov-Smirnovtwo-sampletest (for one-tailedtests),and the randomization test.
On the other hand, if the researcheris interested in determin-
Ing whetherhis two samplesare from populationswhichdiffer in any respect at all, i.e., in location or dispersionor skewness, etc., he should
chooseoneof thesetests:the y' test,the Kolmogorov-Smirnov test (twotailed), or the Wald-Wolfowitz runstest. Theremainingtechnique, the Moses test, is uniquely suitable for testing whether an experimental group is exhibiting extremistor defensivereactionsin comparisonto the reactions exhibited by an independent control group. The choice among the tests which are sensitive to differencesin location is determined by the kind of measurementachieved in the research g,ndby the size of the samples. The most powerful test of location is the
randomization test. However, this test can be used only when the sample sizesare small and when we have someconfidencein the numerical
Immature of the measurementobtained. With larger samplesor weaker Ineasurement (ordinal measurement), the suggested alternative is the Mann-Whitney U test, which is almost as powerful as the randomization test. If the samples are very small, the Kolmogorov-Smirnov test is shghtly more efBcient than the U test. If the measurementis such that it is meaningful only to dichotomize the observations as above or below
the combinedmedian,then the mediantest is applicable. This test is
not as powerfulasthe Mann-WhitneyU testin guardingagainstdifferences in location, but it is more appropriate than the U test when the
data are observationswhich cannot be completely ranked. If the combined samplesizesare very small, when applying the median test the researchershould make the analysis by the Fisher test.
The choiceamongthe testswhicharesensitiveto all kindsof differences
(the secondgrouplisted above)is predicatedon the strengthof the
Ineasurement obtained, thesizeofthetwosamples, andtherelativepower pf the available tests. The x' test is suitable for data which are in nominal or strongerscales. When the N is small and the data are in a
>(2 contingency table,theFishertestshouldbeusedratherthany~. In Inany casesthe x' test may not make efBcientuse of all the infor-
Inationin the data. If the populations of scores arecontinuously dis-
tributed,wemaychoose eitherthe Kolmogorov-Smirnov (two-tailed) testor theWald-Wolfowitz runstestin preference to theg' test. Ofall testsfor anykindof difference, theKolmogorov-Smirnov testis themost
powerful.If it is usedwithdatawhichdonotmeettheassumption of
158
THE CASE OF TWO INDEPENDENT SAMPLES
continuity,it is still suitablebut it operates moreconservatively (Goodman, 1954),i.e., the obtainedvalueof p in suchcaseswill be slightly higher than it shouldbe, and thus the probability of a Type II error will
beslightlyincreased.If Hois rejectedwith suchdata,wecansurely haveconfidence in the decision. The runstest alsoguardsagainstall kinds of differences,but it is not aspowerfulas the Kolmogorov-Smirnov test.
Two pointsshouldbe emphasized aboutthe useof the secondgroup of tests. First, if one is interestedin testing the alternative hypothesis that the groups differ in central tendency, e.g., that one population has a
largermedianthan the other, then one shouldusea test specifically designedto guard against differencesin location one of the tests in the first group listed above. Second,when one rejects Ho on the basis of a test which guards against any kind of difference (one of the tests in the secondgroup), one can then assertthat the two groups are from different
populationsbut one cannot say in what specificmay(a)the populations differ.
CHAPTER 7
THE
CASE OF k RELATED
SAMPLES
In previouschapterswehavepresented statisticaltestsfor (a) testing for significantdifferences betweena singlesampleand somespecified population, and (b) testing for significant differencesbetween twe
samples,eitherrelatedor independent.In this andthe followingchapters, procedureswill be presentedfor testingfor the significanceof differ-
encesamongthreeor moregroups. That is, statisticaltestswill bepresented for testing the null hypothesisthat k (3 or more) sampleshave been drawn from the samepopulation or from identical populations.
This chapterwill presenttestsfor the caseof k relatedsamples;the following chapterwill presenttestsfor the caseof k independent samples. circumstancessometimesrequirethat we designan experimentsothat
Diore than two samplesor conditionscan be studiedsimultaneously. .hen three or more samplesor conditionsare to be comparedin an experiment, it is necessaryto use a statistical test which will indicate
whether there is an over&i differenceamongthe k samplesor conditions
beforeonepicksout any pair of samples in orderto test the significance of the difference between them.
If we wishedto usea two-samplestatistical test to test for differences
among,say,5 groups, wewouldneedto compute, in orderto compare each pair of samples,10 statistical tests. (Five things taken 2 at a
time= 2 = 10. Sucha procedure is notonlytedious, butit may lead to fallaciousconclusionsas well becauseit capitalizeson chance.
That is, suppose wewishto usea significance levelof, say,0, = .Q5. Our hypothesisis that thereis a difference amongk' = 5 samples.If wetest
that hypothesis by comparing eachof the 5 samples with everyother sample,usinga two-sample test (whichwouldrequire10comparisons in
all), we aregivingourselves 10chances ratherthan 1 chance to reject Ho Nowwhenweset.05asourlevelof significance, wearetakingthe risk of rejectingHoerroneously (makingthe TypeI error) 5 percentof
thetime. Butif wemake10statistical testsof thesame hypothesis, we increase theprobability of theTypeI error. It canbeshown that,for 5 samples,the probabilitythat a two-samplestatisticaltest will find 159
THE CASE OF k RELATED SAMPLES
one or more "significant" differences,when a = .05, is p = .40. That is, the actualsignificance levelin sucha procedure becomes n = .40. Caseshave been reported in the researchliterature (McNemar, 1955,
p. 234)in whichanover-alltestof fivesamples yieldsinsignificant results
(leadsto theacceptance of Hp)but two-sample testsof thelargerdifferencesamongthe fivesamples yieldsignificantfindings. Sucha posteriori selectiontends to capitalizeon chance,and thereforewe can have no confidencein a decisioninvolving k samplesin which the analysisconsistedonly of testing two samplesat a time.
It is only whenan over-alltest (a k-sampletest) allowsus to reject the null hypothesisthat we are justifiedin employinga procedure for
testingfor significantdifferences betweenany two of the k samples. '(Forsucha procedure, seeCochran, 1954;andTukey,1949.) The parametric technique for testingwhetherseveralsamples have comefrom identical populationsis the analysisof varianceor F test.
The assumptions associated with the statisticalmodelunderlyingthe F test are these:that the scoresor observationsare independentlydrawn
from normallydistributedpopulations; that the populations all havethe samevariance;and that the meansin the normallydistributedpopulations are linear combinationsof "effects" due to rowsand columns,i.e., that the effectsare additive. In addition, the F test requiresat least interval measurementof the variables involved. If a researcher finds such assumptions unrealistic for his data, if he
finds that his scoresdo not meetthe measurement requirement,or if he
wishesto avoidmakingthe assumptions in orderto increase the gener-
ality of hisfindings, hemayuseoneof thenonparametric statisticaltests
presented in thisandthefollowing chapter.In addition to avoiding the
assumptions andrequirements mentioned, thesenonparametric k-samp]e tests have the further advantageof enablingdata which are inherently
onlyclassificatory or in ranksto beexamined for significance. There are two basic designsfor comparingk groups. In the first
design, thek samples of equalsizearematched according to somecriterion or criteria which may affect the values of the observations. In some
cases, the matchingis achieved by comparing the sameindividualsor casesunder all k conditions. Or each of N groups may be measured under all k conditions. For such designs, the statistical tests for.k
relatedsamples(presented in this chapter)shouldbeused. The second
designinvolvesk independent randomsamples, not necessarily of the samesize, one samplefrom each population. For that design,the statistical testsfor k independentsamples(presentedin Chap.8) should be employed. The above distinction is, of course,exactly that made in the parametric case. The first design is known as the two-way analysis of variance,
THE COCHRh.N Q TEST
161
sometimes called"the randomized blocksdesign."' The seconddesign is known as the one-way analysis of variance. The distinction is similar to that we made between the case of two
relatedsamples(discussed in Chap.5) and the caseof two independent samples(discussedin Chap. 6).
Thischapterwill presentnonparametric statisticaltestswhichparallel the two-wayanalysisof variance. We will presenta testsuitablefor use with data measuredin a nominal scale and another suitable for use with data measuredin at least an ordinal scale. At the conclusion of this
chapterweshallcompareandcontrastthesetestsfor k relatedsamples, ofering further guidance to the researcher in his selection of the test suitable for his data. THE COCHRAN Q TEST
Function
The McNemartest for two relatedsamples, presented in Chap.5, canbeextended for usein research havingmorethantwo samples.This extension,the CochranQ testfor k relatedsamples,providesa methodfor
testingwhetherthreeor morematchedsetsof frequencies or proportions difer significantlyamongthemselves.The matchingmay be basedon relevant characteristicsof the diferent subjects,or on the fact that the same subjects are used under different conditions. The Cochran test is
particularlysuitablewhenthe dataarein a nominalscaleor aredichotoggzed ordinal information.
pne may imaginea widevariety of researchhypothesesfor which the
datamightbeanalyzed by theCochran test. For example, onemight testwhetherthevariousitemsona testdifferin difficultyby analyzing data consistingof pass-failinformation on k items for N individuals.
yn thisdesign, thek groups areconsidered "matched"because eachper.on answersall k items.
pn the otherhand,wemighthaveonlyoneitemto beanalyzed, and ~sh to compare the responses of N subjectsunderk diferentconditions.
Hereagainthe "matching"is achieved by havingthesamesubjects in +verygroup,but nowthe groupsdiffer in that eachis undera difFerent
effectonthesubjects' responses to theitem. Forexample, onemight ~g eachmember ofa panel ofvoterswhichoftwocandidates theyprefer s Theterm"randomised blocks"derives fromagricultural experimentation, in
hich plotsof landmaybe usedas experimental units. A "block" consists of adjacentPlotsof land,andit isassumed thatPlotsof hand adjacent to eachotherare ~ore alike(i.e.,arebettermatched) thanareplotsremotefromeachbther. Thek
~atments,forexample, k varieties offertiliser, ork varieties ofseed, areassigned at
dom oneto eachof thek plotsin a block;thisis donewithindependent random +ggignment in eachblock.
THE CASE OF IC RELATED SAMPLES
162
at k = 5 times during the election season:before the campaign, at the peak of Smith's campaign,at the peak of Miller's campaign,immediately before the balloting, snd immediately after the results are announced. The Cochrsn
test would
determine
whether
these conditions
have a
significant effect on the voters' preferences between the candidates.
Again, we might comparethe responsesto one item from N setshaving k matched personsin each set. That is, we would have responsesfrom I(; matched groups. Method
If the data from researcheslike those exemplified above are arranged in s two-wsy table consisting of N rows and k columns, it is possibleto test the null hypothesis that the proportion (or frequency) of responses of a particular kind is the same in each column, except for chance differ-
ences. Cochrsn (1950) has shown that if the null hypothesis is true, i.e., if there is no differencein the probability of, say, "success"under each condition (which is to say that the "successes" and "failures"
are ran-
domly distributed in the rows and columns of the two-way table), then if the number
of rows is not too small
k(k )) $ (G)G)' N
(7.1)
N
k Li
L
is distributed approximately as chi square with df = It: 1, where Gj = total number of "successes" in jth column, 6 = mean of the Gj L; = total number of "successes" in ith row
A formula which is equivalent to snd easily derivable from (7.1) but which simplifies computation is
(k )) [k $ GP($ G,)] N
I(:
L; L
Inasmuch as the sampling distribution of Q is approximated by the chisquare distribution with df = I(: 1, the probability associatedwith the occurrenceunder Ho of values as large as an observedQ may be determined by referenceto Table C of the Appendix. If the observedvalue of Q, as computed from formula (7.2), is equal to or greater than that
THE COCHRANQ TEST
168
shownin Table C for a particularsignificance level and a particularvalue of df = k 1, the implicationis that the proportion(or frequency)of «successes"divers significantlyamong the varioussamples. That is, Hp may be rejectedat that particularlevel of significance. Example
Supposewe wereinterestedin the influenceof interviewerfriendlinessupon housewives'responses in an opinionsurvey, We might
trainaninterviewer to conduct threekindsofinterviews: interview1, showinginterest,friendliness,and enthusiasm;interview2, showing
formality,reserve,and courtesy;and interview3, showingdisinterest, abruptness,and harshformality. The interviewerwould be assignedto visit 3 groupsof 18 houses,and told to useinterview 1 with onegroup,interview2 with another,and interview3 with the third. That is, we wouldhave obtained18 setsof housewives with 3 matched housewives(equatedon relevant variables)in each set. For each set, the 3 memberswould randomlybe assignedto the 8 conditions(types of interviews). Thus we would have 3 matched
samples(k = 8) with 18 membersin each (N = 18). We could then test whetherthe grossdifferences betweenthe threestylesof interviewsinfluencedthe numberof "yes" responses givento a
particular itemby thethreematched groups.Usingartificialdata, a testof thishypothesis follows.
i. Null Hypothesis.Hp.'the probabilityof a "yes" response is the samefor all threetypesof interviews. H~.'the probabilities of "yes" responses differ accordingto the style of the interview. ii. Statietical Test. The Cochran Q test is chosen because the
data are for morethan two relatedgroups(k = 3), and are dichotomizedas «yes" and «no." iii. SignificanceLevel. Let a = .01. N = 18 = the number o casesin each of the k matched sets.
iv. SamplingDistribution.Under the null hypothesis, Q [as yieldedby formula(7.1) or (7.2)]is distributed approximately as chi squarewith df = k 1. That is, the probabilityassocia~ with the occurrence underHo of any valueaslargeas an observed value of Q may be determinedby referenceto Table C. v. RejectionRegion. The regionof rejectionconsistsof all values of Q which are so large that the probability associatedwith their occurrence underHpis equalto or lessthan a = .01.
vi. Decieion.In this artificialstudy, we will represent«yes"
responses by 1'sand"no" responses by 0's. Thedataofthestudy aregivenin Table7.1. Alsoshownin that tablearethevaluesof L; (the total numberof yesesfor eachrow)andthe valuesof LP. rior
THE CASE OF k RELATED
SAMPLES
example,in the first matched set all the housewivesrespondednegatively, regardlessof the interview style. Thus LI = 0+ 0+ 0 = 0, and thus LI' 0' = 0. In the secondmatched set of 3 housewives, the responsesto interviews 1 and 2 wereaffirmative but the response to interview 3 was negative, so that Lg = 1 + 1 + 0 = 2, and thus Lg'
2' = 4.
As is the practice, the scoreshave beenarranged in k = 3 columns and N =
18 rows.
ThBm 7.1. YES (1) hND NO (0) RESPONSESBY HOUSEWIVESUNDER THREE
TYPES
OF INTERVIEWS
(Artificial data)
We observethat GI = 13 = the total number of yesesin response to interview 1. G~ 13 = the total number of yesesin response
to interview2. And GI = 3 = the total numberof yesesin response to interview
3.
The total number of yesesin all three interviews = Gg 18
= 13
jI
+ 13+3 =29.Observe thatg Lc 29also (the sum ofthe SeeI
THE COCHRhNQ TEST
column of row totals).
165
The sum of the squaresof the row totals is
1$
L=
63, the sum of the final column.
i~]
By entering these values in formula (7.2) we have
(k 1) [k $ GP($ Gg)] 7~1
i-~
(7 2)
N
k
Li
L
i~1
i~1
(3 1) I3[(13)' + (13)' + (3)'] (29)'I (3) (29) 63 = 16.7
Reference to Table C reveals that Q > 16.7 has probability of occurrence under Hp of p < .001, when df = k 1 = 8 1 = 2. This probability is smaller than the significance level, a = .01. Thus the value of Q is in the region of rejection and therefore our decision is to reject Hp in favor of H>.
On the basis of these artificial
data, we concludethat the probabilities of "yes" responsesare different under interviews 1, 2, and 8.
It shouldbe notedthat Q isdistributedas chi squarewith df = k
1
Nrd ascore of0 toeach"failure." Cast the scoresin a k X N table, using k columnsand N rows.
N = thenumberof cases in eachof thek groups. Determinethe value of Q by substituting the observedvaluesin
formula (7.2). 4 The significance of the observed valueof Q maybe determined by eferenceto TableC, for Q is distributedapproximatelyas chi square gf = k
1. If the probability associatedwith the occurrence
<+derH pof a valueaslargeastheobserved valueof Q isequalto or less ~bana, rejectHp,
powerandPower-Efficiency The powerof the Cochrantest is not known exactly. The notion of
Ower~fBciency is meaningless whenthe Cochrantestis usedfor nominal
< naturallydichotomous data,for parametric testsarenot applicable to
or
THE CASE OF k RELATED SAMPLES
166
such data.
When the Cochran test is used with data that are not nominal
or naturally dichotomous, it may be wasteful of information. References
The reader may find discussionsof the Cochran test in Cochran (1950) and McXemar (1955, pp. 232 233). THE FRIEDMAN
TWO-WAY
ANALYSIS
OF VARIANCE
BY RANKS
Function
When the data from k matched samples are in at least an ordinal
scale,the Friedmantwo-way analysisof varianceby ranks is usefulfor testing the null hypothesisthat the k sampleshave beendrawn from the same population.
Since the k samples are matched, the number of casesis the same in each of the samples. The matching may be achievedby studying the
samegroup of subjectsunder each of k conditions. Or the researcher may obtain severalsets,eachconsistingof k matchedsubjects,and then randomlyassignonesubjectin eachset to the first condition,onesubject in each set to the secondcondition, etc. For example, if one wished to
study the differencesin learningachievedunder four teachingmethods, one might obtain N setsof k = 4 pupils, eachset consistingof children who are matched on the relevant variables (age, previous learning,
intelligence,socioeconomic status,motivation, etc.), and then at random assignone child from eachof the N setsto teachingmethodA, another from eachset to B, another from eachset to C, and the fourth to D. Rationale
and Method
For the Friedman test, the data are cast in a two-way table having N rows and k columns. The rows representthe various subjectsor matched
setsof subjects,and the columnsrepresentthe variousconditions. If the scoresof subjectsservingunderall conditionsare understudy, then each row givesthe scoresof onesubjectunder the k conditions. The data of the test are ranks.
The scores in each rmo are ranked
separately. That is, with k conditionsbeing studied,the ranks in any row range from 1 to k. The Friedman test determineswhether it is likely that the different columnsof ranks (samples)camefrom the same population. For example, supposewe wish to study the scoresof 3 groups under 4 conditions.
Here k = 4 and N = 3.
Each group contains 4 matched
subjects,one being assignedto eachof the 4 conditions. Supposeour scores for this study were those given in Table 7.2. To perform the
THE
Thnm
FRIEDMAN
TWO-WAY
ANALYSIS
OF VARIANCE
167
7.2. ScoHES oF THREE MhTCHED GRoUP8 UNDER FQUR CQNDITIoNs Conditions
priedman
test on these data, we first rank the scores in each rom.
We
IIIay give the lowest score in each row the rank of 1, the next lowest scpre in each row the rank of 2, etc.' By doing this we obtain the data shown in Table 7.3.
Observe that the ranks in each row of Table 7.8
range from 1 to k = 4. Thnm
7.3. RhNES OP THREE MhTcHED GRoUPs UNDER FoUE CoNDITIoNs
Npw if the null hypothesis (that all the samplescolumns came frpm the same population) is in fact true, then the distribution of ranks
zn eachcolumnwouldbea matterof chance,and thus we wouldexpectthe
rankspf 1,2, 8, and4 toappearin all columnswith aboutequalfrequency. This would indicate that for any group it is a matter of chance under which condition the highest score occurs and under which condition the
]pwest occurs,which would be the caseif the conditionsreally did not differ.
If the subjects'scoreswere independentof the conditions,the set of ranks in each column would representa random samplefrom the discpntinuousrectangulardistribution of 1, 2, 8, and 4, and the rank totals fpI the various columnswould be about equal. If the subjects'scores were dependenton the conditions(i.e., if Ho were false), then the rank totals would vary from one column to another. Inasmuch as the columns apt is immaterial whether the ranking is from lowest to highest scoresor from highest to lowestscores.
THE CASE OF lC RELATED SAMPLES
all contain an equal number of cases,an equivalent statement would be that under Ho the mean ranks of the various columns would be about
equal. The Friedman test determines whether the rank totals (Rj) dier significantly. To make this test, we compute the value of a statistic which Friedman
denotes as y,'.
When the number of rows and/or columns is not too small, it can be shown (Friedman, 1937) that z,' is distributed approximately as chi square with df = k 1, when
12
where N = k=
Rj
(Rj)~ 3N(k
+ 1)
(7.3)
number of rows number
of columns
sum of ranks in jth column directs one to sum the squaresof the sums of ranks over all k
j-i
conditions
Inasmuch as the sampling distribution of p,' is approximated by the chisquare distribution with df = k 1, the probability associatedwith the occurrence under Ho of values as large as an observed g,' is shown in Table
C of the Appendix. If the value of z,' as computed from formula (7,3)
is equal to or larger than that given in Table C for a particular level of significanceand a particular value of df = k 1, the implication is that the sums of ranks (or, equivalently, the mean ranks, Rj/N) for the various columns differ significantly (which is to say that the size of the
scoresdependson the conditionsunder which the scoreswereobtained) and thus Ho may be rejected at that level of significance.
Notice that x,' is distributed approximatelyas chi squarewith df = k 1 only when the number of rows and/or columns is not too small. When the number of rows or columnsis lessthan minimal, exact probabil-
ity tables are available,and theseshouldbe usedrather than Table C. Table N of the Appendix gives exact probabilities associatedwith values as large as an observed x,'for k = 3, N = 2 to 9, and for k = 4, N = 2
to 4. When N and k are larger than the valuesincludedin Table N, z,' may be consideredto be distributed as chi square, and thus Table C may be used for testing Ho. To illustrate the computation of x,' and the use of Table N, we may test for significance the data shown in Table 7.3. By referring to that
table, the readermay seethat the varioussumsof ranks,R;, were 11,5,
THE FRIEDMhN TWO-WhY hNhLYSIS OF VhRIhNCE
169
4, and 10. The numberof conditions= k = 4. The number of rows =N
= 3.
We may compute the value of y,' for the data in Table 7.3
by substituting thesevaluesin formula (7.3): (7.3) jI
[(11)' + (5)'+
(4)' + (10)'] (3)(3)(4+
1)
= 7.4
%e may determinethe probability of occurrence underHoof g,»
7.4by
turning to Table Nri which gives the exact probability associatedwith values as large as an observed x,' for k = 4.
Table N shows that the
probability associated with x,' > 7.4whenk = 4 andN = 3 is p = .033. ~jth thesedata, therefore,we could reject the null hypothesisthat the fouI samplesweredrawnfrom the samepopulationwith respectto location (mean ranks).at the .033 level of significance. Examplefor N and k Lerge In a study of the effect of three different patterns of reinforcement
uponextentof discrimination learningin rats,'threematchedsamples (k = 3) of 18 rats (N = 18) were trained under three patterns of reinforcement. Matching was achieved by the use of 18 sets of littermates, 3 in each set. Although all the 54 rats received the
samequantity of reinforcement(reward),the patterningof the administrationof reinforcement wasdifferentfor eachof the groups. One groupwas trainedwith 100per cent reinforcement(RR), a matched group was trained with partial reinforcement in which each
sequence of trials endedwith an unreinforced trial (RU), and the third matched group was trained with partial reinforcement in which
eachsequence of trials endedwith a reinforced trial (UR). After this training,the extentof learningwasmeasured by the speedat which the various rats learnedan "opposing" habit: whereasthey had beentrained to run to white, the rats now were
rewardedfor runningto black. The betterthe initial learning, the slowerthis transferof learningshouldbe. The predictionwas that the different patterns of reinforcement used would result in differential learning as exhibited by ability to transfer. i Grosslight,J. H., and Radlow,R. 1956. Studiesin partial reinforcement: I. patterningeffectof the nonreinforcement-reinforcement sequence in a discrimination situation. J. Comp.Physiol.Peychol., in press.
THE CASE OF k RELhTED ShMPLES
170
i. Null Hypothesis. Hp'. the different patterns of reinforcement have no differential effect. H~. the different patterns of reinforcement have a differential effect. ii. Statistical Test. Because number of errors in transfer of learn-
ing is probably not an interval measureof strength of original learning, the nonparametric two-way analysis of variance was chosen rather than the parametric. Moreover, the use of the parametric analysis of variance was also precluded becausethe scoresexhibited
possiblelack of homogeneityof varianceand thus the data suggestedthat oneof the basicassumptionsof the F test wasprobably untenable.
iii. SignificanceLevel. Let a = .05. N = 18 = the number of rats in each of the 8 matched groups.
iv. Sampling Distribution. As computed by formula (7.3), y,' is distributed approximately as chi square with df = k 1 when N and/or k arelarge. Thus the probability associatedwith the occurrence under Hp of a value as large as the observed value of x,' may be determined by referenceto Table C. v. RejectionRegion. The region of rejection consistsof all values of x,' which are so large that the probability associatedwith their occurrenceunder H pis equal to or less than a = .05. vi. Decision.
The number of errors committed by each rat in the
transfer of learning situation was determined, and these scoreswere ranked for each of the 18 sets of 3 matched rats.
These ranks are
given in Table 7.4. Observethat the sum of ranks for the RR group is 89.5, the sum of ranks for the RU group is 42.5, and the sum of ranks for the UR
group is 26.0. A low rank signifiesa high numberof errorsin transfer, i.e., signifiesstrong original learning, We may compute the value of x,' by substituting our observedvaluesin formula (7.3): (7.8) jm ]
18
895'+ (425)'+(260)')(8)(18)(8+1)
= 8.4
Reference to Table C indicates
that y,' = 8.4 when df = k
1
= 3 1 = 2 is significant at between the .02 and .01 levels. Since p < .02 is less than our previously established significancelevel of a = .05, the decision is to reject Hp. The conclusion is that rats' scores on transfer of learning depend on the pattern of reinforcement
in the original learning trials.
THE
FRIEDMAN
TWO-WAY
ANALYSIS
OF VARIANCE
171
ThRIE 7.4. RhNKS OF EIGHTEEN MhTCHED GROUPSIN TRhNSFER AFTER TRhINING UNDER THREE CONDITIONSOF REINFORCEMENT Type of reinforcement Group RR
RU
UR
3 12 3
2 43 3 6 8 5 7 9 23 1 1 2 1
1
2 11
35'235~ 2 33' 2 ~
10
1
11 12 13 14 15
3 32 3 2
16
2
17 18
RI
39.5
42.5
26 ' 0
~ In group 15, the RR and the RU animals obtained equal scoresand thus were tied for Ianks 2 and 3. Both were given ranks 2.5, the averageof the tied ranks. FriedInan (1937,p. 681) states that the substitution of the averagerank for tied values does not affect the validity of the x,' test.
Summary of procedure. Theseare the stepsin the useof the Friedman two-way analysis of variance by ranks:
1. Cast the scoresin a two-way table having k columns(conditions) and N rows (subjects or groups). 2. Rank the scores in each row from 1 to k.
3. Determinethe sum of the ranks in eachcolumn:R;. 4. Computethe value of g,', usingformula (7.3). 5. The method for determining the probability of occurrenceunder H, associatedwith the observed value of y,' dependson the sizesof N and k:
a. TableN givesexactprobabilitiesassociated with valuesaslargeas anobservedx,'fork = 3, N = 2to9, andfork = 4,N = 2to4.
b. For N and/ork largerthanthoseshownin TableN, the associated probability may be determinedby referenceto the chi-square distribution (givenin Table C) with df = k
1.
THE CASE OF k RELATED SAMPLES
6. If the probability yielded by the appropriate method in step 5 is equal to or less than a, reject Ho. Power
The exact power of the x,' test is not reported in the literature. However, Friedman (1937, p. 686) has reported the results of 56 independent analysesof data which were suitable for analysis by the parametric F test and which were analyzed by both that test and by the nonparametric y' test. The results give a good idea of the efEciencyof the x' test as compared to the most powerful I'-sample parametric test (under these conditions): the F test.
They are given in Table 7.5.
The reader can
Thnm 7.5. COMFARISONOF REsUI.Ts oF THE F TEsT hND THE x,' TEsT oN 56 SETS OF DhTh
WHICH MET THE ASSUMPTIONS hND REQUIREMENTS OF THE F TEST
Number
of F's with
probability Number of x,"s with
probability
Greater than
Greater Between
than .05
.05
Between
.05 and .01
Total
Less than
28
.01
30
.05 and .01
Less than .01 Total
see from the information
17 32
20
66
in that table that it would be difBcult
or even
impossible to say which is the more powerful test. In no casedid one of the tests yield a probability of less than .01, while the other yielded a probability greater than .05. In 45 of the 56 cases,the probability levels yielded by the two tests were essentially the same. For the 56 sets of data, the g,' test would have rejected Ho 26 times at a = .05, while the F test would have rejected Hs 24 times at that significancelevel. References
The reader may find discussionsof the Friedman two-way analysis of varianceby ranksin Friedman(1937; 1940)and in Kendall (1939: 1948a, chaps.6, 7).
173
DISCUSSION
DISCUSSION
Two nonparametric statistical tests for testing Ho in the case of k related sampleswere presentedin this chapter. The first, the Cochran Q test, is useful when the measurementof the variable under study is in a nominal or dichotomized
ordinal scale.
This test determines whether it
js likely that the k related samplescould have comefrom the samepopulation with respectto proportion or frequency of "successes"in the various samples. That is, it is an over-all test of whether the k samples exhibit significantly different frequenciesof "successes."
The secondstatistical test presented,the Friedmanx,' test, is useful when the measurement of the variable is in at least an ordinal scale.
It
tests whether the k related samplescould probably have come from the same population with respect to mean ranks. That is, it is an over-all test of whether the size of the scoresdependson the conditions under which they were yielded.
Very little is known about the power of either test. However,the empirical study by Friedman which was cited earlier has shown very favorable results for the x,' test as comparedwith the most powerful parametrictest, the F test. This would suggestthat the Friedmantest should be used in preferenceto the Cochran test whenever the data are
appropriate (i.e., wheneverthe scoresare in at least an ordinal scale). The x,' test has the further advantageof having tables of exact probabilities for very small samples, whereas the Cochran test should not be used
when N (the number of rows) is too small.
CHAPTER 8 THE
CASE
OF lc INDEPENDENT
SAMPLES
In the analysis of researchdata, the investigator often needsto decide whether several independent samples should be regarded as having come from the same population. Sample values almost always differ somewhat, and the problem is to determine whether the observed sample differences signify differences among populations or whether they are
merely the chance variations that are to be expected among random samples from the same population.
In this chapter, procedureswill be presented for testing for the significance of differences among three or more independent groups or samples. That is, statistical techniques will be presented for testing the null hypothesis that k independent samples have been drawn from the same population or from k identical populations. In the introduction to Chap. 7, we attempted to distinguish between
two sorts of k-sample tests. The first sort of test is useful for analyzing data from k matched samples, and two nonparametric tests of this sort
were presentedin Chap. 7. The secondkind of k-sample test is useful for analyzing data from k independentsamples. Such tests will be presented in this chapter.
The usual parametric technique for testing whether several independent samples have come from the same population is the one-way analysis of variance or F test. The assumptions associated with the statistical model underlying the F test are that the observations are
independently drawn from normally distributed populations, all of which have the same variance. that
The measurement requirement of the P test is
the research must achieve at least interval
variable
measurement
of the
involved.
If a researcherfinds such assumptionsare unrealistic for his data, or if his measurement is weaker than interval scaling, or if he wishes to avoid
making the restrictive assumptionsof the I" test and thus to increasethe generality of his findings, he may use one of the nonparametric statistical
tests for k independent samples which are presented in this chapter. These nonparametric tests have the further advantage of enabling data 1?4
THE x TEST FOR k INDEPENDENTSAMPLES
175
whichareinherentlyonly classificatory (in a nominalscale)or in ranks (in an ordinalscale)to be examinedfor significance. We shall present three nonparametric tests for the case of k inde-
pendentsamples,and shall concludethe chapterwith a discussion of the comparative uses of these tests. THE x' TEST FOR k INDEPENDENT
SAMPLES
Function
Whenfrequencies in discretecategories (eithernominalor ordinal) constitutethe data of research,the g' test may be usedto determinethe
significanceof the difl'erences amongk independentgroups. The y' test for k independentsamplesis a straightforward extensionof the
x' test for two independent samples, whichis presented in Chap.6. Jngeneral, thetestisthesame forbothtwoandk independent samples. Method
Themethodof computing thex' testfor independent samples will be presented brieflyhere,together withanexample of theapplication of the test,. The readerwill find a fullerdiscussion of this testin Chap.6. To apply the y' test, one first arrangesthe frequencies in a k X r
t,able. The null hypothesis is that the k samplesof frequencies or proportions have come from the same population or from identical
populations. Thishypothesis, thatthek samples donotdier among themselves,may be testedby applyingformula (6.3): r
k
i~i
j=l
x*=g g' ", "'
(o.3)
where 0;j =:observed numberof cases categorized in ith row of jth column
E,j = numberof cases expected underH pto becategorized in ith row of jth column,as determined by methodpresented on page 105. r
k
directs one to sum over all cells i~i
j=l
pnder H p,the samplingdistributionof y'- ascomputedfrom formula
(6,3)canbeshownto beapproximated by a chi-square distribution with gf = (k 1)(r 1), wherek = the numberof columns andr = the numberof rows. Thus,theprobabilityassociated with the occurrence
of values aslargeasanobserved x' isgivenin TableC oftheAppendix. yf anobserved valueof y' isequalto or largerthanthatgivenin Table
C fora particular levelofsignificance andfordf = (k l)(r 1), then
Qp may be rejectedat that level of significance.
176
THE CASE OF k INDEPENDENT SAMPLES
Example
In an investigation of the nature and consequences of social stratification in a small Middle Western community,' Hollingshead found that members of the community divided themselves into five social
classes:I, II, III, IV, and V. His researchcentered on the correlates of this stratification among the youth of the community. One of his predictions was that adolescentsin the different social classes would enroll in different curriculums (college preparatory, general, commercial) at the Elmtown high school. He tested this by identifying the social class membership of 390 high school students and determining the curriculum enrollment of each. i. Null Hypothesis. Ho. the proportion of students enrolled in the three alternative
curriculums
is the same in all classes.
H>'. the
proportion of students enrolled in the three alternative curriculums differs from class to class.
ii. Statistical Test. Sincethe groups under study are independent and number more than 2, a statistical test for k independent samples
is called for. Since the data are in discrete categories,the g' test is an appropriate one. iii. Significance ieeeL Let a = .01. N = 390, the number of students whose enrollment
and class status were observed.
iv. Sampling Distribution. Under the null hypothesis, g~ as computed from formula (6.3) is distributed approximately as chi
squarewith df = (k 1)
(r
1). The probability associated with
the occurrence under Ho of values as large as an observed value of
y' is shown in Table C. v. RejectionRegion. The region of rejection consistsof all values of x' which are so large that the probability associatedwith their occurrenceunder Ho is equal to or less than a = .01. vi. Dension. Table 8.1 gives the curricular enrollment of the
390high schoolstudentsin Elmtown who werestudiedby Hollingshead. Social classesI and II were grouped together by Hollingshead becauseof the small number of youths belonging to these two classes,particularly to class I. Table 8.1 also shows in italics the number of youths who might be expectedunder Ho to have enrolled in each of the three curriculums, i.e., the expected enrollments if there were really no differencein enrollment among the various social classes. (Theseexpectedvalues were determined from the marginal totals by the method presentedon page 105.) For example,whereas 23 of the classI and II students were enrolled in the collegepreparatory curriculum, under Ho we would expect only 7.3 to have enrolled ' Hollingshead, A. B. 1949. Elmtmon's youth: The Anpact of social ctassoson adotcocents. New York: Wiley.
THE x TEST FOR k INDEPENDENTSAMPLES
177
in that curriculum. And whereasonly 1 of the class I and II students enrolled in the commercial curriculum, under Ho we would
expect9.1 to have enrolledin that curriculum. Of the 26 classV youths, only 2 wereenrolledin the collegepreparatorycurriculum, whereasunder H pwe would expect to find 5.4 in that curriculum. TABLE 8.1. FREQUENCYOF ENROLLMENTOF ELMTOWN YOUTHS FROM FIVE SOCIAL CLASSES IN THREE ALTERNATIVE HIGH SCHOOL CURRICULUMS
~ Adapted from TableX of Hollingshead, A. B. 1949. ELeAnon'8 youth. New York: Wiley, p. 462,with the kind permissionof John Wiley 4 Sons,Ine.
The size of x' reflects the magnitude of the discrepancy between the observedand the expectedvalues in each of the cells. We may
computex for the valuesin Table 8.1 by the applicationof formula (6.3): r
k
(0'j E'~)'
(6.3)
xE i
Ij
I
(28 7.8)'
(40 30.8)'
7.8
(16 88.0)'
30.8
(11 18.6)' (14 18.8)%
+ 138
88.0
(75 77.5)'
18.6
5.4
(107 97.1) ~
77.5
97.1
(1 9.1) ~ (31
+ 91
(2 5.4)'
38.2)
+ 882
P
+
(60 47.9) R 47.9
(10 6.8)%
+ 6S
= 88.8 + 8.1 + 12.7 + 2.1 + 8.1 + .08 + 1.0 + .008 + 7.3 + 1.4 $9'3
+ 8.1
+ 1.5
THE CASE OF k INDEPENDENT SAMPLES
178
We observe that for the data in Table 8.1, y' = 69.2 with df = (k 1)(r
1)
= (4
1)(3
1)
=6
Reference to Table C reveals that such a value of g' is significant far beyond the .001 level. Since p ( .001 is less than our previously
set level of significance,a = .01, our decision is to reject Ho. We conclude that high school studc»ts' curricular enrollment
is not
independentof social class mer»bcrshipar»o»gLlmtow»'s youth. Summary of procedure. These are the steps i» the use of the y' test for k independent samples: 1. Cast the observed frequencies in a k X r contingency table, using the k columns for the groups. 2. Determine the expected frequency under Ho for each cell by finding
the product of the marginal totals common to the cell and dividing this product by N. (N is the sum of each group of marginal totals. It representsthe total number of independentobservations. I»Hated N's invalidate
the test.)
3. Compute X2by using Formula (6.3). Determine df = (k l)(r
1)
4. Determine the significance of the observed value of x' by reference
to Table C. If the probability given for the Observedvalue of x' for the observedvalue of df is equal to or smaller than u, reject Ho in favor of Hj. When to Use the x' Test
The g' test requires that the expected frequencies (E; s) in each cell should not be too small.
When this requireme»t is violated, the results
of the test are meaningless. Cochran (1954) recommends that for g' tests with df larger than 1 (that is, when either k or r is larger than 2), fewer than 20 per cent of the cells should have an expectedfrequency of lessthan 5, and no cell should have an expectedfrequency of lessthan 1. If theserequirementsare not met by the data in the form in which they were originally collected, the researcher must combine adjacent categories so as to increasethe E;,'s in the various cells. Only after he has combined the categoriesso that fewer than 20 per cent of the cells have expectedfrequenciesof lessthan 5 and no cell has an expectedfrequency of less than 1 can the researcher meaningfully apply the x' test. Of course he will be limited in his combining by the nature of his data.
That is, the results of the statistical test may not be interpretable if the combining of categorieshas beencapricious. The adjacent categories which are combined must have some common property or mutual
THE
EXTENSION
OF
THE
MEDIAN
TEST
179
identity if interpretationof the outcomeof the test after the combining js to be possible. The researcherwill guard against the necessityof combining categoriesif he usesa sufficiently large N in his research. y' tests are insensitive to the effects of order when df > 1. Thus when a hypothesistakes order into account, g' may not be the best test. Cochran (1954) has presented methods that strengthen the common x' tests when Ho is tested against specific alternatives. Power
Thereis usuallyno clearalternativeto the x' test whenit is used,and thus the exact power of the x' test usually cannot be computed. How-
ever, Cochran(1952,pp. 323324) has shownthat the limiting power distribution of y' tends to 1 as N becomeslarge. References
For other discussionsof the y' test, the reader is referred to Cochran
(1952;1954),Dixon and Massey(1951,chap. 13), Edwards(1954,chap. 18), Lewis and Burke (1949),McNemar (1955,chap. 13), and Walker and Lev (1953, chap. 4).
THE
EXTENSION
OF THE
MEDIAN
TEST
Function
The extensionof the mediantest determineswhether k independent groups (not necessarilyof equal size) have been drawn from the same population or from populations with equal medians. It is useful when the variable under study has been measured in at least an ordinal scale. Method
To apply the extension of the median test, we first determine the median scorefor the combined k samplesof scores,i.e., we find the com-
rnon medianfor all scoresin the I' groups. We then replaceeachscore by a plusif the scoreis larger than the commonmedianand by a minus if it is smaller than the commonmedian. (If it happensthat one or morescoresfall at the commonmedian,then the scoresmay be dichotomized by assigning a plus to those scores which exceed the common
medianand a minusto thosewhichfall at the medianor below.) We may cast the resulting dichotomoussets of scoresinto a k X 2
table, with the numbersin the body of the table representing the frequencies of plusesandminuses in eachof theA'groups. Table8.3,shown below, is an example of such a table.
To test the null hypothesisthat the k sampleshave comefrom the
THE CASE OF k INDEPENDENT SAMPLES
same population with respect to medians, we compute the value of x~ from formula (6.3): r
k
(0j 'E~j) X
E; i~1
j~l
where0;j = observed number of casescategorizedin ith row of jth column
E.= numberof casesexpectedunderHo to be categorizedin ith row of jth column r
k
directs one to sum over all cells s~lj~l
It can be shown that the sampling distribution under Ho of g' as com-
puted from formula (6.3) is approximatedby the chi-squaredistribution with df = (k 1)(r 1), wherek = the numberof columnsandr = the number of rows. In the median test, r = 2, and thus df = (k 1) (r 1)
= (k 1) (2 1) = (k 1)
The probability associatedwith the occurrenceunder Ho of values as large as an observedvalue of y' are given in Table C of the Appendix. If the observedvalue of y' is equal to or larger than that given in Table C
for the previouslyset level of significanceand for the observedvalue of df = k 1, then Ho may be rejectedat that level of significance. ~If it is possibleto dichotomizethe scoresexactlyat the commonmedian, then eachE;j is one-halfof the marginaltotal for its column. Whenthe scores are dichotomized as those which do and do not exceed the common
median,the methodfor findingexpectedfrequencies whichis presented on page 105 should be used. Oncethe data have beencategorizedas plusesand minuseswith respect to the common median, and the resulting frequencieshave been cast in a
k X 2 table, the computation proceduresfor this test are exactly the sameas those for the x' test for k independent samples,presentedin the
previoussectionof this chapter. Theywill beillustratedin theexample which
follows.
Example
Supposean educationalresearcherwishesto study the influenceof amount of education upon mothers' degreeof interest in their children's schooling. He takes the highest school grade completed by each mother as an index to her amount of education; as an index to degree of interest in the child's schooling he takes the number of voluntary visits which each mother makesto the school during one school year, e.g., to class plays, to parent meetings, to self-initiated conferenceswith teachers and administrators, etc. By drawing
THE
EXTENSION
OP THE
MEDIAN
TEST
181
every tenth name from the file of the names of the 440 children
enrolledin the school,he obtainsthe namesof 44 mothers,who constitute his sample. His hypothesis is that mothers' numbers of
visits will vary accordingto the numberof yearsthe mothershave completed in school.
i. Null Hypothesis.Ho.'there ia no differencein frequencyof schoolvisits amongmotherswith differentamountsof education, i.e., frequencyof maternalvisits to schoolis independentof amount
of maternaleducation. H<.'the frequencyof schoolvisits by mothers differs for varying amounts of maternal education.
ii. StatisticalTest. Sincethe groupsof mothersof variouseducational levels are independent of each other and since several
groupsare anticipated,a significance test for k independentsamples ia in order. Since number of years of school constitutes at best an
ordinalmeasureof degreeof education,and sincenumberof visits to
schoolconstitutes at bestan ordinalmeasure of degreeof interestin one'schild'sschooling, the extension of the mediantestis appropriate for testing the hypothesisconcerningdifference in central tendencies.
iii. Significance level. Let a = .05. N = 44 = the number of mothers in the sample.
iv. SamplingDistribution. Under the null hypothesis,x' as computedfrom formula (6.3)is distributedapproximatelyaa chi squarewith df = k 1 when r = 2. The probabilityassociated with the occurrence underHoof valuesaslargeasan observed g' is shown
in Table
C.
v. RejectionRegion. The regionof rejectionconsistsof all values of g' which are so large that the probability associatedwith their occurrenceunder Ho is equal to or less than a = .05.
vi. Decision. In our fictitious example,the researchercollects the data presentedin Table 8.2. The commonmedianfor these44 scoresis 2.5. That is, half of the mothersvisited the schooltwo or fewer times during the schoolyear, and half visited three or more
times. If wespliteachgroupof scores at that combined median, weobtainthe datashownin Table8.3,whichgivesthe numberof mothersin eacheducational levelwhofall aboveorbelowthecommon medianin numberof visits to school. Of thosemotherswhose educationwaslimitedto 8 years,for example, fivevisitedthe school threeor moretimesand five visitedtwo or fewertimes. Of those
mothers whohadattended someyearsof college, threevisitedthe schoolthreeor moretimes,andonevisitedtwo or fewertimes. Givenin italicsin Table8.3arethe expected numberof visits of eachgroupunderHo. Observethat, with the scoresdichotomiaed
182
THE CASE OF k INDEPENDENT SAMPLE8 TABLE 8.2. NUMBER OF VISITS TO SCHOOL BY MOTHERS OF VARIOUS EDUCATIONAL LEVELS
(ArtiRcial data) Education completed by mother
TABLE 8.3. VISITS TO SCHOOL BY MOTHERS OF VARIOUS EDUCATIONAL LEVELS
(ArtiRcial data)
THE
EXTENSION
OF
THE
MEDIAN
TEST
exactly at the median, the expected frequency in each cell is just one-half of the total for the column in which the cell is located. Examining the data, the reader will notice that in this form the data
are not amenableto a x' analysis, for more than 20 per cent of the cellshave expectedfrequenciesof lessthan 5. (Seethe discussionof when to use the x' test, on pages178 and 179.) Observethat those categories with the unacceptably small expected frequencies all concern mothers who have attended college for various amounts of time: those who had somecollege,those who graduated from college, and those who attended graduate school. %e may combine the three categoriesinto one: College(oneor more years). By doing so, we obtain the data shown in Table 8.4. Observe that in this TABLE 8.4. Vlsrrs To SGHQQLBY MoTHRRs QF VARIoUs EDUcATIQNALLEvEL8 (Artificial data)
No. Of mothers whose visits were less
frequent than commonmedian no.
5
5.5 5
6.5 76
22
of visits 10
Total
13
10
44
table we havedata which are amenableto a x' analysis.
<e may computethe observedvalueof x' by substitutingthe data in Table 8.4 into formula (6.3): r
X=
(6.3)
g i-1
i
= (5 5)' + ' (4 5.5)' 5 5.5
+ '(7 6.5)~ 6.5
(5 5)~ 5 5.5
(7 5.5)s
+ (6 5)' 5
(6 6.5)R (4 5)R 6.5
0 + .409 + .0385 + .2 + 0 + .409 + .0385 + .2 = 1.295
5
THE CA8E OP k INDEPENDENT 8AMPLE8
By this computation we determine that x' = 1.295, and we know that
df = k
1= 4
1 = 3.
Reference to Table
C reveals
that under Hp a g' equal to or greater than 1.295 for df = 3 has probability of occurrence between .80 and .70. Since this p is larger than our previously set level of significance, e = .05, our decision must be that on the basis of these (fictitious) data, we cannot reject the null hypothesis that the number of school visits madeby mothers is independentof amount of maternal education. Summaryof procedure. Theseare the stepsin the useof the extension of the median test:
1. Determinethe commonmedianof the scoresin the k groups. 2. Assignplusesto all scoresabove that medianand minusesto all scoresbelow,therebysplittingeachof the k groupsof scoresat the combined median. Cast the resulting frequenciesin a k X 2 table.
3. Using the data in that table, computethe value of y' as given by formula (6.3). Determinedf = k 1. 4. Determinethe significance of the observedvalue of g' by reference to Table C. If the associatedprobability given for valuesas large as the observedvalue of g' is equal to or smaller than a, reject Ho in favor of Hg.
As we have mentioned, the extension of the median test is in essence
a g' test for k samples. For information concerningthe conditions under which the test may properlybe used,and the powerof the test, the readeris referredto discussions of thesetopicson pages178and 179 References
Discussions relevant to this test are containedin Cochran(1954) and Mood (1950, pp. 398-399). THE
KRUSKAL-WALLIS
ONE-WAY
ANALYSIS
OF VARIANCE
BY RANKS
Function
The Kruskal-Wallis one-way analysis of variance by ranks is an extremely useful test for deciding whether k independent samples are from different populations. Sample values almost invariably differ
somewhat,and the questionis whetherthe differencesamongthe samples signify genuinepopulation differencesor whetherthey representmerely chance variations such as are to be expected among several random
samplesfrom the samepopulation. The Kruskal-Wallistechniquetests the null hypothesisthat the k samplescomefrom the samepopulationor from identical populations with respect to averages. The test assumes
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE
185
that the variable under study has an underlying continuous distribution. It requires at least ordinal measurementof that variable. Rationale
and Method
In the computation of the Kruskal-Wallis test, each of the N observations are replaced by ranks. That is, all of the scoresfrom all of the k samples combined are ranked in a single series. The smallest scoreis replacedby rank 1, the next to smallestby rank 2, and the largest by rank N. N = thetotsl number of independentobservationsintheksamples. When this has beendone, the sum of the ranks in eachsample (column) is found.
The Kruskal-Wallis
test determines
whether
these sums of
ranks are so disparate that they are not likely to have comefrom samples which were all drawn from the samepopulation. It can be shown that if the k samplesactually are from the samepopulation or from identical populations, that is, if Hp is true, then H [the statistic used in the Kruskal-Wallis test and defined by formula (8.1) below] is distributed as chi square with df = k 1, provided that the sizesof the various k samplesare not too small. That is, 12
H N(N+ 1)
RP
' 3(N+1)
(8 1)
jm]
where k = number of samples
nj = number of casesin jth sample N = Znj, the number of casesin all samplescombined Rj = sum of ranks in jth sample (column) directs one to sum over the k samples (columns) j~l
is distributed approximatelyas chi squarewith df = k
1, for sample
sizes (n s) sufBciently large.
When there are more than 5 casesin the various groups, that is, nj > 5, the probability associatedwith the occurrenceunder Ho of values ss large as an observedH may be determined by referenceto Table C of
the Appendix. If the observedvalue of H is equalto or larger than the
valueof chi squaregivenin TableC for the previouslyset levelof significanceand for the observedvalue of df = k 1, then Ho may be rejected at that level of significance.
Whenk = 3 andthe numberof casesin eachof the 3 samples is 5 or fewer,the chi-squareapproximationto the samplingdistribution of H is not suSciently close. For such cases,exact probabilities have been tabled from formula (8.1). These sre presented in Table 0 of the
Appendix. The first columnin that table givesthe numberof casesin the 3 samples,i.e., givesvariouspossiblevaluesof n>,n~,and n>. The
THE CASE OF /C INDEPENDENT
186
SAMPLES
second gives various values of H, as computed from formula (8.1). The third gives the probability associatedwith the occurrenceunder Ho of values as large as an observedH. For example,if H > 5.8333 when the three samplescontain 4, 3, and 1 cases,Table 0 show that the null hypothesis may be rejected at the .021 level of significance. Examplefor Small Samples
Supposean educationalresearcherwishesto test the hypothesis that school administrators
are typically
more authoritarian
than
classroom teachers. He knows, however, that his data for testing
this hypothesismay be contaminatedby the fact that many classroom teachers are administration-oriented
in their
professional
aspirations. That is, many teacherstake administratorsas a referencegroup. To avoid this contamination,he plans to divide his 14 subjects into 3 groups: teaching-orientedteachers(classroom teacherswho wish to remain in a teaching position), administrationoriented teachers(classroomteacherswho aspire to becomeadministrators), and administrators. He administers the F scale' (a measure of authoritarianism) to each of the 14 subjects. His
hypothesisis that the threegroupswill differ with respectto averages on the F scale.
i. Null Hypothesis. Ho.'there is no differenceamong the average F scores of teaching-oriented teachers, administration-oriented
teachers,and administrators. H~.' the three groupsof educators are not the same in their average F scores.' ii. Statistical Test. Since three independent groups are under
study,a testfor k independent samples is calledfor. SinceF-scale scoresmay be consideredto representat least ordinal measurement of authoritarianism, the Kruskal-Wallis test is appropriate.
iii. Significancelevel. Leta = .05. N = 14 = the totalnumber of educators studied. nt 5 = the number of teaching-oriented teachers. n2 = 5 = the number of administration-oriented teachers. n3 4 = the number of administrators.
iv. Sampling Distribution. For k = 3 and n s small, Table 0 gives the probability associatedwith the occurrenceunder Ho for values as large as an observedH. ~ Presentedin Adorno, T. W., Frenkel-Brunswik, Else, LevinsonpD J y and San-
ford, R. N. 1950. Theauthoritarianpereonatity. New York: Harper. ~ If X stands
for the score of a teachingwriented
teacher, Y stands for the score
of an administration-oriented teacher,and Z standsfor the scoreof an administrator, then Ho, more properly stated, is that p(X > Y) p(X > 2) ~ p(Y > 2) H~ would then call for inequality at least once.
KRUSKAL-WALLIS ONE-WAY ANALYSIS OP VARIANCE
187
v. RejectionRegion. The region of rejection consistsof all values of H which are so large that the probability associatedwith their occurrence under Ho is equal to or less than e = .05.
vi. Decision. For this fictitious study, the F scoresfor the various educators
are
shown
in Table
8.5.
If
we rank
these
14 Ii scores
TABLE 8.5. AUTHORITARIANISM SCORES OF THREE GROUPS OF EDUCA.TORS
(Artiflcial
data)
from lowest to highest, we obtain the ranks shown in Table 8.6.
These ranks are summed for the three groups to obtain RI = 22, R, = 37, and R3 46, as shown in Table 8.6. TABLE 8.6. AUTHORITARIANISM RANKS OF THREE GROUPS OF EDUCATORS
(Artificia
Teachingwriented
data)
Administration-oriented teachers
teachers
Administrators
28
7 13
10
14
11
12
6
Rg
Rg = 37
22
Re=46
Now with thesedata we may compute the value of H from formula (8.1): RP
H N(N+ 1)
j-3(N+ 1)
(8.1)
j'- I
12
(22)'
14(14+1) 5 +
(37)'
5+
(46)'
4
3(14+1)
= 6.4
Reference to Table0 disclosesthat whenthe 'sjs are5, 5, and 4, H ) 6.4 has probability of occurrenceunder the null hypothesisof
THE CASE OF k INDEPENDENT SAMPLES
p < .049. Since this probability is smaller than a = .05, our decision in this fictitious study is to reject Ho in favor of Hi. We conclude that the specifiedthree groups of educatorsdiffer in degreeof authoritarianism.
Tied observations.
When ties occur between two or more scores, each
scoreis given the meanof the ranks for which it is tied. Since the value of H is somewhat influenced by ties, one may wish to
correct for ties in computing H.
To correct for the effect of ties, H is
computedby formula (8.1) and then divided by ZT N' N
(8.2)
where T = t' t (when t is the number of tied observationsin a tied group of scores) N = number of observations in all k samples together, that is, N =Znj
ZT directs one to sum over all groups of ties Thus a general expressionfor H corrected for ties is 12
N(N+1)
R~
' -3(N+1)
ZT
N'
(8.3)
N
The effectof correctingfor ties is to increasethe value of H and thus to make the result more significant than it would have been if uncorrected. Therefore if one is able to reject Ho without making the cor-
rection [i.e., by usingformula (8.1) for computingH], onewill be ableto reject Ho at an evenmorestringentlevel of significanceif the correction 1S Used.
In most cases,the effect of the correctionis negligible. If no more than 25 per cent of the observationsare involved in ties, the probability associatedwith an H computed without the correction for ties, i,e., by
formula (8.1), is rarely changedby morethan 10 per cent whenthe correction for ties is made, that is, if H is computed by formula (8.3), according to Kruskal and Wallis (1952, p. 587). In the examplewhich follows, H is first computed by formula (8.1) and then is correctedfor ties. Notice that even though there are 13 groups of ties, involving 47 of the 56 observations, the changein H which results
from applying the correction for ties is merely from K = 18.464to H=
18.566.
KRU8KAL-wALLIs
oNE-wAY
ANALYBIs oF vARIANGE
189
Asusual,the magnitude of the correction factordepends onthelength of the ties,i.e., onthe valuesof t, aswell asonthe percentageof the observationsinvolved. This point is discussed on page 125. Exemplefor LargeSamples An investigatordeterminedthe birth weightsof the membersof eight differentlitters of pigs, in order to determinewhether birth weight is affected by litter size.'
i. NuQHypothesis.Ho.thereis nodifference in the averagebirth weightsof pigsfrom litters of differentsizes. H~. the averagebirth weightsof pigsfrom difFerentlitter sizesarenot all equal.
ii. StatisticalTest. Sincethe eight litters are independent, a statisticaltest for k independentsamplesis appropriate. Although the measurementof weight in poundsis measurementin a ratio scale,we choosethe nonparametricone-way analysisof variance
ratherthanthe equivalent parametric testin orderto avoidmaking the assumptions concerningnormality and homogeneityof variance
associated with the parametric F testandto increase the generality of our findings.
iii. SignijkaneeLevel. Let a = .05. N = 56 = the total number of infant pigsunderstudy.
iv. SamplingDistribution.As computed by formula(8.1), K is distributedapproximately as chi squarewith df = k 1. Thus the probability associatedwith the occurrenceunder Hc of values as
largeasan observed K may be determined by reference to TableC. v. RejectionRegion. The regionof rejectionconsistsof all values of H which are so large that the probability associatedwith their
occurrence underHc for df = Ic 1 = 7 is equalto or lessthan a=
.05.
vi. Decision. Thebirthweights ofthe56infantpigsbelonging to 8 littersaregivenin Table8.7. If werankthese56 weights, we obtainthe ranksshownin Table 8.8. Observethat we have ranked
the 56 scoresin a singleseries,as is requiredby this test. The
smallest infantpig,thefinalmember oflitter1, weighed 1.1pounds < This example usesanadaptation of thedatapresented in tables10.16.1, 10.17.1, 10.20.1, and10.29.3 of theQth editionof Stotietical methods by George W. Snedecor (1956)withthekindpermission of theauthorandthepublisher, theIowaStateCol-
legePress.Although these dataarenotfromthebehavioral sciences, andalthough theKruskal-Wallis testmaynotbeaseKcient ssa regression analysis in extracting therelevantinformation in thedata,theexample ischosen for thisillustration because
ofthelargenumber oftiescontained in theobservations andbecause thegroups are of unequal sine.Thislatterfeature israrein research dataavailable in complete forminthecurrent literature ofthebehavioral sciences. Kruskal andWallis(1952) usethe sameillustrativedatain the presentation of theirtest.
THE CASE OF k INDEPENDENT SAMPLES
and is given the rank of 1. The heaviest infant pig, alsoin litter 1, weighed4.4 pounds; this weight earnedthe rank of 56. Also shown
in Table8.8arethe sumsof eachcolumnof ranks,the Rj s. TARLR 8.7. BIRTH WEIGHTS IN POUNDS OF EIGHT LITTERS OF
POLAND CHINA PIGSTY SPRING) 1919 Litters
2.0
3.5
3.3
3.2
2.6
3.1
2.6
2.5
2.8
2.8
3.6
3.3
2.6
2.9
2.2
2.4
3.3
3.2
2.6
3.2
2.9
3.1
2.2
3.0
3.2
3.5
3.1
2.9
2.0
2.5
2.5
1.5
4.4
2.3
3.2
3.3
2.0
1.2
2.1
1.2
3.6
2.4
3.3
2.5
1.9
2.0
2.9
2.6
3.3
1.6
2.8
2.8
3.2
1.1
3.2
With the data in Table 8.8, we may computethe value of H uncorrected for ties, by formula (8.1):
(8 1)
Sj ja I
12
l (317)
(2165)
56(56+ 1) L 10
(414)
(2775)' (105 5)~
8 10
( )'+ 46
86
(715)'+(72)' 4
(10,048.9 + 5,859.031 + 17,139.6 + 9,625.781 f
+ 1,855.042 + 3,721.0 + 852.042 + 1,296.0) 171 = 18.464
Reference to Table C indicates
df=tt:
that an H > 18.464
with
1=7
has probability of occurrenceunder Ho of p < .02.
To correctfor ties, we must first determinehow many groupsof ties occurredand how many scoresweretied in each group. The first tie occurredbetweentwo pigsin litter 7 (whoboth weighed1.2
KRUSKAL-WALLIS
ONE-WAY ANALYSIS OF VARIANCE
191
pounds)~ Both were assignedthe rank of 2.5. Here t = the number of tied observations = 2. T=
For this occurrence,
t3 t
=8
2=6
The next tie occurredbetweenfour pigs who wereassignedthe tied rank of 8.5. Here t = 4, and T = t t = 64 4 = 60. TmI.E
8.8. RANxs
oF BIRTH WEIGHT8 OF EIGHT LITTERs oF PIGs
Litters
Ri ~ 317.0
Ra
216.5
RI
414.0
Ra
277.5
Rs
105.5
Re
122.0
18.5 15.5 34.0 4.0
23.0 12.5 12.5 18.5 2.5 2.5
36.0 31.5 36.0 18.5
23.0 23.0 31.5 8.5 8.5 11.0
41.0 47.5 41.0 31.5 47.5 18.5 23.0 27.5
47.5 54.5 23.0 36.0 41.0 47.5 31.5 51.0 41.0 41.0
52. 5 27.5 41.0 52.5 14.0 15.5 8.5 5.0
8.5 27.5 47.5 41.0 56.0 54.5 6.0 47.5 27.5 1.0
Rr
71.5
Rs ~ 72.0
Continuing through the data in Table 8.8 in this way, we find that
13 groups of ties occurred. We may count the number of observations in each tied group to determine the various values of t, and we may compute the value of T = t' t in each case. Our count will result in the findings below: 22 66
60
54 120
37 60
24
22 336
210
66
Observe that for any particular value of t, the value of T is a con-
stant. Now, using formula (8.2), we may compute the total correction for ties;
¹
ZT
X
( +60+
(8,2) +6+60+120+60+60+24+336+210+6+6) (56)' 56
= .9945
Now this value becomesthe denominator of formula (8.3), and the value we have already computed from formula (8.1) is the numera-
THE CASE OF k INDEPENDENT SAMPLES
192
tor.
Thus we need make only one additional operation to compute
the value
of H corrected
for ties:
RP
3(N + 1)
XT
N'
(83)
N
18.464 .9945 = 18.566
Referenceto Table C disclosesthat the probability associatedwith the occurrenceunder H pof a value as large as H = 18.566,df = 7, is p < .01. Since this probability is smaller than our previously set level of significance, a = .05, our decision is to reject Hp.~ We conclude that the birth weight of pigs varies significantly with litter size.
SumIImryof procedure. Theseare the stepsin the useof the KruskalWallis one-wayanalysisof varianceby ranks: 1. Rank all of the observationsfor the k groups in a single series, assigningranks from 1 to N. 2. Determine the value of R (the sum of the ranks) for each of the k groups of ranks.
3. If a large proportion of the observations are tied, compute the value of H from formula (8.3). Otherwise use formula (8.1). 4. The method for assessingthe significanceof the observedvalue of
H dependson the sizeof k and on the sizeof the groups: a. If k = 3 and if n~, nr, ne < 5, Table 0 may be used to determine the associatedprobability under Hp of an H as large as that observed.
b. In other cases,the significance of a value as large as the observed value of H may be assessed by referenceto Table C, with df = k
1
5. If the probability associatedwith the observedvalue of H is equal to or less than the previously set level of significance, I, reject Hp in favor of H~.
Power-Eliciency
Compared with the most powerful parametric test, the F test, under conditions where the assumptions associated with the statistical model of ~ The parametricanalysisof varianceyieldsan P ~ 2.987,which for df's of 7 and 48 corresponds with a probabilityof .011.
193
DI8CU88ION
the F test are met, the Kruskal-Wallis test has asymptotic efBciencyof
3= = 95.5 per cent (Andrews, 1954). The Kruskal-Wallis median
test is more efBcient than the extension of the
test because it utilizes
more
of the information
in the observa-
tions, converting the scoresinto ranks rather than simply dichotomizing them as above and below the median. References
The reader will find discussionsof the one-way analysis of variance by ranks in Kruskal and Wallis (1952) and in Kruskal (1952). DISCUSSION
Three nonparametric statistical tests for analyzing data from k independent sampleswere presentedin this chapter. The first of these,the x' test for k independentsamples,is useful when the data are in frequenciesand when measurementof the variables under study is in a nominal scale or in discrete categoriesof an ordinal scale. It tests whether the proportions or frequenciesin the various categories are independent of the condition (sample) under which they were observed. That is, it tests the null hypothesis that the k sampleshave comefrom the samepopulation or from identical populations with respect to the proportion of casesin the various categories. The secondtest presented,the extension of the median test, requires at least ordinal measurement of the variable under study. It tests whether k independent samples of scores on that variable could have
been drawn from the same population or identical populations with respect to the median. The Kruskal-Wallis one-way analysis of variance by ranks, the third. test discussed,requires at least ordinal measurement of the variable. It tests whether k independent samples could have been drawn from the same continuous population.
We have no choiceamongthesetests if our data are in frequencies rather than scores,i.e., if we have enumeration data, and if the measure-
ment is no strongerthan nominal. The x' test for k independent samples js uniquely useful for such data.
The extensionof the mediantest and the Kruskal-Wallistest may both be appliedto the samedata, i.e., they havesimilar requirementsfor
the dataundertest. Whenthe dataaresuchthat either test might be used,the Kruskal-Wallistest will be found to be more scient because it uses more of the information in the observations. It converts the scoresto ranks, whereasthe extension of the median test converts them
194
THE CASE OF IC INDEPENDENT
SAMPLES
simply to either pluses or minuses. Thus the Kruskal-Wallis test preservesthe magnitude of the scoresmore fully than doesthe extension of the median test.
For this reason it is usually more sensitive to differ-
encesamong the j- samplesof scores. The Kruskal-Wallis
test seems to be the most efficient
of the non-
parametric tests for k independent samples. It has power-efficiencyof
3=-= 95.5 per cent, when compared with the F test, the most powerful parametric test.
There are at least four other nonparametric tests for k independent
samples. These four are rather specializedin their usefulnessand therefore have not been presented here. However, the reader might find one of them most valuable in meeting certain specific statistical
requirements. The first of these tests, the Whitney extensionto the Mann-Whitney test (Whitney, 1951), is a significance test for three samples. It differs from the more generalKruskal-Wallis test in application in that it is designedto test the prediction that the three averages will occur in a specific order. The second of these tests is Mosteller's k-sample test of slippage (Mosteller, 1948; Mosteller and Tukey, 1950). The third is a k-sample runs test (Mood, 1940). Jonckheere (1954) presented the fourth, which is a k-sample test against ordered alternatives, i.e., is designedto test the prediction that the Icaverageswill occur in a specific order.
CHAPTER 9
MEASURES
OF CORRELATION
TESTS
AND
THEIR
OF SIGNIFICANCE
In researchin the behavioralsciences,we frequently wish to know whether two sets of scoresare related, or to know the degree of their relation.
Establishing that a correlation exists between two variables
may be the ultimate aim of a research,as in somestudiesof personality dynamics,trait clusters,intragroup similarities,etc. Or establishinga correlationmay be but one step in a researchhaving other ends,as is the casewhenweusemeasures of correlationto test the reliability of our observations.
This chapter will be devoted to the presentationof nonparametric measuresof correlation, and the presentation of statistical tests which determine the probability associatedwith the occurrenceof a correlation
as large as the one observedin the sampleunder the null hypothesis that the variablesareunrelatedin the population. That is, in additionto presenting measuresof associationwe shall present statistical tests which
determinethe "significance"of the observed association.Theproblem of measuring degree of association betweentwo setsof scoresis quite different in character from that of testing for the existenceof an associa-
tion in somepopulation. It is, of course,of someinterest to be able to
statethe degree of association between two setsof scores froma given groupof subjects. But it is perhapsof greaterinterestto beableto say whetheror not someobservedassociationin a sampleof scoresindicates that the variables under study are most probably associatedin the populationfrom whichthe samplewasdrawn. The correlationcoefficient
itself represents the degreeof association.Testsof the significance of that coefficientdetermine,at a statedlevel of probability,whetherthe association existsin the populationfrom whicha samplewasdrawnto yield the datafrom whichthe coefficient wascomputed. In the parametriccase,the usualmeasureof correlationis the Pearson
product-moment correlation coefficient r. Thisstatisticrequires scores whichrepresentmeasurement in at leastan equal-interval scale. If we
wishto testthesignificance of anobserved valueof r, wemustnotonly 195
196
CORRELATION
AND
TESTS
OP SIGNIFICANCE
meet the measurementrequirement but we must also assumethat the scores are from a bivariate normal population.
If, with a given set of data, the measurementrequirement of r is not met or the normality assumption is unrealistic, then use may be made of one of the nonparametric correlation coefBcientsand associatedsignificance tests presented in this chapter. Nonparametric measuresof correlation are available for both nominal and ordinal data.
The tests
make no assumption about the shape of the population from which the scores were drawn. Some assume that the variables have underlying
continuity, while othersdo not evenmakethis assumption. Moreover, the researcherwill find that, especially with small samples,the computa-
tion of nonparametricmeasuresof correlationand testsof significanceis easierthan the computation of the Pearsonr. The uses and limitations
of each measure will be discussed as the
measureia presented. A discussioncomparingthe meritsandusesof the various majores will be Oared at the close of the chapter.
THE CONTINGENCY
COEFFICIENT:
C
Function
The contingencycoefBcientC is a measureof the extent of association or relation between two sets of attributes.
It is uniquely useful when we
have only categorical(nominalscale)informationabout oneor both sets of these attributes. That is, it may be used when the information about the attributes consistsof an unordered seriesof frequencies.
To use the contingencycoefficient,it is not necessarythat we be able to assumeunderlying continuity for the variouscategoriesusedto measureeither or both sets of attributes. In fact, we do not even need to be able to order the categoriesin any particular way. The contingency
coefBcient, as computedfrom a contingency table,will havethe same value regardless of how the categories are arrangedin the rows and columns. Method
To computethe contingency coefBcient betweenscoreson two setsof categories, sayA >,A~,A e,..., A»,andB> Bp Bpy Bwe arrange the frequencies in a contingency table like Table 9.1. The data may consistof any numberof categories.That is, onemay computea contingencycoefficient from a 2 X 2 table,a 2 X 5 table,a 4 X 4 table,a 3 X 7 table, or any% X r table.
In sucha table,we mayenterexpected frequencies for eachcell(Ey's) by determining whatfrequencies wouldoccurif therewerenoassociation
THE CONTINGENCY COEFFICIENT:
C
197
or correlationbetweenthe two variables. Thelargeris the discrepancy betweentheseexpected valuesandthe observed ceHvalues,the largeris the degreeof association between thetwo variablesandthusthe higheris the value
of C.
TABLE9.1. FORMOF THE CONTINQENCY TABLEFROM%WHICH C IS COMPUTED
The degreeof association betweentwo sets of attributes,whether
orderable or not,andirrespective of the natureof the variable(it may be either continuousor discrete)or of the underlyingdistributionof the attribute (the populationdistributionmay be normalor any other shape),may be found from a contingencytable of the frequencies by N+
g'
(Oo E'~)'
where i~1
(9 1)
(63)
j~l
andwherex~is computedby the methodpresented earlier(pages104to 111).
In otherwords,in orderto computeC, onefirst computes the valueof
x' by formula(6.3),andtheninsertsthat valueinto formula(9.1)to get C. Example
This computationmay be illustratedby reference to data which werefirst presented in Chap.8, in the discussion of the x' test for k
independent samples.Thereaderwill remember that Hollingshead testedwhetherthe high schoolcurriculumschosenby the youth of Elmtownweredependent onthe socialclassof the youths. Observe
198
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
that this is a question of the associationbetweenfrequenciesfrom an unordered series (high school curriculum) and frequenciesfrom aII ordered series(socialclassstatus). Hollingshead'sdata are repeated
in Table 9.2,a 3 X 4 contingencytable. For the data in this table, TARLE 9.2. FREQUENGY oF ENRoLLMENT oE I'LMTowN YoUTHs FRQM FIVE SOCIAL CLAssES IN TRREE ALTERNATIVE 111GH ScnooL
CURRIOULUMs+
Class Curriculum
Total I and
College prepara tory
II
III
23
IV
40
V
2 81
General
75
107
14
20?
Commercial
31
60
10
102
146
183
26
390
35
Total
~ Adapted from Table X of IIollingshead, A. 13. 1949. Elmlown's ymth. New York: IViley, p. 462, with the kind permission of John 8'iley 8' Sons, Inc.
g' = 69.2. (The computation of x' for these data is given on page 177.) Knowing this, we may determine the value of C by using formula (9.1); N+
(9 >)
y' 69.2
390 + 69.2
We determine that the correlation, expressedby a contingency coefficient, between social class position and choice of high school curriculum
in Elmtown
is C =
.39.
Testing the Significance of the Contingency CoefBcient The scores or observations
with
which
we deal in research are fre-
quently from individuals in whom we are interested becausethey constitute a random sample from a population in which we are interested.
If weobservea correlationbetweentwo setsof attributes in the sample, we may wish to determine whether it is plausible for us to conclude that
they areassociatedin the populationwhich is representedby the sample. If a group of subjectsconstitutesa randomsamplefrom somepopu-
THE CONTINGENCY COEFFICIENT:
C
199
lation, we may determine whether the association that exists between
two sets of scoresfrom that sampleindicates that an associationexists in
the populationby testing the associationfor "significance." In testing the significanceof a measureof association,we are testing the null hypothesisthat there is no correlationin the population,i.e., that the observedvalue of the measureof associationin the samplecould have arisen by chancein a random samplefrom a population in which the two variables
were not correlated.
In order to test the null hypothesis,we usually ascertainthe sampling distribution of the statistic (in this case,the measureof association) under Ho. We then use an appropriate statistical test to determine whether our observedvalue of that statistic can reasonablybe thought to
have arisen under Ho, referring to some predeterminedlevel of sig-
nificance. If theprobability, associated with theoccurrence underHoof a value as large as our observed value of the statistic is equal to or less
than our level of significance,that is, if p < 0.,then we decideto reject Ho and we conclude that the observedassociationin our sample is not a
resultof chancebut rather representsa genuinerelationin the population. However,if the statistical test revealsthat it is likely that our observed value might have arisen under Ho, that is, if p ) a, then we decide that our data do not permit us to conclude that there is a relation between
the variables in the population from which the sample was drawn. This
method of testing hypothesesshouldby now be thoroughly familiar to the reader. A fuller discussionof the methodis given in Chap. 2, and illustrations of its use occur throughout this book.
Now the readermay know that the Pearsonproduct-momentcorrelation coefficientr may be testedfor significanceby exactly the method described above. As he reads further in this chapter, he will discover
that variousnonparametricmeasuresof associationare testedfor significanceby just sucha method. As it happens,however,the contingency coefficient is a special case. One reason that we do not refer to the
samplingdistribution of C in order to test an observedC for significance is that the mathematical complexitiesof sucha procedureare considerable.
A.better reason,however,is that in the courseof computingthe value of C we computea statistic which itself providesa simple and adequate indication of the significanceof C. This statistic, of course,is y'. We
maytest whetheran observed C differssignificantlyfrom chancesimply by determining whether the y' for the data is significant.
For any k X r contingencytable, we may determinethe significance'
of the degreeof association (the significance of C) by ascertaining the probabilityassociated with the occurrence underHo of valuesas large asthe observedvalueof x', with df = (k 1)(r 1). If that probability is equalto or lessthan n, the null hypothesismay be rejectedat that
200
CORRELhTION hND TESTS OF SIGNIFIChNCE
level of significance. Table C gives the probability associatedwith the occurrence under Ho of values as large as an observed y'.
If the x~ for
the samplevalues is significant, then we may concludethat in the population the association between the two sets of attributes
is not zero.
Example We have shown that in Elmtown the relation between adolescents' social class status and their curriculum choice is C = .39. In the
course of computing C, we determined that x' = 69.2. Now if we consider the adolescentsin Hollingshead's group to be a random sample from some population, we may test whether social class status is related to curriculum choice in that population by testing x' = 69.2 for significance. By referring to Table C, we may determine that g' > 69.2 with df = (k 1)(r 1) = (4 1) (3 1) = 6 has probability of occurrenceunder Ho of less than .001. Thus we could reject Ho at the .001 level of significance,and conclude that social class status and high school curriculum choice are related in the population of which the Elmtown youth are a sample. That is, we concludethat C = .39 is significantly diferent from zero. Summary of Procedure
Theseare the stepsin the useof the contingencycoefficient: 1. Arrange the observed frequencies in a k X r contingency table, like Table 9.1, where k = the number of categorieson which one variable is "scored" and r = the number of categorieson which the other variable is "scored."
2. Determine the expectedfrequency under Ho for each cell by multi
plying the two marginal totals commonto that cell and then dividing this product by N, the total numberof cases. If morethan 20 per cent of the cells have expectedfrequenciesof lessthan 5, or if any cell has an
expectedfrequencyof lessthan 1, combinecategories to increasethe expectedfrequencieswhich are deficient. 3. Usingformula (6.3), computethe value of x' for the data. 4. With this value of x', compute the value of C, using formula (9.1). 5. To test whether the observed value of C indicates that there is an
association between the two variables in the population sampled, determine the associated probability under Ho of a value as large as the observed g' with df = (k 1)(r 1) by referring to Table C. If that probability is equal to or lessthan a, reject Ho in favor of Hi. Limitationsof the ContingencyCoe8icient The wide applicability and relatively easy computation of C may seem to make it an ideal all-round
measure of association.
This is
THE CONTINGENCY COEFFICIENT:
C
not the case because of several limitations or deficiencies of the statistic.
In general, we may say that it is desirable for correlation coefficients
to showat least the followingcharacteristics: (a) wherethere is a complete lack of any association,the coefBcientshouldvanish, i.e., should equal zero, and (b) when the variablesshow completedependenceon each otherare perfectly correlatedthe coefficientshouldequal unity, or 1. The contingencycoefficienthas the first but not the second characteristic:it equalszero when there is no association, but it cannot attain unity. This is the first limitation of C. The upper limit for the contingency coefficient is a function of the
numberof categories. When k = r, the upperlimit for C, that is, the C which would occurfor two perfectly correlatedvariables,is
k
1
Forinstance, theupperlimitofC fora 2 X 2 tableis~z .707. Fora 3 X 3 table,the maximumvaluewhichC canattainis ~I
= .816. The
fact that the upperlimit of C dependson the sizesof k and r createsthe secondlimitation of C. Two contingencycoefBcients are not comparable unlessthey are yieldedby contingencytablesof the samesize. A,third limitation of C isthat the data mustbe amenableto the computation of x' before C may appropriately be used. The reader will rememberthat the x' test can properly be usedonly if fewer than 20 per cent of the cellshave an expectedfrequencyof lessthan 5 and no cell has an expectedfrequencyof lessthan 1.
g fourth limitationof C is that C is not directlycomparable to any othermeasure of correlation, e.g.,the Pearson r, the Spearman ra, or the Kendall
~.
In spiteof theselimitations, the contingency coefficient is an extremely usefulmeasureof association becauseof its wideapplicability. The con-
tingencycoefficient makesno assumptions aboutthe shapeof the population of scores,it doesnot requireunderlyingcontinuityin the variablesunderanalysis, and it requiresonlynominalmeasurement (the least refinedvariety of measurement)of the variables. Becauseof this freedomfrom assumptionsand requirements,C may often be used to indicatethe degreeof relationbetweentwo setsof scoresto which none of the othermeasures of association whichwe havepresentedis applicable. Power
Becauseof its natureand its limitations,we shouldnot expectthe contingencycoe@cientto be very powerfulin detectinga relationin the population. However,its easeof computationand its completefreedom from restrictive assumptionsrecommendits use where other measuresof
correlationmay be inapplicable. BecauseC is a function of x', its
CORRELATION
202
AND TESTS OF SIGNIFICANCE
limiting powerdistribution, like that of g', tendsto 1 asN becomes large (Cochran, 1952). References
For other discussions of the contingency coefficient, the reader is
referredto Kendall (1948b,chap.13)and McNemar(1955,pp. 203207). THE SPEARMAN RANK CORRELATION COEFFICIENT: rg Function
Of all the statistics based on ranks, the Spearman rank correlation coefficient was the earliest to be developed and is perhaps the best
knowntoday. This statistic,sometimes calledrho, is hererepresented
by r8. It is a measure of association whichrequires that bothvariables be measuredin at least an ordinal scaleso that the objects or individuals
under study may be rankedin two orderedseries. Rationale
SupposeN individualsare rankedaccordingto two variables. For example,we might arrangea groupof studentsin the order of their
scores onthecollege entrance testandagainin theorderoftheirscholastic ' standingat theendof thefreshman year. If therankingontheentrance test is denotedas Xi, X~, X~,...,
X~, and the ranking on scholastic
standing isrepresented by Yi, Y2,Y~,...,
Yz,wemayusea measure of
rank correlationto determinethe relation betweenthe X's and the Y's. We can see that the correlation between entrance test ranks and
scholastic standingwouldbe perfectif and only if X, = Y; for all i' s. Therefore,it wouldseemlogicalto usethe variousdifferences, d;=X;
Y;
as an indication of the disparity betweenthe two setsof rankings. Sup-
poseMary McCordreceived thetop scoreon theentrance examination but placesfifth in herclassin scholastic standing.Herd wouldbe 4. John Stanislowski,on the other hand, placed tenth on the entrance examination but leads the classin grades. His d is 9. The magnitude of these various d s gives us an idea of how close is the relation between entrance examination scores and scholastic standing. If the relation between the two sets of ranks were perfect, every d would be
zero. The larger the d s, the less perfect must be the association between the two variables.
Now in computing a correlation coefficientit would be awkward to use the d s directly. One difficulty is that the negative d s would
THE SPEARMAN RANK CORRELATIONCOEFFICIENT: f's
203
cancelout the positiveoneswhenwe tried to determinethe total magnitude of the discrepancy. However, if d is employedrather than d;, this difficulty is circumvented. It is clear that the larger are the various
d s, the largerwill be the value of Zd;2. The derivation of the computing formula for rs is fairly simple. We
shall presentit here becauseit may help to exposethe nature of the coefficient,and also becausethe derivation will reveal other forms by whichthe formulamay be expressed. Oneof thesealternativeformswill be used later when we find it necessaryto correct the coefficientfor the
presenceof tied scores. If x = X
X, where X is the mean of the scoreson the X variable,
and if y = Y Y, then a generalexpression for a correlationcoefficient is (Kendall, 1948a,chap.2) (9.2)
~ZsNZy'
jn whichthe sumsare over the N valuesin the sample. Now whenthe X's and Y's are ranks,r = rs, and the sumof the N integers,1, 2,..., N is
zx
N(N+ 1~ 2
and the sum of their squares,1', 2',...,
X,
N' can be shownto be
N(N+ 1)(2N+ 1) 6
Therefore
and
zs' = z(X
X)2 = zx>
(ZX)2 N
N(N+122N+12~(N+1' V' N 64
12 N'
and similarly
(9.3)
N 12
s
Now
y
(~ y)2 = *~ 2~y+ zx' + zy'
ym
2zxy
But formula (9.2) states that z y
gzx'zy' when the observations are ranked.
Therefore
Zd2= Zz2 + Zy' 2N(N ~Zs'Ey' z~'+ and thus
Fs
zy~ zd~
2 ~Zx'Zy'
(9.4)
CORRELATION AND TESTS OF SIGNIFICANCE N' WithI snd Y in ranks,wemsysubstitute Zs' = =
N 12
Ey'into
formula (9.4), getting N' rs
N
N'
N
12
12 s
N Ni
N
(N' N) Zd' N' N
and
6
6Zd' NI N
da.
Inasmuchasd = s y = (X X)
(9.6)
(Y Y) = X Y, sinceX = F
in ranks, we may write N
6 (d i1
¹
N
Formula(9.7) is the mostconvenient formulafor computing the Spearman
rs.
Method
To compute rs,makea listof theN subjects.Nextto eachsubject's entry, enter his rank for the X variable and his rank for the Y variable. Determine then the various values of di = the difference between the two ranks. Square each d;, and then sum all values of d to obtain N
d,'.
Then enter this value and the value of N (the number of sub-
iI
jects) directly in formula (9.7). Example
Aspart of a studyof theeffectof grouppressures for conformity upon an individualin a situationinvolvingmonetaryrisk, the
THE SPEARbKANRANK CORRELATION COEFFICIENT: t's
205
researchers' administered the well-known F scale,' a measure of authoritarianism, and a scale designed to measure social status strivings' to 12 collegestudents. Information about the correlation between the scores on authoritarianism
and those on social status
strivings was desired. (Social status strivings were indicated by agreementwith such statements as "People shouldn't marry below their social level," "For a date, attending a horseshow is better than TABLE
9.3.
ScoRES
oN AUTH0RIThRIANIsM
hND
SOCIhL STATUS STRIVINGS
going to a baseballgame," and "It is worthwhile to trace back your family tree.") Table 9.3 gives each of the 12 students' scoreson the two scales.
In order to compute the Spearman rank correlation between these
two sets of scores,it was necessaryto rank them in two series. The ranks of the scoresgiven in Table 9.3 are shown in Table 9.4, which also shows the various values of d; and d;s. Thus, for example, Table 9.4 shows that the student (student J) who showed the most authoritarianism (on the F scale) also showed the most extreme social status strivings, and thus was assigneda rank of 12 on both variables.
The reader will observe that
no student's
rank on one
> Siegel,S., and Fagan,Joen. The Ascheffectunder conditionsof risk. Unpublished study. The data reported here are from a pilot study. ~ Presentedin Adorno, T. W., Frenkel-Brunswik, Else, Levinson Q. J., and
ford, R. N. 1950. TheauthoritarionpereonalQfi.New York: Harper. ~ Siegel,Alberta E., and Siegel, S. An experimental test of some hypotheses jn
referencegroup theory. Unpublishedstudy.
206
CORRELATION
AND TESTS OF SIGNIFICANCE
ThRLE 9.4. RANKsoN AUTHQRIThRIhNIsM hND SocIhL SThTUSSTRIYIN68 Rank
Student
d;
Authoritarianism
Social status strivings
2
C B D
6 15
2 1
10
8
9
11
10 67 12 9 5
S
JK L
11
23 2 0
4
E
3 4 12 7
1
3
F
C I H
d
2 2 3
23 2 0 Eds' = 52
variable was more than three ranks distant from his rank on the
other variable, i.e., the largest d; is 3. From the data shown in Table 9.4, we may compute the value of rs by applying formula (9.7): G d' s~l
E'
(9.7)
N
G(52) (12) I 12 = .82
We observe that for these 12 students the correlation between
authoritarianism and social status strivings is rs
.82.
Tied observations. Occasionally two or more subjects will receivethe same score on the same variable. When tied scoresoccur, each of them
is assignedthe averageof the ranks which would have been assigned had no ties occurred, our usual procedure for assigning ranks to tied observations.
If the proportionof ties is not large,their eKecton rs is negligible, and formula (9.7) may still be usedfor computation. However,if the proportion of ties is large, then a correctionfactor must be incorporated in the computation of rs.
THE sPEARMAN RANK coRRELATIQN coEFFIcIENT: rs The effect of tied ranks in the X variable
squares,Xx', belowthe value of
N'
N
207
is to reduce the sum of
~ thatis,
Ns
N 12
when there are tied ranks in the X variable. Thereforeit is necessary to correctthe sum of squares,taking ties into account. The correction factor is T: 12
wheret = the numberof observationstied at a given rank. When the sum of squaresis correctedfor ties, it becomes Xx' =
N'
N
ZT
12
where XT indicates that we sum the various values of T for all the
variousgroupsof tied observations. When a considerable numberof ties are present,oneusesformula (9.4) (page203) in computingr8. Zx2 + Zys Zds rs
where Zx' =
Xy' =
N'
N 12
N'
N 12
2 ~Zs~Zy~
(9.4)
XT,
XT fl Example with Ties
In the study citedin the previousexample,eachstudentwasindividually observedin the well-known group pressuressituation developedby Asch.' In this situation, a group of subjectsare askedindividually to state which of a group of alternative lines is
the samelengthas a standardline. All but oneof thesesubjects are confederates of the experimenter, and on certaintrials they unanimously choose an incorrectmatch. The naivesubject,whois soseatedthat he is the last askedto reporthis judgment,hasthe choiceof standingalonein selecting the true match(whichis unmistakableto peoplein situationswhereno contradictory grouppres-
suresareinvolved)or of "yielding"to grouppressures by stating that the incorrect line is the match.
The modi6cationwhichSiegeland Faganintroducedinto this s ~eh,S. E. 1952. Social psychology. NewYork:Prentice-Hall, pp.451-476.
208
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
experiment wasto agreeto pay eachsubject50centsfor everycorrect judgment,and to penalizehim 50 centsfor everyincorrectone.
Thesubjectsweregiven$2at the start of the experiment, andthey understoodthat they could keep all moneysin their possession at
the endof the session.Sofar asthe naivesubjectknew,this agreement held with all membersof the group making the judgments. Each naive subject participated in 12 "crucial" trials, i.e., in 12 trials in which the confederatesunanimouslychosethe wrong line as the match. Thus each naIve subject could "yield"
up to 12
times. TARLE 9.5. SCORESON YIELDINO AND SOCIALSTATUSSTRIVINOS
As part of the study, the experimenters wanted to know whether
yieldingin thissituationis correlated with socialstatusstrivings, as
measured by thescale described previously. Thiswasdetermined by computinga Spearman rank correlationbetweenthe scoresof
eachof the 12naivesubjects on the socialstatusstrivingsscaleand the numberof timesthat eachyieldedto the grouppressures.The data on these two variablesare presentedin Table 9.5. Observe
that two of the naivesubjectsdid not yieldat all (thesewerestudentsA and B), whereasone(studentL) yieldedon everycrucial trial. The scorespresentedin Table 9.5 are ranked in Table 9.6 Observethat there are 3 setsof tied observations on the X variable
(numberof yieldings).Two subjects tied at 0; bothweregiven ranksof 1.5. Two tied at 1; both weregivenranksof 3.5. And two tied at 8: both were given ranks of 10.5. Becauseof the relatively large proportionof tied observations in
the X variable,it mightbefelt that formula(9.4) shouldbe usedin
THE sPEARMAN RANK CORRELhTION COEFFICIENT: rg TABLE 9.6. RANKS FOR YIELDING
209
AND SOCIAL STATUS STRIVINGS
computingthe value of rs. To use that formula, we must first determinethe valuesof Zx' and Zy'. Now with 3 sets of tied observationson the X variable, where t = 2 in eachset, we have Zse =
N'
N
ZT,
12
(12)' 12 12 = 143
(2' 2 ( 12
2' 2 2' 2 12 12
1.5
= 141.5
That is, correctedfor ties, Zx' = 141.5. We find Zy' by a comparable method:
Zy' =
N'
N 12
ZT F
But inasmuchasthere are no tiesin the Y scores(the scoreson social statusstrivings),ZT, = 0, and thus (12)' 12 = 14$
Correctedfor ties, Zss = 141.5 and Zys = 143.
From the addition
hiaHtvf far tnternationet vergleichencle Wirt ch",f's- <.rd Suzie is'.c.",."ik 4Nfted-Neter-!ns. t.t 4; ."; ~l-Ln Sta:,v:issenschaften
an d~rUnive.sIidtI-;eidelberg
210
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
shownin Table 9.6, we know that Zd;2= 109.5. Substitutingthese values in formula (9.4), we have (9.4) 141.5 + 143
109.;)
2 V'(141 5)(143)
Corrected for ties, the correlation betwce» ai»ou»t of yielding and degreeof social status strivings is r8 .616. If we had computed r8 from formula (9.7), i.e., if we had not corrected for ties, we would have found r8 .617. This illustrates the relatively insignificant effect of ties upon the value of the Spearman rank correlation. .Notice, however, that the effec of ties is to inflate the value of r8. For this reaso», the correctio» should bc used where there is a large proportion of ties in either or both the X a»d Y variables. Testing the Significance of r8
If the subjects whosescoreswere used in computing r8 were randomly drawn from some population,
we may usc those scores to determine
whether the two variables are associatedi» the population. That is, we may wish to test the null hypothesis that the two variables under study are not associated in the popiilatio» and that the observed value of ra differs from zero only by chance.
Small samples. Suppose that the null hypothesis is true. That is, supposethat there is no relation in the populatiori betweenthe X and Y variables. Now if a sample of X and Y scoresis randomly drawn from that population, for a given rank order of the Y scoresany rank order of the X scores is just as likely as any other rank order of the X scores. And for any given order of the X scores, all possible orders of the Y scoresare equally likely. For N subjects, there are N! possiblerankings of X scores which may occur in associatio»with any give» ranking of Y scores. Since these are equally likely, the probability of the occurrence
of any particular ranking of the X scores with a given ranking of the Y 1
scores is N!
For each of the possible rankings of Y there will be associated a value of
r8. The probability of the occurrenceunder llo of any particular value of r8 is thus proportional to the number of permutations giving rise to that value.
Using formula (9.7), the computation formula for ra, we find that for
THE sPEARMAN RANK coRRELATIoN
coEFFIcIENT:
7s
211
N = 2, only two values of 7s are possible: +1 and 1. Each of these has probability of occurrenceunder Hp of ~. For N = 3, the possiblevalues of 7s are 1, ~, +~, and +1, Their respective probabilities under H pare z, ~, ~, and ~. Table P of the Appendix gives critical values of 78 which have been
arrived at by a similar method. For N from 4 to 30, the table gives the value of 7s which has an associated probability under Hp of p = .05 and the value of 7s which has an associated probability under Hp of p = .01. This is a one-tailed table, i.e., the stated probabilities apply when the observedvalue of 7s is in the predicted direction, either positive or negative.
If an observed value of 7s equals or exceeds the value
tabled, that observed value is significant (for a one-tailed test) at the level indicated.
Example' We have already found that for N = 12 the correlation between authoritarianism and social status strivings is 78 .82. Table P
shows that a value as large as this is significant at the p < level (one-tailed test). Thus we could reject Hp at the a = level, concluding that in the population of students from which sample was drawn, authoritarianism and social status strivings
.01 .01 the are
associated.
' In testing a measureof association for significance, we follow the same six steps which we have followed in all other statistical tests throughout
this book.
That is,
(i) the null hypothesis is that the two variables are unrelated in the population, whereasH> is that they are related or associatedin the population; (ii) the statistical test is the significancetest appropriate for the measureof association; (iii) the level of significance is specified in advance,and may be any small probability, for example, a = .05 or a = .01, etc., while the N is the number of caseswhich have yielded scores on both variables; (iv) the sampling distribution is the theoretical distribution of the measureunder Hp, exact probabilities or critical values from which are given in the tables used to test the measurefor significance; (v) the rejection region consistsof all values of the measureof associationwhich are so extreme that the probability associated with their occurrence under Hp is equal to or less than a (and a one-tailed region
of rejection is used when the sign of the association is predicted in H>), and (vi) the decision consistsof determining the observedvalue of the measureof association and then determining the probability under Ho associatedwith such an extreme value; if and only if that probability is equal to or less than n, the decision is to reject Ho in favor of H~.
Becausethe same sets of data are repeatedly used for illustrative material in the discussionsof the various measuresof association,in order to highlight the differences
snd similaritiesamongthesemeasures,the constantrepetition of the six stepsof statistical inferencein the exampleswould lead to unnecessary redundancy. Therefore we have chosennot to includethesesix stepsin the presentationof the examplesin this chapter. We mention here that they might well have been included in order to
point out to the readerthat the decision-making procedureusedin testingthe significanceof a measureof associationis identical to the decision-makingprocedureused in other sorts of statistical tests.
212
CORRELATION
AND TESTS OF SIGNIFICANCE
Wehavealsoseenthattherelationbetween socialstatusstrivings andamountof yieldingis rs = .62in ourgroupof 12subjects.By referring to TableP,wecandetermine thatrs ) .G2hasprobability of occurrence underHpbetween p = .05andp = .01(one-tailed). Thus we could conclude,at the u = .05 level, that thesetwo vari-
ab]esare associated in the populationfrom whichthe samplewas drawn.
Large samples. When N is 10 or larger, the significanceof an obtained
rs underthe null hypothesismay be testedby (I<endall,1948a,pp. 47 48): t=rs
N
2
1
rs2
That is, for V large, the value defined by formula (9.8) is distributed as
Student's t with df = N 2. Thus the associatedprobability under H pof any valueasextremeasan observedvalueof rs may be determined
by computingthe t associated with that value,usingformula (9.8), and then determiningthe significanceof that t by referringto TableB of the Appendix. Example'
We have already determined that the relation between social status strivings and amount of yielding is rs .62 for N = 12.
Sincethis N is larger than 10, we may usethe largesamplemethod of testing this rs for significance:
= 2.49
Table B showsthat for df = N
2 = 12 2 = 10, a t as largeas
2.48 is significant at the .025 level but not at the .01 level for a one-
tailed test. This is essentiallythe sameresult we obtainedprevi-
ouslyby usingTableP. WecouldrejectHpat a = .05,cohcluding that social status strivings and amount of yielding are associatedin the population of which the 12students werea sample. Suaunary of Procedure
These are the steps in the use of the Spearman rank correlation coefBcient:
1. Rank the observations on the X variable from 1 to N. observations
on the Y variable
' See footnote, page 211.
from 1 to N.
Rank the
THE KENDALL RANK CORRELATION COEFFICIENT: T
213
2. List the N subjects. Give each subject's rank on the X variable and his rank on the Y variable next to his entry. 3. Determine the value of d; for each subject by subtracting his Y
rank from his X rank. d.
Square this value to determine each subject's
Sum the dP's for the N cases to determine Zd;2.
4. If the proportion of ties in either the X or the Y observations is
large, useformula (9.4) to compute r~. In other cases,useformula (9.7). 5. If the subjects constitute a random sample from some population, one may test whether the observed value of r~ indicates an association
betweenthe X and Y variables in the population. The method for doing so dependson the size of N: a. For N from 4 to 30, critical values of rs for the .05 and .01 levels of significance (one-tailed test) are shown in Table P. b. For N > 10, the signifiicanceof a value as large as the observed value of rs may be determined by computing the t associatedwith that value [using formula (9.8)] and then determining the significanceof that value of t by referring to Table B. Power-EEciency
The efficiency of the Spearman rank correlation when compared with the most powerful parametric correlation, the Pearson r, is about 91 per cent (Hotelling and Pabst, 1936). That is, when rs is usedwith a sample to test for the existence of an association in the population, and when the
assumptionsand requirements underlying the proper use of the Pearson r are met, that is, when the population has a bivariate normal distribution and measurementis in the senseof at least an interval scale, then rs is 91 per cent as efficient as r in rejecting H0. If a correlation between X and Y exists in that population, with 100 casesrs will reveal that correlation at the samelevel of significancewhich r attains with 91 cases. References
For other discussions of the Spearman rank-order correlation, the reader may turn to Hotelling and Pabst (1936), Kendall (1948a; 1948b, chap. 16), and Olds (1949). THE
KENDALL
RANK
CORRELATION
COEFFICIENT'
r
Function
The Kendall rank correlation coefficient, 7.(tau), is suitable as a measure of correlation with the same sort of data for which rs is useful.
That
is, if @tleast ordinal measurementof both the X and Y variables has been achieved,so that every subject can be assigneda rank on both X and Y,
214
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
then r will give a measure of the degree of association or correlation betweenthe two sets of ranks. The sampling distribution of 7 under the
null hypothesisis known, and thereforer, like rs, is subjectto tests of significance.
Oneadvantageof r over rs is that r can be generalized to a partial correlationcoefficient. This partial coefficientwill be presentedin the section following this one. Rationale
SupposeweaskjudgeX and judge Y to rank four objects. For example, we might ask them to rank four essaysin orderof quality of expository style. We representthe four papersas a, b, c, and d. The obtained rankings are these: F.ssay Judge X
Judge Y
31
If we rearrangethe orderof the essaysso that judgeX's ranksappearin natural order (i.e., 1, 2,...,
N), we get
Judge X
I2
34
Judge Y
24
3I
We are now'in a position to determinethe degreeof correspondence betweenthe judgmentsof X and Y. JudgeX's rankingsbeingin their natural order, we proceed to determine how many pairs of ranks in
judge Y's set are in their correct (natural) order with respectto each other.
Considerfirst all possiblepairs of ranksin whichjudge Y's rank 2,
the rankfarthestto the left in his set,is onemember.Thefirst pair, 2 and4, hasthecorrectorder:2 precedes 4. Sincetheorderis "natural," we assigna scoreof +1 to this pair.
Ranks 2 and 3 constitute the second
pair. Thispair is alsoin thecorrectorder,soit alsoearnsa score of +1. Now the third pair consistsof ranks 2 and 1. Theseranks are not in
"natural" order;2 precedes 1. Thereforeweassignthis pair a scoreof 1.
For all pairswhichincludethe rank 2, wetotal the scores: (+1) + (+1)
+ ( 1) = +1
THE KENDALL RANK CORRELATION COEFFICIENT: T
215
Now we considerall possiblepairs of ranks which include rank 4 (which is the rank secondfrom the left in judge Y's set) and onesucceedingrank. One pair is 4 and 3; the two membersof the pair are not in the natural order, so the scorefor that pair is l. Another pair is 4 and 1; again a score of 1 is assigned. The total of these scores is ( 1)+
( 1) =
2
When we consider rank 3 and succeeding ranks, we get only this pair:
3 and 1. The two membersof this pair are in the wrong order; therefore this pair receives a score of 1. The total of all the scores we have assigned is (+ 1) + ( 2) + ( 1) =
2
Now what is the maximum possible total we could have obtained for
the scoresassignedall the pairs in judge Y's ranking? The maximum possibletotal would have been yielded if the rankings of judges X and Y had agreed perfectly, for then, when the rankings of judge X were arranged in their natural order, every pair of judge Y's ranks would also be in the correct order and thus every pair would receive a score of +1. The maximum possible total then, the one which would occur in the case of perfect agreement between X and Y, would be four things taken two at a time, or
4 2
= 6.
The degree of relation between the two sets of ranks is indicated by the ratio of the actual total of +1's and 1's to the possible maximum total.
The Kendall
rank
correlation actual
coe%cient
total
is that
ratio:
2
maximum possible total
6
That is, ~ = .33 is a measure of the agreement between the ranks assignedto the essaysby judge X and those assignedby judge Y, One may think of 7 as a function of the minimum number of inversions or interchanges between neighbors which is required to transform one ranking into another. That is, r is a sort of coeScient of disarray. Method
We have seen that actual
score
maximum possible score
gn general,the maximum possiblescore will be,
which can be
216
CORRELATION AND TESTS OF SIQNIFICANCE
expressedas zN(N
1). Thus this last expressionmay be the denomi-,
nator of the formula for r.
For the numerator, let us denote the observed
sum of the +1 and 1 scoresfor all pairs as S. Then S
~N(N
(9.9)
1)
where N = the number of objects or individuals ranked on both X and Y. The calculation of S may be shortened considerably from the method shown above in the discussion of the logic of the measure.
When the ranks of judge X were in the natural order, the corresponding ranks of judge Y were in this order Judge Y:
24
31
We can determine the value of S by starting with the first number on the
left and counting the number of ranks to its right which are larger. We then subtract from this the number of ranks to its right which are smaller. If we do this for all ranks and then sum the results, we obtain S. Thus, for the above set of ranks, to the right of rank 2 are ranks 3 and 4 which are larger and rank 1 which is smaller.
Rank 2 thus contributes
(+2 1) = +1 to S. For rank 4, no ranks to its right are larger but two (ranks 3 and 1) are smaller. Rank 4 thus contributes (0 2) = -2 to S. For rank 3, no rank to its right is larger but one (rank 1) is smaller, so rank 3 contributes
(0 1)
=
1 to S.
(+1) + ( 2) + ( 1) =
These contributions total 2= S
Knowing S, we may use formula (9.9) to compute the value of r for the ranks assignedby the two judges: S
~N(N I(4)(4
1)
(9.9)
1)
= .33
Era,mpie We have already computed the Spearman ra for 12 students'
scoreson authoritarianism and on social status strivings. The scoresof the 12 students are presentedin Table 9.3, and the ranks of
thesescoresare presentedin Table 9.4. We may computethe value of r for the same data.
THE KENDALL RANK CORRELATION COEFFICIENT: T
217
The two setsof ranks to be correlated(shownin Table 9.4) are these: Subject Status strivings rank Authoritarianism
rank
EF
G
H
I
8 ll
10
6
7 12
3
4 12
10
98
J
To computer, we shallrearrangethe orderof the subjectssothat the rankingson socialstatusstrivingsoccurin the natural order: Subject
DC
Status strivings rank Authoritarianism
rank
A
B
K
G
F
J
10
11
12
12
34
5
15
26
7
89
12
Having arrangedthe rankson variableX in their naturalorder,we determinethe value of S for the corresponding order of ranks on variable
Y:
S = (11 0) + (7 3) + (9 0) + (6 2) + (5 2) + (6
0) + (5
0) + (2
2) + (1
2) + (2
0)
+(1-0) The authoritarianism
=44
rank which is farthest to the left is 1.
This
rank has 11 rankswhichare largerto its right, and 0 rankswhichare smaller,so its contributionto S is (11 0). The next rank is 5. It has 7 ranks to its right which are larger and 3 to its right which are smaller,sothat its contributionto S is (7 3). By proceeding in this way, we obtain the variousvaluesshownabove, which we have summedto yield S = 44. Knowingthat S = 44 and N = 12, we may useformula (9.9) to computer: S
~N(N
1)
(99)
44
~(12)(12 1) = .67
w = .67 representsthe degreeof relation betweenauthoritarianism and social status strivings shown by the 12 students. Tied observations. When two or more observationson either the X or
the Y variableare tied, we turn to our usualprocedure in rankingtied scores:the tied observations are giventhe averageof the ranksthey would have received if there were no ties.
218
CORRELATIONAND TESTS OF SIGNIFICANCE
The eEect of ties is to changethe denomtnatorof our formula for r. In the caseof ties, r becomes T
Q~N(N
whereTz = >Zt(t 1),
1)
Tz Q~N(N
1)
(9.10)
TF
t beingthe numberof tied observations in each
group of ties on the X variable Tr >Et(t 1), t being the number of tied observationsin each group of ties on the Y variable
The computations requiredby formula (9.10) areillustratedin the example which follows. Example with Ties
Again we shallrepeatan examplewhichwasfirst presentedin the discussionof the Spearman rs. Ke correlated the scoresof 12 subjects on a scale measuringsocial status strivings with the number of
times that eachyieldedto grouppressures in judgingthe length of lines. The data for this pilot study are presentedin Table 9.5. These scores are converted to ranks in Table 9.6.
The two setsof ranks to be correlated(first presentedin Table 9.6) are these: Subject
A
Status strivings rank
34
Yieldingrank
1.5
B
C
DE
21 1.5
3.5
8 3.5
F
G
I
11
10
'7
5
JK 12
5
9 10.5
10.5
12
As usual,we first rearrangethe orderof the subjects,so that the ranks on the X variable Subject
occur in natural order: K
CA
Status strivings rank Yielding rank
3.5
1.5
1.5
10.5
IE 78
56
23 3.5
H
8
I 9
95
10
F
J
11
12
6 10.5
Then we computethe value of S in the usualway: S = (8 2) + (8 2) + (8 0) + (8 0) + (1 5) + (3 3) + (2 3) + (4 0) + (0 3) + (1 1) +(1
0) =25
Having determinedthat S = 25, we now determinethe valuesof T~ and Tr.
There are no ties among the scoreson social status
strivings,i.e., in the X ranks, and thus Tz = 0.
THE KENDALL RANK CORRELATION COEFFICIENT: T
219
On the Y variable (yielding), there are three sets of tied ranks. Two subjects are tied at rank 1.5, two are tied at 3.5, and two are tied at 10.5. In eachof thesecases,t = 2, the number of tied observations. Thus T» may be computed: T> = 4Zt(t 1) = g[2(2 1) + 2(2 1) + 2(2 1)) 3
With T» = 0, T> = 3, S = 25, and N the value of r by using formula (9,10):
Q~N(N
12, we may determine
1)
Tp
(9.10)
25
Qg(12) (12 1)
3
= .39
If we had not corrected the above coefficient for ties, i.e., if we had used formula (9.9) in computing ~, we would have found r = .38. Observethat the effect of correcting for ties is relatively small. Comparison of ~ and ra In two caseswe have computed both r and rs for the same data.
The
reader will have noted that the numerical values of r and rs are not identical when both are computed from the same pair of rankings. For the relation between authoritarianism and social status strivings, rq .82 whereasr = .67. For the relation between social status strivings and number of yieldings to group pressures,ra = .62 and r = .39. These examples illustrate the fact that r and rs have different under-
lying scales,and numerically they are not directly comparable to each other. That is, if we measurethe degreeof correlation betweenthe vari-
ablesA and B by usingra, and then do the samefor A and C by usingr, we cannotthen saywhetherA is morecloselyrelatedto B or to C, for we shall be using two noncomparablemeasuresof correlation. However, both coefficientsutilize the same amount of information in the data, and thus both have the same power to detect the existenceof association in the population. That is, the sampling distributions of
< andra are suchthat with a given set of data both will reject the null hypothesis(that the variablesare unrelatedin the population) at the same level of significance. This should becomeclearer after the follow-
ing discussionon testing the significanceof T.
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
Testing the Significance of r
If a random sampleis drawn from somepopulation in which X and Y
are unrelated,and the membersof the sampleare rankedon X and Y, then for any given order of the X ranks all possibleorders of the Y ranks are equally likely. That is, for a given order of the X ranks, any one possibleorder of the Y ranksis just as likely to occur as any other possible
order of the Y ranks. Supposewe order the X ranks in natural order, i.e., 1, 2, 3,..., N. For that orderof the X ranks, all the N! possible ordersof the Y ranks are equally probableunder Ho. Thereforeany particular order of the Y ranks has probability of occurrenceunder Ho of 1/N!. TABLE 9.7. PROBABILITIESOF r UNDER Ho FOR g ~
4
For each of the N! possible rankings of Y, there will be associateda value of r. These possible values of r will range from +1 to -1, and they can be cast in a frequency distribution. For instance, for N = 4 there are 4! = 24 possiblearrangementsof the Y ranks, and each has an associatedvalue of r. Their frequency of occurrenceunder Ho is shown in Table
9.7.
We could compute similar tables of probabilities for other valuesof N, but of courseas N increasesthis method becomesincreasingly tedious. Fortunately, for N ) 8, the sampling distribution of r is practically indistinguishable from the normal distribution (Kendall, 1948a, pp. 38-39). Therefore, for N large, we may use the normal curve table (Table A) for determining the probability associatedwith the occurrence under Ho of any value as extreme as an observedvalue of r. However, when N is 10 or less,Table Q of the Appendix may be usedto determine the exact probability associated with the occurrence (onetailed) under Ho of any value as extreme as an observed S. (The sampling distributions of S and r are identical, in a probability sense.
THE KENDALL RANK CORRELATION COEFFICIENT: r
221
Inasmuch as r is a function of S, either might be tabled. It is more convenient to tabulate S.) For such small samples,the significance of an observed relation between two samplesof ranks may be determined by simply finding the value of S and then referring to Table Q to determine the probability (one-tailed) associatedwith that value. If the y < a, Hp may be rejected. For example, supposeN = 8 and S = 10. Table Q showsthat an S > 10 for N = 8 has probability of occurrenceunder Hp of p = .138. When N is larger than 10, r may be considered to be normally distributed with Mean=y,=0 Standard deviation
and
= o,
=
That is, z
T
Pg tTg
2(2N + 5) 9N(N 1)
(9.11)
is approximately normally distributed with zero mean and unit variance. Thus the probability associatedwith the occurrenceunder Hp of any value as extreme as an observedr may be determined by computing the value of z as defined by formula (9.11) and then determining the significanceof that z by referenceto Table A of the Appendix. Example for N > 10*
We have already determined that among 12 students the correlation between authoritarianism and social status strivings is r = .67. If we consider these 12 students to be a random sample from some
population,we may test whetherthesetwo variablesare associated in that population by using formula (9.11): T
2(2N + 5) 9N(N 1)
(9.11)
.67
2[(2)(12) + 5j (9)(12)(12 1) = 3.03
]3y referring to Table A, we seethat z > 3.03 has probability of occurrenceunder Hp of p = .0012. Thus we could reject Hp at ~ See footnote, page 211.
222
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
level of significance a = .0012, and conclude that the two variables
are associatedin the population from which this samplewasdrawn, We have already mentioned that r and r8 have ide»tical power to reject Hp. That is, even though r and r8 are numerically diA'ere»tfor the same set of data, their sampling distributions are such that with the samedata H0 would be rejected at the samelevel of significanceby the sig»ificance tests associated
with
both
measures.
In the present case,r = .67. Associatedwith this value is z = 3.03, which permits us to reject Hp at a = .0012. When the Spearmancoefficient was computed from the same data, we found r8 = .82. When we apply to that value the significance test for r8 [formula (9.8)], we arrive at t = 4.53 with df = 10. Table B showsthat t ) 4.53 with df. = 10has probability of occurrenceunder Hp of slightly higher than .001. Thus z and r8 for the same set of data have significancetests which reject Ho at essentially the same level of significance. Summary of Procedure These are the steps in the use of the Kendall rank correlation coefficient:
1. Rank the observations on the X variable from 1 to N. observations
on the Y variable
Rank the
from 1 to N.
2. Arrange the list of N subjects so that the X ranks of the subjects are in their natural order, that is, 1, 2, 3,..., N.
3. Observethe Y ranksin the order in which they occurwhenthe X ranks are in natural
order.
Determine
the value of 8 for
this order of
the Y ranks.
4. If there are no ties amongeither the X or the Y observations,use formula (9.9) in computi»gthe valueof r. If thereare ties, useformula (9.10).
5. If the V subjectsconstitutea randomsamplefrom somepopulation, one may test whether the observed value of r indicates the existence of
an association between the X and Y variables in that population. The method for doing so dependson the size of N:
a. For N < 10, Table Q showsthe associatedprobability (one-tailed) of a value as large as an observedS.
b. For .V ) 10, one may computethe value of z associated with ~ by using formula (9.11). Table A shows the associatedprobability of a value as large as an observedz.
If the p yielded by the appropriatemethodis equal to or lessthan a, Ho may be rejected in favor of Hi.
KENDALL PARTIAL RANK CORRELATIONCOEFFICIENT: T~,g
223
Power-EfBciency
The Spearmanr> and the Kendall r are equally powerful in rejecting H0, inasmuch as they make equivalent use of the information in the data.
When used on data to which the Pearsonr is properly applicable, both r and rs have efficiency of 91 per cent. That is, 7.is approximately as sensitive
a test
of the existence
of association
between
two variables
in a
bivariate normal population with a sample of 100 casesas is the Pearson
r with 91 cases(Hotelling and Pabst, 1936; Moran, 1951). References
The reader will find other discussionsof the Kendall r in Kendall (1938; 1945; 1947; 1948a; 1948b; 1949). THE KENDALL PARTIAL RANK CORRELATION COEFFICIENT: T,p., Function
When correlation is observed between two variables, there is always the possibility that this correlation is due to the associationbetweeneach
of the two variablesand a third variable. For example,amonga group of school children of diverse ages,one might find a high correlation betweensize of vocabulary and height. This correlation may not reflect any genuine or direct relation between these two variables, but rather may result from the fact that both vocabulary size and height are associated with a third variable, age.
Statistically, this problem may be attacked by methods of partial correlation. In partial correlation, the effects of variation by a third variable upon the relation betweenthe X and Y variables are eliminated. In other words, the correlation between X and Y is found with the third variable Z kept constant.
In designing an experiment, one has the alternative of either introducing experimental controls in order to eliminate the influence of the third variable or using statistical methods to eliminate its influence.
For example, one may wish to study the relation between memorization ability and ability to solve certain sorts of problems. Both of theseskills may be related to intelligence; therefore in order to determine their direct, relation to each other the influence of differencesin intelligence must be controlled. To effect experimental control, we might choose subjects with equal intelligence. But if experimental controls are not feasible, then statistical controls can be applied. By the technique of partial
224
CORRELATION
AND
TESTS
OI' SIGNIFICANCE
correlation we could hold constant the eA'ectof intelligence on the relation
between memorization ability and ability to solve problems,and thereby determine the extent of the direct or uncontaminated relation between these two skills.
In this section we shall present a method of statistical control which may be used with the Kendall rank correlation r. To use this nonparametric method of partial correlation, we must have data which are measuredin at least an ordinal scale. No assumptionsabout the shape of the population of scoresneed be made. Rationale
Supposewe obtain ranks of 4 subjectson 3 variables:X, Y, and Z. We wish to determine the correlation between X and Y when Z is phr-
tialled out (held constant)~ The ranks are Subject Rank
on Z
23
Rank
on X
12
Rank
on
I3
Y
Now if we considerthe possiblepairs of ranks on any variable, we know that there are
2
possible pairs four things taken two at a time.
Having arranged the ranks on Z in natural order, let us observe every
possiblepair in the X ranks, the Y ranks, and the Z ranks. We shall assign a + to each of those pairs in which the lower rank precedesthe
higher, and a
to each pair in which the higher rank precedesthe
lower: Pair
(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
That is, for variableX the scorefor the pair (a,b)is a because the ranks for a and 5, 3 and 1, occur in the "wrong" orderthe higherrank precedesthe lower. For variable X, the score for the pair (a,c) is also a-
becausethe a rank, 3, is higher than the c rank, 2. For variable Y, the
pair (a,c)receives a + because thea rank,2,is lowerthanthec rank,3. Nowwemaysummarize the informationwehaveobtainedby casting
KENDALLPARTIALRANK CORRELATION COEFFICIENT:T~.g 225 it in a 2 X 2 table, Table 9.8.
Consider first the three signs under (a,b)
above. For that set of paired ranks, both X and Y are assigneda , whereasZ is assigneda +. Thus we say that both X and Y "disagree" with Z. We summarizethat information by casting pair (a,b) in cell D of Table 9.8. Consider next the pair (a,c). Here Y's sign agreeswith ThBLE 9.8
Y pairs whose Y pairs whose sign agrees with Z's sign
sign disagrees Total with Z's sign
X pairs whosesign agreeswith Z's sign
X pairs whosesign disagrees with Z's sign Total
Z's sign, but X's sign disagreeswith Z's sign. Therefore pair (a,c) is assignedto cell C in Table 9.8. In each caseof the remaining pairs, both Y's sign and X's sign agreewith Z's sign; thus these 4 pairs are cast in cell A of Table 9.8. ThBLB 9.9, FQRMFQRChsTINGDhTh FQRColFUThTIQNBY FQRMULh(9.12)
In general, for three sets of rankings of N objects, we can use the method illustrated above to derive the sort of table for which Table 9.9 is
a model. The Kendall partial rank correlation coefficient,r~., (read: the correlationbetweenX and Y with Z heldconstant)is computedfrom such a table.
It is defined as AD
BC
4(A + B)(C+ D)(A + C)(B+ D)
(9.12)
226
CORRELATION
AND TESTS
OF SIQNIFICANCE
In the caseof the 4 objectswehavebeenconsidering, i.e., in the caseof the data shownin Table 9.8, (4)(1) (o)(1)
V'(4)(2)(5)(1) = .63
The correlation between X and Y with the effect of Z held constant is
expressedby r~.. = .63. If we had computedthe correlationbetween X and Y withoutconsidering the effectof Z, we wouldhavefound' = .67. This suggeststhat the relations between X and Z and between Y and Z are only slightly influencing the observed relation between X and Y.
This kind of inference,however,must be madewith reservationsunless thereare relevantprior groundsfor expectingwhatevereffectis observed. Formula (9.12) is sometimescalled the "phi coefficient,"and it can be shown that X
Tsg j
The presence of X' in the expression suggests that r.., measures the extentto whichX and Y agreeindependently of theiragreement with Z. Method
Althoughthe methodwhichwe have shownfor computingr,., is usefulin revealingthe natureof the partialcorrelation, asN getslarger this methodrapidlybecomes moretediousbecause of the rapidincrease of the value of
N
Kendall (1948a, p. 108) has shown that rsvp
rS+rzc
(9.13)*
Formula(9.13)is computationally easierthanformula(9.12). To useit, onefirstmustfindthe correlations (r's)between X and Y, X andZ, andY andZ. Havingthesevalues,onemayuseformula(9.13)to findr,., For the X, Y, and Z ranks we have been considering, r,= .67, r,= .67, and w = .88. Insertingthesevaluesin formula(9.13),we have
.67 (.67) (.33)
V'[1 ( 67)'j[1 ( 33)'j = .63
* Thisformula isdirectly comparable tothatusedin finding theparametric partial product momentcorrelation.Kendall(1948a,p. 103)statesthat the similarity seems to be merely coincidental.
KENDALL PARTIALRANK CORRELATIONCOEFFICIENT:r~.,
227
Usingformula (9.13),wearrive at the samevalue of r~., we havealready arrived at by using formula (9.12).
Example We have already seen that in the data collected by Siegel and Fagan, the correlation between scores on authoritarianism and scores
on social status strivings is r = .67. However, we have also observedthat there is a correlationbetweensocialstatus strivings and amount of conformity (yielding) to group pressures:r = .39. This may make us wonder whether the first-mentioned correlation ThELE 9.10. RhNKSQN AUTHQRIThRIhNI8Mi SoclhL SThTU8STRIvINGsi hND CONFORMITY Rank
Subject Social status striving Authoritarianism Conformity (yielding) A
B C
3 2 4 1
6 2 5 1
D
1.5 1.5 3.5 3.5
g
8
10
5.0
F
11
9
6.0
G HI
10 67
JK
12 5
L
9
83 4
7.0 8.0 9.0
12 7
10.5
ll
12.0
10.5
simply representsthe operationof a third variable: conformity to group pressures. That is, it may be that the subjects' need to conform affects their responsesto both the authoritarianism scale and the social status strivings scale,and thus the correlation betweenthe scoreson thesetwo scalesmay be due to an associationbetweeneach varIable and need to conform. We may check whether this is true by computing a partial correlation between authoritarianism and
socialstatusstrivings,partiallingout the effectof needto conform, as indicated by amount of yielding in the Asch situation. The scoresfor the 12 subjects on each of the three variables are shown in Tables 9.3 and 9.5. The three sets of ranks are shown in Table 9.10. Observe that the variable whose effect we wish to partial out, conformity, is the Z variable.
228
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
We have already determined that the correlation between social status strivings (the X variable) and authoritarianism (the Y variable) is ~~ = .67. We have also already determined that the correlation between social status strivings and conformity is ~= .39 (this value is corrected for ties). From the data presentedin Table 9.10, we may readily determine, using formula (9.10), that the correlation between conformity and authoritarianism is r,= .36 (this value is corrected for ties). With that information, we may determine the value of ~,., by using formula (9.13): +Sg
+Zg+$$
V'(1 ~..') (1
~s')
(9.13)
.67 (.36) (.39)
v'(~ (36)'H~>~)'1 = .62
We have determined that when conformity is partialled out, the correlation between social status strivings and authoritarianism is r~., = .62. Since this value is not much smaller than w~ = .67, we might conclude that the relation between social status strivings and authoritarianism
(as measured by these scales) is relatively
independent of the influence of conformity (as measuredin terms of amount of yielding to group pressures). Summary of Procedure. These are the steps in the useof the Kendall partial rank correlation coefBcient: 1. Let X and Y be the two variables whoserelation is to be determined, and let Z be the variable whose eÃect on X and Y is to be partialled out or held constant.
2. Rank the observations on the X variable from 1 to N. for the observations
Do the same
on the Y and Z variables.
3. Using either formula (9.9) or formula (9.10) (the latter is to be used when ties have occurred in either of the variables being correlated), determine the observedvalues of T T y and T. 4. With those values, compute the value of ~,.using formula (9.13). Test of Significance Unfortunately, the sampling distribution of the Kendall partial rank correlation is not as yet known, and therefore no tests of the significance of an observed r.., are now possible. It Inight be thought that with +C+.s
a x test could be used. This is not so becausethe entities in cellsg,
THE KENDALL COEFFICIENT OF CONCORDANCE: W 229
8, C, and D of a table like Table 9.9 are not independent(their sum is
N 2 ratherthan N) anda z' test mayproperlyandmeaningfullybemade only on independent observations. References
The reader may Snd other discussions of this statistic in Kendall
(1948a,chap. 8) and in Moran (1951).
THE KENDALL
COEFFICIENT
OF CONCORDANCE:
W
Function
In the previous sectionsof this chapter, we have been concernedwith
measures of the correlationbetweentwosetsof rankingsof N objectsor individuals. Now we shall considera measureof the relation among severalrankings of N objects or individuals. When we have k sets of rankings, we may determine the association amongthem by using the Kendall coefBcientof concordanceW. Whereas t Band v expressthe degreeof associationbetweentwo variables measured
in, or transformed to, ranks,W expresses thedegreeof association among k suchvariables. Such a measuremay be particularly useful in studies of interjudge or intertest reliability, and also has applications in studies of clusters of variables. Rationale
As a solutionto the problemof ascertaining the over-all~ment amongk setsof rankings,it might seemreasonable to find the rB's(or r's) betweenall possiblepairs of the rankingsand then computethe average of these coefficients to determine the over-all association In
followingsucha procedure, we would needto compute k rank cor 2
relationcoeScients.Unlessk werevery small,sucha procedure woul be extremely tedious. The computation of W is much simpler, and 8' bears a linear relation
to the averagerBtakenoverall groups. If wedenotethe averagevalue
oftheSpearman rankcorrelation coefBcients between the k pebble 2
pairsof rankingsas t'B,,thenit hasbeenshown(Kendall,1948ap 81) that
230
CORRELATION AND TESTS OF SIGNIFICANCE
Another approach would be to imagine how our data would look if there were no agreementamong the several sets of rankings, and then to imagine how it would look if there were perfect agreementamong the several sets.
The coefficient of concordance would then be an index of
the divergence of the actual agreeInent shown in the data from the maximum possible (perfect) agreement. Very roughly speaking, 1F is just such a coefficient. Suppose three company executives are asked to interview six job
applicantsand to rank them separatelyin their order of suitability for a job opening. The three independentsetsof ranks given by executives I, Y, andZ to applicantsa throughf might bethoseshownin Table9.11. TmLE 9.11. RANKS ASSIGNEDTO SIX JOB APPLIChNTSBY THREE COMPANY
EXECUTIVES
(Ar tificial data) Applicant
The bottom row of Table 9.11, labeled R;, gives the sums of the ranks assignedto each applicant. Now if the three executives had been in perfcetagreementabout the
applicants,i.e., if they had eachranked the six applicantsin the same order, then oneapplicantwould have receivedthree ranks of 1 andthus his sum of ranks, R,, would be 1 + 1 + 1 = 3 = k. The applicant whom all executivesdesignatedas the runner-up would have R, =2+2+2
=6=2k
The least promising applicant would have R; = 6+
6+ 6 = 18 = Nk
In fact, with perfectagreementamongthe executives,the varioussumsof ranks, R;, would be these: 3, 6, 9, 12, 15, 18, though not necessarilyin that order. In general, when there is perfect agreementamong k sets of rankings, we get, for the R,, the series:k, 2I., 3k,..., Nk. On the other hand, if there had been no agreement among the three executives, then the various R,'s would be approximately equal. From this example, if should be clear that the degree of agreement
THE KENDALL COEFFICIENT OF CONCORDANCE: W 231
among the k judges is reflected by the degreeof variance among the N sums of ranks. W; the coeScient of concordance,is a function of that degreeof variance. Method
To compute W, we first find the sum of ranks, R;, in each column of a k X N table. Then we sum the R; and divide that sum by N to obtain the mean value of the R;. Each of the R; may then be expressed as a deviation from the mean value. (We have shown above that the larger are thesedeviations, the greater is the degreeof associationamong the k sets of ranks.) Finally, s, the sum of squaresof thesedeviations, is found. Knowing these values, we may compute the value of W: ~k'(N' where
(9.15)
N)
s = sum of squares of the observed deviations from the
mean of R;, that is, s =
ZR,.
R;
k = numberof setsof rankings,e.g., the numberof judges N = number of entities (objects or individuals) ranked
~k'(N'
N) = maximumpossiblesum of the squareddeviations,i.e., the sum s which would occur with perfect agreement among k rankings
For the datashownin Table9.11,the rank totalswere8, 14,11,11, ]1, and 8. The meanof thesevaluesis 10.5. To obtains, wesquarethe deviation of each rank total from that mean value, and then sum those squares:
s = (8 10.5)'+
(14 10.5)'+
(ll
10.5)'+ + (ll
(ll
10.5)'
10.5)' + (8 10.5)~
= 25.5
Knowing the observedvalue of s, we may find the value of 8' for the data in Table 9.11 by using formula (9.15): 25.5
' (3)'(6'
6)
= .16
5' = .16 expresses the degreeof agreementamong the three fictitious executives in ranking the six job applicants.
With the samedata,wemight havefoundrs, by eitherof two methods.
Oneway wouldbe first to find the valuesof rs, rz, and rs .
282
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
these three values could be averaged. For the data in Table 9.11, re .81, r8, = .54, and re., .54. The averageof thesevalues is .81 + ( .54) + ( .54) r saw
8 = .26
Another way to find re., would be to use formula (9.14): kW k
1
(9.14)
1
8(.16) 1 8
1
= .26
Both methods yield the same value: r8=
.26.
As is shown above,
this value bears a linear relation to the value of W.
One differencebetweenthe W and the r8,methods of expressingagreement among k rankings is that r8.may take values between 1 and +1, whereasW may take values only between0 and +1. The reasonthat W cannot be negative is that when more than two setsof ranks are involved, the rankings cannot all disagree completely. For example, if judge X and judge Y are in disagreement,and judge X is also in disagreement with judge 2, then judges Y and 2 must agree. That is, when more than two judges are involved, agreementand disagreementare not symmetrical opposites. k judges may all agree, but they cannot all disagree completely. Therefore W must be zero or positive. The reader should notice that W bears a linear relation to re but seems
to bear no orderly relation to r.
This reveals one of the advantages
which rq has over r.
Example
Twenty mothers and their deaf preschool children attended a summer camp designed to give introductory training in the treatment and handling of deaf children. A staft' of 18 psychologists and speech correctionists worked with the mothers and children during the 2-week camp session. At the end of that period, the 13 staH members were asked to rank the 20 mothers on how likely it was that each mother would rear her child in such a way that the
child would suter personal mala,djustment.' Theserankings are shown
in Table
9.12.
' Thisexamplecitesunpublished data fromreeesrch conducted at the 1955Camp Easter Seal Speechand Hearing Program, Laurel Hill State Park, Pa. The data
freremadeavailableto the authorthroughthe courtesyof the researcher, Dr. J. E. Gordon.
THE KENDhLL COEFFICIENT OF CONCORDhNCE: 8 233
A coefficientof concordance wascomputedto determinethe agreementamongthe stafFmembers. The meanof the variousR;is 135.5. The deviationof every R, from that mean,and the squareof that
deviation,are shownin Table 9.12. The sum of thesesquares = 64,899= s. II: = 13 = the number of judges. N = 20 = the ThBLE 9.12. RhNKSASSIGNED TO 20 MOTHERSBY 13 SThFPMEMBEES Mother Judge 12
B C DR g a H I
J E IN
RI
78
1 6
2 1513
3 8
2
2 16 12 11
9 2 6
7 10 ll 15 8 14 4 5 16 ll 14 13 10 2 16 6 7 16 10 e 12 13 3 13 8
31
7 8 6 10 15 16 6 e
14 1
10 ll
12 13 14 15 16 17 18 10 20
49 9 6 7 6 9
10 ll 3 ll 8 14 ll 18 10 17 14 6 17 10 12 17 15 19 10 13 0 10 15 12 e 14
ll1
la 0 410 2
6 78 17 9 213 2 3 77 6
02 11 8 5
5 8
6 9 14
0
48 5 a 8 14 7 62 6 92 3 3 2 114
12 13 13 7 10 1 7 8 63 36 81 91 41 01 01 11 10 10 11
a7 4
14 12 4 16
15 1B 17 18 10 12 16 12 11 13 10 12 13 16 8 16 ll 14 5 13 6 16 7 12 16 18
16 1
15 12
13 8 18 4 7
1T 10 20 19 18 10 18 20 10 11 19 18 15
18 15 13 1T
19 14 17 18 12 14 20 18 16 12 17 16 19
17 8 ll 16 18 18 le 17 17
20 20 18 20 19 20 10 19 1T 20 20 20 20
ERg N
Ol OO lO lO OI
RI
CO C) lO CA
OO Ol 'V t4
numberof motherswhowereranked. With this information, we may compute 8': 8
7'Ik'(Ng
N)
(9.15)
64,899
A(13)'[(20)' 20] = .577
Theagreement amongthe13staffmembers is expressed by W =.577
Tied observations Whentied observations occur,the observations are eachassignedthe averageof the ranksthey would havebeenass
hadnotiesoccurred, ourusualprocedure in rankingtiedscores
234
CORRELATION
AND
TESTS
OF
SIGNIFICANCE
The effect of tied ranks is to depressthe value of W asfound by formula (9.15). If the proportion of ties is small, that effect is negligible, and thus formula (9.15) may still be used. If the proportion of ties is large, a correction may be introduced which will increaseslightly the value of 1F over what
it would
have been if uncorrected.
That
correction
factor
is
the same one used with the Spearman rs.
where t = number of observationsin a group tied for a given rank X directs one to sum over all groups of ties within any one of the I' rankings With the correction of ties incorporated, the Kendall coefFIcient of concordance
is
' k'(N' N) k $T
(9.16)
T
where
T directs one to sum the valuesof T for all the A,rankings. T
Example with Tiea Kendall (1948a, p. 83) has given an example in which 10 objects are each ranked on 3 different variables: X, Y, and Z. The ranks are shown in Table 9.13, which also shows the values of 8;. TABLE 9.13. RANKS RECEIVED BY YEN ENTITIES ON THREE VARIABLES
The meanof the 8; is 16.5. To obtain e, we sum the squared deviations of each R; from this mean: a = (5.5 16.5)'+ (6.5 16.5)' + (9 16.5)' + (]3.5 . - 16.5)2 + (12 16.5)'+ (20 16.5)' + (23 16.5)' + (23.5 16.5)' + (25.5 16.5)' + (26.5 16.5) ~ = 591
THE KENDALL COEFFICIENT OF CONCORDA.NCE: W
235
Since the proportion of ties in the ranks is large, we should correct for ties in computing the value of W.
In the X rankings, there are two sets of ties: 2 objects are tied
at 4.5 and 2 are tied at 7.5. For both groups,t = the number oi observations tied for a given rank = 2. Tx
Z(t'
t)
Thus
(2' 2) + (2
12
2)
12
1
In the Y rankings, there are three sets of ties, and each set contains two observations. Here t = 2 in each case,and Z(t'
f)
(2' 2) + (2'
2) + (2'
2)
12
1.5
In the Z rankings,there are two setsof ties. One set, tied at 4.5, consists of 4 observations: here t = 4. The other set, tied at rank 8, consistsof 3 observations: t = 3. Thus
Z(t' t) (4' 4) + (3~3) 7 12
12
Knowing the valuesof T for the X, Y, and Z rankings,we may find their sum:
T=
1 + 1.5 + 7
= 9.5.
T
With the above information, we may compute W corrected for ties:
'k'(N' N) k$ T
(9. 16)
T
591
T'F(3)'K1o)'
10j 3(9 5)
= .828
If we had disregardedthe ties, i.e., if we had usedformula (9.15) in computing W, we would have found W = .796 rather than
W = .828. This differenceillustrates the slightly depressingeSect which ties, when uncorrected, exert on the value of W. Testing the Significance of W
SInI11samples. We may test the significanceof any observedvalueof W by determiningthe probability associatedwith the occurrenceunder Ho of a value as large as the e with which it is associated. If we obtain the sampling distribution of s for all permutations in the N ranks in all
possiblewaysin the k rankings, wewill have(N!) setsof possible ranks.
CORRELATION
236
AND
TESTS
OF
SIGNIFICANCE
Using these, we may test the null hypothesis that the Ipsets of rankings are independent by taking from this distribution the probability associated with the occurrenceunder H pof a valueas large asan observede. By this method, the distribution of s under Hp has been worked out and certain critical values have been tabled. Table R of the Appendix gives values of 8 for W's significant at the .05and .01levels. This table is applicable for k from 3 to 20, and for N from 3 to 7. If an observeds is equal to or greater than that shown in Table R for a particular level of significance,then Hp may be rejected at that level of significance. For example, we saw that when Ip = 3 fictitious executives ranked N = 6 job applicants, their agreement was W = .16. Reference to Table R reveals that the 8 associated with that value of W (e = 25.5) is
not significant. For the association to have been significant at the .05 level, 8 would have had to be 103,9or larger. Large samples. When N is larger than 7, the expressiongiven in formula (9.17) is approximately distributed as chi square with df = N
1
z~kN(N + 1)
(9.17)
That is, the probability associatedwith the occurrenceunder Hp of any value as large as an observed W may be determined by finding y' by formula (9.17) and then determining the probability associatedwith so large a value of y' by referring to table C of the Appendix. Observe
that S
i IN(N+ 1) k(N 1)W and therefore
x' = k(N
1)W
(9.18)
Thus one may use formula (9.18), which is computationally simpler than formula (9.17), with df = N 1, to determine the probability associated with the occurrence under Hp of any value as large as an observed
W.
If the value of g' as computed from formula (9.18) [or, equivalently, from formula (9.17)] equals or exceedsthat shown in Table C for a particular level of significance and a particular value of df = N 1, then the null hypothesis that the k rankings are unrelated may be rejected at that level of significance. Eagle'
In the study of ratings by staE personsof the mother-child rela-
tions of 20 motherswith their deafyoung children,Ip= 13, N = 20, ' See footnote, page 211.
THE KENDhLL COEFFICIENT OF CONCORDhNCE:W 237
and we found that W = .577. We may determine the signi6cance of this relation by applying formula (9.18): )P = k(N
(9.18)
1)W
= 13(20 i)(.577) = 142.5
Referring to Table C, we find that g' > 142.5 with df = N
1 = 20 1
= 19
has probability of occurrenceunder Ho of p ( .001. We can conclude with considerable assurancethat the agreement among the 13 judges is higher than it would be by chance. The very low probability under Ho associatedwith the observedvalue of W'enables us to reject the null hypothesisthat the judges' ratings are unrelated to each other.
Summary of Procedure Theseare the stepsin the use of W, the Kendall coefficientof concordance:
1. Let N = the number of entities to be ranked, and let k = the number of judges assigning ranks. Cast the observed ranks in a k X N table.
2. For eachentity, defermineR;, the sumof the ranksassigned to that entity by the k judges. 3. Determine the mean of the R;. Express each R~ as a deviation fIom that mean. Square these deviations, and sum the squaresto obtain
s.
4. If the proportion of ties in the k sets of ranks is large, use formula
(9.16) in computing the value of W; Otherwiseuse formula (9.15). 5. The method for determining whether the observed value of W' is
significantlydiferent from zerodependson the sizeof N: g. If N is 7 or smaller, Table R gives critical values of e associated with W's significant at the .05 and .01 levels.
g. If N is larger than 7, either formula (9.17) or formula (9.18) (the latter is easier)may be usedto computea value of g' whosesignificance, for df = N
1, may be tested by referenceto Table C.
Interpretation of W A high or significant value of W may
the observersor judges are applying essentiallythe same~ d
rankIngthe N ob]ectsunderstudy. OftentheIr pooledorderIngmay serve as a "standard," especially when there is no relevant external criterion for ordering the objects.
238
CORRELATION
AND
TESTS
OP SIGNIFICANCE
It should be emphasizedthat a high or significant value of W does not mean that the orderings observedare correct. In fact, they may all be incorrect with respect to some external criterion.
For example, the
13 staE membersof the camp agreedwell in judging which mothers and their children were headed for difficulty, but only time can show whether
their judgments were sound. It is possible that a variety of judges can agreein ordering objects becauseall employ the "wrong" criterion. In this case,a high or significant W would simply show that all more or less agree in their use of a "wrong" criterion. To state the point another way, a high degree of agreement about an order does not necessarily mean that the order which was agreed upon is the "objective"
one.
In the behavioral sciences,especially in psychology, "objective" orderings and "consensual" orderings are often incorrectly thought to be synonymous.
Kendall (1948a, p. 87) suggeststhat the best estimate of the "true" ranking of the N objects is provided, when W is significant, by the order of the various sums of ranks, R,. If one acceptsthe criterion which the various judges have agreed upon (as evidenced by the magnitude and significanceof W) in ranking the N entities, then the best estimate of the "true" ranking of those entities according to that criterion is provided by the order of the sums of ranks. This "best estimate" is associated, in a certain sense,with least squares. Thus our best estimate would be that either applicant a or f (seeTable 9.11) should be hired for the job opening, for in both of their casesRy = 8, the lowest value observed. And our best estimate would be that, of the 20 mothers of the deaf children, mother 6 (seeTable 9.12), whoseR = 57 is the smallest of the R;, is the mother who is most likely to rear a well-adjustedchild. Mother 2 is the next most likely, and mother 20 is the mother who, by consensus, is the one most likely to rear a maladjusted child. References
Discussions of the Kendall
coefficient of concordance are contained in
Friedman (1940), Kendall (1948a,chap. 6), and Willerman (1955). DISCUSSION
In this chapter we have presented five nonparametric techniques for measuringthe degreeof correlation between variables in a sample. For
eachof these,exceptthe Kendall partial correlationcoefficient,tests of the significanceof the observedassociationwere presented. One of these techniques, the coefficient of contingency, is uniquely applicable when the data are in a nominal scale. That is, if the measurement is so crude that the classifications
involved
are unrelated
within
239
DISCUSSION
any set and thus cannot be meaningfully ordered, then the contingency coefficient is a meaningful measure of the degree of association in the data. For other suitable measures, see Kruskal and Goodman (1954).
If the variables under study have been measuredin at least an ordinal scale, the contingency coefBcientmay still be used, but an appropriate method of rank correlation
will utilize
more of the information
in the
data and therefore is preferable.
For the bivariate case two rank correlation coefBcients,the Spearman rs and the Kendall ~, werepresented. The Spearmanrs is somewhat easierto compute, and has the further advantageof being linearly related to the coefficient of concordanceS'. However, the Kendall r has the advantagesof being generalizableto a partial correlation coefficient and of having a sampling distribution which is practically indistinguishable from a normal distribution for sample sizes as small as 9.
goth rs and r have the same power-efficiency(91 per cent) in testing for the existenceof a relation in the population. That is, with data which meet the assumptionsof the Pearsonr, both ra and r are as powerful as p for rejecting the null hypothesis when ra and r are basedon 10 observations for every 9 observationsusedin computing r. The Kendall partial rank correlation coefficient measuresthe degree of relation between two variables, X and Y, when a third variable, 2
(on which the associationbetweenX and Y might logically depend),is held constant. r~., is the nonparametric equivalent of the partial
product moment r. However,no test of the significanceof partial r is as yet available. The Kendall coefBcient of concordance W measures the extent of asso-
ciation among several (k) sets of rankings of N entities. It is useful in determining the agreement among several judges or the association
amongthreeor morevariables. It hasspecialapplicationsin providing a standard method of ordering entities accordingto consensuswhen there available no objective order of the entities.
REFERENCES
Anderson,R. L., and Bancroft, T. A.
1952. Statieticaltheoryin reeearch. New
York: McGraw-Hill.
Andrews,F. C. 1954. Asymptoticbehaviorof somerank testsfor analysisof variance.
Ann. Math. Stab'et.,$5, 724-736.
Auble,D. 1953. Extendedtablesfor the Mann-Whitneystatistic. BnQ. Inet. Educ.Bee.Indiana Unieer., 1, No. 2.
Barnsrd,G. A. 1947. Signiacance testsfor 2 X 2 tables. Biomet& ke,34, 123-138. Bergman,G., and Spence,K. W. 1944. The logicof psychological measurement Peychol.Bee.,51, 1-24.
Birnbaum,Z. W. 1952. Numericaltabulationof the distribution of Kolmogorov's statisticfor Snitesamplevalues. J. Amer.Statiet.Aee.,41, 425-441. Birnbaum, Z. W 1953. Distribution-freetests of St for continuousdistribution functions. Ann. Math. Statiet.,$4, 1-8. baum, Z. W., and Tingey, F. H. 1951. Oared conSdencecontoursfor
probabilitydistribution functions. Ann.Math. Statiet.,$$, 592-596. Blackwell,D., snd Girshick,M. A.
1954. Theoryof gameeand etatieticaldecieione.
New York: Wiley.
Blum,J. R., and Fattu, N. A. 1954. Nonparametric methods. Res.Educ.Res., $4, 467-487.
Bowker,A. H. 1948. A. test for symmetryin contingency tables. J. A.mer. Statiet. Aee., 48, 572-574.
BrowneG W g andMoodyA. M 1951.
On mediantestsfor linearhypotheses
Proceedinge of theeecondBerheIeyeym~um on mathematicai
ability. Berkeley,Calif.:Univer.of Calif. Press. Pp. 15g 166. Clopper,C. J., and Pesrson,E. 8. 1934. The useof conldenceor fiduciallimits illustratedin the cseeof the binomial. Biometriha,$6, ~13
Cochran, W. G. 1950. Thecomparison of percentages m matched samp]es metrika, 37, 256-266.
C h,W.G.
1g52. Thex*t tofg~~ofnt.
Ann.M~.8agMt.,$3,315
345.
Cochran, W. G. 1954. Somemethods for strengthening the common >s t ts Biometrice,10, 417-451.
Coombs,C. H. Pey~.
1950. Psychological scalingwithout a unit of moment.
m., 5V, 145-158.
Coombs,C. H. 1952. A theoryof Psychologicai scaling. B~ SngngBee. Inet., 34.
Un~
David, F.N.'1g4g.' Prdd@ay d for e~~~ m~A,. New York: C bridge Univer. Press. Davidson, D., 8iegel,8., and Suppes,p.
1955. 241
REFERENCES
on the meaeurementof utility and subjectiveprobaNlity. Rep. 4, Stanford Value Theory Project.
Dixon, W. J.
1954. Power under normality of several non-parametric tests. Ann.
Math. Statiet., 26, 610 614.
Dixon, W. J., and Massey, F. J. York:
1951. Introduction to etatietical analyeie. New
McGrsw-Hill.
Dixon, W. J., and Mood, A. M. 1946. The statistical sign test. J. Amer. Statist. Aee., 41, 557 566. Edwards, A. L. 1954. Statistical methods for the behavioral eciencee. New York: Rinehart.
Festinger, L. 1946. The significanc of differences between means without reference to the frequency distribution function. Peychometrikag 11' 97 105. Finney, D. J. 1948. The Fisher-Yates test of significance in 2 X 2 contingency tables. Biometrika, 86, 145-156. Fisher, R. A. 1934. Statistical methods for research workers. burgh: Oliver k Boyd.
(5th Ed.)
Edin-
Fisher, R. A. 1935. The designof experiments. Edinburgh: Oliver dcBoyd. Freund, J. E. 1952. Modern elementary etatietics. New York: Prentice-Hall, Friedman, M. 1937. The useof ranks to avoid the assumption of normality implicit in the analysis of variance.
J. Amer. Statiet. Aee., 82, 675 701.
Friedman, M. 1940. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Statist., 11, 86-92. Goodman, L. A. 1954. Kolmogorov-Smirnov Psychol. Bull., 61, 160-'168.
Goodman, L. A., and Kruskal, W. H. classifications.
Hempel, C. G.
tests for psychological
research.
1954. Measures of association for cross
J. Amer. Statist. Aee., 4$, 732 764.
1952. Fundamentals of concept formation in empirical science.
Int. Encycl. Unif. Sci., 2, No. 7.
(Univer. of Chicago Press.)
Hotelling, H., and Pabst, Margaret R. 1936. Rank correlation and tests of significance involving no assumption of normality. Ann. Math. Statist., 7, 29-43. Jonckheere,A. R. 1954. A distribution-free k-sample test against ordered alternatives.
Biometrika, 41, 133-145.
Kendall, M. G.
1938. A new measureof rank correlation.
Biometrika, 80, 81 93.
Kendsll, M. G. 88, 239-251.
1945. The treatment
of ties in ranking problems.
Biometrika,
Kendall, M. G.
1947. The variance of r when both rankings contain ties. Bio-
metrika, 84, 297-298.
Kendall, M. G. 1948a. Rcnk correlation methods. London: Griffin. Kendall, M. G. 1948b. The advanced theory of statistics. Vol. 1. (4th Ed.) London:
Griffin.
Kendsll, M. G.
1949. Rank and product-moment correlation. Biometrikaa,86,
177-193.
Kendall, M. G., and Smith, B. B.
1939. The problem of m rankings. Ann. Math.
Statist., 10, 275-28?.
Kolmogorov, A.
1941. Confidence limits for an unknown distribution functon.
Ann. Math. Statiet., 12, 461-463.
Kruskal, W. H.
1952. A nonparametric test for the several sample problem.
Ann. Math. Statiet., 28, 525-540.
Kruskal, W. H., and Wallis, W. A. analysis.
1952, Use of ranks in one-criterion variance
J. Amer. Statist. Aee., 47, 583-621.
Lstscha,R. 1953. Testsof significancein a 2 X 2 contingencytable: Extensionof Finney's table.
Biometrika, 40, 74-86.
REFERENCES
Lehmann,E. L.
1953. The power of rank tests. Ann. Math. Statist., 24, 23-43.
Lewis, D., and Burke, C. J. 1949. The use and misuseof the chi-squaretest. PsychoL Bull., 46, 433489.
McNemar, Q. 1946. Opinion-attitude methodology. PsychoLBull., 4$, 289-374. McNemar, Q. 1947. Note on the sampling error of the differencebetweencorrelated proportions or percentages. Psychometrika,12, 153 157. McNemar,
Q.
1955.
Psychologieat statistics.
(2nd
Ed.)
New
York:
Wiley.
Mann, H. B., and Whitney, D. R. 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Statist., 18, 50-60.
Massey, F. J., Jr.
1951a. The Kolmogorov-Smirnov test for goodnessof fit.
J.
Asser. Statist. Ass., 46, 68-78.
Massey,F J pJr. 1951b. The distribution of the maximum deviation between two sample cumulative step functions. Ann. Math. Statist., 22, 125-128. Mppd, A, M. 1940. Th. distribution theory of runs. A nn. Math, Statist., 11, 367-392.
Mppd, A. M.
1950. Introduction to the theory of statistics. New York: McGraw-
Hill.
Mppd A. M. 1954. On the asymptotic efficiency of certain non-parametric twosample tests. Ann. Math,. Statist., 26, 514-522.
Moore,G. H., and Wallis,W. A. 1943. Time seriessignificancetestsbasedon signs of differences.
J. Amer. Statist. Ass., 38, 153-164.
Moran, p. A. P. 1951. Partial and multiple rank correlation. Biometrika,38, 26-32.
Mpses,L. E. 1952a. Non-parametric statistics for psychologicalresearch. Psychol. Bull., 4$, 122-143. Mpses L. E.
1952b.
A temple
Mpsteller, F. 1948. A ample
test.
Psychometrika 17 239-247.
slippagetest for an extremepopulation. Ann.
Math. Statist., 10, 58-65.
Mpsteller, F., and Bush, R. R. 1954. Lindsey (Ed.), Handbook of social Cambridge, Mass.: Addison-Wesley. Mosteller, F., and Tukey, J. W. 1950.
Selected quantitative techniques. In G. psycholotiy. Vol. 1. Theory and method. Pp. 289-334. Significance levels for a k-sample slippage
Ann. Math. Statist., 21, 120 123.
Olds,E. G. 1949. The 5% significancelevelsfor sumsof squaresof rank differences snd a correction.
Ann. Math. Statist., $0, 117 118.
pitman, E. J. G. 1937a. Significancetests which may be appliedto samplesfrom any populations. Supplementto J. R. Statist.Soc.,4, 119130. pitman, E. J. G. 1937b. Significancetests which may be appliedto samplesfrom
any populations.II. The correlationcoefficienttest. Supplementto J. R Statist. Soc.,4, 225-232.
Pitman, E. J. G. 1937c. Significancetests which may be appliedto samplesfrpm any populations. III. The analysisof variancetest. Biometrika,2$, 322335.
Savage,I. R. 1953. Bibliographyof nonparametric statisticsand relatedtopics J, Amer. Statist. Ass., 48, 844-906.
Savage, L. J. 1954. Thefoundations of statistics.NewYork: Wiley Scheffh,H. 1943. Statistical inferencein the non-parametriccase. Ann. Math. Statist., 14, 305-332.
Siegel,S. 1956. A methodfor obtainingan orderedmetric~g
P
21, 207-216. tions.
L Table for estimating thegoodness pf fitof Ann. Math. Statist., 10, 2?9-281.
ld
REFERENCES
Smith,K. 1953. Distribution-free statistical methods andthe concept of power efficiency. In L. Festingerand D, Kats (Eds.), Reeearch methodein thebehaeioral eciencee.New York: Dryden. Pp. 536-577.
Snedecor, G. W. 1946. Statisticalmethods.(4th Ed.) Ames,Iowa: Iowa State CollegePress.
Stevens,8. 8. 1946. On the theoryof scalesof measurement.Science, 10$,677680.
Stevens,S. S. 1951. Mathematics,measurement, and psychophysics. In 8. 8. Stevens (Ed.),Handbook ofexperimental peychology, NewYork:Wiley.Pp.1-49. Stevens,W. L.
1939. Distribution of groupsin a sequenceof alternatives. Ann.
Eugenics,9, 10 17.
Swed,FriedsS., andEisenhart,C. 1943. Tablesfor testingrandomness okgrouping in s sequenceof alternatives. Ann. Math. Starlet.,14, 66-87. Tocher, K. D. 1950. Extensionof the Neymsn-Pesrsontheory of teststodiscontinuous variates.
Tukey, J. W.
Biometrika, $7, 130-144.
1949. Comparing individual means in the analysisof variance.
Biometrice, 6, 99-114.
Wald, A. 1950. Statisticaldecision functione. New York: Wiley. Walker, Helen M., snd Lev, J. 1953. Statieticalinference. New York: Holt. Wslsh, J. E. 1946. On the powerfunction of the sign test for slippageof means. Ann. Math. Statist.,17, 358 362. Walsh, J. E. 1949a. Somesignificancetestsfor the medianwhichare valid under very generalconditions. Ann. Math. Statiet.,20, 64-81. Walsh, J. E.
1949b.
Applications of somesignificancetests for the median which are
valid under very generalconditions. J. Amer. Statist.Aee.,44, 342-356. Welch, B. L. 1937. On the e-test in rsndomisedblocksand Latin squares. Biometrika, 20, 21-52.
White, C. ments.
1952. The useof ranksin a test of significance for comparingtwo treat Biometrice, 8, 33-41.
Whitney, D. R. 1948. A comparisonof the powerof non-parametrictestsand tests basedon the normal distributionunder non-normalalternatives. Unpublished doctor'sdissertation,Ohio State Univer. Whitney, D. R. 1951. A bivariate extensionof the U statistic. Ann. Math. Statist., %2, 274-282.
Wilcoxon, F.
1946. Individual comparisonsby ranking methods. Biometrics
Bull., 1, 80-83.
Wilcoxon,F. 1947. Probabilitytables for individualcomparisons by ranking methods. Biometrics,$, 119-122,
Wilcoxon,F. 1949. Somerapid approximateetatietios/proceduree. Stamford, Conn.: AmericanCyanamid Co.
Wilks,S. 8. 1948. Orderstatistics. BulLAmer.Math. Soc.,64, 6 50 %inkerman,B.
1955. The adaptationand useof Kendall's coefficientof concord
ance(W) to sociometric-type rankings. PeychoL Bull., 62, 132-133. Yates, F. 1934. Contingencytablesinvolvingsmall numberssnd the xe test. 8upplementto J. R. Statist.Soc.,1, 217-236.
LIST
OF TABLES
able
Page
Table of Probabilities Associated with Values as Extreme as Observed Values of e in the Normal Distribution ............................. 247 B. Table of Critical Valuesof l. ....................... 248 24g C. Table of Critical Valuesof Chi Square..................... Associated with Valuesss Smallas ObservedValuesof D. Table of ProbabiTities s in the BinomialTest. ~ .. ~ -............. 250 E.
Tableof CriticalValuesof D in the Kolmogorov-Smirnov On~mple Test 251
F.
Table of Critical Valuesof r in the Runs Test
O.
Tableof CriticalValues of T in theWilcoxon Matched-pairs Signed-ranks
.. 252
Test. H. Table of Critical Values for the Walsh Test.
254 255
Table of Critical Valuesof D (or C) in the FisherTest............. 256 Table of Probabilities Associated with Values as Small as Observed Values J. of U in the Mann-Whitney Test......................... 271 274 K. Table of Critical Valuesof U in the Mann-Whitney Test................ I.
L.
Table of Critical Valuesof Xo in the Kolmogorov-Smirnov Two-sample Test (Small Samples).
M.
Tableof CriticalValuesof D in the Kolmogorov<mirnov Tw~ple
.. 278
Test
(LargeSamples:Two-tailed Test)....... .. 279 Table of Probabilities Associated with Values as Large as Observed Values N.
of x,~in the FriedmanTwo-wayAnalysis of Varianceby Ranks.......,...
280
0.
Table of ProbabilitiesAssociated with ValuesasLargeas ObservedValuesof
P.
Table of Critical Valuesof re, the SpearmanRank CorrelationCoefficient284
Q.
Tableof Probabilities Associated with Valuesas Largeas Observed Values
the Kruskal-Wallis One-wayAnalysis of Varianceby Ranks..........
282
of 8 in the Kendall Rank CorrelationCoefficient. .. 285 R. Table of Critical Valuesof e in the Kendall CoeKcientof Concordance..... 286 ~~ .... ~ .... .... 287 8. Table of Factorials. T. U.
Tableof BinomialCoefficients. Tableof Squares andSquareRoots..
.. 288 . 289
247
hPPENDIX
ThBLE A. ThBLE OF PROBhBILITIES ASSOCIhTED WITH VhLUES hs EXTREME hs OBSERVED VALUES OF Z IN THE NORMhL DISTRIBUTION
The body of the table gives one-tailed probabilities under Ho of z. The left-hand
marginalcolumngivesvariousvaluesof z to one decimalplace. The top row gives various values to the seconddecimal place. Thus, for example, the one-tailed p of z > .11 or z < .11
is p ~ .4562. .01
.02
.03
.04
.05
.06
.07
. 5000 . 4602 . 4207 .3821 . 3446
.4960 .4562 .4168 .3783 .3409
.4920 .4522 .4129 .3745 .3372
.4880 . 4483 . 4090 .3707 .3336
.4840 . 4443 . 4052 .3669 .3300
.4&01 .4404 .4013 .3632 .3264
.4761 .4364 .3974 .3594 .3228
.4721
.4681
.4325
.4286
.3050 .2709
.3015 .2676
.2981 .2643
.2946 .2611
.2912
.2877
.2578
.2546
.2389
.2358
.2090 .1814
.2061 .1788
.2327 .2033
.2296 .2005
.9
. 3085 .2743 .2420 .2119 .1841
.1762
.1736
.2266 .1977 .1711
.2236 .1949 .1685
1.0
.1587
.1562
.1539
1.1 1.2 1.3 1.4
.1357 .1151 .0968 .0&08
.1335 .1131 .0951 .0793
.1314 .1112 .0934 .0778
.1515 .1292 .1093 .0918 .0764
.1492 .1271 .1075 .0901 .0749
.1469 .1251
.1446 .1230
.1056 .0885 .0735
1.6 1.6 1.7 1.8 1.9
.0668 .0548 .0446 .0359 .0287
.0655 .0537 .0436 .0351 .0281
.0643 .0526 .0427 .0344 .0274
. 0630 .0516 .0418 .0336 .0268
. 0618 .0505 .0409 .0329 .0262
2.0
.0228 .0179 .0139 .0107 .0082
.0222 .0174 .0136 .0104 .0080
.0217 .0170 .0132 .0102 .0078
. 0212 .0166 .0129 .0099 .0075
. 0062 .0047
.0060 .0045 . 0034
.0059 .0044 . 0033
.0026 . 0019
.0025
.0024
.0018
.0018
3.0 3.1 3.2 3.3 34
.0013
.0013
.0013
3.6 3.6 3.7 3.8 3.9
. 00023 . 00016
4.0
.00003
.0 .1
.2 .3 4 .5 .6
.7 .8
2.1 2.2 2.3 2.4 2.6
2.6 2.7 2.8
2.9
.0035
. 0010 . 0007 .0005 . 0003
. 00011
. 00007 . 00005
.08
.3936
.3897
.3557 .3192
.3520 .3156
.4641 .4247 .3859 .3483 .3121
.2843 .2514 .2206 .1922 .1660
.2810 .2483 .2177 .1894 .1635
.2776 .2451 .2148 .1867 .1611
.1038 .0869 .0721
. 1423 .1210 .1020 .0853 .0708
. 1401 .1190 .1003 .0838 .0694
.1379 .1170 .0985 .0823 .0681
.0606 .0495 .0401 .0322
.0594 .0486 .0392 .0314
.0582 .0475 .0384 .0307
.0571 .0465 .0375 .0301
. 0367 . 0294
.0256
.0250
.0207 .0162 .0125 .0096 .0073
.0202 .0158 .0122 .0094 .0071
.0197 .0154 .0119 .0091 .0069
.0192 .0150 .0116 .0089 .0068
.0188 .0146 .0113 .0087 .0066
. 0183 .0143 .0110 .0084 .0064
.0057 .0043 .0032 .0023 .0017
.0055 .0041 .0031 .0023 .0016
.0054 .0040
.0052 .0039
.0030
.0029
.0022 .0016
.0021 .0015
.0051 .0038 .0028 .0021 .0015
.0049 .0037 .0027 .0020 .0014
.0048 .0036 .0026 .0019 .0014
.0012 .0009
.0012 .0008
.0011 .0008
. 0011 .0008
. 0011
. 0010
.0008
.0007
.0010 .0007
0244
0239
hPPENDIX
T~m
B. Thnus or Cnntchr
Vhx.vm:soI te
Level of significancefor one-tailed test .10
.01
df Level of significancefor two-tailed test .10
3.078 1.886 1.6$8 1.533 1.476
69 8 7 1. 440 1.416 1.$97
.01
.001
63. 657 9.925
636.619 31.598 12.941 8.610 6.859
6.314 2.920 2.$53 2.132 2.016
12. 706 4.303 3.182 2.776 2.571
31.821 6.965 4.641 3.747 3.365
2. 447 2.365 2. 306 2.262 2.228
3.143
3.707
5.959
2.998 2.896 2.821
3.499 3.366 3.260
6.041
2.764
3.169 3. 106 3.056 3.012 2.977 2.947
4.318 4.221 4.140
4.016
6.841 4.604 4.032
10
1.372
1.94$ 1.895 1.860 1.8$$ 1.812
11 12 13 14 16
1.363 1.356 1. 350 1.345 1.341
1.796 1.782 1.771 1.761 1.753
2.201 2.179 2.160 2.145 2.131
2.718 2.681
16
1.337
1.746
1. 333 1. 330 1.328 1.325
1.740 1.734 1.729 1.725
2.120 2.110 2.101 2.093 2.086
2.6N 2.667 2.662 2.639 2.528
2. 921
17 18 19 20
1.323 1.321 1.319 1.318 1.316
1.721 1.717 1.714 1.711 1.708
2.080 2.074 2.069 2.064 2.060
2.518
2. N1
3.819
2.608
2.819
3.792
2.500 2.492 2.485
2.&07
3.767
2.797 2.787
3.746 8.726
1.316 1.314 1.813 1.311 1.310
l. 706 1.70$ 1.701 1.699
2.056 2.062 2.048
2.479
2.779
3.707
2.473
2.771
2.763
3.690 3.674
2.046
2.467 2.462
2.756
3.669
1.697
2.042
2.467
2.760
3.646
1.303 1.296 1.289 1.282
1.684 1. 671 1. 658 1. 646
2.021 2.000 1.980 1.960
2.423 2.390
2. 704 2.660
2.358 2.326
2. 617
3. 651 3. 460 3.373 3.291
1.383
2.624
2.898 2.878 2.861 2.845
2.676
6.405 4.781 4.687 4.437
4.073
3.922
3.88$ 3.860
~ TableB ia abridged fromTableIII of FisherandYates:Stotieticol tables for biobpicsl, atIricultural, andmedical research, published by Oliverand BoydLtd., Edinburgh,by permissionof the authorsand publishers.
APPENDIX
249
TAELE C. TAELEos' CarnCAL VALUESOy CHI SOUAEEe ProbabiTityunder Hs that x»
.98
.99
. QQQ15 .02 .12 .SO .66
.00
. 00063 .0039 .10 .04 .35 .18 .71 .43 1.14 .Td
.&0
.70
.016 .21 .$8 I.OB 1.61
.064 . IS .45 .71 1.00 1.42 1.5S 2.20 2.34 3.00
2.20 2.83 3.49 4.17
S.OT 8.82 4.60 6.38 5.18
ehi square
.60
.30
.46 1.80 2.3T 3.86 4.85
1.07 2.41 3.56 4.$8 B.oe
.10
1. 64 8.22 4. 64 6.00 7. 20
.05
.02
.01
2.TI 3.$4 d.41 6.84 4.50 6.90 7.82 0.21 6.26 7.82 0.84 11.34 7.78 0.40 11.67 13.28 0.24 11.07 13.39 16.09
.001
10.83 13.82 IB.ST 18.4B 20.62
9 10 2.68
2.63 3.06
1.64 2.17 2.73 3.32 S.94
11 12 13 14 16
3.0$ 3.67 4.11 6.28
3.61 4.18 4.76 6.37 6.9$
4.ds 6.23 6.89 6.67 7.25
6.68 6.30 7.04 7.79 8.65
6.00 7.81 8.83 1.47 10.81
$.15 9.03 9.03 10.82 11.72
10.34 11.34 12.34 13.34 14.34
12.00 14.01 IS.12 16.22 IT.SS
14. Bs 17.28 15.81 1$. 55 IB.08 10.81 18. 1$ 21.05 19.31 22.31
19.68 21.03 22.36 23.68 25.00
22.62 24.7S 24.05 25.22 25.47 27.e0 26 EST20.14 SS.SB 30.6
31.26 32. 01 Sa.ds 36. 12 37.70
15 17 18 10 20
6.81 5.41 7.02 7.68 8. 26
B.BI 7.26 7.91 8.6T 0.24
7.05 s.eT 9.39 10.12 10.8S
0.31 lo.os 10.85 11.66 12.44
11.15 12.00 12.86 13.72 14.SS
12.62 13.63 14.44 1$.3$ 16.27
15.34 IB.34 17.34 1$.34 IO.S4
18.42 19.51 20.60 21.59 22.7
20.45 23.54 21.62 24.77 22.7B 25.90 2S.OO 27.20 2$.04 28.41
Se.so 27.60 28. 87 30.14 31.41
20.63 82.00 31.00 83.41 32.$6 34. SS.B9 86.19 35.02 37.67
30.20 40. T6 42.31 43.82 46.82
Sl $.90 22 0.64 Ss 10.20 24 10.86 SS 11.62
9.92 10.60 11.29 11.09 12.70
11.69 12.34 18.00 IS.$6 14.51
18.24 14.04 14.86 16.65 IB.4T
16.44 16.81 17.19 18.06 18.94
17.18 18.10 10.02 19.94 20.87
20.34 21.24
26. 17 29.62 30.81 28.43 32.01 23.34 27.10 29. 65 S3.2 24.84 28.17 So. 88 84.38
32.eT 36.34 33.02 37.66 85.17 38.07 86.42 40.27 ST.BS 41.67
12.20 12.88 13.66 14.25 14.06
18.41 14.1S 14.85 IS.ST 15.31
16.88 16.16 16.93 17.71 18.40
17.20 18.11 1$.94 19.7T 20.60
10.82 20.70 21.60 22.48 Ss.se
21.79 22.72 23.65 24.68 25.$1
26.34 26.34 27.34 28.34 20.34
88.88 42.86 46.64 40. 11 44. Ia 46.08 41.34 45.42 48.28 42.$e 46.69 40.60 43.77 4T.QB 60.80
.87 1.24 1.65
25 27 28 20 $0
1. 13 1.65
3.83 6.35 7.23 8.66 10.64 4.B7 e.sd s.ss 1. &0 12.02 6.63 7.34 0.52 11. QS 13.36 6.39 8.34 IO.BB 12.24 14.68 7.27 0.84 11.78 13.44 15.00
20.2S sl. &0 35.66 S0.32 S2.01 35.74 31.30 37.02 32.45 35.14 39.00 33.58 40.28
12.69 14.0T 15.SI 16.02 18.31
15.03 16.81 16.B2 18. 48 18.1T 20.09 10. 21.57 21. 16 23.21
22.46 24. 82 SB. 12 27.88 20.60
ss.Os 40. 29 41. 54 42.08 44. $1
ae.so 48. ST 40.78 51.18 62.62 64.05 56.48 66.$9 68.$0 50.70
s Table C isabridged fromTableIV of Fisher andYates:StpHstksf talesfprhfsfettjcpf, pttrfsidturpf, padinsdfspf research, published by OliverandBoydLtd.~ Edinburgh, by permission of the authorsand publishers
APPENDIX
TABLE D. TABLE OF PROBABILITIES ASSOCIATED WITH VALUES hs SMALL AS OBSERVED VALUES OF S IN THE BINOMIAL TEST4
Given in the body of this table are one-tailed probabilities under Ho for the binomial test when P Q $. To save space,decimal points are omitted in the p's. 01
56 78
23
45
67
89
10
11
12
031
188 500
812
969
016
109 344
656
891
984
f
008
062
227
500
773
938
992
004
035
145 363
637
855
965
9
002
020
090
254
500
746
910 980 998
10
001
011
055
172 377
623
828
945
989
999
500
726 613 500 395
887 806 709 605
967 927 867 788
994 f f 981 997 f f 954 989 998 f 910 971 994 999
11
006
033
113 274
12
003
019
073
194 387
13
002
011
046
133 291
14
001
006
029
090
212
13
f
f
004
018
059
151
304 500 696 849 941 982 996 f f
16
002
011
038
105
227 402 598 773 895 962 989 998
17
001
006
025
072
18
001
004
015
048
20 21
001
19
22 23 24 25
315 240 180 132
500 407 324 252
039
095
026 001 005 017 001 003 011 002 007
067 047 032 022
004 002
013 008
15
996
15
166 119 002 010 032 084 001 006 021 058
14
685 593 500 412
f, ff
834 760 676 588
928 881 820 748
975 952 916 868
994 985 968 942
999 996 999 990 998 979 994
192 332
500
668
808
905
961
987
143 105 076 054
416 339 271 212
584 500 419 345
738 661 581 500
857 798 729 655
933 895 846 788
974 953 924 SSS
262 202 154 115
~ Adaptedfrom Table IV, B, of Walker, Helen, and Lev, J. 1953. Statistical inference. New York: Holt, p. 458, with the kind permissionof the authors and publisher.
f 1.0 or approximately 1.0.
hPPENDIX
251
TABLE E. TABLE OP CRITICAL VALUES OP D IN THE KOLMOGOROVSMIRNOV ONE-SAMPLE TEsTC
Sample
Level of significancefor D
maximum (Fo(X)
Sa(X))
sise
(N)
6 87
.15
.10
.05
.01
. 900
.925
.950
.975
.995
. 684
.726
.776
.842
.929
.565
.597
.708
.828
.494
.525
.564
.624
.733
.446
.474
.510
.565
.669
.410
.436
.470
. 521
.618
.381
.405
.438
.486
.577
.358
.381
.411
.457
9
.339
.360
.388
.482
.514
10
.322
.342
.36S
.410
.490
11
.307
.326
.852
.391
.46S
12
.295
.313
.338
.375
.450
13
.284
.302
.325
.361
.433
14
.274
.292
.814
.349
.418
15
.266
.283
.804
.338
.404
16
.258
.274
. 295
.328
.392
17
.250
.266
.286
.318
.381
18
.244
.259
.278
.309
.371
19
.237
.252
.272
.801
.363
20
.231
.246
.264
.294
.356
25
.21
.22
.24
.82
30
.19
.20
35
.18
.19
.21
.27
1.07
1.14
1. 22
1. 63
vN
vN
VN
QN
Over
35
+ adapted from Massey,F. J., Jr. 1951. The Kolmogorov-Smirnovtest for
goodness of fit. J. Abner. Statisl.Assp46'70>with thekind permission of theauthor and publisher.
APPENDIX
Tom
F. TAnuz or CamcA~
VAavas
or r m
Tms RvNs TmsT~
Given in the bodies of Table F> and Table Fu are various critical values of r for various values of a~ and n~. For the on~ample runs test, any value of r which is
equalto or smallerthan that shownin Table Fr or equalto or largerthan that shown in Table Fri is signiflcantat the .05 level. For the Wald-Wolfowitstwo-sampleruns test, any value of r which is equal to or smallerthan that shownin Table Fq is signiScant at the .05 level. Table F>
23
45
2 43 56 7
67
89
10 11
12
13
22
22
14
15
16
22
17
18
19
20
22
22
2
22
22
22
22
22
3
33
33
3
23
33
33
33
3
44
44
4
33
33
44
44
4
44
44 55 56
55 55 66
5 6 6
55 67
55 56 B6 77
5 6 6 7
7
77
88
8
7
88
88
9
8
89
99
9
3
22
33
33
44
22
33
34
45
8
23
33
44
55
9
23
34
45
55
66
67
10
23
34
55
56
B7
7
11
23
44
55
66
77
78
7
6B
12
22
34
4
56
67
77
88
8
99
13
22
34
5
5B
67
?8
89
9
9 10
14
22
34
5
5B
77
88 88
99 99
9 10
10
89
9 10
10 11
ll
9 10
10
10
10
10
10
10
11
11
10
11
ll
11
12
11
11
11
12
12
11
12
12
13
15
23
34
5
66
77
16
23
44
5
17
23
44
5
10
23
45
5
7 8 78 88
99
18
66 67 67
99
10
10
11
ll
12
12
13
13
19
23
45
6
67
88
9 10
10
11
11
12
12
13
13
13
20
23
45
6
B7
89
9 10
10
11
12
12
13
13
13
14
10
~Adapted from Swed, Frieda S., and Eisenhart, C. 1943. Tables for testing randomnea of grouping '.n a sequenceof alternatives. Aas. Math. Statiet., 14, 83-86, with the kind permissionof the authorsand publisher.
TABLE F. TABLE oF CEITIcAL VALUEs oF r IN Tss RUN$ TEsT~ (Coahamd) Table FII
23
45
67
89
10 11 12
13
14
15
16
17
18
19
20
17
17
17
17
18
2 3 99
4
78 6 5 9 9 10
10 11 11
9 10
11 12 12 13 13 13 13
11 12 13 13 14 14 14
14
15
15
15
11 1'2 13 14 14 15 15
16
16
16
16
17
18
13 14 14 15 16 16
1B
17
17
18
18
18
18
10
13 14 15 16 16 17
17
18
18
18
19
19
19
11
13 14 15 16 17 17
18
19
19
19
20
20
21
21
12
13 14 16 16 17 18
19
19
20
20
21
21
21
22
22
13
15 16 17 18 19
19
21
14
15 1B 17 18 19
20
15
15 16 18 18 19 21
25
25
17
17 18 19 20
18
2B 2e
26 27
19
22 23 24 25 25 2B 2e
23 24 25
17 18 19 20
22 23 23 24 25 25 2e
23 23 24
16
21 22 23 23 24 25 25
27
27
25
26
21
22
21
22
22
21
22
23
21
22
23
23
17 18 19 20
21
22
23
24
17 18 20 21
22
23
23
24
17 18 20 21
22
23
24
20
28
~opted from Swed, Frieda 8., and Eisenhart, C. 1943. Tables for testing randomnessof grouping in a aequenceof alternatives. Aaa. MatL 8totiei., 14, 83-86, with the kind permissionof the authorsand publisher.
hPPENMX
TAELE G. TABLE OF CRITICAL VALUES OF T IN THE WILCIEtON MATCHED-PAIRS SIGNED-RANKS TEST+
~ Adapted from Table I of Wilcoxon, F. 1949. Somerapid approzimatestatistical procedures. New York: American Cyanamid Company, p. 13, with the kind permission of the author and publisher.
255
APPENDIX
TABLE H. TABLE Op CBITICAL VALUES FOB THE WALSH TESTe Tests Bigni6cance level of teste Two-tsileds accept «s es 0 if either
10
16
Onetailed
Two tailed
.062
.125
de <0
ds >0
.Q62 ,Q31
.125 .062
j(de+ da) < 0 ds <0
j(A+ A) > 0 ds >0
.047 .031 .016
.094 .062 . 031
max [ds. j(de + de)] < 0 j(de+ de) < 0 de <0
min [ds, j(ds + ds)] > 0 j(ds + ds) > 0 ds >0
.055 .023 .016
. 109 . 047 .031 .016
msx (da, j(de + dr)] < 0 max [de. j(de+ dr)] < 0 j(de + dr) < 0 dr <0
min [ds, j(ds + de)] > 0 min [dk, j (ds + ds)] > 0 j(ds + do) > 0 ds >0
.043 .027 .012
. 086 . 055 .023 .016 .008
max [de, j(de + de)] > 0 msz [do, j(da + d )] < 0 max [dr. j(de + ds)] < 0 j(dr + ds) < 0 ds <0
min [ds, j(ds + ds)] > 0 min [do, j(ds + de)] > 0 [ds, j(A+ ds)] >0 j(de+ do) > 0 ds >0
.061 .022 .010 .006
.102 .043 .020 .012 .008
msx max max max
(de, j(de + do)] (dr, j(da + do)] (do, j(de + do)] [do, j(dr + do)] j(do + do) < 0
min min min min
.056 .026 .011 .006
.111 .051 .021 .010
msx max max maz
[ds, j(de+ dse)] < 0 (dr, j(ds+ dso)] < 0 [ds, j(e4+ dso)] < 0 [do, j(de + dse)] < 0
.048 .028 .011
.097 . 056 .021 .011
maz maz msx max
(dr, j(de + dss)] < 0 (dr, j(ds + As)] < 0 [j(de + dss), j(do + do)] < 0 [do, j(dr + dss)] < 0
.p47 .p24 .pip .pp6
.094 .048 .020 .011
max msx max maz
[j(A + At). j(ds + As)] < 0 [do, j(da + Ao)] < o (ds. j(ds + dss)] < 0 [j(dr + dso), j(de + dso)] < 0
min [j(ds + de), j(dk + ds)] > 0 min [ds,j(ds + ds)] > 0
.047 .023 .010
. 094 .047 .020 .010
max max max maz
[j(de + dss), j(do+ As)) < 0 (j(ds + dss), j(de+ dss)] < 0 (j(ds+ dss), j(de+ dse)] < 0 [dso,j(dr + dso)] < 0
min [j(ds + dse), j(ds + do)] > 0 min [j(ds + de), j(do + ds)] > 0 min [j(ds + de), j(de + da)] > 0 min [de,j(ds+ dr)] >0
. 047 .023 .010 pp5
094 047 020 .010
maz [j(de + dse), j(de + dss)] < 0
.047 .023 .010 .006
. 094 .047 .020 .010
max [j(de + dss), j(da + dse)] < 0
One-tailed: accept «s < 0 if
<0 <0 <0 <0
maz [j(da + dss), j(ds + dss)] < 0 msx [dse. j(do+ dse)] < 0
max [j(dr + dse). j(dse + dss)] < 0 msx (](de + As). j(do + Ae)] < 0 max [j(de + dss). j(dse + dss)] < 0 msx [dss,j(dr + Aa)]
<0
One-tailed: accept «s > 0 if
[de, j(ds + do)] [ds, j(ds + ds)] [ds, j(ds + ds)] [do, j(ds + ds)] j(ds + ds) > 0
>0 >0 >0 >0
min [ds, j(ds + dr)] > 0 min (A, j(ds + ds)] > 0
min [ds,j(ds + ds)] > 0 min [ds,j(ds + da)] > 0 min [ds,j(A + ds)] > 0 min [ds. j(ds + dr)]
>0
min (j(ds + do), j(do + A)] > 0 min [do,j(ds + da)] > 0
min [de, j(ds + dr)]
>0
(j(ds + de), j(ds + de)l > 0
min[j(ds + dss),j(ds + dso)]> 0 min [j(dr+ dso), j(dh+ de)] > 0 min [ds, j(ds + de)] > 0 min [j(ds + ds), j(de + ds)] > 0
min [j(ds + dss),j(dh+ dss)]> 0 min [j(ds + dss),j(ds + Ao)] > 0 msn(j(ds + Ao). j(ds + ds)] > 0 min [ds,j(dr+
ds)] > 0
e Adapted fromWelsh,J. E. 1949. Applicationo of some significance testeforthemedian which validunderverygeneral conditions. If. Arrsor arersof. koo.,44o343,withthekindpermission of theauthorandthepublisher.
Trna@ I. Trna@ or Cmvxc~x.Vaxxxnsor D (oa C) xxxma Fxsann Tnt', f
'Adapted from Finney, D. J, 1948. The Fisher-Yatestest of significancein 0 X 2 contingencytables. Biomctriko,$4, 149-154, with the kind permissionof the author and the publisher.
TABLEI. TABLEQFCRITIcllLVhLUESoF D (oR C) IN THE FIsHERTEsT~,f (Cosliaued)
f ~en B isentered in themiddle column, thesigniScance levehareforD. When piaceof B, the signiSoance levelsare for C.
258
hPPENDIX
ThRLE I. ThELE oF CRITIchL VhLUES QF D (oR C) IN THE
FIsHER TEST >f (CoNIIascd)
~ Adapted from Finney, D. J. 1948. The Fisher-Yates test of significance in 2 X 2 contingency tables. BimnetrIka, $6, 149-154, with the kind permissionof the
author and the publisher.
JO?PENDIX
TABLEI. TABLEoF CRITIchLVALUEsQF D (OE C) IN TEE FxsHERTEsT', f (Continued)
f WhenB isenteredin themiddlecolumn,thesigni6cance levelsarefor D. When place of B, the significancelevels are for C.
T~aa
I. Tmm os Cnnrc~z. V~r.vmsos D (oa C) w nm FI8HER TE8T~,f (Continued)
Adapted from Finney, D. J. 1948. The Fisher-Yatestest of signiScancein 9 X 2 contingencytables. Bsomctrika,14, 149-164, with the kind permissionof the author and the publisher.
APPENDIX
261
Talus I. T~m or' Carnc~rVasss or D (on C) IN THE FrsanaTssv~,f (C~insed)
T %hen B lsentredm themiddle colnmn, the~SeancelevelarefotD When placeof B, the aigni5cance levelsare for C.
APPENDIX
262
TABLE I. TABLE OF CRITICAL VALUES OF D (OR C) IA THE
FIsHER TEsT*, f (ContinIIed)
90
0
80
C+D
3
12 11
C+D
2
10 i0
0 00
10 9
00
12
00
11
0
0
~ Adaptedfrom Finney,D. J. 1948. The Fisher-Yates test of significance in 2 X 2 contingencytables. BionIetrika,86, 149154,with the kind permissionof the author and the publisher.
APPENDIX
263
ThBI.E I. ThBLE OP CRITIChLVhLUESOP D (OR C) IN THE FIsHER TEsT», f (Continued) Totals in right margin A+B
13
C+D
B (or A)f 11
13
4
11
4
13
10 9
87 65 13 12 ll
10 9
8
5
0 5 2 1 0 3 1 0 6 3 1 0 4 2 1 5 3 14 2 0 50 0 4 0 4 1 0 2 1 5 3 1 0 4 2 1 31 2 2 0 40 1
00 6
3 12
0
1
0
4
3 12
0 00 0 00
87 65
C+D
.005
6 45 3 12 3 1 2 3 311 2 32 6
ll
=9
.01
12
12
C+D
.025
5
87 65 - 10
.05 7
10 9
C+D
Level of significance
00 5 3 12 3 12 2 10 0 3 1 1 10 9 78 6 13
4
12 ll
C+D
7
13 12 11
0 20 0 1 0 3 4 13 2 12
10
00 8 70 0 6
9
f WhenB is enteredin themiddlecolumn,the significance levelsarefor D. IIsed in place of B, the significance levels are for C.
When
hPPENDIX
264
Thaas I. Tham
or Canxchx. Vhx.vm or D (oa C) xN ma Fxsxxaa Tasv~, f (Coatiaued)
~ Adaptedfrom Finney, D. J. 1948. The Fisher-Yatestest of significancein 9 X 9 contingencytables. Biorsebiko,Sl, 149-154,with the hnd permissionof the author and the publisher.
T~m I. Tmm or Caner. Vmvasos'D (oa C) IN TH% Freya Tam~,f (Continued)
f ~en B isentered iathemiddle column, theaigai6cance levels arefog D. place of Bgthesigai6eanoe levels yrefogC.
APPENDIX
ThELE I. ThELE oF CRITIchL VhLUES oF D (oR C) IN THE
FIsHER TEsT~,f (Continued)
' Adaptedfrom Finney,D. J. 1948. The Fisher-Yates test of signi6cance in 2 X 2 contingency tables. Biomctrika, ld, 149-164,with thekindpermission of the author and the publisher.
APPENDIX
Ter,s I. Twas or CmvxcmVAI.vms os D (oa C) IN ma FIsHERTE8Tpt (Continued)
f ~en B isentered inthemiddle column, thesigni6cance levels areforD. ~en ~ in
placeof B, thesigniScance levelsarefor C.
APPENDIX
Tmua I. Trna@ or CaxTxcax.Vxx,vas or D (oa C) xN vxxa Fxsxxaa TasT», f (Continued)
~hdspted from Finney, D. J. 1948. The Fisher-Yatestest of significanceln 8 X 9 contingencytables. Bios''troika,N, 149-164,with the kind permissionof the author and the publisher.
APPENDIX
Tmm
269
I. Tmaa or Carnmx Vmmzs or D (on C) m Tan Fxsaan Tash~,f (Continued)
) WhenB is enteredin themiddlecolumn,the signilcancelevelsarefor D. When needin placeof B, the signiSoance levels are for C.
270
hPPENDIX
TABLE I. TABLE oF CRITIchL VhLUES QF D (oR C) IN THE FxsEER TEsT, f (Continued)
+hdapted from Finney, D. J. 1948. The Fisher-Yatestest of significancein 2 X 2 contingency tables. Biometrika, S6, 149-154, with the kind permissionof the author and the publisher.
f When B is enteredin the middlecolumn,the significancelevelsarefor D. When A is used in place of B, the significancelevels are for C.
hPPENDIX
271
ThBLEJ. ThBLKOF PROBhBILITIES ASSOCIhTED WITH VhLUEShs SMhLLhs OBSERvEDVALUES OF U IN THE MhNN-WHITNET TEST nt ~3
nt
4
12
0 1 .200 .400
34
.067 .133 .267 .400 .600
2 43 5 .600
67 8
.028 .014 .057 .029 .114 .057 .200 .100 .314 .171 .429
.243
.571
.343 .557
at~5
nt=6
0 Reproduced fromMann,H. B., andWhitney,D. R. 1947. Ona testof whether of two randomvariablesis stochastically largerthan the other. Ann. Math.
f8,5? 54,withthekindpermission oftheauthors andthepublisher.
272
hPPENDIX
ThBLE J. ThBLE oF PnoBhBILITIESAssocIhTEDwITH VhLUEs hs SMhLLhs
OB8ERYED VALUEsoF U IN THEMhNN-WHITNEY TE8T (Continrred) nn~7
01
.028 .056 .111 .167 .250
.008 .017 .033 .058 .092
5
.333
.133
67
.444 .656
.192 .258
2
34
.125 .250 .375 .500 .625
.003
.001
.001
.000
.006
.003
.001
.001
.012
.005
.002
.001
.021
.009
.004
.002
.036
.015
.007
.003
.055
.024
.011
.006
.082
.037
.017
.009
.116
.053
.026
.013 .019
8
.333
.168
.074
.037
9
.417
.206
.101
.061
.027
.264
.134
.069
,036 .049
10 .583
.324
.172
.090
12
.394
.216
.117
.064
13
.464
.265
.147
.082
14
.538
.319
.183
.104
15
.378
.223
.130
16
.438
.267
.169
17
.500
.314
.191
18
.662
11
.365
.228
19
.418
.267
20
.473 .627
.310
21
.366
22
.402
23
.451
24
.600
25
.549
~ Reproducedfrom Mann, H. B., and Whitney, D. R. 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math Statist., 18, 52-54, with the kind permission of the authors and the publisher.
hPPENDXK
T~m
J. Talus
oF Pnosmrumss
273
AssooI>TIn wITH Vmvss m S~rz.
m
OsssavsnVaIUzs oF U IN THEMANN-WIGTNET TEST~(Contiaucd) aa
12
34
8
56
78
t
Normal
01
. 111
.022
.006
.002
.001
.000
.000
.000
3.308
. 001
.222
.044
.012
.004
.002
.001
.000
. 000
3.203
.001
2
.333
.089
.024
.008
.003
.001
.001
.000
3.098
.001
3
.444
.133
.042
.014
.005
.002
.001
.001
2. 993
.001
4
.556
.200
.067
.024
.009
.004
.002
.001
2.888
.002
5
.267
.097
.036
.001
2. 783
.003
.35B
.139
.055
.444 .656
.188 .248 .315 .387 .461 .539
.077 .107 .141 .184 .230 .285
.006 .010 .015 .021 .030 .041
.003
67
.015 .023 .033 .047 .064 .085 .111
8 9 10 11 12 13
.341
14
.005
.002
2.678
.004
.007
.003
2.573
.005
.010
.005
2.468
.007
.014
.007
2.363
.009
.020
.010
2.258
.012
.054
.027
.071 .091 .114
.036
2.153 2.048
.016
.142 .177 .217
.014 .019
.047
.025
1.943
.026
.032
1.838
.033
.020
15
.467
.262
.141
.076
.041
1.733
.041
1B
.533
.311 .362 .416 .472 .528
.172 .207 .245 .286 .331
.095
.377
.232
22
.426
.268
23
.475
.306
24
.526
.347
1.628 1.523 1.418 1.313 1.208 1.102 .998 .893 .788 .683 .678 .473 .368 .2B3 .158 ,062
.062
21
.052 .065 .080 .097 .117 .139 .164 .191 .221 .253 .287 .323 .360 .399 .439 .480 .520
17 18 19 20
25 26 27 28 29 30 31 32
.116 .140 .168 .198
.389 .433
.478 ~ 622
.064 .078 .094 .113 .135 .159 .185 .215
.247 .282
,318 .356 .39B
.437 .481
o Reproduced fromMann,H. B.,sndWhitney, D. R. 1947. Ona testofwhether e of two randomvariables is stochastically largerthanthe other. Aaa. MaS. ~.g j,., 52-54,withthekindpermission of theauthors andthepubhsher.
274
APPENDIX
ThELE K. ThELE oF CEITIchL VhLUES oF U IN THE MhNN-WHITNEY TEST+ Table KI. Critical Values of U for a One-tailed Test at n Test
9 10
11
12
13
.001 or for a Two-tailed
at n .002
14
15
16
17
18
19
20
1 2 3 4 5
00
00
00
0
11
12
23
33
12
2
33
45
56
34
4
56
78
9 10
56
7
89
68
9
11
12
1
67 23 89 5?
10
11
10
12
12
12
14
13
14
17
14
15
19
12
10
11
13
14
15
16
14
15
17
18
20
21
12
14
15
17
19
21
23
25
26
14
17
19
27
29
32
15
17
20
22
29
32
34
37
20
23
25
34
37
40
42
20
23
26
29
22
25
29
32
23 27 31 36 39
25
17
21 24 28 32 36
8 10
8
10
77 11
12
38
42
46
48
43
46
50
64
47 52
51 56
55
69
60
65
15
17
21
24
28
32
36
40
43
16
19
23
27
31
35
39
43
48
17
21
25
29
34
38
43
47
52
57
61
66
70
18
23
27
32
37
42
46
51
56
61
66
71 77 82
76 82 88
19
25
29
34
40
45
50
55
60
66
71
20
26
32
37
42
48
54
59
65
70
76
~ Adaptedand abridgedfrom Tables1, 3, 5, and 7 of Auble,D. 1963. Extended tables for the Mann-Whitney statistic. Bulletin of the Institute of Educational Researchat Indiana University, 1, No. 2, with the kind permission of the author and the publisher.
hPPENDIX
TABLE K. TABLE oF CRITIcAL VALUEs oF U IN THE MhNN-WHITNEY
TEsT~ (Continued)
Table KIL Critical Values of U for a One-tailed Test at a .01
9 10
11
12
Test
at a ~
13
14
or for a Two-tailed
.02
15
16
17
18
19
20
1
3 2
00
00
11
12
22
38
4
33
45
56
56
78
9 10
78
9 11
5 76
9 11
89 10
11 14 16
13 16 19
11
00 4
77
45 9 10
89
11
12
13
14
16
16
12
13
16
16
18
19
20
22
12
14
16
17
19
21
23
24
26
28
15 18 22
17 21 24
20
22
24
26
28
30
32
34
23
26
28
31
33
36
38
40
27
30
33
36
38
41
44
47
11
18
22
25
28
31
34
37
41
44
47
50
53
12
24 27
28 31
31 35
35
88
42
46
49
53
56
60
13
21 23
39
43
47
51
55
59
63
67
14
26
30
34
38
43
47
51 ' 56
60
65
69
73
15
33 36 38 41 44
37 41 44 47 50
42 46 49 53 56
47
51
56
61
66
70
76
80
51
56
61
66
71
76
82
87
55
60
66
71
65
100
69
76 82
94
63
70 75
82 88 94
93
59
77 82 88
88
19
28 31 33 36 38
101
107
20
40
47
53
60
67
73
80
87
93
100
107
114
16 17 18
~ Adaptedand abridgedfrom Tables1, 3, 5, and 7 of Auble,D. 1953. Extended tables for the Mann-Whitney statistic. BuHetin of Se Institute oy gducat~ geeearcW at Indiana Unioersitg,1, No. 2, with the kind permissionof the author and the publisher.
276 ThRLE K. ThELE oF CRITIchL
VhLvEs
QF U IN THE MhNN-WHITNEY
TEsv' (Continued) Table KIII. Critical Values of U for s One-tailed Test at a
9 10
11
12
Test
at a ~
13
14
.ON or for a Two-tailed
.05
15
16
17
18
19
20
nI 1 2
00
01
1
11
1
22
34
23
34
4
65
6
67
45
67
8
9 10
5
78
9 11
12
13
14
22
78
ll
11
12
13
13
16
17
18
19
20
67
10
11
13
14
16
17
19
21
22
24
25
27
12
14
16
18
20
22
24
26
28
30
32
34
89
15 17 20 23 26 28
17 20 23 26 29 33
19 23 26 30 33 37
22 26 29 33 37 41
24 28 33 37 41 45
26
29
31
34
36
38
41
31
34
37
39
42
45
48
36
39
42
46
48
52
56
40
44
47
51
56
58
62
45 50
49 54
63 59
57
61
65
69
63
67
76 83
10 11 12 14
31
36
40
46
50
66
59
64
67
74
72 78
15
34
39
44
49
64
69
64
70
75
80
85
90
16
37
42
47
53
59
64
70
75
81
86
92
98
17
39
45
51
57
63
67
76
81
87
93
99
106
18
42
48
65
61
67
74
80
86
93
99
106
112
19
45
52
58
65
72
78
86
92
99
106
113
119
20
48
65
62
69
76
83
90
98
105
112
119
127
13
Adaptedand abridgedfrom Tables 1, 3, 5, and 7 of Auble, D.
1953. Extended
tablesfor the Mann-Whitneystatistic. Bulletinof the Inetituteof Educational Research at Indiana Unieereitti,1, No. 2, with the kind permissionof the author and the publisher.
TaELE K. ThBLE oF CRrrIchL
VALIIEs QF U IN
TEsT
THE MANN-WHITNEY
(Continued)
Table KIv. Critical Values of U for a One-tailed Test at a = .05 or for a Two-tailed Test at a .10
10
11
12
13
14
15
16
17
18
1
20
00 11
22
23
3
45
56
7
4
78
9 10
11
2
19
1
96 3 56 7
3
34
7
8
99
12
14
15
44 10
11
16
17
18
22
23
25
11
12
13
15
16
18
19
20
12
14
16
17
19
21
23
25
26
28
30
32
15
17
19
21
24
26
28
30
33
35
37
39
18
20
23
26
28
31
33
36
39
41
44
47
21
24
27
30
33
36
39
42
45
48
51
54
10
24
27
31
34
37
41
44
48
51
55
58
62
11
27
31
34
38
42
46
50
54
57
Bl
B5
69
12
30
34
38
42
47
51
55
60
64
68
13
33
37
42
47
51
5B
B1
65
70
75
14
36
41
46
51
56
61
66
71
77
82
15
39
44
50
55
61
66
72
77
83
88
72 &0 87 94
77 84 92 100
16
42
48
54
60
65
71
77
83
89
95
101
17
45
51
57
64
70
77
83
89
115
48
55
61
68
75
82
88
95
]02 109
109
18
96 102
116
123
19
51
58
65
72
80
87
94
101
109
116
123
130
20
54
62
69
77
84
92
100
107
115
123
130
138
89
107
~ Adaptedand abridgedfrom Tables 1, 3, 5, and 7 of Auble, D. 1953. Extended tables for the Mann-Whitney statistic. BuQetin of the Inetitute of Educational geeearehat Indiana Unieereity,1, No. 2, with the kind permissionof the author and the publisher.
278
hPPENDIX
ThRLE L. ThRLE oF CRITIchL
VALUE8 oF Kp IN THE KQLMQGQRov-SMIRNov
Two-shMPLE
TE8T
(Small samples) One-tailed a .05
test~ a =.01
3
3
4
4
Two-tailed a =.05
test f
a =.01
45 5
58 7 65 5 6 7 76 66 6 6 56 6 7 77 8 7 87 7 8 8 8 8 98 8 9 5
4
9 10 11 12 13 14
15
68 7 6 7
18 19 20 21
9
9 9
10
16 17
5
9
88 9
10
99
10
10
10
9
11
10
99
11
10
10
ll
22
ll
23
ll
10
ll 12
9
24
9
ll
10
25
9
11
10
12
26
99
11
10
12
12
10
12
28
10
12
11
13
29
10
12
11
13
30
10
12
11
13
35
11
13
12
40
11
14
13
27
~ Abridgedfrom Goodman,L. A. 1954. Kolmogorov-Smirnov testsfor psychologicalresearch. Psychol.Bull., 61, 167,with the kind permissionof the author and the American Psychological Association.
f Derived from Table 1 of Massey,F. J., Jr. 1951. The distribution of the maximum deviationbetweentwo samplecumulativestep functions. Ann. Moth. Statist., QQ,126-127,with the kind permissionof the author and the publisher.
APPENDIX
279
TABLEM. TABLEoF CRITIchLVALUEsoF D IN THE KQLMQGGRov-S1RIRNov Two-SAMPLE TEST
(Large samples:two-tailed test) *
Level of significance
Valueof D so largeas to call for rejection of Hoat the indicatedlevel of significance,
where D
maximum IS,(X) S,(X))
.10
n+ '%gag
.05
ng+ $$'Sg
.025
48 $$+ fig R)$$
.01
al + ns nqnq
al + SI n~n~
1 95 ag+ ng nqnq
+ Adapted from Smirnov,N. 1948. Tablesfor estimatingthe goodness of fit of empiricaldist,ributions. Ann.MaS.Stetiet.,19> 280-281, with the]cindpermission of the pubbsher.
APPENDIX
TABLE N. TABLEoF PRQBABILITIE8AssoclhTEDwITH VALUE8h8 LARGEhs OBsERVED VALUEs oF xpe IN THE FRIEDMAN Two-whY
ANALY818 oF VARIANcE BY RANKS+
Table NL k 3
N
N~7
N~6
x.'
1.00 1.33 2.38 3.00 4.00 4.33 5.38 6.33 7.00 8.33 9.00 9.33 10.33
12.00
1.000 .956 .740 .570 .430 .252 .184 .142 .072 .052 .029 .012 .0081 .0055 .0017 .00013
. 000 . 286 .857 1.143 2.000 2.571 3.429 3.714 4.571 5.429 6.000 7.148 7.714 8.000 8.857 10.286 10.571 11.148 12.286 14.000
8
.237 .192 .112 .085 .052 .027 .021 .016 .0036 .0027 .0012 .00032 .000021
.25 ,75 1.00 1.75 2.25 3.00 3.25 4.00 4.75 5.25 6.25 6.75 7.00 7.75 9.00 9.25 9.75 10.75 12.00 12.25 13.00 14.25 16.00
9
x.'
Xt 1. 000 .964 .768 .620 .486
N
1.000 . 967 .794 .654 .531 .355 .285 .236 .149 .120 .079 .047 .038 .030 .018 .0080
.0048 .0024 .0011 .00086 .00026 . 000061 .0000036
.000 . 222 . 667 . 889 1.556 2.000 2.667 2.889 3.556 4.222 4.667 5.556 6.000 6.222 6.889 8.000 8.222 8,667 9.556 10.667 10.889 11.556 12.667 13.556 14.000 14.222 14.889 16.222 18.000
1.000 .971 .814 .865 .569 .398 .328 .278 .187 .154 .107 .069 .057 .048 .031 .019 .016 .010 .0035 .0029 .0018 .00066 .00035 .00020 .000097
. Adaptedfrom Friedman,M. 1937. The useof ranksto avoidthe assumption of normalityimplicitin theanalysis of variance. J. Amer.Static.Aee.,$1, 688-689, with the kindpermission of the authorandthe publisher.
LPPEND1X
281
ThELE N. ThELE or PEOEhEILITISAssocIhTEDwrrH VhLUEs hs LhEOEhs QEsERVEDVhLUEs oF xt IN THE FEIEDMhN Two whY ANhLYsIs oP
VhnrhNcE s Y RhNxs' (Conhaued) Table NII. k ~ 4 N
2
N
x.'
Xr .0
N3
1. 000
xe 1.000
.0 .3
4 x.'
1.000
5.7
.141
6.0
.105
6.3
.094
6.6
.077
.958
.6
.958
1.2
,834
1.0
.910
.6
1.8
.792
1.8
.727
.9
2.4
.625
2.2
.608
1.2
B.9
.524
.928
3.0
.542
2.6
1.5
.754
7.2
.054
3.6
.458 .375 .208 .167 .042
3.4
1.8
.677
7.5
.062
3.8
2.1
.649
7.8
.036
.524
8.1
.033
8.4
.019
4.2 4.8 54 6.0
4.2
.300
2.4
6.0
.207
5.4
.176
2.7 3.0
.432
8.7
.014
5.8
.148
3.3
.389
9.3
.012
6.6
.075
3.B
.355
9.6
.0069
7.0
.054
3.9
.324
9.9
.0062
7.4
.033
4.5
10.2
.0027
8.2
.017
4.8
10.8
.0016
9.0
.0017
5.1
.190
11.1
.00094
5.4
.158
12.0
.000072
o Adapted fromFriedman, M. 1937.Theuseof ranksto avoidtheassumption of normalityimplicitin theanalysis ofvariance.J. Amer.Stotiet. Aee.,. 688-B89, with the kind permissionof the author and the publisher.
hPPENDIX
TABLE O. TABLE OF PROBABILITIESASSOCIATEDWITH VALUES hs LARGE hs OBSERVEDVALUES OF H IN THE KRUSKAL-WALLISONE-WAY ANALYSIS OP VARIANCE BY RANKS
Sample sizes
Samplesizes n<
ns
21
ns 1
22
32
32
ns
ns 6.4444
.008
6.3000
.011
5.4444
.046
5.4000
.051
4.5111
.098
4.4444
.102
6.7455
.010
. 100
6.7091
.013
3. 8571
.133
5.7909
.046
5.7273
.050
5.3572
.029
4.7091
.092
4.7143
.048
4.7000
.101
4.5000
.067
4.4643
.105
6.6667
.010
5. 1429
.043
4.9667
.048
4.5714
.100
4.8667
.054
4.0000
.129
4.1667
.082
4.0667
.102
2. 7000
.500
43
2
1 22
31
ns
4.5714
.067
3.7143
.200 .300
1
1
2
4.2857
44
1
6.1667 33
33
33
41 42
42
43
1
2
3
1 1
2
1
6. 2500
,011
5.3611
.032
5.1389
.061
4.5556
.100
4.2500
.121
44
2
7.0364 6.8727
.011
5.4545 5.2364
.052
.004
4.5545
.098
6. 4889
.011
4.4455
.103
5.6889
.029
5.6000
.050
7.1439
.010
5.0667
.086
7.1364
.011
4.6222
.100
5.5985
.049
5.5758
.051
4.5455
.099
4.4773
.102
7.6538
.008
7.5385
.011
5.6923
.049
5.6538
.054
4.6539
.097
4.5001
.104
7.2000
44
3
3.5714 4.8214
.057
4.5000
.076
4.0179
.114
6. 0000
.014
5. 3333
.033
5. 1250
.052
4. 4M3
.100
4.1667
.105
51
1
3. 8571
.143
5.8333
.021
52
1
5. 2500
.036
5.2083
.050
5.0000
.048
5.0000
.057
4. 4500
.071
4.0556
.093
4. 2000
.095
3.8889
.129
4.0500
.119
44
4
hPPENDIX
283
TABLE O. TABLE oE PRQBABILITIEsAssocIATED wITH VALUEs hs LARGE hs OBsERvEDVALUEs oF H IN THE KRUsKAL-WALLIs ONE whT ANALTsrs
oF VARIANcEBT RANK8 (Contznlb') Sample sizes nl
ns
52
n3
2
53
Sample sizes n'
5 ' 6308
~050
6.1333
~013
4 ' 5487
~099
5 ' 1600
~034
4.5231
~103
5 ~0400
~056
4 ' 3733
~090
~009
4.2933
~122
7 ' 7604 7 ' 7440 5 ' 6571
~049
~012
5 ~6176
~050
~048
4 ' 6187
~100
4 ' 8711
~052
4.5527
~102
4 ~9600
2
63
3
64
1
42
43
ne
~008
1
53
ni
6.5333
4.0178
~095
3 ' 8400
~123
6 ~9091
~009
6.8218
~010
5.2509
~049
5 ' 1055
~052
4 ' 6509
~091
4.4945
~101
7 ' 0788
~009
6.9818
~011
5.6485
~049
5.5152
~051
4 ' 5333
~097
4 ' 4121
~109
6 ~9545
~008
6 ' 8400
~011
4 ' 9855
~044
4 ' 8600
~056
3 ' 9873
~098
3.9600
~102
55
55
55
1
2
3
~011
7 ' 3091
~009
6.8364
~011
5 ' 1273
~046
4 ' 9091
~053
4 ' 1091
.086
4.0364
~105
7.3385
~010
7 ' 2692
~010
5 ' 3385
~047
5.2462
F051
4 ' 6231
~097
4 ' 5077
~100
7 ' 5780 7 ' 5429 5 ~7055
~010
5 ' 6264
~051
4 ' 5451
~100
4.5363
~102
7.8229 7 ' 7914
~010
5 ~6657
,049
~010 ~046
~010
7 ' 2045
~009
5 ' 6429
~050
7 ' 1182
~010
4 ' 5229
~099
5 ~2727
~049
4 ' 5200
~101
5 ' 2682
~050
4 ' 5409
~098 ~101
8 ~0000 7 ~9800
~009
4 ' 5182
5. 7&00
~049
55
5
.010
7 ' 4449
~010
5 ~6600
~051
7 ' 3949
~011
4 ~5600
~100
5 ' 6564
~049
4 ~5000
~102
~ Adapted andabridgedfrom Kruskal,W. H., and Waiiis,W. A. 1952. Us of ranksin on~riterion varianceanalysis. J. Amer,Statiet.Aee.,47,61~17 with the
l indpermission oftheauthors andthepublisher.(Thecorrections toth tablegiven by theauthors in Errata,J. Amer.Statiet. Aee.,48,910,havebeenincorpora
hPPENDIR
ThELE P. ThELE oF CRITIchL VhLUE8 oF rg THE SPEhlQKhN RhNK CORRELATION COEFFICIENT
AdaptedfromOlds,E. G. 1938. Distributions ofsumsofsquares ofrankdifferencesfor small numbersof individuals. Ann. Math. Statist.,9, 133-148, and from
Olds'E G
1 949 The 5 %significsnce levelsforsumsof squares of rankdifferences
and a correction. Ann. Math. Statist.,$0, 117-118, with the kind permissionof the author snd the publisher.
APPENDIX
TAELE Q. TAELE os' PaoEAEILITIEs AssocIATED wrra VALUEs As LAEoE As OESERvED VALUES OP S IN THE KENDALL RANK CORRELATION CoEPFIcIENT Values
Values of N
of N
10
0 42 .625
.592
. 548
.375
.408
.452
.167
.242
.360
.381
.117
.274
.306
.042
.199
.238
10
.138
.179
12
68
.042
1
. 500
.500
. 360
.386
.481
. 285
.281
.364
.136
191
.300 .242
39 7 5
.500
.068
.119
11
.028
.068
.190
.089
13
.0088
.035
.146
14
.054
15
.0014
.015
.108
16
.081
17
.0054
.078
19
.0014
.054
21
.00020
.036
18
.016
20
.0071
22
.0028
24
.00087
26
.00019
28
.000025
.012
28
.023
25
.014
27
.0088
.0012
29
.0046
31
.0028
32
.00012
33
.0011
34
.000025
85
.00047
36
.0000028
37
.00018
39
.000058
41
.000015
43
.0000028
45
.00000028
30
~ Adaptedby permissionfrom Kendall, M. G., Rank correlationNIegodg,Charles GriSn h Company, Ltd., London, 1948, Appendix Table 1 p. 141.
hPPENDIX
TABLE R. TABLE oF CRITIcAL VALUEs oF e IN
THE KENDALL CoEFFIGIENT
OF CONCORDANCE
Values at the .05 level of significance
34
56 8 10
54. 0
64. 4
103.9
157.3
9
49.5
88.4
143.3
217.0
12
71. 9
62.6
112.3
182.4
276.2
14
83. 8
75.7
136.1
221.4
335.2
16
95.8
453.1
18
107.7
571.0
48. 1
101.7
183. 7
299.0
60.0
127.8
231.2
376.7
15
89.8
192.9
349.8
570.5
20
119.7
258.0
468.5
764.4
864.9
1,158.7
Values at the .01 level of significance
' Adapted from Friedman, M. 1940. A comparison of alternative tests of significance for the problem of m rankings. AarI. Math. Slaliet., 11, 86-92, with the kind permission of the author and the publisher. 1 Notice that additional critical values of e for N 3 are given in the right-hand column
of this table.
hPPENDIX
T~m
S. Tmm
287
OX FhnOaXhaa Nl 1 1
26 24 120 720 5040 40320 362880 10
3628&00
11
39916&00
12
479001600
13
6227020&00
14
87178291200
15
1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000
16 17 18 19 20
2432902008176640000
APPENDIX
ThIILII
T. ThsLI
or BINOMIhL CQIIIFIGIENTs
(N)
(N)
(N) (N)
10
10
5
1
15
20
15
6
1
35
21
7
21
35
P)
(Ng
1
28
56
70
56
28
8
36
84
126
126
84
36
120 330
45 165
10 56
1 11
10
10
45
120
210
252
210
11
11
55
165
330
462
462
12
12
66
220
495
792
924
792
495
220
66
13
13
78
286
715
1287
1716
1716
1287
715
286
14
14
91
364
1001
2002
3003
3432
3003
2002
1001
15
15
105
455
1365
3003
5005
6435
6436
6005
3003
16
16
120
560
1820
4368
8008
11440
12870
11440
8008
17
17
136
680
2380
6188
12376
19448
24310
24310
19448
18
18
163
816
3060
8568
18564
31824
43768
48620
43758
19
19
171
969
3876
11628
27132
50388
75582
92378
92378
1140
4845
15504
38760
77620
125970
167960
184756
289
hPPENDIX
Tasm
lJ. TABLE QF SQUJUcna AND SQUARE RQQT8
Square root
Number
1
1.0000
16 81
6.4031
4 9
1.4142
17 64
6.4807
1.7321
18 49
6.5574
16
2.0000
19 36
6.6332
25
2.2361
20 25
6.7082
36
2.4495
21 16
6.7823
49
2.6458
22 09
6.8557
64
2.8284
23 04
6.9282
81
3.0000
24 01
7.0000
1
23 5 6 87 9
Square root
10
100
3.1623
25 00
7.0711
11
121
3.3166
26 01
12
144
3.4641
27 04
13
169
28 09
7.1414 7.2111 7.2801 7.3485 7.4162 7.4833 7.5498 7.6158
14
196
3.6056 3. 741'7
15
225
3.8730
16 17
256
4.0000
289
4.1231
30 25 31 36 32 49
18
324
4.2426
33 64
29 16
19
361
4.3589
34 81
7.6811
20
400
4.4721
36 00
7.7460
21
441
4. 5826
37 21
7.8102
22
484
4.6904
38 44
7. 8740
23
4.7958
39 69
7.9373
40 96
8.0000
25
529 576 625
5.0000
42 25
8.0623
26
676
5.0990
43 56
27
729
5.1962
44 89
8.1240 8.1854
24
28
784
5.2915
46 24
8.2462
29
841
5.3852
47 61
8.3066
30
900
5.4772
49 00
8.3666 8.4261 8.4853 8.5440 8.6023 8.6603 8.7178 8.7750 8.8318 8.8882 8.9443
31
961
5.5678
32
10 24
5.6569
33
10 89
5.7446
34
11 56
5.8310
35 36 37 38 39 40
12 25 12 96
5.9161 6.0000
13 69
6.0828
1444
6.1644
50 41 51 84 53 29 54 76 56 25 57 76 59 29 60 84
15 21
6.2450
62 41
1600
6.3246
64 00
+ Bypermission fromStatistics for students ofpsychology audeducation, by H. Soren-
aon, Copyright1936,MoGraw-HiH BookCompany,Inc.
hPPENDIX
ThELE U. ThELE oF SQUhREshND SQUhRERooTs (Continued)
81
Square
Square root
9.0000
146 41
11.0000 11.0454
Square root
Number
65 61
82
67 24
9.0554
148 84
83
68 89
9.1104
1 51 29
11.0905
84
70 56
9.1652
1 53 76
11.1355
85
72 25
9.2195
1 5625
11.1803
86
73 96
9.2736
1 58 76
11.2250
87
75 69
9.3274
1 61 29
11.2694
88
77 44
9.3&08
1 63 84
11.3137
89
79 21
9.4340
16641
11.3578 11.4018
90
81 00
9.4868
1 6900
91
82 81
9.5394
1 71 61
11.4455
92
84 64
9.5917
1 74 24
11.4891
93
86 49
9.6437
17689
11.5326
94
88 36
9.6954
17956
11.5758
95
90 25
9.7468
1 82 25
11.6190
96
92 16
9.7980
18496
11.6619
97
94 09
9.8489
1 87 69
11.7047
98
96 04
9.8995
1 9044
11.7473
9.9499
1 93 21
11.7898
19600
11.8322
99
98 01
100
10000
10.0000
101
10201
10.0499
19881
11.8743
102
10404
10.0995
20164
11.9164
103
10609
10.1489
20449
11.9583
104
1 08 16
10.1980
20736
12.0000
105
11025
10.2470
21025
12.0416
106
1 12 36
10.2956
2 13 16
12.0830
107
11449
10.3441
2 1609
12.1244
108
11664
10.3923
2 1904
12.1655
109
11881
10.4403
22201
12.2066
110
1 2100
10.4881
22500
12.2474
ill
1 23 21
10.5357
2 2801
12.2882 12.3288 12.3693 12.4097 12.4499 12.4900 12.5300 12.5698 12.6095 12.6491
112
12544
10.5830
2 3104
113
12769
10.6301
2 3409
114
12996
10.6771
23716
115
10.7238
24025
10.7703
243 36
117
13225 13456 13689
10.8167
24649
118
1 3924
10.8628
24964
25281 2 5600
116
119
14161
10.9087
120
14400
10.9545
~ Bypermissionfrom Slatietice for studentsof peticholoyti andeduction,by H. Sorenson. Copyright 1936, McGraw-Hill Book Company, Inc.
hPPENDIX
291
TABLE U. TABLE OP SQvhREs hYD SQUIRE RooTs
Number
Square
Squareroot
Number
(Coatiaued)
Squars
Squararoot
161
2 5921
12.68&6
201
40401
14.1774
162
26244
12.7279
40804
14.2127
163
26569
12.7671
202 203
41209
14.2478
164
12.8062
204
41616
14.2829
165
26896 27225
12.8452
205
14.3178
166
275 56
12.8841
206
4 2025 42436
167
27889
12.9228
207
42849
14.3875
168
28224
12.9615
208
43264
14.4222
169
28561
13.0000
209
43681
14.4568
170
2 8900
13.0384
210
441 00
)4.4914
171
29241
13.0767
211
445 21
)4.5258
172
295 84
13.1149
21?
44944
14.5602
173
29929
13.1529
213
453 69
14.5945
174
302 76
13.1909
214
45796
14.6287
175 176
30625 30976
13.2288
462 25
13.2665
215 216
177
3 13 29
13.3041
217
47089
178
3 )684
13.3417
218
475 24
179
3 2041
13.3791
219
4 7961
1&0
3 2400
13.4164
220
4 8400
14.6629 14.6969 14.7309 14.7648 14.7986 14.8324
181
3 2761
13.4536
221
4 8841
14. &661
182
3 31 24
13.4907
222
492 &4
)4.8997
183
3 3489
13.5277
223
497 29
184
3 3856
13.5647
224
501 76 50625 5 1076 51529 5 1984 5 2441 5 2900
14.9332 14.9666 15.0000 15.0333
185
34225
13.6015
225
186
34596
13.6382
226
187
3 4969
13.6748
227
188
3 53 44
13.7113
228
189 190
3 5721 36100
13.7477 13.7840
229
191
3 6481
13.8203
231
192
3 6864
13.8564
232
193 194 195
3 7249
13.8924
233
3 7636 3 8025 3 8416 3 8809 3 9204 3 9601 40000
13.9284
234
13.9642 14.0357
235 236 237
)4.0712
238
14.1067
239
14.1421
240
196 197 198 199
200
14.0000
230
46656
5 33 61 5 3824 54289 54756 5 5225 5 5696 56) 69 5 6644 5 71 21 5 7600
14.3527
15.0665 15.0997
15.1327 15.1658 15.1987 15.2315 15.2643 15.2971 15.3297
15.3623 15.3948 15.4272 15.4596 15.4919
~ Bypermission fromStatietica for ehuknte ofpeychology ondeducutio, by H. gownson. Copyright 1936,McGraw-Hi))Book Company
292
hPPENDIX
ThELE U. ThELE or SQUhRES hND SQUhRERooTs' (Continued) Number
Square
Squareroot
Number
Square
Squareroot
241
5 8081
15.5242
281
78961
16.7631
242
58564
15.5563
282
795 24
16.7929
249
59049
283
800 89
16.8226
244
595 36
284
806 56
16.8523
245
60025
285
81225
16.8819
246
605
15.5885 15.6205 15.6525 15.6844
286
81796
16.9115
247
6 1009
15.7162
287
823 69
16.9411
248
61504
15.74&0
288
8 2944
16.9706
249
6 2001
15.7797
289
835 21
17.0000
250
625 00
15.8114
290
84100
17.0294
251
69001
15.8490
291
84681
252
635 04
292
85264
253
64009
15.8745 15.9060
293
85849
254
645
15.9374
294
864 36
255 256 257
650 25 655 36
15.9687
295
87025
16.0000
296
87616
66049
16.0312
297
88209
17.0587 17.0880 17.1172 17.1464 17.1756 17.2047 17.2337
258
665 64
16.0624
298
88804
17.2627
259
67081
16.0935
299
89401
17.2916
260
6 7600
16.1245
300
90000
17.3205
261
681
16.1555
90601
17.3494
16
16
262
68644
16.1864
301 302
91204
17.3781
269
69169
16.2179
303
91809
17.4069
264
69696
16.2481
304
924
17.4356
265
702 25
16.2788
305
93025
17.4642
266
707 56
16.3095
306
93636
267
712 89
16.3401
307
942 49
268
71824
16.3707
308
94864
269
729 61 7 2900
16.4012
309
16.4917
310
95481 961 00
1'7.4929 17.5214 17.5499 17.5784 17.6068
16.4621
311
16.4924
312
16.5227
313
275
73441 73984 745 29 75076 75625
276 277
270 271 272
273 274
278 279 280
21
16.5529
314
16.5831
315
761 76
16.6132
316
76729 '7 72 84
16.6433
317
16.6799
318
16.7039
319
16.7392
320
77841 7 8400
16
96/21 973 44 97969 985 96 992 25 998 56 1004 89 10 11 24 1017 61 10 24 00
17.6352 17.6635 17.6918 1'/.7200 17.7482 17.7764
17.8045 1/.8326 17.8606 17.8885
~ By permissionfrom Stotietice for students of psychology ondeducotion, by H. Soren-
son. Copyright1936,McGraw-HillBookCompany,Inc.
hPPENDIR
293
ThBLE U. ThBLE QF SQUhRES hND SQUhEERooTs (Qonfzzzzf) Number
Square root
Number
Square
Squareroot 19.0000
321
10 30 41
17.9165
361
13 03 21
322
10 36 84
17.9444
362
13 1044
19.0263
323
1043 29
17.9722
363
13 1769
19.0526
324
10 49 76
18.0000
364
13 2496
19.0788
325
10 56 25
18.0278
365
13 3225
326
10 62 76
18.0555
366
13 3956
19.1050 19.1311
327
10 69 29
18.0831
367
13 46 89
19.1572
328
10 75 84
18.1108
368
13 54 24
19.1833
329
10 82 41
18.1384
369
330
10 89 00
18.1659
370
13 61 61 13 6900
19.2354
331
1095 61
18.1934
371
13 7641
19.2614
332
11 02 24
18.2209
372
13 83 84
19.2873
333
1108 89
18.2483
373
13 91 29
19.3132
334
ll
15 56
13 98 76
ll
22 25
375
14 06 25
336
ll
28 96
18.2757 18.3030 18.3303
374
335
376
337
11 35 69
18.3576
377
1413 76 1421 29
338
11 42 44
18.3848
378
1428
339
11 49 21
18.4120
379
14 36 41
340
11 5600
18.4391
380
14 44 00
19.3391 19.3649 19.3907 19.4165 19.4422 19.4679 19.4936
341
11 62 81
18.4662
381
14 51 61
342
11 69 64
18.4932
382
14 59 24
343
ll
76 49
18.5203
383
1466
344
ll
83 36
18.5472
384
14 74 56 14 82 25
84
19.2094
345
11 90 25
18.5742
385
346
11 97 16
18.6011
386
14 89 96
347
12 04 09
18.6279
387
348
12 ll
14 97 69 15 05 44 15 13 21 15 2100
19.5192 19.5448 19.5704 19.5959 19.6214 19.6469 19.6723 19.6977 19.7231 19.7484
15 28 81 15 36 64 15 44 49 15 52 36 15 60 25 15 68 16 15 7609 15 8404 15 92 01 1600 00
19.7737 19.7990 19.8242 19.8494 19.8746 19.8997 19.9249 19.9499 19.9750 20.0000
04
18.6548
388
18.6815 18.7083
389
349
12 18 01
350
12 25 00
351
12 3201 12 3904
18.7350 18.7617
391
352 353
12 46 09
18.7883
393
354
12 53 16
18.8149
394
355
12 60 25
18.8414 18.8680
395
12 74 49
18.8944
12 81 64
18.9209 18.9473 18.9737
397 398
356 357 358 359 360
12 67 36
12 88 81
12 96 00
390
392
396
399 400
89
~ Bypermission fromSAslisticsfor si~~ zf ~~~ z~ ~~. son. Copyright1936,MoGraw-HillBool'Companylno.
hPPENDIR
Tant Number
401
U. Tanya oF Sqvanxs aNo SqvhRE RooTs'
(Continued)
Square
Squareroot
Number
Square
Square root
16 08 01
20.0250
441
19 44 81
21.0000
402
16 16 04
20.0499
442
19 53 64
21.0238
409
16 24 09
20.0749
443
19 62 49
'21.0476
404
16 32 16
20.0998
444
19 71 36
21.0713
405
1640 25
20.1246
445
19 80 25
21.0950
406
1648 36
20.1494
446
19 89 16
21.1187
16 56 49
20.1742
447
19 98 09
21.1424
2007 04
21.1660
407 408
16 64 64
20.1990
448
409
1672 81
20.2237
449
20 16 01
21.1896
410
16 81 00
20.2485
450
20 25 00
21.2132
411
16 89 21
20.2731
451
20 34 01
21.2368
412
16 97 44
20.2978
452
2049 04
413
1705 69
20.3224
453
20 52 09
21.2603 21.2838
414
17 13 96
20.3470
454
20 61 16
21.3073
415
17 22 25
20.3715
455
20 70 25
21.3307
416
17 30 56
20.3961
456
20 79 36
417
17 38 89
20.4206
457
20 88 49
21.3542 21.3776
418
17 47 24
20.4450
458
20 97 64
21.4009
419
1755 61
20.4695
459
21 06 81
21.4243
420
17 64 00
20.4939
460
21 1600
21.4476
421
17 72 41
20. 5189
461
21 25 21
21.4709
422
17 &0 84
20.5426
462
21 3444
21.494?
423
17 89 29
20.5670
463
21 43 69
21.5174
424
17 97 76
20.5913
464
21 52 96
21.5407
18 06 25
20.6155
465
21 62 25
21.5639
426
18 14 76
20.6398
466
21 71 56
21.5870
427
1823 29
20.6640
467
21 &089
21.6102
428
18 31 84
20.6882
468
21 90 24
21.6333
429
18 40 41
20.7123
469
21 99 61
21.6564
430
18 49 00
20.7364
470
22 09 00
21.6795
425
431
18 57 61
20.7605
471
22 1841
432
18 66 24
20.7846
472
22 27 84
433
18 74 89
20.8087
473
22 37 29
434
1883 56
20.8327
474
22 46 76
20.8567
475
22 5625
21.7025 21.7256 21.7486 21.7715 21.7945
435
1892 25
436
19 00 96
20.8&06
476
2265
76
21.8174
497
1909 69
20.9045
477
22 75 29
21.8403
438
19 18 44
20.9284
478
22 84 84
21.8632
499
1927
21
20.9523
479
22 9441
21.8861
440
19 36 00
20.9762
4&0
23 0400
21.9089
~ Bypermissionfrom Stotietice for etudente of peychotogy endeducation, by H. Soren son. Copyright 1936,McGravr-Hill Book Company,Inc.
295
APPENDlX
Tom
U. Than oF SgUhIEs hND BQUhBERooTS~ (Conhaued)
Number
Square
Squareroot
Number
Square
481
23 13 61
482
23 23 24
21.9317
521
522
27 1441 27 24 84
22.8254
21.9545
483
23 32 89
21.9773
523
2735
29
22.8692
484
23 42 56
22.0000
524
22.8910
485
23 5225
22.0227
525
486
23 61 96
22.0454
526
27 45 76 27 56 25 27 66 76
22.9347
487
23 71 69
22.0681
527
27 77 29
22.9565
488
23 81 44
22.0907
528
27 87 84
22.9783
489
23 91 21
22.1133
529
490
24 01 00
22.1359
530
27 98 41 28 09 00
23.0000 23.0217 23.0434 23.0651
491
24 10 81
22.1585
531
492
24 20 64
22.1811
532
493
24 30 49
22.2036
533
494
24 40 36
22.2261
534
495
24 50 25
22.2486
535
22.84/3
22.9129
496
24 60 16
22.2711
536
28 19 61 28 30 24 2840 89 28 51 56 28 62 25 28 72 96
497
24 7009
22.2935
537
28 83 69
498
24 80 04
22.3159
538
28 94 44
499
24 90 01
22.3383
539
2905
500
25 0000
22.3607
540
29 16 00
501
25 10 01
22.3830
541
29 26 81
502
25 2004
22.4054
542
29 37 64
503
25 3009
22.4277
543
504
25 40 16
22.4499
544
29 48 49 29 59 36
505
25 5025
22.4722
545
29 70 25
506
25 60 36
22.4944
546 547
29 81 16 29 92 09 3003 04 30 14 01 30 25 00
23.2594 23.2809 23.3024 23.3238 23.3452 23.3666 23.38&0 23.4094 23.4307 23.4521
30 3601 30 47 04 30 58 09 30 69 16 30 80 25 3091 36 31 02 49 31 13 64 31 24 81 31 3600
23.4734 23.4947 23.5160 23.5372 23.5584 23.5797 23.6008 23.6220 23.6432 23.6643
507
25 7049
22.5167
508
25 &064
22.5389
548
509 510
25 90 81
22.5610
549
26 01 00
22.5832
550
511
2611
21
22.6053
512
26 21 44
22.6274
551 552
513 514 515 516 517 51$ 519 520
26 31 69
22.6495
553
26 41 96 2652 25
22.6716
554
26 72 89
22.6936 22.7156 22.7376
2683
24
22.7596
2693 61 27 0400
22.8035
555 556 557 558 559 560
2662
56
22.7816
21
Bypermission fromS&lieticsfor st~afsofpsp~pppQ~~~~
son. Copyright1936,MOGraw-Hill BookCompany
23.0&68
23.1084 23.1301 23.1517 23.1733 23.1948 23.2164 23.2379
by
296
hPPENDIX
ThBLE
U. Tom
Square root
Number
561
ot Sqvhnzs hxo Sqv~nz IlooTs'
3147
21
Number
(Continued)
Square
Square root
23. 6854
601
36 12 01
24.5153
36 2404
24.5357
562
31 5844
23.7065
602
563
31 69 69
23.7276
603
36 3609
24.5561
564
31 8096
23.7487
604
36 48 16
24.5764
565
31 92 25
23.7697
605
36 60 25
24.5967
566
32 03 56
23.7908
606
3672
36
24.6171
567
32 14 89
23.8118
607
36 84 49
24.6374
32 26 24
23.8328
608
36 96 64
24.6577
23.8537
609
37 08 81
24.6779
37 2100
24.6982
568 569
32 37 61
570
32 4900
23.8747
610
571 572 573
32 60 41
23.8956
611
3733 21
24.7184
32 71 84
23.9165
612
37 45 44
24.7385
32 83 29
23.9374
613
37 5769
24.7588
3294
23.9583
614
37 69 96
24.7790
33 0625
23.9792
615
37 82 25
24.7992
576
33 1776
24.0000
616
3794
24.8193
577
33 2929
24.0208
617
3806 89
24.8395
578
33 40 84
24.0416
618
38 19 24
24.8596
579
33 5241
24.0624
619
38 31 61
24.8797
580
33 6400
24.0832
620
38 44 00
24.8998
574 575
76
56
581
33 75 61
24.1039
621
38 5641
24.9199
582
33 8724
24.1247
622
38 68 84
24.9399
583
33 98 89
24.1454
623
38 81 29
24.9600
584
34 10 56
24.1661
624
3893 76
24.9800
585
34 22 25
24.1868
625
3906 25
25.0000
626
3918 76
25.0200
586
34 33 96
24.2074
587
34 45 69
24.2281
627
39 31 29
25.0400
588
34 57 44
24.2487
628
3943 84
25.0599
589
34 69 21
24.2693
629
39 5641
25.0799
590
34 81 00
24.2899
630
39 69 00
25.0998
591
3492
24.3105
631
39 81 61
25.1197
592
35 0464
24.3311
632
3994 24
593
35 1649
24.3516
633
40 06 89
25,1396 25.1595
81
594
35 2836
24.3721
634
40 19 56
25.1794
595
35 4025
24.3926
635
4032
25.1992
35 52 16
24.4131
636
40 44 96
25.2190
637
4057
69
25.2389
596
25
597
35 6409
24.4336
598
35 7604
24.4540
638
40 70 44
25.2587
599
35 8801
24.4745
639
40 83 21
600
36 00 00
24.4949
640
40 96 00
25.2784 25.2982
By permission fromStatietice for etudente of peychology andeducation, by H. Sorenson. Copyright1936,McGraw-HillBookCompany, Inc.
297
hPPENDIX
Twas U. Tmm
os' Sqvhmls hNn Sgvajrsl RooTs~ (Continued)
Square
Square root
641
41 08 81
25.3180
642
41 21 64
643
Square
Square root
681
46 37 61
26.0960
25.3377
682
46 51 24
26.1151
41 34 49
25.3574
683
46 64 89
26.1343
644
41 47 36
25.3772
684
46 78 56
26.1534
645
41 6025
25.3969
685
4692
25
26.1725
646
41 73 16
25.4165
686
47 05 96
26.1916
647
41 86 09
25.4362
687
47 19 69
26.2107
648
41 99 04
25.4558
688
26.2298
649
42 12 01
25.4755
650
42 2500
25.4951
690
47 33 44 4747 21 4'7 61 00
26.2679
651
42 3801
25.514F
691
4F /481
26.2869
652
42 51 04
25.5343
692
26.3059
653
42 64 09
25.5539
693
47 88 64 48 02 49
654
42 77 16
25.5734
694
48 16 36
26.3439
655
42 90 25
25.5930
695
48 30 25
26.3629
656
43 03 36
25.6125
696
48 44 16
26.3818
657
43 )649
25.6320
697
48 58 09
26.4008
658
43 2964
25.6515
698
48 72 04
26.4197
659
43 42 81
699
48 8601
26.4386
660
43 5600
25.6710 25.6905
700
49 00 00
26.4575
661
43 69 21
25.7099
701
49 14 01
26.4764
662
43 8244
25.7294
702
49 28 04
26.4953
663
43 95 69
25.7488 25.7682
703
49 42 09
26.5141
/04
26.5330 26.5518 26.5707 26.5895 26.6083 26.6271
Number
Number
664
44 08 96
665
25.7876
705
666
44 22 25 4435 56
25.8070
706
49 56 16 49 70 25 49 84 36
667
44 48 89
25.8263
707
49 98 49
668
44 62 24
25.8457
708
669
44 75 61
25.8650
670
44 8900
25.8844
710
50 12 64 50 26 81 50 41 00
671
45 02 41
25.9037
711
672
45 15 84
25.9230
712
673
45 29 29
25.9422
713
674
45 45 45 45 45 46 46
25.9615
714
25.9808 26.0000
715 716
26.0192
717
26.0384
718
26.0576
719
26.0768
720
675 676 677
678 679
680
42 76 56 25 69 76 83 29 96 84 10 41 24 00
.
709
50 55 21 50 69 44 50 83 69 50 97 96 51 12 25 51 26 56 51 40 89 51 55 24 51 69 61 51 8400
26.2488
26.3249
26.6458
26.6646 26.6833 26.7021 26.7208 26.7395 26.7582 26.7769 26.7955 26.8142 26.8328
By permission fromStotisties for students of psycho~ ondeducotion son. Copyright193B,McGraw-HillBookCompany,inc.
298
hPPENDIX
Thsm
U. ThRLE or SovhREs h ND SQUhRE RooTs ' (Continued)
Square
Squareroot
Number
Square
Squareroot
721
51 98 41
26.8514
761
5791 21
27.5862
722
52 12 84
26.8701
762
58 06 44
27.6043
723
52 2729
26.8887
763
58 21 69
27.6225
724
52 41 76
26.9072
764
58 3696
27.6405
725
52 56 25
26.9258
765
58 52 25
27.6586
726
52 70 76
26.9444
766
5867
27.6767
727
52 85 29
26.9629
767
58 82 89
27.6948
728
5299
26.9815
768
58 98 24
27.7128
729
53 1441
27,0000
769
59 13 61
27.7308
730
53 2900
27.0185
770
59 29 00
27.7489
731
53 43 61
27.0370
771
59 44 41
2'/.7669
732
53 5824
27.0555
772
59 59 84
27.7849
53 72 89
27.0740
773
5975
29
27.8029
734
53 87 56
27.0924
774
59 90 76
27.8209
735
5402 25
27.1109
775
6006
736
54 16 96
27.1293
776
60 21 76
27.8568
737
54 31 69
27.1477
777
60 37 29
27.8747
738
54 46 44
27.1662
778
60 52 84
27.8927
739
54 61 27
27.1846
779
60 68 41
27.9106
60 84 00
27.9285
Number
733
84
56
25
27.8388
740
54 76 00
27.2029
780
741
54 90 81
27.2213
781
60 99 61
27.9464
742
5505 64
27.2397
782
61 15 24
27.9643
743
55 2049
27.2580
783
61 30 89
27.9821
744
55 35 36
27.2764
61 46 56
28.0000
745
55 5025
27.2947
784 '785
61 62 25
28.0179
746
55 65 16
27.3130
786
61 77 96
28.0357
747
55 8009
27.3313
787
61 93 69
28.0535
748
55 95 04
27.3496
788
62 09 44
28.0713
749
56 10 01
27.3679
789
62 25 21
750
56 25 00
27.3861
790
62 41 00
28.0891 28.1069
751
56 40 01
27.4044
791
62 56 81
28.1247
752
56 55 04
27.4226
/92
62 72 64
28.1425
753
56 70 09
27.4408
793
62 88 49
28.1603
754
56 85 16
27.4591
794
63 04 36
755
5700 25 57 15 36
27.4773
795
63 20 25
27.4955
796
63 36 16
28.1780 28.1957 28.2135
757 758
57 3049
27.5136
797
63 5209
28.2312
5745 64
27.5318
798
63 6804
28.2489
759
57 60 81
27.5500
799
63 8401
28.2666
760
57 7600
27.5681
800
64 0000
28.2843
756
~ Bypermission fromStatietice for etudente ofpeychotogy andeducation, by H. Sorenson. Copyright1936,McGraw-HillBookCompany,Inc.
APPENDIX
TABLEU. TABLEQFSQUAEEa ANDSQUAI RooTs~(Conhnued)
801
64 16 01
28.3019
802
64 32 04
28.3196
803
64 48 09
28.3373
804
64 64 16
28.3549
805
64 $025
28.3725
806
64 96 36
807
65 12 49
808
65 28 64
28.4253
65 44 81
28.4429
810
65 61 00
28.4605
$11
65 77 21
28.4781
812
65 93 44
28.4956
813
66 09 69
28.5132
814
6625 96
28.5307
815
66 42 25
28.5482
70 72 81 70 89 64 71 0649 7123 36 71 40 25
29.0000 29.0172 29.0345
28.3901
71 57 16
29.0861
2$.4077
71 Fl 72 72
29,1033
7409 9104 0801 25 00
29.051F 29.0689
29.1204 29.1376 29.1548
816
66 58 56
2$.5657
72 42 01 72 5904 72 7609 72 93 16 73 1025 73 27 36
817
66 74 89
28.5832
73 44 49
818
66 91 24
28.6007
73 61 64
819
67 07 61
2$.6082
820
67 2400
28.6356
73 7881 73 9600
29.2404 29.2575 29.2746 29.2916 29.3087 29.3258
821
67 40 41
28.6531
7413 21
29.3428
822
675684
2$.6705
74 30 44
29.3598
823
67 73 29
28.6880
29.3769
824
67 89 76
28.7054
825
6806
25
28.7228
826
6822
76
28.7402
827
68 3929
28.7576
82$ $30
68 55 84 68 72 41 68 8900
28.7750 28.7924 28.8097
74 47 69 74 64 96 74 82 25 7499 56 75 16 89 75 34 24 75 5161 'FS 69 00
29.4958
831
69 05 61
832
69 22 24
28.8271 28.8444 28.8617
FS 8641 F603 84 76 21 29 76 38 76 76 56 25 76 73 76 76 91 29 77 08 84 77 26 41 77 44 00
29.5127 29.5296 29.5466 29.5635 29.5804 29.5973 29.6142 29.6311 29.6479 29.664$
829
833
69 38 89
834
28.8791
$35
6955 56 69 72 25
836
69 8896
28.9137
837
7005 69
28.9310
838
70 22 44
28.9482
839
/0 39 21
2$.9655
840
70 56 00
28.9828
28.8964
29.1719 29.1890 29.2062 29.2233
29.3939 29.4109
29.4279 29.4449 29.461$ 29.478$
+ Bypermission fromSQfielicefor etudents ofpeychology anded~ipp by H aon. Copyright 1936,McGraw-Hill Book Company'1nlQ.
300
APPENDIX
Thnrx U. TABLEQF SQUhRES hND SQUhEERooTs' (Conlinuedj Number
Square
Square root
Number
Square
Square root
881
77 61 61
29.6816
921
84 &241
30.3480
882
77 79 24
29.6985
922
85 00 84
30.3645
883
77 96 89
29.7153
923
85 1929
30.3809
884
78 14 56
29.7321
924
85 3776
30.3974
885
78 32 25
29.7489
925
85 5625
30.4138
78 49 96
29.7658
926
85 74 76
30.4302
85 93 29
30.4467 30.4631
886 887
78 67 69
29.7825
927
888
78 85 44
29.7993
928
86 11 84
889
7903 21
29.8161
929
86 30 41
30.4795
890
79 21 00
29.8329
930
86 49 00
30.4959
891
7938
81
29.8496
931
86 67 61
30.5123
892
79 56 64
29.8664
932
86 86 24
30.5287
893
79 74 49
29.8831
933
87 04 89
30.5450
894
7992
36
29.8998
934
8723 56
30.5614
895
80 10 25
29.9166
935
8742 25
30.5778
896
8028 16
29.9333
936
87 60 96
30,5941
897
80 46 09
29.9500
937
87 7969
30.6105
898
&0 64 04
29.9666
938
879& 44
30.6268
899
80 &201
29.9833
939
88 17 21
30.6431
900
81 00 00
30.0000
940
88 3600
30.6594
901
81 18 01
30.0167
941
88 54 81
30.6757
902
81 3604
30.0333
942
88 73 64
30.6920
903
81 54 09
30.0500
943
88 92 49
30.7083
904
81 72 16
30.0666
944
89 11 36
30.7246
905
81 90 25
30.0832
945
89 30 25
30.7409
82 08 36
30.0998
946
8949
16
30.7571
907
82 2649
30.1164
947
89 68 09
30.7734
908
82 44 64
30.1330
948
89 87 04
90.7896
909
82 62 81
30.1496
949
90 06 01
30.8058
910
82 81 00
30.1662
950
90 25 00
30.8221
911
82 99 21
30.1828
951
90 44 01
30.8989
912
83 1744
30.1993
952
9063
913
83 9569
30.2159
953
90 82 09
30.8707
914
83 53 96 83 72 25 83 90 56
30.2324
954
91 01 16
30.8869
30.2490
955
91 2025
30.9031
30.2655
956
91 39 36
30.9192 30.9354
915 916
04
30.8545
917
84 08 89
30.2820
957
91 5849
918
84 27 24
30.2985
958
91 77 64
30.9516
919
84 45 61
959
91 96 81
30.9677
920
84 64 00
30.3150 30.3315
960
92 16 00
30.9839
~ By permissionfrom Statisticsfor eludenteof peycIIologyand education,by H. Sorsn-
son. Copyright1936,McGraw-HillBookCompany,Inc.
hPPENDIX
TABLE
301
lj. ThBLE oF SQUhREs ANDSQUhRERooT8 (Continued) Square
Square root
961
92 35 21
31.0000
981
96 23 61
31. 3209
962
92 54 44
31.0161
982
9643
24
31.3369
963
92 73 69
31.0322
983
96 62 89
31.3528
964
92 92 96
31.0483
984
96 82 56
31.3688
965
93 1225
31.0644
985
97 02 25
91.3847
966
93 31 56
31.0805
9&6
97 21 96
31.4006
967
93 5089
31.0966
987
97 41 69
31.4166-
968
93 70 24
31.1127
988
97 61 44
31.4325
969
93 8961
31.1288
989
97 81 21
31.4484
970
94 09 00
31.1448
990
98 0100
31.4643
971
94 28 41
31. 1609
991
98 20 81
31.4802
972
9447 84
31. 1769
992
98 40 64
31.4960
973
94 67 29
31.1929
993
98 60 49
31.5119
974
94 86 76
31.2090
994
98 N36
31.5278
975
95 06 25
31.2250
995
976
95 25 76
31.2410
996
9900 25 99 20 16
31.5496 31.5595
977
95 45 29
31.2570
997
99 4009
31.5753
978
95 64 84
31.2730
998
99 6004
31.5911
979
95 8441
31.2890
999
99 8001
31.6070
9N
96 04 00
31.3050
1000
Number
Number
1000000
31.6228
By permissionfrom Stotietice for ehufente of peyckelogy andeducation, by H. Sorus eon. Copyright 1986,McGraw-Hill Book Company, Inc.
Binomial distribution, table of associated probabilities, 250
Adams, L., 107 Adorno, T. W., 132n., 186n., 205n.
Alpha (a), definitionof, 9
useof (eeeBinomial test; Signtest) Binomial test, 86-42
snd Type I error, 9
(See alsoSignificance level) Alternative hypothesis(H>), definition of, 7
comparedwith other one-sample tests, 59
function and rationale,36 McNemartest, relation to, 65n.,66-67
and location of rejection region, 13-14
(SeealeoOne-tailedtest;Two-tailed test)
Analysis of variance,nature of, 159161, 174-175
nonparametric,159-194 interactions in, 33
parametric(eeeF test) Anderson, R. L., 17, 31n. Andrews, F. C., 193 Arithmetic mean (eeeMean, arithmetic) Asch, S. E., 205n., 207 Associatedprobability,definitionof, 11 and rejectionregion,13 and samplingdistribution,11 and significanc level, 8
Assumptions, additivity,133 ss conditions of statistical model, 18-
method, 87-42 for large samples,~ correction for continuity, 40-41 normal distribution approximation, 40 one-tailed and two-tailed tests, 41 for small samples,38-40 example, 15-16, 89-40
one-tailed test, 88-89 two-tailed test, 39 power-efficiency, 42
table of associatedprobabilities,250 Birnbaum,C. W., 49, 52, 136 Blackwell, D., 8n. Bowker, A. H., 67 Brown, G. W., 116 Buford, H. J., 49n.
Burke, C. J., 47, 111,179
20
in measurement, 27-28
of parametricstatistical tests,2 3, 19-20, 25, 30-31
ss quslifiers of researchconclusions,19 and sampling distribution, 12
(SeealeoStatisticalmodel)
C (eeeContingencycoefBcient) Central-limit theorem, 12-18
xetest, contingencycoemcient,usein significance test, 197-200
Auble, D., 127, 274n;277n.
"in8ated N" in, 44, 109,228-229
Average (eeeMean; Median)
for k independent samples, 175-179 compared with other tests for k in-
dependentsamples,198-194
bancroft, T. A., 17, 1n gsrthol, R. P., 3W bergman, G., 80
function, 175 method, 175-178
Bernard, G. A., 104
orderedhypothesis,test of, 179
example, 176-178
Bets (p), definition of, 9 (SeealeoType II error) Binomial coefficients,tableof, 288 Binomial distribution, 15,36-38
power, 179
requirementsfor use, 178-179
smallexpected frequencies, 178179
normal approximation to, 40-41
(See alsoMediantest,extension of) 303
304
INDEX
z~ test, nominaldata, usewith, 23 one-sample test, 42-47
165
comparedwith other one-sample tests, 59 function, 42 43 method, 43 47 degreesof freedom, 44 example, 44-46 small expected frequencies, 46 ordered hypothesis, test of, 45n., 47 power, 47 compared with Kolmogorov-
Smirnovone-sampletest, 51 table of critical values, 249 for two independent samples, 104-111 compared with other tests for two independent samples, 156-158 function, 104
as mediantest, 112
powerandpower-cfficiency, 165-166 Coefficient,of concordance (seeKendall coefficient of concordance)
of contingency (seeContingency coefficient) of variation, use with data in ratio scale, 29, 30
Coles,M. R., 118, 119n. Consensual ordering,useof Kendall W to obtain, 287-238
Contingencycoefficient(C), 196-202 compared with other measuresof association, 238-240 function, 196
limitationsof, 200-201 method, 196-198, 200 example, 197-198
method, 104-110 degreesof freedom, 106 expected frequencies, 105-106, 109 2 X 2 contingency tables, 107-109 correction for continuity in, 107 ordered hypothesis, test os 110 power, 110
comparedwith KolmogorovSmirnovtwo-sampletest, 136 requirements for use, 110
x,~ (seeFriedman two-way analysisof variance by ranks) Chi-square distribution, 43n., 106n. approximation to, in Cochran Q test,
nominaldata, usewith, 23, 30 power, 201-202
significance test, 198-200 example, 200
Continuity,correction for (seeCorrection for continuity)
Continuous variable,assumption in statistical tests,25 and tied scores,26-26 Coombs,C. H., 30, 76n. Correctionfor continuity,in binomial test, 40-41
in x test of 2 X 2 table, 107 in McNemar test, 64 in sign test, 72
162-163
in Friedman two-way analysis of variance by ranks, 168 in Kendall
CochranQ test,method,example,163-
coefficient
of concord-
ance, 286
in Kendall partial rank correlation coefficient, 226, 228-229 in Kolmogorov-Smirnov temple test, 131-135 in Kruskal-Wallis one-way analysis of variance by ranks, 185 in McNemar test, 64 table of critical values, 249 (See also z' test)
Child, I. L., 112-115, 121, 123n. Classificatoryscale(seeNominal scale) Clopper,C. J., 42 Cochran,W. G., 46, 47, 104, 110, 160, 162, 166, 179, 184, 202
CochranQ test, 161-166 comparedwith other teatsfor k related samples, 178
function, 161-162 method, 162-165
in Wald-Wolfowitsruns test, 140-141 Cumulativefrequencydistribution,in Kolmogorov-Smirnov one-sample test, 47-52
in Kolmogorov-Smirnov two~pie test, 127-136
Cyclicalffuctuationsand one-sample runs test, 52
David, F. N., 42 Davidson,D., 80 Decision,statistical,theory, 8n. in statisticalinference,6-7, 14 Degreesof freedom,44 Designof research,beforeand after, 63 correlational, 195-196
k independentsamples,174-175 k related samples, 169-161 single sample, 86-36
two independentsamples,95-96 two related samples,159-161 Disarray, ~ ascoefficientof, 215
Z)iscrete variste, 25 distribution-free statistical tests, 3
(SecalsoNonpsrametricstatistical tests)
>ixon, W. J., 17, 31n., 47, 75, 87, 110, 136, 179
Fisher exact probability test, method, associatedprobability of data, one-tailedtest, 99-100 two-tailedtest, 100 exact probabilityof data, 98-100 Tocher'smodification,101-103 power, 104
table of critical values,256-270 ~wards
A L y 31n.,110, 179
@elle, K., 24n.
Fisenhart, C., 58, 145,252n;253n. Engvall, A., 69n.
equated groups,andanalysis of variance, 160-161 sensitivity of, 62 and two-sample teste, 61 62
gquivalenceclsseee, in intervalscale,28, 30
in nominal scale, 23, 30 in ordinal scale, 24, 30 in ratio scale, 29-30
gstimation, 1
gÃpectedfrequencies, in xI on~mple test, 43-44
in x~ temple
test, 105-106
in contingency coefficient, 1g6-1g7
Frenkel-Brunswik, E., 132n.,186n.,205n. Frequencycounts,usewith nominal data, 23, 30 Fteund, J. E., 58 Friedman,M., 168, 171n.,172,238,280n., 281n., 286n.
Friedmantwo-way analysisof variance by ranks, 166-172 compared with other teste for k re-
lated samples,173 function, 166 method, 166-172 example,169 172 small samples,168-169 power, 172 rationale, 166-168
tableof associated probabilities, 280281
smelly46' 109' 178-17g
gxtensionof mediantest (seeMedian test)
ggtremereactions, testfor (aceMoses teat of extreme reactions)
p test,assumptione, 19-20,160 interval scale,usewith data in, 28 for k independentsamples,174 for k relatedsamples,160
power,compared withFriedman twoway analysisof variance,172 ~key's procedure,160
gactorials, tableof,287 pagan,J., 205n.,207,227 pestinger, L., 127n. pinney,D. J., 104,256n.-270n. Fisher,R. A., 31n.,92,101,104,248n., 249n.
gisherexactprobability test,96-104 >>test,useof Fishertestasalternative, 110
compared with othertestsfor two independentsamples,156-158 function, 96-97 as median test, 112 Inethod, 97-103 associatedprobabilityof data, 98 100
example,100-101
Generalieation fromparametric andnonparametricteste, 18-20
Geometric mean,29, 30 Ghiselli,E. E., 141,143n. Girshick,M. A., Sn.
Goodman, L. A., 49,52,59,131,135, 13B,158, 239, 278n.
Goodness of St, binomialtestof, 36-42 z' test of, 42-47
Kolmogorov-Smirnov testof, 47-52 andon~pie case,35, 59-60 Gordon,J. E., 232n. Groeslight,J. H., 169n.
H test (seeKruekal-Wallis one-way analysisof varianceby ranks) Hempel, C. G., 30
Hollingshead, A. B., 176,177n.,197,198, 200
Homoecedssticity, assumption of, in t and F teste, 19 definitionof, 19 Hotelling, H., 213, 223 Hurst, P. M., 80n.
Hypotheses, derivedfromtheory',B errorsin testing,8-11 operationalstatementsof, 7 procedurein testing,6-7
INDEX
Hypotheses, tests of, 1
(SeealsoAlternativehypothesis;Null hypothesis; Research hypothesis; Statistical testa)
Kendall rank correlationcoefficient(r), comparisonof r and rq, 219 in numerical values, 219 in power, 219, 222, 228, 239 in uses, 213-214
"Infiated N" in x' test, 44, 109, 22S-229 Interactions in analysis of variance, 33 Interval scale, 26-28, 30 admissible operations, 2S, 30 definition of, 26-27 examples of, 27-28 formal properties of, 28, 30 unit of measurement, 27 zero point, 26-28 Isomorphism, 22 Jonckheere, A. R., 194
k samples(seeDesignof research) Kendall, M. G., 172, 202, 203, 212, 213, 220 223 226' 229' 234) 238) 285n Kendall coefficient of concordance (W), 229-238
comparedwith other measuresof association, 238-239 function, 229
interpretationof, 237 238 method, 231 235, 237 example, 232 233 tied observations, 233 235 assignment of ranks to, 233 correction for, 234 235 effect of, 234 example with, 234 235 ordinal data, use with, 30 rationale, 229 231
significancetest, 235 237 large samples, 236-237 chi-square approximat ion, 236 example, 236-237 small samples, 235 236 table of critical values, 286
Kendall partial rank correlationcoefficient (~,.,), 223-229
comparedwith other measuresof association, 238-289 function, 223 224 method, 226 228
example,227 228 rationale, 224-226
significancetest, 22S-229 Kendall rank correlation coefficient (r), 213-223
compared with othermeasures of association, 238-239
function, 213 214 method, 215-219, 222 example, 216-217 tied observations, 217 219
assignmentof ranks to, 217 correction for, 218-219 effect of, 219 example with, 218-219
ordinal data, usewith, 25, 30 power-efficiency, 223 rationale, 214 215 significancetest, 220-222 large samples, 220 222 example, 221-222
normaldistributionapproxima tion, 221-222 smallsamples,220 221 table of associatedprobabilities,285 Kolmogorov,A., 136 Kolmogorov-Smirnov test,,for one sample, 47 52
comparedwith other one-sample tests, 59-60 cumulative frequency distribution in, 47 52 function and rationale, 47 48 method, 48-51 example, 49-50 one-tailed test, 49 two-tailed test, 48 power, 51 compared with x' test, 51 table of critical values, 251 when parameter values are esti-
mated from sample,60 for two samples, 127-136
comparedwith other testsfor two independentsamples,156-158 cumulative frequency distribution in, 127-136 function, 127 method, 128-136 large samples, 131-135 one-tailed test, 131-135 two-tailed test, 131 small samples, 129-131 example, 129-131 power-efficiency, 136 rationale, 127 128 tables for, 278-279
Kruskal, W. H., 188, 189n., 193, 239, 283n.
807
IADER
Kruaksl-Wallis one-way analysis of variance by ranks, 184 193
compared with other tests for k independent samples,193-194 function, 184-185 method, 185-192 large samples, 185 example, 189-192 small samples, l 85-188 example, 186-188 tied observations, 188-192 assignment of ranks to, 188 correction for, 188-192 effect of, 188-189 example with, 189-192
power-effiCienC,192-193 rationale, 1S5 table of associated probabilities, 282283
I
E. Ii., 145
fepiey, W. M., 129, 130n. Inner,
D., 100, 10ln.
Lev, J., 3ln., Ill, ~vel
179,250n.
of significance (seeSignifirance level)
Invinson, D. J., 132n., 186n., 205n. Iwwis, D., 47, ill, 179 Iwwenfeld, J., 85, S7n.
Mann-Whitney U test, method,computation of U, 119-120 counting method, 116-117
U 1 17 1 18'120' 121 large samples, 120-123 example, 121 128
normaldistributionapproxima tion, 121
small samples,117-120 nq,nq( 8,
117-119
n~between9 and 20, 119 tied observations, 123-126 assignment of ranks to, 124 correction for, 124-126 effect of, 124 126 example with, 124-125 power, compared with Kolmogorov-
Smirnov two-sample test,, 136 compared with median teat, 128 compared with Moses test, 146, 151 compared with Wald-Wolfowits runs test, 144-145 powerwSciency, 126, 155-156 sa randomiaation test on ranks, 155 tables for, 271-277
Massey,F. J., Jr., 17, 3ln., 47, 52, 60, 75, 87, 110, 128, 136, 179, 25ln., 278n. Matching design (eeeEquated groups) Mean, arithmetic, interval scale, use with data in, 28, 30 sampling distribution
of, 12-13
standard error of, 13 geometric, use with data in ratio McNemar, Q., 3ln., 42, 47, 67, 75, 104, scale, 29, 30 1 11~ 160t 16G 179 202 Measurement, 21 30 >fcNemar test for signifirsnrc of c hangea, in behavioral science,21-22, 26-28 63-67
comparedwith other testafor two related samples, 92-94 function, 63 method, 63-67 binomial test, relation to, 65n., 6667
corrertion for continuity, 64 example, 65-66 one-tailed test, 67
small expectedfrequrnriea,M-67 two-tailed test, 67
power-eScicncy,67 rationale, 63 64 sign test, relation to, 74 ]@ann, H. 13., 120, 127, 27ln.-273n. ]4fsnn-Whitney U test, l 16-127
comparedwith other testsfor two independentsamples,156-158 function, 116 method, 116-126
as criterion in choice of statistirsl test, 31
formal properties of scales,23 isomorphism, 22 levels of, 22 interval scale, 26-28 nominal scale, 22-23
ordered metric scale, 76n. ordinal scale, 23 26 ratio scale, 28-29
snd nonparametricstatistical testa,3 parametricstatistical model,requirement associatedwith, 19-20 in physical science,21 and statistics, 30
Median,usewith ordinaldata,25,30 Medianteat, extensionof, for I; independentsamples,179-184
compared with other testa for k
independent samples, 193-194
function, 179
30S
INDEX
Median test, extension of, for k indepen- Nonparamctri«statisti«al tests, in bs havioral science, 31 dentsamples, method,179-184 example, 180-184
power-efficiency, 184 compared with Kruskal-Wallis test, 193
for two independentsamples,111-116 comparedwith other testsfor two independentsamples,156-158 function, 111
method, 111115 example, 112-115
power, comparedwith KolmogorovSmirnov two-sample test, 136
compared with Mann-lvVhitney U test, 123
powermfficiency, 115 rationale, 111-112 Meeker, M., 24n.
Minimax principle,Sn.
Mode,usewith nominaldata, 28, 30
Model,statistical (seeStatistical model) MoodyA M y 17'31np 42 75)88g1 15' 116,126,145,184,194 Moore, G. H., 58 Moran, P. A. P., 223, 229
Moses,L. E., 75, 88, 92, 116,144147, 152, 156 Moses test of extreme reactions, 145-152
compared with othertestsfor two independent samples, 156-15S function and rationale, 145-146 method, 147-151 example, 148-151 tied observations, 151 effect of, 151
procedurewith, 151 power, 151 range in, 146 Mosteller, F., 194
Multipleproduct-moment correlation, use with data in interval scale, 30
conclusions from generality of 3
interval scale,use with data in, 28 measurementrequirements, 3, 80, 83 parametric statistical tests, comparison with, 30-34 (See aleo Contents for list of nonparametric tests)
Normal distribution, approximation to, in binomial test, 40-41
in Mann-Whitney U test, 120-128 in one-sample runs test, 56 58 in randomization
test for two re-
lated samples, 91 92 in sign test, 72 74
in significance test,for Kendall r, 221-222
in Wald-Wolfowitz runs test, 140143
in Wilcoxon matched-pairs signedranks test, 79 88 assumption of, in interval scaling, 27
in t and F tests, 19 table of, 247 Null hypothesis (Hi,), definition of, 7 statement of, in steps of hypothesistesting, 6
Olds, E. G., 213, 284n. One-tailed test, and nature of H>f 7
power of, 11 rejection region of, 13 14 Order tests, 3 use with ordinal data, 25 Ordered metric scale, 76n. Ordinal scale, 23 26, 30 admissible operations, 24-25 definition of, 2:3-24 examples of, 24
formal properties of, 24 statistics and tests appropriate to, 2526I 30
Nno York Post, 45n. Neyman, J., 104 Nominal scale, 22-23, 30
ties, occurrencein, 25-26
definition of, 22
Pabst, M. R., 218, 223 Parameters, assumedin parametric tests,
examplesof, 22-23 formal properties of, 23 statistics and tests appropriate to, 23,
definition of, 2 Parametric statistiral
admissible operations, 23
30
Nonnormality and parametric tests, 20, 126
Nonparametric statistical tests, assumptions underlying' 25' 31i 32
30-32
tests, 2-3
interval scale, use with data in, 28 measurementrequirements, 19-20, 30 nonparametric statistical tests, comparison with, 30-34 ordinal scale, use with data in, 26
309
parametric statisticaltests,ratio scale, use with data in, 29 statistical model, 19-20
underlying continuity,assumptionof, 25
for various research designs,correlational, 195-196
k independentsamples,174175 k related samples, 160-161 one sample, 35-36 two independent samples, 96 two related samples, 62 partial rank correlation, 223-229
partially orderedscale,24
Powerwfficiency, of nonparametrictests, randomisation test, for matched pairs, 92
for two independentsamples,156 sign test, 75
Spearmanrs, 213, 219 Wald-Wolfowitsruns test, 144-145 Walsh test, 87 Wilcoxonmatched-pairssignedranks test, 83
and samplesise,20-21 Q test (seeCochran Q test)
pearson, E. S., 42, 104
pearson product-momentcorrelation coefficient (r), interval scale, use with data in, 28, 30 measure of association, 195-196
power,compared with re, 213 comparedwith r, 223 percentile, usewith ordinaldata, 30 phi coefficient,226
pitman, E. J. G., 92, 152n.,154,156 pool, I. de S., 100n. Power of statisticaltest, 10-11 curves, test of the mean, 10 and one-tailed and two-tailed tests, 11n.
sample sise,relation to, 10-11, 20-21 and statist,ical test, choice of, 18) 31 and Type II error, 10-11
power-efficiency,20-21 as oriterion in choice of statistical test, 31
definition of, 20-21, 33
of nonparametrictests,33 binomial test, 42
x tests 47>110' 179 Cochran Q test, 165-166
contingencycoefficient,201-202 extension of median test, 184 Fisher test, 104
Friedman two-way analysisof varianceby ranks, 172 Kendall r, 219, 223 Kolmogorov-Smirnov one-sample test, 51
Kolmogorov-Smirnov tw~ple test, 136 Kruskal-Wallis one-way analysis of
varianceby ranks, 192-193
r (eeePearsonproduct-momentcorrels tion coefficient)
re (eeeSpearmanrank correlation coefficient)
Radlow, R., 169n.
Randomisation test,for matchedpairs, 88-92
comparedwith other testsfor two related samples,92-94 function,88 method,88-92 small samples,88-91 example, 90-91
large samples,91-92 normal distributionapproximation, 91-92
Wilcoxontestasalternative, 91-92
power-efficiency, 92
rationale,88-89
for twoindependent samples, 152-156 comparedwith other testsfor two independentsamples,156-158 function, 152 method, 152-156 large samples,154-156 Mann-Whitney test as alterna-
tive, 155-156 t distributionapproximation, 154-156
small samples,152-154 power-efficiency,156 rationale, 152-154 Randomisedblocks,161 Randomness,test for, 52-58 Range in Mosestest, 146
McNemar test, 67
Rankingscale(eeeOrdinalscale)
Mann-Whitney U test, 126
Ranking tests,3 usewith ordinal data, 25 Ratio scale,2&40
median test, 115 Moses test, 151
one-sampleruns test, 58
admissible operations, SMO
310
INDEX
Ratio scale, definition of, 28 29 example of, 29 formal properties of, 29, 30 zero point in, 28-29
Sample size (N), and power, 9-11
Regionof rejection(seeRejectionregion) Reflexivedefined,23n.
Sampling distribution, 11-13
and power-efficicncy, 20 21 specification of, in steps in hypothesis testing, 6
Rejection region, 13-14 definition of, 13 illustration of, 14 (Fig. 2)
locationof, and alternative hypothesis, 13 size, and significancelevel, 14
specificationof, in stepsof hypothesis
Schueller, G. K., 100n.
testing, 6
Research,designof (seeDesignof research)
and statistics, 1-2, 6 and theory, 6
Researchhypothesis,definitionof, 7 operationalstatementof, 7 Rho (seeSpearmanrank correlation coefficient)
Run, definitionof, 52, 13? Runs test, k-sample,194 one-sample, 52 58
comparedwith other one-sample tests
60
function and rationale, 52-53 method, 53-58 large samples, 56 58 example, 56-58 normal distribution approximation, 56 small samples, 53 56 example, 54-56
power-efficiency, 58 two-sample (Wald-Wolfowitz), 136145 compared with other tests for two independent samples, 156-158
140-
example, 141-143 normal distribution approximation, 140 small samples, 138-140 example, 138 140 tied observations, 143-144 effec of, 143
power-efficiency, 144-145 rationale, 136-138
McNemar test, relation to, 74 method, 6S-75 large samples,72 74 correction for continuity, 72 example, 72-74 normal distribution approximation, 72 one-tailed and two-tailed teats, 72 small samples, 68 71 example, 69-?1 one-tailed and two-tailed tests, 69
tied observations,procedurewith, 71 power, compared to Wilcoxon test, power-efficicncy, 75 Significance level (n), 8-11 definition of, 8 and rejection region, 14
specification of, in steps in hypothesis
141
procedure with, 143 144 power, compared with MannWhitney U test, 144-145
Scientific method, objectivity in, 6 Sicgel, A. E<.,54n., 138n., 205n. Siegel, S., 30, 76n., 80n., 132'., 134n., 205n., 207, 227 Sign test, 68-75 compared with other tests for two related samples, 92 94 function, 68
78 79
table of critical values, 252-253
function, 136 method, 136-144 large samples, 140-143 correction for continuity,
definition of, 11 identification of, in steps in hypothesis testing, 6 of mean, 12-13 rejection region of, 13-14 Sanford, R. N., 132n., 186n., 205'. Savage, L. J., Sn. Scheffh, H., 92, 156
testing, 6 and statistical decision theory, Sn.
and Type I error, 9 10 "Significant" value of a statistic, 14 Smirnov, N. V., 128, 136, 279n. (Seealso Kolmogorov-Smirnov test) Smith, K., 145, 156 Snedecor,G. W., 31n., 189n. Solomon, R. I., 118, 119n. Sorenson,H., 289n.-301n. Span in Moses test,, 146
Spearmanrank correlationcoefficient (rs), 202-213
comparedwith other measuresof association, 23S-239
INDEX
Spearman rankcorrelationcoefficient
t distribution, table of associatedprob-
(rs) compared with ~ 219
abilities, 248 use in significance test for r8, 212 (See aLsot test)
in numerical values, 219
in power, 219, 222, 223, 239 in uses, 213-214 function, 202 method, 204 210, 212-213 example, 204-206 tied observations, 206-210
assignment of ranks to, 206 correction for, 207 effect of, 206, 210 example with, 207 210
power-eiciency, 213 rs 229 232 rationale and derivation, 202 204
significance test, 210-213 large samples,212 t distribution approximation, 212 small samples, 210-212 example, 211 212 table of critical values, 284
Spence,K. W., 30 Squaresand squareroots,table of, 289301
Standard deviation, use with data in interval scale, 28, 30
Standard error, definition of, 13n. Statistical inference, de6nition of, 1-3
t test, assumptionsunderlying, 19-20 interval scale,use with data in, 28. as nonparametric test, 154-156
in one-samplecase,35 table of critical values, 248 for two independent samples, A6
for two related samples,62 tau (eeeKendall rank correlation coefficient)
Tessman, E., 148n. Test of signi6cance (seeStatistical test) Tied observations, effect of length of, 125
procedure with and correction for, in Kendall
in Kendall
procedureof, 6-34, 199,211n.
ss criterion in choice of statistical test, 31 of F test, 19-20
of t test, 19-20 Statistical test, 7 8 assumptions underlying, 18 20
in steps of hypothesis testing 6 18 34
powerof (eeePowerof stat.isticaltest) Statistics, function, in statistical inference, 6 nature of, 1
Stevens,S. S., 30 Stevens,W. L., 138,145 Stolz, L. M., 69n.
Student'st (seet distribution;t test)
correlation
coeffi-
test, 188-192
in Moses test, 151 in sign test, 71 in Spearman rank correlation coefficient, 206 210
in Wald-Wolfowitz runs test, 143144
in Wilcoxon test, 76-77
andunderlyingcontinuity, 25-26,123124
Tingey, F. H., 49, 52 Tocher, K. D., 102, 104
Transformation, linear, in interval scale, 28
monotonic, in ordinal scale, 24-25 multiplication by positive constant, in ratio scale, 29
one-to-one,in nominal scale, 23
choice of,powereScienI.y ascriterion Transitive in., 31
of concord-
in Mann-Whitney U test, 123-126
and tests of hypotheses, 1
and Type I and Type II errors, 9-11 Statistical model, 18-20
rant
cient, 217-219 in Kruskal-Wallis
and estimation, 1
and parametric statistics, 2 3
coefficient
ance, 233-235
defined, 23-24
Tukey, J. W., 160, 194 Two-tailed test, and nature of Hi, 7 power of, 11
rejection region of, 13-14 Type I error, in a decision, 14 definition of, 9 and significancelevel, 9-10 Type II error, definition of, 9 and power of a test, 10-11 and sample sise, 9-10 and significancelevel, 9 10
Suppes,P., 30
Swed~ F S p58)145'2528p253%
Symmetrical definition of, 23m.
U st
(secMann-Whitney U test)
312
INDEX
W (see Kendall coefficient of concordWald, A., 8n. Wald-Wolfowitz
compared with other tests for two reruns test (seeRuns test)
Walker, H. M., 31n., 111, 179, 250n. Wallis, W. A., 58, 188, 189n., 193, 283n. (See also Kruskal-Wallis one-way analysis of variance by ranks)
Walsh, J. E., 75, 87, 255n. Walsh test, 83-87 assumptions, 83 84 compared with other tests for two re-
lated samples,92-94 method, 84-87 example, 85 87 one-tailed and two-tailed tests, 84
lated samples, 92-04 function, 75-76 method, 76-L3 large samples, 79-83 example, 80-83
normal distribution approximation, 79-80 small samples, 77-79 example, 77-79 tied observations, 76-77 assignment, of ranks to, 76 effect of, 77 procedure witli, 76 77 power-efficiency, 83
as randomisation ti st on ranks, 91
85
power-efficiency, 87 table of critical values, 255
Walter, A. A., 49'. Warner, W. L., 24n., 49n. Welch, B. L., 92, 156 White, C., 127 Whiting, J. W. M., 112-115, 121, 123'. Whitney, D. R., 120, 126, 127, 155, 294, 27ln;273
Wilcoxon matched-pairs signed-ranks test, 75-83
ance)
rationale, 76
sign test, relation to, 78-79 table of critical values, 254 Willerman, B., 238 Wilson, K. V., 33n. Wolfowits, J. (seeRuns test) Yatcs, F., 64, 248n., 249n.
n.
(Sce also Mann-Whitney U test) Wilcoxon, F., 83, 127, 254n.
pro point, in interval scale, 26-28 in ratio scale, 28-20