C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518HTL.3D
i [1–2] 27.8.2011 10:04AM
Statistics Explained An...
12 downloads
745 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518HTL.3D
i [1–2] 27.8.2011 10:04AM
Statistics Explained An Introductory Guide for Life Scientists Second Edition
An understanding of statistics and experimental design is essential for life science studies, but many students lack a mathematical background and some even dread taking an introductory statistics course. Using a refreshingly clear and encouraging reader-friendly approach, this book helps students understand how to choose, carry out, interpret and report the results of complex statistical analyses, critically evaluate the design of experiments and proceed to more advanced material. Taking a straightforward conceptual approach, it is specifically designed to foster understanding, demystify difficult concepts and encourage the unsure. Even complex topics are explained clearly, using a pictorial approach with a minimum of formulae and terminology. Examples of tests included throughout are kept simple by using small data sets. In addition, end-of-chapter exercises, new to this edition, allow self-testing. Handy diagnostic tables help students choose the right test for their work and remain a useful refresher tool for postgraduates. Steve McKillup is an Associate Professor of Biology in the School of Medical and Applied Sciences at Central Queensland University, Rockhampton. He has received several tertiary teaching awards, including the Vice-Chancellor’s Award for Quality Teaching and a 2008 Australian Learning and Teaching Council citation ‘For developing a highly successful method of teaching complex physiological and statistical concepts, and embodying that method in an innovative international textbook’. He is the author of Geostatistics Explained: An Introductory Guide for Earth Scientists (Cambridge, 2010).
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518HTL.3D
ii [1–2] 27.8.2011 10:04AM
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TTL.3D
iii
[3–3] 27.8.2011 10:33AM
Statistics Explained An Introductory Guide for Life Scientists SECOND EDITION
Steve McKillup Central Queensland University, Rockhampton
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518IMP.3D
iv
[4–4] 27.8.2011 10:36AM
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Tokyo, Mexico City Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9781107005518 © S. McKillup 2012 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2012 Printed in the United Kingdom at the University Press, Cambridge A catalogue record for this publication is available from the British Library Library of Congress Cataloguing in Publication data ISBN 978-1-107-00551-8 Hardback ISBN 978-0-521-18328-4 Paperback Additional resources for this publication at www.cambridge.org/9781107005518 Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
v [5–12] 27.8.2011 11:48AM
Contents
Preface
page xiii 1
1.2
Introduction Why do life scientists need to know about experimental design and statistics? What is this book designed to do?
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
Doing science: hypotheses, experiments and disproof Introduction Basic scientific method Making a decision about an hypothesis Why can’t an hypothesis or theory ever be proven? ‘Negative’ outcomes Null and alternate hypotheses Conclusion Questions
7 7 7 11 11 12 12 14 14
3 3.1 3.2 3.3 3.4 3.5 3.6 3.7
Collecting and displaying data Introduction Variables, experimental units and types of data Displaying data Displaying ordinal or nominal scale data Bivariate data Multivariate data Summary and conclusion
15 15 15 17 23 25 26 28
4 4.1 4.2 4.3 4.4
Introductory concepts of experimental design Introduction Sampling – mensurative experiments Manipulative experiments Sometimes you can only do an unreplicated experiment
29 29 30 34 41
1 1.1
1 5
v
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
vi
vi [5–12] 27.8.2011 11:48AM
Contents
4.5 4.6 4.7 4.8 4.9 4.10 5 5.1 5.2 5.3 5.4 5.5 5.6 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 7 7.1 7.2 7.3
Realism A bit of common sense Designing a ‘good’ experiment Reporting your results Summary and conclusion Questions
42 43 44 45 46 46
Doing science responsibly and ethically Introduction Dealing fairly with other people’s work Doing the experiment Evaluating and reporting results Quality control in science Questions
48 48 48 50 52 53 54
Probability helps you make a decision about your results Introduction Statistical tests and significance levels What has this got to do with making a decision about your results? Making the wrong decision Other probability levels How are probability values reported? All statistical tests do the same basic thing A very simple example – the chi-square test for goodness of fit What if you get a statistic with a probability of exactly 0.05? Statistical significance and biological significance Summary and conclusion Questions
66 67 69 70
Probability explained Introduction Probability The addition rule
71 71 71 71
56 56 57 60 60 61 62 63 64
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
vii
[5–12] 27.8.2011 11:48AM
Contents
7.4 7.5 7.6 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12
The multiplication rule for independent events Conditional probability Applications of conditional probability Using the normal distribution to make statistical decisions Introduction The normal curve Two statistics describe a normal distribution Samples and populations The distribution of sample means is also normal What do you do when you only have data from one sample? Use of the 95% confidence interval in significance testing Distributions that are not normal Other distributions Other statistics that describe a distribution Summary and conclusion Questions Comparing the means of one and two samples of normally distributed data Introduction The 95% confidence interval and 95% confidence limits Using the Z statistic to compare a sample mean and population mean when population statistics are known Comparing a sample mean to an expected value when population statistics are not known Comparing the means of two related samples Comparing the means of two independent samples One-tailed and two-tailed tests Are your data appropriate for a t test? Distinguishing between data that should be analysed by a paired sample test and a test for two independent samples Reporting the results of t tests Conclusion Questions
vii
72 75 77
87 87 87 89 93 95 99 102 102 103 105 106 106
108 108 108 108 112 116 118 121 124 125 126 127 128
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
viii
viii
[5–12] 27.8.2011 11:48AM
Contents
Type 1 error and Type 2 error, power and sample size Introduction Type 1 error Type 2 error The power of a test What sample size do you need to ensure the risk of Type 2 error is not too high? Type 1 error, Type 2 error and the concept of biological risk Conclusion Questions
130 130 130 131 135
Single-factor analysis of variance Introduction The concept behind analysis of variance More detail and an arithmetic example Unequal sample sizes (unbalanced designs) An ANOVA does not tell you which particular treatments appear to be from different populations Fixed or random effects Reporting the results of a single-factor ANOVA Summary Questions
140 140 141 147 152
157 157 157
12.4 12.5 12.6 12.7
Multiple comparisons after ANOVA Introduction Multiple comparison tests after a Model I ANOVA An a posteriori Tukey comparison following a significant result for a single-factor Model I ANOVA Other a posteriori multiple comparison tests Planned comparisons Reporting the results of a posteriori comparisons Questions
160 162 162 164 166
13 13.1 13.2
Two-factor analysis of variance Introduction What does a two-factor ANOVA do?
168 168 170
10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12 12.1 12.2 12.3
135 136 138 139
153 153 154 154 155
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
13.3 13.4 13.5 13.6 13.7 13.8 13.9 13.10 14 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 15 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8
15.9
ix [5–12] 27.8.2011 11:48AM
Contents
ix
A pictorial example How does a two-factor ANOVA separate out the effects of each factor and interaction? An example of a two-factor analysis of variance Some essential cautions and important complications Unbalanced designs More complex designs Reporting the results of a two-factor ANOVA Questions
174
Important assumptions of analysis of variance, transformations, and a test for equality of variances Introduction Homogeneity of variances Normally distributed data Independence Transformations Are transformations legitimate? Tests for heteroscedasticity Reporting the results of transformations and the Levene test Questions More complex ANOVA Introduction Two-factor ANOVA without replication A posteriori comparison of means after a two-factor ANOVA without replication Randomised blocks Repeated-measures ANOVA Nested ANOVA as a special case of a single-factor ANOVA A final comment on ANOVA – this book is only an introduction Reporting the results of two-factor ANOVA without replication, randomised blocks design, repeated-measures ANOVA and nested ANOVA Questions
176 180 181 192 192 193 194
196 196 196 197 201 201 203 204 205 207 209 209 209 214 214 216 222 229
229 230
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
x
x
[5–12] 27.8.2011 11:48AM
Contents
16 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 17 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9 17.10 17.11 17.12 17.13 18 18.1 18.2 18.3 18.4 18.5 18.6 18.7
Relationships between variables: correlation and regression Introduction Correlation contrasted with regression Linear correlation Calculation of the Pearson r statistic Is the value of r statistically significant? Assumptions of linear correlation Summary and conclusion Questions
233 233 234 234 235 241 241 242 242
Regression Introduction Simple linear regression Calculation of the slope of the regression line Calculation of the intercept with the Y axis Testing the significance of the slope and the intercept An example – mites that live in the hair follicles Predicting a value of Y from a value of X Predicting a value of X from a value of Y The danger of extrapolation Assumptions of linear regression analysis Curvilinear regression Multiple linear regression Questions
244 244 244 246 249 250 258 260 260 262 263 266 273 281
Analysis of covariance Introduction Adjusting data to remove the effect of a confounding factor An arithmetic example Assumptions of ANCOVA and an extremely important caution about parallelism Reporting the results of ANCOVA More complex models Questions
284 284 285 288 289 295 296 296
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
xi [5–12] 27.8.2011 11:48AM
Contents
19 19.1 19.2 19.3
20 20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 20.9 20.10 21 21.1 21.2 21.3 21.4 21.5 21.6
xi
Non-parametric statistics Introduction The danger of assuming normality when a population is grossly non-normal The advantage of making a preliminary inspection of the data
298 298
Non-parametric tests for nominal scale data Introduction Comparing observed and expected frequencies: the chi-square test for goodness of fit Comparing proportions among two or more independent samples Bias when there is one degree of freedom Three-dimensional contingency tables Inappropriate use of tests for goodness of fit and heterogeneity Comparing proportions among two or more related samples of nominal scale data Recommended tests for categorical data Reporting the results of tests for categorical data Questions
301 301
Non-parametric tests for ratio, interval or ordinal scale data Introduction A non-parametric comparison between one sample and an expected distribution Non-parametric comparisons between two independent samples Non-parametric comparisons among three or more independent samples Non-parametric comparisons of two related samples Non-parametric comparisons among three or more related samples
298 300
302 305 308 312 312 314 316 316 318
319 319 320 325 331 335 338
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518TOC.3D
xii
21.7
xii
[5–12] 27.8.2011 11:48AM
Contents
Analysing ratio, interval or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed Non-parametric correlation analysis Other non-parametric tests Questions
341 342 344 344
22 22.1 22.2 22.3 22.4 22.5 22.6 22.7
Introductory concepts of multivariate analysis Introduction Simplifying and summarising multivariate data An R-mode analysis: principal components analysis Q-mode analyses: multidimensional scaling Q-mode analyses: cluster analysis Which multivariate analysis should you use? Questions
346 346 347 348 361 368 372 374
23 23.1
Choosing a test Introduction
375 375
Appendix: Critical values of chi-square, t and F References Index
388 394 396
21.8 21.9 21.10
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518PRF.3D
xiii
[13–14] 27.8.2011 10:50AM
Preface
If you mention ‘statistics’ or ‘biostatistics’ to life scientists, they often look nervous. Many fear or dislike mathematics, but an understanding of statistics and experimental design is essential for graduates, postgraduates and researchers in the biological, biochemical, health and human movement sciences. Since this understanding is so important, life science students are usually made to take some compulsory undergraduate statistics courses. Nevertheless, I found that a lot of graduates (and postgraduates) were unsure about designing experiments and had difficulty knowing which statistical test to use (and which ones not to!) when analysing their results. Some even told me they had found statistics courses ‘boring, irrelevant and hard to understand’. It seemed there was a problem with the way many introductory biostatistics courses were presented, which was making students disinterested and preventing them from understanding the concepts needed to progress to higher-level courses and more complex statistical applications. There seemed to be two major reasons for this problem and as a student I encountered both. First, a lot of statistics textbooks take a mathematical approach and often launch into considerable detail and pages of daunting looking formulae without any straightforward explanation about what statistical testing really does. Second, introductory biostatistics courses are often taught in a way that does not cater for life science students, who may lack a strong mathematical background. When I started teaching at Central Queensland University, I thought there had to be a better way of introducing essential concepts of biostatistics and experimental design. It had to start from first principles and develop an understanding that could be applied to all statistical tests. It had to demystify what these tests actually did and explain them with a minimum of formulae and terminology. It had to relate statistical concepts to experimental design. And, finally, it had to build a strong understanding to help the student progress to more complex material. I tried this approach with xiii
C:/ITOOLS/WMS/CUP-NEW/2647705/WORKINGFOLDER/MCKI/9781107005518PRF.3D
xiv
xiv
[13–14] 27.8.2011 10:50AM
Preface
my undergraduate classes and the response from a lot of students, including some postgraduates who sat in on the course, was ‘Hey Steve, you should write an introductory stats book!’ Ward Cooper suggested I submit a proposal for this sort of book to Cambridge University Press. The reviewers of the initial proposal and the subsequent manuscript made most appropriate suggestions for improvement. Ruth McKillup read, commented on and reread several drafts, provided constant encouragement and tolerated my absent mindedness. My students, especially Steve Dunbar, Kevin Strychar and Glenn Druery encouraged me to start writing and my friends and colleagues, especially Dearne Mayer and Sandy Dalton, encouraged me to finish. I sincerely thank the users and reviewers of the first edition for their comments and encouragement. Katrina Halliday from CUP suggested an expanded second edition. Ruth McKillup remained a tolerant, pragmatic, constructive and encouraging critic, despite having read many drafts many times. The students in my 2010 undergraduate statistics class, especially Deborah Fisher, Michael Rose and Tara Monks, gave feedback on many of the explanations developed for this edition; their company and cynical humour were a refreshing antidote.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C01.3D
1 [1–6] 27.8.2011 11:17AM
1
Introduction
1.1
Why do life scientists need to know about experimental design and statistics?
If you work on living things, it is usually impossible to get data from every individual of the group or species in question. Imagine trying to measure the length of every anchovy in the Pacific Ocean, the haemoglobin count of every adult in the USA, the diameter of every pine tree in a plantation of 200 000 or the individual protein content of 10 000 prawns in a large aquaculture pond. The total number of individuals of a particular species present in a defined area is often called the population. But because a researcher usually cannot measure every individual in the population (unless they are studying the few remaining members of an endangered species), they have to work with a very carefully selected subset containing several individuals (often called sampling units or experimental units) that they hope is a representative sample from which they can infer the characteristics of the population. You can also think of a population as the total number of artificial sampling units possible (e.g. the total number of 1m2 plots that would cover a whole coral reef) and your sample being the subset (e.g. 20 plots) you have to work upon. The best way to get a representative sample is usually to choose a number of individuals from the population at random – without bias, with every possible individual (or sampling unit) within the population having an equal chance of being selected. The unavoidable problem with this approach is that there are often great differences among sampling units from the same population. Think of the people you have seen today – unless you have met some identical twins (or triplets etc.), no two would have been the same. This 1
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C01.3D
2
2 [1–6] 27.8.2011 11:17AM
Introduction
Figure 1.1 Even a random sample may not necessarily be a good
representative of the population from which it has been taken. Two samples, each of five individuals, have been taken at random from the same population. By chance sample 1 contains a group of relatively large fish, while those in sample 2 are relatively small.
can even apply to species made up of similar looking individuals (like flies or cockroaches or snails) and causes problems when you work with samples. First, even a random sample may not be a good representative of the population from which it has been taken (Figure 1.1). For example, you may choose students for an exercise experiment who are, by chance, far less (or far more) physically fit than the student population of the college they represent. A batch of seed chosen at random may not represent the variability present in all seed of that species, and a sample of mosquitoes from a particular place may have very different insecticide resistance than the same species occurring elsewhere.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C01.3D
3 [1–6] 27.8.2011 11:17AM
1.1 Why do life scientists need to know about design and statistics?
3
Figure 1.2 Samples selected at random from very different populations may
not necessarily be different. Simply by chance the samples from populations 1 and 2 are similar, so you might mistakenly conclude the two populations are also similar.
Therefore, if you take a random sample from each of two similar populations, the samples may be different to each other simply by chance. On the basis of your samples, you might mistakenly conclude that the two populations are very different. You need some way of knowing if a difference between samples is one you would expect by chance or whether the populations they have been taken from really do seem to be different. Second, even if two populations are very different, randomly chosen samples from each may be similar and give the misleading impression the populations are also similar (Figure 1.2).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C01.3D
4
4 [1–6] 27.8.2011 11:17AM
Introduction
Figure 1.3 Two samples were taken from the same population and deliberately
matched so that six equal-sized individuals were initially present in each group. Those in the treatment group were fed a vitamin supplement for 300 days and those in the untreated control group were not. This caused each fish in the treatment group to grow about 10% longer than it would have without the supplement, but this difference is small compared to the variation in growth among individuals, which may obscure any effect of treatment.
Finally, natural variation among individuals within a sample may obscure any effect of an experimental treatment (Figure 1.3). There is often so much variation within a sample (and a population) that an effect of treatment may be difficult or impossible to detect. For example, what would you conclude if you found that a sample of 50 people given a newly synthesised drug showed an average decrease in blood pressure, but when you looked more closely at the group you found that blood pressure remained unchanged for 25, decreased markedly for 15 and increased slightly for the remaining ten? Has the drug really had an effect? What if
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C01.3D
5 [1–6] 27.8.2011 11:17AM
1.2 What is this book designed to do?
5
tomato plants treated with a new fertiliser yielded from 1.5 kg to 9 kg of fruit per plant compared to 1.5 kg to 7.5 kg per plant in an untreated group? Could you confidently conclude there was a meaningful difference between these two samples? This uncertainty is usually unavoidable when you work with samples, and means that a researcher has to take every possible precaution to ensure their samples are likely to be representative of the population as a whole. Researchers need to know how to sample. They also need a good understanding of experimental design, because a good design will take natural variation into account and also minimise additional unwanted variability introduced by the experimental procedure itself. They also need to take accurate and precise measurements to minimise other sources of error. Finally, considering the variability among samples described above, the results of an experiment may not be clear cut. It is therefore often difficult to make a decision about a difference between samples from different populations or from different experimental treatments. Is it the sort of difference you would expect by chance or are the populations really different? Is the experimental treatment having an effect? You need something to help you decide, and that is what statistical tests do by calculating the probability of a particular difference among samples. Once you know that probability, the decision is up to you. So you need to understand how statistical tests work!
1.2
What is this book designed to do?
A good understanding of experimental design and statistics is important for all life scientists (e.g. entomologists, biochemists, environmental scientists, parasitologists, physiologists, genetic engineers, medical scientists, microbiologists, nursing professionals, taxonomists and human movement scientists), so most life science students are made to take a general introductory statistics course. Many of these courses take a detailed mathematical approach that a lot of life scientists find difficult, irrelevant and uninspiring. This book is an introduction that does not assume a strong mathematical background. Instead, it develops a conceptual understanding of how statistical tests actually work by using pictorial explanations where possible and a minimum of formulae.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C01.3D
6
6 [1–6] 27.8.2011 11:17AM
Introduction
If you have read other texts or already done an introductory course, you may find that the way this material is presented is unusual, but I have found that non-statisticians find this approach very easy to understand and sometimes even entertaining. If you have a background in statistics, you may find some sections a little too explanatory, but at the same time they are likely to make sense. This book most certainly will not teach you everything about the subject areas, but it will help you decide what sort of statistical test to use and what the results mean. It will also help you understand and criticise the experimental designs of others. Most importantly, it will help you design and analyse your own experiments, understand more complex experimental designs and move on to more advanced statistical courses.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
7 [7–14] 27.8.2011 7:23AM
2
Doing science: hypotheses, experiments and disproof
2.1
Introduction
Before starting on experimental design and statistics, it is important to be familiar with how science is done. This is a summary of a very conventional view of scientific method.
2.2
Basic scientific method
These are the essential features of the ‘hypothetico-deductive’ view of scientific method (see Popper, 1968). First, a person observes or samples the natural world and uses all the information available to make an intuitive, logical guess, called an hypothesis, about how the system functions. The person has no way of knowing if their hypothesis is correct – it may or may not apply. Second, a prediction is made on the assumption the hypothesis is correct. For example, if your hypothesis were that ‘Increased concentrations of carbon dioxide in the atmosphere in the future will increase the growth rate of tomato plants’, you could predict that tomato plants will grow faster in an experimental treatment where the carbon dioxide concentration was higher than a second treatment set at the current atmospheric concentration of this gas. Third, the prediction is tested by taking more samples or doing an experiment. Fourth, if the results are consistent with the prediction, then the hypothesis is retained. If they are not, it is rejected and a new hypothesis will need to be formulated (Figure 2.1). The initial hypothesis may come about as a result of observations, sampling and/or reading the scientific literature. Here is an example from ecological entomology. 7
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
8
8 [7–14] 27.8.2011 7:23AM
Doing science: hypotheses, experiments and disproof
Figure 2.1 The process of hypothesis formulation and testing.
The Portuguese millipede Ommatioulus moreleti was accidentally introduced into southern Australia from Portugal in the 1950s. This millipede lives in leaf litter and grows to about four centimetres long. In the absence of natural enemies from its country of origin (especially European hedgehogs which eat a lot of millipedes), its numbers rapidly increased to plague proportions in South Australia. Although it causes very little damage to agricultural crops, O. moreleti is a serious ‘nuisance’ pest because it invades houses. In heavily infested areas of South Australia during the late 1980s, it used to be common to find over 1000 millipedes invading a moderate-sized house in just one night. When you disturb one of these millipedes, it ejects a smelly yellow defensive secretion. Once inside the house, the millipedes would crawl across the floor, up the walls and over the ceiling from where they even fell into food and into the open mouths of sleeping people. When accidentally crushed underfoot, they stained carpets and floors,
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
9 [7–14] 27.8.2011 7:23AM
2.2 Basic scientific method
9
Figure 2.2 Arrangement of a 2 × 5 grid of lit and unlit tiles across a field
where millipedes were abundant. Filled squares indicate unlit tiles and open squares indicate lit tiles.
and smelt. The problem was so great that almost half a million dollars (which was a lot of money in the 1980s) was spent researching how to control this pest. While working on ways to reduce the nuisance caused by the Portuguese millipede, I noticed that householders who reported severe problems had well-lit houses with large and often uncurtained windows. In contrast, nearby neighbours whose houses were not so well lit and who closed their curtains at night reported far fewer millipedes inside. The numbers of O. moreleti per square metre were similar in the leaf litter around both types of houses. From these observations and very limited sampling of less than ten houses, I formulated the hypothesis, ‘Portuguese millipedes are attracted to visible light at night.’ I had no way of knowing whether this very simple hypothesis was the reason for home invasions by millipedes, but it could explain my observations and seemed logical because other arthropods are also attracted to light at night. From this hypothesis it was straightforward to predict ‘At night, in a field where Portuguese millipedes are abundant, more will be present in areas illuminated by visible light than in unlit areas.’ This prediction was tested by doing a simple and inexpensive manipulative field experiment with two treatments – lit areas and a control treatment of unlit areas. Because any difference in millipede numbers between only one lit and one unlit area might occur just by chance or some other unknown factor(s), the two treatments were each replicated five times. I set up ten identical white ceramic floor tiles in a two row × five column rectangular grid in a field where millipedes were abundant (Figure 2.2). For each column of two tiles, I tossed a coin to decide which of each pair was going to be lit. The
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
10
10 [7–14] 27.8.2011 7:23AM
Doing science: hypotheses, experiments and disproof
other tile was left unlit. This ensured that replicates of both the treatment and control were dispersed across the field instead of having all the treatment tiles clustered together and was also a precaution in case the number of millipedes per square metre varied across the field. The coin tossing also eliminated any likelihood that I might subconsciously place the lit tile of each pair in an area where millipedes were more common. I hammered a thin two-metre long wooden stake vertically into the ground next to each tile. For every one of the lit tiles, I attached a pocket torch to its stake and made sure the light shone on the tile. I started the experiment at dusk by turning on the torches and went back three hours later to count the numbers of millipedes on all tiles. From this experiment, there were at least four possible outcomes: (1) No millipedes were present on the unlit tiles, but lots were present on each of the lit tiles. This result is consistent with the hypothesis, which has survived this initial test and can be retained. (2) High and similar numbers of millipedes were present on both the lit and unlit tiles. This is not consistent with the hypothesis, which can probably be rejected since it seems light has no effect. (3) No (or very few) millipedes were present on any tiles. It is difficult to know if this has any bearing on the hypothesis – there may be a fault with the experiment (e.g. the tiles were themselves repellent or perhaps too slippery, or millipedes may not have been active that night). The hypothesis is neither rejected nor retained. (4) More millipedes were present on the unlit tiles than on the lit ones. This is a most unexpected outcome that is not consistent with the hypothesis, which is extremely likely to be rejected. These are the four simplest outcomes. A more complicated and much more likely one is that you find some millipedes on each of the tiles in both treatments, and that is what happened – see McKillup (1988) for more details. This sort of outcome is a problem because you need to decide if light is having an effect on the millipedes or whether the difference in numbers between lit and unlit treatments is simply happening by chance. Here statistical testing is extremely useful and necessary because it helps you decide whether a difference between treatments is meaningful.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
11 [7–14] 27.8.2011 7:23AM
2.4 Why can’t an hypothesis or theory ever be proven?
2.3
11
Making a decision about an hypothesis
Once you have the result of the experimental test of an hypothesis, two things can happen: Either the results of the experiment are consistent with the hypothesis, which is retained. Or the results are inconsistent with the hypothesis, which may be rejected. If the hypothesis is rejected, it is likely to be wrong and another will need to be proposed. If the hypothesis is retained, withstands further testing and has some very widespread generality, it may progress to become a theory. But a theory is only ever a very general hypothesis that has withstood repeated testing. There is always a possibility it may be disproven in the future.
2.4
Why can’t an hypothesis or theory ever be proven?
No hypothesis or theory can ever be proven because one day there may be evidence that rejects it and leads to a different explanation (which can include all the successful predictions of the previous hypothesis). So we can only falsify or disprove hypotheses and theories – we can never ever prove them. Cases of disproof and a subsequent change in thinking are common. Here are three examples. A classic historical case was how a person can contract cholera, which is a life-threatening gastrointestinal disease. In nineteenth century London, it was widely believed that cholera was contracted by breathing foul air, but in 1854, the physician and anaesthesiologist John Snow noticed that the majority of victims of the outbreak had drunk water from a well at the corner of Broad Street and Cambridge Street, South London. At this time, much of London’s drinking water was hand-pumped from shallow wells that were in close proximity to cesspits – open pits in the ground into which untreated human excrement was discarded. There was no sewerage system. Snow hypothesised that cholera was contracted by drinking water contaminated by the excrement of cholera sufferers. This hypothesis was first tested by simply removing the handle from the Broad Street pump, thereby forcing people to get their water from other wells, and the outbreak ceased
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
12
12 [7–14] 27.8.2011 7:23AM
Doing science: hypotheses, experiments and disproof
shortly thereafter. The result of this experiment was consistent with Snow’s hypothesis that drinking contaminated water was the cause of cholera, and after more research it eventually replaced the hypothesis about foul air. More recently, medical researchers used to believe that excess stomach acidity was responsible for the majority of gastric ulcers in humans. There was a radical change in thinking when many ulcers healed following antibiotic therapy designed to reduce numbers of the bacterium Helicobacter pylori in the stomach wall. There have been at least three theories of how the human kidney produces a concentrated solution of urine and the latest may not necessarily be correct.
2.5
‘Negative’ outcomes
People are often quite disappointed if the outcome of an experiment is not what they expected and their hypothesis is rejected. But there is nothing wrong with this – the rejection of an hypothesis is still progress in the process of understanding how a system functions. Therefore, a ‘negative’ outcome that causes you to reject a cherished hypothesis is just as important as a ‘positive’ one that causes you to retain it. Unfortunately, some researchers tend to be very possessive and protective of their hypotheses and there have been cases where results have been falsified in order to allow an hypothesis to survive. This does not advance our understanding of the world and is likely to be detected when other scientists repeat the experiments or do further work based on these false conclusions. There will be more about this in Chapter 5, which is about doing science responsibly and ethically.
2.6
Null and alternate hypotheses
It is scientific convention that when you test an hypothesis you state it as two hypotheses, which are essentially alternates. For example, the hypothesis ‘Portuguese millipedes are attracted to visible light at night’ is usually stated in combination with ‘Portuguese millipedes are not attracted to visible light at night.’ The latter includes all cases not included by the first hypothesis (e.g. no response, or avoidance of visible light).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
13 [7–14] 27.8.2011 7:23AM
2.6 Null and alternate hypotheses
13
These hypotheses are called the alternate (which some texts call ‘alternative’) and null hypotheses respectively. Importantly, the null hypothesis is always stated as the hypothesis of ‘no difference’ or ‘no effect’. So, looking at the two hypotheses above, the second ‘are not’ hypothesis is the null and the first is the alternate. This is a tedious but very important convention (because it clearly states the hypothesis and its alternative) and there will be several reminders in this book.
Box 2.1 Two other views about scientific method Popper’s hypothetico-deductive philosophy of scientific method – where an hypothesis is tested and always at risk of being rejected – is widely accepted. In reality, however, scientists may do things a little differently. Kuhn (1970) argued that scientific enquiry does not necessarily proceed with the steady testing and survival or rejection of hypotheses. Instead, hypotheses with some generality and which have survived considerable testing become well-established theories or ‘paradigms’ that are relatively immune to rejection even if subsequent testing finds some evidence against them. A few negative results are used to refine the paradigm to make it continue to fit all available evidence. It is only when the negative evidence becomes overwhelming that the paradigm is rejected and replaced by a new one. Lakatos (1978) also argued that a strict hypothetico-deductive process of scientific investigation does not necessarily occur. Instead, fields of enquiry called ‘research programmes’ are based on a set of ‘core’ theories that are rarely questioned or tested. The core is surrounded by a protective ‘belt’ of theories and hypotheses that are tested. A successful research programme is one that accumulates more and more theories that have survived testing within the belt, which provides increasing protection for the core. But if many of the belt theories are rejected, doubt will eventually be cast on the veracity of the core and of the research programme itself and it is likely eventually to be replaced by a more successful one. These two views and Popper’s hypothetico-deductive one are not irreconcilable. In all cases, observations and experiments provide
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C02.3D
14
14 [7–14] 27.8.2011 7:23AM
Doing science: hypotheses, experiments and disproof
evidence either for or against a hypothesis or theory. In the hypothetico-deductive view, science proceeds by the orderly testing and survival or rejection of individual hypotheses, while the other two views reflect the complexity of theories required to describe a research area and emphasise that it would be foolish to reject immediately an already well-established theory on the basis of a small amount of evidence against it.
2.7
Conclusion
There are five components to an experiment: (1) formulating an hypothesis, (2) making a prediction from the hypothesis, (3) doing an experiment or sampling to test the prediction, (4) analysing the data and (5) deciding whether to retain or reject the hypothesis. The description of scientific method given here is extremely simple and basic and there has been an enormous amount of philosophical debate about how science really is done. For example, more than one hypothesis might explain a set of observations and it may be difficult to test these by progressively considering each alternate hypothesis against its null. For further reading, Chalmers (1999) gives a very readable and clearly explained discussion of the process and philosophy of scientific discovery. 2.8
Questions
(1)
Why is it important to collect data from more than one experimental unit or sampling unit when testing an hypothesis?
(2)
Describe the ‘hypothetic-deductive’ model of how science is done, including an example of a null and alternate hypothesis, the concept of disproof and why a negative outcome is just as important as a positive one.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
15 [15–28] 27.8.2011 12:08PM
3
Collecting and displaying data
3.1
Introduction
One way of generating hypotheses is to collect data and look for patterns. Often, however, it is difficult to see any underlying trend on feature from a set of data which is just a list of numbers. Graphs and descriptive statistics are very useful for summarising and displaying data in ways that may reveal patterns. This chapter describes the different types of data you are likely to encounter and discusses ways of displaying them.
3.2
Variables, experimental units and types of data
The particular attributes you measure when you collect data are called variables (e.g. body temperature, the numbers of a particular species of beetle per broad bean pod, the amount of fungal damage per leaf or the numbers of brown and albino mice). These data are collected from each experimental or sampling unit, which may be an individual (e.g. a human or a whale) or a defined item (e.g. a square metre of the seabed, a leaf or a lake). If you only measure one variable per experimental unit, the data set is univariate. Data for two variables per unit are bivariate. Data for three or more variables measured on the same experimental unit are multivariate. Variables can be measured on four scales – ratio, interval, ordinal or nominal. A ratio scale describes a variable whose numerical values truly indicate the quantity being measured. *
There is a true 0 point below which you cannot have any data. For example, if you are measuring the length of lizards, you cannot have a lizard of negative length. 15
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
16 *
*
16 [15–28] 27.8.2011 12:08PM
Collecting and displaying data
An increase of the same numerical amount indicates the same quantity across the range of measurements. For example, a 2 cm and a 40 cm long lizard will have grown by the same amount if they both increase in length by 10 cm. A particular ratio holds across the range of the variable. For example, a 40 cm long lizard is twenty times longer than a 2 cm lizard and a 100 cm long lizard is also twenty times longer than a 5 cm lizard.
An interval scale describes a variable that can be less than 0(zero). *
*
*
The 0 point is arbitrary (e.g., temperature measured in degrees Celsius has a 0 point at which water freezes), so negative values are possible. The true 0 point for temperature, where there is a complete absence of heat, is 0 kelvin (about –273°C), so unlike the celsius scale the kelvin scale is a ratio scale. An increase of the same numerical amount indicates the same quantity across the range of measurements. For example, a 2°C increase indicates the same increase in heat whatever the starting temperature. Since the 0 point is arbitrary, a particular ratio does not hold across the range of the variable. For example, the ratio of 6°C compared to 1°C is not the same as 60°C to 10°C. The two ratios in terms of the kelvin scale are 279 : 274 K and 333 : 283 K.
An ordinal scale applies to data where values are ranked, which is when they are given a value that simply indicates their relative order. Therefore, the ranks do not necessarily indicate constant differences. For example, five children of ages from birth of 2, 7, 9, 10 and 16 years have been aged on a ratio scale. If, however, you rank these ages in order from the youngest to the oldest, which will give them ranks of 1 to 5, the data have been reduced to an ordinal scale. Child 2 is not necessarily twice as old as child 1. *
An increase in the same numerical amount of ranks does not necessarily hold across the range of the variable.
A nominal scale applies to data where the values are classified according to an attribute. For example, if there are only two possible forms of coat colour in mice, then a sample of mice can be subdivided into the numbers within each of these two attributes. The first three types of data described above can include either continuous or discrete data. Nominal scale data (since they are attributes) can only be discrete.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
17 [15–28] 27.8.2011 12:08PM
3.3 Displaying data
17
Continuous data can have any value within a range. For example, any value of temperature is possible within the range from 10°C to 20°C (e.g. 15.3°C or 17.82°C). Discrete data can only have fixed numerical values within a range. For example, the number of offspring produced increases from one fixed whole number to the next because you cannot have a fraction of an offspring. It is important that you know what type of data you are dealing with because it will help determine your choice of statistical test.
3.3
Displaying data
A list of data may reveal very little, but a pictorial summary might show a pattern that can help you generate or test hypotheses.
3.3.1
Histograms
Here is a list of the number of visits made to a medical doctor during the previous six months by a sample of 60 students chosen at random from a first-year university biostatistics class of 600. These data are univariate, ratio scaled and discrete: 1, 11, 2, 1, 10, 2, 1, 1, 1, 1, 12, 1, 6, 2, 1, 2, 2, 7, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2, 1, 2, 1, 4, 6, 9, 1, 2, 8, 1, 9, 1, 8, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 8, 1, 2, 1, 1, 1, 1, 7. It is difficult to see any pattern from this list of numbers, but you could summarise and display the data as a histogram. To do this you separately count the number (the frequency) of cases for students who visited a medical doctor never, once, twice, three times, through to the maximum number of visits. These totals are plotted as a series of rectangles on a graph, with the X axis showing the number of visits and the Y axis the number of students in each. Figure 3.1 shows a histogram for the data. This visual summary is useful. The distribution is skewed to the right – most students make few visits to a medical doctor, but there is a long ‘tail’ (and perhaps even a separate group) who have made six or more visits. Incidentally, looking at the graph you may be a little suspicious because every student made at least one visit. When the students were asked about this, they said that all first years had to have a compulsory medical examination, so these data are somewhat misleading in terms of indicating the health of the group.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
18
18 [15–28] 27.8.2011 12:08PM
Collecting and displaying data
Number of students
40
30
20
10
0 0
2 4 6 8 10 Number of visits to a medical doctor
12
Figure 3.1 The number of visits made to a medical doctor during the past
six months for 60 students chosen at random from a first-year biostatistics class of 600.
You may be tempted to draw a line joining the midpoints of the tops of each bar to indicate the shape of the distribution, but this implies that the data on the X axis are continuous, which is not the case because visits are discrete whole numbers.
3.3.2
Frequency polygons or line graphs
If the data are continuous, it is appropriate to draw a line linking the midpoint of the tops of each bar in a histogram. Here is an example of some continuous data that can be summarised either as a histogram or as a frequency polygon (often called a line graph). The time a person takes to respond to a stimulus is called their reaction time. This can be easily measured in the laboratory by getting them to press a button as soon as they see a light flash: the reaction time is the time elapsing between the instant of the flash and when the button is pressed. A researcher suspected that an abnormally long reaction time might be a useful way of making an early diagnosis of certain neurological diseases, so they chose a random sample of 30 students from a first-year biomedical science class and measured their reaction time in seconds. These data are as follows: 0.70, 0.50, 1.20, 0.80, 0.30, 0.34, 0.56, 0.41, 0.30, 1.20, 0.40, 0.64, 0.52, 0.38, 0.62, 0.47, 0.24, 0.55, 0.57, 0.61, 0.39, 0.55, 0.49, 0.41, 0.72, 0.71, 0.68, 0.49, 1.10, 0.59. Here, too, nothing is very obvious from this list.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
19 [15–28] 27.8.2011 12:08PM
3.3 Displaying data
19
Table 3.1 Summary of the data for the reaction times in seconds of 30 students chosen at random from a first-year biomedical class. Interval range
Number of students
0.20–0.29 0.30–0.39 0.40–0.49 0.50–0.59 0.60–0.69 0.70–0.79 0.80–0.89 0.90–0.99 1.00–1.09 1.10–1.19 1.20–1.29
1 5 6 7 4 3 1 0 0 1 2
Because the data are continuous, they are not as easy to summarise as the discrete data in Figure 3.1. To display a histogram for continuous data, you need to subdivide the data into the frequency of cases within a series of intervals of equal width. First, you need to look at the range of the data, which is from a minimum of 0.24 through to a maximum of 1.20 seconds, and decide on an interval width that will give you an informative display of the data. Here the chosen width is 0.099. Starting from 0.2, this will give 11 intervals with the first being 0.20–0.29 seconds. The chosen interval width needs to show the shape of the distribution: there would be no point in choosing an interval that included all the data in two intervals because you would only have two bars on the histogram. Nor would there be any point in choosing more than 20 intervals because most would only contain a few data. Once you have decided on an appropriate interval width, you need to count the number of students with a response time that falls within each interval (Table 3.1) and plot these frequencies on the Y axis against the midpoints of each interval on the X axis. This has been done in Figure 3.2(a). Finally, the midpoints of the tops of each rectangle have been joined by a line to give a frequency polygon or line graph (Figure 3.2(b)). Most students have short reaction times, but there is a distinct group of three who took a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
20
20 [15–28] 27.8.2011 12:08PM
Collecting and displaying data (a)
Frequency
6
4
2
0 0.20
0.40
0.60 0.80 1.00 1.20 Reaction time (seconds)
1.40
(b)
Frequency
6
4
2
0 0.25
0.50 0.75 1.00 1.25 Reaction time (seconds)
Figure 3.2 Data for the reaction time in seconds of 30 biomedical students
displayed as (a) a histogram and (b) a frequency polygon or line graph. The points on the frequency polygon (b) correspond to the midpoints of the bars on (a).
relatively long time to respond and who may be of further interest to the researcher.
3.3.3
Cumulative graphs
Often it is useful to display data as a histogram of cumulative frequencies. This is a graph that displays the progressive total of cases (starting at 0 or 0% and finishing at the sample size or 100%) on the Y axis against the increasing value of the variable on the X axis. Table 3.2 and Figure 3.3 give an example for the grouped data from Table 3.1. A cumulative frequency graph can never decrease.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
21 [15–28] 27.8.2011 12:08PM
3.3 Displaying data
21
Table 3.2 Data for the reaction time in seconds of 30 biomedical science students listed as frequencies and cumulative frequencies. Cumulative frequency Interval range
Number of students
Total
Percent
0.20–0.29 0.30–0.39 0.40–0.49 0.50–0.59 0.60–0.69 0.70–0.79 0.80–0.89 0.90–0.99 1.00–1.09 1.10–1.19 1.20–1.29
1 5 6 7 4 3 1 0 0 1 2
1 6 12 19 23 26 27 27 27 28 30
3.3 20 40 63.3 76.6 86.6 90 90 90 93.3 100
Cumulative frequency
30
20
10
0 0.25
0.50 0.75 1.00 Reaction time (seconds)
1.25
Figure 3.3 The cumulative frequency histogram for the reaction time of 30
students.
Although I have given the rather tedious manual procedures for constructing histograms, you will find that most statistical software packages have excellent graphics programs for displaying your data. These will automatically select an interval width, summarise the data and plot the graph of your choice.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
22
22 [15–28] 27.8.2011 12:08PM
Collecting and displaying data
Table 3.3 The numbers of four species (A–D) of weevil expressed as their proportions of the total sample of 140. Each proportion is multiplied by 360 to give its width of the pie diagram in degrees, which is shown in Figure 3.4.
Species A Species B Species C Species D Total
Number
Proportion of total number
Degrees
63 41 10 26 140
0.45 0.29 0.07 0.19 1.00
162.0 105.4 25.7 66.9 360
Figure 3.4 A pie diagram of the data in Table 3.3.
3.3.4
Pie diagrams
Data for the relative frequencies in two or more categories that sum to a total of 1 or 100% can be displayed as a pie diagram, which is a circle within which each of the categories is displayed as a ‘slice’, the size of which
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
23 [15–28] 27.8.2011 12:08PM
3.4 Displaying ordinal or nominal scale data
23
Table 3.4 The number of basal cell carcinomas detected and removed from eight locations on the body for 400 males aged from 40 to 50 years during 12 months at a skin cancer clinic in Brisbane, Australia. Location
Number of basal cell carcinomas
Head (H) Neck and shoulders (NS) Arms (A) Legs (L) Upper back (UB) Lower back (LB) Chest (C) Lower abdomen (LA)
211 103 74 49 94 32 21 12
(in degrees) is proportional to its value. For example, a sample containing equal numbers of four different species would be displayed as four equal 90° slices. Pie diagrams for only a few categories are very easy to interpret, but when there are more than ten the display will appear cluttered, especially when the slices are distinguished by black, white and shades of grey. Categories with a very small proportion of total cases will appear very narrow and may be overlooked. It is easy to draw a pie diagram. The data for each category are listed, summed to give a total and expressed as the proportion of the total (i.e. as the relative frequency). Each proportion is then multiplied by 360 to give the width of the slice in degrees (Table 3.3) which is used to draw the slices on the pie diagram (Figure 3.4).
3.4
Displaying ordinal or nominal scale data
When you display data for ordinal or nominal scale variables, you need to modify the form of the graph slightly because the categories are unlikely to be continuous and so the bars need to be separated to indicate this clearly. Here is an example of some nominal scale data. Table 3.4 gives the location of 596 basal cell carcinomas (a form of skin cancer that is most common on sunexposed areas of the body) detected and removed from 400 males aged
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
24
24 [15–28] 27.8.2011 12:08PM
Collecting and displaying data (a) 250
Number of cases
200 150 100 50 0 A
C H L LA LB NS Location of basal cell carcinoma
H
NS
UB
(b) 250
Number of cases
200 150 100 50 0 UB
A
L
LB
C
LA
Location of basal cell carcinoma
Figure 3.5 (a) The numbers of basal cell carcinomas detected and removed by
location on the body during 12 months at a skin cancer clinic in Brisbane, Australia. (b) The same data but with the numbers of cases for each location ranked in order from most to least. Location codes are given in Table 3.4.
from 40 to 50 years treated over 12 months at a skin cancer clinic in Brisbane, Australia. The locations have been defined as (a) head, (b) neck and shoulders, (c) arms, (d) legs, (e) upper back, (f) lower back, (g) chest and (h) lower abdomen. These can be displayed on a bar graph with the categories in any order along the X axis and the number of cases on the Y axis (Figure 3.5(a)). It often helps to rank the data in order of magnitude to aid interpretation (Figure 3.5(b)).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
25 [15–28] 27.8.2011 12:08PM
3.5 Bivariate data
3.5
25
Bivariate data
Data where two variables have been measured on each experimental unit can often reveal patterns that may suggest hypotheses or be useful for testing them. Table 3.5 gives two lists of bivariate data for the number of dental caries, which are the holes that develop in decaying teeth, and ages for 20 children between one and nine years old from each of the cities of Hale and Yarvard. Looking at these data, nothing stands out apart from an increase in the number of caries with age. If you calculate descriptive statistics, such as the
Table 3.5 The number of dental caries and the age in years of 20 children chosen at random from each of the two cities of Hale and Yarvard. Hale
Yarvard
Caries
Age
Caries
Age
1 1 4 4 5 6 2 9 4 2 7 3 9 11 1 1 3 1 1 6
3 2 4 3 6 5 3 9 5 1 8 4 8 9 2 4 7 1 1 5
10 1 12 1 1 11 2 14 2 8 1 4 1 1 7 1 1 1 2 1
9 5 9 2 2 9 3 9 6 9 1 7 1 5 8 7 6 4 6 2
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
26
26 [15–28] 27.8.2011 12:08PM
Collecting and displaying data
Table 3.6 The average number of dental caries and age of 20 children chosen at random from each of the two cities of Hale and Yarvard.
Average caries Average age (years)
Hale
Yarvard
4.05 4.50
4.10 5.50
average age and average number of dental caries for each of the two groups (Table 3.6), they are not very informative either. You have probably calculated the average (which is often called the mean) for a set of data and this procedure will be described in Chapter 8, but the average is the sum of all the values divided by the sample size. Table 3.6 shows that the children from Yarvard had slightly more dental caries on average than those from Hale, but this is not surprising because the Yarvard sample was also an average of one year older. But if you graph these data, patterns emerge. One very useful way of displaying bivariate data is to graph them as a two-dimensional plot with increasing values of one variable on the horizontal (or X axis) and increasing values of the second on the vertical (or Y axis). Figures 3.6(a) and (b) show both sets of data with tooth decay (Y axis) plotted against child age (X axis) for each city. These graphs show that tooth decay increases with age, but the pattern differs between cities – in Hale the increase is fairly steady, but in Yarvard it remains low in children up to the age of seven and then suddenly increases. This might suggest hypotheses about the reasons why, or stimulate further investigation (perhaps a child dental care programme or water fluoridation has been in place in Yarvard for the past eight years compared to no action on decay in Hale). Of course, there is always the possibility that the samples are different due to chance, so the first step in any further investigation might be to repeat the sampling using much larger numbers of children from each city.
3.6
Multivariate data
Often life scientists have data for three or more variables measured on the same experimental unit. For example, a taxonomist might have
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
27 [15–28] 27.8.2011 12:08PM
3.6 Multivariate data
Number of caries
(a)
27
15
10
5
0 0
4 Age (years)
8
0
4 Age (years)
8
(b)
Number of caries
15
10
5
0
Figure 3.6 The number of dental caries plotted against the age of
20 children chosen at random from each of the two cities of (a) Hale and (b) Yarvard.
data for attributes that help distinguish among different species (e.g. wing length, eye shape, body length, body colour, number of bristles), while a marine ecologist might have data for the numbers of several species of marine invertebrates present in areas experiencing different levels of pollution.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C03.3D
28
28 [15–28] 27.8.2011 12:08PM
Collecting and displaying data
Results for three variables can be shown as three-dimensional graphs, but for more than this number of variables a direct display is difficult to interpret. Some relatively new statistical techniques have made it possible to condense and summarise multivariate data in a two-dimensional display and these are described in Chapter 22.
3.7
Summary and conclusion
Graphs may reveal patterns in data sets that are not obvious from looking at lists or calculating descriptive statistics and can therefore provide an easily understood visual summary of a set of results. In later chapters, there will be discussion of displays such as boxplots and probability plots, which can be used to decide whether the data set is suitable for a particular analysis. Most modern statistical software packages have easy to use graphics options that produce high-quality graphs and figures. These are very useful for life scientists who are writing assignments, reports or scientific publications.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
29 [29–47] 27.8.2011 7:53AM
4
Introductory concepts of experimental design
4.1
Introduction
To generate hypotheses you often sample different groups or places (which is sometimes called a mensurative experiment because you usually measure something, such as height or weight, on each sampling unit) and explore these data for patterns or associations. To test hypotheses, you may do mensurative experiments, or manipulative experiments where you change a condition and observe the effect of that change on each experimental unit (like the experiment with millipedes and light described in Chapter 2). Often you may do several experiments of both types to test a particular hypothesis. The quality of your sampling and the design of your experiment can have an effect upon the outcome and therefore determine whether your hypothesis is rejected or not, so it is absolutely necessary to have an appropriate experimental design. First, you should attempt to make your measurements as accurate and precise as possible so they are the best estimates of actual values. Accuracy is the closeness of a measured value to the true value. Precision is the ‘spread’ or variability of repeated measures of the same value. For example, a thermometer that consistently gives a reading corresponding to a true temperature (e.g. 20°C) is both accurate and precise. Another that gives a reading consistently higher (e.g. +10°C) than a true temperature is not accurate, but it is very precise. In contrast, a thermometer that gives a reading that fluctuates around a true temperature is not precise and will usually be inaccurate except when it occasionally happens to correspond to the true temperature. Inaccurate and imprecise measurements or a poor or unrealistic sampling design can result in the generation of inappropriate hypotheses. Measurement errors or a poor experimental design can give a false or 29
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
30
30 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
misleading outcome that may result in the incorrect retention or rejection of an hypothesis. The following is a discussion of some important essentials of sampling and experimental design.
4.2
Sampling – mensurative experiments
Mensurative experiments are often a good way of generating or testing predictions from hypotheses (an example of the latter is ‘I think millipedes are attracted to light at night. So if I sample 50 well-lit houses and 50 that are not well-lit, those in the first group should, on average, contain more millipedes than the second’). You have to be careful when interpreting the results of mensurative experiments because you are sampling an existing condition, rather than manipulating conditions experimentally, so there may be some other difference between your groups. For example, well-lit houses may have a design which makes it easier for millipedes to get inside, and light may not be important at all.
4.2.1
Confusing a correlation with causality
First, a correlation is often mistakenly interpreted as indicating causality. A correlation between two variables means they vary together. A positive correlation means that high values of one variable are associated with high values of the other, while a negative correlation means that high values of one variable are associated with low values of the other. For example, the graph in Figure 4.1 shows a positive correlation between the population
Figure 4.1 An example of a positive correlation. The numbers of mice and the
weight of wheat plants per square metre increase together.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
31 [29–47] 27.8.2011 7:53AM
4.2 Sampling – mensurative experiments
31
Figure 4.2 The involvement of a third variable, soil moisture, which
determines the number of mice and kilograms of wheat per square metre. Even though there is no causal relationship between the number of mice and weight of wheat, these two variables are positively correlated.
density of mice per square metre and the weight of wheat plants in kilograms per square metre at ten sites chosen at random within a large field. It seems logical that the amount of wheat might be the cause of differences in the numbers of mice (which may be eating the wheat or using it for shelter), but even if there is a very obvious correlation between any two variables, it does not necessarily show that one is responsible for the other. The correlation may have occurred by chance, or a third unmeasured variable might determine the numbers of the two variables studied. For example, soil moisture may determine both the number of mice and the weight of wheat (Figure 4.2). Therefore, although there is a causal relationship between soil moisture and each of the other two variables, they are not causally related themselves.
4.2.2
The inadvertent inclusion of a third variable: sampling confounded in time or space
Occasionally, researchers have no choice but to sample different populations of the same species, or different habitats, at different times. These results should be interpreted with great caution because changes occurring over time may contribute to differences (or the lack of them) among samples. The sampling is said to be confounded in that more than one variable may be having an effect on the results. Here is an example of sampling that is confounded in time. An ecologist hypothesised that the density of above-ground vegetation might affect the population density of earthworms, and therefore sampled several different areas for these two variables. The work was very
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
32
32 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
time-consuming because the earthworms had to be sampled by taking cores of soil, and unfortunately the ecologist had no help. Therefore, areas of low vegetation density were sampled in January, low to moderate density in February, moderate density in March and high density in April. The sampling showed a negative correlation between vegetation density and earthworm density. Unfortunately, however, the density of earthworms was the same in all areas but decreased as the year progressed (and the ecologist did not know this), so the negative correlation between earthworm density and vegetation density was an artefact of the sampling of different places being confounded in time. This is an example of a common problem, and you are likely to find similar cases in many published scientific papers and reports. Sampling can also be spatially confounded. For example, samples of aquatic plants and animals are often taken at several locations increasingly distant from a point source of pollution, such as a sewage outfall. A change in species diversity, especially if it consistently increases or decreases with distance from the point source, is often interpreted as evidence for the pollutant having an effect. This may be so, but the sampling is spatially confounded and some feature of the environment apart from the concentration of the pollutant may be responsible for the difference. Here it would be very useful to have data for the same sites before the pollution had occurred.
4.2.3
The need for independent samples in mensurative experiments
Often researchers sample the numbers, or population density, of a species in relation to an environmental gradient (such as depth in a lake) to see if there is any correlation between density of the species and the gradient of interest. There is an obvious need to replicate the sampling – that is, to independently estimate density more than once. For example, consider sampling Dark Lake, Wisconsin, to investigate the population density of freshwater prawns in relation to depth. If you only sampled at one place (Figure 4.3(a)), the results would not be a good indication of changes in the population density of prawns with depth in the lake. The sampling needs to be replicated, but there is little value in repeatedly sampling one small area (e.g. by taking several samples
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
33 [29–47] 27.8.2011 7:53AM
4.2 Sampling – mensurative experiments
33
Figure 4.3 Variation in the number of freshwater prawns per cubic metre of
water at two different depths in Dark Lake, Wisconsin. (a) An unreplicated sample taken at only one place (*) would give a very misleading indication of differences in the population density of prawns with depth within the entire lake. (b) Several replicates taken from only one place (*****) would still give a very misleading indication. (c) Several replicates taken at random across the lake (*) would give a much better indication.
under ***** in Figure 4.3(b)) because this still will not give an accurate indication of changes in population density with depth in the whole lake (although it may give a very accurate indication of conditions in that particular part of the lake). This sort of sampling is one aspect of what Hurlbert (1984) called pseudoreplication and is still a very common flaw in a lot of scientific research. The replicates are ‘pseudo’ – sham or unreal – because they are unlikely to truly describe what is occurring across the entire area being discussed (in this case the lake). A better design would be to sample at several places chosen at random within the lake as shown in Figure 4.3(c).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
34
34 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
This type of inappropriate sampling is very common. Here is another example. A researcher sampled a large coral reef by dropping a 1m2 square frame, subdivided into a grid of 100 equal-sized squares, at random in one place only and then took one sample from each of the smaller squares. Although these 100 replicates may very accurately describe conditions within the sampling frame, they may not necessarily describe the remaining 9999 m2 of the reef and would be pseudoreplicates if the results were interpreted in this way. A more appropriate design would be to sample 100 replicates chosen at random across the whole reef.
4.2.4
The need to repeat sampling on several occasions and elsewhere
In the example described above, the results of sampling Dark Lake can only confidently be discussed in relation to that particular lake on that day. Therefore, when interpreting such results you need to be cautious. Sampling the same lake on several different occasions will strengthen the findings and may be sufficient if you are only interested in that lake. Sampling more than one lake will make the results more able to be generalised. Inappropriate generalisation is another example of pseudoreplication because data from one location may not hold in the more general case. At the same time, however, even if your study is limited, you can still make more general predictions from your findings, provided these are clearly identified as such.
4.3
Manipulative experiments
4.3.1
Independent replicates
It is essential to have several independent replicates of any treatment used in an experiment. I mentioned this briefly when describing the millipedes and light experiment in Chapter 2 and said if there were only one lit and one unlit tile, any difference between them could have simply been due to chance or some other unknown factor(s). As the number of randomly chosen independent replicates increases, so does the likelihood that a difference between the experimental group and the control group is a result of the experimental treatment. The following example is deliberately absurd, because I will use it later in this chapter to discuss a lack of replication that is not so obvious.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
35 [29–47] 27.8.2011 7:53AM
4.3 Manipulative experiments
35
Imagine you were asked to test the hypothesis that vitamin C caused guinea pigs to grow more rapidly. You obtained two six-week old guinea pigs of the same sex and weight, caged them separately and offered one an unlimited amount of commercial rodent food plus 20 mg of vitamin C per day, while the other was only offered an unlimited amount of commercial rodent food. The two guinea pigs were re-weighed after three months and the results were obvious – the one that received vitamin C was 40% heavier than the other. This result is consistent with the hypothesis but there is an obvious flaw in the experiment – with only one guinea pig in each treatment, any differences between them may be due to some inherent difference between the guinea pigs, their treatment cages or both. For example, the slow-growing guinea pig may, by chance, have been heavily infested with intestinal parasites. There is a need to replicate this experiment and the replicates need to be truly independent – it is not sufficient to have ten ‘vitamin C’ guinea pigs together in one cage and ten control guinea pigs in another, because any differences between treatments may still be caused by some difference between the cages. There will be more about this shortly.
4.3.2
Control treatments
Control treatments are needed because they allow the experimenter to isolate the reason why something is occurring in an experiment by comparing two treatments that differ by only one factor. Frequently the need for a rigorous experimental design makes it necessary to have several different treatments, with more than one being considered as controls. Here is an example. Herbivorous species of marine snails are often common in rock pools on the shore, where they eat algae that grow on the sides of the pools. Very occasionally these snails are seen being attacked and eaten by carnivorous species of intertidal snails, which also occur in the pools. An ecologist was surprised that such attacks occurred so infrequently and hypothesised that it was because the herbivorous snails showed ‘avoidance’ by climbing out of the water in response to water-borne odours from their predators. The null hypothesis is ‘Herbivorous snails will not avoid their predators’ and the alternate hypothesis is ‘Herbivorous snails will avoid their predators.’ One prediction that might distinguish between these hypotheses is
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
36
36 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
Table 4.1 Breakdown of three treatments into their effects upon herbivorous snails. Predator
Control for disturbance
Control for time
predator disturbance time
disturbance time
time
‘Herbivorous snails will crawl out of a pool to which a predatory snail has been added.’ It could be tested by dropping a predatory snail into a rock pool and seeing how many herbivorous snails crawled out during the next five minutes. Unfortunately, this experiment is not controlled. By adding a predator and waiting for five minutes, several things have happened to the herbivorous snails in the pool. Certainly, you are adding a predator. But the pool is also being disturbed, simply by adding something (the predator) to it. Also, the experiment is not well controlled in terms of time because five minutes have elapsed while the experiment is being done. Therefore, even if all the herbivorous snails did crawl out of the pool, the experimenter could not confidently attribute this behaviour to the addition of the predator – the snails may have been responding to disturbance because the pool had warmed up in the sun or many other reasons. One improvement to the experiment would be a control for the disturbance associated with adding a predator. This is often done by including another pool into which a small stone about the size of the predator is dropped as ‘something added to the pool’. Another important improvement would be to include a control pool to which nothing was added. At this stage, by incorporating these improvements, you would have three treatments. Table 4.1 lists what each is doing to the snails. For such a simple hypothesis, ‘Herbivorous snails will avoid their predators’, the experiment has already expanded to three treatments. But many ecologists are likely to say that even this design is not adequate because the ‘predator’ treatment is the only one in which a snail has been added. Therefore, even if all or most snails crawled out of the pools in the treatment with the predator, but remained submerged in the other two, the response may have been only a response to the addition of any living snail rather than
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
37 [29–47] 27.8.2011 7:53AM
4.3 Manipulative experiments
37
Table 4.2 Breakdown of four treatments into their effects upon herbivorous snails. Predator
Control for snail Control for disturbance Control for time
predator herbivore disturbance disturbance time time
disturbance time
time
a predator. Ideally, a fourth treatment should be included where an herbivorous snail is added to control for this (Table 4.2). By now you may well be thinking that the above design is far too finicky. Nevertheless, experiments do have to have appropriate controls so that the effects of each potentially contributing factor can be isolated. Furthermore, the design would have to be replicated – you could not just do it once using four pools because any difference among treatments may result from some difference among the pools rather than the actual treatments applied. I have done this experiment (McKillup and McKillup, 1993) and included all the treatments listed in Table 4.2 with six replicates, using 30 pools altogether. It is often difficult to work out what control treatments you need to incorporate into a manipulative experiment. One way to clarify these is to list all of the things you are actually doing in an experimental treatment and make sure you have appropriate controls for each.
4.3.3
Other common types of manipulative experiments where treatments are confounded with time
Many experiments confound treatments with time. For example, experiments designed to evaluate the effect of a particular treatment or intervention often measure some variable (e.g. blood pressure) of the same group of experimental subjects before and after administration of an experimental drug. Any change is attributed to the effect of the drug. Here, however, several different things have been done to the treatment group. I will use blood pressure as an example, but the concept applies to any ‘before and after’ experiment.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
38
38 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
First, time has elapsed, but blood pressure can change over a matter of minutes or hours in response to many factors, including fluctuations in room temperature. Second, the group has been given a drug, but studies have shown that administration of even an empty capsule or an injection of saline (these are called placebo treatments) can affect a person’s blood pressure. Third, each person in the group has had their blood pressure measured twice. Many people are ‘white coat hypertensive’ – their blood pressure increases substantially at the sight of a physician approaching with the inflatable cuff and pressure gauge used to measure blood pressure. An improvement to this experiment would at least include a group that was treated in exactly the same way as the experimental one, except that the subjects were given an appropriate placebo. This would at least isolate the effect of the drug from the other ways in which both groups had been disturbed. Consequently, well-designed medical experiments often include ‘sham operations’ where the control subjects are treated in the same way as the experimental subjects, but do not receive the experimental manipulation. For example, early experiments to investigate the function of the parathyroid glands, which are small patches of tissue within the thyroid, included an experimental treatment where the parathyroids were completely removed from several dogs. A control group of dogs had their thyroids exposed and cut, but the parathyroids were left in place.
4.3.4
Pseudoreplication
One of the nastiest pitfalls is having a manipulative experimental design which appears to be replicated when it really is not. This is another aspect of ‘pseudoreplication’ described by Hurlbert (1984) who invented the word – before then it was just called ‘bad design’. Here is an example that relates back to the discussion about the need for replicates. An aquacultural scientist hypothesised that a diet which included additional vitamin A would increase the growth rate of prawns. He was aware of the need to replicate the experiment, so set up two treatment ponds, each containing 1000 prawns of the same species and of similar weight and age from the same hatchery. One pond was chosen at random and the 1000 prawns within it were fed commercial prawn food plus vitamin A, while the 1000 prawns in the second pond were only fed standard
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
39 [29–47] 27.8.2011 7:53AM
4.3 Manipulative experiments
39
commercial prawn food. After six months, the prawns were harvested and weighed. The prawns that had received vitamin A were twice as heavy, on average, as the ones that had not. The scientist was delighted – an experiment with 1000 replicates of each treatment had produced a result consistent with the hypothesis. Unfortunately, there are not 1000 truly independent replicates in each pond. All prawns receiving the vitamin A supplement were in pond 1 and all those receiving only standard food were in pond 2. Therefore, any difference in growth may, or may not, have been due to the vitamin – it could have been caused by some other (perhaps unknown) difference between the two ponds. The experimental replicates are the ponds and not the prawns, so the experiment has no effective replication at all and is essentially the same as the absurd unreplicated experiment with only two guinea pigs described earlier in this chapter. An improvement to the design would be to run each treatment in several ponds. For example, an experiment with five ponds within each treatment, each of which contains 200 prawns, has been replicated five times. But here too it is still necessary to have truly independent replicates – you should not subdivide two ponds into five enclosures and run one treatment in each pond. This is one case of apparent replication and here are four examples. (a) Even if you have several separate replicates of each treatment (e.g. five treatment aquaria and five control aquaria), the arrangement of these can lead to a lack of independence. For convenience and to reduce the risk of making a mistake, you might have the treatment aquaria in a group at one end of a laboratory bench and the experimental aquaria at the other. But there may be some known or unknown feature of the laboratory (e.g. light levels, ventilation, disturbance) that affects one group of aquaria differently to the other (Figure 4.4(a)). (b) Replicates could be placed alternately. If you decided to get around the clustering problem described above by placing treatments and controls alternately (i.e. by placing, from left to right, (treatment #1, control #1, treatment #2, control #2, treatment #3 etc.), there can still be problems. Just by chance all the treatment aquaria (or all the controls) might be under regularly placed laboratory ceiling lights, windows or subject to some other regular feature you are not even aware of (Figure 4.4(b)).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
40
40 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
Figure 4.4 Three cases of apparent replication. (a) Clustering of replicates
means that there is no independence among controls or treatments. (b) A regular arrangement of treatments and controls may, by chance, correspond to some feature of the environment (here the very obvious ceiling lights) that might affect the results. (c) Clustering of temperature treatments within particular incubators.
(c) Because of a shortage of equipment, you may have to have all of your replicates of one temperature treatment in only one controlled temperature cabinet and all replicates of another temperature treatment in only one other. Unfortunately, if there is something peculiar about one cabinet, in addition to temperature, then either the experimental or control treatment may be affected. This pattern is called ‘isolative segregation’ (Figure 4.4(c)). (d) The final example is more subtle. Imagine you decided to test the hypothesis that ‘Water with a high nitrogen content increases the growth of freshwater mussels.’ You set up five control aquaria and five experimental aquaria, which were placed on the bench in a completely randomised pattern to prevent problems (a) and (b) above. All tanks had to have water constantly flowing through them, so you set up one storage tank of high nitrogen water and one of low nitrogen water. Water from each storage tank was piped into five aquaria as shown in Figure 4.5.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
41 [29–47] 27.8.2011 7:53AM
4.4 Sometimes you can only do an unreplicated experiment
41
Figure 4.5 The positions of the treatment tanks are randomised, but all tanks
within a treatment share water from one supply tank.
This looks fine, but unfortunately all five aquaria within each treatment are sharing the same water. All in the ‘high nitrogen’ treatment receive water from storage tank A. Similarly, all aquaria in the control receive water from storage tank B. So any difference in mussel growth between treatments may be due either to the nitrogen or some other feature of the storage tanks and this design is little better than the case of isolative segregation in (c) above. Ideally, each aquarium should have its own separate and independent water supply. Finally, the allocation of replicate tanks to treatments should be done using a method that removes any possibility of unintentional bias by the experimenter. For example, the toss of a coin was used to allocate pairs of tiles to lit and unlit treatments in the experiment with millipedes and light described in Section 2.2.
4.4
Sometimes you can only do an unreplicated experiment
Although replication is desirable in any experiment, there are some cases where it is not possible. For example, when doing large-scale mensurative or manipulative experiments on systems such as rivers there may only be one polluted river available to study. Although you cannot attribute the reason for any difference or the lack of it to the treatment (e.g. a polluted versus a relatively unpolluted river), because you only have one replicate, the results are still useful. First, they are still evidence for or against your hypothesis and can be cautiously discussed while noting the lack of replication. Second,
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
42
42 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
it may also be possible to achieve replication by analysing your results in conjunction with those from similar studies (e.g. comparisons of other polluted and unpolluted rivers) done elsewhere by other researchers. This is called a meta-analysis. Finally, the results of a large-scale but unreplicated experiment may suggest smaller-scale experiments that can be done with replication so that you can continue to test the hypothesis.
4.4.1
Lack of replication in environmental impact assessment
Quite often an event likely to have an effect upon a particular habitat or ecosystem and therefore cause an environmental impact is unreplicated (e.g. one nuclear reactor meltdown, a spill from one oil tanker or heavy metal contamination downstream from only one mine). Furthermore, data may not be available for the impact site before the event (e.g. a stretch of remote coastline before a spill from an oil tanker). One way of assessing data that are only available after an event at one impact site is to compare it to several control sites chosen at random with the constraint that all have similar environmental conditions (e.g. for an impact site in a north-facing bay it is desirable to have control sites in similar bays rather than on the exposed coast). If the impact and control sites are found to be similar, it is some evidence for no measurable impact, but if the impact site is grossly different to the controls (which are relatively similar), it is some evidence for an impact. Furthermore, if all the sites are repeatedly monitored after the event and the impact site becomes more similar to the controls and therefore recovers over time, this is stronger evidence for an impact having occurred. When planning a development likely to have an environmental impact it would be far better to have data for both control and ‘impact’ sites before and after construction, plus an ongoing sampling programme. These are called ‘before-after-control-impact’ (abbreviated to ‘BACI’) designs and a very clear introduction is given by Manly (2001).
4.5
Realism
Even an apparently well-designed mensurative or manipulative experiment may still suffer from a lack of realism. Here are two examples.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
43 [29–47] 27.8.2011 7:53AM
4.6 A bit of common sense
43
The first is a mensurative experiment on the incidence of testicular torsion, which can occur in males when the testicular artery that supplies the testis with oxygenated blood becomes twisted. This can restrict or cut off the blood supply and thereby damage or kill the testis. Apparently, it is an extremely painful condition and usually requires surgery to either restore blood flow or remove the damaged testis. Testes retract closer to the body as temperature decreases, so a physician hypothesised that the likelihood of torsion would be greater during winter compared to summer. Their alternate hypothesis was ‘Retraction of the testis during cold weather increases the incidence of testicular torsion’. The null hypothesis was ‘Retraction of the testis during cold weather does not increase the incidence of testicular torsion.’ The physician found that the incidence of testicular torsion was twice as high during winter compared to summer in a sample from a small town in Alaska. Unfortunately, there were very few affected males (six altogether) in the sample so this difference may have occurred simply by chance, making it impossible to distinguish between these hypotheses. A few years later, another researcher obtained data from a much larger sample of 96 affected males from hospital records in north Queensland, Australia. Although they found no difference in the incidence of testicular torsion between summer and winter this may not have been a realistic test of the hypothesis, because even Alaskan summers are considerably cooler than tropical north Queensland winters. Second, an experiment to investigate factors affecting the selection of breeding sites by the mosquito Anopheles farauti offered adult females a choice of salinities ranging from 0, 5, 10, 15, 20, 25, 30 to 35 ‰. Eggs were laid in all but the two highest salinities (30 ‰ and 35 ‰). The conclusion was that salinity significantly affects the choice of breeding sites by mosquitoes. Unfortunately, the salinity in the habitat where the mosquitoes occurred never exceeded 10 ‰, again making the choice of treatments unrealistic.
4.6
A bit of common sense
By now, you may be quite daunted by the challenge of being able to design a good experiment, but provided you have appropriate controls, replicates
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
44
44 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
and have thought about any obvious problems of pseudoreplication and realism, you are well on the way to a good design. Furthermore, the desire for a near-perfect design has to be balanced against financial constraints as well as the space and time available to do the experiment. Often it is not possible to have more than two incubators or as many replicates as you would like. It also depends on the type of science. For example, many microbiologists working with organisms they grow on agar plates, where conditions can be strictly controlled, would never be concerned about clustering of replicates or isolative segregation because they would be confident that conditions did not vary in different parts of the laboratory and their incubators only differed in relation to temperature. Most of the time they may be right, but considerations about experimental design need to be borne in mind by all scientists. Also, you may not have the resources to do a large manipulative field experiment at more than one site. Although, strictly speaking, the results cannot be generalised to other sites, they may nevertheless apply and careful interpretation and discussion of results can include predictions that are more general. For example, the ‘millipede and light’ experiment described in Chapter 2 was initially done during one night at one site. It was repeated on the following night at the same site in the presence of some colleagues (who were quite sceptical until the experiment had been running for 20 minutes) and later repeated at two other sites as well as in the laboratory. All the results were consistent with the hypothesis, so I concluded ‘Portuguese millipedes are attracted to visible light at night.’ Nevertheless, the hypothesis may not apply to all populations of O. moreleti, but, to date, there has been no evidence to the contrary.
4.7
Designing a ‘good’ experiment
Designing a well-controlled, appropriately replicated and realistic experiment has been described by some researchers as an ‘art’. It is not, but there are often several different ways to test the same hypothesis and therefore several different experiments that could be done. Because of this, it is difficult to give a guide to designing experiments beyond an awareness of the general principles discussed in this chapter.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
45 [29–47] 27.8.2011 7:53AM
4.8 Reporting your results
4.7.1
45
Good design versus the ability to do the experiment
It has often been said ‘There is no such thing as a perfect experiment.’ One inherent problem is that as a design gets better and better the cost in time and equipment also increases, but the ability actually to do the experiment decreases (Figure 4.6). An absolutely perfect design may be impossible to carry out. Therefore, every researcher must choose a design that is good enough, but still practical. There are no set rules for this – the decision on design is made by the researcher and will be eventually judged by their colleagues, who examine any report from the work.
4.8
Reporting your results
It has been said ‘Your science is of little use unless other people know about it.’ Even if you have an excellent experimental design, the results are unlikely to make a difference to our understanding of the world unless other researchers know about them. The conventional and widely accepted way of doing this is to write up the experiment as a report that is published as a paper in a scientific journal. These journals have procedures for assessing
Figure 4.6 An example of the trade-off between the cost and the ability to do
an experiment. As the quality of the experimental design increases so does the cost of the experiment (solid line), but the ability to do the experiment decreases (dashed line). Your design usually has to be a compromise between one that is practicable, affordable and of sufficient rigour. Its quality might be anywhere along the X axis.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
46
46 [29–47] 27.8.2011 7:53AM
Introductory concepts of experimental design
the quality of submitted manuscripts and are unlikely to publish work below a certain standard. Manuscripts are usually assessed by ‘peer review’ where the editor of the journal reads either the whole paper, or perhaps just the title and abstract, and sends it to one or more specialists in that particular field for advice. Peer review is usually anonymous – the author does not know the identity of the reviewers and some journals even withhold the name of the author(s) from the reviewers. The comments and recommendations from the reviewers are read by the editor and used to help decide whether to (a) reject the manuscript, (b) recommend revision and resubmission or (c) accept it with little or no change. The intention is to help ensure quality work is published, but the process is definitely not flawless and will be discussed further in Chapter 5.
4.9
Summary and conclusion
The above discussion only superficially covers some important aspects of experimental design. Considering how easy it is to make a mistake, you will probably not be surprised that a lot of published scientific papers have serious flaws in design or interpretation that could have been avoided. Work with major problems in the design of experiments is still being done and, quite alarmingly, many researchers are not aware of these problems. As an example, after teaching the material in this chapter I often ask my students to find a published paper, review and criticise the experimental design and then offer constructive suggestions for improvement. Many have later reported that it was far easier to find a flawed paper than they expected.
4.10
Questions
(1)
The first test of the hypothesis that cholera could be contracted by drinking contaminated water (see Section 2.4) was to compare the number of cases of cholera in a South London neighbourhood during the few weeks (a) before and (b) after removal of the handle from a pump that drew water from a suspect well. Was this experiment well designed? Please comment.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C04.3D
47 [29–47] 27.8.2011 7:53AM
4.10 Questions
47
(2)
A biologist decided to change the textbook they used for the course ‘Introductory Biostatistics’. The average mark gained by the students in this course every year from 2004 to 2009 was very similar, but increased by 12% after introduction of the new text in 2010. The biologist said ‘The increase may have been due to the textbook, but you can never really be sure about that.’ What did the biologist mean?
(3)
Give an example of confusing a correlation with causality.
(4)
Name and give examples of two types of ‘apparent replication’.
(5)
A researcher did a well-controlled and appropriately replicated experiment and was very surprised indeed when one of their colleagues said ‘The design is fine but the experiment is unrealistic so the results and your conclusion have little or no biological significance.’ What did the researcher’s colleague mean?
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
48 [48–55] 27.8.2011 9:34AM
5
Doing science responsibly and ethically
5.1
Introduction
By now you are likely to have a very clear idea about how science is done. Science is the process of rational enquiry, which seeks explanations for natural phenomena. Scientific method was discussed in a very prescriptive way in Chapter 2 as the proposal of an hypothesis from which predictions are made and tested by doing experiments. Depending on the results, which may have to be analysed statistically, the decision is made to either retain or reject the hypothesis. This process of knowledge by disproof advances our understanding of the natural world and seems extremely impartial and hard to fault. Unfortunately, this is not necessarily the case, because science is done by human beings who sometimes do not behave responsibly or ethically. For example, some scientists fail to give credit to those who have helped propose a new hypothesis. Others make up, change, ignore or delete results so their hypothesis is not rejected, omit details to prevent the detection of poor experimental design or deal unfairly with the work of others. Most scientists are not taught about responsible behaviour and are supposed to learn a code of conduct by example, but this does not seem to be a good strategy considering the number of cases of scientific irresponsibility that have recently been exposed. This chapter is about the importance of behaving responsibly and ethically when doing science.
5.2
Dealing fairly with other people’s work
5.2.1
Plagiarism
Plagiarism is the theft and use of techniques, data, words or ideas without appropriate acknowledgement. If you are using an experimental technique 48
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
49 [48–55] 27.8.2011 9:34AM
5.2 Dealing fairly with other people’s work
49
or procedure devised by someone else, or data owned by another person, you must acknowledge this. If you have been reading another person’s work, it is easy inadvertently to use some of their phrases, but plagiarism is the repeated and excessive use of text without acknowledgement. As a reviewer for scientific journals, I have reported manuscripts that contain substantial blocks of text taken from the published papers of other authors. It is very easy indeed to cut text from electronic copies of journal articles and paste it into your own manuscript or postgraduate thesis with the intention of rearranging and rewriting the material in your own words. This has been called ‘patchwriting’ by Howard (1993) and there is evidence that academics do it (Roig, 2001), but often the patchwriter is so careless that sentences and even whole paragraphs survive unchanged. Detected plagiarism can severely affect your credibility and your career. The increasing availability of electronic content, together with programs and web-based services that can be used to check this for plagiarism, means that the risk of detection is likely to increase in the future.
5.2.2
Acknowledging previous work
Previous studies can be extremely valuable because they may add weight to an hypothesis and even suggest other hypotheses to test. There is a surprising tendency for scientists to fail to acknowledge previous published work by others in the same area, sometimes to the extent that experiments done two or three decades ago are repeated and presented as new findings. This can be an honest mistake in that the researcher is unaware of previous work, but it is now far easier to search the scientific literature than it used to be. When you submit your work to a scientific journal, it can be very embarrassing to be told that something similar has been done before. Even if a reviewer or editor of a journal does not notice, others may and are likely to say so in print.
5.2.3
Fair dealing
Some researchers cite the work done by others in the same field, but downplay or even distort it. Although it appears that previous work in the field has been acknowledged because the publication is listed in the citations at the back of the paper or report, the researcher has nevertheless been
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
50
50 [48–55] 27.8.2011 9:34AM
Doing science responsibly and ethically
somewhat dishonest. I have found this in about 5% of the papers I have reviewed, but it may be more common because it is quite hard to detect unless you are very familiar with the work. Often the problem seems to arise because the writer has only read the abstract of a paper, which can be misleading. It is important to carefully read and critically evaluate previous work in your field because it will improve the quality of your own research.
5.2.4
Acknowledging the input of others
Often hypotheses may arise from discussions with colleagues or with your supervisor. This is an accepted aspect of how science is done. If, however, the discussion has been relatively one-sided in that someone has suggested a useful and novel hypothesis, then you should seriously think about acknowledgement. One of my colleagues once said bitterly ‘My suggestions become someone else’s original thoughts in a matter of seconds.’ Acknowledgement can be a mention (in a section headed ‘Acknowledgements’) at the end of a report or paper, or you may even consider including the person as an author. It is not surprising that disputes about the authorship of papers often arise between supervisors and their postgraduate students. Some supervisors argue that they have facilitated all of the student’s work and therefore expect their name to be included on all papers from the research. Others recognise the importance of the student having some single-authored papers and do not insist on this. The decision depends on the amount and type of input and rests with the principal author of the paper, but it is often helpful to clarify the matter of authorship and acknowledgement with your supervisor(s) at the start of a postgraduate programme or new job.
5.3
Doing the experiment
5.3.1
Approval
You are likely to need prior permission or approval to do some types of research, or to work in a national park or reserve. Research on endangered species is very likely to need a permit (or permits) and you will have to give a convincing argument for doing the work, including its likely advantages and
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
51 [48–55] 27.8.2011 9:34AM
5.3 Doing the experiment
51
disadvantages. In many countries, there are severe penalties for breaches of permits or doing research without one.
5.3.2
Ethics
Ethics are moral judgements where you have to decide if something is right or wrong, so different scientists can have different ethical views. Ethical issues include honesty and fair dealing, but they also extend to whether experimental procedures can be justified – for example, procedures which kill, mutilate or are thought to cause pain or suffering to animals. Some scientists think it is right to test cosmetic products on vertebrate animals because it will reduce the likelihood of harming or causing pain to humans, while others think it is wrong because it may cause pain and suffering to the animals. Both groups would probably find it odd if someone said it was unethical to do experiments on insects or plants. Importantly, however, none of these three views can be considered the best or most appropriate, because ethical standards are not absolute. Provided a person honestly believes, for any reason, that it is right to do what they are doing, they are behaving ethically (Singer, 1992) and it is up to you to decide what is right. The remainder of this section is about the ethical conduct of research, rather than whether a research topic or procedure is considered ethical. Research on vertebrates, which appear to feel pain, is likely to require approval by an animal ethics committee in the organisation where you are working. The committee will consider the likely advantages and disadvantages of the research, the number of animals used, possible alternative procedures and the likelihood the animals will experience pain and suffering, so your research proposal may not necessarily be approved. Taking a wider view, research on any living organism has the potential to affect that species and others, so all life scientists should think carefully about their experimental procedures and should try to minimise disturbance, death and possible suffering. Most research organisations also have strict ethical guidelines on using humans in experiments. Any procedures need to be considered carefully in terms of the benefits, disadvantages and possible pain and suffering, together with the need to maintain privacy and confidentiality, before being submitted to the human ethics committee for approval. Once again, the committee may not approve the research. There are usually strict
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
52
52 [48–55] 27.8.2011 9:34AM
Doing science responsibly and ethically
reporting requirements and severe penalties for breaches of the procedure specified by the permit.
5.4
Evaluating and reporting results
Once you have the results of an experiment, you need to analyse them and discuss the results in terms of rejection or retention of your hypothesis. Unfortunately, some scientists have been known to change the results of experiments to make them consistent with their hypothesis, which is grossly dishonest. I suspect this is more common than reported and may even be fostered by assessment procedures in universities and colleges, where marks are given for the correct outcomes of practical experiments. I once asked an undergraduate statistics class how many people had ever altered their data to fit the expectations of their biology practical assignments and got a lot of very guilty looks. I know two researchers who were dishonest. The first had a regression line that was not statistically significant, so they changed the data until it was. The second made up entire sets of data for sampling that had never been done. Both were found out and neither is still doing science. It has been suggested that the tendency to alter results is at least partly because people become attached to their hypotheses and believe they are true, which goes completely against science proceeding by disproof. Some researchers are quite downcast when results are inconsistent with their hypothesis, but you need to be impartial about the results of any experiment and remember that a negative result is just as important as a positive one, because our understanding of the natural world has progressed in both cases. Another cause of dishonesty is that scientists are often under extraordinary pressure to provide evidence for a particular hypothesis. There are often career rewards for finding solutions to problems or suggesting new models of natural processes. Competition among scientists for jobs, promotion and recognition is intense and can foster dishonesty. The problem with scientific dishonesty is that the person has not reported what is really occurring. Science aims to describe the real world, so if you fail to reject an hypothesis when a result suggests you should, you will report a false and misleading view of the process under investigation. Future hypotheses and research based on these findings are likely to produce results inconsistent with your findings. There have been some spectacular cases
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
53 [48–55] 27.8.2011 9:34AM
5.5 Quality control in science
53
where scientific dishonesty has been revealed and these have only served to undermine the credibility of the scientific process and the scientific community.
5.4.1
Pressure from peers or superiors
Sometimes inexperienced, young or contract researchers have been pressured by their superiors to falsify or give a misleading interpretation of their results. It is far better to be honest than risk being associated with work that may subsequently be shown to be flawed. One strategy for avoiding such pressure is to keep good records.
5.4.2
Record keeping
Some research groups, especially in the biomedical sciences, are so concerned about honesty that they have a code of conduct where all researchers have to keep records of their ideas, hypotheses, methods and results in a hard-bound laboratory book with numbered pages that are signed and dated on a daily or weekly basis by the researcher and their supervisor. Not only can this be scrutinised if there is any doubt about the work (including who thought of something first), but it also encourages good data management and sequential record keeping. Results kept on pieces of loose paper with no reference to the methods used can be quite hard to interpret when the work is written up for publication.
5.5
Quality control in science
Publication in a refereed journal usually ensures your work has been scrutinised by at least one referee who is a specialist in the research area. Nevertheless, this process is more likely to detect obvious and inadvertent mistakes than deliberate dishonesty, so it is not surprising that many journal editors have admitted that some of the work they publish is likely to be flawed (LaFollette, 1992). The peer review of manuscripts submitted to journals (Section 4.8) can easily be abused. There is even evidence that some (anonymous) reviewers deliberately delay their assessment of a manuscript in order to hold up the publication of work by others in the same field.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
54
54 [48–55] 27.8.2011 9:34AM
Doing science responsibly and ethically
Reviewers are not usually paid and do the work simply to return a favour to the scientific community whose members have reviewed their manuscripts. A critical review with careful justification and constructive suggestions for improvement can take a lot of time and it is far easier to write a few lines of fairly non-committal approval. The number of people doing science is rapidly increasing and so is the pressure to publish, so many journals are being overwhelmed by submissions and their editors are having difficulty finding reviewers. I recently found out that the editor of a journal for which I review sometimes sends a manuscript to only one referee. Considering there is often a great disparity between the opinions of reviewers, this does not seem very fair. The accessibility of the internet has seen the creation of an enormous number of new journals, and academics and researchers often receive unsolicited emails from these inviting them to become an editor or a member of an editorial board. Unfortunately, many new journals are publishing work of poor quality and the final published papers may even contain gross typographical and grammatical errors. It is up to you to make sure you submit your manuscript to a journal whose standard you consider suitable. Institutional strategies for quality control of the scientific process are becoming more common and many have rules about the storage and scrutiny of data. Many institutions also need explicit guidelines and policies about the penalties for misconduct, together with mechanisms for handling alleged cases reported by others. The responsibility for doing good science is often left to the researcher and applies to every aspect of the scientific process, including devising logical hypotheses, doing well-designed experiments and using and interpreting statistics appropriately, together with honesty, responsible and ethical behaviour, and fair dealing.
5.6
Questions
(1)
A college lecturer said ‘For the course “Biostatistical Methods”, the percentage a student gets for the closed-book exam has always been fairly similar to the one they get for the assignment, give or take about 15%. I am a busy person, so I will simply copy the assignment percentage into the column marked ‘exam’ on
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C05.3D
55 [48–55] 27.8.2011 9:34AM
5.6 Questions
55
the spreadsheet and not bother to grade the exams at all. It’s really fair of me, because students get stressed during the exam anyway and may not perform as well as they should.’ Please discuss. (2)
An environmental scientist said ‘I did a small pilot experiment with two replicates in each treatment and got the result we hoped for. I didn’t have time to do a bigger experiment but that didn’t matter – if you get the result you want with a small experiment, the same thing will happen if you run it with many more replicates. So when I published the result I said I used twelve replicates in each treatment.’ Please comment thoroughly and carefully on all aspects of this statement.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
56 [56–70] 27.8.2011 9:57AM
6
Probability helps you make a decision about your results
6.1
Introduction
Most science is comparative. Researchers often need to know if a particular experimental treatment has had an effect or if there are differences among a particular variable measured at several different locations. For example, does a new drug affect blood pressure, does a diet high in vitamin C reduce the risk of liver cancer in humans or is there a relationship between vegetation cover and the population density of rabbits? But when you make these sorts of comparisons, any differences among treatments or among areas sampled may be real, or they may just be the sort of variation that occurs by chance among samples from the same population. Here is an example using blood pressure. A biomedical scientist was interested in seeing if the newly synthesised drug, Arterolin B, had any effect on blood pressure in humans. A group of six humans had their systolic blood pressure measured before and after administration of a dose of Arterolin B. The average systolic blood pressure was 118.33 mm Hg before and 128.83 mm Hg after being given the drug (Table 6.1). The average change in blood pressure from before to after administration of the drug is quite large (an increase of 10.5 mm Hg), but by looking at the data you can see there is a lot of variation among individuals – blood pressure went up in three cases, down in two and stayed the same for the remaining person. Even so, the scientist might conclude that a dose of Arterolin B increases blood pressure. But there is a problem (apart from the poor experimental design that has no controls for time, or the disturbing effect of having one’s blood pressure measured). How do you know that the effect of the drug is meaningful or significant? Perhaps this change occurred by chance and the drug had no effect. Somehow, you need a way of helping you make a 56
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
57 [56–70] 27.8.2011 9:57AM
6.2 Statistical tests and significance levels
57
Table 6.1 The systolic blood pressure in mm Hg for six people before and after being given the experimental drug Arterolin B. Person
Before
After
1 2 3 4 5 6 Average:
100 120 120 140 80 150 118.33
108 120 150 135 120 140 128.83
decision about your results. This led to the development of statistical tests and a commonly agreed upon level of statistical significance.
6.2
Statistical tests and significance levels
Statistical tests are just a way of working out the probability of obtaining the observed, or an even more extreme, difference among samples (or between an observed and expected value) if a specific hypothesis (usually the null of no difference) is true. Once the probability is known, the experimenter can make a decision about the difference, using criteria that are uniformly used and understood. Here is a very easy example where the probability of every possible outcome can be calculated. If you are unsure about probability, Chapter 7 gives an explanation of the concepts you will need for this book. Imagine you have a large sack containing 5000 white and 5000 black beads that are otherwise identical. All of these beads are well mixed together. They are a population of 10 000 beads. You take one bead out at random, without looking in the sack. Because there are equal numbers of black and white in the population, the probability of getting a black one is 50%, or 1/2, which is also the probability of getting a white one. The chance of getting either a black or a white bead is the sum of these probabilities: (1/2 + 1/2) which is 1 (or 100%) since there are no other colours.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
58
58 [56–70] 27.8.2011 9:57AM
Probability helps you make a decision about your results
Now consider what happens if you take out a sample of six beads in sequence, one after the other, without looking in the sack. Each bead is replaced after it is drawn and the contents of the sack remixed before taking out the next, so these are independent events. Here are all of the possible outcomes. You may get six black beads or six white ones (both outcomes are very unlikely); five black and one white, or one black and five white (which is more likely); four black and two white, or two black and four white (which is even more likely), or three black and three white (which is very likely because the proportions of beads in the sack are 1:1). The probability of getting six black beads in sequence is the probability of getting one black one (1/2) multiplied by itself six times, which is 1/2 × 1/2 × 1/2 × 1/2 × 1/2 × 1/2 = 1/64. The probability of getting six white beads is also 1/64. The probability of five black and one white is greater because there are six ways of getting this combination (WBBBBB or BWBBBB or BBWBBB or BBBWBB or BBBBWB or BBBBBW) giving 6/64. There is the same probability (6/64) of getting five white and one black. The probability of four black and two white is even greater because there are 15 ways of getting this combination (WWBBBB, BWWBBB, BBWWBB, BBBWWB, BBBBWW, WBWBBB, WBBWBB, WBBBWB, WBBBBW, BWBWBB, BWBBWB, BWBBBW, BBWBWB, BBWBBW, BBBWBW) giving 15/64. There is the same probability (15/64) of getting four white and two black. Finally, the probability of three black and three white (there are 20 ways of getting this combination) is 20/64. You can summarise all of the outcomes as a table of probabilities (Table 6.2) and they are shown as a histogram in Figure 6.1. Note that the distribution is symmetrical with a peak corresponding to the case where half the beads in the sample will be black and half will be white. Therefore, if you were given a sack containing 50% black and 50% white beads, from which you drew six, you would have a very high probability of drawing a sample that contained both colours. It is very unlikely you would get only six black or six white (the probability of each is 1/64, so the probability of either six black or six white is the sum of these, which is only 2/64, or 0.03125 or 3.125%).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
59 [56–70] 27.8.2011 9:57AM
6.2 Statistical tests and significance levels
59
Table 6.2 The probabilities of obtaining all possible combinations of black and white beads in samples of six from a large population containing equal numbers of black and white beads.
Number of black
Number of white
Probability of this outcome
Percentage of cases likely to give this result
6 5 4 3 2 1 0
0 1 2 3 4 5 6
1/64 6/64 15/64 20/64 15/64 6/64 1/64 Total: 64/64
1.56 9.38 23.44 31.25 23.44 9.38 1.56 100
Expected number in a sample of 64
20
15
10
5
0 0
1 2 3 4 5 Number of black beads in a sample of 6
6
Figure 6.1 The expected numbers of each possible mixture of colours when
drawing six beads independently with replacement on 64 different occasions from a large population containing 50% black and 50% white beads. The most likely outcome of three black and three white corresponds to the proportions of black and white in the population.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
60 [56–70] 27.8.2011 9:57AM
60
Probability helps you make a decision about your results
6.3
What has this got to do with making a decision about your results?
In 1925, the statistician Sir Ronald Fisher proposed that if the probability of the observed outcome and any more extreme departures possible from the expected outcome (the null hypothesis discussed in Chapter 2) is less than 5%, then it is appropriate to conclude that the observed difference is statistically significant (Fisher, 1925). There is no biological or scientific reason for the choice of 5%, which is the same as 1/20 or 0.05. It is the probability that many researchers use as a standard ‘statistically significant level’. Using the example of the beads in the sack, if your null hypothesis specified that there were equal numbers of black and white beads in the population, you could test it by drawing out a sample of six beads as described above. If all were black, the probability of this result or any more extreme departures (in this case there are no possibilities more extreme than six black beads) from the null hypothesis is only 1.56% (Table 6.2), so the difference between the outcome for the sample and the expected result has such a low probability it would be considered statistically significant. A researcher would reject the null hypothesis and conclude that the sample did not come from a population containing equal numbers of black and white beads.
6.4
Making the wrong decision
If the proportions of black and white beads in the sack really were equal, then most of the time a sample of six beads would contain both colours. But if the beads in the sample were all only black or all only white, a researcher would decide the sack (the population) did not contain 50% black and 50% white. Here they would have made the wrong decision, but this would not happen very often because the probability of either of these outcomes is only 2/64. The unavoidable problem with using probability to help you make a decision is that there is always a chance of making a wrong decision and you have no way of telling when you have done this. As described above, if a researcher drew out a sample of six of one colour, they would decide that the population (the contents of the bag) was not 50% black and 50% white when
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
61 [56–70] 27.8.2011 9:57AM
6.5 Other probability levels
61
really it was. This type of mistake, where the null hypothesis is inappropriately rejected, is called a Type 1 error. There is another problem too. Sometimes an unknown population is different to the expected (e.g. it may contain 90% white beads and 10% black ones), but the sample taken (e.g. four white and two black) is not significantly different to the expected outcome predicted by the hypothesis of 50:50. In this case, the researcher would decide the composition of the population was the one expected under the null hypothesis (50:50), even though in reality it was not. This type of mistake – when the alternate hypothesis holds, but is inappropriately rejected – is called a Type 2 error. Every time you do a statistical test, you run the risk of a Type 1 or Type 2 error. There will be more discussion of these errors in Chapter 10, but they are unavoidably associated with using probability to help you make a decision.
6.5
Other probability levels
Sometimes, depending on the hypothesis being tested, a researcher may decide that the ‘less than 5%’ significance level (with its 5% chance of inappropriately rejecting the null hypothesis) is too risky. Here is a medical example. Malaria is caused by a parasitic protozoan carried by certain species of mosquito. When an infected mosquito bites a person, the protozoans are injected into the person’s bloodstream, where they reproduce inside red blood cells. A small proportion of malarial infections progress to cerebral malaria, where the parasite causes severe inflammation of the person’s brain and often results in death. A biomedical scientist was asked to test a new and extremely expensive drug that was hoped to reduce mortality in people suffering from cerebral malaria. A large experiment was done, where half of cerebral malaria cases chosen at random received the new drug and the other half did not. The survival of both groups over the next month was compared. The alternate hypothesis was ‘There will be increased survival of the drug-treated group compared to the control.’ In this case, the prohibitive cost of the drug meant that the manufacturer had to be very confident that it was of use before recommending and marketing it. Therefore, the risk of a Type 1 error (significantly greater survival in the experimental group compared to the control occurring
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
62
62 [56–70] 27.8.2011 9:57AM
Probability helps you make a decision about your results
simply by chance) when using the 5% significance level might be considered too risky. Instead, the researcher might decide to reduce the risk of Type 1 error by using the 1% (or even the 0.1%) significance level and only recommend the drug if the reduction in mortality was so marked that it was significant at this level. Here is an example of the opposite case. Before releasing any new pharmaceutical product on the market, it has to be assessed for side effects. There were concerns that the new sunscreen ‘Bayray Blockout 2020’ might cause an increase in pimples among frequent users. A pharmaceutical scientist ran an experiment using 200 high school students during their summer holiday. Each student was asked to apply Bayray Blockout 2020 to their left cheek and the best-selling but boringly named ‘Sensible Suncare’ to their right cheek every morning and then spend the next hour with both cheeks exposed to the sun. After six weeks, the numbers of pimples per square cm on each cheek were counted and compared. The alternate hypothesis was ‘Bayray Blockout 2020 causes an increase in pimple numbers compared to Sensible Suncare.’ An increase could be disastrous for sales, so the scientist decided on a significance level of 10% rather than the conventional 5%. Even though there was a 10% chance (double the usual risk) of a Type 1 error, the company could not take the risk that Bayray Blockout 2020 increased the incidence of pimples. The most commonly used significance level is 5%, which is 0.05. If you decide to use a different level in an analysis, the decision needs to be made, justified and clearly stated before the experiment is done. For a significant result, the actual probability is also important. For example, a probability of 0.04 is not very much less than 0.05. In contrast, a probability of 0.002 is very much less than 0.05. Therefore, even though both are significant, the result with the lowest probability gives much stronger evidence for rejecting the null hypothesis.
6.6
How are probability values reported?
The symbol used for the chosen significance level (e.g. 0.05) is the Greek α (alpha). Often you will see the probability reported as: P < 0.05 or P < 0.01 or P < 0.001. These mean respectively: ‘The probability is less than 0.05’ or ‘The probability is less than 0.01’ or ‘The probability is less than 0.001.’
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
63 [56–70] 27.8.2011 9:57AM
6.7 All statistical tests do the same basic thing
63
N.S. means ‘not significant’, which is when the probability is 0.05 or more (P ≥ 0.05). Of course, as noted above, if you have specified a significance level of 0.05 and get a result with a probability of less than 0.001, this is far stronger evidence for your alternate hypothesis than a result with a probability of 0.04.
6.6.1
Reporting the results of statistical tests
One of the most important and often neglected aspects of doing an experiment is the amount of detail given for the results of statistical tests. Scientific journals often have guidelines (under ‘Instructions for authors’ or ‘For authors’ on their websites), but these can vary considerably even among journals from the same publisher. Many authors give insufficient detail, so it is not surprising that submitted manuscripts are often returned to the authors with a request for more. The results of a statistical test are usually reported as the value of the statistic, the number of degrees of freedom and the probability. A nonsignificant result is usually reported as ‘N.S.’ or ‘P > 0.05’. When the probability of a result is very low (e.g. < 0.001), it is usually reported as P < 0.001 instead of P < 0.05 to emphasise how unlikely it is. I have often been asked ‘Does this mean the experimenter has used a probability level of 0.001?’ It does not, unless the experimenter has clearly specified that they decided to use that probability level and you should expect a reason to be given. Sometimes the exact probability is given (e.g. a non-significant probability: P = 0.34, or a significant probability: P = 0.03). Here it often helps to check the journal guidelines for the appropriate format required.
6.7
All statistical tests do the same basic thing
In the example where beads were drawn from a sack, all of the possible outcomes were listed and the probability of each was calculated directly. Some statistical tests do this, but most use a formula to produce a number called a statistic. The probability of getting each possible value of the statistic has been previously calculated, so you can use the formula to get the numerical value of the statistic, look up the probability of that value and make your decision to retain the null hypothesis if it has a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
64
64 [56–70] 27.8.2011 9:57AM
Probability helps you make a decision about your results
probability of ≥ 0.05 or reject it if it has a probability of < 0.05. Most statistical software packages now available will give the probability as well as the statistic.
6.8
A very simple example – the chi-square test for goodness of fit
This example illustrates the concepts discussed above, using one of the simplest statistical tests. The chi-square test for goodness of fit compares observed ratios to expected ratios for nominal scale data. Imagine you have done a genetics experiment on pelt colour in guinea pigs, where you expect a 3:1 ratio of brown to albino offspring. You have obtained 100 offspring altogether, so ideally you would expect the numbers in the sample to be 75 brown to 25 albino. But even if the null hypothesis of 3:1 were to apply, in many cases the effects of chance are likely to give you values that are not exactly 75:25, including a few that are quite dissimilar to this ratio. For example, you might actually get 86 brown and 14 albino offspring. This difference from the expected frequencies might be due to chance, it may be because your null hypothesis is incorrect or a combination of both. You need to decide whether this result is significantly different from the one expected under the null hypothesis. This is the same concept developed for the example of the sack of beads. The chi-square test for goodness of fit generates a statistic (a number) that allows you to easily estimate the probability of the observed (or any greater) deviation from the expected outcome. It is so simple, you can do it on a calculator. To calculate the value of chi-square, which is symbolised by the Greek χ2, you take each expected value away from its equivalent observed value, square the difference and divide this by the expected value. These separate values (of which there will be two in the case above) are added together to give the chi-square statistic. First, here is the chi-square statistic for an expected ratio that is the same as the observed (observed numbers 75 brown and 25 albino, expected 75 brown and 25 albino): χ2 ¼
ð75 75Þ2 ð25 25Þ2 þ ¼ 0:0: 75 25
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
65 [56–70] 27.8.2011 9:57AM
6.8 A very simple example – the chi-square test for goodness of fit
65
The value of chi-square is 0 when there is no difference between the observed and expected values. As the difference between the observed and expected values increases, so does the value of chi-square. Here the observed numbers are 74 and 26. The value of chi-square can only be positive because you always square the difference between the observed and expected values: χ2 ¼
ð74 75Þ2 ð26 25Þ2 þ ¼ 0:0533: 75 25
For the observed numbers of 70:30, the chi-square statistic is: χ2 ¼
ð70 75Þ2 ð30 25Þ2 þ ¼ 1:333: 75 25
When you take samples from a population in a ‘category’ experiment, you are, by chance, unlikely to always get perfect agreement to the ratio in the population. For example, even when the ratio in the population is 75:25, some samples will have that ratio, but you are also likely to get 76:24, 74:26, 77:23, 73:27 etc. The range of possible outcomes among 100 offspring goes all the way from 0:100 to 100:0. So the distribution of the chi-square statistic generated by taking samples in two categories from a population, in which there really is a ratio of 75:25, will look like the one in Figure 6.2, and the most unlikely 5% of outcomes will generate values of the statistic that will be greater than a critical value determined by the number of independent categories in the analysis. Going back to the result of the genetic experiment given above, the expected numbers are 75 and 25 and the observed numbers are 86 brown and 14 albino. So to get the value of chi-square, you calculate: χ2 ¼
ð86 75Þ2 ð14 25Þ2 þ ¼ 6:453: 75 25
The critical 5% value of chi-square for an analysis of two independent categories is 3.841. This means that only the most extreme 5% of departures from the expected ratio will generate a chi-squared statistic greater than this value. There is more about the chi-square test in Chapter 20 – here I have just given the critical value.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
66
66 [56–70] 27.8.2011 9:57AM
Probability helps you make a decision about your results
Figure 6.2 The distribution of the chi-square statistic generated by taking
samples from a population containing only two categories in a known ratio. Most of the samples will have the same ratio as the expected and thus generate a chi-square statistic of 0, but the remainder will differ from this by chance and therefore give positive values of chi-square. The most extreme 5% departures from the expected ratio will generate statistics greater than the critical value of chi-square.
Because the actual value of chi-square is 6.453, the observed result is significantly different to the result expected under the null hypothesis. The researcher would conclude that the ratio in the population sampled is not 3:1 and therefore reject their null hypothesis.
6.9
What if you get a statistic with a probability of exactly 0.05?
Many statistics texts do not mention this and students often ask ‘What if you get a probability of exactly 0.05?’ Here the result would be considered not significant since significance has been defined as a probability of less than 0.05 (< 0.05). Some texts define a significant result as one where the probability is 5% or less (≤ 0.05). In practice, this will make very little difference, but because Fisher proposed the ‘less than 0.05’ definition, which is also used by most scientific publications, it will be used here. More importantly, many researchers would be uneasy about any result with a probability close to 0.05 and would be likely to repeat the experiment. If the null hypothesis applies, then there is a 0.95 probability of a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
67 [56–70] 27.8.2011 9:57AM
6.10 Statistical significance and biological significance
67
Box 6.1 Why do we use P < 0.05? There is no mathematical or biological reason why the significance level of 0.05 has become the accepted criterion. In the early 1900s, Karl Pearson and others published tables of probabilities for several statistical tests, including chi-square, with values that included 0.1, 0.05, 0.02 and 0.01. When Fisher published the first edition of his Statistical Methods for Research Workers (Fisher, 1925), it included tables of probabilities such as 0.99, 0.95, 0.90, 0.80, 0.70, 0.50, 0.30, 0.20, 0.10, 0.05, 0.02 and 0.01 for statistics such as chi-square. Nevertheless, Fisher wrote about the probability of 0.05 ‘It is convenient to take this point in judging whether a deviation is to be considered significant or not’ (Fisher, 1925) and only gave critical values of 0.05 for his recently developed technique of analysis of variance (ANOVA). Stigler (2008) has suggested that the complete set of tables for this test would have been so lengthy that it was necessary for Fisher to decide upon a level of significance. Nevertheless, discussion of what constituted ‘significance’ had occurred since the 1800s (Cowles and Davis, 1982) and Fisher’s choice was consistent with previous ideas, but ‘drew the line’ and thereby provided a set criterion for use by researchers.
non-significant result on any trial, so you would be unlikely to get a similarly marginal result if you did repeat the experiment.
6.10
Statistical significance and biological significance
It is important to realise that a statistically significant result may not necessarily have any biological significance. Here is an example. A study of male college students aged 21 was used to compare the sperm counts of 5000 coffee drinkers with 5000 non-coffee drinkers. Results showed that the coffee drinkers had fewer viable sperm per millilitre of semen than noncoffee drinkers and this difference was significant at P < 0.05. Nevertheless, a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
68
68 [56–70] 27.8.2011 9:57AM
Probability helps you make a decision about your results
follow-up study of the same males over the next 15 years showed no difference in their effective fertility, as measured by the number of children produced by the partners of each group. Therefore, in terms of fertility, the difference was not biologically significant. If you get a significant result, you need to ask yourself ‘What does it mean biologically?’ This is another aspect of realism, which was first discussed in relation to experimental design in Chapter 4.
Box 6.2 Improbable events do occur: left-handed people and cancer clusters Scientists either reject or retain hypotheses on the basis of whether the probability is less than 0.05, but if you take a large number of samples from a population, you will inevitably get some with an extremely unlikely probability. Here is a real example. The probability a person is left-handed is about 1/10, so the probability a person is right-handed is about 9/10. When I was a first-year university student, my tutorial group contained 13 left-handed students and one right-handed tutor. Ignoring the tutor, the probability of a sample containing 13 left-handed students is (1/ 10)13 = 0.00000000000010, which is an extremely improbable event indeed. The students were assigned to their tutorial groups at random, so when we calculated the probability our response was ‘It has happened by chance.’ Unfortunately, whenever you take a lot of random samples from a population, such improbable events will occasionally occur for conditions that are far more serious than being left-handed. For example, there are many hundreds of thousands of workplaces within a large country. Each person in that country has a relatively low probability of suffering from cancer, but sometimes a high number of cancer cases may occur within a particular workplace. These are often called ‘cancer clusters’ and are of great concern to health authorities because the unusually high proportion of cases may have occurred by chance or it may not – it may have been caused by some feature of the workplace itself. Therefore, when a workplace cancer cluster is reported a lot of
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
69 [56–70] 27.8.2011 9:57AM
6.11 Summary and conclusion
69
effort is put into measuring levels of known carcinogens (e.g. background radiation, soil contamination and air pollution), but often no plausible explanation is found. Nevertheless, it can never be confidently concluded that the cluster has occurred due to chance, because there may be some unidentified cause. This has even resulted in buildings being abandoned and demolished because they may contain some (unknown) threat to human health. But if you examine data for workplace disease, you will also find some where the incidence of cancer is extremely and improbably low. Here the significant lack of cancers may have occurred by chance or because of some feature of the workplace environment that protects the workers from cancer, but these events do not usually stimulate further investigation because human health is not at risk.
6.11
Summary and conclusion
All statistical tests are a way of obtaining the probability of a particular outcome. The probability is either calculated directly as shown for the example of beads drawn from a sack, or a test that gives a statistic (e.g. the chi-square test) is applied to the data. A test statistic is just a number that usually increases as the difference between an observed and expected value (or between samples) also increases. As the value of the statistic becomes larger and larger, the probability of an event generating that statistic gets smaller and smaller. Once the probability of that event and any more extreme departures from the null hypothesis is less than 5%, it is concluded that the outcome is statistically significant. A range of tests will be covered in the rest of this book, but they are all just methods for obtaining the probability of an outcome to help you make a decision about your hypothesis. Nevertheless, it is important to realise that the probability of the result does not make a decision for you, and even a statistically significant result may not necessarily have any biological significance – the result has to be considered in relation to the system you are investigating.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C06.3D
70 [56–70] 27.8.2011 9:57AM
70
Probability helps you make a decision about your results
6.12
Questions
(1)
Why would many scientists be uneasy about a probability of 0.06 for the result of a statistical test?
(2)
Define a Type 1 error and a Type 2 error.
(3)
Discuss the use of the 0.05 significance level in terms of assessing the outcome of hypothesis testing. When might you use the 0.01 significance level instead?
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
71 [71–86] 29.8.2011 12:13PM
7
Probability explained
7.1
Introduction
This chapter gives a summary of the essential concepts of probability needed for the statistical tests covered in this book. More advanced material that will be useful if you go on to higher-level statistics courses is in Section 7.6.
7.2
Probability
The probability of any event can only vary between zero (0) and one (1) (which correspond to 0% and 100%). If an event is certain to occur, it has a probability of 1. If an event is certain not to occur, it has a probability of 0. The probability of a particular event is the number of outcomes giving that event, divided by the total number of possible outcomes. For example, when you toss a coin, there are only two possible outcomes – a head or a tail. These two events are mutually exclusive – you cannot get both simultaneously. Consequently, the probability of a head is 1 divided by 2 = 1/2 (and thus the probability of a tail is also 1/2). Probability is usually symbolised as P, so the previous sentence could be written as P (head) = 1/2 and P (tail) = 1/2. Similarly, if you roll a six-sided die numbered from 1 to 6, the probability of a particular number (e.g. the number 1) is 1/6 (Figures 7.1 (a) and (b)).
7.3
The addition rule
The probability of getting either a head or a tail is the sum of the two probabilities, which for a coin is 1/2 + 1/2 = 1, or P (head) + P (tail) = 1. For a six-sided die, the probability of any number between 1 and 6 inclusive is 6/6 = 1. This is an example of the addition rule: when several outcomes are 71
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
72
72 [71–86] 29.8.2011 12:13PM
Probability explained
Figure 7.1 The probability of an event is the number of outcomes giving that event,
divided by the total number of possible outcomes. (a) For a two-sided coin, there are two outcomes, so the probability of a head (shaded) is 1/2 (0.5). (b) For a six-sided die, the probability of the number ‘2’ is 1/6 (0.1667).
Figure 7.2 The addition rule. The probability of getting two or more events is the sum
of their probabilities. Therefore, the probability of getting 1, 2, 3 or 4 when rolling a six-sided die is 4/6 (0.6667).
mutually exclusive (meaning they cannot occur simultaneously), the probability of getting any of these is the sum of their separate probabilities. Therefore, the probability of getting a 1, 2, 3 or 4 when rolling a six-sided die is 4/6 (Figure 7.2).
7.4
The multiplication rule for independent events
When the occurrence of one event has no effect on the occurrence of the second, the events are independent. For example, if you toss two coins (either simultaneously or one after the other), the outcome (H or T) for the first coin will have no influence on the outcome for the second (and vice versa). To calculate the joint probability of two or more independent events such as two heads occurring when two coins are tossed simultaneously,
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
73 [71–86] 29.8.2011 12:13PM
7.4 The multiplication rule for independent events
73
Box 7.1 The concept of risk and its relationship to probability Reports often give the relative risk, the percentage relative risk or the increased percentage in relative risk associated with a particular diet, activity or exposure to a toxin. Relative risk is the probability of occurrence in an ‘exposed’ or ‘treatment’ group divided by the probability of occurrence in an ‘unexposed’ or ‘control’ group: Relative risk ¼
Pðin treatment groupÞ : Pðin control groupÞ
Here is an example using tickets in a lottery. If you have one ticket in a lottery with ten million tickets, your probability of winning is one in ten million (1/10 000 000) and therefore extremely unlikely. If you had ten tickets in the same lottery, your probability of winning would improve to one in a million (10/10 000 000 = 1/1 000 000). Expressed as relative risk, a ten ticket holder has ten times the relative risk of winning the lottery than a person who has only one ticket: Relative risk ¼
Pðten ticket holder winsÞ ¼ 10:0: Pðone ticket holder winsÞ
This may sound impressive, ‘My chance of winning the lottery is ten times more than yours’, but the probability of winning is still very unlikely. The percentage relative risk is just relative risk expressed as a percentage, so a relative risk of 1 is equivalent to 100%. The increased percentage in relative risk is the percentage by which the relative risk exceeds 100% (i.e. percentage relative risk minus 100). For example, the ten ticket holder described above has ten times the relative risk of winning the lottery compared to a single ticket holder, or 1000%, which is a 900% increase compared to the single ticket holder. The decreased percentage in relative risk is the percentage by which relative risk is less than 100% (i.e. 100 minus the percentage relative risk). Statistics for relative risk, especially the percentage increase or decrease in relative risk, are particularly common in popular reports on health. For example, ‘Firefighters aged in their 40s have a 28% increased relative risk
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
74
74 [71–86] 29.8.2011 12:13PM
Probability explained
of developing prostate cancer compared to other males in this age group’ needs to be considered in relation to the incidence of prostate cancer, which is about 1 per 1000 males in their 40s. Therefore, to have an increased relative risk of 28% (i.e. 1.28) the incidence among firefighters in their 40s must be 1.28 × 1/1000 = 1.28 per 1000. A relative risk of 1 (i.e. 100%) shows there is no difference in incidence between two groups. An increased incidence in the treatment group will give a relative risk of more than 1 compared to the population, and a reduced incidence in the treatment group will give a relative risk between 0 and 1.
Figure 7.3 (a) The multiplication rule. The probability of getting two or more
independent events is the product of their probabilities. Therefore, the probability of two heads when tossing two coins is 1/2 × 1/2, which is 1/4 (0.25). (b) The combination of the multiplication and addition rule. A head and a tail can occur in two ways: H and then T or T and then H, giving a total probability of 1/2 (0.5).
which would be written as P (head, head), you simply multiply the independent probabilities together. Therefore, the probability of getting two heads with two coins is P (head) × P (head), which is 1/2 × 1/2 = 1/4. The chance of a head and a tail with two coins is 1/2, because there are two ways of obtaining this: coin 1 = H, coin 2 = T, or coin 1 = T and coin 2 = H (Figures 7.3 (a) and (b)).
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
75 [71–86] 29.8.2011 12:13PM
7.5 Conditional probability
7.5
75
Conditional probability
If two events are not independent (for example, a single roll of a six-sided die, with the first event being a number in the range from 1 to 3 inclusive and the second event that the number is even), the multiplication rule also applies, but you have to multiply the probability of one event by the conditional probability of the second. When rolling a die, the independent probability of a number from 1 to 3 is 3/6 = 1/2, and the independent probability of any even number is also 1/2 (the even numbers are 2, 4 or 6, giving three of six possible outcomes). If, however, you have already rolled a number from 1 to 3, the probability of an even number within (and therefore conditional upon) that restricted set of outcomes is 1/3 (because ‘2’ is the only even number possible in the three outcomes). Therefore, the probability of both related
Figure 7.4 Conditional probabilities. The probability of two related events is the
probability of the first multiplied by the conditional probability of the second. (a) The probability of a number from 1 to 3 that is also even is: 1/2 × 1/3 which is 1/6. (b) The probability of a number from 1 to 3 that is also odd is 2/3 ×1/2, which is 2/6 (1/3).
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
76
76 [71–86] 29.8.2011 12:13PM
Probability explained
Table 7.1 The probability of obtaining an even number from 1 to 3 when rolling a six-sided die can be obtained in two ways: (a) by multiplying the probability of obtaining an even number by the conditional probability that a number from 1 to 3 is even, or (b) by multiplying the probability of obtaining a number from 1 to 3 by the conditional probability that an even number is from 1 to 3. Case (b) is illustrated in Figure 7.4(a). (a) Number from 1 to 3 that is even
(b) Even number that is from 1 to 3
First event
Even number Number from 1–3 P (even) = 3/6 = 1/2 P (1–3) = 3/6 = 1/2 Second event Number from 1–3 provided the number Even number provided the is even P (1–3|even) = 1/3 number is from 1–3 P (even|1–3) = 1/3 Product P (even) × P (1–3|even) = 1/6 P (1–3) × P (even|1–3) = 1/6
events is 1/2 × 1/3 = 1/6 (Figure 7.4(a)). You can work out this probability the other way – when rolling a die the probability of an even number is 1/2 (you would get numbers 2, 4 or 6) and the probability of one of these numbers being in the range from 1 to 3 is 1/3 (the number 2 out of these three outcomes). Therefore, the probability of both related events is again 1/2 × 1/3 = 1/6. The conditional probability of an event (e.g. an even number provided a number from 1 to 3 has already been rolled) occurring is written as P (A|B), which means ‘the probability of event A provided event B has already occurred’. For the example with the die, the probability of an even number, provided a number from 1 to 3 has been rolled, is written as P (even|1–3). Therefore, the calculations in Figure 7.4(a) and Table 7.1 can be formally written as: PðA; BÞ ¼ PðAÞ PðBjAÞ ¼ PðBÞ PðAjBÞ:
(7:1)
Or for the specific case of a six-sided die: Pðeven; 1 3Þ ¼ PðevenÞ Pð1 3jevenÞ ¼ Pð1 3Þ Pðevenj1 3Þ:
(7:2)
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
77 [71–86] 29.8.2011 12:13PM
7.6 Applications of conditional probability
7.6
77
Applications of conditional probability
An understanding of the applications of conditional probability is not needed for the topics in this book, but will be useful if you go on to take more advanced courses. The calculation of the probability of two events by multiplying the probability of the first by the conditional probability of the second (Section 7.5) is an example of Bayes’ theorem. Put formally, the probability of the conditional events A and B occurring together can be obtained in two ways: the probability of event A multiplied by the probability B will occur provided event A has already occurred, or the probability of event B multiplied by the probability A will occur provided event B has already occurred: PðA; BÞ ¼ PðAÞ PðBjAÞ ¼PðBÞ PðAjBÞ:
(7:3 copied from 7:1)
For example, as described in Section 7.5, the overall probability of an even number and a number from 1 to 3 in a single roll of a six-sided die is symbolised as P (even,1–3) and can be obtained from either: Pð1 3jevenÞ PðevenÞ; or : Pðevenj1 3Þ Pð1 3Þ:
7.6.1
Using Bayes’ theorem to obtain the probability of simultaneous events
Bayes’ theorem is often used to obtain P (A,B). Here is an example. In central Queensland, many rural property owners have a well drilled in the hope of accessing underground water, but there is a risk of not striking sufficient water (i.e. a maximum flow rate of less than 200 litres per hour is considered insufficient) and also a risk that the water is unsuitable for human consumption (i.e. it is not potable). It would be very helpful to know the probability of the combination of events of striking sufficient water that is also potable: P (sufficient, potable). Obtaining P (sufficient) is easy, because drilling companies keep data for the numbers of sufficient and insufficient wells they have drilled. Unfortunately, they do not have records of whether the water is potable, because that is only established later by a laboratory analysis paid for by the
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
78
78 [71–86] 29.8.2011 12:13PM
Probability explained
property owner. Furthermore, analyses of samples from new wells are usually only done on those that yield sufficient water because there would be little point in assessing an insufficient well. Therefore, data from laboratory analyses for potability only gives the conditional probability of potable water provided the well has yielded sufficient water: P (potable| sufficient). Nevertheless, from these two known probabilities, the chance of striking sufficient and potable water can be calculated as: Pðsufficient; potableÞ ¼ PðsufficientÞ PðpotablejsufficientÞ:
(7:4)
From drilling company records, the likelihood of striking sufficient water in central Queensland (P sufficient) is 0.95 (so it is not surprising that one company charges about 5% more than its competitors, but guarantees to refund most of the drilling fee for any well that does not strike sufficient water). Laboratory records for water sample analyses in central Queensland show that only 0.3 of known sufficient wells yield potable water: (P potable|sufficient). Therefore, the probability of the two events sufficient and potable water occurring together is (0.95 × 0.30), which is only 0.285 and means that the chance of this occurring is slightly more than 1/4. If you were a central Queensland property owner with a choice of two equally expensive alternatives of (a) installing additional rainwater storage tanks or (b) having a well drilled, what would you decide on the basis of this probability?
7.6.2
Using Bayes’ theorem to estimate conditional probabilities that cannot be directly determined
As explained in Table 7.1, the probability of two conditional events A and B occurring together, P (A,B), can be obtained in two ways: PðA; BÞ ¼ PðAÞ PðBjAÞ ¼ PðBÞ PðAjBÞ:
(7:5 copied from 7:3)
This formula can be rearranged and used to estimate conditional probabilities that cannot be obtained directly. For example, by rearrangement of equation (7.5) the conditional probability of P (A|B) is: PðAjBÞ ¼
PðAÞ PðBjAÞ : PðBÞ
(7:6)
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
79 [71–86] 29.8.2011 12:13PM
7.6 Applications of conditional probability
79
As an example, for the six-sided die where P (A) is the probability of rolling an even number and P (B) is the probability of a number from 1 to 3, the probability of an even number, provided that number is from 1 to 3 is: PðevenÞ Pð1 3jevenÞ Pð1 3Þ ¼ ð1=2 1=3Þ 1=2 ¼ 1=3:
Pðevenj1 3Þ ¼
This has widespread applications and two examples are described below.
The test for occult blood Cancer of the large intestine, which is often called bowel cancer, can be life threatening because it may eventually spread to other parts of the body. Bowel cancer tumours often bleed, so blood is present in an affected person’s faeces, but the very small amounts released during the early stages of the disease when the tumour is small are often not obvious, which is why it is called occult (meaning ‘hidden’) blood. Recently an extremely sensitive test for haemoglobin, the protein present in red blood cells, has been developed and is being used to detect occult blood. In several countries, a home sampling kit is now available for the user to take a very small sample from their faeces and mail it to a testing laboratory. Unfortunately, this test is not 100% accurate: it sometimes gives a false positive result (i.e. ‘blood present’) for a person who does not have bowel cancer and a false negative result (i.e. ‘blood absent’) for a person who does. The likelihood of bowel cancer increases as you get older and large-scale trials on randomly chosen groups of people (who therefore may or may not have bowel cancer) aged 55 show the test gives a positive result, symbolised by P (positive), in about 30/1000 cases. Trials of the test on 55-year-olds who are known to have bowel cancer have shown it gives a positive result, symbolised by P (positive test|has cancer), of 99/100 (i.e. it can detect 99 of 100 known cases). Hospital records show that the incidence of bowel cancer in 55-year-olds (P (has cancer)) is about 2/1000. It is likely to be of great interest to a person who had just been told their occult blood test was positive to know the conditional probability that they do have cancer, given their test is positive (P (has cancer|positive test)), but this was not known when the test was developed. Intuitively, you might
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
80
80 [71–86] 29.8.2011 12:13PM
Probability explained
say ‘But surely this will be the detection rate of 99/100?’, but that is the probability of a positive test for people who are known to have cancer. The conditional probability P (has cancer|positive test) can be estimated using the extension of Bayes’ theorem: Pðpositive testjhas cancerÞ ¼ PðcancerÞ Pðpositive testjhas cancerÞ ¼ Pðpositive testÞ Pðhas cancerjpositive testÞ:
Therefore, by rearrangement: PðcancerÞ Pðpositive testjhas cancerÞ Pðpositive testÞ ð2=1000 99=100Þ ¼ 30=1000 ¼ 0:066 ðor 6:6%Þ:
Pðhas cancerjpositive testÞ ¼
This shows that only 6.6% of people who tested positive in the survey of 55-year-olds are likely to actually have bowel cancer. Here you may be wondering why this conditional probability is so low. It is because the test gives 30 positive results per 1000 people tested, yet the true number of cases is only two per 1000. Despite this over-reporting, the test is frequently used because it does detect some people who are unaware they have bowel cancer, which can often be successfully removed before it has spread.
7.6.3
Using Bayes’ theorem to reassess existing probabilities in the light of new information
A widely used and increasingly popular application of Bayes’ theorem is to estimate the probabilities of two or more mutually exclusive events, with the intention of making a decision about which is most likely to have occurred. Here you may be thinking that this sounds similar to the classic Popperian view of hypothesis testing described in Chapter 2 and you would be right, but the application of Bayes’ theorem is different. It is not used to reject or retain an hypothesis in relation to a critical probability such as 0.05 as described in Chapter 6. Instead, it gives a way of estimating the probability of each of two or more mutually exclusive events or hypotheses and updating these probabilities as more information becomes available. Here are two examples.
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
81 [71–86] 29.8.2011 12:13PM
7.6 Applications of conditional probability
81
Which patch of forest? A wildlife ecologist set some traps for small mammals along a road running between two patches of forest. Forest A1, on the western side of the road, had an area of 28 km2 and Forest A2, on the eastern side, was only 7 km2 in area. One morning, the ecologist found what appeared to be a new (i.e. undescribed) species of forest possum in one of the traps. The ecologist was extremely keen to trap more individuals, but needed to decide where to set the traps. Assuming the possum must have come from either Forest A1 or Forest A2, these possibilities can be thought of as two mutually exclusive hypothetical events: ‘From Forest A1’ or ‘From Forest A2’. If the ecologist knew which hypothesis was most likely, effort could be concentrated on trapping in that forest. You could give hypotheses A1 and A2 equal probabilities of 0.5, but this would not be helpful in deciding where to trap. Therefore, the ecologist assumed the likelihood of the possum having come from Forest A1 or A2 was proportional to the relative area of each, so the probability of the hypothesis ‘From Forest A1’ was 28/(28 + 7), which is 80% (0.8), while the probability of ‘From Forest A2’ was only 7/(28 + 7), which is 20% (0.2). Note that these two mutually exclusive events are the only ones being considered, so their probabilities sum to 1. Such probabilities, which are given their values on the basis of prior knowledge, are often called priors or prior probabilities. Nevertheless, they may not necessarily be appropriate (e.g. the area of the forest may not have any effect on the occurrence of the possum). They are also subjective: another ecologist might have assigned prior probabilities on the basis of some other aspect of the two forests. The obvious choice, on the basis of the very limited information about the relative areas of each patch, would be to concentrate further trapping in Forest A1. There are no conditional probabilities involved – the two mutually exclusive hypotheses simply have the following probabilities: PðForest A1 Þ ¼ 28=35 ¼ 80% and PðForest A2 Þ ¼ 7=35 ¼ 20%:
At this stage, the possum unexpectedly disgorged its stomach contents. These were excitedly scooped up and closely examined by the ecologist who found they consisted entirely of chewed fruit of the Brown Fig tree, Fiscus browneii. From previous aerial surveys of the region, the Brown Fig
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
82
82 [71–86] 29.8.2011 12:13PM
Probability explained
was known to occur in both forests, but was relatively uncommon in Forest A1 (only 10% of the area) and very common in Forest A2 (94% of the area). This additional information that the possum had eaten Brown Figs can be used in conjunction with Bayes’ theorem in an attempt to improve the accuracy of the probabilities that the possum was from Forest A1 or Forest A2. The conditional probabilities that the Brown Fig (B) will be encountered within each forest type are simply its percentage occurrence in each: PðBjForest A1 Þ ¼ 10=100 PðBjForest A2 Þ ¼ 94=100:
These conditional probabilities (based on the apparent diet of the possum) can be used, together with the earlier prior probabilities (based only on forest area) of P (Forest A1) and P (Forest A2), in Bayes’ theorem to update the likelihood that the possum has come from each forest type. This is done by calculating the two conditional probabilities that the possum has come from Forest A1, or Forest A2, provided it is a Brown Fig eater. The probabilities are P (Forest A1|B) and P (Forest A2|B) and are called posterior probabilities because they have been obtained after the event of obtaining some new information (which in this case was provided by the possum). The arithmetic is not difficult. First, you already have P (Forest A1) and P (Forest A2), and the two conditional probabilities for the occurrence of Brown Fig trees in each forest: P (B|Forest A1) and P (B|Forest A2). The intention is to use Bayes’ theorem to obtain the conditional probabilities that (a) the possum is from Forest A1 provided it is a Brown Fig eater: P (Forest A1|B), and (b) the conditional probability that the possum is from Forest A2 provided it is a Brown Fig eater: P (Forest A2|B). The standard form of Bayes’ theorem, where only one conditional probability is being estimated is: PðAjBÞ ¼
PðAÞ PðBjAÞ : PðBÞ
(7:7 copied from 7:6:)
This formula is duplicated to give separate conditional probabilities for Forests A1 and A2: PðA1 jBÞ ¼
PðA1 Þ PðBjA1 Þ PðBÞ
(7:8)
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
83 [71–86] 29.8.2011 12:13PM
7.6 Applications of conditional probability
83
and: PðA2 jBÞ ¼
PðA2 Þ PðBjA2 Þ : PðBÞ
(7:9)
Here you need an estimate of P (B). This is straightforward and relies on a property of conditional probabilities. For equations (7.8) and (7.9) above, the denominator gives P (B), which is the probability of occurrence of the Brown Fig across both patches of forest. But because we are only considering these two patches, the Brown Fig can only occur in Forests A1 and A2, and therefore only occur as the conditional probabilities of P (B|A1) and P (B|A2). So the total overall probability P (B) across both forests is the conditional probability of Brown Fig trees in Forest A1 multiplied by the relative area of Forest A1, plus the conditional probability of Brown Fig trees in Forest A2 multiplied by the relative area of Forest A2: PðBÞ ¼ PðBjA1 Þ PðA1 Þ þ PðBjA2 Þ PðA2 Þ:
(7:10)
Therefore, you substitute the right-hand side of equation (7.10) for P (B) in equation (7.9). These probabilities can be used to calculate the conditional probability that the possum is from Forest A1 provided it is a Brown Fig eater P (A1|B): PðA1 jBÞ ¼ ¼
PðA1 Þ PðBjA1 Þ PðA1 Þ PðBjA1 Þ þ PðA2 Þ PðBjA2 Þ
(7:11)
28=35 10=100 ¼ 0:299: 28=35 10=100 þ 7=35 94=100
Second, the conditional probability that the possum is from Forest A2 provided it is a Brown Fig eater P (A2| B) is: PðA2 jBÞ ¼ ¼
PðA2 Þ PðBjA2 Þ PðBjA1 Þ PðA1 Þ þ PðBjA2 Þ PðA2 Þ
(7:12)
7=35 94=100 ¼ 0:701: 28=35 10=100 þ 7=35 94=100
This process is shown pictorially in Figure 7.5. In summary, before the possum delivered the additional dietary information, the hypothesis ‘From Forest A1’ had an 80% prior probability and the hypothesis ‘From Forest A2’
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
84
84 [71–86] 29.8.2011 12:13PM
Probability explained
Figure 7.5 A pictorial explanation of the use of Bayes’ theorem. (a), (b) The prior probabilities of possum capture within Forests A1 and A2 are only based on the relative area of each forest. (c), (d) New information on the likely diet of the trapped possum provides conditional probabilities for the occurrence of Brown Figs (the possum’s food) within each forest. (e), (f) These conditional probabilities, together with the priors, are used in Bayes’ theorem to give posterior (conditional) probabilities for the likelihood of trapping this possum species within each forest, provided it is a Brown Fig eater.
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
85 [71–86] 29.8.2011 12:13PM
7.6 Applications of conditional probability
85
had only a 20% prior probability. The new information, which assumes the possum is a frequent or obligate Brown Fig eater, gives a conditional posterior probability of only 30% for ‘From Forest A1|Brown Fig eater’ and a conditional posterior probability of 70% for ‘From Forest A2|Brown Fig eater’. On the basis of these posterior probabilities, the ecologist decided to concentrate the trapping effort in Forest A2.
Rolling numbers with a six-sided die The following example is for the six-sided die used earlier in this chapter. Imagine you are in a contest to guess whether the number rolled with a sixsided die is either from 1 to 3 or from 4 to 6. These are prior mutually exclusive hypotheses A1 and A2, with an equal probability of 0.5: P (A1) = 0.5, P (A2) = 0.5. On the basis of these two prior probabilities, you would be equally likely to win the guessing contest if you chose outcome A1 or A2. The die is rolled. You are not told the exact outcome, but you are told that it is an even number and then given the opportunity to change your initial guess. The knowledge that the number is even is new information that can be used to generate posterior probabilities that the number is from 1 to 3 or from 4 to 6, provided it is even: P (1–3|even) and P (4–6|even). First, you need the conditional probabilities. For a six-sided die, the conditional probability of an even number from 1 to 3: P (even|1–3) is only 1/3, but the conditional probability of an even number from 4 to 6: P (even|4–6) is 2/3. Second, you need the overall probability of an even number when rolling a six-sided die. This example is so simple that it is obviously 3/6 or 1/2, but here is the application of equation (7.10): an even number can only be obtained by rolling a number either from 1 to 3 or 4 to 6, so the overall probability of an even number is: PðevenÞ ¼ Pðevenj1 3Þ Pð1 3Þ þ Pðevenj4 6Þ Pð4 6Þ ¼ 3=6 ¼ 1=2:
This is all you need to calculate the posterior conditional probabilities of: Pð1 3jevenÞ ¼
Pðevenj1 3Þ Pð1 3Þ ¼ 1=3 Pðevenj1 3Þ Pð1 3Þ þ Pðevenj4 6Þ Pð4 6Þ
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C07.3D
86
86 [71–86] 29.8.2011 12:13PM
Probability explained
and: Pð4 6jevenÞ ¼
Pðevenj4 6Þ Pð4 6Þ ¼ 2=3: Pðevenj1 3Þ Pð1 3ÞþPðevenj4 6Þ Pð4 6Þ
Your best strategy would be to guess that the number was from 4 to 6, now that you have the additional information that an even number has been rolled. Each of the examples above is for only two mutually exclusive hypotheses. More generally, when estimating P (B) for two or more inclusive conditional events that can give rise to event B: PðBÞ ¼
n X
PðBjAi ÞPðAi Þ
(7:13)
i¼1
which is an example of the addition rule. Therefore, you can calculate the probabilities of each separate event that contributes to event B by substituting equation (7.13) for P (B) in equation (7.14): PðAi jBÞ ¼
PðAi Þ PðBjAi Þ : n P PðBjAi ÞPðAi Þ
(7:14)
i¼1
Furthermore, new posterior probabilities can be calculated if more information becomes available (e.g. a subsequent analysis of pollen from the pelt of the possum in the first example above).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
87 [87–107] 27.8.2011 12:49PM
8
Using the normal distribution to make statistical decisions
8.1
Introduction
Scientists do mensurative or manipulative experiments to test hypotheses. The result of an experiment will differ from the expected outcome under the null hypothesis because of two things: (a) chance and (b) any effect of the experimental condition or treatment. This concept was illustrated with the chi-square test for nominal scale data in Chapter 6. Although life scientists work with nominal scale variables, most of the data they collect are measured on a ratio, interval or ordinal scale and are often summarised by calculating a statistic such as the mean (which is also called the average: see Section 3.5). For example, you might have the mean blood pressure of a sample of five astronauts who had spent the previous six months in space and need to know if it differs significantly from the mean blood pressure of the population on Earth. An agricultural scientist might need to know if the mean weight of tomatoes differs significantly between two or more fertiliser treatments. If you knew the range of values within which 95% of the means of samples taken from a particular population were likely to occur, then a sample mean within this range would be considered nonsignificant and one outside this range would be considered significant. This chapter explains how a common property of many variables measured on a ratio, interval or ordinal scale data can be used for significance testing.
8.2
The normal curve
Statisticians began collecting data for variables measured on ratio, interval or ordinal scales in the nineteenth century and were surprised to find that the distributions often had a very consistent and predictable shape. For 87
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
88
88 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
Figure 8.1 An example of a normally distributed population. The shape of the
distribution is symmetrical about the mean and the majority of values are close to this, with upper and lower ‘tails’ of relatively tall and relatively short people respectively.
example, if you measure the height of the entire adult female population of a large city and plot the frequency of individuals against their height, the distribution is bell shaped and symmetrical about the mean. This is called the normal distribution (Figure 8.1). The normal distribution has been found to apply to an enormous number of variables in nature (e.g. the number of erythrocytes per ml of blood, resting heart rate, reaction time, skull diameter, the maximum speed at which people can run, the initial growth rate of colonies of the mould Aspergillus niger on laboratory agar plates, the shell length of many species of marine snails, the number of abalone per square kilometre of seagrass or the number of sap-sucking bugs per tomato plant). A normal distribution is likely when several different factors each make a small contribution to the value of a variable. For example, human adult height is affected by several genes, as well as a person’s nutrition during their childhood, and each of these factors will have a small additive effect upon adult height. Even the distribution of the number of black beads in samples of six (Figure 6.1), which is only affected by six events (whether beads 1, 2, 3, 4, 5 and 6 are black or white), resembles the normal distribution. A normally distributed variable has a predictable shape that can be used to calculate the percentage of individuals or sampling units with values greater than or less than a particular value. This has been used to develop a wide range of statistical tests for ratio, interval and ordinal scale data. These are called parametric tests because they are for data with a known distribution and are straightforward, powerful and easy to apply. To use them you have to be sure your data are reasonably ‘normal’, and methods to assess
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
89 [87–107] 27.8.2011 12:49PM
8.3 Two statistics describe a normal distribution
89
this will be described later. For data that are not normal, and for nominal scale data, non-parametric tests have been developed and are covered later in this book.
8.3
Two statistics describe a normal distribution
Only two descriptive statistics – the mean and the standard deviation – are needed to describe a normal distribution. To understand tests based on the normal distribution, you need to be familiar with these statistics and some of their properties.
8.3.1
The mean of a normally distributed population
First, the mean (the average), symbolised by the Greek μ, describes the location of the centre of the normal distribution. It is the sum of all the values (X1, X2 etc.) divided by the population size (N). The formula for the mean is: N P
Xi μ ¼ i¼1 : N
(8:1)
This needs some explanation. It contains some common standard abbreviations and symbols. First, the symbol Σ means ‘The sum of.’ Second, the symbol Xi means ‘All the X values specified by the restrictions listed below and above the Σ symbol.’ The lowest value of i is specified underneath Σ (here it is 1, meaning the first value in the data set for the population) and the highest is specified above Σ (here it is N, which means the last value in the data set for the population). Third, the horizontal line means that the quantity above this line is divided by the quantity below it. Therefore, you add up all the values (X1 to XN) and then divide this number by the size of the population (N). Some textbooks use Y instead of X. From Chapter 3, you will recall that some data can be expressed as two-dimensional graphs with an X and a Y axis. Here I will use X and show distributions with a mean on the X axis, but later in this book you will meet cases of data that can be thought of as values of Y with distributions on the Y axis.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
90
90 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
Figure 8.2 Calculation of the variance of a population consisting of only four individuals
(■) with shell lengths of 6, 7, 9 and 10 mm. The vertical line shows the mean μ. Horizontal arrows show the difference between each value and the mean. The numbers in brackets are the magnitude of each difference. The contents of the box show these differences squared, their sum and the variance obtained by dividing the sum of the squared differences by the population size.
Here is a quick example of the calculation of a mean, for a population of only four snails (N = 4) with shell lengths of 6, 7, 9 and 10 mm. The mean, μ, is the sum of these lengths divided by four: 32 ÷ 4 = 8 mm.
8.3.2
The variance of a population
Two populations can have the same mean but very different dispersions around their means. For example, a population of four snails with shell lengths of 1, 2, 9 and 10 mm will have the same mean, but greater dispersion, than another population of four with shell lengths of 5, 5, 6 and 6 mm. There are several ways of indicating dispersion. The range, which is just the difference between the lowest and highest value in the population, is sometimes used. However, the variance, symbolised by the Greek σ2, provides a lot of information about the normal distribution that can be used in statistical tests. To calculate the population variance, you first calculate the population mean μ. Then, by subtraction, you calculate the difference between each value (X1. . .XN) and μ. Each of these differences is squared (to convert them to a positive quantity) and these values added together to get the sum of the squares, which is then divided by the population size. This is similar to the way the average is calculated, but here you have an average value for the dispersion. It is shown pictorially in Figure 8.2 for the population
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
91 [87–107] 27.8.2011 12:49PM
8.3 Two statistics describe a normal distribution
91
of only four snails with shell lengths of 6, 7, 9 and 10 mm. The formula for the procedure is straightforward: N P
σ ¼ i¼1
ðXi μÞ N
2
:
(8:2)
If there is no dispersion at all, the variance will be 0 (every value of X will be the same and equal to μ, so the top line in the equation above will be 0). The variance increases as the dispersion of the values about the mean increases.
8.3.3
The standard deviation of a population
The importance of the variance is apparent when you obtain the standard deviation, which is symbolised for a population by σ and is just the square root of the variance. For example, if the variance is 64, the standard deviation is 8. The standard deviation is important because the mean of a normally distributed population, plus or minus one standard deviation, includes 68.27% of the values within that population. Even more importantly, 95% of the values in the population will be within ± 1.96 standard deviations of the mean. This is especially useful because the remaining 5% of the values will be outside this range and therefore furthest away from the mean (Figures 8.3(a) and (b)). Remember from Chapter 6 that 5% is the commonly used significance level. These two statistics are all you need to describe the location and width of a normal distribution and can be used to determine the proportion of the population that is less than or more than a particular value. There is an example in Box 8.1.
8.3.4
The Z statistic
The proportions of the normal distribution described in the previous section can be expressed in a different and more useful way. For a normal distribution, the difference between any value and the mean, divided by the standard deviation, gives a ratio called the Z statistic that is also normally distributed but with a mean of 0 and a standard deviation of 1. This is called the standard normal distribution:
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
92 [87–107] 27.8.2011 12:49PM
Figure 8.3 Illustration of the proportions of the values in a normally distributed
population. (a) 68.27% of values are within the range of ± 1 standard deviation from the mean. (b) 95% of values are within the range of ± 1.96 standard deviations from the mean. These percentages correspond to the shaded area of the distribution enclosed by the two vertical lines.
Box 8.1 Use of the standard normal distribution For a normally distributed population with a mean height of 170 cm and a standard deviation of 10, 95% of the people in that population will have heights within the range of 170 ± (1.96 × 10) (which is from 150.4 to 189.6 cm). You only have a 5% chance of finding someone who is either taller than 189.6 cm or shorter than 150.4 cm (Figure 8.4).
Figure 8.4 For a normally distributed population with a mean height of 170 cm and
a standard deviation of 10 cm, 95% of the people in that population will have heights within the range of 170 ± (1.96 × 10) cm.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
93 [87–107] 27.8.2011 12:49PM
8.4 Samples and populations
Z¼
Xi μ : σ
93
(8:3)
Consequently, the value of the Z statistic specifies the number of standard deviations it is from the mean. For the example in Box 8.1, a value of 189.6 cm is: 189:6 170 ¼ 1:96 standard deviations away from the mean: 10
In contrast, a value of 175 cm is: 175 170 ¼ 0:5 standard deviations away from the mean: 10
Once the Z statistic is greater than +1.96 or less than −1.96 the probability of obtaining that value of X is less than 5%. The Z statistic will be discussed again later in this chapter.
8.4
Samples and populations
Life scientists usually work with samples. The equations for the mean, variance and standard deviation given above are for a population – the case where you have obtained data for every individual present. For a population the values of µ, σ2 and σ are called parameters or population statistics and are true values for that population (assuming no mistakes in measurement or calculation). When you take a sample from a population and calculate the sample mean, sample variance and sample standard deviation, they are true values for the sample, but only estimates of the population statistics µ, σ2 and σ. Because of this, sample statistics are given different symbols (the Roman X, 2 s and s respectively). But remember – because they are only estimates, they may not be accurate measures of the true population statistics.
8.4.1
The sample mean
First, the procedure for calculating a sample mean is the same as for the population mean, except (as mentioned above) the sample mean is sym because it is only an estimate of µ. The sample mean is: bolised by X
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
94
94 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
n P
Xi ¼ i¼1 : X n
(8:4)
Note that the lower case n is used to indicate the sample size, compared to the capital N used to indicate the population size in equation (8.1).
8.4.2
The sample variance
When you calculate the sample variance, s2, this estimate of σ2 is also likely to be subject to error. Small sample size also introduces a consistent bias, but this can be compensated for by a modification to equation (8.2). For a population, the variance is: N P
σ 2 ¼ i¼1
ðXi μÞ2 N
:
(8:5 copied from 8:2)
In contrast, the sample variance is estimated using the following formula: n P
s ¼ 2
2 ðXi XÞ
i¼1
n1
:
(8:6)
Note that the sum of squares is divided by n – 1, when you would expect it to be divided by n. This is to reduce a bias caused by small sample size and is easily explained by an example. Imagine you wanted to estimate the population variance of the height of all adult females in a population of 10 000 by sampling only 100. This small sample is unlikely to include a sufficient proportion of people who are in either the upper or lower extremes within that population (the really short and really tall people), because there are relatively few of them. These will, nevertheless, make a large contribution to the true population variance because they are so far from the mean that the value of ðXi μÞ2 will be a large quantity for every one of those individuals. So the sample variance will tend to underestimate the population variance and needs to be corrected. To illustrate this, I ask my students to look around the lecture room and ask themselves ‘Are there any extremely tall or very short people present?’ The answer so far has been ‘No’, but one day, depending on who shows up to my
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
95 [87–107] 27.8.2011 12:49PM
8.5 The distribution of sample means is also normal
95
classes, I may have to choose a different variable. To make s2 the best possible estimate of σ2, you divide the sum of squares by n – 1, not n. This correction will make the sample variance (and sample standard deviation) larger. Note that this correction will have a considerable effect when n is small (imagine dividing by 3 instead of 4) but less effect as sample size increases (imagine dividing by 999 instead of 1000). Less correction is needed as sample size increases because larger samples are more likely to include individuals in the extremes of the variable you are measuring. Here you may be thinking ‘Why don’t I have to correct the mean in this way as well?’ This is not necessary because you are equally likely to miss out on sampling both the positive and negative extremes of the population.
8.5
The distribution of sample means is also normal
As discussed earlier, when you do an experiment and measure a ratio, interval or ordinal scale variable on a sample from a population, two things will affect the value of the mean of that sample. It may differ from the population mean by chance, but it may also be affected by the experimental treatment. Therefore, if you knew the range around the population mean within which 95% of the sample means would be expected to occur when the null hypothesis applies (i.e. there is only the effect of chance and no effect of treatment), you could use this known range to decide whether the mean of an experimental group was significantly different (or not) to the population mean. Statisticians initially investigated the range of the sample means expected by chance by taking several random samples from a large population and found that the distribution of the means of these samples was also predictable. For a lot of samples of a certain size (n) taken at random from a population, the sample means are unlikely to all be the same: they will be dispersed around the population mean μ. Statisticians have shown that the distribution of these sample means is also normal with its own mean (which is also μ), variance and standard deviation. The standard deviation of the distribution of sample means is a particularly important statistic. It is called the standard error of the mean (or the standard error, or abbreviated as SEM or SE) and given the symbol σ X to distinguish it from the sample standard deviation (s) and the population standard deviation (σ). As sample size increases, the standard
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
96
96 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
Figure 8.5 The distribution of sample means and the effect of sample size. The heavy line shows the distribution of a population with a known mean μ. The lighter line and shaded area shows the distribution of the means of 200 independent samples, each of which has a sample size of (a) 2, (b) 20 and (c) 200. Note that the distribution of the sample means is normal with a mean of μ and that its expected range decreases as sample size increases. The double-headed arrow shows the range within which 95% of the sample means are expected to occur.
error of the mean decreases and therefore the accuracy of any single estimate of the population mean is likely to improve. This is shown in Figure 8.5. When you take a lot of samples, each of size n, from a population whose parametric statistics are known (as illustrated in Figures 8.5(a),
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
97 [87–107] 27.8.2011 12:49PM
8.5 The distribution of sample means is also normal
97
Table 8.1 A numerical example of the effect of sample size on the obtained by taking random accuracy and precision of values of X samples of size 2, 20 or 200 from a population with a known variance of 600. As sample size increases, the values of the sample means become much closer to the population mean. Precision improves and therefore the sample means will tend to be more accurate estimates of μ. Population parameters σ2
σ
600 600 600
24.49 24.49 24.49
Sample size (n) 2 20 200
pffiffiffi n
Standard error of the mean (pσffiffin )
1.41 4.47 14.14
17.37 5.48 1.73
(b) and (c)), the standard error of the mean can be estimated by dividing the standard deviation of the population by the square root of the sample size: σ SEM ¼ σ X ¼ pffiffiffi : n
(8:7)
A numerical example is given in Table 8.1, which clearly illustrates that the means of larger samples are likely to be relatively close to the population mean. The standard error of the mean can be used to calculate the range within which a particular percentage of the sample means will occur. Because the sample means are normally distributed with a mean of μ, then μ ± 1 SEM will include 68.27% of the sample means and μ ± 1.96 SEM will include 95% of the sample means. This can also be expressed as a ratio. The difference between any sample and the population mean, μ, divided by the standard error of the mean, X, mean: μ X σ X
(8:8)
will give the Z statistic already discussed in Section 8.3.4, which always has a and μ mean of 0 and a standard deviation of 1. As the difference between X
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
98
98 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
Figure 8.6 Distribution of the Z statistic (the ratio of Xμ SEM obtained by taking the means of
a large number of small samples from a normal distribution). By chance 95% of the sample means will be within the range –1.96 to +1.96 (shown by the black horizontal bar), with the remaining 5% outside this range.
Box 8.2 Use of the Z statistic The known population value of μ is 100 and σ is 36. You take a sample of 16 individuals and obtain a sample mean of 81. What is the probability that this sample is from the population? pffiffiffi μ = 100, σ = 36, n = 16, so the n ¼ 4 and the SEM ¼ pσffiffin ¼ 36 4 ¼ 9: Therefore the value of: μ X SEM
is
81 100 ¼ 2:11: 9
The ratio is outside the range of ± 1.96, so the probability that the sample has come from a population with a mean of 100 is less than 0.05. The sample mean is significantly different to the population mean.
is greater increases the value of Z will become increasingly positive (if X is less than μ). Once the value of Z is than μ) or increasingly negative (if X less than –1.96, or greater than +1.96, the probability of getting that difference between the sample mean and the known population mean is less than 5% (Figure 8.6). This formula can be used to test hypotheses about the means of samples when population parameters are known. Box 8.2 gives a worked example.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
99 [87–107] 27.8.2011 12:49PM
8.6 What do you do when you only have data from one sample?
8.6
99
What do you do when you only have data from one sample?
As shown above, the standard error of the mean is particularly important for hypothesis testing because it can be used to predict the range around µ within which 95% of means of a certain sample size will occur. Unfortunately, a researcher usually does not know the true values of the population parameters μ and σ because they only have a sample, and statistical decisions have to be made from the limited information provided by that sample. Here, too, knowing the standard error of the mean would be extremely useful. If you only have data from a sample, you can still calculate the sample the sample variance (s2) and sample standard deviation (s). These mean (X), are your best estimates of the population statistics μ, σ and σ2. You can use s to estimate the standard error of the mean by substituting s for σ in equation (8.7). This is also called the standard error of the mean and abbreviated as ‘SEM’ where s is the sample standard deviation and n is the sample size: s sX ¼ pffiffiffi : (8:9) n Note from equation (8.9) that the sample SEM estimated in this way has a different symbol to the SEM estimated from the population statistics (it is sX instead of σ X ). This estimate of the standard error of the mean of the population, made from your sample, can be used to predict the range around any hypothetical value of μ within which 95% of the means of all samples of size n taken from that population will occur. It is called the 95% confidence interval and the upper and lower values for its range are called the 95% confidence limits (Figure 8.7). Therefore, in terms of making a decision about whether your sample mean differs significantly from an expected value of µ, the formula: μexpected X sX
(8:10)
corresponds to equation (8.8), but with sX used instead of σ X as the SEM. Here it seems logical that once this ratio is less than –1.96 or greater than +1.96, the difference between the sample mean and the expected value
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
100
100 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
Figure 8.7 If you only have one sample (the heavy curve), you can calculate the sample
standard deviation, s, which is your only estimate of the population standard deviation σ. From this you can also estimate the standard error of the mean of the population by dividing the sample standard deviation by the square root of the sample size (Formula 8.9). The shaded distribution is the expected distribution of sample means. The black horizontal bar and the two vertical lines shows the range within which 95% of the means of all samples of size n taken from a population with an hypothetical mean of μ would be expected to occur.
would be considered statistically significant at the 5% level. This is an appropriate procedure, but a correction is needed especially for samples of less than 100, which are very prone to sampling error and therefore likely to give poor estimates of the population mean, standard deviation and standard error of the mean. For small samples, the distribution of the ratio given by equation (8.10) is wider and flatter than the distribution obtained by calculating the standard error of the mean from the (known) population standard deviation. As sample size increases, the distribution gets closer and closer to the one shown in Figure 8.6, as shown in Figures 8.8(a), (b) and (c). Therefore the use of equation (8.10) is appropriate, but for small samples the range within which 95% of the values of all means will occur is wider (e.g. for a sample size of only four, the adjusted range within which 95% of values would be expected to occur is from –3.182 to +3.182). Using this correction, you can without knowing the populatest hypotheses about your sample mean X tion statistics. The shape of this wider and flatter distribution of the expected ratio for small samples was established by W.S. Gossett who published his work under the pseudonym of ‘Student’ (see Student, 1908). This is why the distribution is often called the ‘Student’ distribution or ‘Student’s t’ distribution. Two examples of the distribution of t are shown in Figure 8.8 and Table 8.2.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
101 [87–107] 27.8.2011 12:49PM
8.6 What do you do when you only have data from one sample?
101
Table 8.2 The range of the 95% confidence interval for the t statistic in relation to sample size. (a) n = 4, (b) n = 60, (c) n = 200, (d) n = 1000 and (e) n = ∞. Note that the 95% confidence interval decreases as the sample size increases. Values of t were calculated using the equations given by Zelen and Severo (1964).
(a) (b) (c) (d) (e)
Formula
Statistic
Sample size
95% confidence interval
Xμ sX Xμ sX Xμ sX Xμ sX Xμ sX
t
4
3 182
t
60
2 001
t
200
1 972
t
1000
1 962
t
∞
1 96
Figure 8.8 The distribution of the t statistic obtained when the sample statistic, s, is used
as an estimate of σ (a) n = 4, (b) n = 60, (c) n = ∞.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
102
102 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
As sample size increases, the t statistic decreases and becomes closer and closer to 1.96, which is the value for a sample of infinite size.
8.7
Use of the 95% confidence interval in significance testing
Sample statistics like the mean, variance, standard deviation and especially the standard error of the mean are estimates of population statistics that can be used to predict the range within which 95% of the means of a particular sample size will occur. Knowing this, you can use a parametric test to estimate the probability a sample has been taken from a population with a known or expected mean, or the probability that two samples are from the same population. These tests are described in Chapter 9. Here you may well be thinking ‘These statistical methods have the potential to be very prone to error! My sample mean may be an inaccurate estimate of μ and then I’m using the sample standard deviation (i.e. s) to infer the standard error of the mean.’ This is true and unavoidable when you extrapolate from only one sample, but the corrections described in this chapter and knowledge of how the sample mean is likely to become a more accurate estimate of μ as sample size increases help ensure that the best possible estimates are obtained.
8.8
Distributions that are not normal
Some variables do not have a normal distribution. Nevertheless, statisticians have shown that even when a population does not have a normal distribution and you take repeated samples of size 25 or more, the distribution of the means of these samples will have an approximately normal distribution with a mean µ and standard error of the mean pσffiffin, just as they do when the population is normal (Figures 8.9(a) and (b)). Furthermore, for populations that are approximately normal, this even holds for samples as small as five. This property, which is called the central limit theorem, makes it possible to use tests based on the normal distribution provided you have a reasonable-sized sample.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
103 [87–107] 27.8.2011 12:49PM
8.9 Other distributions
103
Figure 8.9 An example of the central limit theorem. (a) The distribution of a population
that is not normally distributed, with mean μ and standard deviation σ. Samples of 25 or more from this population will have an approximately normal distribution with mean μ and standard error of pσffiffin. (b) The distribution of 200 samples, each of n = 25 taken at random from the population is approximately normal with a mean of μ and standard error of pσffiffin.
8.9
Other distributions
Not all data are normally distributed. Sometimes a frequency distribution may resemble a normal distribution and be symmetrical, but is much flatter (compare Figures 8.10(a) and (b)). This is a platykurtic distribution. In contrast, a distribution that resembles a normal distribution but has too many values around the mean and in the tails is leptokurtic (Figure 8.10(c)). A distribution similar to a normal one but asymmetrical, in that one tail extends further than the other, is skewed. If the upper tail is longer, the distribution has a positive skew (Figure 8.10(d)), and if the lower tail is longer, it has a negative skew. Other distributions include the binomial distribution and the Poisson distribution. The binomial distribution was used in Chapter 6. If the sampling units in a population can be partitioned into two categories (e.g.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
104
104 [87–107] 27.8.2011 12:49PM
Using the normal distribution to make statistical decisions
Figure 8.10 Distributions that are similar to the normal distribution. (a) A normal
distribution, (b) a platykurtic distribution, (c) a leptokurtic distribution, (d) positive skew.
black and white beads in a sack), then the probability of sampling a particular category will be its proportion in the population (e.g. 0.5 for a population where half the beads are black and half are white). The proportions of each of the two categories in samples containing two or more individuals will follow a pattern called the binomial distribution. Table 6.2 gave the expected distribution of the proportions of two colours in samples where n = 6 from a population containing 50% black and 50% white beads. The Poisson distribution applies when you sample something by examining randomly chosen patches of a certain size, within which there is a very low probability of finding what you are looking for, so most of your data will be the value of 0. For example, the koala (an Australian arboreal leaf-eating mammal sometimes erroneously called a ‘koala bear’) is extremely uncommon in most parts of Queensland and you can walk through some areas of forest for weeks without even seeing one. If you sample a large number of randomly chosen 1 km2 patches of forest, you will generally record no koalas. Sometimes you will find one koala, even more rarely two and, very rarely indeed, three or more. This will generate a Poisson distribution where most values are ‘0’, a few are ‘1’ and even fewer are ‘2’ and ‘3’ etc.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
105 [87–107] 27.8.2011 12:49PM
8.10 Other statistics that describe a distribution
8.10
105
Other statistics that describe a distribution
8.10.1 The median The median is the middle value of a set of data listed in order of magnitude. For example, a sample with the values 1, 6, 3, 9, 4, 11 and 16 is ranked in order as 1, 3, 4, 6, 9, 11, 16 and the middle value is 6. You can calculate the location of the value of the median using the formula: M ¼ Xðnþ1Þ=2
(8:11)
which means ‘The median is the value of X whose numbered position in an ordered sequence corresponds to the sample size plus one, and then divided by two.’ For the sample of seven listed above the median is the fourth value, X4, which is 6. For sample sizes that are an even number, the median will lie between two values (e.g. X5.5 for a sample of ten) in which case it is the average of the value below and the value above. The procedure becomes more complex when there are tied values, but most statistical packages will calculate the median of a set of data.
8.10.2 The mode The mode is defined as the most frequently occurring value in a set of data, so the normal distribution is unimodal (Figure 8.11(a)). Sometimes, however, a distribution may have two or more clearly separated peaks in which case it is bimodal (Figure 8.11(b)) or multimodal.
8.10.3 The range The range is the difference between the largest and smallest value in a sample or population. The range of the set of data in Section 8.10.1 is 16 – 1 = 15.
Figure 8.11 (a) A unimodal distribution and (b) a bimodal distribution.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
106 [87–107] 27.8.2011 12:49PM
106
Using the normal distribution to make statistical decisions
8.11
Summary and conclusion
The mean and standard deviation are all that are needed to describe any normal distribution. Importantly, the distribution of the means of samples from a normal population is also normal, with a mean of µ and a standard error of the mean of pσffiffin. The range within which 95% of the sample means are expected to occur is μ ± 1.96 × SEM and this can be used to decide whether a particular sample mean can be considered significantly different (or not) to the population mean. Even if you do not know the population statistics, the standard error of the mean can be estimated from the sample statistics psffiffin. Here too you can also use the properties of the normal distribution and the appropriate value of t to predict the range (your best and only estimate of μ) within which 95% of the around X means of all samples of size n taken from that population will occur. Even more usefully, provided you have a sample size of about 25 or more these properties of the distribution of sample means apply even when the population they have been taken from is not normal, provided it is not grossly non-normal (e.g. bimodal). Therefore, you can often use a parametric test to make decisions about sample means even when the population you have sampled is not normally distributed. 8.12
Questions
(1)
It is known that a population of the snail Calcarus porosis on Kangaroo Island, South Australia, has a mean shell length of 100 mm and a standard deviation of 10 mm. An ecologist measured one snail from this population and found it had a shell length of 75 mm. The ecologist said ‘This is an impossible result.’ Please comment on what was said, including whether you agree or disagree and why.
(2)
Why does the variance calculated from a sample have to be corrected to give a realistic indication of the variance of the population from which it has been taken?
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C08.3D
107 [87–107] 27.8.2011 12:49PM
8.12 Questions
(3)
107
An sample of 16 adult weaver rats, Rattus weaveri, found in a storage freezer in a museum and only labelled ‘Expedition to North and South Keppel Islands, 1984: all from site 3’ had a mean body weight of 875 grams. The mean body weight for the population of adult weaver rats on North Keppel Island is 1000 grams (1 kg), with a standard deviation of 400 grams and the population on South Keppel Island has a mean weight of 650 grams and a standard deviation of 400 grams. (a) From these population statistics, calculate the SEM and the range within which you would expect 95% of the means for samples of 16 rats from each population. (b) Which of the two islands is the sample most likely to have come from? (c) Please discuss.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
108 [108–129] 27.8.2011 1:01PM
9
Comparing the means of one and two samples of normally distributed data
9.1
Introduction
This chapter explains how some parametric tests for comparing the means of one and two samples actually work. The first test is for comparing a single sample mean to a known population mean. The second is for comparing a single sample mean to an hypothesised value. These are followed by a test for comparing two related samples and a test for two independent samples.
9.2
The 95% confidence interval and 95% confidence limits
In Chapter 8 it was described how 95% of the means of samples of a particular size, n, taken from a population with a known mean, μ, and standard deviation, σ, would be expected to occur within the range of μ ± 1.96 × SEM. This range is called the 95% confidence interval, and the actual numbers that show the limits of that range (μ ± 1.96 × SEM) are called the 95% confidence limits. If you only have data for one sample, the sample standard deviation, s, is your best estimate of σ and can be used with the appropriate t statistic to calculate the 95% confidence interval around an expected or hypothesised value of μ. You have to use the formula μ ± t × SEM, because the population statistics are not known. This will give a wider confidence interval than ± 1.96 × SEM because the value of t for a finite sample size is always greater than 1.96, and can be very large for a small sample (Chapter 8).
9.3
Using the Z statistic to compare a sample mean and population mean when population statistics are known
The Z test gives the probability that a sample mean has been taken from a population with a known mean and standard deviation. From the 108
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
109 [108–129] 27.8.2011 1:01PM
9.3 Using the Z statistic to compare a sample mean and population mean
109
Figure 9.1 The 95% confidence interval obtained by taking the means of a
large number of small samples from a normally distributed population with known statistics is indicated by the black horizontal bar enclosed within μ ± 1.96 × SEM. The remaining 5% of sample means are expected to be further away from μ. Therefore, a sample mean that lies inside the 95% confidence interval will be considered to have come from the population with a mean of μ, but a sample mean that lies outside the 95% confidence interval will be considered to have come from a population with a mean significantly different to μ, assuming an α of 0.05.
populationstatistics µ and σ you can calculate the expected standard error of σ the mean pffiffin for a sample of size n and therefore the 95% confidence interval (Figure 9.1), which is the range within μ ± 1.96 × SEM. If your sample mean, X, occurs within this range, then the probability it has come from the population is 0.05 or greater, so the mean of the population from which the sample has been taken is not significantly different to the known population mean. If, however, your sample mean occurs outside the confidence interval, the probability it has been taken from the population is less than 0.05, so the mean of the population from which the sample has been taken is significantly different to the known population mean. If you decide on a probability level other than 0.05, you simply need to use a different value than 1.96 (e.g. for the 99% confidence interval you would use 2.576). This is a very straightforward test (Figure 9.1). Although you could calculate the 95% confidence limits every time you made this type of comparison, it is far easier to calculate the ratio Z ¼ Xμ SEM as described in Sections 8.3.4 and 8.5. All this formula does is divide the distance between the sample mean and the known population mean by the standard error, so once the value of Z is less than –1.96 or greater than +1.96, the mean of the population from which the sample has been taken is
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
110
110 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
Figure 9.2 For a Z test, 95% of the sample means will be expected to be within
the range of (μ ± 1.96 × SEM) (black bar). Therefore, once the difference between the sample mean and the population mean (dark grey bar) divided by the standard error of the mean (pale grey short bar which is 1 SEM long) is (a) greater than + 1.96 or (b) less than − 1.96, it will be significant.
considered significantly different to the known population mean, assuming an α of 0.05 (Figures 9.2(a) and (b)). Here you may be wondering if a population mean could ever be known, apart from small populations where every individual has been censursed. Sometimes, however, researchers have so many data for a particular variable that they can assume the sample statistics indicate the true values of population statistics. For example, many physiological variables such as the number of red (or white) cells per millilitre of blood, fasting blood glucose levels and resting body temperature have been measured on several million healthy people. This sample is so large it can be considered to give extremely accurate estimates of the population statistics. Remember, as sample size increases, X becomes closer and closer to the
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
111 [108–129] 27.8.2011 1:01PM
9.3 Using the Z statistic to compare a sample mean and population mean
111
true population mean and the correction of n −1 used to calculate the standard deviation also becomes less and less important. There is an example of the comparison between a sample mean and a ‘known’ population mean in Box 9.1.
Box 9.1 Comparison between a sample mean and a known population mean when population statistics are known The mean number of white blood cells per ml of blood in healthy adults is 7500 per ml, with a standard deviation of 1250. These statistics are from a sample of over one million people and are therefore considered to be the population statistics µ and σ. Ten astronauts who had spent six months in space had their white cell counts measured as soon as they returned to Earth. The data are shown below. What is the probability that the sample mean X has been taken from the healthy population? The white cell counts are: 7120, 6845, 7055, 7235, 7200, 7450, 7750, 7950, 7340 and 7150 cells/ml. The population statistics for healthy human adults are µ = 7500 and σ = 1250 The sample size n = 10 The sample mean X = 7310.5 pffiffiffiffi ¼ 395:3 The standard error of the mean = pσffiffin ¼ 1250 10 Therefore, 1:96 SEM = 1:96 395:3 ¼ 774:76 and the 95% confidence interval for the means of samples of n = 10 is 7500 ± 774.76, which is from 6725.24 to 8274.76. Since the mean white cell count of the ten astronauts lies within the range in which 95% of means with n = 10 would be expected to occur by chance, the probability that the sample mean has come from the healthy population with mean µ is not significant. 7310:57500 Expressed as a formula: Z ¼ Xμ ¼ 189:5 SEM ¼ 395:3 395:3 ¼ 0:4794 Here, too, because the Z value lies within the range of ± 1.96, the mean of the population from which the sample has been taken does not differ significantly from the mean of the healthy population. The negative value is caused by the sample mean being less than the population mean.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
112 [108–129] 27.8.2011 1:01PM
112
Comparing one and two samples of normally distributed data
9.3.1
Reporting the result of a Z test
For a Z test, the Z statistic, sample size and probability are usually reported. Often more detail is given, including the population and sample mean. For example, when reporting the comparison of the sample in Box 9.1 to the population mean of 7500 you could write ‘The mean white blood cell count of 7310.5 cells/ml for the sample of ten astronauts did not differ significantly from the mean of the healthy population of 7500 (Z = 0.4794, NS).’ If you had not already specified the value of α earlier in the report (e.g. when giving details of materials and methods or the statistical tests used), you might write ‘The mean white blood cell count of 7310.5 cells/ml for the sample of ten astronauts did not differ significantly from the mean of the healthy population of 7500 (α = 0.05, Z = 0.4794, NS).’
9.4
Comparing a sample mean to an expected value when population statistics are not known
The single sample t test (which is often called the one sample t test) compares a single sample mean to an expected value of the population mean. When population statistics are not known, the sample standard deviation s is your best and only estimate of σ for the population from which it has been taken. You can still use the 95% confidence interval of the mean, estimated from the sample standard deviation, to predict the range around an expected value of µ within which 95% of the means of samples of size n taken from that population will occur. Here, too, once the sample mean lies outside the 95% confidence interval, the probability of it being from a population with a mean of μexpected is less than 0.05 (Figure 9.3). Xμexpected Expressed as a formula, as soon as the ratio of t ¼ SEM is less than the critical 5% value of –t or greater than +t, then the sample mean is considered to have come from a population with a mean significantly different to µexpected (Figures 9.4(a) and (b)).
9.4.1
Degrees of freedom and looking up the appropriate critical value of t
The appropriate critical value of t for a particular sample size is easily found in a set of statistical tables. A selection of values is given in Table 9.1 and a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
113 [108–129] 27.8.2011 1:01PM
9.4 Comparing a sample mean to an expected value
113
Figure 9.3 The 95% confidence interval, estimated from one sample of size n by
using the t statistic, is indicated by the black horizontal bar showing μ ± t × SEM. Therefore, 5% of the means of samples size n from the population would be expected to lie outside this range. If X lies inside the confidence interval, it is considered to have come from a population with a mean the same as μexpected , but if it lies outside the confidence interval, it is considered to have come from a population with a significantly different mean, assuming an α of 0.05.
Figure 9.4 For a single sample t test, 95% of the sample means will be expected
to occur within the range of (μexpected ± t × SEM) (the horizontal black bar). Therefore, the difference between the sample mean and the expected population mean (dark grey bar) divided by the standard error of the mean (pale grey short bar) will be significant once it is (a) greater than + t or (b) less than – t.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
114
114 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
Table 9.1 Critical values of the distribution of t. The column on the far left gives the number of degrees of freedom (ν). The remaining columns give the critical value of t. For example, the third column, shown in bold and headed α(2) = 0.05, gives the 5% critical values. Note that the 5% probability value of t for a sample of infinite size (the last row) is 1.96 and thus equal to the 5% probability value for the Z distribution. Finite critical values were calculated using the methods given by Zelen and Severo (1964). A more extensive table is given in the Appendix (Table A2). Degrees of α(2) = 0.10 or α(2) = 0.05 or α(2) = 0.025 α(2) = 0.01 or freedom ν α(1) = 0.05 α(1) = 0.025 or α(1) = 0.01 α(1) = 0.005 1 2 3 4 5 6 7 8 9 10 15 30 50 100 1000 ∞
6.314 2.920 2.353 2.132 2.015 1.934 1.895 1.860 1.833 1.812 1.753 1.697 1.676 1.660 1.646 1.645
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.131 2.042 2.009 1.984 1.962 1.960
31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.602 2.457 2.403 2.364 2.330 2.326
63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 2.947 2.750 2.678 2.626 2.581 2.576
more extensive table in the Appendix (Table A2). Look at Table 9.1. First, you need to find the chosen probability level along the top line of the table. Here I am using an α of 0.05, so you need the column headed α(2) = 0.05. (There is an explanation for α(1) in Section 9.7.) Note that several probability levels are given in Table 9.1, including 0.10, 0.05 and 0.01, corresponding respectively to the 10%, 5% and 1% levels discussed in Chapter 6. The column on the far left gives the number of degrees of freedom, which needs explanation. If you have a sample of size n and the mean of that sample is a specified value, then all of the values within the sample except
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
115 [108–129] 27.8.2011 1:01PM
9.4 Comparing a sample mean to an expected value
115
one are free to be any number at all, but the final value is fixed because the sum of the values in the sample, divided by n, must equal the mean. For example, if you have a specified sample mean of 4.25 and n = 2, then the first value in the sample is free to be any value at all, but the second must be one that gives a mean of 4.25, so it is a fixed number. Thus, the number of degrees of freedom for a sample of n = 2 is 1. For n = 100 and a specified mean (e.g. 4.25), 99 of the observations can be any value at all, but the final measurement is also determined by the requirement for the mean to be 4.25. Therefore, the number of degrees of freedom is 99. The number of degrees of freedom determines the critical value of the t statistic. For a single sample t test, if your sample size is n, you need to use the t value that has n −1 degrees of freedom. For a sample size of 10, the degrees of freedom are 9 and the critical value of the t statistic for an α of 0.05 is 2.262 (see Table 9.1). Therefore, if your calculated value of t is less than −2.262 or more than +2.262, the expected probability of that outcome is < 0.05 and considered significant. From now on, the appropriate t value will have a subscript to show the degrees of freedom (e.g. t7 indicates 7 degrees of freedom).
9.4.2
The application of a single sample t test
Here is an example of the use of a single sample t test. Many agricultural crops such as wheat and barley have an optimal water content for harvesting by machine. If the crop is too dry, it may catch fire while being harvested. If it is too wet, it may clog and damage the harvester. The optimal desired mean water content at harvest of the rather dubious sounding crop ‘Panama Gold’ is 50 g/kg. Many growers sample their crop to establish whether the water content is significantly different to a desired value before making a decision on whether to harvest. A grower took a sample of nine 1 kilogram replicates chosen at random over a widely dispersed area of their crop of Panama Gold and measured the water content of each. The data are given in Box 9.2. Is the sample likely to have come from a population where μ = 50g/kg? The calculations are in Box 9.2 and are straightforward. If you analyse these data with a statistical package, the results will usually include the value of the t statistic and the probability, making it unnecessary to use a table such as Table A2 in the Appendix.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
116
116 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
Box 9.2 Comparison between a sample mean and an expected value when population statistics are not known The water content of nine samples of Panama Gold taken at random from within a large field is 44, 42, 43, 49, 43, 47, 45, 46 and 43 g/kg. The null hypothesis is that this sample is from a population with a mean water content of 50 g/kg. The alternate hypothesis is that this sample is from a population with a mean water content that is not 50 g/kg. The mean of this sample is: 44.67 The standard deviation s = 2.29 The standard error of the mean is psffiffin ¼ 2:29 3 ¼ 0:764 Xμexpected Therefore t8 ¼ SEM ¼ 44:6750 ¼ 6:98 0:764 Although the mean of the sample is less than the desired mean value of 50, is the difference significant? The calculated value of t8 is −6.98. The critical value of t8 for an α of 0.05 is ±2.306. Therefore, the probability that the sample mean is from a population with a mean water content of 50g/kg is <0.05. The grower concluded that the mean moisture content of the crop was significantly different to that of a population with a mean of 50 g/kg and did not attempt to harvest on that day.
9.5
Comparing the means of two related samples
The paired sample t test is designed for cases where you have measured the same variable twice on each experimental unit under two different conditions. Some common applications of this type of comparison are measurements taken before and after drug treatments, and performance tests on athletes. Here is an example. A sports psychologist hypothesised that athletes may perform either more or less efficiently than usual when they are first introduced to unfamiliar surroundings. The psychologist called this the ‘familiarity effect’. For example, sprinters taken to an unfamiliar stadium may consistently run either faster or slower on the second day compared to the first. Even if the familiarity effect is small, it could be vital in achieving a world record or winning a race. There is, however, a lot of variation in running speed among
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
117 [108–129] 27.8.2011 1:01PM
9.5 Comparing the means of two related samples
117
Table 9.2 The time taken, in seconds, for ten athletes to sprint the same distance on their first and second day of running in an unfamiliar environment. The column headed ‘Difference’ gives the race time for Day 2 minus Day 1 for each athlete, and the sample statistics are for this column of data. Race time (seconds) Athlete number
Day 1
Day 2
Difference
1 2 3 4 5 6 7 8 9 10
13.5 14.6 12.7 15.5 11.1 16.4 13.2 19.3 16.7 18.4
13.6 14.6 12.6 15.7 11.1 16.6 13.2 19.5 16.8 18.7
+0.1 0.0 −0.1 +0.2 0.0 +0.2 0.0 +0.2 +0.1 +0.3 X = 0.100 s = 0.1247 n = 10 SEM = 0.0394
individuals, so a comparison of two independent groups of different athletes might obscure any difference in speed between the first and second days. Instead, the psychologist decided to measure the running time of the same ten athletes over a fixed distance on both the first and second day after arriving at a new stadium. The results are in Table 9.2. Here the two groups are not independent because the same individuals are in each group. Nevertheless, you can generate a single independent value for each individual by taking their ‘Day 1’ reading away from their ‘Day 2’ reading. This will give a single column of differences for the ten experimental subjects, which will have its own mean and standard deviation (Table 9.2). The null hypothesis is that there is no difference between the running times of each athlete on both days. If the null hypothesis were true, you would expect the population of values for the difference for each athlete to have a mean of 0 and a standard error that can be estimated from the sample of differences
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
118
118 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
Box 9.3 A worked example of a paired sample t test using the data from Table 9.2 X ¼ 0:100 s ¼ 0:12472 n ¼ 10 SEM ¼ 0:03944 0:100 Therefore, t9 ¼ 0:03944 ¼ 2:5355
From Table 9.1 the critical value of t9 is 2.262. Therefore, the value of t lies outside the range within which you would expect 95% of t statistics generated by samples of n = 10 from a population where μ = 0, so it was concluded that the mean of the population of the differences in race time was significantly different (P < 0.05) to an expected mean of 0. Interestingly, the athletes took significantly longer to run the same distance on the second day. Although this is a poor experimental design in that many other factors (including fatigue, differences in air temperature or wind speed) may have confounded the results, it is consistent with the alternate hypothesis and is likely to lead to further investigation. by psffiffin. This is just another case of a single sample t test (Section 9.4), but here the expected population mean is 0. Consequently, all you need to do is calculate the ratio of X0 SEM and see if this statistic lies within or outside the region where 95% of the means of this sample size would be expected to occur around an expected population mean of 0. This has been done in Box 9.3.
9.6
Comparing the means of two independent samples
The t test for two independent samples (which is often called an independent samples t test) is used to compare two randomly chosen independent samples such as a control and an experimental group, each containing different experimental units. Here the question is ‘Have the two sample means been drawn from the same population?’ This test is an extremely straightforward extension of the single sample t test. For a single sample t test, a significant value of t will occur as soon as the
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
119 [108–129] 27.8.2011 1:01PM
9.6 Comparing the means of two independent samples
119
expected
Figure 9.5 For two independent samples taken from the same population, 95%
of the values of each sample mean would be expected to be within the range (μexpected ± t × SEM) (horizontal black bar). Therefore, the difference between the two sample means will only be significant once it exceeds this range (shown by the dark grey bar), so the difference has to be divided by twice the SEM (i.e. the sum of the two short pale grey bars) to give the appropriate value of the t statistic.
sample mean is greater than the upper 95% confidence limit, or less than the lower 95% confidence limit (Figure 9.4). But when you have taken two samples from the same population, 95% of the values of each would be expected to be within the 95% confidence interval surrounding the expected population mean (Figure 9.5). So the range of possible non-significant differences between the two sample means will be the entire width of this confidence interval and the greatest possible non-significant difference between the sample means will occur when the first mean is equal to the lower 95% confidence limit and the second mean is equal to the upper 95% confidence limit or vice versa. Therefore, to generate an appropriate value of t for the comparison of the means of two independent samples, the formula for a single sample t test has to be modified. First, you calculate the difference between the two sample means. Second, you divide this difference by the standard error of the mean of sample A plus the standard error of the mean of sample B (equation (9.1)). This will give a denominator that is twice the SEM, which will therefore give the appropriate value of t to compare to those in the table of critical values (Figure 9.5): t¼
XA XB : SEMA þ SEMB
(9:1)
qffiffiffi 2 The SEM from a sample is sn , so to get the best estimate of the combined standard errors you use the following formula (which takes into account any
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
120
120 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
differences in sample size and uses the two independent estimates of the population variance from samples A and B): sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sA 2 sB 2 SEM ¼ þ : nA nB
(9:2)
Therefore, to obtain the t statistic for the differences between the two means you divide X A X B by this estimate of the SEM: XA XB ffi: t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi sA 2 sB 2 þ nA nB
(9:3)
Here the number of degrees of freedom is (n(A) − 1) + (n(B) − 1), which is usually put as (n(A) + n(B) −2). This is because you have calculated the standard error using two independent samples, both of which have n −1 degrees of freedom. You have lost a degree of freedom from each sample. There is a worked example of an independent samples t test in Box 9.4.
Box 9.4 A worked example of a t test for two independent samples A freshwater ecologist sampled the shell length of 15 freshwater clams in each of two lakes to see if these samples were likely to have come from populations with the same mean. The data are shown below: Lake A: 25, 40, 34, 37, 38, 35, 29, 32, 35, 44, 27, 33, 37, 38, 36 Lake B: 45, 37, 36, 38, 49, 47, 32, 41, 38, 45, 33, 39, 46, 47, 40 nA ¼ 15; nB ¼ 15; X A ¼ 34:67; X B ¼ 40:87; sA 2 ¼ 24:67; sB 2 ¼ 28:69 34:67 40:87 6:2 therefore t28 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 3:287 24:67 28:69 1:648 þ 1:913 15 þ 15 Note that the value of t is negative because the mean for Lake B is greater than Lake A. From Table A2 in the Appendix the critical value of t28 for an of α 0.05 is 2.048, so the two sample means have less than 5% probability of being from the same population.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
121 [108–129] 27.8.2011 1:01PM
9.7 One-tailed and two-tailed tests
121
You may never have to manually calculate a t statistic, because statistical packages have excellent programs for doing this. But the simple worked examples in this chapter explain how t tests work and will be very helpful as you continue through this book.
9.7
One-tailed and two-tailed tests
All of the alternate hypotheses dealt with so far in this chapter have not specified anything other than ‘The mean of the population from which the sample has been drawn is different to an expected value’ or ‘The two samples are from populations with different means.’ Therefore, these are two-tailed hypotheses because nothing is specified about the direction of the difference. The null hypothesis could be rejected by a difference in either a positive or negative direction. Sometimes, however, you may have an alternate hypothesis that specifies a direction. For example, ‘The mean of the population from which the sample has been taken is greater than an expected value’ or ‘The mean of the population from which sample A has been taken is less than the mean of the population from which sample B has been taken.’ These are called onetailed hypotheses. If you have an alternate hypothesis that is directional, the null hypothesis will not just be one of no difference. For example, if the alternate hypothesis states that the mean of the population from which the sample has been taken will be less than an expected value, then the null should state ‘The mean of the population from which the sample has been taken will be no different to, or more, than the expected value.’ You need to be very cautious, however, because a directional hypothesis will affect the location of the region where the most extreme 5% of outcomes will occur. Here is an example using a single sample test where the true population mean is known. For any two-tailed hypothesis, the 5% rejection region is split equally into two areas of 2.5% on the negative and positive side of μ (Figure 9.6(a)). If, however, the hypothesis specifies that your sample is from a population with a mean that is expected to be only greater (or only less) than the true value, then in each case the most extreme 5% of possible outcomes that you would be interested in are restricted to one side or one tail of the distribution (Figure 9.6(b)). So if you have a one-tailed
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
122
122 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
Figure 9.6 (a) The 5% of most extreme outcomes under a two-tailed
hypothesis are equally distributed as two areas, each of 2.5%, on both the positive and negative sides of the true population mean. (b) For a one-tailed hypothesis, the rejection region of 5% occurs only on one side of the true population mean. Here it is on the right (upper) side because the alternate hypothesis specifies that the sample mean is taken from a population with a larger mean than μ.
hypothesis, you need to do two things to make sure you make an appropriate decision. First, you need to examine your results to see if the difference is in the direction expected under the alternate hypothesis. If it is not, then the value of the t statistic is irrelevant – the null hypothesis will stand and the alternate hypothesis will be rejected (Figure 9.7). Second, if the difference is in the expected direction, then you need to choose an appropriate critical value to ensure that 5% of outcomes are concentrated in one tail of the distribution. This is easy. For the Z or t statistics, the critical two-tailed probability of 5% is not appropriate for a one-tailed test, because it only specifies the region where 2.5% of the values will occur within each tail. So to get the critical 5% value for a one-tailed test, you need to use the 10% critical value for a two-tailed test. This is why the column headed α(2) = 0.10 in Table 9.1 also includes the heading α(1) = 0.05, and you would use the critical values in this column if you were doing a one-tailed test. It is necessary to specify your null and alternate hypotheses – and therefore decide whether a one or two-tailed test is appropriate – before you do an experiment, because the critical values are different. For example, for an α = 0.05, the two-tailed critical value for t10 is ±2.228 (Table 9.1), but if the
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
123 [108–129] 27.8.2011 1:01PM
9.7 One-tailed and two-tailed tests
123
Figure 9.7 An example of the rejection region for a one-tailed test. If the
alternate hypothesis states that the sample mean will be more than μ, then the null hypothesis is retained unless the sample mean lies in the region to the right where the most extreme 5% of values would be expected to occur. Even though the sample mean shown here is much less than μ, it is still consistent with the null hypothesis.
test were one-tailed, the critical value would be either +1.812 or −1.812. So a t value of 2.0 in the correct direction would be significant for a one-tailed test, but not for a two-tailed test. Many statistical packages only give the calculated value of t (not the critical value) and its probability for a two-tailed test. In this case, all you have to do is halve the two-tailed probability to get the appropriate onetailed probability (e.g. a two-tailed probability of P = 0.08 is equivalent to P = 0.04, provided the difference is in the expected direction).
9.7.1
How appropriate are one-tailed tests?
There has been considerable discussion about the appropriateness of onetailed tests because the decision to propose a directional hypothesis implies that an outcome in the opposite direction is of absolutely no interest to either the researcher or science, but often this is not true. For example, if the psychologist interested in the familiarity effect (Section 9.5) had hypothesised that athletic performance would improve on the second day of running a race in a new stadium and did a one-tailed test, this would ignore a marked decrease in performance that would nevertheless be of considerable importance and interest. Similarly, if a biomedical scientist was testing a newly synthesised compound that was expected to decrease blood pressure and therefore only applied a one-tailed test they would ignore any increase in blood pressure, but this unanticipated outcome might help
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
124
124 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
understand what affects blood pressure in humans. Consequently, it has been suggested that one-tailed tests should only be applied in the rare circumstances where a one-tailed hypothesis is truly appropriate because there is absolutely no interest in the opposite outcome, such as evaluation of a new type of fine particle filter in relation to existing products, where you would only ever be looking for an improvement in performance (see Lombardi and Hurlbert, 2009). Finally, there have been cases of unscrupulous researchers who have obtained a result with a non-significant two-tailed probability (e.g. P = 0.065) and realised this would be significant if a one-tailed test were applied (P = 0.0325), so they have subsequently modified their initial hypothesis. This is neither appropriate nor ethical as discussed in Chapter 5.
9.8
Are your data appropriate for a t test?
The use of a t test makes three assumptions. The first is that the data are normally distributed. The second is that the each sample has been taken at random from its respective population and the third is that for an independent samples test, the variances are the same. It has, however, been shown that t tests are actually very ‘robust’ – they will still generate statistics that approximate the t distribution and give realistic probabilities even when the data show considerable departure from normality and when sample variances are dissimilar.
9.8.1
Assessing normality
First, if you already know that the population from which your sample has been taken is normally distributed (perhaps you have data for a variable that has been studied before), you can assume the distribution of sample means from this population will also be normally distributed. Second, the central limit theorem discussed in Chapter 8 states that the distribution of the means of samples of about 25 or more taken from any population will be approximately normal, provided its population is not grossly non-normal (e.g. a population that is bimodal). Therefore, provided your sample size is sufficiently large, you can usually do a parametric test. Finally, you can examine your sample. Although there are statistical tests for normality, many statisticians (see Quinn and Keough, 2002) have
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
125 [108–129] 27.8.2011 1:01PM
9.9 Distinguishing data between paired and a two independent samples
125
cautioned that these tests often indicate the sample is significantly nonnormal even when a t test will still give reliable results. Some authors (e.g. Zar, 1999; Quinn and Keough, 2002) suggest plotting the cumulative frequency distribution of the sample. The easiest way to do this is to use a statistics package to give you a probability plot (often called a P-P plot). This graphs the actual cumulative frequency against the expected cumulative frequency assuming the data are normally distributed. If they are, the P-P plot will be a straight line. Any gross departures from this should be analysed cautiously and perhaps a non-parametric test used. Most statistical packages will draw a P-P plot for a sample.
9.8.2
Have the sample(s) been taken at random?
This is really just a case of having an appropriate experimental design. For a single sample test, the sample needs to have been selected at random in order to appropriately represent the population from which it has been taken. For an independent samples test, both samples need to have been selected at random.
9.8.3
Are the sample variances equal?
One easy test of whether sample variances are equal is to divide the largest by the smallest. If the samples have equal variances, this ratio will be 1. As the variances become more and more unequal, the value of this statistic (which is called the F statistic after the statistician Sir Ronald A. Fisher) will increase. The F statistic and tests for equality of variances are discussed in Chapters 12 and 14. Even if the variances of two samples are significantly different, you can often still apply a t test.
9.9
Distinguishing between data that should be analysed by a paired sample test and a test for two independent samples
As a researcher, or reviewer of another person’s work, you may have to decide if an experimental outcome should be analysed as a paired sample test or a test for two independent samples. The way to do this is to ask ‘Are the experimental units in the two samples related or are they independent?’ Here are some examples.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
126
126 [108–129] 27.8.2011 1:01PM
Comparing one and two samples of normally distributed data
Table 9.3 Data for the systolic blood pressure of four people in response to Drug A and Drug B. Person
Drug A
Drug B
1 2 3 4
120 150 170 110
130 140 150 120
Table 9.4 Data for the systolic blood pressure of four people given Drug A and four people given Drug B. Person
Drug A
1 2 3 4 5 6 7 8
120 150 160 130
Drug B
135 160 120 140
First, Table 9.3 gives data for two samples that are related – two measurements of systolic blood pressure have been made on each member of the same group of four people given two different drugs. Each experimental unit (person) in Table 9.3 experiences both drugs, so you would do a paired sample test. An independent example is the measurement of blood pressure of two different groups of individuals, with each group receiving a different drug. In Table 9.4, the experimental units 1 to 4 only receive Drug A, while 5 to 8 only receive Drug B. The samples are obviously independent. You would do an independent samples t test.
9.10
Reporting the results of t tests
For a single sample t test, the minimum reported information is the value of the t statistic, its degrees of freedom and probability, but it is also helpful if
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
127 [108–129] 27.8.2011 1:01PM
9.11 Conclusion
127
you give the sample mean and expected value. For example, the comparison between the water content of a sample and a population with a desired water content of 50 g/kg in Box 9.2 could be reported as ‘The mean water content of the sample (44.67 g/kg) was significantly less than expected from a population with a mean of 50 g/kg (single sample t test: t8 = −6.98, P < 0.001).’ Note the degrees of freedom are given as a subscript to ‘t’. Similarly, for a paired sample t test the comparison between the time taken to run 100 m on two consecutive days (Box 9.3) could be reported as ‘The mean time for the ten athletes to run 100 m was significantly longer on day 2 compared to day 1 (paired sample t test: t9 = 2.54, P < 0.05).’ For an independent samples t test, the comparison between the samples of freshwater clams from two different lakes in Box 9.4 could be reported as ‘The mean shell length of freshwater clams differed significantly between Lake A and Lake B (independent samples t test: t28 = 3.287, P < 0.01).’ You could also give more detail ‘The mean shell length of freshwater clams was significantly less in Lake A compared to Lake B (independent samples t test: t28 = 3.287, P < 0.01).’ Manuscripts are often returned to authors because t statistics have been given without the number of degrees of freedom, which some reviewers even use to check on the experimental design. For example, if a single sample t test has been done, then 29 degrees of freedom requires a sample size of 30. A reviewer is likely to ask for more detail if the experimental design is not consistent with this: if the methods specified that blood samples were taken from ten rats but 30 degrees of freedom were given for a single sample t test, it suggests that more than one measurement was taken from each individual, so the data may be pseudoreplicates (Chapter 4).
9.11
Conclusion
This chapter explains how the Z test and t tests for one and two samples actually work. The concepts will help you make decisions about which test to use for a particular set of data and also be very useful when you work through the material in later chapters. They will also help you understand the results given by statistical packages.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
128 [108–129] 27.8.2011 1:01PM
128
Comparing one and two samples of normally distributed data
9.12
Questions
(1)
The data below are for the time taken (in minutes) for people to evacuate a multi-storey building in response to a fire alarm that was tested on two consecutive days. In both cases, there was no fire. A human movement scientist hypothesised that experiencing a false alarm on the first day would affect the time a person took to evacuate the building on the second. Person number
Day 1
Day 2
1 2 3 4 5 6 7 8 9 10
1.5 1.9 2.7 3.7 1.1 1.4 1.2 4.3 2.7 1.4
3.7 4.6 2.7 5.7 1.3 2.6 3.1 4.5 6.8 8.5
(a) What sort of statistical analysis is appropriate for this hypothesis? (b) Is the hypothesis one- or two-tailed? (c) Is the result of the analysis significant? (d) What does this suggest about the behaviour of people in response to fire alarm testing? (2)
This is a useful exercise to help understand how statistical tests actually work. It can be done by hand using the instructions in Box 9.2, but you can do it very easily indeed if you have access to a statistical package. The body lengths of nine locusts are 6.2, 5.6, 5.3, 6.8, 6.9, 5.3, 6.3, 6.2 and 5.4 cm. The mean of this sample is exactly 6. Use a statistical package to run a single sample t test where the expected mean is set at 6.0. This will give no difference between the observed and expected mean. (a) What would you expect the value of the t statistic to be? Run the analysis to
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C09.3D
129 [108–129] 27.8.2011 1:01PM
9.12 Questions
129
check on this. (b) Now, modify the expected value of the mean. Make it 5.90 and run the analysis again. Then make it 5.80, 5.75, 5.50 etc. What happens to the value of t as the difference between the observed and expected values increases? What happens to the probability?
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
130 [130–139] 27.8.2011 11:30AM
10
Type 1 error and Type 2 error, power and sample size
10.1
Introduction
Every time you make a decision to retain or reject an hypothesis on the basis of the probability of a particular result, there is a risk that this decision is wrong. There are two sorts of mistakes you can make and these are called Type 1 error and Type 2 error.
10.2
Type 1 error
A Type 1 error or false positive occurs when you decide the null hypothesis is false when in reality it is not. Imagine you have taken a sample from a population with known statistics of μ and σ and exposed the sample to a particular experimental treatment. Because the population statistics are known, you could test whether the sample mean was significantly different to the population mean by doing a Z test (Section 9.3). If the treatment had no effect, the null hypothesis would apply and your sample would simply be equivalent to one drawn at random from the population. Nevertheless, 5% of the sample means of size n would lie outside the 95% confidence interval of μ ±1.96 × SEM, so 5% of the time you would incorrectly reject the null hypothesis of no difference between your sample mean and the population mean (Figure 10.1) and accept the alternate hypothesis. This is a Type 1 error. It is important to realise that a Type 1 error can only occur when the null hypothesis applies. There is absolutely no risk if the null hypothesis is false. Unfortunately, you are most unlikely to know if the null hypothesis applies or not – if you did, you would not be doing an experiment to test it! If the null hypothesis applies, the risk of Type 1 error is the same as the probability level 130
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
131 [130–139] 27.8.2011 11:30AM
10.3 Type 2 error
131
Figure 10.1 Illustration of Type 1 error. The known population mean is 7500
and the 95% confidence interval for the mean is shown as the black horizontal bar. There is no effect of treatment, so the distribution of sample means from the experimental population will be the same as those from the untreated population. Nevertheless, 5% of your sample means will, by chance, lie within the tails of the distribution outside the 95% confidence interval. Whenever a sample mean occurs in either of these areas, you will incorrectly reject the null hypothesis and make a Type 1 error. This risk is unavoidable when the null hypothesis applies, but can be controlled by the chosen value of α. An α of 0.05 will have a 5% probability of Type 1 error, but an α of 0.01 will only have a 1% probability of Type 1 error.
you have chosen. Here, therefore, you may be thinking ‘Then why do we usually set α at 0.05? Surely, an α of 0.01 or even 0.001 would reduce the risk of Type 1 error?’ It will, but it will affect the likelihood of Type 2 error.
10.3
Type 2 error
A Type 2 error or false negative occurs when you retain the null hypothesis, even though it is false. For the example above, this would occur when the treatment had a real effect but your experiment and analysis did not detect it. Here is an example using a single sample two-tailed Z test where the population statistics are known.
10.3.1 An example The population mean and variance of the number of white blood cells per millilitre in healthy adults is 7500, with a standard deviation of 1250. These statistics are from more than a million people, so are considered to be the population statistics α and σ. They were first used in Box 9.1.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
132
132 [130–139] 27.8.2011 11:30AM
Type 1 error and Type 2 error, power and sample size
Figure 10.2 The concept of effect size displacing the population mean. The
population mean, μ, is 7500 white blood cells/ml, but the drug leucoxifen increases this value by 700 to 8200 cells/ml.
Here you need to consider the case where the new and untested experimental drug ‘leucoxifen’ causes an increase of 700 white blood cells per millilitre, so the mean of the population treated with leucoxifen is 700 cells/ ml more than the mean of the untreated population. This change is often called the effect size of the treatment. But because leucoxifen is a new drug, it is not known if there is an effect size or not. A researcher was asked to investigate the effect of this drug, so they administered it to a sample of several healthy people and then compared the mean white cell count of this sample to the known population (Figure 10.2). First, consider the case where you take a sample of n = 5 from each population. The expected standard error of the mean will be pσffiffi ¼ 1250 pffiffi ¼ 559:02. Therefore, the range within which you would expect n 5 95% of sample means from the untreated population to occur would be μ ±1.96 × SEM, which is 7500 ±1.96 × 559.02. This range is from 6404.33 to 8595.67. With an effect size of 700, the range around μtreated within which you would expect 95% of sample means from the experimental population is 8200 ±1095.67, which is from 7104.33 to 9295.67. These two ranges are shown in Figure 10.3(a). Importantly, they overlap considerably, with most of the means of samples from the treated population falling within the expected range of the means of samples from the untreated population. Therefore, if you were to treat five people with leucoxifen, there is a very high probability that your sample mean from the treatment group will fall within the 95% confidence interval of the untreated population and thus would not be considered significantly different to it. Even though there is a real effect of this drug, your sample size is too small to detect it very often, so you will frequently make a Type 2 error. Now, consider the case where you have ten people in your experimental group. As sample size increases, the standard error of the mean, and
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
133 [130–139] 27.8.2011 11:30AM
10.3 Type 2 error
133
Figure 10.3 The expected distributions of the means of samples taken from
two populations with the same variance, one of which has a μ of 7500 and the other which has a μ of 8200. The range within which 95% of the sample means from the untreated population is shown as a horizontal black bar. (a) When n = 5, the sample means are expected to occur within a relatively wide range around each population mean and most of the means from the treated population (shaded distribution) are within the range of 95% of means from the untreated one. (b) When n = 10, the sample means are expected to occur within a narrower range. (c) When n = 30, the sample means are expected to occur within a much narrower range and far fewer means from the treated population (shaded distribution) are within the range of 95% of means from the untreated one.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
134
134 [130–139] 27.8.2011 11:30AM
Type 1 error and Type 2 error, power and sample size
Figure 10.4 The probability of a Type 2 error is the black area to the left of
the horizontal line marking the upper 95% confidence limit of μ for the untreated population. The risk of Type 2 error is considerable, but it will be even greater if the sample size is smaller.
therefore the 95% confidence interval of the mean, will reduce. For a sample pffiffiffiffi ¼ 395:3. Therefore, size of ten, the standard error of the mean ¼ σn ¼ 1250 10 for samples where n = 10, the 95% confidence interval for the distribution of values of the mean in the untreated population is 7500 ±774.76 (which is from 6725.24 to 8274.76) and the distribution around μtreated is 8200 ±774.76 (which is from 7425.24 to 8974.76). These two ranges are shown in Figure 10.3(b). The confidence intervals have been reduced, but the majority of the sample means from the treated population still lie within the range expected from the untreated population, so the risk of Type 2 error is still very high. Finally, for a sample size of 30 the standard error will be greatly reduced pffiffiffiffi ¼ 228:22. Therefore, the 95% confidence interval for the means of at 1250 30 sample size 30 will be μ ± 447.3, which is from 7052.7 to 7947.3 for the untreated population and from 7752.7 to 8647.3 for the treated population (Figure 10.3(c)). There is less overlap between the 95% confidence intervals of both groups, so you are less likely to make a Type 2 error. Even when the sample size is 30, there is still a considerable risk of failing to reject the null hypothesis that μ = 7500, because about 25% of the possible values of the sample mean from the treated population are still within the region expected if the mean of 7500 is correct (Figure 10.4). The probability of Type 2 error is symbolised by β and is the probability of failing to reject the null hypothesis when it is false. Therefore, as shown in Figure 10.4, the value of β is the black area of the treated distribution lying to the left of the upper confidence interval for μ.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
135 [130–139] 27.8.2011 11:30AM
10.5 What sample size do you need to ensure the risk of Type 2 error?
10.4
135
The power of a test
The power of a test is the probability of making the correct decision and rejecting the null hypothesis when it is false. Therefore, power is the grey shaded area of the treated distribution to the right of the vertical line in Figure 10.4. If you know β, you can calculate power as 1 – β. An 80% power is considered desirable. That is, there is only a 20% chance of a Type 2 error and an 80% chance of not making a Type 2 error when the null hypothesis is false.
10.4.1 What determines the power of a test? The power of a test depends on several things, only some of which can be controlled by the researcher. The uncontrollable factors are effect size and the variance of the population. As effect size increases, power will increase and become closer and closer to 100% as the two distributions get further and further apart (Figure 10.5(a)). Samples from populations with a relatively small variance will have a smaller standard error of the mean, so overlap between the untreated and treated distributions will be less than for samples from populations with a larger variance (Figure 10.5(b)). The controllable factors are the sample size and your chosen value of α. As sample size increases, the standard error of the mean decreases, so power increases and the risk of Type 2 error decreases. This has already been described in Figure 10.3. As the chosen value of α decreases (e.g. from 0.10 to 0.05 to 0.01 to 0.001), the risk of Type 1 error decreases, but the risk of a Type 2 error increases. This is shown in Figures 10.6(a), (b) and (c). There is a trade-off between the risks of Type 1 and Type 2 error.
10.5
What sample size do you need to ensure the risk of Type 2 error is not too high?
Without compromising the risk of Type 1 error, the only way a researcher can reduce the risk of Type 2 error to an acceptable level and therefore ensure sufficient power is to increase the sample size. Every researcher has to ask themselves ‘What sample size do I need to ensure the risk of Type 2
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
136
136 [130–139] 27.8.2011 11:30AM
Type 1 error and Type 2 error, power and sample size
Figure 10.5 Uncontrollable factors affecting power. (a) Effect size will
determine power and if the effect size is large enough, power will approach 100%. The arrows show effect size. (b) With a fixed effect size, a test comparing the distribution of sample means from a population with a relatively small variance (the pair of graphs on the left) will have greater power than if the population variance is large (the pair of graphs on the right).
error is low and therefore power is high?’ This is an important question because samples are usually costly to take, so there is no point in increasing sample size past the point where power reaches an acceptable level. For example, if a sample size of 35 gave 99% power, there is no point in taking any more than this number of samples. Unfortunately, the only way to estimate the appropriate minimum sample size needed in an experiment is to know (or have good estimates of) the effect size and standard deviation of the population(s). Often the only way to obtain these statistics is to do a pilot experiment and estimate them from a sample. For most tests, there are formulae that use these statistics to give the appropriate sample size needed to have a desired power. Some statistical packages will calculate the power of a test as part of the analysis.
10.6
Type 1 error, Type 2 error and the concept of biological risk
The commonly used α of 0.05 sets the risk of Type 1 error at 5%, while 20% is considered an acceptable risk of Type 2 error. Nevertheless, these risks
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
137 [130–139] 27.8.2011 11:30AM
10.6 Type 1 error, Type 2 error and the concept of biological risk
137
Figure 10.6 The trade-off between Type 1 and Type 2 error. The probability
of a Type 2 error is shown in black. (a) α set at 10%. (b) Decreasing α to 0.05 will reduce the risk of Type 1 error, but will increase the risk of Type 2 error. (c) Decreasing α to 0.01 will further decrease the risk of Type 1 error, but greatly increase the risk of Type 2 error.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
138
138 [130–139] 27.8.2011 11:30AM
Type 1 error and Type 2 error, power and sample size
have to be considered in relation to the consequences of an incorrect decision about the null or alternate hypotheses. This was discussed in Section 6.5 and the same considerations apply to the risk of Type 2 error. For example, a test that has a 20% chance of incorrectly retaining the null hypothesis of no effect may be considered inappropriate if you are testing for the undesirable side effects of a new drug, or evaluating whether the release of sewage into a river is affecting the number of bacteria pathogenic to humans in a lake downstream. Every time you run a statistical test, you have to consider not only the risk of Type 1 and Type 2 error, but also the consequences of these risks.
10.7
Conclusion
Whenever you make a decision based on the probability of a result, there is a risk of either a Type 1 or a Type 2 error. There is only a risk of Type 1 error when the null hypothesis applies, and the risk is the chosen probability level α. There is only a risk of Type 2 error when the null hypothesis is false. Here the risk of Type 2 error, β, is affected by several factors, but the most controllable is sample size. As sample size increases, the risk of Type 2 error decreases. Power is the converse of Type 2 error. Power is 1 – β and is the ability of the test to reject the null hypothesis when it is false. There are formulae for calculating the appropriate sample size to ensure that the risk of Type 2 error is acceptable (e.g. 20%) and thereby ensure acceptable power, but these calculations rely on an estimate of effect size and the standard deviation of the sample or population. Finally, the risks of Type 1 and Type 2 error need to be considered in terms of biological risk – depending on the consequences of making each type of error, you may decide an α of 5%, or a β of 20%, is unacceptable.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C10.3D
139 [130–139] 27.8.2011 11:30AM
10.8 Questions
139
10.8
Questions
(1)
Comment on the following: ‘Depending on sample size, a nonsignificant result in a statistical test may not necessarily be correct.’
(2)
Explain the following: ‘I did an experiment with only 10% power (therefore β was 90%), but the null hypothesis was rejected so the low power does not matter and I can trust the result.’
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
140 [140–156] 29.8.2011 12:16PM
11
Single-factor analysis of variance
11.1
Introduction
So far, this book has only covered tests for one and two samples. But often you are likely to have univariate data for three or more samples and need to test whether there are differences among them. For example, you might have data for the concentration of cholesterol in the blood of seven adult humans in each of five different dietary treatments and need to test whether there are differences among these treatments. The null hypothesis is that these five treatment samples have come from the same population. You could test it by doing a lot of independent samples t tests (Chapter 9) to compare all of the possible pairs of means (e.g. mean 1 compared to mean 2, mean 1 compared to mean 3, mean 2 compared to mean 3 etc.), but this causes a problem. Every time you do a two-sample test and the null hypothesis applies (so the samples are from the same population), you run a 5% risk of a Type 1 error. As you do more and more tests on two samples from the same population, the risk of a Type 1 error rises rapidly. Put simply, doing a statistical test on samples taken from the same population is like having a ticket in a lottery where the chances of winning are 5% – the more tickets you have, the more likely you are to win. Here, however, to ‘win’ is to make the wrong decision about your results and reject the null hypothesis. If you have five samples, there are ten possible pairwise comparisons among them and the risk of a Type 1 error when using an α of 0.05 is 40%, which is extremely high (Box 11.1). Obviously, there is a need for a test that compares three or more samples simultaneously, but only has a Type 1 error the same as your chosen value of α. This is where analysis of variance (ANOVA) can often be used. A lot of scientists make decisions on the results of ANOVA without knowing how it 140
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
141 [140–156] 29.8.2011 12:16PM
11.2 The concept behind analysis of variance
141
Box 11.1 The probability of a Type 1 error increases when you make several pairwise comparisons Every time you do a statistical test where the null hypothesis applies, the risk of a Type 1 error is your chosen value of α. If α is 0.05, then the probability of not making a Type 1 error is (1 – α) or 0.95. If you have three treatment means and therefore make three pairwise comparisons (1 versus 2, 2 versus 3 and 1 versus 3), the probability of no Type 1 errors is (0.95)3 = 0.86. The probability of at least one Type 1 error is 0.14 or 14%. For four treatment means there are six possible comparisons, so the probability of no Type 1 errors is (0.95)6 = 0.74. The probability of at least one Type 1 error is 0.26 or 26%. For five treatment means, there are ten possible comparisons, so the probability of no Type 1 error is (0.95)10 = 0.60. The probability of at least one Type 1 error is 0.40 or 40%. These risks are unacceptably high. You need a test that compares more than two treatments means with a Type 1 error the same as α.
works. But it is very important to understand how ANOVA does work so you can appreciate its uses and limitations. Analysis of variance was developed by the statistician Sir Ronald A. Fisher from 1918 onwards. It is a very elegant technique that can be applied to numerous and very complex experimental designs. This book introduces the simpler ANOVAs, because an understanding of these makes the more complex ones very easy to understand. The following is a pictorial explanation like the ones used to explain t tests in Chapter 9. It is remarkably simple and does represent what happens. By contrast, a look at the equations in many statistics texts makes ANOVA seem very confusing indeed.
11.2
The concept behind analysis of variance
The concept behind analysis of variance is straightforward. This test generates a statistic, which increases in size as the difference between two or more means increases.
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
142
142 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
Figure 11.1 No effect of any treatment. The values for the four individuals (black
squares) are dispersed around their treatment mean (short horizontal lines) because of unavoidable individual variation, called ‘error’. There is no effect of any treatment, so the treatment means are all quite close to the grand mean (the long horizontal line).
Imagine you are interested in assessing the effects of two experimental drugs on the growth of brain tumours in humans. Many of these tumours cannot be surgically removed because the brain would be badly damaged in the process. A growing tumour can cause fatal damage by compressing and replacing neural tissue, so there is great medical interest in drugs that affect tumour growth. You have been assigned 12 consenting experimental subjects, each of whom has a brain tumour of the same size and type. Four are allocated at random to an untreated control group, four are treated with the drug ‘Tumostat’ and four more with the drug ‘Inhibin 4’. After two months of treatment, the diameter of each tumour is remeasured. Your null hypothesis is that ‘There is no difference in mean tumour diameter among the three treatments.’ The alternate hypothesis is ‘There is a difference in mean tumour diameter among the treatments.’ The factor being tested is ‘drugs’ (with treatments of Tumostat, Inhibin 4 and a control) and the variable being measured, which is often called the response variable, is the diameter of each tumour. First, consider the case where neither of the drugs has any effect on tumour growth, so the three treatments are just three samples taken at random from the same population. Tumour size for each individual is shown in Figure 11.1, together with the mean for each treatment and the grand mean which is the mean of all 12 individuals (and also the mean of the three treatment means). Note that there is some variation among individuals within each treatment and also some variation among the three treatments means.
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
143 [140–156] 29.8.2011 12:16PM
11.2 The concept behind analysis of variance
143
Now, ask yourself ‘Why is there variation among the four individuals within each treatment? Why isn’t the value for each individual the same as its treatment mean?’ The answer is ‘Because of individual variation that we can’t do anything about.’ Individuals rarely have exactly the same value for many variables, including tumour growth, and this inherent and uncontrollable variation is often called ‘error’. Next, ask yourself ‘Why is there variation among the three treatment means? Why isn’t each treatment mean the same as the grand mean?’ In this case we know that neither of the drugs has any effect on tumour growth, so here too the displacement of each treatment mean from the grand mean is also only ‘Because of individual variation that we can’t do anything about.’ If all the individuals in the population had identical tumour growth, the three treatment means would be exactly the same. You can quantify the amount of variation around a mean by calculating the variance (Chapter 8). But here you can calculate two separate variances: (a) The variance of the individuals around their respective treatment means will give a number that indicates only individual variation or ‘error’ (Figure 11.2(a)). (b) The variance of the three treatment means around the grand mean will also give a number, and in this particular case (where there is no effect of any treatment) the number will also indicate error because it is the only thing affecting the positions of the three treatment means (Figure 11.2(b)). Therefore, you would expect the variances from (a) and (b) to be similar. At this stage you might be thinking ‘Yes but why are these two variances so important?’ This is explained below. Second, consider the case where the two drugs do have an effect on tumour growth. In Figure 11.3 it is clear that Tumostat and Inhibin 4 reduce tumour growth because the means for these two treatments are much less than the mean for the control. The four individuals within each treatment are still scattered around their respective treatment means, but the three treatment means are no longer similar and close to the grand mean – instead they have been displaced from the grand mean because the drugs have had an effect. Now, again ask yourself why the values for the four individuals are scattered around each treatment mean. The answer is the same as before
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
144
144 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
Figure 11.2 No effect of any treatment. (a) The displacement of each individual (black
squares) from its respective treatment mean is shown as an arrow and is only due to individual variation, called ‘error’. (b) The displacement of each treatment mean (short horizontal lines) from the grand mean is shown as an arrow. When there is no effect of treatment, this displacement is also only due to individual variation and therefore ‘error’.
Figure 11.3 An effect of treatment. There are relatively large differences among the three
treatment means (short horizontal lines), so they are further from the grand mean.
‘Because of uncontrolled individual variation that we can’t do anything about, which is called error’ (Figure 11.4(a)). Next, ask yourself why the three treatment means are scattered around the grand mean (Figure 11.4(b)). Here, the difference between each
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
145 [140–156] 29.8.2011 12:16PM
11.2 The concept behind analysis of variance
145
Figure 11.4 An effect of treatment. There are relatively large differences among some of
the three treatment means (short horizontal lines), so they are further from the grand mean. This will make the among group variance relatively large. (a) The displacement of each individual (black squares) from its treatment mean (arrows) is only due to individual variation, called ‘error’. (b) The displacement of each treatment mean (short horizontal lines) from the grand mean (arrows) is due to the effect of treatment plus error.
treatment mean and the grand mean is not just caused by error – it is also caused by the effects of the drugs. Therefore, the variance of the treatment means around the grand mean will be much larger than the previous example because it will be caused by the drug treatments plus error. This gives a very easy way to calculate a statistic that shows how much difference there is among treatment means, over and above that expected because of natural variation (i.e. error).
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
146
146 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
The variance of the treatment means around the grand mean indicates treatment + error and is called the among group variance. The variance of the individuals around their respective treatment means indicates only error and is called the within group variance. So if you divide the among group variance by the within group variance, it will give a number that will get larger as the differences among treatment means increase: Among group variance (treatment + error) : Within group variance (error)
(11:1)
This is the essential concept of analysis of variance. When there is no effect of any treatment, both the among group variance and the within group variance will be very similar. Therefore, the among group variance divided by the within group variance will give a value of about 1. In contrast, if the treatment means are not all similar (i.e. one or more of the treatments has an effect), then the among group variance will be relatively large. Therefore, the among group variance, divided by the within group variance, will give a large number that increases as the treatment means become increasingly dissimilar. Fisher called this statistic the variance ratio, but it was later renamed the F statistic or F ratio in his honour. Fisher’s method for comparing means is called analysis of variance because it uses the ratio of two variances to give a statistic showing the relative differences among two or more means. For this example, Tumostat and Inhibin 4 appear to have an inhibitory effect on growth compared to the control (in which the tumours have grown larger), but is the effect significant, or is it just the sort of difference that might occur by chance among samples taken from the same population? Once an F statistic has been calculated, its significance can be assessed by looking up the critical value of F under the null hypothesis that the treatment means are all from the same population. Just like the example of the chi-square statistic discussed in Chapter 6 and the Z and t statistics in Chapters 8 and 9, even when two or more samples are taken from the same population (i.e. when there is no effect of any of the treatments) the value of the F statistic will, just by chance, be larger than the critical value in 5% of cases and therefore be considered statistically significant.
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
147 [140–156] 29.8.2011 12:16PM
11.3 More detail and an arithmetic example
147
Table 11.1 The diameter in mm of 12 brain tumours after three months of either (a) no treatment, (b) treatment with Neurohib or (c) treatment with Mitostop.
11.3
Control
Neurohib
Mitostop
7 8 10 11
4 5 7 8
1 2 4 5
More detail and an arithmetic example
This section expands on the concept behind ANOVA by giving a pictorial explanation and arithmetic example for the calculation of separate estimates of the within group and among group variances. It also explains a third estimate of variance called the total variance. I am using a simple set of data for tumour diameter (in mm) for four replicates of each of two more drug treatments and a control (Table 11.1). Two possible sources of variation will contribute to the displacement of each tumour from the grand mean: tumour diameter ¼ treatment þ error:
(11:2)
To do a single-factor ANOVA, all you have to do is calculate the among group (treatment + error) variance and divide this by the within group (error) variance to get the F statistic. The procedure is shown pictorially in Figures 11.5 to 11.7.
11.3.1 Preliminary steps First, you calculate the grand mean by taking the sum of all the values (which in this example is 72) and dividing this by n (which in this example is 12). The grand mean is shown as the heavy line in Figure 11.5 and its numerical value (which in this example is 6) is in the box on the right of the line. Second, you calculate each treatment mean by taking the sum of the values in each treatment and dividing by the appropriate sample size (here,
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
148
148 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
Figure 11.5 Pictorial representation of the diameter of human brain tumours in
volunteers either left untreated (control) or treated with the experimental drugs Neurohib or Mitostop. Tumour diameter increases up the page. The heavy horizontal line shows the grand mean and the shorter lighter lines show the three treatment means. The diameter of each replicate is shown as ▪. Boxes show the values of the treatment means and the grand mean.
Figure 11.6 Calculation of the within group (error) sum of squares and variance. This
has been done in two steps. First, the displacement of each point from its treatment mean has been squared and these values added together to get the sum of squares. Second, this total has been divided by the number of degrees of freedom to give the mean square, which is the within group (error) variance.
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
149 [140–156] 29.8.2011 12:16PM
11.3 More detail and an arithmetic example
149
Figure 11.7 Calculation of the among group (treatment + error) sum of squares and
variance. This has been done in two steps. First, the displacement of each treatment mean from the grand mean has been squared. This value has to be multiplied by the sample size within each treatment to get the total effect for the replicates within that treatment, because the displacement is the average for the treatment. These three values are then added together to give the sum of squares. Second, this value has been divided by the number of degrees of freedom to give the mean square, which is the among group (treatment + error) variance. Note that one of the treatment means happens to be the same as the grand mean, but this will not always occur.
in each case it is 4). These values are in the boxes to the right of the lines for each treatment mean. These are all you need to calculate the three different variances and Figures 11.6, 11.7 and 11.8 show the calculation of the variances for: (a) error, (b) treatment + error and (c) the total variance. The general formula for any sample variance is: X ðXi XÞ2 n1
(11:3)
and the variances have been calculated in two steps. First, the difference between each value and the appropriate mean is squared and all of these squared values are added together to give the numerator of the equation
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
150
150 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
above which is called the sum of squares. Second, the sum of squares is divided by the appropriate degrees of freedom (the denominator of the equation above) to give the variance, which is often called the mean square.
11.3.2 Calculation of the within group variance (error) This has been done in two steps in Figure 11.6. First, you calculate the sum of squares for error. The distance between each replicate and its treatment mean is the error associated with that replicate. Each of these values is squared and added together to get the sum of squares. Second, you calculate the mean square (which is the within group or error variance) by dividing the within group sum of squares by the appropriate number of degrees of freedom. Here you need to take one away from the number of replicates within each treatment and then sum these numbers. Because each treatment contains four replicates, the number of degrees of freedom is 3 + 3 + 3 = 9.
11.3.3 Calculation of the among group variance (treatment + error) This has been done in two steps in Figure 11.7. First, you calculate the sum of squares for treatment. Here, however, the distance between each of the three treatment means and the grand mean is the average effect of that treatment. Therefore, to get the total effect for all the replicates within each treatment, this value has to be squared and then multiplied by the number of replicates in that treatment and these values added together to give the sum of squares for treatment. Second, you calculate the mean square (which is the among group or treatment variance) by dividing the sum of squares by the appropriate degrees of freedom, which is n – 1 where n is the number of treatments. Because there are three treatments, there are only two degrees of freedom.
11.3.4 Calculation of the total variance First, you calculate the total sum of squares by taking the displacement of each point from the grand mean, squaring it and adding these together for all replicates. This gives the total sum of squares. Dividing by the
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
151 [140–156] 29.8.2011 12:16PM
11.3 More detail and an arithmetic example
151
Figure 11.8 Calculation of the total sum of squares and total variance. This has been
done in two steps. First, the displacement of each point from the grand mean has been squared, and these values added together to give the sum of squares. Second, this value has been divided by the number of degrees of freedom to give the mean square, which is the total variance.
appropriate number of degrees of freedom (there are n – 1 degrees of freedom and for this example n = 12) gives the mean square (which is the total variance). This has been done in two steps in Figure 11.8. Finally, to obtain the F ratio, which compares the effect of treatment to the effect of error, you simply divide the among group (treatment + error) variance (36) (Figure 11.7) by the within group (error) variance (3.33) (Figure 11.6), which gives an F ratio of 36/3.33 = 10.8. Table 11.2 gives the results of this analysis in a similar format to the one provided by most statistical packages. Here you may be wondering why the total sum of squares and total variance have also been calculated (Figure 11.8) as they are not needed for the F ratio given above. They have been included to illustrate the additivity of the sums of squares and degrees of freedom. Note from Table 11.2 that the total sum of squares (102) is the sum of the among groups (i.e. treatment + error) (72) plus the within groups (i.e. error) (30) sums of squares. Note also that the total degrees of freedom (11) is the sum of the among groups
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
152
152 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
Table 11.2 Results of the calculations in Figures 11.6 to 11.8. The results have been formatted as a typical summary table for a singlefactor ANOVA provided by most software packages. Source of variation
Sum of squares df
Among groups (treatment 72 + error) Within groups (error) 30 Total 102
Mean square
2
36
9 11
3.3
F ratio Probability 10.8
0.004
(i.e. treatment + error) degrees of freedom (2) plus the within groups (i.e. error) degrees of freedom. This additivity of sums of squares and degrees of freedom will be used when discussing more complex ANOVAs in Chapters 13 and 15. Now, all you need is the critical value of the F ratio. This used to be a tedious procedure because there are two values of the degrees of freedom to consider – the one associated with the treatment mean square and the one associated with the error mean square – and you had to look up the critical value in a large set of tables. Now you can use a statistics program to run this analysis, generate the F ratio and obtain the probability. The F ratio is always written with the number of degrees of freedom for the numerator and denominator in order as a subscript, immediately after the symbol ‘F’. Therefore, the F ratio for the among group mean square divided by the within group mean square from Table 11.2 is written as F2,9, because there are two degrees of freedom for the among group variance and nine degrees of freedom for the within group variance. Incidentally, the F ratio is significant – the probability is 0.004 (Table 11.2).
11.4
Unequal sample sizes (unbalanced designs)
The example described above has equal numbers in each sample. If they are not equal, the method for calculating the F statistic will still work but the means and variance for each sample will not be estimated with the same precision (Chapter 8). For example, the mean of a relatively small sample is likely to be less accurate than of a larger one, so the conclusion from a
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
153 [140–156] 29.8.2011 12:16PM
11.6 Fixed or random effects
153
comparison of means may be misleading. You should, wherever possible, aim to have equal numbers in each sample, especially when sample sizes are relatively small.
11.5
An ANOVA does not tell you which particular treatments appear to be from different populations
Although a significant result of a single-factor ANOVA indicates that the treatment means are unlikely to have come from the same population, it has not shown where the differences actually lie. In the example given above, a significant effect might be caused by one or both experimental drugs actually enhancing tumour growth compared to the control! You will almost certainly want to know how each of the two drugs affects tumour growth, so you will need to compare the treatment means. A procedure for making these multiple comparisons is described in the next chapter.
11.6
Fixed or random effects
This is an important concept. There are two types of single-factor ANOVA, which are called Model I and Model II. An understanding of the difference between them is necessary, especially when doing two-factor and higher ANOVAs (Chapter 13). A Model I or fixed effects ANOVA applies when the treatments (e.g. the experimental drugs) have been specifically chosen. For example, you may only be interested in the effect of a particular set of four drugs and the null hypothesis reflects this: ‘There is no difference between the effects of drugs A, B, C and D on tumour growth.’ A Model II or random effects ANOVA applies to hypotheses that are more general. For example, instead of examining the effects of specific drugs your hypothesis might be ‘There is no difference among drugs, in general, on tumour growth.’ Therefore, the drugs chosen and used in the experiment are merely random representatives of the wider range of drugs available, even though your random selection might happen to be drugs A, B, C and D. For a single-factor ANOVA, the actual computations for both models are the same. But if you have done a Model II ANOVA, you would not go further and make multiple comparisons among treatments, because you
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
154
154 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
would not be interested in knowing which ones were different. This is discussed in more detail in the next chapter.
11.7
Reporting the results of a single-factor ANOVA
The essential statistic from a single-factor ANOVA is the F ratio for (treatment + error) / (error). Because the F ratio always has two separate degrees of freedom – the one for the numerator (treatment + error) and the one for the denominator (error) – both should be reported. For example, the results of the ANOVA in Table 11.2 could be reported as ‘A single-factor ANOVA showed a significant difference among the three treatments (an untreated control and the drugs Neurohib and Mitostop): F2,9 = 10.8, P < 0.01.’ You might give the actual value of the probability as P = 0.004. Common omissions and mistakes include reporting an F ratio without any degrees of freedom, or giving the degrees of freedom for the total variance instead of those for error. Often a reviewer of your work will check the degrees of freedom against the experimental design. The numerator degrees of freedom will be one less than the number of treatments (so the three treatments in this example will give two degrees of freedom). The denominator degrees of freedom is the sum of the number of replicates within each of the treatments, minus the number of treatments. For example, for three treatments, each containing four replicates, the error degrees of freedom will be (4 + 4 + 4) – 3 = 9, while for an unbalanced design with three treatments containing five, five and nine replicates, the error degrees of freedom are (5 +5 + 9) – 3 = 16. If you have done a Model II ANOVA, you are not interested in which treatments are different (or not), so you would only need to report the appropriate F ratio and probability.
11.8
Summary
A single-factor ANOVA gives a statistic, called the F ratio, which quantifies the amount of variation among two or more treatment means. The value of a variable for each individual will only be displaced from its treatment mean by uncontrollable individual variation, called error. In contrast, the displacement of each treatment mean from the grand mean will be caused
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
155 [140–156] 29.8.2011 12:16PM
11.9 Questions
155
by two sources of variation: any effect of treatment, plus error. These displacements can be quantified by calculating the variance of the individuals around their treatment means (error only) and the variance of the treatment means around the grand mean (treatment + error). The F ratio is the among group (treatment + error) variance divided by the within group (error) variance. When there is no effect of treatment (i.e. the samples have all been taken from the same population), the F ratio will be about 1, but when there is an effect of one or more treatment(s), it will be much larger. The calculated value of F can be compared to the range within which 95% of F ratios will occur when samples are taken at random from the same population. If F is greater than this range, it indicates a significant difference among the treatment means for α = 0.05. A single-factor design can be Model I or Model II. A Model I analysis is done when the treatments being compared have been specifically chosen because you are only interested in them (e.g. three specifically chosen drugs with potential anti-cancer properties). In contrast, a Model II analysis is done when treatments have been chosen at random from the wider range of those available because you are only interested if there is variation among treatments in general (e.g. three drugs chosen at random from 100 thought to have anti-cancer properties). The calculation of F is the same in both cases, but a significant result for a Model I analysis would need to be followed by further testing to identify where the differences occurred among the treatment means (e.g. because you are specifically interested in the relative anti-cancer merits of three specifically chosen drugs). A significant result for a Model II analysis would not be followed by further testing because your hypothesis is more general (e.g. you are only interested if there is variation in anti-cancer properties among drugs).
11.9
Questions
(1)
Why would you expect an F ratio of about 1, and departures from this only due to chance, if there was no significant difference among treatments analysed by a single-factor ANOVA?
(2)
The following simple set of data is for three ‘treatments’ each of which contains four replicates: Treatment A: 1, 2, 3, 4;
C:/ITOOLS/WMS/CUP-NEW/2648663/WORKINGFOLDER/MCKI/9781107005518C11.3D
156
156 [140–156] 29.8.2011 12:16PM
Single-factor analysis of variance
Treatment B: 2, 3, 4, 5; Treatment C: 3, 4, 5, 6. The means of the three samples are similar and there is some within group (error) variance around each treatment mean. Use a statistical package to run a single-factor ANOVA on these data. (a) Is there a significant difference among treatments? (b) What are the within group (error) sum of squares and mean square? Change the values for Treatment C to 21, 22, 23 and 24, and run the analysis again. (c) Is there a significant difference among groups? (d) Have the within group (error) sum of squares and mean square changed from the analysis in (a)? Can you explain why? (3)
Which of the following experimental designs may be suitable for analysis as a Model I ANOVA? (a) A fisheries biologist was interested in testing the general hypothesis that the mean size of adult trout varied among lakes in Canterbury, New Zealand. They selected three lakes (Lake Veronica, Lake Michael and Lake Monica) at random from a total of 21 lakes, and took a sample of ten adult trout from each of the three lakes. (b) A fisheries biologist was interested in testing the specific hypothesis that the mean size of adult trout varied among Lake Veronica, Lake Michael and Lake Monica, so they took a sample of ten adult trout from within each of the three lakes.
(4)
An ecologist did a single-factor ANOVA and obtained a treatment F ratio of 0.99. They said ‘That F ratio isn’t significant. There isn’t even a need to look up a table of probability values.’ One of their colleagues was very worried by that comment and said ‘I think you had better look up the probability. You can’t be so sure’ Who was right? Why?
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
157 [157–167] 27.8.2011 12:02PM
12
Multiple comparisons after ANOVA
12.1
Introduction
When you use a single-factor ANOVA to compare the means in an experiment with three or more treatments, a significant result only indicates that one or more appear to be from different populations. It does not identify which particular treatments appear to be the same or different. For example, a significant difference among the means of three treatments, A, B and C can occur in several ways. Mean A may be greater (or less) than B and C, mean B may be greater (or less) than A and C, mean C may be greater (or less) than A and B and finally means A, B and C may all be different to each other. If the treatments have been chosen as random representatives of all those available (i.e. the factor is random so you have done a Model II ANOVA), you will not be interested in knowing which particular treatments are different and which are not, because your hypothesis is more general. A significant result will reject the null hypothesis and show a difference, but that is all you will want to know. In contrast, if the treatments have been specifically chosen (i.e. the factor is fixed so you have done a Model I ANOVA), you are likely to be very interested in knowing which treatments are different and which are not. Several multiple comparison tests have been designed to do this.
12.2
Multiple comparison tests after a Model I ANOVA
Multiple comparison tests are used to assign three or more means to groups that appear to be from the same population. These tests are usually done after a Model I ANOVA has shown a significant difference among treatments. They are called a posteriori or post hoc tests, both of which mean ‘after the event’, where the ‘event’ is a significant result of the ANOVA. 157
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
158
158 [157–167] 27.8.2011 12:02PM
Multiple comparisons after ANOVA
A lot of multiple comparison tests have been developed, but all of them work in essentially the same way. Here is an example using the Tukey test, which is analogous to the two sample t test described in Chapter 9. The t statistic is calculated by dividing the difference between two means by the standard error of that difference. The Tukey statistic, q, is calculated by dividing the difference between two means by the standard error of the mean. The smaller mean is always taken away from the larger, therefore giving a positive number: q¼
XA XB : SEM
(12:1)
This procedure is first used to compare the largest mean to the smallest. If the difference is significant, testing continues by comparing the largest with the second smallest and so on. If a non-significant difference is found, all treatments with means in the range between the non-significant pair are assigned to the same population. Next, the procedure is repeated, starting with the second largest mean and the smallest. Here too it continues until a non-significant difference is found, and the treatments with means within that range are assigned to the same population. Again, the procedure is repeated, starting with the largest and the third smallest mean and so on. Eventually all treatments will be assigned to one or more significantly different groups (Figure 12.1). For the example in Figure 12.1, treatments A, B and C have been assigned to one population and D and E to another. The analysis has revealed two distinct groups. To calculate the Tukey statistic, you need the SEM and the best way to obtain this is from the error mean square of the ANOVA because this is an estimate of the population variance, σ2, calculated from the displacement of all the replicates in the experiment from their respective treatment means. Therefore, because the standard error of a mean is: rffiffiffiffiffi σ σ2 SEM ¼ pffiffiffi or (12:2) n n the standard error of the mean estimated from an ANOVA is: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error SEM ¼ n
(12:3)
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
159 [157–167] 27.8.2011 12:02PM
12.2 Multiple comparison tests after a Model I ANOVA
159
Figure 12.1 General procedure for a Tukey a posteriori test. The treatment
means (A to E) are displayed in order of magnitude from the smallest (E) to the largest (A). (a) First the largest mean is compared to the smallest (A–E). If the difference is significant, the largest is then compared to the second smallest (A–D) and so on, until a non-significant difference (here, as an example, A–C) is found or there are no more pairs of means left to compare. All treatments with means in the range between A–C (i.e. A, B and C) are assigned to the same population. (b) Testing continues using the same procedure, but starting with the second largest mean and comparing it to the smallest (B–E). (c) Next, the third largest mean (C) is compared to D and E. (d) Finally, the fourth largest (D) is compared to E. This difference is not significant so treatments D and E appear to be from the same population, which has a different mean to the one from which A, B and C have been taken.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
160
160 [157–167] 27.8.2011 12:02PM
Multiple comparisons after ANOVA
where n is the sample size within each treatment. If the sample sizes differ among treatments, you use the formula: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error 1 1 SEM ¼ þ : (12:4) 2 nA nB Then you calculate the Tukey statistic, q, for each pair of means by using equation (12.1) and the procedure in Figure 12.1. The value of q will be 0 when there is no difference between the two sample means and will increase as the difference increases. If q exceeds the critical value, the hypothesis that the treatments are from the same population is rejected. The critical value of q depends on your chosen value of α, the number of degrees of freedom for the MS error and the number of means being tested. Here I deliberately have not given a table of critical values, because most statistical packages will do multiple comparisons and even generate a display assigning the treatment means to groups that appear to be from the same population. Section 12.3 gives two examples and also illustrates that ambiguous results are possible.
12.3
An a posteriori Tukey comparison following a significant result for a single-factor Model I ANOVA
12.3.1 Example 1 – The effects of dietary supplements on pig growth The data in Table 12.1 are for the amount of weight gained in kilograms by piglets of the same age and initial weight after a year of feeding with four Table 12.1 The weight gain, in kg, of piglets fed four different dietary supplements for one year.
X
Bigpiggy
Sowgrow
Baconbuster II
Fatsow III
16 14 16 17 18 16.2
19 20 22 20 24 21.0
25 30 26 27 28 27.2
12 10 12 13 9 11.2
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
161 [157–167] 27.8.2011 12:02PM
12.3 An a posteriori Tukey comparison for a single-factor Model I ANOVA
161
different dietary supplements: ‘Bigpiggy’, ‘Sowgrow’, ‘Baconbuster II’ and ‘Fatsow III’. This is a Model I ANOVA because the researcher is only interested in the effects of these four supplements on pig growth. A singlefactor ANOVA will show a significant difference among treatments (F3,16 = 74.01, P < 0.001), so some appear to be from different populations. If you do an a posteriori Tukey test, you will find that each of the treatment means appear to be from a different population.
12.3.2 Example 2 – Growth of brain tumours treated with experimental drugs The data in Table 12.2 are the diameter of brain tumours after three months of either (a) no treatment, (b) the drug Neurohib or (c) the drug Mitostop. A single-factor ANOVA is significant (F2,9 = 12.8, P < 0.004), so the three treatments do not appear to be from the same population. If, however, you run an a posteriori Tukey comparison of the means, it will assign the control and Neurohib treatments to the same population, but the Neurohib and Mitostop treatments to another (Figure 12.2). This result is obviously ambiguous. The a posteriori analysis has separated the data into two groups, but the mean of the Neurohib treatment cannot be distinguished from the mean of either the control or Mitostop. At the same time, the mean of the control can be distinguished from the mean for Neurohib. The mean for the Neurohib treatment cannot be from both populations, so a Type 2 error has occurred somewhere. This problem is not uncommon when a posteriori
Table 12.2 The diameter of 12 brain tumours in mm after three months of either (a) no treatment, (b) treatment with Neurohib or (c) Mitostop. All tumours were the same size at the start of the experiment.
X
Control
Neurohib
Mitostop
7 8 10 11 9.0
4 5 7 8 6.0
1 2 4 5 3.0
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
162
162 [157–167] 27.8.2011 12:02PM
Multiple comparisons after ANOVA
Figure 12.2 Summary of the results of a posteriori Tukey testing comparing
the means of the three samples in Table 12.2. Treatments connected by vertical lines are not significantly different.
testing is done after ANOVA. Sometimes an ANOVA may even give a significant result, especially one that is only ‘just significant’ (e.g. P = 0.04), but subsequent a posteriori testing cannot distinguish between any treatments. This is because a posteriori tests are generally less powerful than ANOVA (see Section 12.5). More practically, the analysis has shown that the effect of the drug Neurohib cannot be distinguished from the control, but Mitostop appears to significantly reduce tumour growth compared to the control.
12.4
Other a posteriori multiple comparison tests
There are many other multiple comparison tests. These include the LSD (which means ‘least significant difference’), Bonferroni, Scheffé and Student–Newman–Keuls. The most commonly used are the Tukey and Student–Newman–Keuls (Zar, 1999). Most statistical packages offer a wide choice of a posteriori tests and their relative merits are described in more advanced texts.
12.5
Planned comparisons
Chapter 11 began with a discussion about the danger of an increased probability of Type 1 error when making numerous pairwise comparisons among three or more means. Here, however, the a posteriori method for identifying which treatments appear to be from the same population also uses numerous pairwise comparisons. Therefore, you may well be thinking this will also give an increased risk of Type 1 error.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
163 [157–167] 27.8.2011 12:02PM
12.5 Planned comparisons
163
First, however, unplanned a posteriori comparisons are usually only done if the ANOVA has detected a significant difference among the treatment means. Second, a posteriori tests are specifically designed to take into account the number of treatments being compared and have a much lower risk of Type 1 error than the same number of t tests, but unfortunately this makes multiple comparison tests relatively low in power. As noted above, an ANOVA may detect a significant difference among treatments, but subsequent a posteriori testing may not distinguish between any of them. Instead of making a large number of indiscriminate unplanned a posteriori comparisons, a better approach may often be to make a small number of planned (a priori, which means ‘before the event’) ones. For example, your hypotheses might be that each of the two experimental drugs in Example 2 above will affect the growth of brain tumours compared to the control. An ANOVA will test for differences among treatment means with an α of 0.05 and also give a good estimate of the sample variance from the MS error because this has been calculated from all the individuals used in the experiment. Next, however, instead of making a large number of unplanned comparisons, you could carry out two planned t tests comparing the mean growth of tumours in each drug treatment and the control. If you make only one a priori comparison, the probability of Type 1 error is an acceptable 0.05. If you make several a priori comparisons that really have been planned for particular reasons before the experiment (e.g. to test the hypotheses ‘Mitostop will affect tumour growth compared to the untreated control’ and ‘Neurohib will affect tumour growth compared to the untreated control’), then each is a distinct and different hypothesis so the risk of a Type 1 error is still an acceptable 0.05. It is only when you make indiscriminate comparisons that the risk of Type 1 error increases and you should consider using one of the a posteriori tests described previously, which maintains an α of 0.05. To make a planned comparison after a single-factor ANOVA, the formula for an independent samples t test (Chapter 9) is slightly modified by using the ANOVA mean square error as the best estimate of s2:
tnAþnB2
X A X B sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error 1 1 þ 1 nA nB
(12:5)
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
164
164 [157–167] 27.8.2011 12:02PM
Multiple comparisons after ANOVA
which reduces to equation (12.6) when there are equal numbers in both treatment groups. XA XB tnAþnB2 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : 2 MS error n
(12:6)
Here is an example, using the data from Example 2 in Section 12.3. The planned comparison is for the mean tumour size in the Mitostop treatment compared to the control. From the ANOVA, the mean square for error is 3.333. The mean of the Mitostop treatment is 3 mm and the control is 9 mm. Therefore: 3:00 9:00 t6 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 4:65: 2 3:33 4
From Table A2 in the Appendix, the critical two-tailed 5% value for t is ±2.447. The two means appear to be from different populations. The planned comparison to test the effect of Neurohib compared to the control is: 6:00 9:00 t6 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 2:325: 2 3:33 4
Here the calculated value of t6 is within the range between the critical twotailed 5% values of ±2.447, so these two means do not appear to be from different populations. Only Mitostop appears to suppress tumour growth compared to the control.
12.6
Reporting the results of a posteriori comparisons
The way in which a posteriori testing is reported depends on whether the result is ambiguous and the complexity of the experimental design. The unambiguous result for the comparison of the four pig food treatments in
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
165 [157–167] 27.8.2011 12:02PM
12.6 Reporting the results of a posteriori comparisons
165
Table 12.3 The mean diameter of brain tumours (mm) after treatment for three months with the drugs Neurohib, Mitostop or an untreated control. Superscripts show means that are not significantly different when compared with an a posteriori Tukey test. Treatment
Mean tumour diameter (mm)
Controla Neurohiba,b Mitostopb
9.0 6.0 3.0
a
Tumour diameter (mm)
9
a, b
6
b
3
0 Control
Neurohib Treatment
Mitostop Treatment
Figure 12.3 The mean diameter of brain tumours (mm) after treatment for
three months with the drugs Neurohib, Mitostop or an untreated control. Bars show ± one standard error of the mean. Superscripts show means that are not significantly different when compared using an a posteriori Tukey test.
Section 12.3.1 could simply be reported as ‘A single-factor ANOVA showed a significant difference among treatments (F3,16 = 74.01, P < 0.001) and an a posteriori Tukey test showed that all of the treatment means were significantly different at α = 0.05, with Baconbuster II giving the greatest increase in weight.’ Results that are more complex are often presented as a table of means, with identical superscripts indicating the ones that appear to be from the same population. For example, Table 12.3 gives a summary for an a posteriori analysis of the data in Table 12.2. Sometimes a graph showing the means with letters indicating those from each population is given instead (e.g. Figure 12.3).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
166 [157–167] 27.8.2011 12:02PM
166
Multiple comparisons after ANOVA
12.7
Questions
(1)
An agricultural scientist was given the following data for the number of kilograms of fruit produced by eight randomly chosen four-year-old trees of three specifically chosen experimental varieties of mango and asked to identify the variety that was producing the greatest weight of fruit. (a) Does a singlefactor ANOVA show a significant difference among varieties? (b) Is a posteriori testing needed? If so, which variety is yielding the greatest weight of fruit and is it significantly different to the other two? Variety 1
Variety 2
Variety 3
15.7 18.4 14.3 13.5 16.3 10.3 15.2 12.9
18.3 17.4 18.2 19.9 18.3 16.8 19.3 16.2
14.9 19.3 12.9 14.6 12.6 13.2 14.3 15.4
(2)
In relation to the data in Question 1, the agricultural scientist expected that variety 1 would yield a different weight of fruit than variety 3. Use a t test and the error MS from the ANOVA to make a planned comparison between these two varieties only. Is the result significant?
(3)
An urban entomologist did an experiment to assess the effectiveness of five different brands of mosquito repellent and an untreated control. Thirty volunteers were assigned at random to six groups of five. Each individual had repellent applied to their left arm, which was then inserted into a chamber containing 200 starved female mosquitoes. The numbers of bites on each person’s arm after one hour of exposure are given below.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C12.3D
167 [157–167] 27.8.2011 12:02PM
12.7 Questions
167
Control
Slap
Go-Way
Outdoors
Holiday
Clear Off
25 24 27 26 27
14 17 15 19 16
25 30 26 27 28
22 29 28 30 31
39 41 38 37 42
13 16 16 17 16
(a) Is the analysis Model I or Model II? (b) Run a single-factor ANOVA on the data. Is there a significant difference among treatments? (c) Run an a posteriori Tukey test. Which repellent would you recommend? Are there any you definitely would not recommend?
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
168 [168–195] 27.8.2011 2:10PM
13
Two-factor analysis of variance
13.1
Introduction
A single-factor ANOVA (Chapter 11) is used to analyse univariate data from samples exposed to different levels or aspects of only one factor. For example, it could be used to compare the oxygen consumption of a species of intertidal crab (the response variable) at two or more temperatures (the factor), the growth of brain tumours (the response variable) exposed to a range of drugs (the factor), or the insecticide resistance of a moth (the response variable) from several different locations (the factor). Often, however, life scientists obtain univariate data for a response variable in relation to more than one factor. Examples of two-factor experiments are the oxygen consumption of an intertidal crab at several combinations of temperature and humidity, the growth of brain tumours exposed to a range of drugs and different levels of radiation therapy, or the insecticide resistance of an agricultural pest from different locations and different host plants. It would be very useful to have an analysis that gave separate F ratios (and therefore the probability that the treatment means had come from the same population) for each of the two factors. That is what two-factor ANOVA does.
13.1.1 Why do an experiment with more than one factor? Experiments that simultaneously include the effects of two or more factors on a particular variable can be far more revealing than looking at each factor separately, because you may detect certain combinations of the two factors that have a synergistic effect. Also, by examining several factors at once there may be significant savings in time and resources compared to doing a series of separate experiments and separate analyses. 168
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
169 [168–195] 27.8.2011 2:10PM
13.1 Introduction
169
Here is an example of the advantage of a two-factor experiment. It also illustrates a synergistic effect – what statisticians call interaction – which occurs when the effect of one factor varies across the levels of the other. Cockroaches are a serious public health risk, especially in tropical and subtropical cities with poor sanitation. These insects often live in sewers and drains, but forage widely, and frequently infest areas where food is stored and prepared. Urban cockroaches have a broad diet, which often includes excrement and other wastes, so they can contaminate food and thereby cause disease in humans. An urban entomologist investigating ways of controlling cockroaches was interested in the effects of both temperature and humidity on the activity of the cockroach Periplaneta americana. The entomologist devised a method of measuring cockroach activity by placing these insects individually in open-topped cylindrical glass jars with a continuous recording camera above each, from which they obtained data for the amount of movement per cockroach per hour. The entomologist set up an experiment with cockroaches in all six combinations of three temperatures (20, 30 and 40°C) and two humidity levels (33 and 66%). There were 20 cockroaches in each treatment combination, so 120 were used altogether. After two hours of acclimation, activity was recorded for one hour. This type of design, where there is a treatment for every combination of the levels of each factor, is called a ‘fully orthogonal’ design or an ‘orthogonal’ design (Table 13.1). If one of the treatment combinations, which are often called cells, were not included (for example 33% humidity with 30°C), the design would not be orthogonal.
Table 13.1 Example of an orthogonal two-factor design. There are three levels of Factor A (temperature) and two levels of Factor B (humidity) with experimental units (cockroaches) in each of the six possible combinations of the 3 × 2 treatment levels. Temperature (°C) Humidity (%)
20
30
40
33 66
20 cockroaches 20 cockroaches
20 cockroaches 20 cockroaches
20 cockroaches 20 cockroaches
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
170
170 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
The results of the experiment can be displayed as a graph of the means for each of the six combinations (which are often called cell means), with temperature on the X axis, the response variable of cockroach activity on the Y axis and lines joining the three means within each of the two levels of humidity. If you wanted, you could show humidity on the X axis and have lines joining each of the three temperatures, but the data for the response variable are easier to visualise when the greatest number of treatment levels are on the X axis. Figure 13.1(a) shows a set of cell means where there is no interaction – the change in humidity from 33% to 66% (or from 66% to 33%) has the same effect on activity at each temperature because in all cases an increase in humidity increases activity by about the same amount. Similarly, the effect of an increase in temperature from 20°C through to 40°C (or vice versa) is the same at each humidity. In contrast, Figure 13.1(b) shows interaction. A change in humidity from 33% to 66% does not have the same effect on activity at each of the three temperatures and a change in temperature from 20°C through to 40°C does not have the same effect on activity at each humidity. That is all interaction is. When there is a complete lack of interaction (Figure 13.1(a)) the lines joining the treatment means always run exactly parallel to each other (even though both lines move up, they move up in parallel). In contrast, when there is interaction (Figure 13.1(b)) the lines are not always parallel. As the amount of interaction increases, the lines become less and less parallel and will eventually reach a point where the interaction is considered significant. Interaction between two or more factors is often of great interest to life scientists. It may be very helpful to know that a response to one factor is not uniform across the range of a second factor, or that it is uniform. For example, if you found that cockroaches were only extremely active when both humidity and temperature are high (Figure 13.1(b)), you might save considerable resources and time by only implementing cockroach control measures when this particular combination of weather conditions was forecast.
13.2
What does a two-factor ANOVA do?
Here you need to remember that a single-factor ANOVA partitions the total variation into two components – the variation among groups (treatment + error) and the variation within groups (error) – and examines whether there is a significant treatment effect by dividing the among groups
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
171 [168–195] 27.8.2011 2:10PM
13.2 What does a two-factor ANOVA do?
171
Figure 13.1 Interaction in a two-factor experiment. (a) No interaction
between the two factors temperature and humidity on the activity of cockroaches. A change in humidity from 33% to 66% has the same effect on cockroach activity at each of the three temperatures and a change in temperature from 20°C through to 40°C has the same effect at each humidity. (b) An interaction between temperature and humidity on the activity of cockroaches. A change in humidity from 33% to 66% does not have the same effect on activity at each of the three temperatures and a change in temperature from 20°C through to 40°C does not have the same effect on activity at each humidity.
mean square by the within groups mean square. This gives an F ratio and probability that all the treatment means have come from the same population. A two-factor ANOVA works in a similar way, but partitions the total variation within a set of data into four components: the variation due to
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
172 [168–195] 27.8.2011 2:10PM
Figure 13.2 Some of the possible outcomes of an orthogonal two-factor
experiment. Only the means for each treatment combination are shown. (a) No effect of temperature or humidity and no interaction. All treatment means are the same and the lines joining the means within each humidity are also the same. (b) An effect of humidity but no effect of temperature and no interaction. The two treatment means for 66% humidity are consistently higher than the two for 33% humidity. (c) An effect of temperature but no effect of humidity and no interaction. The two treatment means for 20°C are consistently higher than the two for 30°C. (d) An effect of temperature and humidity but no interaction. All treatment means are different but the change in growth in relation to a change in humidity from 33% to 66% is the same at each temperature and vice versa. (e) An effect of temperature and humidity and some interaction. The change in growth between 66% and 33% humidity is not the same at each temperature and vice versa. Note that all lines joining the treatments within the same humidity are parallel except for example (e) where there is some interaction between temperature and humidity.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
173 [168–195] 27.8.2011 2:10PM
13.2 What does a two-factor ANOVA do?
173
Figure 13.2 (cont.)
(a) Factor A + error, (b) Factor B + error, (c) Interaction + error and (d) error. The way the analysis works is a straightforward extension of the concept developed to explain single-factor ANOVA, and can also be shown pictorially. I will use the simplest case of a two-factor design with two levels only of each factor, both of which are fixed. First, here are some examples of the outcomes you might get from a twofactor experiment. The urban entomologist mentioned above was also interested in the effects of temperature and humidity on the growth of cockroaches. They did an experiment with two levels of temperature and two of humidity in an orthogonal design with four newly hatched cockroaches in each of the four treatment combinations and therefore a total of 16 cockroaches in the experiment. Each cockroach was the same starting weight. They were offered food to excess and reweighed after four weeks. Several different outcomes are shown in Figures 13.2(a)–(e). The mean weight gained within each treatment combination of the cockroaches kept
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
174
174 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
at 66% humidity are indicated by ▪, while those kept at 33% humidity are indicated by ●.
13.3
A pictorial example
This assumes you are familiar with the explanation for a single-factor ANOVA in Chapter 11. I am using a two-factor experiment with two levels of each factor, therefore giving four treatment combinations, each of which contains four replicates. The design is summarised in Table 13.2. Both factors are fixed – the researcher is only interested in these specific temperatures and humidities. To start, consider the final weight of each cockroach. It will be displaced from the grand mean by four sources of variation – that associated with Factor A, plus Factor B, plus interaction, plus error. This is called the total variation in the experiment. Put formally, the position on the Y axis of each replicate in relation to the grand mean will be determined by the following formula: Growth ¼ Factor A þ Factor B þ interaction þ error:
(13:1)
Here you may wish to contrast this with the much simpler equation for the total variation within a single-factor experiment from Chapter 11: Tumour diameter ¼ treatment þ error:
(13:2 copied from 11:2)
Table 13.2 The orthogonal design used to explain how two-factor ANOVA works in Figures 13.3 to 13.6. There are four combinations of the two temperatures and two humidities, with four experimental units (cockroaches) in each. Temperature (°C) Humidity (%)
20
30
33 66
4 cockroaches 4 cockroaches
4 cockroaches 4 cockroaches
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
175 [168–195] 27.8.2011 2:10PM
13.3 A pictorial example
175
Figure 13.3 The estimation of the within group (error) variance in a two-
factor ANOVA of the growth of cockroaches exposed to four different combinations of temperature and humidity. Each cockroach is shown as a symbol: ▪ = cockroaches at 66% humidity, ● = cockroaches at 33% humidity. Horizontal lines indicate the grand mean and each cell mean. The displacement of each replicate from its cell mean (arrows) will be caused by error only.
Just as for a single-factor ANOVA, the variation within a two-factor experiment can be partitioned into several additive components. These are shown in Figures 13.3 to 13.6. First, the final weight of each cockroach will be displaced from its respective cell mean by error only. This is estimated in just the same way as for a single-factor ANOVA and also called the within group variance or error (Figure 13.3). The distances between each replicate and its cell mean are squared and added together to give the within group (error) sum of squares, which is divided by the appropriate degrees of freedom (here there are 3 + 3 +3 + 3 = 12) to give the within group (error) mean square. Second, each replicate will be displaced from the grand mean by all sources of variation in the experiment – the effect of Factor A, plus Factor B, plus interaction, plus error. This is called the total variance in the experiment. In Figure 13.4, the distance displaced is shown for all replicates. These distances are squared and added together to give the total sum of squares for the experiment. Again, this is the same as the procedure for calculating the total variance for a single-factor ANOVA.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
176
176 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
Figure 13.4 The total variance in the experiment on the growth of
cockroaches. Each cockroach is shown as a symbol: ▪ = cockroaches at 66% humidity, ● = cockroaches at 33% humidity. The heavy horizontal line indicates the grand mean and the four shorter horizontal lines indicate each cell mean. The displacement of each replicate from the grand mean (arrows) will be caused by the total variation in the experiment.
So far, this is the same procedure used to calculate the within group (error) variance and total variance for a single-factor ANOVA. (a) The within group variance (Figure 13.3), which is due to error only, can be calculated from the dispersion of the points around each of their respective cell means. (b) The total variance (Figure 13.4) will estimate the total variation in the experiment (i.e. the within group (error) variance plus Factor A, Factor B, plus interaction) and can be calculated from the dispersion of all the points around the grand mean. At this stage, you still need separate effects for Factor A (temperature +error), Factor B (humidity + error) and A×B (interaction + error).
13.4
How does a two-factor ANOVA separate out the effects of each factor and interaction?
Two-factor ANOVA separates out the effects of each factor and interaction in a very elegant way. After having done the preliminary calculations in
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
177 [168–195] 27.8.2011 2:10PM
13.4 Two-factor ANOVA separate the effects of each factor & interaction
177
Figure 13.5 The effect of Factor A (temperature + error) only on the growth of
cockroaches. Each cockroach is shown as a symbol: ▪ = cockroaches at 66% humidity, ● = cockroaches at 33% humidity. These data have been pooled for each temperature, ignoring humidity, thereby generating the two new treatment means shown by the short horizontal lines. The displacement of each treatment mean from the grand mean is an estimate of the average effect of temperature plus error. The sum of squares is the sum of each displacement squared, which is then multiplied by the number of replicates in that treatment. The mean square is the sum of squares divided by n – 1 degrees of freedom where n is the number of pooled treatments (here n = 2).
Figures 13.3 and 13.4, the data are only considered in relation to each of the two factors. This is done by first ignoring the different levels within Factor B and considering the data only in relation to Factor A (temperature), after which the same is done for Factor B (humidity). These procedures are shown in Figures 13.5 and 13.6 and allow you to calculate separate sums of squares for temperature + error, and humidity + error. They are called the simple main effects because they examine each factor in isolation from the other. First, humidity is ignored and the data treated as though they are the results of a single-factor experiment on temperature only. Here, therefore, you will have eight replicates within each of the two levels of temperature and you can calculate a mean for each level. These new means, calculated from all eight replicates within each temperature treatment, will only be displaced from the grand mean by the average effect of temperature
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
178
178 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
Figure 13.6 The effect of Factor B (humidity + error) only on the growth of
cockroaches. Each cockroach is shown as a symbol: ▪ = cockroaches at 66% humidity, ● = cockroaches at 33% humidity. These data have been pooled for each humidity, ignoring temperature, thereby generating the two new treatment means shown by the horizontal lines. The displacement of each treatment mean from the grand mean is the average effect of humidity for the number of replicates, so the sum of squares for the effect of humidity is the sum of the each displacement squared and multiplied by the number of replicates in that treatment. The mean square is the sum of squares divided by n – 1 degrees of freedom where n is the number of pooled treatments (here n = 2).
plus error. Therefore, the displacement of the new treatment means from the grand mean can be used to calculate the sum of squares and mean square for Factor A (temperature) only (Figure 13.5) just as in a singlefactor ANOVA. Second, temperature is ignored and the data are treated as though they are only the results of a single-factor experiment on humidity. Here, too, you will have eight replicates within each of the two levels of humidity and you can calculate a mean for each level. These new means, calculated from all eight replicates within each humidity treatment, will only be displaced from the grand mean by the average effect of humidity plus error (Figure 13.6). Therefore, the displacement of the treatment means from the grand mean can be used to calculate the sum of squares and mean square for Factor B (humidity) only, just as in a single-factor ANOVA.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
179 [168–195] 27.8.2011 2:10PM
13.4 Two-factor ANOVA separate the effects of each factor & interaction
179
At this stage, you have sums of squares for the following: (a) The total variance (the combined effects of Factor A, Factor B, A×B, and error) (Figure 13.4) (b) The effects of Factor A (temperature + error) (Figure 13.5) (c) The effects of Factor B (humidity + error) (Figure 13.6) (d) error (Figure 13.3) The only other sum of squares you still need is the one for interaction plus error. Since the sums of squares are additive and the total sum of squares is the combined effects of all the factors in the ANOVA (Section 11.3.4), you can calculate the sum of squares for interaction by subtraction. This is done by taking away the sums of squares for Factor A, Factor B and error from the total sum of squares ((a) above minus (b) and (c) and (d)). Now you have the following sums of squares: *
* * * *
The total variance (the combined effects of Factor A, Factor B, A×B and error) (Figure 13.4) The effects of Factor A (temperature + error) (Figure 13.5) The effects of Factor B (humidity + error) (Figure 13.6) The effects of interaction (A × B + error) (by subtraction) Error (Figure 13.3)
Once you have these, dividing by the appropriate degrees of freedom will give you mean square values, just as for a single-factor ANOVA. The effect of each factor can be estimated by dividing the factor mean square by the error mean square to get an F ratio. If the F ratio is significant, the factor is considered to have an effect. The F ratios for the effects of interaction, Factor A and Factor B are summarised in Table 13.3. Most statistical Table 13.3 Variation estimated by each mean square term and the appropriate division to estimate the effect of each factor when Factor A and Factor B are both fixed. Source of variation
Calculation of F ratio
Factor A
Mean square for Factor A Mean square error
Factor B
Mean square for Factor B Mean square error
Interaction (A × B)
Mean square for interaction Mean square error
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
180
180 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
packages will give an analysis of variance summary table that has all of these sums of squares, degrees of freedom, mean square values, F ratios and probabilities.
13.5
An example of a two-factor analysis of variance
These data are for the growth of cockroaches after three weeks of being fed ad libitum in an orthogonal experiment with three levels of temperature and three of humidity (Table 13.4). As an initial step, you might draw a graph of the cell means (Figure 13.7). Which factors might you expect to be significant? Would you expect a significant interaction? Why? Next, if you use a statistical package to run a two-factor ANOVA on these data, your results will be similar to Table 13.5, which gives the F ratio and probability for each of the two factors and their interaction. The interaction term is temperature × humidity. Note that the F ratios for temperature and humidity are significant at P < 0.001, but there is no significant interaction (P = 0.852). It seems the samples have come from different populations in relation to the levels of temperature and also humidity, but there is no interaction between these
Table 13.4 The length, in mm, for 27 cockroaches after two weeks of being fed ad libitum in an orthogonal experiment with nine different combinations of temperature and humidity. Temperature (°C) Humidity (%)
20
30
40
33
1 2 3
5 6 7
9 10 11
66
9 10 11
13 14 15
17 18 19
99
17 18 19
21 22 23
25 26 27
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
181 [168–195] 27.8.2011 2:10PM
13.6 Some essential cautions and important complications
181
Table 13.5 Results of a two-factor ANOVA on the data in Table 13.4 as an example of the format given by a statistical package. Source of variation Temperature Humidity Temperature × Humidity Error
Sum of squares
df
Mean square
F ratio
312.66 1200.66 1.33
2 2 4
156.33 600.33 0.33
156.33 600.33 0.33
18.00
18
1.00
Significance < 0.001 < 0.001 0.852
30
Length
20
10
0 20
30 Temperature (°C)
40
Figure 13.7 The mean length of cockroaches after two weeks of being fed ad
libitum in an orthogonal experiment with three levels of temperature (20, 30 and 40°C) and humidity (clear fill: 33%, grey fill: 66%, black fill: 99% humidity). Vertical bars show ± one SEM. Data are in Table 13.4.
factors. This result is not surprising considering the plot of the treatment means in Figure 13.7.
13.6
Some essential cautions and important complications
There are some essential cautions and important complications associated with two-factor and more complex ANOVAs of which you must be aware. First, a significant effect of a factor does not reveal where differences occur if you have examined more than two levels of that factor.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
182
182 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
Second, a significant interaction can make the F ratios for Factor A or Factor B misleading. Third, if either or both of Factors A and B are random, you need to use a different procedure for calculating the F ratios for one or both of them. These cautions and complications are explained below.
13.6.1 A posteriori testing is needed when there is a significant effect of a fixed factor First, just as for a single-factor ANOVA, a significant F ratio does not reveal where differences occur among three or more levels of that factor. For example, if you did a two-factor ANOVA with four levels of Factor A and six of Factor B, and found a significant effect of Factor A, it will not identify which levels of Factor A appear to come from populations with the same, or different, means. Here, just as for a single-factor analysis, you need to carry out a posteriori testing. This is straightforward if there is no significant interaction. If the interaction is not significant, a posteriori testing can be done for each factor that has a significant effect. This uses the MS error from the ANOVA and compares the mean values for the pooled data (e.g. the means in Figures 13.5 and 13.6) in just the same way as an a posteriori test after a single-factor ANOVA. For example, if you were to use a Tukey test, the formula is the same as the one given in Chapter 12: q¼
A X B X : SEM
(13:3 copied from 12:1)
To calculate the standard error of the mean from the ANOVA statistics, you use: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error SEM ¼ n
(13:4)
where n is the sample size of each pooled group. If the sample sizes are different, you need to use a slight modification of the formula (which reduces to the one above when nA is the same size as nB): sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error 1 1 SEM ¼ þ : 2 nA nB
(13:5)
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
183 [168–195] 27.8.2011 2:10PM
13.6 Some essential cautions and important complications
183
Then you simply calculate the Tukey q statistic for each pair of means and look up the critical value of q using the degrees of freedom for the MS within groups (error). If the calculated q value is greater than the critical value, the hypothesis that the means are from the same population is rejected. The value of q will range from 0 when the two sample means are the same and will increase as the difference between them increases. Once again, many statistical packages will do Tukey tests and assign the treatments to groups that are significantly different to each other. Finally, just as for a singlefactor experiment, a priori planned comparisons can also be made between particular cell means, but only if these comparisons have been specified beforehand (see Section 12.5). If the interaction is significant, you need to be extremely cautious because a significant interaction can obscure a main effect. This is explained in Section 13.6.2.
13.6.2 An interaction can obscure a main effect The two-factor analysis described in Section 13.5 gave mean squares for the main effects of Factor A (temperature), Factor B (humidity), interaction and error. The effect of each factor is estimated by dividing the factor mean square by the error mean square. This is appropriate, but there can be a complication. A significant interaction means that the effect of one factor (e.g. humidity) is not constant across the levels of the second factor (e.g. temperature). Therefore, if there is a significant interaction, the conclusion of a non-significant main effect (because of a non-significant F ratio for that factor) may not be correct. Here is an extreme example which clearly illustrates the problem. Imagine an experiment designed to investigate the effects of two treatments, with three levels of Factor A and two of Factor B. Figures 13.8(a), (b) and (c) show the results of this experiment. Although there is obviously an effect of both temperature and humidity on cockroach activity, the response to temperature at 66% humidity is the opposite of that at 33% humidity. When these results are analysed by a two-factor ANOVA, the total sum of squares will be large because the replicates will be well dispersed from the grand mean (Figure 13.8(a)). There will also be some error because the
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
184
184 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance (a)
66% humidity
Cockroach activity
Grand mean
33% humidity
20
30 Temperature (°C)
40
(b)
Cockroach activity
Grand mean
20
30 Temperature (°C)
40
Figure 13.8 An illustration of how interaction can obscure main effects in a
two-factor ANOVA. (a) As temperature increases, activity decreases at 33% humidity but increases at 66% humidity. (b) When humidity is ignored the cell means for the three levels of temperature only are shown as short horizontal lines. Note they all lie on the grand mean, so the sum of squares for temperature will be 0. (c) When temperature is ignored the cell means for the two levels of humidity are shown as two short horizontal lines. Note they both lie on the grand mean, so the sum of squares for humidity will also be 0.
replicates are dispersed about their treatment means (Figure 13.8(a)). But when the ANOVA partitions the sums of squares among the separate factors of temperature and humidity, the results are extremely misleading.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
185 [168–195] 27.8.2011 2:10PM
13.6 Some essential cautions and important complications
185
(c)
Cockroach activity
Grand mean
33
66 Humidity (%)
Figure 13.8 (cont.)
First, consider the pooled analysis for temperature. The new cell means for each of the three levels of temperature (ignoring humidity) will all lie on the grand mean. Consequently, there will be no overall effect of temperature and the sum of squares for temperature will be 0 (Figure 13.8(b)), even though there is obviously an effect of temperature within each level of humidity. Second, consider the pooled analysis for the two levels of humidity. The new cell means for each of the two levels of humidity (ignoring temperature) will also lie on the grand mean, so the sum of squares for humidity will also be 0 (Figure 13.8(c)), even though there is an effect of humidity within each temperature. Finally, the sum of squares for interaction will be realistic and very large. Therefore, when there is a significant interaction it is not appropriate to trust the F ratios for the effects of Factors A and B. This caution is particularly important because most statistical packages calculate F ratios for main effects regardless of whether the interaction is significant or not, but the solution is straightforward. A graph of the cell means such as Figure 13.8(a) is a useful first step as it will give you a visual indication of the positions of each cell mean. The next step is statistical. You need to look at the effects of each factor across all levels of the second factor using an a posteriori test. This procedure is a little fiddly, but quite easy to do. Here is a pictorial explanation for the cockroach data. First, you compare the two cell means within each of the three levels of temperature (Figure 13.9(a)). Second, you compare the three cell means within the first level of humidity (Figure 13.9(b)). Finally, you compare the
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
186
186 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
Figure 13.9 Illustration of the comparisons required for full a posteriori testing
of a two-factor ANOVA when there is a significant interaction. Short horizontal lines show cell means. (a) Double-headed arrows show comparisons between the means for the levels of Factor B (humidity) within each level of Factor A (temperature). (b) Double-headed arrows show comparisons between the means for the three levels of Factor A (temperature) within the first level of Factor B (33% humidity). (c) Double-headed arrows show comparisons between the means for the three levels of Factor A (temperature) within the second level of Factor B (66% humidity).
three cell means within the second level of humidity (Figure 13.9(c)). Here, too, for a Tukey test, you simply use the formula: q¼
A X B X SEM
(13:6 copied from 13.3)
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
187 [168–195] 27.8.2011 2:10PM
13.6 Some essential cautions and important complications
187
Figure 13.9 (cont.)
and: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error SEM ¼ n
(13:7 copied from 13.4)
where n is the sample size within each cell. Again, the modification to the formula shown in equation (13.5) applies if there are different numbers in each cell. This rather long but extremely important example emphasises that when there is a significant interaction you need to examine all possible combinations of treatments and that conclusions from F ratios for main effects may not hold. Some statistical programs will not do a posteriori tests for the combinations of cell means given above, so it may be necessary for you to make these using a spreadsheet or a calculator. This procedure and the statistical tables necessary to decide whether each difference is significant are covered in more advanced texts such as Zar (1999).
13.6.3 Fixed and random factors The final complication applies to two-factor and more complex ANOVAs that include random factors. The concept of fixed and random factors was discussed in Section 11.6, but here is a reminder.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
188
188 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
A fixed factor is one where the treatments (e.g. levels of temperature) have been specifically chosen. You are only interested in those particular treatments and the null hypothesis reflects this – for example, ‘There is no difference in cockroach activity at 20°C and 30°C.’ A random factor is one where the treatments are used as random representatives of the full set of possible treatments within that factor. Therefore, the null hypothesis is more general. Instead of comparing specific temperatures, the hypothesis is ‘There is no difference in cockroach activity at different temperatures.’ The levels of temperature chosen and used in the experiment are merely random representatives of the wider range of temperatures that cockroaches may experience. For a two-factor ANOVA, both factors could be fixed, one could be random and the other fixed or both could be random. If a two-factor experiment contains two fixed factors, the method for calculating the F ratios for the main effects (Factor A and Factor B) are those given in Table 13.3 and repeated in Table 13.6(b). The mean square for each factor estimates the effect of that factor plus error, and an F ratio is obtained by dividing the factor mean square by the within groups (error) mean square. If, however, the analysis contains two random factors, the sum of squares and mean square for each of the two factors will be inflated by the inclusion of any additional variation caused by interaction. Therefore, the variation estimated by the mean square for each main effect will be the effect of that factor, plus interaction, plus error. Most importantly, to realistically estimate the F ratios for each random factor you need to divide the factor mean squares by the interaction MS (which estimates interaction plus error) rather than the error MS (Table 13.6). Finally, if the ANOVA has one fixed and one random factor, it is even more complicated. Most authors recommend that, if Factor A is fixed and Factor B is random, the F ratio for Factor A is obtained by dividing the Factor A MS by the interaction MS, but the F ratio for Factor B is obtained by dividing the Factor B MS by the error MS (Table 13.6). In all cases, the F ratio for interaction is obtained by dividing the interaction MS by the error MS. Importantly, many statistical packages do not give appropriate F ratios when random factors are included in an analysis, so you have to do these calculations yourself by dividing by the appropriate mean squares.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
189 [168–195] 27.8.2011 2:10PM
13.6 Some essential cautions and important complications
189
Table 13.6 (a) Sources of variation contributing to the mean squares (MS) for Factor A, Factor B, interaction and error, when both A and B are fixed, A is fixed and B is random, and both A and B are random. (b) Divisions required to obtain the appropriate F ratio for each factor. (a) Sources of variation Source of variation
Both factors fixed
Factor A fixed, B random
Factor A
Factor A + error
Factor A + interaction + Factor A + interaction + error error
Factor B
Factor B + error
Factor B + error
Factor B + interaction + error
Interaction
Interaction + error
Interaction + error
Interaction + error
error
error
error
error
Both factors random
(b) Divisions required to obtain appropriate F ratio Source of variation
Both factors fixed
Factor A fixed, B random
Factor A
MS A MS error
MS A MS A MS interaction þ error MS interaction þ error
Factor B
MS B MS error
MS B MS error
Interaction
MS interaction MS interaction MS error MS error
Both factors random
MS B MS interaction þ error MS interaction MS error
Here is a conceptual explanation for the different ways of estimating main effects in a two-factor ANOVA, depending on whether the other factor is fixed or random. In all cases, the fixed factor of interest is Factor A. Imagine the hypothetical case where the only levels of Factor A and B that exist in the world are A1 and A2, and B1, B2, B3 and B4. For example, A1 and A2 may be the only two dietary supplements available for feeding to four species of farmed catfish (B1 to B4) and you are interested in the effects of dietary supplements on the growth of these species.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
190
190 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
Figure 13.10 A pictorial explanation for why the F ratio for a main effect is calculated differently, depending on whether the other factor is fixed or random. Cell means are indicated by symbols and pooled treatment means are indicated by the two short horizontal lines. (a) All the possible levels of Factor A and Factor B, together with all possible combinations of these. Note that there is considerable interaction, but overall there is no effect of Factor A (when Factor B is ignored, the pooled treatment means for A1 and A2 are identical). (b) When Factor B is fixed and only a subset of B is considered (B2 and B4), the interaction will contribute to the difference between the pooled means of A1 and A2, but this variation is a relevant addition within the deliberately restricted levels of each factor being compared. (c) When Factor B is random, the interaction will contribute unrealistic additional variation to the difference between the pooled means of A1 and A2. It will not indicate the true lack of change from A1 to A2 across the entire set of the levels of B and therefore needs to be excluded.
Figure 13.10(a) shows growth for all eight possible combinations of Factors A and B. Note that there is no effect of Factor A when averaged over all possible levels of Factor B because the means for levels A1 and A2 are the same, but there is considerable interaction between the two factors.
Both factors fixed First, consider the case where both factors are fixed, and you are only interested in the four combinations of A1 and A2 with B2 and B4.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
191 [168–195] 27.8.2011 2:10PM
13.6 Some essential cautions and important complications
191
Because both factors are fixed, you are not interested in whether any differences in growth between A1 and A2 within this very restricted comparison also reflect those averaged over all possible levels of Factor B. The comparisons between A1 and A2 are shown in Figure 13.10(b). Cell means have been copied from the appropriate part of Figure 13.10(a). Although the means of treatments A1 and A2 (ignoring B) are affected by the interaction, you are only interested in treatment A1 compared to A2 within the two fixed levels of B2 and B4. Therefore, to get a realistic effect of Factor A within this limited and fixed comparison, the variation due to the interaction is a necessary and appropriate additional component of Factor A and you calculate the F ratio for Factor A by dividing its treatment mean square by error only.
Factor A fixed, Factor B random Second, consider the case where Factor A is fixed and Factor B is random. You are interested in the comparison between A1 and A2 across all possible levels of B, from which B2 and B4 have been chosen as random representatives. The results of the experiment on the combinations of A1, A2 and B2, B4 are shown in Figure 13.10(c). Here, too, the pooled means of treatments A1 and A2 are affected by the interaction, but the difference within the experiment does not reflect the lack of change between A1 and A2 averaged over all possible levels of Factor B in Figure 13.10(a). Therefore, because the interaction has contributed additional variation to the sum of squares and mean square for Factor A, it is appropriate to exclude it by dividing the Factor A mean square by the interaction + error mean square to get a more realistic estimate of Factor A averaged over all four possible levels of B. For any two-factor ANOVA, the effect of a particular factor (e.g. Factor A) is estimated by dividing by the mean square for error only if the other factor is fixed, but by the mean square for interaction (i.e. interaction + error) if the other factor is random. Therefore, if both factors are random, you divide the mean squares of each by the interaction mean square. Finally, although I have specified the procedure for obtaining realistic F ratios when one or both factors are random, there is still some disagreement about this (e.g. Quinn and Keough, 2002). Some authors recommend dividing the mean square for Factor A and also Factor B by the mean square
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
192
192 [168–195] 27.8.2011 2:10PM
Two-factor analysis of variance
for interaction + error when either, or both, is random. Most importantly, if you have an analysis involving one or more random factors, you should clearly specify how you calculated the F ratio for each factor.
13.7
Unbalanced designs
The caution about unbalanced designs (when the sample size is not the same in each treatment) in relation to single-factor ANOVA in Chapter 11 also applies to more complex models. Whenever possible you should try to ensure that samples sizes are equal in each treatment combination, especially when sample sizes are relatively small, because they may not give good estimates of cell means and result in misleading conclusions.
13.8
More complex designs
Once you understand the concept of single-and two-factor analyses of variance, extension to three or more factors and other designs is relatively easy. A two-factor ANOVA breaks the analysis down into two main factors (which are each analysed like a single-factor ANOVA) and generates an interaction term by subtraction. A three-factor ANOVA does the same thing, but the analysis and ANOVA table are more complex because there are three main factors (Factors A, B and C), plus several interactions among these (A × B, A × C, B × C, A × B × C) and error. More advanced texts give rules for obtaining the appropriate F ratios with designs that are more complex, where there can be several combinations of fixed and random factors. If you continue on to use ANOVA a lot, you will realise that this chapter is very introductory. There are nested ANOVAs, two-factor ANOVAs without replication, ANOVAs for split plot designs, unbalanced designs and many more. This book does not attempt to cover all of these – instead it provides you with a general conceptual view that will help you work with more complex designs. Perhaps the best advice if you have to do a complex experiment requiring a complicated ANOVA is to find a good textbook (e.g. Quinn and Keough, 2002; Zar, 1999; Sokal and Rohlf, 1995) and perhaps talk to a statistician before you design the experiment.
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
193 [168–195] 27.8.2011 2:10PM
13.9 Reporting the results of a two-factor ANOVA
13.9
193
Reporting the results of a two-factor ANOVA
For a two-factor ANOVA, you need to give the F ratios and their degrees of freedom for both main effects and their interaction. Just as for a singlefactor ANOVA (Section 11.7), it is important to include the degrees of freedom for the numerator and the denominator of the F ratio as subscripts. For example, the output table for the ANOVA (Table 13.5) on the data for cockroach growth in Table 13.4 could be reported as ‘A two-factor ANOVA showed there was a significant effect of temperature (F2,18 = 156.33, P < 0.001) and humidity (F2,18 = 600.33, P < 0.001) but no significant interaction between these factors (F4,18 = 0.33, NS) on cockroach growth.’ Importantly, the results of the ANOVA do not show which treatment combinations have the highest or lowest growth, so a summary table of means or a graph such as Figure 13.7 is also needed. For a significant fixed factor with three or more treatments, you would need to do a posteriori testing to identify which treatments appeared to be from the same or different populations. When there is a significant interaction, a graph showing the results of full a posteriori testing is particularly helpful. Table 13.7 gives data for the percentage of exoskeleton damage in mud crabs exposed to two different levels of copper and three of zinc. A two-factor ANOVA of these data shows a significant effect of both copper (F1,18 = 18.04, P < 0.001) and zinc (F2,18 = 27.46, P < 0.001) and a significant interaction between them (F2,18 = 24.05, P < 0.001). The six cell means are displayed in Figure 13.11, with Table 13.7 The percentage exoskeleton damage in mud crabs exposed to two levels of copper (0 and 10 μg/l) and three levels of zinc (0, 10 and 20 μg/l) in an orthogonal experiment with four mud crabs in each treatment combination. Copper concentration (μg/l) 0 Zinc concentration (μg/l)
0 2 3 1 6
10 3 2 3 5
10 20 3 5 4 2
0 3 0 5 1
10 2 6 2 2
20 12 14 14 17
194 [168–195] 27.8.2011 2:10PM
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
194
Two-factor analysis of variance 20 Percentage damage
b 15
10
5
a
a
a
a
a
0 0
10 Zinc concentration
20
Figure 13.11 The mean percentage damage to the exoskeleton of mud crabs
exposed to 0 and 10 μg/l of copper (grey and black bars respectively), in orthogonal combination with 0, 10 and 20 μg/l of zinc. Vertical bars show ± one SEM. Superscripts of the same letter indicate means that are not significantly different when compared with an a posteriori Tukey test. Significantly greater exoskeleton damage only occurred when there were high levels of both copper and zinc.
superscripts indicating those from the same population. It is clear that high levels of exoskeleton damage only occurred in the treatment combination with high concentrations of copper and zinc.
13.10 Questions (1)
Constructing your own data set will help you understand how a two-factor ANOVA works and what the F ratios and probability values for each term mean. Use a simple design with three levels of Factor A and two of Factor B. Assume the ANOVA is Model I. First, make all the cells means identical by using the following data: Factor A
A1
Factor B
B1 1 2 3 4
A2 B2 1 2 3 4
B1 1 2 3 4
A3 B2 1 2 3 4
B1 1 2 3 4
B2 1 2 3 4
C:/ITOOLS/WMS/CUP-NEW/2648140/WORKINGFOLDER/MCKI/9781107005518C13.3D
195 [168–195] 27.8.2011 2:10PM
13.10 Questions
195
(a) Analyse these data with a two-factor, Model I ANOVA. What are the F ratios and probabilities for each factor and the interaction? (b) Now, deliberately change the data so you would expect a significant effect of Factor B, no effect of Factor A and no interaction. Rerun the two-factor analysis. What are the F ratios and probabilities for each factor and the interaction? (c) Finally, deliberately change the data so you would expect a significant effect of Factors A and B, but no interaction, and rerun the analysis. What are the F ratios and probabilities for each factor and the interaction? It will help if you start this problem by drawing a rough graph of the cell means like the one in Figure 13.7. (2)
In the previous question, you examined a simple design with three levels of Factor A and two of Factor B. Change the data in the table given in Question 1 so you would expect a significant effect of Factor A and Factor B as well as a significant interaction. Run the two-factor analysis. (a) What are the F ratios and probabilities for each factor and the interaction? Here too, a rough graph of the cell means like the one in Figure 13.7 will help visualise the data.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
196 [196–208] 27.8.2011 1:24PM
14
Important assumptions of analysis of variance, transformations, and a test for equality of variances
14.1
Introduction
Parametric analysis of variance assumes the data are from normally distributed populations with the same variance and there is independence, both within and among treatments. If these assumptions are not met, an ANOVA may give you an unrealistic F statistic and therefore an unrealistic probability that several sample means are from the same population. Therefore, it is important to know how robust ANOVA is to violations of these assumptions and what to do if they are not met as in some cases it may be possible to transform the data to make variances more homogeneous or give distributions that are better approximations to the normal curve. This chapter discusses the assumptions of ANOVA, followed by three frequently used transformations. Finally, there are descriptions of two tests for the homogeneity of variances.
14.2
Homogeneity of variances
The first and most important assumption is that the data for each treatment (or treatment combination in the case of two-factor and more complex ANOVA designs) are assumed to have come from populations that have the same variance. Equality of variances is called homogeneity of variances or homoscedasticity, while unequal variances show heterogeneity of variances or heteroscedasticity. Nevertheless, statisticians have found that ANOVA is relatively robust to departures from homoscedasticity, and there has been considerable discussion about whether it is necessary to apply tests which assess this before doing an ANOVA, because they may be too sensitive when sample sizes are large, or too insensitive when sample sizes are small 196
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
197 [196–208] 27.8.2011 1:24PM
14.3 Normally distributed data
197
(e.g. Quinn and Keough, 2002). Many authors suggest preliminary testing for homoscedasticity is not necessary, provided as a very general rule that the ratio of the largest to smallest variance does not exceed 4:1. Some cases of heteroscedasticity can be reduced by transforming the data (Section 14.5). Consequently, it is often useful to plot the data or calculate the variance within each treatment, or treatment combination, to see if there is a trend. For example, biological data often show an increase in variance as the mean increases, so transforming the data by taking the square root of each value may reduce heteroscedasticity (Section 14.5). There are several tests designed to assess heteroscedasticity and these have more uses than just checking whether data are suitable for parametric analysis. Sometimes you may be interested in an hypothesis about the variances rather than the means of different treatments. For example, you might hypothesise that a drug treatment increases the variance of systolic blood pressure in humans, so you would need to analyse your data with a test that compares variances among treatments. The Levene test for heteroscedasticity is described in Section 14.7.
14.3
Normally distributed data
The second assumption is that the data are from normally distributed populations. Nevertheless, it has been shown that ANOVA is quite robust in terms of minor departures from normality. As previously described in Section 9.8.1, drawing a P-P plot, where the actual cumulative frequency of the data is plotted against the expected cumulative frequency assuming the data are normally distributed, can be used to assess normality. You should only be cautious about proceeding with a parametric analysis if a P-P plot shows gross departures from linearity, such as sharp kinks.
14.3.1 Skew and outliers A box and whiskers plot (Tukey, 1977) is a way of visually summarising the distribution of a sample (Figure 14.1) so it can be assessed for skew (Section 8.9) and whether there are values in the data set which are unusually distant from the median. These are called outliers. Construction of a box and whiskers plot is straightforward.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
198
198 [196–208] 27.8.2011 1:24PM
Assumptions of analysis of variance
For a sample containing an odd number of values, you need to find the median, which is the middle value of this set of data (Section 8.10). Next, divide the data into two sets, the first of which contains all the values less than the median and the second of which contains all the values more than the median. Include the median in each set. Then, find the median of each of the lower and upper sets. These new medians are called the lower quartile and upper quartile, which are used to draw the lower and upper limits (which are also called the hinges) of the box. The distance between these quartiles or hinges is the interquartile range. Twenty five per cent of the values in the sample will be larger than the upper quartile, 50% will lie between the two quartiles and 25% will be smaller than the lower quartile. Finally, you need to add the whiskers to the box. Each whisker can extend outwards for a maximum distance of 1.5 times the interquartile range from each end of the box, but is only drawn to the maximum value of the set of data that is within that potential range. This will give you a plot with a box running from the lower to upper quartiles and whiskers extending vertically out from each end of the box (Figure 14.1). For a data set with an even number of values, the procedure is almost the same except that after finding the median you divide the data into two sets, the first of which contains all the values less than the median and the
Figure 14.1 The features of a box and whiskers plot.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
199 [196–208] 27.8.2011 1:24PM
14.3 Normally distributed data
199
second of which contains all the values more than the median (so neither will contain the median).
14.3.2 A worked example of a box and whiskers plot This example uses a sample with an odd number of values (n = 9): 1, 3, 4, 6, 7, 9, 10, 12, 25. The median of this sample is 7, so the data are divided into two groups, where the lower group contains 1, 3, 4, 6 and 7, while the upper group contains 7, 9, 10, 12 and 25. The median of the lower group is 4, which becomes the lower quartile. The median of the upper group is 10, which becomes the upper quartile. These are the limits of the ends of the box (called the hinges). The interquartile range is 10 – 4 = 6 units. From this you can draw the rectangular box in Figure 14.2(a). The maximum potential length of each whisker is 1.5 times the interquartile range and thus 1.5 × 6 = 9. This is shown in Figure 14.2(b). Each whisker can extend out to a maximum of nine units from its hinge, but since each is only drawn to the most extreme value of the (a)
(b)
(c)
25
25
25
12
12
12
10 9
10 9
10 9
7 6
7 6
7 6
4 3
4 3
4 3
1
1
1
*
Figure 14.2 The three steps in drawing a box and whiskers plot, using the data
in Section 14.3.2. (a) Drawing the box. (b) Establishing the maximum potential length of each whisker. (c) Drawing the actual length of each whisker. Note the asterisk indicating that the value of 25 is an outlier.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
200
200 [196–208] 27.8.2011 1:24PM
Assumptions of analysis of variance
data within its potential range, the lower whisker will only extend down to 1, while the upper will only extend up to 12. The outlier of 25, indicated by an asterisk, lies outside the range of the box and its whiskers (Figure 14.2(c)). The shape of the box and whiskers plot indicates whether the distribution is skewed. If the distribution of the data is symmetrical about the mean, then the box and whiskers plot will have a median equidistant from the hinges and whiskers of similar length. As the distribution becomes increasingly skewed, the median will become less equidistant from the hinges and the whiskers will have different lengths (Figure 14.3). Any values outside the range of the whiskers are called outliers and should be scrutinised carefully. In some cases, they may be obvious mistakes caused by incorrect data entry or recording, faulty equipment or inappropriate methodology (e.g. a human body temperature of 50°C or a negative number of individuals) in which case they can justifiably be deleted. Outliers that appear to be real are of great interest because they may indicate something unusual, especially if they are present in some samples or treatments and not others. Importantly, however, when there are outliers you should be cautious about using a parametric test. One or two extreme values can greatly affect the variance of a sample since the formula for the variance uses the square of the difference between each value and the mean, so the assumption of variances that are not grossly dissimilar among treatments or samples can be easily violated. (a)
(b) * * *
Figure 14.3 Examples of box and whiskers plots for (a) normally distributed
data and (b) data with a gross positive skew. Outliers are shown as asterisks.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
201 [196–208] 27.8.2011 1:24PM
14.5 Transformations
14.4
201
Independence
Finally, the data must be independent of each other, both within and among groups. This important assumption needs very little explanation because it is really just a matter of good experimental design. For example, you need to ensure each experimental unit within each treatment or treatment combination is chosen independently and all possible experimental units within the population have an equal likelihood of being selected. You also need to ensure that independence applies among treatment groups as well.
14.5
Transformations
Transformations are a way of reducing heteroscedasticity or making data more closely resemble a normal distribution. There are many transformations available and three commonly used ones are described below. Most spreadsheet and statistical packages include transformations.
14.5.1 The square root transformation If the variance of the data increases as the mean increases, a square root transformation will make these data more homoscedastic. There is an example in Table 14.1. Table 14.1 An example of the effect of a square root transformation on data where the variance increases as the mean increases. Data are given for the growth of tumours in three drug treatments. For the original data, the largest variance is 10.92 and the smallest is 0.67, giving a ratio of largest to smallest of 16.38:1. A square root transformation reduces this ratio to 2.05:1. Control
Tumostat
Inhibin 4
Original Square root Original Square root Original Square root 19 15 14 11 s2 10.92
4.36 3.87 3.74 3.32 0.18
7 5 4 3 2.92
2.65 2.24 2.00 1.73 0.15
3 2 2 1 0.67
1.73 1.41 1.41 1.00 0.09
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
202
202 [196–208] 27.8.2011 1:24PM
Assumptions of analysis of variance
14.5.2 The logarithmic transformation If the data have a gross positive skew, a logarithmic transformation will give a distribution that better approximates the normal distribution. When the data set includes any values of 0, you need to use the logarithm of X + 1 because the logarithm of 0 is – ∞. Biological data often show a positive skew, and Figures 14.4 (a)
Number of plants
30
20
10
0 0
5
10 15 Number of fruit per plant
20
(b)
Number of plants
12 9
6 3 0 0.4
0.6 0.8 1.0 1.2 Logarithm of the number of fruit per plant
1.4
Figure 14.4 The effect of a logarithmic transformation on data for the number of
fruit produced by tomato plants. The X axis shows the number of fruit produced per plant and the Y axis shows the number of plants bearing each number of fruit. (a) Untransformed data show a positive skew. (b) After transformation to the log10. Note that the distribution in (b) is far more symmetrical than (a).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
203 [196–208] 27.8.2011 1:24PM
14.6 Are transformations legitimate?
203
Figure 14.5 Restriction of the normal distribution when percentage data are
close to 0% or 100%.
(a) and (b) shows the effect of a logarithmic transformation on a positively skewed distribution.
14.5.3 The arc-sine transformation The arc-sine transformation can be useful for percentage data with an absolute minimum of 0% and an absolute maximum of 100%. Any set of data with a mean close to either of these extremes is unlikely to have a normal distribution because it will cease at these values (Figure 14.5). An arc-sine transformation will give these distributions a far more normal shape.
14.6
Are transformations legitimate?
Here you may be thinking that transforming data to make them more suitable for parametric statistical analysis sounds like cheating or altering the data to get the result you want. First, transformations are applied to the entire data set so that each value is treated in the same way. Second, there is no scientific necessity to use the linear base ten scale that we are so familiar with. Many biological relationships between two variables (e.g. the metabolic rate of an invertebrate and temperature, the number of eggs per female toad versus the length of that female) are logarithms, squares or cubes. The apparently linear pH scale is actually logarithmic – a pH of 4 indicates a ten-fold difference from pH 5 and a 100-fold difference from pH 6. Therefore, in many cases it is actually more appropriate to transform the data so they reflect the underlying relationship.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
204
204 [196–208] 27.8.2011 1:24PM
Assumptions of analysis of variance
Importantly, if you transform a set of data, you also need to transform your null and alternate hypotheses. For example, if you were to hypothesise that ‘Drugs A, B and C have no effect on the blood glucose concentration in humans’, but carried out a logarithmic transformation on your data before analysis, your original hypothesis would also have to be transformed to ‘Drugs A, B and C have no effect on the logarithm of blood glucose concentration in humans.’
14.7
Tests for heteroscedasticity
There are several tests designed to examine whether two or more samples appear to have come from populations with the same variance. As mentioned earlier, if you are only interested in whether the data are suitable for a parametric analysis, the general rule that the ratio of the largest variance to the smallest should not exceed 4:1 can be used. If the ratio is four or greater, it may be useful to examine the data and see where the differences occur since it may be possible to apply a transformation and proceed with a parametric analysis. If you are interested in testing an hypothesis about the variance of two or more samples, you can use the Levene test, which also gives an F ratio. Remember, however, that a significant result for the Levene test may not mean the data are unsuitable for analysis by ANOVA, which is quite robust to heteroscedasticity. Levene’s original test calculates the absolute difference between each replicate and its treatment mean and then does a single-factor ANOVA on these differences. The absolute difference is the difference between any two numbers expressed as a positive value. (For example, the difference between 6 and 3 is –3, while the difference between 3 and 6 is +3, but in both cases the absolute difference is +3.) Figures 14.6 and 14.7 give a pictorial explanation of the Levene test. Two examples are shown, using an experiment on the growth of brain tumours in three different treatments. First, if the variances within all treatments are similar, then the set of absolute differences between the replicates and their sample means will also be similar for each treatment. Figure 14.6 shows the absolute differences for three samples that all have the same variance. Note that the means of the absolute differences in 14.6(b) are the same, even though the treatment
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
205 [196–208] 27.8.2011 1:24PM
14.8 Reporting the results of transformations and the Levene test
205
Figure 14.6 The Levene test examines whether two or more variances are
likely to have come from the same population by doing a single-factor ANOVA on the absolute differences between the replicates and their treatment means or cell means. (a) Arrows show the difference between each replicate and its treatment mean. Note that some differences are positive and some are negative. (b) The absolute differences are listed under each treatment. Every value of the absolute difference between each replicate and its sample mean will be positive. In this case, the means of the absolute differences are the same for each treatment, and a single-factor ANOVA comparing these will not be significant, thereby indicating the variances are homoscedastic.
means in 14.6(a) are not. A single-factor ANOVA comparing the means of the absolute differences will not be significant. Second, if the variances differ among treatments (Figure 14.7(a)), then so will the values of the absolute differences (Figure 14.7(b)). Note that the set of absolute differences for the control treatment has a mean that is much larger than the other two, and a single-factor ANOVA comparing these means is likely to be significant. The Levene test is available in most statistical packages.
14.8
Reporting the results of transformations and the Levene test
If the data have been transformed, you should give the transformation, the reason for it and whether it was successful. For example ‘There was marked
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
206
206 [196–208] 27.8.2011 1:24PM
Assumptions of analysis of variance
Figure 14.7 An example of the Levene test where there is heteroscedasticity.
(a) Arrows show the difference between each replicate and its sample mean. (b) The absolute differences between each replicate and its sample mean are listed under each treatment. Since the absolute differences for the control are much greater than the other two treatments, a single-factor ANOVA comparing the means of the values in (b) will show the variances are significantly heteroscedastic.
heteroscedasticity in that the ratio of the largest to smallest variance exceeded 4:1 and increased as the mean increased, so a square root transformation was applied which reduced heteroscedasticity to 2.4:1.’ The Levene test is simply a single-factor ANOVA of the absolute values of the displacement of each replicate from its cell mean. A significant result for an experiment with four treatments could be reported as ‘A Levene test showed the variance differed significantly among treatments: F3,16 = 5.91, P < 0.05.’ If you were interested in which treatments had the greatest and least variance, the statistics for the variance can be given and an a posteriori Tukey test could be carried out on the means of the absolute values.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
207 [196–208] 27.8.2011 1:24PM
14.9 Questions
207
14.9
Questions
(1)
Why can a transformation be useful when analysing data with parametric tests, especially ANOVA?
(2)
The following set of data is for the number of mosquito bites on the left arm of 16 volunteers in a trial of three specifically chosen insect repellents and an untreated control. (a) Do the data appear to require transformation before using a single-factor ANOVA to compare the four treatments? (b) What transformation would you recommend? (c) Carry out the transformation. Has it made the data suitable for analysis? (d) Carry out the ANOVA. Is there a significant difference among treatments? (e) Is a posteriori testing required? (f) Carry out an a posteriori test. Which repellent(s) would you recommend?
(3)
Control
Bugroff
Bitefree
Nobite
15 9 12 18
4 2 3 1
2 4 3 5
7 9 6 10
The table below gives data for the systolic blood pressure, in mm Hg, of 20 individuals who had systolic blood pressure between 145 and 165 mm Hg. Ten were given the experimental drug Angiocalm and ten were left untreated as a control. The initial report was ‘The ratio of the largest to smallest variance of the groups did not exceed 4:1, so the data were analysed by a Model I single-factor ANOVA which showed Angiocalm significantly reduced systolic blood pressure compared to the control (F1,18 = 6.21, P = 0.023).’ A physiologist who examined the results was concerned that the effect of Angiocalm appeared to be extremely variable among individuals and requested a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C14.3D
208
208 [196–208] 27.8.2011 1:24PM
Assumptions of analysis of variance
Levene test on the data. Use a statistical package or the procedure in Section 14.7 to do this test. What can you conclude? Angiocalm
Control
125 155 120 135 165 125 140 130 160 145
150 145 165 160 150 145 160 155 165 145
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
209 [209–232] 27.8.2011 1:45PM
15
More complex ANOVA
15.1
Introduction
This chapter is an introduction to four slightly more complex analyses of variance often used by life scientists: two-factor ANOVA without replication, ANOVA for randomised block designs, repeated-measures ANOVA and nested ANOVA. An understanding of these is not essential if you are reading this book as an introduction to biostatistics. If, however, you need to use models that are more complex, the explanations given here are straightforward extensions of the pictorial descriptions in Chapters 11 and 13 and will help with many of the models used to analyse designs that are more complex.
15.2
Two-factor ANOVA without replication
This is a special case of the two-factor ANOVA described in Chapter 13. Sometimes an orthogonal experiment with two independent factors has to be done without replication because there is a shortage of experimental subjects or the treatments are very expensive to administer. The simplest case of ANOVA without replication is a two-factor design. You cannot do a single-factor ANOVA without replication. The data in Table 15.1 are for a preliminary trial of the effects of the two experimental drugs, ‘Proshib’ and ‘Testosblock’, and an untreated control, in an orthogonal combination with three levels (high, medium and low) of radiation therapy, upon the growth of solid tumours of the prostate. The researcher had only nine consenting volunteers with advanced prostate cancer, so an orthogonal design was only possible without replication. This causes a problem. There is no way to directly estimate error from the dispersion of replicates around their respective cell means (as was done for a 209
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
210
210 [209–232] 27.8.2011 1:45PM
More complex ANOVA
Table 15.1 The increase in volume (in mm3) of prostate tumours in nine males after three months of treatment with nine different combinations of radiation therapy and drug treatments. Drug Radiation level
Proshib
Testosblock
Control
Low Medium High
81 45 28
76 46 27
79 45 27
Figure 15.1 The growth of prostate tumours in nine combinations of three
levels of drug treatment and three levels of radiation. There is only one replicate within each treatment combination. The increased volume of each tumour is shown as a symbol: ♦ = tumours receiving low radiation • = tumours receiving medium radiation and ▪ = tumours receiving high radiation. The heavy horizontal line shows the grand mean and the nine shorter horizontal lines show each cell mean.
single-factor ANOVA in Chapter 11 and two-factor ANOVA with replication in Chapter 13) because there is only one value in each treatment combination, so this will be the same as the cell mean. A two-factor ANOVA without replication uses a different way of estimating error, which has to assume there is no interaction between the factors. Figures 15.1 to 15.4 give a pictorial explanation of how a two-factor ANOVA without replication estimates three sources of variation and uses these to isolate the effects of the two factors. The data in Table 15.1 are graphed in Figure 15.1.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
211 [209–232] 27.8.2011 1:45PM
15.2 Two-factor ANOVA without replication
211
Figure 15.2 The total variance within the experiment on the growth of
prostate tumours. The heavy horizontal line indicates the grand mean and the nine shorter horizontal lines indicate each cell mean. The displacement of each point from the grand mean (arrows) will be caused by the total variation within the experiment.
First, the total variation within the experiment is estimated (Figure 15.2). Each point will be displaced from the grand mean by the effects of Factor A, Factor B, any interaction and error. These displacements are squared and added together to give the total sum of squares. This is divided by the appropriate degrees of freedom, which is one less than the number of experimental subjects, to give the mean square or total variance. The calculation is the same as for the total variance in a two-factor ANOVA with replication (Chapter 13). Second, the effect of Factor A is estimated by ignoring Factor B and calculating a new mean for each of the levels within Factor A. The displacement of each of these new treatment means from the grand mean will be caused by the average effect of Factor A, plus error (Figure 15.3). Each of the displacements is squared, multiplied by the number of replicates within each treatment and added together to give the sum of squares for Factor A. The number of degrees of freedom is one less than the number of treatments in ‘A’ and dividing the sum of squares by this value will give the mean square for Factor A. Finally, the effect of Factor B is estimated by ignoring Factor A and calculating a new mean for each treatment level of Factor B. The displacement of each of these new treatment means from the grand mean will be caused by the effect of Factor B, plus error (Figure 15.4). Here, too, the displacements are
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
212
212 [209–232] 27.8.2011 1:45PM
More complex ANOVA
Figure 15.3 Estimation of the effect of Factor A. Factor B (radiation) is
ignored. The displacement of each new treatment mean from the grand mean (arrows) will be caused by the effect of Factor A (here drugs), plus error.
Figure 15.4 Estimation of the effect of Factor B. Factor A (drugs) is ignored.
The displacement of each treatment mean from the grand mean (arrows) will be caused by the effect of Factor B (here radiation), plus error.
squared, multiplied by the number of replicates within each treatment and added together to give the sum of squares for Factor B. The number of degrees of freedom is one less than the number of treatments in ‘B’ and dividing by this value will give the mean square for Factor B. At this stage, you have estimates for the following sources of variation: (a) The total variance in the experiment (the combined effects of Factor A, Factor B, A×B and error) (Figure 15.2) (b) The effects of Factor A (drug treatment + error) (Figure 15.3) (c) The effects of Factor B (radiation level + error) (Figure 15.4).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
213 [209–232] 27.8.2011 1:45PM
15.2 Two-factor ANOVA without replication
213
However, because there is only one replicate within each treatment combination, there is no way to estimate error separately. Therefore, unlike a two-factor ANOVA with replication, it is not possible to estimate the sum of squares for the effect of any interaction by subtracting the sums of squares for Factor A, Factor B and error from the total variance. Two-factor ANOVA without replication does the next best thing. The sums of squares and degrees of freedom in an ANOVA are additive and in Chapter 11 it was explained how the total sum of squares and total degrees of freedom in a single-factor ANOVA were the sums of those for Factor A, plus error. Therefore, by subtracting the sums of squares for Factor A plus Factor B from the total variance, you are left with the sum of squares for the remaining variation in the experiment, which will include error and any effect of interaction. This sum of squares, which is the only possible estimate of error for this unreplicated design, is divided by the remaining degrees of freedom to give the best estimate of the mean square for error. If there is an interaction, the mean square will be inflated, but this is unavoidable and undetectable if you do a two-factor ANOVA without replication. The results of a two-factor ANOVA without replication will include the sums of squares and mean squares for Factor A, Factor B and the remaining variation (used as the best estimate of error), together with the F ratios and probabilities for Factors A and B. For the example given above, the results of the analysis are in Table 15.2.
Table 15.2 Results of a two-factor ANOVA without replication on the data in Table 15.1. There is a significant effect of radiation but no significant effect of the drugs on the growth of prostate tumours. Source of variation Sum of squares df Mean square F Radiation Drugs Remainder Total
4070.222 4.222 9.778 4084.222
2 2 4 8
2035.111 2.111 2.444
P
835.545 < 0.001 0.864 0.488
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
214 [209–232] 27.8.2011 1:45PM
214
More complex ANOVA
15.3
A posteriori comparison of means after a two-factor ANOVA without replication
If a two-factor ANOVA without replication shows a significant effect of a fixed treatment factor (e.g. the three radiation levels being specifically compared in Section 15.2), you are likely to want to know which treatments appear to be from the same or different populations. The procedure for a posteriori testing is a modification of that for a singlefactor ANOVA except that the MS error for the remainder (estimated by subtraction as described above) is used as the best estimate of error. For a Tukey test, each factor is examined separately using the formula: q¼
xA xB SEM
(15:1 copied from 12:1)
with the standard error of the mean estimated from: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MS error SEM ¼ n
(15:2 copied from 12:3)
where the MS error is the ‘remainder’ MS from the ANOVA (see Table 15.2) and n is the number of data within each group (for example, there are three values within each of the three radiation levels when the drug treatments are ignored and vice versa).
15.4
Randomised blocks
Two-factor ANOVA without replication can also be used to analyse results from a randomised block experimental design. Many agricultural experiments on the productivity of commercial crops are done on a large scale, and involve treatments applied to particular plots within a field or paddock. When each plot is harvested, you usually only get one value (e.g. the total weight of the crop harvested from that plot). Unfortunately, factors including soil fertility, wind exposure, drainage and soil type may vary across fields and paddocks. Therefore, if you use an experimental design with several replicates of each treatment allocated at random within such a large area, you may get a lot of variation among replicates of the same treatment simply due to their different locations.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
215 [209–232] 27.8.2011 1:45PM
15.4 Randomised blocks
215
Table 15.3 An example of a randomised block experimental design. (a) The matrix represents a paddock that has been subdivided into several rows, each of which is treated as a separate block and subdivided into four equal-sized plots. One replicate of each treatment (A–D) is assigned at random to a plot within each block. The data for each treatment are shown in brackets. (b) Results of a two-factor ANOVA without replication on the data in (a). There is a significant difference among treatments A–D. The significant difference among the random factor ‘blocks’ shows there is variation across the paddock but it is unlikely to be of further interest in this comparison of fertiliser treatments. (a) Block number 1 2 3 4
Treatments running across each block A (44) C (45) C (52) B (41)
C (27) A (60) B (88) A (30)
B (62) B (84) A (67) C (12)
(b) Source of variation Sum of squares df Mean square F Treatments Blocks Remainder Total
2418.50 3170.67 50.83 5640
2 1209.25 3 1056.89 6 8.47 11
P
142.73 < 0.001 124.75 < 0.001
A randomised block design gives a way of isolating treatment effects from this type of confounding spatial variation. An area of land is subdivided into several rows (which are called blocks). Each block is then subdivided into several plots of equal size, usually with one plot for each treatment type. One replicate of each treatment is assigned at random within each block, as shown in Table 15.3(a). This design can be analysed as a two-factor ANOVA without replication, using treatments as the first factor and blocks as the second. The table of results will give sums of squares and mean squares for the total variance, ‘treatment’ and ‘block’ with the remaining variation, calculated by
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
216
216 [209–232] 27.8.2011 1:45PM
More complex ANOVA
subtraction, used as the error term (Table 15.3(b)). Here, too, when there are more than two levels of a fixed factor of interest (e.g. ‘treatment’) a significant result would need to be followed by an a posteriori test to determine which levels are different, and can be done using the procedure in Section 15.3. Two-factor ANOVA without replication is often used to analyse experiments on animals, where you might expect great variation among the offspring of different parents. For example, small mammals usually only have relatively few individuals per litter (e.g. cats have an average of about three kittens per litter) and there is likely to be considerable genetic variation among litters from different females. In such cases, the best approach is often to treat litters as one (random) factor and your treatments as the second factor with each treatment applied to one individual per litter. Here the factor ‘litters’ is equivalent to the blocks in the previous example.
15.5
Repeated-measures ANOVA
Often life scientists do experiments where several treatments are applied sequentially to a sample of individuals and the same response variable (e.g. blood pressure) is measured during each treatment. These are called repeated-measures experiments because more than one measurement of the same response variable is taken from each individual. There are two main reasons for using a repeated-measures design instead of a fully independent one. First, it may not be possible to obtain a large number of participants, as discussed for the experiment on prostate tumours in Section 15.2. Second, if there is a lot of variation among individuals, an experiment with independent samples of different individuals within each treatment may have such a large error variance that differences among treatments are obscured (Figure 15.5(a)). Therefore, although the data are not independent and error cannot be measured directly, a repeated-measures design can be more powerful, especially if the differences among subjects are relatively consistent within each level of treatment (Figure 15.5(b)). Nevertheless, a disadvantage of experiments in which different treatments are sequentially administered to the same individual is the risk of carryover effects, where the response to the second or subsequent treatments is affected by those previously administered. This is why a recovery period of hours, days or even weeks is usually included between treatments.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
217 [209–232] 27.8.2011 1:45PM
15.5 Repeated-measures ANOVA
217
(a) 13 120
14 15
1 7
2 Systolic blood pressure (mm Hg)
8
16
9
17
10
18
3 4
100
5 80 6 11 12 Control
Drug A
Drug B
(b) 1 120
2 Systolic blood pressure (mm Hg)
100
3 2
1
3 4
1 2
4
3
5
4
6
5 80 6 6 5 Control
Drug A
Drug B
Figure 15.5 (a) If there is a lot of variation among individuals, the use of
independent samples in each treatment may give a very large error term that can obscure differences among treatments. Experimental units (individuals) are numbered from 1 to 18 and horizontal lines show the three treatment means. (b) A repeated-measures design. Experimental units are numbered from 1 to 6. Even though there is a lot of variation, the response to each treatment is relatively consistent for each individual. This repeated-measures design is likely to be more powerful because error is estimated from the variation within (not among) individuals.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
218
218 [209–232] 27.8.2011 1:45PM
More complex ANOVA
Table 15.4 The rate of expiratory airflow of four asthmatics before and after being treated sequentially with the two experimental drugs Adrein B and Broklar. Subject number
Before
Adrein B
Broklar
1 2 3 4
64 77 63 71
81 82 70 84
76 84 65 80
Here is an example of a repeated-measures design. Four humans who suffered from severe asthma were each sequentially treated with two experimental drugs designed to increase the internal diameter of their airways. The maximum expiratory airflow of each participant was measured on a scale of 0 (no airflow) to 100 (the peak expected airflow for a healthy person of that age and height) at the start of the experiment and then immediately after being treated with each drug. A four-hour recovery period was included between treatments to minimise carryover effects. The data are in Table 15.4. At this stage, you are likely to be thinking that this experimental design appears extremely similar to the one for randomised blocks in the previous section, where treatments were applied to different parts of each of four blocks. The two designs are almost the same except that each treatment has been applied sequentially to the same individual for the repeated-measures design. Nevertheless, the data can be analysed in the same way as a twofactor ANOVA without replication (Section 15.4), with subjects as a random factor and drugs as a fixed factor. The calculations of the sums of squares for (a) the total variance, (b) drug treatments and (c) subjects are shown in Figures 15.6 (a)–(c) and the results of this analysis are in Table 15.5. Here too the error term cannot be calculated directly, so the remaining variation, which will include any interaction, has to be calculated by subtraction as described previously for a two-factor ANOVA without replication. Most statistical software will run a repeated-measures analysis and the results will be the same as the randomised block design described previously. A significant effect for a fixed factor of interest (e.g. drugs) with three or more levels can examined by a posteriori testing using the procedure in Section 15.3.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
219 [209–232] 27.8.2011 1:45PM
15.5 Repeated-measures ANOVA
219
Figure 15.6 The sources of variation in a repeated-measures design. Symbols
▪
show lung function index before , after treatment with Adrein B▲, after treatment with Broklar ▼. (a) The total variance in the experiment: Factor A (treatment + error), plus Factor B (subjects + error), plus any interaction (interaction + error) and error. (b) The effect of Factor A (treatments + error) only. The displacement of each treatment mean from the grand mean (arrows) will be caused by the effect of treatment (here drugs), plus error. (c) The effect of Factor B (subjects + error) only. The displacement of each subject mean from the grand mean will be caused by the effect of each subject plus error. Note that only (a), (b) and (c) are needed to obtain the remaining variation (interaction + error) by subtraction, and these sources of variation are the same as for a two-factor ANOVA without replication. (d) The table of results for a repeated measures ANOVA often includes a line for variation within subjects. The displacement of each point from the mean for each subject is due to the effect of Factor A (treatment + error), plus any interaction (interaction + error) between the treatment and subject and therefore shows the combined total of both sources of within subject variation.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
220
220 [209–232] 27.8.2011 1:45PM
More complex ANOVA
Figure 15.6 (cont.)
Sometimes the results of a repeated-measures ANOVA are presented in more detail than in Table 15.5, but this is only to clearly specify (a) inherent variation among subjects and (b) variation due to the treatments applied within each subject. If you compare Tables 15.5 and 15.6, you will see the second has an additional source of variation called ‘Within subjects’. This is simply the combined sum of squares (and degrees of freedom) for treatments and the remaining variation in the experiment, given separately in the following two lines of Table 15.6. Variation within subjects is the displacement of the value for each treatment from the mean of each subject and will be due to the effect of the treatment (e.g. drugs + error), plus any interaction between the treatments and each individual (interaction +error) (see Figure 15.6(d)).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
221 [209–232] 27.8.2011 1:45PM
15.5 Repeated-measures ANOVA
221
Table 15.5 Results of a two-factor ANOVA repeated-measures analysis on the data in Table 15.4. There is a significant difference in airflow among treatments, so you are likely to want to run an a posteriori test to distinguish which of these are significantly different. Source of variation Sum of squares df Subjects Treatments Remainder Total
388.92 234.00 59.33 682.25
3 2 6 11
Mean square F 129.64 117.00 9.89
P
13.11 < 0.01 11.83 < 0.05
Table 15.6 Results of a two-factor ANOVA without replication on the data in Table 15.4 in the format where variation (a) among subjects and (b) within subjects are clearly distinguished. Note that the within subjects sums of squares and degrees of freedom are the sum of those for ‘treatments’ and ‘remainder’ shown below it and that no F ratio or probability are given for the combined quantity. Source of variation Sum of squares df (a) Subjects (b) Within subjects Treatments Remainder Total
388.92 293.33 234.00 59.33 682.25
3 8 2 6 11
Mean square F
P
129.64
13.11 < 0.01
117.00 9.89
11.83 < 0.05
Repeated-measures designs are often used in biomedical experiments. For example, you might take hourly measurements of the blood pressure of the same ten individuals over the course of a day to see how this variable changes with time. Here the individuals are the subjects, time is the withinsubjects factor (equivalent to ‘treatments’ in Tables 15.5 and 15.6) and blood pressure is the response variable. More complex repeated-measures designs include two or more levels of a factor among the subjects (e.g. males and females). These analyses include additional interactions, but are easy to apply in many statistical packages. More advanced texts such as Quinn and Keough (2002) give an excellent discussion.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
222 [209–232] 27.8.2011 1:45PM
222
More complex ANOVA
15.6
Nested ANOVA as a special case of a single-factor ANOVA
An experimental design that compares the means of two or more levels of the same factor (e.g. different levels of salinity or different drugs) can be analysed by a single-factor ANOVA, as described in Chapter 11. Sometimes, however, researchers do an experiment with two or more levels of a particular factor, but also have two or more subgroups nested within each level. Here is an example. Large-scale experiments in aquaculture are often constrained by the number of ponds available to the researcher. For example, only nine ponds were available for an investigation into the effects of two different vitamin supplements (vitamin A and vitamin B) on the growth of prawns. Each pond was stocked with 100 prawns of the same species, age and weight. Three ponds were allocated at random to each of the two vitamin treatments and the remaining three ponds used as a control. After six months, the 100 prawns in each pond were recaptured and weighed. This is called a nested or hierarchical design. Three ponds, each containing several replicates (in this case 100), are nested within each treatment (Figure 15.7). This design is not appropriate for analysis using a single-factor ANOVA with diet as the factor and the 300 prawns within each treatment as the number of replicates because this ignores the ponds, which may contribute to the variation within the experiment. You may also be thinking that the design appears pseudoreplicated in that the real level of replication within each treatment is the number of ponds rather than the number of
Figure 15.7 Example of a nested or hierarchical design. Each pond contains
100 replicates and three ponds are nested within each treatment. Open circles indicate the control (prawn food only), grey circles treatment 1 (prawn food plus vitamin A) and black circles treatment 2 (prawn food plus vitamin B).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
223 [209–232] 27.8.2011 1:45PM
15.6 Nested ANOVA as a special case of a single-factor ANOVA
223
prawns. This is true and the nested analysis described below takes it into account. The design is also unsuitable for analysis as a two-factor ANOVA with diet as the first factor and ponds as the second, because the three ponds are simply random subgroups nested within each treatment, which do not intentionally contain different treatment levels of a second factor. For example, the ‘first’ pond in treatment 1 does not share an exclusive property with the ‘first’ pond in treatments 2 and 3 (Table 15.7). Table 15.7 An hierarchical design should not be analysed as an independent factor design. (a) Correct hierarchical plan for the nested experimental design described in Figure 15.7. This design has one factor nested within the other. The ponds have been chosen at random and are nested within each treatment. (b) Incorrect orthogonal two-factor plan. There is nothing exclusively shared within any of the rows of ponds across treatments, so it is incorrect to treat the three rows as three different levels of the factor ‘pond’ because the ponds do not contain different levels of a second factor. (a) Prawn food
Prawn food + vitamin A
Prawn food + vitamin B
Pond C Pond E Pond I
Pond B Pond F Pond G
Pond A Pond D Pond H
(b) Treatment Pond
Prawn food
Prawn food + vitamin A
Prawn food + vitamin B
First within each treatment Second within each treatment Third within each treatment
100 prawns (Pond C) 100 prawns (Pond E) 100 prawns (Pond I)
100 prawns (Pond B) 100 prawns (Pond F) 100 prawns (Pond G)
100 prawns (Pond A) 100 prawns (Pond D) 100 prawns (Pond H)
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
224
224 [209–232] 27.8.2011 1:45PM
More complex ANOVA
When one factor (e.g. Factor B) is nested within another (e.g. Factor A), it is often written as Factor B(Factor A). For the nested design above, where Factor A is diet and Factor B is the ponds, the following will contribute to the final weight of each prawn: Growth ¼ Factor A þ Factor B ðFactor AÞ þ error:
(15:3)
This is the same as equation (13.1) for a single-factor ANOVA apart from the additional source of variation from the ponds nested within each diet. There is no interaction term because the design is not orthogonal. A nested ANOVA isolates the effects of treatments and subgroups within these treatments and gives an F ratio for both factors. The way this analysis works is described in Section 15.6.1.
15.6.1 A pictorial explanation of nested ANOVA For simplicity, the following example has two treatments and two ponds nested within each treatment, with only four prawns in each pond. The data are in Table 15.8 and Figure 15.8. Diet is Factor A and the ponds are Factor B(A). First, error is estimated. The value for each replicate is displaced from its cell mean by error only (Figure 15.8). The sum of squares for error is obtained by squaring each displacement and adding these together. This quantity is divided by the appropriate degrees of freedom (the sum of one less than the number of replicates within each of the cells) to give the mean square for error.
Table 15.8 Data for the weight in grams of prawns after six weeks of feeding with (a) standard prawn food plus vitamin A and (b) standard prawn food only. Two ponds are nested within each treatment. Prawn food + vitamin A
Prawn food only
Pond 1
Pond 2
Pond 3
Pond 4
30 35 45 50
60 65 75 80
80 85 95 100
110 115 125 130
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
225 [209–232] 27.8.2011 1:45PM
15.6 Nested ANOVA as a special case of a single-factor ANOVA
225
Figure 15.8 Arrows show the displacement of each replicate from its cell mean, which is
the variation due to error only. The number of degrees of freedom is the sum of one less than the number within each of the cells. In this example there are 12 degrees of freedom.
Second, the subgroups (in this case the ponds) are ignored and new means are calculated by combining all of the replicates within each treatment (in this case diet) (Figure 15.9). This will give the effect of treatment, but for a nested ANOVA each treatment mean will be displaced from the grand mean because of the effect of treatment plus the subgroups nested within each treatment, plus error. This seems inconsistent with the explanation given for an orthogonal two-factor ANOVA, where ignoring a factor (e.g. Factor B) removed it as a source of variation, thereby allowing the effect of the other (e.g. Factor A) to be estimated. For a two-factor orthogonal design, all levels of Factor A are present within every level of Factor B and vice versa, so each of the two factors can be ignored in turn and the effect of each factor separately estimated. But for a nested design, the effects of Factor B (the subgroups) cannot be excluded in this way because different subgroups (here different ponds) are present and may contribute very different amounts of variation within each of the levels of Factor A. The displacements of each treatment mean from the grand mean are squared, multiplied by the number of replicates within their respective
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
226
226 [209–232] 27.8.2011 1:45PM
More complex ANOVA
Figure 15.9 Estimation of the effects of Factor A (treatment). The displacement of each
combined treatment mean from the grand mean shown by the arrows is caused by the average effects of that treatment, plus ponds nested within each treatment, plus error. The number of degrees of freedom will be one less than the number of treatments, so in this example with two treatments there is one degree of freedom.
treatment and added together to give the sum of squares for Factor A, which will include treatment, plus subgroups(treatment), plus error. The number of degrees of freedom is one less than the number of treatments and dividing the sum of squares by this number will give the mean square for Factor A (i.e. treatment, plus subgroups(treatment), plus error). Third, a mean is also calculated for Factor B(A), which is the variation contributed by each subgroup (in this case each pond) (Figure 15.10). Each subgroup mean will only be displaced from its respective treatment mean by the effect of the subgroups plus error. The displacements are squared, multiplied by the number of replicates within their respective subgroups and added together to give the Factor B(A) sum of squares. The number of degrees of freedom will be the sum of one less than the number of subgroups within each treatment. Dividing the sum of squares by this number will give the mean square for Factor B(A) (i.e. subgroups plus error). The procedures shown in Figures 15.8 to 15.10 give three separate sums of squares and mean squares:
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
227 [209–232] 27.8.2011 1:45PM
15.6 Nested ANOVA as a special case of a single-factor ANOVA
227
Figure 15.10 Estimation of the effect of Factor B(A). The displacement of each cell mean
from its treatment mean is shown by each arrow and is caused by the average effect of that subgroup (each pond), plus error. The number of degrees of freedom will be the sum of one less than the number of ponds within each treatment. In this example there are two degrees of freedom.
(a) Factor A: treatments + subgroups(treatment) + error (Figure 15.9) (b) Factor B(A): subgroups(treatment) + error (Figure 15.10) (c) error (Figure 15.8) No other mean squares are needed to isolate the effects of the treatments from the subgroups nested within each treatment. First, to isolate the effect of treatment only, the MS for treatment + subgroups(treatment) + error is divided by the MS for subgroups(treatment) + error. Second, to isolate the variation due to subgroups(treatment), the MS for subgroups(treatment) + error is divided by the MS error (Table 15.9). In the example shown in Figures 15.8–15.10, the F ratio for the effect of Factor A will only have one and two degrees of freedom, despite there being 16 prawns in the experiment. This is appropriate because the level of replication for this comparison is the ponds and not the prawns within each pond. Most statistical packages will do a nested ANOVA and the results will be in a similar format to Table 15.10, which gives the results for the data in
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
228
228 [209–232] 27.8.2011 1:45PM
More complex ANOVA
Table 15.9 The appropriate division and components of each mean square term used to estimate the effect of each factor when Factor B is nested within Factor A.
Source of variation Factor A (treatment)
Calculation of F ratio
Components of each mean square
MS A Factor A þ Factor BðAÞ þ error MS for BðAÞ Factor BðAÞ þ error
Factor B(A) (subgroups MS BðAÞ nested within each MS error treatment)
Factor BðAÞ þ error error
Table 15.10 Results of a nested ANOVA on the data in Table 15.8. Note that the F ratio for diet has been obtained by dividing the MS for diet by the MS for pond(diet). Source of variation Sum of squares df Diet Pond(Diet) Error
10000.0 3600.0 1000.0
Mean square F
1 10000.0 2 1800.0 12 83.3
P
5.556 0.143 21.600 0.000
Table 15.8. If the treatment factor is fixed and significant, you are likely to want to carry out a posteriori testing to examine which treatment means are significantly different. The Tukey test (equation (15.1)) can be used, but when comparing among treatments the appropriate ‘MS error’ to use in equation (15.3) is the MS for subgroups(treatments) instead of the error. I suggest you use a more advanced text (e.g. Sokal and Rohlf (1995) or Zar (1999)) if you need to do a posteriori testing after a nested ANOVA. This example is the simplest case of a nested or hierarchical design. More complex designs can include several levels of nesting, and nested factors in combination with two- and higher-factor ANOVAs. If you need to use more complex designs, it is important to read an advanced text or talk to a statistician before doing the experiment.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
229 [209–232] 27.8.2011 1:45PM
15.8 Reporting the results of two-factor ANOVA
15.7
229
A final comment on ANOVA – this book is only an introduction
Even though this book has five chapters about analysis of variance, it is only an introduction to an enormous and diverse topic. Hopefully, the introduction developed here will make it easier for you to understand more complex designs described in advanced texts.
15.8
Reporting the results of two-factor ANOVA without replication, randomised blocks design, repeated-measures ANOVA and nested ANOVA
For a two-factor ANOVA without replication, the F ratios, degrees of freedom and probabilities for both factors are reported. The analysis of the data in Table 15.1 could be reported as ‘A two-factor ANOVA without replication showed a significant effect of radiation (F2,4 = 835.55, P < 0.001) but no effect of drugs (F2,4 = 0.864, NS).’ Note that the degrees of freedom given for each F ratio are the treatment and remainder (because the latter is the only available estimate of error). For a significant difference between two levels of a fixed factor, you also need to give the mean for each level to identify which is the highest or lowest and, for three or more levels, you might also report on the results of an a posteriori test among treatments. For the data in Table 15.1, you could report ‘Subsequent a posteriori testing showed the mean increase in tumour volume differed significantly among all three radiation levels and decreased as the radiation level increased.’ For a randomised blocks design, the F ratio for treatment is given, but usually not the one for blocks unless variation among this factor is of interest. The analysis shown in Table 15.3 could be reported as ‘A two-factor ANOVA without replication applied to a randomised blocks design showed a significant difference among treatments A, B and C (F2,6 = 142.73, P < 0.001).’ Here, too, for a fixed factor, you might also report on the results of an a posteriori test among treatments as ‘Subsequent a posteriori testing showed treatment B yielded significantly more than treatment A, which yielded significantly more than treatment C, for more complex designs with additional factors and interactions, it is usually clearer to give a table of the results of the ANOVA and a graph showing cell means can be extremely helpful.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
230
230 [209–232] 27.8.2011 1:45PM
More complex ANOVA
For a repeated-measures ANOVA, the F ratio for treatment is given, but usually not the one for subjects unless variation among this randomly chosen factor is of interest. The analysis shown in Table 15.5 could be reported as ‘A repeated-measures ANOVA showed a significant difference among the three treatments (F2,6 = 11.83, P < 0.05).’ Here, too, you might also report on the results of an a posteriori test among treatments as ‘Subsequent a posteriori testing showed no significant difference between the two drug treatments, but both resulted in significantly greater airway diameter than the untreated control.’ Here, too, a graph of the treatment means is useful. For more complex designs with additional factors and interactions, it is usually necessary to also give a table showing the results of the ANOVA. For a nested ANOVA it is particularly important to ensure you give the appropriate degrees of freedom for the numerator and denominator of the F ratio. For example, the F ratio for the effect of diet in Table 15.10 is the ‘diet’ MS divided by the MS for Pond(Diet), so the F ratio for diet is reported as ‘A nested ANOVA showed no significant difference between diets (F1,2 = 5.566, NS).’ For a significant fixed factor with three or more levels, you are likely to report on a posteriori testing. The F ratio for the nested factor is usually only reported if it is of interest.
15.9
Questions
(1)
The table below gives the concentration of total polycyclic aromatic hydrocarbons (PAHs) in three benthic sediment cores taken 3 km, 2 km and 1 km from an oil refinery situated on the edge of an estuary. (a) Analyse the data as a two-factor ANOVA without replication, using depth and distance as factors. Is there a significant effect of distance? Is there a significant effect of depth in the sediment? (b) The environmental scientist who collected these data mistakenly analysed them with a single-factor ANOVA comparing the three different cores but ignoring depth (i.e. the table below was simply taken as three columns of independent data for three samples). Repeat this incorrect analysis. Is the result significant? What might be the implications, in terms of the conclusion drawn about the concentrations of PAHs and
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
231 [209–232] 27.8.2011 1:45PM
15.9 Questions
231
distance from the refinery, if the incorrect single-factor analysis were done? Distance
(2)
Depth (m)
3 km
2 km
1 km
1 2 3 4 5
1.11 0.84 2.64 0.34 4.21
1.25 0.94 2.72 0.38 4.20
1.28 0.95 2.84 0.39 4.23
A fisheries biologist who had to estimate the number of prawns in two aquaculture ponds chose three locations at random within each pond and deployed four traps at each, using a total of 24 traps. This design is summarised below. Location
Number of traps
First location in pond 1 Second location in pond 1 Third location in pond 1 First location in pond 2 Second location in pond 2 Third location in pond 2
4 traps 4 traps 4 traps 4 traps 4 traps 4 traps
The biologist said ‘I have a two-factor design, where the ponds are one factor and the trap location is the second, so I will use a two-factor ANOVA with replication.’ (a) Is this appropriate? What analysis would you use for this design? (3)
The table below gives data for the percentage area of a person’s teeth coated with plaque (a colourless bacterial biofilm that forms on hard surfaces in the mouth) after using one of four different toothpastes for four successive fortnights. (a) What analysis would you use for this design? (b) Use a statistical package to analyse the data. Is there a significant difference among toothpastes? (c) Use an a posteriori test to identify the
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C15.3D
232
232 [209–232] 27.8.2011 1:45PM
More complex ANOVA
most effective toothpaste(s). Which would you recommend? (d) Are there any you would not recommend? Person 1 2 3 4 5 6
Plarkoff 2 8 15 9 15 26
Abrade 21 14 25 20 23 24
Whiteup 17 12 27 13 20 17
ScrubOff 11 8 18 10 12 21
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
233 [233–243] 27.8.2011 2:01PM
16
Relationships between variables: correlation and regression
16.1
Introduction
Often life scientists obtain data for two or more variables measured on the same set of subjects or experimental units because they are interested in whether these variables are related and, if so, the type of functional relationship between them. If two variables are related, they vary together – as the value of one variable increases or decreases, the other also changes in a consistent way. If two variables are functionally related, they vary together and the value of one variable can be predicted from the value of the other. To detect a relationship between two variables, both are measured on each of several subjects or experimental units and these bivariate data examined to see if there is any pattern. A scatter plot with one variable on the X axis and the other on the Y axis (Chapter 3) can reveal patterns, but does not show whether two variables are significantly related or have a significant functional relationship. This is another case where you have to use a statistical test, because an apparent relationship between two variables may only have occurred by chance in a sample from a population where there is no relationship. A statistic will indicate the strength of the relationship, together with the probability of getting that particular result or an outcome even more extreme in a sample from a population where there is no relationship between the two variables. Two parametric methods for statistically analysing relationships between variables are linear correlation and linear regression, both of which can be used on data measured on a ratio, interval or ordinal scale. Correlation and regression have very different uses, and there have been many cases where correlation has been inappropriately used instead of regression and vice versa. After contrasting correlation and regression, this chapter explains correlation analysis. Regression analysis is explained in Chapter 17. 233
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
234 [233–243] 27.8.2011 2:01PM
234
Relationships between variables: correlation and regression
16.2
Correlation contrasted with regression
Correlation is an exploratory technique used to examine whether the values of two variables are significantly related, which means that the values of both variables change together in a consistent way. For example, an increase in one may be accompanied by a decrease in the other. There is no expectation that the value of one variable can be predicted from the other or that there is any causal relationship between them. In contrast, regression analysis is used to describe the functional relationship between two variables so that the value of one can be predicted from the other. A functional relationship means that the value of one variable (called the dependent variable or the response variable) can be determined by the value of the second (the independent variable), but the reverse is not true. For example, the amount of tooth wear in koalas, a mammal which feeds on leaves, is likely to be determined by age because older koalas will have spent more time chewing. The opposite is not true – the age of a koala is not determined by how worn its teeth are. Nevertheless, although tooth wear is determined by age, it is not caused by age – it is actually caused by chewing. This is an important point. Regression analysis can be used provided there is a good reason to hypothesise that one variable (the dependent one) can be determined by another (the independent one), but it does not necessarily have to be caused by it. Regression analysis provides an equation that describes the functional relationship between two variables and which can be used to predict values of the dependent variable from the independent one. The very different uses of correlation and regression are summarised in Table 16.1.
16.3
Linear correlation
The Pearson correlation coefficient, symbolised by ρ (the Greek letter rho) for a population and by r for a sample, is a statistic that indicates the extent to which two variables are linearly related and can be any value from −1 to +1. Usually the population statistic ρ is not known, so it is estimated by the sample statistic r. An r of +1, which shows a perfect positive linear correlation, will only be obtained when the values of both variables increase together and lie along a
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
235 [233–243] 27.8.2011 2:01PM
16.4 Calculation of the Pearson r statistic
235
Table 16.1 The applications of correlation and regression. Correlation
Regression
Exploratory: are two variables significantly related?
Definitive: what is the functional relationship between variable Y and variable X and is it significant? Predictive: what is the value of Y given a particular value of X? Variable Y is dependent upon X.
Neither Y nor X has to be dependent upon the other variable. Neither variable has to be determined by the other.
It must be plausible that Y is determined by X, but Y does not necessarily have to be caused by X.
straight line (Figure 16.1(a)). Similarly, an r of −1 which shows a perfect negative linear correlation will only be obtained when the value of one variable decreases as the other increases and the points also lie along a straight line (Figure 16.1(b)). In contrast, an r of 0 shows no relationship between two variables and Figure 16.1(c) gives one example of this where the points lie along a straight line parallel to the X axis. When the points are more scattered, but both variables tend to increase together, the values of r will be between 0 and +1 (Figure 16.1(d)), while if one variable tends to decrease as the other increases, the value of r will be between 0 and −1 (Figure 16.1(e)). If there is no relationship and considerable scatter (Figure 16.1(f)), the value of r will be close to 0. Finally, it is important to remember that linear correlation will only detect a linear relationship between variables – even though the two variables shown in Figure 16.1(g) are obviously related, the value of r will be close to 0.
16.4
Calculation of the Pearson r statistic
A statistic for correlation needs to reliably describe the strength of a linear relationship for any bivariate data set, even when the two variables have been measured on very different scales. For example, the values of one variable might range from 0 to 10 and the other from 0 to 1000. To obtain a statistic that always has a value between 1 and −1, with these maximum
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
236
236 [233–243] 27.8.2011 2:01PM
Relationships between variables: correlation and regression
Figure 16.1 Some examples of the value of the correlation coefficient r.
(a) A perfect linear relationship where r = 1. (b) A perfect linear relationship where r = −1. (c) No relationship (r = 0). (d) A positive linear relationship with 0 < r < 1. (e) A negative linear relationship where −1 < r < 0. (f) No linear relationship (r is close to 0). (g) An obvious relationship, but one that will not be detected by linear correlation (r will be close to 0).
and minimum values indicating a perfect positive and negative linear relationship respectively, you need a way of standardising the data. This is straightforward and is done by transforming the values of both variables to Z scores, as described in Chapter 8. To transform a set of data to Z scores, the mean is subtracted from each value and the result divided by the standard deviation. This will give a distribution that always has a mean of 0 and a standard deviation (and variance) of 1. For a population, the equation for Z is:
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
237 [233–243] 27.8.2011 2:01PM
16.4 Calculation of the Pearson r statistic
237
Figure 16.2 For any set of data, dividing the distance between each value and
the mean by the standard deviation will give a mean of 0 and a standard deviation (and variance) of 1. The scales on which X and Y have been measured are very different for cases (a) and (b) above, but transformation of both variables gives the distribution shown in (c), where both Zx and Zy have a mean of 0 and a standard deviation of 1.
Z¼
Xi μ σ
(16:1 copied from 8:3)
and for a sample, it is: Z¼
Xi X : s
(16:2)
Figure 16.2 shows the effect of transforming bivariate data measured on different scales to their Z scores.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
238
238 [233–243] 27.8.2011 2:01PM
Relationships between variables: correlation and regression
Once the data for both variables have been converted to their Z scores, it is easy to calculate a statistic that indicates the strength of the relationship between them. If the two increase together, large positive values of Zx will always be associated with large positive values of Zy and large negative values of Zx will also be associated with large negative values of Zy (Figure 16.3(a)). If there is no relationship between the variables, all of the values of Zy will be 0 (Figure 16.3(b)). Finally, if one variable decreases as the other increases, large positive values of Zx will be consistently associated with large negative values of Zy and vice versa (Figure 16.3(c)). This gives a way of calculating a comparative statistic that indicates the extent to which the two variables are related. If the Zx and Zy scores for each of the experimental units are multiplied together and then summed (equation (16.3)), data with a positive correlation will give a total with a positive value and data with a negative correlation will give a total with a negative one. In contrast, data for two variables that are not related will give a total close to 0: n X ðZxi Zyi Þ: (16:3) i¼1
Importantly, the largest possible positive value of
n P
ðZxi Zyi Þ will be
i¼1
obtained when each pair of data has exactly the same Z scores for both variables (Figure 16.3(a)), and the largest possible negative value will be obtained when the Z scores for each pair of data are the same number but opposite in sign (Figure 16.3(c)). If the pairs of scores do not vary together completely in either a positive or negative way, the total will be a smaller positive (Figure 16.3(d) or negative number (Figure 16.3(f)). This total will increase as the size of the sample increases, so dividing by the degrees of freedom (N for a population and n − 1 for a sample) will give a statistic that has been ‘averaged’, just as the equation for the standard deviation and variance of a sample are averaged and corrected for sample size by dividing by n − 1. The statistic given by equation (16.4) is the Pearson correlation coefficient r: n P ðZxi Zyi Þ r ¼ i¼1 : (16:4) n1
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
239 [233–243] 27.8.2011 2:01PM
16.4 Calculation of the Pearson r statistic
239
Figure 16.3 Examples of raw scores and Z scores for data with (a) a perfect positive
linear relationship (all points lie along a straight line), (b) no relationship, (c) a perfect negative linear relationship (all points lie along a straight line), (d) a positive relationship, (e) no relationship and (f) a negative relationship. Note that the largest positive and negative values for the sum of the products of the two Z scores for each point occur when there is a perfect positive or negative relationship, and that these values (+3 and −3) are equivalent to n − 1 and – (n − 1) respectively.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
240
240 [233–243] 27.8.2011 2:01PM
Relationships between variables: correlation and regression
More importantly, equation (16.4) gives a statistic that will only ever be between −1 and +1. This is easy to show. In Chapter 8, it was described how the Z distribution always has a mean of 0 and a standard deviation (and variance) of 1. If you were to calculate the variance of the Z scores for only one variable, you would use the equation: n P
s ¼ 2
ðZi ZÞ2
i¼1
(16:5)
n1
but since Z is 0, this equation becomes: n P
s ¼ 2
i¼1
Zi2 (16:6)
n1
and since s2 is always 1 for the Z distribution, the numerator of equation (16.6) is always equal to n 1. Therefore, for a set of bivariate data where the two Z scores within each experimental unit are exactly the same in magnitude and sign, the equation for the correlation between the two variables: n P
r ¼ i¼1
ðZxi Zyi Þ (16:7)
n1
will be equivalent to: n P
r¼
Zxi 2
i¼1
n1
or
n1 ¼ 1: n1
(16:8)
Consequently, when there is perfect agreement between Zx and Zy for each point, the value of r will be 1. If the Z scores generally increase together but not all the points lie along a straight line, the value of r will between 0 and 1 because the numerator of equation (16.8) will be less than n − 1. Similarly, if every Z score for the first variable is the exact negative equivalent of the other, the numerator of equation (16.8) will be the negative equivalent of n − 1, so the value of r will be −1. If one variable decreases while the other increases but not all the points lie along a straight line, the value of r will be between 0 and −1.
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
241 [233–243] 27.8.2011 2:01PM
16.6 Assumptions of linear correlation
241
Finally, for a set of points along any line parallel to the X axis, all of the Z scores for the Y variable will be 0, so the value of the numerator of equation (16.6) and r will also be 0.
16.5
Is the value of r statistically significant?
Once you have obtained the value of r, you need to establish whether it is significantly different to 0. Statisticians have found that the distribution of r for random samples taken from a population where there is no correlation (i.e. ρ = 0) between two variables is normally distributed with a mean of 0. Both positive and negative values of r will be generated by chance and 5% of these will be greater than a positive critical value or less than its negative equivalent. The critical value will depend on the size of the sample. Statistical packages will calculate r and give the probability the sample has been taken from a population where ρ = 0. As sample size increases, the value of r is likely to become closer to the value of ρ. Importantly, the number of degrees of freedom for the value of r is n − 2, where n is the number of pairs of bivariate data. For example, the correlation coefficient between the variables of height and weight of 20 individuals will have 18 degrees of freedom. Here you may expect the number of degrees of freedom to be n − 1, but one degree of freedom is lost when calculating the mean of the first variable, X, and another when calculating the mean of the second variable Y, in order to obtain r.
16.6
Assumptions of linear correlation
Linear correlation analysis assumes that the data are random representatives taken from the larger population of values for each of the two variables, which are normally distributed and have been measured on ratio, interval or ordinal scales. A scatter plot of these variables will have what is called a bivariate normal distribution. If the data are not normally distributed or the relationship does not appear to be linear, they may be able to be analysed by a non-parametric test for correlation (Chapter 21).
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
242 [233–243] 27.8.2011 2:01PM
242
Relationships between variables: correlation and regression
16.7
Summary and conclusion
Correlation is an exploratory technique used to test whether two variables are related. It is often useful to draw a scatter plot of the data to see if there is any pattern before calculating the correlation coefficient, because the variables may be related in a non–linear way. The Pearson correlation coefficient, r, is a statistic that shows the extent to which two variables are linearly related, and can have a value between −1 and +1, with these extremes showing a perfect negative linear relationship and perfect positive linear relationship respectively, while an r of 0 shows no relationship. The sign of r (i.e. positive or negative) indicates the way in which the variables are related, but the probability of getting a particular r value is needed to decide whether the correlation is statistically significant. 16.7.1 Reporting the results of a correlation analysis A correlation analysis is reported as the value of the Pearson correlation coefficient with the number of degrees of freedom given as a subscript and followed by the probability. Importantly, as noted in Section 16.5, the number of degrees of freedom for r is n − 2, where n is the number of pairs of bivariate data. For example, a significant positive linear correlation between the variables of age and weight measured on each of 20 blowfly larvae could be reported as ‘There was a significant positive correlation between the age and weight of blowfly larvae: r18 = 0.93, P < 0.001.’ For a negative correlation, you might report ‘There was a significant negative correlation between activity and age of blowfly larvae: r18 = −0.73, P < 0.001.’
16.8
Questions
(1)
(a) Add appropriate words to the following sentence to specify a regression analysis. ‘I am interested in finding out whether the shell length of the snail Littoraria articulata.................................................. shell length.’ (b) Add appropriate words to the following sentence to specify a correlation analysis. ‘I am interested in finding out whether the shell
C:/ITOOLS/WMS/CUP-NEW/2647576/WORKINGFOLDER/MCKI/9781107005518C16.3D
243 [233–243] 27.8.2011 2:01PM
16.8 Questions
243
length of the snail Littoraria articulata..................... shell length.’ (2)
A useful way to understand how correlation works is to modify a set of contrived data. Run a correlation analysis on the following set of ten bivariate data, given as the values of (X,Y) for each unit: (1,5) (2,6) (3,4) (4,5) (5,5) (6,4) (7,6) (8,5) (9,6) (10,4). (a) What is the value of the correlation coefficient? (You might draw a scatter plot of the data to help visualise the relationship.) (b) Next, modify some of the Y values only to give a highly significant positive correlation between X and Y. Here a scatter plot might help you decide how to do this. (c) Finally, modify some of the Y values only to give a highly significant negative correlation between X and Y.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
244 [244–283] 29.8.2011 4:16PM
17
Regression
17.1
Introduction
This chapter explains how simple linear regression analysis describes the functional (linear) relationship between a dependent and an independent variable, followed by an introduction to some more complex regression analyses which include fitting curves to non-linear relationships and the use of multiple linear regression to simultaneously model the relationship between a dependent variable and two or more independent ones. If you are using this book for an introductory course, you may only need to study the sections on simple linear regression, but the material in the latter part will be a very useful bridge to more advanced courses and texts. The uses of correlation and regression were contrasted in Chapter 16. Correlation examines if two variables are related. Regression describes the functional relationship between a dependent variable (which is often called the response variable) and an independent variable (which is often called the predictor variable).
17.2
Simple linear regression
Linear regression analysis is often used by life scientists. For example, the equation for the regression of one variable on another may suggest hypotheses about why they are functionally related. More practically, regression can be used in situations where the dependent variable is difficult, expensive or impossible to measure, but its values can be predicted from another easily measured variable to which it is functionally related. Here is an example. There is considerable variation in the height of adult humans and sometimes the parents of a child who is relatively short for its age often become 244
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
245 [244–283] 29.8.2011 4:16PM
17.2 Simple linear regression
245
Figure 17.1 An example of the use of regression. The additional height by
which a child will grow (the dependent or response variable) can be accurately predicted from the length of uncalcified bone remaining in the fingers at the age of six years (the independent or predictor variable). The additional height is determined by and easy to predict from the independent variable, but is not caused by it.
concerned their child will become a relatively short adult. It has been shown that the amount of extra height by which a person will grow can be predicted from the length of uncalcified cartilaginous bone remaining in the bones of their fingers, which can be accurately measured from an x ray of their hands at quite a young age (e.g. six years). Additional growth is therefore dependent upon (but not caused by) the length of uncalcified bone remaining in the fingers and can be predicted from it by using a regression line (Figure 17.1) and used to estimate the child’s adult height. This is another deliberate example where the dependent variable (additional height) is not caused by the independent variable (the length of uncalcified bone in a six-year-old child’s fingers) but is plausibly dependent on it. Incidentally, it is easy to make a person grow taller by administering human growth hormone, but this treatment becomes less and less effective after the age of ten and ineffective after about the age of 17, and has to be used with caution because only small amounts of hormone can cause a considerable increase in growth. If the predicted height is considered unacceptably short by the parents, they may ask for their child to be given growth hormone. A linear regression analysis gives an equation for a line that describes the functional relationship between two variables and tests whether the statistics that describe this line are significantly different to 0.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
246
246 [244–283] 29.8.2011 4:16PM
Regression
The simplest functional relationship between a dependent and independent variable is a straight line. Only two statistics, the intercept a (which is the value of Y when X is 0 and therefore the point where the regression line intercepts the Y axis) and the slope of the line b, are needed to uniquely describe where that line occurs on a graph. These statistics are called the regression coefficients. The position of any point on a straight line can be described by the equation: Yi ¼ a þ bXi
(17:1)
where ‘a’ is the value of Y when X = 0 and b is the slope of the line. For example, the equation Y = 6 + 0.5X means ‘The Y value is six units plus half the value of X.’ Therefore, for this line, when X = 0, Y = 6, and when X = 10, Y = 11. Simple linear regression analysis gives an equation for a straight line that is the ‘best fit’ through a set of data in a two-dimensional scatter plot. It is very easy to obtain a and b if all the points lie on a straight line. When the points are scattered, which they usually are for biological data, the method for obtaining these statistics is also straightforward.
17.3
Calculation of the slope of the regression line
The slope of the regression line is the amount by which the value of Y increases in relation to an increase in the value of X. For example, if an increase in the value of X by one unit is also accompanied by a one unit increase in the value of Y, it will give a regression line with a slope of 1. If, however, the value of Y decreases by three units for every one unit increase in X, the line will have a slope of −3. If all points lie along a straight line, you can calculate the slope by taking any two points and using the equation: b¼
Y2 Y1 X2 X1
(17:2)
which divides the relative change in Y by the relative change in X (Figure 17.2). Equation (17.2) will not work for a set of points that are scattered. To calculate the slope of the line of best fit running through a set of scattered
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
247 [244–283] 29.8.2011 4:16PM
17.3 Calculation of the slope of the regression line
247
Figure 17.2 Calculation of the slope when all points lie along a straight line.
The vertical arrow shows the relative change in Y from Y1 to Y2 that occurs with an increase in X from X1 to X2 (the horizontal arrow). For any two points, Y2 − Y1 divided by X2 − X1 will give the slope, which is this case is positive because Y increases as X increases and vice versa.
points, a procedure is needed that gives the average slope, which takes into account the values for all of the points. The equation for calculating b, the slope of the regression line, is: n P
ðXi XÞ ðYi YÞ
i¼1
b¼P n
ðXi XÞ ðXi XÞ
:
(17:3)
i¼1
This is an extension of equation (17.2). Instead of calculating the change in X and Y from any two points, equation (17.3) calculates an average slope using every point in the data set. First, the means of X and Y are separately calculated. Then, for each point, the value of X minus its mean is multiplied by the value of Y minus its mean and these values summed. This is the numerator of equation (17.3), which is then divided by the sum of each value of X minus its mean and squared. It is easy to see how this equation will give an appropriate average value for the slope. The first examples are for points that lie on straight lines. For a line with a slope of +1, as X increases by one unit from its mean the value of Y will also increase by one unit from its mean (and vice versa if X decreases). The difference between any value of X and its mean will always be the same as the difference between any value of Y and its mean, so the
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
248
248 [244–283] 29.8.2011 4:16PM
Regression
Figure 17.3 Examples of the use of equation (17.3) to obtain the slope of a
regression line. Vertical arrows show Yi Y and horizontal arrows show Xi X. The dashed lines show X and Y. (a) For every point along a line with a slope of 1, Yi Y will be the same magnitude and sign as Xi X, so equation (17.3) will give a value of 1. (b) For every point along a line with a slope of 3, Yi Y will be the same sign but three times greater than Xi X so equation (17.3) will give a value of 3. (c) For every point along a line with a slope of −1, Yi Y will be the same magnitude but the opposite sign of Xi X, so equation (17.3) will give a value of −1. (d) For a slope of 0, each value of Yi Y will be 0, so equation (17.3) will give a value of 0.
numerator and denominator of equation (17.3) will be the same and therefore give a b of 1 (Figure 17.3(a)). For a line with a slope of +3, as X increases by one unit from its mean the value of Y will increase by three units from its mean (and vice versa if X decreases). Therefore, the value of the numerator of equation (17.3) will always be three times the size of the denominator no matter how many points are included, thus giving a b of 3 (Figure 17.3(b)).
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
249 [244–283] 29.8.2011 4:16PM
17.4 Calculation of the intercept with the Y axis
249
For a line with a slope of −1, as X increases by one unit from its mean the value of Y will decrease by one unit from its mean (and vice versa if X decreases). Therefore, the numerator of equation (17.3) will give a total that is the same magnitude but the negative of the denominator, thus giving a b of −1 (Figure 17.3(c)). For a line with a slope of −3, as X increases by one unit from its mean the value of Y will decrease by three units from its mean (and vice versa if X decreases), so the numerator of equation (17.3) will always have a negative sign and be three times the value of the denominator, thus giving a b of −3. Finally, for a line running parallel to the X axis, every value of Yi Y will be 0, so the total of the numerator of equation (17.3) will also be 0, thus giving a b of 0 (Figure 17.3(d)). When the data are scattered, equation (17.3) will also give the average change in Y in relation to the increase in X. Figure 17.4 gives an example. First, cases 17.4 (a) (b) and (c) show three lines, each of which has been drawn through two data points. These lines have slopes of 3, 2 and 1 respectively, and the calculation of each value of b is given in the box under the graph. In Figure 17.4(d), the six data points have been combined. Intuitively, this group of six scattered points should have a slope of 2 because this is the average of the slopes of the three lines shown in (a), (b) and (c). Equation (17.3) gives this value.
17.4
Calculation of the intercept with the Y axis
The intercept of the regression line with the Y axis when X = 0 is easy to calculate using an extension of the regression equation. Since: Yi ¼ a þ bXi
(17:4 copied from 17:1)
then: Y ¼ a þ bX
(17:5)
which can be rearranged to give the value of a from: a ¼ Y bX
and statistical packages will do this as part of a regression analysis.
(17:6)
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
250
250 [244–283] 29.8.2011 4:16PM
Regression
Figure 17.4 (a), (b) and (c) show three lines of slope 3, 2 and 1 respectively, with two
data points on each. (d) The six points have been combined and the line of best fit through these would be expected to have a slope of 2. Equation (17.3) gives this value for b.
17.5
Testing the significance of the slope and the intercept
Although the equation for a regression line describes the functional relationship between X and Y, it does not show whether the slope of the line and the intercept are significantly different to 0. For a population, the equation for the line of best fit is: Yi ¼ α þ βXi
(17:7)
but because life scientists usually only have data for a sample, the population statistics α and β are only estimated by the sample statistics a and b. Therefore, you need to test the null hypotheses that a and b are from a population where α and β are 0. Please note that you will find different symbols for the intercept and slope in some texts. Introductory texts generally use a and b (and for a population α and β), but more advanced texts
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
251 [244–283] 29.8.2011 4:16PM
17.5 Testing the significance of the slope and the intercept
251
Figure 17.4 (cont.)
use b0 for the intercept and b1 for the slope (and β0 and β1 for the equivalent population statistics). Here I have used the same symbols as most introductory texts for clarity.
17.5.1 Testing the hypothesis that the slope is significantly different to 0 One method for testing whether the slope of a regression line is significantly different to a slope of 0 is very similar to the way a single-factor ANOVA tests for a difference among means (Chapter 11). A pictorial explanation is given in Figures 17.5(a)–(d) and 17.6(a)–(d).
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
252
252 [244–283] 29.8.2011 4:16PM
Regression
Figure 17.5 (a) For a population where there is no relationship between X and
Y, the slope, β, of the regression line will be 0 and the same as the line showing the grand mean Y. Note that not all points are exactly on the regression line because of individual variation or ‘error’. (b) to (d) Three random samples, each of n = 4, have been taken from the population shown in (a). For each sample not all points fit exactly on the regression line because of individual variation or ‘error’ and this is also why the regression lines for the three samples also have slopes, b, that are not exactly 0.
The amount by which the regression line is tilted from the horizontal can be detected in the same way a single-factor ANOVA detects whether several treatment means are all similar to the grand mean, or whether any are significantly displaced from it. In Chapter 11, it was described how a singlefactor ANOVA calculates an F ratio by dividing the mean square for treatment (i.e. treatment + error) by the mean square for error only. If treatment has no effect, the treatment means will be the same or close to the grand mean, so the F ratio will be close to 1. The test for whether the slope of a regression line is significantly different to the horizontal line showing Y is done in a similar way. First, consider the case where there is no relationship between X and Y, so the slope, β, for the population of bivariate data is 0 (Figure 17.5(a)). If you took several separate samples from that population, you would expect
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
253 [244–283] 29.8.2011 4:16PM
17.5 Testing the significance of the slope and the intercept
253
most to have slopes close to and including 0, but a small proportion would, simply by chance, have marked positive or negative slopes. Three samples, each of n = 4, that have been taken from that population are shown in Figure 17.5(b)–(d). Now, ask yourself ‘Why is there variation among the individuals in the population and the four individuals within each of the three samples shown in Figure 17.5(b)–(d)? Why isn’t the value for each individual exactly on the regression line?’ The answer is ‘Because of individual variation that we can’t do anything about, called error.’ Individuals rarely have exactly the same value for a variable and this inherent and uncontrollable variation is often called ‘error’. Next, ask yourself ‘Why is the slope of the regression line for each of the three sample scatter plots in Figures 17.5(b)–(d) slightly different to 0? Why isn’t the slope exactly 0 as it is for the population? Here too, this is ‘Because of individual variation within each sample that we can’t do anything about, called error.’ If all individuals had exactly the same value for the dependent variable, the slope of the regression line for a sample would be 0. Second, consider a case where there is a strong relationship between X and Y, so the regression line for the population has a marked (positive or negative) slope. Random samples taken from that population are likely to have similar slopes, with some variation from the ‘true’ slope of the population. Figure 17.6 shows the scatter plot for the population and for three random samples, each of n = 4, taken from it. The four individuals in each sample are still scattered around the regression line, but each of these lines is now strongly tilted. Now, again ask yourself why the values for the individuals are scattered around the regression lines for the population and for each sample. The answer is the same as before ‘Because of individual variation that we can’t do anything about, which is called error.’ Next, think about why the regression line for the population and those for each sample are tilted from the horizontal dashed line showing the grand mean, Y, which has a slope of 0. Here, the slope of each regression line is not just caused by error – it is also affected by the relationship between X and Y. This gives a very easy way to calculate a statistic that shows how much the tilt of the regression line from one of 0 slope is affected by the relationship between X and Y, over and above that expected because of natural variation (i.e. error).
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
254
254 [244–283] 29.8.2011 4:16PM
Regression
Figure 17.6 (a) For a population where there is a significant relationship
between X and Y, the slope, β, of the regression line will be different to 0 and therefore tilted from the line showing the grand mean Y. This example shows a positive slope. Note that not all points are exactly on the regression line because of individual variation or ‘error’. (b) – (d) Three random samples, each of n = 4, have been taken from the population. For each sample, not all points fit exactly on the regression line because of individual variation or ‘error’. Here, however, the regression lines for the three samples also have positive slopes, b, and are tilted from the horizontal line showing Y because of the effects of both the relationship between X and Y, plus error.
First, each of the points in the scatter plot will be displaced upwards or downwards from the regression line because of the unexplained variation (error only). Second, the regression line will be tilted from the line showing Y because of the variation explained by the regression plus the unexplained variation (regression plus error). It is easy to calculate the sums of squares and mean squares for these two separate sources of variation. Figures 17.7(a) and (b) show scatter plots for two sets of data. The first regression line (17.7(a)) has a slope close to 0 and the second (17.7(b)) has a positive slope. The horizontal dashed line on each graph shows Y.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
255 [244–283] 29.8.2011 4:16PM
17.5 Testing the significance of the slope and the intercept
255
Figure 17.7 (a) The diagonal solid line shows the regression through a scatter
plot of six points, and the dashed horizontal line shows Y. The vertical arrow shows the displacement of one point, symbolised by a filled circle instead of a square, from Y. The distance between the point and the Y average ðY YÞ is the total variation, which can be partitioned into variation explained by the regression line and unexplained variation or error. The heavy part of the b YÞ shows the displacement explained by the regression line vertical line ðY b is unexplained variation (regression plus error) and the remainder ðY YÞ (error). (a) When the slope is close to 0, the explained component is very small. (b) When the slope is large, the explained component is also very large.
Here you need to think about the vertical displacement of each point from the line showing Y. To illustrate this, the point at the top far right of each scatter plot in Figures 17.7(a) and (b) has been identified by a circle instead of a square.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
256
256 [244–283] 29.8.2011 4:16PM
Regression
The vertical arrow running up from Y to each of the circled points ðY YÞ indicates the total variation or displacement of that point from Y. This distance can be partitioned into two sources of variation. The first is the amount of displacement explained by the regression line (which is affected by both the regression plus error) and is the distance b YÞ shown by the heavy part of the vertical arrow in Figures 17.7(a) and (b). ðY b shown by the lighter vertical part of The second is the distance ðY YÞ the arrow in Figures 17.7(a) and (b). This is unexplained variation, or error, and often called the residual variation because it is the amount of b that cannot be explained variation remaining between the data points and Y by the regression line. This gives a way of calculating an F ratio that indicates how much of the variation can be accounted for by the regression. First, you calculate the sum of squares for the variation explained by the regression line (regression plus error) by squaring the vertical distance b YÞ and adding these between the regression line and Y for each point ðY together. Dividing this sum of squares by the appropriate number of degrees of freedom will give the mean square due to the explained variation (regression plus error). Second, you calculate the sum of squares for the unexplained variation (error) by squaring the vertical distance between each point and the regresb and adding these together. Dividing this sum of squares sion line ðY YÞ by the appropriate number of degrees of freedom will give the mean square due to unexplained variation or ‘error’. At this stage, you have sums of squares and mean squares for two sources of variation that will be very familiar from the explanation for single-factor ANOVA in Chapter 11: (a) The variation explained by the regression line (regression plus error). (b) The unexplained residual variation (error only). Therefore, to get an F ratio that shows the proportion of the variation explained by the regression line over and above the unexplained variation due to error, you divide the mean square for (a) by the mean square for (b). F1;n2 ¼
MS regression : MS residual
(17:8)
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
257 [244–283] 29.8.2011 4:16PM
17.5 Testing the significance of the slope and the intercept
257
If the regression line has a slope close to 0 (e.g. Figure 17.7(a)), both the numerator and denominator of equation (17.8) will be similar, so the value of the F statistic will be approximately 1. As the slope of the line increases (Figure 17.7(b)), the numerator of equation (17.8) will become larger, so the value of F will also increase. As F increases, the probability that the data have been taken from a population where the slope, β, of the regression line is 0 will decrease and will eventually be less than 0.05. Most statistical packages will calculate the F statistic and give the probability. Finally, you may have noticed that the F ratio in equation (17.8) has degrees of freedom of 1 and n – 2. Box 17.1 gives an explanation for the number of degrees of freedom for the F ratio in a regression analysis.
17.5.2 Testing whether the intercept of the regression line is significantly different to 0 The value for the intercept, a, calculated from a sample is only an estimate of the population statistic α. Consequently, a positive or negative value of a might be obtained in a sample from a population where α is 0. The standard deviation of the points scattered around the regression line can be used to calculate the 95% confidence interval for a, so a single sample t test can be used to compare it to 0 or any other expected value. Once again, most statistical packages include a test of whether a differs significantly from 0.
17.5.3 The coefficient of determination r 2 The coefficient of determination, symbolised by r2, is a statistic that shows the proportion of the total variation that is explained by the regression line. Regression analysis calculates sums of squares and mean squares for two sources of variation: (a) The variation explained by the regression line (regression plus error). (b) The unexplained residual variation (error only). The coefficient of determination is the regression sum of squares divided by the total sum of squares: r2 ¼
Sum of squares explained by the regression ððaÞ aboveÞ Total sum of squares ððaÞ þ ðbÞ aboveÞ (17:9)
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
258
258 [244–283] 29.8.2011 4:16PM
Regression
which will only ever be a number from 0 to 1. If the points all lie along the regression line and it has a slope that is different to 0, the unexplained component (quantity (b)) will be 0 and r2 will be 1. If the explained sum of squares is small in relation to the unexplained, r2 will be a small number.
17.6
An example – mites that live in the hair follicles
The mite Demodex folliculorum is less than a millimetre long and lives in the hair follicles of humans, including those of the eyelashes. A fascinating account of the ecology of these mites, including illustrations, can be found in Andrews (1976) who notes that most adult humans have D. folliculorum living in the hair follicles of their ‘chin, nose, forehead or scalp’. These mites are acquired after birth and prefer follicles where a relatively large amount of sebum (the waxy material produced by the sebaceous gland within the follicle) is produced. A biomedical scientist hypothesised that the number of mites could be predicted by a person’s age. To test this they obtained 20 volunteers, plucked 25 eyelashes at random from each eye and counted the number of mites. These bivariate data for age in years and the number of mites are given in Table 17.1. A statistical package will give values for the equation for the regression line, plus a test of the hypotheses that the intercept, a, and
Table 17.1 Data for the age of a person and the total number of mites found on 50 of their eyelashes. Age (years)
Number of mites
3 6 9 12 15 18 21 24 27 30
5 13 16 14 18 23 20 32 29 28
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
259 [244–283] 29.8.2011 4:16PM
17.6 An example – mites that live in the hair follicles
259
Table 17.2 An example of the table of results from a regression analysis. The value of the intercept a (5.733) is given in the first row, labelled ‘(Constant)’ under the heading ‘Value’. The slope b (0.853) is given in the second row (labelled as the independent variable ‘Age’) under the heading ‘Value’. The final two columns give the results of t tests comparing a and b to 0. These show the intercept, a, is significantly different to 0 (P < 0.035) and the slope b is also significantly different to 0 (P < 0.01). Model
Value
Std error
t
Significance
Constant Age
5.733 0.853
2.265 0.122
2.531 7.006
0.035 0.001
Table 17.3 An example of the results of an analysis of the slope of a regression, using the data in Table 17.1. The significant F ratio shows the slope is significantly different to 0.
Regression Residual Total
Sum of Squares
df
Mean square
F
Significance
539.648 87.952 627.600
1 8 9
539.648 10.994
49.086
<0.001
slope, b, are from a population where α and β are 0. The output will be similar in format to Table 17.2, which gives the equation for the regression line: mites = 5.733 + 0.853 × age. The slope is significantly different to 0 (in this case, it is positive) and so is the intercept. You could use the regression equation to predict the number of mites on a person of any age between 3 and 30. Most statistical packages will also give results of an ANOVA which tests whether the slope of the regression line is significantly different to 0. For the data in Table 17.1, there is a significant relationship between mite numbers and age (Table 17.3). Finally, the value of r2 is also given. Sometimes there are two values: r2, which is the statistic for the sample and one called ‘Adjusted r2’ which is an estimate for the population from which the sample has been taken. The r2 value is usually the one reported in the results of the regression. For the example above, you would get the following values:
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
260
260 [244–283] 29.8.2011 4:16PM
Regression
r ¼ 0:927; r2 ¼ 0:860; adjusted r2 ¼ 0:842:
This shows that 86% of the variation in mite numbers with age can be predicted by the regression line.
17.7
Predicting a value of Y from a value of X
Since the regression line gives the average slope through a set of scattered points, the predicted value of Y is only the average expected for a given value of X. If the r2 value is 1, the value of Y will be predicted without error because all the data points will lie on the regression line, but usually they will be scattered around it. More advanced texts (e.g. Sokal and Rohlf, 1995; Zar, 1999) describe how you can calculate the 95% confidence interval for a value of Y and thus predict its likely range.
17.8
Predicting a value of X from a value of Y
Often you might want to estimate a value of the independent variable X from the dependent variable Y. Here is an example. The concentration of sugar in fruit can only be directly measured by damaging the fruit, which makes it unsuitable for sale. Sugar content varies among fruit from the same plant and from the same farm. Fruit relatively high in sugar usually taste sweeter and can often be sold for a higher price, so it would be advantageous to identify individual pieces of fruit with the highest sugar concentration without damaging them before sale. It has been shown that the amount of infra-red light reflected from the surface of certain fruit such as melons and tomatoes is significantly dependent on the sugar concentration. Therefore, if sugar concentration could be predicted from the amount of infra-red light reflected from the fruit it could be estimated without damage. In this case, it is not appropriate to designate sugar concentration as the dependent variable and calculate a regression equation because it clearly does not depend on the amount of infra-red light reflected from the fruit, so one of the assumptions of regression would be violated. Predicting X from Y can be done by rearranging the regression equation for any point from:
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
261 [244–283] 29.8.2011 4:16PM
17.8 Predicting a value of X from a value of Y
261
Box 17.1 A note on the number of degrees of freedom in an ANOVA of the slope of the regression line The example in Section 17.5.1 includes an ANOVA table with an F statistic and probability for the significance of the slope of the regression line. Note that the ‘regression’ mean square, which is equivalent to the ‘treatment’ mean square in a single-factor ANOVA, has only one degree of freedom. This is the case for any regression analysis, irrespective of the sample size. In contrast, for a single-factor ANOVA, the number of degrees of freedom for the treatment MS is one less than the number of treatments. This difference needs explaining. For a single-factor ANOVA, all but one of the treatment means are free to vary, but the value of the ‘final’ one is constrained because the grand mean is a set value. Therefore, the number of degrees of freedom for the treatment mean square is always one less than the number of treatments. b must (by In contrast, for any regression line, every value of Y definition) lie on the line. Therefore, for a regression line of known b has been plotted the remainder are no slope, once the first value of Y longer free to vary since they must lie on the line, so the regression mean square has only one degree of freedom. The number of degrees of freedom for error in a single-factor ANOVA is the sum of one less than the number within each of the treatments. Since a degree of freedom is lost for every treatment, if there are a total of n replicates (the sum of the replicates in all treatments) and k treatments, the error degrees of freedom are n – k. In contrast, the degrees of freedom for the residual (error) variation in a regression analysis of bivariate data are always n – 2. This is because the regression line, which only ever has one degree of freedom, is always only equivalent to an experiment with two treatments. Yi ¼ a þ bXi
(17:10)
to: Xi ¼
Yi a b
(17:11)
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
262
262 [244–283] 29.8.2011 4:16PM
Regression
Figure 17.8 It is risky to use a regression line to extrapolate values of Y beyond the
measured range of X. (a) The regression line based on the data for values of X ranging from 1 to 5 does not necessarily give an accurate prediction (b) of the values of Y beyond that range. (c) The regression line for a population of 19 individuals. (d) A regression line extrapolated from only a few points in a small part of the range may not give a good estimate of the regression line for the population.
but here too the 95% confidence interval around the estimated value of X must also be calculated because the measurement of Y is likely to include some error. Methods for doing this are given in more advanced texts (e.g. Sokal and Rohlf, 1995).
17.9
The danger of extrapolation
Although regression analysis draws a line of best fit through a set of data, it is dangerous to make predictions beyond the measured range of X. Figures 17.8(a)–(d) illustrate that a predicted regression line may not correctly estimate the value of Y outside this range.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
263 [244–283] 29.8.2011 4:16PM
17.10 Assumptions of linear regression analysis
263
Table 17.4 Original data and calculated values of Y for the regression equation Y = 10.824 + 0.901X. The value of the residual for each point is its vertical displacement from the regression line. Each residual is plotted on the Y axis against the original value of X for that point to give a graph showing the spread of the points about a line of 0 slope and intercept. Original data
Data for the plot of residuals
X
Y
Calculated value of Yb from regression equation
Value of X (from original data)
Value of Y b ðY YÞ
1 3 4 5 6 7 8 9 10 11 12 14
13 12 14 17 17 15 17 21 20 19 21 25
11.725 13.527 14.428 15.329 16.230 17.131 18.032 18.933 19.834 20.735 21.636 23.438
1 3 4 5 6 7 8 9 10 11 12 14
1.275 −1.527 −0.428 1.671 0.770 −2.131 −1.032 2.067 0.166 −1.735 −0.636 1.562
17.10 Assumptions of linear regression analysis The procedure for linear regression analysis described in this chapter is often described as a Model I regression, and makes several assumptions. First, the values of Y are assumed to be from a population that is normal and evenly distributed about the regression line, with no gross heteroscedasticity. One easy way to check for this is to plot a graph showing the residuals. For each data point, its vertical displacement on the Y axis either above or below the fitted regression line is the amount of residual variation that cannot be explained by the regression line, as described in Section 17.5.1. The residuals are calculated by subtraction (Table 17.4) and plotted on the Y axis, against the values of X for each point and will
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
264
264 [244–283] 29.8.2011 4:16PM
Regression
always give a plot where the regression line is re-expressed as a horizontal line with an intercept of 0. If the data points are uniformly scattered about the original regression line, the scatter plot of the residuals will be evenly dispersed in a band above and below 0 (Figures 17.9(a)–(d)). If there is heteroscedasticity, the band will vary in width as X increases or decreases. Most statistical packages will give a plot of the residuals for a set of bivariate data. Second, it is assumed the independent variable X is measured without error. This is often difficult and many texts note that X should be measured with little error. For example, variables such as air temperature or the circumference of a tree trunk one metre above the ground can usually be measured with very little error indeed. In contrast, the estimated volume of the canopy of a tree or the wet weight of a live frog are likely to be measured with a great deal of error. When the dependent variable is subject to error, a different analysis called Model II regression is appropriate. Again, this is described in more advanced texts. Third, it is assumed that the dependent variable is determined by the independent variable. This was discussed when regression analysis was introduced and contrasted with correlation in Section 16.2. Fourth, the relationship between X and Y is assumed to be linear and it is important to be confident of this before carrying out the analysis. A scatter plot of the data should be drawn to look for any obvious departures from linearity. In some cases it may be possible to transform the Y variable (see Chapter 14) to give a linear relationship and proceed with a simple regression analysis on the transformed data. Another possibility is to attempt to fit a more complex regression equation to the data (see Section 17.11).
17.10.1 Reporting the results of a linear regression The results of a regression analysis include whether the slope of the regression line is significantly different to 0, the regression equation, the value of r2 and, if of interest, the result of a test for whether the intercept differs from 0. The example in Section 17.6 could be reported as ‘The relationship between the number of mites found on a person’s eyelashes and their age appeared to be linear. Regression analysis showed mite numbers increased significantly with age (F1,8 = 49.09, P < 0.001); mites = 5.733 + 0.853 × age; r2 = 0.86.’ A non-significant relationship could be reported as ‘Linear regression analysis
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
265 [244–283] 29.8.2011 4:16PM
17.10 Assumptions of linear regression analysis
265
(a) 24
Y
20 16 12 0
4
8
12
8
12
X (b)
Residual
2
0 –2
0
4 X
c
Figure 17.9 (a) Plot of original data in Table 17.4 with fitted regression line
b against the value of Y = 10.824 + 0.901 X (b) The plot of the residuals ðY YÞ X for each data point shows a relatively even scatter about the horizontal line. (c) General form of residual plot for data that are homoscedastic, (d) Residual plot showing one example of heteroscedasticity, where the variance of the residuals decreases with increasing X.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
266
266 [244–283] 29.8.2011 4:16PM
Regression
showed no significant relationship between the number of mites found on a person’s eyelashes and their age (F1,8 = 1.13, NS).’ It can be helpful to include a graph showing the relationship.
17.11 Curvilinear regression Often the relationship between the dependent variable and independent variable is not linear. One way of modelling a more complex relationship is to expand the simple linear regression, equation (17.4), by adding additional constants and powers of X. These equations are called polynomials of increasing degrees. The simple linear regression equation is a first-degree polynomial that gives a straight line relationship: Y ¼ a þ b1 X:
(17:12 modified from 17:4)
This can be expanded to a second-degree (quadratic) polynomial, which gives a line with one change of direction, by adding a second constant, b2, that is multiplied by the square of X: Y ¼ a þ b1 X þ b2 X 2
(17:13)
and a third-degree (cubic) polynomial that gives a line with two changes of direction, by adding a third constant multiplied by the cube of X: Y ¼ a þ b1 X þ b2 X 2 þ b3 X 3
(17:14)
and a fourth-degree (quartic) polynomial that gives a line with three changes in direction: Y ¼ a þ b1 X þ b2 X 2 þ b3 X 3 þ b4 X 4
(17:15)
and so on, for additional constants (b) and increasing powers of X. As the number of constants and powers of X increases, the regression line will become a better and better fit to the data. Eventually, when the number of terms in the polynomial is one less than the number of points in the data set, the equation will run through all of them (and thus be a perfect fit), but this is unlikely to be useful for anything except a very small data set, because the equation will be extremely long and complex. Often, however, a good
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
267 [244–283] 29.8.2011 4:16PM
17.11 Curvilinear regression
267
approximation of a relationship can be achieved by using only a second- or third-degree polynomial, which can be tested for significance, easily visualised and used for interpolation and cautious prediction. Here is an example of how a more complex regression equation can give a better fit to a set of data. Naturally occurring dead bodies of terrestrial vertebrates (e.g. wild birds, mice, deer, pigs) usually contain enormous numbers of maggots, which are the larvae of flies that have laid eggs on the cadaver. The maggots feed on decomposing flesh and when fully grown they leave the cadaver and pupate. There is great scientific interest in maggots found in cadavers because the species and their developmental stage can often be used by forensic entomologists to estimate how long a human cadaver has been exposed to flies, which may help estimate the date of death or how long a body has been in a particular place. Table 17.5 and Table 17.5 The number of maggots of Sarcophaga cadaverus in 20 rat carcasses and the number of days since death. Days since death
Number of maggots
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
8 24 123 153 254 272 345 340 372 376 370 362 334 327 216 201 91 84 0 1
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
268
268 [244–283] 29.8.2011 4:16PM
Regression (a) Number of maggots
375
250
125
0 0
4
8 Days
(b)
Residual
125 0 –125 –250 0
4
8 Days
Figure 17.10 The number of maggots of Sarcophaga cadaverus in 20 rat
carcasses in relation to the number of days since death. (a) A simple linear regression line is clearly not a good fit to the points. (b) The residuals are not evenly distributed about the regression line (which is such a poor fit that the shape of the residual plot is very similar to the original data).
Figures 17.10(a) and (b) give the number of larvae of the central Queensland cadaver fly, Sarcophaga cadaverus, in 20 rat carcasses for up to ten days after death. The relationship is obviously not linear, so it is not surprising that a linear model using equation (17.12) does not show a significant relationship between the number of larvae and time (ANOVA: F1,18 = 0.192, NS) and the regression line, Y (maggots) = 239.63 – 4.91 X (time), is an extremely poor fit (Figure 17.10(a)) with an r2 of only 0.011. A plot of the residuals (Figure 17.10(b)) confirms the use of linear regression is inappropriate
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
269 [244–283] 29.8.2011 4:16PM
17.11 Curvilinear regression
269
(a) Number of maggots
375
250
125
0 0
4
8 Days
(b)
50
Residual
25 0 –25 –50 0
4
8 Days
Figure 17.11 The number of maggots of Sarcophaga cadaverus in 20 rat
carcasses in relation to the number of days since death. (a) A quadratic regression (heavy line) is a good fit to the data. (b) The residuals are fairly evenly dispersed about the line.
because the data do not occur in a band around the horizontal line for 0 residual variation. Second, a quadratic model: Y = −168.74 + 199.28 X −18.56 X2 fitted to the data is highly significant (F2,17 = 361.33, P < 0.001), and a far better fit than the linear model, with an r2 of 0.977 (Figure 17.11(a)). A plot of the residuals (Figure 17.11(b)) shows that the regression appears to be a very good fit to the data. Finally, a cubic model: Y = −209.27 + 235.23 X −26.36 X2 + 0.47 X3 is also significant (F3,16 = 270.68, P<0.001) with an r2 of 0.981, indicating a slight improvement over the quadratic and a very good fit to the data. The residuals are consistent with this (Figures 17.12(a) and (b)).
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
270
270 [244–283] 29.8.2011 4:16PM
Regression (a) Number of maggots
375
250
125
0 0
4
8 Days
(b)
50
Residual
25 0 –25 –50 0
4
8 Days
Figure 17.12 The number of maggots of Sarcophaga cadaverus in 20 rat
carcasses in relation to the number of days since death. (a) A cubic regression is a good fit to the points, but appears little better than the quadratic in Figure 17.11. (b) The residuals are fairly evenly dispersed about the line.
In summary, both the quadratic and cubic polynomials are very good fits to the data. There is a significant non-linear relationship between the number of maggots and time since death. Even though the regression line is a good fit to the data, it cannot be used to predict maggot numbers more than ten days after death, because negative numbers of maggots cannot exist. I have deliberately chosen this example to emphasise the danger of predicting beyond the measured limits of a set of data and the caution applies even when the predicted values seem realistic. The linear model, with an r2 of only 0.011, is a very poor fit to the data. In contrast, the quadratic is a good fit (r2 = 0.977) and the cubic model is a slight improvement over the quadratic (r2 = 0.981). There is no point in
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
271 [244–283] 29.8.2011 4:16PM
17.11 Curvilinear regression
271
using a more complex higher-order polynomial if it does not give a significantly improved fit over a simpler one, and the relative improvement can be tested for significance by a straightforward extension of the ANOVA used to assess the significance of a regression described in Section 17.5.1. Most statistical packages give a table of results for the ANOVA that tests for departure of the regression line from one having a slope of 0 and also includes the sum of squares, degrees of freedom and mean squares for the regression. Table 17.6(a) gives these for the linear (Figures 17.10(a) and (b)), quadratic (Figures 17.11(a) and (b)) and cubic model (Figures 17.12(a) and (b)) fitted to the data in Table 17.5. Each expansion of the polynomial is additive (e.g. equations (17.12) to (17.15)) and so are their sums of squares. Therefore, the sum of squares for any improvement of the quadratic compared to the linear regression can be obtained by subtracting the sum of squares for the linear model from the sum of squares for the quadratic, giving the sum of squares for the difference (SS difference). The number of degrees of freedom for the difference (df difference) is also calculated by subtraction (Table 17.6(b)). The mean square for the difference is (SS difference/df difference), and the F statistic is calculated by dividing this quantity by the error of the higher polynomial. The same method is used to assess whether the cubic model is an improvement compared to the quadratic. In the example in Table 17.6, the additional variation explained by the quadratic over the linear model is highly significant, but the cubic model is not a significant improvement over the quadratic, so the latter is used to describe the relationship.
17.11.1 Reporting the results of curvilinear regression A curvilinear regression, such as the one in Section 17.11, could be reported as ‘The relationship between the number of maggots in a carcass and time was not linear (Figure 17.11(a)). A quadratic model gave a highly significant relationship (F2,17 = 361.33, P < 0.001); Y = −168.74 + 199.28 X −18.56 X2, r2 = 0.977.’ Often it is also noted that the next highest polynomial does not provide a significantly better fit to the data: ‘A cubic model was not a significant improvement over the quadratic (F1,16 = 3.03, NS).’ The relationship is often very hard to visualise from the equation, so a graph (e.g. Figure 17.12(a)) is often included.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
272
272 [244–283] 29.8.2011 4:16PM
Regression
Table 17.6 The amount of additional variation explained by progressive polynomial expansions can be assessed by subtracting the sum of squares for the lower polynomial from the next higher one to give the sum of squares for the difference (SS difference). The number of degrees of freedom for the difference (df difference) is also obtained by subtraction. The mean square for the difference is (SS difference/df difference), and the F statistic is calculated by dividing the resultant MS difference by the error of the higher polynomial. (a) ANOVA statistics for each of the three regression models. (b) ANOVA table for the relative importance of each additional expansion of the regression equation. In this example, the additional variation explained by the quadratic model is a highly significant improvement over the linear one, but there is no significant improvement of the cubic model over the quadratic. (a) Model
Sum of squares
df
Mean square
Linear Error
(a) 3971.46 372515.09
1 18
3971.46 20695.28
Quadratic Error
(b) 367833.58 8652.97
2 17
183916.79 509.10
Cubic Error
(c) 369211.71 7274.84
3 16
123070.57 454.68
(b) Sum of squares for the Mean square difference df of difference
Mean square error
(b) 367833.58
2
Quadratic
(a) 3971.46
1
363862.12
1
Cubic minus (c) 369211.71 quadratic (b) 367833.58
3
Difference
1
Model Quadratic minus linear
Difference
1378.13
363862.12
509.10
df F ratio F1, 17 = 714.72 P < 0.001
17
Cubic
F1,16 = 3.03 NS
2 1378.13
454.68
16
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
273 [244–283] 29.8.2011 4:16PM
17.12 Multiple linear regression
273
17.12 Multiple linear regression Simple linear regression examines the relationship between one dependent and one independent variable, but often a dependent variable may be affected by a combination of several independent variables. For example, you might have sampled 100 tomato plants and obtained data for the number of fruit produced per plant (the dependent or response variable) and the concentration of nitrogen, phosphorous and potassium (three independent variables) in the soil where each plant was growing. If you knew the relationship between fruit produced and the concentrations of these three elements, you may be able to improve crop yield. This is where multiple linear regression can often be used to estimate the effect of each of the independent variables in the analysis. Multiple linear regression is another straightforward extension of simple linear regression. The simple linear regression equation: Yi ¼ a þ bXi
(17:16 copied from 17:1)
examines the relationship between the value of one dependent variable Y (e.g. fruit yield) and another independent variable X (e.g. soil nitrogen). If you have two independent variables (e.g. soil nitrogen and phosphorous), the regression equation can be extended: Yi ¼ a þ b1 X1i þ b2 X2i
(17:17)
which is just equation (17.16) plus a second independent variable with its own coefficient (b2) and values (X2i). You will notice that there is now a double subscript after the two values of X, in order to specify the first variable (e.g. soil nitrogen) as X1 and the second (e.g. soil phosphorous) as X2. Equation (17.17) can be further extended to include additional independent variables (e.g. soil potassium and iron etc): Yi ¼ a þ b1 X1i þ b2 X2i þ b3 X3i þ b4 X4i : : : etc:
(17:18)
The mathematics of multiple linear regression is complex, but the concept of how it works is very straightforward. The following explanation is for only two independent variables and therefore the simplest case of multiple linear regression.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
274
274 [244–283] 29.8.2011 4:16PM
Regression
Table 17.7 The number of fruit produced per tomato plant and the soil concentrations of nitrogen and phosphorous. Number of fruit
Nitrogen
Phosphorous
18 14 10 12 21 19 10 18 22 5 21 19
4 1 3 2 5 2 2 5 6 1 5 2
2 6 1 2 3 6 1 2 1 1 2 7
When you only have one independent variable, a simple linear regression gives a two-dimensional straight line of best fit running through a scatter plot of points in two dimensions (e.g. Figure 17.9(a)). When you have two independent variables, the line of best fit becomes a plane of best fit running through a scatter plot in three-dimensional space.
17.12.1 An example for two independent variables, both of which are significant Table 17.7 gives data for the number of fruit produced per tomato plant and the concentrations of nitrogen and phosphorous in the soil where each plant was grown. These data have been plotted in three dimensions in Figure 17.13. A multiple linear regression analysis will give an equation for the effects of both nitrogen and phosphorous on fruit yield, in the form of Yi = a + b1X1i +b2X2i (where Y is the number of fruit, a is the intercept when both b1 and b2 are 0, and X1 and X2 are nitrogen and phosphorous respectively), together with F ratios for the significance of the slope of these two independent variables (Table 17.8). The overall regression equation: Y (fruit) = 0.953 + 3.089 X1 (nitrogen) + 1.770 X2 (phosphorous) is significant (Table 17.8(a)) with an r2 of 0.947. In
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
275 [244–283] 29.8.2011 4:16PM
17.12 Multiple linear regression
275
Table 17.8 Results of a multiple linear regression of the data in Table 17.7. (a) An analysis of variance tests whether the plane of the regression differs from one that is flat. In this case, the relationship is significant. (b) Values of a, b1 and b2, together with results of tests of significance comparing each of these to 0. Note that there is an effect of both nitrogen and phosphorous on fruit yield, but the intercept does not differ significantly from 0. (a) Source of variation
Sum of squares
df
Regression Residual Total
307.125 17.125 324.250
2 9 11
Mean square 153.563 1.903
F
Probability
80.705
< 0.001
(b) Model
Value
Probability
Constant (a) Nitrogen (b1) Phosphorous (b2)
0.953 3.089 1.770
N.S. < 0.001 < 0.001
Figure 17.13 Illustration of the plane of best fit to the data in Table 17.7.
The vertical Y axis shows the number of fruit produced per plant. The first X axis (X1) for the concentration of soil nitrogen is in the approximate plane of the page. The second X axis (X2, arrowed) for the concentration of phosphorous extends back into the page. The shaded parallelogram shows the plane of best fit to the scatter plot of the 12 points.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
276
276 [244–283] 29.8.2011 4:16PM
Regression
this example, there is a significant effect of both nitrogen and phosphorous on tomato yields (Table 17.8(b)). The values of the slopes, b1 and b2, give the effect of each independent variable upon Y when the other variable is held constant. Note that the intercept does not differ significantly from 0. In Figure 17.13, the slope of the relationship between fruit yield and the independent variable nitrogen is the lower side of the parallelogram (and has a positive slope) and the slope of the relationship for phosphorous is the shorter side of the parallelogram (which also has a positive slope). Importantly, the multiple linear regression equation gives the predicted value of Y for the combined contribution of each independent variable included in the analysis. Therefore, if you were to take the set of data in Table 17.7 and run a separate simple linear regression for the dependent variable and only one of the independent variables (e.g. nitrogen), the values of the coefficients and their significance are likely to be very different to the analysis where both independent variables have been included, because the contribution of the other variable has been ignored. For example, a simple linear regression on the data for fruit yield in relation to only the data for concentration of nitrogen in Table 17.7 gives a significant intercept and a different coefficient for the slope (Y (fruit) = 8.743 + 2.213 X1 (nitrogen): F1,10 = 10.343, P < 0.05) and an r2 of only 0.508 compared to the coefficient for nitrogen in the equation that included both independent variables shown in Table 17.8.
17.12.2 An example where one variable is not significant The following example is for one significant and one non-significant independent variable. Table 17.9 gives the number of fruit produced per chilli plant and soil nitrogen and phosphorous. The overall regression equation: Y (fruit) = 2.84 + 0.001 X1 (nitrogen) + 1.387 X2 (phosphorous), r2 = 0.967 is significant (Table 17.10(a)), but there is a significant effect of phosphorous and no effect of nitrogen on chilli yields (Table 17.10(b)). Here too the slopes b1 and b2 give the effect of each independent variable upon Y when the other variable is held constant. In Figure 17.14, the slope of the relationship between fruit yield and the independent variable nitrogen is that of the lower side of the parallelogram (which as you might expect has a slope of almost 0) and the slope of
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
277 [244–283] 29.8.2011 4:16PM
17.12 Multiple linear regression
277
Table 17.9 The number of fruit produced per chilli plant and the soil concentrations of nitrogen and phosphorous. Number of fruit
Nitrogen
Phosphorous
5 10 8 12 5 15 13 6 14 7
5 10 10 20 8 6 4 21 8 16
1 5 4 6 2 9 7 2 8 4
Table 17.10 Results of a multiple linear regression of the data in Table 17.9. (a) An analysis of variance tests whether the plane of the regression differs from one that is flat. The overall relationship is significant. (b) Values of a, b1 and b2, together with results of tests of significance comparing each of these to 0. Note that there is no effect of nitrogen upon fruit yield. (a) Source of variation
Sum of squares
Regression Residual Total
126.236 4.265 130.500
df 2 7
Mean square 63.117 0.609 9
F
Probability
103.587
< 0.001
(b) Model
Value
Probability
Constant (a) Nitrogen (b1) Phosphorous (b2)
2.837 0.001 1.387
< 0.01 NS < 0.001
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
278
278 [244–283] 29.8.2011 4:16PM
Regression
the relationship for phosphorous is the (steeper) shorter side of the parallelogram. The use of multiple linear regression with more than two independent variables (e.g. fruit yield in relation to nitrogen, phosphorous, potassium and iron) is a straightforward extension of the conceptual explanation for the two independent variables and three dimensions given here.
17.12.3 Non-significant variables in multiple linear regression If the initial analysis shows that an independent variable has no significant effect on Y, then the variable can be removed from the equation and the analysis rerun. This process of refining the model can be repeated until only significant independent variables remain, thereby giving the most parsimonious possible model for predicting Y. There are several procedures for refining, but the one most frequently recommended is to initially include all independent variables, run the analysis and then scrutinise the results for variables with non-significant values of b. The least significant of these is removed and the analysis rerun. This process, which is most appropriately called backward elimination, is repeated until only significant variables remain. For the example in Section 17.12.2, the analysis could be rerun without the non-significant independent variable nitrogen.
Figure 17.14 Illustration of the plane of best fit to the data in Table 17.9. The Y
axis shows the number of fruit produced per chilli plant. The first X axis (X1) for the concentration of soil nitrogen is in the approximate plane of the page and the second X axis (X2, arrowed) for the concentration of phosphorous extends back into the page. The shaded parallelogram shows the plane of best fit to the scatter plot of the ten points. Note that the slope of the relationship for nitrogen is close to 0.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
279 [244–283] 29.8.2011 4:16PM
17.12 Multiple linear regression
279
Table 17.11 An example of marked multicollinearity. These data are taken from the population that has the plane of best fit for the number of fruit produced per tomato plant and the soil concentration of nitrogen and phosphorous shown in Figure 17.15(a). Here, however, the restricted set of data in the sample happens by chance to show a marked correlation between the measured values of the two independent variables nitrogen and phosphorous. Number of fruit
Nitrogen
Phosphorous
13 8 11 5 13 8
23 7 15 1 21 9
6 3 4.5 0.33 6 3
17.12.4 Multicollinearity: a note of caution Multiple linear regression analysis can sometimes give misleading results when the data for a sample show marked multicollinearity. This means that the data for two (or more) independent variables are highly correlated. For example, the set of data for the number of tomato fruit in relation to soil nitrogen and phosphorous in Table 17.11 has been taken from the population of points giving the plane of best fit shown in Figure 17.15(a). Unfortunately, by chance, this small sample of data shows a high correlation between the two independent variables of nitrogen and phosphorous. This does not affect the data for the dependent variable (fruit per plant), which is still related to soil nitrogen and phosphorous as described by the plane shown in Figure 17.15(a), but a plot of the data for the sample (Figure 17.15(b)) shows the points all lie very close to a straight line running diagonally through the parallelogram, showing the plane of best fit for the population. Therefore, if you only had this small sample to analyse, which only gives quite restricted information for the part of the plane close to the line running through the three-dimensional space, the calculated plane of best fit may not be a good estimate of the one for the population. Any point in the very restricted sample that is even slightly displaced (by error) above or below the true plane for the population is likely to have a major effect
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
280
280 [244–283] 29.8.2011 4:16PM
Regression
Figure 17.15 Illustration of the possible effect of pronounced
multicollinearity. (a) The population of data for the relationship between the dependent variable, fruit per plant, and the two independent variables of soil nitrogen and phosphorous. The plane of best fit for the population is shown as a shaded parallelogram. (b) A sample of data from the population shows pronounced multicollinearity between soil nitrogen and phosphorous, giving a scatter plot that lies very close to the diagonal line. This very restricted sample may give a misleading estimate of the orientation of the plane of best fit extending out into the population because of random (error) variation of sample points lying slightly above and below the true plane.
upon the predicted orientation of the estimated plane (Figure 17.15(b)) and therefore contribute to a misleading regression equation for the population, although it will be appropriate for the data in the sample. This is analogous to attempting to estimate the true slope of a two-dimensional regression line by extrapolation from a small number of points within a very narrow range, as previously discussed in Section 17.9 and Figure 17.8(d): the estimated regression line will be appropriate for that sample, but not necessarily for the population. The solution to this problem is to inspect the matrix of data (i.e. Table 17.11), including perhaps running some correlation analyses among
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
281 [244–283] 29.8.2011 4:16PM
17.13 Questions
281
the independent variables to see if there is multicollinearity. If two or more variables are markedly (either positively or negatively) correlated, you could exclude one or more before running the analysis. Sometimes you may even expect a correlation between two or more variables, so the inclusion of data for more than one of these only adds redundant information. For example, data for the weight and volume of an object are likely to be positively correlated, so one of these redundant variables need not be included in the analysis.
17.12.5 Reporting the results of a multiple linear regression Results of a multiple linear regression such as Table 17.8 could be reported as ‘The overall relationship between fruit yield and nitrogen and phosphorous was significant: Y (fruit) = 0.953 + 3.089 X1 (nitrogen) + 1.770 X2 (phosphorous); F2,9 = 80.71, P < 0.001, r2 = 0.947, and both nitrogen and phosphorous contributed significantly to fruit yield (P < 0.001 in each case).’ A three-dimensional scatter plot would help visualise the relationship. For the case where one of the independent variables is not significant (Section 17.12.2), it could be reported as ‘The overall relationship between fruit yield and nitrogen and phosphorous was significant Y (fruit) = 2.84 + 0.001 X1 (nitrogen) + 1.387 X2 (phosphorous); F2,7 = 103.59, P < 0.001, r2 = 0.967. However, only the effect of phosphorous was significant, so the variable of nitrogen concentration was removed by backward elimination. This gave the relationship between fruit yield and nitrogen as Y (fruit) = 2.84 + 1.387 X (phosphorous); F1,8 = 236.77, P < 0.001, r2 = 0.967.’
17.13 Questions (1)
An easy way to help understand simple linear regression is to work through a contrived example. The set of data below gives a regression with a slope of 0 and an intercept of 10, so the line will have the equation Y = 10 + 0X.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
282
282 [244–283] 29.8.2011 4:16PM
Regression
X
Y
0 0 0 1 1 1 2 2 2
10 9 11 10 9 11 10 9 11
(a) Use a statistical package to run the regression. What is the value of r2 for this relationship? Is the slope of the regression significant? (b) Next, modify the data to give an intercept of 20, but with a slope that is still 0. (c) Finally, modify the data to give a negative slope that is significant. (2)
The table below gives data for the weight of raw opioids recovered from different volumes of foliage of the poppy Papaver somniferum. Volume of foliage processed (m3)
Weight of opioids recovered (grams)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
25 42 71 103 111 142 164 191 220
(a) Run a regression analysis, where the volume of foliage is the independent variable. What is the value of r2? Is the relationship significant? (b) What is the equation for the relationship between the weight of raw opioids recovered from the volume of foliage
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C17.3D
283 [244–283] 29.8.2011 4:16PM
17.13 Questions
283
processed? (c) Does the intercept of the regression line differ significantly from 0? Would you expect it to? Why? (3)
The table below gives the number of fruit produced per tomato plant in relation to soil nitrogen, phosphorous and nickel (which is toxic to plants, but occurs naturally in some soils). Number of fruit
Nitrogen
Phosphorous
Nickel
34 17 10 38 17 55 34 24 17 2 28 25 27 4
1 4 4 8 3 3 7 6 1 1 4 5 9 2
5 2 1 3 1 8 3 1 2 2 2 3 1 0
3 8 8 2 1 2 2 3 1 12 1 6 6 6
Run a multiple linear regression on the data. (a) What is the regression equation? (b) What is the value of r2? (c) Which independent variables are significant? Is the intercept significantly different to 0?
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
284 [284–297] 27.8.2011 4:20PM
18
Analysis of covariance
18.1
Introduction
Sometimes when testing whether two or more samples have come from the same population, you may suspect that the value of the response variable is also affected by a second factor. For example, you might need to compare the concentration of lead in the tissues of rats from a contaminated mine site and several uncontaminated sites. You could do this by trapping 20 rats at the mine and 20 from each of five randomly chosen uncontaminated sites. These data for the response variable ‘lead concentration’ and the factor ‘site’ could be analysed by a single-factor ANOVA (Chapter 11), but there is a potential complication. Heavy metals such as lead often accumulate in the bodies of animals, so the concentration in an individual’s tissues will be determined by (a) the amount of lead it has been exposed to at each site and (b) its age. Therefore, if you have a wide age range of rats in each sample, the variance in lead concentration within each is also likely to be large. This will give a relatively large standard error of the mean and a large error term in ANOVA (Chapter 11). An example for a comparison between a mine site and a relatively uncontaminated control site is given in Figures 18.1(a) and (b). The lead concentration at a particular age is always higher in a rat from the mine site compared to one from the control, so there appears to be a difference between sites, but a t test or single-factor ANOVA will not show a significant difference because the data are confounded by the wide age range of rats trapped at each. The second confounding factor is called the covariate. Here are two ways of dealing with samples that appear to be suitable for comparison by ANOVA, but may be confounded by a second factor. First, you could repeat the sampling in the hope of getting enough individuals 284
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
285 [284–297] 27.8.2011 4:20PM
18.2 Adjusting data to remove the effect of a confounding factor
285
Figure 18.1 Graphical example of the confounding effect of a covariate.
Symbols show lead concentration in rats from a contaminated mine site (■) and a control (•). (a) At any particular age, a rat from the mine site has a higher tissue lead concentration than one from the control, but lead levels also increase with age. (b) The data for lead in relation to site only. A single-factor ANOVA comparing the means at the two sites (and therefore ignoring age) will have a large error term and show no significant difference.
within the same narrow age range at each site and only compare these data, but this may be impossibly time-consuming or expensive. Second, if the only data available were confounded by age, it would be very helpful to have some way of adjusting them to remove the effect of this confounding factor and thereby make a more realistic comparison among samples from different sites. This is where analysis of covariance (ANCOVA) can often be used.
18.2
Adjusting data to remove the effect of a confounding factor
This explanation is introductory and conceptual. I am assuming you have read Chapter 11 on single-factor ANOVA and Chapter 17 on regression. From Figure 18.1, it is clear that tissue lead concentrations at each site are confounded by age. If you were confident that the regression lines for lead concentration in relation to age had the same slope at each site, you could use this relationship to estimate the lead concentration of every rat at an identical age and thereby remove the confounding effect of the covariate. A graphical explanation is given in Figures 18.2(a)–(f), where lead concentration is plotted against age for two sites (Figure 18.2(a)).
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
286
286 [284–297] 27.8.2011 4:20PM
Analysis of covariance
Figure 18.2 A conceptual explanation for the adjustment of data to remove the
effect of a confounding covariate. Symbols show tissue lead concentrations in rats from a mine site (■) and a control site (•). (a) The data for lead appear to be affected by both site and age (left-hand graph). When these data are plotted for site only, there is little difference in mean lead concentration and a relatively large error term (the vertical arrows on the right-hand graph). (b) The regression line fitted through all the data gives the common slope. (c) Separate regression lines, with the constraint that each must have the common slope, are fitted to the data for each site. (d) Vertical arrows show the ^ i) for each datum. (e) Selection of a standard value value of the residual (Yi − Y of age, X (standard). In this case, the grand mean age for all rats has been chosen. ^ (standard) is obtained from the regression line at (f) The predicted value of Y each site. The residuals are used to calculate the adjusted lead concentration for each rat, (Yi(adjusted)), at the standard age, using the appropriate value of ^ (standard) in the equation Yi(adjusted) = Y ^ (standard) + residuali. Note that the Y error for each sample (arrows) is much smaller than for the unadjusted data shown in (a).
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
287 [284–297] 27.8.2011 4:20PM
Figure 18.2 (cont.)
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
288
288 [284–297] 27.8.2011 4:20PM
Analysis of covariance
First, the overall slope of the regression line through all the data is calculated (Figure 18.2(b)). Second, separate regression lines are fitted to the data for each site, with the constraint that each of these lines is given the same (overall) slope (Figure 18.2(c)). Third, the residuals (Chapter 17) are calculated. These are the vertical displacement of the value (in this case lead concentration) for each individual above or below the appropriate regression line for each site (Figure 18.2(d)) and will include both positive and negative quantities. Fourth, the grand mean for the covariate (here age) for the total number of individuals is calculated and used as the standard value (Figure 18.2(e)). This could be any value within the age range because the fitted regression lines have the same slope and will always be parallel and therefore the same distance apart, but usually the grand mean of all the data for the covariate is used. Fifth, for each individual the data for the response variable (here lead) is adjusted and expressed in relation to the standard value of the covariate by ^ for the standard age at each adding the appropriate residual to the value of Y ^ (standard) + residual) (Figure 18.2(f)). site: (Y This process, which is a slightly more complex example of interpolation using regression (Chapter 17), gives a set of adjusted data from which the confounding effect of the covariate (here age) has been removed and can therefore be used to make a more realistic comparison of the response variable among the levels of the factor of interest (here site). Note from Figure 18.2(f) that the variances are smaller than for the unadjusted data in Figure 18.2(a).
18.3
An arithmetic example
Table 18.1 gives data for age and tissue lead concentration in rats from two sites. If age is ignored, a single-factor ANOVA on the raw data for tissue lead (the second column from the left in Table 18.1) shows no significant difference between sites (ANOVA: F1,18 = 2.74, NS), but this may be because of the confounding effect of age. Table 18.1 also gives
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
289 [284–297] 27.8.2011 4:20PM
18.4 Assumptions of ANCOVA and an extremely important
289
the residuals for each value of lead in relation to the regression line for each site. Each residual has been added to the value for tissue lead at the standard age (estimated from the appropriate regression line at each site) to give a set of adjusted values for the response variable. A singlefactor ANOVA on these data is highly significant (F1,18 = 71.92, P < 0.001) and shows the advantage of removing the effect of a confounding covariate. This is essentially what ANCOVA does. The analysis is mathematically complex, but extremely easy to run with most statistical packages (where you use the commands for ANOVA, but specify the appropriate factor as a covariate). A typical output for the data in Table 18.1 is in Table 18.2, which shows a significant effect of both site and age. If the effect of age were not significant, there would be no need to include it as a covariate.
18.4
Assumptions of ANCOVA and an extremely important caution about parallelism
The assumptions of ANCOVA include those for both ANOVA (Chapter 14) and linear regression (i.e. homoscedasticity, normality, independence and a linear relationship between the dependent variable and the covariate). It is also assumed that the slopes of the regression lines for the relationship between the dependent variable (e.g. lead concentration) and the covariate (e.g. age) are the same for all levels of the factor being investigated. This is called parallelism and is extremely important. If the regression lines are not parallel, a comparison of the difference between them at only one selected (standard) value will not represent what is occurring throughout the range of the covariate (Figure 18.3). You should not proceed with the ANCOVA if there is a significant lack of parallelism. First, as a preliminary, it is very useful to plot a graph of the data (e.g. Figure 18.3) as it may show such a gross difference in the slopes of the regression lines that it would be inappropriate to continue. Second, for most statistical packages it is easy to specify a preliminary test for a lack of parallelism in the ANCOVA. For a set of data with two or more levels of
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
290
290 [284–297] 27.8.2011 4:20PM
Analysis of covariance
Table 18.1 The calculation of adjusted values for the concentration of lead in ten rats from a mine site and another ten from an uncontaminated control site. Raw data are given in the first two columns. The regression line for each site is used to calculate the predicted value of Ŷ for the age of each rat. Next, the (positive or negative) residual for each datum is then added to the predicted value of Ŷ for each site at the same standard age of 8.7 months (which is the grand mean of all 20 rats in the two samples combined). This will give a set of adjusted data from which the confounding effect of age has been removed. Adjusted value of lead, Yi (adjusted), at Age Lead Predicted value of Residual the standard age of 8.7 months Xi Yi Y^ from regression Y i Y^ Y^(standard) + residual Control site 10 63 61.62 1.38 57.73 12 67 69.72 −2.72 53.63 6 45 45.41 −0.41 55.94 6 42 45.41 −3.41 52.94 11 73 65.67 7.33 63.68 13 69 73.77 −4.77 51.58 12 73 69.72 3.28 59.63 10 57 61.62 −4.62 51.73 9 56 57.56 −1.56 54.78 8 59 53.51 5.49 61.84 Regression: Y (lead) = 21.09797 + 4.051756 X (age) ^ (standard) for 8.7 months = 56.347 Y Mine site 8 71 69.62 1.38 73.84 10 75 77.72 −2.72 69.73 4 53 53.41 −0.41 72.04 4 50 53.41 −3.41 69.04 9 81 73.67 7.33 79.78 11 77 81.77 −4.77 67.68 10 81 77.72 3.28 75.73 8 65 69.62 −4.62 67.84 7 64 65.56 −1.56 70.89 6 67 61.51 5.49 77.94 Regression: Y (lead) = 37.20148 + 4.051756 X (age) ^ (standard) for 8.7 months = 72.452 Y
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
291 [284–297] 27.8.2011 4:20PM
18.4 Assumptions of ANCOVA and an extremely important
291
Table 18.2 A typical output for the results of an ANCOVA on the data in Table 18.1. There is a significant effect of both site and age upon lead concentration in rats. No interaction term is included because one of the assumptions of the analysis is that the slope of the relationship between lead and the covariate (age) is the same for each level of the factor of interest (site). Source
Sum of squares
df
Mean square
F
Probability
Age Site Error Total
1776.290 1094.335 324.510 2420.800
1 1 17 19
1776.290 1094.335 19.089
93.054 57.329
< 0.001 < 0.001
Box 18.1 ANCOVA, ANOVA and regression analysis The contrived data in Table 18.1 that give two exactly parallel regression lines and the sums of squares for the results of ANCOVA in Table 18.2 illustrate how ANCOVA is related to both ANOVA and regression analysis. The statistics from a single-factor ANOVA on the unadjusted data for lead concentration and site (i.e. the two columns to the far left of Table 18.2) are given in the table immediately below. The total sum of squares in the ANCOVA (2420.8 in Table 18.2) is the same as for a single-factor ANOVA (the bottom line in the table below). This is because both quantities are calculated from the displacement of each point from the grand mean of all the data for the dependent variable. Source
Sum of squares
df
Mean square
F
Probability
Site Error Total
320 2100.8 2420.8
1 18 19
320 116.71
2.74
0.115
For the contrived example in Table 18.1, the two regression lines differ only in their intercepts. The ANOVAs for the slope of each line will have the same sums of squares, mean squares and degrees of
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
292
292 [284–297] 27.8.2011 4:20PM
Analysis of covariance
freedom, so only one table of results is given below. The sum of squares for the effect of age in the ANCOVA (1776.29: Table 18.2) is the same as the combined sum of squares for the two regressions (i.e. regression + error) and therefore double the value of 888.15 given for each. This is because the regression sum of squares is an estimate of the effect of the regression + error (i.e. age + error) at each site. Source
Sum of squares
df
Mean square
F
Probability
Regression Residual Total
888.15 162.26 1050.40
1 8 9
888.15 20.28
43.79
< 0.001
The sum of squares for error (324.51: Table 18.2) in the ANCOVA is also the combined total of the residual (i.e. error) sums of squares for the two regressions (in this case double 162.26 in the table above).
Figure 18.3 The effect of a lack of parallelism in ANCOVA. The
relationship between tissue lead and age of rats differs at each site, so the regression lines are not parallel. A comparison between the two sites at only one standard value of the covariate (e.g. an age of 8 months) will not be representative of the difference across the entire range of the covariate.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
293 [284–297] 27.8.2011 4:20PM
18.4 Assumptions of ANCOVA and an extremely important
293
Factor A and a covariate (Factor B), the lack of parallelism between the regression lines is the same concept as interaction in relation to two-factor and higher ANOVA (Chapter 13), where the effect of one factor varies across the levels of the second. Therefore, to run a test for parallelism prior to ANCOVA, you identify the response variable (e.g. lead), the factor (e.g. site) and the covariate (e.g. age), but specify a model which also includes the interaction between the factor and the covariate. The table of results from this analysis will include the F ratio and probability for the interaction. If the interaction is not significant, you can continue and specify the ANCOVA model without an interaction term. There is an example in Section 18.4.1. When the regression lines show a significant lack of parallelism you can still run separate regression analyses on the data for each level of the factor of interest (e.g. lead concentration and age at each site) and compare the slopes and intercepts. Furthermore, a lack of parallelism is likely to be of interest as it shows a different relationship between the dependent variable and the covariate at each level of the factor you are investigating. Here you may be thinking ‘Then why use ANCOVA even when the regression lines are reasonably parallel? Why not simply run a regression at each level of the factor of interest and compare the intercepts of these lines?’ ANCOVA is used because it estimates error and main effects from the entire data set, thereby giving a more powerful analysis.
18.4.1 An example Table 18.3 gives data for the systolic blood pressure of ten males and ten females. Blood pressure tends to increase with age, so a comparison between males and females is likely to be confounded by the wide age range within each of these two samples. A graph of the data confirms there is an increase in blood pressure with age (Figure 18.4), but a preliminary analysis for parallelism shows no significant difference in the slopes of the lines for the relationship between blood pressure and age for each gender (Table 18.4). The subsequent ANCOVA (Table 18.5) shows no effect of gender but a significant effect of age (the covariate). These results could be reported as ‘A preliminary analysis for parallelism showed no significant difference in the slopes of the lines for blood
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
294
294 [284–297] 27.8.2011 4:20PM
Analysis of covariance
Table 18.3 Data for age in years and systolic blood pressure for ten males and ten females. Both samples include individuals of a relatively wide age range. Females
Males
Blood pressure
Age
Blood pressure
38 40 45 50 54 58 59 67 70 75
112 110 120 123 118 124 126 128 127 132
35 37 41 47 52 53 55 60 61 65
115 115 114 116 120 118 130 125 128 130
Blood pressure
Age
130
120
110 30
45
60 Age (years)
75
Figure 18.4 Graph of the data in Table 18.3 for systolic blood pressure in
relation to age and gender. ■ = males, • = females. It is clear that blood pressure increases with age.
pressure and age (blood pressure × age: F1,16 = 0.073, NS) and the subsequent ANCOVA showed no effect of gender (F1,17 = 1.59, NS), but a significant effect of age (F1,17 = 73.31, P < 0.001) upon systolic blood pressure.’
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
295 [284–297] 27.8.2011 4:20PM
18.5 Reporting the results of ANCOVA
295
Table 18.4 Results of a preliminary analysis for parallelism on the data in Table 18.3. The lack of a significant interaction between gender and age shows that the regression lines for the relationship between blood pressure and age for each gender are relatively parallel. Source
Sum of squares df Mean square F
Gender 0.002 Age 655.750 Gender × age 0.709 Error 155.327 Total 832.950
1 1 1 16 19
0.002 655.750 0.709 9.708
Probability
0.000206 67.55 0.073
0.989 < 0.001 0.790
Table 18.5 Results of an ANCOVA on the data in Table 18.3. There is no effect of gender. The covariate, age, has a significant effect on blood pressure.
18.5
Source
Sum of squares
df
Mean square
F
Probability
Age Gender Error Total
672.865 14.558 156.035 832.950
1 1 17 19
672.865 14.558 9.179
73.308 1.586
< 0.001 0.225
Reporting the results of ANCOVA
The results of an ANCOVA on the (extremely) contrived data for lead levels in rats in Table 18.1 could be reported as ‘A preliminary analysis for parallelism showed no significant difference between the slopes of the lines for lead concentration in relation to age (age × covariate: F1,16 = 0.00, NS). The subsequent ANCOVA showed a significant effect of site (F1,17 = 57.329, P < 0.001) as well as a significant effect of the covariate
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
296
296 [284–297] 27.8.2011 4:20PM
Analysis of covariance
(age) (F1,17 = 93.054, P < 0.001). Rats from the mine site had higher levels of lead than those from the control (Figure 18.1(a)).’ A significant lack of parallelism for the comparison of blood pressure between males and females (Figure 18.3) could be reported as ‘The data were not suitable for analysis by ANCOVA because a preliminary analysis for parallelism showed a significant difference in the slopes of the lines for blood pressure in relation to age (blood pressure × age: F1,16 = 15.043, P < 0.001).’
18.6
More complex models
The conceptual introduction in this chapter is for the simplest ANCOVA and, for clarity, has only two levels of the factor being investigated. If you have more than two levels of a significant fixed factor (e.g. three or more sites in the example in Table 18.1), you will probably need to carry out a posteriori tests (Chapter 12) to distinguish which levels are different. Explanations and instructions are given in more advanced texts. More complex models can include more than one factor of interest (e.g. a twofactor ANCOVA examining the effects of gender and several experimental drugs on blood pressure with age as a covariate) and more than one covariate. Here, however, you need to be particularly careful when analysing for parallelism because you need to examine the interaction between each factor and the covariate, as well as among both factors and the covariate. There is a very clear explanation and discussion in Quinn and Keough (2002).
18.7
Questions
(1)
Why does the use of ANCOVA rely upon the assumption that the regression lines for the relationship between the dependent variable and the covariate for all levels of the factor being investigated are reasonably parallel?
(2)
The data given below are for the systolic blood pressure of ten adult males in an untreated control and ten more males given the experimental drug Systolsyn B.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C18.3D
297 [284–297] 27.8.2011 4:20PM
18.7 Questions
Control
297
Systolsyn B
Age
Blood pressure
Age
Blood pressure
38 40 45 50 54 58 59 67 70 75
112 110 120 123 118 124 126 128 127 132
35 37 41 47 52 53 55 60 61 65
115 110 114 108 113 110 120 114 108 120
(a) Draw a graph of the data for blood pressure in relation to age for both groups. Does an ANCOVA seem appropriate for examining whether there is a difference between the experimental group and the control? Why? (b) Use a statistical package to run a preliminary analysis for parallelism on these data by specifying the interaction between blood pressure and age. Does this show a significant lack of parallelism? (c) How might you proceed with this analysis?
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C19.3D
298 [298–300] 27.8.2011 4:24PM
19
Non-parametric statistics
19.1
Introduction
Parametric tests are designed for analysing data from a known distribution, and most of these tests assume the distribution is normal. Although parametric tests are quite robust to departures from normality and a transformation can often be used to normalise data, there are some cases where the population is so grossly non-normal that parametric testing is unwise. In these cases, a powerful analysis can often still be done by using a nonparametric test. Non-parametric tests are not just alternatives to the parametric procedures for analysing ratio, interval and ordinal data described in Chapters 9 to 18. Often life scientists measure data on a nominal scale. For example, Table 3.4 gave the numbers of basal cell carcinomas detected and removed from different areas of the human body. This is a sample containing frequencies in several discrete and mutually exclusive categories and there are non-parametric tests for analysing this type of data (Chapter 20).
19.2
The danger of assuming normality when a population is grossly non-normal
Most parametric tests have been specifically designed for analysing data from populations having distributions shaped like a bell that is symmetrical about the mean with 66.27% of values occurring within μ 1 standard deviation and 95% within μ 1:96 standard deviations (Chapter 8). This predictable distribution is used to determine the range within which 95% will occur when samples of a particular of the values of the sample mean, X, occurs outside the range of μ 1:96, size are taken from a population. If X 298
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C19.3D
299 [298–300] 27.8.2011 4:24PM
19.2 The danger of assuming normality
299
(a) Frequency
µ (b)
Frequency
(c)
Frequency
Figure 19.1 Illustration of how the range in which the means of samples from
a grossly non-normal population does not correspond to the expected range assuming the population is normally distributed. (a) Distribution of a bimodal population. (b) Actual shape of the distribution of means of sample size n = 30 from the population. (c) Shape of the distribution of means calculated from the standard error when n = 30 assuming the population is normally distributed. Horizontal arrows show the range within which 95% of means would be expected to occur. Note that the expected range in (c) is much wider than the true range in (b).
the probability the sample has come from that population is less than 5%. If the population is not normally distributed, the range occupied by 95% of the values of the mean may be either wider or narrower than assumed, in which case judgements about statistical significance made on the basis of the normal distribution will be misleading. An example is shown in Figures 19.1(a)–(c). The population is bimodal and the range within which 95% of the values of the means of samples of size
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C19.3D
300
300 [298–300] 27.8.2011 4:24PM
Non-parametric statistics
n = 30 from this population actually occur is narrower than the range predicted if the population is assumed to be normally distributed.
19.3
The advantage of making a preliminary inspection of the data
It has already been emphasised that parametric tests for comparing means can often be applied to data from populations that are not normal because the distribution of the means of samples from most populations will usually be relatively normal (Chapter 8). Once again, however, the example in Section 19.2 emphasises the advantage of graphing the data to inspect it for normality and homoscedasticity before doing a statistical analysis. The next two chapters describe tests for analysing nominal scale data, followed by some non-parametric alternatives to the parametric tests for independent and related samples described in Chapters 8–15, as well as a non-parametric test for correlation.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
301 [301–318] 29.8.2011 3:52PM
20
Non-parametric tests for nominal scale data
20.1
Introduction
Life scientists often collect data that can be assigned to two or more discrete and mutually exclusive categories. For example, a sample of 20 humans can be partitioned into two categories of ‘right-handed’ or ‘left-handed’ (because even people who claim to be ambidextrous still perform a greater proportion of actions with one hand and can be classified as having a dominant right or left hand). These two categories are discrete, because there is no intermediate state, and mutually exclusive, because a person cannot be assigned to both. They also make up the entire set of possible outcomes within the sample and are therefore contingent upon each other, because for a fixed sample size a decrease in the number in one category must be accompanied by an increase in the number in the other and vice versa. These are nominal scale data (Chapter 3). The questions researchers ask about these data are the sort asked about any sample(s) from a population. First, you may need to know the probability a sample has been taken from a population having a known or expected proportion within each category. For example, the proportion of left-handed people in the world is close to 0.1 (10%), which can be considered the proportion in the population because it is from a sample of several million people. A biomedical scientist, who suspected the proportion of left- and right-handed people showed some variation among occupations, sampled 20 statisticians and found that four were left handed and 16 right handed. The question is whether the proportions in the sample were significantly different from the expected proportions of 0.1 and 0.9 respectively. The difference between the population and the sample might be solely due to chance or also reflect career choice. Second, you may want to know the probability that two or more samples are from the same population. For example, a geneticist noticed that the 301
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
302
302 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
children of male deep-sea divers seemed to be predominantly female. The geneticist sampled 100 male divers and 100 male deckhands within a similar age range and who worked on the same dive boats. For each individual, they recorded the gender of their first-born child. The offspring of the divers were 67 females and 33 males, while the offspring of the deckhands were 53 females and 47 males. Here, too, the difference between the two samples might be due to chance or also occupation. For both of these examples, a method is needed that gives the probability of obtaining the observed outcome under the null hypothesis of no difference. This chapter describes some tests for analysing samples of categorical data.
20.2
Comparing observed and expected frequencies: the chi-square test for goodness of fit
The chi-square test for goodness of fit compares the observed frequencies in a sample to the expected frequencies in a population, and the following example may be familiar to you from an introductory biology course. The genes that control pelt colour in guinea pigs are described as ‘dominant’ and ‘recessive’, with the gene for a lack of pigment being recessive to the gene for brown pelt. This is because the dominant gene codes for a protein that makes brown pigment, but the recessive gene does not code for any pigment. Therefore, an individual with two copies of the recessive gene will be albino, but a heterozygote with one copy of the brown gene and a homozygote with two copies will be brown. Consequently, you would expect the proportions of three brown to one albino among the offspring from a cross between two heterozygotes. To test this, a geneticist crossed several guinea pigs heterozygous for pelt colour and obtained 100 offspring altogether. Under the null hypothesis, the expected numbers are 75 brown and 25 albino, but the sample contained 86 brown and 14 albino offspring. This difference from the expected frequencies might be due to chance or because the null hypothesis is incorrect. The chi-square test calculates the probability that a sample has come from a population with the expected proportions in each category. To calculate the chi-square statistic, the appropriate expected frequency is taken away from the observed frequency. Each of these differences is
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
303 [301–318] 29.8.2011 3:52PM
20.2 Comparing observed and expected frequencies
303
squared and divided by the expected frequency, and the resulting quotients added together: χ2 ¼
n X ðoi ei Þ2 i¼1
ei
:
(20:1)
This is sometimes written as: χ2 ¼
n X ðfi ^fi Þ2 i¼1
(20:2)
^fi
where fi is the observed frequency and ^fi is the expected frequency. It does not matter whether the difference between the observed and expected frequencies is positive or negative because the square of any difference will be positive. If there is perfect agreement between every observed and expected frequency, the value of chi-square will be 0. Nevertheless, even if the null hypothesis applies, samples are unlikely to always contain the exact proportions present in the population. By chance, small departures are likely and larger departures will also occur, all of which will generate positive values of chi-square. The most extreme 5% of departures from the expected ratio are considered statistically significant and will exceed a critical value of chi-square. Table 20.1 gives a worked example for the sample of left- and right-handed statisticians mentioned above. Table 20.1 A worked example comparing the observed frequencies in a sample to those expected from the proportions in the population. The observed frequencies in a sample of 20 are 4:16 left- to righthanded people and the expected frequencies are 2:18.
Observed Expected Observed – Expected (Observed – Expected)2 ðObserved ExpectedÞ2 Expected n P ðoi ei Þ2 2 X ¼ ¼ 2:22 ei i¼1
Left-handed
Right-handed
4 2 2 4
16 18 −2 4
2
0.22
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
304
304 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
The value of chi-square in Table 20.1 has one degree of freedom. This is because the total number of individuals in both categories combined is fixed, so as soon as the number in one of the two categories is set, the other is no longer free to vary. The 5% critical value of chi-square with one degree of freedom is 3.84, so the proportions of left- and right-handed people in the sample are not significantly different to the expected proportions of 0.1 to 0.9. The chi-square test for goodness of fit can be extended to any number of categories and the degrees of freedom will be k − 1 (where k is the number of categories). Statistical packages will calculate the value of chi-square and its probability.
20.2.1 Small sample sizes When expected frequencies are small, the calculated value of chi-square tends to be too large and will therefore indicate a lower than appropriate probability, thereby increasing the risk of Type 1 error. It used to be recommended that no expected frequency in a goodness of fit test should be less than five, but this has been relaxed somewhat in the light of more recent research, and it is now recommended that no more than 20% of expected frequencies should be less than five. An entirely different method which is not subject to bias when sample size is small can be used to analyse such data. It is an example of a group of procedures called randomisation tests that will be discussed further in Chapter 21. Instead of calculating a statistic that is used to estimate the probability of an outcome, a randomisation test uses a computer program to simulate the repeated random sampling of an hypothetical population containing the expected proportions in each category. These samples will often contain the same proportions as the population, but departures from this will occur by chance. The simulated sampling is iterated, meaning it is repeated, several thousand times and this artificially generated distribution of the statistic is used to identify the most extreme 5% of departures from the expected proportions. Finally, the actual proportions in the real sample are compared to the artificial distribution. If the sample statistic falls within the region where the most extreme 5% of departures from the expected occur, the sample is considered to be significantly different to the population. The repeated random sampling of an hypothetical population is an example of a more general procedure called the Monte Carlo method
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
305 [301–318] 29.8.2011 3:52PM
20.3 Comparing proportions among two or more independent samples
305
Frequency
0.3
0.2
0.1
0.0 0
2
4
6 8 10 12 14 Number of left-handed people
16
18
20
Figure 20.1 An example of the distribution of outcomes from a Monte Carlo
simulation where 10 000 samples of size 20 are taken at random from a population containing 0.1 left-handed and 0.9 right-handed people. Note that the probability of obtaining four or more left-handed people in a sample of 20 is greater than 0.05.
that uses the properties of the sample, or the expected properties of a population, and takes a large number of simulated random samples to create a distribution that would apply under the null hypothesis. For the data in Table 20.1, where the sample size is 20 and the expected proportions are 0.1 left-handers to 0.9 right-handers, a randomisation test works by taking several thousand random samples of size 20 from an hypothetical population containing these proportions. This will generate a distribution of outcomes similar to the one shown in Figure 20.1, which is for 10 000 simulated samples. If the procedure is repeated another 10 000 times, the outcome is unlikely to be exactly the same, but nevertheless will be very similar to Figure 20.1 because so many samples have been taken. It is clear from Figure 20.1 that the likelihood of a sample containing four or more people who are left-handed is greater than 0.05.
20.3
Comparing proportions among two or more independent samples
Life scientists often need to compare the proportions in categories among two or more samples to test the null hypothesis that they have come from
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
306
306 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
Table 20.2 Data for samples of 20 cane toads taken at each of three locations in Queensland, Australia and dissected to see if they were infected with intestinal parasites.
Infected Uninfected
Rockhampton
Bowen
Mackay
12 8
7 13
14 6
the same population. Unlike the previous example, there are no expected proportions – instead, these tests examine whether the proportions in each category are heterogeneous among samples.
20.3.1 The chi-square test for heterogeneity Here is an example for three samples, each containing two mutually exclusive categories. The cane toad, Bufo marinus, was deliberately introduced to Australia in an attempt to control insect pests of sugar cane and has since become extremely abundant in Queensland. Unfortunately, the cane toad is now classified as a pest because it preys on a wide variety of small native species and can poison animals that attack it. The population density of cane toads appears to have peaked, and subsequently decreased in some areas of Queensland, so conservation biologists are sampling these in an attempt to find out if the toads are being affected by parasites or pathogens that might be useful as biological control agents. A researcher decided to test the hypothesis that the proportion of cane toads with intestinal parasites was the same in three different areas of Queensland, so they sampled 20 toads from each area, dissected them and categorised each toad as being either infected or uninfected (in that none were detected) with intestinal parasites. The researcher did not have a preconceived hypothesis about the expected proportions of infected and uninfected toads – they simply wanted to compare the three samples. The data are in Table 20.2. This format is often called a contingency table. For this example, it has three columns (the sample locations) and two rows (the numbers of infected and uninfected toads), thereby giving a table of six cells with a number in each. These data are used to calculate an expected frequency for each of the six cells. This is done by first calculating the row and column totals (Table 20.3(a)),
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
307 [301–318] 29.8.2011 3:52PM
20.3 Comparing proportions among two or more independent samples
307
Table 20.3 (a) The marginal totals for the data in Table 20.2. To obtain the expected frequency for any cell, its row and column total are multiplied together and divided by the grand total. (b) Note that the expected frequencies at each location (11:9) are the same and also correspond to the proportions of the marginal totals (33:27). (a)
Observed frequencies and marginal totals
Infected Uninfected Column totals
(b)
Rockhampton
Bowen
Mackay
12 8 20
7 13 20
14 6 20
Row totals 33 27 Grand total: 60
Expected frequencies calculated from the marginal totals
Infected Uninfected Column totals
Rockhampton
Bowen
Mackay
11 9 20
11 9 20
11 9 20
Row totals 33 27 Grand total: 60
which are often called the marginal totals and are the overall proportions of infected and uninfected toads in the entire sample (shown in the right-hand column of Table 20.3(a)). Therefore, under the null hypothesis of no difference in the proportions among locations, each will have the same proportion of infected toads. To obtain the expected frequency for any cell, the column total and the row total corresponding to that cell are multiplied together and divided by the grand total. For example, in Table 20.3(b) the expected frequency of infected toads in a sample of 20 from Rockhampton is (20 × 33) ÷ 60 = 11 and the expected frequency of uninfected toads from Mackay is (20 × 27) ÷ 60 = 9. After the expected frequencies have been calculated for all cells, equation (20.1) is used to calculate the chi-square statistic. The number of degrees of freedom for this analysis is one less than the number of columns multiplied by one less than the number of rows because all but one of the values within each column and each row are free to vary, but the final one is not because of the fixed marginal total. Here, therefore, the number of degrees of freedom
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
308
308 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
is 2 × 1 = 2. The smallest contingency table possible has two rows and two columns (this is called a 2 × 2 table), which will give a chi-square statistic with only one degree of freedom.
20.3.2 The G test or log-likelihood ratio The G test or log likelihood ratio is another way of estimating the chisquare statistic. The formula for the G statistic is: ! n X fi G¼2 fi ln : (20:3) ^fi i¼1 This means ‘The G statistic is twice the sum of the frequency of each cell, multiplied by the natural logarithm of each observed frequency, divided by the expected frequency’ and the formula will give a statistic of 0 when each expected frequency is equal to its observed frequency, but any discrepancy will give a positive value of G. The G statistic has the same degrees of freedom as χ2, and is so similar that the critical values for the latter can be used. Some statisticians recommend the G test and others recommend the chi-square test. There is a summary of tests recommended for categorical data in Section 20.8.
20.3.3 Randomisation tests for contingency tables A randomisation test procedure similar to the one discussed in Section 20.2.1 for goodness of fit tests can be used for any contingency table. First, the marginal totals of the table are calculated and give the expected proportions when there is no difference among samples. Then the Monte Carlo method is used to repeatedly ‘sample’ an hypothetical population containing these proportions with the constraint that both the column and row totals are fixed. Randomisation tests are available in some statistical packages.
20.4
Bias when there is one degree of freedom
When there is only one degree of freedom and the total sample size is less than 200, the calculated value of chi-square has been found to be inaccurate because it is too large. The inflated value of the statistic will give a
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
309 [301–318] 29.8.2011 3:52PM
20.4 Bias when there is one degree of freedom
309
probability that is smaller than appropriate and therefore increase the risk of Type 1 error. This bias increases as sample size decreases, so the following formula, called Yates’ correction or the continuity correction, was designed to improve the accuracy of the chi-square statistic for small samples with one degree of freedom. Yates’ correction removes 0.5 from the absolute difference between each observed and expected frequency. The absolute difference is used because it converts all differences to positive numbers, which will be reduced by subtracting 0.5. (Otherwise, any negative values of oi – ei would have to be increased by 0.5 to make their absolute size and the square of that smaller.) The absolute value is the positive of any number and is indicated by enclosing the number or its symbol by two vertical bars (e.g. | −6 | = 6). The subscript ‘adj’ after the value of chi-square means it has been adjusted by Yates’ correction. χ 2adj ¼
n X ðjoi ei j 0:5Þ2 i¼1
ei
:
(20:4)
From equation (20.4), it is clear that the compensatory effect of Yates’ correction will become less and less as sample size increases. Some authors (e.g. Zar, 1999) recommend that Yates’ correction is applied to all chisquare tests having only one degree of freedom, but others suggest it is unnecessary for large samples and recommend the use of the Fisher Exact Test (see Section 20.4.1 below) for smaller ones.
20.4.1 The Fisher Exact Test for 2 2 tables The Fisher Exact Test accurately calculates the probability that two samples, each containing two categories, are from the same population. This test is not subject to bias and is recommended when sample sizes are small or when more than 20% of expected frequencies are less than five, but it can be used for any 2 × 2 contingency table. The Fisher Exact Test is unusual in that it does not calculate a statistic that is used to estimate the probability of a departure from the null hypothesis. Instead, the probability is calculated directly. The easiest way to explain the Fisher Exact Test is with an example. Table 20.4 gives data for the incidence of motion sickness for ten students
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
310
310 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
Table 20.4 Data for the reported incidence of motion sickness for ten students, five of whom had been warned about motion sickness and five of whom had not. The marginal totals show that four students were affected but six were not.
Sick Not sick Column totals
Warned about motion sickness
Not warned about motion sickness
Row totals
4 1 5
0 5 5
4 6 10
Table 20.5 Under the null hypothesis that there is no effect of being warned about motion sickness, the expected proportions of affected and unaffected students in each group (2:3 and 2:3) will correspond to the marginal totals for the two rows (4:6). The proportions of students warned and not warned (2:2) and (3:3) will also correspond to the marginal totals for the two columns (5:5).
Sick Not sick Column totals
Warned about motion sickness
Not warned about motion sickness
Row totals
2 3 5
2 3 5
4 6 10
making their first flight in a small aircraft with seating for only ten passengers. Five of the students were chosen at random and warned they were likely to experience motion sickness. The remaining five were not warned. These frequencies are too small for accurate analysis using a chi-square test. If there were no effect of being warned about motion sickness, you would expect, under the null hypothesis, that the incidence within each treatment would be the same as the marginal totals (Table 20.5), with any departures being due to chance. The Fisher Exact Test uses the following procedure to calculate the probability of an outcome equal to or more extreme than the one observed, which can be used to decide whether it is statistically significant.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
311 [301–318] 29.8.2011 3:52PM
20.4 Bias when there is one degree of freedom
311
Table 20.6 The total set of possible outcomes for the number of students suffering or not suffering from motion sickness subject the constraint that there are five in each of the two groups of warned and not warned, and four students were sick but six were not. The most likely outcome, where the proportions are the same in both groups, is shown in the central box (c). The actual observed outcome is case (a). Warned Not Warned Not Warned Not Warned Not Warned Not Sick Not sick
4 0 1 5 (a) Observed
3 2 (b)
1 4
2 2 3 3 (c) Expected
1 4 (d)
3 2
0 5 (e)
4 1
First, the four marginal totals are calculated, as shown in Table 20.5. Second, all of the possible ways in which the data can be arranged within the four cells of the 2 × 2 table are listed, subject to the constraint that the marginal totals must remain unchanged. This is the total set of possible outcomes for the combined sample. For these marginal totals, the most likely outcome under the null hypothesis of no difference between the samples is shown in Table 20.5 and identified as (c) in Table 20.6. For a sample of ten students, five who had been warned about motion sickness and five who had not, together with the constraint that four were subsequently sick and six were not, there are five possible outcomes (Table 20.6). To obtain these, you start with the outcome expected under the null hypothesis (c), choose one of the four cells (it does not matter which) and add one to that cell. Next, adjust the values in the other three cells so the marginal totals do not change. Continue with this procedure until the number within the cell you have chosen cannot be increased any further without affecting the marginal totals. Then go back to the expected outcome under the null hypothesis and repeat the procedure by subtracting one from the same cell until the number in it cannot decrease any further without affecting the marginal totals (Table 20.6). Third, the actual outcome is identified within the total set of possible outcomes. For this example, it is case (a) in Table 20.6. The probability of this outcome, together with any more extreme departures in the same direction from the one expected under the null hypothesis (here there are none more extreme than (a)) can be calculated from the probability of
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
312
312 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
getting this particular arrangement within the four cells by sampling a set of ten students, four of which are sick and six of which are unaffected, with five in each group. This is similar to the example of drawing beads from a sack that was used to introduce hypothesis testing in Chapter 6. Here, however, a very small group is sampled without replacement, so the initial probability of selecting a student who is sick is 4/10, but if one of these is drawn, the probability of drawing another sick student is now 3/9 (and 6/9 unaffected). I deliberately have not given this calculation because it is long and tedious, and most statistical packages do it as part of the Fisher Exact Test. The calculation gives the exact probability of getting the observed outcome or a more extreme departure in the same direction from that expected under the null hypothesis. This is a one-tailed probability because the outcomes in the opposite direction (e.g. on the right of (c) in Table 20.6) have been ignored. For a two-tailed hypothesis, you simply double the probability. If the probability is less than 0.05, the observed outcome is considered statistically significant.
20.5
Three-dimensional contingency tables
The contingency tables described in this chapter are two-dimensional, but three-dimensional tables can also be analysed. For example, if you had two or more samples within which two categorical variables have been measured on each individual (e.g. a person’s gender and whether they are left or right handed), these would give a contingency table consisting of a threedimensional block of cells with one column and two rows. Threedimensional chi-square analyses are described in more advanced texts.
20.6
Inappropriate use of tests for goodness of fit and heterogeneity
Tests for goodness of fit and contingency tables assume that the data are mutually exclusive and contingent upon one another. It is also assumed that the categories are the entire set possible within each sample. Occasionally, these tests are misused and the most common case is when samples are incorrectly considered as categories, as shown in the following example.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
313 [301–318] 29.8.2011 3:52PM
20.6 Inappropriate use of tests for goodness of fit and heterogeneity
313
Table 20.7 Data for the number of N. subtidalis found in five traps after six hours, four of which contained different baits and one left empty as a control. The numbers in each trap are not all the possible mutually exclusive categories, nor are they contingent upon the numbers within any other, so the data are unsuitable for analysis by a goodness of fit test. Crab
Oyster
Fish
Prawn
Control
14
1
16
17
2
Marine ecologists often use small traps baited with dead or damaged fish or crustaceans to sample benthic scavengers, such as whelks. A researcher was interested in comparing the numbers of the scavenging whelk Nassarius subtidalis in traps baited with four different baits (crab, oyster, fish and prawn), so they placed one trap containing each bait, plus an empty trap as a control, on the seabed in an area where N. subtidalis was common. The traps were left for six hours, retrieved and the number of N. subtidalis inside them counted (Table 20.7). Fifty whelks were trapped. The data were analysed using a chi-square test for goodness of fit, with the null hypothesis that equal numbers of whelks (in this case ten, since 50 were caught in total and there were five traps) would be expected in each trap. Unfortunately, these data are not suitable for a goodness of fit test because the five treatments are not all the possible mutually exclusive categories and are not contingent upon each other. This is clear if you consider that a whelk that did not enter a particular trap would not have to enter any other. The numbers in each trap are actually single samples from each treatment. In contrast, if the whelks caught within each trap were subdivided into the mutually exclusive categories of male and female, it would be appropriate to use a test for heterogeneity to test the (very different) hypothesis that the sex ratio does not vary among treatments, because the two sexes are mutually exclusive and contingent categories within each treatment (Table 20.8). To avoid the pitfall of confusing categories and samples, you need to ask yourself ‘Do I have data for categories that are mutually exclusive and contingent, or are my “categories” really separate independent samples?’
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
314
314 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
Table 20.8 Data for the number of N. subtidalis found in five traps after six hours. The categories of male and female whelks are mutually exclusive and contingent so these data are suitable for a contingency table analysis comparing the proportions of each sex among bait types.
Male Female
Crab
Oyster
Fish
Prawn
Control
6 8
1 0
7 9
7 10
0 2
Table 20.9 The background occupied by 12 geckos before and after being exposed to the silhouette of a predatory bird. B = black background, W = white background. These two samples are not independent because they contain the same 12 individuals.
20.7
Gecko number
Before
After
1 2 3 4 5 6 7 8 9 10 11 12
B B W B W W B W B W W W
B B B B B W B B B B B B
Comparing proportions among two or more related samples of nominal scale data
If you have measured the same variable more than once on each experimental or sampling unit, the samples are not independent and need to be analysed using a test for related samples. Table 20.9 gives an example of two related samples of nominal scale data from a laboratory experiment where 12
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
315 [301–318] 29.8.2011 3:52PM
20.7 Comparing proportions among two or more related samples
315
Table 20.10 The numbers of geckos on black and white backgrounds before and after exposure to the silhouette of a predatory bird. Two cells show the individuals whose background preference changed and these are (b) from black to white and (c) from white to black. After Before
Black
White
Black White
(a) 5 (c) 6
(b) 0 (d) 1
individually numbered geckos of the same species were placed in an arena with a background tiled as an alternating ‘checkerboard’ pattern of large black and white squares. One hour later the background type (black or white) occupied by each gecko was recorded. Next, the silhouette of a predatory bird known to eat geckos was displayed above the arena and the background occupied by each individual recorded a second time. The null hypothesis was that geckos would show no change in the background occupied before and after the sight of the predator, while the alternate hypothesis was that they would change to either a darker or lighter background. It is not appropriate to analyse these data with a test that compares two independent samples. The McNemar test for the significance of changes compares two related samples of nominal scale data in two categories. The data in Table 20.9 are summarised in a 2 × 2 table giving the number of individuals in all four possible combinations of categories and samples. These are (a) black before and after, (b) black before and white after, (c) white before and black after and (d) white before and after (Table 20.10). The null hypothesis predicts that there will be no difference in the proportions of geckos on each background between the two samples, while the alternate predicts there will be a difference. Therefore, under the null hypothesis, the sample of geckos that did change backgrounds (which are cases (b) and (c) above) would be expected to consist of equal numbers that changed from black to white and from white to black, so you would expect cells (b) and (c) of Table 20.10 to contain equal frequencies. If, however, the background preference differed before and after exposure to the silhouette, the frequencies in these two cells would be expected to be unequal.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
316
316 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
In this example, six geckos changed backgrounds, so three would be expected to change from black to white and vice versa. The McNemar test ignores categories (a) and (d) where no change has occurred and compares only the observed and expected frequencies in cells (b) and (c) using a goodness of fit test such as the chi-square, exact or randomisation test for two mutually exclusive categories discussed earlier in this chapter, or the exact probability calculated from the binomial distribution discussed in Chapter 6. If there is a statistically significant difference between the numbers in each of these two categories, it indicates a change between the two samples. For three or more related samples of nominal scale data in two categories, the Cochran Q test is an extension of the McNemar test. These tests are also included in most statistical packages.
20.8
Recommended tests for categorical data
Several tests have been developed for data that are frequencies in mutually exclusive and contingent categories. The following are broad recommendations. When comparing the frequencies in two or more categories within a single sample to their expected proportions, Yates’ corrected chi-square can be used where no more than 20% of expected frequencies are less than five. A randomisation test can be used for any sized sample. For 2 × 2 contingency tables, the Fisher Exact Test will give an unbiased probability and is available in most statistical packages. For contingency tables with more than two rows and columns, the chisquare or G test can be used if no more than 20% of expected frequencies are less than five. A randomisation test will give an unbiased probability for any sized sample. For related samples, a McNemar test (two samples) or a Cochran Q test (three or more samples) can be used.
20.9
Reporting the results of tests for categorical data
For a single sample chi-square test, the results are usually reported as the statistic, with the degrees of freedom as a subscript, and the probability. For example, the chi-square test in Table 20.1 could be reported as ‘A single sample chi-square test showed no significant difference in the proportion of
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
317 [301–318] 29.8.2011 3:52PM
20.9 Reporting the results of tests for categorical data
317
left- to right-handed people in the sample of 20 statisticians compared to the known proportions in the population (χ21 = 2.22, NS)’. For a significant result, you are likely to include details about how the proportions differed between the sample and the population such as ‘The proportion of lefthanded people was significantly greater in the sample of statisticians compared to the population.’ A chi-square test on the proportions in a contingency table could be reported as ‘A chi-square test showed a significant difference in the proportions of left- and right-handed people in the sample of engineers compared to the sample of boilermakers (2 × 2 contingency table: χ21 = 6.43, P < 0.01).’ Here too, for a significant result you might give more detail such as ‘A chi-square test showed the sample of engineers contained a significantly greater proportion of left-handed individuals compared to the sample of boilermakers (2 × 2 contingency table: χ21 = 6.43, P < 0.01)’ or include the contingency table in the results. If you used Yates’ correction, you might write ‘A chi-square test showed a significant difference in the proportion of left- to right-handed people in the two groups (2 × 2 contingency table: Yates’ corrected χ21 = 6.43, P < 0.01). Results of a G test could be reported as ‘A G test showed a significant difference in the proportions of left- and right-handed people between the two groups (2 × 2 contingency table: χ21 = 6.43, P < 0.01).’ For a Fisher Exact Test only the probability is given. For example, ‘A two-tailed Fisher Exact Test showed a significant difference in the proportion of left- and righthanded people in the two groups (P = 0.002).’ The result of a randomisation test (with 10 000 iterations) could be reported as ‘A two-tailed randomisation test using the Monte Carlo method and 10 000 iterations showed a significant difference in the proportion of left- and right-handed people in the two groups: P = 0.002.’ Here too, for a significant result you might include information about which group contained the greater proportion of left-handers. For a related-samples test, you might report ‘A McNemar test showed a significant change in the proportions of geckoes on the two backgrounds before and after exposure to the silhouette of a predatory bird (χ21 = 6.43, P < 0.01)’, but it would be more helpful for your reader if you gave more information ‘A McNemar test showed a significant increase in the proportion of geckoes on the black background after exposure to the silhouette of a predatory bird (χ21 = 6.43, P < 0.01).’
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C20.3D
318
318 [301–318] 29.8.2011 3:52PM
Non-parametric tests for nominal scale data
20.10 Questions (1)
The ratio of left-handed to right-handed people in the human population is about 1:9. My first-year university tutorial group of 14 contained 13 left-handers and one right-hander. Students had been assigned to tutorial groups at random. (a) How could you analyse these data? (b) Is this departure from the expected proportion of left-handers significant? What could you conclude about this occurrence?
(2)
To help understand how the value of the chi-square statistic is related to departures from expected proportions, it is useful to use a statistical package to compare a sample of contrived data to the expected proportions for a population. Use expected proportions of 1:1 (e.g. male and females) and a sample size of 100. (a) Initially, set the observed numbers in each category as 50:50 and run a single sample chi-square test. What would you expect the value of chi-square to be? (b) Now, change the sample numbers to 45 and 55 and rerun the test. Repeat this for increasing departures from the expected ratio (e.g. 40 and 60, 35 and 65, 30 and 70). What happens to the value of chisquare? What happens to the probability?
(3)
A dental researcher examined the teeth of two identical adult twins. The first had seven filled teeth and the second had 15. The researcher compared these numbers using a chi-square test and the expected numbers of 11:11 filled teeth. Was this test appropriate? Please discuss.
(4)
(a) Do you think the experiment described in Section 20.7 was well designed? (b) Might there be other possible reasons for the change in behaviour of the 12 geckos after exposure to the silhouette? (c) How might you improve the experimental design to take other possible influences into account?
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
319 [319–345] 27.8.2011 5:09PM
21
Non-parametric tests for ratio, interval or ordinal scale data
21.1
Introduction
This chapter describes some non-parametric tests for ratio, interval or ordinal scale univariate data. These tests do not use the predictable normal distribution of sample means, which is the basis of most parametric tests, to assess whether samples are from the same population. Because of this, nonparametric tests are generally not as powerful as their parametric equivalents but if the data are grossly non-normal and cannot be satisfactorily improved by transformation it is necessary to use one. Non-parametric tests are often called ‘distribution free tests’, but most nevertheless assume that the samples being analysed are from populations with the same distribution. They should not be used where there are gross differences in distribution (including the variance) among samples and the general rule that the ratio of the largest to smallest sample variance should not exceed 4:1 discussed in Chapter 14 also applies. Many nonparametric tests for ratio, interval or ordinal data calculate a statistic from a comparison of two or more samples and work in the following way. First, the raw data are converted to ranks. For example, the lowest value is assigned the rank of ‘1’, the next highest ‘2’ etc. This transforms the data to an ordinal scale (see Chapter 3) with the ranks indicating only their relative order. Under the null hypothesis that the samples are from the same population, you would expect a similar range of ranks within each, with differences among samples only occurring by chance. Second, a statistic that reflects any differences in the ranks among samples is calculated and its value compared to the expected distribution of this statistic when all the samples have been taken from the same population. If the calculated value falls within the range generated by the most extreme 5% of departures from the null hypothesis, the result is 319
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
320
320 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
considered statistically significant. Most statistical packages give the value of the test statistic and the probability of that outcome. Randomisation and exact tests can be used to compare two or more samples of ratio, interval or ordinal scale data and are also described in this chapter.
21.2
A non-parametric comparison between one sample and an expected distribution
The Kolmogorov–Smirnov one-sample test can be used to compare the distribution of a single sample to a known or expected distribution. This test compares the shapes of both distributions, so a significant difference between a sample and the population may not necessarily indicate the two have different means. Therefore, when you do get a significant result you may need to inspect the two distributions (e.g. by graphing) to see how they differ. The principle behind the Kolmogorov–Smirnov one-sample test is extremely straightforward and uses a comparison between the cumulative relative frequency distributions (Section 3.3.3) of a sample and a known or expected distribution. The cumulative frequency distribution is a graph of the progressive total of cases (starting at 0 and finishing at the sample size) plotted on the Y axis against the increasing value of the variable on the X axis (see Section 3.3.2). These data can be expressed as the cumulative relative frequency distribution by dividing each of the progressive totals by the sample size, thereby giving a plot with the data on the Y axis having a minimum of 0 and maximum of 1. If the distribution of a sample is exactly the same as an expected distribution, then the plots of their cumulative relative frequencies will be identical. If the two distributions are different, their cumulative relative frequency plots will also be different. When samples are taken at random from a population, most will have cumulative relative frequency distributions that are the same or very similar to that of the population, but there will be departures due to chance. The Kolmogorov–Smirnov one-sample test calculates a statistic, D(max), that increases as the discrepancy between the two cumulative distributions increases and which can be used to obtain the probability that the sample has come from the expected population. Figures 21.1(a), (b) and (c) gives an example for a comparison between the known longevity of the western chilli moth (which can be considered the data for the population) and the longevity of a sample of 5000 moths reared on a newly developed variety of chilli.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
321 [319–345] 27.8.2011 5:09PM
21.2 A non-parametric comparison between one sample
321
Figure 21.1 The Kolmogorov–Smirnov one-sample test compares an observed
(sample) distribution to the expected distribution by finding the maximum difference, D(max), between the two lines for the cumulative relative frequencies. The two distributions show (a) the known (population) longevity in days for the western chilli moth and (b) the longevity of a sample of 5000 moths reared on a newly developed variety of chilli. (c) The double-headed arrow shows the maximum difference between the two cumulative distributions. The absolute value of the difference is the D(max) statistic, which will be 0 when the two distributions are identical and increase as they become increasingly different.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
322
322 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
First, the data for both the sample and the expected distribution are expressed as cumulative relative frequencies, which will have a minimum value of 0 and increase to a maximum of 1 (Figures 21.1(a) and (b)). Second, these two distributions are plotted on the same graph. Third, the two distributions are compared and the greatest difference between them is identified and expressed as its absolute value (Figure 21.1 (c)) which can only be a number of 0 or greater. This is the D(max) statistic. If there is no difference between the two distributions, D(max) will be 0. As the difference between the distributions increases, the value of D(max) will increase and eventually exceed the appropriate critical value for the chosen level of α. A statistical package will calculate D(max) and give the probability. If the sample size is larger than 35, the critical value of D for a two-tailed α of 0.05 is: 1:36 DðcriticalÞ ¼ pffiffiffi n
(21:1)
where n is the number of data in the sample only (not the population). If you have a small sample, its cumulative relative frequency distribution is likely to be a markedly ‘stepped’ plot (Figure 21.2), but this will not affect the calculation of D(max).
21.2.1 Comparison with a particular distribution One of the most common uses of the Kolmogorov–Smirnov one-sample test is a comparison of a sample distribution with an expected distribution that is completely uniform (i.e. flat). Here is an example. A wildlife ecologist obtained data for the numbers of the southern hairy-nosed wombat, Lasiorhinus latifrons (a bulky ground-dwelling mammal weighing up to about 30 kg) accidentally hit and killed by motor cars in 2010 along a 20 km stretch of road near the town of Keith, South Australia. (Unlike most of the examples in this book, the somewhat implausible-sounding hairy-nosed wombat and the town of Keith are real.) The ecologist was interested in whether the distribution of wombat kills was significantly different to a uniform distribution along the road (Figure 21.2(a)), because this might help to decide where to put warning signs and speed restrictions in an attempt to reduce both wombat mortality and vehicle damage. Data for the distance in kilometres from the beginning of the road to the location of each kill are given in Table 21.1. These data were compared to the expected (uniform) distribution of kills along the length of the road (Figure 21.2(a)).
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
323 [319–345] 27.8.2011 5:09PM
21.2 A non-parametric comparison between one sample
323
Figure 21.2 The Kolmogorov–Smirnov one-sample test for the data in Table 21.1 for
wombat deaths along a 10 km stretch of road. (a) For a uniform occurrence of mortality along the road, the graph of the relative frequencies will be flat. (b) The uniform distribution plotted as the cumulative relative frequency. (c) Comparison between the observed and expected distributions. The relatively small sample size gives a distribution that is ‘stepped’. The vertical arrow shows the value of D(max).
A Kolmogorov–Smirnov one-sample test on these data shows a significant difference between a uniform distribution and the one for wombat fatalities (D(max) = 0.308, n = 24, P < 0.05). By inspection of the three graphs in Figures 21.2(a), (b) and (c) this is because mortality is significantly greater than expected at one quite distinct location about seven km along the road where the line showing the cumulative relative frequency for the sample is above the line for expected distribution. Importantly, however, greatly reduced mortality could also give a significant result (and, by definition, a positive value of D(max)), but the line for the sample would be below that for
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
324
324 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
Table 21.1 The locations, in kilometres from the beginning of a 10 km long road running north, of 24 accidental kills of the southern hairy-nosed wombat in 2010. Distance from beginning (kilometres) 0.34 1.01 1.33 1.81 2.42 2.73 3.15 3.67
4.02 4.46 6.05 6.07 6.10 6.15 6.21 6.27
6.33 6.31 6.38 6.40 6.41 6.42 6.43 9.70
the even distribution. This emphasises the need to examine the distributions whenever a significant difference is found, because it would be most inappropriate to simply identify the ‘significant point’ and erect warning signs in an area where fewer than expected deaths had occurred. Furthermore, the distribution in Figures 21.2(a), (b) and (c) shows deaths were less than expected up to about 6 km from the start of the road, but then very high for the next kilometre, so the most appropriate place for warning signs appears to be the section about 6–7 km from the beginning of the road and not just the 7 km point where the distributions were significantly different. This also emphasises the importance of examining the distributions. Another common use of the Kolmogorov–Smirnov one-sample test is for assessing whether a sample is normally distributed. It uses the same procedure as described above, except that the cumulative relative frequency curve for a normal distribution is used as the expected distribution. The comparison is available in most statistical packages and some give the option of comparing a sample to a normal, uniform or Poisson distribution.
21.2.2 Reporting the results of a Kolmogorov–Smirnov one-sample test Results of a Kolmogorov–Smirnov one-sample test are usually reported as the sample size, the value of D(max) and its probability. The test on the data
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
325 [319–345] 27.8.2011 5:09PM
21.3 Non-parametric comparisons between two independent samples
325
Table 21.2 The height in centimetres of palm seedlings germinated and grown for six weeks in either clay or sandy soil. Ranks are shown in the two right-hand columns, together with the rank sums (R1 and R2) for each treatment. Height in clay soil
Height in sandy soil
Rank for clay soil
Rank for sandy soil
24 41 17 38
22 6 11 15 4 n2 = 5
7 9 5 8
6 2 3 4 1 R2 = 16
n1 = 4
R1 = 29
in Table 21.1 could be reported as ‘The distribution of road kills was significantly different to that expected if they occurred evenly along the road (D(max) = 0.308, n = 24, P < 0.05). By inspection, the difference was due to significantly greater than expected mortality in a section between 6 and 7 km from the beginning of the road.’
21.3
Non-parametric comparisons between two independent samples
21.3.1 The Mann–Whitney test The Mann–Whitney test is used to obtain the probability that two independent samples are from the same population. First, the values are ranked over both samples as shown in Table 21.2, which gives data for the height of nine palm seedlings grown in two different soil types. The smallest value is given the rank of 1, the next largest the rank of 2, etc., so the largest will have the rank of n1 + n2 (which is the sum of the number of cases in both samples). For the data in Table 21.2, the largest possible rank is 9. If most of the seedlings grew taller in one soil type than the other, the ranks would differ between treatments. In contrast, if the seedlings grew to a similar height in both soil types, the ranks within each treatment would also be similar.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
326
326 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
If two or more values are the same (that is, they are tied), each is given the average of the ranks assigned to that many values. For example, if the data in Table 21.2 contained two 4 cm high seedlings and these were the smallest, each would be given the average of ranks 1 and 2, which is 1.5. The ranks are summed separately for each sample (these are R1 and R2 in Table 21.2) and the two Mann–Whitney statistics U and U 0 calculated: U ¼ n1 n2 þ
n1 ðn1 þ 1Þ R1 2
(21:2)
and: U 0 ¼ n1 n2 þ
n2 ðn2 þ 1Þ R2 2
(21:3)
where n1 and n2 are the size of each sample. These formulae may appear complex, but are easily explained by separating them into three components as shown for U in equation (21.2). U ¼ n1 n2
þ
component A
n1 ðn1 þ 1Þ 2 component B
R1
:
(21:4)
component C
Component A will increase with the size of both samples. Component B will only increase as the size of sample 1 increases. In contrast, component C will be affected by the way the ranks are distributed between the two samples. A lot of low ranks in sample 1 will give a relatively small value of R1 and vice versa. Therefore, because U is calculated by taking component C away from the sum of components A and B, it will be large compared to U 0 when sample 1 contains mainly low ranks. In contrast, if sample 1 contains mainly high ranks, the value of U will be small compared to U 0 . Finally, if both samples contain similar ranks, then neither U nor U 0 will be relatively large or small. When both samples are from the same population, most values of U and U 0 will be similar, but differences between them will occur by chance and the most extreme 5% of discrepancies will give values of U or U 0 that will be equal to, or exceed, a critical value. For a two-tailed test, if either of the U statistics exceeds the critical value, then the probability the samples are from the same population is less than 5%.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
327 [319–345] 27.8.2011 5:09PM
21.3 Non-parametric comparisons between two independent samples
327
21.3.2 Randomisation tests for two independent samples Another way of comparing two independent samples without assuming they are from a normal distribution is to use a randomisation test. These tests were first discussed in relation to samples of categorical data in Chapter 20. If two independent samples are taken from the same population, then the values within each should differ only by chance. A randomisation test takes the combined set of ranks from both samples, which will be a group of size n1 + n2, and samples this at random to give two simulated groups of size n1 and n2. This simulated sampling is iterated several thousand times and used to generate the expected distribution of U and U 0 from the data set and therefore identify the most extreme 5% of departures from the outcome expected under the null hypothesis. Finally, the U statistics for the actual outcome are compared to these distributions and if the probability is less than 5%, the result is statistically significant (Figures 21.3(a)–(c)).
21.3.3 Exact tests for two independent samples A comparison between two samples can also be made with an exact test that works in a very similar way to the Fisher Exact Test described in Chapter 20. An exact test for two independent samples calculates the probability of the actual difference (or values of statistics such as U and U 0 ) between the ranks of the samples, together with any more extreme differences from the outcome expected under the null hypothesis. This gives the one-tailed probability of the outcome. Here is an example for two independent samples with three data in each. The values range from 1 to 6, and the total set of ways in which they can be distributed between two samples is shown in Table 21.3. I have deliberately made the values the same as their ranks and used a simple comparison between the rank sums of the samples. For this example, there are only two combinations that will give the greatest difference between the rank sums. These are when the first sample contains the three lowest (1, 2 and 3) and the second the three highest (4, 5 and 6) ranks and vice versa, giving an absolute difference of nine. Less extreme differences can be obtained from several combinations and are therefore more likely (Table 21.3).
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
328
328 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
′
Figure 21.3 Illustration of a randomisation procedure that gives distributions
of the two Mann–Whitney statistics U and U0 from simulated sampling. (a) The actual outcome of the experiment. (b) The ranks from both samples are combined in a common pool. (c) The pool of ranks is resampled at random
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
329 [319–345] 27.8.2011 5:09PM
21.3 Non-parametric comparisons between two independent samples
329
For example, you may wish to calculate the probability of the observed outcome and any more extreme departures from that expected under the null hypothesis when one sample contains the ranks 1, 2 and 5 (and therefore the other contains ranks 3, 4 and 6). The observed difference between the sums of the ranks is −5. You will find this outcome in the third line from the top of Table 21.3. There are two more extreme differences (−7 and −9) in the same direction (that is, with increasingly negative values) from the outcome expected under the null hypothesis. The probability of each outcome is calculated directly by sampling a set of six values without replacement (e.g. the chance of rank 1 is 1/6, but the chance of then selecting rank 2 is 1/5 etc). Once calculated, these probabilities are summed to give the one-tailed probability of the observed outcome and any more extreme departures from the null hypothesis. It is one-tailed because differences in the same absolute size between the samples (i.e. the last three lines showing differences of 5, 7 and 9) in Table 21.3 have been ignored, and therefore has to be doubled to get the two-tailed probability.
21.3.4 Recommended non-parametric tests for two independent samples Most statistical software includes the Mann–Whitney test. If you have a package with an exact test or randomisation test for two independent samples, either of these is recommended in preference to the Mann– Whitney test.
Caption for Figure 21.3 (cont.) to give two more simulated samples of size n1 and n2, which will thereby generate two new values of R1 and R2. (d) Steps (b) and (c) are repeated several thousand times. Each time, two more values of R1 and R2 are generated. (e) The simulated sampling gives the distributions of U and U0 for two samples taken at random from the same group. By chance, there will often be differences between samples and as these increase so will U or U0 . The largest 5% of the values of U and U0 are shown as the filled areas on the right of each graph. Finally, the U statistics from the actual outcome (a) are compared to these distributions. If the probability of getting either U or U0 is less than 5% (i.e. either statistic falls within the filled area), the null hypothesis that the samples in (a) are from the same population is rejected.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
330
330 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
Table 21.3 The set of ways in which six ranks can be distributed between two samples of three. Note that the most extreme differences (of −9 and 9) between the sums of the ranks of two samples can only be obtained when one contains ranks 1, 2 and 3 and the other 4, 5 and 6, so these outcomes have a relatively low probability of occurring compared to less extreme differences (e.g. 1 and −1), which can be obtained in several different ways. Sample 1 (a)
(b)
(c)
Sample 2 Rank sums and their differences
(c)
(b)
(a)
1 2 3
R1 = 6
R1 − R2 = −9
R2 = 15
4 5 6
1 2 4
R1 = 7
R1 − R2 = −7
R2 = 14
3 5 6
R1 = 8
R1 − R2 = −5
R2 = 13
1 2 5
1 3 4
2 5 6
3 4 6
2 3 4
1 3 5
1 2 6
R1 = 9
R1 − R2 = −3
R2 = 12
3 4 5
2 4 6
1 5 6
2 3 5
1 3 6
1 4 5
R1 = 10
R1 − R2 = −1
R2 = 11
2 3 6
2 4 5
1 4 6
1 4 6
2 4 5
2 3 6
R1 = 11
R1 − R2 = 1
R2 = 10
1 4 5
1 3 6
2 3 5
1 5 6
2 4 6
3 4 5
R1 = 12
R1 − R2 = 3
R2 = 9
1 2 6
1 3 5
2 3 4
3 4 6
2 5 6
R1 = 13
R1 − R2 = 5
R2 = 8
1 3 4
1 2 5
3 5 6
R1 = 14
R1 − R2 = 7
R2 = 7
1 2 4
4 5 6
R1 = 15
R1 − R2 = 9
R2 = 6
1 2 3
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
331 [319–345] 27.8.2011 5:09PM
21.4 Non-parametric comparisons
331
21.3.5 Reporting the results The result of a two-tailed Mann–Whitney test is usually reported as the largest value of the two statistics U or U 0 , the two sample sizes and the probability. The comparison between the height of palm seedlings in two soil types in Table 21.2 could be reported as ‘A two-tailed Mann–Whitney test showed palm seedlings were significantly taller when grown in clay compared to sandy soil: U = 21.5, n = 4, 5, P < 0.05.’ For a one-tailed test, you need to check whether the difference between the samples is in the appropriate direction before concluding that a U statistic which exceeds the critical value is necessarily significant. The result of a randomisation or exact test could be reported as ‘A twotailed randomisation test for two independent samples showed that palm seedlings were significantly taller when grown in clay compared to sandy soil: P < 0.05’ or ‘A two-tailed exact test for two independent samples showed that palm seedlings were significantly taller when grown in clay compared to sandy soil: P < 0.05.’
21.4
Non-parametric comparisons among three or more independent samples
The most frequently used non-parametric test for more than two independent samples is the Kruskal–Wallis test. It is also called the Kruskal–Wallis single-factor analysis of variance by ranks but this is misleading because it does not use analysis of variance to compare samples. Instead, the Kruskal– Wallis test is an extension of the Mann–Whitney test that can be applied to three or more samples.
21.4.1 The Kruskal–Wallis test For a Kruskal–Wallis test, the data are ranked in the same way as for a Mann–Whitney test, starting by assigning the lowest rank to the smallest value. Here is an example for the number of sandfly bites on the arms of 16 college students after a two-hour class field trip to a Florida mangrove swamp (Table 21.4). The students were classified into three samples according to their natural hair colour and the null hypothesis was ‘There is no
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
332
332 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
Table 21.4 The number of sandfly bites on both arms of 16 students after spending two hours in a Florida mangrove swamp. Five had black hair, six had brown hair and five were blonde. The totals are the rank sums within each sample. Number of sandfly bites ranked from the least to most
Number of sandfly bites Black hair
Brown hair
Blonde hair
Black hair
Brown hair
Blonde hair
25 14 35 41 28
31 20 29 15 40 30
22 4 11 18 8
9 4 14 16 10
8 1 3 6 2
Total
R1 = 53
13 7 11 5 15 12 R2= 63
R3 = 20
difference in the number of sandfly bites on the arms of people with different hair colour.’ It is clear that a marked difference in the number of sandfly bites among samples will also result in a difference in the ranks and rank sums. The rank sums for each sample are used in the following formula for the Kruskal– Wallis statistic H: H¼
k X 12 R2i 3ðN þ 1Þ NðN þ 1Þ i¼1 ni
(21:5)
where N is the total sample size and k is the number of groups or samples. Although this formula looks complex, it is straightforward when considered as three components: H¼
12 NðN þ 1Þ component A
k X R2 i
i¼1
ni
component B
3ðN þ 1Þ:
(21:6)
component C
Components A and C will increase as sample size increases. Component B is the sum of the squares of all the rank totals divided by their respective sample sizes. If all Ri values are relatively similar, then component B (and
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
333 [319–345] 27.8.2011 5:09PM
21.4 Non-parametric comparisons
333
Box 21.1 The effect of an unequal allocation of ranks k 2 P Ri on ni in equation 21.6 i¼1
This example uses three groups with two values in each. Only the ranks of the values are shown. First, the rank sums are identical among groups. Group A
Group B
Group C
1 6 R1 = 7
2
3
5 R2 = 7
4 R3 = 7
k X R2 i
i¼1
ni
¼3
49 ¼ 73:5: 2
Second, the rank sums are different among groups and this gives a larger sum of the squared rank sums. Group A
Group B
Group C
1 2 R1 = 3
3 4 R2 = 7
5 6 R3 = 11
k X R2 i
i¼1
ni
¼
32 72 112 þ þ ¼ 89:5: 2 2 2
therefore H) will be smaller than when some are large and others small because of the effect of squaring relatively large numbers (Box 21.1). The distribution of H for samples taken at random from the same population has been established and used to identify the 5% most extreme departures from the null hypothesis of no difference. For large samples, or where the number of groups or samples is more than five, the value of H is a close approximation to the chi-square statistic with (k − 1) degrees of freedom (where k is the number of samples) and many statistical packages only give this statistic (and its probability) for the result of a Kruskal–Wallis test.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
334
334 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
21.4.2 Exact tests and randomisation tests for three or more independent samples Randomisation and exact tests on the ranks of three or more independent samples are extensions of the methods described for two independent samples in Section 21.3.
21.4.3 A posteriori comparisons after a non-parametric test A non-parametric comparison can detect a significant difference among three or more samples, but it cannot show which appear to be from the same or different populations. This was discussed in Chapter 12 in relation to a single-factor parametric ANOVA. If the effect of the variable you are examining is considered fixed, you need to use a non-parametric a posteriori test to compare among samples and these are described in more advanced texts (e.g. Sprent, 1993).
21.4.4 Rank transformation followed by single-factor ANOVA Another way of analysing data that are grossly non-normal is to run a parametric single-factor ANOVA on the data ranked across all samples (e.g. Table 21.4). This is not a true non-parametric test, but has the advantage of easy a posteriori comparisons when an effect is fixed and the initial ANOVA has shown a significant difference among samples. It is as powerful as applying a Kruskal–Wallis test.
21.4.5 Recommended non-parametric tests for three or more independent samples Most statistical packages include the Kruskal–Wallis test, which is up to 95% as powerful as the equivalent parametric single-factor ANOVA described in Chapter 11. If you have a statistical package that includes an exact or randomisation test, these are recommended in preference to the Kruskal–Wallis test. Several texts recommend using a parametric ANOVA after rank transformation, but it is important to note that this is not a true non-parametric comparison.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
335 [319–345] 27.8.2011 5:09PM
21.5 Non-parametric comparisons of two related samples
335
21.4.6 Reporting the results of tests for three or more independent samples The result of a Kruskal–Wallis test is usually reported as the value of H, the size of each sample and the probability. For example, the data in Table 21.4 could be reported as ‘A Kruskal–Wallis test showed there was a significant difference in the number of sandfly bites on the arms of people with different hair colour: H = 6.498, n = 6, 5, 5, P < 0.05.’ If you were interested in specific treatments (e.g. which hair colour recorded the lowest and highest number of bites), you might want to do further a posteriori testing. For sample sizes of more than about eight or more than five samples, the distribution of H is very close to chi-square with degrees of freedom of one less than the number of samples (k − 1) and is usually reported as the chi-square statistic, degrees of freedom and probability. For a randomisation or an exact test, the data in Table 21.4 could be reported as ‘A randomisation test (or exact test) for three independent samples showed a significant difference in the number of sandfly bites on the arms of people with different hair colour: P < 0.05.’ A single-factor ANOVA applied to the ranks of the data could be reported as ‘A single-factor ANOVA applied to the data ranked across all samples showed a significant difference in the number of sandfly bites on the arms of people with different hair colour: F2,13 = 4.97, P < 0.05.’
21.5
Non-parametric comparisons of two related samples
Related samples were first discussed in Chapter 9. Some examples are when a variable is measured twice (and usually under different conditions) on the same experimental unit, or when the experimental units within one sample or treatment are somehow related to those in a second (e.g. an experiment with two treatments, where a pair of rats is taken from each of several litters and one in each pair assigned to different treatments). There are several non-parametric tests for determining the probability that two related samples have been taken from the same population. These include the Wilcoxon paired-sample test, randomisation tests and exact tests for this statistic.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
336
336 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
21.5.1 The Wilcoxon paired-sample test The Wilcoxon paired-sample test is the non-parametric equivalent of the paired sample t test. The following example is for two samples taken from each of ten sampling units. In humans the trachea branches into two bronchi which lead to different lungs. The bronchus leading to the right lung is wider, shorter and angled more closely to the vertical than the one leading to the left lung. Not surprisingly, it has been found that inhaled objects are more likely to lodge in the right bronchus, so a pathologist hypothesised the right lung may also receive a greater proportion of inhaled airborne particles, such as dust and smoke, and therefore be more prone to damage. To test this hypothesis the pathologist counted the number of lesions found during post mortem examination of the left and right lungs of ten males who had died of natural causes. The data are in Table 21.5. For the Wilcoxon test the difference between each pair of related samples is first calculated. This is expressed as the absolute difference and these values ranked (Table 21.5). Finally, the ranks associated with negative and positive differences are summed separately to give the Wilcoxon statistics T+ and T−. For the data in Table 21.5, the ranks of the positive differences sum to 25 (specimens 1, 2, 4, 5, 6 and 9), while the ranks of the negative differences sum to 30 (specimens 3, 7, 8 and 10). Under the null hypothesis of no effect of bronchial structure on the number of lesions in each lung, any difference between each pair of related values (and therefore T+ and T−) would only be expected by chance. If, however, there were an effect of bronchial structure, it would contribute to differences between these two statistics. The values of T+ and T− can be compared to their expected distributions when related samples are taken at random from a population. For a twotailed test, the null hypothesis is rejected if either T+ or T− is less than a critical value, but for a one-tailed test, the null hypothesis is only rejected if the appropriate T statistic is less than a critical value. For example, if it were hypothesised there were more lesions in the right lung than the left, you would expect a reduction in the number of negative ranks so the null hypothesis would only be rejected if T− were less than the critical value. For large samples, the distributions of both T statistics approximate the standard normal curve, so statistical packages often give the value of the Z statistic and probability as the result of the Wilcoxon test.
1 2 3 4 5 6 7 8 9 10
19 24 16 28 19 26 16 27 18 18
15 14 22 28 11 9 38 42 13 37
4 10 −6 0 8 17 −22 −15 5 −19
4 10 6 0 8 17 22 15 5 19
2 6 4 1 5 8 10 7 3 9
(f) Rank of (a) Specimen (b) Right (c) Left (d) Difference (e) Absolute the absolute number lung lung (right − left) difference difference + + − + + + − − + − Total
T+ = 25
3
1 5 8
2 6
(h) Ranks associated with (g) Sign of positive the difference differences
9 T− = 30
10 7
4
(i) Ranks associated with negative differences
Table 21.5 Data for the number of separate lesions found during post mortem examination of the left and right lungs of ten males who died from natural causes (columns b and c). For a Wilcoxon test, the difference between each pair of related data is calculated (column d), expressed as the absolute difference (column e) and the absolute values ranked (column f). The ranks associated with positive differences (column h) and negative differences (column i) are separately summed to give the statistics T+ and T−.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
337 [319–345] 27.8.2011 5:09PM
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
338
338 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
21.5.2 Exact tests and randomisation tests for two related samples The procedures for randomisation and exact tests on the ranks of two related samples are conceptually similar to the analyses for two independent samples described in Section 21.3.
21.6
Non-parametric comparisons among three or more related samples
Tests for three or more related samples include the Friedman test, together with randomisation and exact tests for this statistic.
21.6.1 The Friedman test The Friedman test is often called the Friedman two-way (or two-factor) analysis of variance by ranks, but this is misleading because it is not equivalent to the two-factor ANOVA discussed in Chapter 13. The Friedman test cannot detect interaction and only examines differences among the levels of one factor so is really analogous to the two-factor ANOVA without replication applied to the randomised block experimental design described in Chapter 15. The results of an experiment designed to compare the effects of two different antibiotics and an untreated control on the growth of pigs are in Table 21.6. Considerable differences in growth can occur among pigs from different litters so each treatment was assigned one piglet from each of six litters, in a randomised block design with three treatments and six blocks. Piglets in the control treatment were offered unlimited food, while those in the other two treatments were offered unlimited food laced with either antibiotic A or antibiotic B. Data for the increase in weight of each piglet during the next two months are in Table 21.6. First, ranks are assigned within each block and therefore within each row of Table 21.6. The lowest value in each row is given the rank of ‘1’, the next highest ‘2’, etc., and the highest rank cannot exceed the number of treatments. If the treatments are from the same population, the range of ranks (and the rank sums) for each should also be similar, with any variation due to chance. If, however, there is any effect of either treatment, the ranks and their sums will also differ. For the example in Table 21.6, antibiotic
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
339 [319–345] 27.8.2011 5:09PM
21.6 Non-parametric comparisons among three or more related samples
339
Table 21.6 The increase in weight, in kilograms, for piglets from six litters assigned to three different treatments. Piglets in the control treatment were offered unlimited food, while those in treatments A and B were offered unlimited food plus antibiotic A and B respectively. Litter Control Antibiotic A Antibiotic B Rank of control Rank of A Rank of B 1 2 3 4 5 6
2.5 1.8 4.4 2.4 5.1 1.7
2.7 1.9 4.7 2.6 5.3 1.9
2.1 2.0 4.1 2.3 5.2 1.6 Totals
2 1 2 2 1 2 R1= 10
3 2 3 3 3 3 R2 = 17
1 3 1 1 2 1 R3 = 9
treatment A contains all but one of the highest ranks and antibiotic treatment B contains all but two of the lowest. Second, the total of the squared rank sums is calculated. The size of this will depend on the relative size of the rank sums (Box 21.1) with a set of similar rank sums giving a smaller total than a set of dissimilar ones. Finally, the following formula is used to calculate the Friedman statistic χ 2r : χ 2r ¼
a X 12 R2 3bða þ 1Þ baða þ 1Þ i¼1 i
(21:7)
where a is the number of treatments or groups and b is the number of blocks. This appears complex, but can be split into three components as shown in equation (21.8) below. The Friedman statistic is obtained by multiplying components A and B together and then subtracting component C: χ 2r ¼
12 baða þ 1Þ component A
a X
R2i
3bða þ 1Þ:
i¼1
component B
component C (21:8)
Components A and C will increase as sample sizes and the number of samples increase. If the rank sums are very similar among treatments,
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
340
340 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
component B will be relatively small so the value of the Friedman statistic will also be small. As the differences among the rank sums increase, component B will increase and therefore give a larger value of the Friedman statistic. Once this exceeds the critical value above which less than 5% of the most extreme departures from the null hypothesis occur when samples are taken from the same population, the outcome is considered statistically significant. This analysis can be up to 95% as powerful as the equivalent two-factor ANOVA without replication for randomised blocks. For designs with more than about 10−15 blocks, the distribution of χ2r is close to the distribution of chi-square with a − 1 (i.e. the number of treatments − 1) degrees of freedom.
21.6.2 Exact tests and randomisation tests for three or more related samples The procedures for randomisation and exact tests on the ranks of three or more related samples are extensions of the methods for two independent samples.
21.6.3 A posteriori comparisons for three or more related samples If the Friedman test shows a significant difference among treatments and the factor is fixed, you are likely to want to know which treatments are significantly different (see Section 21.3.3). A posteriori testing can be done and instructions are given in more advanced texts such as Zar (1999).
21.6.4 Reporting the results of tests for two or more related samples Results of a Wilcoxon paired-sample test can be reported as the value of T, the sample size and the probability. For example, the results of a Wilcoxon paired-sample test on the data in Table 21.5 could be reported as ‘A twotailed Wilcoxon paired-sample test showed no significant difference in the number of lesions in the right and left lungs: T+ = 25, T− =30, n = 10, NS.’ Results of a Friedman test on the data in Table 21.6 could be reported as ‘A two-tailed Friedman test showed a significant difference in the growth of piglets in the three feeding treatments χ 2r = 6.33, df = 2, P < 0.05.’
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
341 [319–345] 27.8.2011 5:09PM
21.7 Analysing ratio, interval or ordinal data that show gross differences
21.7
341
Analysing ratio, interval or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed
Some data show gross differences in variance among treatments that cannot be improved by transformation and are therefore unsuitable for parametric or non-parametric analysis. A peridontologist was asked to assess the effects of a dental hygiene programme upon the incidence of caries among 14–19year old adolescent males. Thirty-two males aged 14 years were chosen at random within a large high school. Sixteen were assigned at random to a ‘hygiene’ treatment and regularly encouraged to eat only three meals a day, carefully clean and floss their teeth after every meal and reduce their consumption of carbonated sugary drinks. The 16 students in the other (control) treatment received no encouragement about dental hygiene. Members of both treatments had regular dental examinations and the number of new cases of dental caries were recorded during the next five years for each individual (Table 21.7). It is clear there are gross differences in variance among the two groups. Furthermore, the distribution of the control group is bimodal, but the hygiene group is not. One solution is to transform the data to a nominal scale and reclassify both samples into two mutually exclusive categories of Table 21.7 The number of cases of new dental caries occurring in two groups of males between the ages of 14 and 19 years. The members of the ‘hygiene’ treatment were encouraged to undertake a rigorous dental hygiene programme and those in the control received no encouragement about dental hygiene. Control 4 7 4 10 2 7 1 9
Hygiene 3 9 12 7 5 4 15 10
0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
342
342 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
Table 21.8 Transformation of the ratio data in Table 21.7 to a nominal scale where students were classified as those with, or without, new caries. The proportions in these two mutually exclusive categories could be compared between treatments with a test for a 2 × 2 contingency table (e.g. a Fisher Exact Test).
Number of students with new caries Number of students without new caries
Control
Hygiene
16 0
2 14
‘no new caries’ and ‘new caries’ (Table 21.8), which can be compared using a test for two independent samples of categorical data (Chapter 20).
21.8
Non-parametric correlation analysis
Correlation analysis was introduced in Chapter 16 as an exploratory technique used to examine whether two variables are related. Importantly, for correlation, there is no expectation that the numerical value of one variable can be predicted from the other, nor is it necessary that either variable is determined by the other. The parametric test for correlation gives a statistic that varies between +1 and −1, with both of these extremes indicating a perfect positive and negative linear relationship respectively, while values around 0 show no linear relationship. Although parametric correlation analysis is powerful, it can only detect linear relationships and also assumes that both the X and Y variables are normally distributed. When normality of both variables cannot be assumed or the relationship between the two variables does not appear to be linear and cannot be remedied by transformation, it is not appropriate to use a parametric test for correlation. The most commonly used nonparametric test for correlation is Spearman’s rank correlation.
21.8.1 Spearman’s rank correlation This test is extremely straightforward. The two variables are ranked separately, from lowest to highest, and the (parametric) Pearson correlation
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
343 [319–345] 27.8.2011 5:09PM
21.8 Non-parametric correlation analysis
343
Figure 21.4 Examples of raw scores, ranks and the Spearman rank correlation coefficient for data with (a) a perfect positive relationship (all points lie along a straight line), (b) no relationship, (c) a perfect negative relationship (all points lie along a straight line), (d) a positive relationship which is not a straight line, but all pairs of bivariate data have the same ranks, (e) a positive relationship with only half the pairs of bivariate data having equal ranks and (f) a positive relationship with no pairs of bivariate data having equal ranks. Note that in (d) the value of rs is 1 even though the raw data do not show a straight-line relationship.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
344
344 [319–345] 27.8.2011 5:09PM
Non-parametric tests for ratio, interval or ordinal scale data
coefficient calculated for the ranked values. This gives a statistic called Spearman’s rho, which for a population is symbolised by ρs and by rs for a sample. Spearman’s rs and Pearson’s r will not always be the same for the same set of data. For Pearson’s r, the correlation coefficients of 1 or −1 are only obtained when there is a perfect positive or negative straight-line relationship between two variables. In contrast, Spearman’s rs will give a value of 1 or −1 whenever the ranks for the two variables are in perfect agreement or disagreement, which occurs in more cases than a straight-line relationship (Figures 21.4(a)–(f)). The probability of the value of rs can be obtained by comparing it to the expected distribution of this statistic. Most statistical packages will give rs together with its probability.
21.8.2 Reporting the results of Spearman’s rank correlation A Spearman’s rank correlation coefficient is reported as the value of rs, the sample size (which is the number of paired data) and the probability. For example, a Spearman’s rank correlation on the data in Table 17.1 for the number of mites on a person’s eyelashes and age could be reported as ‘There was a significant relationship between the number of mites on a person’s eyelashes and their age: Spearman’s rs = 0.928, n = 10, P < 0.05.’
21.9
Other non-parametric tests
This chapter is an introduction to some non-parametric tests for two or more samples of independent and related data. Other non-parametric tests are described in more specialised but nevertheless extremely well-explained texts, such as Siegel and Castallan (1988).
21.10 Questions (1)
An easy way to understand the process of ranking and the tests that use this procedure is to work with a contrived data set. The following two independent samples will have very similar rank sums.
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518C21.3D
345 [319–345] 27.8.2011 5:09PM
21.10 Questions
Group 1
Group 2
4 7 8 11 12 15 16 19 20
5 6 9 10 13 14 17 18 21
345
(a) Rank the data across both samples and calculate the rank sums. (b) Use a statistical package to run a Mann–Whitney test on the data. Is there a significant difference between the samples? (c) Now change the data so you would expect a significant difference between groups. Run the Mann–Whitney test again. Was the difference significant? (2)
The following data are for the number of solar keratoses (these are lesions found on sun-exposed skin) on the hands of two samples of 50-year-old males living in Brisbane, Australia. The first sample contained 23 individuals who worked indoors and the second contained 23 who worked outdoors. (a) Do the samples appear to have similar distributions? (b) How might you compare the number of solar keratoses between these samples? (c) Use your suggested method to test the hypothesis that there is a difference in the incidence of solar keratoses in these groups: Indoor workers: 0, 0, 1, 1, 2, 1, 1, 1, 2, 0, 1, 0, 1, 0, 1, 1, 2, 2, 2, 3, 2, 2, 0 Outdoor workers: 1, 1, 0, 1, 1, 0, 5, 6, 7, 6, 8, 4, 7, 6, 0, 10, 8, 6, 0, 7, 8, 8, 6
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
346 [346–374] 29.8.2011 3:55PM
22
Introductory concepts of multivariate analysis
22.1
Introduction
Often life scientists collect samples of multivariate data – where more than two variables have been measured on each sample – because univariate or bivariate data are unlikely to give enough detail to realistically describe the object, location or ecosystem being investigated. For example, when comparing three polluted and three unpolluted lakes, it is best to collect data for as many species of plants and animals as possible, because the pollutant may affect some species but not others. Data for only a few species are unlikely to realistically estimate the effect upon an aquatic community of several hundred species. If all six lakes were relatively similar in terms of the species present and their abundance, it suggests the pollutant has had little or no effect. In contrast, if the three polluted lakes were relatively similar to each other but very different to the three unpolluted ones, it suggests the pollutant has had an effect. It would be very useful to have a way of assessing similarity (or its converse dissimilarity) among samples of multivariate data. As described in Chapter 3, univariate data can easily be visualised and compared by summary statistics such as the mean and standard deviation. Bivariate data can be displayed as a two-dimensional graph with one axis for each variable. Even data for three variables can be displayed as a three-dimensional graph. But when you have four or more variables, the visualisation of these in a multidimensional space and comparison among samples becomes increasingly difficult. For example, Table 22.1 gives data for the numbers of five species of common birds (the variables) at four woodland sites (the samples) of equal area. Although this is only a small data set, it is difficult to assess which sites are most similar or dissimilar. Incidentally, you may be thinking the experimental design is poor because 346
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
347 [346–374] 29.8.2011 3:55PM
22.2 Simplifying and summarising multivariate data
347
Table 22.1 The numbers of five species of bird at four woodland sites of equal area in Kent, England. From these raw data, it is difficult to evaluate which sites are most similar or dissimilar. Species
Beckenham
Herne Hill
Croydon
Penge East
Blue tit Feral pigeon Blackbird Great tit Carrion crow
12 11 46 32 6
43 40 63 5 40
26 28 26 19 21
21 19 21 7 38
the data for each species are unreplicated at each site, but I have used a simple example for clarity. Life scientists need ways of simplifying and summarising multivariate data to compare samples. Because univariate data are so easy to visualise, the comparison among the four sites in Table 22.1 would be greatly simplified if the data for the five species could somehow be reduced to a single summary statistic or measure. Multivariate methods do this by reducing the complexity of the data sets while retaining as much information as possible about each sample. The following explanations are simplified and conceptual, but they describe how multivariate methods work.
22.2
Simplifying and summarising multivariate data
The methods for simplifying and comparing multivariate data can be divided into two groups. (a) The first group of analyses works on the variables. They reduce the number of variables by identifying the ones that have the most influence upon the observed differences among samples so that relationships among the samples can be summarised and visualised more easily. These ‘variable-oriented’ methods are often called R-mode analyses. (b) The second group of analyses works on the samples. They often summarise multivariate data by calculating a single measure, or statistic, that helps to quantify differences among samples. These ‘sample-oriented’ methods are often called Q-mode analyses.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
348
348 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
This chapter describes an example of an R-mode analysis, followed by two Q-mode ones.
22.3
An R-mode analysis: principal components analysis
Principal components analysis (PCA) (which is called ‘principal component analysis’ in some texts) is one of the oldest multivariate techniques. The mathematical procedure of PCA is complex and uses matrix algebra, but the concept of how it works is very easy to understand. The following explanation uses the data in Table 22.1 for four sites (the samples) and five variables and only assumes an understanding of correlation (Chapter 16). By inspection, it is not easy to see which of the sites (that for PCA are often called objects) in Table 22.1 are most similar or dissimilar to each other. Often, however, a set of multivariate data shows a lot of redundancy, which means that two or more variables are highly correlated. For example, if you look at the data in Table 22.1, it is apparent that the numbers of blue tits, feral pigeons and carrion crows are strongly and positively correlated (when there are relatively high numbers of blue tits, there are also relatively high numbers of feral pigeons and carrion crows and vice versa). Furthermore, the numbers of each of these three species are also strongly correlated with the number of great tits, but I have deliberately made these correlations negative (when there are relatively high numbers of great tits, there are relatively low numbers of blue tits, feral pigeons and carrion crows, and vice versa) because negative correlations are just as important as positive ones. The correlation coefficients are in Table 22.2. These examples of redundancy mean that any of the highly (positively or negatively) correlated variables give essentially the same information about the differences among the sites. Therefore, you could use only one of these variables (either blue tits, feral pigeons, carrion crows or great tits) plus blackbirds with little loss of information about the inter-site differences, even though the number of variables has been reduced from five to two (Table 22.3).
22.3.1 How does a PCA combine two or more variables into one? This is a straightforward example where data for two variables are combined to give one new variable and I am using a simplified version of the
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
349 [346–374] 29.8.2011 3:55PM
22.3 An R-mode analysis: principal components analysis
349
Table 22.2 The correlations between each pair of variables in Table 22.1. The numbers of blue tits, feral pigeons and carrion crows are all strongly and positively correlated with each other. These three variables are also strongly and negatively correlated with the numbers of great tits. None of the four is strongly correlated with the numbers of blackbirds (for which correlations range from −0.04 to 0.53). Blue tit Feral pigeon Blackbird Great tit Carrion crow Blue tit Feral pigeon Blackbird Great tit Carrion crow
0.99 0.53 −0.75 0.73
0.46 −0.70 0.68
−0.04 0.05
−1.00
Table 22.3 The numbers of blue tits, feral pigeons, carrion crows and great tits are strongly correlated. Therefore, you only need data for one of these (e.g. the feral pigeon), plus the number of blackbirds, to describe the differences among the four sites. Species
Beckenham
Herne Hill
Croydon
Penge East
Feral pigeon Blackbird
11 46
40 63
28 26
19 21
conceptual explanation given by Davis (2002). Imagine you need to assess the spatial variation in water quality within a large reservoir. You only have data for the concentration (in numbers per ml) of the bacterium Escherichia coli (which occurs in the intestine of warm-blooded animals and is often used as an indicator of faecal contamination) and the turbidity of the water (which is a measure of the amount of suspended solids, so highly turbid water is usually of poor quality) at ten sites. It would be helpful to know which sites were most similar and which were most dissimilar, because this information might be used to improve water quality in the reservoir. The data for the ten sites have been plotted in Figure 22.1, which shows a (positive) correlation between the concentration of E. coli and turbidity. This strong relationship between two variables can be used to construct a
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
350
350 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Figure 22.1 The concentration of the bacterium Escherichia coli and turbidity
at ten sites in a reservoir.
Figure 22.2 An ellipse drawn around the data for the concentration of
Escherichia coli and turbidity at ten sites. The elliptical boundary is analogous to the 95% confidence interval for this bivariate distribution.
single, combined variable to help make comparisons among the ten sites. Note that you are not interested in whether the variables are positively or negatively correlated – you only want to compare the sites. First, the bivariate distribution of points for these two highly correlated variables can be enclosed by a boundary. This is analogous to the way a set of univariate data has a 95% confidence interval (Chapter 8). For bivariate data the boundary will be two-dimensional, and because the variables are strongly correlated it will be elliptical as shown in Figure 22.2. An ellipse is symmetrical and its relative length and width can be described by the length of the longest line that can be drawn through it (which is called the major axis), and the length of a second line (which is called the minor axis) drawn halfway down and perpendicular to the major
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
351 [346–374] 29.8.2011 3:55PM
22.3 An R-mode analysis: principal components analysis
351
Figure 22.3 The long major axis and shorter minor axis give the dimensions of
the ellipse that encloses the set of data.
axis (Figure 22.3). The relative lengths of the two axes describing the ellipse will depend upon the strength of the correlation between the two variables. Highly correlated data like those in Figure 22.3 will be enclosed by a long and narrow ellipse, but for weakly correlated data the ellipse will be far more circular. At present the ten sites are described by two variables – the concentration of E. coli and turbidity. But these two variables are highly correlated, so all the sites will be quite close to the major axis of the ellipse. Therefore, most of the variation among them can be described by the major axis (Figure 22.3), which can be thought of as a new single variable that is a good indication of most of the differences among the sites. So instead of using two variables to describe the ten sites, the information can be combined into just one. The two axes of an ellipse are called eigenvectors and their relative lengths within the ellipse are their eigenvalues. Once the longest eigenvector of the ellipse has been drawn, it is rotated (in the case of Figure 22.3, this will simply be clockwise by about 45°) so that it becomes the new X axis (Figure 22.4). This new, artificially constructed principal component explains most of the variation among the ten sites and has no name except principal component number 1 (PC1). It is important to realise that PC1 is a new variable, which in this case is a combination of the two variables ‘the concentration of E. coli’ and ‘turbidity’. The plot of the points in relation to PC1 in Figure 22.4 only shows the sites in terms of this new variable and there is nothing about E. coli or turbidity in the
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
352
352 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Figure 22.4 The long axis of the ellipse has been drawn through the set of
highly correlated data for the concentration of E. coli and turbidity (Figure 22.3), and then rotated to give a new X axis for the artificial variable called principal component number 1. This new variable explains most of the variation among sites.
Figure 22.5 The values for PC1 are expressed in relation to the midpoint of the
principal eigenvector, which is assigned the value of 0.
graph. The new X axis, PC1, is rescaled to assign its midpoint the value of 0, so some sites will have positive and some will have negative values of PC1 (Figure 22.5). In this example the points are all close to the major axis so principal component 1 explains most of the variation among the sites and can be used to easily assess similarities among them. From Figures 22.4 and 22.5, it is clear that sites A, I and F are more similar to each other than A is to E because the distances among the former three are much shorter. Because there are two variables in the initial data set, a principal components analysis also constructs a second component that is completely independent and uncorrelated with principal component 1. The second axis is called principal component 2 (PC2) and is simply the minor axis of the ellipse shown in Figure 22.3, which after the rotation described above will be a vertical line perpendicular to PC1. Here, too, the eigenvalue for PC2 corresponds to its relative length and its midpoint is given the value of 0. It is clear that PC2 does not explain very much of the variation among the sites because the objects are quite widely dispersed around this eigenvector, which is therefore relatively short (Figure 22.6). Most of the variation is
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
353 [346–374] 29.8.2011 3:55PM
22.3 An R-mode analysis: principal components analysis
353
Figure 22.6 Principal component 2 is the short axis of the ellipse and
constructed by drawing a line perpendicular to the line showing PC1. Note that PC2 explains very little of the variation among sites.
Figure 22.7 (a) Highly correlated data. The long axis of the ellipse is a good
indication of variation among samples. (b) Uncorrelated data. The major and minor axes of the ellipse are both similar in length, so neither can be used as a good single indication of the variation among samples.
described by PC1, so the analysis has effectively reduced the number of variables from two to one.
22.3.2 What happens if the variables are not highly correlated? As described above, if two variables are highly correlated, the ellipse enclosing the data will be long and narrow. Therefore, the first eigenvector will be relatively long with a large eigenvalue and the second will be relatively short with a small eigenvalue. The new combined variable of the first eigenvector will be a good indicator of differences among the samples (in this case the sites). In contrast, if the two variables are not correlated, the ellipse will be more circular and the first and second eigenvectors will have similar eigenvalues (Figures 22.7(a) and (b)). Therefore, neither can be used as a good indication of the differences among samples.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
354
354 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
22.3.3 PCA for more than two variables Principal components analysis is particularly useful when you have data for three or more variables. If you have n variables, a PCA will calculate n eigenvectors (with n eigenvalues) that give the dimensions of an n dimensional object in an n dimensional space. This sounds daunting, but it is easy to visualise for only three variables where the three eigenvectors will give the dimensions of a three-dimensional object in three-dimensional space. It will be spherical for three variables with no correlations and therefore no redundancy. For three highly correlated variables, it will be a very elongated three-dimensional hyperellipsoid. For three or more variables, the PCA procedure is a straightforward extension of the explanation given for two variables in Section 22.3.1. First, the longest axis of the object is found and rotated so that it becomes the X axis lying horizontally to the viewer in a two-dimensional plane with its flat surface facing the viewer (like the page you are reading at the moment). If there are many variables and therefore many dimensions, the rotation is likely to be complex – even an eigenvector in three dimensions may have to be rotated in both the transverse and the horizontal plane. The eigenvector for the longest axis becomes principal component 1. After this, the other eigenvectors are drawn. For example, if you have measured three variables, the three-dimensional boundary enclosing the data points will have three eigenvectors that give its length, breadth and depth, which will be at 90° to each other. In many cases, several variables may be highly correlated so the hyperellipsoid will be relatively simple and may even describe most of the variation among samples in just one or two dimensions. Here is an example. An environmental scientist sampled sediments along a 100 kilometre section of coastline, including five estuaries (A−E) that received storm water runoff from urban areas and five control estuaries (F−J) that did not. At each site they obtained data for the concentrations of copper, lead, chromium, nickel, cadmium, aluminium, mercury, zinc, total polycyclic aromatic hydrocarbons (ΣPAHs) and total polychlorinated biphenyls (ΣPCBs). These ten variables were subject to principal components analysis and reexpressed as ten principal components, thereby giving the shape of a tendimensional hyperellipsoid. But because several of the original variables were highly correlated, the first principal component, PC1, explained 70% of
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
355 [346–374] 29.8.2011 3:55PM
22.3 An R-mode analysis: principal components analysis
355
Figure 22.8 When several variables are highly correlated, they can be re-
expressed as a hyperellipsoid with one very long axis (PC1), a shorter one (PC2) and a very short one (PC3). Most of the variation is explained by PC1, followed by PC2. The third component, PC3, accounts for very little variation and could be ignored.
the variation among estuaries. The second, PC2, explained 15% more of the variation and the third, PC3, only 5% of the variation (Table 22.4). Therefore, in this case 85% of the variation among samples could be described by a twodimensional ellipse with axes of PC1 and PC2. Furthermore, 90% of the variation could be described by a three-dimensional hyperellipsoid with axes of PC1, PC2 and PC3 and would be a very elongated, not very wide and even less thick object in three-dimensional space (Figure 22.8), with the remaining seven dimensions making little contribution to its shape. The output from a PCA usually includes a table of eigenvalues and the percentage variation explained by each principal component. Therefore, you could take only PC1 and PC2 and plot them as a twodimensional ellipse from which it is easy to visualise the relationships among the sites. These two principal components explain 85% of the variation, so the closeness of the objects in two dimensions will give a realistic indication of their similarities (Figure 22.9). The analysis shows two relatively distinct clusters that correspond to the five urban and the five control estuaries, which is consistent with urban storm water runoff having a relatively consistent effect (although you need to bear in mind that this is only a mensurative experiment).
22.3.4 The contribution of each variable to the principal components It is often useful to know which of the original variables contribute most to each principal component. For example, if you knew that most of the
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
356
356 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Table 22.4 Typical output table for only the first three components of a PCA. PC1 explains most (70%) of the variation in the data set and thus has the largest eigenvalue. Principal component
Eigenvalue
Percentage variation
1 2 3
3.54 1.32 0.64
70 15 5
Figure 22.9 A plot of only PC1 and PC2 can explain most of the variation
among sites A−J. Note that the five urban estuaries are clustered to the right of the plot and the five control estuaries are clustered to the left.
variation along PC1 was contributed by the variables ΣPAHs and ΣPCBs, it might suggest ways of reducing the effects of urban development upon estuaries. The results of a PCA include the relative contribution of every variable to each of the principal components, shown as correlation coefficients. A variable that is highly (and either positively or negatively) correlated with a particular principal component has made a greater contribution to that component than a variable with a correlation closer to 0. Table 22.5 gives an example for the ten variables measured in the estuarine study described above. The highest correlations (either positive or negative) in each column have been highlighted. It is clear that principal component 1 is mainly composed of variables 3 and 6, which are chromium and aluminium (the two highest positive and negative correlations). In contrast, principal
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
357 [346–374] 29.8.2011 3:55PM
22.3 An R-mode analysis: principal components analysis
357
Table 22.5 Typical output table from a PCA. The far left-hand column lists the original variables (in this case, variables 1−10). The next three columns represent the first three principal components and the values in these columns are the correlations between the new components and the original variables. Note that PC1 is primarily composed of variables 3 and 6 (the two largest values for the correlation coefficients and shown in bold), while PC2 is primarily composed of variables 1, 2 and 10 (also in bold). The variables that contribute most to PC3 are 4 and 5. Original variable
Principal component 1
Principal component 2
Principal component 3
1 Copper 2 Lead 3 Chromium 4 Nickel 5 Cadmium 6 Aluminium 7 Mercury 8 Zinc 9 ΣPAHs 10 ΣPCBs
0.01 0.24 0.91 −0.18 0.15 −0.87 0.42 0.30 −0.17 0.05
0.60 0.61 0.26 0.32 0.05 −0.22 0.19 −0.02 0.21 −0.71
0.22 0.37 −0.06 0.57 0.52 0.44 0.37 −0.22 −0.06 0.32
component 2 is largely composed of variables 1, 2 and 10, which are copper, lead and ΣPCBs. Which two variables make the most contribution to principal component 3? You need to look for the highest correlations, irrespective of their signs. (They are nickel and cadmium.) The signs of the correlations are also useful. For example, for principal component 1 (Table 22.5) the correlation coefficient for variable 3 (chromium) is positive and the one for variable 6 (aluminium) is negative. This means that as PC1 increases, chromium concentration also increases, but the concentration of aluminium decreases. In summary, a PCA has the potential to express multivariate data in a form we can more easily understand by reducing the number of dimensions so that the data can be plotted on a two- or three-dimensional graph. It also gives a good indication of the variables that contribute most to the differences among samples.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
358
358 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Figure 22.10 (a) A plot of PC1 and PC2 for six sites increasingly distant (site
A = closest, site F = most distant) from a petrochemical plant. The analysis shows a clear gradation through sites A to F. (b) A plot of PC1 and PC2 for the same six sites after procedures were put in place to minimise spillage and discharge from the plant.
22.3.5 An example of the practical use of principal components analysis A marine toxicologist was asked to compare the hydrocarbons in sediments at six sampling sites, each one kilometre apart, running south along the shore and increasingly distant from a petrochemical plant to (a) see if there were differences in hydrocarbon levels among the sites and (b) if so, to find out which compounds might be the best indicators of pollution. The toxicologist sampled ten hydrocarbons at each of the six sites (A−F). A principal components analysis showed that only two hydrocarbons, 1 and 6 (combined as PC1), contributed to most of the variation among sites and were negatively correlated with PC1, followed by 5 and 9 (combined as PC2). When plotted on a graph of PC1 and PC2, there was a clear pattern (Figure 22.10(a)) in that the rank order of the sites, running from left to right, corresponded to their distance from the petrochemical plant. The toxicologist concluded that the concentrations of only two hydrocarbons could explain most of the variation among sites and noted that the concentrations of both were highest near the petrochemical plant.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
359 [346–374] 29.8.2011 3:55PM
22.3 An R-mode analysis: principal components analysis
359
Following the study, the owners of the petrochemical plant took steps to reduce accidental spillage and discharges. Three years later, the sites were resampled. The concentrations of all hydrocarbons were lower and a PCA showed sites A and B were closely clustered and distinct from C−F, which suggested the effects of pollution had been reduced to the two sites within two kilometres of the plant (Figure 22.10(b)).
22.3.6 How many principal components should you plot? There are several ways of deciding how many principal components to include in a plot. Sometimes only one or two are needed, but this will only occur if they account for almost all the percentage variation among samples. Generally, however, you should not use components with eigenvalues of 1 or less because this is the level of variation expected by chance when there are no strong correlations among variables so that they all contribute equally to a component.
22.3.7 How much variation must a PCA explain before it is useful? If the first two or three components describe more than 70% of the variation among samples, a PCA will give a plot in two or three dimensions that is reasonably realistic. Sometimes, however, it may be useful to know that none of the components can explain very much of the variation among samples. For example, a PCA on indicators of air pollution (nitrogen dioxide, sulphur dioxide, ozone, ammonia and the concentration of fine particles per cubic metre of air) at sites throughout a city, including the centre and the fringes of the outer suburbs, showed no component with an eigenvalue greater than 0.9 and none explained more than 16% of the variation among sites. The twodimensional plot of the data was almost circular and the three-dimensional plot was spheroidal. It was concluded that there was no obvious difference in air quality (at least not in relation to these five indicators) across the city.
22.3.8 Summary and some cautions and restrictions on use of PCA PCA is a way of reducing the complexity of a multivariate data set, but it can only do this if some variables are highly correlated. These are combined to
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
360
360 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
form principal components that are used to display the samples in two or three dimensions. The contribution of each original variable to the principal components is also given. PCA will not give a realistic reduction in complexity when the original data include a lot of 0 values (e.g. no individuals of several species). This restriction can be thought of in terms of the PCA constructing new axes from highly correlated variables. If the data include a lot of 0 values for each variable and a few cases of larger numbers, the PCA is likely to overestimate redundancy, just as several values of 0,0 and a few larger ones in a bivariate plot can overestimate the strength of a correlation. The plot provided by a PCA is also sensitive to the scale on which each variable is measured. For example, data for ten species might include the number of trees per hectare and the number of beetles per square metre. This will affect the shape of the hyperellipsoid and if the data are rescaled (e.g. all expressed as number per m2), the PCA plot will stretch or shrink to reflect this. This is why many PCA programs normalise the data by converting each value to its Z score (Chapter 8). For each variable, every datum is subtracted from the mean and the difference divided by the standard deviation, which always gives a distribution with a mean of 0 and a standard deviation of 1. The same procedure is used to standardise data for a correlation analysis (Chapter 16, equation (16.2)).
22.3.9 Reporting the results of a principal components analysis The results of a PCA, especially when one or two principal components explain most of the variation among objects, are usually presented as a plot showing the objects in relation to PC1 and PC2 together with a table of correlations between the original variables and the principal components. For example, the data for the estuaries in Section 22.3.4 could be reported as ‘The first principal component accounted for 70% of the variation among sites (Table 22.4) and a plot of PC1 and PC2 separated the five urban estuaries from the five controls. PC1 was positively correlated with chromium and negatively correlated with aluminium, while PC2 was positively correlated with copper and lead but negatively correlated with ΣPCBs (Table 22.5). In summary, the urban estuaries were characterised by high levels of chromium and low levels of aluminium compared to the controls.’
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
361 [346–374] 29.8.2011 3:55PM
22.4 Q-mode analyses: multidimensional scaling
361
If a PCA explains very little variation among samples, a plot need not be given. The example in Section 22.10 could be reported as ‘A PCA of nitrogen dioxide, sulphur dioxide, ozone, ammonia and the concentration of fine particles per cubic metre of air at sites throughout the city, including the centre and the fringes of the outer suburbs, showed no component with an eigenvalue greater than 0.9 and none explained more than 16% of the variation among sites. Therefore, there was no marked difference in air quality in relation to these five indicators across the city.’
22.4
Q-mode analyses: multidimensional scaling
Q-mode analyses are similar to R-mode ones in that they also reduce the effective number of variables in a data set, but they do it in a different way. Q-mode analyses examine the similarity among samples so that they can be displayed on a graph with fewer dimensions than the number of original variables. Multidimensional scaling (MDS) does this by taking the original set of samples and calculating a single measure of the dissimilarity between each of the possible pairs of these. These dissimilarity data, which are univariate, are then used to draw a plot in two- or three-dimensional space. Here is a very straightforward explanation. Conservation ecologists are often very interested in patches of natural habitat remaining in areas that have been developed for agriculture, because many species of plants and animals can only persist in these remnants. Furthermore, the distances among patches may affect the long-term survival of a species because if a population in a particular patch becomes extinct, the likelihood of it being recolonised often depends on the distance to the nearest occupied patch. If you were to take four remnant patches within a large area of farmland and measure the distances between every possible pair of these (sites 1–2, 1–3, 1–4, 2–3, 2–4, 3–4), then you could construct the matrix shown in Table 22.6. These summary data indicate the dissimilarity between the patches in terms of their distance apart. Patches very close together will have a low score and those further apart will have a higher one. Knowing the dissimilarity values from the matrix, you could draw at least one map showing the positions of the patches in two dimensions. Not all of the maps would match the actual positions of the patches on the surface of the Earth when viewed from above, but they would be a
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
362
362 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Figure 22.11 Two equally plausible plots (a) and (b) showing the relationships between
four patches of remnant vegetation in terms of their distances apart. Both maps correctly show all the dissimilarities among the four patches and are therefore equally applicable, even though only one (in this case (b)) corresponds to the actual position of these on the surface of the Earth when viewed from above.
convenient way of visualising the relationships among them. Two examples are shown in Figures 22.11(a) and (b). This is what multidimensional scaling does for the full set of variables measured on each sample. The example using four locations is very simple, but it shows how a univariate matrix of dissimilarities among samples can be used to position them in two dimensions and easily visualise how closely they are related.
22.4.1 How is a univariate measure of dissimilarity among samples extracted from multivariate data? Univariate measures such as the Euclidian distance can be used to quantify the dissimilarity between samples of multivariate data. The Euclidian distance is just the distance between any two samples in two-, three-, four-, or higher-dimensional space. Here is an example for only two dimensions and therefore two variables. The length of the hypotenuse of a triangle is the square root of the sum of the squared lengths of the two other sides (Figure 22.12). For example, for two points (A and B) in two-dimensional space with axes of Y1 and Y2 and coordinates for point A of (Y1 = 6, Y2 = 11) and (Y1 = 9, Y2 =13) for point B,
363 [346–374] 29.8.2011 3:55PM
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
22.4 Q-mode analyses: multidimensional scaling
363
B
A
Figure 22.12 The Euclidian distance between two points, A and B, plotted in
two dimensions.
the distance between them will be the hypotenuse of a triangle which has sides of three units long (9−6) on axis Y1, by two units high (13 − 11) on axis Y2. Therefore, the length of the hypotenuse is the square root of (9 + 4), which is 3.61 units. The general formula for the Euclidian distance between any two points with known coordinates in p dimensions is: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u P uX de ¼ t ðYiA YiB Þ2 (22:1) i¼1
where Y1, Y2,Y3 . . . Yp are the number of dimensions. For example, for only two dimensions, Y1 and Y2 correspond to the X and Y axes of a typical twodimensional scatter plot. The ‘Y’ terminology is used because the number of dimensions (and therefore the number of axes and variables) can be two or more. Equation (22.1) gives a single value for the dissimilarity between the two points, just like the distances between the four locations in Table 22.6. If the values for all variables are identical for the two points, the Euclidian distance (and dissimilarity) will be 0. Table 22.7 gives an example for three variables.
22.4.2 An example The data in Table 22.8 are for the numbers of six different species of plants growing at four sites (A−D) contaminated by heavy metals in an open-cut mine. By inspection of the raw data, it is difficult to see which sites are most
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
364
364 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Table 22.6 The dissimilarities, expressed as their distance apart in kilometres, for four patches of remnant vegetation surrounded by farmland. Patches close together will have a low dissimilarity score and those further apart will have a higher one. Note that each patch is no distance from itself. The values are duplicated (i.e. the distance between Mt Sando and Warner Wood is the same as that between Warner Wood and Mt Sando) and the matrix is symmetrical so you only need the similarities either above or below the diagonal showing values of zero.
Warner Wood Mt Sando Weaver Hill Mt Bentley
Warner Wood
Mt Sando
Weaver Hill
Mt Bentley
0 7 58 84
7 0 50 80
58 50 0 76
84 80 76 0
Table 22.7 Calculation of the Euclidian distance between two samples where data are available for three variables. The numbers of three species of marine snail, Nassarius dorsatus, Nassarius pullus and Nassarius olivaceous are given for each sample. A univariate measure of the distance between the two samples can be calculated using the Euclidian distance, which is the square root of the sum of the squared differences for each of the three variables. Variable Nassarius dorsatus (axis Y1) Nassarius pullus (axis Y2) Nassarius olivaceous (axis Y3)
Sample B
(YA − YB)
(YA − YB)2
24
12
12
144
33
31
2
4
121
95
26
676
Sample A
p P
ðYiA YiB Þ2
824
i¼1
Single univariate value for the Euclidian distance between samples A and B
28.71
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
365 [346–374] 29.8.2011 3:55PM
22.4 Q-mode analyses: multidimensional scaling
365
Table 22.8 Raw data for the number of six plant species at four sites in an open-cut mine. Plant species
Site A
Site B
Site C
Site D
Stackhousia tryonii Bowenia serrulata Cycas ophiolitica Macrozamia miquelli Macrozamia serpentina Melaleuca laterifolia
12 43 32 61 2 31
16 54 34 23 7 65
22 6 54 32 10 4
14 39 28 71 8 29
Table 22.9 The matrix of results for the Euclidian distances between all six possible paired combinations of sites in Table 22.8.
Site A Site B Site C Site D
Site A
Site B
Site C
Site D
0 52.6 59.9 13.3
0 80.9 62.2
0 63.1
0
similar and which are most dissimilar. In Table 22.9, the Euclidian distance has been calculated for each pairwise comparison of sites, using equation (22.1), and expressed in a matrix. The calculated matrix of dissimilarities can be used to position the sites in two-dimensional space, as shown in Figure 22.13. The process becomes difficult as soon as you have more than three samples, but MDS is available in many statistical packages. Some simply start the positioning process by placing the samples at random in two dimensions, even though the distances among them are extremely unlikely to correspond to the actual Euclidian distances. Next, all of the samples are moved slightly at random. If this improves the correspondence between the positions of the sites within the two-dimensional space and their known Euclidian distances apart, the change is retained. If it does not, it is discarded and another change chosen at random. This process is done iteratively, meaning it is repeated many thousands or tens of thousand of times, and will result in a gradually improving map of
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
366
366 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Figure 22.13 The four sites, for which data are given in Table 22.8, positioned
in two dimensions on the basis of the Euclidian distances between them.
the relationships among the samples. Eventually, the fit within the twodimensional space cannot be improved any further in relation to the Euclidian distances and the process stops. At this stage there will be a final map showing the best relationship among samples. Importantly, there may be several possible final maps, so most MDS programs repeat the process several times to establish the most common solution.
22.4.3 Stress Ideally, the display of the samples will be two dimensional because this is easiest to interpret. Sometimes, however, the samples will not fit well into a flat two-dimensional plane and the only way to position a particular sample an appropriate Euclidian distance from all the others is place it above or below the plane. This lack of conformity to a two-dimensional display is called stress and will give a misleading two-dimensional picture of the relationships among samples. For example, a sample above (or below) the plane will seem closer to two neighbouring sites than it is in reality (Figures 22.14(a) and (b)). Stress can be reduced by increasing the number of dimensions in the display and there will be no stress when the number of dimensions is equal to the number of original variables, but this is unlikely to be useful because a multidimensional display is usually impossibly complex to interpret. Hopefully, you will get a two- or three-dimensional display with little stress. Statistical packages for MDS usually give a value for stress, which, as a general guide, should be less than 0.2 and ideally less than 0.1.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
367 [346–374] 29.8.2011 3:55PM
22.4 Q-mode analyses: multidimensional scaling
367
Figure 22.14 Sometimes sites will not fit into a two-dimensional plane.
(a) Sites B, C, D, E and F are on the flat ‘floor’ of the figure. Site A can only be accommodated accurately in relation to all others by positioning it in space above (or below) C. (b) When viewed from above as a two-dimensional map, site A will be misleadingly close to C.
22.4.4 Summary and cautions on the use of multidimensional scaling Multidimensional scaling is a way of displaying samples of multivariate data in a reduced number of dimensions. The distance between the samples is an indication of their dissimilarity. Unlike PCA, multidimensional scaling does not identify which variables contribute to the positions of the samples. There are several ways of estimating the dissimilarity among samples. For continuous data, where there are few values of 0, the Euclidian distance is appropriate. For data that include a large number values of 0 (e.g. the numbers of several uncommon species), an index such as the Bray–Curtis coefficient of dissimilarity should be used.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
368
368 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Although MDS is a simple technique for displaying multivariate samples in as few as two dimensions, the amount of stress (Section 22.4.3) required to do this needs to be considered because the two-dimensional projection is likely to be misleading when stress is high.
22.4.5 Reporting the results of MDS Results of a MDS analysis are given as a two-dimensional plot and the value for stress. The analysis in Table 22.8 could be reported as ‘A multidimensional scaling analysis of species composition at sites (A−D) showed A and D were most similar (Figure 22.13) with stress of less than 0.04’ followed by discussion of the possible implications of the similarity/dissimilarity among sites.
22.5
Q-mode analyses: cluster analysis
Cluster analysis is a method for assigning samples to groups (called clusters) on the basis of the similarity/dissimilarity among samples. This is much simpler than it sounds. For example, the a posteriori Tukey test (Chapter 12) for assigning several means to groups that appear to be from the same population is a simple univariate clustering method. The following explanation of cluster analysis relies on an understanding of how a univariate measure of the dissimilarity between samples is derived from multivariate data, as explained for Euclidian distance in Section 22.4.1. Just like MDS, cluster analysis uses a matrix of univariate dissimilarities between pairs of samples. For example, the data for the numbers of six plant species at four sites (the samples) in Table 22.8 were used to construct the matrix in Table 22.9, which has been copied to Table 22.10. It gives the Euclidian distance between all possible pairs of the four sites. There are several types of clustering procedures, but one common method is hierarchical clustering, which can be used to construct a dendogram – a tree-like diagram – showing clusters based on the amount of within cluster dissimilarity. Here is an example. First, the four sites in Table 22.10 can be considered as being in four separate groups or clusters because none of the dissimilarities between any of them is 0. They are all dissimilar to some extent.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
369 [346–374] 29.8.2011 3:55PM
22.5 Q-mode analyses: cluster analysis
369
Table 22.10 (copied from Table 22.9). The matrix of results for the Euclidian distances between all six possible paired combinations of the sites in Table 22.8.
Site A Site B Site C Site D
Site A
Site B
Site C
Site D
0 52.6 59.9 13.3
0 80.9 62.2
0 63.1
0
Figure 22.15 Fusion of the two least dissimilar (and therefore most similar)
sites (A and D) to give three clusters on the basis of a maximum of 13.3 units of permitted dissimilarity within clusters.
Second, the dissimilarities in Table 22.10 are examined to find the two sites that are least dissimilar, which are sites A and D, with a dissimilarity between them of only 13.3 units. These two sites are assigned to the first cluster, which therefore has an internal dissimilarity of 13.3 units (Figure 22.15). At this stage, the sites have been assigned to three clusters – one with A&D, plus B and C. The cluster of sites A and D combined, symbolised as (A&D), is then considered as a single sample and the matrix of dissimilarities recalculated. This will not affect the dissimilarity between sites B and C, but the dissimilarity between site B and the ‘new’ combined cluster of (A&D), as well as that between site C and cluster (A&D) will change. There are several methods for calculating dissimilarity after samples have started being assigned to clusters. The group average linkage method simply takes the average of the dissimilarities between an outside sample (e.g. site B) and those within the cluster (e.g. (A&D)). Therefore, using the initial dissimilarities given in Table 22.10, the new dissimilarity between site B and the cluster (A&D) is the average of the dissimilarity between B and A, and between B and D. This is (52.6 + 62.2) /2 = 57.4. In the same way, the new dissimilarity between site C and cluster (A&D) is the average of the
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
370
370 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Table 22.11 The reduced matrix of results for the Euclidian distances between the clusters shown in Figure 22.16.
Sites (A&D) Site B Site C
Sites (A&D)
Site B
Site C
0 57.4 61.5
0 80.9
0
Figure 22.16 Fusion of the first cluster (A&D) and the next most similar (site
B) to give two clusters, (A&D&B) and site C, on the basis of a maximum of 57.4 units of dissimilarity within clusters.
dissimilarity between C and A, and between C and D. This is (59.9 + 63.1)/2 = 61.5. (Note that the dissimilarity between B and C remains the same at 80.9.) These dissimilarities will give the reduced matrix in Table 22.11. By inspection, the two most similar samples are now cluster (A&D) and site B (because the dissimilarity is the lowest at 57.4). Therefore, these two are now assigned to the same cluster, with a permitted internal dissimilarity of 57.4. This gives two clusters: (A&D&B) and site C (Figure 22.16). Next, the matrix of dissimilarities is reduced to the one in Table 22.12. Here, because there are only two samples left to compare, it is only necessary to calculate the dissimilarity between (A&D&B) and site C. Once again, the dissimilarities in the original matrix (Table 22.10) are used and averaged but the calculation is slightly more complex because you need to take the average of the three dissimilarities A–C, B–C and D–C. This is: (59.9 + 80.9 + 63.1)/3 = 68. These values for increasing dissimilarity can be used to construct the final dendogram showing the sites grouped into fewer and fewer clusters (Figure 22.17). The dendogram shows a three-cluster solution at 13.3 units of internal dissimilarity, a two-cluster solution at 57.4 units of internal dissimilarity and a single-cluster solution at 68 units of internal
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
371 [346–374] 29.8.2011 3:55PM
22.5 Q-mode analyses: cluster analysis
371
Table 22.12 The reduced matrix of results for the Euclidian distance between the only possible pair (A&D&B) and site C.
Sites A&D&B Site C
Sites A&D&B
Site C
0 68.0
0
Figure 22.17 Dendogram showing sites A, B, C and D hierarchically arranged
in fewer clusters as the amount of dissimilarity permitted within clusters increases. At 13.3 units of dissimilarity, there are three clusters: (A&D), B, and C. These reduce to two clusters: (A&D&B), and C at 57.4 units. Fusion into only one cluster occurs at 68 units of permitted dissimilarity.
dissimilarity. This result is consistent with the results of the MDS analysis of the same data in Figure 22.13, which is not surprising. The advantage of a cluster analysis is that it gives a quantitative way of assigning samples to groups. For example, from the dendogram in Figure 22.17 you could suggest that A&D are ‘in the same group’, which is different to group B and group C. Importantly, however, the groupings produced by a cluster analysis may not necessarily correspond to true categorical attributes (such as gender, or black or white beads) given as examples of nominal scale data in earlier chapters. Instead, the categories are based on decisions made about dissimilarity (or its converse, similarity), which is a continuous and ratio scale variable. This is an important point. Cluster analysis is often used by taxonomists – who describe and define species – as a way of helping decide whether individuals should be categorised as the same or different species. Here too, however, even though the analysis can be used to define clusters, it does not mean these have identified real discontinuities or discrete categories that correspond to a species (Box 22.1).
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
372
372 [346–374] 29.8.2011 3:55PM
Introductory concepts of multivariate analysis
Box 22.1 Defining species: cluster analysis of DNA, morphology and behaviour Cluster analysis of DNA is being increasingly used by taxonomists. Closely related species are likely to have very similar DNA and vice versa, so cluster analysis can be applied to these data to generate dendograms that may indicate separate species, but the results need to be interpreted with care. For example, Pape et al. (2000) described and named two species of fly, Sarcophaga megafilosia and Sarcophaga meiofilosia, which are parasitoids of the mangrove snail Littoraria filosa. Both flies have similar geographic distributions but S. meiofilosia is smaller and only attacks snails less than 10 mm long, while S. megafilosia attacks larger snails. The flies behave differently, have different developmental times and can be distinguished by consistent differences in their genitalia, so they seem to be very closely related but nevertheless separate species. Meiklejohn et al. (2011) extracted the DNA from these and several other species of Sarcophaga and carried out a cluster analysis. The DNA of the two parasitoids was found to be so similar that Meiklejohn et al. (2011) commented ‘The 2.891% interspecific variation noted between S. megafilosia and S. meiofilosia suggests that these morphologically distinct specimens could have diverged very recently or possibly belong to the same species. . .’. These results emphasise the importance of considering all the available evidence when defining species.
22.5.1 Reporting the results of a cluster analysis The results of a cluster analysis are usually given as a dendogram, although this is often displayed running across the page, with the value of the dissimilarity on the X axis (Figure 22.18).
22.6
Which multivariate analysis should you use?
The three analyses described in this chapter are all ways of summarising and simplifying a multivariate data set so that relationships
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
373 [346–374] 29.8.2011 3:55PM
22.6 Which multivariate analysis should you use?
373
Figure 22.18 Horizontal dendogram showing sites A, B, C and D
hierarchically arranged in fewer clusters as the amount of dissimilarity permitted within clusters increases. At 13.3 units of dissimilarity, there are three clusters: (A&D), B and C. These reduce to two clusters: (A&D&B), and C at 57.4 units. Fusion into only one cluster occurs at 68 units of permitted dissimilarity.
among samples can be more easily visualised, but they have different applications. Principal components analysis is useful for data sets where there are few 0 values and you need to know which variables contribute most to differences among samples. Multidimensional scaling can be used with data sets that contain a lot of 0 values. Most MDS programs do not give an indication of which original variables contribute to differences among samples. Cluster analysis assigns samples to groups, based on dissimilarity or similarity, which may help you categorise the samples. This chapter has only described three commonly used multivariate methods. This is deliberate. First, many life scientists may never use multivariate analyses themselves, but need a conceptual grasp of how they actually work so that they can evaluate reports that include summary statistics and conclusions from these procedures. Second, more powerful methods of analysing multivariate data are being developed, but most of these are derivations of these three ‘core’ methods.
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518C22.3D
374 [346–374] 29.8.2011 3:55PM
374
Introductory concepts of multivariate analysis
22.7
Questions
(1)
Discuss the statement ‘If there are no correlations within a multivariate data set, then principal components analysis really is not very much use at all.’
(2)
An environmental scientist carried out a principal components analysis and obtained the following eigenvalues for components 1 to 5. Which components would you use for a graphical display of the data? Why? Principal component
Eigenvalue
Percentage variation
1 2 3 4 5
3.54 2.82 2.64 0.89 0.42
44 23 22 6 5
(3)
What is ‘stress’ in the context of a two-dimensional summary of the results from a multidimensional scaling analysis?
(4)
Why are the ‘groups’ produced by cluster analysis often not equivalent to true cases of categorical data (such as males versus females)?
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
375 [375–387] 29.8.2011 4:42PM
23
Choosing a test
23.1
Introduction
Statisticians and life scientists who teach statistics are often visited in their offices by a researcher or student they may never have met before, who is clutching a thick pile of paper and perhaps a couple of flash drives or CDs with labels such as ‘Experiment 1’ or ‘Trial 2’. The visitor drops everything heavily on the desk and says ‘Here are my results. What stats do I need?’ This is not a good thing to do. First, the person whose advice you are seeking may not have the time to work out exactly what you have done, so they may give you bad advice. Second, the answer can be a very nasty surprise like ‘There are problems with your experimental design.’ The decision about the appropriate statistical analysis needs to be made by considering the hypothesis being tested, the experimental design and the type of data. It can save a lot of time, trouble and disappointment if you think about possible ways of analysing the data before the sampling is done and preferably when the experiment designed is instead of only after the data have been collected. The following 12 tables are a guide to choosing an appropriate analysis from the ones discussed in this book. You need to start at Table 23.1 which initially gives three mutually exclusive alternatives. Once you have decided among these, work downwards within the column you have chosen. There may be more choices and again you need to select the appropriate column and continue downwards. Eventually you will be referred to another table with more choices that lead to a suggested analysis.
375
Table 23.1 Are you: Testing the hypothesis that one or more univariate samples of, ratio, interval, ordinal or nominal scale data are from the same population; Testing whether two or more variables are related; Examining whether there is similarity or dissimilarity among samples of multivariate data?
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
376 [375–387] 29.8.2011 4:42PM
Table 23.2 Is the hypothesis about: (a) whether one sample of ratio, interval or ordinal scale data is from a population with known or expected population statistics, or (b) whether two or more samples of ratio, interval or ordinal scale data are from the same population?
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
377 [375–387] 29.8.2011 4:42PM
Table 23.3 Tests for nominal scale data.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
378 [375–387] 29.8.2011 4:42PM
Table 23.4 Tests for two or more independent samples of normally distributed ratio, interval or ordinal scale data.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
379 [375–387] 29.8.2011 4:42PM
Table 23.5 Tests for two or more independent samples of ratio, interval or ordinal scale data that are not normally distributed.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
380 [375–387] 29.8.2011 4:42PM
Table 23.6 Tests for two or more related samples of normally distributed ratio, interval or ordinal scale data.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
381 [375–387] 29.8.2011 4:42PM
Table 23.7 Tests for two or more related samples of ratio, interval or ordinal scale data that are not normally distributed.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
382 [375–387] 29.8.2011 4:42PM
Table 23.8 Tests for whether two or more variables are related.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
383 [375–387] 29.8.2011 4:42PM
Table 23.9 Methods for comparing two or more samples of multivariate data.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
384 [375–387] 29.8.2011 4:42PM
Table 23.10 Tests for two or more independent samples of normally distributed ratio, interval or ordinal scale data.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
385 [375–387] 29.8.2011 4:42PM
Table 23.11 Tests for two or more independent samples of normally distributed ratio, interval or ordinal scale data.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
386 [375–387] 29.8.2011 4:42PM
Table 23.12 Tests for two or more independent samples of normally distributed ratio, interval or ordinal scale data, where you have an orthogonal design for two or more levels of two factors.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518C23.3D
387 [375–387] 29.8.2011 4:42PM
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518APX.3D
388 [388–393] 29.8.2011 4:16PM
Appendix: Critical values of chi-square, t and F
Table A1 Critical values of chi-square when α = 0.05, for 1-120 degrees of freedom. If the calculated value of chi-square is larger than the critical value for the appropriate number of degrees of freedom, then the probability of the result is < 0.05 (and is therefore considered significant with an α of 0.05). For example, for three degrees of freedom the critical value is 7.815, so a chi-square larger than this indicates P < 0.05. Values were calculated using the method given by Zelen and Severo (1964). Degrees of freedom α = 0.05
Degrees of freedom α = 0.05
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
388
3.841 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.296 27.587 28.869 30.114
56.942 58.124 59.304 60.481 61.656 62.830 64.001 65.171 66.339 67.505 68.669 69.832 70.993 72.153 73.311 74.468 75.624 76.778 77.931
Degrees of freedom α = 0.05 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
103.010 104.139 105.267 106.395 107.522 108.648 109.773 110.898 112.022 113.145 114.268 115.390 116.511 117.632 118.752 119.871 120.990 122.108 123.225
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518APX.3D
389 [388–393] 29.8.2011 4:16PM
Appendix: Critical values of chi-square, t and F
Table A1 (cont.) Degrees of freedom α = 0.05
Degrees of freedom α = 0.05
Degrees of freedom α = 0.05
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
31.401 32.671 33.924 35.172 36.415 37.652 38.885 40.113 41.337 42.557 43.773 44.985 46.914 47.400 48.602 49.802 50.998 52.192 53.384 54.572 55.758
79.082 80.232 81.381 82.529 83.675 84.821 85.965 87.108 88.250 89.391 90.531 91.670 92.808 93.945 95.081 96.217 97.351 98.484 99.617 100.749 101.879
124.342 125.458 126.574 127.689 128.804 129.918 131.031 132.144 133.257 134.369 135.480 136.591 137.701 138.811 139.921 141.030 142.138 143.246 144.354 145.461 146.567
389
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518APX.3D
390
390 [388–393] 29.8.2011 4:16PM
Appendix: Critical values of chi-square, t and F
Table A2 Critical two- and one-tailed values of Student’s t statistic when α = 0.05, calculated using the method given by Zelen and Severo (1964). A t test is used for comparison between two samples or a sample and a population, so both non-directional and directional alternate hypotheses are possible (e.g. for the latter, the alternate hypothesis might be ‘The mean of Sample A is expected to be greater than the population mean μ’). For non-directional and therefore two-tailed alternate hypotheses, if the calculated value of t is outside the range of zero ± the critical value, then the probability of that result is < 0.05 (and therefore considered significant). For example, for six degrees of freedom the value of t must be outside the range of zero ± 2.447. For directional and therefore one-tailed alternate hypotheses, you first need to check whether the difference between two means is in the direction specified by the alternate hypothesis (e.g. if the hypothesis specifies that mean A is greater than mean B, there is no need to look up the critical value if mean B is greater than mean A because the null hypothesis will stand whatever the value of t). If the difference is in the direction specified by the alternate hypothesis, then the absolute value of t (i.e., the value of t written as a positive number irrespective of whether the calculated value is positive or negative) is significant if it is larger than the one-tailed critical value for the appropriate number of degrees of freedom in Table A2. For example, for 20 degrees of freedom the absolute value of t must exceed 1.725 for P < 0.05. Degrees of freedom α (2) = 0.05 α (1) = 0.05 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110
6.314 2.920 2.353 2.132 2.015 1.934 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740
Degrees of freedom α (2) = 0.05 α (1) = 0.05 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74
2.018 2.015 2.013 2.011 2.009 2.007 2.005 2.003 2.002 2.000 1.999 1.998 1.997 1.995 1.994 1.993 1.993
1.682 1.680 1.679 1.677 1.676 1.675 1.674 1.673 1.672 1.671 1.670 1.669 1.668 1.668 1.667 1.666 1.666
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518APX.3D
391 [388–393] 29.8.2011 4:16PM
Appendix: Critical values of chi-square, t and F
391
Table A2 (cont.) Degrees of freedom α (2) = 0.05 α (1) = 0.05
Degrees of freedom α (2) = 0.05 α (1) = 0.05
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
76 78 80 82 84 86 88 90 92 94 96 98 100 200 300 400 500 600 700 800 900 1000 ∞
2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.040 2.037 2.035 2.032 2.030 2.028 2.026 2.024 2.023 2.021
1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.696 1.694 1.692 1.691 1.690 1.688 1.687 1.686 1.685 1.684
1.992 1.991 1.990 1.989 1.989 1.988 1.987 1.987 1.986 1.986 1.985 1.984 1.984 1.972 1.968 1.966 1.965 1.964 1.963 1.963 1.963 1.962 1.960
1.665 1.665 1.664 1.664 1.663 1.663 1.662 1.662 1.662 1.661 1.661 1.661 1.660 1.653 1.650 1.649 1.648 1.647 1.647 1.647 1.647 1.646 1.6455
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2
3
4
5
6
7
8
9
10
12
15
20
30
50
100
200
161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.90 245.00 248.01 250.10 252 253 254 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.46 19.47 19.49 19.49 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.75 8.70 8.66 8.62 8.58 8.55 8.54 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.75 5.70 5.66 5.65 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.50 4.44 4.41 4.39 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.81 3.75 3.71 3.69 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.58 3.51 3.45 3.38 3.32 3.27 3.25 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.08 3.02 2.97 2.95 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.86 2.80 2.76 2.73 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.70 2.64 2.59 2.56 4.84 3.98 3.59 3.36 3.20 3.10 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.57 2.51 2.46 2.43 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.47 2.40 2.35 2.32 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.38 2.31 2.26 2.23 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.31 2.24 2.19 2.16 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.25 2.18 2.12 2.10 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.43 2.35 2.28 2.19 2.12 2.07 2.04 4.45 3.59 3.20 2.97 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.15 2.08 2.02 1.99 4.41 3.56 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.11 2.04 1.98 1.95
Denominator degrees of freedom 1
Numerator degrees of freedom
Table A3 Critical values of the F distribution for ANOVA when α = 0.05. If the calculated value of F is larger than the critical value for the appropriate degrees of freedom, it is considered significant. The columns in Table A3 give the degrees of freedom for the numerator of the F ratio and the rows give the degrees of freedom for the denominator. For example, the critical value for F3,7 is 4.35, so an F statistic larger than this is considered significant at P < 0.05. Values were calculated using the method given by Zelen and Severo (1964).
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518APX.3D
392 [388–393] 29.8.2011 4:16PM
19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 200 500
Denominator degrees of freedom 1
4.38 4.35 4.32 4.30 4.28 4.26 4.24 4.23 4.21 4.20 4.18 4.17 4.08 4.03 4.00 3.98 3.96 3.95 3.94 3.89 3.86
Table A3 (cont.)
2
3.52 3.49 3.47 3.44 3.42 3.40 3.39 3.37 3.35 3.34 3.33 3.32 3.23 3.18 3.15 3.13 3.11 3.10 3.09 3.04 3.01
3
3.13 3.10 3.07 3.05 3.03 3.01 2.99 2.98 2.96 2.95 2.93 2.92 2.84 2.79 2.76 2.74 2.72 2.71 2.70 2.65 2.62
4 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 2.73 2.71 2.70 2.69 2.61 2.56 2.53 2.50 2.49 2.47 2.46 2.42 2.39
5 2.74 2.71 2.69 2.66 2.64 2.62 2.60 2.59 2.57 2.56 2.55 2.53 2.45 2.40 2.37 2.35 2.33 2.32 2.31 2.26 2.23
6 2.63 2.60 2.57 2.55 2.53 2.51 2.49 2.47 2.46 2.45 2.43 2.42 2.34 2.29 2.25 2.23 2.21 2.20 2.19 2.14 2.12
7 2.54 2.51 2.49 2.46 2.44 2.42 2.40 2.39 2.37 2.36 2.35 2.33 2.25 2.20 2.17 2.14 2.13 2.11 2.10 2.06 2.03
8 2.48 2.45 2.42 2.40 2.38 2.36 2.34 2.32 2.31 2.29 2.28 2.27 2.18 2.13 2.10 2.07 2.06 2.04 2.03 1.98 1.96
9 2.42 2.39 2.37 2.34 2.32 2.30 2.28 2.27 2.25 2.24 2.22 2.21 2.12 2.07 2.04 2.02 2.00 1.99 1.97 1.93 1.90
2.38 2.35 2.32 2.30 2.28 2.26 2.24 2.22 2.20 2.19 2.18 2.17 2.08 2.03 1.99 1.97 1.95 1.94 1.93 1.88 1.85
10 2.31 2.28 2.25 2.23 2.20 2.18 2.16 2.15 2.13 2.12 2.10 2.09 2.00 1.95 1.92 1.89 1.88 1.86 1.85 1.80 1.77
12
Numerator degrees of freedom
2.23 2.20 2.18 2.15 2.13 2.11 2.09 2.07 2.06 2.04 2.03 2.01 1.92 1.87 1.84 1.81 1.79 1.78 1.77 1.72 1.69
15 2.16 2.12 2.10 2.07 2.05 2.03 2.01 1.99 1.97 1.96 1.94 1.93 1.84 1.78 1.75 1.72 1.70 1.69 1.68 1.62 1.59
20 2.07 2.04 2.01 1.98 1.96 1.94 1.92 1.90 1.88 1.87 1.85 1.84 1.74 1.69 1.65 1.62 1.60 1.59 1.57 1.52 1.48
30
2.00 1.97 1.94 1.91 1.88 1.84 1.84 1.82 1.81 1.79 1.77 1.76 1.66 1.60 1.56 1.53 1.51 1.49 1.48 1.41 1.38
50
1.94 1.91 1.88 1.85 1.82 1.80 1.78 1.76 1.74 1.73 1.71 1.70 1.59 1.52 1.48 1.45 1.43 1.41 1.39 1.32 1.28
100
1.91 1.88 1.84 1.82 1.79 1.77 1.75 1.73 1.71 1.69 1.67 1.66 1.55 1.48 1.44 1.40 1.38 1.36 1.34 1.26 1.21
200
C:/ITOOLS/WMS/CUP-NEW/2649164/WORKINGFOLDER/MCKI/9781107005518APX.3D
393 [388–393] 29.8.2011 4:16PM
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518RFA.3D
394 [394–395] 27.8.2011 5:43PM
References
Andrews, M. L. A. (1976) The Life that Lives on Man. London: Faber & Faber. Chalmers, A. F. (1999) What is This Thing Called Science? 3rd edition. Indianapolis: Hackett Publishing Co. Cowles, M. and Davis, C. (1982) On the origins of the .05 level of statistical significance. American Psychologist 37: 553–558. Davis, J. C. (2002) Statistics and Data Analysis in Geology. New York: Wiley. Fisher, R. A. (1925) Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd. Howard, R. M. (1993) A plagiarism pentimento. Journal of Teaching and Writing 11: 233–246. Hurlbert, S. J. (1984) Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54: 187–211. Kuhn, T. S. (1970) The Structure of Scientific Revolutions. 2nd edition. Chicago: University of Chicago Press. LaFollette, M. C. (1992) Stealing into Print: Fraud, Plagiarism and Misconduct in Scientific Publishing. Berkeley, CA: University of California Press. Lakatos, I. (1978) The Methodology of Scientific Research Programmes. New York: Cambridge University Press. Lombardi, C. M. and Hurlbert, S. H. (2009) Misprescription and misuse of onetailed tests. Austral Ecology 34: 447–469. Manly, B. J. F. (2001) Statistics for Environmental Science and Management. Boca Raton: Chapman & Hall/CRC. McKillup, S. C. (1988) Behaviour of the millipedes, Ommatoiulus moreletii, Ophyiulus verruciluger and Oncocladosoma castaneum in response to visible light: an explanation for the invasion of houses by Ommatoiulus moreletii. Journal of Zoology, London 215: 35–46. McKillup, S. C. and McKillup, R. V. (1993) Behavior of the intertidal gastropod Planaxis sulcatus (Cerithiacae: Planaxidae) in Fiji: are responses to damaged conspecifics and predators more pronounced on tropical versus temperate shores? Pacific Science 47: 401–407. 394
C:/ITOOLS/WMS/CUP-NEW/2648246/WORKINGFOLDER/MCKI/9781107005518RFA.3D
395 [394–395] 27.8.2011 5:43PM
References
395
Meiklejohn, K. A., Wallman, J. F. and Dowton, M. (2011) DNA-based identification of forensically important Australian Sarcophagidae (Diptera). International Journal of Legal Medicine 125: 27–32. Pape, T., McKillup, S. C. and McKillup, R. V. (2000) Two new species of Sarcophaga (Sarcorohdendorfia), parasitoids of Littoraria filosia (Sowerby) (Gastropoda: Littorinidae). Australian Journal of Entomology 39: 236–240. Popper, K. R. (1968) The Logic of Scientific Discovery. London: Hutchinson. Quinn, G. P. and Keough, M. J. (2002) Experimental Design and Data Analysis for Biologists. Cambridge: Cambridge University Press. Roig, M. (2001) Plagiarism and paraphrasing criteria of college and university professors. Ethics and Behavior 11: 307–323. Siegel, S. and Castallan, J. J. (1988) Statistics for the Behavioral Sciences. 2nd edition. New York: McGraw-Hill. Singer, P. (1992) Practical Ethics. Cambridge: Cambridge University Press. Sokal, R. R. and Rohlf, F. J. (1995) Biometry. 3rd edition. New York: W.H. Freeman. Sprent, P. (1993) Applied Nonparametric Statistical Methods. 2nd edition. London: Chapman & Hall. Stigler, S. M. (2008) Fisher and the 5% level. Chance 21: 12. Student. (1908) The probable error of a mean. Biometrica 6: 1–25. Tukey, J. W. (1977) Exploratory Data Analysis. Reading: Addison Wesley. Zar, J. H. (1999) Biostatistical Analysis. 4th edition. Upper Saddle River, NJ: Prentice Hall. Zelen, M. and Severo, N. C. (1964) Probability functions. In Handbook of Mathematical Functions, Abramowitz, M. and Stegun, I. (eds.). Washington, DC: National Bureau of Standards, pp. 925–995.
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
396 [396–404] 29.8.2011 4:58PM
Index
a posteriori comparisons, 157–162 after Levene test, 206 after single-factor ANOVA, 157 after two-factor ANOVA, 181–192 after two-factor ANOVA without replication, 214 and Type 1 error, 163 Tukey test, 158 a priori comparisons, 162–164, 183 absolute value, 322 accuracy, 29 additivity of polynomial expansions in regression, 271 of sums of squares and degrees of freedom, 151, 179 alpha α, 62 of 0.05, 62 other values of, 61–62 alternate hypothesis, 12 analysis of covariance (ANCOVA), 284–296 and regression analysis, 291 assumptions of, 289–293 concept of, 288 reporting the results of, 295 analysis of variance (ANOVA) a posteriori comparisons, 157–165 reporting the results of, 164 Tukey test, 158 a priori comparisons, 162–164 assumptions, 201 homoscedasticity, 196 independence, 201 normally distributed data, 197 concept of, 146 fixed and random effects, 153
396
interaction, 168 example of, 170 Model I, 153 Model II, 153 multiple comparisons, 157–165 risk of Type 1 error, 140 nested, 222–228 reporting the results of, 230 randomised blocks, 214–216 reporting the results of, 229 repeated measures, 216–221 reporting the results of, 230 single factor, 140–154 a posteriori comparisons, 157–165 arithmetic example, 147–152 degrees of freedom, 154 Model I (fixed effects), 153 Model II (random effects), 153 reporting the results of, 154 unbalanced designs, 152 three-factor, 192 two-factor, 168–194 a posteriori comparisons, 181 advantage of, 168 cautions and complications, 181–192 concept of, 180 fixed and random factors, 187–192 interaction, 169, 182 interaction obscuring main effects, 183 reporting the results of, 193 synergistic effect, 168 Tukey test, 182 unbalanced designs, 192 two-factor without replication, 209 a posteriori testing, 214
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
397 [396–404] 29.8.2011 4:58PM
Index and randomised blocks, 214 reporting the results of, 229 ANOVA. See Analysis of variance (ANOVA) apparent replication. See replication average. See mean backward elimination, 278 Bayes’ theorem, 77–86 estimation of a conditional probability, 78 updating probabilities, 80 beta β. See Type 2 error binomial distribution, 103 biological risk, 136 bivariate normal distribution, 241 box-and-whiskers plot, 197–200 worked example, 199 carryover effects, 216 central limit theorem, 102 chi-square statistic χ2, 64 table of critical values, 388 test, 64–66, 302–308 bias with one degree of freedom, 308 for heterogeneity, 306 for one sample, 302 inappropriate use of, 312 reporting the results of, 316 worked example, 64 Yates’ correction, 308, 309 choosing a test, 375 cluster analysis, 368–372 Cochran Q test, 316 coefficient of determination r2, 257 conditional probability, 75 Bayes’ theorem, 77 confidence interval, 99, 108 for a population, 99 for a sample, 100 used for statistical testing, 102 confidence limits, 99, 108 contingency table, 306 control treatments, 35 for disturbance, 36 for time, 36
397
sham operations, 38 correlation, 234–241 coefficient r, 234 confused with causality, 30 contrasted with regression, 233–234 linear assumptions of, 241 reporting the results of, 242 non-parametric, 342 covariate, 284 data bivariate, 15, 25 continuous, 17 discrete, 17 displaying, 17–28 bivariate, 25 cumulative graph, 20 frequency polygon, 18 histogram, 17 nominal scale, 23 ordinal scale, 23 pie diagram, 22 interval scale, 16 multivariate, 15, 26 nominal scale, 16 ordinal scale, 16 ratio scale, 15 univariate, 15 degrees of freedom, 112 additivity in single-factor ANOVA, 151 for a 2 × 2 contingency table, 307 for F statistic, 154 for t test, 112 in regression analysis, 261 dissimilarity, 361 distribution binomial, 103 normal, 88 of sample means, 95 Poisson, 104 effect size, 132 and power of a test, 135 eigenvalues, 351 table of in PCA, 355
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
398
398 [396–404] 29.8.2011 4:58PM
Index
eigenvectors, 351 environmental impact assessment, 42 error and sample size, 102 Type 1, 130 definition of, 61 Type 2, 131 definition of, 61 uncontrollable among sampling units, 143 when estimating population statistics from a sample, 93 within group, 175 ethical behaviour, 48–54 acknowledging input of others, 50 acknowledging previous work, 49 approval and permits, 50 experiments on vertebrates, 51 fair dealing, 49 falsifying results, 52 modifying hypotheses, 124 moral judgements, 51 plagiarism, 48 pressure from peers or superiors, 53 record keeping, 53 Euclidian distance, 362 exact tests Fisher Exact Test, 309 for ratio, interval and ordinal scale data, 327, 334 examples antibiotics and pig growth, 338 Arterolin B and blood pressure, 56 bacteria and turbidity in a reservoir, 349 basal cell carcinomas, 23 cane toads in Queensland, 306 chilli yield and fertilisers, 276 cholera in London, 11 cockroaches activity and weather, 169 growth, 174 control treatments, 35 dental decay in Hale and Yarvard, 25 dental hygiene in male adolescents, 341 dietary supplements and pig growth, 160 drugs and brain tumours, 142 egg laying by mosquitoes, 43
familiarity effect and athletes, 116 gecko behaviour and predators, 315 growth of palm seedlings and soil type, 325 herbivorous snails and predators, 35 hypothesis testing, 8, 11 koalas and tooth wear, 234 lung structure and damage, 336 maggot numbers in cadavers, 267 mites living in human hair follicles, 258 Panama Gold crop, 115 pelt colour in guinea pigs, 302 pollutants in sediments, 354 Portuguese millipedes and light, 8–10 prawn growth in aquaculture ponds, 222 prawns in Dark Lake, 32 predicting adult height, 244 prostate tumours and drugs, 209 pseudoreplication, 32 rats and heavy metals, 284 reaction time of students, 18 sandfly bites and hair colour, 331 scavenging whelks, 313 sperm counts of coffee drinkers, 67 stomach ulcers in humans, 12 student visits to a medical doctor, 17 suggestion and motion sickness, 309 testicular torsion in humans, 43 tomato yield and fertilisers, 274 Type 1 error, 61 Type 2 error, 62 vitamin A and prawn growth, 38 wombat fatalities along a country road, 322 examples, worked analysis of covariance (ANCOVA), 288, 293 analysis of variance (ANOVA), 140–156, 168–195, 209–232 Bayes’ theorem and bowel cancer, 79 and potable water, 77 updating probabilities, 81, 85 box-and-whiskers plot, 199 chi-square test, 64, 306 cluster analysis, 368–372 Kolmogorov–Smirnov one-sample test, 320
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
399 [396–404] 29.8.2011 4:58PM
Index
399
linear regression, 258 McNemar test, 315 multidimensional scaling (MDS), 361–368 Pearson correlation, 235–241 population variance, 90 principal components analysis (PCA), 348–353 probability coin tossing, 71, 72 rolling a die, 71, 75 statistical significance testing, 57–61 t test paired sample, 118 single sample, 116 two independent samples, 120 Type 1 error and multiple comparisons, 141 Z statistic, 93 Z test, 98, 111 experiment common sense, 37 good design versus cost, 45 manipulative, 29, 34–42 control treatments, 35 need for independent replicates, 34 placebo, 38 pseudoreplication, 38 treatments confounded in time, 37 meaningful result, 56 mensurative, 29, 34 confounded, 31 correlation and causality, 30 need for independent samples, 32 need to repeat, 34 pseudoreplication, 33 negative outcome, 12 realism, 42 replication not possible, 41 testing an hypothesis, 7 experimental design, 29–46 experimental unit, 1, 15
for linear regression, 256 table of critical values for ANOVA, 392 factor, 142 fixed and random, 188 false negative. See Type 2 error false positive. See Type 1 error Fisher Exact Test, 309 reporting the results of, 317 Fisher, R., 60, 141 his choice of P < 0.05, 67 frequency, 17 Friedman test, 338 reporting the results of, 340
F statistic, 125, 146 degrees of freedom for, 154
interaction, 169 obscuring main effects in ANOVA, 183
G test, 308 reporting the results of, 317 Gossett, W.S. See Student graph bivariate data, 25 cumulative frequency, 20 frequency polygon, 18 histogram, 17 multivariate data, 26 nominal scale data, 23 ordinal scale data, 23 heterogeneity of variances. See heteroscedasticity heteroscedasticity, 196 tests for, 204–205 homogeneity of variances. See homoscedasticity homoscedasticity, 196 Hurlbert, S., 33, 38 hypothesis, 7 alternate, 12, 35 becoming a theory, 11 cannot be proven, 11 null, 12, 35, 60 prediction from, 7 retained or rejected, 7, 11 two-tailed contrasted with one-tailed, 121–124
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
400
400 [396–404] 29.8.2011 4:58PM
Index
Kolmogorov–Smirnov one-sample test, 320–325 reporting the results of, 324 Kruskal–Wallis test, 331 and single-factor ANOVA, 334 reporting the results of, 335 Kuhn, T., 13 Lakatos, I., 13 leptokurtic, 103 Levene test, 204 reporting the results of, 205 log-likelihood ratio. See G test Mann–Whitney U test, 325 McNemar test, 315 mean, 26, 87, 93 calculation of, 90 of a normal distribution, 89 standard error of (SEM), 95 mean square, 150 median, 105 meta-analysis, 42 mode, 105 Monte Carlo method, 304 multicollinearity, 279 multidimensional scaling (MDS), 361–368 multiple linear regression, 273–281 multicollinearity, 279 multivariate analyses, 346–373 cluster analysis, 368–372 caution in use of, 372 group average linkage method, 369 hierarchical clustering, 368 reporting the results of, 372 multidimensional scaling (MDS), 361–368 cautions in the use of, 367 example, 363 reporting the results of, 368 stress, 366 principal components analysis (PCA), 348–361 cautions and restrictions, 359 eigenvalue, 351 eigenvector, 351
number of components to plot, 359 practical use of, 358 principal components, 351 redundancy, 348 reporting the results of, 360 Q-mode, 347 R-mode, 347 nested design, 222 non-parametric tests correlation, 342–344 introduction to, 298 nominal scale and independent data, 301–313 chi-square test, 302–308 Fisher Exact Test, 309 G test, 308 inappropriate use of, 312 randomisation test, 308 nominal scale and related data, 314–316 Cochran Q test, 316 McNemar test, 315 ratio, interval or ordinal scale and independent data, 319–335 cautions when using, 319 exact test for two samples, 327 Kolmogorov–Smirnov one-sample test, 320–325 Kruskal–Wallis test, 331 Mann–Whitney test, 325 randomisation and exact tests, 334 randomisation test for two samples, 327 ratio, interval or ordinal scale and related data, 335–340 Friedman test, 338 Wilcoxon paired-sample test, 336 normal distribution, 87–102 descriptive statistics, 89 leptokurtic, 103 platykurtic, 103 skewed, 103 null hypothesis, 12 one-tailed hypotheses and tests, 121–124 cautions when using, 123
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
401 [396–404] 29.8.2011 4:58PM
Index orthogonal, 169 design without replication, 209 outliers, 197, 200 parallelism in ANCOVA, 289 showing lack of interaction in ANOVA, 170 parametric, 88 Pearson correlation coefficient r, 234 calculation of, 235–241 degrees of freedom for, 241 peer review, 53 placebo, 38 plagiarism, 48 planned comparisons. See a priori comparisons platykurtic, 103 Poisson distribution, 104 polynomials. See regression: curvilinear Popper, K., 7 population, 1 statistics (parameters), 89–93 post-hoc test. See a posteriori comparisons power of a test, 135 and sample size, 135 precision, 29 predictor variable, 244 principal components analysis (PCA), 348–361 probability ≥ 0.05, 63 < 0.001, 62 < 0.01, 61, 62 < 0.05, 60, 62 < 0.1, 62 addition rule, 71 and statistical testing, 56–68 basic concepts, 71–86 Bayes’ theorem, 77 conditional, 75 multiplication rule, 72 not significant, 63 of 1, 71 of an event, 71 of exactly 0.05, 66
401
of 0, 71 posterior, 82 prior, 81 relative and absolute risk, 73 significant, 60 stating exact values, 63 unlikely events, 68 cancer clusters, 68 pseudoreplication, 33 alternating treatments, 39 apparent replication, 39–41 clumped replicates, 39 in manipulative experiments, 38–41 in mensurative experiments, 34 inappropriate generalisation, 34 isolative segregation, 40 sharing a condition, 40 Q-mode, 347 r statistic, 234 r2 statistic, 257 randomisation test, 327 for nominal scale data, 304, 308 reporting the results of, 317 randomised blocks, 214 range, 90, 105 ranks, 319 redundancy, 348 regression, 244–281 contrasted with correlation, 233–234 curvilinear, 266–271 comparing polynomial expansions, 270 danger of extrapolation, 270 reporting the results of, 271 linear, 244–266 assumptions of, 264 coefficient of determination r2, 257 danger of prediction and extrapolation, 262 intercept, 246, 249 reporting the results of, 264 residuals, 263 significance testing, 250–257 slope, 246–249
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
402
402 [396–404] 29.8.2011 4:58PM
Index
regression, (cont.) multiple linear, 273–281 refining the model, 278 reporting the results of, 281 replicates, 9 replication apparent, 39 alternating treatments, 39 clumped replicates, 39 isolative segregation, 40 sharing a condition, 40 need for, 9, 32, 34 residuals, 263 use in ANCOVA, 288 response variable, 142, 244 risk, 73 R-mode, 347 sample, 1 mean, 26, 93 random, 1 representative, 1 statistics, 93–95 as estimates of population statistics, 93 variance, 94 sampling unit, 1, 15 scientific method, 14, 48 hypothetico-deductive, 7 Kuhn, T., 13 Lakatos, I., 13 paradigm, 13 sigma Σ, 89 significance level, 57 skew, 17 examining data for, 197 Spearman’s rank correlation, 342 reporting the results of, 344 standard deviation for a population σ, 91 proportion of normal distribution, 91 standard error of the mean (SEM), 95 estimated from one sample, 99 statistic, 63 statistical significance, 57 and biological risk, 136
and biological significance, 67 stress, 366 Student, 100 Student’s t. See t statistic sum of squares, 150 additivity in single-factor ANOVA, 151 synergistic effect. See interaction t statistic, 100 critical values, 112, 390 degrees of freedom, 112 t test, 112–127 assumptions equal variances, 125 normality, 124 random samples, 125 choosing the appropriate, 125 for two independent samples, 118–120 paired sample, 116–118 reporting the results of, 126 single sample, 112–115 theory, 11 transformations, 201–204 arc-sine, 203 legitimacy of, 203 logarithmic, 202 square root, 201 Tukey test, 158 for two-factor ANOVA, 182 Type 1 error, 61, 130 and sample size, 131–134 trade-off with Type 2 error, 131–134 Type 2 error, 61, 131 and sample size, 131–136 unplanned comparisons, 157 variable, 15 dependent, 234 independent, 234 response, 142, 234 variance among group, 146 for a population σ2, 90
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
403 [396–404] 29.8.2011 4:58PM
Index calculation of, 90 for a sample s2, 94 total, 147, 176, 179 within group, 146, 176 variance ratio. See F statistic variation among samples by chance, 3 natural, 2
Wilcoxon paired-sample test, 336 reporting the results of, 340 Yates’ correction, 309 Z statistic, 91 Z test, 108, 112
403
C:/ITOOLS/WMS/CUP-NEW/2649240/WORKINGFOLDER/MCKI/9781107005518IND.3D
404 [396–404] 29.8.2011 4:58PM