Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, This issue of CHANCE includes articles on a diverse set of topics. Authors Shahedul Ahsan Khan, Grace Chiu, and Joel Dubin discuss statistical issues with modeling the concentration of chlorofluorocarbons, a primary greenhouse gas, in the atmosphere based on worldwide observations. Read on to learn their predictions. Autar Kaw and Ali Yalcin present a new measure to quantify upsets, or unexpected game outcomes, in college football. Once you study their statistic, you can watch for signs of continued topsy-turvy behavior in the 2009–2010 season. Tristan Barnett analyzes expected returns and the chances of extreme outcomes in online gambling games. How risk-adverse are you? The calculations in this article could be useful illustrations for teaching expectation, variance, and other distributional summaries. Alberto Baccini, Lucio Barabesi, and Marzia Marcheselli apply network analysis to editorship networks of statistical journals. In doing so, the authors give insight into network analysis and help us understand the field of probability of statistics. Chris Bilder imagines the potential application of grouptesting methods in a popular science-fiction television series. Could “Battlestar Galactica”
have benefited from having a statistician/ biostatistician on board? Definitely. In his Visual Revelations column, Howard Wainer and coauthor Peter Baldwin discuss a case in which a teacher was accused of helping students cheat on a test. A careful statistical analysis was critical in reaching a justifiable outcome. In Mark Glickman’s Here’s to Your Health column, Scott Evans defines and illustrates what are referred to as noninferiority clinical trials. The purpose of such trials is to study whether a new version of a drug is not inferior to the currently accepted treatment. Read the article to learn why noninferiority trials are common and how they compare to other clinical trials. In his Goodness of Wit Test column, Jonathan Berkowitz gives us a standard bar-type cryptic puzzle with one constraint. The theme of the puzzle is statistical inference. Finally, Daniel Westreich provides a little comic relief in “A Statistical Investigation of Urban Skylines in Illinois.” It might be interesting to note that authors in this issue reside in the United States, Canada, Australia, and Italy. We welcome contributions and readers from around the world. I look forward to your comments, suggestions, and article submissions.
Enjoy the issue! Mike Larsen
CHANCE
5
Letter to the Editor
Dear Editor, I have just today received my copy of the Spring 2009 issue of CHANCE and read with particular interest the column “Pictures at an Exhibition,” which compares several attempts at graphing a data set concerning the effectiveness of three antibiotics against 16 strains of bacteria, seven Gram-positive and nine Gram-negative. Three questions are suggested for a graph to help answer: (i) What bacteria does this drug kill? (ii) What drug kills this strain of bacteria? (iii) Does the covariate Gram staining help us make decisions? The authors state that, at fifirst blush, one would think (ii) is the key clinical question, so I begin with it. But, surely the table, itself, answers that question better than any possible graph. Suppose, for example, that our patient is infected with Pseudomonas aeruginosa. The various bacteria are represented by the lines of the table, and they are in alphabetical order, so we can easily find the one line
6
VOL. 22, NO. 3, 2009
for this particular bacterium—all the rest of the table being pretty much irrelevant. We then have only three numbers to compare and easily fifind the one in the column corresponding to neomycin to be the smallest, so that is the drug we should use—unless, perhaps, it is unavailable or our patient is allergic to it, in which case we try the drug with the second-smallest number, namely streptomycin. I do not know whether the actual values, as distinct from relative values, are meaningful, but, if so, we have them in front of our eyes. All the graphs just obfuscate this reasoning. First, one must study the construction of the graph to understand it, and next locate the part that refers to the bacterium in question. Even then, only the relative values are clear, while actual values can, at best, be approximated. Similar remarks apply to (i): for each drug, we just list the bacteria for which it has the smallest MIC, or those for which its MIC
is low absolutely, depending on how we interpret the question. Again, a graph is really not helpful. And, by the way, taking logarithms is of no value whatever for answering either question. A more complicated version of (ii) might be, “What drug should we try if the bacterium is not one of the 16 in the table, or is not identifiable fi at all?” This suggests considering the data as a two-way layout—a sample (not random, of course, but can we do better than to assume it is?) of 16 bacterial strains, each with MICs for three drugs, and we ask for which drug (if any) are the MICs generally smallest? In this context, (iii) makes sense, since we might know whether the bacterium is Gram-negative or positive, even if we cannot identify it. These questions are not so easily answered just by reading the table, and perhaps a graph could be devised that would help us reach a decision.
Thus, I evaluate the 11 graphs in the article solely on the basis of what light they shed on the altered (ii) and on (iii). In my opinion, it’s not much in most cases. Every one of the graphs carefully identifies fi which data point corresponds to which bacterium, thus cluttering it with information of no relevance to these two questions. Entry 1 is so cluttered as to be absolutely unreadable. Entries 1, 2, and 3 barely distinguish Gram-negative
from Gram-positive by using two shades of gray. Entry 7 distinguishes them solely by labeling, and Entry 5c almost does not distinguish them at all, so they are essentially worthless for (iii). The others distinguish them by juxtaposing two separate graphs and manage to give the impression that Gram-positive bacteria are generally easier to treat than Gramnegative, but only Entry 8 and perhaps Entry 5d have any success in telling us
which antibiotic to use in each of the two cases. I might actually approve of them if CHANCE printed them with different colors instead of three shades of gray!
value is, in some sense, its robustness; its ability to answer questions that were thought of in advance, as well as some that were not. John Tukey pointed out that “the greatest value of a graph is when it forces us to see what we never expected” (emphasis his). How can we measure the extent to which a specific design has done this? Several of the displays we reproduced, none of them unflawed, fl did this very well. Several of the displays made it abundantly clear how two of the bacteria were misclassified decades before their classifications were corrected. Such power should not be pooh-poohed. The displays reproduced were not shown in their full glory because of the economic constraints that prevented us from publishing them in color. The fact
that they were still able to function in a monochrome figuration is a testament to their robustness. Some displays—most notably Jana Asher’s gorgeous tour-deforce—suffered more than others from this monochrome restriction. I suggest that critics go to the CHANCE web site and look at the displays in their full glory before voicing objections. Let me challenge Professor Quade to produce an argument, based on his favorite display of these data, that explains how it forces us to see what we never expected and, in addition, one that scales to more bacteria and more antibiotics. All three of the winners (and a number of the others as well) do a remarkable job of this.
Dana Quade Professor Emeritus of Biostatistics The University of North Carolina at Chapel Hill
Author’s Response Professor Quade has given us an interesting critique of the displays shown in our paper and provides an opening for further discussion. For this I thank him. I agree a well-constructed table that could be sorted by different criteria, depending on the specific individual question posed, would provide a good answer for a number of specific queries. If we knew in advance what single question would be asked of the data, the optimal display is almost surely just a line of prose (e.g., “Streptomycin is the best treatment for an infection with Pseudomonas aeroginosa.”). But this is a very limited view of the uses of data display. For most data displays, we do not know in advance what questions will be posed for them to answer. We guessed as to the classes of possible questions, but there might be others as well. A display’s
Howard Wainer
CHANCE
7
Atmospheric Concentration of Chlorofluorocarbons: Addressing the Global Concern with the Longitudinal Bent-Cable Model Shahedul Ahsan Khan, Grace Chiu, and Joel A. Dubin
B
iological conseque consequences such as skin cancer, cataracts, catarac irreversible damage to plants, and an reduction of drifting organisms (e.g., aanimals, plants, archaea, bacteria) in the ocean’s photic incr ultraviozone may result from increased let (UV) exposure due to ozone deple“Ozon Science: The tion. According to “Ozone Facts Behind the Phaseout” Phaseou by the U.S. Environmental Protection Agency, each ozo levels has natural reduction in ozone reco been followed by a recovery, though en there is convincing scie scientific evidence b that the ozone shield is being depleted beyon nd changes du due to natural prowell beyond cesses es. In particular, parti t cular, ozone ozzon nee depletion due cesses. activvity is a major majjo concern, but to human activity m y be controlled. ma con ntro olled. may
8
VOL. VOL. 22 22, NO. NO O 3, 3, 2009 2009
after the transition period (outgoing phase). This trend is representative of CFC-11 measurements taken by stations across the world. The effects of CFCs in ozone depletion are a global concern. Although exploratory data analyses reveal a decrease in CFCs in the Earth’s atmosphere since the early 1990s, no sophisticated statistical analysis has been conducted so far to evaluate the global trend. In addition, there are several other important questions regarding the CFC concentration in the atmosphere:
• What were the rates of change (increase/decrease) in CFCs before and after the transition period? • What was the critical time point (CTP) at which the CFC trend went from increasing to decreasing? Here, we will address these questions statistically by fitting a special changepoint model for CFC-11 data. We focus on CFC-11 because it is considered one of the most dangerous CFCs to the atmosphere. In fact, it has the shortest lifetime of common CFCs and is regarded as a reference substance in the definition fi of the ozone depletion potential (ODP). The ODP of a chemical is the ratio of its impact on ozone compared to the impact of a similar mass of CFC-11.
250 255 260
265 270 275
• How long did it take for the CFC concentration to show an obvious decline?
CFC-11 (in ppt)
One such human activity is the use of chlorofluorocarbons fl (CFCs). As cited in “The Science of the Ozone Hole” by the University of Cambridge, the catalytic destruction of ozone by atomic chlorine and bromine is the major cause of polar ozone holes, and photodissociation of CFC compounds is the main reason for these atoms to be in the stratosphere. CFCs are nontoxic, nonflammable chemicals containing atoms of carbon, chlorine, and fluorine. CFCs were extensively used in air conditioning/cooling units and as aerosol propellants prior to the 1980s. While CFCs are safe to use in most applications and are inert in the lower atmosphere, they do undergo significant fi reaction in the upper atmosphere. According to The Columbia Encyclopedia, CFC molecules take an average of 15 years to travel from the ground to the upper atmosphere, and can stay there for about a century. Chlorine inside the CFCs is one of the most important free radical catalysts to destroy ozone. The destruction process continues over the atmospheric lifetime of the chlorine atom (one or two years), during which an average of 100,000 ozone molecules are broken down. Because of this, CFCs were banned globally by the 1987 Montréal Protocol on Substances That Deplete the Ozone Layer. Since this protocol came into effect, the atmospheric concentration of CFCs has either leveled off or decreased. For example, Figure 1 shows the monthly average concentrations of CFC-11 monitored from a station in Mauna Loa, Hawaii. We see roughly three phases: an initial increasing trend (incoming phase), a gradual transition period, and a decreasing trend
1988
1990
1992
1994
1996
1998
2000
Time Figure 1. Trend of the monthly average concentrations of CFC-11 in Mauna Loa, Hawaii Source: NOAA/ESRL Global Monitoring Division
CHANCE
9
Thus, the ODP is 1 for CFC-11 and ranges from 0.6 to 1 for other CFCs. In a broader sense, we will comment on the global trend of CFC-11 and the effectiveness of the Montréal Protocol on preserving the ozone level by reducing the use of CFC-11. Our findings will also provide a rough idea of how long it may take to diminish CFC-11 from the Earth’s atmosphere.
Data CFCs are monitored from stations across the globe by the NOAA/ESRL Global Monitoring Division and the Atmospheric Lifetime Experiment/ Global Atmospheric Gases Experiment/ Advanced Global Atmospheric Gases Experiment (ALE/GAGE/AGAGE) global network program. Henceforth, we will refer to these programs as NOAA and AGAGE. Under the Radiatively Important Trace Species (RITS) program, NOAA began measuring CFCs using in situ gas chromatographs at their four baseline observatories—Pt. Barrow (Alaska), Cape Matatula (American Samoa), Mauna Loa (Hawaii), and South Pole (Antarctica)—and at Niwot Ridge (Colorado) in collaboration with the University of Colorado. We will label these stations 1 to 5, respectively. From 1988–1999, a new generation of gas chromatography called chromatograph for atmospheric trace species (CATS) was developed and has been used to measure CFC concentrations ever since. The AGAGE program consists of three stages corresponding to advances and upgrades in instrumentation. The 10
VOL. 22, NO. 3, 2009
fi rst stage (ALE) began in 1978, the second (GAGE) between 1981–1985, and the third (AGAGE) between 1993– 1996. The current AGAGE stations are located in Mace Head (Ireland), Cape Grim (Tasmania), Ragged Point (Barbados), Cape Matatula (American Samoa), and Trinidad Head (California). These five stations will be labeled 6 to 10, respectively. We consider monthly mean data for our statistical analysis. Ideally, we wish to have full data for all stations, a long enough period to capture all three phases of the CFC trend, and no change in instrumentation (to avoid the elements of nonstationarity and biased measurement). However, we do not have the same duration of consecutive observations for all stations. Moreover, data were recorded by instrumentation that switched from one type to another. Table 1 summarizes the availability of the consecutive observations and the instrumentations used to record data. Thus, ideal statistical conditions are not achievable in this case. As a compromise, we removed stations 9 and 10 from our analysis due to insufficient fi data and chose a study period in such a way that it can reflect the changing behavior of the CFC-11 concentration in the atmosphere. The Montréal Protocol came into force on January 1, 1989, so we expect an increasing trend in CFC-11 prior to 1989 because of its extensive use during that period. After the implementation of the protocol, we expect a change (either decreasing or leveling off) in the CFC-11 trend. To characterize this
change, we wish to have a study period starting from some point before the implementation of the protocol. Moreover, we must have sufficient data to observe the change, if any. Thus, we settle for a relatively long study period of 152 months from January of 1988 to August of 2000. In particular, it covers stations 3 and 4 with a single measuring device: RITS. Stations 2 and 5 have RITS data until April of 2000, at which point we truncated their data so only RITS is present for stations 2–5 throughout the study period. Data for the remaining four stations during this period were recorded by two measuring devices—RITS and CATS for Station 1 and GAGE and AGAGE for stations 6–8—each device occupying a substantial range of the 152 months. Figure 2 shows the eight profiles fi of the corresponding CFC-11 data. Specififically, each station constitutes an individual curve, which is different from the others due to actual CFC-11 levels during measurement, exposure to wind and other environmental variables, and sampling techniques. Our objective is to assess the global CFC-11 concentration in the atmosphere and station-specific fi characterization of the trends. We do not take into consideration the effects of change in instrumentation (RITS/CATS and GAGE/AGAGE). Modeling these effects is currently in progress, and a preliminary analysis reveals statistically insignificant fi results. However, between-station differences are accounted for, at least partially, by the random components introduced in our modeling.
Method We wish to unify information from each station to aid the understanding of the global and station-specific behavior of CFC-11 in the atmosphere. We will index the stations by i = 1, 2, …, 8 and the months from January of 1988 to August of 2000 by j = 1, 2, …, 152. Let tijj denote the jth measurement occasion of the ith station. We model the CFC-11 measurement for the ith station at time tij , denoted by yij , by the relationship yij = ƒ(tij , i) + ⑀ij
(1)
where i is a vector of regression coefficients for the ith station, ƒ(·) is a
Table 1—CFC-11 Data Summary Available Consecutive Observations
Instrumentation
Pt. Barrow (Station 1)
Nov. 1987 – Feb. 1999 Jun. 1998 – Aug. 2008
RITS CATS
Cape Matatula (Station 2)
May 1989 – Apr. 2000 Dec. 1998 – Aug. 2008
RITS CATS
Mauna Loa (Station 3)
Jul. 1987 – Aug. 2000 Jun. 1999 – Aug. 2008
RITS CATS
South Pole (Station 4)
Jun. 1990 – Nov. 2000 Feb. 1998 – Aug. 2008
RITS CATS
Niwot Ridge (Station 5)
Feb. 1990 – Apr. 2000 May 2001 – Jul. 2006
RITS CATS
Mace Head (Station 6)
Feb. 1987 – Jun. 1994 Mar. 1994 – Sep. 2007
GAGE AGAGE
Cape Grim (Station 7)
Dec. 1981 – Dec. 1994 Aug. 1993 – Sep. 2007
GAGE AGAGE
Ragged Point (Station 8)
Aug. 1985 – Jun. 1996 Jun. 1996 – Sep. 2007
GAGE AGAGE
Cape Matatula (Station 9)
Jun. 1991 – Sep. 1996 Aug. 1996 – Sep. 2007
GAGE AGAGE
Trinidad Head (Station 10)
– Oct. 1995 – Sep. 2007
GAGE AGAGE
270 260 250 240 230
CFC-11 in parts-per-trillion (ppt)
280
Station
1998
1990
1992
1994
1996
1998
2000
Time Figure 2. CFC-11 profiles of eight stations (monthly mean data)
CHANCE
11
Bent-cable function, f
Incoming phase
Outgoing phase
␥i
i - ␥i
␥i
i
i + ␥i
Figure 3. The bent-cable function, comprising two linear segments (incoming and outgoing) joined by a quadratic bend. There are three linear parameters—0i , 1i , and 2i—and two transition parameters, ␥i and i, with i = (0i , 1i , 2i , ␥i , i)⬘. The intercept and slope for the incoming phase are 0i and 1i, respectively; the slope for the outgoing phase is 1i + 2i and the center and half-width of the bend are ␥i and i, respectively. The critical time point (at the gray line) is the point at which the response takes either an upturn from a decreasing trend or a downturn from an increasing trend.
function of tijj and i characterizing the trend of the station-specific data over time, and ⑀ ij represents the random error component. Recall that some stations do not have data for all 152 months. We employ the following system for defining fi tij. For
12
VOL. 22, NO. 3, 2009
example, the first and last months with recorded data by Station 3 are January of 1988 and August of 2000, respectively. Thus, t3,1, =1, t3,2, =2, …, t3,152 =152. In , contrast, Station 2 had its first and last recordings in May of 1989 and April of 2000, respectively. Hence, t2,1, = 17, t2,2, = 18, …, t2,132 = 148. The same approach , is used to defifine tijj’s for other i’s. Note that a few yijj’s (from 1 to 5) were missing between the first and last months for a given station. We replaced them by observations from another data set (e.g., CATS or AGAGE) or by mean imputation based on neighboring time points. As noted by P. E. McKnight, A. J. Figueredo, and S. Sidani in Missing Data: A Gentle Introduction, if just a few missing values are replaced by the mean, the deleterious effect of mean substitution is reduced. So, we expect our findings to be minimally affected. We would like an expression for f from (1) that not only describes the CFC-11 profiles as shown in Figure 2, but also gives useful information regarding the rates of change and the transition. Although a simple quadratic model such as ƒ(tij , i) = 0i + 1itijj + 2it2ijj , where i = (0i , 1i , 2i)⬘, might be appealing to
characterize the overall convexity of the trend, it would not be expected to fit the observed data well in light of the apparent three phases: incoming, outgoing, and curved transition. G. Chiu, R. Lockhart, and R. Routledge developed the so-called bent-cable regression to characterize these three phases (see Figure 3) in their Journal of the American Statistical Association article, “Bent-Cable Regression Theory and Applications.” The model is parsimonious in that it has only five regression coefficients, and is appealing due to the greatly interpretable regression coefficients. We will extend their model to account for the betweenstation heterogeneity as suggested by the different profiles in Figure 2. Overall, the bent-cable function, ƒ, represents a smooth trend over time. The random error components, ⑀ijj’s in (1), account for within-station variability in the measurements around the regression f. According to G. M. Fitzmaurice, N. M. Laird, and J. H. Ware in Applied Longitudinal Analysis, many standard assumptions for regression analyses do not hold for longitudinal data, including independence between all measurements. Instead, it is necessary to account for correlation
275 270 265 260 255
Slope = 0.65
Slope = -0.12
End of transition = Sept., 1994
250
CFC-11 (in ppt)
CTP = Nov., 1993
240
245
Beginning of transition = Jan., 1989
1988
1990
1992
1994
1996
1998
2000
Time Figure 4. Global fit of the CFC-11 data using longitudinal bent-cable regression
between repeated measurements within the same unit (i.e., station) over time. Though, for some data, the unit-specific fi coeffi ficients i may adequately account for this correlation, there is often additional serial correlation remaining that can be accounted for by these ⑀ijj’s. An order-1 autoregressive (AR(1)) serial correlation structure, being very parsimonious, may be well-suited for many types of longitudinal data. For our CFC-11 analysis, we assume an AR(1) structure with correlation parameter = corr(⑀ij , ⑀ij±1) and variance parameters
⑀ijj .
To define a population level of CFC11, we relate the station-specific fi coeffi ficients i to the population coefficient fi = (0 , 1 , 2 , ␥, )⬘ plus a random error component i. That is, i = + i, where i is a 5 × 1 column vector. Under the assumption that i has mean 0 and a 5 × 5 covariance matrix D, we can conceptually regard stations as having their own regression models and the population coefficients as the average across stations. Then, the covariance matrix D provides information about the variability of the deviation of the
station-specific coefficient i from the population coefficient fi . As an extreme example, a zero variation of the deviation between 1i and 1 indicates the station-specific and global incoming slopes are identical. In other words, rates of change in the incoming phase are identical across stations. Statistical assumptions for ⑀ijj and i added to equation (1) constitute our longitudinal bent-cable model. Details of these assumptions are available at www.stats.uwaterloo.ca/stats_ navigation/techreports/09WorkingPapers/ 2009-04.pdf. Statistical inference was carried out via a Bayesian approach. The main idea of Bayesian inference is to combine data and prior knowledge on a parameter to determine its posterior distribution (the conditional density of the parameter given the data). The prior knowledge is supplied in the form of a prior distribution of the parameter, which quantifies information about the parameter prior to any data being gathered. When little is reliably known about the parameter, it is reasonable to choose a fairly vague, minimally informative prior. For example, in
analyzing the CFC-11 data, we assumed a multivariate normal prior for with mean 0 and a covariance matrix with very large variance parameters. This leads to a noninformative prior for , meaning the data will primarily dictate ’s resulting posterior distribution. Any conclusions about the plausible value of the parameter are to be drawn from the posterior distribution. For example, the posterior mean (median if the posterior density is noticeably asymmetric) and variance can be considered an estimate of the parameter and the variance of the estimate, respectively. A 100(1 - 2p)% Bayesian interval, or credible interval, is [c1 , c2], where c1 and c2 are the pth and (1 - p)th quantiles of the posterior density, respectively. This credible interval has a probabilistic interpretation. For example, for p = 0.025, the conditional probability that the parameter falls in the interval [c1 , c2] given the data is 0.95.
Results Figure 4 presents the estimated global CFC-11 concentrations using longitudinal bent-cable regression. The CHANCE
13
2000
2000
280 270 1992
1996
2000
250 260
CFC-11 (in ppt) 1988
230 240
250 260 230 240
CFC-11 (in ppt) 1996
1988
1992
1996 Time
Niwot Ridge (Station 5)
Mace Head (Station 6)
Cape Grim (Station 7)
Ragged Point (Station 8)
1996
2000
1988
Time
2000
280
2000
1988
1992
1996 Time
2000
250 260 230 240
CFC-11 (in ppt)
270
270 1996
Time
250 260
CFC-11 (in ppt) 1992
230 240
250 260
CFC-11 (in ppt) 1992
230 240
270
270
280
Time
280
Time
250 260
CFC-11 (in ppt)
1992
Time
230 240 1988
250 260 1988
South Pole (Station 4)
270
280 1996
230 240
CFC-11 (in ppt)
270
270 250 260
CFC-11 (in ppt)
230 240
1992
280
1988
Mauna Loa (Station 3) 280
Cap Matatula (Station 2)
280
Barrow (Station 1)
1988
1992
1996
2000
Time
Figure 5. Station-specific fits (white) and population fit (gray) of the CFC-11 data, with estimated transition marked by vertical lines. The black line indicates observed data.
global drop (gradual) in CFC-11 took place between approximately January of 1989 and September of 1994. Estimated incoming and outgoing slopes were 0.65 and -0.12, respectively. Thus, the average increase in CFC-11 was about 0.65 ppt for a one-month increase during the incoming phase (January of 1988 – December of 1988), and the average decrease was about 0.12 ppt during the outgoing phase (October of 1994 – August of 2000). The 95% credible intervals indicated signifificant slopes for the incoming/outgoing phases. Specifi fically, these intervals were (0.50, 0.80) and (-0.22, -0.01), respectively, for the two linear phases, neither interval including 0. The estimated population CTP was November of 1993, which implies that, on average, CFC-11 went from increasing to decreasing around this time. The corresponding 95% credible interval ranged from December of 1992 to November of 1994. 14
VOL. 22, NO. 3, 2009
The station-specific fi fits and population fit are displayed in Figure 5. It shows the model fits the data well, with the observed data and individual fifits agreeing closely. Table 2 summarizes the fits fi numerically. We see significant increase/ decrease of CFC-11 in the incoming/ outgoing phases for all the stations separately and globally. The rates at which these changes occurred (columns 2 and 3) agree closely for the stations. This phenomenon also was evident in the estimate of D, showing small variation of the deviations between the global and station-specifi fic incoming/outgoing slope parameters (variance ≈ 0.02 for both cases). The findings support the notion of constant rates of increase and decrease, respectively, before and after the enforcement of the Montréal Protocol, observable despite a geographically spread-out detection network. They also point to the success of the
widespread adoption and implementation of the Montréal Protocol across the globe. However, the rate by which CFC-11 has been decreasing (about 0.12 ppt per month, globally) suggests it will remain in the atmosphere throughout the 21st century, should current conditions prevail. The transition periods and CTPs varied somewhat across stations (Table 2). This may be due to the extended CFC-11 phase-out schedules contained in the Montréal Protocol—1996 for developed countries and 2010 for developing countries. Thus, many countries at various geographical locations continued to contribute CFCs to the atmosphere during the 152 months in our study period, while those at other locations had stopped. Overall, the eight transitions began between September of 1988 and May of 1989, a period of only nine months. This refl ects the
Table 2—Estimated Station-Specific and Global Concentrations of CFC-11 Incoming Slope (95% Credible Interval)
Outgoing Slope (95% Credible Interval)
Transition Period (Duration)
CTP (99% Credible Interval)
Global
0:65 (0.50, 0.80)
-0.12 (-0.22, -0.01)
Jan. 1989 – Sep. 1994 (69 months)
Nov. 1993 (Aug. 1992 – May 1995)
Barrow (Station 1) ≈ 2.97
0.55 (0.39, 0.72)
-0.19 (-0.24, -0.15)
Jan. 1989 – Aug. 1994 (68 months)
Mar. 1993 (July 1992 – Nov. 1993)
Cape Matatula (Station 2) ≈ 1.01
0.74 (0.56, 0.94)
-0.10 (-0.13, -0.07)
May 1989 – Jan. 1995 (69 months)
May 1994 (Oct. 1993 – Feb. 1995)
Mauna Loa (Station 3) ≈ 1.81
0.67 (0.52, 0.83)
-0.12 (-0.16, -0.09)
Mar. 1989 – Jun. 1994 (64 months)
Aug. 1993 (Dec. 1992 – May 1994)
South Pole (Station 4) ≈ 0.30
0.60 (0.42, 0.77)
-0.12 (-0.15, -0.10)
Dec. 1988 – Nov. 1995 (84 months)
Sep. 1994 (April 1994 – March 1995)
Niwot Ridge (Station 5) ≈ 0.82
0.56 (0.34, 0.79)
-0.11 (-0.13, -0.08)
Nov. 1988 – Jul. 1994 (69 months)
Aug. 1993 (Dec. 1992 – May 1994)
Mace Head (Station 6) ≈ 1.20
0.59 (0.44, 0.74)
-0.11 (-0.13, -0.08)
Sep. 1988 – Jan. 1994 (65 months)
Mar. 1993 (July 1992 – Dec. 1993)
Cape Grim (Station 7) ≈ 0.29
0.78 (0.68, 0.93)
-0.07 (-0.09, -0.06)
Mar. 1989 – Nov. 1994 (69 months)
Jun. 1994 (Jan. 1994 – Oct. 1994)
Ragged Point (Station 8) ≈ 2.25
0.70 (0.55, 0.86)
-0.10 (-0.14, -0.07)
Jan. 1989 – Apr. 1994 (64 months)
Aug. 1993 (Nov. 1992 – June 1994)
success and acceptability of the protocol across the globe. Durations of the transition periods are similar among stations, except for South Pole. Thus, it took almost the same amount of time in different parts of the world for CFC-11 to start dropping linearly, with an average rate of about 0.12 ppt per month. The last column of the table indicates the global estimate of the CTP is contained in all the stationspecifi fic 99% credible intervals, except for those of South Pole and Cape Grim, with the lower bound for South Pole coming three months later than that for Cape Grim, making South Pole a greater outlier. Specifically, the transition for South Pole was estimated to take place over 84 months, an extended
period compared to the other stations. This could be due to the highly unusual weather conditions specific fi to the location. CFCs are not disassociated during the long winter nights in the South Pole. Only when sunlight returns in October does ultraviolet light break the bond holding chlorine atoms to the CFC molecule. For this reason, it may be expected for CFCs to remain in the atmosphere over the South Pole for a longer period of time, and hence, an extended transition period. Indeed, our findings for South Pole are similar to those reported by S. D. Ghude, S. L. Jain, and B. C. Arya in their Current Science article, “Temporal Evolution of Measured Climate Forcing Agents at South Pole, Antarctica.” To
evaluate the trend, the authors used the NASA EdGCM model—a deterministic global climate model wrapped in a graphical user interface. They found the average growth rate to be 9 ppt per year for 1983–1992 and about -1.4 ppt per year for 1993–2004, turning negative in the mid-1990s. With our statistical modeling approach, we estimated a linear growth rate of 0.6 ppt per month (7.2 ppt per year) prior to December of 1988, a transition between December of 1988 and November of 1995, and a negative linear phase (-0.12 ppt per month, or -1.44 ppt per year) after November of 1995. The eight estimates of within-station variability ( ) are given in the first column of Table 2. We noticed earlier CHANCE
15
A Statistical Investigation of Urban Skylines in Illinois Daniel Westreich
from Figure 2 that Barrow measurements are more variable, whereas Cape Grim and South Pole show little variation over time. This is reflected in their withinstation variance estimates of 2.97, 0.29, and 0.30, respectively. This also explains, at least partially, why the credible intervals of the South Pole and Cape Grim CTPs did not contain the estimate of the global CTP. As expected, we found a high estimated correlation between consecutive within-station error terms (AR(1) parameter ≈ 0.81 with 95% credible interval (0.77, 0.85)).
Conclusion
Normal, Illinois
Lognormal, Illinois
Uniform, Illinois
CFC-11 is a major source for the depletion of stratospheric ozone across the globe. Since the Montréal Protocol came into effect, a global decrease in CFC-11 has been observed, a fifinding confirmed by our analysis. Our analysis revealed a gradual, rather than abrupt, change, the latter of which is assumed by most standard changepoint models. This makes scientifific sense due to CFC molecules’ ability to stay in the upper atmosphere for about a century and because their breakdown does not take place instantaneously. The substantial decrease in global CFC-11 levels after the gradual change shown by our analysis suggests the Montréal Protocol can be regarded as a successful international agreement to reduce the negative effect of CFCs on the ozone layer. One possible extension of the proposed bent-cable model for longitudinal data is to incorporate individual-specific fi covariate(s) (e.g., effects of instrumentations specific to different stations in measuring the CFC-11 data) to see if they can partially explain the variations within and between the individual profiles. One may incorporate this by modeling the individual-specific fi coeffificients i’s to vary systematically with the covariate(s) plus a random component.
Further Reading Hint: There is, in fact, a town called Normal in Illinois.
Further Reading Freeman, M. (2006). “A Visual Comparison of Normal and Paranormal Distributions.” Journal of Epidemiology and Community Health, 60:6.
16
VOL. 22, NO. 3, 2009
Carbon Dioxide Information Analysis Center. The ALE/GAGE/AGAGE Network. http://cdiac.esd.ornl.gov/ndps/ alegage.html. Chiu, G. and Lockhart, R. (2008) “Some Mathematical and Algorithmic Details for Bent-Cable Regression with AR(p) Noise.” Working Paper Series No. 2006–07. Department of Statistics
and Actuarial Science, University of Waterloo, Waterloo, Ontario. www.stats.uwaterloo.ca/stats_navigation/ techreports/06WorkingPapers/2006-07.pdf. Fitzmaurice, G.M., Laird, N.M., Ware, J.H. (2004) Applied Longitudinal Analysis. Wiley: Hoboken, New Jersey. Khan, S.A., Chiu, G., and Dubin, J.A. (2009). “Atmospheric Concentration of Chlorofluorocarbons: orofluorocarbons: AddressAddress ing the Global Concern with the Longituudinal Bent-Cable Model.” Workingg Paper Series No. 2009–04. Departm ment of Statistics and Actuarial Sciience, University of Waterloo, Waaterloo, Ontario. www.stats. uwaterlooo.ca/stats_navigation/techreports/ 09WorkinngPapers/2009-04.pdf. McKnight, P.E., Figueredo, A.J., Sidani, S. (2000 7) Missing Data: A Gentle Introducttion. Guilford Press: New York, Neew York. NASA. Ozzone Hole Watch. Ozone Facts: What Is the Ozone Hole? http:// ozonewatcch.gsfc.nasa.gov/facts/hole.html. NOAA/ES S RL Global Monitoring Division n. www.esrl.noaa.gov/gmd. United Nations Environment Programmee. (2008) Backgrounder: B k d B Basic i Facts and Data on the Science and Politics of Ozone Prrotection. www.unep.org/Ozone/ pdf/Press--Backgrounder.pdf.
University of Cambridge. “The Science of the Ozone Hole.” www.atm.ch.cam. ac.uk/tour/part3.html. U.S. Environmental Protection Agency. Ozone Science: “The Facts Behind the Phaseout.” www.epa.gov/Ozone/ science/sc_fact.html.
U.S. Environmental Protection Agency. Ozone Depletion Glossary. www.epa. gov/ozone/defns.html. U.S. Environmental Protection Agency. Class I Ozone-Depleting Substances. www.epa.gov/ozone/science/ods/ classone.html.
CHANCE C CHAN CE
177
A Metric to Quantify College Football’s Topsy-Turvy Season Autar K. Kaw and Ali Yalcin
T
o garner the attention of their audiences, the news media, sports commentators, and bloggers all hope to have something to hype during college football season. In 2007, they had such a topic—one would be hard-pressed to recall a more topsyturvy season, in which highly ranked teams lost regularly to low-ranked and unranked teams.
18
VOL. 22, NO. 3, 2009
In the fi rst week, Associated Press (AP) No. 5 team University of Michigan lost to Appalachian State University, an unranked Division-II team. AP wasted no time in booting Michigan out of the AP Top 25. Two weeks later, No. 11 UCLA lost to unranked Utah by a wide margin of 44–6 and also was dropped from the AP Top 25. The topsy-turvy season continued, especially for No. 2 ranked teams. The University of South Florida was ranked No. 2 when they lost to unranked Rutgers 30–27 in the eighth week. This was the same week in which South Carolina, Kentucky, and California—ranked in the top 10 of the AP poll—also lost their games. To top off the season, the title bowl team—Louisiana State University (LSU)— had two regular season losses and ended up winning the national championship, which was a first in the history of the Bowl Championship Series (BCS). Although many ranted and raved about the anecdotal evidence of a topsy-turvy season, is it possible the media and fans overexaggerated the unreliability of the 2007 college football season? Were there other seasons that were more topsy-turvy? To answer this question scientifi cally, we propose a metric to quantify topsy-turvy. Two topsy-turvy (TT) factors are calculated: Week TT T for each week of the season and Season TT, a cumulative factor for the end of each week of the season.
Week TT Factor At the end of each college football week, the AP poll rankings are calculated by polling 65 sportswriters and broadcasters across the nation. Each voter supplies his or her ranking of the top 25 teams. The individual votes are added by giving 25 points to the first-place vote, 24 points to the second-place vote, etc. The addition of the points then produces the list of the AP Top 25 teams of the week. The method to fi nd the Week TT factor is based on comparing the AP Top 25 poll rankings of schools from the previous week to that of the current week. The difference in the rankings is squared, which allocates proportionately higher importance to bigger weekto-week changes for a given team. The formula for the Week TT factor is given by: Sk s100 , (1) 44.16 where Sk is the square root of the sum of the square of the differences in rankings, Week TT factor
given by
(2)
and c i= current week ranking of the previous week’s ith-ranked AP Top 25 team. In equation (2), how do we account for teams that fall out of the AP Top 25 rankings? A team that gets unranked from the previous week is assumed to be the No. 26 team in the current week. In other words, ci=26 for any team i that gets unranked. Where does 44.16 come from in equation (1)? It is a normalization number, which is the mean of the lowest and highest possible value of S k. The lowest possible value for Sk is for when all the rankings stay unchanged from the previous week. The numerical difference in the rankings between the current
Figure 1. Box plot of Week TT factors for six seasons (2002–2007)
and previous week would be zero for all teams in this case, so the lowest possible value of Sk=0. The highest possible value for Sk is obtained when the top 17 teams fall out of the top 25 and the 25–18-ranked teams are ranked 1–8, respectively. In this case, Sk=88.32. Figure 1 shows Week TT factors for the seasons between 2002 and 2007. Clearly, the 2003 and 2007 seasons emerge as the most topsy-turvy, while the 2004 season appears stable. Each week, there are 25 teams in the AP Top 25 and 25 changes in rank (some being zero). Figure 2 shows the absolute change in the 25 rankings of the 2004 and 2007 seasons. This further illustrates the topsy-turvy 2007 season and the stable 2004 season.
are given more weight in the calculation because the upset of a ranked team later in the season is more topsy-turvy than an upset in the beginning of the season, when the strength of a ranked team is less established. The weight given to each Week TT factor in the Season TT factor formula is equal to 1+ (Week Number of the Season/Number of Seasons in a Week). For example, the weight given to the Week TT factor in the fifth week of the 2007 season is 1+5/15=1.3333. The formula for the calculation of the Season TT factor at the end of the ith week is (Season TT factor)i= i j ¤(1 )s (Week TT factor) j n j1 (3) i(2n i 1)/ (2n)
Season TT Factor
where n= number of weeks in the full season. Based on the Season TT factor formula, Figure 3 shows all the Season TT factors. Note that the 2004 season was mostly stable compared to the 2007
The Season TT factor is calculated at the end of each week using weighted averages of the Week TT factors. As the season progresses, the Week TT factors
and 2003 seasons. On the other hand, the 2005 season—which was mostly “middle-of-the-way”—exhibited high topsy-turvy variability by the week. The higher weights given to the later weeks in the Season TT factor do not result in a bias in the calculation of the Season TT factor. End-of-season TT factors calculated with the above
CHANCE
19
Figure 2. Box plot of changes in rank for seasons 2004 and 2007
20
VOL. 22, NO. 3, 2009
Figure 3. Box plot of Season TT factors for six seasons (2002–2007)
Figure 4. Effect of fall-out-of-ranking number on end-of-season TT factor
CHANCE
21
weightage and equal weightage differ by less than 3%.
Fall-Out-of-Ranking Number For teams falling out of the rankings, we used a rank number of 26 to calculate the TT factors. Will using some number other than 26 for the “fall-out-of-ranking” number result in a conclusion other than topsy-turvy? The rankings of the teams could be extended beyond 25 by using the votes received by the unranked teams. However, this approach would suffer from the following drawbacks: 1. Not all teams that get unranked get votes in the current week 2. A team getting one or two votes is not a measure of a true ranking 3. Low-ranked teams falling out of the rankings do not warrant the same weightage as the highranked teams getting unranked because topsy-turviness is
22
VOL. 22, NO. 3, 2009
determined by the fall of the high-ranked teams We conducted a sensitivity analysis of the fall-out-of-ranking number. First, a suitable range for the fall-outof-ranking number needed to be found. We considered the votes received by teams that fall out of ranking and used them to give a ranking 1 of over 25. For a topsy-turvy season such as 2007, the average fall in the ranking of the teams falling out of the AP Top 25 was 12.1 (standard deviation 5.2). For the same season, the average fall in rankings by using the rank number of 26 for teams falling out of the AP Top 25 is 6.7 (standard deviation 6.3). Based on this, we chose a range of 26 to 42 (26 + difference in average fall in rank + two times the standard deviations = 26 + (12.1-6.7) + 2×5.2 ≈ 42 ) for the fall-out-of-rank number. The endof-season TT factors show the same trend across the seasons (Figure 4). Note that a direct comparison cannot
be made between the values of the TT factors obtained for each fall-out-ofrank number, as both the numerator and denominator of equation (1) change accordingly.
Other Metrics Another way to quantify the college football season as topsy-turvy is to find the percentage of weeks in a season the Week TT factor is high. To do so, we calculated the average and standard deviation of all the Week TT factors for the past six seasons (2002–2007). For the Week TT factor, the average turns out to be 42.1, while the standard deviation is 12.4. If the Week TT factor is the average plus one standard deviation (that is 42.1+12.4=53.5) or more, we consider it a measure of a topsy-turvy week. If the Week TT factor is the average Week TT factor less one standard deviation (that is 42.1–12.4=29.7) or less, it is a measure of a stable week.
90 End of Season Factor Topsy-Turvy Weeks (%) Stable Weeks ((%))
80 70 60 50 40 30 20 10 0 2002
2003
2004
2005
2006
2007
Season Figure 5. Percentage of stable and topsy-turvy weeks
Figure 5 shows the percentage of weeks for each of the last six seasons that were topsy-turvy and stable. Also shown are the end-of-season TT factors. These results agree with the previous assessment, where the 2003 and 2007 seasons are topsy-turvy and the 2004 season is stable. In the 2007 season, no week fell into the category of a stable week, while 33% of the weeks were topsy-turvy. In the 2004 season, by contrast, 33% of the weeks were stable and only 7% were topsy-turvy.
TT Factor Based on Other Polls Would using ranking polls other than the AP Top 25 give different results? For this, we considered the USA Today poll rankings, calculated by polling the USA Today board of 63 Division 1-A head coaches. Each voter supplies his or her ranking of the top 25 teams. The individual votes are added by giving 25 points to the first-place vote, 24 points to the second-place vote, etc. Adding the points then produces the list of top 25 teams for the week. Figure 6 compares the Week TT factors obtained from the AP and USA Todayy polls for the 2004 and 2007 seasons. Although the Week TT factors based on the AP and USA Today polls
differ slightly for a few weeks, both polls give similar trends. Table 1 shows the end-of-season TT factors obtained using the AP and USA Today polls for all six seasons. The maximum difference between them is less than 5%.
Other Measures of Disarray How does the TT factor or the percentage weeks of high TT factor compare with other common measures of disarray, such as the normalized Kendall’s tau distance created by Maurice Ken or Charles Spearman’s rank correlation coefficient? The normalized Kendall tau distance, K, is a measure of discordant pairs between two sets (rankings from two consecutive weeks). The distance, K, varies between 0 and 1, where 0 represents identical and 1 represents total disagreement in rankings. The trend of the number (1–K) for each week through the season is similar to
CHANCE
23
Figure 6. Week TT factors from the AP and USA Todayy polls for the 2004 and 2007 seasons
24
VOL. 22, NO. 3, 2009
Table 1— End-of-Season TT Factors from the AP and USA Todayy Polls End-of-Season TT Factor
Season
AP Poll
USA Todayy Poll
2002
41
41
2003
47
45
2004
33
34
2005
40
40
2006
38
39
2007
50
50
1
Autocorrelation
0.5
0
-0.5
-1 2002
2003
2004
2005
2006
2007
Season Figure 7. Weekly TT autocorrelation for six seasons (2002–2007)
the Week TT factors, but the distinctness between the seasons is not as clear to differentiate between a topsy-turvy and stable season. The Spearman’s rank correlation coeffi ficient, , is a measure based on the square of the difference between the rankings of the two sets. The coefficient, fi varies from –1 to 1, where –1 represents total disagreement and 1 represents identical ranking. The trend of the number (1– ) for each week through the season is similar to the week TT factors. However, the TT factor presents a more appropriate measure of a topsy-turvy season because of the following:
1. The teams that get unranked still get a rank of 25 or less in the formula for , thus introducing a bias that becomes larger in weeks where a significant number of teams fall out of rankings. For example, if four teams get unranked in a particular week, they all are assigned a rank of 23.5 [= (22+23+24+25)/4]. 2. If a low-ranked team loses and is out of the Top 25 ranking, it may get a higher rank in the formula for . For example, if four teams get unranked in a particular week
and one of the teams was ranked 25 in the previous week, it will be assigned a higher rank of 23.5 [= (22+23+24+25)/4].
Is a Topsy-Turvy Season Random? To determine the degree of randomness in weekly topsy-turviness, we calculated the lag one autocorrelation coefficient r for each season between 2002 and 2007. The results are shown in Figure 7. The autocorrelation factors for all seasons ranged between ⫾0.4, which does not indicate the presence of CHANCE
25
Table 2 — Analysis of Variance for the Weekly TT Factors Source of Variation
Sum of Squares
Degrees of Freedom
Mean Square
F0
P-value
Weeks
3562.3
14
254.4
2.6
0.0044
Seasons
2936.5
5
587.3
5.9
0.0001
Errors
6914.4
70
98.8
Total
13413.1
89
Note: For the 2002 season, the first week’s results were not included in the analysis. All other seasons have data for 15 weeks.
nonrandomness throughout the weekly topsy-turviness. Are the final Week (postseason) TT factors statistically different from those of the regular season? Inspection of the results did not reveal any significant difference between final Week TT factors and those of other weeks. This can be attributed to the number of college bowl games in the postseason. In the 2007 season, 64 teams played in the college bowls, which included 39 prebowl unranked teams. Sixteen bowls were played between unranked teams, seven matched a ranked and unranked team, and only nine had ranked teams face each other. With match-ups such as that, the bowl games are seemingly like any other regular season week, except that more highly and closely ranked teams play each other. To answer conclusively the question of whether any of the weeks in the season tend to be more or less topsy-turvy than others, we conducted an analysis of variance of a topsy-turvy season based on a randomized complete block design in which the weeks were treatments and the seasons were blocks. The resulting ANOVA table is shown in Table 2. As expected, the seasons
In addition to the Top 25 teams, other teams that get votes are listed in each weekly AP Poll. We used the votes received by the teams to rank them beyond 25. If the previously ranked team received no votes in the next week, we ranked them as the team after the last team that received a vote.
26
VOL. 22, NO. 3, 2009
showed a significant fi difference between mean weekly topsy-turvy seasons. The analysis also indicated there was a signifi cant difference between the mean topsy-turvy season across weeks. Using John Tukey’s test did not reveal any significant trends, but did indicate that week 14 was significantly less topsyturvy than week 6. Though it was not statistically signifi cant, the last week of the season (week 14) persistently had much lower Week TT factor scores than the rest of the weeks. This is attributed to the pre-bowl week involving conference championships (mostly a match-up between high-ranked teams) and many top-ranked teams (44% in 2007) having finished their regular season the week before.
Conclusions Based on ranking change of college football teams from week to week in AP polls, a metric to measure the topsy-turviness of college football weeks has been developed. Six recent seasons (2002–2007) were used in the analysis. The 2007 season turned out to be the most topsy-turvy, while the 2004 season was the most stable. These findings were confirmed with other measurements, such as change in ranking from week to week and number of weeks deemed topsy-turvy or stable based on the average and standard deviation of all TT factor numbers. Other polls, such as the USA Today y poll, resulted in similar trends. In-depth statistical analysis of the weekly TT factors did not indicate the presence of nonrandomness throughout the weekly topsy-turvy season. When
using a randomized complete block design, statistically significant differences were detected in the mean TT factors across weeks and seasons. However, no significant trends were found in the TT factors, except for in the last week before the bowls are played. The methodology described here to determine a topsy-turvy season can be used for other sports that are ranked throughout the season, such as college basketball and baseball.
Further Reading Associated Press. (2007). “BCS Coordinator Says Increased Parity Won’t Encourage BCS Change.” November 28, http://sports.espn.go.com/ncf/news/ story?id=3132578. Bagnato, Andrew. (2007). “In TopsyTurvy Season, 6 Teams Remain Perfect—So Far.” USA Today, October 16, www.usatoday.com/sports/college/ football/2007-10-16-90477001_x.htm. Box, G. E. P. and Jenkins, G. (1976). Time Series Analysis: Forecasting and Control. Holden-Day. ESPN-NCAA College Football Polls, College Football Rankings, NCAA Football Poll, http://sports.espn.go.com/ ncf/rankingsindex. Montgomery, D.C. (2001). Design and Analysis of Experiments, 5th ed. New York: John Wiley and Sons. Thamel, Pete. (2007). “Topsy-Turvy Season Faces More Flips.” The New York Times, October 15, www. nytimes.com/2007/10/15/sports/ ncaafootball/15colleges.html. TT Factor for College Football, www.eng. usf.edu/~kaw/ttfactor.
Gambling Your Way to New York: A Story of Expectation Tristan Barnett
I
nterCasino, an online gambling web site at www.intercasino. com, recently offered a free flight to New York as a promotional offer to new customers. From Australia, the value of the flight was about $2,000. To qualify, players had to complete their wagering requirements for one of the online games. Assuming the player was happy with the terms and conditions of the offer, two obvious questions are: Is the offer worthwhile? Which is the best game to play? To answer these questions, we need probability theory, details concerning the games, and clear criteria on which to base our answers.
Rules of the Promotion The following conditions were posted on the InterCasino web site: 1. Deposit $100 or more using any of our payment methods from March 1–31, 2008 2. We will then match the first $100 of this deposit, giving you another $100 in your account (you may need to click refresh to see your balance update) 3. All you need to do then is fulfill fi the bonus wagering requirement by wagering $5,000 on any of the games below before midnight on March 31, 2008 4. Deposit $100 or more using any of our payment methods between April 1–30, 2008 5. We will then match the first $100 of this deposit, giving you another $100 in your account (you may need to click refresh to see your balance update) 6. All you need to do then is fulfill fi the bonus wagering requirement by wagering $5,000 on any of the games below before midnight on April 30, 2008 Games that qualify for the promotional offer are 3 Card Stud Poker, Let It Ride Poker, Casino Stud Poker, Pai Gow Poker, Casino Solitaire, Casino War, Red Dog, Bejeweled, Cubis, Sudoku, Keno, Lottery, and Scratchcards.
Analysis of Casino Games and Percent House Margin The exact rules of play vary by game, but the general situation can be represented abstractly. There is an initial cost C to play the game. Each trial results in an outcome Oi, where each outcome occurs with profit fi xi and probability pi. A profit fi of zero means the money paid to the player for a particular outcome equals the initial cost. Profits above zero represent a gain for the player; negative profits represent a loss. The probabilities are all non-negative and sum to one over all the possible outcomes. Given this information, the total expected profit fi Ei = pixi. The percent house margin (%HM) is then -Ei/C. Positive percent house margins indicate the gambling site, CHANCE
277
Table 1—Representation in Terms of Expected Profit of a Casino Game with K Possible Outcomes Outcome
Profi fit
Probability
Expected Profit fi
O1
x1
p1
E1=p1x1
O2
x2
p2
E2=p2x2
O3
x3
p3
E3=p3x3
…
…
…
…
OK
xK
pK
EK=pKxK
1.0
Ei
on average, makes money and the players lose money. Table 1 summarizes this information when there are K possible outcomes. If N consecutive bets are made, then the total profi fit, T, is a random variable. T = X1 + X2 + …. +XN, where Xi is the outcome on the ith bet. We assume the variables Xi are independently and identically distributed. That is, we assume the probability distribution is the same each time we play and consecutive plays have no impact on each other.
The parameters of T are directly related to the parameters of X: Mean T = NX Standard Deviation T = √N X Coefficient fi of Skewness T = X/√N Coeffificient of Excess Kurtosis T = X / N In the above, X, X, X, and X are the mean, standard deviation, skewness, and excess kurtosis for variable X. In general, excess kurtosis = kurtosis -3, so the normal distribution has an excess kurtosis of 0, and therefore a kurtosis of 3. The parameters are used in a normal approximation to a standardized version of random variable T. We’ll use this normal approximation to compute probabilities associated with the online casino games. A discussion of moments and parameters of probability distributions is presented in “An Aside on Moments of a Probability Distribution.” Let Z be a standardized variable, such that Z = (T – T)/ T. Variable Z has mean 0 and variance 1, which is the same as the standardized normal distribution. The cumulative distribution function (CDF) of Z can be approximated by the normal distribution (i.e., Prob (Z ≥ z) = F(z) ≈ (z), where (.) is the cumulative normal distribution). The simple approximation above will give a poor fifit to the tails of the distribution in most cases. This is because the variable Z may be skewed and have a non-zero excess kurtosis. A better approximation is the normal power approximation, which can be expressed as F(z) ≈ (y), where y = z - 1/6 T (z2 – 1) + [1/36 T 2 (4z3 -7z) – 1/24 T (z3 – 3z)] . Using y instead of z in the cumulative normal distribution provides improved accuracy when the distribution has skewness and excess kurtosis different from that of a standard normal distribution. Failure to recognize a skewed distribution for the outcome is likely to result in underestimating the chance of ruin, given a finite capital. This is despite a warning that the NP approximation is known to fail when T > 2 or T > 6.
The Casino Games The Wizard of Odds at www.wizardofodds.com gives the rules and %HM for various online casino games, represented in Table 2, by assuming optimal strategies where applicable. Note that the house margins are unknown for Casino Solitaire, 28
VOL. 22, NO. 3, 2009
Table 2—InterCasino Games That Qualify for the Promotional Offer Game
House Margin
Minimum Bet
3 Card Stud Poker
3.37% (ante bet) 2.32% (pairplus bet)
$5 $5
Let It Ride Poker
3.51%
$6
Casino Stud Poker
5.22%
$2
Pai Gow Poker
2.86%
$5
Casino War
2.88% (go to war) 3.70% (surrender)
$2 $2
Red Dog
3.16%
$5
Keno
25–29%
$0.25
Table 3—Expected Profits of 3 Card Stud Poker with the Minimum $5 Bet Outcome
Profi fit ($)
Probability
Expected Profit fi ($)
Straight Flush
200
12/5525
0.434
Three of a Kind
150
1/425
0.353
Straight
30
36/1105
0.977
Flush
20
274/5525
0.992
Pair
5
72/425
0.847
All Other
-5
822/1105
-3.719
1
-0.116
Total
Bejeweled, Cubis, Sudoku, Lottery, and Scratchcards. Therefore, we choose to ignore these games when determining the best game to play. Table 2 indicates that the pairplus bet in 3 Card Stud Poker has the lowest %HM of 2.32%, but requires a minimum bet of $5. Therefore, 3 Card Stud Poker requires at most 2,000 trials to turn over $10,000. The go to war option in Casino War has a %HM of 2.88% with a minimum bet of only $2. Therefore, Casino War requires at most 5,000 trials to turn over $10,000. Although Pai Gow Poker has a %HM of 2.86%, we choose to ignore this game due to the complex strategies to achieve this margin. Based on this information, the analysis to follow will be on 3 Card Stud Poker (pairplus bet) and Casino War (go to war option). Tables 3 and 4 give the outcomes, probabilities, profi fits, and expected profits for 3 Card Stud Poker and Casino War, respectively. The minimum bets for each game have been applied. Note that in Table 4, x = 12*(24/310*23/309) + 22/310*21/309 and y = (1-x)/2. The calculations for Table 4 were obtained as follows: 23/311 is the probability of the dealer and player obtaining the same rank of card. The probability of the dealer having a higher card rank than the player and vice versa is then (1-23/311)/2 = 144/311. The probability of
the dealer and player obtaining a card of the same rank given their initial cards were of the same rank can be calculated as x. The player wins the hand (given the player and dealers initial cards were of the same rank) by drawing a card with the same or higher rank than the dealer. This occurs with probability
CHANCE
29
An Aside on Moments of a Probabilityy Distribution Central to the computations in this article is the normal approximation to the distribution of the profits from playing an online casino game many times. The mean, variance, skewness, and excess kurtosis of the distribution of profits fi all play a part in the approximation. As in the article, assume there is an initial cost, C, to play the game. Each trial results in an outcome Oi, where each outcome occurs with profit fi xi and probability pi. The probabilities are all non-negative and sum to one over all the possible outcomes. Given this information, the total expected profit Ei = pixi. The percent house margin (%HM) is then -Ei/C. Table 1 in the article summarizes this information when there are K possible outcomes.
k1X = m1X k2X = m2X – m1X2 k3X = m3X – 3m2Xm1XX + 2m1X3 and k4X = m4X – 4m3Xm1XX – 3m2X2 + 12m2Xm1X2 – 6m1X4 These cumulants can be used to calculate the following familiar distributional characteristics (parameters) for X: Mean Standard Deviation Coeffificient of Skewness Coeffificient of Excess Kurtosis
X = k1X X = square root of k2X X = k3X / (k2X)3/2 X = k4X / (k2X)2
Moments and Cumulants from a Single Bet The outcome or profit from a single bet, X, is a random variable. From probability theory, the moment generating function (MGF) of X is: MX(t) = E(exp( Xt)) = 1 + m1X t + m2XX t2/2! + m3XX t3/3! + m4XX t4/4! + …, where mrXX represent the rth moment of X. The moments of X are readily calculated using the following: m1XX = i pi xi m2XX = i pi xi2 m3XX = i pi xi3 m4XX = i pi xi4 … The calculation of these moments is illustrated below: Profit
Probability
1st Moment
2nd Moment
3rd Moment
4th Moment
O1
x1
p1
p1x1
p1x12
p1x13
p1x14
O2
x2
p2
p2x2
p2x22
p2x23
p2x24
p3x3
2 3 3
3 3 3
p3x34
Outcome
O3
x3
p3
px
px
…
…
…
…
…
…
…
OK
xK
pK
pKxK
pKxK2
pKxK3
pKxK4
1.0
m1X= i pixi
m2X= i pixi2
m3X= i pixi3
m4X= i pixi4
The cumulant generating function (CGF) of X is the natural logarithm of the MGF: KX(t) = loge (MX(t)) = k1XX t + k2X t2/2! + k3X t3/3! + k4X t4/4! + …, where krX represent the rth cumulant of X. The relationship between the first four cumulants and moments is given by the following:
30
VOL. 22, NO. 3, 2009
Moments and Cumulants After N Bets If N consecutive bets are made, the total profit, T, is a random variable. T = X1 + X2 + …. +XN, where Xi is the outcome on the ith bet. We assume the variables Xi are independently and identically distributed. When the number of outcomes in a single bet is two (Win or Lose), the binomial formula can be used to calculate the distribution of profits after N bets. Alternatively, the normal approximation to the binomial distribution can be applied when N is large. When there are more than two outcomes in a single bet, we can use MGFs, CGFs, and a different normal approximation formula. Assuming the outcome from each bet is independent of the others, probability theory tells us the MGF of random variable T is the product of MGFs of the Xi’s: M T(t) = E(exp( X1 + X2 + …. +XN )t ) = E(exp( X1 t)) E(exp( X2 t)) …. E(exp( XN t)) = MX1(t) MX2(t) …. MXN(t) If the bets are all on the same game and the same size, the distribution of the profit from each bet is identical and we obtain an important simplification: fi M T(t) = [MX(t)]N. Taking logarithms, we obtain a relationship between the CGFs: K T(t) = NK KX(t). This relationship can be expressed in terms of the individual cumulants: kr T = N krX for all r ≥ 1. Thus, the cumulants of the total profit after N bets of the same size on a single game can be computed directly from the cumulants of the profifit for a single bet. Likewise, the parameters of T are directly related to the parameters of X: Mean T = N X Standard Deviation T = √N X Coeffificient of Skewness T = X/√N Coefficient fi of Excess Kurtosis T = X / N
Table 4—Expected Profits of Casino War with the Minimum $2 Bet Outcome
Profit fi ($)
Probability
Expected Profit fi ($)
Highest Card
2
144/311
0.926
Lowest Card
-2
144/311
-0.926
Go to War and Win
2
23/311 (x+y)
0.079
Go to War and Lose
-4
23/311 y
-0.137
1
-0.058
Total
Table 5—Characteristics of the Amount of Profit After One Trial for 3 Card Stud Poker and Casino War Game
Mean
3 Card Stud Poker
-$0.12
Casino War
-$0.06
Coefficient fi of Skewness
Coefficient fi of Excess Kurtosis
$14.55
8.63
102.08
$2.10
-0.12
-1.77
Standard Deviation
23/311*(x+y). The player loses the hand (given the player and dealers initial cards were of the same rank) by drawing a card with lower rank than the dealer. This occurs with probability 23/311*(y). By comparing tables 3 and 4, it is evident that the variance of profits and higher-order moments such as skewness is significantly fi greater for 3 Card Stud Poker. This is a result of the minimum bet being greater for 3 Card Stud Poker and the relatively low probabilities in 3 Card Stud Poker in obtaining the outcomes of a straight flush fl and three of a kind.
Scenario 2: Attempts to minimize the probability of having to deposit more money while playing. Only one of the two games will be played to turn over the $10,000. Characteristics of the amount of profit The equations discussed above are applied to calculate the mean, standard deviation, and coefficients of skewness and excess kurtosis of the amount of profit after one trial and the results given in Table 5. The characteristics
Criteria and Results The game the player chooses will depend on the player’s attitude to risk and whether depositing more than the minimum $200 into an account is an option. A professional gambler, for example, may be more inclined to play 3 Card Stud Poker, as it has a lower %HM. An individual who gambles only to take advantage of these promotional offers, however, may be more inclined to play Casino War. Remember that there are additional costs in depositing more money through processing, withdrawals, and currency conversions. For an individual to make a judgement about which game he or she would prefer to play (if any), we will investigate two scenarios: Scenario 1: Looks at the end result only by allowing the player to deposit extra funds if necessary. Only one of the two games will be played to turn over the $10,000.
CHANCE
31
Table 6—Characteristics of the Amount of Profit After N Trials for 3 Card Stud Poker with a $5 Minimum Bet Standard Deviation
Coefficient fi of Skewness
Coefficient fi of Excess Kurtosis
$205.81
0.61
0.51
-$46.33
$291.06
0.43
0.26
-$69.50
$356.48
0.35
0.17
Turnover
Trials
Mean
$1,000
200
-$23.17
$2,000
400
$3,000
600
$4,000
800
-$92.67
$411.63
0.31
0.13
$5,000
1,000
-$115.84
$460.21
0.27
0.10
$6,000
1,200
-$139.00
$504.14
0.25
0.09
$7,000
1,400
-$162.17
$544.53
0.23
0.07
$8,000
1,600
-$185.34
$582.13
0.22
0.06
$9,000
1,800
-$208.51
$617.44
0.20
0.06
$10,000
2,000
-$231.67
$650.84
0.19
0.05
Table 7—Characteristics of the Amount of Profit After N Trials for Casino War with a $2 Minimum Bet Turnover
Trials
Mean
Standard Deviation
Coefficient fi of Skewness
Coefficient fi of Excess Kurtosis
$1,000
500
-$28.77
$46.94
-0.01
0.00
$2,000
1,000
-$57.54
$66.39
0.00
0.00
$3,000
1,500
-$86.31
$81.31
0.00
0.00
$4,000
2,000
-$115.09
$93.89
0.00
0.00
$5,000
2,500
-$143.86
$104.97
0.00
0.00
$6,000
3,000
-$172.63
$114.99
0.00
0.00
$7,000
3,500
-$201.40
$124.20
0.00
0.00
$8,000
4,000
-$230.17
$132.78
0.00
0.00
$9,000
4,500
-$258.94
$140.83
0.00
0.00
$10,000
5,000
-$287.71
$148.45
0.00
0.00
of the amount of profit after N trials for 3 Card Stud Poker with a minimum $5 bet are given in Table 6. Similarly, the characteristics of the amount of profit after N trials for Casino War with a minimum $2 bet are given in Table 7. As expected, the standard deviation and coefficients of skewness and excess kurtosis are significantly greater for 3 Card Stud Poker. The bonus $200 has not been included in the calculations to generate tables 6 and 7. Given this $200 bonus, 3 Card Stud Poker and Casino War would have total expected profits of -$31.67 and -$87.71, respectively. Note that the standard deviation, coefficient fi of skewness, and coefficient of excess kurtosis remain unchanged when the $200 bonus is included. The coefficient of variation is given by the standard deviation divided by the mean, and therefore would change with the $200 bonus. 32
VOL. 22, NO. 3, 2009
Distribution of profits The normal power approximation is applied to give the distribution of profits for both 3 Card Stud Poker and Casino War, and a graphical representation is given in Figure 1. Note that the bonus $200 has been added to the horizontal axis to represent the total profit a player can achieve. The mode value in 3 Card Stud Poker is approximately -$100, which is less than the mean value of -$31.67. Using the normal distribution directly, the mode value would be -$31.67. This shows the importance of using the normal power approximation in determining the probabilities for the tail and establishing accurate confidence intervals. What is compelling about these two
Distribution of Profits
3 Card Stud Poker Casino War
-2000 2000
-1000 000
0
1000 000
2000
3000
Profit ($)
Figure 1. The distribution of profits for 3 Card Stud Poker and Casino War by turning over $10,000
Table 8—Confidence Intervals for 3 Card Stud Poker and Casino War Game
90% CI
95% CI
99% CI
3 Card Stud Poker
(-$1,070, $1,070)
(-$1,250, $1,300)
(-$1,600, $1,750)
Casino War
(-$330, $160)
(-$380, $200)
(-$420, $300)
graphs is that a player can lose as much as $1,500 by playing 3 Card Stud Poker. Scenario 1
Figure 1 could be compared by the player, in deciding which game he or she would prefer to play under Scenario 1. To help interpret Figure 1, confidence intervals (CI) have been constructed and are represented in Table 8. The results are reasonably accurate, as simulation methods were developed to verify the results. It has already been established that 3 Card Stud Poker has a lower %HM by 0.56% and a player is expected to lose $56 less by playing 3 Card Stud Poker than Casino War. However, the results from Table 8 indicate a player could lose substantially more money by playing 3 Card Stud Poker than Casino War. On the
CHANCE
33
Table 9—Probability of Being Behind More Than $200 for 3 Card Stud Poker and Casino War Turnover
3 Card Stud Poker
Casino War
$1,000
0.290
0.001
$2,000
0.385
0.033
$3,000
0.433
0.125
$4,000
0.464
0.245
$5,000
0.488
0.365
other hand, a player could gain substantially more money by playing 3 Card Stud Poker. Scenario 2
This scenario will investigate the intermediate fluctuations fl during play in an attempt to minimize the chance of depositing more than the minimum amount of $200. Table 9 gives the probabilities for every $1,000 of turnover (up to a maximum of $5,000) of a player being behind more than $200, and therefore having to deposit more money. The results indicate there is about a 49% chance of depositing more money in 3 Card Stud Poker while fulfilling the first $5,000 in wagering requirements. There is about a 37% chance of depositing more money in Casino War while fulfilling the first $5,000 in wagering requirements. Calculating values for the chances of depositing more money to fulfill the $10,000 in wagering requirements is difficult to obtain. It is clear, however, that the chance of depositing more money is smaller by playing Casino War. Therefore, under Scenario 2, Casino War is the preferred game to play.
Summary Scenarios 1 and 2 assume the player only plays the one game to turn over the required $10,000. If a player had a good run in the first $5,000 turnover in Casino War, he or she may be inclined to switch to 3 Card Stud Poker. Likewise, if a player had a bad run in the first $5,000 turnover in 3 Card Stud Poker, he or she may be inclined to switch to Casino War. A player’s choice to switch between 3 Card Stud Poker and Casino War
is dependent on his or her current bankroll and how many hands are left to play to wager the required $10,000. The minimum bet for both games has always been applied. A player may choose to speed up the process by betting higher than the minimum. This may again depend on the player’s current bankroll and how many hands are left to play. In general, increasing the size of the bankroll in any casino game increases the variance and skewness. This would amount to increasing the probability of losing the initial $200 before wagering the initial $5,000 (i.e., a greater chance of having to deposit more money into the casino account). Is the offer worthwhile? The cost of a return flight to New York from Australia is around $2,000, excluding taxes. The expected cost for the player in 3 Card Stud Poker is $88 and $32 in Casino War. Therefore, on average, the offer is worthwhile in the sense of having a positive expected value once one accounts for the value of the flight fl from Australia. Which is the best game to play? We identified two games in which the playing strategies were straightforward and had relatively low percent house margins. The best game to play depends on the player’s objective toward the offer. A professional gambler, for example, may be more inclined to play 3 Card Stud Poker because it has a lower %HM, whereas an individual who gambles only to take advantage of these promotional offers may be more inclined to play Casino War, as it minimizes the probability of having to deposit more than $200 into an account. There is evidence to show that Casino War is the preferred game if a player is trying to minimize the probability of depositing more than the minimum $200. This is interesting, as a player is willing to spend an extra $56 to reduce risk.
Further Reading Packel, E. (1981). The Mathematics of Games and Gambling. The Mathematical Association of America: Washington, DC Pentikainen, T. and Pesonen, E. (1984). Risk Theory: The Stochastic Basis of Insurance. Chapman and Hall: New York, New York. Pesonen, E. (1975). “NP-Technique as a Tool in Decision Making.” ASTIN Bulletin. 8(3):359–363. www.casact.org/library/ astin/vol8no3/359.pdf.
34
VOL. 22, NO. 3, 2009
How Are Statistical Journals Linked? A Network Analysis Alberto Baccini, Lucio Barabesi, and Marzia Marcheselli
S
ocial network analysis is a method of inquiry that focuses on relationships between subjects such as individuals, organizations, or nation states. Network analysis has a broad applicability, as it has also been used to highlight relationships between objects as diverse as Internet pages, scientific papers, organisms, and molecules. Scientific fi research and its evaluation present a nonreducible social dimension. It is typical to ask a colleague to collaborate when writing a paper, commenting on a book, or revising a project. It is also typical to request the opinions of experts (or peers) to judge the quality of a paper, research results, or a research project. The editorial boards of scientific journals decide which papers are worthy of publication on the basis of revision by anonymous referees. The proxies normally used for measuring the scientifific quality of a paper or journal (e.g., the impact factor) are implicitly based on the relational dimension of the scientific activity. Indeed, bibliometric popularity depends on the number of citations received by other scholars (mainly in the same research domain). In some cases, the relevance of individual scientific activity is approximated by esteem indicators. Esteem indicators are based on the positive appraisal other scholars attribute to an individual, and this positive appraisal is reflected in the position he or she occupies in the scientifific community (e.g., as the director of a research project or the editor of a scientific fi journal). All these scholars and activity can be viewed as interdependent, rather than autonomous, units. The scientific fi activity can then be considered a relational link among the scholars. The connection pattern among scholars gives rise to a social network, and its structure affects the social interactions among them. By adopting the concepts of graph
theory, such a network mayy be represented d as a set of vertices denoting actors joined in pairs by lines denoting acquaintance. Hence, the quantitative empirical studies in this setting may be conducted ducted with the tools off networkk analysis. l i This approach has often been applied d to the analysis of networks generated d by scientific activities. The most highly y investigated topic is probably that off the scientific collaboration network. In this case, two scientists are considered connected if they have coauthored a paper. As an example, Mark Newman, in “Coauthorship Networks and Patterns of Scientifi fic Collaboration,” published in the Proceedings of the National Academy of Sciences, analyzed the collaboration networks of scientists in the areas of biology, medicine, and physics and found that all these networks constitute a self-contained world in which the average distance between scientists via intermediate collaborators is tiny.
The Network of Common Editors for Statistical Journals To explore the structural properties of the network generated by the editorial activities of those in the statistics community, the vertices of the network are statistical journals. A link between a pair of journals is generated by the presence of a common editor on the board of both. The domain of the present research is the academic community of
statisticians gathered around the 81 journals included in the statistics and probability category of the 2005 edition of the Journal of Citation Report Science Edition. By studying the structure of the statistical journal network with the tools of network analysis, we can shed light on the underlying processes scholars use in their research. To the best of our knowledge, the literature presents no extensive discussion of the role of editorial boards for CHANCE
35
Figure 1. The statistical journal network. A link indicates a journal shares one or more editors. The journals are labeled according to the legend in Table 1. Dark grey vertices refer to pure and applied probability journals, white vertices to methodological statistical journals, and light grey vertices to applied statistical journals or journals with applications of statistics.
scientifific journals. However, we possess anecdotal evidence and recent tentative generalizations. Traditionally, the main function of editorial boards was to determine which articles were appropriate for publication, but this function has changed in the last two to three decades. The spread of the anonymous referee process now allows editorial boards to concentrate on selecting and evaluating referees. In any case, the role of editors can be considered relevant in guiding research—in encouraging or suppressing various lines of research. We assert that editorial boards have power in shaping the editorial processes and policies of statistical journals. Therefore, we hypothesize that each editor 36
VOL. 22, NO. 3, 2009
may influence the editorial policy of his or her journal. Consequently, if the same individual sits on the board of two journals, those journals’ editorial policies could have common elements. We will infer considerations about the similarity of editorial policies by observing the presence of scholars on editorial boards. Finally, it is worth remarking that this framework is similar to that considered in interlocking directorship analysis, which is probably the most developed fifield of application of dual-mode network analysis. An interlocking directorate occurs when a person sitting on the board of directors of a firm also sits on the board of another fi rm. Those interlocks have become primary indicators of inter-firm fi
network ties. As discussed by Mark Mizruchi in an Annual Review of Sociology article titled “What Do Interlocks Do? An Analysis, Critique, and Assessment of Research on Interlocking Directorates,” an inter-firm tie can be explained as the result of strategic decisions by firms, such as collusion, cooptation, or monitoring of environmental uncertainty sources.
The Data and Degree Distribution of the Interlocking Editorship Network The empirical notion of editor that we adopted is very broad. Indeed, it covers all those listed as editor, associate editor, coeditor, member of the editorial board, or
8 6 4
Frequency
2 0 0
2
4
6
8
11
14
17
20
23
26
29
32
35
Degree Figure 2. Degree frequency distribution of the statistical journals. The degree of a journal is the number of connections through editors to other journals.
member of the advisory editorial board. There is no evidence regarding the roles of different kinds of editors in the editorial process (possibly apart from the role of editor-in-chief), and a single title such as managing editor may often entail different roles for different journals. Hence, the broad definition is assumed. We constructed the affiliation network database ad hoc. We have included 79 of the 81 statistical journals in the statistics and probability category of the 2005 edition of Journal of Citation Report Science Edition. The two journals excluded were Utilitas Mathematica, given that it has no fixed editorial board, and the Journal of the Royal Statistical Society (Series D), as it ceased publication in 2003. This set of journals includes most major statistics and probability journals. The data on the members of the editorial boards was directly obtained from the hard copy of the journals. The fifirst issue in 2006 was used to determine the makeup of the board. Moreover, the database was managed by the Pajek
package, a free software for large network analysis. In this database, 2,981 seats were available on the editorial boards and they were occupied by 2,346 scholars. The average number of seats per journal was 37.7, while the average number of seats occupied by each scholar (i.e., the mean rate of participation) was 1.27. A pair of journals is said to be linked when the journals have an editor in common. The number of lines linking the journals is 428, and the density of the interlocking directorship network (i.e., the ratio of the actual number of lines to the maximum possible number of lines) is 0.14. Thus, 14% of the possible lines are present. Figure 1 shows the graph of the network. The vertices are automatically placed by the Pajek package on the basis of the Fruchterman-Reingold algorithm. This procedure allows for an “optimal” layout, avoiding a subjective placement of the network vertices. More precisely, the algorithm moves vertices to locations that minimize the
variation in line length in such a way to generate a visually appealing graph. In this graph, the three main groups are easily recognized: the lower part presents mainly pure and applied probability journals (dark grey vertices), the central part presents methodological statistical journals (white vertices), and the upper part presents applied statistical journals or journals with applications of statistics (light grey vertices). To consider an initial exploratory analysis, the degree distribution has been provided. In this setting, the degree of a journal is the number of lines it shares with other journals. Figure 2 displays the degree frequency distribution of the considered journals. The mean degree is 10.84, the median degree is 10, and the degree standard deviation is 7.94. Only three journals—interdisciplinary journals with a major emphasis on other disciplines and edited by members of other scientific communities—are isolated from the network (i.e., they have zero degrees). All the other journals are CHANCE
377
linked directly or indirectly, showing a strongly connected network. Indeed, a search for components in the network trivially shows three components, each made up of one element (the aforementioned journals), and one big component made up of the remaining 76 journals. A component is a maximal connected subnetwork (i.e., each pair of subnetwork vertices is connected by a sequence of distinct lines). Finally, we assessed the association between the number of board editors and the journal degree. Even if it may be expected for journals with large boards to possess an elevated degree value, a mild positive dependence exists between these variables. Indeed, Kendall’s association coeffi cient turns out to be 0.30. The lack of dependence could be partially explained by the presence of interdisciplinary journals. For example, Econometrica possesses 50 editors and displays a degree equal to 3. However, there also exist statistical journals with many editors and a low degree, such as The American Statistician with 41 editors and a degree equal to 6.
The Center and Periphery in the Interlocking Editorship Network A main concern in network analysis is to distinguish between the center and periphery of the network. In our case, the problem is to distinguish between the statistical journals that have a central position in the network and those in 38
VOL. 22, NO. 3, 2009
the periphery. As suggested by Stanley Wasserman and Katherine Faust in Social Network Analysis: Method and Application, three centrality measures for each journal in the network may be adopted. The simplest measure is represented by its degree—the more ties a journal has to other journals, the more central its position in the network. For example, the Journal of the American Statistical Association is linked to 35 journals, whereas Multivariate Behavioral Research is linked to one. Hence, the first is more central in the network than the second. The normalized degree of a journal is the ratio of its degree to the maximum possible degree (i.e., the number of journals minus 1). It is a linear transformation of the degree and ranks journals in an equivalent order. Thus, the Journal of the American Statistical Association is linked to about 45% of the other journals in the network, whereas Multivariate Behavioral Research is linked with 1.3%. Table 1 contains the degree and normalized degree for the statistical journals considered. An overall measure of centralization in the network (based on marginal degrees) is given by the so-called degree centralization. In this case, the index turns out to be 0.32, showing that the network of statistical journals is quite centralized. The second centrality measure is given by closeness centrality, which is based on the distance between a journal and all the other journals. In the network analysis, the distance between two vertices is usually based on so-called geodesic distance. A geodesic is the shortest path between two vertices. Its length is the number of lines in the geodesic. Hence, the closeness centrality of a journal is the number of journals (linked to this journal by a path) divided by the sum of all the distances between the journal and the linked journals. The basic idea is that a journal is central if its board can quickly interact with all the other boards. Journals occupying a central location with respect to closeness can be effective in communicating information (sharing research, sharing papers, and deciding editorial policies) to other journals. Table 1 shows the closeness centrality for the statistical journals. By focusing on the connected network of 76 journals, it is possible to compute the overall closeness centrality of 0.33, showing that the network of statistical journals is fairly centralized. The third considered measure is the so-called betweenness centrality. The
idea behind the index is that similar editorial aims between two nonadjacent journals might depend on other journals in the network, especially on those journals lying on the paths between the two. The other journals might have some control over the interaction between two nonadjacent journals. Hence, a journal is more central in this respect if it is an important intermediary in links between other journals. From a formal perspective, the betweenness centrality of a journal is the proportion of all paths between pairs of other journals that include this journal. Table 1 contains the betweenness centrality of the statistical journals. For example, the Journal of the American Statistical Association is in about 10% of the paths linking all other journals in the network. In turn, it is possible to compute the overall betweenness centralization of the network (i.e., the variation in the betweenness centrality of vertices divided by the maximum variation in the betweenness centrality scores), which is possible in a network of the same size. In this case, the overall betweenness centralization is 0.09. This value turns out to be low, showing that the betweenness centrality of vertices is not scattered. It is worth noting the ranking similarity of the three centrality measures. This item is emphasized by Kendall’s concordance index, which equals 0.90. This index value is high, so the rankings induced by the centrality measures are in good agreement. The top positions according to the degree centrality and closeness centrality rankings are occupied by the same journals. Moreover, the top four journals in these two rankings are the same and in the same order: Journal of the American Statistical Association, Journal of Statistical Planning and Inference, Statistica Sinica, and d Biometrics. The Journal of the American Statistical Association, the Journal r of Statistical Planning and Inference and Statistica Sinica are broad-based journals aiming to cover all branches of statistics. In contrast, Biometrics promotes general statistical methodology with applications for biological and environmental data. The central position of this journal may be explained by its interdisciplinary nature (i.e., by the presence of many influential editors who give rise to a large number of different links with other journals). These four journals also display a top ranking when the betweenness centrality is considered.
Table 1—Statistical Journals, Centrality Measures, and Corresponding Rankings of the 79 Probability and Statistics Journals Label
Journal
Degree
Normalized Degree
Normalized Degree Closeness Rank
Closeness Rank
Betweeness
Betweeness Rank
1
Advances in Applied Probability
18
0.231
14
0.505
15
0.0438
8
2
American Statistician
6
0.077
52
0.440
43
0.0014
56
3
Annales de l’Institut Henri Poincaré
4
0.051
59
0.345
70
0.0003
66
4
Annals of Applied Probability
15
0.192
19
0.475
28
0.0177
24
5
Annals of Probability
11
0.141
37
0.437
45
0.0231
21
6
Annals of Statistics
23
0.295
6
0.523
11
0.0299
11
7
Annals of the Institute of Statistical Mathematics
16
0.205
18
0.501
16
0.0089
34
8
Applied Stochastics Models in Business and Industry
21
0.269
12
0.519
12
0.0325
10
9
Australian & New Zealand Journal of Statistics
18
0.231
14
0.512
14
0.0266
14
10
Bernoulli
19
0.244
13
0.515
13
0.0268
13
11
Bioinformatics
4
0.051
60
0.419
55
0.0014
55
12
Biometrical Journal
15
0.192
20
0.488
21
0.0072
37
13
Biometrics
24
0.308
4
0.538
4
0.0477
7
14
Biometrika
4
0.051
60
0.401
60
0.0006
63
15
Biostatistics
11
0.141
31
0.445
42
0.0066
38
16
British Journal of Mathematical and Statistical Psychology
2
0.026
66
0.361
66
0.0000
71
17
Canadian Journal of Statistics
11
0.141
31
0.469
32
0.0019
53
18
Chemometrics and Intelligent Laboratory Systems
4
0.051
60
0.380
63
0.0057
40
19
Combinatorics, Probability & Computing
2
0.026
66
0.372
65
0.0008
59
20
Communications in Statistics. Theory and Methods
22
0.282
10
0.531
6
0.0126
30
21
Communications in Statistics. Simulation and Computation
22
0.282
10
0.531
6
0.0126
30
22
Computational Statistics
12
0.154
30
0.472
30
0.0057
41
23
Computational Statistics & Data Analysis
23
0.295
7
0.531
6
0.0771
3
24
Econometrica
3
0.038
64
0.405
58
0.0000
70
25
Environmental and Ecological Statistics
23
0.295
7
0.538
4
0.0502
5
26
Environmetrics
14
0.179
24
0.484
25
0.0238
19
27
Finance and Stochastics
7
0.090
48
0.429
49
0.0093
33
28
Fuzzy Sets and Systems
2
0.026
66
0.354
68
0.0001
69 CHANCE
39
Table 1—Statistical Journals, Centrality Measures, and Corresponding Rankings of the 79 Probability and Statistics Journals (continued)
40
Normalized Degree Closeness Rank
Closeness Rank
Betweeness
Betweeness Rank
0.422
54
0.0495
6
44
0.448
40
0.0055
42
0.000
77
0.000
77
0.0000
71
3
0.038
64
0.382
62
0.0003
67
Journal of Agricultural Biological and Environmental Statistics
6
0.077
53
0.424
53
0.0008
60
34
Journal of Applied Probability
14
0.179
24
0.484
25
0.0254
16
35
Journal of Applied Statistics
9
0.115
42
0.460
36
0.0024
50
36
Journal of Business and Economic Statistics
10
0.128
38
0.463
34
0.0043
44
37
Journal of Chemometrics
2
0.026
66
0.315
74
0.0002
68
38
Journal of Computational and Graphical Statistics
15
0.192
20
0.494
17
0.0112
32
39
Journal of Computational Biology
8
0.103
44
0.427
51
0.0048
43
40
Journal of Multivariate Analysis
23
0.295
7
0.527
10
0.0205
22
41
Journal of Nonparametric Statistics
8
0.103
44
0.440
43
0.0014
57
42
Journal of Quality Technology
8
0.103
44
0.448
40
0.0233
20
43
Journal of Statistical Computation and Simulation
7
0.090
48
0.432
47
0.0026
48
44
Journal of Statistical Planning and Inference
33
0.423
2
0.582
2
0.0607
4
45
Journal of the American Statistical Association
35
0.449
1
0.601
1
0.1046
1
46
Journal of the Royal Statistical Society, Series A
2
0.026
66
0.355
67
0.0000
71
47
Journal of the Royal Statistical Society, Series B
11
0.141
31
0.451
39
0.0039
45
48
Journal of the Royal Statistical Society, Series C
5
0.064
57
0.392
61
0.0021
51
49
Journal of Theoretical Probability
6
0.077
53
0.376
64
0.0024
49
50
Journal of Time Series Analysis
7
0.090
48
0.437
45
0.0006
62
51
Lifetime Data Analysis
24
0.308
4
0.531
6
0.0245
18
Degree
Normalized Degree
29
Infinite Dimensional Analysis, Quantum Probability and Related Topics
5
0.064
57
30
Insurance: Mathematics and Economics
8
0.103
31
International Journal of Game Theory
0
32
International Statistical Review
33
Label
Journal
VOL. 22, NO. 3, 2009
Table 1—Statistical Journals, Centrality Measures, and Corresponding Rankings of the 79 Probability and Statistics Journals (continued) Label
Journal
Degree
Normalized Degree
Normalized Degree Closeness Rank
Closeness Rank
Betweeness
Betweeness Rank
52
Methodology and Computing in Applied Probability
15
0.192
20
0.488
21
0.0078
35
53
Metrika
10
0.128
38
0.463
34
0.0075
36
54
Multivariate Behavioral Research
1
0.013
75
0.344
71
0.0000
71
55
Open Systems & Information Dynamics
2
0.026
66
0.297
75
0.0246
17
56
Oxford Bulletin of Economics and Statistics
0
0.000
77
0.000
77
0.0000
71
57
Pharmaceutical Statistics
9
0.115
42
0.419
55
0.0015
54
58
Probabilistic Engineering Mechanics
1
0.013
75
0.228
76
0.0000
71
59
Probability in the Engineering and Informational Sciences
2
0.026
66
0.334
72
0.0000
71
60
Probability Theory and Related Fields
10
0.128
38
0.460
36
0.0132
28
61
Quality and Quantity
0
0.000
77
0.000
77
0.0000
71
62
Scandinavian Journal of Statistics
7
0.090
48
0.432
47
0.0012
58
63
Statistica Neerlandica
6
0.077
53
0.405
58
0.0004
65
64
Statistica Sinica
29
0.372
3
0.573
3
0.0819
2
65
Statistical Methods in Medical Research
13
0.167
28
0.454
38
0.0174
25
66
Statistical Modelling
16
0.205
17
0.494
17
0.0152
26
67
Statistical Papers
11
0.141
31
0.472
30
0.0038
46
68
Statistical Science
4
0.051
60
0.412
57
0.0007
61
69
Statistics
14
0.179
24
0.491
19
0.0058
39
70
Statistics & Probability Letters
15
0.192
20
0.491
19
0.0262
15
71
Statistics and Computing
10
0.128
38
0.465
33
0.0150
27
72
Statistics in Medicine
11
0.141
31
0.475
28
0.0020
52
73
Stochastic Analysis and Applications
13
0.167
28
0.488
21
0.0360
9
74
Stochastic Environmental Research and Risk Assessment
6
0.077
53
0.429
49
0.0032
47
75
Stochastic Models
2
0.026
66
0.334
72
0.0000
71
76
Stochastic Processes and their Applications
11
0.141
31
0.427
51
0.0131
29
77
Technometrics
14
0.179
24
0.478
27
0.0203
23
78
Test
17
0.218
16
0.488
21
0.0270
12
79
Theory of Probability and its Applications
2
0.026
66
0.349
69
0.0005
64
CHANCE
41
Table 2—Line Multiplicity Frequency Distribution Line Value
Frequency
Frequency (%)
1
284
66.4
2
80
18.7
3
27
6.3
4
13
3.0
5
8
1.9
6
7
1.6
7
3
0.7
8
2
0.5
9
1
0.2
11
1
0.2
29
1
0.2
82
1
0.2
Note: The line value is the number of editors in common between pairs of journals connected by a line.
Annals of Statistics, Computational Statistics & Data Analysis, Environmental and Ecological Statistics, and Applied Stochastic Models in Business and Industry y are also central in the network. Annals of Statistics aims to publish research papers with emphasis on methodological statistics. Computational Statistics & Data Analysis publishes papers on different topics in statistics, with an emphasis on computational methods and data analysis. Environmental and Ecological Statistics, which is obviously devoted to a rather special topic, occupies a very central position in the network. This might be due to the increasing importance of environmental research in science. Applied Stochastic Models in Business and Industry y publishes papers on the interface between stochastic modeling, data analysis, and their applications in business and fifinance. In addition, Communications in Statistics - Theory and Methods, Communications in Statistics - Simulation and Computation, the Journal of Multivariate Analysis, and Lifetime Data Analysis are important in sustaining the network structure, even if their role in connecting the other journals in the network is weaker because they have smaller values of betweenness centrality. In turn, the Communications in Statistics journals publish papers devoted to all the main areas of statistics. The Journal of Multivariate Analysis and d Lifetime Data Analysis display a higher degree of specialization, as the first aims to publish papers on 42
VOL. 22, NO. 3, 2009
multivariate statistical methodology and the second generally considers applications of statistical science in fields dealing with lifetime data. All these journals have a long standing in statistical research; hence, their role in the network is understandable. The less central position of influential fl journals does not reduce their importance for statistical research. This simply emphasizes that the editorial policy of the boards is different. A different ranking system for journals uses so-called impact factors, which are based on statistical analysis of citation data. However, as written by Vasilis Theoharakis and Mary Skordia in their The American Statistician paper titled “How Do Statisticians Perceive Statistics Journals,” the ranking induced by the perceived journal importance may differ from the ranking given by impact factor.
Valued Network Analysis It is interesting to consider the strength of the relationship between journals. The network of journals can be characterized as a valued network. More precisely, in a valued network, the lines have a value indicating the strength of the tie linking two vertices. In our case, the value of the line is the number of editors sitting on the board of the two journals linked by that line.
Table 2 shows the distribution of journals according to their line values. As we already know, there are three isolated journals and one pair of journals sharing all 82 editors (i.e., Communications in Statistics - Theory and Methods and Communications in Statistics - Simulation and Computation). We see that 66% of the links are generated by journals sharing one editor and about 91% are generated by journals sharing three or fewer editors. In social network analysis, it is usual to consider lines with higher value to be more important, as they are less personal and more institutional. In the case of the journal network, the basic idea is simple: The editorial proximity between two journals can be measured by observing the degree of overlap among their boards. Two journals with no common editors have no editorial relationship. Two journals with the same board share the same aim (i.e., the two journals have a common or shared editorial policy). As an example, Statistica Sinica (the Chinese statistical journal) and Quality and Quantity (an Italian sociological journal) have no common editors, so their editorial policies can be considered independent of each other. The opposite situation occurs with Communications in Statistics - Theory and Methods and Communications in Statistics - Simulation and Computation. These two journals share all 82 board members and their editorial policies are complementary—theoretical contributions are addressed by the former and computation-oriented contributions by the latter. Obviously, there are different degrees of integration between these two extreme cases. Starting from this basis, it is possible to define cohesive subgroups (i.e., subsets of journals with relatively strong ties). In a valued network, a cohesive subgroup is a subset of vertices among which ties have a value higher than a given threshold. In our case, a cohesive subgroup of journals is a set of journals sharing a number of editors equal to or higher than the threshold. A cohesive subgroup of journals is a subgroup with a similar editorial policy, belonging to the same subfield of the discipline or sharing a common methodological approach. As described in Exploratory Social Network Analysis with Pajek, cohesive subgroups are identified as weak components in m-slices (i.e., subsets for which the threshold value is at least m).
Biostatistics
4
Lifetime Data Analysis
Statistical Methods in Medical Research 5 Statistics in Medicine
5
4
Biometrics 4
Australian & New Zealand Journal of Statistics 4 Statistica Sinica
4 4
9 Environmental and Ecological g Statistics 6 8 5 11 4
5
Annals of Statistics
7 6 Journal of the American Statistical Association Associati 7 Journal of Statistical Planning and Inference 4
Environmetrics
Statistics & Probability Letters
6 5 Test
6
82
Communications in Statistics. Theoryy and Methods
Communications in Statistics. Simulation and Computation p 6 6 5
4
Methodology and Computing in Applied Probability
Bernoulli
Figure 3. The big weak component in the four-slice network, the largest cohesive subgroup of journals linked by at least four editors. The area of vertices is proportional to betweenness centrality.
In this case, the threshold is the number of editors needed to establish a link (i.e., all journals in a cohesive subgroup must be joined to others in the subgroup by at least m editors). The search for cohesive subgroups shows a clear path—the presence of a relatively big component and the complete fragmentation of the others in small groups mostly including one journal, or by tiny groups with niche specialization. Figure 3 contains the representation of the central component of the network identifified as the big weak component in four slices. The 18 journals in this
subset have at least four common editors. The dimension of each vertex represents the betweenness centrality of the corresponding journal. The core of the big weak component is actually maintained by the same four journals that display the top rankings with respect to centralization measures. The Journal of the American Statistical Association, Journal of Statistical Planning and Inference, Statistica Sinica, and Biometrics are strongly connected and control the links with most of the other journals in the big weak component. By dropping the Journal of the American Statistical
Association and Journal of Statistical Planning and Inference from the network, the graph splits into three groups. In particular, the Journal of Statistical Planning and Inference is the bridge to two groups with respect to the remaining part of the graph. The first group is constituted by Test and Bernoulli, devoting attention to rather technical papers with special emphasis on Bayesian methodology. The second group is made up of Communications in Statistics - Theory and Methods, Communications in Statistics - Simulation and Computation, Statistics & Probability Letters, and d Methodology and Computing in Applied Probability, CHANCE
43
Bioinformatics
7
Annals of Applied Probability
Journal of Computational Biology
5 Annals of Probability
Journal of Chemometrics
6 Stochastic Analysis and Applications
Chemometrics and Intelligent Laboratory Systems
4
Infinite Dimensional Analysis, Quantum Probability and Related Topics
Journal of Quality Technology
Journal of Computational and Graphical Statistics
4
5 4
Technometrics
Computational Statistics
4 8
Computational Statistics & Data Analysis Journal of Applied Probability Open Systems & Information Dynamics
29 Advances in Applied Probability
Figure 4. Small weak components in the four-slice network, the smaller cohesive subgroups of journals linked by at least four editors. The area of vertices is proportional to betweenness centrality.
which publish papers on the boundary of mathematical statistics and probability. As to the third group, Biometrics is central in maintaining this structure. It is connected to a subgroup of two journals dealing with statistical applications to medicine (Statistics in Medicine and d Statistical Methods in Medical Research), to a subgroup of two journals dealing with biological applications (Lifetime Data Analysis and d Biostatistics), to a subgroup of two journals dealing with environmental applications (Environmental and Ecological Statistics and Environmetrics) and to a subgroup of methodological journals (Australian & New Zealand Journal of Statistics, Statistica Sinica, and Annals of Statistics). 44
VOL. 22, NO. 3, 2009
The seven small weak components resulting from the search of cohesive subgroups can be seen in Figure 4. There are five small weak components given by pairs of journals motivated by specialized aims and two small weak components given by three journals. The small weak components given by two journals deal with different areas of probability and applied statistics. The small weak components given by three journals are made up of one component representing journals concerning computational statistics and publishing statistical papers connected to computational statistics and one component representing journals
dealing with processes theory and the applications of probability to physics.
Future Possibilities The exploratory analysis developed here relies on a weak hypothesis: Each editor possesses some power in the definition of the editorial policy of his or her journal. Consequently, if the same scholar sits on the board of two journals, those journals could have some common elements in their editorial policies. The proximity of the editorial policies of two scientific journals can be assessed by the number of common editors sitting on their boards. On the basis of this
statement, applying the instruments of network analysis, a simple interpretation of the statistical journal network has been given. The network generated by interlocking editorship seems compact. This is probably the result of a common perspective about the appropriate methods (for investigating the problems and constructing the theories) in the domain of statistics. Competing visions or approaches to statistical research do not prompt scholars to abandon a common tradition, language, or vision about how to conduct research. Moreover, it is not surprising that in the center of the network lie general journals or journals devoted to the recent and growing subfields of the discipline. It might be interesting to analyze network changes brought about through the inclusion of other journals not listed in the statistics and probability category of the Journal of Citation Report Science Edition such as Survey Methodology, Journal of
Official Statistics, and Psychometrika. It may also be revealing to track the evolution of the network over time to understand the extent of its variability, especially within its inner core.
Further Reading Barabási, A.L. (2003). Linked. New York: Plume Book. de Nooy, W., Mrvar, A., and Batagelj, V. (2005). Exploratory Social Network Analysis with Pajek. Cambridge: Cambridge University Press. Hodgson, G.M. and Rothman, H. (1999). “The Editors and Authors of Economics Journals: A Case of Institutional Oligopoly?” The Economic Journal, 109:165–186. Joint Committee on the Quantitative Assessment of Research. (2008). Citation Statistics. International Mathematical Union. www.mathunion.org/fileadmin/ IMU/Report/CitationStatistics.pdf.
Mizruchi, M.S. (1996). “What Do Interlocks Do? An Analysis, Critique, and Assessment of Research on Interlocking Directorates.” Annual Review of Sociology, 22:271–298. Newman, M.E.J. (2004). “Coauthorship Networks and Patterns of Scientific Collaboration.” Proceedings of the National Academy of Sciences of the United States of America, 101:5200–5205. Stigler, G.J., Stigler, S.M., and Friedland, C. (1995). “The Journals of Economics.” The Journal of Political Economy, 103:331–359. Theoharakis, V. and Skordia, M. (2003). “How Do Statisticians Perceive Statistics Journals?” The American Statistician, 57:115–123. Wasserman, S. and Faust, K. (1994). Social Network Analysis: Method and Application. Cambridge: Cambridge University Press.
CHANCE
45
Human or Cylon?
Group testing on ‘Battlestar Galactica’ Christopher R. Bilder
Battlestar Galactica, “Tigh Me Up, Tigh Me Down,” (from left) James Callis as Dr. Gauis Baltar and Katee Sackhoff as Lt. Kara ‘Starbuck’ Thrace Photo by Carole Segal/SyFy Channel/NBCU via AP Images
S
tatistical science has made significant contributions to medicine, engineering, social science, agriculture, and a multitude of other important areas. Does statistics have a place in the world of science fiction, though? Because science fiction fi writers try to merge the sci-fi world with the real world in a believable way, one might think statistics could make a significant fi contribution to solving sci-fi fi problems. After all, many science fiction works already rely on science to rescue characters from the brink of disaster. In the hit Syfy television show “Battlestar Galactica” (a re-imagined version of the 1970s show), there is an attempt to use science to solve an important problem. Due to the excessive amount of time the proposed solution would take to complete, it is deemed impractical and therefore never implemented. I will show how the problem could be solved instead using a statistical technique called “group testing.” Scientists 46
VOL. 22, NO. 3, 2009
use this technique to solve many realworld problems, including the screening of blood donations for diseases.
to it being indistinguishable from real humans. When the humans discovered the existence of the humanoid Cylon, their top priority became figuring out how to distinguish a human from a Cylon. The character charged with developing a “Cylon detector” is Dr. Gaius Baltar. Fortunately for him, the number of Cylons in the fleet is expected to be small, but all 47,905 individuals in it must be tested. Baltar created a blood test for Cylon indicators, and in the episode “Tigh Me Up, Tigh Me Down,” he says a single test will take 11 hours to complete. Extrapolating to include everyone in the flfleet, Baltar estimated it would take about 61 years to complete the testing. (From Baltar’s calculations, one can deduce that the fleet, like Earth, observes a 24-hour day and 365-day year.) Baltar planned to use “individual testing.” This involves testing each blood specimen one at a time for Cylon indicators. The obvious problem with this testing strategy is that it will take a long time. Another problem is that these humans have limited resources. The Cylon testing needs to be done quickly, while using as few resources as possible.
‘Battlestar Galactica’ The Emmy and Peabody award-winning “Battlestar Galactica” has been rated one of the best science fiction shows of all time. The show is about the struggle between humans and Cylons in a distant part of our galaxy. Cylons are cybernetic life forms originally created by humans. When they evolved and rebelled by destroying humans’ home planets, approximately 47,000 humans survived and banded together in a ragtag flfleet of spaceships led by a military spaceship named Galactica, which is the source of the show’s name. There are two types of Cylons. One has a metallic, robot-like form and one a more human form. The humanoid Cylon was secretly created by other Cylons to help destroy humanity, due
Group Testing The “Tigh Me Up, Tigh Me Down” episode was the ninth episode of the 70-episode series. Also, it was the last episode in which testing for Cylon indicators is mentioned, and the testing is never carried out. Perhaps the writers wanted to put Baltar into an impossible situation. Alternatively, perhaps they did not consult with scientists, because scientists would have suggested using group testing. Group testing, also known as pooled testing, is used in a variety of realworld applications (see “Areas Where Group Testing Is Used”). In the “Battlestar Galactica” situation, group testing would begin by putting each individual into a group. Within a group, parts
of each individual’s specimen would be composited so one test could be performed. If the composited specimen tested negative for Cylon indicators, all individuals within the group would be declared human. If the composited specimen tested positive for Cylon indicators, then there would be at least one Cylon in the group and retesting procedures would be implemented to fifind the Cylon(s). The potential advantages of using group testing are that a smaller number of tests is required and fewer resources are expended. In general, these advantages occur when the overall prevalence of the trait of interest (e.g., being a Cylon or having a particular disease) in a population is small. There are a number of retesting procedures that can be used to decode a positive group. The easiest and most-used procedure is one originally proposed by Robert Dorfman in his 1943 Annals of Mathematical Statistics article, “The Detection of Defective Members of Large Populations,” which suggests using group testing for syphilis screening of American soldiers during World War II. Dorfman’s procedure retests each individual within a positive group to determine a diagnosis. Overall, an initial group of size I that tests positive would result in I + 1 tests. While this procedure typically results in signififi cant savings when compared to individual testing, there are better procedures. Andrew Sterrett’s 1957 article in the Annals of Mathematical Statistics, “On the Detection of Defective Members of Large Populations,” proposes a different retesting procedure that leads to a smaller expected number of tests than Dorfman’s procedure. For an initial positive group, the procedure begins by randomly retesting individuals until a first positive is found. Once found, the remaining individuals are pooled to form a new group. If this new group tests negative, the decoding process is complete and the individuals in the new group are declared negative. Because group testing is used in low overall prevalence situations, having only one positive is a likely event for a reasonably chosen initial group size. If the new group tests positive, the process begins again by randomly retesting individuals from this group until a second positive is found. Once the second positive is found, the remaining individuals are pooled again to determine if more
Areas Where Group p Testing g Is Used Screening blood donations: All blood donations need to be screened for diseases such as HIV, hepatitis, and West Nile virus. The American Red Cross uses groups of 16 and Robert Dorfman’s procedure for their screening. See Dodd et al. (2002). Drug discovery experiments: Early in the drug discovery process, hundreds of thousands of chemical compounds are screened to look for those that are active. The matrix pooling procedure is used often for this purpose. See Remlinger et al. (2006). Plant pathology: In multiple vector transfer design experiments, groups of insect vectors are transferred to individual plants. After a sufficient amount of time, the plants are examined to determine if they have become infected by the insects. In this case, the plants provide the group responses. See Tebbs and Bilder (2004). Veterinary: Among the many applications, cattle are screened for the bovine viral diarrhea virus. Groups of up to 100 are formed from ear notches. See Peck (2006). Public health studies: Group testing provides a cost-efficient mechanism for poorer countries to obtain information on disease prevalence. See Verstraeten et al. (2000).
positives remain. The whole process of individually testing and repooling is repeated until no more retests are positive. Another frequently used procedure involves creating subsets of a positive initial group. Eugene Litvak, Xin Tu, and Marcello Pagano in their 1994 Journal of the American Statistical Association article, “Screening for the Presence of a Disease by Pooling Sera Samples,” proposed to subsequently halve positive groups until all individual positives have been found. For example, suppose an initial group of size eight tests positive. This group is divided into two halves of size four. Any new group that tests positive is subsequently divided into pools of size two. Finally, the remaining positive groups result in individual tests. All of these procedures assume an individual is initially assigned to one group only and retesting is performed in a hierarchical manner. There are other nonhierarchical testing procedures. In
particular, Ravi Phatarfod and Aidan Sudbury propose in “The Use of a Square Array Scheme in Blood Testing,” a 1994 Statistics in Medicine article, to place specimens into a matrix-like grid. Specimens are pooled within each row and column. Positive individuals occur at the intersection of positive rows and columns. When more than one row and column test positive within a single grid, individuals at the intersections are further individually tested to complete the decoding. Matrix pooling is especially useful in highthroughput screening. In this situation, specimens are arranged into a matrix grid of wells on a plate so robotic arms can do the pooling automatically. With many of these procedures, modififications must be made. For example, the Sterrett and matrix pooling procedures could lead to an individual being declared positive without ever having been tested individually. This might happen if a positive group contains only one positive individual and CHANCE
477
6
5000
5
4000
2
Years
3
4
3000 2000 0
0
1
1000
Number of Tests
0
100
200
300
400
500
Group Size Figure 1. Expected number of tests versus group size using four procedures; group size increments of size 10 are used for constructing the plot; conversion to years of testing is on the right axis.
this individual is the last to be selected using Sterrett’s procedure. For this reason, it is better to add an extra test for this individual, rather than declaring him or her to be positive by default. Also, when using the halving procedure, a group not of a size evenly divisible by two can still be tested. The successive dividing of positive groups may lead to an individual being tested at an earlier stage. For example, a group of seven can be divided into groups of four and three. For the group of three, a further subdividing leads to one group of two and a separate individual. Finally, when performing matrix pooling, there may be a 10×10 grid available for testing, but 122 individuals need to be tested. This works well for the first 100, but not for the last 22. These last individuals can be tested in two rows of 10 and Dorfman’’s procedure can be performed on the last two.
Cylon Detection “Battlestar Galactica” fans now know there were most likely seven Cylons out of the 47,905 individuals in the fleet, fl and all of these Cylons were unknown to Baltar. Taking this seven as the true
48
VOL. 22, NO. 3, 2009
count, the overall prevalence becomes approximately 0.0001461, which meets the low prevalence criteria needed for group testing to be useful. In comparison, the HIV prevalence for American Red Cross blood donations was 0.00009754 in 2001, when Dorfman’s procedure was used for screening purposes. Baltar makes a number of implicit assumptions with his testing. First, he assumes all the individual testing needs to be done back-to-back. Of course, if Baltar was able to run individual tests on multiple specimens simultaneously, this would greatly reduce the testing time. Perhaps due to the limited resources available to him (a nuclear warhead was needed to create his Cylon detector), only back-to-back testing was mentioned. Back-to-back testing is used here when implementing group testing, as well. Second, Baltar never discusses the possibility of testing error (i.e., the test diagnosis is incorrect). In the real world, however, one usually needs to account for testing error through knowing the sensitivity and specificity of a diagnostic test. No testing error is assumed here when implementing group testing, either.
For the four group testing procedures and a given group size, Figure 1 shows the expected number of tests for each procedure. While Dorfman’s procedure is the simplest to implement, it generally results in a larger expected number of tests. Halving generally results in the smallest number of tests, reaching a minimum of 220.77 for a group size of 500—the maximum group size included in the plot. The actual minimum expected number of tests for halving is 172.39 when groups of size 4,080 are used. On the right-side y-axis of the plot in Figure 1, the number of tests has been translated into a year’s length of time. Baltar says individual testing will take about 61 years. Using a group size of 500, the expected time halving takes is 101 days. Even for Dorfman’s procedure, testing will take only 1.45 years (1,155.64 expected tests) using groups of size 80. Given that “Battlestar Galactica” was on television for six years, there would have been plenty of time for testing to be completed. In general practice, the group size corresponding to the minimum expected number of tests is considered the “optimal” size. Usually, this optimal size is an
6
5000
3 2
Years
4
5
4000 3000 2000
0
0
1
1000
Number of Tests
0
100
200
300
400
500
Group Size Figure 2. Expected tests ⫾3x(standard deviation) bands plotted by group size for four procedures; group size increments of size 10 are used for constructing the plot; conversion to years of testing is on the right axis.
estimate because the overall prevalence, which is needed for its calculation, is unknown. In the Cylon example here, the overall prevalence is most likely known now, so Figure 1 allows one to see the optimal group sizes for each group-testing procedure. These group sizes often can be found mathematically, too. For example, one can show that the optimal group size for Dorfman’s procedure is approximately the square root of the inverted overall prevalence, resulting in 82.73. Of course, if Baltar implemented one of the group-testing procedures, the actual number of tests most likely would not be the same as the expected number of tests. There will be variability from application to application. Figure 2 gives bands illustrating an expected range for the number of tests using the mean ±3×(standard deviation). Applying Chebyshev’s Theorem, one would expect to observe the number of tests to be within this range at least 89% of the time. For larger group sizes, both Dorfman’s and Sterrett’s procedures can produce a lot of variability leading to uncertainty in the number of tests. Alternatively, both halving and matrix pooling lead to much less uncertainty, so
an extremely large number of tests will not likely happen. When comparing the procedures based on their optimal group sizes though, Dorfman’s and Sterrett’s procedures result in a more reasonable amount of variability, but still more than halving and matrix pooling. Overall, if Baltar chose the halving procedure and a group size of 500, the number of tests likely would be between 127.06 and 314.47 (58 to 144 days).
Additional Considerations While optimal group sizes are nice to know, they often cannot be used in practice. Diagnostic testing procedures must be calibrated to ensure their accuracy is similar to that for individual testing. Using too large a group size can dilute the composited specimen to a point that prevents detection. Also, optimal group sizes can be chosen based on other measures. Commonly, cost is included in its calculation. Labor and storage of specimens, which may be longer due to retesting, all can be factored into an optimal group size calculation. There are practicality issues that must be considered when choosing a group size, as well. For example, if a 10×10 well plate
is available, but a group size of 11 for matrix pooling is optimal, a group size of 10 will likely be used. Testing error will likely occur at some time with any diagnostic procedure. Because of the additional uncertainty in test results, both individual and group testing will have more tests and more variability than when test results are perfectly accurate. For example, confirfi matory tests may be needed to confirm fi a positive test result. Diagnostic accuracy is defined in terms of two quantities: sensitivity (Se) and specifi ficity (Sp). Sensitivity is the probability an individual or group is diagnosed to be positive from a single test, given the individual or group is truly positive. Specifi ficity is similarly defined for a negative diagnosis given a true negative. For example, if Se = 0.95, Sp = 0.95, and groups of size 80 are used for Cylon detection, the expected number of tests for Dorfman’s procedure increases to 3,495.22 with an expected range of 2,562.19 to 4,428.26 tests (without including confirmatory tests). Still, the number of tests using Dorfman’s procedure is much less than for individual testing. Due to the group-testing protocols, the overall probability of diagnosing an
CHANCE
49
Table 1—Estimates of the 7/47,905 = 0.0001461 Overall Prevalence Using Only the Initial Group Test Outcomes from a Hierarchical Procedure 95% Confidence fi Interval
Estimate
Standard Deviation
Lower Limit
Upper Limit
# of Positive Groups
5
0.0001462
0.00005524
0.00003789
0.0002544
7
10
0.0001462
0.00005526
0.00003791
0.0002545
7
Group Size
50
0.0001466
0.00005542
0.00003802
0.0002553
7
100
0.0001472
0.00005629
0.00003816
0.0002562
7
500
0.0001293
0.00005281
0.00002584
0.0002328
6
1000
0.0001338
0.00005466
0.00002667
0.0002409
6
Note: Individuals are randomly put into groups of the size indicated in the table.
individual to be positive given they are a true positive, called the “pooling sensitivity,” for many procedures is lower than for individual testing. Exact formulas for the pooling sensitivity have been derived for some of the group-testing procedures examined here to better gauge the overall effect of testing error. For example, the pooling sensitivity for Dorfman’s procedure is Se2, while for individual testing it is just Se because one test is performed for each individual. The pooling sensitivity can be increased by additional testing of individuals declared negative. In contrast, the pooling specifi ficity—the probability an individual is declared negative through a group-testing procedure given they are a true negative—is often higher than under individual testing. Another general purpose of group testing is to estimate the prevalence of a trait in a population. Sometimes, only estimation is of interest and identification is not a goal. In addition to identifying individuals who are Cylon, Baltar may be interested in using all the initial group tests from a hierarchical procedure to estimate prevalence. This would give him an initial impression of the overall Cylon prevalence in the flfleet. After first placing individuals randomly into groups, Table 1 gives the estimates, standard deviations, and 95% confidence fi intervals for the overall prevalence (Supplementary materials at www. chrisbilder.com/grouptesting provide details
50
VOL. 22, NO. 3, 2009
for the formulas used). For example, the estimated prevalence is 0.0001462 when only the group test outcomes from groups of five fi are used. Overall, all the estimates are close to the true prevalence of 0.0001461. Larger group sizes produce estimates a little further from the true prevalence, but this should be expected because less is observed from the population. Notice also that all confidence intervals capture the true prevalence. While these calculations are made for individuals randomly assigned to groups, similar results should be expected for other random groupings due to the small number of Cylons. In fact, it is unlikely to have groups with more than one Cylon, except in the case of the larger group sizes.
Conclusions As in the real world, statistics could have played a significant role in solving the sci-fifi problem in “Battlestar Galactica.” However, the consequences for implementing these procedures might have prematurely stifled fans’ enthusiasm for the show because the humanoid Cylons would have been identified fi earlier. While the series ended in 2009, a prequel, “Caprica,” premieres in 2010. “Caprica” will investigate topics such as how Cylons were first developed by humans. Of course, this makes one wonder if the use of statistical science could
have played a role in preventing the Cylon attack on the humans in the first fi place. One can only hope the producers will ask a statistician to serve as a consultant in the writing of the new show.
Further Reading Dodd, R., Notari, E., and Stramer, S. (2002). “Current Prevalence and Incidence of Infectious Disease Markers and Estimated WindowPeriod Risk in the American Red Cross Donor Population.” Transfusion, 42:975–979. Peck, C. (2006). “Going After BVD.” Beef, 42:34–44. Remlinger, K., Hughes-Oliver, J., Young, S., and Lam, R. (2006). “Statistical Design of Pools Using Optimal Coverage and Minimal Collision.” Technometrics, 48:133–143. Tebbs, J. and Bilder, C. (2004). “Confidence Interval Procedures for the Probability of Disease Transmission in Multiple-Vector-Transfer Designs.” Journal of Agricultural, Biological, and Environmental Statistics, 9:75–90. Verstraeten, T., Farah, B., Duchateau, L., and Matu, R. (1998). “Pooling Sera to Reduce the Cost of HIV Surveillance: A Feasibility Study in a Rural Kenyan District.” Tropical Medicine and International Health, 3:747–750.
Visual Revelations
Howard Wainer, Column Editor
A Little Ignorance: How Statistics Rescued a Damsel in Distress Peter Baldwin and Howard Wainer
“A little ignorance is a dangerous thing.” — Apologies to Alexander Pope Although all the facts of the case described below are accurate, the names of the people and places have been modified to preserve confidentiality.
S
ince the passage of No Child Left Behind, the role of standardized test scores in contemporary education has grown. Students have been tested and teachers, schools, districts, and states have been judged. The pressure to get high scores has increased, and concern about cheating has increased apace. School officials desperately want high scores, but simultaneously do not want even the appearance of malfeasance. Thus, it was not entirely a surprise that a preliminary investigation took place after 16 of the 25 students in Jenny Jones’ third-grade class at Squire Allworthy Elementary School obtained perfect scores on the state’s math test. The principal became concerned because only about 2% of all third-graders in the state had perfect scores. This result, as welcome as it was, seemed so unlikely that it was bound to attract unwanted attention. The investigation began with a discussion between the principal and Jones, in which Jones explained that she followed all the rules specified by Ms. Blifil, a teacher designated by the principal as the administrator of the exam. One of the rules Blifil sent to teachers giving the exam was “point and look,” in which the proctor of the exam was instructed to stroll around the class during the administration of the exam and, if he or she saw a student doing something incorrectly, point to that item and look at the student. Jones did not proctor the exam, as she was administering it individually to a student in special education, but she instructed an aide on how to do it. Point and look seemed clearly forbidden by the state’s rules for test administration. On Page 11 of the administrators’
CHANCE
51
and look” rule—her culpability on this point is not in dispute. However, she was not disciplined for incorrectly instructing the proctor; rather, she was disciplined for the alleged effect of the proctor’s behavior. This distinction places a much greater burden on the school because they must show that the “point and look” intervention not only took place, but that it had an effect. That is, the school must demonstrate that Ms. Jones’ rule violation (foul) unfairly improved her students’ scores (harm)—failure to show such an effect exculpates her. Further, here the burden of proof lies on the accuser, not the accused. That is, Ms. Jones is presumed innocent unless the school can prove otherwise. And, the scarcity of highquality student-level data made proving that the effect could be zero much less challenging than proving that it could not have been. The subsequent investigation had two parts. The fifirst was to show it is believable that the cause (the “point and look” rule) had no effect. The second, and larger, part was to estimate the size of the alleged effect (the improvement in performance attributable to this cause), or—as we shall discuss—show that the size of the effect could have been zero. The Cause
manual, it says, “Be sure students understand the directions and how to mark answers. Assist them with test-taking mechanics, but be careful not to inadvertently give hints or clues that indicate the answer or help eliminate answer choices.” That pretty much disqualififies point and look. The school’s investigative team, after learning about point and look, had found what it believed to be the cause of the high exam scores. To cement this conclusion, the school administration brought in an expert, Bill Thwackum, to confirm that, indeed, the obtained test results were too good to be true. Thwackum was a young PhD who studied measurement in graduate school. He was asked if the results obtained on the math exam were sufficiently unlikely to have occurred to justify concluding that students had been inappropriately aided by the exam proctor. Thwackum billed the district for 90 minutes of his time and concluded that, indeed, such a result was so unlikely that a fair test administration was not plausible. Jones was suspended without pay for a month and given an “unsatisfactory” rating. Additionally, her pay increment for the year was eliminated and her service for the year in question did not count toward her seniority. Through her union, she filed an appeal, and the union hired a statistician to look into the matter more carefully.
The Statistician’s Findings “No harm, no foul” and the presumption of innocence help constitute the foundations upon which the great institutions of basketball and justice, respectively, rest. The adjudication of Ms. Jones’ appeal relied heavily on both principles. Ms. Jones admitted to instructing the exam proctor about the “point 52
VOL. 22, NO. 3, 2009
Because the principle of “no harm, no foul” makes the effect— not the cause—of primary concern, our interest in the cause is limited to showing it is plausible that the aide’s behavior had no effect on children’s scores. This is not difficult to imagine if it is recalled that Jones’ third-grade class was populated with 8-year-olds. Getting them to sit still and concentrate on a lengthy exam is roughly akin to herding cats. In addition, there were more than 10 forms of the test handed out at random so that the aide, in looking down on any specific fi exam at random, was unlikely to be seeing the same question she saw previously. So, what was the kind of event that likely led to her pointing and looking? Probably pointing to a question that had been inadvertently omitted, or perhaps suggesting a student take the pencil out of his nose—or stop hitting the student in front of him with it. The developers of the test surely had such activities in mind when they wrote the test manual, for on Page 8 of the instructions, they said, “Remind students who finish early to check their work in that section of the test for completeness and accuracy and to attempt to answer every question.” So, it is plausible that the aide’s activities were benign and had little or no effect on test scores. Given this, both the prosecution and defendants sought to measure the alleged effect: Thwackum seeking to show the effect could not have been zero while we tried to show the effect could have been zero. Of course, this was an observational study, not an experiment. We don’t know the counterfactual result of how they would have done had the aide behaved differently. Therefore, the analysis must use covariate information, untestable assumptions, and the other tools of observational studies. The Effect First, we must try to understand how Thwackum so quickly arrived at his decisive conclusion. Had he been more naïve, he might have computed the unconditional binomial probability of 16 out of 25 perfect scores based on the state-wide result of 2% perfect scores. But, he knew that would have assumed
10000 8000 6000 4000 0
2000
Frequency
4
8
13
19
25
31
37
43
49
55
61
Raw Math Score Figure 1. Statewide frequency distribution of 3rd-grade math scores
all students were equally likely to get a perfect score. He must have known this was an incorrect assumption and he needed to condition on some covariate that would allow him to distinguish among students. He had close at hand the results of a reading test with the unlikely name of DIBELS for the same students who were in Jones’ class. He calculated the regression of the DIBELS score against the math score on an independent sample and found, using this model with the DIBELS scores as the independent variable, that it predicted not a single perfect score on the math test for any of the students in Jones’ class. As far as Thwackum was concerned, that sealed the case against the unfortunate Jones. But, it turned out that aside from being immediately available, DIBELS has little going for it. It is a short and narrowly focused test that uses one-minute reading passages scored in a rudimentary fashion. It can be reliably scored, but it has little generality of inference. In an influential study of its validity, the authors of An Evaluation of End-Grade-3 Dynamics Indicators of Basic Early Literacy Skills (DIBELS): Speed Reading Without Comprehension concluded that, “… DIBELS mis-predicts reading performance on other assessments much of the time and, at best, is a measure of who reads quickly without regard to whether the reader comprehends what is read.” And, more to the point of this investigation, when Thwackum used DIBELS to predict math scores, the correlation between the two was approximately 0.4. Thus, it was not surprising that the predicted math scores were regressed toward the center of the distribution and no perfect scores were portended.
Is there a more informative way to study this? First, we must reconcile ourselves to the reality that additional data are limited. What were available were the reading and math scaled scores for all the students in the school district containing Squire Allworthy School and some state and county-wide summary data. Failure to accommodate the limitations of these data may produce a desirable result, but not one that bears scrutiny (as Thwackum discovered). We can begin by noting that the third-grade math test does a relatively poor job of discriminating among higher-performing students. That is, the distribution of scores is highly skewed toward the upper end. This is obvious in the frequency distribution of raw scores shown in Figure 1. Note that although only 2.03% of all third-graders statewide had perfect scores of 61 on the test, about 22% had scores of 58 or greater. There is only a tiny difference in performance between perfect and very good. In fact, more than half the children (54%) taking this test scored 53 or greater. Thus, the test does not distinguish well among the best students—few would conclude a third-grader who got 58 right answers out of 61 questions on a math test was demonstrably worse than one who got 61 answers right. And no one knowledgeable about the psychometric variability of tests would claim such a score difference was reliable enough to replicate in repeated testing. But Figure 1 is for the entire state. What about this particular county? What about Squire Allworthy Elementary School? To answer this, we can compare the county, school-level, and classroom-level results with the state using the summary data
CHANCE
53
Table 1—Third-Grade Reading 2006: Performance Levels Percentages Advanced
Profi ficient
Basic
Below Basic
State
31%
38%
15%
16%
County
45%
36%
10%
9%
District
55%
35%
6%
4%
Squire Allworthy School
58%
35%
6%
1%
Jones’ Students
84%
16%
0%
0%
available to us. One such summary for reading—which is not in dispute—is shown in Table 1. We see the county did better than the state, the district did better than the county, Squire Allworthy did better than the district (with 93% of its students being classified as advanced or proficient), and Jones’ students did better than Squire Allworthy (with all examinees scoring at the proficient or advanced level). Clearly, Jones’ students are in elite company, performing at a very high level compared to other thirdgraders in the state. Let us add that within district, reading and math scores correlated positively (rr = .55).
If we regress reading onto math, we find the regression effect makes it impossible to predict a perfect math score— even from a perfect reading score. In part, this is due to the ceiling effect we discussed previously (the prediction model fails to account for the artificially homogeneous scores at the upper end of the scale), and in part it’s due to the modest relationship between reading and math scores. Thus, when Thwackum used this approach with an even poorer predictor—DIBELS scores—it came as no surprise that Jones’ students’ performance seemed implausible. Observations Thwackum’s strategy was not suitable for these data because a regression approach must perforce under-predict high performers. An even simpler approach, which circumvents this problem, is to look at only examinees with perfect math scores. In addition to Jones’ 16 students with perfect scores, there were 23 non-Jones students with perfect math scores within the Squire Allworthy School’s district. We can compare reading scores for Jones’ students with those for non-Jones students. There are three possible outcomes here: 1. Jones’ students do better than non-Jones students 2. All students perform the same 3. Jones’ students do worse than non-Jones students We observed above that there was a modestly strong positive relationship between reading and math profi ficiency. For this subset of perfect math performers, there is no variability among scores, so no relationship between reading and math can be observed. Nevertheless, failure to observe any relationship can be plausibly attributed to the ceiling effect, rather than a true absence of relationship between reading and math proficiency. fi So, we could reasonably suggest that if Jones’ perfect performers were unfairly aided, their reading scores should be lower on average than non-Jones perfect performers. That is, if option 3 is shown to be true, it could be used as evidence in support of the school district’s case against Jones. However, if Jones’ students do equally well or better than non-Jones students—options 1 and 2—or if we lack the power to identify any differences, we must conclude that the effect could have been zero, at least based on this analysis. Typically, if we wanted to compare reading score means for Jones’ 16 students and the districts’ 23 students with perfect
54
VOL. 22, NO. 3, 2009
2000 1800 1600 1400
Reading Score
Everyone Else
Ms. Jones’ Students
1700 1600 1500
The Outcome
1400
Predicted Math Score
1800
Figure 2. Reading scores for students with perfect math scores
containing the middle 50% of the data and the cross line representing the median. The dotted vertical lines depict the upper and lower quartiles. We can see that Jones’s students’ reading scores are not below the reference group’s (as would be expected had her intervention produced the alleged effect). On the contrary, her 16 perfect math students did noticeably better on reading than non-Jones students who earned perfect math scores. This suggests Jones’s students’ perfect math scores are not inconsistent with their reading scores. And, in fact, if the math test had a higher top, her students would be expected to do better still. Thus, based on the data we have available, we cannot reject the hypothesis that Jones’ students came by their scores honestly. Another way to think about this is to predict math scores from reading scores and then compare the predicted scores for the students whose perfect scores are not being questioned to those from Jones’ class. The results of this are shown in the box plots in Figure 3. Of course, these predicted scores are nothing more than a linear transformation of the reading scores from Figure 2. So, it comes as no surprise that we once again see the students in Jones’ class who obtained perfect scores predicted to do better than those other students in the district whose perfect scores were unquestioned. Once again, this provides additional evidence against the hypothesis that the perfect scores obtained by Jones’ students are not legitimate. In short, there is no empirical evidence that Jones’ intervention produced an effect of any kind. Her students’ scores appear no less legitimate than those of students from other classes in the district.
Everyone El E Else
M Ms. JJones’’ SStudents d
Figure 3. Predicted math scores for students with perfect observed math scores Note: This plot uses the regression model based on all district data except Jones’ students and the 23 non-Jones students with perfect math scores.
math scores, we would make some distributional assumptions and perform a statistical test. However, even that is not necessary here because Jones’ students’ reading mean is higher than the districts’ students’ mean. The districts’ 23 students with perfect math scores had a mean reading score of 1549 with a standard error of 30. Whereas, the 16 students in Jones’s class with perfect math scores had a mean reading score of 1650 with a standard error of 41. The box plots in Figure 2 compare the distribution of reading scores for two groups of students with perfect math scores. This plot follows common convention with the box
On December 5, 2008, there was an arbitration hearing to decide the outcome of Jones’ appeal. Faced with these results, the representatives of the district decided to settle a few minutes in advance of the hearing. Jones’ pay and seniority were restored, and her “unsatisfactory” rating was replaced with one better suited to an exemplary teacher whose students perform at the highest levels. Blififil was relieved of further responsibility in the oversight of standardized exams, and the practice of “point and look” was retired. The statisticians who did these analyses were rewarded with Jones’ gratitude, the knowledge that they helped rectify a grievous wrong, and a modest honorarium. It is hoped Thwackum learned that a little ignorance is a dangerous thing.
Further Reading Pressley, M., Hilden, K., and Shankland, R. (2002). An Evaluation of End-Grade-3 Dynamics Indicators of Basic Early Literacy Skills (DIBELS): Speed Reading Without Comprehension. East Lansing, MI: Literacy Achievement Research Center.
Column Editor: Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104;
[email protected] CHANCE
55
Here’s to Your Health
Mark Glickman, Column Editorr
Noninferiority N i f i i Cli Clinical i lT Trials i l Scott Evans
separate the effects of the intervention from the effects of natural history. There is no assurance that similar response would not have been observed had the study participants not been given the intervention. Also, the observed responses may, in part, be a result of the “placebo effect,” where the belief of being treated affects subsequent responses. Trial designs that use a “control group” can help address these concerns and put the results of an intervention arm into perspective. The most fundamental design is the placebo-controlled trial, in which eligible study participants are randomized to the intervention or a placebo (an inert “fake” intervention). Study participants are then followed over time, and results from the two randomized arms are compared. If the intervention arm can be shown to be more effective than the placebo arm, the effect of the intervention has been demonstrated. However, how can researchers implement a placebo-controlled design for serious diseases where it may be unethical to randomize study participants to a placebo, particularly when effective interventions are available? Noninferiority trial designs have been developed to address this issue. The rationale for noninferiority trials is that to appropriately evaluate an intervention, a comparison to a control group is necessary to put the results of an intervention arm into context. However, for the targeted medical indication, randomization to a placebo is unethical due to the availability of a proven effective therapy. In noninferiority trials, an existing effective therapy is selected to be the “active” control group. For this reason, noninferiority trials are also called “active-controlled trials.”
C
linical trials have become the gold standard study design for demonstrating the effectiveness and safety of an intervention. The objective of clinical trials is to reduce or eliminate bias and control confounding using tools such as prospective observation, randomization, blinding, and the use of control groups to isolate the effect of an intervention and establish cause and effect. Perhaps the simplest clinical trial design is a single-arm trial in which the intervention is applied to a group of study participants. After allowing a reasonable amount of time for the intervention to have an effect, the study participants are evaluated for a response. Although logical, this design cannot 56
VOL. 22, NO. 3, 2009
Noninferiority versus Superiority The objective of a noninferiority trial is different than a placebo-controlled trial. No longer is it necessary to show that the intervention is superior to the control; rather, it is desirable to show that the intervention is “at least as good as” or “no worse than” (i.e., noninferior to) the active control. It is hoped the intervention is better than the active control in other ways (e.g., less expensive, better safety profile, better quality of life, different resistance profile, or more convenient or less invasive to administer). For example, researchers seek less-complicated
Figure 1. Possible confidence interval outcome scenarios for P1–P2 in noninferiority trials. P1 is the efficacy of the new therapy. P2 is the efficacy of the control group. –M is the noninferiority margin.
or less-toxic antiretroviral regimens that can display similar effificacy to existing regimens in the treatment of HIV. Noninferiority cannot be demonstrated with a nonsignififi cant test of superiority. The traditional strategy of a noninferiority trial is to select a noninferiority margin (M) and, if treatment differences can be shown to be within the noninferiority margin (i.e., <M), then noninferiority can be claimed. The null and alternative hypotheses are H0: T,active control ≥M and HA: T,active control <M where T,active control is the effect of the intervention relative to the active control. The standard analysis is to construct a confidence interval for the difference between arms and note whether the entire confidence ce interval is within the bounds of the noninferiority margin. For example, if the primary endpoint is binary (e.g.,., response vs. no response), a confidence interval for the difference erence in response rates (intervention minus the active control) ntrol) can be constructed. If the lower bound of the confidence fi ence interval is greater than –M, then important differences nces can be ruled out with reasonable confi fidence and nononinferiority can be claimed. In Figure 1, confidence intervals A–F represent ent potential noninferiority trial outcome scenarios. The intervals have different centers and widths. If the trial is designed to evaluate superiority, then a failure to reject eject the null hypothesis results from scenarios A and D (since the confidence interval does not exclude zero). Inferiority is concluded from scenarios B, C, and E, whereas superiority is concluded from scenario F. If the trial is designed as a noninferiority trial, then a failure to reject the null hypothesis of inferiority results from scenarios A,
B, and C, but noninferiority is claimed in scenarios D, E, and F since the lower bound of the interval is >-M. Some confusion often results from scenario E, in which inferiority is concluded from a superiority trial but noninferiority is concluded from a noninferiority trial. This case highlights the distinction between statistical significance (i.e., the confidence fi interval excludes 0) and clinical relevance (i.e., the differences are less than M). Scenario A is a case in which neither superiority, inferiority, nor noninferiority can be claimed because the
CHANCE
577
confidence fi interval is too wide. This may be due to a small sample size or large variation.
Examples Noninferiority clinical trials have become common in clinical research. Noninferiority trials can be “positive,” resulting in claims of noninferiority, or “negative,” resulting in an inability to make a noninferiority claim. Noninferiority trials are applied to evaluate interventions in many types of disease areas. One example is the AIDS Clinical Trials Group 116A, which was a positive noninferiority trial with a time-to-event endpoint. The trial was designed to compare two antiretroviral therapies for the treatment of HIV infection by evaluating if didanosine (500 mg/day and 750 mg/day) was noninferior to zidovudine. In 1989, zidovudine was the only approved antiretroviral and had been shown to be better than placebo in reducing HIV disease progression. More treatments were needed due to resistance development. A trial of didanosine vs. placebo would not be ethical due to the known efficacy of zidovudine and the seriousness of the consequences of untreated HIV infection. The primary endpoint was the time to an AIDS-defining event or death. It was determined that noninferiority would be concluded if the upper bound of the 90% CI for the hazard ratio (didanosine to zidovudine) was <1.6. The resulting 90% confidence fi intervals were (0.79, 1.33) and (0.80, 1.34) for the 500 and 750 mg/day arms respectively, thus resulting in a conclusion of noninferiority. By contrast, the PROFESS study was a negative noninferiority trial with a time-to-event endpoint. The trial concluded that aspirin plus extended-release Dipyridamole was not noninferiority to clopidogrel for stroke prevention. The primary endpoint was recurrent stroke, and a noninferiority margin was set at a 7.5% difference in relative risk. The 95% CI for the hazard ratio was (0.92, 1.11). As the upper bound of the CI was greater than 1.075, noninferiority could not be concluded. Noninferiority trials do not have to use time-to-event endpoints, but can use a variety of
58
VOL. 22, NO. 3, 2009
types of endpoints. For example, in 2003, the Food and Drug Administration (FDA) reviewed a New Drug Application based on a randomized, double-blind, multicenter trial evaluating whether piperacillin/tazobactam (4g/500mg) was noninferior to imipenem/cilastatin (500mg/500mg) administered intravenously every six hours to treat nosocomial pneumonia in hospitalized patients based on a binary endpoint (response vs. no response). It was determined that noninferiority would be claimed if the lower bound for the 95% CI for the difference in response rates (piperacillin/tazobactam minus imipenem/ cilastatin) was greater than -0.20 (i.e., it could be shown that the response rate for piperacillin/tazobactam was not more than 20% worse than imipenem/cilastatin). The response rates in the imipenem/cilastatin and piperacillin/tazobactam arms were (60/99=60.6%) and (67/98=67.7%), respectively, resulting in a lower bound for the 95% CI of -0.066 (>-0.20). Piperacillin/tazobactam was judged to be noninferior to imipenem/cilastatin and was approved by the FDA. Similarly, in a clinical trial evaluating treatments for newly diagnosed epilepsy, Keppra was shown to be noninferior to Carbatrol. The primary endpoint was six months of freedom from seizure, and a noninferiority margin was set at a 15% difference. The 95% CI for the risk difference was (-7.8%, 8.2%), thus noninferiority was concluded.
Running a Successful Noninferiority Trial Two important assumptions associated with the design of noninferiority trials are constancy and assay sensitivity. In noninferiority trials, an active control is selected because it has been shown to be efficacious (e.g., superior to placebo) in a historical trial. The constancy assumption states that the effect of the active control over placebo in the historical trial would be the same as the effect in the current trial if a placebo group was included. This may not be the case if there were differences in trial conduct (e.g., differences in treatment administration, endpoints, or population) between the historical and current trials. This assumption is not able to be tested in the current trial withou without a placebo group. The development of resistance re is one threat to the constancy assu assumption. For example, Staphylococcus aureus (“staph”) is a bacterial infection aure that lives on the skin. Historically, staph was commonly acquired in hospital settings and successfully treated by antibiotics such as penicillin, amoxicillin, and methicillin. Recently, staph is more commonly spread outside the hospital setting (i.e., “community acquired”) and has become resistant to treatment w with antibiotics. There is a lack of data to support a claim that these antidat biotics are currently efficacious biotic fi against methicillin-resistant Staphylococcus aureus methi (MRSA). (MRSA Thus, use of these antibiotics as an active control in a noninferiority study could result in showing noninferiority to therapies that are no better than placebo.
To enable an evaluation of the retention of some of the effect of the active control over placebo, study participants, endpoints, and other important design features should be similar to those used in the trials for demonstrating the effectiveness of the active control over placebo. One can then indirectly assess the constancy assumption by comparing the effectiveness of the active control in the noninferiority and historical trials. Noninferiority trials are appropriate when there is adequate evidence of a defined effect size for the active control so that a noninferiority margin can be justified. fi A comprehensive synthesis of the evidence that supports the effect size of the active control and the noninferiority margin should be assembled. For these reasons, the data may not support a noninferiority design for some indications such as acute bacterial sinusitis, acute bacterial exacerbation of chronic bronchitis, and acute bacterial otitis media. “Assay sensitivity” is another important assumption in the design of noninferiority trials. The assumption of assay sensitivity states that the trial is designed in such a way that it is able to detect differences between therapies if they exist. Unless the instrument measuring treatment response is sensitive enough to detect differences, the therapies will display similar responses due to the insensitivity of the instrument, possibly resulting in erroneously concluding noninferiority. The endpoints selected, how they are measured, and the conduct and integrity of the trial can affect assay sensitivity.
Selecting the Active Control The active control in a noninferiority trial should be selected carefully. Regulatory approval does not necessarily imply that a therapy can be used as an active control. The active control ideally will have clinical efficacy that is of substantial magnitude, estimated with precision in the relevant setting in which the noninferiority trial is being conducted, and preferably quantifified in multiple trials. Since the effect size of the active control relative to placebo is used to guide the selection of the noninferiority margin, superiority to placebo must be reliably established and measured. Assurance that the active control would be superior to placebo if a placebo was employed in the trial is necessary. Recently, there has been concern over the development elopment of noninferiority studies using active controls that violate olate the constancy assumption (i.e., active control efficacy has changed over time) or that do not have proven efficacy over placebo. acebo. Research teams often claim that placebo-controlled trials are not feasible because placebos are unethical due to o the existence of other interventions, patients are unwilling ling to enroll in placebo-controlled trials, and institutional nal review boards question the ethics of the use of placebos bos in these situations. For example, the Oral HIV AIDS Research Alliance ance developed a trial for the treatment of HIV-associated ed oral candidiasis in resource-limited countries. Fluconozole zole is a standard-of-care therapy in the United States, but is not readily available in resource-limited settings due to high costs. Nystatin is used as a standard-of-care therapy in many resourcelimited settings. Gentian violet, an inexpensive topical agent, has shown excellent in vitro
activity against oral candidiasis. A trial evaluating the noninferiority of Gentian violet to nystatin was proposed. However, despite the standard use of nystatin, there were no published results from randomized trials that documented the efficacy fi of nystatin over placebo. Thus, a noninferiority trial could not be justified and a simple comparison of the two therapies was proposed. When selecting the active control for a noninferiority trial, one must consider how the efficacy fi of the active control was established (e.g., by showing noninferiority to another active control vs. showing superiority to placebo). If the active control was shown to be effective via a noninferiority trial, then one must consider the concern for biocreep. Biocreep is the tendency for a slightly inferior therapy (but within the margin of noninferiority) that was shown to be efficacious via a noninferiority trial to be the active control in the next generation of noninferiority trials. Multiple generations of noninferiority trials using active controls that were, themselves, shown to be effective via noninferiority trials could eventually result in the demonstration of the noninferiority of a therapy that is not better than placebo. Logically, noninferiority is not transitive: If A is noninferior to B, and B is noninferior to C, then it does not necessarily follow that A is noninferior to C. For these reasons, noninferiority trials should generally choose the best available active controls.
Selecting a Noninferiority Margin The selection of the noninferiority margin in noninferiority trials is a complex issue and one that has created much discussion. In general, the selection of the noninferiority margin is done in the design stage of the trial and is used to help determine sample size. Defining the noninferiority margin in noninferiority trials is context-dependent and plays a direct role in the interpretation of the trial results. The selection of the noninferiority margin is subjective but structured, requiring a combination of statistical reasoning and clinical judgment. Conceptually, one may view the noninferiority margin as the “maximum treatment difference that is clinically irrelevant” or
CHANCE
59
the “largest efficacy difference that is acceptable to sacrifice fi in order to gain the advantages of the intervention.” Defining fi this difference often requires interaction between statisticians and clinicians. As one indirect goal of a noninferiority trial is to show that intervention is superior to placebo, some of the effect of active control over placebo needs to be retained (often termed “preserving a fraction of the effect”). Thus, the noninferiority margin should be selected to be smaller than the effect size of the active control over placebo. Researchers should review the historical data that demonstrated the superiority of the active control to placebo to aid in defining the noninferiority margin. Researchers must also consider the within- and acrosstrial variability in estimates. Ideally, the noninferiority margin should be chosen independent of study power, but practical limitations may arise since the selection of noninferiority margin dramatically affects study power. One strategy for preserving the estimate of the effect is to set the noninferiority margin to 50% of the estimated active control effect vs. placebo. For example, the STAR trial was designed to evaluate the noninferiority of raloxifene to tamoxifen on the primary endpoint of invasive breast cancer. An earlier trial of tamoxifen vs. placebo resulted in an estimate of relative risk of 2.12 (95% CI = (1.52, 3.03)) favoring tamoxifen. Thus, one option considered for defining the noninferiority margin for the raloxifene trial was 1 + [(2.12 - 1)/2] =1.56. If the upper bound of the 95% confidence interval for the relative risk of raloxifene vs. tamoxifen was less than 1.56, then noninferiority would be demonstrated. Note, however, that this method does not consider that the estimate of tamoxifen vs. placebo is subject to uncertainty. To account for the variability of the estimate of tamoxifen vs. placebo, the “95%–95% confi fidence interval method” could be used. In this strategy, the noninferiority margin is set to the lower bound of the 95% confidence fi interval for the effect of placebo vs. tamoxifen. If the upper bound of the 95% confidence interval for raloxifene vs. tamoxifen was less than 1.52, then noninferiority would be demonstrated. This criterion is stringent and depends directly on the evidence from historical trials. A poor choice of a noninferiority margin can result in a failed noninferiority trial. The TARGET trial was designed to evaluate the noninferiority of tirofaban to abciximab (two glycoprotein IIa/IIIb inhibitors) for the treatment
60
VOL. 22, NO. 3, 2009
of coronary syndromes. A noninferiority margin for the hazard ration (HR) of 1.47 was selected (50% of the effect of abciximab vs. placebo in the EPISTENT trial). The trial was viewed as poorly designed because an agent with a HR=1.47 would not have been considered noninferior to abciximab. In the SPORTIF V trial, ximelegatran was compared to warfarin (active control) for stroke prevention in atrial fifibrillation patients. The event rate for warfarin was 1.2%, and the noninferiority margin was set at 2% (absolute difference in event rates) based on historical data. As the event rate in the warfarin arm was low, the noninferiority could be concluded even if the trial could not rule out a doubling of the event rate. For these reasons, the selection of the noninferiority margin should incorporate statistical considerations and clinical relevance considerations. A natural question is whether a noninferiority margin can be changed after trial initiation. In general, there is little concern regarding a decrease in the noninferiority margin. However, increasing the noninferiority margin can be perceived as manipulation unless appropriately justified (i.e., based on external data that is independent of the trial).
Sample Sizes and Interim Analyses Sample sizes for noninferiority trials are generally believed to be larger than for superiority trials. However, this really depends on the selection of the noninferiority margin and other parameters. Required sample sizes increase with a decreasing noninferiority margin. Stratification can help, as adjusted confidence intervals are generally narrower than unadjusted confidence intervals. Researchers should power noninferiority trials for per-protocol analyses and intent-totreat (ITT) analyses, given the importance of both analyses. Researchers also need to weigh the costs of type I error (i.e., incorrectly claiming noninferiority) and type II error (i.e., incorrectly failing to claim noninferiority). For binary and continuous endpoints, a standard approach to sizing noninferiority trials is to reverse the roles of the type I and II error probabilities when calculating the sample size for a superiority trial. Another simple approach to sizing a noninferiority trial is to view the trial from an estimation perspective. The strategy is to estimate the difference between treatments with appropriate precision (as measured by the width of a confidence interval). Then, size the study to ensure the width of the confidence interval for the difference between treatments is acceptable. ACTG 5263 used this approach to estimate the difference in event rates (clinical progression of Kaposi’s Sarcoma (KS) or death) between two chemotherapies for the treatment of HIV-associated advanced KS disease. The research team concluded that the widest acceptable width of a confidence fi interval for the difference in event rates was 20%. Data from the AIDS Malignancy Consortium was used to estimate the event rates at 50%, and the team further noted that the width of a confidence interval would be maximized when event rates are 50% in both arms. A sample size was calculated to ensure the width of a confidence interval for the difference in event rates between two arms would be less than 20% when event rates were 50% in both arms. If the event rates were not 50%, then the confifidence interval width would be less than 20%. Interim analyses of noninferiority trials can be complicated. It generally takes overwhelming evidence to suggest stopping a trial for noninferiority during an interim analysis. Also, there
may not be an ethical imperative to stop a trial that has shown noninferiority (in contrast to superiority studies, where, if superiority is demonstrated, there may be ethical imperatives to stop the study since randomization to an inferior arm may be viewed as unethical). In addition, even if noninferiority is demonstrated at an interim time point, it may be desirable to continue the study to assess whether superiority could be shown with trial continuation. It is not uncommon to stop a noninferiority trial for futility (i.e., unable to show noninferiority). Use of repeated confidence intervals to control error rates with predicted interval plots can aid data-monitoring committees with interim decisionmaking.
Two Implicit Sub-Objectives The traditional approaches to the design and analysis of noninferiority trials have been recently critiqued by noting a failure to distinguish between the two distinct sub-objectives of noninferiority trials: to demonstrate that the intervention is noninferior to the active control and to demonstrate the intervention is superior to placebo, taking into account historical evidence. The design of a noninferiority trial can be accomplished by planning to test two hypotheses. To evaluate whether the intervention is noninferior to the active control, a noninferiority margin could be selected based on clinical considerations (not the effect of active control vs. placebo) that will allow a noninferiority claim if supported by the data. Using standard methods for noninferiority trials, one tests H10: T,active control ≥M vs. H1A: T,active control <M. A direct comparison of the intervention to placebo is not possible. However, if the constancy assumption holds, the difference between the intervention and placebo can be estimated as a function of the placebo-controlled trial of active control, the noninferiority trial (i.e., T,placebo = T,active control + AC,P), and test: H20: T,placebo = 0 vs. H2A: T,placebo <0. Adjustments can be made if the constancy assumption does not hold. For example, suppose the current effect of the active control vs. placebo is (1-) of the historical effect. Then, T,P, = T,active control + (1-)AC,P and we test H20: T,P, = 0. If the goal of the noninferiority trial is to reject both hypothyp eses, sample size calculations can be performed for each test and the larger of the he required samples sizes can be selected.. A particular trial may only accomplish one of the two sub-objectives. If intervention is shown superior to placebo, but fails to demonstrate noninferiority to the active control, use of intervention may be indicated for patients in n whom active control is contraindicated ed or not available. In contrast, the intervenention could be shown to be noninferior to the active control, but not superior to placebo. acebo. This may occur when the efficacy of the active control is modest. Recently, there have been claims that the second of the two subobjectives (i.e., demonstrating superiority to placebo) is the objective of interest in the regulatory setting.
Industry groups have argued that regulatory approval of new therapies should be based on evidence of superiority to placebo (demonstration of clinically meaningful benefi fit), not necessarily noninferiority to an active control. Proponents of this perspective (often termed the “synthesis method”) pose several dilemmas and inconsistencies with traditional approaches to noninferiority trials in support of this position. First, the intervention could look better than the active control, but not meet the preservation of effect condition. Second, two trials with different active controls have different standards for success. Third, if the intervention is shown to be superior to an active control, should the active control be withdrawn from the market? The basic argument is that the required degree of effi ficacy should be independent of the design (superiority vs. noninferiority) and that superiority to placebo is the standard for regulatory approval. Proponents of the synthesis method thus argue that “noninferiority trial” terminology is inappropriate because the superiority of the intervention to placebo is the true objective. One scientifically attractive alternative design is to have a three-arm trial consisting of the intervention, active control, and a placebo arm. This design is particularly attractive when the efficacy of the active control has changed, is volatile, or is in doubt. This design allows assessment of noninferiority and superiority to placebo directly and for within-trial validation of the noninferiority margin. Unfortunately, this design is not frequently implemented due to a concern for the unethical nature of the placebo arm in some settings.
Analyzing a Noninferiority Trial The choice of the noninferiority margin plays a direct role in the interpretation of the noninferiority trial, unlike the minimum clinically relevant difference often defined fi in superiority trials. Thus, the justification for the noninferiority margin should be outlined in the analysis. The analysis of noninferiority trials also uses information outside the current trial to infer the effect of the intervention vs. placebo in the absence of a direct comparison. It is therefore recommended that a comparison of the response rate, adherence, etc. of the active control in the noninferiority trial be compared to historical trials
CHANCE
61
that compared the active control to placebo and provided evidence of the efficacy of the active control. If the active control displays different efficacy than in prior trials, then the validity of the pre-defined noninferiority margin may be suspect, and the interpretation of the results will be challenging. The general approach to analysis is to compute two-sided confidence fi intervals (p-values are not generally appropriate). A common question is whether one-sided 0.05 confidence fi intervals are acceptable given the one-sided nature of noninferiority; however, two-sided confidence intervals are generally appropriate for consistency between significance testing and subsequent estimation. Note that a one-sided 95% confidence fi interval would lower the level of evidence for drawing conclusions compared to the accepted practice in superiority trials. In superiority studies, an ITT-based analysis tends to be conservative (i.e., there is a tendency to underestimate true treatment differences). As a result, ITT analyses are generally considered the primary analyses in superiority trials, as this helps protect the type I error rate. As the goal of noninferiority trials is to show noninferiority or similarity, an underestimate of the true treatment difference can bias toward noninferiority, thus inflating the “false positive” (i.e., incorrectly claiming noninferiority) error rate. Thus, ITT is not necessarily conservative in noninferiority trials. For these reasons, an ITT and perprotocol analysis (i.e., an analyses based on study participants that adhered to protocol) are often considered as co-primary analyses in noninferiority trials. It is important to conduct both analyses (and perhaps additional sensitivity analyses) to assess the robustness of the trial result. Per-protocol analyses often result in a larger effect size because ITT often dilutes the estimate of the effect but frequently results in wider confidence fi intervals, as it is based on fewer study participants than ITT. If a noninferiority trial is conducted and the noninferiority of intervention to an active control is demonstrated, can a stronger claim of superiority be made? In other words, what are the ramifications of switching from a noninferiority trial to 62
VOL. 22, NO. 3, 2009
a superiority trial? Conversely, if a superiority trial is conducted and significant between-group differences are not observed, can a weaker claim of noninferiority be concluded? Can one switch from a superiority trial to a noninferiority trial? In general, it is considered acceptable to conduct an evaluation of superiority after showing noninferiority. Due to the closed testing principle, no multiplicity adjustment is necessary. The ITT and per-protocol analyses are both important for the noninferiority analyses, but the ITT analysis is the most important analysis for the superiority evaluation. It is more difficult to justify a claim of noninferiority after failing to demonstrate superiority. There are several issues to consider. First, whether a noninferiority margin has been pre-specified is an important consideration. Defining the noninferiority margin post-hoc can be difficult to justify and perceived as manipulation. The choice of the noninferiority margin needs to be independent of the trial data (i.e., based on external information), which is difficult to demonstrate after data has been collected and unblinded. Second, is the control group an appropriate control group for a noninferiority trial (e.g., has it demonstrated and precisely measured superiority over placebo)? Third, was the efficacy of the control group similar to that displayed in historical trials vs. placebo (constancy assumption)? Fourth, the ITT and per-protocol analyses become equally important. Fifth, trial quality must be high (acceptable adherence and few dropouts). Sixth, assay sensitivity must be acceptable. The reporting of noninferiority trials has been suboptimal in the medical literature. In an Annals of Internal Medicine article titled “Claims of Equivalence in Medical Research: Are They Supported by the Evidence?,” the authors reviewed 88 studies claiming noninferiority, but noted 67% of these studies claimed noninferiority based on nonsignificant superiority tests. Furthermore, only 23% of the studies pre-specified a noninferiority margin. G. Piaggio and coauthors published an extension of the CONSORT statement to outline appropriate reporting of noninferiority trials in a Journal of the American Medical Association articled titled “Reporting of Noninferiority and Equivalence Randomized Trials: An Extension of the CONSORT Statement.” A noninferiority design is an alternative that uses an active control group when placebos are unethical. A clinical trial using a noninferiority design poses unique challenges, including the selection of the active control and noninferiority margin. A noninferiority trial must be designed, monitored, analyzed, and reported carefully.
Further Reading Greene, W.L.; Concato, J.; and Feinstein, A.R. (2000). “Claims of Equivalence in Medical Research: Are They Supported by the Evidence?” Annals of Internal Medicine, 132:715–722. Piaggio, G.; Elbourne, D.R.; Altman, D.G.; Pocock, S.J.; and Evans, S.J.W. (2006). “Reporting of Noninferiority and Equivalence Randomized Trials: An Extension of the CONSORT Statement.” Journal of the American Medical Association, 295:1152–1160.
“Here’s to Your Health” prints columns about medical and healthrelated topics. Please contact Mark Glickman (
[email protected] ( ) if you are interested in submitting an article.
Goodness of Wit Test #5: Simple Inference
Jonathan Berkowitz, Column Editor
I
am one year late in my celebration of the 100th anniversary of Student’s t distribution, but in its honor, the theme of this puzzle is simple inference. The puzzle is a standard bar-type cryptic with one constraint, as you’ll read about in the instructions. I certainly hope you can infer what I imply! A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by November 1, 2009. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Please mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver, BC, Canada V6N 3S2, or send him a list of the answers by email at witz@ telus.net. Note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Write Your Own Cryptic Clue Contest A few issues back, I invited readers to create a cryptic clue for a statistical term. Writing clues is a great way to better understand how cryptic clues work. Every proper cryptic clue has three parts: 1) a definition of the solution word, 2) word play that resolves into the solution word, and 3) nothing else! The third item is often the most difficult. I encourage you to send me your efforts. As encouragement, I will print some of the entries and, where possible, try to incorporate the best entries (with acknowledgment) into future puzzles.
Solution to Goodness of Wit Test #3 This puzzle appeared in CHANCE, 22(1), Page 64. Clues with asterisks have the term “range” embedded within.
Winners from Goodness of Wit Test #3—Home on the Range
David Unger is finishing his first year of teaching in the Mathematics and Statistics Department at Southern Illinois University, Edwardsville. He earned his bachelor’s degree in mathematics from Truman State University, followed by his MS in statistics from the University of Illinois, where he is also working on a PhD. He enjoys sports, the unending projects associated with home ownership, time with his wife, and the hijinx of his two children.
Richard Penny is a senior methodologist for Statistics
New Zealand, where he has been for the last 25 years. He works with statisticians who provide statistical (and crossword clue—thanks, Penny) expertise to a range of clients. His interests include the process of collecting data and providing information about how it could, and should, be used. Having met R. A. Fisher while very young, he thinks he was probably fated to become a statistician.
Across:*1 BOOMErangeD [definition: boome(range)d] 7 PRO [rebus: profit – fit] 9 AGENT [rebus: a + gent] *10 OrangeADE [rebus: o (range) + ad + e] *11 ESTrangeD [rebus: est (range) + d] 12 OMEGA [hidden word: hOME GAme] 13 EARSHOT [anagram: to share] 15 NULL [hidden word + reversal: droLL UNcle] 18 IDEA [anagram: aide] 20 MIRACLE [anagram: reclaim] 23 ANOVA (see 8 Down) *24 STrangeST (rebus: st (range) + st] 26 INAUGURAL [rebus + anagram: in + arugula] 27 CAUSE [rebus: CA + use] 28 SIR [initial letters: s+i+r] *29 ARrangeMENT [container: ar(range)(men)t] Down: 1 BRACELET [anagram + deletion: celebrate – e] 2 OVERTURN [rebus: overt + urn] 3 EXTRA [anagram: a T. Rex] 4 AMONGST [container: a(MO)ngst] 5 GLADDEN [anagram: dangled] 6 DOG COLLAR [rebus + container: do(g+c+o)llar] 7 PLANET [rebus: plane + t] 8 (With 23 Across) ONE-WAY (ANOVA) [anagram: anyone avow a] *14 HYDrangeA (rebus + anagram: h + yd(range)a (day) 16 SCHEDULE (reversal + container [s(ch) edule (eludes)] 17 TESTIEST [container: tes(ties)t 19 ASSURER [rebus: as + sure + r] 20 MARILYN [anagram + container: mainly + r] 21 VARIES [rebus: v + Aries] 22 HOT AIR [anagram + rebus: oath + I + r] 25 NICHE [container: nic(h)e] CHANCE
63
Goodness of Wit Test #5 Instructions: In the answers to eight clues, each of which intersects another such clue, a sample of letters must be replaced appropriately to allow entry of the answer into the diagram. Enumeration is withheld.
Across 1 Appropriate start to legal fight 4 Bad luck not finishing cleaning device in meeting place 10 Erase sideless part of a ring 11 Turn about having dropped old instrument 12 Part of car exhaust 13 For example, pine out loud at opening of tennis court in South Dakota 14 Fox and others take home some of rice in frying pans 16 Roger takes in unsatisfactory ending to scary, mischievous play 17 Still one abominable snowman 18 Decorating trouble when finally going in circle 20 Piece in a note on initiative 24 Drops reciprocal of standard deviation divided by constant 26 Furnish with jazzy wallpaper ignoring length and width 28 To some extent blame a tax cutter (hyph.) 31 Pitch one possibly related to sound 32 Landmass like Iowa 33 Level drawer 34 Top conmen pitching ingredient 35 Bum runs city inquiry 36 Stimulate limitless swells
64
VOL. 22, NO. 3, 2009
Down 1 By promotion, raised values 2 Unchanging university standard retaining one failure 3 Anger a green radical 4 Rod has left these words for you? 5 Start to lean towards strangely definitive statement (2 wds) 6 Agent with fewer assets? 7 Lode containing uranium, trace of titanium, tungsten, and oxygen lasted longer 8 Leaders of orchestra begin overture excluding wind instrument 9 A grim lad arranged song 14 Improving D minus uncovered practice? 15 Arcane, nasty and unknown rules of grammar 16 Again accommodates a quiet time during studies 19 Wanting joint after first sign of crash 21 Believer in improved tithes 22 Choice magic potion 23 Convict's hype ruined senior 25 Earliest particle splitting noble 27 Smart city getting rid of a turn 29 Host with attractive upper limbs? 30 Meat head A guide to solving cryptic clues appeared in CHANCE 21(3). The use of solving aids—electronic dictionaries, the Internet, etc.—is encouraged.