About the Authors Picture Not Available
Michela Baccini
works in the Departimento di Statistica di Firenze “G. Parenti” at the University of Florence, Italy. Her interests include analysis of historical series in epidemiology, analysis of survival data, Bayesian meta-analysis, and statistical methods for health studies.
Bianca Gelbrich has an MD in dentistry (Leipzig) and works at the Clinical Trial Centre Leipzig. Her professional fields are orthodontics and pediatric dentistry. She also was a study coordinator at the Clinical Trial Centre and has expertise in forensic odontostomatology.
Robert Bell is a member of the statistics research department at AT&T LabsResearch. His research interests span survey research methods to machine learning.
Götz Gelbrich works as biometrician
Susann Blüher is an MD and board-
certified pediatrician at the University Hospital for Children and Adolescents Leipzig. She is also a study coordinator at the Clinical Trial Centre Leipzig and is engaged in obesity research and pediatric endocrinology.
Samantha Cook is a statistician in
the research department at Google, where her work involves predicting rates of influenza from Google users’ search behavior. Other areas of interest include missing data, causal inference, and computation.
Constantine Frangakis, who earned his PhD from Harvard in 1999, is a professor of biostatistics at The Johns Hopkins University. He develops designs and methods of analyses to evaluate treatments in medicine, public health, and policy. Annegret Franke is a biometrician
at the Clinical Trial Centre Leipzig. She earned a diploma and PhD in biomedical cybernetics from Ilmenau and did post-graduate work in medical biometry at Heidelberg. She is working on clinical trials in dermatology, neurosurgery, nuclear medicine, and rheumatologic rehabilitation.
at the Clinical Trial Centre Leipzig. He earned his diploma in mathematics from Warsaw University and has doctoral degrees in mathematics (Greifswald) and theoretical medicine (Leipzig). He has experience from clinical trials in various domains and is the senior biometrician of the German Competence Network Heart Failure.
Yehuda Koren is a senior researcher at Yahoo Research, Haifa. He earned his PhD in computer science from the Weizmann Institute of Science in 2003. His research interests include recommender systems, spam filtering, and general data analysis and visualization. Andreas Krause is director and lead
scientist of modeling and simulation at Actelion Pharmaceuticals in Basel, Switzerland. Being a statistician by training, his broader interests include simulation and visualization techniques. Krause can be contacted at
[email protected].
Paul Kvam earned his PhD in statistics
from the University of California, Davis, in 1990 and is now a professor at the Stewart School of Industrial and Systems Engineering. His research interests focus on statistical reliability with applications to engineering, nonparametric estimation, and analysis of complex and dependent systems.
CHANCE
3
Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, This issue of CHANCE, the first of the 23rd year for the magazine, focuses on a few hot-topic issues while also including some lighter, amusing articles. Walter Mebane writes about statistical evidence suggesting fraud in the recent Iranian presidential election. Analyses are presented at the district-, town-, and ballot-box levels, comparing elections in 2005 and 2009. A variation of the so-called Benford’s law concerning the distribution of digits is used to judge the counts. Unfortunately, it will take more information than will likely ever be public to explain the unusual data patterns. Michela Baccini, Sam Cook, Constantine Frangakis, Fan Li, Fabrizia Mealli, Don Rubin, and Elizabeth Zell describe missing data issues and a multiple imputation approach in the Centers for Disease Control and Prevention’s Anthrax Vaccine Research Program clinical trial. The study is ongoing. Producing the imputations for missing items is important, but so is the process of evaluating the quality of the resulting imputations and inferences. Bob Bell, Yehuda Koren, and Chris Volinsky discuss their ensemble approach to winning the $1 Million Netflix Prize. Is their approach to one challenging problem the way to the fastest progress in many areas of study? Steve Lohr of The New York Times comments and adds perspective. Todd Remund illustrates digital filtering and signal processing with examples of car motion and rocket firing. It is interesting how the sound record of a rocket firing can be used as data and analyzed statistically. Götz Gelbrich, Annegret Franke, Bianca Gelbrich, Susann Blüher, and Markus Löffler answer the important question: Are the color and flavor of gummy bears related? If you teach, students might want to try to replicate the results. Even if you do not teach, you still might want to try this experiment. We use this article as an opportunity to include a comment about the CONSORT guidelines for clinical trial reporting. Paul Kvam examines the Electoral College and how the distribution of electoral votes across the states has evolved. A simple
multinomial model of counts does not fit, but Kvam suggests a promising alternative. He also makes a radical suggestion for reorganizing the electorate. Perhaps after Congress deals with health care legislation … In Mark Glickman’s Here’s to Your Health column, Andreas Krause describes efforts at modeling and simulation for evaluating the effect of drugs and therapies on a “virtual patient” in pharmacodynamic/pharmacokinetic trials. Modern drug development is time consuming and expensive. Statistical simulation and modeling aim to improve the drug development process in both of these dimensions. In his Visual Revelations column, Howard Wainer explains how a simple graph can have three substantially different interpretations. In doing so, he gives a brief historical account of item response theory and introduces us to Schrödinger’s cat. Peter Loly and George Styan present examples of Latin square designs in stamps, taking us around the world and characterizing the 4x4 Latin squares of experimental design. Color versions of the stamps are available in the online version at www.amstat.org/ publications/chance. Finally, Jonathan Berkowitz’s Goodness of Wit Test challenges us with a variety cryptic in the bar-type style. The puzzle is subtitled “Figure It Out.” We encourage you to try to do so. In other news, former CHANCE editor Dalene Stangl of Duke University will present a webinar March 9 at 2 p.m. Eastern time about teaching statistics with articles from the magazine. The topic was the focus of Stangl’s article in the 20th anniversary issue (Vol. 20, No. 4), which is available at www.amstat.org/publications/ chance for subscribers who are also members of the American Statistical Association. Information about the webinar can be found at www.CAUSEweb.org/webinar. I look forward to your suggestions and submissions. Enjoy the issue!
Mike Larsen
CHANCE
5
Fraud in the 2009 Presidential Election in Iran? Walter R. Mebane Jr.
T
he presidential election that took place in Iran on June 12, 2009, attracted considerable controversy. The incumbent, Mahmoud Ahmadinejad, was declared the winner with 63.3% of the 38,770,288 valid votes, but the opposition candidates— Mir-Hossein Mousavi, Mohsen Rezaei, and Mehdi Karroubi—refused to accept the results. According to the ballot box data, Mousavi officially received 34.2%, Rezaei 1.7%, and Karroubi 0.9% of the valid votes. Widespread demonstrations occurred in the days following the election. Eventually, the demonstrations were repressed, but the government’s legitimacy continues to be challenged. The available data support the belief that the official results announced in the election do not accurately reflect the intentions of the voters, that there was widespread fraud in which the vote counts for Ahmadinejad were substantially augmented by artificial means. However, the data do exhibit a striking
6
VOL. 23, NO. 1, 2010
AP Photo/Vahid Salemi
AP Photo/Vahid Salemi
Mir-Hossein Mousavi
Mehdi Karroubi
AP Photo/Richard Drew
Mahmoud Ahmadinejad
feature that, if adequately explained, might help build confidence that Ahmadinejad actually won. To explain such patterns, though, would depend on information— administrative records, witness testimony, etc.—that goes beyond what a statistical assessment can produce. Such investigations are unlikely to occur in the foreseeable future, thus the questions that have been raised are unlikely to receive satisfactory answers.
Previous Studies Several papers examined various aspects of the Iranian election results shortly after the election concluded. One study by Boudewijn Roukema demonstrates that the first significant digits of vote counts reported for Karroubi in the 366 ‘towns’ into which Iran was administratively divided for the 2009 election differ significantly from the distribution implied by Benford’s Law (see page 9 “Benford’s Law and 2BL”).
AP photo/Hasan Sarbakhshian
Mohsen Rezaei
To overcome a basic objection that the first digits of vote counts predominantly reflect the size of administrative units and each candidate’s share of support among the public—and there is no reason to expect either of those to follow any particular statistical distribution— Roukema uses an empirical version of the law that adjusts for these factors. Still, the question remains why anyone should believe a clean election would be associated with, according to Roukema, “the null hypothesis that the first digit in the candidates’ absolute numbers of votes are consistent with random selection from a uniform, base 10 logarithmic distribution modulo 1.” There are neither an extensive archive of empirical findings from other elections nor theoretical motivation to support such a belief. What Roukema’s results imply about the credibility of the election is unclear. “Preliminary Analysis of the Voting Figures in Iran’s 2009 Presidential
Election,” a study by Ali Ansari, Daniel Berman, and Thomas Rintoul, uses vote totals for Iran’s 30 provinces from 2009 and 2005 (the previous presidential election), along with data from the Iranian census of 2006, to show that several aspects of the 2009 election were surprising. For instance, the study finds no correlation at the province level between increases in turnout and increases in the proportion of votes for Ahmadinejad and a strong positive association between votes for Karroubi in 2005 and votes for Ahmadinejad in 2009. Such findings say little about the possibility of fraud. For one thing, as Ansari, Berman, and Rintoul acknowledge, province-level vote counts represent too high a level of aggregation to reach strong conclusions. The study also finds two provinces where more votes were recorded than there were eligible voters. This finding was superseded by an announcement from Iran’s Guardian Council that turnout CHANCE
7
Table 1—Estimated Coefficients and Standard Errors in the Robust Overdispersed Multinomial Regression Model for Predicting Vote Counts for Four Candidates in the 2009 Iranian Presidential Election Ahmadinejad coef. Constant
SE
Rezaei coef.
Karroubi SE
coef.
SE
1.40000
0.2290
−3.63000
0.3150
−4.4200
0.3560
−0.60200
0.1590
0.07160
0.2160
0.2720
0.2340
logitM(2005 Ahmadinejad prop)
0.23200
0.0517
0.49400
0.1110
0.5970
0.0977
logitM(2005 Ghalibaf prop)
0.19400
0.0888
−0.10900
0.1490
−0.3760
0.1720
logitM(2005 Karroubi prop)
0.11100
0.0900
0.44700
0.1390
−0.0359
0.1410
logitM(2005 Larijani prop)
−0.01090
0.0792
0.00296
0.1160
−0.3120
0.1440
logitM(2005 Moeen, prop)
−0.80400
0.0978
−0.46300
0.1630
0.5910
0.1800
0.26300
0.1500
−0.26800
0.2470
−0.1950
0.2740
−0.10400
0.0366
−0.06820
0.0778
−0.3970
0.0629
ratio x Ghalibaf
0.05890
0.0576
0.15200
0.1050
0.2250
0.1170
ratio x Karroubi
0.00672
0.0663
−0.24200
0.0990
0.3570
0.0916
ratio x Larijani
0.02780
0.0579
−0.00318
0.0814
0.2440
0.1000
ratio x Moeen
0.24400
0.0726
0.14200
0.1220
−0.4820
0.1320
−0.17900
0.1080
0.08070
0.1730
−0.0535
0.1850
ratio(2009 total/2005 total)
logitM(2005 Rafsanjani prop) ratio x Ahmadinejad
ratio x Rafsanjani
Note: The reference candidate is Mousavi. Predictors are the turnout ratio, logits of 2005 votes, and the interaction between turnout and logits. Data were used from n = 320 towns. Outliers accounted for 81 out of 960 counts (8.4%). Estimates of the overdispersion parameter were σLQD = 17.9 and σtanh = 15.8 using two methods.
exceeded the number of eligible voters in more than 50 cities, supposedly because voters voted at locations away from their home cities.
Town-Level Analysis Vote counts are available from the 366 towns in the 2009 election and the 325 in the 2005 presidential election. There are 320 towns with the same names at both times and thus able to be matched on that basis. The analysis uses a statistical method that fits an overdispersed multinomial model to the 2009 vote counts, but allows for the possibility that the variables chosen to try to explain the vote counts for each town do not explain some of the counts well—the so-called robust overdispersed multinomial model described by Walter Mebane and Jasjeet Sekhon in their American Journal of Political Science paper, “Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data” (see “Robust Overdispersed Multinomial Logit Model”). Observed 8
VOL. 23, NO. 1, 2010
vote counts that are extremely discrepant from what the model predicts are excluded from the analysis and declared to be outliers. Counts that the model predicts a bit more accurately, but still poorly, are downweighted so they have less influence on what the pattern that prevails through most of the data is estimated to be. For the 320 towns that could be matched, the analysis tries to explain the 2009 vote as a function of two features of the 2005 election: (1) the proportion of votes received by each candidate in the first round and (2) the ratio of the total number of votes in 2009 to the total number of votes in 2005. The rationale for the first set of variables is that it is natural to expect towns that heavily supported Ahmadinejad in 2005 to continue to do so in 2009, towns that supported Karroubi to continue support Karroubi, and towns that supported Akbar Hashemi Rafsanjani to support Mousavi (due to their similar positions). Concrete expectations are lacking for the patterns to be associated with the other
four 2005 candidates—Mohammad Bagher Ghalibaf, Ali Larijani, Mohsen Mehralizadeh, and Mostafa Moeen— but by including them, a relatively fine-grained implicit description of the political geography of Iran may be produced, which will help obtain relatively good predictions for the 2009 results. The second variable is motivated by the fact that, in 2005, opposition politicians called for a boycott of the election, so a surge in turnout in 2009 probably means many who boycotted in 2005 decided to vote in 2009. Hence, towns that have high ratios should have lower proportions of the vote for Ahmadinejad. It is important to also consider interactions between the change in turnout and each of the vote-proportion variables. For instance, strong support for Ahmadinejad in 2005 might not imply strong support in 2009 in towns where turnout surged. To produce a set of regressors, the ratio of total votes in 2009 to 2005 is computed and a logit function is used to transform the set of 2005 first-stage vote
proportions. The logits are formed as the natural logarithm of the ratio of the proportion of votes for a candidate in each town to the proportion of votes for an arbitrarily chosen reference candidate. Because in the multinomial logit model the focus is on the degree to which each covariate is associated with differences between candidates, choosing a reference candidate is a normalization that does not affect the results. What matters is the difference between two candidates’ coefficients for a particular covariate. The reference candidate is specified to have a coefficient of zero for all the covariates. Mehralizadeh was used for this reference candidate, so the logits used here for candidate j are defined as the following: pj logitM(pj) = log pMehralizadeh ,
(
)
where j is an index for one of the other candidates in 2009: {Ahmadinejad, Ghalibaf, Karroubi, Larijani, Moeen, Rafsanjani}. The interactions between ratio of total votes and the logit is produced by multiplying the logit by the ratio. As there are three candidates in 2009 besides Mousavi, there are three sets of logits and interactions. Several aspects of the results (see Table 1) seem natural. Increases in turnout generally went with decreases in Ahmadinejad’s support the coefficient estimate ( βˆ 21 = −0.602). Places in which there was strong support for Ahmadinejad in the first stage of the 2005 election tended to have strong support for him in 2009 ( βˆ 31 = 0.232). In places where turnout surged in 2009, however, strong support for Ahmadinejad in the firststage of the 2005 election no longer indicated strong support in the 2009 election ( βˆ 91 = −0.104). Places in which there was strong support for Karroubi in 2005 tended to have strong support for him in 2009, at least in cases where 2009 turnout surged above 2005 levels ( βˆ 11,3 = 0.3570). However, places that supported Rafsanjani in 2005 showed no particular predilection for supporting Mousavi in 2009, though Rafsanjani was often identified as backing Mousavi in the post-election controversy. What is striking about this analysis is that many towns have results the model cannot explain. The preceding estimation produces 81 outliers and 189 downweighted observations out
Benford’s Law and 2BL Benford’s Law describes a pattern for the frequency distribution of digits in numbers that occurs under a wide variety of conditions. As described by Theodore Hill in “A Statistical Derivation of the Significant-Digit Law” and Ricardo Rodriguez in “First Significant Digit Patterns From Mixtures of Uniform Distributions,” statistical distributions with long tails (like the log-normal) or that arise as mixtures of distributions have values with digits that often satisfy Benford’s Law. Under Benford’s Law, the leading digits in a number are not uniformly distributed and different digits are not independent. Under Benford’s Law, for example, the distribution of the second significant digits j = 0, 1, 2, . . . , 9 in a set of numbers written in base 10 is given by the formula rj = 9k=1 log10 (1 + (10k + j )−1 ). This relationship has proved relevant in several applications. For example, Wendy Tam Cho and Brian Gaines, in “Breaking the (Benford) Law: Statistical Fraud Detection in Campaign Finance,” used Benford’s Law as a reference distribution in efforts to detect fraud in financial data. The so-called second-digit Benford’s Law (2BL), which specifies the expected frequencies rj for the second significant digits, is relevant for vote counts. Extensive examinations suggest that low-level vote counts rarely have first digits that satisfy Benford’s Law, but they often have second digits that do. A simple thought experiment also tends to argue against Benford’s Law for the first digit of vote counts. Imagine a close election in which two candidates evenly divide the votes and all polling stations are the same size M. All the vote counts will be about M/2 and so have the same first digit. Overall, vote counts appear not to satisfy Benford’s Law. There seem to be at least two reasons for vote counts to deviate from 2BL: election fraud and strategic voting. Exactly how and when the 2BL distribution arises is a topic for ongoing research.
∑
of 320×3 = 960 independent pieces of information; 20.5% of the data are downweighted. Sixty of the 81 outliers represent vote counts for Ahmadinejad that the model wholly fails to describe. Mebane lists all the outliers and downweighted observations at www-personal. umich.edu/~wmebane/note29jun2009.pdf. Among the downweighted observations, 172 of the 189 involve poorly modeled vote counts for Ahmadinejad. Sixty-seven percent of these observations—115 out of 172—have more votes recorded for Ahmadinejad than the model would predict. Such a lopsided outcome is extremely unlikely if positive and negative deviations are equally likely. More than half of the 320 towns included in this part of the analysis exhibit vote totals for Ahmadinejad that are not well described by the natural political processes the baseline model is intended to represent. These depar-
tures from the model much more often represent additions than declines in the votes reported for Ahmadinejad. Correspondingly, the poorly modeled observations much more often represent declines than additions in the votes reported for Mousavi. Such results are shocking on their face. The first-round 2005 election results provide a seemingly detailed implicit map of Iranian political geography, and results in 2009 seem quite different. Still, the large number of outliers does not necessarily imply fraud. One way to put these results in perspective is to compare them to the results of a similar analysis conducted using data from an election with no credible allegations of widespread election fraud. To this end, consider the 2008 U.S. presidential election votes for the Democratic, Republican, and Libertarian candidates and the Independent candidate Ralph Nader. CHANCE
9
Table 2—Estimated Coefficients and Standard Errors in the Robust Overdispersed Multinomial Regression Model for Predicting Vote Counts for Four Candidates in the 2008 U.S. Presidential Election Democrat
Libertarian coef.
SE
Independent
coef.
SE
coef.
SE
Constant
0.140
0.00498
−4.900
0.0185
−4.010
0.0157
logit(2004 Democratic prop)
0.957
0.00871
0.218
0.0368
0.359
0.0264
Note: The reference candidate is Republican. The predictor is the logit of the 2004 Democratic presidential vote proportion. Data were used from n = 931 counties. Outliers accounted for 159 out of 2973 counts (5.3%). Estimates of the overdispersion parameter were σLQD = 5.37 and σtanh = 4.65 using two methods.
A county-level robust overdispersed multinomial model regressing the 2008 vote counts on 2004 vote proportions, using data from 22 states, produces a large number of outliers involving the Democratic candidate. This set of states includes a number of counties (931) greater than the number of towns in Iran. Due to limitations in the available data, the only regressor is the simple logit of the proportion of votes for the Democratic candidate in each county in 2004. The simple logit is logit(p) = log(p/(1 − p)). With the Republican treated as the reference candidate, the parameter estimates appear quite natural (see Table 2). Counties that supported the Democrat in 2004 tended strongly to support the Democrat in 2008 ( βˆ 21 = 0.957). Still, out of 931× 3 = 2793 independent pieces of information, 159 observations are outliers and 413 observations are downweighted. Altogether 14.8% of the data are downweighted. Among the outliers, all but three observations involve poorly modeled vote counts for the Democrat, and 71 of the outliers—44.7%—have more votes recorded for the Democrat than the model would predict. The results from the U.S. 2008 election suggest elections in which some candidates receive a small number of votes (e.g., Libertarian or Independent) usually produce a substantial number of 10
VOL. 23, NO. 1, 2010
outliers. The set of regressors is thinner in the U.S. model than in the Iranian specification. In the U.S. election, the rate and polarization of outliers is less extreme than in Iran. Even so, it is difficult to argue that the large set of outliers found in the analysis of the Iranian data is in itself diagnostic of fraud.
Ballot Box-Level Analysis This analysis used “ballot box” (polling station) data. These data comprise 45,692 ballot boxes from all 30 provinces, as reported on a Ministry of Interior web page. The question is whether the second significant digits in the ballot box vote counts for each candidate occur with the frequencies associated with Benford’s Law. Discrepancies from those frequencies may indicate problems with the vote counts. The rationale for this test is further discussed in Mebane’s paper, “Election Forensics: Vote Counts and Benford’s Law,” available at www-personal.umich. edu/~wmebane/pm06.pdf. In several settings, Mebane found that departures from the Benford distribution in the second significant digit of polling station vote counts are diagnostic of anomalies in the voting process. To decide whether the anomaly represents fraud, strategic voting, or other features of the electoral system, however, depends on informa-
tion other than the vote counts. More papers on this topic are posted at www. umich.edu/~wmebane. Here, results are reported from treating the candidates separately but pooling across all the provinces. Looking at each of the 30 provinces separately conveys the same general impression about which candidates have distorted vote totals. The test statistic is X 22BL = ∑ 9j=0 (nj − Nrj)2/(Nrj), where N is the number of precincts having a vote count of 10 or greater (so there is a second digit) for a particular candidate, nj is the number having second digit j, and rj denotes the proportion expected to have second digit j according to the 2BL distribution, where (r0, . . . , r9) = (.120, .114, .109, .104, .100, .097, .093, .090, .088, .085). If an observed test statistic is larger than the critical value of the chisquared distribution with nine degrees of freedom, then the conclusion is that the 2BL distribution does not characterize the referent vote counts. With four candidates, there are four test statistics to consider in order to test the hypothesis of no departures from the expected values. Adjustments need to be made for having multiple, simultaneous tests. One way to do this is to control the false discovery rate (FDR) over the set of statistics of interest, as described by Yoav Benjamini and Yosef Hochberg
Table 3—Second-Digit Benford’s Law (2BL) Tests, Iran 2009 Presidential Election, Ballot Box Vote Counts Candidate Mousavi
N 44721
X22BL 9.9
Karroubi
9441
315.1
Rezaei
18437
155.3
Ahmadinejad
45630
57.7
Note: N denotes the number of ballot boxes with 10 or more votes for the candidate. X22BL > 21.03 is statistically significant at test level α = .05.
in their article, “Controlling the False Discover Rate: A Practical and Powerful Approach to Multiple Testing.” In this case, a test statistic needs to be larger than 21.03 for there to be a significant result at test level α = .05. The 2BL tests show highly significant discrepancies in the ballot-box vote counts for Ahmadinejad, Karroubi, and Rezaei, but not for Mousavi (see Table 3). The results for the two candidates who received small proportions of the vote reflect the smallness of their vote counts. Karroubi, for instance, received 10 or more votes only in 9,441 of the 45,692 ballot boxes. The key question is whether the small numbers stem from normal politics: either sincerely low support for them or voters strategically abandoning them in favor of one of two leading candidates. Of course, both candidates disputed that possibility. Another possibility is that there was widespread fraud. Allegations regarding the election include assertions that many ballot boxes were sealed before they could be inspected by opposition candidates’ representatives. This opens the possibility of ballot box stuffing or other vote tampering. The simplest interpretation of the fact that Ahmadinejad’s, but not Mousavi’s, vote counts deviate significantly from the 2BL pattern is that votes were somehow artificially added to Ahmadinejad’s totals while
possibility is that there “ Another was widespread fraud. Allegations regarding the election include assertions that many ballot boxes were sealed before they could be inspected by opposition candidates’ representatives.
”
Mousavi’s counts remained unmolested (and perhaps votes for Karroubi and Rezaei were somehow thrown out). But simulations by Mebane in “Detecting Attempted Election Theft: Vote Counts, Voting Machines, and Benford’s Law” and “Statistics for Digits” (both available at www.umich.edu/~wmebane) indicate that if votes are artificially shifted from one candidate to the other, sometimes only the 2BL statistics for the receiver candidate reflect the manipulation. So, the simplest interpretation should be considered a likely possibility, but not the only one. It is also possible that the finding of a significant 2BL statistic for the candidate
with the most votes does not indicate any kind of fraud at all. For instance, in the 22 U.S. states analyzed above, there are about 91,000 precincts. Running the same test for the precinct-level vote counts from the 2008 presidential election for those states gives a highly significant statistic for the Democratic, Republican, Libertarian, and Independent candidates. If California, New York, and Pennsylvania are omitted, there are about 40,000 precincts. Then, the 2BL test statistic for the Democrat is an insignificant 14.2. States in America have different political and institutional profiles. For example, New York has fusion, with the Democratic and Republican candidates CHANCE
11
Table 4—Deciles of the Distribution of the Proportion of Invalid Votes by Ballot Box, Iran 2009 Presidential Election, Based on 45,692 Ballot Boxes 10%
20%
30%
0.0000 0.0028 0.0048
40% 0.0067
50%
60%
70%
80%
90%
0.0085 0.0105 0.0129 0.0161 0.0213
Figure 1. Second digits of ballot-box vote counts for president by proportion of invalid votes, 2009 Iranian presidential election. Each solid line is a nonparametric regression curve showing the conditional mean of the second digit of the vote count for each candidate given each invalid vote proportion value. Dashed lines show 95 percent confidence bounds adjusted for having four candidates. The horizontal dotted line shows the mean value expected according to the 2BL distribution. Confidence band width reflects the spread of the data and the number of observations near each ordinate.
appearing under multiple party lines on the general election ballot. So it appears the 2BL test is sensitive to some of those differences. Thus, although discordant according to the 2BL test, there is insufficient evidence based on only this test to assert fraud in the 2009 Iranian elections.
Analysis Using Invalid Ballots The ballot box data include an additional measure: the number of ballots deemed invalid (or “void”) for each ballot box. 12
VOL. 23, NO. 1, 2010
Ballot boxes to which votes have been added artificially may have unusually low proportions of invalid ballots. Ballot boxes used by voters who are deliberately spoiling their ballots may, on the other hand, have relatively high proportions. Any relationship between the invalid vote proportion and the second digits of the ballot-box vote counts may therefore give insight into the processes producing the votes. Invalid ballots are not all that common. The deciles of the distribution of the proportion of invalid ballots in each
ballot box (see Table 4) show that more than 10% of ballot boxes contain no invalid ballots. The median proportion is 0.0085. The 90th percentile is 0.0213. Based on the ballot box counts, at least, not a lot of protest voting was done by spoiling ballots. The relationship between the proportion of ballots that are invalid and the second digits of the ballot-box vote counts exhibits a couple of significant associations. Figure 1 displays nonparametric regression curves that indicate how the mean of the second digit of the vote counts for each candidate varies with the proportion invalid. See Adrian Bowman and Adelchi Azzalini’s book, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S–Plus Illustrations, for a description of this method. The conditional mean of the second digit is shown surrounded by 95% confidence bounds. Each pair of confidence bounds corresponds to a test level of α= .05/4, which represents a Bonferroni adjustment to take into account that four candidates are being considered. Each plot includes ballot boxes ranging from those having the smallest proportion of invalid ballots (zero) to those having invalid ballot proportions at the 95th percentile (about 0.026). The question is whether there are sections of each plot in which the mean expected according to the 2BL distribution falls outside of the confidence bounds. The expected mean, equal to 4.19, is indicated by a horizontal dotted line in the plots. No substantial association is apparent over most of the data for Karroubi and Rezaei. Their vote counts’ second digits are always less than the 2BL-expected mean value, except for roughly the top 5% of the invalid proportion distribution. For the other two candidates, there are noteworthy associations. For Mousavi, the second-digit mean always falls inside the confidence bounds, except for invalid vote proportions between about 0.0164 and 0.0238. For Ahmadinejad, the second-digit mean is inside the confidence bounds, except for invalid vote proportions less than about 0.0033. For Mousavi, the exceptional values comprise about 13.5% of the ballot boxes; for Ahmadinejad, the exceptional values occur in about 25.6% of the ballot boxes. To understand what the exceptional patterns may mean for the process that generated the votes, it is useful
to examine the relationship between invalid ballot counts and the proportion of votes reported for each candidate. These relationships are shown in Figure 2, which displays nonparametric regression curves that indicate how the mean of the proportion of votes received by each candidate varies with the proportion invalid. Strong relationships are apparent for all four candidates. Ahmadinejad’s support increases as the proportion invalid decreases, whereas the support for each of the other three candidates decreases. The rate at which Ahmadinejad gains support as the proportion invalid decreases is monotonic and sometimes steep—as the proportion invalid falls from the median value of about 0.0085 to zero, Ahmadinejad’s share of the vote increases from an average of about 0.64 to an average of about 0.77. This corresponds to an average increase of more than 15 votes for every decrease of one invalid vote. The increases in vote share for Ahmadinejad are matched by decreases in the vote share for Mousavi. The steep relationship makes it implausible to argue that the relationship between invalid vote proportions and the respective shifts in votes for Ahmadinejad or Mousavi reflects changes in protest votes (i.e., in blank or spoiled ballots cast by people who liked none of the candidates). The curves for both Ahmadinejad and Mousavi flatten out considerably for invalid vote proportions of about 0.015, roughly the 75th percentile. Ballot box stuffing, or some analogous artificial augmentation of Ahmadinejad’s vote counts, is by far the simplest way to explain these patterns. Plots similar to Fig ure 2, with Ahmadinejad’s support increasing as the proportion invalid decreases, appear for each province if the data for each province are analyzed separately, with the sole exception of the data from Sistan province. A measure comparable to invalid ballots is mostly not available in the U.S. data. It would be useful to collect such information across a wide range of elections thought to be free from substantial fraud in order to have more empirical experience with graphs such as these.
Analysis of Total Ballots If votes were faked in ways that neglected to add invalid ballots while Ahmadinejad’s vote counts were being boosted, one might also expect to see distortions in
Figure 2. Vote proportions for president by proportion of invalid votes, 2009 presidential election. Each solid line is a nonparametric regression curve showing the conditional mean of the proportion of votes received by each candidate given each invalid vote proportion value. Dashed lines show 95% confidence bounds adjusted for having four candidates. Confidence band width reflects the spread of the data and the number of observations near each ordinate.
the total numbers of ballots counted at each location. Ballots were distributed to polling stations in lots of 50. Many places reportedly ran out of ballots before the polls closed and, in general, there were complaints about how ballots were administered during the election. Complaints filed by Mousavi’s supervisory committee mention ballots having been distributed with no serial numbers, excess ballots having been distributed with no accounting of their whereabouts, and polling stations running out of ballots. The complaints also mention an excess of poorly controlled election stamps. Polling stations that used all ballots should be expected to have total ballot counts that are exactly or nearly a multiple of 50.
Indeed, the distribution of the values listed as the total ballots in the ballot box data shows a pattern that indicates election administration was not uniform throughout the country. Taking the integer remainder of each total count after it is divided by 50, about 4.7% of the ballot box total counts end in either 00 or 50, and another 4.8% end in either 49, 99, 48, or 98. The deviations from an expected uniform level are statistically significant at test level .05—after adjusting for the FDR to take into account the presence of 50 tests—for remainders 0, 49, 28, and 48. Beyond appearing to confirm the existence of polling stations that completely exhausted the ballots distributed, it is unclear whether such results are CHANCE
13
Robust Overdispersed Multinomial Logit Model The overdispersed multinomial model for J ≥ 2 outcome categories is a model for counts yi = (yi1, . . . , yiJ)’, where i = 1, …, n indexes observations. The total of the counts for observation i is mi = Jj=1 yij, treated as fixed, and the expected value of yij given probability pij is pijmi. The vector of all the probabilities for observation i is pi = (pi1, …, piJ)’. For a vector of K covariates (predictor variables) xij = (xij1, …, xijK)’ and of coefficients βj = (β1j , …, βKj)’, pij is a logistic function of J linear predictors:
∑
pij =
∑
exp(µij)
J k=1
exp(µik),
∑
where µij = x’ ijβj = Kk=1 xijkβkj. A commonly used identifying assumption is βJ = 0: J is said to be the reference category. The covariance matrix for observation i is written using Pi = diag(pi), a J x J diagonal matrix containing the probabilities pi1, …, piJ on the diagonal and zeros in the off-diagonal entries. The covariance matrix is E[(yi −mipi)(yi −mipi)’] = σ2mi(Pi −pip’ i) with σ2 > 0. If σ2 = 1, the covariance matrix is the covariance matrix of a multinomial model. There will be mipij(1 − pij) in the jth diagonal entry and −mipijpij’ in the (j, j’) off-diagonal position. If σ2 > 1, the model is said to be overdispersed relative to the multinomial. This model is discussed in Peter McCullagh and John Nelder’s book, Generalized Linear Models. The robust aspect of the estimation algorithm produces two estimates for σ2. One, σˆ 2LQD, derives from a least quartile difference estimator (a generalized S-estimator), and the other, σˆ 2tahn, derives from a hyperbolic tangent estimator, which is a redescending M-estimator. Observations with an absolute studentized residual greater than 1.803 are downweighted, and observations with an absolute studentized residual greater than 4.0 are declared outliers and given a weight of zero. Walter Mebane and Jasjeet Sekhon present details in their American Journal of Political Science article, “Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data.” Robust estimators are used to produce inferences that are less sensitive to the choice of a certain model. The logit is the inverse of the logistic function used to define probabilities. Let category ℓ be the reference category used to define logitℓ(pj) = log(pj/pℓ). Recall that Jk=1 pk = 1. Now
∑
pj =
∑
pj/pℓ
J k=1
pk/pℓ
=
∑
exp(logitℓ(pj))
J k=1
exp(logitℓ(pk))
.
Through this equation, estimated coefficients and predictors on the linear scale (the µij’s from before) are translated into probabilities, which are between zero and one. Consider the model for votes in 2009 with results from the 2005 election comprising the regressors (predictors). If in the linear predictor for Ahmadinejad, the coefficient for Ahmadinejad’s 2005 vote were 1.0 and all the other coefficients were zero, then the generalized logit formulation means his 2009 vote proportion in each town would be expected to equal his corresponding 2005 proportion. In fact, the estimated coefficient is less than 1.0 and other coefficients are statistically significantly different from zero. To understand the interactions between the turnout ratio and the 2005 vote proportions, consider the relationship between Karroubi’s 2005 vote and his vote proportion in 2009. When the turnout ratio is 1.0, the point estimate for the coefficient that applies to Karroubi’s transformed 2005 vote is β53 + β11,3x5 = −0.0359 + 0.3570(1.0) = 0.3211. These numbers from Table 1 can be found in the 5th row, 3rd coefficient column (-0.0359) and the 11th row, 3rd coefficient column (0.3570). If the turnout ratio had been 1.1, the prediction would have been −0.0359 + 0.3570(1.1) = 0.3568. 14
VOL. 23, NO. 1, 2010
meaningful. Ballot counts of either 28 or 78 occur in only 1.8% of the ballot boxes, but it is not clear what else may be special about those two values. None of the four candidates received an unusual proportion of the votes in ballot boxes with totals at those values, and none of the candidates exhibits vote counts that vary significantly as a function of the total vote remainder. A simple analysis of variance gives an F-statistic with p-value 0.037 for Rezaei, but that is not significant at the α = .05 level once multiple testing—four candidates—is taken into account.
Assessing the Statistical Evidence Combining 2005 and 2009 town-level data conveys the impression that while natural political processes significantly contributed to the election outcome, outcomes in many towns were produced by different processes. Much more often than not, these poorly modeled observations have vote counts for Ahmadinejad that are greater than the naturalistic model would imply. Ballot-box data show evidence of significant distortions in the vote counts not only for Karroubi and Rezaei, but also for Ahmadinejad. The appearance of an association between invalid ballots and the ballot-box second digits strongly suggests there was extensive ballot-box stuffing (or other fakery) on Ahmadinejad’s behalf. Does the current analysis prove the election was stolen? A lot rests on explaining the pattern in Figure 2, which shows Ahmadinejad’s vote share increasing steeply as the proportion of invalid votes decreases. Is there any reasonable explanation of that pattern aside from ballot-box stuffing? It is possible that Ahmadinejad actually won, supported by many who might have voted for Karroubi or Rezaei instead strategically voting for Ahmadinejad. The likelihood of such votes being cast needs to be assessed based on information beyond what can be extracted from the 2005 and 2009 election returns. To support a benign interpretation of the Iranian election, the additional evidence needs to explain how the strong support for Ahmadinejad happens to line up so strongly with the proportion of invalid votes in the ballot-box vote counts. A recount conducted of 10% of the ballot boxes, as reported by Michael
Weissenstein in an Associated Press article titled “Iran Declares Election Fight Over, Vote Valid,” did not turn up any problems. As there is little reason to believe the ballots were preserved in a secure way since the election, however, the recount fails to provide strong evidence regarding the validity of the results. Evidence is lacking regarding the chain of custody from the time each voter supposedly voted to the moment each ballot was recounted. The opposition declared the official procedure for considering their challenges to the election fundamentally biased and refused to participate in it. Tests such as those considered here can, in general, only identify places where there may be problems with the votes. The tests’ best use is for screening election results, not confirming or refuting claims of fraud. A significant finding should prompt investigations using administrative records, witness testimony, and other facts to try to determine what happened. The problem with the 2009 Iranian election is that the serious questions that have been raised are unlikely to receive satisfactory answers. Transparency is utterly lacking in this case. There is little reason to believe the official results announced in that election accurately reflect the intentions of the voters who went to the polls. Author’s Note: Analysis was conducted using publicly available information. The initial analysis was written as a report drafted on June 14, 2009, and then updated during the next two weeks as new data became available. The data used in the report and this article were downloaded from Iranian Ministry of the Interior web sites and translated and reformatted.
Further Reading Ansari, Ali, Daniel Berman, and Thomas Rintoul. 2009. Preliminary analysis of the voting figures in Iran’s 2009 presidential election. Chatham House, www.chathamhouse.org.uk/publications/ papers/view/-/id/755.
Associated Press. 2009. Former leader says Iran trial a sham. August 2. Bahrampour, Tara. 2009. Militia adds fear to time of unrest. The Washington Post. June 19. Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57(1):289–300. Bowman, Adrian W., and Adelchi Azzalini. 1997. Applied smoothing techniques for data analysis: The kernel approach with S-Plus illustrations. Oxford: Clarendon Press. Cho, Wendy Tam, and Brian Gaines. 2007. Breaking the (Benford) law: Statistical fraud detection in campaign finance. The American Statistician 61:218–223. Erdbrink, Thomas, and William Branigin. 2009. Iran’s leaders warn more election protests will not be tolerated. The Washington Post. June 30. Erdbrink, Thomas, and William Branigin. 2009. Obama, in boldest terms yet, presses Iran to halt violence against own people. The Washington Post. June 21. Hill, Theodore P. 1995. A statistical derivation of the significant-digit law. Statistical Science 10:354–363. McCullagh, Peter, and John A. Nelder. 1989. Generalized linear models. New York: Chapman & Hall. Mebane, Walter R. Jr., and Jasjeet S. Sekhon. 2004. Robust estimation and outlier detection for overdispersed multinomial models of count data. American Journal of Political Science 48:392–411. Mebane, Walter R. Jr. 2006. Detecting attempted election theft: Vote counts, voting machines, and Benford’s law. Paper prepared for the 2006 Annual Meeting of the Midwest Political Science Association, Chicago. Mebane, Walter R. Jr. 2006. Election forensics: Vote counts and Benford’s law. Paper prepared for the 2006 Summer Meeting of the Political Methodology Society, University of California-Davis.
Mebane, Walter R. Jr. 2006. Election Forensics: The Second-digit Benford’s law test and recent American presidential elections. Paper prepared for the Election Fraud Conference, Salt Lake City. Mebane, Walter R. Jr. 2007. Election forensics: Statistics, recounts, and fraud. Paper presented at the 2007 Annual Meeting of the Midwest Political Science Association, Chicago. Mebane, Walter R. Jr. 2007. Evaluating voting systems to improve and verify accuracy. Paper presented at the 2007 Annual Meeting of the American Association for the Advancement of Science, San Francisco, and the Bay Area Methods Meeting, Berkeley. Mebane, Walter R. Jr. 2007. Statistics for digits. Paper presented at the 2007 Summer Meeting of the Political Methodology Society, Pennsylvania State University. Mebane, Walter R. Jr. 2008. Election forensics: The second-digit Benford’s law test and recent American presidential elections. In The Art and Science of Studying Election Fraud: Detection, Prevention, and Consequences, ed. R. Michael Alvarez, Thad E. Hall, and Susan D. Hyde. Washington, DC: Brookings Institution. Mebane, Walter R. Jr. 2009. Note on the presidential election in Iran, www. umich.edu/˜wmebane/note29jun2009.pdf. Mebane, Walter R. Jr., and Kirill Kalinin. 2009. Comparative election fraud detection. Paper presented at the 2009 Annual Meeting of the Midwest Political Science Association, Chicago. Press TV. 2009. Guardian council: Over 100% voted in 50 cities. Associated Press. June 21. R Development Core Team. 2009. R: A Language and Environment for Statistical Computing. http://cran.r-project.org/doc/ manuals/refman.pdf. Rodriguez, Ricardo J. 2004. First significant digit patterns from mixtures of uniform distributions. The American Statistician 58:64–71.
CHANCE
15
Multiple Imputation in the Anthrax Vaccine Research Program Michela Baccini, Samantha Cook, Constantine E. Frangakis, Fan Li, Fabrizia Mealli, Donald B. Rubin, and Elizabeth R. Zell
Bacillus anthracis under a microscope
A
nthrax, caused by the bacterium Bacillus anthracis, can be a highly lethal acute disease in humans and animals. Prior to the 20th century, it led to thousands of deaths each year. Anthrax infection became extremely rare in the United States in the 20th century, thanks to extensive animal vaccination and anthrax eradication programs. The 2001 anthrax attacks in the United States drew this formidable disease back into the public spotlight. Anthrax spores have not long been used in biological warfare or as a terrorism weapon, but anthrax has been on the bioterrorism (BT) agent list for a long time. U.S. military personnel are now routinely vaccinated against anthrax prior to active service in places where biological attacks are considered a threat. The current FDA-licensed vaccine, anthrax vaccine adsorbed (AVA), is produced from a nonvirulent strain of the anthrax bacterium. The licensed regimen for AVA is subcutaneous administration of a series of six primary doses (zero, two, and four weeks and six, 12, and 18 months), followed by annual booster doses. In 1998, a pilot study conducted by the U.S. Department of Defense provided preliminary data suggesting AVA could be given intramuscularly and without the two-week dose, reducing adverse events and not adversely impacting immunogenicity. Since 2000, the CDC has been planning and conducting a clinical trial, the Anthrax Vaccine Research Program (AVRP), to evaluate a reduced AVA schedule and a change in the route of administration in humans. The AVA trial is a 43-month
16
VOL. 23, NO. 1, 2010
Getty Images/Duncan Smith
prospective, randomized, double-blind, placebo-controlled trial for the comparison of immunogenicity (i.e., immunity) and reactogenicity (i.e., side effect) elicited by AVA given by different routes of administration and dosing regimens. Administration is subcutaneous (SQ) versus intramuscular (IM). Dosing regimens require as many as eight doses versus as few as four doses. In the AVRP, sterile saline is used as the placebo when a dose is not actually administered. The trial is being conducted among 1,005 healthy adult men and women (18–61 years of age) at five U.S. sites. Participants were randomized into one of seven study groups. One group receives AVA as currently licensed (SQ with six doses followed by annual boosters). Another group receives saline IM or SQ at the same time points as the currently licensed regimen. The five other groups receive AVA IM: one group at the same time points as the currently licensed regimen and the remaining groups in modified dosing regimens. Placebo is given when a dose of AVA is omitted from the licensed dosing regimen. There are 25 required visits over 42 months, during which all participants receive an injection of vaccine or placebo (eight injections total), have a blood sample drawn (16 total), and have an in-clinic examination for adverse events (22 total). Total anti-protective antigen IgG antibody (anti-PA IgG levels) is measured using a standardized and validated enzyme-linked immunosorbent assay (ELISA). The primary study endpoints are four-fold rise in antibody titer, geometric mean antibody concentration, and titer. All adverse events,
including vaccine reactogenicity, are actively monitored. Several reactogenicity endpoints are assessed. Potential risk factors for adverse events (e.g., sex, pre-injection anti-PA IgG titer) also are recorded. The AVA study has been significant because, as a result of the interim analysis, the FDA approved the change in the route of AVA administration from SQ to IM. However, as with other complex experimental and observational data, the AVRP data creates various challenges for statistical evaluation. One such challenge is how to handle the missing data generated by dropouts, missed visits, and missing responses. The simplest complete data analysis that drops any subjects with missing data is not applicable here, because even though the overall missing rate is low at this time (3.4%), only 56 among the approximately 2,000 variables are fully observed and only 208 subjects have fully observed variables. Filling in missing values by copying the last recorded value for a subject on a particular variable (the “last observation carried forward” approach) likely is not a good idea in this situation, because side effects and immune response likely will vary over time and not remain constant. Randomly choosing a case with observed data to serve as a donor of values to a case with missing data (the “hot-deck” strategy) could be problematic due to the high degree of missing values and the need to express uncertainty after imputation. During the last two decades, multiple imputation (MI) has become a standard statistical technique for dealing with missing data. It has been further popularized by several software packages (e.g., PROC MI in SAS, IVEware, SOLAS, and MICE). MI generally involves specifying a joint distribution for all variables in a data set. The data model is often supplemented by a prior distribution for the model parameters in the Bayesian setting. Multiple imputations of the missing values are then created as random draws from the posterior predictive distribution of the missing data, given the observed data. MI has been successfully implemented in many large applications. Two such applications are described in “Filling in the Blanks: Some Guesses Are Better Than Others” and “Healthy for Life: Accounting for Transcription Errors Using Multiple Imputation,” both published in CHANCE, Vol. 21, No. 3. MI for the AVRP substantially increases the challenge, mostly due to the large number and different types of variables in the data set, the limited number of units within each treatment arm (Imputations should be done independently across treatment arms to avoid cross-contamination among groups.), and, most important, theoretical incompatibility in the imputation algorithms used by current available packages such as IVEware and MICE. Another important issue is how to evaluate the imputations, a question that has been largely neglected in most of the MI applications.
High-Dimensional Imputation Consider a data set with K variables, labeled Y1, ..., YK , each defined as a vector over a common set of N subjects. Each entry can be missing or observed, so M1, ..., Mk are the vectors such that Mk,i is 1 or 0, indicating whether Yk,i is missing for subject i.
Since 2000, the CDC “ has been planning and conducting a clinical trial, the Anthrax Vaccine Research Program (AVRP), to evaluate a reduced AVA schedule and a change in the route of administration in humans.
”
Imputation with a Joint Model Probability models for two parts of the problem need to be specified to do imputation. First, one can postulate a model (also called a likelihood) for the joint distribution of the variables, given some parameters: pr(Y1, ..., YK | θ), where θ is a vector of model parameters. One also specifies a prior distribution pr(θ) for the model parameters. The prior distribution can express what is known about the parameters. In other cases, a prior distribution is specified in a manner to have little impact on the analysis, but to enable computation of posterior predictive distributions for missing values. For example, if the Yi have independent Bernoulli distributions with an unknown probability of success θ1, then a common choice for the prior distribution of θ1 is a Beta(α, β) distribution. Small values of alpha and beta correspond to a prior distribution with little influence over the analysis. For another example, if all the Y variables have binary outcomes, pr(Y1, ..., YK | θ) could be specified by a loglinear model. The parameters of the log linear model are the components of the vector θ. Second, one needs to consider why the data are missing. Assumptions about why the data are missing are translated into a probability model for the indicator vectors M1, ..., MK. The probability distribution for these vectors is often referred to as the missing data mechanism. An ignorable missing data mechanism is an assumption that the data missing are missing not because of what values would have been observed had they been observed, but because of factors associated with the observed data. This assumption can include the existence of different missing rates for different known groups. If a large collection of covariate variables are available for subjects, such as characteristics collected at study baseline, it is fairly common to assume an ignorable missingness mechanism and build a statistical model to predict response. It is often convenient to generate the imputations using simulation techniques, such as a Gibbs sampling algorithm. CHANCE
17
Figure 1. A missing data pattern that is transformable to a fully monotone missing data pattern. In the transformation, both variables and units are reordered.
In the current application, a Gibbs sampler is defined by the following steps: 1. Start at preliminary values for the missing data of the variables Y– k, where Y– k is defined to be all Ys except for Yk 2. For k =1, ..., K: Generate a random value, θ˜, from the current estimate of the posterior distribution pr(θ | Yk,Y– k,M), which is based on the likelihood, the assumptions about the missingness mechanism, and the prior pr(θ) 3. Simulate the missing values of Yk from the current estimate of the posterior predictive distribution pr(Yk | Y– k,θ˜) When the missing data are ignorably missing, Step 2 can ignore the missingness mechanism and M can be omitted from the notation. As described in Bayesian Data Analysis, the repetition of steps 2 and 3 generally produces simulated values of the missing data that converge in distribution to their posterior predictive distribution under the model. Practical Complications with High Dimensions When there are many variables to be imputed, finding a plausible model for the joint distribution pr(Y1, ..., YK | θ) is difficult to accomplish for two common ways of postulating joint models. One such way, which postulates a joint model on all variables simultaneously (e.g., a multivariate normal model), is not flexible enough to reflect the structure of complex data such as the AVA data, which include continuous, semicontinuous, ordinal, categorical, and binary variables. A second way is to postulate a joint model sequentially: Postulate a marginal distribution for Y1 first, then a conditional distribution for Y2 given Y1, and so on. If the order of postulation matches the order of a monotone pattern (a special case of missing data pattern), then this way is workable and efficient. This is the fundamental basis for SOLAS. However, if the pattern is not monotone, then one has to compute the 18
VOL. 23, NO. 1, 2010
complete conditional distributions pr(Yk | Y– k,θ), which is often impractical because of the complex relation among the parameters of these models. These complications have led researchers to all but abandon the effort to postulate a joint model. Instead, they follow the intuitive method of specifying directly—for each variable with missing values—the univariate conditional distribution pr(Yk | Y– k,θ) given all other variables. Such univariate distributions take the form of regression models and can accurately reflect different data types. The approaches used with such postulations then follow steps 2 and 3. Software such as IVEware and MICE impute missing data this way. There is a catch, though. If one chooses the conditional distributions pr(Yk | Y– k,θ) for k =1, ..., K directly, there is generally no joint distribution “compatible” with them (i.e., whose conditional distributions pr(Yk | Y– k) equal those of the models chosen). This disagreement is called “incompatibility.” In addition to being theoretically unsatisfactory, incompatibility has the practical implication of the repetition of steps 2 and 3 not generally leading to a convergent distribution. Some methods choose a particular ordering of variables, starting the imputation by the variables with the least missing values to aid convergence. However, the deeper problem of incompatibility is currently ignored. In “Fully Conditional Specifications in Multivariate Imputation,” published in the Journal of Computational and Graphical Statistics, S. Van Buuren and colleagues have presented evidence that in simple cases with ‘good’ starting values, the procedure works acceptably well in practice.
Monotone Missingness Blocks An important case to note is a special type of missing data pattern, depicted in Figure 1. In this pattern, one observes that if, for subject i, the variable Yk,i’ is missing, then the variable Y’k+1,i is also missing. That is, if M’k,i=1, then M’k+1,i=1. Roderick Little and Donald Rubin call such a pattern a “monotone missing data pattern” in Statistical Analysis with Missing Data. The set of
Figure 2. A simple geometric example to illustrate the concept of continuity, which states that objects defined to be relatively similar to one another should have relatively similar properties.
Figure 3. A missing data pattern transformed to monotone blocks pattern of missing data guided by the concept of continuity.
CHANCE
19
Table 1—Summary of Missing Data and Monotone Blocks by Treatment Arm in Currently Available AVA Data
Treatment Arm 0 1 2 3 4 5 6
Number of Subjects 165 170 168 166 167 85 84
Number of Missing values 927 1372 1558 1383 1325 252 334
Number of Percent in 1st Patterns monotone block 15 45 13 74 13 65 15 79 15 74 7 74 9 87
Percent in First 3 monotone blocks 75 84 85 90 89 91 93
Table 1: Summary of missing data and monotone blocks by treatment arm in currently available AVA data. rectangles in the right side of Figure 1 is called a “block.” The is continuous for such problems, one should expect that the pattern in Figure 1 is called a monotone block pattern. Monotonicity and Compatible Sequential Models With a monotone pattern, the likelihood of the data under ignorability can be factored sequentially as the product of the following terms: (i) the product of pr(Y’1,i |θ) for subjects with observed Y’1 (ii) the product of pr(Y’2,i | Y’1,i ,θ) for subjects with observed Y’1 and Y’2 (iii) the product of pr(Y’3,i |Y’2,i , Y’1,i , θ) for subjects with observed Y’1 and Y’2 and Y’3 Suppose now one postulates models pr(Yk , ..., Y– k,θ) for k =1, ..., K (i.e., in the order that matches the monotone order of missingness). If one assumes the parameters θk are independent in the prior distribution, each model can be fitted separately from the nine subjects’ data as indicated in (i)–(iii), leading to imputations of missing values, with no need to iterate a Gibbs sampler. Moreover, the sequential postulation of the models ensures full compatibility. This advantage is strictly limited to the monotone missing data pattern. However, it sheds light on how one could design the imputations to minimize incompatibility. Continuity as a Guide to Approach an Ideal Case For this application, a field of study satisfies continuity if objects defined to be relatively similar have relatively similar properties. An illuminating example of continuity relates to isoperimetric problems, as shown in Figure 2. The first problem asks how to shape a fully flexible fence (Figure 2a) so it encloses maximum area; the answer is a circle (Figure 2a’). Suppose now that, instead of a fully flexible fence, one has a fence that connects eight equal straight segments at flexible joints (Figure 2b), and suppose one asks how to shape this fence to enclose maximum area. Observe that the second fence can be thought of as similar to the first, except for the inflexibility 13 along the segments. Now, assuming the Euclidean geometry 20
VOL. 23, NO. 1, 2010
best shape with the restricted fence would be similar to the best shape with the unrestricted fence. Indeed, the best shape with the restricted fence is the regular octagon (Figure 2b’), which in some sense is the closest shape to the circle, given the constraint. The Proposal of Monotone Blocks The proposal of monotone blocks is a natural extension of Rubin’s method of multiple imputation using a single major monotone pattern (MISM), which exploits a single major monotone block. One can eliminate incompatibility by rearranging the data set to have a completely monotone pattern. On the other hand, deviations from monotonicity can have degrees. Thus, if one rearranges a data set, such as in Figure 3a, to a pattern close to monotone, such as in Figure 3a’, one will be close to eliminating incompatibility. This argument relates to the example of continuity in Figure 2 if one makes the following relations: Relate “if one can rearrange the data set to have a completely monotone pattern” to “if one can rearrange the fence to be completely circular” Relate “then one can eliminate incompatibility” to “then one can maximize the area within the fence” Therefore, to apply the argument of continuity here, one first identifies a rearrangement of the data set such that the missing values not forming part of a monotone block are minimal. The part that is monotone is labeled the “first” monotone block. In Figure 3a, the first monotone block consists of the top pattern of missing values, and the bottom pattern of missing values are those that do not form part of the first monotone block. For those missing values, one repeats the process, identifying a rearrangement so most form a monotone block, with the rest of the missing values being minimal. The process continues until all missing values have been identified with a monotone block.
In the AVA data, we applied this process separately for each treatment arm. The information about the missing data and the monotone blocks of the currently available data is shown in Table 1. As one can see, even though the total number of monotone blocks can be large, the first monotone block usually dominates, covering a large proportion of all missing data. On average, the first three monotone blocks include more than 85% of the missing values in each arm. After the monotone blocks were obtained, we imputed the missing data within each block for each arm as follows: (i) Start with filling the missing data of all but the first monotone block with preliminary values (ii) Fit Bayesian sequential models and simulate the missing values for the first monotone block using steps corresponding to steps 1–3 (iii) Treat the data imputed for the first monotone block as observed and impute the missing values for the second monotone block (iv) Continue across all the monotone blocks We then iterated steps (ii), (iii), and (iv) until we detected that the simulations had converged to the desired distributions.
Flexible Regression Models A benefit of modeling univariate conditional distributions instead of large joint distributions is that it is easy to specify and fit different types of outcomes with different types of models. We classified the outcome variables in the AVA data into the following types, according to the models we planned to apply: 1. Binary outcome with two observed levels 2. Categorical outcome with either three (ordered or unordered) levels, or four unordered levels 3. Ordered categorical outcome with at least four, but at most 11, observed levels that have a natural ordering 4. Continuous outcome, defined here as an ordered outcome with more that 11 observed levels and with no extreme level having an observed frequency of at least 20% 5. Mixed continuous outcome—an ordered outcome with more that 11 observed levels and with one of the two extreme levels having an observed frequency of at least 20% Different models are used for different types of outcome variables. For unit i, an outcome variable is denoted by Yi and the set of predictors by the vector Xi. Of course, in this conditional modeling approach, an outcome variable in one model can be used as a predictor variable for a different outcome. In that case, the variable can be denoted Yi for one situation and included in the Xi vector in the other.
A binary outcome is assumed to follow a logistic regression likelihood given covariates Xi, with a “pseudocount” prior distribution for the regression coefficients. The logistic regression likelihood specifies that the probability that Yi is equal to 1 given the values of Xi is logit{pr(Yi =1 | Xi,β)} = X’i β
(1),
where β is a vector of logistic regression coefficients and the logit function of a probability p is ln(p/(1 – p)). This is the standard transformation used in logistic regression. The inverse of this transformation is pr(Yi = 1|Xi,β) = exp(X’i β)/ (1 + exp(X’i β)). The likelihood is the product over all patients of pr(Yi =1|Xi,β) when Yi is 1 and 1 – pr(Yi =1|Xi,β) when Yi is 0. The pseudocount prior distribution essentially imagines the existence of additional prior observations, some with Yi equal to 1 and some equal to 0. The appendix to the 2004 edition of Rubin’s book Multiple Imputation for Nonresponse in Surveys describes this prior specification further. Based on the implied posterior distribution for β, a random value of β is drawn using an iterative simulation algorithm. Based on the draw of β, the missing values of Y are imputed independently across subjects from the logistic regression model. That is, for each subject, given a value of β, the probability that Y is 1 is computed. Using this probability, a Bernoulli random variable is generated as 1 or 0. A categorical outcome with three levels or four or more (L) unordered levels is completely characterized by L – 1 sequential binary regressions. Specifically, one can completely characterize such an outcome by a variable describing “whether the person belongs in level 1.” Then, a variable describing “whether the person belongs in level 2 among those who do not belong in level 1” and so on until all levels have been described. Each regression is fitted as described for the binary logistic regression type. A missing value for Yi is then imputed by simulating sequentially the indicators for the events {Yi =1}, ..., {Yi = L – 1} until one indicator is drawn as 1. If all the indicators are drawn as 0, then Yi is set to L. A categorical outcome with four or more ordered levels is modeled as the continuous type variable. This modeling is preferred to a proportional odds or probit fit for computational stability. The imputed values will be rounded to the nearest level observed in the data. A continuous outcome is assumed to follow a normal regression likelihood given covariates Xi , with a pseudocount prior distribution for the regression coefficient β and the logarithm of the residual variance σ2. Based on a random draw of β and σ2, the missing values of Y are imputed independently across subjects from the normal model. For a mixed continuous-binary outcome, one assumes here that the extreme value with at least 20% of the mixed outcome variable Y is 0 and the remaining values are positive. The model assumed for this type is a logistic regression for Y being at 0 and a normal regression for the log of the positive values of Y. The posterior distribution for the parameters is obtained by fitting these models separately, the first model to the indicator of Y being positive and the second model to the subjects with positive values of Y. A missing value of Y is then imputed by first imputing the indicator of the value being 0 or positive, and, if positive, imputing a value using the normal regression. CHANCE
21
Evaluating methods is “ always as important as developing them. Much work has been done in proposing and applying imputation methods, but remarkably little in corresponding evaluations.
”
Fitting and imputation using each model is as described for the binary and continuous case above. As there are around 2,000 variables and only 1,005 subjects in the AVA data, we constrained the number of predictors that entered the conditional model for each outcome. The predictor selection took place before the imputation procedure. It was based on a preliminarily imputation of all the missing values from their empirical marginal distributions. We allowed the predictors for each outcome variable to differ across different arms and monotone patterns. Demographic variables (i.e., age and sex) were fully observed and always included in the model. For each outcome with missing values, the potential predictors were all the variables that were more observed (i.e., with less missing values) than the outcome. Ideally, we would have selected from the potential predictors by a stochastic search method such as the stepwise regression based on the Akaike’s information criterion (AIC). The computational demand of conducting such a search among 2,000 variables for each outcome in each monotone block, however, was prohibitively large. Instead, we fit the regression models of the outcome given each single potential predictor—age and sex—and sorted the predictors according to the corresponding AIC’s. Then, the 20 predictors with the smallest AIC were selected. Finally, we checked the “fittability” of the conditional model, which contemporarily includes all the selected predictors on the complete cases. Fittability is defined as invertibility of the corresponding design matrix. It is checked sequentially on the subsets of predictors sorted by AIC in a backward fashion. If one predictor is not “fittable,” it does not contain enough information on the outcome and is selected mostly because of the model assumption. Therefore, this predictor is dropped from the subset and the same checking goes on to the next selected predictor until the last one. This checking procedure is done separately within each treatment arm and monotone pattern.
22
VOL. 23, NO. 1, 2010
Software coded jointly in R and Fortran has been developed to address each of the univariate problems (i.e., one imputation of one outcome in one treatment arm). Several simulation studies show the software handles data well.
Evaluation Plan Evaluating methods is always as important as developing them. Much work has been done in proposing and applying imputation methods, but remarkably little in corresponding evaluations. Comparing the imputed values to the observed values (i.e., the ‘truth’) is the most intuitive evaluation, but is neither generally possible nor a valid method of evaluation. Instead, one simulates missing data in a fully observed subset of the data set and imputes these created missing data. Then, one compares the inferences based on the imputed data set to the inferences based on the original, complete data set. A similar approach was taken in the Third National Health and Nutrition Examination Survey (NHANES III) evaluation by Andrew Gelman and Rubin in the Statistical Methods in Medical Research journal article “Markov Chain Monte Carlo Methods in Biostatistics” and by Trivellore Raghunathan and Rubin in their article, “Roles for Bayesian Techniques in Survey Sampling,” which appeared in the proceedings of the Silver Jubilee Meeting of the Statistical Society of Canada. To avoid conflict of interest, the AVA team was split into two subteams: the evaluation team and the imputation team. The evaluation team first created artificial missingness in a subset of complete data. The imputation team, blinded to the truth, then imputed the created missing data. After that, the evaluation team took over again. They derived and compared the inferences based on the imputed and complete data. The evaluation team first defined a set of key variables (Y1, ..., Yt), the values of which are of primary interest in the analysis. Then, they identified the missing patterns of the key variables presented in the data set. A missing data pattern is a unique vector of the t corresponding missing indicators. There are at most 2t missing patterns of t variables. For example, in the analysis of the immunogenicity of AVA, the CDC defined the Elisa measurement at four weeks, eight weeks, six months, and seven months as the key variables. A unit (person) who has Elisa measurement for, say, the first three time points belongs to the missing pattern (0, 0, 0, 1). Defining the complete subset as the set of units whose key variables are fully observed, the evaluation team then simulated 200 copies of the complete subset with varying missing data patterns and proportion of missingness as follows: Step 1: Count the numbers of units that belong to each pattern. Rank the missing patterns by their sizes in decreasing order. (Pattern 1 is the most prominent missing pattern.) Step 2: For pattern j =0, ..., J (0 means fully observed), use logistic regression using a pseudocount prior distribution to model the probability of being in pattern j versus being in patterns j +1, ..., J given all covariates X. Denote the estimated intercept and coefficients (αˆ j,βˆ j). The estimated probability of a unit i being in pattern pi,j is logit–1((αˆ j,βˆ j) (1,Xi)T ).
Step 3: Choose α0 that gives the overall proportion of missing data approximately equal to 10%. For each unit i, calculate its probability of being fully observed: logit–1((α0,βˆ 0) (1,Xi)T ). Then, randomly assign the missing indicator =0 to the units based on these probabilities. Next, for each unit that is assigned to be missing (i.e., not fully observed), calculate its probability of belonging to missing pattern 1—logit–1((αˆ 1,βˆ 1)(1,Xi)T )—and randomly assign it to pattern 1 (versus patterns 2 to J). Continue this procedure through pattern J –1. At last, each unit in the complete set belongs to a missing pattern (including the pattern of fully observed) and the overall missing proportion is 10%. Step 4: Choose values of α0 that give overall missing proportions of approximately 20%, 30%, 40%, and 50% and repeat Step 3 40 times for each α0. The imputation team, upon receiving the altered data sets from the evaluation team, used the monotone block imputation procedure to provide five imputed data sets for each of the 200 copies at each value of α0. Finally, the evaluation team calculated the empirical 95% (and 90%, 80%, and 50%) confidence intervals for a set of summarizing functions of each key variable (e.g., a mean) from the imputed data sets. They compared these to the corresponding intervals based on the original, complete data set. Evaluations like this are rarely done, but often give rather disappointing results for standard (nonmissing data) procedures.
Remarks Motivated by the missing data problem arising from the AVA trial, we developed a general approach to handling nonmonotone missing data from data sets that have a large number of variables and many types of data structures. The design of the imputations we used here breaks the nonmonotone pattern into blocks of separate patterns, each of which is monotone and can be handled with sequential modeling. There can, of course, be other ways of building around the ideal case of monotonicity, to address nonmonotone patterns. An interesting problem is to explore generalizable ways of comparing such approaches in how they best reduce the impact of incompatibility. Within monotone patterns, the large number of variables is still an issue. It was handled here in an approach that looks at each regression isolated from the others. Perhaps it would be beneficial from a predictive standpoint to have a more integrated assessment of reducing the dimension of the models. Moreover, one has to ask how answers to these problems can help design the study better. For example, for which variables do the missing data impact most the overall efficiency or compatibility? An easy and generalizable way to answer such questions could suggest, for example, offering increased incentives for the collection of these data or developing proxy measurements that would have higher observation rates. So far, we have focused on addressing the missing values for the variables intended to be measured in the study with human subjects. A practically more important task is to address missing measurements not intended to be obtained in this study— these values are the survival status of the human subjects if, after the vaccination, they had been exposed to anthrax. For predicting this survival, there is little information from the
human study alone because exposing humans to lethal anthrax doses is not ethical, given the risks of such exposure. For this reason, in parallel to the study with humans, CDC has been conducting a study with macaques. The macaque study is similar to the human study with the important exception that after vaccination, the macaques are exposed to anthrax and their response—including survival status—is measured. An important question, therefore, is how to bridge the human and macaque information to estimate the survival of humans if the vaccination regime under study had been followed by an exposure to anthrax. An answer to this question is not trivial and depends on what parts of the model among dosage, immunogenicity response, and survival are generalizable between the macaque and human studies. Thus, advancing knowledge in this problem will have to rely not on a single study, but on how this study is combined with existing knowledge. Editor’s Note: The findings and conclusions in this article are those of the authors and do not necessarily represent the views of their institutions.
Further Reading Frangakis, C. E., and D. B. Rubin. 2002. Principal stratification in causal inference. Biometrics 58, 21-29. Gelman, A. E., J. B. Carlin, H. S. Stern, and D. B. Rubin. 2004. Bayesian data analysis. Boca Raton: Chapman and Hall. Gelman, A. E., and D. B. Rubin. 1996. Markov chain Monte Carlo methods in biostatistics. Statistical Methods in Medical Research 5(4):339-355. Little, R. J. A., and D. B. Rubin. 2002. Statistical analysis with missing data. New Jersey: Wiley. Raghunathan, T. E., and D. B. Rubin. 1998. Roles for Bayesian techniques in survey sampling. Proceedings of the Silver Jubilee Meeting of the Statistical Society of Canada 51–55. Rao, J. N. K., W. Jocelyn, and N. A. Hidiroglou. 2003. Confidence coverage properties for regression estimators in uniphase and two-phase sampling. Journal of Official Statistics 19:17– 30. Rubin, D. B. 1987, 2004. Multiple imputation for nonresponse in surveys. New York: Wiley. Rubin, D. B. 2003. Nested multiple imputation of NMES via partially incompatible MCMC. Statistica Neerlandica 57(1):3–18. Van Buuren, S., J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, and D. B. Rubin. 2006. Fully conditional specifications in multivariate imputation. Journal of Computational and Graphical Statistics 76(12):1049–1064.
CHANCE
23
All Together Now: A Perspective on the
NETFLIX PRIZE Robert M. Bell, Yehuda Koren, and Chris Volinsky
W
hen the Netflix Prize was announced in October of 2006, we initially approached it as a fun diversion from our ‘day jobs’ at AT&T. Our group had worked for many years on building profiles of customer patterns for fraud detection, and we were comfortable with large data sets, so this seemed right up our alley. Plus, it was about movies, and who doesn’t love movies? We thought it would be a fun project for a few weeks. Boy, were we wrong (not about the fun part, though). Almost three years later, we were part of a multinational team named as the winner of the $1 million prize for having the greatest improvement in root mean squared error (RMSE) over Netflix’s internal algorithm, Cinematch. The predominant discipline of participants in the Netflix Prize appears to have been computer science, more specifically machine learning. While something of a stereotype, machine
24
VOL. 23, NO. 1, 2010
learning methods tend to center on algorithms (black boxes), where the focus is on the quality of predictions— rather than ‘understanding’ what drives particular predictions. In contrast, statisticians tend to think more in terms of models with parameters that carry inherent interest for explaining the world. Leo Breiman’s article, “Statistical Modeling: The Two Cultures,” which was published in Statistical Science, provides various views on this contrast. Our original team consisted of two statisticians and a computer scientist, and the diversity of expertise and perspective across these two disciplines was an important factor in our success.
Fundamental Analysis Challenge The Netflix Prize challenge concerns recommender systems for movies. Netflix released a training set consisting of data from almost 500,000 customers and
their ratings on 18,000 movies. This amounted to more than 100 million ratings. The task was to use these data to build a model to predict ratings for a hold-out set of 3 million ratings. These models, known as collaborative filtering, use the collective information of the whole group to make individualized predictions. Movies are complex beasts. Besides the most obvious characterization into genres, movies differ on countless dimensions describing setting, plot, characters, cast, and many more subtle features such as tone or style of the dialogue. The Movie Genome Project (www.jinni.com/ movie-genome.html) reports using “thousands of possible genes.” Consequently, any finite model is likely to miss some of the signal, or explanation, associated with people’s ratings of movies. On the other hand, complex models are prone to overfitting, or matching small details rather than the big picture—especially where data are scarce.
For the Netflix data, the numbers of ratings vary by at least three orders of magnitude among both movies and users. Whereas there are some users who have rated more than 10,000 movies, the average number of ratings per user is 208, and more than one quarter rated fewer than 50 movies. Overfitting is particularly a concern for these infrequent raters. This leads to the fundamental challenge for the Netflix data. How can one estimate as much signal as possible where there are sufficient data without overfitting where data are scarce? Our winning approach combined tried-andtrue models for recommender systems with novel extensions to these models, averaged together in an ensemble.
Nearest Neighbors At the outset of the Netflix competition, the most commonly used collaborative filtering method was nearest neighbors. Gediminas Adomavicius and Alexander Tuzhilin give an overview of the state of the art in “Towards the Next Generation of Recommender Systems: A Survey of the State of the Art and Possible Extensions,” published in IEEE Transactions on Knowledge and Data Engineering. With nearest neighbors, the predicted rating for an item by a user might be a weighted average rating of similar items by the same user. Similarity is measured via Pearson correlation, cosine similarity, or other metric calculated on the ratings. For example, we might expect neighbors of the movie “Saving Private Ryan” to include other war movies, other movies directed by Steven Spielberg, and other movies starring Tom Hanks. A typical nearest neighbors model, as described in “Item-Based Collaborative Filtering Recommendation Algorithms” presented at the 10th International World Wide Web Conference by Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl, estimates the rating rui of item i by user u to be r^ ui=
∑ ∑
sr
j N (i;u) ij uj j
s N (i;u) ij
,
(1)
where N(i; u) is the set of neighbors of item i that were rated by user u and sij is the similarity between items i and j. A nice feature of nearest neighbor models
The Contest That Shaped Careers and Inspired Research Papers Steve Lohr
B
ack in October of 2006, when Netflix announced its million-dollar prize contest, the competition seemed to be a neat idea, but not necessarily a big one. The movie rental company declared it would pay $1 million to the contestant who could improve its web site’s movie recommendation system by 10% or more. The contest presented an intriguing problem—and a lucrative one for the winner. For Netflix, it was a shrewd ploy that promised to pay off in improved service and Steve Lohr covers publicity. technology, economics, But the Netflix contest, which lasted nearly three and (occasionally) stayears, turned out to have a significance that extended tistics for The New York well beyond movie recommendations and money. The Times. He has written competition became a model of Internet-era collabora- for The New York Times tion and innovation, attracting entries from thousands Magazine, The Atlantic, of teams around the world. The leading teams added and The Washington members as they sought help to improve their results Monthly and is the and climb up the Netflix leaderboard. Team members author of Go To: The were often located in different countries, communicating Story of the Math Majors, by email and sharing work on the web with people they Bridge Players, Engineers, never met face to face. Chess Wizards, Maverick This kind of Internet-enabled cooperative work— Scientists, and Iconoknown as crowdsourcing—has become a hot topic in clasts—the Programmers industry and academia. The Netflix contest is widely Who Created the Softcited as proof of its potential. ware Revolution. The Netflix competition also became a leading exhibit by enthusiasts of “prize economics.” Internet technology makes it possible to tap brainpower worldwide, but those smart people need an incentive to contribute their time and synapses. Hence, the $1 million prize. The prize model is increasingly being tried as a new way to get all kinds of work done, from exploring the frontiers of science to piecework projects for companies. The X Prize Foundation, for example, is offering multimillion-dollar prizes for path-breaking advances in genomics, alternative energy cars, and private space exploration. InnoCentive is a marketplace for business projects, where companies post challenges—often in areas such as product development and applied science—and workers or teams compete for cash payments or prizes. A start-up, Genius Rocket, runs a similar online marketplace mainly for marketing, advertising, and design projects. The emerging prize economy, according to labor experts, does carry the danger of being a further shift in the balance of power toward the buyers— typically corporations—and away from most workers. At first glance, there did seem to be an element of exploitation in the Netflix contest. Thousands of teams from more than 100 nations competed, and it was a good deal for the company. “You look at the cumulative hours and you’re getting PhDs for a dollar an hour,” said Reed Hastings, the chief executive of Netflix. Yet, the PhDs I talked to in covering the Netflix contest for The New York Times were not complaining. Mostly, they found the challenge appealing, even though hardly any were movie buffs. “It’s incredibly alluring to work on such a large, high-quality data set,” explained Joe Sill, an analytics consultant who holds a PhD in machine learning from the California Institute of Technology. continued on next page CHANCE
25
Some professors used the Netflix challenge to introduce graduate students to collaborative research on a big and interesting problem. At Iowa State University, for example, 15 graduate students, advised by five faculty members, tackled the Netflix challenge for a semester, until final exams pulled them away. It proved a rich, real-world test tube for trying out a range of statistical models and techniques. In four months, the Iowa State team improved on the Netflix internal recommendation system by about 4%. “We weren’t doing bad[ly], but it was very time-consuming and eventually the rest of the world passed us by,” said Heike Hofmann, an associate professor of statistics. Indeed, the main lure for the contestants—even more than a chance for a big payday—was to be able to experiment in applying tools of statistics, computing, and machine learning to the big Netflix data set, 100 million movie ratings. They spent their own time, or took work time with their companies’ blessing, because they knew the lessons learned would be valuable beyond the Netflix case. In the online world, automated recommendation systems increasingly help—and shape—the choices people make not only about movies, but also books, clothing, restaurants, news, and other goods and services. And largescale modeling, based on ever-larger data sets, is being applied across science, commerce, and politics. Computing and the web are creating new realms of data to explore—sensor signals, surveillance tapes, social network chatter, public records, and more. The skills and insights acquired in the Netflix quest promise to be valuable in many fields. So, what was the biggest lesson learned? The power of collaboration and combining many models, the contestants said, to boost the results. In a nail-biter finish, two teams each had the same score above the 10% threshold, though the winner made its entry 20 minutes earlier. Both teams used the mash-up of models approach, with no single insight, algorithm, or concept responsible for the performance. “Combining predictive models to improve results,” said Trevor Hastie, a professor of statistics at Stanford University, “is known as the ‘ensemble’ approach.” (The runner-up team, in fact, was called The Ensemble.) Model averaging is another statistical term used to describe such approaches. “It may not be very elegant or satisfying—it would be nice if some produced the natural single method—but that is what worked,” said Hastie, who was not a contestant. “And success with ensemble methods was made possible by the huge amount of data.” The winning team and the near-miss loser were alliances. The winner, BellKor’s Pragmatic Chaos, started as three AT&T researchers. (One later joined Yahoo Research, but remained on the team.) Two other two-person teams, from Austria and Canada, came on board as the contest progressed. The AT&T scientists participated in the contest with their company’s approval because it was seen as a worthwhile research project. That would have been a smart decision, the researchers agree, even if their team had not landed the prize. “The Netflix contest will be looked at for years by people studying how to do predictive modeling,” said Chris Volinsky, director of statistics research at AT&T. The scientists and engineers on the second-place team—and the employers who gave many of them the freedom to compete in the contest—agreed the time and toil was worth it. Arnab Gupta, chief executive of Opera Solutions, took a small group of his leading researchers off other work for two years. “We’ve already had a $10 million payoff internally from what we’ve learned,” Gupta said. The Netflix prize contest shaped careers, inspired research papers, and spawned at least one start-up. Shortly after the $1 million contest concluded, Sill said he and other members of The Ensemble team were talking about commercializing their hard-earned knowledge. “There’s nothing concrete yet, but we’re exploring several avenues,” he said.
26
VOL. 23, NO. 1, 2010
is that the resulting recommendations are easy to explain to users by pointing to the neighboring items that the user rated highly. Although equation (1) is intuitive, it raises many concerns. First, the choice of similarity metric is arbitrary, without formal justification. Second, relative weights do not depend on the composition of a neighborhood, so highly correlated neighbors of a movie tend to get “double counted.” Finally, predictions are particularly unreliable for movies with few or no close neighbors. In “Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights,” presented at the 7th IEEE International Conference on Data Mining by Robert Bell and Yehuda Koren, linear regression is used to estimate customized mixing weights for each prediction to account for correlations among the set of available neighbors. To deal with missing data, these regressions use sufficient statistics from the derived estimates of the covariance matrix for item ratings. Empirical Bayes shrinkage is used to improve reliability of estimated covariances for item pairs with few ratings by a common user.
Matrix Factorization Many of the most successful single models in the Netflix Prize were latent factor models—most notably ones that use matrix factorization. Matrix factorization characterizes both items and users by d-dimensional vectors of latent factors, where d is much smaller than either the number of movies or users, say d = 50. For movies, a factor might reflect the amount of violence, drama vs. comedy, more subtle characteristics such as satire or irony, or possibly some noninterpretable dimension. For each user, there is a vector in the same dimensional space, with weights that reflect the user’s taste for items that score high on the corresponding factor. Figure 1 shows output from a typical model, where a selected set of movies is shown with their loadings on the first two latent factors. An inspection of the first factor shows a separation of serious movies from silly comedies. On the right side, we see movies with strong female dramatic leads, contrasted on the left with movies aimed at the fraternity set. The second dimension separates critically acclaimed, independent,
Figure 1. Selected movies and their weights across the first two latent factors for a typical matrix factorization model
quirky movies at the top from largebudget Hollywood star–driven movies on the bottom. Latent factor models are attractive because they can extract these orthogonal ‘concepts’ out of the data, without requiring external information about genre, actors, or budget. In the basic matrix factorization model, a user’s interest in a particular item is modeled using the inner product of the user (pu) and item (qi) factor vectors, plus bias terms (bu for users and ai for items) for both. Specifically,
r^ ui= μ + ai + bu + qiT pu .
(2)
Ideally, the number of dimensions d could be set large, perhaps in the thousands, to capture as many of the countless subtle movie characteristics that affect users’ ratings as possible. For this simple model, however, predictive performance on a hold-out set begins to degrade for d > 5. One solution is
to minimize a penalized least squares function, which is one way to address overfitting and sparse data. Technically, the sum of squared errors (the first term in (3) below) is augmented by additional terms (four in this example):
∑
(rui – r^ ui)2 + λ1
(u,i) training
+ λ2
+ λ4
∑b u
2 u
+ λ3
∑|| p u
u
∑a i
2 i
∑|| q || i
i
2 2
(3)
||22 ,
where the lambdas, which control the amount of regularization, are chosen to optimize the MSE on a hold-out set. Penalized least squares can be motivated by assuming each of the parameters, other than µ, is drawn from a normal distribution centered at zero. One option for solving equation (3) is to fit an iterative series of ridge regressions—one per user and one per item in each iteration.
Alternatively, as presented in “Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo,” presented at the 25th International Conference on Machine Learning by Ruslan Salakhutdinov and Andriy Mnih, a full Bayesian analysis is feasible for d of at least 300. In practice, we found that the most effective solution to equation (3) was usually stochastic gradient descent, which loops through the observations multiple times, adjusting the parameters in the direction of the error gradient at each step. This approach conveniently avoids the need for repeated inversion of large matrices. The power and flexibility of the matrix factorization model combined with efficient computational techniques allowed us to develop several important extensions to the main model. One such extension uses the set of items rated by a user to refine the estimated taste factors derived from the ratings themselves. The idea is that users select which items to rate, so that missing ratings CHANCE
27
0.905 50 100
Basic Matrix Factorization 200
0.900
… + What was Rated … + Linear Time Factors … + Per-Day User Factors
50
0.895
100 200 50
0.890
100 RMSE for Held Out Data 0.885
200
500
50 100
200
0.880
500
1000
1500
0.875 10
100
1000
10000
100000
Millions of Parameters
Figure 2. RMSEs for selected matrix factorization models and extensions versus numbers of parameters
are certainly not—in the terminology of Roderick Little and Donald Rubin in Statistical Analysis with Missing Data— missing at random. Consequently, the set of movies a user chooses to rate is an additional source of information about the user’s likes. Another extension allows many of the parameters to vary over time. For example, a movie’s popularity might drift over time, as might a user’s standard for what constitutes four stars. Of more consequence, a user’s tastes, as measured by pu may change over time. Consequently, as described by Koren in “Collaborative Filtering with Temporal Dynamics” (see http://research.yahoo.com/files/kdd-fp074-koren. pdf), we allow the parameters {ai}, {bu}, and {pu} in equation (2) to change gradually or, perhaps, abruptly to allow for the possibility that different members of a family provide ratings on different dates. Additional regularization terms are required to shrink the new parameters.
28
VOL. 23, NO. 1, 2010
Figure 2 displays the RMSE on test data for a variety of matrix factorization models with different numbers of parameters. The top curve shows performance for basic matrix factorization with d = 50, 100, and 200, resulting in approximately 25, 50, and 100 million parameters, respectively. The RMSE for Cinematch was 0.9514, so the 100-factor model achieves slightly better than a 5% improvement. For reference, a model that always predicts the movie average yields an RMSE of about 1.05. The remaining curves illustrate the improvement in RMSE attributable to the extensions outlined above. The second curve (from the top) shows that accounting for what a user rated produces about a 1% improvement for each value of d. The next curve illustrates the additional impact of allowing each {ai}, {bu}, and {pu} to change linearly over time. The final curve shows results for a model that allows each user bias {bu} and taste vector {pu} to vary arbitrarily around the
linear trend for each unique rating date. We see RMSEs continue to decline even as d grows to 1,500, resulting in 300 parameters per observation. Obviously, regularization was essential.
The More the Merrier An early lesson of the competition was the value of combining sets of predictions from multiple models or algorithms. If two prediction sets achieved similar RMSEs, it was quicker and more effective to simply average the two sets than to try to develop a new model that incorporated the best of each method. Even if the RMSE for one set was much worse than the other, there was almost certainly a linear combination that improved on the better set. And, if two is better than one, then k + 1 must be better than k. Indeed, a hopelessly inferior method often improved a blend if it was not highly correlated with the other components.
At the end of the first year of the competition, our submission was a linear combination of 107 prediction sets, with weights determined by ridge regression. These prediction sets included results of various forms of both nearest neighbors and latent factor models, as well as other methods such as nearest neighbors fit to residuals from matrix factorization. This blend achieved an 8.43% improvement relative to Cinematch on the quiz data. Notably, our best pure single prediction set at the time (RMSE = 0.8888) achieved only a 6.58% improvement. The value of ensembles for prediction or classification is not a new insight. Robert Schapire and Yoav Fruend’s boosting and Breiman’s random forests were derived from the idea of forming ensembles of purposely imperfect classifiers or regressions. Our blend differs fundamentally in that the individual components need not share anything. Others also have found ensembles of heterogeneous methods to be effective in various problems. Just as we found value in combining many diverse models and algorithms, we benefited from merging with other competitors who brought their own perspectives to the problem. In 2008, we merged with the BigChaos team, Michael Jahrer and Andreas Toscher, two computer science students from Austria. Their approach to blending models used nonlinear blends via neural networks, which we had not tried. In 2009, we added team Pragmatic Theory, consisting of Martin Chabbert and Martin Piotte, two engineers from Montréal. Among many contributions, they brought new models for incorporating rating frequency to model temporal behavior. After nearly 33 months, our combined team became the first to achieve a 10% improvement over Cinematch, triggering a 30-day period in which all competitors were allowed to produce their best and final submissions. Of course, what had worked so well for us was open to all. On the secondto-last day, a new team aptly named The Ensemble first appeared on the leaderboard. This mega team, which included members from 23 original teams, leapfrogged slightly ahead of us in the public standings, but the winner was left unclear. The actual winner was to be determined by a held-out test set about which Netflix had not provided
any feedback. At the time, nobody knew who actually won the Netflix Prize. As it turned out, the top two teams’ RMSEs on the test set differed by an unbelievably small margin—0.856704 for us, versus 0.856714 for The Ensemble—a difference of 0.00010. The conditions of the contest called for rounding all RMSEs to four decimal places, so the teams were tied. The tiebreaker was submission time of the best entries. We submitted our predictions 20 minutes earlier than The Ensemble. After three years, our margin of victory was 20 minutes. It was like winning a marathon by just a few inches.
Success The Netflix Prize took three years to win and generated a lot of interest in and attention for the field of collaborative filtering and recommendations in general. It is widely considered a success from different perspectives: Researchers already working in collaborative filtering were energized by the scale and scope of the Netflix data set, which was orders of magnitude bigger than previous recommendation data sets. The competition also attracted many new researchers into the field, as many of the top teams (including ours) had little or no experience in collaborative filtering when they started. The Netflix Prize showed the success of bringing together the work of many independent people to develop new technology or business strategy, sometimes called crowdsourcing and popularized in the recent book The Wisdom of Crowds, by James Surowiecki. The idea behind crowdsourcing is that an ensemble of opinions, all generated independently and with access to different resources, will perform better than a single expert. The Netflix Prize drove home the power of ensemble methods. Every leading team created a bucket of heterogeneous models and algorithms. Methods for optimal averaging of an ensemble based on limited information (e.g., we only had RMSE for each model) had a large impact on overall performance. For the field of statistics, there is a clear lesson. Our team benefited from having teammates from many academic backgrounds, including computer science, machine learning, and engineering. These fields brought different perspectives on problemsolving and different
toolboxes to large data sets. Stepping outside of our domain was helpful in making difficult decisions that turned out to be important. As collaboration has always been at the heart of the statistics profession, this lesson should surprise no one.
Further Reading Adomavicius, G., and A. Tuzhilin. 2005. Towards the next generation of recommender systems: A survey of the state of the art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17:734–749. Bell, R. M., Y. Koren, and C. Volinsky. 2007. The BellKor solution to the Netflix Prize. www2.research. att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf. Bell, R. M., and Y. Koren. 2007. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. Presented at the Seventh IEEE International Conference on Data Mining, Omaha. Breiman, L. 2001. Statistical modeling: The two cultures (with discussion). Statistical Science 16:199–231. Breiman, L. 2001. Random forests. Machine Learning 45(1):5–32. Koren, Y. 2009. Collaborative filtering with temporal dynamics. http:// research.yahoo.com/files/kdd-fp074-koren. pdf. Koren, Y., R. Bell, and C. Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42(8):30–38. Little, R. J. A., and D. B. Rubin. 2002. Statistical analysis with missing data (2nd ed.). Hoboken: John Wiley & Sons. Salakhutdinov, R., and A. Mnih. 2008. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo.” Presented at the 25th International Conference on Machine Learning, Helsinki. Sarwar, B., G. Karypis, J. Konstan, and J. Riedl. 2001. Item-based collaborative filtering recommendation algorithms. Presented at the 10th International World Wide Web Conference, Hong Kong. Schapire, R. 1990. Strength of weak learnability. Journal of Machine Learning 5:197–227.
CHANCE
29
Analysis of Engineering Data: The Case of Todd G. Remund
Rocket Firings A
piano sometimes reminds people of beautiful classical music—a favorite composer or musician may come to mind. Music, including instrumental harmonics, covers roughly one-half of the limits of human hearing. The piano ranges one-fifth of these tones, whereas the human voice can amount to slightly more than a twentieth of the overall span. More complex sound sources, such as rockets, actually produce an intense medley of tones, many of which are within the limits of human hearing. Few people probably think of rockets when asked what produces music. However, many scientists would argue that there are aesthetic properties to an ignited rocket: the feel in the air, the pops and crackles, the heat and the brightness. From a statistical perspective, a huge amount of data can be recorded while firing a rocket, including information about the frequencies of sound produced. To a scientist or engineer, these data are rich with information that can be extracted using statistical methods that are common to many applications in several fields of science.
30
VOL. 23, NO. 1, 2010
Sound, Light, Time Frequencies It would be fair to say that most people are not concerned about the analysis of frequencies in time series data. But this does not mean they do not benefit by analyzed frequencies, specifically those found in sound. Consider how a prism breaks light into its different components. Most have seen a demonstration of a white light beamed through a prism producing a spectrum of colors on the other side. This is a decomposition of white light into light frequency components. Similarly, ears are designed to decompose sound into a cascade of frequencies, or unique vibrations. Nerve receptors in the ear transfer these high to low frequency messages to the brain, where they are processed as the sense of hearing. In all actuality, this sense is a frequency input to the brain. In engineering sciences, there are various types of time series data—acoustic, pressure, force. All are extremely valuable in providing refined ‘senses’ to scientists. Although it is possible for a scientist to listen to a rocket fire, it would take a superhuman mind to gather information about rocket performance by exclusively using the sense of hearing. A more refined sense must be developed, that of signal processing and statistics. People can augment what their brains intuit by a computer with graphs and tables that bridge the gap between what we can hear and what we can learn in detail about a rocket or system. Some tones or frequencies in a system are unexpected and unwelcome. Others are necessary and natural. For instance, cars have both welcome and unwelcome vibrations, many of which are audible. Turning the ignition and hearing the car engine roar to life is a welcome sound; it means the engine works. On the other hand, when a car door squeaks, the owner experiences an irritating sound that is fixed by applying a little oil to the affected hinge. In some instances, unwanted frequencies in a system cannot be fixed physically or electrically. Sometimes, it is a result of electrical pollution during the gathering of the data. In any case, it is present as an annoyance. Other times, a particular frequency is of interest in an analysis, but it is present in the overall sound and it must be isolated from the overall spectrum of frequencies found in the data. How might either of these
situations be satisfied? This is where digital filters make their grand entrance. Filtering is sometimes the effectual oil fixing the squeak. Other times, it is like a precision surgical instrument during brain surgery.
What Is Digital Filtering and DSP? In the book The Scientist and Engineer’s Guide to Digital Signal Processing, Steven Smith gives a good description of the field of signal processing: “Digital signal processing is distinguished from other areas in computer science by the unique type of data it uses: signals. … DSP is the mathematics, the algorithms, and the techniques used to manipulate these signals after they have been converted into digital form.” The set of tools and skills used in DSP is shared or derived from many fields of science. This sharing of statistics methodology and digital signal processing is ingrained in the subject matter, as well. DSP books will have a chapter on statistics. Time series analysis books will contain a chapter or two on methods that are fundamentally DSP. A digital filter is conceptually simple. The difficulty comes in understanding the environment in which these filters are applied. Once the environment is understood, the general concept of the filter can be boiled down to a simple statement. Idealistically, digital filters multiply the desired information by one and all other information by zero. For the present consideration, the term “ideal filter” is a function that multiplies desired frequency content by one and all others by zero. Signals can be filtered in two alternate environments or domains: the time domain and frequency domain. In the frequency domain, time series data are represented as functions of frequency. Frequency has units of hertz (Hz), which represent the number of cycles per second. Recall the ear, through which frequency information is passed to the brain. The sense of hearing is not specifically designed to analyze data in the manner we would like, at least not with the sophistication needed for scientific inquiry. For example, hearing a rocket ignite and burn gives useful information only if we hear sounds that can affect our safety. If we hear an explosion, our
brains instantly analyze the sound data and send a message such as “Duck”! Our brains do not say, “The hull of the rocket has been breached in such and such location because the doohickey has failed and thus is spewing hot chunks of material that are currently X meters from my present location, traveling at a velocity of Y mph. Therefore, jump right and roll under the nearest bolder to avoid certain death.” Superheroes might be able to do this in a split second, but the average person cannot. A refined sense for scientific inquiry requires additional help from computers and mathematical methods. Thus, frequency information is attained from data and sometimes analyzed with the help of digital filters. Information, whether in the time or frequency domain, is completely equivalent. Nothing is lost by transforming from one domain to the other. Signals can be moved or transformed to either domain using a functionality called the Fast Fourier Transform (FFT), a powerful algorithm that greatly speeds up computation of the filtered signal for large data sets. Digital filters can be applied in either of these environments. In the time domain, a digital filter is applied using convolution. Filtering is carried out in the frequency domain using elementwise multiplication of the filter coefficients and the signal frequency domain components. Aside from faster computation, digital filtering in the frequency domain is more intuitive.
An Illustrated Example Imagine you are on a trip in your new high-tech futuristic concept car. It has all the bells and whistles. It even has a number of sensors that detect vertical and horizontal motion. After a period of time, you trade places with your 16-yearold son, who has just learned to drive. You wake up at the desired location after dozing for an hour. As you get out of the car, you notice the spare tire has replaced one of your low-profile, highperformance tires. Your son informs you that he hit a tire iron on the side of the road a while back. After thanking him for changing the tire, you decide to turn the situation into a learning experience. You pose the question, “How can the time of the flat be estimated?” The first step is to get the time series data from one of your car’s high-tech sensors. The second step is to filter it. CHANCE
31
Figure 1. Time domain amplitude of car motion (in G) versus time (seconds) during a period of a hypothetical trip
After looking at the raw signal with your son, you notice there is a spot in the data in which the sensor has a different pattern than normal and you decide to zoom in on the spot (Figure 1). The computer on the high-tech car recorded the speed during that time interval as 25 mph. Performing some quick calculations, you realize the revolutions per second (rps) of the tire are roughly 4. If there was a flat tire at a specific time, the motion of the car would pick up the thumping of the flat tire at roughly four thumps per second. Filtering the data at 4.15 Hz will reveal any 4 Hz motion by removing all the other frequency information above this value. Figure 1 shows the motion of the car in the time domain before filtering. Figure 2 shows the frequency domain, where it is apparent that there is a definite frequency contribution at 120, 60, 6, and 4 Hz. Normally, there is electronic line pollution at multiples of 60 Hz; this represents a nuisance part of the data. At 6 Hz, another annoying pattern is found. You would like to see only the 4 Hz content. One way to do this is through low-pass filtering to remove all content above 4 Hz. Filtering is done using the windowedsinc filter, formally called a finite impulse response (FIR) filter. After having filtered the data at 4.15 Hz, the approximate change in car motion is revealed (Figure 3). Close inspection shows the flat 32
VOL. 23, NO. 1, 2010
Figure 2. A frequency domain representation of the data in Figure 1, where amplitude is measured as acceleration (G), phase measured in radians (rad), and frequency measured in Hertz (Hz)
occurred at roughly 1,242 seconds, or 20 minutes 42 seconds after the switch of drivers. The time domain graph of the filtered signal, Figure 3, was used to locate the rough time point in which the behavior of the car motion changed. Figure 4, the frequency domain, indicates the amount of frequency content left after filtering. Notice the simplicity of the concept of filtering: All unwanted frequencies’ components are essentially nullified. Although there may be more precise methods to find the time of the flat tire or any particular event in a signal, this example illustrates the ability of a digital filter to reveal underlying information in a signal and portrays the simplicity of the digital filter concept. Digital filters have the power to expose not just all significant information, but the specific parts of the significant information of most interest. There are methodologies in statistics and DSP that can reveal such information in the frequency domain; however, digital filtering can parse this information and portray it in the time domain.
Infinite Data and Ideal Filters: Reality of a Bias-Variance Tradeoff The filtering concept described above is essentially true, where unwanted information—or frequency content—is
multiplied by zero. However, a minor adjustment needs to be made to the filter function in reality. The ideal filter— multiplying unwanted information by zero—is only optimal if certain assumptions can be made. If the data set is of infinite length and the distance in time between each measurement is infinitesimally small, then the ideal filter works and produces perfectly filtered data. Also, the same must be true of the filter function. Figure 5 depicts what is meant by “infinite” data, as indexed by sample resolution. Infinite data have an infinite number of samples between any two time points, and proceed from negative to positive infinity in bound. In Figure 5, you can see that as you increase the resolution (bottom to top), the samples get closer. Also depicted is the length of the signal overall becoming larger and approaching positive and negative infinity. Thus, infinite data are sometimes defined as a limit of successive sets of data. So, if we record the car motion forever with no breaks in time, we can perfectly filter the data. There are many good reasons why this is not possible. Recording devices only have so much accuracy. Also, we would need to change tires, change the oil, and refill the gas. Not to mention, we should have been recording since before the beginning of time.
Figure 3. The time domain data after filtering with a low-pass filter. The car motion due to the flat becomes more visible.
There are a number of approaches to creating filters that approach the behavior of the ideal filter for finite data and provide a tradeoff between bias and variance. In DSP, bias and variance are referred to as resolution and stability. Tradeoff between bias and variance— or resolution and stability—can be carried out using the FIR filter. This filter has a windowing function as part of the mechanics that control the tradeoff. The scientist doing the filtering has a choice of what type of window function to use and the window length. A longer window filter produces high resolution (small bias) but low stability (large variance). The shorter window length gives low resolution (larger bias) and high stability (smaller variance). Balancing this tradeoff takes practice and a little more understanding of the behavior of this phenomenon. The sequence of steps can be summarized as follows: The frequency domain values are found by calculating the FFT of the data (gray trace and phase found in Figure 4). We multiply these values, per frequency, by the filter coefficients (black trace in Figure 6). This produces the frequency domain data after filtering (black traces in Figure 4). The inverse FFT is then applied to convert this information back into the time domain as the filtered signal (black trace in Figure 3). Described by linear regression, the digital filter takes on new light in the
Figure 4. Filtered data shown in the frequency domain doesn’t necessarily reveal anything new about the signal.
Figure 5. An illustration of infinite data. Moving from the bottom to the top, the number of data points increases with the resolution and the negative and positive limits become infinite.
CHANCE
33
Amplitude: Car Motion (G)
Frequency (Hz)
Amplitude: Car Motion (G)
Figure 6. The basic idea of the filtering concept: The wanted amplitudes are kept and the others are eliminated.
Time (Sec) Figure 7. Low-pass filter vs. linear regression. Both perform the same.
34
VOL. 23, NO. 1, 2010
eyes of statisticians, becoming a method to model data. Note that presenting the digital filter as a linear regression is more for conceptual ideas—such as finding the expected value of the filter and delving into its behavior—than for day-to-day use in filtering. The FFT is much more computationally efficient in this respect. By once again considering the conceptual idea of multiplying the unwanted frequency content by zero and the desired content by one, it may become apparent that the filtered data are parallel to a reduced/full model fit. The reduced model is related to the filtered data. Conceptually, when we fit a reduced model with only some of the sine and cosine functions to the data and compute a prediction, we effectively are filtering data. An interesting avenue resulting from this thought is that one may use this method to fit an optimal line to the data with respect to significant frequency content. With any device built by humans, there are always issues or interesting phenomena. It is easy to find publicdomain descriptions of rocket nature; sloshing is one. Consider the common water dispenser found in many offices. It has a large plastic bottle atop and paper cups dispensed on the side. Imagine tying one of these half-full bottles to your back and trying to run a marathon. The behavior of the water as your body shifts will cause you to be pulled from side to side. When you stop, the water sloshing to one side of the container may jolt you forward a step or two. Now imagine this same issue in the liquid tank connected to the shuttle. The liquid tank on the shuttle is the heaviest part of the vehicle. If the vehicle turns, rolls, or speeds up, there is always a reaction in the liquid contained in this tank. In many situations, engineers would like to use mathematical models to predict the behavior of the fluids in the tank so they can account for this behavior in the algorithms controlling the vehicle. To validate these predictions, methods based on the same foundation as the simple analysis done in Figure 2 are used. This is comparing what actually happened in the liquid and what was predicted to happen. Digital filters can be used as utilities to perform certain data mining or data conditioning procedures. Sloshing has been well accounted for in the development of liquid propulsion.
Digital Filtering as a Linear Regression Description of digital filters as a linear regression begins by specifying the predictors. In statistics textbooks, many authors include information about Fourier analysis of time series data. Regression of a time series record on sine and cosine functions often is the initial step in the description of these analyses. With this in mind, it can be said that the finite impulse response (FIR) filter can be presented as a linear regression of the signal on these sine and cosine functions. The basic idea can be described using Figure 6. The information as indexed by frequency—the amplitudes (gray trace)—are found using the least squares parameters. The Fast Fourier Transform (FFT) and least squares are two alternate ways of finding the frequency domain values of the data. A diagonal filter matrix can be constructed using the filter coefficients (black trace). Let’s call this matrix the F matrix, for filter matrix, where the diagonal elements are the frequency domain filter coefficients as stated. To express a digital filter as a linear regression, first the predictor is found using the sine and cosine functions. Note there are N observations in the time series. The design matrix, in a linear regression of transformed time values and time series data, is a matrix full of sine and cosine functions as given below, where j = 1,...,N.
where
with
continued on next page CHANCE
35
Based on engineers’ findings, different measures have been put into practice that deal with this issue and others like it. In effect, the beast has been tamed using mathematical models, frequency analysis techniques, and the ingenuity and innovation of engineers and scientists.
and
Summary
The matrix X is of dimension N+2N. There are subtleties in applying the filter this way, but they are not important in the general description of the linear regression form of the digital filter. Using these predictors in the linear regression, we get estimates of the regression parameters using the least squares estimator and some manipulations: , where y is the vector of observed measurements from one of the car sensors. The vector βˆ are estimated coefficients appropriate for the usual linear regression model y=xβ. Notice the error term ε has been left off. This is not a mistake; this is a supersaturated regression. All the fluctuations in the data are accounted for. The residuals begin to appear when the filter matrix is applied. On a technical note, the X matrix is full-row rank, hence the Moore-Penrose generalized inverse, denoted with a + in the superscript, can be used to find a unique g-inverse in this estimate. To filter the data, we simply pre-multiply the parameters by the F matrix and the design matrix.
In the flat tire example, the filter was applied using the FFT. What if the filtering was applied using the linear regression approach? Due to the challenge of computing power and the tremendous size of the X matrix, the filter must be changed to make the computation possible, using linear regression. We take the data in the interval from 1241.5 to 1242.5 seconds to show graphically that these two approaches are equivalent and define the filter with a small window length. A smaller window size than that used in Figure 4 makes the X matrix more manageable computationally. Figure 7 shows the likeness of the filtering approaches; whether FFT or regression, the job gets done.
36
VOL. 23, NO. 1, 2010
Digital filtering can be described as employing a heightened scientific sense. It might be likened to having the ability to ignore some frequencies, while allowing others to pass untouched to our brains for processing. Scientists, through better data collection and filtering, gain better refined senses as time passes. Although there is no high-tech hearing aid that enhances perception to a superhuman level, computers allow the discernment of information in slow motion. Scientists are using these refined senses to learn all they can about rockets, cars, airplanes, and other engineering products. Earthquakes are analyzed in plate tectonics and volcano science. The music and communication industries provide high-quality sound. Digital filters help facilitate refined scientific senses.
Further Reading Bloomfield, Peter. 2000. Fourier analysis of time series: An introduction (2nd ed.). Hoboken: John Wiley & Sons. Irizarry, Rafael A. 2004. Parameters with musical interpretations. CHANCE 17(4):30–38. Neuwirth, Eric. 2001. Listen to your data: Sonification with standard statistics software. CHANCE 14(4), 48–50. Remund, Todd G. 2008. Aerospace industry holds rich opportunities for MS statisticians. Amstat News 378:31. Smith, Steven W. 1997. The scientist and engineer’s guide to digital signal processing. San Diego: California Technical Publishing.
Gummy Bears and the Pleasure of Biostatistics Lessons Götz Gelbrich, Annegret Franke, Bianca Gelbrich, Susann Blüher, and Markus Löffler
B
iostatistics is a mandatory course for medical students in Germany—and often not the most beloved. To enhance students’ engagement, we chose to apply the ideas in Penn Jillette and Teller’s Penn & Teller’s How to Play with Your Food and introduced gummy bears as a teaching tool. The study—to determine whether the color and flavor of gummy bears are related—was meant to introduce the confidence interval and hypothesis test for the rate of an event to medical students. The study treatment consisted of the oral administration of a gummy bear. Prior to the experiment, participants were informed that they would randomly receive either a red gummy bear with a cherry-like flavor or a yellow gummy bear with a lemon-like flavor. Each participant was then blinded using a black scarf (see Figure 1). By tossing a coin, the conductor selected a red or yellow gummy bear (in case of head or tail, respectively) and the participant was instructed to eat it (and chew well). The participant was then asked to guess the color of the bear based on the flavor. Single blinding was sufficient in this study, as the conductor had to make no assessment. In case of no relationship between color and flavor of the gummy bears, there was a 50/50 chance for a correct guess. We should therefore test the null hypothesis that the hit rate is 50% against the alternative claiming a hit rate above 50%. However, we considered the possibility that the instructions on flavor provided to the participants could be misleading, as they were based on the subjective opinion of the conductors. So, we performed the two-sided test. The Gaussian approximation to the binomial distribution was used to estimate probabilities associated with the rate of correct guesses of the gummy bear colors. We expected at least one-third of the subjects would be able to recognize the information on color from flavor. Considering the 50/50 chance of a correct guess in all other cases, we assumed an overall hit rate of two-thirds. To achieve a power of 0.95 in a two-sided test with type I error rate of 0.05, 114 subjects who could be evaluated were needed. As 135 students participated in our lessons and gummy bears are generally popular, we were optimistic about reaching the recruitment
Figure 1. Blind oral administration of a gummy bear
Cons id ered for inc l us i on : A l l s tudents partic ipati ng in th e bi om etry c ours e n=135
Refuse d to eat gum m y bear n=14 Ran dom i ze d n=121
Red gum m y bear
Ye ll ow gu m m y b ear
n =60
n=61
Inc l ude d i nto an al ys i s
In c l uded i nto an al ys i s
n =60
n=61
Figure 2. Study flow
CHANCE
37
CONSORT: Improving Clinical Trial Reporting Randomized trials form the bedrock of evidence-based medical research. When properly designed, implemented, and analyzed, they allow causal inferences to be made regarding interventions. However, when not properly undertaken and described, results can be misleading or incorrect. An eclectic international group consisting of medical journal editors, statistical methodologists, and clinical trial researchers began meeting in the early 1990s to address concerns about the reporting of medical research. They became the Consolidated Standards of Reporting Trials (CONSORT) group and published a set of recommendations, checklist, and flow diagram (see www.consort-statement.org) to help improve the reporting of clinical trials. The group continues to work to help readers assess the validity of results from randomized medical trials by fostering transparency from authors. There is evidence that the CONSORT guidelines are widely adopted, with endorsements from 328 medical research journals as of January 2009 and translations into 10 additional languages. The most recent version was published simultaneously in the Annals of Internal Medicine, Journal of the American Medical Association, and The Lancet. Both the CONSORT statement and its companion explanation and elaboration document are among the top 1% of cited papers. There is also evidence from nonrandomized comparisons that clinical trial reporting has improved in recent years. The CHANCE editors use the CONSORT materials when reviewing article submissions to which the CONSORT principles should apply. In the case of the gummy bear trial, even though risk to participants was extremely low, useful (and amusing) details of the experiment were added to the article based on the CONSORT checklist. The editors of CHANCE encourage the use of CONSORT materials in instruction, in evaluating clinical studies, and to bolster evidence-based medicine.
38
VOL. 23, NO. 1, 2010
target. Subjects were included in the final analysis directly, so no interim analyses were carried out. Figure 2 shows the study flow. Of 135 participants, 14 refused to eat gummy bears. Sixty-seven female and 54 male students aged 20 to 24 years old took part. They were randomized to eat 60 red and 61 yellow gummy bears. Of 121 gummy bears eaten, the colors of 97 were correctly guessed (hit rate 80%, 95% confidence interval 73% to 87%). This rate was significantly higher than 50% (p < 0.001), therefore the null hypothesis of no association between color and flavor was rejected. The hit rates for red and yellow bears were similar (49 of 60 red, 82%, and 48 of 61 yellow, 79%). No adverse event (AE) occurred after the intake of gummy bears (95% confidence interval of AE rate 0 to 3%).
Discussion We have demonstrated that the color and flavor of gummy bears are significantly related to each other. Gummy bears proved to be superior to textbooks in evoking responses of complacent medical students to statistical challenges. As experiences with gummy bears make statistical methods easier to remember, we expect our method to contribute to making knowledge of confidence intervals more persistent. Nonetheless, the use of sweets for teaching should be limited in view of the increasing prevalence of obesity, diabetes, and other health hazards.
Further Reading Altman, D. G., K. F. Schulz, D. Moher, M. Egger, F. Davidoff, D. Elbourne, P. C. Gøtzsche, and T. Lang. 2001. The revised CONSORT statement for reporting randomized trials: Explanation and elaboration. Ann Intern Med 134(8):663–694. Barron, M. M., and P. Steerman. 1989. Gummi bear bezoar: A case report. J Emerg Med 7(2): 143–144. Jillette, P., and Teller. 1992. Penn & Teller’s how to play with your food. New York: Villard Books. Moher, D., A. Jones, and L. Lepage. 2001. Use of the CONSORT statement and quality of reports of randomized trials: A comparative before-and-after evaluation. Journal of the American Medical Association 285(15):1992–1995. Moher, D., K. F. Schulz, and D. G. Altman. 2001. The CONSORT statement: Revised recommendations for improving the quality of reports of parallel-group randomized trials. The Lancet 357(9263):1191–1194. PubMed. www.ncbi.nlm.nih.gov/sitesentrez?cmd=search&db=pubmed. Sackett D. L., W. M. Rosenberg, J. A. Gray, R. B. Haynes, and W. S. Richardson. 1996. Evidence-based medicine: What it is and what it isn’t. BMJ 312(7023):71–72. Ufberg, J. W., and J. Lex. 2005. Abdominal calcifications on unenhanced CT scan due to gummy bear ingestion. J Emerg Med 28(4): 469–70.
CHANCE
39
The ASA Community provides an online setting for like-minded statisticians to connect with peers through tools that make it easy to communicate, collaborate, and share.
CREATE your profile FIND contacts JOIN discussions
Networking Set up your personal member profile and connect with ASA colleagues. ASAGroups The new online home for our ASA committees, sections, and chapters. Members exchange information in real time. Resource Library The Resource Library is where members may share, comment on, rate, and tag documents. Glossary Collaborate to create and build industry definitions. Rate definitions and add comments to build terms so they can be used by other members. Calendar Access calendars for specific groups and the master calendar for all ASA Community events.
Visit http://community.amstat.org today! 40
VOL. 23, NO. 1, 2010
Electoral Voting and Population Distribution in the United States Paul Kvam
I
n the United States, the electoral system for determining the president is controversial and sometimes confusing to voters keeping track of election outcomes. Instead of directly counting votes to decide the winner of a presidential election, individual states send a representative number of electors to the Electoral College, and they are trusted to cast their collective vote for the candidate who won the popular vote in their state. Forty-eight states and Washington, DC, employ the winner-takes-all method, each awarding its electors as a single bloc. Maine and Nebraska select one elector within each congressional district by popular vote and the remaining two electors by the aggregate, statewide popular vote. Due to this all-or-nothing outcome, the winner of the popular vote will not always be the candidate who wins the most electoral votes—such as in the 2000 election. There are numerous critics of the current system who point out how its weaknesses are exacerbated by the naturally uneven spread of the country’s population across the 50 states and that every state—no matter how sparsely populated—is guaranteed to have at least three electoral votes out of the total 538. Two of the votes correspond to two congressional senators that represent each state, independent of the state population. Under the current rules, the value of a vote differs from state to state. A large state such as California has an immense effect on the national election, but, compared to a sparsely populated state such as Alaska, is grossly under-represented in the U.S. senate, where all senators have an equal vote. Arnold Barnett and Edward Kaplan, in their 2007 CHANCE article, “A Cure for the Electoral College?” called the Electoral College “the funhouse mirror of American politics” and suggested a weighted voting system that would mitigate the problem caused by the present winner-take-all rule.
the candidate who wins the most votes in California, even if by a slim majority, garners more than 20% of the electoral votes needed to win the national election. The 538 electoral votes in the 2008 presidential election were distributed among 50 states and the District of Columbia (DC). To simplify the language, we will treat DC as a state, because it has a minimum allotment of three electoral votes. So, we will refer to 51 states in this scenario. Consider a hypothetical situation within states in which candidates have an equal chance of winning the popular vote (and thus all of a state’s electoral votes) and state results are decidedly independent of one another. As mentioned previously, we are examining the impact of disparities, but not modeling the actual U.S. political landscape. Suppose candidate A has won the popular vote in California over candidate B. If candidate A wins m out of 50 of the remaining states, then the electoral votes of candidate A will be the sum of the electoral votes of those m states plus the votes in California. Excluding California, the average number of votes per state is 9.66, and the standard deviation is 7.25 votes. The mean number of votes out of the remaining for candidate A can be calculated as the mean of the conditional mean. That is, if candidate A wins M states, the average number of votes is 9.66M. If M—the number of states—has a binomial
An Examination of the Impact of California Figure 1 illustrates the allotment of electoral votes from 2000– 2008. California has 55 out of the total 538. That’s already more than the 11 Vice President Walter Mondale achieved in the 1980 election against President Ronald Reagan, more than the 49 President Jimmy Carter achieved against Reagan in 1980, more than the 17 Sen. George McGovern achieved against President Richard Nixon in 1972, and more than the 52 Sen. Barry Goldwater achieved against President Lyndon Johnson in 1964. Given that 270 votes are needed to secure the majority,
Figure 1. Electoral votes in 2008 for 50 states and the District of Columbia (not in figure: Hawaii (4 votes) and Alaska (3 votes)) CHANCE
41
distribution with n = 50 independent trials and probability of going for candidate A of p = 1/2, then the expected value of M is np = 50(1/2) = 25. Consequently, the expected number of additional votes for candidate A, under the simple coinflip model, is 9.66(25) = 241.5. The calculation for the standard deviation of the number of votes is slightly more complicated, but uses a familiar formula from introductory probability theory. In words, the variance of the number of votes is the mean of the conditional variance plus the variance of the conditional mean. The standard deviation is the square root of that number. In symbols, E(Var(Votes | M)) + Var (E(Votes | M)).
(1)
The variance of the votes given the number of states going for candidate A, assuming the states decide independently, is Var(Votes | M) = 7.252M. The expectation of this is 7.252(25) = 1314.06. As before, E(Votes | M) = 9.66M. The variance of M from the binomial distribution is np(1 – p) = 50(1/2)(1-1/2) = 12.5. A multiplier such as 9.66 in a variance is squared, so Var(E(Votes | M)) = 9.662(12.5) = 1166.45. Then, the standard deviation is the square root of the sum of the two parts: 1314.06 + 1166.45 = 49.80. If the distribution of the number of additional electoral votes for candidate A is roughly normal, then the probability that candidate A gets the needed 215 additional votes to win is approximately 215 – 241.5 P(Votes ≥ 215) ≈ 1 – Φ 49.80 (2) = 1 – Φ (-0.532) = 1 – 0.297 = 0.703. In this scenario, the candidate who wins California will win the general election about 70% of the time. The importance of California as a big state is clear, even if the simple assumptions— such as a 50/50 coinflip for the candidate who wins California to win each other state—do not match political reality. No other state would have this large of a conditional probability. Turning to another issue, can a probability model describe the uneven distribution of electoral votes across the states? If so, what can that model tell us about how the Electoral College disparity has changed since the electoral voting system was introduced 220 years ago?
A Multinomial Model for the Distribution of Electoral Votes We will consider modeling the distribution of electoral votes to the 51 states. A multinomial distribution is a probability distribution for the number of items from a fixed total distributed to categories or classes. There is a probability for each category, which in this application are the 51 states. Under a multinomial model, the votes are distributed to the states independently of one another and with fixed probabilities of going to each state. To model the distribution of votes, it will make more mathematical sense to consider only the votes corresponding to the congressional seats that are not already guaranteed with minimum state population. Specifically, each 42
VOL. 23, NO. 1, 2010
state has two elected members of the Senate (the state population does not matter) and at least one member in the House of Representatives. Additional representatives allotted to the state are based on population of the state and the constraint that the total number in Congress is 538. Let X1,...,Xn be the electoral votes allocated to the n states from the remaining k =385 possible votes. Then, k =∑Xi = 385. Now the maximum from the n = 51 states in the 2008 election is California, with 52, and eight sparsely populated states (again, including DC) have none. Can the multinomial model reasonably explain this distribution of votes? A simple model is one for which votes are distributed to states independently with equal probability. This is a multinomial model with constant chances for the 51 states, which will not suffice. Under this model, states would get 7.55 votes on average, with a standard deviation of 2.72 votes. The equal probability model does not characterize the extreme variability that allows eight states to end up with no votes while a single state ends up with 52. Another approach is to treat the probabilities as unknown parameters and estimate them based on the actual observed electoral vote distributions. The maximum likelihood estimates of the probabilities are simply the actual vote counts divided by the number of total votes. In California, for example, the empirical probability is 52/385, or 13.5%. This model fits exactly, but it does not help us learn anything about the distribution beyond what we already know. In 2008, using this framework, we have exactly one sample—the actual number of electoral votes in the nation.
A Multivariate Polya Model The multivariate Polya distribution (MPD) is closely related to the multinomial distribution when each bin (state) has a probability of receiving each of the k = 385 votes. The 51 probabilities add to one. The probabilities, themselves, are modeled as arising from a mixing distribution. With more than two categories, it is called a Dirichlet distribution and is a model for probabilities adding to one. That is, the probabilities are thought of as arising independently from a single distribution. The resulting mixture distribution for the electoral vote counts is MDP. In the special case when n = 2, the model for the counts is a binomial probability model and the binomial probability parameter is assigned a beta distribution. The resulting mixed distribution is called a beta-binomial distribution. The multivariate Polya distribution has probability mass function px (x1,...,xn) =
T(nβ)
k!
n
T(nβ + k) ∏nj=1 xj!
∏ j=1
T(xj + β) T(β)
, (3)
where ∑xj = k and β is called the contagion parameter. The contagion parameter β regulates how evenly k votes will be distributed among n equally likely states. To see how this works, consider a ball and urn problem in which n urns each contain β balls. Another ball is randomly placed in one urn with probability determined by the proportion of the balls in the urn. To start, each urn has an equally likely chance of getting the first ball. The second ball, however, is more likely to go to the same urn that received the first ball. If β is large, the chance does not increase much, but β also can
Figure 2. Goodness-of-fit test result for U.S. 2008 electoral vote data and the multivariate Polya distribution
be a fraction, and the first ball randomly placed in an urn can have a greater effect. For example, as β goes to infinity (β ˆ| ∞), the contagion disappears and the distribution of X converges to the multinomial distribution with equal probabilities (n; k; n –11), where 1 is an n-vector of ones. Conversely, as β decreases to zero (β | 0), the contagion causes all the votes to be distributed to ˇjust one state, creating a maximum amount of unevenness. These opposite extremes for β also represent the extreme difference in multivariate Polya distributions in terms of entropy, including Kullback-Leibler divergence. Using the likelihood in (3), the maximum likelihood estimator for b corresponding to the 2008 U.S. electoral map is ^β = 0.507, with an approximate 95% confidence interval of (0.307, 0.836) based on the likelihood ratio. To ensure the MPD adequately fits the election data, we will apply a heuristic goodness-of-fit test. Because the expected values for each state are identical, a standard chi-square test is generally ineffective. Alternatively, a test based on the expected values for the n order statistics (the values of X1, …, Xn sorted in ascending order) of X can effectively distinguish MDP data from more general categorical distributions. A test can be constructed using simulation to construct 95% confidence intervals for each order statistic based on MPD simulations. Simulating the MPD is relatively easy, as it can be constructed as a multinomial distribution with mixing parameter (p1, …, pn) having a Dirichlet distribution. A Dirichlet vector can be composed from a vector of independent gamma random variables (with identical scale parameter) divided by their sum. In Figure 2, the goodness-of-fit results are plotted for the 51 states, which are ordered according to the number of
electoral votes. (California is represented by the right-most plotted point.) The 95% confidence intervals are based on 105 simulations. The apparent suitable fit reinforces the assertion that the contagion model can characterize the way electoral votes are distributed differently across the states. This simplifies our modeling effort because the contagion model requires only one parameter to be estimated, instead of 51 separate proportions.
Previous U.S. Elections Since the first U.S. presidential election in 1789, when President George Washington essentially ran unopposed, the number of states has increased from its original 10 (listed in Table 1) to the current 51. The population (both absolute and relative) of those existing states in 1789, as well as most future states, has changed dramatically in the course of the last 200 years. We are interested in finding out how the distribution of electoral votes across the states has evolved, in terms of the contagion parameter, over this same course of time. Figure 3 shows how the estimate of β has changed, along with the number of states participating in the federal election, since 1789. As the estimated contagion β has decreased over time, state populations have polarized toward small groups of heavily populated states and large groups of sparsely populated ones. The MPD model fit also changes from year to year, sometimes due to a large number of states that were constituted at the time of the census. To go along with Figure 3, Figure 4 displays the 95% confidence intervals for the contagion parameter for each electoral vote configuration in the United States since 1789. In earlier elections, in which there is less certainty about the finiteness of the contagion parameter, the upper confidence CHANCE
43
Figure 3. Estimated MPD contagion parameter and number of states in past elections (1789–2008)
Figure 4. Estimated contagion parameter (with 95% confidence interval) for MPD model in past elections (1789–2008)
44
VOL. 23, NO. 1, 2010
Table 1—Electoral Vote Distribution for the 1789 and 1792 U.S. Federal Election Electoral Votes State
1789
1792
Virginia
10
21
Massachusetts
10
16
Pennsylvania
10
15
New York
12
North Carolina
12
Connecticut
7
9
Maryland
6
8
South Carolina
7
8
New Jersey
6
7
New Hampshire
5
6
Georgia
5
4
Kentucky
4
Rhode Island
4
Delaware Vermont
3
Figure 5. Goodness-of-fit test result for U.S. 1789 electoral vote data and the multivariate Polya distribution
3 3
limit can become arbitrarily large. For this reason, the upper bounds in Figure 4 are cut off at β ≥ 3.5. The wider intervals corresponding to some earlier elections (including all elections before1820) suggest the MPD fit is better in more recent years. The MPD goodness of fit for the1789 data is illustrated in Figure 5. The plot does not belie a fit to the MDP. A goodness of fit for the multinomial distribution (with equal probabilities), however, has a chi-square test statistic of 13.6, with significance P(x92 ≥ 13.6) = 0.14. This is the best fit among all the election cycles that follow; all other significance values of the multinomial goodness-of-fit test are less than 0.01, indicating a clear lack of fit. Given the somewhat uniform spread of the population across the 10 participating states in the 1789 election (see Table 1), it would be hard to believe the founding fathers or anyone else considered future elections based on five times as many states and such a dramatically less even population distribution. The contagion parameter (β), which represents a metric of population evenness or stability, is estimated to be largest (^ β = 3) for the first election, but in every election after 1789, the estimate was much smaller—almost always less than one. Table 1 shows that by the election of 1792—when Washington was re-elected—five more states were added to union and population disparity between the states increased greatly. Likewise, the contagion parameter decreased from 3.0 to 0.72. While Virginia accumulated a large population with a dominant number of electoral votes (21) in 1792, several smaller states had only the minimum number of three votes. There are two noticeable drops in the contagion parameter in Figures 3 and 4. One was the 1824 election. Shortly before
the 1820 election, five sparsely populated western territories gained statehood (i.e., Alabama, Illinois, Indiana, Mississippi, and Missouri). In this election, President James Monroe defeated President John Quincy Adams and was re-elected as president. The other drop is in 1868. In the 1864 election, several southern states did not participate in President Abraham Lincoln’s re-election.
The Effect of the Contagion Parameter If the contagion parameter is sufficiently small, it becomes more likely that one or two states will dominate in terms of electoral votes. With the current rules that add two electoral votes to the minimum one each state gets based on relative population, the electoral college will never satisfy the 80/20 rule. That is, if 20% of the states have almost all the country’s population, they can never garner even 80% of the electoral votes with 51 states. Currently, our most populated 10 states have less than half the total number of electoral votes. Even if we consider only the 385 electoral votes assigned after each state already has the three minimum electoral votes, simulation shows that the 80/20 rule is satisfied only with a contagion parameter of β ≤ 0.22. To illustrate the influence the contagion parameter wields on a presidential election outcome, we again look at how the population disparity creates big states that dominate the others in terms of electoral votes. We consider a population in which the two candidates have an equal chance of winning any individual state and examine the probability that the candidate who wins the largest state wins the election. This is how the “California effect” was computed earlier. We found CHANCE
45
Figure 6. The conditional probability (based on 106 simulations) a candidate will win the Election (vertical axis) given he or she has already won the largest state. Along the horizontal axis, the contagion parameter, β, ranges from 0 to 5.
that if a candidate wins California and has a 50/50 chance of winning each of the remaining states, that candidate has a 69.4% chance of winning the election. We again let X = (X1,...,Xn) have a multivariate Polya distribution where k = 358 votes are allotted to n = 51 states. The probability of winning the election, given the largest state is already won, increases as β decreases. Figure 6 shows how the probability of winning the general election goes to one as β decreases to zero. Furthermore, as β increases to infinity, the probability will correspond to the analogous model based on the multinomial distribution with equal bin probabilities. In that case, the largest state is not much larger than most of the other states, and the probability of winning the election is around 0.58.
Advertising Dollars and Presidential Elections In 2008, the University of Wisconsin Advertising Project estimated that more than 50% of the cost of television advertisements was spent in just 10 states. In 2000, FairVote’s Presidential Elections Reform Program reported that the advertising money spent in Florida alone exceeded that of 45 states and the District of Columbia combined. The unequal advertising spending reflects how the electoral vote share is unequal across states and the outcome in some states is decided by small margins.
46
VOL. 23, NO. 1, 2010
From Figure 6, the probability of winning with β = 0.5 is about 0.71, which is in agreement with the approximation based on California being the biggest state in the 2000–2008 electoral vote distribution. If we go back to the first election in 1789, where the population was more evenly spread across 10 states and ^ β=3.0, the probability decreases to 0.62.
Achieving Evenness in the Electoral College During the U.S. Constitutional Convention in 1787, the framework for the Electoral College was created, delegating votes to states in proportion to its representatives in Congress. The United States was still an agrarian society at the time of the first federal election, and the new nation’s population was spread out somewhat uniformly among most of the 10 voting states. It can be argued that this balance was part of the original design that emphasized constitutional rights as a mixture of state-based and population-based government. In “Federalist No. 39,” President James Madison proposed to have a Congress with two houses, one based on population (the House of Representatives) and one based on states (the Senate), with the election of the president also relying on this mixture of two government modes. The balance achieved in the first election can be characterized in the statistical model based on the multivariate Polya distribution, where the contagion parameter was estimated as ^ β = 3.00. To achieve this kind of voting balance with our current population distribution, a lot of state lines would have to be redrawn. For example, if the United States collapsed North and South Dakota into one state, removing a star from the flag, this would move the contagion parameter from its current 0.5071 to 0.6461. To gain any substantial increase for β, all of the states with just three or four electoral votes would have to be conjoined with larger states. The 23rd Amendment, which guarantees at least three electoral votes to DC, would
Figure 7. The reconfigured United States, with 39 states and contagion parameter ^ β = 3.0. Modified states are shaded gray. Note that Alaska and Hawaii, not pictured, also are modified.
need rectification, and, more likely, DC would be swallowed by Maryland or Virginia. To show how difficult this balance would be to achieve, following is an efficient (albeit facetious) 12-step master plan listed below that will increase the contagion parameter to its original ^ β = 3.0: Split California into two states (i.e., Northern California and Southern California) based on the county lines starting between San Bernardino and Tulare counties Split off the New York City metropolitan area (including all of Long Island) from the state of New York Split off west Texas, according to the vertical borders implied by its congressional districts, joining it with New Mexico Join Montana, Wyoming, and Idaho
Discussion The reconfigured union of states displayed in Figure 7 serves to illustrate how much the population of the United States has changed since the Constitutional Convention. Barnett and Kaplan suggested that any constitutional amendment to change the electoral system directly has little or no chance of being approved because it would require a large number of small states to vote against their own interests. Their aim is to go through a side door by tweaking the system via a weighted vote share. The authors show that the results would be significantly closer to matching a national popular vote. An additional metric was provided here that reflects how the unevenness in population distribution translates into inequality of voting power between states. The key is to view the current distribution of population and electoral votes as one of many possible realities, depending on how the population distributes itself into states, or how states are defined with respect to population density.
Add Alaska to Washington
Further Reading
Add Hawaii and Nevada to Northern California
Barnett, A., and E. H. Kaplan. 2007. A cure for the electoral college? CHANCE 20(2):6–9. David, F. N., and D. E. Barton. 1962. Combinatorial chance. Port Jervis: Lubrecht & Cramer, Ltd. Gainous, J., and J. Gill. 2002. Why does voting get so complicated? A review of theories for analyzing democratic participation. Statistical Science 17(4):383–404. Kvam, P. H., and D. Day. 2001. The multivariate Polya distribution in combat modeling. Naval Research Logistics 48(1):1–17. Pearson, C. 2006. Who picks the president? Takoma Park: FairVote. Peterson, D. W. 2008. Putting chance to work: Reducing the politics in political redistricting. CHANCE 21(1):22–26. University of Wisconsin Advertising Project, http://wiscadproject.wisc.edu.
Join Utah and Colorado Join North Dakota, South Dakota, and Nebraska Join Maine, Vermont, and New Hampshire Join Virginia and West Virginia Add Delaware and DC to Maryland Join Connecticut and Rhode Island The results of this splitting and mixing are displayed in Figure 7, based on population estimates for 2007. The New York City metropolitan area, as a new state, would garner 21 electoral votes, while the remains of upstate New York would have 12. Among the 39 states in the reconfiguration, the lowest number of electoral votes for any state is six. No additional pairing significantly increased ^ β past 3.0.
CHANCE
47
Here’s to Your Health
Mark Glickman,
Column Editor
The Virtual Patient: Developing Drugs Using Modeling and Simulation Andreas Krause
I
n recent years, many major pharmaceutical companies have established dedicated groups for modeling and simulation (M&S). M&S aims to use quantitative approaches to bridge the gaps between medicine, pharmacology, statistics, and computer science in drug development. This trend is supported by the Food and Drug Administration (FDA) and its Critical Path Initiative, which strongly encourages modelbased drug development. M&S develops and uses sophisticated and scientifically sound models to characterize the data measured and observed in human subjects, who are healthy volunteers or patients. Translating observations into mathematical models helps one understand the positive and negative effects of taking a drug— its efficacy and safety. A major aim is to get the dose right. Other aims include better mechanistic, coherent, and—in particular—quantitative understanding of a drug’s effects. Once a model has been developed, a “virtual patient” becomes available. One can give a drug to a virtual patient and let the model predict what is going to happen. What can be expected if the dose is bigger or given twice a day for a longer period of time? What might happen to patients who are Japanese or Caucasian, light or heavy, female or male, adults or children? Typically, multiple measurements are taken from each subject—before the start of treatment, while on treatment, and after having terminated treatment. The models are therefore longitudinal in nature, often falling into the class of nonlinear mixed effect models.
Modeling Many models are employed in clinical drug development. M&S models frequently fall into one of the categories of pharmacokinetics (PK), pharmacodynamics (PD), or PK/PD, linking PK to PD. 48
VOL. 23, NO. 1, 2010
Pharmacokinetic Modeling PK is the discipline of assessing what the body does to the drug—drug absorption, distribution, metabolism, and elimination, known as ADME. The interest is in characterizing the drug concentration over time. The concentration can be measured in the blood stream or an organ of interest, and of course it depends on the dose of drug given. Figure 1 shows a typical concentration-time profile for a subject on a daily dose of 100 mg, administered once every 24 hours for seven days. PK profiles can vary widely, depending on the particular compound and its characteristics. Here, the concentration of the drug in the blood increases up to about half a day after drug intake, followed by a decline in concentration until the next dose is taken. On the first day, about half the concentration of day four is reached. From day four on, the profile is roughly similar on each day, reaching the so-called steady state. After discontinuation of treatment, it takes another five days for the drug to be almost entirely eliminated from the body. Pharmacodynamic Modeling PD is the discipline of assessing what the drug does to the body—curing an illness or provoking an unwanted effect. There can be many pharmacodynamic endpoints, from measures for disease change such as tumor size or change in glucose level to measures for unwanted effects such as increased heart rate or patient-reported outcomes. The assumption is frequently that the drug effect is driven by the drug concentration; the higher the concentration, the bigger the effect. Other factors also might influence the response. There might be a placebo effect driven by what the subject thinks he or she receives. (Placebo originates from the Latin “to please.”) For this reason, studies often include a placebo group in which subjects receive “medication” with no
Figure 1. Pharmacokinetic profile (drug concentration over time in the blood) for a typical subject taking a dose of 100 mg orally every 24 hours for seven days
Figure 2. A PK/PD model for a drug taken orally with two PK compartments and a PD effect compart¬ment (response to treatment). The response is driven by the concen¬tration of the drug in the central compartment, a placebo effect, circadian rhythm, and tolerance.
active drug that looks identical to the one that does contain a drug. Placebo effects, an improvement without a real drug, are common in patient-reported outcomes such as pain, but they can also be present in outcomes such as reductions in migraine or asthma attacks. Circadian effects (from “circa diem,” or “about daily”) are changes that occur naturally in 24-hour cycles, such as regular patterns in heart rate and cortisol levels. Tolerance effects describe the fact that the body gets used to the drug and the effect of the drug is reduced over time. Typical examples of drugs that show tolerance include caffeine and nicotine. The point of modeling is that all these components—and possibly others such as disease status before treatment (mild, severe), demographic characteristics (age, sex, body weight, race), and relevant co-medication—are combined into a single framework to get a coherent basis for medical judgment about the net drug effect. Figure 2 shows a generic PK/PD model for a drug taken orally, such as a tablet. The upper part illustrates the drug flow inside the body (the PK part), from the depot (e.g., the stomach) to the central compartment (e.g., the blood stream) and
the peripheral compartment (e.g., an organ or body fat). The flows of the drug between the compartments are commonly characterized by differential equations describing the change of flow. The lower part (PD) illustrates the patient’s response, driven by the drug concentration in the central compartment, a placebo effect, circadian variation, and possibly tolerance toward the drug that develops over time. Population Modeling Once the model is established in its basic structure, a population model introduces individualization to better describe individual data. Parameters such as drug elimination (clearance) and volume of distribution differ between subjects. They might differ because of measurable differences (fixed effects) or unknown effects (random effects). A typical fixed effect in a population model is the effect of body weight on volume; bigger people generally have more mass and thus larger organs and more blood and fat, leading to lower drug concentrations. This implies that, as a general tendency for a given dose, the drug effect is larger in smaller CHANCE
49
Figure 3. Pharmacokinetic profile (bottom) and pharmacodynamic response (top) for a subject taking seven doses of 100 mg every 24 hours
people than in bigger people. That difference can be tiny or huge, depending on the drug. Random effects that cannot be controlled can include variation in absorption, magnitude of the drug effect, or circadian rhythm for no apparent reason. The remaining variation after inclusion of fixed and random effects (i.e., mixed effects) is the residual variation. Determining the relevant covariates is not entirely straightforward; differences between sexes might simply be due to women tending to have lower body weight than men. If many covariates are analyzed without adjustment, the chance increases that a covariate is identified as being relevant just by chance. It is generally a good idea to examine only covariates that also have a medical meaning. Population approaches are frequently used to argue whether the dose needs to be individualized. If the drug concentration and the drug effect vary substantially between small and big people, the small people are more likely to experience an undesired effect due to the high concentration and the big
What Do We Know? Statisticians focus on modeling, estimating, explaining, and controlling variability. Some aspects of models are structural features describing a relationship. Other aspects of a model, such as a random effects term, describe variability. A great deal of effort in developing a statistical model for a complex problem involves selecting variables and determining how to use them. Still, one often wishes more information and variables were available or that it were known that the major contributions to uncertainty were included correctly in the data and model. Former Secretary of Defense Donald H. Rumsfeld described an analogous situation in military planning as follows: As we know, there are known knowns; there are things we know we know. We also know there are known unknowns. That is to say we know there are some things we do not know. But, there are also unknown unknowns—the ones we don’t know we don’t know. 50
VOL. 23, NO. 1, 2010
people are more likely to experience a small or no therapeutic effect due to the low drug concentration. If these differences become substantial, it might be indicated to adjust the dose for body weight.
Simulation Frequently, the interest is not only in the average response to treatment, but in the expected range and extremes. What percentage of all subjects in the population can be expected to reach a desired range of drug concentration and effect? How many subjects are expected to pass below a predefined safety threshold? Simulating a large population helps quantify the answer. The population is simulated from the statistical model with all uncertainties, including parameter uncertainty, random effects, and residual variation. For example, if 234 out of 10,000 simulated subjects of different age, sex, body weight, and disease status show an undesired effect, 2.34% of all people treated are estimated to show this effect. The subjects with negative effects, in turn, might have specific characteristics. For example, most subjects with the undesired effect might show low body weight or advanced stages of the disease, giving rise to a discussion about whether these subjects should be excluded from treatment. As always, when it comes to predictions, one needs to keep in mind that “prediction is very difficult, especially about the future” (frequently attributed to Niels Bohr). First, extrapolating outside the observed range of doses or patient characteristics must be done with care. Imagine trying to realistically extend results from adult volunteers to children patients by scaling by body weight. Also, patients with a compromised immune system or other weakness can respond differently to a treatment than the healthy people who volunteered to initially evaluate it. In addition to population simulations in which a single large population is simulated and evaluated, one also can simulate a clinical study. The aim here would be to assess study outcomes to decide which patient population to treat, how many, and for how long with what dose and to derive the estimated probabilities of a statistically significant result. The study characteristics being simulated are the traditional significance levels and statistical power. In many clinical trials, levels of significance and power can be computed analytically.
Figure 4. Simulation results: dose-response relationship for efficacy (top) and safety (bottom). In the top graph, the horizontal gray area indicates the desired range, the dark line shows the expected average for the given dose, and the gray lines indicate the range of 90% of the simulated data. In the bottom graph, the line shows the percentage of subjects simulated in the critical safety range (below the critical limit).
When many aspects are variable or unknown, or the model is too complex for analytical solutions (which is frequently the case in longitudinal models), one can use simulation to assess these operating characteristics. Assessing different study setups and deriving the chances of successful outcome for each helps when deciding on the most promising study design.
Case Study: Characterizing a New Compound Actelion Pharmaceuticals Ltd. currently produces several new drugs. One of these drugs is an immunomodulator. It regulates the circulation of lymphocytes, a type of white blood cell that plays a major role in the immune system. Regulation of the immune system can help control autoimmune diseases such as psoriasis (excessive skin production resulting in skin flakes), rheumatoid arthritis, or multiple sclerosis. The immune system also plays a major role if a rejection occurs after organ transplantation. Downregulation of the body’s immune system can help control auto-immune diseases. If the immune system is downregulated too much, however, the system becomes vulnerable to infections. Understanding the drug effect, quantifying it, and determining the appropriate dose is therefore literally vital. A population PK model was developed here to characterize the pharmacokinetics of the investigational drug in healthy volunteers (clinical phase I). Based on the PK model, several PD models were developed to evaluate the effects of the drug with respect to total lymphocyte count lowering and to analyze safety aspects of the drug on cardiovascular parameters such as electrocardiogram (ECG) data. Once a model is in place, it represents the current state of knowledge. If all variables—including the drug dose, the subject’s total lymphocyte count at baseline, and the subject’s
body weight and age—are specified, the model predicts the time course of drug concentration and drug effect. The model developed here was used to study the dose on a continuous scale, including doses that were never given. Such analyses are helpful in determining the doses to give in further clinical studies. Figure 3 shows a typical PK/PD profile for the orally taken drug. In this graph, the parameters had to be modified and the units removed to protect the confidentiality of information with respect to the compound. The bottom curve shows the concentration-time profile over 12 days for a once-daily drug given for seven days. The top curve shows the pharmacodynamic effect, a lowering of the total lymphocyte count. Note the delay in the pharmacodynamic effect: The minimum lymphocyte count appears well after the maximum drug concentration. Balancing efficacy and safety in this drug means a certain reduction in total lymphocyte count is desired and a target range is defined. When it comes to safety, the interest is in the fraction of subjects in whom the total lymphocyte count would drop below a critical threshold such that the immune system might become too compromised. Figure 4 shows the result. (The particular numbers are protected.) The top graph shows the dose-response for efficacy; the horizontal gray area indicates the desired range. The dark line shows the expected average for the given dose (x-axis), and the gray lines indicate the range of 90% of the simulated data. The bottom graph shows the percentage of subjects simulated in the critical safety range (below the critical limit). Even a dose of 0 mg will yield some critical safety values due to the nature of the effect (natural variation, placebo effect, or other phenomena). CHANCE
51
Know When to Fold ‘Em If you consider a pharmaceutical company with many drugs under development and a finite budget, priorities must be set among compounds and the selected compounds must be developed in an optimal way. In this sense, drug development is a little like gambling: With several drugs at hand, limited resources to develop them all, and different chances and hopes for successful outcomes, you need to put your money on a good hand (a good drug) and you need to play it right. Kenny Rogers wrote a song about this in 1978 with the following lyrics:
characteristics. Patients suffering from different diseases also might be different. For example, diabetes patients frequently have high body weights, whereas multiple sclerosis patients are frequently females with low body weights. Therefore, patients with different diseases often receive different doses of the same drug. This drug is now entering the dose-finding stage for different indications. Getting the dose right by balancing efficacy and safety will be crucial for successful drug development.
Decisionmaking In the end, the goal is to characterize a drug and help the decisionmaking process. There are many tuning parameters, including the following: What dose should the patients receive?
You got to know when to hold ‘em, know when to fold ‘em. Know when to walk away, and know when to run Every gambler knows that the secret to survivin’ is knowing what to throw away and knowing what to keep.
are great oppor“ ...there tunities to collaborate and provide key expertise for researchers with advanced training in modeling and simulation.
”
There is rarely a dose that fulfills all efficacy and safety criteria. In this example, there is no dose that gets 90% of all subjects into the desired efficacious range while simultaneously keeping the fraction of subjects in the critical range below the specified threshold. The next clinical phase, phase II, will study the drug effects in patients, so which diseases to tackle must be determined. Different diseases will have different efficacy and safety
52
VOL. 23, NO. 1, 2010
For how long should they be treated? Is the same dose appropriate for all? How should the drug effect be evaluated (which efficacy and safety measures should be used)? Which patients will be treated? The drug might have a better effect on more severely ill patients. If that is the case, should all patients be treated, or just the more severely ill ones? Modern drug development is expensive and many factors must be considered. There is a desire to understand the behavior and performance of the drug in detail. To design cost-effective and medically successful trials, pharmacokinetic and pharmacodynamic models and simulated outcomes are increasingly being used. In this area, there are great opportunities to collaborate and provide key expertise for researchers with advanced training in modeling and simulation.
Further Reading U.S. Food and Drug Administration. Critical path initiative. www.fda.gov/ScienceResearch/SpecialTopics/CriticalPathInitiative. Bonate, P. 2005. Pharmacokinetic-pharmacodynamic modeling and simulation. New York: Springer. Ette, E. I., and P. J. Williams (eds.). 2007. Pharmacometrics: The science of quantitative pharmacology. Hoboken: Wiley. Kimko, H., and S. B. Duffull (eds.). 2002. Simulation for designing clinical trials: A pharmacokinetic-pharmacodynamic modeling perspective. New York: Marcel Dekker. Ting, N. (ed.). 2006. Dose finding in drug development. New York: Springer. “Here’s to Your Health” prints columns about medical and healthrelated topics. Please contact Mark Glickman (
[email protected]) if you are interested in submitting an article.
Visual Revelations
Howard Wainer,
Column Editor
Schrödinger’s Cat and the Conception of Probability in Item Response Theory Although not presenting novel graphics or commenting on the quality of graphics, this article discusses how three similarlooking graphs can have quite different meanings. If careful interpretation is critical for such familiar graphs, then careful presentation and reading certainly is needed for more involved illustrations of data and models. This work was supported by the National Board of Medical Examiners. In addition, many of the key ideas contained here, as well as some of the explicit facts, were taken from Paul Holland’s 1989 presidential address to the Psychometric Society.
M
odern test theory is comprised of various kinds of mathematical models that provide a quantitative description of what happens when a person meets an item. This description is stochastic in that it usually ends in the probability of the person answering the item correctly. How we interpret this probability is both subtle and important, and to understand it in its full context, we must go back more than a century; the full story involves some of the most famous scientists of the age.
Leipzig, 1880s The story begins in the mid-19th century in Gustav Theodor Fechner’s laboratory in Leipzig (see “The Method of Right and Wrong Cases”), where the study of psychophysics had begun. Experiments were performed in an effort to connect the physically measurable strength of stimuli to their perceived strength. To do so, an experimental subject was prompted
with a weak light (or soft sound or light touch) and asked to indicate whether he saw (or heard or felt) it. This procedure was repeated many times at various levels of stimulus strength; the result was often summarized graphically. Figure
1 provides an example. For my current purpose, it is crucial to note that this fitted function is for a single hypothetical subject (#2718) and each value of ‘P’ reported was derived from repetitions of the same stimulus strength.
The Method of Right and Wrong Cases Gustav Theodor Fechner (1801–1887) published his Elemente der Psychophysik in 1860. The showpiece of that volume may have been the logarithmic Weber-Fechner Law, connecting sensation (S) with stimulus (R): S = C log(R), but it was not the principal contribution. The most influential innovation was Fechner’s methods for measuring sensation. One of these was what Fechner called the method of right and wrong cases, in which he presented a subject with a pair of stimuli and asked which was greater. Through repeated presentations, he was able to derive an estimate of the probability of that individual judging the difference. For a more detailed discussion, see The History of Statistics: The Measurement of Uncertainty Before 1900, by Stephen Stigler.
CHANCE
53
Stimulus Strength
Figure 1. Stimulus-response curve for subject 2718
Dose
Figure 2. Dose-response curve for pesticide 314
Proficiency
Figure 3. The probability of getting item 159 correct
54
VOL. 23, NO. 1, 2010
In various kinds of biological experimentation (e.g., pharmacology), a similar kind of graphic form is in common usage. In this incarnation, it is often called a dose-response curve. For example, if we wish to determine the effectiveness of various dosages of some pesticide, we would run a sequence of experiments in which a large number of the pest under consideration was divided into experimental groups, and each group would then be administered a different dosage of the pesticide. We would then record the death rate within each group. A plot of the proportion that dies against the dosage (often plotted as log(dosage) when such a transformation yields a more symmetric function) yields the dose-response curve. A typical plot for some hypothetical pesticide is shown in Figure 2. A superficial comparison of the two plots shown suggests they are virtually identical. But this is incorrect. The y-axis label in both cases is P, but those two Ps have very different meanings. The psychophysical P represents variability in the responses of a specific individual to the same stimulus strength. The dose-response P is a sampling probability derived from the differential responses observed across individuals to the same dosage. This difference, though perhaps subtle, can have important consequences.
Copenhagen, 1920s and 1930s The next step in this exploration takes us to Copenhagen in the 1920s and 1930s. These decades were especially exciting in the world of physics. The consequences of quantum theory were only just being realized, and heated debates were taking place among the giants in the field. The most basic issue being debated was the meaning of a probability. In this debate, one side was argued by Erwin Schrödinger and Albert Einstein. On the other stood Niels Bohr, Werner Heisenberg, Max Born, and Paul Dirac. All have been canonized as iconic figures in modern physics; indeed, Einstein’s accomplishments place him beside Aristotle and Isaac Newton in the pantheon of science. Einstein and Schrödinger argued that when quantum theory represented the result of an experiment in probabilistic terms, it was an indication that the theory was incomplete and/or the data
were inadequate. When the theory was improved and/or better data gathered, the uncertainty would shrink and eventually disappear. This is the source of Einstein’s famous remark, “God does not throw dice.” Bohr argued that the uncertainty was part of the system and was, after a certain point, irreducible. Bohr’s argument became known as the “Copenhagen interpretation.” As part of this discussion, Schrödinger published a series of papers in 1935 in which he proposed a gedankenexperiment, or thought experiment, that has become known as Schrödinger’s cat. Schrödinger proposed a box with a cat in it. Also in the box was a small amount of radioactive material that would give rise to a quantum event, say, the release of one particle in an hour with probability 0.5. Also in this box was a Geiger counter that could detect this single particle. If a particle was detected, it would trigger the release of enough poison to kill the cat. Bohr argued that until the box was opened, the cat was alive with probability of 0.5 and dead with the same probability. This was the so-called superposition structure: Until the box was opened, the cat was neither dead nor alive, but a bit of both. Bohr declared that until it is measured, nothing exists. Einstein argued, instead, that the cat was indeed either dead or alive (not both)—not knowing reflected either a shortcoming of the
Extensions of the Gendankenexperiment Various extensions of the cat gendankenexperiment were proposed. One would insert a clock into the box, as well, wired to the Geiger counter so that when the poison was released, the clock would stop. Thus, when the box was opened, not only would you see a dead cat, but also evidence of when it had expired. Bohr argued that the clock was the measurement, and hence the uncertainty ended when the clock stopped. In the more than 70 years since Schrödinger’s original cat, there have been many, many variations and alternatives. To this day, the interpretation of Schrödinger’s cat remains an open question. theory (not being able to accurately say when a particle will be released) or the data. Expanding on Einstein’s view of this, we would argue that if we had 100 such boxes, we might find 50 of them with dead cats and 50 with live ones. The probability of 0.5 is thus a sampling probability realized across cats, not an intrinsic property of the same cat. Thus, Schrödinger and Einstein’s view is closer to that of the dose response P than Fechner’s psychophysical P. The view of Bohr and his colleagues was closer to Fechner’s P.
Copenhagen, 1960s Now let us jump forward another 25 years, but remain in Copenhagen. In 1960, Georg Rasch published the results of his research on a test theory model that would predict/describe
what happens when a person meets an item. The result of this model was a curve—the item characteristic curve, which bears a close resemblance to the two prior figures shown. A typical, but hypothetical, example is presented in Figure 3. But, once again, we must focus on the meaning of the vertical axis. What does P mean? It is not surprising, given the visibility of psychophysical work in combination with the “Copenhagen interpretation” put forth by the most eminent member of the faculty of the University of Copenhagen, that Rasch would interpret P as the internal probability within a specific person. Indeed, this is exactly what he did: … [C]ertain human acts can be described by a model of chance. … Even if we know a person to be very capable, we cannot be CHANCE
55
sure he will solve a certain difficult problem, nor even a much easier one. There is always a possibility that he fails—he may be tired or his attention is led astray, or some other excuse may be given. Furthermore, if the problem is neither “too easy” nor “too difficult” for a certain person, the outcome is quite unpredictable. But we may, in any case, attempt to describe the situation by ascribing to every person a probability of solving each problem correctly. … Rasch was not alone in this view. Even Frederic Lord and Melvin Novick in their cornerstone of modern test theory take a similar tack when they describe the “transient state of the person” giving rise to a “propensity distribution.” Other researchers, however, took a different view—closely akin to Einstein’s insistence that the cat was either dead or alive—in which they thought an examinee either knows the answer or does not, with little uncertainty. Consequently, the P is yielded by a sampling process. Thus, if we group a large sample of examinees (say, n) with the same proficiency and administer a specific item, nP of them will know the answer, on average, and n(1-P) will not. The statistician Allan Birnbaum was one early advocate of this view. He said, “Item scores … are related to an ability θ by functions that give the probability of each possible score on an item for a randomly selected examinee of given ability.” Lord appeared to join in this perspective when he said, “An alternative interpretation is that Pi(θ) is the probability that item i will be answered correctly by a randomly chosen examinee of ability θ.” Not being fully committed to this view, however, Lord, on the same page, leaves room for some within-person variability when, in the discussion of responses to previously omitted items, he states, “These interpretations tell us nothing about the probability that a specified examinee will answer a specific item correctly.” Thus, there appears to be room for both conceptions. There is, however, a dark side to the within-person probability approach. By adopting this outlook, you can be fooled into believing you can ignore the need to 56
VOL. 23, NO. 1, 2010
ever consider the examinee population. Holland pointed out the following in a 1990 Psychometrika article: … [T]his is an illusion. Item parameters and subject abilities are always estimated relative to a population, even if this fact may be obscured by the mathematical properties of the models used. Hence, I view as unattainable the goal of many psychometric researchers to remove the effect of the examinee population from the analysis of test data. Thus, the interpretation of P as within-person probability too easily leads one to the chimera of “samplefree item analysis,” such as Benjamin D. Wright and Graham Douglas did in their 1977 article, “Best Procedures for Sample Free Item Analysis,” which appeared in Applied Psychological Measurement. Again, quoting Holland, “The effect of the population will always be there; the best that we can hope for is that it is small enough to be ignored.”
Item Response Theory and the Pressure of History Item response theory (IRT) has become a mainstay of modern testing. Traditional test-scoring models that used a metric such as “percent correct” were theories that used the entire test as the unit of measurement. Many modern tests are adaptively constructed to suit the individual examinee and, as such, are made up on the fly. To obtain scores that are comparable across such a collection of tests, a theory was needed that could connect them by making the individual test item the fungible unit. This was the goal of item response theory, and the success of adaptive testing— as described in my book, Computerized Adaptive Testing: A Primer—attests to its efficacy. Yet, Rasch, one of the pioneers of IRT, was led down a path that had serious negative consequences because of an ill-chosen interpretation of one of the fundamental units of the model. The point of this essay was to explore what the forces were that impelled Rasch to this choice. Our conclusion was that it was the power of history combined with the press of authority that made any other interpretation of P more than just difficult.
Rasch had the gigantic epistemological shadow of Bohr looming over him. It is too easy for us, with 50 years of hindsight, to gainsay his choice. I am inclined to be gentle in any criticism of him, but am less forgiving of contemporary figures who persist in making the same mistake.
Further Reading Birnbaum, A. 1968. Some latent trait models and their use in inferring an examinee’s ability. In Statistical theories of mental test scores, F. M. Lord and M. R. Novick, chapters 17–20. Reading: Addison-Wesley. Fechner, G. T. Elements of psychophysics. Volume 1 translated by Helmut E. Adler. New York: Holt, Rinehart, and Winston, 1966. Originally published as Elemente der Psychophsik (1860). Holland, P. W. 1990. On the sampling theory foundations of item response theory models. Psychometrika 55:577– 601. Lord, F. M. 1974. Estimation of latent ability and item parameters when there are omitted responses. Psychometrika 39:247–264. Lord, F. M., and M. R. Novick. 1968. Statistical theories of mental test scores. Reading: Addison Wesley. Schrödinger, E. 1935. Die gegenwärtige situation in der quantenmechanik (The present situation in quantum mechanics). Naturwissenschaften 48, 807, 49, 823, 50, 844. Stigler, Stephen M. 1986. The history of statistics: The measurement of uncertainty before 1900. Cambridge: Harvard University Press. Wainer, H., N. Dorans, D. Eignor, R. Flaugher, B. Green, R. Mislevy, L. Steinberg, and D. Thissen. 2000. Computerized adaptive testing: A primer (2nd ed.) Hillsdale: Lawrence Erlbaum Associates. Wright, B. D., and G. A. Douglas. 1977. Best procedures for sample-free item analysis. Applied Psychological Measurement 14:281–295. Column Editor: Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104;
[email protected]
Comments on 4x4 Philatelic Latin Squares Peter D. Loly and George P. H. Styan
P
ostage stamps are occasionally issued in sheets of n stamps, printed in an n x n array and containing n of each of the n stamps. Sometimes, the n x n array forms a philatelic Latin square (PLS): Each of the n stamps appears exactly once in each row and exactly once in each column. Figure 1 shows such a sheet (with selvage) with n = 4. From July to August in 1972, Canada hosted four international congresses concerned with the exploration and development of the Earth and man’s activities. Featured on the stamps in Figure 1 (top row, from left) are aerial map photography for the 12th Congress of the International Society of Photogrammetry, contour lines for the 6th International Conference of the International Cartographic Association, a geological fault (cross-section of the crust of the Earth, showing different layers of material) for the 24th International Geological Congress, and an aerial view for the 22nd International Geographical Congress.
4 x 4 Latin Squares There are four 4 x 4 Latin square matrices in reduced form (i.e., 1234 in the top row and first column):
Courtesy of George P. H. Styan
Figure 1. Four international scientific congresses: Canada 1972, PLS type a324
Permuting the last three rows of each of the four Latin square matrices—La, Lb, Lc, Ld—generates six Latin squares, so there are 24 standard-form 4 x 4 Latin squares in all. Standard form means 1234 is in the top row, but not necessarily in the first column. These 24 standard-form Latin squares are shown with a letter—a, b, c, d—and three numbers, as they appear in column 1, rows 2, 3, and 4. Thus, the PLS in Figure 1 is of type a324 and may be represented by the following:
La(324) and the array of 16 stamps in Figure 1 are in Sudoku form—the four stamps each appear once in the top left, top right, bottom left, and bottom right 2 x 2 corners—thus the full Latin square is the solution to a mini-Sudoku puzzle. Frank Yates, in a 1939 Biometrika article, noted that such Latin squares have “balanced corners.”
Latin Squares and Experimental Design It seems the first 4 x 4 Latin square design used when printing postage stamps was in 1972 (by Canada, Figure 1). Latin square designs of size 4 x 4 have been used in Europe since at least the 13th century, when the Majorcan writer and philosopher Ramon Llull (1232–1316) published his “First Elemental Figure” in Ars Demonstrativa.
CHANCE
57
Much has been written about Latin squares. Terry Ritter, in research comments from Ciphers by Ritter, observed the following: It seems that Latin squares were originally mathematical curiosities, but serious statistical applications were found early in the 20th century, as experimental designs. The classic example is the use of a Latin square configuration to place three or four different grain varieties in test patches. Having multiple patches for each variety helps to minimize localized soil effects. Similar statements can be made about medical treatments. Charles Laywine and Gary Mullen noted the following in Discrete Mathematics Using Latin Squares: While the study of Latin squares is usually traced to a famous problem involving 36 officers considered in 1779 by Leonhard Euler (1707–1783), statistics provided a major motivation for important work in the early and middle decades of the 20th century. Indeed, many of the most important combinatorial results pertaining to Latin squares were obtained by researchers whose primary self-description would be that of a statistician. In fact, it seems the name “Latin squares” is due to Euler, who used Latin characters as symbols. Jeff Miller has images of 13 philatelic items for Euler on his web site at http://jeff560. tripod.com/stamps.html. In Handbook of Combinatorial Designs (2nd ed.), Charles Colbourn and Jeffrey Dinitz identify the following in their
timeline of the main contributors to design theory from antiquity to 1950 (consecutively, born between 1890 and 1902): Sir Ronald Aylmer Fisher (1890–1962) William John Youden (1900–1971) Raj Chandra Bose (1901–1987) Frank Yates (1902–1994) Latin squares have, however, been used in the statistical design of experiments for more than 200 years. In randomized controlled trials, there are often too many treatments to cross to have replicates in each factorial treatment combination. The idea of Latin square designs allows main effects to be estimated and tested for significance. The general idea of factorial design with replication, though, has also been important in scientific research and the history of statistics. As discussed by D. A. Preece in his 1990 Biometrics article, “R. A. Fisher and Experimental Design: A Review,” the earliest use of a Latin square in an experimental design was probably by the agronomist François Cretté de Palluel (1741–1798), who published a study in 1788 of an experiment involving the fattening of 16 sheep in France, four of each of four breeds. The advantage of using a Latin square design in this experiment was that only 16 sheep were needed, rather than 64 for a completely cross-classified design. The purpose was to show that one might just as well feed sheep root vegetables during the winter, which was cheaper and easier than feeding the normal diet of corn and hay.
Table 1. Standard-Form 4 x 4 Latin Squares and PLS Counts and Examples
58
VOL. 23, NO. 1, 2010
Latin Squares and Postage Stamps Here, there are 10 types of 4 x 4 philatelic Latin squares from 10 countries: Albania, Canada, Gambia, Guinea, Guinea-Bissau, Hong Kong (China), Malawi, Pitcairn Islands, Portugal, and Tristan da Cunha (see Table 1). The stamps feature animals, artwork, birds, Mickey Mouse, and scientific congresses. As the mathematics educator William Leonard Schaaf (1898–1992) wrote in Mathematics and Science: An Adventure in Postage Stamps, “The postage stamps of the world are, in effect, a mirror of civilization.” As of January 25, 2010, 165 examples of PLS from 74 countries (stamp-issuing authorities) have been identified, but only PLS of the following 10 types of 4 x 4 standard-form Latin squares (see Table 1): a234, a324, a342, a432, b324, b342, b423, b432, c423, and d324 4 x 4 PLS of the types a243, a423, c324, d243, and d342 were found, however, embedded in sheets of more than 16 stamps. Of these 165 PLS examples, 96 were issued for the World Wide Fund for Nature (known as the World Wildlife Fund in the United States and Canada) from 1988–2010 and 32 from Macau.
Philatelic Latin Squares of Type a The stamps shown in Figure 2 form a PLS of type a234, which is a one-step backward circulant and 2 x 2 block-Latin—the top left 2 x 2 corner of four stamps is the same as that in the bottom right corner and the top right corner of four stamps is the same as that in the bottom left corner. This seems to be the most popular type of PLS, with 56 examples found (see Table 1). The stamps were issued by Albania in 1999 and feature Mickey Mouse, the celebrated animated character that became an icon for The Walt Disney Company. The only PLS found of type a342 is shown in Figure 3. It is also in Sudoku form, like the PLS of type a324 shown in Figure 1. The stamps were issued in 2004 by Tristan da Cunha, a group of remote volcanic islands in the Atlantic Ocean, 1,750 miles from South Africa and 2,088 miles from South America. Depicted in Figure 3 is the sub-Antarctic fur seal, which is found in southern parts of the Indian and Atlantic oceans. The stamps shown in Figure 4 form a PLS of type a432, which is a one-step forward circulant and 2 x 2 block-Latin. This is a popular type of PLS, with 39 examples found (see Table 1). The stamps shown in Figure 4 were issued in 2009 by Malawi and feature Lilian’s lovebird, a small African parrot species.
Courtesy of George P. H. Styan
Figure 2. Mickey Mouse: Albania 1999, PLS type a234
Philatelic Latin Squares of Type b The stamps shown in Figure 5 form a PLS of type b324, which is in Sudoku form and another popular type of PLS, with 45 examples found (see Table 1). These stamps were issued by Hong Kong, China, in 2002 and feature four works of art: (top row, from left) “Lines in Motion,” by Chui Tze-hung; “Volume and Time,” by Hon Chi-fun; “Bright Sun,” by Aries Lee; and “Midsummer,” by Irene Chou. Hong Kong, which was a British Crown colony until 1997, is now a special administrative region of the People’s Republic of China, but continues to issue its own postage stamps.
Courtesy of Groth AG
Figure 3. Sub-Antarctic fur seal: Tristan da Cunha 2004, PLS type a342
CHANCE
59
Courtesy of Groth AG
Figure 4. Lilian’s lovebird: Malawi 2009, PLS type a432
Courtesy of George P. H. Styan
Figure 5. Hong Kong Art Collections: Hong Kong, China, 2002, PLS type b324
Courtesy of Groth AG
Figure 6. African buffalo: Guinea-Bissau 2002, PLS type b342
60
VOL. 23, NO. 1, 2010
Courtesy of Groth AG
Figure 7. Pyrenean desman: Portugal 1997, PLS type b423
The sheet shown in Figure 6 comes from Guinea-Bissau, in western Africa. Formerly Portuguese Guinea, Bissau was added to the country’s name upon independence to prevent confusion between it and the nearby Republic of Guinea, which was formerly French Guinea and which issued the stamps shown in Figure 9. Depicted in Figure 6 is the African buffalo, a large African cloven-hoofed mammal. This sheet is the only PLS of type b342 that has been identified. The stamps shown in Figure 7 were issued by Portugal in 1997 and depict the Pyrenean desman, a small semi-aquatic mammal that lives in the Iberian peninsula. The stamps form a PLS of type b423, which is in Sudoku form. The sheet shown in Figure 8 forms a PLS of type b432, which is in Sudoku form and block-Latin. The sheet comes from Gambia, the smallest country on the African continental mainland. Depicted is the black crowned crane, a bird in the family Gruidae.
Courtesy of George P. H. Styan
Figure 8. Black crowned crane: Gambia 2006, PLS type b432
Philatelic Latin Squares of Types c and d The PLS of type c423 (Figure 9) comes from the Republic of Guinea and features the Guinea baboon in row 1, columns 1 and 2, and the collared mangabey in row 1, columns 3 and 4. The last example of a PLS is of type d324 (Figure 10) and the only PLS of this type that has been found. This type is called criss-cross because the same stamp appears in each cell in the main forward diagonal and the main backward diagonal. The stamps were issued by the Pitcairn Islands, a British overseas territory (formerly a British colony). Shown in row 1, column 1, is the black noddy, a seabird of the tern family. It resembles the closely related brown noddy in row 1, column 2, but is smaller with darker plumage, a whiter cap, a longer beak, and shorter tail. The blue-grey ternlet in row 1, column 3, and the sooty tern in row 1, column 4, also are seabirds of the tern family.
Courtesy of Groth AG
Figure 9. Baboons and mangabeys: Guinea 2000, PLS type c423
CHANCE
61
Further Reading
Courtesy of George P. H. Styan
Figure 10. Terns and noddies: Pitcairn Islands 2007, PLS type d324
Remarks It is not known whether the 10 Latin square designs discussed here and used in printing postage stamps are also the 10 most popular for research design purposes, or if they are particularly advantageous relative to other possibilities. However, Latin squares of types b324, b342, b423, and b432 all have balanced corners and are such that the treatment degrees of freedom can be partitioned into two from the first group of contrasts and one from the second. Moreover, b423 is called “Hudson’s systematic square,” named after Abram Wilfrid Hudson (1896–1982), who was a crop experimentalist with the New Zealand Department of Agriculture. Hudson conducted extensive correspondence with “Student” (i.e., William Sealy Gosset, 1876–1937) that mainly concerned randomization, a practice Hudson unwaveringly opposed. Hudson’s “balanced Latin squares” (i.e., b324, b342, and b423) define magic squares—in addition to the numbers in the rows and columns all adding to 10, so do the numbers in the main forward and backward diagonals. The only other 4 x 4 standard-form Latin square that has this property is type d234. A philatelic Latin square for type d234 has not yet been found. Type b324 (Figure 5) is the second most popular PLS and, in addition to being in Sudoku form, is centrosymmetric— reversing the rows and columns yields the original design. In fact, b324 is the only type of 4x4 Latin square in standard form that is both centrosymmetric and in Sudoku form. Type d234 is centrosymmetric, but is not in Sudoku form. An expanded version of this article (with all stamps shown in color) is available in the supplemental material at www.amstat. org/publications/chance.
62
VOL. 23, NO. 1, 2010
Chu, Ka Lok, Simo Puntanen, and George P. H. Styan. 2009. Some comments on philatelic Latin squares from Pakistan. Pakistan Journal of Statistics 25(4):427–471. Colbourn, Charles J., and Jeffrey H. Dinitz (eds.) 2007. Handbook of combinatorial designs (2nd ed.). Boca Raton: Chapman & Hall/Taylor & Francis. Cretté de Palluel, [François]. 1790. On the advantage and economy of feeding sheep in the house with roots. Annals of Agriculture and Other Useful Arts 14:133–139 and foldout pages facing 139 (English translation). Dénes, J., and A. D. Keedwell. 1974. Latin squares and their applications. New York: Academic Press. Groth AG. WWF conservation stamp collection. www. wwfstamp.com. Laywine, Charles F., and Gary L. Mullen. 1998. Discrete mathematics using Latin squares. New York: Wiley. Llull, Ramon. 2007. Ars demonstrativa. Turnhout: Brepols. Miller, Jeff. Images of mathematicians on postage stamps. http://jeff560.tripod.com/stamps. Preece, D. A. 1990. R. A. Fisher and experimental design: a review. Biometrics 46(4):925–935. Puntanen, Simo, and George P. H. Styan. 2008. Stochastic stamps: a philatelic introduction to chance. CHANCE 21(3):36–41. Ritter, Terry. Latin squares: a literature survey—research comments from Ciphers by Ritter. www.ciphersbyritter.com/ RES/LATSQ.HTM. Roberts, H. S. 1999. A history of statistics in New Zealand. Wellington: New Zealand Statistical Association. Schaaf, William L. 1978. Mathematics and science: an adventure in postage stamps. Reston: National Council of Teachers of Mathematics. “Student.” 1938. Comparison between balanced and random arrangements of field plots. Biometrika 29(3/4):363–378. Styan, George P. H. 2009. Personal postage stamp collection. Verdun (Québec), Canada. Ullrich, Peter. 1999. An Eulerian square before Euler and an experimental design before R. A. Fisher: on the early history of Latin squares. CHANCE 12(1):22–26. Ullrich, Peter. 2002. Officers, playing cards, and sheep: on the history of Eulerian squares and of the design of experiments. Metrika 56(3):189–204. Yates, F. 1939. The comparative advantages of systematic and randomized arrangements in the design of agricultural and biological experiments. Biometrika, 30(3/4):440–466.
Goodness of Wit Test #7: Figure It Out
Jonathan Berkowitz, Column Editor
T
his issue’s puzzle is another variety cryptic in the bar-type style. It was composed in a hotel room and restaurant in Shanghai, China, during spare time I had between teaching MBA classes for the University of British Columbia. Passersby certainly gave me puzzled looks as they tried to figure out what I was doing. In the same spirit, you will need to figure out the gimmick in this puzzle. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by me by April 1, 2010. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver, BC Canada V6N 3S2, or email a list of the answers to witz@telus. net. Note that winners of the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Solution to Goodness of Wit Test #5: Simple Inference This puzzle appeared in CHANCE Vol. 22, No. 3.
For answers having the string of letters O-N-E or T-W-O embedded, replace those strings of letters with the numerals 1 or 2, as appropriate. For example, “NE(TWO)RKS” becomes “NE2RKS.” Across: 1 DUEL [rebus: due + l] 4 CLUBROOM [rebus + anagram: luck – k=clu +broom] 10 ANNULAR [rebus + terminal deletion: annul + (p)ar(t)] 11 TUBA (reversal + deletion: about – u] 12 TIRE [double definition] 13 SOF(TWO)OD [rebus: S (o + f + t + woo) D] 14 NE(TWO)RKS [ rebus: net + wo(r) ks] 16 ROGUERY [container + rebus: Rog(u)er + y] 17 YETI [rebus: yet + I] 18 ADORNING [rebus + container: ado + r(n) ing] 20 AMBITION [rebus + container: a + m(bit)i + on] 24 DIPS [reversal + container: s(pi)d] 26 APPAREL [anagram + deletion: wallpaper – l – w] 28 MEAT-AX [hidden word] 31 PH(ONE)TIC [anagram: pitch one] 32 ASIA [charade: as + IA] 33 TIER [double definition] 34 COMPONENT [anagram: top conmen] 35 SCRUTINY [anagram: runs city] 36 URGE [terminal deletion: (s)urge(s)] Down : 1 DATA [reversal + rebus: at+ad] 2 UNIFORM [rebus + container: U + n(I+f)orm] 3 ENRAGE [anagram: a green] 4 CLUE [container: c(l)ue] 5 LAST WORD [rebus + anagram: l + towards] 6 BROKER [double definition] 7 OU(TWO)RE [rebus + container: o(U+T+W+O)re 8 OBOE [initial letters: o+b+o+e] 9 MADRIGAL [anagram: a grim lad] 14 NUDISM [anagram: D minus] 15 SYNTAX [anagram + rebus: nasty + X] 16 READAPTS [container + rebus: read( a + p + t)s 19 NEEDING [rebus + beheadment: knee – k + ding] 21 THEIST [anagram: tithes] 22 OPTION [anagram: potion] 23 PRIS(ONE) R [rebus + anagram: PR + senior] 25 PI(ONE)ER [container: p(ion)eer] 27 CHIC [rebus: Chicago – ago] 29 ARMY [double definition + pun: army/leggy] 30 PATE [double definition]
Winner from Goodness of Wit Test #5: Simple Inference
A guide to solving cryptic clues appeared in CHANCE Vol. 21, No. 3. Using solving aids—electronic dictionaries, the Internet, etc.—is encouraged.
John Rickert earned his PhD in number theory from the University of Michigan and now teaches at Rose-Hulman Institute of Technology in Indiana. He coaches the Indiana ARML (math) team and enjoys analyzing the baseball statistics generated by Retrosheet. He also entertains himself by singing and working word and number puzzles. CHANCE
63
Goodness of Wit Test #7: Figure It Out Instructions: Eight of the clue answers will not fit into the grid without modification. All eight are to be modified the same way. It is up to you to figure it out. The other answers are entered normally. Enumerations are withheld to hide the identity of the eight special answers.
Across
Down
1 Badger spies front of trap
1 I’ll impute surprising origin of risk factor
5 Stick a derivative in this place
2 Pendant is all Valerie fashioned
11 Advisors invalidate Igor’s license
3 Reverse fake achieved with great difficulty
12 Built up, bank has appalling paper power
4 Cry of terror from church under rocky debris
13 Back embarrassed, cunning stockbroker
5 Englishmen edit slogan
14 Hindu god’s period of mourning
6 Hot girl in punk hairdo is the easiest way (2 words)
16 Swears about introduction of overseas routes
7 Lego is my haven for self-absorption
17 Pokes holes right in the center of parts
8 Local official returning first lady to hospital
18 Memories swirling around head of lounge waiter
9 Trellises possibly please IRS
20 Woman’s half of palindrome?
10 Henri mishandled a rupture
22 Commits to keep brand in periods of decline
15 Feature of cathedral’s unusual patterns
25 Lend cashier plastic lamps
18 Yielded to Head of Ops in Special Education
27 Emcee consumes one drink
19 See the French military’s leader back confused struggles
28 Stan’s partner, while inside, advertises fuel suppliers
21 Rain envelops one piece of sweet, dried fruit
30 Advances for models
22 Pay agent to return fish
31 After second kiss call for difference in output
23 Judge is harsh to imprison courtesan
32 Guard small vestibule
24 Talks disrespectfully with senate fools
33 Ladies’ wild goals
25 Duke’s wife is mostly game 26 Rule-maker put club in tight spot 29 More solitary lady running to train—that is rare
64
VOL. 23, NO. 1, 2010
The ASA Brings the Library
to Your Desktop
Did you know your ASA membership includes online access to the
Journal of the American Statistical Association Journal of Business & Economic Statistics Statistics in Biopharmaceutical Research and The American Statistician?
Log in at MEMBERS ONLY today to access your journals! www.amstat.org/membersonly TECH
SHOP Purchase T-shirts, books, JSM Proceedings, and gift items! Visit the ASA’s online marketplace at www.amstat.org/asastore
SAVE 10% on your first purchase by entering ASASTORE at checkout.