Editor’s Letter
Sam Behseta,
Executive Editor
Dear Readers, It is my great pleasure to serve as the new executive editor of CHANCE. As articulated in their first editorial letter, Stephen Fienberg and William Eddy founded this magazine on a simple premise: “CHANCE is intended to entertain and to inform. We hope to bring you articles and columns that will stimulate your thinking on innovative uses of statistics.” Some 24 years and seven editors later, I cannot think of a better way to delineate the primary mission of CHANCE. In my view, the magazine will serve the statistical community—as well as its diverse readership outside the statistical world—more effectively by adhering to the original concept its founders envisioned. Realizing this objective, however, will remain nontrivial. As Fienberg put it in an interview with Howard Wainer for the 20th anniversary issue of CHANCE, “The difficult thing, as every CHANCE editor knows, is to get authors who can write about such topics without using equations and lots of technical jargon.” To set an agenda for CHANCE, one can simply draw inspiration from the variety of invigorating topics that have been covered in this magazine: the life and work of R. A. Fisher and Harold Jeffreys, expert witnesses and the courts, teacher evaluations and student grades, statistical history of the AIDS epidemic, memories of election night predictions, the hot hand in ice hockey, the use of statistical evidence in allegations of exam cheating, racial profiling, counter-terrorism, voting irregularities in Palm Beach, sex differences and traffic fatalities, evaluating agreement and disagreement among movie reviewers, fasting during Ramadan and traffic accidents in Turkey, the difficulty of faking data, anatomy of a jury challenge, and the Torah codes, among the others. I would like to express my gratitude to the former executive editor, Michael Larsen. CHANCE benefited greatly from his vision, resulting in highly engaging articles. The good news is he will stay around as an advisory editor. Speaking of the editorial board, I am pleased to welcome two new editors: Scott Evans of Harvard’s School of Public Health and Aleksandra Slavkovic of Penn State, who is embarking on a new column titled “O’ Privacy, Where Art Thou? Mapping the Landscape of Data Confidentiality.” We open this issue with Stephen Stigler’s article about Galton’s visualization of the Bayes’ theorem. Those familiar with Stigler’s extensive work on the history of statistics know his every article is a treat. He introduces us to Galton’s schematic presentation of the now well-known Bayesian normal model with a normal conjugate prior. As demonstrated in the article, Galton’s 1877 machine was capable, quite amazingly, of generating the posterior normal distribution out of the normalnormal model. Howard Wainer takes a momentary break from the popular Visual Revelations column to give a statistician’s reading of the
highly sensitive discourse around the implementation of valueadded models (VAM) in K–12 teachers’ evaluations. Wainer’s critique of the VAM scheme revolves around three arguments: the role of the counterfactual on causal inference, the problem of reading too much into test results, and the ever-present challenge of handling missing data. Also in this issue, David Rockoff and Philip Yates take a fresh look at Joe DiMaggio’s monumental hitting streak during the 1941 season by inquiring about how one would go about quantifying the rarity of such an event. The authors make the point that it is more meaningful to rephrase this objective in a grander context: What is the probability that any player, not just DiMaggio, can ever reach the 56-hitting mark? Robert Burks provides an interesting combinatorial argument for the joy of mixing jellybeans to create exciting new flavors, followed by proposing a simple procedure for assessing the distribution of jellybeans. Harry Davis, Hershey Friedman, and Jiangming Ye revisit an ancient sampling problem of establishing a standard volume for eggs. Amid the familiar flaws of using the mean as the measure of centrality, they argue that the ancient approach of averaging the volume of two eggs—a large one and a small one—does well, especially when the underlying distribution of all eggs is assumed to be normal. Michael Wenz and Joren Skugrud develop a probit model to study the effect of a critical decision in football: following a touchdown by a kick—one extra point—or adopting the riskier strategy of going for a two-point play. Stephen Marks and Gary Smith present an engaging overview of the two-child paradox: In a family with two children, given that one of the children is a girl, what is the probability that the other child is also a girl? Two popular answers are a half and a third, hence a paradox. It turns out there will remain no ambiguities if the problem is tackled by the Bayes’ theorem. Hongmei Liu, Jay Parker, and Wei Sun employ a stratified single-stage sampling design to estimate the mis-shelving rate of books at the library of the University of Illinois at Chicago. Milton W. Loyer and Gene D. Sprechini raise a deceivingly simple question: Can the overall probability of an event be larger or smaller than the sum of its individual probabilities? They motivate the solution using two examples: the statistical problem of estimating the proportion of fruit having some defect and the problem of calculating the probability of getting a hit from two baseball players with unequal batting averages. Finally, I would like to invite you to visit the magazine’s redesigned website at http://chance.amstat.org. The empowered functionality of the site is largely due to the creative work of staff members in the Communications Department at the American Statistical Association. Sam Behseta
CHANCE
5
Letter to the Editor Dear Editor,
I
would like to call your attention to a small misconception regarding spie charts, as described by Howard Wainer in his Visual Revelations column in the September 2010 issue [Vol. 23, No. 4] of CHANCE. A spie chart is composed of a rose superimposed on a regular pie chart. In the example of Figure 4 in the column, the base pie chart describes the general population of Israel in 2002, divided into sex and age groups. The angle of each slice is proportional to the size of the corresponding segment in the population (e.g., males aged 15 to 19). These same angles are retained for the superimposed rose, which portrays Israel’s road casualties for that year, divided into the same sex and age groups. But the radii of the superimposed slices are not directly proportional to the square root of the number of casualties in each group, as suggested by Wainer. Instead, they are proportional to the square root of the ratio of the number of casualties to the general population in the segment. As a result, the areas of the superimposed slices do, in fact, correspond to the data, as they should. And comparing the radius of each slice to the circle bounding the base pie chart shows whether each segment of the population is over- or under-represented among road casualties. Thus, the spie chart shows both the absolute number in each category (slice size) and the hazard (relative radius), as opposed to the line chart of Figure 5, which shows only the hazard. Dror Feitelson The Hebrew University of Jerusalem
Females
Males
0-4 5-14
0-4 5-14
15-19
15-19
20-24
20-24
25-29
25-29
30-34
30-34
35-44
35-44
45-54 55-64 65-74 75+
45-54 55-64 65-74 75+
Figure 4. Distribution of road accident casualties by age and sex, relative to the size of the population Likelihood of Being in a Traffic Accident
Figure 5. Line chart of Israel accident data Data Source: Israel Bureau of Statistics, 2002
CHANCE
7
Galton Visualizing Bayesian Inference Stephen M. Stigler
W
hat does Bayes Theorem look like? I do not mean what does the formula— f(θ|X) ∞ p(θ)f(X|θ)
—look like; these days, every statistician knows that. I mean, how can we visualize the cognitive content of the theorem? What picture can we appeal to with the hope that any person curious about the theorem may look at it, and, after a bit of study say, “Why, that is clear—I can indeed see what is happening!” Francis Galton could produce just such a picture; in fact, he built and operated a machine in 1877 that performs that calculation. But, despite having published the picture in Nature and the Proceedings of the Royal Institution of Great Britain, he never referred to it again—and no reader seems to have appreciated what it could accomplish until recently. The machine is reproduced in Figure 1 from the original publication. It depicts the fundamental calculation of Bayesian inference: the determination of a posterior distribution from a prior distribution and a likelihood function. Look carefully at the picture—notice it shows the upper
Image courtesy of www.galton.org
Image courtesy of www.galton.org
portion as three-dimensional, with a glass front and a depth of about four inches. There are cardboard dividers to keep the beads from settling into a flat pattern, and the drawing exaggerates the smoothness of the heap from left to right, something like a normal curve. We could think of the top layer as showing the prior distribution p(θ) as a population of beads representing, say, potential values for θ, from low (left) to high (right). The machine does the computation with gravity providing the motive force. There is a knob at the right-hand side of each of two levels. When the platform supporting the top level of beads is withdrawn by pulling the upper knob at the right, the beads fall to the next lower level. On that second level, you can see what is intended to be a vertical screen, or wall, that is close to the glass front at both the left and the right, but recedes to the rear in the middle. If viewed from above, that screen would look something like a normal curve. The vertical screen represents the likelihood function; in this position, it reflects high likelihood for θs in the middle, but if moved to the right, it would represent high likelihood for larger values of θ. Similarly, if moved to the left, high likelihood for smaller θ. The way the machine works its magic is that those beads to the front of the screen are retained as shown; those falling behind are rejected and discarded. (You might think of this stage as doing rejection sampling from the upper stage.) The surviving beads are shown at this level as a sort of nonstandard histogram, nonstandard because the depths of the compartments vary, with those toward the middle being deeper than those in the extremes. The final stage turns this into a standard histogram: The second support platform is removed by pulling to the right on its knob, and the beads fall to a slanted platform immediately below, rolling then to the lowest level, where the depth is again uniform—about one inch deep from the glass in front. This simply rescales the retained beads, resulting in a distribution that again looks somewhat like another normal curve, one a bit less disperse that the prior distribution at the top. The magic of the machine is that this lowest level is proportional to the posterior distribution! How does it work? Each compartment at the top is a fraction of the prior population—a fraction proportional to the prior density. When it falls to the second level, only a fraction of that fraction is retained—that fraction is a height of the likelihood function. The retained group is thus a fraction of a fraction—a product of the fractions, or a product of the prior density and the likelihood function. From Bayes Theorem, this gives a curve proportional to the posterior density. As shown, prior and likelihood both look like normal distributions, with the observed value taken as the central (a priori most likely) value. If the prior were N(0, σ2) and the likelihood were proportional to N(0, τ2), the resulting posterior would be proportional to N(0, c2), where 1/c2 = 1/σ2 + 1/τ2. Galton did the math for exactly this case—perhaps the first time anyone ever deduced the relationship between the prior and posterior variance for normal priors and likelihoods. But the mechanism does not require normality; it is perfectly general.
Figure 1. Galton’s 1877 machine
Bayes’ Take Thomas Bayes (1702–1761), himself, approached his theorem visually. In his original article (published posthumously in 1764), he imagined what is best described today as a billiard table. A ball is rolled at random, and after it comes to rest, its position θ is noted, where 0 represents the left-most possible position and 1 the right-most possible. That roll represents the prior distribution of θ on [0,1]. Then, the ball is rolled again n times, and the number X of times the ball comes to rest to the left of θ is recorded. Given θ, X has a binomial distribution, and Bayes’ Theorem would then give the posterior distribution of θ, given X. The example was restricted to uniform prior distributions and the binomial model, and Bayes—unlike Francis Galton—fell back on analysis to find the posterior—an easy task in this case. Galton’s method is applicable to Bayes’ example, with the vertical screen being taken as proportional to the appropriate Beta density. Karl Pearson, in 1920, noticed that Bayes’ example could encompass nonuniform priors—just use a warped billiard table. But, the restriction to beta-binomial likelihood functions seems essential. CHANCE
9
1877 Algorithm: Normal Prior-Posterior
Figure 2. An illustration of how different likelihood functions would extract different subsets of the prior population
As mentioned, shifting the vertical screen to the left or right will reflect shifts in the likelihood function. If a lower X value is observed, the vertical screen should be moved to the left by the appropriate amount; if a higher value X is observed, the screen should be moved to the right (Figure 2). The vertical likelihood screen operates like a cookie cutter, automatically cutting the appropriate proportion out of the prior distribution to give (after a simple rescaling) the posterior distribution. With Galton’s machine, we can see how Bayes Theorem works by screening out possible prior values in proportion to the corresponding value of the likelihood. This machine was invented by Galton and used in a public lecture on February 9, 1877. It does not survive today, except through this picture. In 1877, Galton did not have Bayes Theorem explicitly in mind, and so my discussion here represents a repurposing of his device. Galton thought of it as displaying the action of natural selection in a model for inheritance of quantitative characteristics.
10
VOL. 24, NO. 1, 2011
The device may be seen as performing a simple multiplication (of prior times likelihood) or as an analog for a rejection sampling algorithm. But as repurposed here, it shows how one may visualize the fundamental operation of Bayesian inference and illustrates how a complex mathematical algorithm may be seen as a simple instance of thresholding.
Further Reading Galton, Francis. 1877. Typical laws of heredity. Nature 15:492–495, 512–514, 532–533. Also published in Proceedings of the Royal Institution of Great Britain 8:282–301. Stigler, Stephen M. 1982. Thomas Bayes’ Bayesian inference. Journal of the Royal Statistical Society, Series A 145:250–258. Stigler, Stephen M. 2010. Darwin, Galton, and the statistical enlightenment. Journal of the Royal Statistical Society, Series A 173:469–482.
Value-Added Models to Evaluate Teachers: A Cry for Help Howard Wainer
This essay was extracted from Howard Wainer’s book Uneducated Guesses: A Leisurely Tour of Some Educational Policies That Only Make Sense If You Say Them Fast, to be published this year by Princeton University Press.
R
ace to the Top, the Obama administration’s program to help reform American education, has much to recommend it—not the least of which is the infusion of much needed money. So it came as no surprise to anyone that resource-starved states rushed headlong to submit modified education programs that would qualify them for some of the windfall. A required aspect of all such reforms is the use of student performance data to judge the quality of districts and teachers. This is a fine idea, not original to Race to the Top. The late Marvin Bressler (1923– 2010), Princeton University’s renowned educational sociologist, said the following in a 1991 interview:
called “value-added models,” or VAMs for short. These models try to partition the change in student test scores among the student, the school, and the teacher. Although these models are still very much in the experimental stage, they have been seized upon as “the solution” by many states and thence included as a key element in their reform proposals. Their use in the evaluation of teachers is especially problematic. Let me describe what I see as three especially difficult problems in the hopes that I might instigate some of my readers to try to make some progress toward their solution.
Some professors are justly renowned for their bravura performances as Grand Expositor on the podium, Agent Provocateur in the preceptorial, or Kindly Old Mentor in the corridors. These familiar roles in the standard faculty repertoire, however, should not be mistaken for teaching, except as they are validated by the transformation of the minds and persons of the intended audience.
One principal goal of VAMs is to estimate each teacher’s effect on his/her students. This probably would not be too difficult to do if our goal was just descriptive (e.g., Freddy’s math score went up 10 points while Ms. Jones was his teacher.). But, description is only a very small step if this is to be used to evaluate teachers. We must have a causal connection. Surely no one would credit Ms. Jones if the claim was “Freddy grew four inches while Ms. Jones was his teacher,” although it, too, might be descriptively correct. How are we to go from description to causation? A good beginning would be to know how much of a gain Freddy would have made with some other teacher. But alas, Freddy didn’t have any other teacher. He had Ms. Jones. The problem of the counterfactual plagues all of causal inference.
But how are we to measure the extent to which “the minds and persons” of students are transformed? And how much of any transformation observed do we assign to the teacher as the causal agent? These are thorny issues indeed. The beginning of a solution has been proposed in the guise of what are generally
Problem 1—Causal Inference
We would have a stronger claim for causation if we could randomly assign students to teachers and thence compare the average gain of one teacher with that of another. But, students are not assigned randomly. And even if they were, it would be difficult to contain post-assignment shifting. Also, randomization doesn’t cure all ills in very finite samples. The VAM parameter that is called “teacher effect” is actually misnamed; it should more properly be called “classroom effect.” This change in nomenclature makes explicit that certain aspects of what goes on in the classroom affects student learning, but is not completely under the teacher’s control. For example, suppose there is one 4th grader whose lack of bladder control regularly disrupts the class. Even if his class assignment is random, it still does not allow fair causal comparisons. And so if VAMs are to be usable, we must utilize all the tools of observational studies to make the assumptions required for causal inference less than heroic.
Problem 2—The Load VAM Places on Tests VAMs have been mandated for use in teacher evaluation from kindergarten through 12th grade. This leads through dangerous waters. It may be possible for test scores on the same subject, within the same grade, to be scaled so that a 10-point gain from a score of, say, 40 to 50 has the same meaning as a similar gain from 80 to 90. It will take some doing, CHANCE
11
Some Hows and Whys of VAM The principal claim made by the developers of VAM—William L. Sanders, Arnold M. Saxton, and Sandra P. Horn—is that through the analysis of changes in student test scores from one year to the next, they can objectively isolate the contributions of teachers and schools to student learning. If this claim proves to be true, VAM could become a powerful tool for both teachers’ professional development and teachers’ evaluation. This approach represents an important divergence from the path specified by the “adequate yearly progress” provisions of the No Child Left Behind Act, for it focuses on the gain each student makes, rather than the proportion of students who attain some particular standard. VAM’s attention to individual student’s longitudinal data to measure their progress seems filled with commonsense and fairness. There are many models that fall under the general heading of VAM. One of the most widely used was developed and programmed by William Sanders and his colleagues. It was developed for use in Tennessee and has been in place there for more than a decade under the name Tennessee Value-Added Assessment System. It also has been called the “layered model” because of the way each of its annual component pieces is layered on top of another. The model begins by representing a student’s test score in the first year, y1, as the sum of the district’s average for that grade, subject, and year, say μ1; the incremental contribution of the teacher, say θ 1; and systematic and unsystematic errors, say ε1. When these pieces are put together, we obtain a simple equation for the first year: y1 = μ1+ θ 1+ ε1
(1)
or Student’s score (1) = district average (1) + teacher effect (1) + error (1)
but I believe that current psychometric technology may be able to handle it. I am less sanguine about being able to do this across years. Thus, while we may be able to make comparisons between two 4th-grade teachers with respect to the gains their students have made in math, I am not sure how well we could do if we were comparing a 2nd-grade teacher and a 6th-grade one. Surely a 10-point gain on the tests that were properly aimed for these two distinct student populations would have little in common. In fact, even the topics covered on the two math tests are certain to be wildly different. If these difficulties emerge on the same subject in elementary school, the problems of comparing teachers in high school seem insurmountable. Is a 10-point gain on a French test equal to a 10-point gain in physics? 12
VOL. 24, NO. 1, 2011
There are similar equations for the second, third, fourth, and fifth years, and it is instructive to look at the second year’s equation, which looks like the first except it contains a term for the teacher’s effect from the previous year: y2 = μ2+ θ 1+ θ 2+ ε2 .
(2)
or Student’s score (2) = district average (2) + teacher effect (1) + teacher (2) + error (2) To assess the value added (y2 - y1), we merely subtract equation (1) from equation (2) and note that the effect of the teacher from the first year has conveniently dropped out. While this is statistically convenient, because it leaves us with fewer parameters to estimate, does it make sense? Some have argued that although a teacher’s effect lingers beyond the year the student had her/him, that effect is likely to shrink with time. Although such a model is less convenient to estimate, it more realistically mirrors reality. But, not surprisingly, the estimate of the size of a teacher’s effect varies depending on the choice of model. How large this choice-of-model effect is, relative to the size of the “teacher effect” is yet to be determined. Obviously, if it is large, it diminishes the practicality of the methodology. Recent research from the Rand Corporation shows a shift from the layered model to one that estimates the size of the change of a teacher’s effect from one year to the next suggests that almost half of the teacher effect is accounted for by the choice of model. One cannot partition student effect from teacher effect without information about how the same students perform with other teachers. In practice, using longitudinal data and obtaining measures of student performance in other years can resolve this issue. The decade of Tennessee’s experience with VAM led to a requirement of at least three years’ data. This requirement raises the concerns when (i) data are missing and (ii) the meaning of what is being tested changes with time.
Even cutting-edge psychometrics has no answers for this. Are you better at French than I am in physics? Was Mozart a better composer than Babe Ruth was a hitter? Such questions are not impossible to think about—Mozart was a better composer than I am a hitter—but only for very great differences. What can we do to make some gains on this topic? Judging differences among teachers are usually much more subtle.
Problem 3—Missing Data Always a huge problem in all practical situations, it is made even more critical because of problems with the stability of VAM parameter estimates. The sample size available for the estimation of a teacher effect is typically about 30. This has not yielded stable estimates.
One VAM study showed that only about 20% of teachers in the top quintile one year were in the top quintile the next. This result can be interpreted in either of two ways: 1. The teacher effect estimates aren’t much better than random numbers. 2. Teacher quality is ephemeral and so a very good teacher one year can be awful the next. If we opt for (1), we must discard VAM as too inaccurate for serious use. If we opt for (2), the underlying idea behind VAM (that being a good teacher is a relatively stable characteristic we wish to reward) is not true. In either case, VAM is in trouble.
Current belief is that the problem is (1) and we must try to stabilize the estimates by increasing the sample size. This can be done in lots of ways. Four that come to mind are: (a) Increasing class size to 300 or so (b) Collapsing across time (c) Collapsing across teachers (d) Using some sort of empirical Bayes trick and gather stability by borrowing strength from both other teachers and other time periods Option (a), despite its appeal to lunatic cost-cutters, violates all we know about the importance of small class sizes, especially in elementary education. Option (c) seems at odds with the notion of trying to estimate a teacher effect, and it would be tough to explain to a teacher that her rating was lowered this year because some of the other teachers in her school had not performed up to par. Option (d) is a technical solution that has much appeal to me, but I don’t know how much work has been done to measure its efficacy. Option (b) is the one that has been chosen in Tennessee, the state that has pioneered VAM, and has thence been adopted more-or-less pro forma by the other states in which VAMs have been mandated. But, requiring longitudinal data increases data-gathering costs and the amount of missing data. What data are missing? Typically, test scores, but also sometimes things like the connection between student and teacher. But, let’s just focus on missing test scores. The essence of VAM is the adjusted difference between pretest and posttest scores (often given at the ends of the school year, but sometimes there is just one test given in a year and the pretest score is the previous year’s post score). The pre-score can be missing, the post-score can be missing, or both can be missing. High student mobility increases the likelihood of missing data. Inner-city schools have higher mobility than suburban schools. Because it is unlikely that missingness is unrelated to student performance, it is unrealistic to assume that we can ignore missing data and just average around them. Yet, often this is just what is done. If a student’s pre-test score is missing, we cannot obtain what the change is unless we do something. What is often
done is the mean pre-score for the students that have them (in that school and that grade) is imputed for the missing score. This has the advantage of allowing us to compute a change score, and the mean scores for the actual data and the augmented data (augmented with the imputed scores) will be the same. This sounds like a plausible strategy, but only if you say it fast. It assumes that the people who are missing scores are just like the ones that have complete data. This is unlikely to be true. It isn’t hard to imagine how a principal under pressure to show big gains could easily game the system. For example, the principal could arrange a field trip for the best students on the day that the pre-test is to be given. Those students would have the average of all those left behind imputed as their score. Then, at the end of the year when the posttest is to be given, there is a parallel field trip for the worst students. Their missing data will be imputed from the average of those who remain. The gain scores could thus be manipulated, and the size of the manipulation is directly related to the academic diversity of the student population—the more diverse, the greater the possible gain. Obviously, a better method for dealing with missing data must be found.
Concluding Remarks There is substantial evidence that the quality of teachers is of great importance in children’s education. We must remember, however, the lessons brought home to us in the Coleman report (and replicated many times in the almost halfcentury since) that the effects of home life dwarf teacher effects, whatever they are. If a classroom is made up of students whose home life is filled with the richness of learning, even an ordinary teacher can have remarkable results. But, conversely, if the children’s homes reflect chronic lack, and the life of the mind is largely absent, the teacher’s task is made insuperably more difficult. Value-added models represent the beginning of an attempt to help us find, and thence reward, the most gifted teachers. But, despite substantial efforts, these models are still not ready for full-scale implementation. I have tried to describe what I believe are the
biggest challenges facing the developers of this methodology. I do this in the hope that once the problems are made explicit, others will add the beauty of their minds to the labor of mine and we may make some progress. But we must be quick, because the pressures of contemporary politics allow little time for extended reflection. Marcel Proust likened aging to being “perched upon living stilts that keep on growing.” We can see farther, but passage is increasingly wobbly. This essay exemplifies Proust’s metaphor.
Further Reading Ballou, D., W. Sanders, and P. Wright. 2004. Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics 29(1):37–65. Braun, H. I., and H. Wainer. 2007. Value-added assessment. In Handbook of statistics (volume 27) psychometrics, ed. C. R. Rao and S. Sinharay, 867–892. Amsterdam: Elsevier Science. Bressler, M. 1992. A teacher reflects. Princeton Alumni Weekly 93(5):11–14. Coleman, J. S., et al. 1966. Equality of educational opportunity. Washington, DC: U.S. Office of Education. Mariano, L. T., D. F. McCafferty, and J. R. Lockwood. 2010. A model for teacher effects from longitudinal data without assuming vertical scaling. Journal of Educational and Behavioral Statistics 35:253–279. National Research Council. 2010. Getting value out of value-added. H. Braun, N. Chudowsky, and J. Koenig (eds.). Washington, DC: National Academy Press. Rubin, D. B., E. A. Stuart, and E. L. Zanutto. 2004. A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics 29(1):103–116. Sanders, W. L., A. M. Saxton, and S. P. Horn. 1997. The Tennessee value-added educational assessment system (TVAAS): A quantitative, outcomes-based approach to educational assessment. In Grading teachers, grading schools: Is student achievement a valid evaluation measure?, ed. J. Millman, 137–162. Thousand Oaks, California: Corwin Press, Inc.
CHANCE
13
Joe DiMaggio Done It Again … and Again and Again and Again?
David Rockoff and Philip Yates
Joe DiMaggio done it again! Joe DiMaggio done it again! Clackin’ that bat, gone with the wind! Joe DiMaggio’s done it again! – “Joe DiMaggio Done It Again,” Woody Guthrie, 1949
AP Photo/Preston Stroup
14
VOL. 24, NO. 1, 2011
O
ne of the most-celebrated feats in the history of American sports is baseball player Joe DiMaggio’s 56-game hitting streak during the 1941 season. In major league baseball, a 30% success rate by a hitter is considered good, and a batter will typically have three to five attempts in a game; Joltin’ Joe hit safely in 56 consecutive games, a record that has rarely been approached since. One might wonder just how amazing this accomplishment is, or how surprised people should be that there has been such a long streak in the history of the sport. Many have attempted to quantify this using statistical methods. A basic probability approach goes something like this: If a player has a .300 batting average (i.e., gets a hit in 30% of his at-bats) and has four at-bats each game, his probability of getting at least one hit in a given game is 1-((1.3)4), or .76. Thus, the probability of this player getting a hit in every game during a given 56-game stretch is .7656, or .00000021 (assuming all at-bats and all games are independent). But, we are really interested in the probability of there ever being a 56-game hitting streak by any player, not just the probability of one particular player achieving it over a given 56-game stretch. A direct probability approach is then difficult, mainly because the universe of 56-game stretches for a player is not independent; performance in games 1–56 during a season shares much information with performance in games 2–57. Fortunately, advances in computer technology and in the availability of baseball data have made a simulation approach feasible. The inspiration for our research came from a New York Times article on March
beneficial for the team on the receiving end. A good hitter, unless he has a very long streak going already, is unlikely to eschew the walk in favor of swinging at bad pitches in a dire bid to get a hit. That is presumably why a base on balls is not counted against a player’s batting average.
Constant vs. Variable At-Bats
Figure 1. Contour plot of the probability of a player with a .300 batting average getting a hit in each of two games
30, 2008, titled “A Journey to Baseball’s Alternative Universe.” Samuel Arbesman and Steven Strogatz ran simulations of baseball seasons to estimate the probability of long hitting streaks, using data for each batter during each season in major league history. They treated a player’s at-bats per game as constant across all games in a season; simulated 10,000 baseball histories; and tabulated which player held the longest streak, when he did it, and how long his streak was. There was a hitting streak of at least 56 games in 42% of these simulated histories, meaning we should not be all that amazed that there has been one in our actual observed history. Don M. Chance, in the CHANCE article “What Are the Odds? Another Look at DiMaggio’s Streak,” used a modified calculation of hitting opportunities to study the likelihood of long hitting streaks, arguing that nonintentional bases on balls and sacrifice flies are opportunities for a hit and should be included in any calculation. This increases the number of opportunities in a game, but decreases the probability of success in a single opportunity. The net effect is a decrease in the probability of a hit in most games and, thus, of a lengthy hitting streak. We disagree with the notion that a base on balls is a missed opportunity for a hit. A base on balls is usually quite
We noted that assigning the same number of at-bats for each game greatly overestimates the probability of long streaks. This is due to Jensen’s inequality. The probability of a two-game hitting streak is much lower if the player’s at-bats are, say, two and then six than if his at-bats are four and then four. A simple example can be seen in Figure 1. The probability of a two-game hitting streak is much lower if the player has fewer at-bats in each game. Going down a northwest-southeast diagonal of this lattice structure graphically illustrates Jensen’s inequality. Thus, the constant at-bat assumption overestimates the likelihood of long hitting streaks. The need to vary at-bats also is due to the fact that they have decreased and are
even more varied over time. Figure 2 illustrates this phenomenon. The simulations run in “Chasing DiMaggio: Streaks in Simulated Seasons Using Non-Constant At-Bats,” published in the Journal of Quantitative Analysis in Sports in 2009, varied at-bats using Retrosheet game data for all of major league baseball from 1954–2007, as well as for the National League in 1911, 1921, 1922, and 1953. Since the publication of that paper, Retrosheet has added game data from both the American League and National League for the 1920–1929 seasons. It should be noted that these simulations are not true simulations of a game, but simulations of a player’s at-bats in a game over all the games played in a season. Unfortunately, due to the unavailability of some game-by-game data, some or all of the careers of some of the best hitters in baseball (e.g., Willie Keeler, Ted Williams, Joe DiMaggio) are not included in this analysis. The following is a brief overview of the simulations with varying at-bats. For each hitter in each season, the batting average is fixed, using that player’s actual batting average for that season.
Figure 2. Densities estimates of at-bats per game by decade Note: The unit of analysis is a player-season.
CHANCE
15
Table 1—Top 20 Maximum Hitting Streaks in 1,000 Simulated Baseball Histories
Table 2—Hitting Streaks in 18,607,000 Simulated Player–Seasons
This allows us to treat the 1990 and 2001 versions of a player such as Barry Bonds as two players. We assumed atbats over the course of a single game are independent. This means the number of hits a player gets in a given game has a binomial distribution with the number of trials equal to the at-bats in that game and the probability of success equal to the batting average for the season. Using the game-by-game data from Retrosheet, each player in each season had their atbats sampled with replacement from their actual at-bat distribution to create a simulated season’s worth of games. This was done 1,000 times to create 1,000 “simulated” baseball histories. A hitting streak was considered to be any run of games with at least one hit. In the results section, this method will be denoted as Binom. 16
VOL. 24, NO. 1, 2011
Varying Batting Average While the number of at-bats varied each game, the batting average remained constant. The question remained: How should we vary batting average in this simulation study? In “Chasing DiMaggio: Streaks in Simulated Seasons Using NonConstant At-Bats,” why did the authors feel the need to vary at-bats during the simulations? Over the course of a baseball season, a player is not going to have the same number of at-bats in every game. There may be days when the player is starting and others when he may be coming off the bench to pinch-hit. Varying the at-bats is an attempt to mimic this phenomenon in the simulations. How do we attempt to vary batting average in our simulations to resemble the course of a season? We attempted
this in three ways. The first method we used was the simplest approach to varying batting average. We treated a player’s chance at a hit in a given at-bat as a beta random variable, taking the player’s actual number of hits (successes) in a season as the first shape parameter and the player’s actual number of outs (failures) in a season as the second shape parameter. The mean of this random variable would be the batter’s batting average in a season. In the results section, this method will be denoted as Beta. The second method of varying batting average treats hit probability in a given game as correlated with performance in ”neighboring” games, which may better mimic so-called hot- and cold-hand effects. For a brief description, let’s look at how the method works for a neighborhood of 15 games.
Table 3—Hitting Streaks in 1,000 Simulated Baseball Histories
Table 4—Hitting Streaks for 1941 Joe DiMaggio in 10,000 Simulations
In the simulations, we start with a fixed batting average for each game and run a simulated season, just like the Binom method. Hit probabilities are then updated by incorporating information from neighboring games into the simulated season. For instance, in game 50 of a simulated season, the probability of a base hit is reflected by the player’s performance in games 35 through 65. Using the new hit probabilities, the simulations generate a new array of hits. This process is repeated for each game in a player’s season and denoted as Binom-15. The third method of varying batting average is the same as Binom-15, except it uses a 30-game neighborhood; that is, the probability of a base hit in game 50 should be reflected by the player’s performance in games 20 through 80. This method is denoted as Binom-30.
Results Table 1 lists the top 20 performances in the simulations. It should be noted that these are the peak streaks for a player. For example, using the Binom-15 method, the 1922 version of George Sisler had a maximum hitting streak of 95 games. His secondhighest streak was 83 games. That streak is not included on the list since his highest was 95. Another way to look at the results is to summarize the hitting streaks over each simulated player-season; yet another is to summarize over each of the individual 1,000 baseball histories. Table 2 compares our four methods, showing how many simulated player-seasons contained a hitting streak of at least 40 games, at
least 50 games, and at least 56 games (DiMaggio’s record). Table 3 shows the same breakdown, but over entire simulated histories. The Binom-15 method yields the most long streaks—561 individual playerseasons contained a hitting streak of 56 games or longer, and a whopping 450 of the 1,000 histories featured such a streak. This means that, if we assume batting ability fluctuates smoothly over the course of a month, we should not be all that surprised that there has been a 56-game hitting streak in real life. It also shows that results change significantly depending on our assumptions. How did Joltin’ Joe do in our simulations using all four methods, plus the constant at-bat method? We ran 10,000 simulations of DiMaggio’s 1941 season, using his actual game-by-game at-bats. CHANCE
17
age varies under the “neighbor” method? If a batter was successful in his first 15 or 30 games, his batting average (probability of success of a hit) is going to be higher for the next game. A player who starts a simulated season “hot” is going to have longer streaks using this method. The 30-game neighborhood method had a streak greater than DiMaggio’s 1941 streak in 34.3% of our simulated baseball seasons. The 15-game neighborhood method had a streak greater than DiMaggio’s 1941 streak in 45% of our simulated baseball seasons. Over a stretch of 15 games, it might be more common to see a player have a higher batting average than over a stretch of 30 games. This may have produced longer hitting streaks in the simulations. If we are to believe the results of these attempts at simulating baseball histories, it is not surprising that someone did have a 56-game hitting streak during the course of Major League Baseball’s history. The surprise comes into play when we pick out a certain player, such as Joltin’ Joe or Sisler, to have the specific streak. Joe DiMaggio with bat ready at first day’s workout on March 6, 1946, in Bradenton, Florida, after returning from Panama. DiMaggio had been out since 1942 for U.S. Army service. AP Photo/Preston Stroup
DiMaggio’s actual game-by-game data were obtained from Cliff Blau, a member of the Society for American Baseball Research. Table 4 shows he reached his record only a few times with each method and had more long streaks with the two Binom methods. The first two rows also reinforce that when batting average is treated as constant, the assumption of constant at-bats makes a long streak more likely.
Multiseason Streaks At the end of the 2005 season through the beginning of the 2006 season, Jimmy Rollins of the Philadelphia Phillies had a hitting streak of 38 games. His streak would not have been captured by the methods previously discussed, although it officially counts as a streak. Chance looked at the odds of long hitting streaks using a player’s career data. His analysis used only 125 players—the top 100 in career batting average, plus all 18
VOL. 24, NO. 1, 2011
other players who had real-life hitting streaks of at least 30 games. We ran additional simulations on these good hitters to estimate the frequency of long hitting streaks that span two seasons. We started with the same list of players for our analysis using the game-by-game at-bats for their entire career found in the Retrosheet data. Retrosheet contains the complete careers for only 24 of those players and partial data for an additional 62. In these simulations, multiseason long streaks occurred approximately one-tenth as often as single-season long streaks, meaning that to truly compare the likelihood of witnessing a long streak, we should increase the number of long streaks in Tables 2 and 3 by 10%.
Concluding Thoughts Why is there such a difference between the simulations when batting average is fixed and batting aver-
Further Reading Arbesman, S., and S. Strogatz. 2008. A journey to baseball’s Universe. The New York Times. March 30. Berry, S. 1991. The summer of ’41: A probability analysis of DiMaggio’s streak and Williams’ average of .406. CHANCE 4(4):8–11. Chance, D. M. 2009. What are the odds? Another look at DiMaggio’s streak. CHANCE 22(2):33–42. Gould, S. J. 1989. The streak of streaks. CHANCE 2(2):10–16. McCotter, T. 2008. Hitting streaks don’t obey your rules: Evidence that hitting streaks aren’t just byproduct of random variation. Baseball Research Journal 37:62–70. Rockoff, D. M., and P. A. Yates. 2009. Chasing DiMaggio: Streaks in simulated seasons using non-constant at-bats. Journal of Quantitative Analysis in Sports 5(2), Article 4. www.bepress.com/jqas/vol5/iss2/4. Short, T., and L. Wasserman. 1989. Should we be surprised by the s t r e a k o f s t r e a k s ? CHANCE 2(2):13. Warrack, G. 1995. The great streak. CHANCE 8(3):41–43, 60.
An Ancient Sampling Technique: Flawed, Surprisingly Good, or Optimal? Harry Zvi Davis, Hershey H. Friedman, and Jianming Ye
A
bout 1,800 years ago, there was a discussion regarding establishing the legal volume of an egg, given that different eggs have different volumes. Both participants in the discussion assumed the ideal measurement was to take the average of the largest and smallest eggs. At first blush, this measure is seriously flawed. In their book Essential Statistics in Business and Economics, D. P. Doane and L. E. Seward said, “It is easy to calculate, but is not a robust measure of central tendency, because it is sensitive to extreme data values.” However, upon analysis, it turns out the measure is unexpectedly good, and, for the ancient times and circumstances, may even have been optimal. The Mishna, originally an ancient oral tradition, was compiled into written form and edited about 1,800 years ago by Rabbi Judah Hanasi (the Prince). The egg of a chicken has implications in laws dealing with ritual purity, an important issue in ancient times, since sacrifices and certain tithes had to be ritually pure. The Mishna in Keilim (17:6), a tractate that deals with ritual purity, discusses the establishment of the legal volume of an egg for purposes of ritual purity. It is important to realize that, 1,800 years ago, there were no universally standardized measures to which one could refer. Thus, egg volume could either be linked to a known measurement or a sampling procedure for establishing the volume could be provided. According to Rabbi Yehuda, one should presumably go to the market, and in the market (which represents a batch CHANCE
19
Table 1—Bias, Variance, and Mean Squared Error for Estimating Average Egg Size Based on the Average of the Largest and Smallest Eggs for Different Batch Sizes When the Egg Volumes Follow Four Statistical Distributions Batch Size (N) 20
Chisq (6 df)
T-dist (3 df)
T-dist (10 df)
Bias
Variance
MSE
Bias
Variance
MSE
Bias
Variance
MSE
Bias
Variance
MSE
0.514
0.237
0.501
0
1.193
1.193
0
0.142
0.142
0
0.269
0.269
50
0.793
0.218
0.847
0
1.970
1.970
0
0.109
0.109
0
0.264
0.264
100
1.008
0.208
1.223
0
3.121
3.121
0
0.093
0.093
0
0.278
0.278
500
1.527
0.193
2.523
0
9.356
9.356
0
0.069
0.069
0
0.325
0.325
1000
1.753
0.189
3.261
0
13.903
13.903
0
0.062
0.062
0
0.347
0.347
of eggs), choose the “the largest of the largest” egg and “the smallest of the smallest” egg, and then calculate the average volume of the two eggs. As explained in the Tosefta (Keilim 2:6:4)—a separate collection of statements of the Tannaitic sages that was compiled at about the same time as the Mishna—this is done by placing the two eggs into a vessel filled to the brim with water. After the water overflows, the eggs are replaced with items that do not absorb water until the vessel is again full to the brim. The egg volume is defined as half of the volume of the items placed in the vessel. Conceptually, Rabbi Yehuda was taking the midrange (average) of the maximum value and the minimum value. Rabbi Yosi fundamentally agreed with Rabbi Yehuda’s methodology; however, his problem was that the largest and smallest eggs in the market may not be the largest and smallest eggs in the population. Therefore, according to Rabbi Yosi, one should use the median egg in the market. Both Rabbi Yehuda and Rabbi Yosi were comfortable with using two extreme values to calculate a measure of central tendency. However, the two extreme values can be a poor estimator, and with skewed distributions, their approach is a biased estimate of both the mean and the median. For consistency, assume the mean of egg volumes is 4, and the variance is 1. To produce mean 4 and variance 1, if X is the random variable, the following transformed variables Y have mean 4 and variance 1: X is chi-square (6) and Y= (X-6)/ sqrt(12) + 4; X is Normal(0,1) and Y=X+4; X is t(10) and Y = X/sqrt(1.125) + 4; and X is t(3) and Y = X/sqrt(3)+4. Given a batch of N eggs, Table 1 and Figure 1 give the expected bias, variance, and mean square error in estimating the population mean with the midrange estimate (using a Monte Carlo simulation of 10,000 runs) for four distributions. For a chi-square distribution with six degrees of freedom (df), which is skewed to the right, the midrange estimate is a flawed measure. The measure is biased, and the larger the batch size, the larger the bias. Thus, increasing the batch size only increases the mean square error. For a t distribution with three df, there is no bias, because the distribution is symmetric, but the mean square error is much larger than the chi-square distribution with six df. This would be equivalent to sampling the heaviest and lightest people in a large group of adult males. Someone living in a community with the heaviest 20
Normal
VOL. 24, NO. 1, 2011
person alive—Manual Uribe of Mexico once weighed 1,320 pounds—will calculate by averaging the heaviest and lightest observations, an average weight of more than 660 pounds. The larger the batch size, the worse the estimator does.
Normal Distribution But consider a normal distribution. Even with a small batch of 20, there is no bias, and the mean square error is only 0.14. As the batch size increases, the variance keeps decreasing. Analyzing a t distribution with 10 df (which has many outliers), the midrange estimate produces a mean square error of about 0.3. Even though the variance increases with an increase in the batch size, the mean square error does not change by much. According to the authors of “The Influence of Body Size, Breeding Experience, and Environmental Variability on Egg Size in the Northern Fulmar,” published in the Journal of Zoology, empirically, egg volume is normally distributed. And as C. J. Adams and D.D. Bell noted in their Journal of Applied Poultry Research article, researchers generally assume the size and weight of eggs are normally distributed and use regression and correlation to determine the factors that affect egg weight. Assuming a normal distribution of egg volumes, we compare the efficiency of using the midrange estimate versus the mean of random samples. We then present the equivalent sample of a random sample (without replacement) with the same accuracy as the midrange estimate (see Table 2). Surprisingly, even though the midrange uses only two eggs, if the batch (market) has 100 eggs, it is equivalent to a random sample of 11 eggs. If the batch size is 1,000, the midrange estimate is equivalent to a random sample of size 16.
Sensitivity How sensitive is the mean square error to picking the maximum and minimum eggs? What happens if there is a slight error in picking the maximum and minimum egg, and a very large and very small egg are picked instead? To simulate the process, we take a subsample of the 10% biggest eggs in the batch and choose a random egg from the subsample as the very large egg. Similarly, we take a subsample of the 10% smallest eggs in the batch and choose a random egg from the subsample as the very small egg. Table 3 presents the results.
Mean Squared Error
Batch Size
Figure 1. Mean squared error as a function of batch size for estimating average egg size based on the average of the largest and smallest eggs when the egg volumes follow four statistical distributions
Table 2—Mean Squared Error as a Function of Batch Size and Equivalent Size of a Random Sample When Egg Volumes Follow a Normal Distribution Mean of Two Extremes Batch Size (N)
MSE
Equivalent Random Sample Size
3
0.362
2.76
4
0.300
3.33
5
0.261
3.83
10
0.187
5.39
20
0.146
6.84
50
0.109
9.18
100
0.092
10.89
500
0.069
14.55
1000
0.064
15.71
CHANCE
21
Table 3—Mean Squared Error and Equivalent Random Sample Size When Egg Volumes Follow a Normal Distribution for Four Estimation Methods
Mean of 2 Extremes
Mean of Random Largest and Smallest
Mean of 2 Largest and 2 Smallest
Mean of 2nd Largest and 2nd Smallest
MSE
Equivalent Random Sample Size
MSE
Equivalent Random Sample Size
MSE
Equivalent Random Sample Size
MSE
Equivalent Random Sample Size
20
0.146
7.0
0.141
7.1
0.094
10.7
0.087
11.5
50
0.109
9.1
0.110
9.1
0.069
14.5
0.060
16.6
100
0.092
10.7
0.097
10.3
0.057
17.5
0.048
20.7
500
0.069
14.5
0.087
11.5
0.041
24.4
0.033
30.0
1000
0.064
16.1
0.086
11.7
0.037
27.2
0.030
33.8
Equivalent Random Sample Size
Batch Size (N)
Batch Size
Figure 2. Equivalent random sample size as a function of batch size when egg volumes follow a normal distribution for four estimation methods
22
VOL. 24, NO. 1, 2011
Although there is a slight increase in the variance for larger batch sizes, the results are almost identical to the results of using the largest and smallest eggs.
Measuring Errors in Calculating the Mean But there is a further consideration. For a random sample of, for example, 16 eggs, the displaced water must be divided into 16 equal parts. Conceptually, this is trivial. However, as any parent who has tried to equally divide an item of food among querulous children can testify, the task is not trivial operationally. Furthermore, the more parts into which the displaced water must be divided, the larger the measuring error. Thus, any efficiency gains in estimating the population mean by increasing the number of eggs in the sample may be more than offset by the measuring error in splitting the result into equal parts. It is possible that since the midrange method is superior to any sample size up to 16, using the midrange method (which requires a split into only two equal parts) may produce a better estimate of egg volume.
Optimality Is it possible to find a different sampling technique, which would improve on the midrange? One suggestion is to use the two largest eggs and two smallest eggs. This reduces the variance by a third (see Table 3). But the eggs must now be split into four parts, rather than two. It is possible that the increase in the error in measuring the mean of the sample outweighs the decrease in the sample variance. But consider taking the second-
largest and the second-smallest eggs. Especially for the larger batch sizes, this reduces the variance by about half. Thus, if the goal is to use a sample of two eggs to estimate the average population egg size, using the second-largest and secondsmallest eggs is the best of the alternatives examined here. With a batch size of 1,000, the sample is equivalent to choosing more than 33 random eggs (see Figure 2).
Further Reading Abanikannda, O. T. F., A. O. Leigh, and L. A. Ajay. 2007. Statistical modeling of egg weight and egg dimensions in commercial layers. International Journal of Poultry Science 6(1):59–63. Adams, C. J., and D. D. Bell. 1998. A model relating egg weight and distribution to age of hen and season. Journal Applied Poultry Research 7:35–44. Doane, D. P., and L.E. Seward. 2008. Essential statistics in business and economics. New York: McGraw Hill. Huber, Peter J. 1981. Robust statistics. New Jersey: Wiley. Mishna, Soncino English translation, www.halakhah.com/pdf/ taharoth/Kelim.pdf. Pascale, M., J. C. Ollason, V. Grosbois, and P. M. Thompson. 2003. The influence of body size, breeding experience, and environmental variability on egg size in the northern fulmar. Journal of Zoology 261:427–432. Wikipedia, Manuel Uribe, March 24, 2010.
Have you checked out the FREE ARTICLES available online? Take your pick from the following ASA journals: Journal of the American Statistical Association The American Statistician Journal of Business & Economic Statistics Statistics in Biopharmaceutical Research Journal of Computational and Graphical Statistics Technometrics Go to http://pubs.amstat.org, click on the journal of your choice, and enjoy the articles marked FREE.
CHANCE
23
Jelly Belly Mixing and Uniform Multinomial Distribution Robert E. Burks
J
elly Belly gourmet jellybeans have not always been the household name they are today. The product of Fairfield California’s Jelly Belly Candy Company first gained its national recognition during Ronald Reagan’s presidential inauguration in 1981. Reagan ordered more than three tons (2.4 million beans) of the gourmet jellybeans to help celebrate his inauguration and propelled the beans into the national spotlight. In honor of the event, the company created the blueberryflavored jellybean so Reagan could serve red, white, and blue jellybeans at his parties; it remains a favorite today. The gourmet beans retained a place of prominence on both the president’s desk and Air Force One throughout Reagan’s presidency, and even rode into space on the Challenger in 1983. The Jelly Belly flavor lineup has grown from its humble start of eight flavors in 1976 (i.e., very cherry, root beer, cream soda, tangerine, green apple, lemon, licorice, and grape) to today’s lineup of 50 official flavors. To satisfy global demand, the company produces some two million beans an hour, or approximately 14 billion beans (17,500 tons) annually. Currently, it ships beans to all 50 states and many countries.
What Happened to the Blueberry Muffins? One recent weekend, my wife and daughter were enjoying a bag of Jell Belly 40 Flavors. As it turns out, one of their favorite activities is creating new flavors by eating more than one flavored bean at a time. For example, eating two chocolate fudge beans and one toasted marshmallow bean together produces a chocolate mousse flavor. In terms of potential
24
VOL. 24, NO. 1, 2011
Table 1—Flavor Combinations Possible by Combining Flavors of 40 Jellybeans Number of Beans
Number of Combinations
1
40
2
780
3
11,480
4
123,410
5
1,086,008
combinations, a 40-flavor bag provides almost an endless set of culinary possibilities, not to mention a great opportunity for a combinatorics class. Since eating order does not matter and repetition is encouraged, when you combine just three flavors as Table 1 shows, there are 11,480 potential combinations of new flavors. One of my daughter’s favorite creations is blueberry muffin, a combination of two blueberry beans and a butter popcorn bean. At some point during the afternoon, my daughter commented that she was ripped off since she wasn’t able to eat a single blueberry muffin because there were no buttered popcorn–flavored beans in the bag. The bag should have contained approximately 210 beans (six suggested servings × 35 beans per serving) and, assuming a uniform distribution of the 40 flavors, one would have expected 5.25 buttered popcorn beans in the bag. Surely, my daughter should have been able to make one blueberry muffin combination. To investigate further, I contacted the Jelly Belly Candy Company and asked how they mix their flavors. In making the 40-flavor mix, the company takes 25-pound component cases for each of the individual 40 fl avors and pours them into a mixing hopper (1,000 pounds total). As the batch is mixed,
the beans are placed back into 25-pound mixed component cases. These component cases are eventually placed back into the hopper and weighed out to make the appropriate package size. Based on this information, if the mixing is done properly, then the content of the packages should be approximately uniformly distributed.
An Edible Experiment My daughter’s bag, with no buttered popcorn–flavored jellybeans, appeared to be abnormal. My daughter, always eager to eat more jellybeans, was more than happy to help me in an experiment. The idea of an edible experiment caught on with my other children, who eagerly supported the concept of the more-data-is-better principle; obviously, they are all on their way to becoming statistics professionals. To support this experiment, I bought approximately four pounds of beans, consisting of four 6.25 oz. bags and four 9.0 oz. bags from several locations. Following Ron Fricker’s rationale in the Mysterious Case of the Blue M&M’s, my objective was to produce four random one-pound (really 15.25 oz.) samples with different manufacturing lots. The eight bags were randomly assigned to four samples: one 6.25 oz. and one 9.0 oz. bag per sample. My methodology was to sort each sample separately, dividing the beans into groups by their flavors, while reminding my six year old not to eat the beans. After verifying flavor separation, my daughter re-verified the information and entered the data into an Excel spreadsheet. With three independent reviews of the data, the accuracy of the count of flavors by sample was reasonably certain. Table 2 provides the counts for each group by flavor and the aggregate total for all four samples. Looking at Table 2, the consistency of sample sizes is striking. Also, all samples exceeded the company’s suggested total serving size. The package indicates a serving size is 35 beans and, coupled with the 10.5 suggested servings (4.5 servings for the 6.25 oz.. bag and six servings for the 9 oz. bag), there should be approximately 367.5 beans per sample. However, the distribution does not appear to be uniform, and some flavors appear more popular. A bar chart of flavors by sample confirms this observation. The chart can be viewed at http:// chance.amstat.org/category/supplemental. Overall, very cherry appears to be the most populous flavor, while juicy pear is the least populous. A review of Table 2 clearly demonstrates a discrepancy between the samples and the company’s expected uniform distribution. Although about half of the flavors matched up reasonably well, there were several flavors—including very cherry, juicy pear, and, of course, buttered popcorn—that appeared out of sorts.
Statistical Analysis: One Flavor at a Time Under the hypothesis that the sample was generated according to the company’s described distribution, one can compute the probability of observing 65 or more very cherry–flavored jellybeans. Simplifying the flavors into two groups—very cherry and not very cherry—the probability distribution for the number of very cherry beans out of 1,531 is binomial with n = 1,531 independent trials and a probability of p = 0.025 of a single bean being very cherry. The probability of seeing x or more very cherry beans out of 1,531 is the sum from x to 1,531
Table 2—Counts of Jellybeans by Flavor for Four Samples Jelly Belly Flavor
Sample A
B
C
D
Total
Very Cherry
19
16
16
14
65
Top Banana
13
12
15
14
54
Buttered Popcorn
10
16
13
9
48
Strawberry Jam
14
14
9
11
48
Caramel Apple
6
11
13
17
47
Cream Soda
11
10
12
14
47
Orange Sherbert
13
15
7
12
47
Plum
14
12
11
10
47
Dr. Pepper
12
16
6
12
46
Tutti-Frutti
10
13
10
13
46
Blueberry
14
14
7
8
43
Piña Colada
11
13
9
10
43
Coconut
8
9
10
14
41
Peach
8
12
13
8
41
Toasted Marshmallow
11
10
7
13
41
Root Beer
7
12
8
13
40
Crushed Pineapple
10
12
12
5
39
French Vanilla
4
8
16
10
38
Strawberry Cheesecake
14
6
7
11
38
Watermelon
6
12
11
9
38
Island Punch
14
7
6
10
37
Margarita
9
12
8
8
37
Red Apple
10
8
10
9
37
Caramel Corn
10
8
7
10
35
Chocolate Pudding
5
7
11
12
35
Lemon Lime
12
8
7
8
35
Raspberry
15
11
2
7
35
Cotton Candy
6
8
11
9
34
Lemon
10
4
12
8
34
Orange Juice
6
5
11
12
34
Licorice
8
12
7
6
33
Green Apple
12
5
10
5
32
Berry Blue
4
12
9
6
31
Strawberry Daiquiri
5
6
5
14
30
Tangerine
10
2
13
5
30
Kiwi
10
4
10
4
28
Cappuccino
4
7
11
5
27
Sizzling Cinnamon
3
6
7
10
26
Bubble Gum
4
5
8
6
23
Juicy Pear
7
1
8
5
21
Total
379
381
385
386
1,531
CHANCE
25
Table 3—Tests of Whether the Mean Number of Beans per Flavor per Sample Is 9.19
26
Jelly Belly Flavor
Mean
Standard Deviation
Calculated t
p-t
p-binomial
Very Cherry
15.61
2.10
5.29
0.013
0.000
Top Banana
12.96
1.16
5.64
0.011
0.006
Buttered Popcorn
11.53
3.06
1.32
0.278
0.051
Strawberry Jam
11.54
2.44
1.67
0.194
0.051
Caramel Apple
11.26
4.31
0.83
0.466
0.069
Cream Soda
11.27
1.56
2.32
0.103
0.069
Orange Sherbet
11.30
3.32
1.10
0.352
0.069
Plum
11.29
1.74
2.10
0.127
0.069
Dr. Pepper
11.06
4.00
0.81
0.478
0.092
Tutti-Frutti
11.04
1.64
1.96
0.146
0.092
Blueberry
10.34
3.71
0.54
0.627
0.194
Piña Colada
10.33
1.70
1.16
0.329
0.194
Coconut
9.83
2.45
0.45
0.681
0.292
Peach
9.84
2.51
0.45
0.683
0.292
Toasted Marshmallow
9.84
2.39
0.47
0.667
0.292
Root Beer
9.59
2.79
0.25
0.817
0.350
Crushed Pineapple
9.37
3.19
0.10
0.927
0.411
French Vanilla
9.10
4.74
-0.03
0.976
0.475
Strawberry Cheesecake
9.13
3.59
-0.03
0.979
0.475
Watermelon
9.12
2.53
-0.05
0.964
0.475
Island Punch
8.89
3.51
-0.15
0.894
0.460
Margarita
8.89
1.86
-0.28
0.799
0.460
Red Apple
8.88
0.92
-0.57
0.607
0.460
Caramel Corn
8.40
1.46
-0.93
0.420
0.332
Chocolate Pudding
8.38
3.10
-0.45
0.683
0.332
Lemon Lime
8.41
2.20
-0.61
0.585
0.332
Raspberry
8.43
5.41
-0.24
0.824
0.332
Cotton Candy
8.15
1.94
-0.92
0.424
0.274
Lemon
8.16
3.27
-0.55
0.623
0.274
Orange Juice
8.14
3.30
-0.55
0.622
0.274
Licorice
7.93
2.57
-0.85
0.459
0.220
Green Apple
7.69
3.46
-0.75
0.508
0.173
Berry Blue
7.44
3.37
-0.90
0.435
0.132
Strawberry Daiquiri
7.18
4.12
-0.84
0.462
0.098
Tangerine
7.20
4.73
-0.73
0.519
0.098
Kiwi
6.73
3.34
-1.28
0.292
0.050
Cappuccino
6.47
2.94
-1.60
0.208
0.034
Sizzling Cinnamon
6.22
2.72
-1.88
0.156
0.022
Bubble Gum
5.51
1.60
-3.97
0.028
0.005
Juicy Pear
5.04
2.97
-2.42
0.094
0.002
VOL. 24, NO. 1, 2011
Table 4—Number of Beans by Sample and Bag Size Sample Component (size bag - oz.)
A
B
C
D
6.25
9.00
6.25
9.00
6.25
9.00
6.25
9.00
Total Beans
153.00
226.00
156.00
225.00
157.00
228.00
159.00
227.00
Expected Beans
157.50
210.00
157.50
210.00
157.50
210.00
157.50
210.00
-4.50
16.00
-1.50
15.00
-0.50
18.00
1.50
17.00
Delta (Total – Expected)
of the binomial probabilities at those amounts. Invoking the normal approximation to the binomial gives
The calculated t-test statistic reported in Table 3 is given by
Here, the continuity correction makes little difference in the substantive results. Following the company’s assertion that the flavors were equally mixed according to a binomial distribution with p = 0.025, observing 65 or more very cherry–flavored beans would rarely occur. Similarly, seeing 21 or fewer juicy pear–flavored jellybeans—
—out of 1,531 jellybeans would rarely occur. The last column of Table 3 gives the p-values for all 40 flavors based on the binomial test. If the company did fill the sample bags in accordance with a uniform distribution, one would expect an average of 9.1875 (=367.5/40) beans of each flavor in a standard sample (one standard 6.5 oz. bag and a standard 9.0 oz. bag). The samples have sizes of 379, 381, 385, and 386, which are larger than standard. To test a hypothesis about the average count of beans, one can multiply the observed counts by a ratio equal to 367.5/n to standardize the count per sample to s 367.5. Once that is done, one can test the hypothesis that the mean number is 9.1875 versus not using a t-test with four observations per flavor.
where x and s respectively represent the sample mean and standard deviation. The p-value in the final column provides the probability of observing a test statistic as extreme or more extreme than the observed test statistic value given that the null hypothesis is true. There were clearly multiple flavors that, on average, had significantly more than the expected 9.19 beans per bag and a couple that, on average, were significantly short of this mark. A closer look at the differently sized bags of the sample provides additional evidence of some discrepancies. Also, it seems to indicate that if you want your money’s worth, buy the larger bag. Table 4 shows that the 6.25 oz. bags were filled close to the suggested level of 157.5 beans, but the 9 oz. bags were overfilled in each sample. A review of the eight individual bags of jellybeans that comprise the four samples also shows that six of the eight bags contained fewer than the 40 flavors. In fact, three of the eight bags only contained 38 flavors. All together, the number of beans for an individual flavor ranged from 0 to 12 beans, with a standard deviation of 2.6 beans. Clearly, there is a problem with the samples and the company’s claim of how the beans are mixed. Three flavors show significant discrepancies from uniform using this approach. Very cherry and top banana appear much more frequently than chance, whereas bubble gum appears much less frequently. This test is not very powerful because there are only four observations for each flavor, so other apparent departures do not appear significant.
CHANCE
27
and is compared to a chi-squared distribution with 39 (40 -1) degrees of freedom. Large values of the statistic suggest the distribution is not uniform. For samples A, B, C, and D, the statistic values are 57.6, 63.0, 37.1, and 42.6, which yield p-values of 0.028, 0.009, 0.555, and 0.319. When the samples are combined, the statistic is 76.3, which gives a p-value of 0.0003. One can conclude that, although some individual samples are not significantly different from a uniform distribution of flavors, the distribution appears to be non-uniform overall.
Further Evidence
As with the separate binomial tests, one can worry about whether the largest or smallest statistics are truly significant, since they were in fact chosen because they are the extremes. That is, there is a multiple testing issue. If you conduct 40 independent hypothesis tests at the alpha = 0.05 level, then you would expect, on average, two of them to be significant just by chance. One might further worry about the assumptions of the t-test: Are the four measurements reasonably normally distributed? Perhaps one would prefer a test based on ranks that would be less sensitive to assumptions.
Statistical Analysis: 40 Flavors at Once Of course, conditional on the number of beans in a bag, when one flavor is under-represented in a bag, another flavor has to be over-represented. Is the overall distribution of counts in the four samples significantly different from uniform? A chi-squared test can be used to answer this question. The chi-squared statistic is a function of the observed counts (o ) in each cell (for each flavor) and the expected counts (e i) in each of the 40 cells. The expected i values under a uniform distribution is n(1/40), where n is the number of beans. The chi-squared statistic is
28
VOL. 24, NO. 1, 2011
A review of the buttered popcorn flavor shows a moderate significance of there being more than 9.19 beans per bag. Clearly, this distribution does not appear to match the company’s mixing claim, but what about my daughter’s claim that she did not receive any buttered popcorn–flavored beans in her bag? Invoking the normal approximation to the binomial once again and assuming she had a standard bag with 210 beans, the probability of zero buttered popcorn beans in a sample yields p~ =0.0049. In other words, this occurrence should be a rare event. Armed with this clear discrepancy, I contacted the company again to inquire about their mixing process. According to the company, the original batch of mixed flavors is placed in the hopper to begin filling the appropriate individual bags to the appropriate weight. However, as time goes by and they run out of mixed component cases, the process is to add individual 25-pound component (not mixed) flavors to the hopper to keep it full. Therefore, the initial bags of any production run may be uniformly distributed, but as time passes, the fill process may deviate from uniformity. The process of keeping the hopper filled after using all mixed cases makes sense when you consider the volume of beans the company must produce each day to meet demand. However, it is curious that the most populous flavors in my samples—very cherry, top banana, and buttered popcorn—tended to top the list of favorite flavors on several informal online surveys. An example of an informal online survey can be found at answers.yahoo.com/question/index?qid=20100117084208AASO6zS. Could it be that the process of loading component cases to the hopper may favor these more fan-favorite flavors? This investigation provided several insights. First, a good dose of skepticism is always prudent when reading information presented as facts and that simple classroom statistics can be leveraged to address perceived facts in our lives. This experience also illustrates that it is perfectly acceptable to question and seek additional information when the facts do not seem to mesh with your analysis. Second, don’t be too disappointed if you cannot create your favorite combination, or even get your favorite flavor, from a bag of the gourmet beans. Rest assured there is not some company conspiracy to deny you your favorite flavor. You just need to sample another bag.
Tackling the Chart: Two-Point Conversions and Team Differences in Football
Michael Wenz and Joren Skugrud
O
n October 16, 2005, the New York Giants scored a touchdown and reflexively followed convention by kicking the extra point to pull into a 10-10 tie with 19 seconds left in their game against the Dallas Cowboys. An ESPN.com recap titled “Cowboys Pull Through the Muck to Edge Giants” says about the game, “[It] was over when the Cowboys won the coin toss before overtime. The Giants defense was on the field for nearly 37 minutes during regulation and the heat in Texas Stadium was starting to take its toll.” In fact, the Cowboys kicked a field goal on their first drive in overtime for the 13–10 victory. By following the conventional approach, did the Giants miss their best chance to win when they kicked the extra point? Rigorous analysis of more than a decade of National Collegiate Athletic Association (NCAA) and National Football League (NFL) football games shows the conventional wisdom is correct more often than not, but there is still a substantial number of cases in which blindly kicking the extra point is a mistake. Different teams have different optimal strategies. In particular, teams with successful running attacks convert a higher fraction of their twopoint conversions, and some of them would increase their win probability by attempting the two-point conversion. In practice, coaches systematically deviate from the optimal strategy, attempting far too many one-point conversions and choosing the wrong CHANCE
29
Figure 1. This is a card once carried by Winona State University head coach Tom Sawyer, complete with advertising on the back. The chart gives coaches an easy reference to decide on the optimal strategy for attempting a one-point or two-point conversion after scoring a touchdown, based on the score of the game.
times to go for two. In the high-stakes world of NCAA and NFL football, even a small increase in win probability carries a large potential gain.
The Chart The conventional wisdom is encapsulated on “the chart,” a system so widely used that it is distributed to coaches as promotional junk mail. Figure 1 shows a card once carried by Winona State University head coach Tom Sawyer, complete with advertising on the back. The chart gives coaches an easy reference to decide on the optimal strategy for attempting a one-point or two-point conversion after scoring a touchdown, based on the score of the game. Following a loss that involved two missed two-point conversions, Pittsburgh Steelers coach Mike Tomlin told the media, “Playing the charts, that’s not out of bounds … everybody has the chart.” In fact, the chart loosely approximates the optimal strategy, but it is imperfect. In a 2000 CHANCE article, Harold Sackrowitz uses a linear programming approach to show that the optimal decision varies depending 30
VOL. 24, NO. 1, 2011
on the number of possessions remaining in the game. J. Denbigh Starkey showed in a 2005 essay that the optimal strategy following a touchdown that cuts a 14-point lead to eight is to attempt a two-point conversion immediately. The jumping-off point here is to consider how the optimal conversion decision may be influenced by differences in team conversion probabilities. Should different teams have different charts, and even different charts for different opponents?
Modeling Optimal Strategy Choices Teams that score a touchdown to cut a seven-point lead to one near the end of a game can essentially play for the tie (kick) or play for the win (go for two). In practice, coaches nearly always kick. From 1994 to 2006 in the NFL and from 1997 to 2006 in NCAA Division I-A, there were 228 games in which a team scored a touchdown with fewer than two minutes in regulation to cut a seven-point lead down to one. And in 222 of those games, they kicked the extra point.
During this period, kicks were successful 99% of the time in the NFL and 96% of the time in NCAA Division I-A games, while two-point conversions were successful 45% of the time in the NFL and 41% of the time in the NCAA games. The expected value of a two-point try is thus lower than that of a one-point try for the average team, but this may not be true for all teams. Let Team X be the team that scores to pull within one point before the conversion and Team Y be the team in the lead. Calculating Team X’s win probability under different strategy choices requires an estimate of four probabilities: A=Probability of a successful one-point try by Team Y B=Probability of a successful two-point try by Team Y C=Probability Team X prevents Team Y from scoring on next possession D=Probability Team X will win an overtime game
Table 1—Two-Point Conversion Probit Model Estimates NFL Offense
NCAA Division I-A Defense
Offense
Defense
Parameter Estimate
Marginal Effect
Parameter Estimate
Marginal Effect
Parameter Estimate
Marginal Effect
Parameter Estimate
Marginal Effect
Rush YPC
0.1603* (.085)
0.063
0.0672 (.096)
0.026
0.0761** (0.033)
0.030
0.0659* (0.036)
0.025
Completion Rate
-0.6059 (1.00)
-0.238
-1.6099 (1.18)
-0.638
0.1554 (0.417
-0.061
-0.3596 (0.525)
0.140
Win Percentage
0.4803** (0.22)
0.189
-0.1010 (-0.210)
-0.040
0.1657 (0.122)
-0.064
-0.1741 (0.124)
0.067
Constant
-0.6261(0.639
0.6241 (.7553)
-0.667 (0.248)
-0.170 (0.330)
N
1151
1139
2992
2967
Standard deviations in parentheses *Statistically significant at a 90% confidence level **Statistically significant at a 95% confidence level
The win probability with a onepoint try is equal to A*D*C, and the win probability with a two-point try is equal to B*C, so the optimal strategy is to attempt a two-point conversion if B > A*D. There are at least two challenges in estimating probabilities C and D. First, it is possible that Team Y may be more conservative on their subsequent possession in a tie game than in a game they trail by one point. This may slightly decrease the benefit of attempting a two-point conversion. Second, it is possible that Team X may be able to recover an onside kick, which would reduce some of the risk associated with a failed twopoint conversion. These effects might largely offset each other and could be small in magnitude in any case.
Predicting Success Rates NFL and NCAA I-A teams average fewer than three two-point attempts per season, so there is not much information contained in two-point conversion rates of a particular team in a particular season. In aggregate, though, it may be possible to find characteristics across teams that are
associated with high success rates. STATS, LLC provided data on NFL teams from 1994–2006 and NCAA Division I-A teams from 1997–2006. 1994 was the first year of the twopoint conversion in the NFL, and 1997 was the first year of universal usage of overtime in NCAA Division I-A. Probit regression was used to calculate the probability of a successful two-point try. William H. Greene’s Econometric Analysis provides a technical discussion of probit models. Team characteristics thought to influence the probability of success at converting two-point attempts were drawn from three broad categories: rushing proficiency, passing proficiency, and overall team quality. Rushing proficiency measures tested for inclusion in the model were yards per carry and yards per game; passing measures included completion percentage, yards per completion, yards per game, and quarterback rating; and team quality measures included point differential, points scored (offense), points allowed (defense), and team winning percentage. There is a high degree of multicollinearity for measures within each of the categories, so one variable was chosen from each category for
inclusion in the final model specification. The variables were chosen to maximize the model’s fit based on the pseudo-R2 statistic. The final specification included rushing yards per carry, completion percentage, and team winning percentage. A consistent pattern appeared in all specifications: Teams that can run the football well convert two-point attempts at a statistically significantly higher rate than teams that do not. Passing proficiency did not matter statistically in either league, nor did team quality in the NCAA. The quality of the defense mattered in NCAA games, as those who could stop the run were also good at stopping twopoint conversions. In the NFL, teams that had high winning percentages were also good at converting twopoint attempts. Included in the Table 1 are estimated marginal effects, which measure the increase in two-point conversion probability associated with a change in the explanatory variable. For instance, the marginal effect for rushing yards in NFL games is 0.063. This means a one-yard increase in average rushing yards per carry is associated with an increase of 6.3 percentage points in CHANCE
31
Table 2—Sample Means and Standard Deviations Rush YPC
Completion Percentage
Winning Percentage
NFL Offense
4.00 (0.46)
58.2% (.041)
0.500 (0.185)
NFL Defense
4.00 (0.42)
58.3% (0.034)
0.500 (0.185)
NCAA Offense
3.90 (0.79)
55.2% (0.059)
0.507 (.227)
NCAA Defense
3.856 (0.76)
55.6% (0.048)
0.507 (.227)
Source: STATS LLC 2007. Used with permission. Standard deviations in parentheses.
Table 3—NFL Recommended Choice vs. Actual Choice Comparisons Results for Scoring Team
32
Two Point Recommended, One Point Attempted
All Games (W-L-T) Comparison Games (W-L-T) Actual Win% Chosen Strategy Prediction Recommended Strategy Prediction Difference
8-17-0 7-8-0 .467 .489 .511 +.022
One Point Recommended, Two Point Attempted
All Games (W-L-T) Comparison Games (W-L-T) Actual Win% Chosen Strategy Prediction Recommended Strategy Prediction Difference
1-3-0 1-3-0 .250 .456 .495 +.039
VOL. 24, NO. 1, 2011
the probability that a team will be successful with their two-point conversion. Probit models are nonlinear, so the exact marginal effects depend on the exact rushing average and the other characteristics of the team. The marginal effects presented in Table 1 are the average marginal effects of all the teams that actually appear in the data set. Table 2 presents sample means and standard deviations for the NFL and NCAA Division I-A. The average NFL team rushed for 4.0 yards per carry and converted 44.7% of their twopoint attempts. Boosting the average to 5.0 yards suggests an increase in their conversion probability to 51.0%. The 2006 San Diego Chargers own the highest predicted probability of success at 54.5%. Marginal effects are smaller in college—3.0 percentage points per yard increase—but team-by-team variations in rushing proficiency are much larger. The 2006 West Virginia Mountaineers had the highest predicted probability of success at 53.2%.
Examining On-Field Results Consider again the case of the New York Giants in their October 2005 matchup with the Dallas Cowboys. An average team would be comparing the average 44.7% success rate on conversions to the product of their overtime win probability (50%) and their probability of successfully kicking the field goal (99% in the NFL). Thus, most teams would correctly opt to kick, as 44.7% is less than 49.5%. The 2005 Giants, however, were one of the top rushing teams in the league. They ranked third in yards per carry at 4.71, and when combined with their other characteristics and the characteristics of the Cowboys defense, their model-predicted probability of success was 52.1%. Their optimal strategy would have been to attempt the two-point conversion. This would have increased their win probability by 2.6%, a significant amount in the high-stakes world of professional football. As it was, the Giants finished the 2005 regular season with a record of 11–5 and lost a first-round playoff game with the Carolina Panthers. An extra victory
On October 23, 1999, with his team clinging to a 28–27 lead and about one minute remaining in the football game, University of Illinois tailback Rocky Harvey burst through the University of Michigan line for a 54-yard touchdown run that looked like the exclamation point on a stirring 20–point comeback victory. Reflexively, the Illini kicked the extra point for an eight-point, 35–27 lead. According to the chart and conventional wisdom, kicking the extra point was the safe, obvious, and nearly universal strategy choice for this situation. But, as Tom Brady marched the Wolverines back the other way, the risk of taking the (almost) sure one point—rather than attempting the less certain two-point conversion—came into focus. A Michigan touchdown and twopoint conversion would force overtime, but had the Illini been able to successfully complete their own two-point conversion, their 36–27 lead would have been all but insurmountable. Did the conventional wisdom lead the Illini to the right choice? A discussion of game situations like this and the interesting optimal strategy resulting from the model are discussed in the online supplement for this article, which can be found at http://chance.amstat.org. It turns out the Illini got it right—a onepoint attempt had a 3.8 percentage point better chance of leading to a win than a twopoint try if Michigan was able to respond with a touchdown of its own. But, it wasn’t until the Illini intercepted not one, but two, Tom Brady passes, recovered a fumble, and allowed a safety—all in the last 10 seconds and all inside their own five-yard line—that they were able to escape with a 35–29 victory.
that season would have meant a firstround bye in the postseason. There were 103 games in the NFL and 125 in NCAA Division I-A in which a team scored a touchdown to cut the lead to one point with the conversion forthcoming and fewer than two minutes remaining in the NFL and NCAA Division I-A over the period of analysis. NFL teams were successful in 96 of 97 one-point attempts and two of six two-point tries. NCAA I-A teams were successful in 111 of 114 one-point attempts and five of 11 two-point tries. The
optimal strategy in each of these games was computed based on the method outlined previously. Table 3 presents results for the NFL. The first observation is that there is significant deviation from the optimal strategy. The chart says to kick the extra point, and, in fact, that choice is made more often than is justified. The model agrees in 76 of 103 games, but kicking is chosen 97 times. In 25 games, teams chose to kick when going for two would have been the optimal strategy. Some of these CHANCE
33
Table 4—NCAA Recommended Choice vs. Actual Choice Comparisons Results for Scoring Team
Two Point Recommended, One Point Attempted
All Games (W-L-T) Comparison Games (W-L-T) Actual Win% Chosen Strategy Prediction Recommended Strategy Prediction Difference
13-17-0 8-12-0 .400 .480 .527 +.047
One Point Recommended, Two Point Attempted
All Games (W-L-T) Comparison Games (W-L-T) Actual Win% Chosen Strategy Prediction Recommended Strategy Prediction Difference
3-6-0 3-4-0 .429 .429 .483 +.054
games were decided by a subsequent score in regulation, so the final outcome turned out not to depend on the conversion decision. In the remaining 15 games, following the recommended strategy would have led to a 2.2% higher expected winning percentage. Of the six teams that attempted a two-point conversion, two followed the recommended strategy and four chose poorly. The four that should have kicked cost themselves 3.9% in expected winning percentage. Table 4 presents the same information for NCAA games. Again, there is significant deviation from the optimal strategy. Teams attempted 114 one-point conversions, rather than the recommended 93, and attempted 11 two-point conversions, rather than the recommended 32. Overall, teams pursued a suboptimal strategy in 39 of 125 cases. There were 30 cases of teams kicking when it was not the optimal strategy, and this cost them 4.7% in expected winning percentage. Nine teams attempted a two-point conversion when they should have played for overtime, costing themselves 5.4% in expected winning percentage. One possibility is that coaches have important information in real time that is not observed in the model, which allows them to make better choices. If this were the case, we should see coaches outperforming the model-predicted values of their actual choices. However, teams
34
VOL. 24, NO. 1, 2011
that deviated from the optimal strategy actually did worse than expected. Between NCAA Division I-A and the NFL, there were 68 instances of suboptimal strategies being pursued. In 21 cases, the decision turned out to not matter due to a subsequent score in regulation. In the remaining 47 games, the team making the mistake amassed a record of 19–28. The strategy they chose should be expected to produce a record of about 22–25 in those games, while the recommended strategy would be expected to produce a record of 24–23. There are noteworthy games in which choosing the correct strategy would have increased the odds of a better outcome. The online supplement to this article, found at http://chance.amstat. org, includes a list of each of the games in which the team chose the suboptimal strategy and lost. Aside from the 2005 New York Giants game mentioned here, the 2002 Atlanta Falcons played the chart in their matchup with the Seattle Seahawks and kicked when they should have tried for two. A win would have led to a first-round playoff matchup with the 10–6 San Francisco 49ers instead of the 13–3 Green Bay Packers, though they did beat the Packers. In the college ranks, lowly Vanderbilt missed its best chance to knock off #15 Florida when they chose to kick instead of go for two in their 2002 matchup. Similarly, North Carolina State played it too safe when they had an opportunity to knock off #3 Ohio State early in the 2003 season.
Final Remarks There are useful extensions of this analysis. First, it would be a constructive exercise to recreate Sackrowitz’s conversion chart after adjusting for team differences in conversion rates. Optimal conversion charts would probably vary not just from team to team, but even from game to game as opponents change. Second, in NCAA overtime games, each team is given one chance to score from the 25-yard line. A coin flip determines which team gets to choose whether to play first or second, and in the second overtime round, the choice reverts to the other team. Consider a team that allows a touchdown on the opening possession of the fi rst overtime. If it scores a touchdown, it faces a conversion decision with knowledge about who will have the advantage of choice in the second overtime. Peter Rosen and Rick Wilson examined the determinants of the size of this advantage in a 2007 Journal of Quantitative Analysis in Sports paper, “An Analysis of the Defense First Strategy in College Football Overtime Games.” Combining the analysis here with their model would improve on decisionmaking at the end of the first overtime period. The conventional wisdom is often right when it comes to the two-point conversion. In the 228 situations examined here, kicking the extra point was the recommended strategy in 169 of them. But that leaves 59 games—26% of
them—where the conventional wisdom was wrong. Sure, sometimes coaches do go for two, but this study finds little evidence that they do so with any rhyme or reason, and hopefully this analysis will provide some guidance for when to play it safe and when to go for it.
Further Reading Clevenson, M. Lawrence, and Jennifer Wright. 2009. Go for it: What to consider when making fourth down decisions. CHANCE 22(1):34–41. ESPN.com. 2005. Cowboys pull through the muck to edge Giants. http://sports.espn.go.com/nfl/ recap?gameId=251016006. Greene, William H. 2003. Econometric analysis, 5th edition. New Jersey: Prentice-Hall. Rosen, Peter A., and Rick L. Wilson. 2007. An analysis of the defense first strategy in college football overtime games. Journal of Quantitative Analysis in Sports 3(2):1–17. Sackrowitz, Harold. 2000. Refining the point(s)-after-touchdown decision. CHANCE 13(3):29–34. Starkey, J. Denbigh. 2005. Playing with the percentages when trailing by two touchdowns. The Sport Journal 8(4). Wenz, Michael G., and Joren Skugrud. 2010. Is there status quo bias? Evidence from two-point conversion attempts in football. http://ssrn.com/ abstract=1582638.
201 1VIDEO COMPETITION ASA Promoting the Practice and Profession of Statistics Create a Video to Promote Statistics and Win Cash Cash prizes will be awarded to the winning entries: First place: $1,000 Second place: $350 Third place: $150 The deadline for submission of videos is July 15. Winners will be announced at JSM 2011 (www.amstat.org/meetings/jsm/2011) in Miami Beach, Florida. Visit www.amstat.org/youtube for details.
CHANCE
35
Estimating Rates at Which Books Are Mis-Shelved Hongmei Liu, Jay Parker, and Wei Sun
T
he basic goal of survey sampling is to draw inferences on selected parameters, or characteristics, of a population. Usually, it is impractical or impossible to examine all the individuals in an entire population. By using a controlled random sampling strategy, one can examine a sample selected from a population, which takes less time, costs less money, and is operationally simpler than doing a census. It also achieves acceptably accurate results. Basic ideas of survey sampling were used in a project for an introductory survey sampling course at the University of Illinois-Chicago, taught and supervised by Samad Hedayat, in the fall of 2009. The task was to estimate the misshelving rate of books at the university library. The following relates the story of that exercise.
Survey Design To draw a representative sample that can reflect on the population with reasonable accuracy, we need to select a suitable sampling design. Let U = {1, 2, …, N} index a finite population of N distinct and identifiable units. We refer to N as the size of the finite population. The elements of U—namely 1, 2, …, N— also are known as the sampling units. Here, identifiability means there is a 1:1 correspondence between the units and the indexes 1,2, …, N. The sampling frame 36
VOL. 24, NO. 1, 2011
is a list of all the N population units from which a sample is drawn. There is a study variable Y that assumes values Y1, Y2 ...,YN on the N units in the population. For example, in the study of mis-shelved books, we are interested in the mis-shelving rate. In this case, the total number of books is N, and Yi= 1 if the ith book is mis-shelved, Yi= 0 otherwise; then, −
Y=
∑Y i
i
N
refers to the mis-shelving rate per book. Here, all the N books are distinct and identifiable. A duplicate copy of a title counts as a separate book for this application. In reality, the mis-shelving rate per book is small. Therefore, it is often expressed in terms of a suitably defined larger collection (of books). Herein, we will work with the shelves as the reference population units, since these define natural collections of books. For a reasonably moderate collection of identifiable reference units, such as books or shelves, simple random sampling without replacement (SRSWOR) is recommended over taking a convenience or opportunistic sample. Once the sample size n (in terms of the number of reference units to be sampled) has been determined, the SRSWOR method ensures equal chance of all possible combinations of n units out of the totality of all reference units. SRSWOR can be implemented by putting index cards with identifying information on each reference unit into a box, or by putting electronic records into a computer, and randomly selecting n indexes sequentially without replacement with equal probability at each draw. Most often, however, alternative, more convenient, and accurate sampling methods are available for a large collection of units. One method is referred to as stratified simple random sampling (STRSRS). According to this method, the whole population of reference units is divided into a number of subpopulations, called strata, and independent simple random samples are conducted
in each stratum. We denote by L the number of strata and assume there are Mh population units in the hth stratum, h = 1, 2, …, L. We obtained preliminary information on mis-shelving rates and frequency of use on different book collections by consulting librarians and reading literature. Because different collections have different frequency of use and some collections naturally result in a low number of mis-shelved books, we divided the library into six collections by physical location and frequency of use. Each such collection served as a stratum in our study. The collections for our study comprised of 1st floor south, 2nd floor south, 2nd floor north, 3rd floor north, 4th floor south, and 4th floor north. Those collections covered all the areas in the library, excluding the noncirculated items and government documents, which we excluded from our study. In the above notation, we have L = 6 strata. If each stratum were amenable to conducting a SRSWOR, we could have done so and combined the estimates from each stratum to estimate the overall misclassification rate. Stratified sampling often is more efficient than simple random sampling, because it explicitly removes the between-strata variation from the estimation. At this stage, we realized the stratum sizes were large and still not easy to sample. The books, however, were conveniently located in stacks of bookshelves, running into hundreds and thousands of bookshelves. Moreover, the bookshelves were easy to list and contained a more-or-less constant number of books (30) per shelf. Although there was slight stratum-to-stratum variation in this number, this was approximately true for each stratum. Also, it is not difficult for a single person to check the shelving order of 30 books. Within a stratum, in survey sampling terminology, the population of books is organized into clusters of books. From within each stratum, we decided to select a few clusters and inspect the status of all the books in each selected cluster. Cluster sampling is useful in CHANCE
37
Table 1—Precision Test of Sample Size ^
^
s.e. of φ h
Samples
Total Number of Books
Mis-shelving Rate per Shelf ( φ h )
SRSWOR(9945, 100)
3,110
0.80
0.0891
SRSWOR(9945, 1000)
29,886
0.71
0.0250
Table 2(a)—Mis-Shelving Rate by Column and Row Positions of the Shelf in Book Stacks Data
Row\Column
Bottom
Middle
Top
Total
End
87
130
104
321
Middle
151
260
168
579
End
28
29
30
29
Middle
31
30
30
30
End
0.67
0.63
0.74
0.68
Middle
0.86
0.66
0.68
0.72
End
2.36%
2.18%
2.50%
2.33%
Middle
2.78%
2.20%
2.30%
2.38%
Total # of Shelves
238
390
272
900
Average # of Books per Shelf
30
30
30
30
0.79
0.65
0.70
0.70
2.63%
2.19%
2.37%
2.36%
# of Shelves # of Books per Shelf # of Mis-Shelved Books per Shelf Mis-Shelving Rate per Book
Average # of Mis-Shelved Books per Shelf Mis-Shelving Rate per Book
Table 2(b)—Chi-Square Tests Comparing Mis-Shelving Rate of the Shelves Row
End
Middle
38
VOL. 24, NO. 1, 2011
Column
# of Mis-Shelved Books
Total # of Books
p-Value
Bottom
58
2456
Reference
Middle
82
3768
0.6006
Top
77
3081
0.7263
Bottom
130
4679
0.3127
Middle
172
7825
0.6893
Top
114
4967
0.8059
situations in which the frame of the population under study is either not readily available or largely inadequate for the purpose of sampling. Indeed, we could not be sure of the exact shelf location of each book based on its electronic record. Instead, the population units are found to be conveniently grouped into several natural clusters. In such cases, splitting the population into representative clusters can make sampling more practical. Then, we could simply select one or a few clusters at random and perform a census within each of the clusters. A potential disadvantage of cluster sampling is that the units within a cluster can be quite similar. For example, a remote shelf with old volumes on it might be relatively undisturbed relative to the general library. One could potentially learn more about the overall mis-shelving rate by taking 30 randomly selected books, rather than 30 from one shelf. Still, cluster sampling is often used due to its practical advantages. In summary, the design used in the project was a stratified single-stage cluster sampling design. That is, first a STRSRS design was used to select clusters (the book shelves in our study) within each stratum (book collection). Then, a census of books in the selected clusters was performed to determine the accuracy of shelving. In effect, we used the shelves as clusters of books and used the shelves as reference units for sampling and inference.
Definition of Outcome Variable Recall that the parameter of interest is the mis-shelving rate. The understanding was that a mis-shelved book would be one whose call number was larger than the call number of the adjacent books on its right or smaller than the call number of the adjacent books on its left. That is, if {1, 2, 3, 4, 5} is the correct order, then for the sequence {1, 2, 5, 3, 4}, only unit {5} is considered mis-shelved. Our definition included identifiers such as volume numbers in a series of bound volumes of a periodical. We also accounted for books located on the floor close to bookshelves or on
top of other books. We treated a shelf of books as forming a natural cluster. Normally, a book stack consisted of seven rows and 13 columns of shelves on both sides. A shelf is a natural, easy to identify and locate cluster. Also one does not have to account for borrowed, circulating, or lost books when selecting a sample of nonempty shelves, as one does with selecting a sample of individual books. In many contexts, clearly defining variables to be measured is an important component of study development.
Sample Size and Pilot Study To check whether it is worthwhile to allocate sample size (i.e., number of shelves to be selected from each stratum of shelves) proportional to the stratum size and, further, for example, whether 100 shelves make up a suitable sample size, we conducted a precision test on the 4th floor south collection. This collection had 9,945 shelves in total. We obtained a sample of 100 shelves using an SRSWOR(9945, 100) design. We then sampled another 900 shelves using an SRSWOR design from the remaining subpopulation of 9,845 shelves. Combining these two samples produces a sample of 1,000 shelves belonging to the class of SRSWOR(9945,1000) designs. We recorded both the number of books per shelf and the number of mis-shelved books per shelf. Table 1 summarizes the comparison between the samples. As seen in Table 1, the precision or standard error with 100 shelves 10 times that of the sample was about 10 with 1,000 shelves (as expected), but it still was quite low. The standard ^^ error (s.e.) of φφhh i s a p p r o x i m a t e l y the square root of , ^^
^^
φφhh((303030−−φφhh)) (1 − mh ) Mm 30 mmhh 3030 (1 − oohh ) M oh
which is a formula from survey sampling that accounts for sampling units from a finite population. Therefore, the decision was made to opt for a sample of 100 shelves from each stratum as it meets our needs of reducing data collection time while keeping a satisfactory precision level.
Moreover, we decided to work out an estimate of the mis-shelving rate per shelf, rather than per book, since the latter was small. In every stratum, an estimate of the exact mis-shelving rate per shelf is the ratio of the total number of mis-shelved books arising out of the sampled shelves over the number of sampled shelves. Under the assumption that, on an average, the book volume is 30 books per shelf, we also can work out an estimate of the exact mis-shelving rate per book, once the rate is computed per shelf. Moreover, using exact sizes of the sampled shelves would result in computation of a ratio of the form total number of mis-shelved books arising out of the sampled shelves over total number of books arising out of the sample shelves. This would provide an estimate of the rate per book only in an approximate sense. The estimate would behave like a ratio estimate. We pursue this latter computation below only to check the variation in the final results on rate per book.
Data Description Table 2(a) and Table 2(b) show comparisons of shelves with different column and row positions on book stacks to check whether physical locations of the shelves make any difference, using the 900 shelves sampled from the 4th floor south. Column position was determined by proximity to the top or bottom. Those shelves within two levels from the top were considered column-top, and those within two levels from the bottom were considered column-bottom. Otherwise, the shelves were considered column-middle. Position was similarly defined for rows. Those rows within two of the left end or right end of the row were considered row-end and everything else was considered row-middle. Statistical tests show there was no significant difference in terms of mis-shelving rates among shelves of different row and column positions, although column-middle seemed to have a lower mis-shelving than column-top and column-bottom. Therefore, a shelf of books seemed like a reasonable cluster. The shelves were used as the primary sampling units.
CHANCE
39
Table 3—Frequency of Mis-Shelved Books by Collection Collection
Number of Mis-Shelved Books on a Shelf 0
1
2
3
4
5
6
7
Total
1st Fl South
59%
19%
14%
5%
3%
0%
0%
0%
100%
2nd Fl South
45%
29%
13%
10%
3%
0%
0%
0%
100%
2nd Fl North
46%
27%
11%
7%
5%
3%
0%
1%
100%
3rd Fl North
46%
31%
16%
5%
2%
0%
0%
0%
100%
4th Fl South
52%
28%
14%
2%
3%
0%
1%
0%
100%
4th Fl North
47%
28%
10%
10%
4%
0%
1%
0%
100%
Overall
49.2%
27.0%
13.0%
6.5%
3.3%
0.5%
0.3%
0.2%
100%
Phi
Figure 1. Frequency of mis-shelved books by collection (number of mis-shelved books on a shelf : between 0 and 7)
Collection
Figure 2. Boxplot of phi by collection Y-axis: Phi = Number of mis-shelved books on a shelf / Total number of books on the same shelf
40
VOL. 24, NO. 1, 2011
Table 4—Data Summary by Collection and Book Volume per Shelf Data
Collections (strata)
Book Volume per Shelf (10,20]
(20,30]
(30,40]
>40
Overall
1 Floor S.
< −10 7
27
38
22
6
100
2 Floor S.
0
16
56
26
2
100
2 Floor N.
0
9
43
46
2
100
3rd Floor N.
2
31
57
9
1
100
4th Floor S.
1
7
43
37
12
100
4 Floor N.
1
14
52
25
8
100
1 Floor S.
9
17
25
35
50
25
2 Floor S.
/
18
26
34
47
27
2 Floor N.
/
18
26
35
44
30
3 Floor N.
8
17
26
34
41
23
4th Floor S.
8
17
27
35
46
31
4th Floor N.
8
17
26
34
44
28
1 Floor S.
0.57
0.19
0.76
1.05
2.17
0.74
2 Floor S.
/
0.75
0.91
1.19
1.50
0.97
2 Floor N.
/
0.33
1.30
1.13
0.50
1.12
3 Floor N.
1.50
0.68
0.86
1.11
3.00
0.86
4 Floor S.
0.00
0.29
0.58
0.89
1.67
0.80
4th Floor N.
0.00
0.14
0.85
1.32
2.63
1.00
1st Floor S.
6.06%
1.12%
3.05%
2.95%
4.38%
2.91%
2 Floor S.
/
4.20%
3.50%
3.56%
3.19%
3.58%
2 Floor N.
/
1.90%
5.00%
3.27%
1.15%
3.79%
3 Floor N.
18.75%
4.06%
3.37%
3.27%
7.32%
3.68%
4 Floor S.
0.00%
1.71%
2.16%
2.58%
3.64%
2.57%
4 Floor N.
0.00%
0.82%
3.28%
3.83%
5.92%
3.56%
# of Shelves
11
104
289
165
31
600
Avg. # of Books per Shelf
9
17
26
34
46
27
0.64
0.43
0.88
1.10
1.97
0.92
# of Shelves
st
nd nd
th
# of Books per Shelf
st
nd nd rd
# of Mis-shelved Books per Shelf
st
nd nd rd th
Mis-shelving Rate per Book
nd
nd rd th th
Avg. # of Mis-shelved Books per Shelf For each collection, we applied SRSWOR design to select 100 shelves. For each selected shelf, the mis-shelving rate (the ratio of the number of mis-shelved books to the total number of books on the same shelf) was computed.
Analysis Phase Table 3 and Figure 1 show that most of the shelves had fewer than two books mis-shelved across all the collections. Overall, 49.2% of the shelves had no
books mis-shelved, 27% had only one book mis-shelved, and 13% had two books mis-shelved. As we can see from Figure 1, 2nd floor north and 2nd floor south had the lowest percentages of shelves with fewer than two books mis-shelved, along with a high percentage of shelves with more than two books mis-shelved. The 1st floor south and 4th floor south had the highest percentage of shelves with fewer than two books mis-shelved. Figure 2 shows the mis-shelving rate on single shelves by collection. The
mean level of the 4th floor south collection was much lower than the other collections and also had the smallest spread. One can clearly see that the distributions were right skewed and the 3rd floor north collection had an extremely high outlier. Checking the data, the 3rd floor north collection had one shelf with a high mis-shelving rate because it had three out of its seven books mis-shelved. Table 4 shows distribution of book volume of the shelves by the six strata. It indicates there is no strong linear CHANCE
41
relationship between book volume and mis-shelving rate. To further investigate the impact of book volume per shelf on mis-shelving rate, we used the Wald chi-square test, conducted with SAS SURVEYFREQ procedure, to analyze the data. The shelves were divided into five groups according to their book volume, as we did in Table 4, and we had six strata. According to the chi-square test results, the association between book volume of the shelves and mis-shelving rate is not significant (p-value = 0.1750, X24= 6.3653, Adjusted F = 1.5833). This analysis procedure took into consideration the complex survey design with stratification and clustering.
Accounting for the Complex Sample Design: Statistical Inference Now we look at statistical inference for mis-shelving rates. When a sample is greater than 5% of the population from which it is being selected and the sample is chosen without replacement, the finite population correction factor should be used. The central limit theorem and standard errors of the mean and of the proportion are based on the premise that the samples selected are chosen with replacement. However, in virtually all survey research, sampling is conducted without replacement from populations that are of a finite size N. In these cases, particularly when the sample size n is not small in comparison with the population size N (i.e., more than 5% of the population is sampled) so that n/N > 0.05, a finite population correction factor is used to define both the standard error of the mean and the standard error of the proportion. If n denotes the sample size and N denotes the population size under SRSWOR sampling, then f = n/N is known as the “sampling fraction.” Under SRSWR sampling, the sampled units behave as independently and identically drawn units and the expression for the variance of the sample mean assumes a very simple form, σ2 . n
case, the expression for the variance of the sample mean assumes the form σ 2 N (1 − f ) ( N − 1) n
,
where f is the sampling fraction. For large N and appreciable value of the sampling fraction (of the order of 5% or more), the formula simplifies to σ2 (1 − f ) . n
The factor (1-f) is known as the finite population correction factor (abbreviated as fpc). When f is small, we can drop the fpc and the formula reduces to σ2 , n
same as that under SRSWR sampling. In the above, σ 2 refers to the population variance of the values of the study variable Y. Table 5 summarizes statistical estimation of mis-shelved books and other parameters. Moh is the number of shelves, or clusters. nh is the average number of books per shelf. Mh is the number of shelves rescaled by a factor of (nh /30). Wh is the weight of the hth stratum ≈ ^ ^
Mh . ∑Mh h
estimated mis-shelving rate per φφhh is the ^ shelf. Th is the estimated total number of books = nh Moh. The collection on 2nd floor north had the highest mis-shelving rate among the six collections we investigated, and the collection on 1st floor south had the lowest mis-shelving rate. The mean misshelved books per shelf were estimated to be 0.88 with estimated s.e. to be 0.1588. We have assumed the average shelf length in terms of book volume per shelf is 30. Therefore, an estimate of number of misshelved books per 1,000 books would be 29, obtained from that per shelf by multiplying the latter by 1,000/30. Further, However, under SRSWOR sampling, the estimated standard error would also there is an intrinsic dependence of the have the same inflation factor. The missampled units among themselves. In that shelving rate we obtained is 2.9%.
42
VOL. 24, NO. 1, 2011
Table 5—Statistical Estimation of Mis-Shelved Books
M oh ( mh =100)
nh
1st Floor S.
1,033
25
861
0.0299
2nd Floor S.
1,221
27
1,099
2nd Floor N.
1,239
30
3rd Floor N.
7,869
4th Floor S.
Collections Strata (L=6)
Mh
Wh
^
s.e. of
^
s.e. of
^
Th
0.74
0.0898
25,825
1,043
0.0382
0.97
0.1016
32,967
749
1,239
0.0431
1.12
0.1088
37,170
755
23
6,033
0.2096
0.86
0.0924
180,987
5,150
9,945
31
10,277
0.3571
0.80
0.0891
308,295
7,915
4th Floor N.
9,528
28
8,893
0.3090
1.00
0.0993
266,784
7,453
Overall
30,835
28
28,779
1
0.88
0.1588
852,082
23,065
In the collection on 2nd floor north, the low frequency of use played a different role than expected. While many of the shelves had fewer than two mis-shelved books, there were 10 shelves with four, five, or even seven mis-shelved books. These high numbers were due to books being stacked after the bookend or on top of the books, instead of being placed somewhere the librarian would notice them and reshelve them properly.
Discussion Our librarians did a good job in keeping the overall mis-shelving rate low compared to other public and university libraries, though there were certain areas that needed more effort (e.g., 2nd floor north). Due to time constraints, we did not do a census to compare strategies. There are other strategies we can consider and compare to STRSRS, such as systematic sampling since books are arranged in arrays and STRSRS with proportional allocation or optimal allocation if we can define a cost structure for the sampling design. We also can look into time effect by inspecting the mis-shelving rate at different periods of the year. We used stratified single-stage cluster sampling. The drawback of this approach is that our strata were placement based. The amount of gain due to stratification largely depends on the degree of
φh
homogeneity within each stratum and heterogeneity among the strata means with respect to the survey variable. If improving precision of the sampling design is the primary concern, then we can do a subject-based stratified sampling. One possible approach would be to define strata in a different method or incorporate a second-stage cluster sampling. In a second-stage cluster sampling, one would subsample books from each cluster. If the shelves held 300 books instead of 30, it would be time consuming to check all 300 books. Rather, one could take a random sample of 30 books from each shelf and examine those books. This second stage of sampling is common in many applications. Another option with more advance information would be to combine homogeneous collections of similar mis-shelving rates into one stratum and reduce the number of strata in the sampling design. Another approach for investigating the mis-shelved books and predict counts of mis-shelved books is using a BinomialPoisson mixture model with each shelf as the specified interval, if information about more covariates is available. There are many factors that may have an impact on the mis-shelving rate, including frequency of use, subjects, timing and flux of users, and the shape of the books. Perhaps mis-shelving is also a result of inexperienced personnel, lack
φh
^
Th
of attention, fatigue by personnel, or the complexity of some book indexes.
Further Reading Cochran, W. G. 1977. Sampling techniques. Wiley, NY. Edwardy, J. M., and J.S. Pontius. 2001. Monitoring book reshelving in libraries using statistical sampling and control charts. Library Resources & Technical Services, 45(2):90–94. Groves, R. M., F. J. Fowler Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau. 2009. Survey methodology. 2nd edition. Wiley. Hedayat, A. S., and B. K. Sinha. 1991. Design and inference in finite population sampling. New York: Wiley Series in Probability and Mathematical Statistics. Heeringa, S. G., B. T. West, and P. A. Berglund. 2010. Applied survey data analysis. Chapman & Hall. Jan, S. S., A. W. John, and S. Nackil. 2009. A cost-benefit analysis of a collections inventory project: A statistical analysis of inventory data from a medium-sized academic library. The Journal of Academic Librarianship 35(4):314–323. Kish, L. 1965. Survey sampling. Wiley. Lohr, S. 2009. Sampling: Design and analysis. Duxbury.
CHANCE
43
Can the Probability of an Event Be Larger or Smaller Than Each of Its Component Conditional Probabilities? Milton W. Loyer and Gene D. Sprechini
44
VOL. 24, NO. 1, 2011
C
onsider an event that can occur under a set of mutually exclusive and exhaustive conditions. Is it possible for the overall probability of that event to be larger (or smaller) than all of its individual probabilities of occurrence under each of those conditions? Before answering that question, you may wish to consider the following scenario.
Plate Expectations: A Dickens of a Problem A baseball manager has two available pinch-hitters, each with 100 at-bats so far this season. Arnie has 26 hits for a batting average of 0.260, and Barney has 36 hits for a batting average of 0.360. If the manager chooses at random between the two players, what is the probability of getting a hit? Straightforward calculations indicate the probability of getting a hit from such a random selection is (1/2)(0.260) + (1/2)(0.360) = 0.310. But, are there other factors that should be considered? What about whether the opposing pitcher is left-handed or right-handed? The more detailed Table 1 reveals an amazing fact. While the unconditional probability that the randomly selected pinch-hitter gets a hit is P(H) = 0.310, that probability drops to P(H|L) = (1/2)(0.300) + (1/2)(0.240) = 0.270 if the opposing pitcher is left-handed and P(H|R) = (1/2)(0.100) + (1/2)(0.480) = 0.290 if the opposing pitcher is right-handed. How can this be?
An Applied Problem in Statistical Estimation Consider the following analysis of apple data. The goal is to estimate P(H)—the proportion of fruit having some defect— for a tree of a given variety, rootstock, chemical treatment, etc. Each tree is an experimental unit that provides an equally valuable estimate of the desired proportion, and so the researcher estimates P(H) by adding the separate proportions and dividing by the number of trees. The Table 2 results for a treatment applied to a sample of n = 10 trees give the (# of apples having the defect)/(# of apples sampled) to find the proportion of apples that have the defect for each tree. On the surface, the analysis and interpretation is straightforward. A grower wants to estimate, when he plants a tree and applies a certain treatment, what proportion of the fruit for that tree will have the defect. The tree he plants is considered a random selection from all possible future trees, and so he takes a random sample of existing trees and applies the treatment. Results from the sampled trees are averaged, with the estimate from each existing tree considered an equally valuable estimate for what might occur in future trees. He estimates that he can expect a tree with that treatment to produce with 0.048 = 4.8% of the fruit having the defect. The variability
Table 1—Detailed Batter Data
Batter
Against Left-Handers
Against Right-Handers
Overall
A
24/80 = 0.300
2/20 = 0.100
26/100 = 0.260
B
12/50 = 0.240
24/50 = 0.480 36/100 = 0.360
from tree to tree is used to set confidence intervals, perform tests involving other treatments, etc. The unweighted estimate P(H) = ∑P(Hi)/10 = 0.048 is preferred over the weighted estimate P(H) = [∑hi]/[∑ni] = 90/2000 = 0.045 for a number of reasons. Not only is the tree the experimental unit of interest, but the weighted estimate is inappropriate because the so-called weights (i.e., the individual ni values) are arbitrary and not necessarily representative of the population of apples in the orchard. Notice, for example, that each ni is a multiple of 50. In practice, workers are sent into the orchard with trays that hold 50 apples. They are typically instructed to get as many full trays as possible from each tree, or to get as many full trays as possible up to some maximum number of trays for each tree. It is often important for growers to break down the total crop into fruit lost to the ground before the harvest and fruit remaining on the tree until harvest. The so-called lost and remaining fruit are usually processed differently, and for most defects P(H|L) > P(H|R). A typical result might be given mathematically as P(H) = 0.048, P(H|L) = 0.080, and P(H|R) = 0.040 so that the following are true: (1) For a general fruit from a tree, the probability of having the defect is .048. (2) If the fruit is known to be one lost to the ground, that probability rises to .080. (3) If the fruit is known to be one remaining on the tree, that probability drops to .040. The proportions in this example imply that P(L) = 0.20, or 20% of the fruit is lost to the ground. The reasoning is as follows: 0.080 (0.20) + 0.040 (1.00 – 0.20) = 0.048.
CHANCE
45
Table 2—Apple Data for 10 Trees Results Tree 1
6/150 = 0.04
Tree 2
10/200 = 0.05
Tree 3
5/250 = 0.02
Tree 4
9/450 = 0.02
Tree 5
5/100 = 0.05
Tree 6
6/150 = 0.04
Tree 7
25/250 = 0.10
Tree 8
6/200 = 0.03
Tree 9
15/150 = 0.10
Tree 10
3/100 = 0.03
P(H)
0.048
Table 3 gives the 10 Table 2 results, this time broken down by lost and remaining fruit. This is the analysis often employed for such data. In mathematical notation, the unconditional probability is less than any of the conditional probabilities: P(H) = 0.048, P(H|L) = 0.088, and P(H|R) = 0.053. In practical terms, the following paradoxical statements apply:
A includes 100 values: 20 Ls (2 H and 18 O) and 80 Rs (24 H and 56 O) B includes 100 values: 50 Ls (25 H and 25 O) and 50 Rs (12 H and 38 O) C includes 200 values: 60 Ls (18 H and 42 O) and 140 Rs (42 H and 98 O) Table 4 summaries this information, which also exhibits the paradox under consideration, in a format similar to the apple data. There are two possible scenarios for finding P(H), P(H|L), and P(H|R) for selecting one of A, B, or C at random. Scenario 1: Probabilities Involving Buckets The first probability scenario involves selecting in such a way that the sample consists of physical objects in the population. Consider three buckets, each containing shapes [hexagons (H) or octagons (O)] that are colored either lilac (L) or rose (R). Bucket A: 20 lilac-colored shapes (2 hexagons, 18 octagons) 80 rose-colored shapes (24 hexagons, 56 octagons)
(1) For a general fruit from a tree, the probability of having the defect is 0.048.
Bucket B: 50 lilac-colored shapes (25 hexagons, 25 octagons)
(2) If the fruit is known to be one lost to the ground, that probability rises to 0.088.
50 rose-colored shapes (12 hexagons, 38 octagons)
(3) If the fruit is known to be one remaining on the tree, that probability rises to 0.053.
Bucket C: 60 lilac-colored shapes (18 hexagons, 42 octagons)
During more than 20 years of analyzing fruit data, the author has seen such paradoxical results only once. But, no matter how rare the results are, there are only two possible explanations: Either the probability of an event can be larger or smaller than each of its component conditional probabilities, or the estimates for at least one of P(H), P(H|L), or P(H|R) was improperly obtained. After much thought, several false starts, and helpful insight from colleagues, the issue is now resolved. The following adventure from statistics to probability and back to statistics is the story of that resolution.
The Switch from Statistics to Probability Probability sees the population and tries to predict the sample, whereas statistics sees the sample and tries to predict the population. Furthermore, the best prediction for the value of a sample statistic is typically the value of the corresponding population parameter, and the best point estimate for the 46
value of a population parameter is typically the value of the corresponding sample statistic. Since probability is generally more straightforward than statistics, we restate the preceding statistical estimation problem in terms of a probability problem to expose any nuances that might be being overlooked. Consider a single set of data following the pattern of the apple data—with the number of “trees” reduced from 10 to three and designated A, B, and C, as follows. In addition, let “O” represent “not H.” As previously, “L” means “lost to the ground” and “R” means “remaining” on the tree.
VOL. 24, NO. 1, 2011
140 rose-colored shapes (42 hexagons, 98 octagons) A bucket is to be selected at random, and a shape is to be selected from that bucket. We desire to find the following probabilities: P(the shape is a hexagon) = P(H) P(the shape is a hexagon, given that it is lilac colored) = P(H|L) P(the shape is a hexagon, given that it is rose colored) = P(H|R) In this case, P(H) is straightforward, while P(H|L) and P(H|R) require some thought. Note that the conditional probability that the shape came from bucket A when it is lilac colored, P(A|L), is not 20/130 (about 0.15), because the 130
Table 3—Apple Data for 10 Trees, by Lost and Remaining
Tree 1 Tree 2 Tree 3 Tree 4 Tree 5 Tree 6 Tree 7 Tree 8 Tree 9 Tree 10 P(H)
Lost
Remaining
Overall
3/30 = 0.10 6/40 = 0.15 3/50 = 0.06 5/50 = 0.10 3/20 = 0.15 3/30 = 0.10 2/20 = 0.10 2/40 = 0.05 2/100 = 0.02 1/20 = 0.05 0.088
3/120 = 0.025 4/160 = 0.025 2/200 = 0.010 4/400 = 0.010 2/80 = 0.025 3/120 = 0.025 23/230 = 0.100 4/160 = 0.025 13/50 = 0.260 2/80 = 0.025 0.053
6/150 = 0.04 10/200 = 0.05 5/250 = 0.02 9/450 = 0.02 5/100 = 0.05 6/150 = 0.04 25/250 = 0.10 6/200 = 0.03 15/150 = 0.10 3/100 = 0.03 0.048
Table 4—Summarizing the Population Data L
R
Overall
A
2/20 = 0.100
24/80 = 0.300
26/100 = 0.260
B
25/50 = 0.500
12/50 = 0.240
37/100 = 0.370
C
18/60 = 0.300
42/140 = 0.300
Unweighted P(H)
0.300
0.280
60/200 = 0.300 0.310
CHANCE
47
Table 5—Summarizing the Calculated Probabilities P(H|L)
P(H|R)
P(H)
Bucket Scenario
0.360
0.285
0.310
Batter Scenario
0.300
0.280
0.280 + 0.02p
Table 6—Summarizing the Point Estimates for the Population Parameters P(H|L)
P(H|R)
P(H)
Scenario 1
0.360
0.285
0.310
Scenario 2
0.300
0.280
0.280 + 0.02p
P(H|L) = P(A|L)P(H|L and A) + P(B|L)P(H|L and B) + P(C|L)P(H|L and C) = (0.20)(2/20) + (0.50)(25/50) + (0.30)(18/60) = 0.360 P(H|R) = P(A|R)P(H|R and A) + P(B|R)P(H|R and B) + P(C|R)P(H|R and C) = (0.40)(24/80) + (0.25)(12/50) + (0.35) (42/140) = 0.285 As expected, the overall P(H) = 0.310 is between the conditional probabilities P(H|R) = .285 and P(H|L) = 0.360. L shapes are not equally likely, but P(A|L) may be calculated using Bayes’ Theorem as follows: P(A|L) = P(A∩L)/P(L) = P(A∩L)/[P(A∩L) + P(B∩L) + P(C∩L)] = P(A)P(L|A)/[P(A)P(L|A) + P(B)P(L|B) + P(C) P(L|C)] = (1/3)(20/100)/[(1/3)(20/100) + (1/3)(50/100) + (1/3)(60/200)] = 0.20 The three desired probabilities may be calculated as follows: P(H) = P(A)P(H|A) + P(B)P(H|B) + P(C)P(H|C) = (1/3)(26/100) + (1/3)(37/100) + (1/3)(60/200) = 0.310
48
Scenario 2: Probabilities Involving Batting Averages The second probability scenario involves using historical observed probabilities to predict future occurrences. Consider a team with three available batters that have the following numbers of hits (H) or outs (O) against pitchers that are either left handed (L) or right handed (R). Batter A: 20 bats against left-handed pitchers (2 hits, 18 outs) 80 bats against right-handed pitchers (24 hits, 56 outs) Batter B: 50 bats against left-handed pitchers (25 hits, 25 outs) 50 bats against right-handed pitchers (12 hits, 38 outs) Batter C: 60 bats against left-handed pitchers (18 hits, 42 outs) 140 bats against right-handed pitchers (42 hits, 98 outs)
VOL. 24, NO. 1, 2011
A batter is selected at random to face an opposing pitcher. We desire to find the following probabilities: P(the batter gets a hit) = P(H) P(the batter gets a hit, given that the opposing pitcher is left handed) = P(H|L) P(the batter gets a hit, given that the opposing pitcher is right handed) = P(H|R) In this case, P(H|L) and P(H|R) are straightforward, while P(H) requires some thought. Now the conditional P(A|L) is 1/3 because the batter is selected at random, independent of the pitcher. The three desired probabilities may be calculated as follows: P(H|L) = P(A|L)P(H|L and A) + P(B|L)P(H|L and B) + P(C|L)P(H|L and C) = (1/3)(2/20) + (1/3)(25/50) + (1/3)(18/60) = 0.300 P(H|R) = P(A|R)P(H|R and A) + P(B|R)P(H|R and B) + P(C|R)P(H|R and C) = (1/3)(24/80) + (1/3)(12/50) + (1/3)(42/140) = 0.280 P(H) = P(A)P(H|A) + P(B)P(H|B) + P(C)P(H|C) cannot be evaluated directly. Consider, for example, batter A. P(H|A) cannot be determined because A’s performance is affected by whether the opposing pitcher is left handed or right handed. It is not valid to say without qualification that P(H|A) = 26/100 = 0.260. But, it is true that P(H|L and A) = 24/80 = 0.300 and that P(H|R and A) = 2/20 = 0.100. And so the probability that A gets a hit depends upon p, the probability that the opposing pitcher is left handed. For each batter, it may be said that: P(H|A) = (p)∙P(H|L and A) + (1-p)∙P(H|R and A) = (p)(0.100) + (1-p)(0.300) = 0.300 – 0.20p P(H|B) = (p)∙P(H|L and B) + (1-p)∙P(H|R and B) = (p)(0.500) + (1-p)(0.240) = 0.240 + 0.26p P(H|C) = (p)∙P(H|L and C) + (1-p)∙P(H|R and C) = (p)(0.300) + (1-p)(0.300) = 0.300. And now, P(H) may be expressed in terms of p as follows: P(H) = P(A)P(H|A) + P(B)P(H|B) + P(C)P(H|C) = (1/3)(0.300 – 0.20p) + (1/3)(0.240 + 0.26p) + (1/3) (0.300) = 0.280 + 0.02p The probability p may be thought of as an indicator variable that equals 1.0 when the opposing pitcher is left handed and 0.0 when the opposing pitcher is right handed, or p could be
any value between 1 and 0 representing the actual probability that the opposing manager will insert a left-handed pitcher. In either case, as expected, P(H) = 0.280 + 0.02p is between P(H|R) = 0.280 and P(H|L) = 0.300. Summarizing the Two Probability Scenarios The calculated probabilities in the two scenarios are summarized in Table 5. Taken together, the two scenarios demonstrate the following key concepts: (1) The evaluation of the probabilities depends not only on the observed population data, but also on the story problem. (2) In any scenario, it must be true that P(H) is between P(H|R) and P(H|L). (3) In one scenario, the unweighted mean of A,B,C is appropriate for the overall probability, but not for the conditional probabilities. In the other scenario, the unweighted mean of A,B,C is appropriate for the conditional probabilities, but not for the overall probability. It may seem surprising that a single population of values provides two answers for the probabilities P(H), P(H|L), and P(H|R), but there is a fundamental difference between the scenarios. The proportions of Ls in the buckets were fixed by the problem and determine the probability that an object is an L. The proportions of past Ls for the batters were manipulated by managerial choice and do not necessarily determine the probability that the present pitcher will be an L.
The Switch Back to Statistics The best prediction for the value of a sample statistic is typically the value of the corresponding population parameter, and the best point estimate for the value of a population parameter is typically the value of the corresponding sample statistic. Considering the data as summarized in Table 4 to be a population from which a random selection of A, B, or C is to be made, Table 5 gives the probabilities of P(H), P(H|L), and P(H|R). Now, consider the data summarized in Table 4 to be the sample results from selecting A, B, and C from some population and use the previously calculated probabilities of P(H), P(H|L), and P(H|R) to estimate the population parameters P(H), P(H|L), and P(H|R). Since the calculations depend on the story problem, it is necessary to answer the following question: Were the samples A, B, and C selected from the population in a manner that would make the bucket analysis appropriate, or in a manner that would make the batter analysis appropriate? In the words of the original apple paradox, how were the proportions of L determined for the randomly selected trees A, B, and C? Were those proportions (1) fixed by nature [so they are representative of the overall proportion of Ls] or (2) manipulated by researcher [so they are not necessarily representative of the overall proportion of Ls]? The bucket analysis corresponds to scenario (1), and the batter analysis corresponds to scenario (2). Adapting Table 5 CHANCE
49
Table 7—Apple Data for 10 Trees, by Lost and Remaining Lost
Remaining
Overall P(H)
Tree 1
3/30 = 0.10
3/120 = 0.025
0.10p + 0.025(1-p)
Tree 2
6/40 = 0.15
4/160 = 0.025
0.15p + 0.025(1-p)
Tree 3
3/50 = 0.06
2/200 = 0.010
0.06p + 0.010(1-p)
Tree 4
5/50 = 0.10
4/400 = 0.010
0.10p + 0.010(1-p)
Tree 5
3/20 = 0.15
2/80 = 0.025
0.15p + 0.025(1-p)
Tree 6
3/30 = 0.10
3/120 = 0.025
0.10p + 0.025(1-p)
Tree 7
2/20 = 0.10
23/230 = 0.100
0.10p + 0.100(1-p)
Tree 8
2/40 = 0.05
4/160 = 0.025
0.05p + 0.025(1-p)
Tree 9
2/100 = 0.02
13/50 = 0.260
0.02p + 0.260(1-p)
Tree 10
1/20 = 0.05
2/80 = 0.025
0.05p + 0.025(1-p)
Unweighted P(H)
0.088
0.053
0.053 + 0.035p
Table 8—Summarizing the Point Estimates for the Population Parameters P(H|L)
P(H|R)
P(H)
Scenario 1 (All of the Ls & Rs)
0.073
0.041
0.048
Scenario 2 (Sample of Ls & Rs)
0.088
0.053
0.061
from probability to statistics, the best point estimates for the desired parameters are given by Table 6. The estimates from Scenario 1 apply if the data were gathered using all the Ls and all the Rs from every sampled tree. In that case, the proportions of Ls from the sampled trees are fixed by nature and are a random sample from the distribution of all Ls in the population. The estimates from Scenario 2 apply if the data were gathered using only a portion of the Ls and Rs from each tree. In that case, the proportions of Ls from the sampled trees were manipulated by the researcher, are not a random sample from the distribution of all Ls in the population, and are not necessarily representative of the overall proportion of Ls. If Scenario 1 applies, the problem is solved. If Scenario 2 applies, the matter of determining a value for p must be considered. If p is taken as an indicator variable equal to one for apples lost to the ground and zero for apples remaining on the tree, there is no additional meaningful estimate for the value for the overall P(H). If p is taken to be the overall proportion of apples lost to the ground, such an estimate is possible. In general, one can either (1) estimate p using the data at hand or (2) estimate p from past experience and/or other information independent of the data at hand. For the A,B,C data at hand, a reasonable estimate for p is the unweighted mean (0.20 + 0.50 + 0.30)/3 = 0.333. This is consistent with the notion that the tree is the experimental 50
VOL. 24, NO. 1, 2011
unit and that each tree gives an equally valid point estimate for any parameter of interest. Even though such an approach is reasonable for the tree data, it would not be so in the batter scenario. In the batter analysis, p cannot be estimated from the data at hand because the proportion of times each batter faced a left-hander was determined by deliberate managerial strategy and is not necessarily indicative of the true probability that the opposing manger will insert a left-handed pitcher for the moment in question. In the batter analysis, p could reasonably be the proportion of the opposing team’s available pitchers that are left-handers, which is independent of the historical proportions for each batter.
Point Estimates for the Applied Problem Returning to the original sample data and story problem recorded in Table 3, that the numbers of lost and remaining apples are all multiples of 10 suggests the data from each tree come from samples of lost and remaining fruit (Table 7), not from all the fruit produced by the tree. Such is the typical practice. The apple data are similar to the batter data in that the natural overall P(H) obtained by pooling the L and R data is artificial and not necessarily indicative of true overall performance—because it is determined by the proportion of L occurrences imposed by the manager or researcher. If it may be assumed that the reported numbers of lost and remaining
fruit sampled from each tree are roughly proportional to the true numbers of lost and remaining fruit produced by each tree, the data may be used to construct an estimate for p using (0.20 + 0.20 + 0.20 + 0.11 + 0.20 + 0.20 + 0.08 + 0.20 + 0.67 + 0.20)/10 = 2.26/10 = 0.226. If not, an independent “from experience” estimate could be used. Using the estimate p = 0.226, and considering the entire story problem associated with Table 3, the best estimates for the desired parameters are the following: P(H|L) = 0.088 [the unweighted mean] P(H|R) = 0.053 [the unweighted mean] P(H) = 0.053 + 0.035(0.226) = 0.061 As expected, it is true that P(H) = 0.061 is between P(H|R) = 0.053 and P(H|L) = 0.088. The appropriate conclusion for the apple data in Table 3 is that the estimates that P(H|L) = 0.088 and that P(H|R) = 0.053 are correct, but that the original estimate that the overall P(H) = 0.048 is not correct. Even though the method of unweighted means was appropriate for the estimates of the conditional probabilities, it was incorrect to apply that same methodology to estimate the overall P(H). This eliminates the apparent paradox for the problem at hand and provides a method for obtaining the desired estimates in situations following the data collection procedure outlined. Unfortunately, applied statistics often involve design factors that complicate identification of the appropriate analysis.
Further Considerations What if the data in Table 3 had been gathered using all the Ls and all the Rs from every sampled tree Ti? In that case, Scenario 1 applies and, following the pattern in the bucket analysis, the estimates are as follows: P(H) =∑P(Ti)P(H|Ti) = (1/10)(6/150) + (1/10)(10/200) +…+ (1/10) (3/100) = 0.048 [the unweighted mean] P(H|L) = ∑P(Ti|L)P(H|L and Ti) = (0.20/2.26)(3/30) + (0.20/2.26)(6/40) +…+ (0.67/2.26)(2/100) + (0.20/2.26)(1/20) = 0.073 P(H|R) = ∑P(Ti|R)P(H|R and Ti) = (0.80/7.74)(3/120) + (0.80/7.74)(4/160) +…+ (0.33/7.74)(13/50) + (0.80/7.74)(2/80) = 0.041 As seen in Table 8, the same data yield differing point estimates depending on whether all the Ls and all the Rs were gathered from each tree or a sample of the Ls and Rs was selected for analysis.
In practice, however, analysis of the data is complicated by the actual “design” often being a combination of the two scenarios (i.e., it is common for researchers to gather all the Ls and Rs from trees producing small numbers of apples (Scenario 1) and to gather a sample of the Ls and Rs from trees producing large numbers of apples (Scenario 2)). In such cases, how does the original approach (i.e., using the unweighted estimate for both the overall and conditional probabilities) perform as a compromise solution to a complicated problem? Among the arguments that could be made in favor of such a compromise solution are the following: It is a compromise between the two correct analyses, just as the actual “design” employed is often a combination the two scenarios. It results in a single analysis for both scenarios, or any combination thereof, an analysis that is easily handled by statistical software to compute confidence intervals and to compare different treatments. Paradoxical results similar to those discussed appear to occur infrequently, and only when there are a few trees whose data are dramatically different from the data of the other trees. To determine the degree to which this compromise analysis may be an appropriate approximate technique for analyzing data similar to those discussed, we performed simulations by generating 1,000 data sets for 20 trees. In each case, a wholetree analysis was made using all the apples from every tree, and then a second partial-tree analysis was made using a sample from each tree. The results using the compromise analysis were then compared to the results using the correct analyses. To approximate conditions observed in the field, the simulations were designed as follows: (1) The number of apples for each tree was generated by randomly selecting integers from 101 to 1,100. (2) Independent pairs of values P(L) and P(H) were generated for each tree, selecting P(L) from a beta distribution with mean 0.200 and standard deviation 0.03636 and P(H) from a beta distribution with mean 0.048 and standard deviation 0.00955. To approximate what occurs in practice, a correlation of 0.8 between P(L) and P(H) was used. (3) To completely determine all required probabilities, a ratio P(H|L)/P(H|R) was selected for each tree from a Fisher’s f distribution with mean two and standard deviation 0.5930. Together with the P(L) and P(H) generated in (2), this ratio fixes all the necessary parameters. (4) For partial-tree simulations, a random process was employed to generate for each tree L and R numbers that were multiples of 10, which ranged from (a) close to the original whole-tree L and R numbers for trees whose total number of apples was near the lower limit of 101 to (b) approximately ¼ of the original L and R values for trees whose total number of apples was near the upper limit of 1,100. CHANCE
51
Table 9—Simulation Results for 20 Trees
P(H|L)
Mean St Dev CI
P(H|R)
Mean St Dev CI
P(H)
Mean St Dev CI
Whole-Tree Correct Weighted 0.08022 0.00762 90.0% Weighted 0.04015 0.00303 89.7% Unweighted 0.04821 0.00317 90.0%
Paradox
0.00%
Parameter
Results Compromise Unweighted 0.07838 0.00764 90.9% Unweighted 0.04038 0.00309 89.0% Unweighted 0.04821 0.00317 90.0%
Partial-Tree Correct Unweighted 0.07748 0.01091 88.4% Unweighted 0.04033 0.00423 89.9% Weighted 0.04870 0.00457 88.5%
Results Compromise Unweighted 0.07748 0.01091 88.4% Unweighted 0.04033 0.00423 89.9% Unweighted 0.04906 0.00435 90.9%
0.00%
0.00%
0.40%
Table 10—Re-Titled Detailed Batter Data Batter
Against Pitchers Born January to June
Against Pitchers Born July to December
Overall
A
24/80 = 0.300
2/20 = 0.100
26/100 = 0.260
B
12/50 = 0.240
24/50 = 0.480
36/100 = 0.360
—Regrouped Detailed Batter Data Table 11—Regrouped
52
Batter
Against Left-Handers
Against Right-Handers
Overall
A
24/96 = 0.250
2/4 = 0.500
26/100 = 0.260
B
12/50 = 0.240
24/50 = 0.480
36/100 = 0.360
VOL. 24, NO. 1, 2011
The simulation results are given in Table 9. The parameter values used for the simulation were P(H|L) = 0.08, P(H|R) = 0.04, and P(H) = 0.048. “CI” indicates the percent of the 90% confidence intervals that captured the parameter values used for the simulation. “Paradox” indicates the percentage of times the analysis yielded paradoxical results. Table 9 gives rise to the following observations: (1) In every case, there is more variability in the point estimate for the “partial-tree” data. This is to be expected. (2) The point estimates using the compromise analysis appear to be very close to the true parameter values. (3) The confidence levels using the compromise method appear to be very close to the desired 90% for both the “whole-tree” data and the “partial-tree” data. (4) Using the correct analysis, the paradox never occurred. Since it is not mathematically possible for the paradox to occur with the correct analysis, this is a check for the simulation. (5) Using the compromise analysis, the paradox never occurred for the “whole-tree” data. (6) Using the compromise analysis, the paradox occurred four times out of the 1,000 simulations for the “partialtree” data. When the simulation was run using more than 20 trees per treatment, as is often the case in practice, the paradox was even less likely to occur. The conclusion of the matter is that, except for the unlikely possibility that the paradox occurs, the so-called traditional compromise analysis often used in practice is an acceptable convenient alternative to the technically correct, more cumbersome, analysis.
Concluding Remarks In the opening batter scenario with Arnie and Barney, the P(H) = 0.310 is the best available estimate, if there is no further information about left-handed or right-handed pitchers. In the original apple scenario, P(H) = 0.048 is the best available estimate if there is no further information about lost or remaining fruit. In both situations, the introduction of additional relevant information requires a reassessment. Furthermore, there are two broad principles in operation that should not be overlooked. First, combining data from two groups (whether righthanded or left-handed pitchers or lost or remaining fruit) can producing misleading, if not paradoxical, results. The classic example is the researcher investigating how the weight of college students affects their performance in the 100-yard dash. For 100 students selected at random, there was a significant negative correlation between weight and time (i.e., the heavier students had the fastest times). But when the males
and females were considered separately, there was a significant positive correlation between weight and time for each gender (i.e., the heavier students, as expected, had the slower times). Because males tend to be heavier than females, and because males tend to be faster than females, putting the two genders together to investigate the relationship between weight and performance is not appropriate. The same can be said for the batters and the apples. Second, it is additional relevant information that requires a reassessment. In the opening batter scenario with Arnie and Barney, for example, suppose the detailed batter data gave, as shown in Table 10, the birth date rather than the throwing hand of the opposing pitcher. It may be assumed that no manager would examine this data and check the birth date of the opposing pitcher before choosing between Arnie and Barney. The statistician must always work with scientists in the field to identify both relevant factors that should be considered and irrelevant anomalies that should be ignored. Finally, note that this new paradox wherein the overall unconditional probability appears to be greater (or less) than the overall conditional probabilities across the groups is distinct from the phenomenon known as Simpson’s Paradox. In the opening batter scenario with Arnie and Barney, moving 16 of Arnie’s hitless plate appearances from the right-handed column to the left-handed, as shown in Table 11, produces Simpson’s Paradox: Arnie has a higher batting average against both left-handers and right-handers, but Barney has the higher overall average. While both the new paradox and Simpson’s Paradox result from combining data from two groups, they are conceptually different and no set of data can exhibit both paradoxes simultaneously. In the above baseball problem, for example, Simpson’s Paradox requires that P(H|L) < P(H|R) for each batter, whereas the new paradox requires that P(H|L) > P(H|R) for one batter and P(H|L) < P(H|R) for the other.
Further Reading Appleton, David R., Joyce M. French, and Mark P. J. Vanderpump. 1996. Ignoring a covariate: An example of Simpson’s paradox. The American Statistician 50:340–341. Morrell, Christopher H. 1999. Simpson’s paradox: An example from a longitudinal study in South Africa. Journal of Statistics Education 7(3). www.amstat.org/publications/jse/secure/ v7n3/datasets.morrell.cfm. Wainer, Howard. 2002. The BK-plot: Making Simpson’s paradox clear to the masses. CHANCE 15(3):60–62. Wainer, Howard, and Lisa M. Brown. 2004. Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. The American Statistician 58:117–123. Westbrooke, Ian. 1998. Simpson’s paradox: An example in a New Zealand survey of jury composition. CHANCE 11(2):40–42.
CHANCE
53
The Two-Child Paradox Reborn? Stephen Marks and Gary Smith
F
or at least half a century, probability devotees have puzzled over a problem known as the two-child paradox. Suppose we know that a family has two children, and we learn that one of them is a girl. What is the probability that there are two girls? Some have argued that the probability is one-half, others that it is one-third. This problem has reappeared in various guises over the years. Does it matter, for example, if we learn not only the gender of one of the children, but whether the child is the older or younger sibling? A new variant of the two-child problem was introduced recently by Leonard Mlodinow in The Drunkard’s Walk: How Randomness Rules Our Lives. Mlodinow argues that the two-child probabilities depend on whether the child we learn about has an unusual name: The probability of two girls is one-third if we learn that one of the children is a girl, but it is one-half if we learn that her name is Florida.
54
VOL. 24, NO. 1, 2011
This claim is paradoxical in that it does not seem that knowing a child’s name should have any effect on the probability that her sibling is male or female. We will first analyze the classic two-child problem, in which it is known only that one of the children is a girl. We will then consider the case in which it is also known that the girl is named Florida. For both of these scenarios, we will work through two approaches: a sample space approach and Bayesian analysis.
The Classic Two-Child Problem We employ the traditional assumptions that boys and girls are equally likely and that the sexes of the children in a family are independent. To be clear about which child is which, we let BG indicate that, in a two-child family, the older child is a boy and the younger a girl, and similarly define GB, GG, and BB. Under the traditional assumptions, the probability of each of these outcomes is equal to one-fourth.
The Sample Space Approach The conclusion that the probability of two girls is one-third follows from a restricted sample space analysis. It is based on the simple inference that, if one of the children is a girl, then there cannot be two boys. Because the three other possibilities—GB, BG, and GG—are presumed to be equally likely, the probability of GG must be one-third. This analysis is correct if we are solving a different, hypothetical problem in which we draw a family at random from all families that are GG, BG, or GB. In this case, there is exactly a one-third probability of choosing a GG family. This conclusion is not paradoxical. However, the situation is different if we are looking at a family drawn at random from the population of all two-child families, and the difference is best revealed by Bayesian analysis.
CHANCE
55
Table 1—Bayesian Analysis: Mother Mentions Either Child Equally Often BB
BG
GB
GG
Total
Mentions Boy
100
50
50
0
200
Mentions Girl
0
50
50
100
200
Total
100
100
100
100
400
Table 2—Bayesian Analysis: Boys Are Mentioned Only If There Are No Girls BB
BG
GB
GG
Total
Mentions Boy
100
0
0
0
100
Mentions Girl
0
100
100
100
300
Total
100
100
100
100
400
Table 3—Bayesian Analysis: Mother Mentions Child Independent of Gender and Name
56
BB
BG
GB
GG
Total
Mentions Girl Named Florida
0
α50
α50
α100
α200
Mentions Girl Not Named Florida
0
(1-α)50
(1-α)50
(1-α)100
(1-α)200
Mentions Boy
100
50
50
0
200
Total
100
100
100
100
400
VOL. 24, NO. 1, 2011
The Bayesian Approach The Bayesian approach begins with “prior probabilities” of one-fourth for each of the four possibilities: P(BG) = P(GB) = P(GG) = P(BB) = 1/4. It then uses Bayes’ Theorem to calculate “posterior probabilities” that are conditional on the new information about the family. One attractive feature of the Bayesian approach is that it encourages us to think about how the information was obtained. Suppose we learn the sex of one of the children because the mother mentions him or her. In this case, the mother mentions a girl, Gm, and we wish to find the conditional probability that the family is GG, given this information, P(GG | Gm). Applying Bayes’ Theorem, which uses the multiplication and addition rules of probability, we get P(GG | G m ) = Gm) =
numbers of families shown in Table 2, which applies this assumption, the probability of a two-girl family equals onethird if the mother mentions a girl and the probability of a two-boy family equals one if she mentions a boy. This extreme assumption is never included in the presentation of the two-child problem, however, and is surely not what people have in mind when they present it.
A Girl Named Florida Mlodinow examines the two-child probabilities in the case in which it is learned one of the children is a girl named Florida, evidently an unusual name. He begins with this statement of the classic two-child problem:
P(GG and G m ) P(GG and G m ) = P(GG and G m ) + P(BG and G m ) + P(GB and G m ) + P(BB and G m ) P(G m )
P(GG and G m ) and G ) P(GG and G mP(GG ) P(GG and G m ) =P(GG | G m ) = = P(G mm | GG) ⋅ P(GG) P(G m ) P(BG and G mP(GG ) + P(GB ) + P(BB ) = P(GG and G m ) +P(G and and G m )G+mP(BG andand GmG ) +m P(GB and G m ) + P(BB.and G m ) m) P(G m | GG) ⋅ P(GG) + P(G m | BG) ⋅ P(BG) + P(G m | GB) ⋅ P(GB) + P(G m | BB) ⋅ P(BB)
P(G m | GG) ⋅ P(GG) P(G m | GG) ⋅ P(GG) . . = P(G m+| BB) + P(G m+| GB) P(G⋅mP(BG) | GG) ⋅ P(GG) P(G⋅mP(GB) | BG) ⋅ +P(BG) P(G⋅mP(BB) | GB) ⋅ P(GB) + P(G m | BB) ⋅ P(BB) m | GG) ⋅ P(GG) + P(G m | BG)
If the mother has two daughters, she can only mention a girl: P(Gm | GG) = 1. If she has two sons, she can only mention a boy: P(Gm | BB) = 0. If she has a son and a daughter, a plausible assumption is that she is equally likely to mention either child: P(Gm | BG) = P(Gm | GB) = 1/2. Plugging these conditional probabilities, and the prior probabilities, into the formula, we get P(GG | Gm) = 1/2. The calculations can be done succinctly using a contingency table. For the case just examined, Table 1 shows the expected number of families for various contingencies, based on a random sample of 400 families with two children. The entries in the interior of the table equal joint probabilities like P(GG and Gm) multiplied by 400, while the totals equal marginal probabilities like P(Gm) (for one of the rows) or P(GG) (for one of the columns) multiplied by 400. For example, we expect there to be 100 BG families, and that the mother would mention the boy in half of these and the girl in the other half. Notice that the outcomes BG, GB, and GG are not equally likely once the mother mentions the gender of one of the children, contrary to the assumption of the sample space approach. For example, the joint probability P(GG and Gm) equals P(Gm | GG) P(GG) = (1)(1/4) = 1/4. This implies an expected number of families equal to 100. It is twice the joint probability of the family being BG and the girl being mentioned, for which the expected number of families is 50. However, no matter whether the mother mentions a boy or a girl, her other child is equally likely to be a boy or a girl. For example, if she mentions a girl, we can use the entries from the relevant row of the table to see that the probability that the other child is a boy is given by (50 + 50)/200 = 1/2, while the probability that the other child is a girl is given by 100/200 = 1/2. One-third would be the correct answer only under the assumption that the mother never mentions a son if she has a daughter: P(Gm | BG) = P(Gm | GB) = 1. Based on the expected
Suppose a mother is carrying fraternal twins and wants to know the odds of having two girls, a boy and a girl, and so on … In the two-daughter problem, an additional question is usually asked: What are the chances, given that one of the children is a girl, that both children will be girls?
His answer to the first question is the usual equal probabilities for BB, BG, GB, and GG. His answer to the second question is one-third, based on the sample-space analysis of the classic two-child problem. As shown above, this is fine if Mlodinow does not mean that one of the two children is observed or mentioned to be a girl, but rather is answering a hypothetical question about the chances of there being two girls in families that do not have two boys. Mlodinow later writes the following: [I]f you learn that one of the children is a girl named Florida … the answer is not 1 in 3—as it was in the two-daughter problem—but 1 in 2. The added information—your knowledge of the girl’s name—makes a difference. Mlodinow is not talking about hypothetical families that do not have two boys. The fraternal twins have been born, and we learn that one of the children is a girl named Florida. Mlodinow supposes that one in a million girls are named Florida. We will generalize this slightly and suppose that a fraction α of girls is named Florida. Like Mlodinow, we suppose children are named independently, so that, in principle, a family could have two girls named Florida.
Bayesian Analysis Table 3 shows the expected numbers of families for the Bayesian analysis, under the assumption that we learn about a randomly selected child from a randomly selected family. Thus, like Table 1, it assumes a mother is equally likely to mention a boy or girl if there is one of each. Similarly, the conditional probability that the mother mentions a girl named Florida, if she mentions a girl, equals the fraction of girls named Florida, α. Thus, for the 100 cases in which a BG family is selected, it is expected that a boy will be mentioned in 50 cases and a girl in 50 cases, with α50 of the mentioned girls named Florida and (1 – α)50 not named Florida. For the GG column, no matter whether the mother mentions the older daughter or the younger daughter, the probability is α that the girl’s name is Florida and 1 – α that it is not Florida.
CHANCE
57
Table 4—Exact Expected Numbers of Families for Mlodinow’s Question BG
GB
GG
Total
Families With a Girl Named Florida
α100
α100
α(2-α)100
α(4-α)100
Families With No Girl Named Florida
(1-α)100
(1-α)100
(1-α)2100
(3-α)(1-α)100
Total
100
100
100
300
If the mother mentions a girl named Florida, Fm, the conditional probability that she comes from a two-girl family is one-half, regardless of the value of α: P[GG | Fm ] =
α100 1 = . α 200 2
Similarly, if the mother mentions a girl not named Florida, Nm, the probability that she comes from a GG family is also one-half, regardless of the value of α:
P[GG | N m ] =
(1 − α )100 = 1 . (1 − α )200 2
Thus, if the mother mentions a girl, the probability of two girls is one-half, no matter what her name is. The answer is the same as in our original Bayesian analysis: There is no paradox.
58
VOL. 24, NO. 1, 2011
Mlodinow’s Approach Mlodinow does not answer the question he poses about the probability that a given family will have two girls, if we learn it has a girl named Florida. Instead, he gives an approximate answer to a different, hypothetical question: Among all BG, GB, and GG families that have a daughter named Florida, what proportion are GG? His approximation is one-half. This analysis parallels his take on the classic two-child problem. Mlodinow purports to use a Bayesian approach, which he says “is to use new information to prune the sample space.” However, a Bayesian analysis must account for the conditional probabilities that we would observe or learn about a girl, named Florida or otherwise, for different family types, and this Mlodinow does not do. Instead, he does a restricted sample space analysis much like the conventional analysis of the two-child problem. Despite these issues, his analysis is instructive. Mlodinow assumes 100 million families, but we will assume 400 to be comparable with our earlier analyses. Let GF indicate that a girl is named Florida and GN that she is not. We will denote birth order as above. With his assumption that α is tiny—one in a million— Mlodinow infers that the expected number of GFGF families (equal to α2100) is, for practical purposes, zero. He also infers, since girls not named Florida are almost as numerous as boys, that the expected number of BGF families (equal to α100) is, for practical purposes, equal to that of GNGF families (equal to α(1 – α)100). The commonality in these assumptions is that second-order terms in α2 are so small they can be ignored. Since BGF and GFB families are expected in equal numbers, as are GNGF and GFGN families, Mlodinow then concludes that BGF, GFB, GNGF, and GFGN families are all equally likely. With GFGF families ruled out, the proportion of GG families in the BG, GB, and GG families with a girl named Florida is then equal to one-half. To provide a further rationale for his analysis, Mlodinow offers a second set of calculations. Table 4 shows the relevant exact calculations of expected numbers of families in which there is at least one girl. We can rule out BB, and thus start with the 300 BG, GB, and GG families. Of the 100 BG and 100 GB families, a fraction α has a girl named Florida. Of the 100 GG families, a fraction α of the older girls is named Florida and, of the (1 – α)100 families in which the older girl is not named Florida, a fraction α of the
younger girls is named Florida, giving a total of α100 + (1 – α) α100 = α(2 – α)100 GG families with girls named Florida. The second row of the table can be most easily obtained by subtracting the entries in the fi rst row from the totals for each column. To translate his presentation to our framework, Mlodinow would observe that we expect 2 α100 families with a girl named Florida to be BG or GB, as shown in Table 4. He would then note that, of the 100 expected GG families, α100 will have an older child named Florida and α100 will have the younger child named Florida, yielding 2 α100 GG families with a girl named Florida. The approximation α2 = 0 is important in this case as well, as is clear from the GG entry in the first row of Table 4, because otherwise there would be double counting of families in which both girls are named Florida. Mlodinow would then conclude that the fraction of GG families in BG, GB, and GG families with a girl named Florida is approximately equal to 2 α100/(2 α100 + 2 α100) = 1/2. It turns out that this last set of calculations approximates an exact calculation in which girls are counted, rather than families. In BG and GB families combined, there are expected to be exactly 2 α100 girls named Florida, and the same number in GG families. There are half as many GG families as there are BG plus GB families combined, but twice as many girls per family and thus twice as many girls named Florida, under the independence assumption. Thus, in the second variant of his analysis, Mlodinow is in effect answering yet another, but still hypothetical, question: What fraction of girls named Florida from two-child families is from two-girl families? However, the exact analysis of this case applies for a girl of any name, and indeed for girls in general. If a girl is randomly selected from all BG, GB, and GG families, the probability that she came from a GG family is exactly one-half. In particular, the fraction of girls named Florida from two-girl families is independent of the value of α: 2 α100/(2 α100 + 2 α100) = 1/2. Thus, it is irrelevant in this variant of the analysis that the girl has an unusual name. Finally, we can generalize Mlodinow’s analysis on its own terms, using the exact formulas in Table 4. For families with a girl named Florida, the fraction of BG, GB, and GG families that is GG equals (2 – α)/(4 – α). Mlodinow, in effect, finds the limit of this formula as α approaches zero, obtaining his answer of one-half. The other extreme is for α to approach one. All girls are named Florida, so that being a girl named Florida is equivalent to being a girl. In this case, the formula evaluates to one-third, which is the fraction of all BG, GB, and GG families that are GG. In general, if we consider families with a girl named Florida, as the name becomes increasingly rare, the number of families that is GG goes down more slowly than the combined number of those that are BG, GB, or GG, and so the fraction of BG, GB, or GG families that are GG goes up.
Discussion A general question is how best to accommodate new information into the evaluation of uncertain situations. Use of the restricted sample space approach for the two-child problem does not yield a proper conditional probability that a family
has, say, two girls, given that one has learned that one of the children is a girl. All it offers, in this case, is a hypothetical calculation of the fraction of BG, GB, and GG families that are GG. In the classic two-child problem, it also offers an erroneous illusion of simplicity—that, in general, a two-child family is equally likely to be BG, GB, or GG if we learn one of the children is a girl. In contrast, the Bayesian approach provides useful conditional probabilities that can be applied directly to a family at hand as we acquire new information about it. It also provides discipline in that it requires us to be clear about the full set of assumptions that enter into our probabilistic inferences. In this analysis, we think we have made plausible assumptions about the probability that a mother would mention a girl versus a boy, depending on whether the family is BB, BG, GB, or GG, but the Bayesian approach offers the flexibility to accommodate alternative sets of assumptions in any case.
Further Reading Bar-Hillel, M., and R. Falk. 1982. Some teasers concerning conditional probabilities. Cognition 11:109–122. Freund, J. E. 1965. Puzzle or paradox? The American Statistician 19:29–44. Gardner, M. 1959. The scientific American book of mathematical puzzles and diversions. New York: Simon and Schuster. Gardner, M. 1961. The 2nd scientific American book of mathematical puzzles and diversions. New York: Simon and Schuster. Mlodinow, L. 2008. The drunkard’s walk: How randomness rules our lives. New York: Pantheon Books. Nickerson, R. S. 1996. Ambiguities and unstated assumptions in probabilistic reasoning. Psychological Bulletin 120:410–433. CHANCE
59
Goodness of Wit Test
Jonathan Berkowitz, Column Editor
Goodness of Wit Test #11: Definite Article Rejection!
I
n previous columns, I have mentioned the National Puzzlers’ League (NPL) (www.puzzlers.org), an organization founded in 1883 for lovers of word puzzles and dedicated to raising the standard of puzzling to a higher intellectual level. Cryptic crossword puzzles, the focus of my CHANCE puzzles, are just one type of word puzzle created and solved by NPL members. In this and future issues, I will introduce you to the dizzying array of other puzzle types. Forms are puzzles similar to crosswords. From given clues (straight, not cryptic), the solver fits words into a pattern. Forms differ from crosswords in that form patterns are geometric shapes—squares, pyramids, diamonds, hourglasses, etc.—and have no black squares. Here is a simple example, a square that has the same words across and down. There is an additional element of symmetry, but I’ll leave that for you to discover. The answer is at the end of the cryptic puzzle solution in this issue.
Winner from Goodness of Wit Test #9: In One Ear and Out the Other Sandra Taylor is a biostatistician with the Clinical and Translational Science Center in the School of Medicine at the University of California, Davis. She spends much of her free time training and competing with her dogs in dog agility, running, and cycling.
1 Chance subject 2 Principle 3 Pineapple 4 Doctrine 5 Data In Goodness of Wit Test #11, as in many previous puzzles, some alteration of the themed answer words is required for entry into the grid. I’m sure you will be able to discover the gimmick! A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by me by March 5, 2011. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver BC Canada V6N 3S2, or email a list of the answers to
[email protected]. Please note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Reminder: A guide to solving cryptic clues appeared in CHANCE 21(3). The use of solving aids—electronic dictionaries, the Internet, etc.—is encouraged. CHANCE
61
A N N E X F F I S H E R
M A E S T R O S T O L E
B O X P B E N N E T T E
U N T I E E D G E C O X
L E G A L H A L D A N P
A Y U L E S N E A K E R
Solution to Goodness of Wit Test #9: In One Ear and Out the Other This puzzle appeared in CHANCE, Vol. 23, No. 3. Nine “Down” answer words are homophones of names of famous statisticians. Enter their names, not the clue answer words, in the grid. Theme words are: BOX, RAO, YULE, WALD, HALD, LEHMANN, FISHER, BAYES, and COX. Across: 1 AMBULATORY [anagram: to buy alarm] 10 A-ONE [deletion: Saone–S] 12 CAMEO [rebus: came+o] 14 NEXT [container: n(ex)t] 15 UPROARS [container + rebus: UP(r+oar)S] 16 ESPIAL [rebus + reversal: ESP+lai–d] 17 WHIST [deletion: whilst–l] 20 LENA [hidden word: sto(le na)tion’s] 21 FREE [container: f(r)ee] 24 SOLUBLE [anagram + rebus: l+blouse] 26 FONDANT [anagram + rebus: d+nonfat] 27 ROES [initial letters: r+o+e+s] 29 LEAP [rebus: plea/ leap]34 STEED [container: S(tee)D] 35 GRAMME [hidden word: pro(gram me)eting] 36 HOTCAKE [anagram: choke+at] 38 YUAN [rebus: y+U(a)N] 40 ELTON [anagram: lento] 41 FERN [rebus: fe(r)n] 42 EXPRES SING [rebus + deletion: expires–i+sing]
62
VOL. 24, NO. 1, 2011
T O P K N O T A G E D E
O C R W A L D P R O F S
R A O H O U R B A Y E S
Y M A I L B O X M U R I
V E R S S L E H M A N N
J O S T L E S R E N I G
Down: 1 ANNEX [homophone: an+X] 2 MAESTRO [anagram: some art] 3 (BOX) BALKS [charade: b+AL+Ks] 4 UNTIE [deletion: auntie–a] 5 LEGAL [charade: le+gal] 6 TOPKNOT [rebus: to+pk+not; {do = hairdo}] 7 (RAO) ROW [deletion: crowd–c–d] 8 VERSE [anagram: serve] 9 JOSTLE [rebus: J+host–h+l+e] 11 (YULE) YOU’LL [rebus: yo+u+ll] 13 MAILBOX [homophone: male+bocks] 17 (WALD) WALLED [container + reversal: wa(lle)d] 18 HOUR [container: H(o)ur] 19 BENNETT [rebus + container: ben(net)t] 22 EDGE [deletion: hedge–h] 23 (HALD) HAULED [charade: h+au+led] 24 SNEAKER [anagram + rebus: nears+ke] 25 (LEHMANN) LAYMAN [anagram + deletion = Malayan–a] 26 (FISHER) FISSURE [rebus: fi(SS)(u)re] 28 STOLE [double definition] 30 AGED [reversal + deletion: Degas–s ] 31 PROFS [rebus: P(rof)S] 32 (BAYES) BAIZE [hidden word: Du(bai ze)nana] 33 RENIG [anagram: reign] 37 (COX) CAULKS [anagram + rebus: slack+u] 39 URI [rebus: Turin–t–n] Form Answer: STATS, TENET, ANANA, TENET, STATS This form reads the same across and down, forward and backward, top to bottom, and bottom to top. Beautiful symmetry, I’m sure you’ll agree.
Goodness of Wit Test #11: Definite Article Rejection! Nine of the clue answer words need to be altered, all in the same way, to fit into the grid. The altered entries are all standard dictionary entries. Enumerations are withheld.
1
2
3
4
10
5
6
7
11
8
9
12 13
14 16
15 17
18 19
20 26
21
22
23
27
24
25
28
29
30
31
32 33 35
Across 1 Disturbed one’s opposite in bed 5 Fly around airwaves encompassing a change at the earth’s surface 10 Regretted being impolite to the audience 11 In that respect container holds present 13 Limits of galaxy estimate spin 14 Cops at roadside holding back main arteries 15 Interrogate trawler smuggling fish 16 Track standard deviation and mean 18 Orestes’ revolutionary sound systems 19 Mutual advantage, say, without a little bit of work in New York 20 Herb coats country breakfast dish 26 Radio PR organized delivery by parachute 28 Train a North Dakota antelope 29 Not offered a French hen without recipe 30 Swapped shirt with radical Eastern Democrat 32 Nurtured listener in real education 33 Cluster he had specially endowed 34 Canadian politician flees realm for Ireland 35 Floors stunned oysters 36 Inspired class to get funny hat
34 36
Down 1 Confront bullies’ leader about a couple of students 2 Clown, losing head, swallows sample of unusual liqueur 3 Registered straight line increase again 4 He could order “redo it”! 5 Links joints with no error 6 Solemn after first four score 7 Less common breed raised around end of summer 8 Peculiar ring tone element 9 Slippery, as in Confederate soldier 11 Back artist taking Royal Academy over 12 Supply includes the device for insertion 15 Offered bit of discretion entering best before date 17 Most obvious puzzled panelist 20 He’s entering April baby collection with many words 21 Pope belonging to a city 22 Considerable success oddly loses heart 23 Mobile Herb Alpert skipping past foolish talker 24 Luxury auto club assistant 25 In confusion mounted horse after start 27 Doctor admits one fully conscious person who is trembling 30 Duets broadcast first of tunes on the way up 31 Period covers last prophet
CHANCE
63