About the Authors Terry Allen is an adjunct professor of sociology at the University of Utah, where he teaches graduate and undergraduate courses in social statistics, research methods, criminology, and deviant behavior and social control. He has been a research analyst with the Utah Department of Public Safety and the Utah Attorney General’s Office, and is the former director of the Utah Medicaid Fraud Statistical Unit.
Necip Doganaksoy is a principal technologist-statistician at GE Global Research and an adjunct professor at Union Graduate Colleges School of Management in Schenectady, New York. He has worked extensively on statistical applications in quality, reliability, and productivity improvement in business and industrial settings. He is a Fellow of the American Statistical Association and the American Society for Quality and is an Elected Member of the International Statistical Institute.
Glen Buckner has worked for 31 years as a statistical programmer in the Management Information Center of the Church of the Latter Day Saints offices. He does statistical consulting on projects requiring data analysis and statistical modeling. He majored in business statistics and minored in computer science.
Gerald J. Hahn is a retired manager of statistics at the GE Global Research Center in Schenectady, New York, where he worked for 46 years. He has a doctorate in statistics and operations research from Rensselaer Polytechnic Institute in Troy, New York. He is a Fellow of the American Statistical Association and the American Society for Quality and an Elected Member of the International Statistical Institute.
Jason Crowley is a graduate student
Michael Huber is associate pro-
in statistics at Iowa State University. He earned his BS in economics and statistics from Iowa State. His areas of interest are the visual display of information and statistical computing.
fessor of mathematics at Muhlenberg College in Allentown, Pennsylvania. His research interests include mathematical modeling with differential equations and studying/predicting rare events that occur in baseball.
Brenna Curley is a graduate student in statistics at Iowa State University. She earned her BA in mathematics and French at Grinnell College.
Janice Lent is a mathematical statistician at the Energy Information Administration. She holds a PhD in statistics from The George Washington University and previously worked at the Bureau of Labor Statistics and the Bureau of Transportation Statistics.
CHANCE
3
Trent McCotter is a third-year stu-
tics at the University of California, Berkeley. He works on inference problems and uncertainty quantification in applications ranging from astrophysics to human hearing. He served on California Secretary of State Debra Bowen’s Post Election Audit Standards Working Group and conducted the first six risk-limiting audits in collaboration with election officials in Marin, Santa Cruz, and Yolo counties.
William Meeker is a professor of statistics and distinguished professor of liberal arts and sciences at Iowa State University. He is a Fellow of the American Statistical Association and the American Society for Quality. He has done research and consulted extensively on problems in reliability data analysis, warranty analysis, reliability test planning, accelerated testing, nondestructive evaluation, and statistical computing.
Amber Thom graduated from the University of Utah summa cum laude with a BS in both sociology and anthropology with a criminology and corrections certification. She earned an MS in forensic science and currently works for Salt Lake County Criminal Justice Services.
Dave Osthus is a graduate student in statistics at Iowa State University. He earned his BA in mathematics and religion from Luther College in Decorah, Iowa. He has been a fan of Jeopardy! since he was an undergrad.
4
VOL. 23, NO. 4, 2010
Philip B. Stark is professor of statis-
dent at The University of North Carolina School of Law. He has written numerous articles on baseball streaks and frequently provides research for writers at ESPN and The New York Times. His fascination with baseball stats began more than 10 years ago, when he first played All-Star Baseball on his Nintendo64.
Editor’s Letter Mike Larsen, Executive Editor
Dear Readers,
I
t is hard to believe three years have passed and my term as executive editor is coming to a close. It has been a lot of work, but also a great experience. Staff at the American Statistical Association (ASA), column editors, editors, advisory editors, and staff at Springer have been supportive and contributed a lot to the success of the magazine during this time. Authors also deserve sincere thanks for their efforts. I specifically want to thank Herbie Lee and John Kimmel for their efforts related to CHANCE. Herbie is stepping down as editor to become interim vice provost for academic affairs at the University of California, Santa Cruz. John is departing from Springer as executive editor for their statistics program. Best wishes to them on their new endeavors. As you may have heard, the new executive editor for 2011– 2013 is Sam Behseta of California State University, Fullerton. I look forward to the new issues and wish Sam all the best. Let me tell you about three items before describing this issue. First, did you know you can sign up to receive email notifications about the table of contents (TOC) for CHANCE? Just visit www. springer.com/mathematics/probability/journal/144. Second, the ASA has decided to make the online version of CHANCE a member benefit for K–12 teacher members of the ASA. Earlier this year, it was made a benefit for student members. This means a critically important segment of the ASA can access CHANCE without additional charge. Thanks to the ASA for making this possible. Third, the ASA will launch a new blog called “The Statistics Forum, brought to you by the American Statistical Association and CHANCE Magazine” in February, with Andrew Gelman of Columbia University as editor. This development has great potential to extend the mission of CHANCE. In particular, it will give all of us opportunities to participate in discussions about probability and statistics and their role in important and interesting topics. Now, in this issue, Necip Doganaksoy, Gerald Hahn, and Bill Meeker describe issues involved in validating product liability with limited budgets and testing timeframes. Several illustrations from real applications are presented. Jason Crowley, Brenna Curley, and Dave Osthus analyze results from the game show “Jeopardy” from 1984–2009. Graphical analysis is used to depict trends across the show’s history. Janice Lent discusses the role of statistical models and statistical insight in forecasting and understanding energy supply and demand. She explains the role of the U.S. Energy Information Administration in developing the statistical National Energy Modeling System. Terry Allen, Amber Thom, and Glen Buckner present multiple correspondence analysis (MCA) as a tool for summarizing
several categorical variables. In this article, MCA is used to interpret factors related to infant homicides. Data come from the Uniform Crime Report. A group of authors provide us their insight into statistical consulting with limited resources. Mark Glickman discusses short-term consulting and the idea of approximating the most principled approaches. One of his illustrations comes from consulting about multiplayer online games. Sarah Ratcliffe and Justine Shults, biostatisticians in a medical school, share their insight regarding methods research on a limited budget, where the limiting factor is often time. Todd Nick and Ralph O’Brien focus on developing grant proposals on a limited budget, which again often means limited time. They encourage developing a task checklist and planning ahead. Richard Ittenbach outlines the process of scale—or composite measure—development, and how it can be approached on a limited budget. He emphasizes that some costs, such as the need to acquire new knowledge, can be difficult to quantify. A different group of authors wrote about consulting in university centers in volume 21, number 2. The current articles and the ones from 2008 are available online and, together, give a lot of useful advice. Are you a fan of bodybuilding champion, movie actor, and California Gov. Arnold Schwarzenegger? Whatever your opinion of him, you’ll enjoy the statistical analysis by Phil Stark of the text of a veto by Schwarzenegger in 2009. Michael Huber reanalyzes a famous homerun hit by baseball legend Mickey Mantle in 1963. Using a combination of mathematics, physics, and statistics, can we accurately predict the distance the ball would have gone if not impeded? Trent McCotter addresses expected maximum length of hitting streaks in baseball using permutations of actual hitting streaks. Read the article to learn how computations were done and what conclusions can be reached. Howard Wainer, in his 20th year as a columnist for CHANCE, gives us his 93rd article (according to his count), which is titled “Pies, Spies, Roses, Lines, and Symmetries.” The Visual Revelations column focuses on common topics: graphical display and history. Finally, Jonathan Berkowitz brings us a new puzzle in his Goodness of Wit Test column. Solving the puzzles involves pattern recognition and looking beyond the obvious. His 10th puzzle is titled “Once Is Enough.” Enjoy the issue! Mike Larsen
CHANCE
5
What Is A graphical exploration from 1984–2009
?
Jason Crowley, Brenna Curley, and Dave Osthus
J
eopardy has been a popular television game show for many years. J! Archive, found at www.j-archive.com, contains a wealth of information about this TV quiz show from 1984—its first season—to the present. We extracted data from J! Archive about each contestant—information ranging from their winnings each round to state of residence. Excluded from the data are any tournaments or specialty games such as Celebrity Jeopardy, which we consider to be a different population of interest. In the process of extracting the data, we discovered many inconsistencies, due to both rule changes over the years and irregularities within the J! Archive website.
6
VOL. 23, NO. 4, 2010
Using the extracted data, we graphically explored a player’s performance within and across games, utilizing the ggplot2 package in R. Various plots will illustrate shifts in a player’s place within a game, as well as the effect of the final Jeopardy question on a player’s winning margin. The relationship between consecutive games played and final winnings also is explored. Linking the data with U.S. Census measurements, we explored the proportional representation of states on Jeopardy to see if any states or regions were outliers with respect to population or time. Last, we singled out a particular outlier by the name of Ken Jennings, who won 74 games in a row—by far the longest winning streak in Jeopardy history.
Jeopardy! Game Play Jeopardy, hosted by Alex Trebek, is played by three contestants over three rounds. The clues are actually the answers, so the contestants must give their responses in the form of a question. The Jeopardy game board consists of six categories with five clues each. After the host finishes reading a clue, contestants may buzz in. The first to do so has five seconds to respond. If a contestant answers correctly, the dollar value of the clue is added to their score and they get to choose the next clue—a doubly advantageous result. If they answer incorrectly, this dollar value is subtracted from their score and the other contestants have a chance to answer. If no one answers correctly, the last person with a correct response chooses the next clue. The first round is the Jeopardy round. From 1984–2001, clues ranged in dollar value from $100 to $500. Then, from 2001 to the present, they were doubled to range from $200 to $1,000. In the Double Jeopardy round, as the name suggests, clues double in value. So, from 1984–2001, clues ranged from $200 to $1,000; from 2001 to the present they ranged from $400 to $2,000. As Figure 1 on the next page verifies, the change to double all the values happened during season 18— specifically on show #3966, which was broadcast on November 26, 2001. During these first two rounds, there are hidden clues called Daily Doubles—one in the Jeopardy round and two in the Double Jeopardy round. If a contestant selects these clues, they get to wager a dollar amount up to as much as their current total or $1,000, whichever is greater. After wagering, only they are given the chance to answer.
Alex Trebek arrives at the 37th Annual Daytime Emmy Awards, held at the Las Vegas Hilton on June 27, 2010, in Las Vegas, Nevada. Photo by Tom Donoghue/PictureGroup via AP IMAGES
At the end of the show is the Final Jeopardy round, which consists of just one question. To make it to this round, a contestant must have positive winnings after the Double Jeopardy round; if they have $0 or less, they cannot participate. For the Final Jeopardy question, contestants are told the category and have the commercial break to place their wagers, ranging from $0 to their current score. Upon returning, the contestants have 30 seconds to write down their responses once the clue is revealed. The winner after Final Jeopardy is invited to return to play on the following show as the returning champion. In the case of a tie, all first-place players are invited to return. Before 2003, contestants were only able to return for a maximum of five games. However, starting at the beginning of Season 20, this limit was dropped and contestants are now allowed to return until defeated.
CHANCE
7
Average Final Winnings
Average Final Winnings by Season
Season Figure 1. Starting in Season 18, a jump in average final winnings occurred due to the doubling of the clue values.
Maximum Winning Streak
Maximum Winning Streak by Season
Season Figure 2. The maximum winning streak in seasons 20 and 21 are attributed to Ken Jennings.
Figure 2 illustrates this point. We see a large jump in the maximum number of shows a contestant was on Jeopardy with Season 20. In fact, this jump was due to a contestant who holds the record for the longest winning streak in Jeopardy history.
Collecting and Cleaning the Data Before any analysis can be done, data must be available and in a manageable form. The data on the J! Archive website are plentiful, but in a form more suitable to a web browser than a data miner. J! Archive is set up so that each show has
8
VOL. 23, NO. 4, 2010
its own website, and the user can basically “relive” the game’s experience. To pull out the needed information, we wrote a function, “jscrape,” using the XML library in R. This arsenal of functions has the ability to parse the HTML code of a website and allows the programmer to extract data from a given table. One inconsistency with this data set is missing episodes. This is illustrated in Figure 3, which plots Show ID versus Player ID, a variable we created that uniquely identifies every contestant. Small gaps can be explained by our omission of Jeopardy tournaments, which we decided were not in our target population of shows. For instance, there was a large
Player ID
Show ID vs. Player ID
Show Figure 3. Departures from the linear trend can be explained by repeat contestants, including Jeff Kirby, who is represented by the enlarged point at approximately (5750, 1600). He is the only known contestant to return to the show without permission.
Unadjusted Winnings by Season Average Final Winnings
Average Adjusted Final Winnings
Adjusted Winnings by Season
Season
Season
Figure 4. After adjusting for the rule change that doubled clue values, no discernible trend can be seen in average winnings over time.
tournament around show #4750. Also, while it appears we have almost all episodes from recent seasons, the coverage before show #2750 is much more scattered. This is most likely due to the website developers’ continued efforts to enter historical shows, or diminished access to early episodes. The points that stray from the strong linear relationship present in the plot are another interesting oddity seen in Figure 3. After isolating these cases, we looked up these rogue contestants on J! Archive and, in almost every case, found they were players who were invited back to the show due to losses on “poorly worded” or “ambiguous” Final Jeopardy questions. There are relatively few of these repeat contestants because it is against the rules for a player to return to Jeopardy unless invited back. Interestingly enough, the outlier Jeff Kirby on Show #5766—the enlarged point—was a repeat contestant who had not been invited back, according to J! Archive. This explains why the horizontal distance of his point from the line is much larger than the rest of the stray points. Most of the time, contestants who were invited back due to unfair losses had their second show relatively soon after their original. Kirby
returned for his second show approximately 10 years after his first appearance. He was disqualified from his winnings after the producers learned of this rule-breaking. Besides Player ID, we created a number of other variables to help in our analysis. Among these are proportion of questions correct, proportion of Daily Doubles correct, total questions answered, and the place each contestant is in at the end of each round. Using these variables, we were able to further delve into trends within games, as well as possible shifts in player performance over time and space. We also created a variable that adjusts round winnings for the doubling of question dollar amounts in Season 18. Figure 4 shows the actual average winnings by season, as well as the average adjusted winnings by season. We can see that, after adjusting for the doubling of clue dollar amounts, there is little to no trend over time. Since we have fewer episodes available in earlier seasons, represented by the size of the points, we can see there is more variability in average winnings—a finding in line with the Central Limit Theorem. After this point, all analysis was done using adjusted winnings to better compare results across all seasons. CHANCE
9
Percentage of Contestants
Tracking Contestant Places Throughout Game
Final Rank Figure 5. Players tend to have fewer changes in rank, as shown by the larger, darker bars.
Trends Within the Game In our graphical analysis of this Jeopardy data, we explored the following three questions related to trends within the game: 1. How do contestants’ places change throughout the game? 2. Under what conditions does a lead change occur as a result of Final Jeopardy? 3. Is there a trend between times on show and average final winnings? How do contestants’ places change throughout the game? To gauge game dynamics, we tracked players’ places after the Jeopardy, Double Jeopardy, and Final Jeopardy rounds. In Figure 5, we examine the number of players in first, second, and third place at the end of the game, faceted by their place after the first round (across the top) and their place after the second round (down the side). We also shaded the bars by the number of changes that occurred in a player’s place throughout the game. For example, a contestant who went from first to second to first place would have a change of two, whereas someone who went from first to third to first would have a change of four. Overall, a change of zero is more likely than any other specific change, as shown by the darkest bars being larger than any other shade. As expected, the rarest path (albeit not impossible) for a player to take is from first to third to first or from third to first to third. From a numerical standpoint, knowing a contestant’s place after the Jeopardy and Double Jeopardy rounds has predictive value of the contestant’s final place, though to differing degrees. For instance, if a contestant is in first place at the end 10
VOL. 23, NO. 4, 2010
of the Jeopardy round, 56.8% of the time they end up in first place after Final Jeopardy. If a contestant is in first place after the Double Jeopardy round, however, they end up in first place 74.1% of the time—an appreciable increase. Likewise, similar patterns exist for both second and third place. Under what conditions does a lead change occur as a result of Final Jeopardy? In addition to seeing how players’ ranks change throughout the game, we looked at the winning margin between the first- and second-place players after the Double Jeopardy and Final Jeopardy rounds. This way, the only difference is due to the Final Jeopardy question. In addition, Figure 6 is faceted on two characteristics: whether a lead change occurs due to Final Jeopardy and whether the first-place contestant has a significant lead going in to Final Jeopardy. A significant lead is defined as the first-place contestant having at least double the winnings of the second-place contestant, thus guaranteeing victory barring any foolish wagering. Examining Figure 6, we see that lead changes typically occur when the winning margin before the Final Jeopardy question is under $10,000. This makes sense, because if a contestant has a significant lead, they should know to wager only what they can afford without dropping out of first place. In fact, we see from the absence of points in the bottom right plot of Figure 6 that no contestant in our data set with a significant lead going into Final Jeopardy lost their game. The number of points between the lead-change and nolead-change plots suggests the lead usually stays the same after the Final Jeopardy question. Therefore, if a contestant is in first place after the Double Jeopardy round, they are more likely to stay in that position after the Final Jeopardy question. Is there a trend between times on show and average final winnings? As mentioned earlier, once a contestant has won
Lead After Final Round ($)
Winning Margin and Lead Change
Lead After Second Round ($) Figure 6. Lead changes do not occur in games with a significant winning margin after Double Jeopardy.
Number of Appearances
Average Final Winnings vs. Show Duration
Average Final Winnings ($) Figure 7. There is a positive relationship between number of appearances on the show and players’ average final winnings.
a game of Jeopardy, they are invited to participate in the next show. Do these more experienced players tend to win more money? To examine this, we looked at the relationship between the number of times each player was been on the show and their average final winnings. We expected average winnings to increase with the number of times on the show. The box plots in Figure 7 illustrate a general increasing trend in final winnings as players return to the show multiple times—confirming our belief. The many outliers for players who were on the show just once suggest there were some high-scoring, presumably close games. We also note that only one person played as many as 9, 10, 20, and 75 games. However, recalling that players were
not allowed to continue past five games until 2003, perhaps these results would differ if there had never been a limit on the number of games a contestant could play.
Spatial Trends In our graphical analysis of this Jeopardy data, we explore three main questions related to spatial trends. 1. Has the geographical distribution of contestants changed over time? 2. Are some states over- or under-represented? 3. Do certain states earn more winnings, on average, than others? CHANCE
11
Number of Contestants First 250 Games
Number of Contestants Last 250 Games
Figure 8. Eight states are unrepresented in the first 250 games, while only three are in the last 250 games. The states in white are unrepresented.
Proportional Representation?
Figure 9. The Northeast, West Coast, Illinois, and Georgia are over-represented.
Has the geographical distribution of contestants changed over time? Jeopardy has been on the air since 1984 and has always been shot and produced in Southern California. In fact, when the show first began, the only way one could qualify for Jeopardy was by auditioning at their California location. These rules changed as the show gained in popularity. After Jeopardy’s second season, the show started sending people out to different areas of the country to increase the show’s qualification accessibility. Later, Jeopardy developed the Brain Bus, a traveling testing station where one can qualify for the show. Now, hopeful contestants can take the compulsory 10-question pre-test online, which has virtually no geographic bounds. With this change over time, we would expect to see the geographic diversity of contestants increase. To test this hypothesis, we examined the spatial distribution of the first and last 250 games of Jeopardy in our data set. As Figure 8 shows, there does appear to be a level of spatial dependency over time. During the first 250 games, there were eight states that had no representation whatsoever, and approximately 20% of all contestants were from California. When looking at the last 250 games, we see an appreciable difference. Only three states were unrepresented, and California made up less than 10% of the total contestants. Though not proven, it is likely these changes are due to Jeopardy’s structural and technological changes, as well as its increased popularity over time. 12
VOL. 23, NO. 4, 2010
Are some states over- or under-represented? Looking at the two graphs in Figure 8 leads one to ask the question, “Are some states, such as California and New York, over-represented on Jeopardy?” California and New York are the first- and third-most populous states in the United States, respectively. Thus, if all U.S. citizens are equally likely to make it onto Jeopardy, we would expect California and New York to have the first- and third-most contestants, again respectively. To explore this question, we standardized the data to the population of individual states. We can create an expected count of contestants from each state based on the percentage of the U.S. population that lived in that state and multiplying that number by 4,880—the total number of Jeopardy contestants in the data set. We divided the actual number of contestants by the expected number. If greater than one, the state is over-represented. If less than one, the state is under-represented. As Figure 9 shows, there is a clear spatial trend. The Northeast, West Coast, Illinois, and Georgia are over-represented, while the rest of the country is under-represented. To give an idea of the range we are dealing with, North Dakota is the most under-represented state, with Utah not far behind. If all states were equally represented, based on its 2000 population measurement, North Dakota should have had 11 Jeopardy contestants, but only one North Dakotan has graced Jeopardy’s stage. On the flip side, Washington, DC, is the most overrepresented part of the country. Only nine people from
State
Average Winnings by State
Average Winnings ($) Figure 10. Utah and North Dakota have significantly higher average winnings than other states.
Washington, DC, were expected to have appeared on Jeopardy thus far; however, 99 have made it onto the show—an 11-fold difference. When looking at the rest of the map, it is obvious that there are two regions of the country that have higher representation than expected. The question, then, is why? As mentioned earlier, there has always been a testing site in California. Thus, qualifying for the show has historically been much more accessible to those living on the West Coast. When Jeopardy implemented its traveling testing center, though not confirmed, it seems New York City or some other highly populated place in the Northeast would be a logical first place to go. Another possibility is that the intellectual make-up of the United States is not uniform. It is possible that people more likely to qualify for Jeopardy tend to inhabit the Northeast due to the jobs, schools, and lifestyles offered by that region. Do certain states earn more winnings, on average, than others? To answer this question, we examined a scatter plot of average final winnings for each state. From Figure 10, we see that most states have average winnings that are at least in the same neighborhood of one another—between $7,500 and $12,500. Thus, most states do not earn significantly different amounts. There are two states, however, that earn significantly more than
all other states: North Dakota and Utah. As mentioned earlier, there has been only one person from North Dakota on Jeopardy, and though he did well, he did not do well enough to win his game. Thus, North Dakota has such high earnings due to its small representation and commendable performance. Utah, on the other hand, has been represented in 86 shows. Thus, small representation size is not the reason for its success. Instead, it has a more interesting cause to its intriguing status. That cause goes by the name of Ken Jennings, who is such an anomaly, he deserves his own section.
Ken Jennings Effect To some, Ken Jennings is as famous as Trebek. His name is synonymous with trivia, and more specifically, winning Jeopardy. Jennings holds the record for the longest winning streak on Jeopardy, with 74 games. This streak blows the competition out of the water, as can be seen in Figure 11 on the following page. In our exploration, we hoped to not just identify the greatest Jeopardy player of all time, but also to take a deeper look into how he was able to put together such an incredible streak. One metric that sheds light on his success is the average number of questions answered correctly during each game. This metric is important because it works on two levels: A correct answer adds CHANCE
13
Count
Distribution of Number of Appearances
Number of Appearances Figure 11. Ken Jennings appeared on 75 consecutive shows, while the next-closest person appeared on 20 consecutive shows.
Margin of Victory for Games Ken Jennings Did and Didn’t Play
Density
Average Number of Questions Correct per Show
Average Number of Questions Answered Correctly
Player ID
Margin of Victory ($)
Figure 12. In the left panel, Ken Jennings—the largest point—answered more questions correctly on average than any other contestant. In the right panel, Jennings’ average margin of victory is larger than typical.
money to the contestant’s total and it takes away an opportunity for opponents to add that money to their totals. Figure 12 shows that Jennings answered, on average, just more than 35 questions correctly each game. There are, at most, 60 questions asked on Jeopardy every game. Thus, on average, the other two contestants had to split the other 25 questions. As can be seen, Jennings’s average is higher than any player in Jeopardy history. If Jennings answered significantly more questions than his opponents, it would make sense that his margin of victory also would be significantly different than the other Jeopardy contestants.’ To win 74 games in a row, it is unreasonable to 14
VOL. 23, NO. 4, 2010
believe every game—or even many games—came down to the wire. Figure 12 (right) confirms that Jennings’s average margin of victory ($26,011) is, indeed, much higher than the average margin of victory in games without him ($9,209)—a nearly $17,000 difference. Thus, much of Jennings’s success is due to Jennings not just beating his opponents, but destroying them.
Further Reading R code, www.public.iastate.edu/~curleyb/chance/ CHANCESubmissionRCode.R
Validating Product Reliability Necip Doganaksoy, Gerald J. Hahn, and William Q. Meeker
A
new washing machine motor has been designed. Its developers need to demonstrate—before going into full-scale production—that at least 97% of such motors will operate without failure for 10 years, but they have only six months to do so. How should they proceed?
Product Reliability Development For many products, reliability is the key performance metric that, in the long run, stands out most vividly in customers’ minds. The desire for reliability is not new. Throughout history, people learned from their successes and mistakes when building durable wheels, stronger ships, longer bridges, larger domes, and more competitive products. What is different today is the emphasis placed by companies on assuring high reliability of a product during its design and development—and long before the introduction of the product on the market—while still keeping costs at an acceptable level. Ensuring up-front reliability makes business sense. It is much less expensive to build reliability into products during design than to fix a flawed design—to say nothing of the impact on customer relations. A proactive reliability assurance program calls for a definitive plan of action and the use of such tools as failure mode and effects analysis, fault trees, and reliability design reviews. Design engineers strive to understand how failures occur so that their causes can be addressed. But what does all this have to do with statistics and statisticians? A great deal. Product reliability is formally defined as the probability that the product (or, equivalently, a fraction or percentage of the product population) will satisfactorily perform its intended function under field conditions for a specified period (such as warranty or design life). For example, if 1% of the units of a product population fail within the first five years of service, the product’s five-year reliability is 0.99. Reliability is often referred to, more informally, as “quality over time.”
Design Validation As suggested by the definition, statistics play a key role in measuring reliability. They help designers make sure, before a new product is released, that it will meet or exceed customers’ reliability expectations. There is often insufficient time to evaluate reliability at normal operating conditions. You can judge the appearance of a car by looking at it, and you can judge its performance by test-driving it. Assessing its reliability, however, is not that easy. And it is especially difficult because, as in the washing machine motor example, we are concerned with reliability over a long period, such as 10 years, but typically have only a short period, such as six months, to make our assessment. Three common statistically based approaches for addressing this challenge are the following: • Use-rate acceleration • Degradation testing • Accelerated life testing Use-Rate Acceleration Many products are in actual operation only a small fraction of the time. Toasters, batteries, digital cameras, photocopiers, printers, bicycles, and seat belt buckles are just a few examples. We capitalize on this in use-rate accelerated testing by shortening the idle time and thereby increasing the use rate. A toaster might, for example, be designed for 12 years of failure-free operation, assuming twice-a-day usage. Thus, the equivalent of 12 years of normal use can be obtained in fewer than 90 days by running the toaster 100 times daily. Washing machine example
The washing machine motor problem, stated at the outset, clearly lends itself to use-rate accelerated testing. We, therefore, developed a six-month life test conducted on special equipment on which the motors were subjected to mechanical loads resembling those experienced
Reliability Function In statistical terms, the reliability function R(t) is the fraction of units that will survive to age t. Said another way, R(t)=1 - F(t), where F(t) is the lifetime cumulative distribution function. In statistical reliability studies, one is often concerned with estimating reliability at a specified age. Age may be expressed in a variety of ways, such as the number of hours of operation, the number of start-ups, the number of operating cycles, or whatever is most reasonable from physical considerations.
Key Assumptions for Use-Rate Acceleration Use-rate acceleration is based on the assumption that the increased cycling excites the failure modes seen in normal operation (i.e., the failure results directly from product operation, and not, say, chemical change over time). This assumption seemed reasonable, based on engineering considerations, for the washing machine motor application and for the failure modes in many, but not all, other situations. Failures due to corrosion, for example, would depend on elapsed time, rather than usage. It also is assumed in use-rate acceleration that the increased use rate will not, itself, change the lifetime distribution. If the motors in our example had not been allowed to cool down, the additional heat generated during the test could have changed the motor cycles-tofailure distribution. CHANCE
15
Table 1—Washing Machine Motor Test Data (Converted Into Years of Operation Under Assumed Normal Operating Conditions) Motor Lifetimes (in years) for Failed Units
Running Times (in years) for Unfailed (Censored) Units
9.6
4.5
13.5
6.7
16.8
7.5
17.4
10.6
18.7
10.7
19.3
17.5 (4 units)
20.9
17.8 19.7 21.0 (48 units)
Censored Data and Analysis Life data typically include unfailed units, or so-called “censored observations” for which only the survival times at the time of analysis are known; that is, the lifetimes for these units are known only to exceed their survival times. Censoring could arise because units were removed from test for some reason other than failure (such as the need to use the test stand for other testing), because they failed due to some unrelated failure mode that would not happen in actual operation or, simply, because a unit was still operating and had not yet failed. (Censored data also arise in a variety of other situations, such as when we are dealing with measurement equipment that can determine that a particular source of contamination is below a threshold level, but does not provide an actual reading). In the motor example, only one unit failed before reaching the number of cycles that are equivalent to 10 years of use. A rough estimate of the 10-year reliability would be 1-(1/66)] = 0.985, but this simple estimate is technically problematic since three units had not run for the 10-year equivalent cycles. Thus, graphical and analytic methods that explicitly take the censoring into consideration need to be used. Standard elementary methods, such as ordinary least squares regression, are not appropriate for censored data. Instead, the approach used in our example, is the method of maximum likelihood (ML). The basic principle underlying ML is to choose as estimates those values of the parameters from among all possible parameter values that make the observed data most likely (i.e., maximize the probability of the data). In analyzing censored data using ML, one typically makes the important assumption that censoring is unrelated to the time at which the unit would have failed. This assumption would be violated, for example, if devices that looked as if they were about to fail were removed from the test (and declared unfailed) or if the failure mode according to which “unrelated” failures took place was, in fact, not completely unrelated to the failure mode under consideration.
during a typical wash cycle. A sample of motors was to be run for 24 cycles per day, shutting down after each cycle for cooldown. In this manner, 3.5 years of field 16
VOL. 23, NO. 4, 2010
operation were simulated in each month of testing—assuming a use rate of four washes per week. The assumption that failures are the result of the actual running of the motor, independent of elapsed
time, seemed reasonable based on engineering considerations. A test of 66 motors (the number of available test stands) for six months (equivalent to 21 years of field operation) was conducted. Monte Carlo simulation indicated that, although it would be nice to test more motors, such a test would provide adequate power for our evaluation. Statistical and practical considerations in life test
The number of test units and test duration need be determined so as to achieve an acceptable balance between the statistical risks associated with failing a reliable design and passing an unreliable one. Monte Carlo simulation is a flexible and easy-to-implement technique to evaluate the statistical properties of life test plans and, in this application, suggested that the proposed test plan was reasonable. During the early stages of product development, much of the work is done on lab prototype units under controlled conditions. Therefore, the test units need be representative of those about which conclusions are to be drawn and should reflect, as closely as possible, the variability expected in future production. Thus, in our example, the motors were built at three times, using multiple lots of material for each of the key assembled parts. Detailed records of the motors’ manufacturing histories and performance were maintained. These could prove to be useful in tracking down (and removing) root causes of premature product failure. Similarly, the test environment need be as close as possible to that expected in field operation. Results
After six months of testing (4,368 test cycles), there were seven bearing failures; the remaining 59 units were unfailed. It is desired to use this data to estimate the 10-year reliability of the new motor, but how do we handle the 59 unfailed units in such an analysis? Clearly, it would be incorrect to treat them as if they had failed at 4,368 test cycles, or, alternately, to ignore them altogether. Fortunately, there are appropriate (typically not considered in an introductory course) statistical methods to handle such censored observations. The results of the life test, converted into customer use years, are shown in Table 1.
with a 95% lower confidence bound of 0.96. These estimates are shown in Figure 2 by the two horizontal lines, meeting the vertical line drawn at 10 years, giving the fraction failing estimate of 0.006
The Weibull Distribution The Weibull distribution is one of the most frequently used distributions to model the lifetimes of products and can be justified for some products based on theoretical (extreme value theory) and practical (provides a good fit to the lifetimes of various products) considerations. It also leads to a variety of distribution shapes, as can be seen from the Weibull distribution probability density functions plotted in Figure 1.
Figure 1. Weibull probability density functions with different values for the shape parameter
The Weibull cumulative distribution function, giving fraction failing as a function of time (the accumulation or integration of the density) is
where >0 is a scale parameter (also the 63rd percentile of the distribution) and >0 is a shape parameter (as can be seen from Figure 1). The special case of =1 is the exponential distribution.
Fraction Falling
Choice of an appropriate distribution for lifetime is an important part of product life data analysis. Sometimes, engineering knowledge can be used to determine the right type of distributional model. The normal distribution is generally not an appropriate model to represent lifetimes. Various other distributions, such as the Weibull and the lognormal, are used instead. Which distribution to use is, in practice, often determined empirically by probability plotting. When more than one model seems to fit the data, each such model is often used and the results compared. Figure 2 is a computer-generated Weibull probability plot of the motor data. A probability plot for an assumed distribution model is a plot of the fraction failing as a function of time on axes constructed so the plotted points tend to scatter around a straight line if the assumed distribution is correct. In light of the form of the Weibull distribution, the lifetime in years is on a log scale in Figure 2, and the fraction failing is on a nonlinear probability scale. The plotted points show a nonparametric estimate (i.e., an estimate that does not make any assumption about the underlying distribution) at each of the observed failure times. The unfailed units are not shown in the plot, but the censoring is taken into consideration in estimating the fraction failing. Maximum likelihood (ML) analysis of the data in Table 1 resulted in the Weibull distribution parameter estimates = 34.7 years and = 4.1. The solid line in Figure 2 is the ML estimate of the fraction of devices failing as a function of years in service. Note that the plotted points scatter around a straight line, supporting the assumption of a Weibull distribution within the range of the data. The dotted curve in Figure 2 gives approximate 95% upper confidence bounds on these estimates. These results yielded the estimated 10-year reliability
Years Figure 2. Fitted Weibull distribution for motor lifetimes and approximate 95% upper confidence bound (dotted curve) CHANCE
17
Multiple Failure Modes Many products are subject to multiple failure modes (i.e., they are subject to more than one cause of failure). Multiple failure modes often lead to more complicated distributional models. In the washing machine example, the assumption of a single failure mode was reasonable because the motor is a mature product susceptible to a single dominant wear-out failure mechanism.
The Lognormal Distribution For a lognormal distribution for product life, the logarithms of the lifetimes are assumed to be normally distributed. For the lognormal distribution, exp() and are scale and shape parameters, respectively. Also, and are, respectively, the mean and standard deviation of the log lifetimes. The lognormal distribution has been used successfully to represent the life distributions of various electronic components whose failures are driven by chemical degradation.
degradation exceeding or falling below some specified threshold value. Tire wear, as measured by tread thickness, provides an example. A consumer might define a tire to fail at “blowout.” But the tire manufacturer (or a regulating agency) might consider a tire to have failed when its tread thickness falls below a specified value at which the tire is no longer felt to be safe. The relationship between the amount of degradation and lifetime makes it possible to use degradation models and data to make inferences and predictions about reliability. Degradation tests often are used to assess the reliability of florescent light bulbs and LEDs. Such tests, however, generally do not provide useful information about the probability of catastrophic failures or failures that might be due to the external environment, such as being hit by lightning.
Percent Increase in Operating Current
GaAs laser example
10
5
0 0
1000
2000 3000 Hours
4000
5000
Figure 3. GaAs laser degradation paths after 2,000 hours of testing
(1-0.994) and the estimated 95% upper confidence bound of 0.04 (1-0.04 = 0.96 reliability with 95% confidence). Further analysis showed that the desired 0.97 10-year reliability could be demonstrated with approximately 92% confidence. The results were sufficiently favorable to allow management to give the go-ahead for full-scale production of the motor under the assumption that the sampled units and the test environment realistically represented what would be encountered in actual operation. 18
VOL. 23, NO. 4, 2010
Degradation Testing Many failure mechanisms can be traced to an underlying degradation process. Degradation eventually leads to a reduction in strength or other change in physical state that can cause failure. Degradation measurements are especially useful in estimating reliability when it is expected that very few or no failures will take place in the time available for testing and when use-rate acceleration is not feasible. In degradation testing, failure is defined, possibly arbitrarily, as the
A GaAs laser used in telecommunications systems has a built-in feedback circuit to generate a constant light output as the lasing material degrades. Over time, a larger current is needed to maintain a specified level of light. Failure was defined as the time at which a 10% increase in current is first needed. Engineers desired to estimate the probability of failure at 5,000 hours (equivalent to about 20 years of normal operation) on a single-condition high-temperature test. This assessment, moreover, was needed after only 2,000 hours of testing. Tests were conducted on 15 lasers, randomly selected from a lab prototype of the production process. None of the lasers exceeded a 10% increase in current after 2,000 hours of testing. This information, per se, was not helpful in estimating 5,000-hour reliability. However, degradation, as expressed by the percent increase in operating current, also was measured periodically on each of the test units. The resulting measurements, up to 2,000 hours, are displayed in Figure 3 for the 15 units. Figure 3 suggests that important information might be gained from the degradation data. The approach used was to generate “pseudo-lifetimes” by extrapolating the degradation paths of each of the devices by regression analysis and using these extrapolations to predict the unit’s lifetime. Figure 4 shows the 2,000-hour degradation paths extrapolated to time to failure or 5,000 hours, whichever came first, assuming
Accelerated Life Tests We cannot employ use-rate acceleration when a product is normally in continuous, or near continuous, use. For example, a generator in a power plant typically operates continuously and, therefore, it is not possible to significantly accelerate its use rate. Or the lifetime of a product may depend on the life of an adhesive or glue and may not be strongly related to product use. In addition, there may be no reasonable degradation measurement. An alternative then might be to conduct what is commonly referred to as an “accelerated life test” or ALT (even though one can claim that use-rate acceleration is also a kind of accelerated test). ALTs involve high-stress testing and/ or product aging acceleration. Highstress testing requires increasing the
Percent Increase in Operating Current
Hours Figure 4. GaAs laser degradation paths at 2,000 hours extrapolated to failure (10% current increase) or 5,000 hours
Fraction Failing
that the linear trend noted in the first 2,000 hours can be validly extrapolated. Seven paths exceeded a 10% current increase prior to 5,000 hours, resulting in “pseudo-lifetimes” of 3,229, 3,514, 3,742, 4,047, 4,282, 4,781, and 4,969 hours. The extrapolations for the eight remaining units suggested they will continue to be unfailed at 5,000 hours. Thus, these units were taken as censored observations at that time. Figure 5 shows a lognormal distribution probability plot of these pseudo-lifetimes, together with a fitted lognormal distribution and the associated 95% confidence interval around the fitted line. The scatter of the pseudo-lifetimes around the fitted straight line suggests the lognormal distribution provides a reasonable fit to the data. The fitted line in Figure 5 led to the ML estimate of the probability of failing by 5,000 hours as 0.48 (or a reliability of 0.52) with an approximate 95% confidence interval of 0.24 to 0.72 (or reliabilities of 0.76 and 0.28). This wide interval reflects the extrapolation and that the analysis is based on a small number of failures. The test, however, was sufficient to show that the device’s reliability at 5,000 hours was unsatisfactory because even the (optimistic) upper confidence bound of 0.76 on 5,000-hour reliability would be unacceptable. The device, therefore, required redesign prior to deployment to improve its 5,000-hour reliability—an interesting conclusion in light of none of the 15 lasers having failed during the 2,000-hour test.
Hours Figure 5. Lognormal probability plot for pseudo-lifetimes with fitted distribution and 95% confidence interval
stress (e.g., voltage or pressure) at which test units operate. A unit will degrade over time and fail when its strength (which generally cannot be observed directly) drops below the applied stress. High-stress testing causes units to fail sooner than normal stress testing. Product aging acceleration involves exposing test units to more severe than normal environments, such as higher than usual levels of temperature or humidity. This accelerates the physical/chemical degradation processes that cause certain failure modes. Such testing tends to age the units faster and cause them to fail earlier.
Similar statistical approaches are used for conducting ALTs for high-stress testing and product aging acceleration. We will use the term “(high) stress” to refer to the accelerating variable for both situations. An important first step in planning an ALT is to identify one or more accelerating variables—and a physically appropriate acceleration model that relates stress to lifetime. The selection of an acceleration model is guided by product domain knowledge and, if possible, should be based on the fundamental physics and chemistry of the failure modes for the application. CHANCE
19
Years of Failure
The stress on an electrode was measured in volts per mil (vpm); 60 vpm is the normal operating condition (a mil is 1/1000 inch). Stresses above 125 vpm were to be avoided since they were likely to lead to failure modes (such as deformation of insulation due to high heat generated by the high-voltage stress) that were not expected to occur at normal operating conditions. The 90 electrodes were randomly allocated for testing for up to 11 months at five stress levels: 15 electrodes at 75 vpm 30 electrodes at 95 vpm
Figure 6. Results of high-stress test for dielectric insulation material, showing estimated first percentile of life distribution (heavy line) and the 95% approximate lower confidence bound on this percentile (dashed line). O= Observed failure time and ▲ = Censoring time (numbers in parentheses designate the number of censored observations).
20 electrodes at 105 vpm 15 electrodes at 115 vpm 10 electrodes at 125 vpm
Acceleration models
Well-known acceleration models include the following: • The Arrhenius relationship describing the effect that temperature has on the rate of a simple chemical reaction (which, in turn, can be expressed in terms of life). This model results in a linear relationship between log rate and the reciprocal of the exposure te mperature Kelvin (Centigrade degrees + 273.2), which also implies a linear relationship for log failure time versus reciprocal Kelvin. • The inverse power relationship between life and voltage-stress acceleration. This model results in a linear relationship between log lifetime and log voltage. • The Coffin-Manson relationship between number of cycles to failure and degree of temperature cycling. This model implies a linear relationship between log lifetime and the range of the thermal cycle (i.e., the difference between the highest and lowest temperatures). In addition to the acceleration model, a statistical model such as a Weibull or lognormal distribution is used to describe the distribution of lifetimes at 20
VOL. 23, NO. 4, 2010
each level of stress. This model is then extrapolated to estimate the expected lifetime under the stress encountered at normal operating conditions. Generator insulation example
This example deals with a newly developed dielectric insulation for stator bars used to operate power generators. The insulation consists of a mica-based system bonded with an organic binder. It was designed to have a reliability of 0.99 after 10 years of use at normal operating conditions. Eleven months were available for testing the new insulation. If reliability is as expected, it is unlikely that any test unit would fail during this time at normal operating conditions. It is not possible to perform use-rate acceleration because the product is operated essentially continuously, and no meaningful degradation measurement was available. Instead, an ALT was conducted. Insulation failures occur due to the failure of the organic material, causing a sudden reduction in strength. As a result, the insulation can no longer handle the voltage stress. Increased voltage also speeds degradation, and thus failure, and was taken as the accelerating variable in the ALT. Tests were conducted on special lab stands using short electrodes built to represent production bars. Sufficient units and testing facilities were available to test 90 electrodes.
The majority of test units at 125 vpm were expected to fail within 11 months. Few failures were expected to occur at 95 vpm, and none at 75 vpm. The data after 11 months from this ALT are displayed in Figure 6. By that time, a total of 32 units had failed at four of the five test conditions; 58 units (15 of 15 at 75 vpm, 27 of 30 at 95 vpm, 15 of 20 at 105 vpm, and 1 of 15 at 115 vpm) had not yet failed, leading to censored observations. From physical considerations and past experience, it was reasonable to assume the following: • An inverse power relationship between life and voltage-stress acceleration • A Weibull distribution for lifetime at each voltage The data were fitted to the assumed model using maximum likelihood to estimate 10-year reliability at 60 vpm. The heavy line shown in Figure 6 is the estimated first percentile of the life distribution as a function of voltage. The associated 95% approximate lower confidence bound on this line is shown by the dashed line. Both lines assume the observed linear trend can be extrapolated to 60 vpm. The approximate 95% lower confidence bound on the first percentile of the lifetime distribution at a voltage at 60 vpm is 13.2 years. Since this exceeds 10 years, the new insulation meets its reliability goal. Thus, the revised design was approved for
production, subject to continued validation testing on production units.
The Hazards of Extrapolation All three of our examples involve some form of extrapolation, as we have noted. Indeed, most statistical analyses of product lifetime data require assuming a model and extrapolating from it either in time or stress or both. Extrapolation requires the assumption that the patterns observed in the data, such as the straight lines in Figures 4 and 5, will continue beyond the data. It is important to note that statistical confidence intervals reflect only statistical uncertainty (i.e., uncertainty due to limited data), and not model error. Thus, the estimates and associated confidence intervals apply only to the degree to which the assumed model is correct in the region of extrapolation. Incorrect models can lead to seriously incorrect conclusions. Thus, the validity of the model and the extrapolations need to be questioned and assessed, based principally on physical considerations.
Accelerated Testing: Statistical Models, Test Plans, and Data Analyses by Wayne Nelson Applied Reliability (2nd ed.) by Paul A. Tobias and David C. Trindade
Conclusion Statistical tools play a key role in all phases of the product life cycle, from product design, development, and scaleup to manufacturing to tracking field performance. The design of experiments and statistical process control are two well-known examples. In this article, we described another important application area—validating product reliability. Product and system reliability is a major concern to consumers and manufacturers, alike. The inability to meet reliability goals can lead to everything
from minor inconveniences (e.g., no toast for breakfast) to human tragedies. Ensuring high reliability in the design of manufactured products is, first and foremost, an engineering challenge. But statistics and statisticians play an important role. They help ensure that meaningful data are obtained quickly in a cost-efficient manner, and that the results are translated into useful information for incisive decisionmaking during product design. In this area, as in many others, statisticians do not work in a vacuum. They need to be intimately knowledgeable with the product with which they are dealing and fully sensitive to the business environment. They need to be articulate team players who can work successfully with engineers, scientists, managers, and sometimes even lawyers and accountants.
http://jobs.amstat.org
To Learn More In this article, which is based on discussion in The Role of Statistics in Business and Industry by Gerald J. Hahn and Necip Doganaksoy, we have only scratched the surface of how statistics is used to further reliability assurance. We have considered only nonrepairable products (or replaceable components in a repairable product) and discussed only reliability validation. Other applications include assessing how reliability can be improved by building redundancy into the product design and conducting designed experiments to help determine how to make product reliability more robust to variability in manufacturing and use conditions. Applications-oriented books that provide detail on the subjects we have (and have not) discussed include the following:
Visit the
ASA JobWeb online
TODAY!
Statistical Models and Methods for Lifetime Data (2nd ed.) by J. F. Lawless Statistical Methods for Reliability Data by William Q. Meeker and Luis A. Escobar Applied Life Data Analysis by Wayne Nelson
CHANCE
21
Statistics in a Dynamic Energy Environment Janice Lent
S
ince 1977, Americans have turned to the U.S. Energy Information Administration (EIA)—a dedicated group of statisticians, econometric modelers, survey methodologists, and other specialists—for up-to-date, policy-neutral energy data. EIA employees respond to information requests from customers, including congressional committees and kindergarten teachers alike, providing everything from quantitative analyses of potential new energy policies to Sudoku picture puzzles featuring Energy Ant, the cartoon host of EIA’s Energy Kids website. In today’s dynamic energy environment, the job of being America’s premier source of energy information often requires the kinetic energy of a wind turbine in a hurricane and the versatility of a flex-fuel vehicle. What is a flex-fuel vehicle? EIA’s website, www.eia.doe.gov, can tell you. Here, we discuss two energy policy issues on which EIA recently provided analysis reports in response to congressional requests: a possible cap-and-trade program for greenhouse gas emissions and the costs and benefits of expanding the use of light-duty diesel-fueled vehicles. While analyzing the potential
What Is EIA? The U.S. Energy Information Administration (EIA) is the statistical and analytical agency within the U.S. Department of Energy. EIA collects, analyzes, and disseminates independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and environment. EIA is the Nation’s premier source of energy information and—by law—its data, analyses, and forecasts are independent of approval by any other officer or employee of the U.S. government. The Department of Energy Organization Act of 1977 established EIA as the primary federal government authority on energy statistics and analysis, building upon systems and organizations first established in 1974 following the oil market disruption of 1973. EIA conducts a comprehensive data-collection program that covers the full spectrum of energy sources, end uses, and energy flows; generates short- and long-term domestic and international energy projections; and performs informative energy analyses. EIA disseminates its data products, analyses, reports, and services to customers and stakeholders primarily through its website and customer contact center. Located in Washington, DC, EIA is an organization of about 375 federal employees, with an annual budget in fiscal year 2010 of $111 million. 22
VOL. 23, NO. 4, 2010
effects of new energy policies is an important component of EIA’s work, it’s just part of the job of gathering, aggregating, analyzing, and disseminating the energy information Americans need. Survey data are provided to EIA primarily by energy industry participants, many of whom also use the aggregate information EIA publishes.
Energy Forecasts and Analyses of Proposed New Energy Policies EIA developed the National Energy Modeling System (NEMS) to produce regular forecasts of energy production, consumption, imports/exports, and prices. The NEMS is not similar to the compact, elegant-looking models found in statistics textbooks. Designed to capture the real-world complexity of national and international energy markets, the NEMS is a modular system, incorporating energy supply, demand, and conversion components, as represented in Figure 1. The NEMS integrating module combines outputs from the other components with additional macroeconomic information and data on international energy markets to calculate projections. Extensive documentation of the assumptions underlying the projections is available at www.eia.doe.gov/oiaf/aeo/assumption/ index.html. NEMS projections are based on the following: a) A vast collection of EIA and non-EIA data sets, including economic and demographic information at regional, national, and global levels b) Classic econometric assumptions regarding the effects of long-term supply and demand pressures on energy prices and resource allocation Because of (b), NEMS energy price forecasts can be off by substantial amounts when factors such as market speculation induce rapid short-term price fluctuations (e.g., those seen in 2008). National long-term forecasts, assuming current government policies, are published annually in EIA’s Annual Energy Outlook (AEO), available at www.eia.doe.gov/oiaf/aeo. At the request of policymakers, EIA frequently uses the NEMS for special forecasts designed to predict the possible effects of new policies under consideration.
The American Clean Energy and Security Act of 2009: Greenhouse Gas Emissions Cap and Trade In March 2009, EIA received a request from Rep. Henry A. Waxman and Rep. Edward J. Markey, both of the House Committee on Energy and Commerce, for analysis of the American Clean Energy and Security Act of 2009 (ACESA, H.R. 2454). The centerpiece of ACESA is a “cap and trade” program for greenhouse gas (GHG) emissions. Like its
Figure 1. Schematic representation of the National Energy Modeling System (NEMS) Source: EIA
predecessors, the U.S. Acid Rain Program and the Kyoto Protocol, the ACESA cap-and-trade program sets upper bounds on emissions. These annual caps gradually decrease from 2012 through 2050. Companies whose emissions are capped by ACESA may buy and sell “offsets,” or emissions credits, measured in metric tons of CO2 equivalent. In effect, these companies can either reduce their own GHG emissions or pay other companies to reduce theirs. Ideally, companies that can cheaply reduce emissions will sell credits to those that cannot and the free market for emissions credits will become a mechanism for minimizing the total cost of the emissions reductions required to meet the caps. If companies emitting GHG believe the price of emissions credits is likely to rise with the declining annual caps, they may over-comply with the emissions caps and save some of their credits, creating a bank of GHG emissions credits. Records of the banked credits will be managed by a government agency. Companies can then use the credits when they wish to emit more GHG than the caps allow, or they can sell the credits to investors or other companies that use them to emit GHG. Figure 2 shows an example of the activity that can take place within an emissions trading system (ETS). Companies A and B are initially issued emissions allowances or credits to either use or sell. Because Company A cannot economically reduce its emissions to the level of its allowances, it buys credits from Company B, which has invested in wind energy equipment to meet its electric power needs. Company B reduces its emissions, sells some of its credits on the market, and saves some for later use or sale. With the money from the sold credits, Company B invests in additional renewable energy equipment to further reduce its emissions.
Meanwhile, to cover its emissions, Company A also may pay overseas companies to reduce their emissions (i.e., Company A may buy “international offsets”). Although international offsets are expected to be cheaper than domestic offsets, ACESA restricts the percentage of a company’s emissions reductions that can be achieved through international offsets. Also, the availability of international offsets is not guaranteed; international agreements must be put in place for these offsets to become available. The ETS creates an economic incentive for companies to “go green,” like Company B. If Company A fails to reduce its emissions, it will continue to pay for credits and become less competitive than Company B. Environmental advocacy groups can further reduce emissions by buying and retiring emissions credits—effectively paying GHG emitters to reduce their emissions. The ACESA caps require a 17% reduction in covered GHG emissions by 2020 and an 83% reduction by 2050, both relative to 2005 emissions levels. Because 2009 NEMS forecasts went to only 2030, EIA’s analysis of ACESA was limited to the 2012–2030 period. (NEMS was recently updated to produce forecasts to 2035.) ACESA’s fairly complex system of offset credits allows companies a variety of options for meeting the caps and introduces a major area of uncertainty in the analysis of ACESA’s possible impacts. Company B, for example, may choose to bank credits, while Company A may choose otherwise, based on differing beliefs about future credit prices. Another area of uncertainty involves the costs and technological constraints of low- and no-carbon energy sources in the coming decades. A research breakthrough could, for example, make an emerging technology such as solar power more economical or increase the public acceptance of nuclear CHANCE
23
Figure 2. Example of activity in an emissions trading system
power. Technological advances are, of course, difficult to predict, and the NEMS projections EIA routinely publishes do not incorporate specific assumptions about such advances. Due to the complexity of the NEMS, which relies on thousands of data elements and relationships, error estimation in the statistical sense is not attempted. Instead, analysts may use professional judgment to develop a collection of analysis cases—sets of detailed assumptions that vary to cover a range of possibilities regarding the major areas of uncertainty. The range of outcomes generated by the hypothetical scenarios may essentially be used as bounds, bracketing ranges of possible developments represented by the scenarios. Developing a set of useful analysis cases is not a straightforward exercise. Analysts first try to identify the major areas of uncertainty. Trial runs and professional judgment may be needed to select a manageable number of points from the multidimensional continuum of possibilities. To analyze ACESA’s potential effects, EIA analysts first designed six primary analysis cases to bracket a range of possible economic and technological developments. Although EIA’s full report on the ACESA analysis gives results from numerous variations on the six main cases described below, the main cases provide a general framework for evaluating ACESA’s possible effects. (1) The ACESA Basic Case assumes moderate use of domestic and international offset credits and continued development and availability of low-carbon technologies to accommodate the emissions reductions required under ACESA. Under this scenario, GHG emitters and investors, anticipating emissions credit price increases in response to reductions in the emissions caps, save approximately 13 billion metric tons (BMT) of credits in the bank by 2030. With the highest level of banked
24
VOL. 23, NO. 4, 2010
credits assumed in the various analysis cases, the Basic Case, while realistic, may be considered somewhat optimistic. (2) The Zero Bank Case is similar to the Basic Case, except it assumes no emissions credits are deposited in the bank (i.e., companies expect credit prices to fall or remain steady over time). While the Zero Bank Case is, in this sense, rather pessimistic and unlikely to be fully realized, it provides a theoretical lower bound on global GHG reduction in the near term—with no banked credits, companies will be using all their credits by emitting GHG. This case implicitly assumes, however, that companies expect the availability of low-carbon, reasonably priced energy technologies to increase, allowing them to meet the increasingly stringent ACESA caps by continuing to reduce GHG emissions. The use of domestic and international offsets is allowed to vary in this scenario; since they are not banking any credits, companies can meet the ACESA caps with less extensive use of offsets than is assumed in the Basic Case. (3) The No International Offsets Case provides a high-end estimate of domestic GHG reduction by assuming U.S. companies (a) buy no credits from overseas and (b) save 13 BMT of credits in the bank, as in the Basic Case. In this optimistic scenario, domestic emissions are reduced more than enough to meet the ACESA caps. (4) The High Offsets Case assumes companies immediately begin using international offsets at or near the highest levels ACESA allows. (5) The High Cost Case represents a scenario in which the costs of no-carbon and low-carbon technologies are 50% higher than in the Basic Case. The higher costs could be expected to result in greater use of (cheaper) international offsets and therefore less domestic GHG reduction.
Table 1—NEMS Summary Results for Six ACESA Scenarios
Case
Domestic International Offsets Offsets
Banked Credits (BMT)
Technology Expansion Potential
Costs of Low/No Carbon Technology
Domestic GHG Reduction by 2030
GDP Reduction by 2030
Basic
Basic
13
Basic
Basic
23.2%
0.8% (0.3%*)
(2) Zero Bank
Variable (low)
Variable (low)
0
Basic
Basic
12.2%
0.5% (0.2%*)
(3) No International
Variable (low)
0
13
Basic
Basic
33.2%
1.1% (0.3%*)
High
High
13
Basic
Basic
12.8%
0.6% (0.2%*)
(5) High Cost
Variable (high)
Variable (high)
Variable (low)
Basic
Basic + 50%
17.5%
1.1% (0.4%*)
(6) No International/ Limited Tech.
Variable (low)
0
13
Low
Basic
28.2%
2.3% (0.9%*)
(1) Basic
(4) High Offsets
*Assuming a 4% annual discount rate. Other calculations assume no discount rate. Source: EIA
(6) The No International Offsets, Limited Technology Case is the No International Offsets Case with an additional assumption of limited availability of low-carbon and no-carbon technologies. A “worst-case scenario” from an economic perspective, this case assumes that all the GHG reduction required under ACESA must be achieved by U.S. companies whose access to alternative technologies does not increase over time. The U.S. economy thus absorbs the costs that the GHG-emitting companies incur in their efforts to comply with the ACESA caps. The assumptions underlying the six primary cases are summarized in Table 1. The final two columns provide projected percentage reductions in GHG emissions and gross domestic product (GDP). Both are relative to the NEMS reference case, which assumes current legislation without ACESA. As expected, the High Cost and No International Offsets cases result in larger GDP reductions, while the Zero Bank and High Offsets cases result in smaller domestic GHG reductions. These results must be interpreted with caution, however, because they do not reflect environmental and health benefits gained through the GHG reductions. All scenarios assume that the caps are met through some combination of GHG reductions and offsets. Discount rates, the rate at which investors “discount” expected future profits relative to funds currently available, are another important but unknown factor. Table 1 provides results assuming no discount rate and, in parentheses, alternative results assuming a 4% discount rate. In general, higher discount rates tend to make investors less willing to buy or save GHG emissions credits, creating a downward pressure on credit prices. As of this writing, ACESA is still under Congressional review and revision. As changes to the legislation are proposed, EIA
continues to provide scenario analyses upon request. Although these analyses help quantify the likely effects of the legislation under various assumptions, the big questions remain to be answered by elected policy makers: What scenario—or weighted average of scenarios—is likely to play out in reality? Will the reduction in GHG emissions be worth the economic cost, i.e., the GDP reduction? EIA analysts do not attempt to answer these questions. If you have an opinion on these issues, however, feel free to write to your Congressional representatives.
Light-Duty Diesel Vehicles: Should We Jump on the Bandwagon? In October 2008, EIA received a request for answers to the oftheard question, “Why are Americans still using gasoline, while the rest of the world is moving toward diesel?” Motivated by an article published in Popular Mechanics magazine, the request came from Sen. Jeff Sessions of the Senate Committee on Energy and Natural Resources. Diesel fuel, as was noted in the request, is the world’s most widely used transportation fuel. About half of the light-duty vehicles (such as cars and light trucks) sold in Western Europe run on diesel. In the United States, while most trailer trucks, buses, and tractors burn diesel, only about 2% of all light-duty vehicles do. Diesel engines running on ultra-low sulfur diesel last longer than conventional gasoline engines, are 20% to 40% more fuel efficient, and emit less CO2. So, should the United States try to switch from gasoline to diesel? CHANCE
25
Biofuels and Blends
Ethanol is an alcohol-based fuel made from plants, primarily corn and other starch or sugar crops. The prospect of increased ethanol dependence raises concerns about farmland use and the possible impact on food supplies. Cellulosic ethanol—made from woody fibers found in leaves, grasses, and crop waste—may offer a solution. Due to its more complex manufacturing process, however, cellulosic ethanol is not yet commercially viable. The U.S. Department of Energy is funding research on new methods of producing cellulosic ethanol. Biodiesel is a diesel substitute made from vegetable oils and other organic fats. In the United States, biodiesel blends are used mainly in fleet vehicles such as buses, snowplows, and garbage trucks. Biofuel blends are usually denoted by “E” (ethanol) or “B” (biodiesel), followed by the percentage of biofuel in the blend. Most gasoline sold in the United States is at least 10% ethanol (E10). Another common blend is E85, and the most popular biodiesel blend is B20. Biofuel blends are often cheaper than petroleum-based fuels. E85, for example, currently sells for up to $0.40 less per gallon than regular gasoline (due, in part, to lower taxes). E85, however, is less widely available, offers fewer miles to the gallon, and can only be used in specially equipped vehicles.
EIA was asked to compare several types of light-duty vehicles (e.g., diesel, gasoline, hybrid electric) and explore the issues related to increased use of diesel fuel in the United States. In the report Light-Duty Diesel Vehicles: Market Issues and Potential Energy Emissions Impacts, EIA analysts examined both economic and environmental incentives—or disincentives—for the United States to follow Europe on the path toward diesel. One major difference between the American and Western European vehicle markets stems from differences in emissions standards. Although diesel engines emit less CO2 than gasoline-powered engines, they emit more nitrogen monoxide and nitrogen dioxide—NO and NO2, commonly referred to as NOx, or nitrogen oxides. These gases react with water to 26
VOL. 23, NO. 4, 2010
form nitric acid (HNO3), a major contributor to acid rain. In the United States, emissions standards for NOx are set at both state and national levels. For marketing reasons, U.S. vehicle manufacturers sell only vehicles that comply with the strictest state-determined standard for NOx emissions: the California standard of 0.07 grams or less per mile. Western European policymakers have encouraged the transition from gasoline to diesel through lower taxes on diesel vehicles and fuels and more relaxed emissions standards for NOx. The current European Union standard for NOx emissions is equivalent to 0.29 grams per mile—more than four times the level allowed under the California standard. Due to the stricter U.S. emissions standards for NOx, vehicles marketed in the United States require more emissions control equipment than do those marketed in Europe. This extra equipment adds to the price of diesel vehicles and introduces extra maintenance costs in both time and money for diesel vehicle owners. Further adding to the cost of operating a diesel vehicle in the United States is the price premium for diesel fuel. Due to increasing demand for diesel in economically developing countries and the transition from gasoline to diesel in Western Europe, the supply of gasoline to the U.S. market has increased in recent years, easing gasoline prices relative to diesel prices. Moreover, light-duty diesel vehicles have yet to recover from the poor reputation they earned in the 1980s, when Americans were largely disappointed with the performance and reliability of the light-duty diesel vehicles that first appeared on the U.S. market. Although the newer models incorporate technological improvements, many consumers are reluctant to give diesel vehicles a second chance. Proponents of light-duty diesel vehicles would like to see the United States follow the example of Western Europe and adopt policies to increase use of diesel fuels. Because policies to encourage increased use of diesel would have to be viewed as long-term efforts, however, analyses of their potential environmental effects must account for expected advances in vehicle technologies and the availability of alternative fuels such as biofuel blends. EIA used the Greenhouse Gases, Regulated Emissions, and Energy Use in Transportation (GREET) model to compare the environmental effects of several vehicle/fuel combinations. Developed by Argonne National Laboratory, a DOE laboratory managed by the University of Chicago at Argonne, the GREET model offers a complete fuel cycle or “well-to-wheels” evaluation of GHG emissions. That is, it accounts for all GHG emitted from the mining of fossil fuels—or growing of biofuel feedstocks—to the burning of the fuels in a vehicle’s engine. GREET also includes a vehicle-cycle model that, although not used in the EIA analysis, estimates the energy and emissions effects of several types of vehicles—from material development to vehicle disposal—under user-specified assumptions. Like the NEMS, the GREET model is a complex, modular system. Extensive GREET documentation is available at www. transportation.anl.gov/modeling_simulation/GREET, where users also may download the GREET software. A graphical user interface allows users to specify fixed values for hundreds of parameters (e.g., percentages of biofuels in fuel blends, market shares of various fuels in current and future years) that represent model assumptions. Adventurous modelers may even use the GREET model to perform stochastic simulations by specifying
Figure 3. GREET model well-to-wheels projections of greenhouse gas emissions for various vehicle/fuel combinations, 2010 Source: EIA
probability distributions for the parameters, rather than fixed values. For the less adventurous, GREET provides data-based default values for most parameters. Plug-in hybrid electric vehicles have rechargeable batteries that can be connected to an external electric power source. For these vehicles, the GREET model allows users to specify a U.S. electrical grid from which the vehicle will draw electricity. The GHG emissions associated with electricity generation vary according to the mix of fuels used in electric power plants. Coal-fired plants produce more GHG, for example, than plants using natural gas or wind energy. Most states now have renewable portfolio standards (RPS), which require electricity suppliers within the state to generate a certain percentage of their electricity from renewable sources. The higher the state RPS, the lower the GHG emissions from the grid that serves the state. GREET lets users choose the California grid (governed by a fairly strict RPS), the Northeast grid (not quite as strict), or the U.S. grid, representing a national average— including those states without any RPS. Users also can specify a theoretical grid with a user-defined fuel mix (e.g., 50% coal, 25% natural gas, 25% renewable). EIA analysts used the GREET model (version 1.8b) to project 2010 GHG emissions, in grams of CO2 equivalent per mile, for the vehicle/fuel combinations shown in Figure 3. While diesel fuel compares favorably to gasoline for all vehicle types, it compares unfavorably to all the other fuel blends considered. As expected, well-to-wheels GHG emissions for plug-in hybrid electric vehicles depend on the electricity grid
specification, with the California grid showing the lowest emissions levels in Figure 3. With regard to environmental impact, therefore, policies that promote faster movement toward biofuels may appear more attractive than policies designed to increase the use of petroleum-based diesel fuel. Again, policymakers will decide whether to change America’s tax and/or emissions policies to encourage the use of diesel fuel. The EIA report helps explain and quantify the pros and cons—both economic and environmental—of diesel as compared to gasoline and other fuels.
Serving the Citizens The mission of EIA’s National Energy Information Center (NEIC) is to help America’s voters and future voters—our kids—understand energy policy issues and make informed choices as energy consumers. One of NEIC’s flagship tools is its easy-to-read online energy encyclopedia, Energy Explained. Filled with color photos and data-based graphics, Energy Explained not only provides energy statistics, but also offers plain-language definitions and background information. In addition to Energy Explained, NEIC provides short explanatory articles on “hot” energy topics in the Energy in Brief section of the EIA site. Recent topics have included how dependent we are on foreign oil and—due to the ACESA debate—what a cap-and-trade program is and how it works. NEIC also maintains an extensive FAQ page based on CHANCE
27
Hydroelectric 2% Geothermal 0%
Petroleum 37% Nuclear 9% Renewables 7%
Solar/Photovoltaic 0% Wind 1% Biomass 4%
Natural Gas 24% Coal 23%
Figure 4. The role of renewable energy in the nation’s energy supply, 2008 Source: EIA, Renewable Energy Consumption and Electricity Preliminary Statistics 2008, Table 1.3, Primary Energy Consumption by Energy Source, 1949-2008 (June 2009)
questions it receives from EIA’s data users. The most colorful pages of EIA’s site, however, can be found in the Energy Kids section, a scaled-down, kid-friendly version of Energy Explained. K–12 teachers across the country rely on Energy Kids to provide simple explanations of energy subjects and exercises and activities that make it fun for kids to learn about energy. Figure 4 provides an example of the type of graphic that EIA’s users of all ages find helpful. Based on data from several of EIA’s surveys, the pie chart shows the distribution of energy consumption in the United States, by primary source, in 2008. As one might expect, petroleum is the biggest slice, followed by natural gas and coal. Renewables represent only about 7% of consumption, and most of this is biomass—primarily wood. Meanwhile, the percentages for geothermal and solar/photovoltaic energy are so small that they round to zero. Figure 4 also helps explain why EIA’s current datacollection activities are concentrated on fossil fuels. EIA conducts more than 70 surveys, 49 of which provide data exclusively on fossil fuels. Of these, 27 EIA surveys collect data on petroleum—the big slice—tracking the status of the liquid from the time it’s drawn from a well as crude oil through the refinement process, bulk transport, blending, etc., to its sale as a consumer product. Although fossil fuels are likely to remain part of America’s energy picture in the coming decades, EIA’s customers are interested in biofuels and other “clean energy” technologies such as wind, wave, and solar. EIA is in the process of enhancing and expanding its Internet data-collection capabilities. In 2009, for example, EIA launched an Internet-based survey of biodiesel producers, gathering information about the quantities of biodiesel they’re making and the materials (e.g., vegetable oils and poultry fat) from which they’re making it. The potential for rapid change in energy prices and consumption patterns also creates challenges for EIA’s modelers. The NEMS assumptions and data, for example, must be continuously reviewed and updated to reflect new market realities. Energy analysts and economic researchers are among
28
VOL. 23, NO. 4, 2010
the “power users” of EIA’s data and projections: They rely on a steady stream of detailed energy information. All of EIA’s publications are scrutinized by these professionals, and EIA regularly answers technical questions about its methods of sampling, data collection, estimation, and modeling. So what does the future hold for America’s energy markets and consumers? EIA offers no unconditional predictions, but will be there to help policymakers, industry analysts, citizens, and students make sense of it all. Editor’s Note: The views in this article are those of the author and should not be construed as representing those of the U.S. Department of Energy or other federal agencies.
Further Reading Readers who want more information about energy statistics and developments in the energy field can visit one or more of the websites listed below. Academics who are interested in research grants may visit the Office of Energy Efficiency and Renewable Energy and National Renewable Energy Laboratory sites for information about funding opportunities and application procedures. Department of Energy www.energy.gov Energy Information Administration www.eia.doe.gov Office of Energy Efficiency and Renewable Energy www.eere.energy.gov Argonne National Laboratory www.anl.gov National Renewable Energy Laboratory www.nrel.gov
Infant Homicides: An Examination Using Multiple Correspondence Analysis Terry Allen, Amber Thom, and Glen Buckner
M
ultiple correspondence analysis (MCA) is a highly useful descriptive statistical technique that creates dimensional coordinates used to graphically display relationships among the attributes of multiple variables. It calculates row and column coordinates that are analogous to factors in a principal component analysis (PCA), differing from the latter in that it partitions the chi-square value instead of the total variance among variables. Although MCA is an excellent technique for examining large numbers of categorical variables, it has received little attention in social science research. Many social research variables are naturally discrete, as they are measured at the nominal level. Such variables are naturally displayed in contingency tables and well suited for analyses using the Pearson chi-square or the likelihood ratio chi-square, methods frequently used by social researchers. Both techniques work well for small tables, but they have limitations when tables contain many categorical variables that aggregate into a large number of combinations. Trying to interpret the resulting combinations can be a complex challenge. Correspondence analysis overcomes this difficulty by calculating row and column coordinates—the “correspondence” between rows and columns— and partitioning the chi-square value into dimensions. A correspondence plot can then be produced to show the relationships between the row and column categories in a single visual.
History of MCA Correspondence analysis (CA), PCA, and MCA are three types of geometric modeling. These methods result in descriptive statistics and are applicable to any size data set. Geometric modeling was suggested as early as 1940, but was not established as a useful method until the early 1960s. The techniques became widely used in
France in the 1970s—PCA for numerical variables and MCA for categorical variables. Further developments appeared almost exclusively in French publications through the end of the decade. English publications describing geometric modeling began in the 1980s, with introductory books about CA published in the 1990s. As CA became more accepted, it was included in many statistical software programs, according to Brigitte Le Roux and Henry Rouanet in their book Multiple Correspondence Analysis. Recently, a number of software companies have begun to include MCA in their statistical packages. SAS is used for this research because of its excellent correspondence analysis procedure. In addition, XL Stat, SPSS, Stata, and SPAD provide numerical results for MCA. MCA has been used in studies that examine such diverse relationships as car ownership preference by family size and diabetes by ethnic group and geographic location. MCA remains underused, despite its capabilities. Here, we demonstrate the power of MCA by using it on a serious data set. MCA is used to clarify relationships between parent offenders (mother, father) and infant victims
(newborn, baby) of homicides in the United States. Hopefully, data analysis of this sort can provide insight and useful information to prevent these tragedies in the future.
Data and Design for the Study of Infant Homicides Twenty-seven years of data from supplementary homicide reports (SHR) are used in this analysis. The SHR data is part of the Uniform Crime Report, the oldest continuing, updated database in the United States for recording crimes known to the police. The information therein, collected from all 50 states and the District of Columbia by the FBI, contains variables reflecting the characteristics of reported homicide victims and offenders, including cases of children under the age of one who are killed by their parents. The SHR separate the deaths of infants into two groups: newborns (0–6 days old) and babies (7–365 days old). Because this analysis is designed to examine relationships among variables, not changes in infant homicide rates over time, 27 years of reported homicide data are combined into a single CHANCE
29
Table 1—Variables Included in the Analysis and the Frequencies Parent
Age Group
Race
Weapon Group
Relationship
Newborn
Baby
Father
Older
Black
No weapon
Daughter
2
122
Father
Older
Black
No weapon
Son
12
153
Father
Older
Black
Weapon
Daughter
2
47
Father
Older
Black
Weapon
Son
7
60
Father
Older
White
No weapon
Daughter
5
201
Father
Older
White
No weapon
Son
15
254
Father
Older
White
Weapon
Daughter
4
64
Father
Older
White
Weapon
Son
7
85
Father
Younger
Black
No weapon
Daughter
3
33
Father
Younger
Black
No weapon
Son
1
45
Father
Younger
Black
Weapon
Daughter
1
7
Father
Younger
Black
Weapon
Son
0
15
Father
Younger
White
No weapon
Daughter
3
46
Father
Younger
White
No weapon
Son
3
68
Father
Younger
White
Weapon
Daughter
2
6
Father
Younger
White
Weapon
Son
0
11
Mother
Older
Black
No weapon
Daughter
9
50
Mother
Older
Black
No weapon
Son
15
38
Mother
Older
Black
Weapon
Daughter
41
69
Mother
Older
Black
Weapon
Son
38
64
Mother
Older
White
No weapon
Daughter
19
56
Mother
Older
White
No weapon
Son
19
91
Mother
Older
White
Weapon
Daughter
75
73
Mother
Older
White
Weapon
Son
70
90
Mother
Younger
Black
No weapon
Daughter
22
26
Mother
Younger
Black
No weapon
Son
9
26
Mother
Younger
Black
Weapon
Daughter
67
42
Mother
Younger
Black
Weapon
Son
54
47
Mother
Younger
White
No weapon
Daughter
21
25
Mother
Younger
White
No weapon
Son
28
31
Mother
Younger
White
Weapon
Daughter
82
31
Mother
Younger
White
Weapon
Son
81
40
717
2,016
Total
30
VOL. 23, NO. 4, 2010
Table 2— Percent Breakdown by Attribute for Each Variable Variable
Number
Percent
Father
1,283
46.98
Mother
1,449
53.02
Newborn
717
26.23
Baby
2,016
73.77
Older
1,857
67.95
Younger
876
32.05
Black
1,127
41.24
White
1,606
58.76
Weapon Group
No Weapon
1,451
53.09
Weapon
1,282
46.91
Relationship
Daughter
1,256
45.96
Son
1,477
54.04
Parent Infant Age Group Race
large database of infant victims and parent offenders. Table 1 lists the variables used in this analysis with the frequency counts for both newborns and babies. The same data is summarized in Table 2, which includes counts and percents for the attributes of each variable. Aggregating the data has several important advantages. First, problems resulting from under-reporting of infant homicides are minimized by the increased number of observations and the longer timeframe of the database. Second, meaningful relationships are more likely to emerge from large stable databases. Third, aggregate databases provide sufficient cases for detailed analyses of patterns and relationships. The SHR contain a number of characteristics of the offender, the victim, and the crime. The two variables of most interest here are “parent,” indicating whether the offender was the mother or the father, and “infant,” indicating whether the victim was a newborn or a baby. This analysis focuses solely on parent offenders who, according to the SHR, are responsible for 78% of infant homicides. Other perpetrators of infant deaths not included in this analysis are neighbors, babysitters, other family members, and strangers. All of the variables in this analysis are dichotomous, including “relationship” (son or daughter), “age-group” of the parent (younger, under 21 years, or older, 21 years and older), and “race” (black or white). More than 97% of all
infants killed are listed in the SHR as white or black, with Hispanics reported as white. To reduce racial confounding, this research includes only intra-racial homicides (i.e., the race of the infant and the parent are the same). Seventeen weapon categories are included in the SHR. For this study, they are summarized into two groups: “weapon use” (parents who use firearms, knives, clubs, etc.) and “no weapon use” (parents who use hands, feet, teeth, etc.). MCA is used to explore the data to find relationships among variables, identify the dimensions underlying the relationships, and display the results in correspondence plots. The population for this analysis totals 2,733 parent/infant homicide cases.
Initial Graphical Analysis Figure 1, a mosaic-plot matrix of a subset of infant-homicide variables, shows the patterns of association between pairs of variables. The mosaic display, proposed by J. Hartigan and B. Kleiner, uses the areas of rectangles to show frequencies found in a contingency table. The width of the rectangles shows the relative frequencies of the first variable and the height shows the second. The resultant areas are proportional to the frequency of the observations found in the cell. Pearson residuals are represented in the mosaic plots by the shading of the cells (i.e., black represents positive residuals— more observations were observed in a
particular cell than expected—and gray shows negative residuals—fewer observations were observed than expected). The individual mosaic plots display a number of interesting relationships among the variables “parent,” “infant,” “age-group,” and “weapon-group.” The panels in the top row illustrate the relationships between “parent” and each of the other variables. Panel 1,2 (row, column) shows the relationship between the variables “parent” and “infant.” The overall proportion of newborn homicides is clearly much smaller than the proportion of babies killed. Positive residuals show that fathers are over-represented in killing babies, while mothers are overrepresented in killing newborns. Panel 1,3 indicates fathers are more likely to be older and mothers, younger. According to the residuals in panel 1,4, fathers are more likely to kill without a weapon, while mothers tend to use a weapon. The remaining panels in the mosaic plot show relationships among the other variables. The residuals in panel 2,3 indicate that younger parents are associated with newborns and older parents are associated with babies. Panel 2,4 shows that newborns are more associated with the use of a weapon, while babies are associated with the lack of a weapon. Panel 3,4 shows that younger parents are more associated with the use of weapons and older parents with no weapons. The lower triangular part of the mosaic matrix shows the same information with the axes reversed. CHANCE
31
Figure 1. Mosaic-plot matrix of a subset of the variables. Black areas represent positive residuals, and the gray areas show negative residuals.
Table 3—Results of the Multiple Correspondence Analysis Singular Value
Principal Inertia
Chi-square
Percent
1
0.57
0.33
6510
32.63%
32.63%
2
0.42
0.17
3437
17.22
49.85
3
0.40
0.16
3233
16.20
66.05
4
0.39
0.15
2961
14.84
80.89
5
0.32
0.10
2078
10.41
91.31
6
0.29
0.09
1735
8.69
100.00
1.00
19952
Total Degrees of Freedom = 121
32
Cumulative Percent
Dimension
VOL. 23, NO. 4, 2010
100.00%
Figure 2. Draftsman display of the correspondence plots for the first three dimensions
Multiple Correspondence Analysis The mosaic plots give an indication of the outcome of the multiple correspondence analysis. Table 3 shows the actual results. The overall chi-square value (chi-square = 19,952, df = 121, p < .0001) demonstrates a strong association among some of the dichotomous variables. In many studies, two dimensions are sufficient to uncover significant associations. However, in this analysis, two dimensions account for less than 50% (49.9) of the total chi-square. A third dimension, accounting for an additional 16.2% of the chi-square value (total percent = 66.05), was therefore included. Multivariate correspondence analysis creates a point for each row and column. If two rows or two columns have similar profiles, their points will plot close together and fall in the same direction away from the point of origin.
Figure 2 shows the relationships between each pairing of the three dimensions found through MCA. Panel 1,2 plots the first and second dimensions. Dimension 1 is closely aligned with parent characteristics and clearly separates mother, newborns, younger, and use of a weapon from father, babies, older, and no weapon. It shows that, in parent/infant homicides, mothers tend to be younger, are associated with killing newborns, and use a weapon. Father offenders are older, most associated with killing babies, and tend not to use a weapon. These associations are clearly separated. The mothers’ characteristics are plotted in the positive end of dimension 1, and the fathers’ characteristics are plotted in the negative end of dimension 1. Positive and negative values on correspondence plots have no intrinsic meaning and only serve to show separation among the attributes of the variables.
Dimension 2 is closely aligned with “relationship” and “race.” The attributes “black” and “daughter” plot in the positive end of dimension 2, while “white” and “son” plot in the negative end of dimension 2. It appears from this plot that black parents have a higher association with killing daughters and white parents appear to be associated with killing sons. Variables aligned with dimension 1 and dimension 2 are orthogonal (rotated nearly 90 degrees), indicating the independence of the two groups of vectors. This orientation shows there is actually little association between “race” and “relationship” and the other variables associated with the parent offenders. Panel 1,3 compares dimensions 1 and 3, showing that dimension 1 continues to align with “parent,” “infant,” “weapon-use,” and, to a lesser degree, “age-group.” Dimension 3, like dimension 2, aligns with “race” and CHANCE
33
Figure 3. Three-dimensional view showing the lack of a relationship among the first three dimensions
“relationship.” In this case, however, the attributes “son” and “black” plot in the positive end of dimension 3, while “daughter” and “white” plot in the negative end. The reversal of this pairing from the previous panel clearly demonstrates a lack of association between “race” and “relationship,” as well as between these variables and the others associated with the offenders. In panel 2,3, the independence of “race” and “relationship” from other characteristics is shown by the near right angles between the two vectors. This panel also explains why the attributes of these variables change positions in panels 1,2 and 1,3. The results depend on the observer’s perspective. If the viewing direction is from dimension 2, the horizontal match is black/daughter and white/son. If the viewing direction is from dimension 3, the vertical match is white/daughter and black/son. The vectors that appear shorter are those less well represented in a two-dimensional space. Figure 3 shows all three dimensions in one plot, demonstrating that they are clearly distinct. The characteristics of dimension 1 are perpendicular to 34
VOL. 23, NO. 4, 2010
dimensions 2 and 3, indicating little or no association. In addition, “race” is at approximate right angles to “relationship,” also indicating a lack of association. A possible explanation for this finding is that parent/infant homicides are more related to other issues than to some of the variables collected in the SHR. For example, homicides in general are concentrated heavily in lower socioeconomic communities, and poverty crosses racial lines. Poor whites have a higher homicide rate than wealthy whites, and poor blacks have a higher homicide rate than wealthy blacks. This relationship between poverty and homicide holds true for infants as well, but socioeconomic status is not included in the SHR.
Conclusion MCA can be useful for analyzing large social science databases such as the SHR. Although the technique is seldom used in social research, this study demonstrates how MCA can identify a number of dimensions underlying the relationships found in correspondence
plots—dimensions that otherwise might remain hidden. The scope of this article is limited to a single, but important, area of homicides—parents killing their infant children. The MCA analysis identified three major dimensions in the parent/ infant SHR homicide data. The first dimension separates the characteristics associated with mother offenders from those associated with father offenders. Mothers tend to be younger (under 21 years old), are associated with killing newborns (0–6 days old), and tend to use weapons. Fathers tend to be older (21 years and older), are associated with killing babies (7–365 days old), and tend not to use weapons. The other two dimensions, race and relationship (sex) of the infant, are independent of each other and also independent of the characteristics associated with dimension 1. This lack of association is an important result of this MCA. The race of the parent is not related to the sex of the victim, and neither variable is related to the parents’ gender, age, or use of a weapon. Likewise, the race and sex of the child are not associated with whether the murdered infant is a newborn or a baby. The use of MCA for this analysis demonstrates how it can uncover hidden relationships in categorical data that are not apparent using other statistical methods. MCA should be considered along with other statistical tools when designing social-research studies that include categorical variables.
Further Reading Federal Bureau of Investigation. 1980– 2006. Supplementary homicide reports. Unpublished raw data. Friendly, M. 2000. Visualizing categorical data. Cary, North Carolina: SAS Institute. Greenacre, M. 2006. Multiple correspondence analysis and related methods. Chapman & Hall. Hartigan, J.A., and B. Kleiner. 1981. Mosaics for contingency tables. Computer Science and Statistics: Proceedings of the 13th Symposium, pp 268-273. New York: Springer-Verlag. Le Roux, B., and H. Rouanet. 2010. Multiple correspondence analysis. Thousand Oaks, California: SAGE Publications.
Statistical Consulting with Limited Resources: Applications to Practice Mark Glickman, Rick Ittenbach, Todd G. Nick, Ralph O’Brien, Sarah J. Ratcliffe, and Justine Shults
iven the constraints of an uncertain economy, consulting statisticians are often asked to do more with less while continuing to keep the quality of their work unquestionably high. The greater emphasis on costcontainment undoubtedly has many benefits, but it brings with it obvious challenges, as well—challenges that extend across many aspects of statistical practice. The following four articles were taken from a topic-contributed session, titled “Statistical Consultation with Limited Resources,” presented at the 2009 Joint Statistical Meetings in Washington, DC. Four specific areas of practice are presented here: short-term consulting, methods research, grant proposal development, and scale development. While the four contributions represent different areas of statistical practice, all strive to strike a balance between
G
recommendations that are as applicable to their specific areas of practice as they are to consulting more generally. For example, Mark Glickman offers a two-step strategy for attacking challenging and time-sensitive consulting problems. Justine Shults and Sarah Ratcliffe, on the other hand, discuss the importance and benefits of conducting independent research and offer tips to help make this possible with an overly busy schedule. Todd Nick and Ralph O’Brien offer guidance on making one’s grant writing more efficient and productive, while Richard Ittenbach introduces statisticians to an area of emerging statistical practice: scale development. Although coming from different perspectives and having different applications, the unifying theme—keeping the quality of one’s work high when confronted with limited resources—remains clear.
Short-Term Statistical Consulting with Limited Time and Resources
when the client has already invested in a particular approach, and this creates constraints to the set of possible solutions. Furthermore, because of the typical time limitations associated with such projects, the statistician does not have the luxury of applying the most principled approaches to problemsolving and must therefore appeal to more imaginative methods not usually associated with long-term research projects.
Mark Glickman is associate professor in the health policy and management department at the Boston University School of Public Health and senior statistician at the Center for Health Quality, Outcomes and Economics Research, a Veterans Administration Center of Excellence. He has had a longstanding interest in the application of statistical methods to rating tournament chess players and has been chair of the U.S. Chess Federation’s ratings committee for more than 15 years.
One of the more rewarding opportunities available to academic statisticians is short-term consulting for clients in the private sector. Not only does statistical consulting have the potential to provide extra income, but, depending on the arrangement with the client, the results of the work can serve as the basis for research publications. The problems with which a client needs help are often interesting, partly because they are typically problems without “textbook” solutions (which is the main reason the consultant was contacted in the first place) and partly because of the novelty of being exposed to a problem with an unfamiliar context. The most common challenge in short-term statistical consulting arises from limited time and resources; a client is often in a situation in which a solution is needed quickly and there isn’t an unlimited bank account to finance the work. A good statistical consultant must recognize a client’s urgent deadlines and think creatively to produce timely, but reliable, solutions. Short-term statistical consulting arguably requires skills that are complementary to the ones used by researchers. In many consultations, the decision to hire a statistician is made at a point
Approximating the Most Principled Approaches A typical kick-off meeting between a statistician and client involves the client describing the relevant background to the problem, the progress made toward solving the problem, and how the client envisions the statistician meeting his or her aims. During this introductory description, it is helpful for the statistician to assess the various constraints imposed by the path the client has taken. From my experience, a client is then willing to let the consultant go off and contemplate possible directions to pursue, being aware of the limited time and resources at one’s disposal. It is at this phase of a consulting project that the following two-part strategy can be helpful in forming potential solutions. First, recognizing the complexity of the problem, it is useful to envision the most principled approaches to solving the problem as if no time or resource limitations were imposed. This is often equivalent to deciding how one might approach the problem if it were a research-level project. Almost always, this involves constructing a (possibly complex) statistical model to describe the relationship among variables in the problem. Second, one identifies aspects of the method that can be approximated or substituted with simpler steps without major sacrifices in precision and accuracy. This simplification should result in an approach that is not only reasonably reliable, but also straightforward enough to be implemented by the client. In the context of a complex statistical model, the statistician CHANCE
35
might find components of a model that can be simplified without introducing major inaccuracies, or determining linearizations of highly nonlinear aspects of complex models. These two steps can be helpful because the first step anchors the ultimately applied method in a solid statistical framework, while the second step, in effect, prevents the approach from veering too far from the most principled solutions. Several examples from previous consultations illustrate this approach. Example 1: Multi-Player Online Games I have been involved in numerous consultations with online gaming organizations interested in implementing rating systems for competitor strength. These consultations have come about based on my having developed a rating system called the Glicko system. The Glicko system measures competitors’ abilities in two-player (or two-team) games/sports, including chess, checkers, football, baseball, and so on. The basic probability model underlying the Glicko system, called the Bradley-Terry model, assumes that if competitors A and B are about to play a game, the probability A defeats B is given by , where A and B are the respective real-valued playing strength parameters. The most principled approach to analyze game outcome data is to fit the Bradley-Terry model using standard likelihoodbased methods. The Glicko system actually makes several computation-saving approximations that allow the system to be applied to thousands of competitors simultaneously. The key approximation is to use a pre-competition estimate of the opponent’s strength when evaluating a player’s ability based on current game outcomes. In some online games, teams compete against other teams to complete a specified quest (e.g., to kill a particular monster or find a valuable object in the game world). I have been consulted about how to extend the Glicko system to measure player ability in the context of team games. The reason this is a tricky, time-consuming question is that it is not obvious how competitors work together in teams. In some games, the primary determinant of a victory could be the strength of the best player on the team; in other games, victory often hinges on the strength of the worst player. With unlimited time and resources, one could explore various models for outcomes accounting for team-member synergy, but under practical time constraints, an approach was needed to produce a reasonably quick method clients could implement when they were only willing to fund a few hours of my time. I therefore considered a simple model that could be viewed as approximating a potentially complex one. With n players per team, one can assume the probability that team A defeats team B is given by . This model, in effect, assumes the average ability of a team is a decent measure of overall team strength. This expression then permits adapting the formulas used in the Glicko system 36
VOL. 23, NO. 4, 2010
for two competitors to two teams of competitors. The basic “trick” is to realize that, when inferring A1, the game can be treated as if there are two competitors: player A1 and the combination of all the players on team B “minus” players 2 through n on team A. The two-player Glicko system can then be applied by estimating A1 using a pre-competition “opponent” estimate of
B1 … Bn – A2 – … – An . All other players then have their ratings estimated in parallel using the same method. This approach has been successfully implemented in several online gaming systems. Example 2: Modeling Solitaire Scores Another consulting example that illustrates the method of approximating or simplifying principled approaches involves work with an online gaming company that specializes in different solitaire-type games in which players can win money based on their performance. Performance in each solitaire game was a function of the time it took to complete the game and the success in completing the game’s objectives; the performance outcome was a numerical score. I was asked to help develop the following idea for a game. After a player had played a game several times, a pop-up window would appear on the player’s screen inviting the player to bet on whether he or she could achieve a numerical score higher than the value displayed in the window. The player would pay $1 to play; if the player achieved a score higher than the one displayed, he or she would win $1 (plus the $1 bet). If the player did not achieve a higher score, he or she would lose the original dollar bet. My task was to determine a fair score for the player to outperform. The client wanted to roll out the game variant as soon as possible and only wanted to pay limited funds to develop the scoring system. This is actually a difficult research-level problem. A player’s game scores follow a complicated stochastic process in which the player’s ability is clearly changing over time as the player improves. The computation of the game score (which depends on the particular solitaire game) typically produces multimodal score distributions, mainly because a player will outright fail some fraction of the time. Determining a score at which a player is predicted to outperform 50% of the time is therefore a serious challenge with limited time and resources. A conventional approach to the problem might involve modeling the game scores through a nonstationary time series model. In such models, a reasonable estimate of the current mean can be obtained by exponential smoothing of the data, with more recent game scores being weighted more heavily. But the current application really involves estimating the median, the game score at which the probability of outperforming is 50%. I therefore proposed the following simple solution. For the most recent n game scores, let w (a value less than 1) be the weight attached to the most recent score, let w2 be the weight attached to the next most recent score, and so on down to wn as the weight of the nth most recent score. Now view the n scores and their weights as a discrete probability distribution (after standardizing the weights to sum to 1). The score that the player is offered to outperform is the median of this discrete probability distribution.
Algorithms for computing medians of discrete distributions, which involve interpolating between adjacent values, are standard in statistics software packages. Once the company accepted the general solution of a weighted median, follow-up work involved determining optimal values of n and w that both produced reliable weighted medians based on the wealth of data they had collected on player game outcomes.
Success and Simplification Many statistical consulting situations exist in which it is important to begin with the most analytically defensible approach and stick with it—expert witnessing in legal cases serves as a compelling example of when the opposition would attack the statistical expert for making approximations. But in those situations, resources and time constraints are not as critical. When time is of the essence, and resources and funding are limited, a statistical consulting client can be well served if the consultant is willing to make judicious simplifications to the most principled methods that the consultant relies on in his or her academic research.
Methods Research on a Limited Budget Justine Shults is associate professor in the department of biostatistics and epidemiology in the University of Pennsylvania School of Medicine. She has received National Science Foundation and National Institutes of Health funding to develop improved statistical methods for longitudinal data analysis and is coauthoring a book on quasi-least squares regression. Sarah J. Ratcliffe is assistant professor of biostatistics at the University of Pennsylvania School of Medicine and an associate editor for The American Statistician. Her fields of statistical activities include models for informative dropout, functional data analysis, and general statistical collaboration in a variety of medical areas.
As biostatisticians in a medical school, we teach and mentor students, collaborate on grants and manuscripts with medical colleagues, attend faculty meetings, participate on various committees, and work on our own methods research. Like most statisticians, our efforts are divided, and we often struggle to make time for our own work. Here, we provide some tips for doing methods research on a limited budget, which generally equates to limited time. Our advice should prove especially useful to new graduates who hope to obtain National Institutes of Health (NIH) funding.
Challenges to Development of an Independent Research Program Given the competing demands for our time, it can be tempting to delay the development of an independent research program. For example, it is often easier to respond to the immediate (and sometimes urgent) requests from a medical colleague than it is to work on tedious calculations or software development for a
methods paper. It also can be guilt inducing to work on your own research when you view time spent on your own papers as time stolen from your students and clinical collaborators. For statisticians who receive all their income through consultation, the situation can be even worse: Any time spent on independent work has to come from family or personal time.
Benefits of Developing and Maintaining an Independent Research Program Although difficult, developing an independent research program has benefits—the first of which is the development of a particular area of expertise. This will lead to invitations to give talks, serve on study sections (committees to review grant proposals), and review papers—activities that are timeconsuming, but will further deepen your knowledge and familiarity with the important emerging developments in your research area. Having an identified area of expertise also will improve your ability to obtain funding to support your work. For example, if you are a co-investigator on a grant application for a medical study that is to be submitted to the NIH, your expertise can strengthen the application. First-authoring your own publications, coupled with obtaining co-authorships on medical papers, is also a necessary first step to obtaining your own NIH funding, which is the best way to guarantee time for your work. In the long term, your efforts will (hopefully) have a positive effect on public health. For example, our work on quasi-least squares has been motivated by a desire to improve efficiency in generalized estimating equation analysis of longitudinal data. Improved efficiency can lessen the likelihood that an intervention is unfairly dismissed as ineffective and can lead to cost savings due to reductions in the sample size requirements. Other research on methods for longitudinal data with informative dropout has been used as part of the primary analysis in clinical studies and is regularly written into the analysis sections of grant proposals. Clinical colleagues are often excited to have the data used to further statistical methodology and good clinical practice and to collaborate with someone who can “handle” their dropout issues.
Several Tips Regarding Development of a Research Program Develop a plan. Here, it helps to keep in mind the “Five P” acronym that is popular in the U.S. Navy: prior planning prevents poor performance. We strongly recommend you develop a one-page plan comprised of one to three aims that will each correspond to one (or more) manuscripts. A natural starting point for developing these aims is your dissertation. One reasonable approach would be to compare your methods (perhaps in analysis of data from your collaborative studies) with alternative approaches. As you read papers, attend talks, and interact with colleagues, you should be generating ideas to extend and create new methods. It is good to jot these ideas in a single location (repository) as a resource for developing your aims. For example, when you analyze the data as a collaborator on a medical study, you might identify characteristics of the data that aren’t well addressed using existing approaches. Extending these approaches (or developing new methods) would potentially be a high-impact research problem. CHANCE
37
Once you have taken time to carefully formulate your aims, you can expand them into a more detailed plan that could form the basis for an exploratory (R03) or larger-scale (R01) application. Invest time in searching on the Internet and reading about how to obtain NIH grant funding; many excellent websites are available that may not be tailored to statistical research, but will be useful nonetheless. For example, these sites will stress the importance of developing a detailed plan and timeline for writing the papers described in your aims. To lessen the time to a first publication, you might consider writing a review of a book that describes methods that will be useful in achieving your aims or a more focused research note that demonstrates and compares statistical methods for a clinical or applied research journal. Schedule time to implement your plan. You should always “pay yourself first.” Just as it doesn’t work to build savings by waiting until the end of the month to see if any money is left over, it usually doesn’t work to wait until the end of the day to see if any time is remaining to work on developing and implementing your research aims. To avoid becoming demoralized, be realistic in your timelines. This includes estimating the minimum time needed in each step of your research plan to see progress. Time should then be set aside in these blocks as often as possible. If you have limited time available, you could budget short daily time periods when writing the paper, but then half to a full day per month when programming. It is important to schedule your work during the time of day when you work best. Attempting to publish a paper once per year, or every two years, might be a realistic goal if you have many responsibilities. It will be helpful to take inventory of how you spend your time and, if necessary, make adjustments to your research plan. We are from Philadelphia, which is the city of the fictional character Rocky Balboa. In each of his many boxing movies, Rocky never complains that he has to take time from running and weight lifting to achieve “life work balance.” The same level of dedication and focus will be important for you when you are completing your first R01. In the long run, however, it will be necessary to maintain the activities that keep you mentally, physically, and spiritually healthy. If you are not healthy, your research (and life) will suffer. To accomplish everything, you might need to revise your timeline and to say no to requests for your time. Along with realistic goals, you need to be able to say no to others or refer them to someone else. Is the deadline as urgent as it is made out to be? Do you really need to be the one to drop everything to complete an analysis? You can consult with the senior members of your department or organization and your mentor(s) about what is most important to do in your career. After all, the senior personnel ultimately will be evaluating your progress, so it is to your benefit to involve them in these decisions. However, be sure to find out what is important for promotion in your department (even though most will stress that “there is no checklist for promotion”). Carefully study the achievements of those who were recently promoted and be sure you are primarily devoting time to the activities that will make you an attractive candidate for promotion. Be collaborative. To ensure success, it helps to involve your colleagues in the planning process. A mentor or two with experience in a position similar to yours, whether at your own
38
VOL. 23, NO. 4, 2010
institution or another, could be helpful with initial advice. You should work hard to build relationships (via co-authorship on medical or statistical papers) with medical colleagues who work in the area to which your statistical methods will be applied because you will describe these relationships and the corresponding motivating studies in your grant applications. If the relationships were hobbled together for a particular application, this will usually be clear to the reviewers, who will then judge your grant as having little potential effect on public health—a key characteristic on which NIH grants are judged. An important caution is that some medical researchers are supported by clinical work and operate according to a “coauthorship OR funding model.” This model does not work for statisticians, whose “clinical work” could be viewed as the time spent on analysis. Before starting any work on a medical paper or analysis, ask whether you will be included as a co-author and explain that including a statistical co-author will improve the likelihood that the paper will be published. It might be worthwhile to do some volunteer work to build these relationships, but there should be a clear expectation that your colleagues will include you as a co-investigator at an appropriate level of support when they submit their next R01. It is important to identify and collaborate with senior medical researchers who are able to obtain funding and junior researchers who have the promise of obtaining future funding. This will ensure that your time will be supported and your colleagues will be available long-term, because medical research is usually not possible without financial support. Be careful about how much time you give to collaborative research that does not further your own research plan without sufficient funding (or any funding at all) to cover your time. You should be clear about the level of funding needed and what is being provided before starting a project. In our department, we have a rule that is rarely violated: The acceptable minimum level of support is 5% annual effort for a relatively straightforward small study. We sometimes write a statement (that may be more detailed) in our analysis plans that we will develop new methods if it will lead to improved analyses for the grant. If we can argue for a higher level of support (say 20%), then writing these additional methods paper (which will strengthen the publications record of the grant) will be possible. As you become more senior, it also will be helpful to include students in your research. This could be as a funded position on a grant or part of a degree requirement. Additionally, it helps to attend conferences such as the Joint Statistical Meetings and ENAR Spring Meeting and to introduce yourself to other researchers who are working in your area. Further, it helps to be responsive to queries about your work. We have developed working relationships with statisticians we have met through email. In fact, one author is working on a book with a statistician whom she met via email when he wrote with questions about her research.
Summary Methodological research is important for your own reputation as an expert statistician. To be successful in this arena, you will need to be responsible for carving out time to undertake methods development. Without a plan, methodology research is sure to suffer. Thus, remember that prior planning prevents poor performance.
Developing Grant Proposals: Guidelines for Statisticians Collaborating Under Limited Resources Todd G. Nick is professor of pediatrics and biostatistics at the University of Arkansas for Medical Sciences, where he serves as the director of the Pediatric Biostatistics Program at Arkansas Children’s Hospital. He is chair of the Section on Statistical Consulting of the American Statistical Association. His fields of statistical activities include general statistical collaboration in a wide variety of areas in the health sciences. Ralph O’Brien, an American Statistical Association Fellow, is a professor of biostatistics at Case Western Reserve University. He is best known for his teaching and work in sample-size analysis for study planning.
Academic statisticians must juggle many activities: short-term consulting projects and long-term collaborative research, teaching, service to the institution and statistics profession, and their own research in developing and refining new statistical methods. In medical statistics, most of one’s salary—often 60% to 80%—must be covered by grants from the National Institutes of Health (NIH) and various private foundations. It is in our own best interest to collaborate efficiently and effectively when our subject-matter colleagues are developing grant applications. The primary goal of such work is to craft an outstanding “statistical considerations” section that cogently describes the data management plan, randomization scheme (if appropriate), statistical analysis plan, and sample-size analyses. It also outlines the professional team of database informaticists and statisticians that will handle the project when it is funded. However, what is often more critical is that the grant development process enables the statistician to play a key role in framing and refining the research questions, developing the proposed study design, and choosing the right measures. It is good for every institution to provide ample support for its collaborating statisticians to develop outstanding grant applications. Unfortunately, statisticians often have too little time for such activities, which fosters unsuccessful relationships with investigators and thus weaker applications. Accordingly, statisticians must use effective technical and professional strategies for developing grant sections, and their home department or division must have sound policies in place to allocate limited time and resources to those proposals that have the greatest chance of success. We offer the following recommendations for statisticians who spend time developing grant applications.
Plan Ahead In crafting statistical considerations sections, follow a checklist. Learn from and adapt material from previous outstanding applications.
Follow a Checklist Use a checklist of what may need to be included in the proposal. For example, consult St. George’s University of London Statistics Guide for Research Grant Applicants at www.sgul.ac.uk. Use Previous Applications To learn and re-learn what reviewers like and dislike, create a repository of statistical considerations sections with the reviews they received. Better statistical groups will have central online password-protected repositories where their members can upload and download such material easily. You may find previous “winning” sections that can serve as effective starting places to craft new sections that will be even better. Any funding acquired by an individual statistician benefits the entire group. Shamelessly, we recommend the fictional statistical considerations section and related commentary written by the second author and his colleagues, available at www.amstat.org/publications/ chance/supplemental.cfm. It describes some of the essential features of the statistical analysis plan (SAP), data management plan, and sample-size analyses. The SAP must fuse the research questions with the statistical techniques and strategies. When proposals have specific aims and tight hypotheses or other clear research questions, the SAP can be organized around each of the aims, with a description of how each research question will be handled using sound and modern statistical strategies. Success here depends on writing for two types of reviewers. Nonstatistician reviewers need to absorb the “big picture” of the SAP. A reviewer who is a professional or para-professional statistician will favor seeing something innovative in the approach, assuring that the statistical work is not going to be pedestrian. Of course, the sample-size analyses must be congruent with the SAP. Critically, the conjectures for all the unknowns need to be specified exactly, and they need to reflect what is currently known about the science at hand, as summarized in other parts of the full application. This includes citing specific articles and/or summarizing analyses of preliminary data. For the data management plan, describe the roles of the informaticists and thus how the data will be collected, checked, entered, managed, exported, and secured. Reviewers need to be convinced that the proposed team is solid and fully committed. The statistical considerations section acts like an infomercial for how all the data will be handled, from collection to analysis to interpretation to presentation. An outstanding statistics section cannot save an otherwise weak proposal, but a weak one can doom an otherwise outstanding proposal.
Develop Policy Every group of statisticians should have policies and procedures that govern how its members deal with grant applications. An excellent example can be found on the website for the department of biostatistics in the school of medicine at Vanderbilt University, http://biostat.mc.vanderbilt.edu/wiki/Main/ GrantPolicies. These policies describe who they serve, their requirements for advance notice, the appropriate use of statisticians’ names, and suitable percent efforts. Here, we mention some important factors.
CHANCE
39
Require Sufficient Lead Time Successful statisticians need sufficient lead time for planning and developing the statistical components of a grant proposal. Even if an institution supports 20% of a statistician’s effort to develop new studies, 80% of his or her effort needs to go to those studies already funded and to teaching, etc. Let’s assume you are asked to help develop two grant proposals, one that requires moderate effort—say 50 hours of work (e.g., an exploratory grant that requires writing a simulation program to assess sample size for some nonstandard statistical method)—and another that requires 80 hours of work (e.g., a discrete, specified long-term project that will involve data from multiple clinical sites, interim analyses, and multiple imputation of missing data). Let’s assume you work an average of 50 hours per week, with 10 hours per week (20%) available for grant development. Combined, these grant proposals will require 130 hours of your time. Therefore, you will need 13 weeks lead time. Keep this in mind when deciding whether to accept any new grant development role. Track Hours Spent on Grant Development Keep detailed records of how many hours are worked on each grant developed. Statisticians should be the first to realize that having real data can help greatly in justifying such efforts to institutional leaders. If one’s allocated effort (e.g., 15% effort) falls below his actual effort (e.g., 20%), then the case can be made for increased institutional funding. Request Realistic and Adequate Funding on Budget It is all too typical for statisticians’ funded efforts on grants to be substantially less than the actual efforts required to ensure timeliness and excellence. Discuss the necessary efforts early in the process, when there is still time to negotiate. You should account for estimated personnel, specialized software, travel, and other operating expenses. Actual funding often falls short of what was requested. Will 80% of the proposed budget for statistics be adequate? If not, then you will need to re-negotiate the statistical budget with the principal investigator. If there will be little flexibility to increase funds for statistical support, you and the principal investigator will need to trim the aims to balance funding received with effort required. Besides, many proposals are overly ambitious—a common criticism from reviewers. It is better to allocate different percent efforts over each stage of the grant than to have a constant percent effort over the whole funding period. For example, the typical five-year NIH grant requires more statistical effort at the start and end of the period and less in the middle. In addition, efforts for master’s-level statisticians may be minor in the early stages and heavy at the end. You might say “In accord with the time required, Jane Doe, MS, will devote 10% effort during years 1–3, 30% during year four, and 50% during year five.” What is an appropriate percent effort? Suppose you are considering one at 10%. Ask yourself, “If I had 10 of these 10% projects as my full-time (100%) job, what would my life be like?” Ask the principal investigator, “If I work one week solid on your project, are you going to be upset if I do not work on your project for another nine weeks?” In the majority of cases, neither the statistician nor the principal investigator is happy with a 10% effort by the main statistician on a project. 40
VOL. 23, NO. 4, 2010
Come to agreement with the principal investigator on the general budget and budget justification as soon as you are able to scope out the work required. Should this not go favorably, you will have time to negotiate a reduced level of commitment to the project, maybe just serving in a consulting role at 5% (minimum) effort, or even declining further involvement altogether. Delaying this negotiation hurts all concerned.
Conclusion As W. Edwards Deming once said, “Statisticians have no magic touch.” Everything in the statistical considerations section rests on the scientific substance and the quality of argument laid out in other portions of the proposal—its specific aims, research design, measures, and the prior studies and pilot data. Only clear and focused specific aims will beget detailed statistical plans. Relevant data from prior studies, especially pilot data for the study in question, will enable a sound, credible, and convincing sample-size analysis. On the other hand, having no preliminary data to support a sample-size analysis will render it a near valueless—if not deceptive—exercise in statistical gamesmanship, which enlightened reviewers will criticize because it has no place in producing good science. For grant proposals that are not yet formulated well, the statistician should indicate clearly that she or he will not be able to make a major contribution. Statisticians need to expend their creativity and time on those proposals that have serious merit and that need our help. This role often requires us to get the entire research team to agree, tightly specify how the study will be performed, and firmly quantify what it hopes and expects to find. This requires heavy involvement by the statistician, certainly not just a few hours. However, because our time is so limited, we need to be efficient and resourceful. We hope this brief homily helps. There are no shortcuts to a job well done. An entire grant proposal can be strengthened when the statistician collaborates fully. However, the onus is on statisticians to repeatedly demonstrate how the amount of effort needed to produce high-quality applications is worth the costs. Investigators who experience this will be eager to form long-term collaborative ties, and they will encourage their leaders to provide ample institutional support for all statistical activities, including grant development.
Scale Development on a Limited Budget Rick Ittenbach is an associate professor of pediatrics at Cincinnati Children’s Hospital Medical Center and the University of Cincinnati College of Medicine. He is actively involved in the Section on Statistical Consulting and collaborates on a range of statistical and design-related issues in biomedical and biobehavioral research.
Statisticians are increasingly being asked to consult on projects that extend their reach into new areas. Scale development, or the development of a composite measure from a series of related items, offers one such example. According to a recent study commissioned by the American Statistical
Association’s Section on Statistical Consulting, more than half of all responding statisticians indicated they either had worked on or consulted on a scale development (SD) project in the past year. For many statisticians, SD projects represent a completely new area of practice—one that they were not trained for in school. Here, I will identify some of the key issues encountered by statisticians when consulting on SD projects and offer recommendations for practice within the context of a limited budget.
Importance of the Process Statistical models represent the bread and butter of a statistician’s work. But within the context of SD, statistical models are most often not used until later in the sequence. Following is a brief listing of the key steps in the SD sequence: 1. Concept Development 2. Item Development 3. Field Testing 4. Item Analysis 5. Subscale Development 6. Standardization 7. Post-Development Validation The more important the measure, the longer the process takes. Whether one is consulting on a brief measure of trauma for use in the ER or a measure of voluntariness for use with parents, developing new measures in a scientifically rigorous way takes time. The longer the process takes, the more important it is to keep an eye on escalating costs. Well-Understood Costs Most statisticians can articulate the cost-based side of their practice: hardware, software, consumables. And, depending on their experience, most can talk in general terms about the importance of accurate sample sizes and the costs of having too few or too many subjects. Some can even discuss the various administrative and operating costs associated with consulting. These costs are all fairly well known, and many consulting statisticians simply choose to try to do more with less, such as when they put off getting a new computer or when they elect to stay with older, less expensive software. There are, however, a number of other less well-known costs that are worth considering when consulting in SD. Less Well-Known Costs Statisticians are not likely to be needed in all aspects of the SD process, but they should be familiar with the entire sequence, if only to know where they can and cannot contribute most effectively. The statistical techniques used in SD projects are fairly straightforward, yet the terms are often different enough to give one pause—as if we might be trespassing into foreign territory. Consulting statisticians should rest assured that SD and its required skill sets are well within reach of traditional statisticians.
New learning
Unless one was trained in SD, new learning will be required. The good news is that until the mid-1980s, SD was primarily based on true score theory—the notion that one’s score is a function of both true ability and error: X = T E. Clearly, one can see the relationship with the general linear model; hence, the techniques used were largely correlations and linear models, with a special focus on the reliability of a given measure. More recently, modern theory techniques such as Rasch analysis and item response theory (IRT) have been well received. These techniques are based on generalized linear models. The caveat for statisticians is that these are item-level techniques, rather than techniques appropriate for derived, continuous scales. While the learning, itself, is not formidable, the context in which it occurs may be different enough to warrant the consultant hiring his or her own consultant in the area of tests and measurement, or even standardized testing. Most colleges and universities have faculty in schools of education or departments of psychology that can be helpful with the conceptual parts of the SD process. Although formal courses are typically only taught at the larger universities, short courses at the National Council on Measurement in Education (NCME) and American Educational Research Association’s (AERA) annual meetings, as well as Peter Bruce’s continuing education website (www.statistics.com), can be helpful. Specialized techniques
The area most often known as psychometrics draws from the disciplines of psychology, education, and statistics. Increasingly, statisticians are becoming familiar with latent variable modeling strategies such as factor analysis, cluster analysis, latent class analysis, and structural equation modeling. But these techniques continue to be based on entire scales with traditional iid and normality assumptions. The modern theory techniques are 1- (Rasch analysis) and 3- (IRT) parameter models that require putting everything onto a log scale so both the person’s likelihood of getting an item right and each item’s difficulty level can be evaluated using a common scale. Statisticians familiar with logistic regression should have little trouble learning either of the two aforementioned techniques, the point of which is to estimate the probability of a specific response or preference () based on item (a) discrimination, (b) difficulty, and (c) guessing.
The difference between Rasch analysis and IRT is a conceptual one, and it focuses on the importance of “guessing” and “discrimination” as informative factors. My recommendation would be for the consultant who wishes to move into this area to consider taking a formal course in modern theory techniques; the material is not onerous, but it is not trivial either. Specialized software
Although work-a-rounds exist in most conventional software packages such as SAS, SPSS, and STATA, it has been my experience that it is easier and more efficient to use software designed specifically for modern theory techniques. Because of its open-source nature, R seems to be the lone exception.
CHANCE
41
When it comes to specialized software, the packages are not prohibitively expensive. Packages such as Bi-Log, RUMM, Winbugs, and Winsteps can often be purchased for less than $250. When it comes to latent variable modeling software, such as structural equation modeling (SEM), we typically have the same issues with software integration that we have with modern theory techniques. Hence, my recommendation here is to purchase specialized software such as EQS, AMOS, or LISREL. SAS can handle SEM, but the program code is cumbersome. At this point, SPSS and STATA do not seem to be able to handle SEM at all. When it comes to maximizing capability while minimizing cost, companies of both types of software will often allow users to download programs for temporary use. In some cases, these trial versions are fully functional. Recruitment/sample size
Statisticians are generally not responsible for recruitment, but they are responsible for generating sample size estimates. Sometimes, the statistician will focus on the overall sample size, rather than cell-specific sample sizes. Often, an SD team will be willing to take all comers in the hopes that all cells will fill themselves appropriately. The more typical scenario is that a study team recruits superfluous numbers of some samples and falls short of their recruiting goals for other samples, threatening the integrity of the measure they are developing. In a recent SD project in which we were developing a new measure of health care beliefs, we had to expand our catchment area and our timeline by a year just to get enough healthy controls. Healthy young adults simply were not going to the doctor at the same rate as young adults who had survived cancer. Whereas we initially thought recruiting healthy controls would be easy, they ended up being the more difficult group to recruit. Development of templates
Few things are more cost effective for the statistical consultant than having templates on hand to help structure their work. Take, for example, Data Analysis Plans (SAP) and Data Management Plans (DMP). The SAP allows the statistician to systematize the more routine parts of a study, allowing everyone to review in writing and then confirm among one another the important analytical parts of a study—so there is no second guessing down the road. The advantage of a DMP is that the team gets to identify explicitly how they will obtain, enter, audit, and protect their data. The extra time spent planning will more than pay for itself in time saved at the end. The same holds true for SD studies. Assuming a scale is only as good as its validity studies indicate, why not choose to design in the all-important follow-up studies before the instrument is fully built? In short, isn’t it better to know where you are going before you leave home than to figure it out after you have left? Whether it is an SAP, a DMP, or a post-development validation study, templates can provide the consultant with yet another tool to streamline their efforts and, in so doing, minimize their costs.
42
VOL. 23, NO. 4, 2010
Further Reading Short-Term Statistical Consulting with Limited Time and Resources Bradley, R. A., and M. E. Terry. 1952. Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika 39:324–345. Glickman, M. E. 1999. Parameter estimation in large dynamic paired comparison experiments. Applied Statistics 48: 377–394. Methods Research on a Limited Budget Derr, J. 2008. Statistical consulting in a university setting: Don’t forget the students. CHANCE 21:38–39. Developing Grand Proposals: Guidelines for Statisticians Collaborating Under Limited Resources Inouye, S. K., and D. A. Fiellin. 2005. An evidence-based guide to writing grant proposals for clinical research. Annals of Internal Medicine 142:274–282. Lesser, M. L., and R. A. Parker. 1995. The biostatistician in medical research: Allocating time and effort. Statistics in Medicine 14:1683–1692. Parker, R. A. 2000. Estimating the value of an internal biostatistical consulting service. Statistics in Medicine 19:2131–2145. Sherrill, J. T., D. I. Sommers, A. A. Nierenberg, A. C. Leon, S. Arndt, K. Bandeen-Roche, J. Greenhouse, D. Guthrie, S. L. Normand, K. A. Phillips, M. K. Shear, and R. Woolson. 2009. Integrating statistical and clinical research elements in intervention-related grant applications: Summary from an NIMH workshop. Academic Psychiatry 33(3):221–8. Scale Development on a Limited Budget Boen, J. R., and D. A. Zahn. 1982. The human side of statistical consulting. Belmont, CA: Lifetime Learning. Crocker, L., and J. Algina. 2008. Introduction to classical and modern test theory. Mason, OH: Cengage Learning. Reprint of 2006 edition by Wadsworth. Derr, J. 2000. Statistical consulting: A guide to effective communication. Pacific Grove, CA: Duxbury Thomson Learning. DeVellis, R. F. 2003. Scale development: Theory and application (2nd ed.). Thousand Oaks, CA: Sage. Gullion, C. M., and N. Berman. 2006. What statistical consultants do: Report of a survey. The American Statistician 60(2):130–138. Johnson, H. D. 2008. Compensation for statistical consulting services: Approaches used at four U.S. universities. CHANCE 21(2):31–35. Kirk, R. E. 1991. Statistical consulting in a university: Dealing with people and other challenges. The American Statistician 45:28–34. Louis, T. A. 2008. Compensation for statistical consulting services: Observations and recommendations. CHANCE 21(2):36–37. Taplin, R.H. 2003. Teaching statistical consulting before statistical methodology. Australian & New Zealand Journal of Statistics 45:141–152.
Null and Vetoed: Chance Coincidence? Governor Schwarzenegger’s acrostic veto Philip B. Stark
—MEMO— To the Members of the California State Assembly: I am returning Assembly Bill 1176 without my signature. For some time now I have lamented the fact that major issues are overlooked while many unnecessary bills come to me for consideration. Water reform, prison reform, and health care are major issues my Administration has brought to the table, but the Legislature just kicks the can down the alley. Yet another legislative year has come and gone without the major reforms Californians overwhelmingly deserve. In light of this, and after careful consideration, I believe it is unnecessary to sign this measure at this time. Sincerely, Arnold Schwarzenegger
I
n late October 2009, California Governor Arnold Schwarzenegger vetoed Assembly Bill 1176 (Ammiano), sending the memo above. The first letters of the third through ninth lines (ignoring the blank lines) comprise an expletive acrostic. Many have wondered whether this could be accidental. The governor’s office claimed the acrostic is “a wild coincidence.” Some have said the chance the acrostic would occur accidentally is 1 in 10 million, but that calculation depends on rather contrived assumptions. I do not think there is an answer to the question, “What is the chance the acrostic would occur accidentally?” In part, this is because there is no single sensible chance model for the wording of a veto. The issue is the framing of an appropriate “null hypothesis” under which to compute the chance. While I am sure that Gov. Schwarzenegger chooses his words carefully, this colorful example invites comparing a variety of null models for how the veto might have been worded “at random.” To put the question into a statistical framework, we ask, “If a seven-line message had been worded randomly, what is the chance that particular acrostic would have occurred?” (We ignore for the moment that many other acrostics would have triggered a similar media response (e.g.,
b-u-g-g-e-r o-f-f). We also ignore that the governor vetoes many bills. Multiplicity issues such as these could greatly increase the calculated probabilities.) “Randomly” does not mean much by itself. We could concoct any number of models for wording the message at random. Here, I consider six. They give probabilities that span more than eight orders of magnitude.
A Selection of Null Models What follows is a handful of probability models one might use to calculate the chance that the acrostic is an accident. None is terribly compelling. Monkeys Banging on Typewriters If seven lines were typed with each character chosen at random, independently, from the 26 letters of the alphabet (ignoring case, spaces, numbers, and punctuation), the chance that the first letter of those seven lines would spell the acrostic is (1/26)7 = 1.245 × 10−10,
(1)
(i.e., 1 in 8,031,810,176).
CHANCE
43
Table 1—Initial Letter Frequencies from the Project Gutenberg Corpus
0.03779 × 0.01487 × 0.03511 × 0.00690 × 0.01620 × 0.06264 × 0.01487 = 2.054 × 10−12, (2)
Letter
Frequency
c
0.03511
(i.e., 1 in 486,804,391,348).
f
0.03779
k
0.00690
o
0.06264
u
0.01487
Permuting the Lines Suppose the seven lines were given and the order shuffled. What is the chance the letters would come out in an order that comprises the acrostic? Two of the lines start with u and the rest start with distinct letters. The chance is
y
0.01620
(3) (i.e., 1 in 2,520).
Table 2—Counts of Initial Letters Among Words in the Veto Letter
Count
c
8
f
3
k
1
o
3
u
2
y
2
Table 3—Number of Words in Each Line in the Veto and Number in That Line Beginning with the Letter Required to Spell the Acrostic Line
Words
Letter
Count
1
16
f
2
2
13
u
1
3
15
c
1
(i.e., 1 in 86,377,328,100).
4
6
k
1
5
13
y
2
6
14
o
1
7
8
u
1
Permuting Words Within Lines We might hold the set of words in each line fixed (acknowledging that the ideas expressed by the words might have a required order) and permute the order of the words within lines randomly. Conceptually, each word on a line is written on a card, and then those cards are shuffled well and dealt in a sequence. The number of words on each line and the number that start with the requisite letter are in Table 3. The number of ways the lines could be internally reordered is 16! × 13! × 15! × 6! × 13! × 14! × 8!. (5)
Random Words from the Project Gutenberg Corpus Presumably, the governor was constrained to use English words in his veto. The frequency of initial letters of English words is not uniform. According to Wikipedia (http://en.wikipedia.org/wiki/ Letter_frequency), the relative frequencies of the relevant initial letters in the Project Gutenberg corpus (www.gutenberg.org/wiki/ Main_Page) are as shown in Table 1. If seven words were chosen from the Gutenberg corpus at random, independently, to start the seven lines, the chance their initial letters would comprise the acrostic is 44
Permuting the Words Suppose we hold the governor to his words, but not to their order. That is, suppose we took the 85 words that comprise the seven lines, wrote them each on a card, shuffled the cards well to put them in a random order, then dealt them sequentially. We keep the number of words in each line fixed, so there are line breaks after the 16th word, the 29th word, etc. What would the chance be that the initial letter on each line would comprise the acrostic? Those 85 words include the relevant initial letters with the counts shown in Table 2. There are 85! sequences in which the words can be dealt (not all are distinguishable because some words occur more than once). For the order to spell the acrostic, the first word must be one of the three words that start with f, the 17th word must be one of the two words that start with u, etc. Once those seven words are specified, the remaining 78 can occur in any order. Thus, the probability of drawing the 85 words in an order that gives the acrostic is
VOL. 23, NO. 4, 2010
(4)
Not all of those are distinguishable, because some words occur more than once on a single line. To have the acrostic, one of the correct letters must be in the first position on each line. The number of orderings that give the acrostic is 2 × 15! × 12! × 14! × 5! × 2 × 12! × 13! × 7!. (6)
Under this null model of shuffling the words in each line, the probability of generating the acrostic is
(7) (i.e., 1 in 6,814,080). Random Line Breaks Suppose the governor’s words were kept in the order in which they actually occurred, but the 85 words were divided by line breaks into seven lines, each with at least one word. Suppose that every way of breaking the lines was equally likely. What is the chance the first letters on the seven resulting lines would spell the acrostic? To end up with seven lines, six line breaks must be inserted. A line break may be inserted before the second word, the third word, …, or before the 85th word. There are thus 84 places into which six line breaks must be inserted, (84 choose 6) equally likely partitions. Of those, only 12 produce the acrostic (the break to produce u at the beginning of the second line can be in only one place; the break to produce c at the beginning of the third line can happen in any of three places; the break for the k can be in only one place, etc.). Hence, the chance that this randomization scheme would produce the acrostic is
Bayesian Approaches These calculations all give the probability that the acrostic would occur, if it is an accident, for different ways of generating the veto at random. Some might prefer to know the chance the acrostic is an accident. I don’t think there is any good way to answer that question. An answer would require a prior probability that the acrostic is accidental, as well as the probability that the acrostic would occur if the veto were generated innocently. We have already seen that the second probability is a problem. The first is, too. For instance, suppose we think there is a 10% chance the governor would do this sort of thing on purpose. Suppose we also think that if the governor did not do this on purpose, the biggest chance the acrostic would come up accidentally is 1 in 2,520. Then, the conditional probability that the governor did this on purpose given that the acrostic occurred is
(9) On the other hand, if we think there’s only a 1 in 1,000 chance that the governor would do this sort of thing on purpose, then (10)
(8) (i.e., 1 in 33,873,462). Other Models We might keep phrases together, shuffle them, and see where the new line breaks fall. We might allow some words or phrases to substitute for others that express the same idea. There are countless null models we could postulate. None is compelling.
Recap and Discussion The null hypothesis for testing “coincidences” matters. In this example, it is easy to get a wide spectrum of values for the probability of a coincidence. In the six calculations here, the probability ranges from about one in a couple of thousand to one in 487 billion, a factor of nearly 200 million—more than eight orders of magnitude.
In my opinion, there is no objective way to get a prior probability, nor is there a good way to upper bound the chance the acrostic would happen accidentally. Different priors and likelihoods give different posteriors. Questioning the Assumptions I don’t think any of the sets of assumptions behind the numbers presented above is realistic. None takes into account the fact that the message must be composed of sentences, sentences that make sense in the context of a veto. Perhaps a better null model would pull full sentences at random from Gov. Schwartzenegger’s other vetoes, string them together, and see where the line breaks fell. Moreover, the computations presented here take the number of lines to be fixed, equal to seven, but the veto could have had more or fewer. The probability calculations do not take into account the number of vetoes the governor has written in all. The larger the number, the greater the chance this could happen accidentally.
Prison reform, water reform, and health care are major issues the Legislature has overlooked, issues my Administration brought to the table. But while I have lamented of unnecessary reforms for some time now, yet another legislative year has come and gone, and without major bills to sign. I believe this after careful consideration. Overwhelmingly, Californians deserve this consideration—the major light that can, in fact, measure the unnecessary. Many are come to me down the alley at this time. It is just for the kicks.
CHANCE
45
Yet another legislative year has come and gone without the major reforms Californians overwhelmingly deserve. In light of this, and after careful consideration, I believe it is unnecessary to sign this measure at this time. Sadly, I have for some time lamented the fact that major issues are overlooked while many unnecessary bills come to me for consideration. Water reform, prison reform, and health care are major issues my Administration has brought to the table, but the Legislature just kicks the can down the alley.
Nor do the calculations consider other acrostic expletives that might have occurred. For instance, the same 85 words in a different order are on the previous page. Strained, yes. But it parses. Above are the 85 words, with one substitution. Either of those re-writes might have triggered media attention like that of the actual veto received, but such alternative acrostics are not contemplated in the calculations above. However, they increase the chance of expletives under most of these models. Similarly, the calculations do not consider acrostics formed from the last letter in each line, etc. Is There a Better Approach? It might make more sense to look at the actual phrases in the veto to see whether they seem natural or strained. For instance, “kicks the can down the alley” is unusual—I’m not quite sure what it means—and it provides the rarest of the initial letters, the k. Has the governor used that phrase elsewhere? “Overwhelmingly deserve” also strikes me as odd. In the construction, “water reform, prison reform, and health care,” it
seems more natural and parallel to say “health care reform,” but that would move the next line break so that the (rare) k would not be in the first position. In my opinion, such “forensic text analysis” is more persuasive than the probability calculations, which are at least as contrived as the veto.
Conclusion You don’t get probability out without putting probability in. In effect, you are making up the probability of a coincidence by inventing the null model for producing the coincidence. Calculations of the chance of a coincidence are based on strong assumptions about how the coincidence might have been generated, and those assumptions can be quite unrealistic. Gubernatorial vetoes are not written by choosing letters, words, or sentences at random. News consumers should be wary of calculations of the “chance” of a coincidence, regardless of the context.
Have you checked out the FREE ARTICLES available online? Take your pick from the following ASA journals: Journal of the American Statistical Association The American Statistician Journal of Business & Economic Statistics Statistics in Biopharmaceutical Research Journal of Computational and Graphical Statistics Technometrics Go to http://pubs.amstat.org, click on the journal of your choice, and enjoy the articles marked FREE.
46
VOL. 23, NO. 4, 2010
Mick Has (Almost) Left the Building Michael Huber
trajectory with a distance from home plate to the façade of 475 feet (see Figure 1). How this 475-foot number is calculated is a mystery. The newspaper caption under the photo read, “Computer calculations suggest the ball would have traveled as far as 734 feet.” A postscript in the paper states that a physics professor estimated the ball traveling 620 feet if unimpeded. It then goes on to report, “More recent computer calculations suggest the ball, which had not yet reached its apex, would have gone as far as 734 feet if not obstructed.” 734 feet compares to a little more than two and one-third times the distance from home plate to the right field foul pole. Could this be true?
Busting the Myths Lewis Early, on his website at www. themick.com/hardestball.html, claims the 734foot distance is accurate and provides the following rationale:
Figure 1. Trajectory with distances of Mantle’s home run (from the New York Post, May 23, 1963). Photo by the New York Post
O
nly once in the history of Yankee Stadium did a ball come close to leaving it (literally). “Mantle’s Homer Subdues A’s” was the headline for a game played on May 22, 1963, when the New York Yankees hosted the Kansas City Athletics in a night game at Yankee Stadium before a crowd of 9,727. According to John Drebinger of The New York Times, “Mickey Mantle belted one of the most powerful home run drives of his spectacular career.” In the next paragraph, Drebinger continued, “First up in the last of the 11th with a score deadlocked at 7-all and a count of two balls and two strikes, the famed Switcher leaned into one of
Carl [note: Fischer’s first name was Bill] Fischer’s fast ones and sent the ball soaring. It crashed against the upper façade of the right-field stand, which towers 108 feet above the playing field.” The New York Post headline covering the same game read “Mick Has (Almost) Left the Building.” Columnist Maury Allen wrote, “A funny thing happened on the way to a 7–0 laugher over the Kansas City A’s last night. Mickey Mantle had to win it, 8–7, with the longest home run in his 13-year career.” “It was the longest ball I ever hit,” Mickey told the reporter. An accompanying photograph in the New York Post had distances as well, showing a
How do we get 734 feet? We assumed that the ball was at its apex when it struck the façade. However, observers were unanimous in their opinion that the ball was still rising when it hit the façade. How do we determine how high the ball would have gone? In fact, we cannot. From this point forward, all numbers become estimates, depending upon how high we think the ball might have gone. A conservative estimate would be an additional 20 feet. Those 20 feet make a major difference. They cause our estimation of total distance to go up almost 100 feet, to the 734-foot number listed above. Is 20 feet higher a fair estimate? Those present when the ball was hit feel that it would have gone at least that much higher, and many feel that the 20 foot number is far too low. Early continues, “To get a precise value, we must turn to calculus. There, we have a formula to determine
CHANCE
47
in the New York Post, but with different measurements. Using Morante’s values (“official,” according to the New York Yankees), let’s determine a more reliable potential distance traveled.
Facts and Assumptions
Figure 2. Trajectory with actual distances of Mantle’s home run Photo courtesy of Tony Morante, director of tours at Yankee Stadium
distance (or range) more precisely. That formula is range = , where v = velocity (estimated at 230 feet per second), and g = the gravitational constant. Using the 117-foot value (the estimated height of the ball where it hit the façade) in the formula, we get a minimum distance of 740.095 feet and a maximum of 976.528 feet!” The ball certainly would not have sailed 600 feet without the façade in the way, even if the flight of the ball was wind-aided. When a baseball undergoing spin moves through the air, in addition to the force of gravity acting on it, there are two other important forces, which are absent from the formula above. These are the drag force,
48
VOL. 23, NO. 4, 2010
FD, and the Magnus force, FM, which accounts for lift. These forces can be calculated with the following equations:
(1) where CD is the drag coefficient, CL is the lift coefficient, is the air density, A is the cross-sectional area of the baseball, and v is the speed of the ball. The actual distances measured to where the ball struck the façade were a height of 108 feet and 1 inch above the ground and a straight-line (hypotenuse) distance of 374 feet (see Figure 2). Tony Morante, director of tours for the New York Yankees at Yankee Stadium, provided the same photograph as that
The Yankees and Athletics were playing a night game on May 23. The game lasted three hours and 13 minutes. That puts Mantle’s at-bat somewhere in the 11:15 p.m. timeframe, since his home run was a walk-off game winner. According to The New York Times May 23 weather records, the temperature at 11:00 p.m. was 61 degrees, with 39% humidity, winds blowing from the west at eight miles per hour, and a steady barometer of 30.05 inches. Let’s assume the air density is 1.23 kilograms per cubic meter, the ball is spinning at 2500 rpm, and the Yankees’ Hall of Fame center fielder hit the ball when it was three feet off the ground, so that it rose 105 feet and one inch at impact. If he crushed the ball as hard as he could straight into the façade, it would have left Mantle’s bat at an angle of 15.84 degrees, which is rare for batted balls that result in home runs. According to Robert K. Adair’s The Physics of Baseball, the optimum angle for batted balls is about 35 degrees, though balls projected at 30–40 degrees could travel almost as far. If the ball was hit at an angle greater than 17 degrees, it could not have been rising when it struck the façade. Mantle was known to have an upper-cut-type swing when batting left-handed, which might increase the angle at which the ball traveled, relative to the ground. Let’s incorporate the speed at which the ball was hit. Fischer threw a fastball at Mantle. Assuming Fischer threw it at about 90 miles per hour and at an angle of 17 degrees, the ball would had to have left the bat at more than 540 miles per hour to intersect at a point 374 feet away and 108.0833 feet off the ground on a straight line in the absence of air resistance (drag). Incidentally, Fischer was not known to have an overpowering fastball. A more realistic angle might be 24 degrees. At this angle, the ball would have had to have a speed of about 138 miles per hour to be still rising. Mantle claimed that “it was the hardest ball I ever hit.” In the absence of air resistance, the ball would have left home plate at a speed of about 156 miles per hour (as
Altitude (feet)
Total Distance Traveled (feet) Figure 3. Projected trajectory of Mantle’s home run with vertical line at 358 feet
claimed by Early) and taken just under two seconds to hit the façade. Factor in air resistance, and the time takes longer, but the ball doesn’t travel as far. As the angle of trajectory increases, the “muzzle velocity” of the ball off the bat decreases. According to Adair, a 75-mile-per-hour swing will send the ball off at about 115 miles per hour, which might be taken as a maximum speed. If Mantle, indeed, crushed the ball as hard as he could, the speed coming off the bat might be higher. However, there would still be drag. The effect of drag would mean that the ball would not travel as a symmetric parabola, implying the same distance traveled after the façade as before, if it was at the apex of its trajectory. Rather, drag requires that the ball’s trajectory is shortened on its way down. So, back to the weather. A ball travels about six feet farther for every inch the barometer drops (it was steady and normal). Humidity has little effect on the ball, making it travel farther on a humid day by a few inches (humidity was low). A cooler evening causes the air to be denser and the ball to not travel as far. The wind was blowing to dead center (from the west) at approximately eight miles per hour. It is unknown whether the New York City winds were the same in Yankee Stadium, but let’s assume they were.
Winds are usually measured a few meters off the ground, so winds at a higher point (where the ball was sailing) could have been higher. A ball traveling down the right field line is about 45 degrees from center field, so the wind has a component of miles per hour. According to Adair, a 5.6-mile-perhour tail wind would add about 17 feet to a 400-foot fly ball. In his book, Adair charts trajectories of baseballs projected at different angles off the bat and different velocities. He also states that the average ball hit at 35 degrees stays in the air for about five seconds. Finally, an uncertainty in the exact coefficient of drag can translate into an uncertainty in the distance traveled. If the drag coefficient changes by 10%, a 400-foot home run might actually go 414 feet or 386 feet. Alan Nathan, professor emeritus of physics at the University of Illinois, has done extensive research in long fly balls, and his data from thousands of plate appearances involving home runs has accurately determined the optimal velocity, which he calls speed off bat. For the first six weeks of 2009, 819 home runs were hit with a mean speed off bat of just over 101 miles per hour. But Mantle was not “average.” Suppose he was 20% better than average. Let’s allow his speed off bat to be 120 miles per hour (note: this is slightly
above Adair’s maximum, but nowhere near the 230-feet-per-second (≈ 156 miles per hour) value offered by Early). How far, then, would the fabled mammoth blast have sent the ball? Using the Pythagorean theorem, Mantle’s ground distance to the façade was about 358 feet. Factoring drag into a 120-mileper-hour chart in Adair’s book until the ball reaches the ground, the ball would travel past 500 feet if the speed (120 mph) and angle (24 degrees) were optimal. Then, factor in the wind (5.6 mph down the right field line). A projected distance of 530 feet would be “in the ballpark” as a maximum prediction. If the initial velocity of the ball is lower, the angle is decreased, or the ball was struck lower (closer to home plate), the range will be lower. Solving the force equations numerically with the ball spinning at 2,500 revolutions per minute, the distance traveled is approximately 531.6 feet. The famous ball’s trajectory is shown in Figure 3. At a horizontal distance of 358 feet from home plate, the ball is about 107 feet, 6 inches, off the ground— close to the 108 feet, 1 inch, estimate of the Yankees. The ball would have been traveling for 3.45 seconds, and would have been descending—having reached a maximum altitude of 110 feet when 321 feet from home plate. The illusion of the ball still rising is just that: an illusion. CHANCE
49
Figure 4. Distance estimates and estimated elevation at 358 feet from home plate based on height, ball speed, and angle
The ball only reached a maximum height of 110 feet off the ground, and that occurred before it crossed the outfield fence for a home run.
Simulating the Variables The necessary conditions to strike the stadium’s façade require that the ball’s velocity be close to 120 miles per hour when leaving home plate, the initial angle of trajectory be about 24 degrees, and the ball’s initial elevation be approximately three feet off the ground. Of these, the key variable is the ball’s speed off the bat; higher velocity increases the ball’s total horizontal distance.
50
VOL. 23, NO. 4, 2010
Figure 4 shows maximum distance predictions for variations in initial height off the ground, bat speed, and angle off the bat. Notice that the optimum conditions mentioned earlier provide a horizontal distance of 531.6 feet and a height of 107.628 feet (close to 108 feet, 1 inch) at a distance of 358 feet from home plate. As the angle increases at high velocity, the maximum horizontal distance also rises, but the ball would miss striking the façade. The ball would truly have left the stadium. Rather than change the weather conditions, let’s simulate the ball’s maximum distance if the ball’s speed, angle, and initial elevation became variable. Assuming each variable is assigned a probability
distribution with appropriate means and standard deviations, we simulate 10,000 triple products using Monte Carlo methods. Three random numbers (Uniform [0,1]) are generated as a probability. Each probability is then converted into a score using Excel’s NORMINV command for the angle and elevation and the GAMMAINV command for the ball’s velocity, as the 120-mile-per-hour speed is rare and best modeled with a gamma distribution. Means and standard deviations were chosen to allow a range of values for the speed between 100 and 120 miles per hour, for the angle between 22 and 26 degrees, and for the elevation between two and three feet above home plate (see Figure 4), such that 95% of all values fall within two standard deviations of the mean. Once the variable triples are generated, I counted how many events occurred that provided optimal conditions, which would produce a trajectory that strikes the façade at 108 feet above ground level and 358 feet from home plate. I then replicated the 10,000-run simulation 100 times. In a typical 10,000run simulation, approximately 1.15% of the speed events were within 1% of 120 miles per hour, approximately 2.00% of the angle events were within 1% of 24 degrees, and approximately 2.25% of the elevation events were within 1% of three feet. In none of the 1 million simulated runs did all three variables fall into the optimal range in the same row. This would indicate that Mantle’s feat was, indeed, a rare event. A typical histogram for one 10,000-run simulation for the ball speed is shown in Figure 5.
The Stuff Legends Are Made Of For the sake of legend, I predict the maximum distance for Mantle’s historic drive to be 536 feet, which just happens to be one foot for every one of his career home runs. This extraordinary home run in 1963 adds to Mantle’s legend. Five-hundred-thirty-six feet from home plate is still impressive, especially when compared to the long-ball era home runs of the past 15 years.
Postscript (Extra Innings) There are plenty of other theories on the Internet as to how far the ball might have traveled. In addition, many fans claim to
Counts
Initial Speed Figure 5. Sample histogram for ball speed of 10,000 simulations. Speeds of at least 120 mph are likely needed to hit the stadium’s upper façade at 108.1 feet.
have been at the ball park and witnessed the Ruthian blast. Did the ball really rise when it hit the façade, as eyewitnesses claim? As it was a mid-week game against the A’s that went into extra innings, it is doubtful that more than a meager few thousand—if that many— fans were still in attendance when Mantle won the game. How many of them were in the upper deck of the right-field seats, qualified to determine if it was
rising? There is no video of this home run, so we have to rely on weather facts. This approach uses the most accepted methods to include trajectories due to drag and weather.
Further Reading Adair, R. K. 1994. The physics of baseball, 2nd ed. New York: HarperPerennial.
Allen, M. 1963. New York Post. May 23. Drebinger, J. 1963. The New York Times. May 23. Early, L. 2010. The hardest ball I ever hit! www.themick.com/hardestball.html. Nathan, A. 2008. The effect of spin on the flight of a baseball.” Am. J. Phys. 76(2).
Do You Want MORE? D FFREE online access to CHANCE is available for all subscribers who are also members of the American su Statistical Association, including all K–12 member Sta sub subscribers. Take advantage of this amazing benefit and read all the CHANCE features, columns, and supplements you want. Just log in to Members Only at www.amstat. org/m org/membersonly today! CHANCE
51
Hitting Streaks Don’t Obey Your Rules Evidence that hitting streaks are not just byproducts of random variation Trent McCotter
My study of long hitting streaks for 1957 through 2006, however, seems to show that the actual number of long hitting streaks in baseball is not the same as what a coin-tossing model would produce, even when we try to account for players getting varying numbers of at-bats per game. The problem is that we have been assuming the outcome of one at-bat has no predictive power for the next at-bat. That is, we have been assuming “independence” in baseball, just like when we toss a coin. While one can find many articles that attempt to calculate probabilities of hitting streaks, there has been no article about whether the crucial independence assumption actually holds true. Even the more sophisticated methods for calculating streak probabilities do not use actual game-by-game data, but rather simulate new data that still requires using an assumption of independence. In fact, this article shows that the independence assumption does not hold true. By using the coin-flip model all of these years, we have been underestimating the likelihood that a player will put together a 20-, 30-, or even a magical 56-game hitting streak.
As It Relates to Baseball
P
rofessional athletes naturally experience hot and cold streaks. However, there’s been a debate for some time now as to whether these athletes experience streaks more frequently than one would expect—a phenomenon commonly referred to as having “the hot hand.” In his review of Michael Seidel’s book, Streak, Harvard biologist Stephen Jay Gould wrote, “Everybody knows about hot hands. The only problem is that no such phenomenon exists. The Stanford psychologist Amos Tversky studied every basket made by the Philadelphia 76ers for more than a season. He found, first of all, that probabilities of making a second basket did not rise following a successful shot. Moreover, the number of ‘runs,’ or baskets in succession, was no greater than what a standard random, or coin-tossing, model would predict.” Seidel’s book detailed Joe DiMaggio’s record 56-game hitting streak in 1941. Gould’s point is that hitting streaks are analogous to the runs of baskets by the 76ers, in that neither should show any signs of deviating from a random coin-tossing model. 52
VOL. 23, NO. 4, 2010
The question is this: Does a player’s performance in one game have any predictive power for how he will do in the next game? If a baseball player usually has a 75% chance of getting at least one base hit in any given game and he’s gotten a hit in 10 straight games, does he still have a 75% chance of getting a hit in the 11th game? This is asking whether batters’ games are independent. This independence assumption has been lurking silently in numerous articles published in The Baseball Research Journal, CHANCE, and even The New York Times. These articles involve attempts to calculate the probability of a player meeting or beating Joe DiMaggio’s major league record 56-game streak. If it’s true that batters who are in the midst of a long hitting streak will tend to be more likely to continue the streak than they normally would (they’re on a “hot streak”), then we would expect more hitting streaks to have actually happened than we would theoretically expect to have happened. That is, if players realize they’ve got a long streak going, they may change their behavior (maybe by taking fewer walks or going for more singles as opposed to homers) to try to extend their streaks. Or maybe they really are in an abnormal hot streak. To calculate the odds of a streak, one usually must determine the odds that a player will have a hit in any given game. In its simplest form, the probability that a player will get a hit in a game with b attempts (at bats) when the probability on one attempt is p and the attempts are independent is one
Z-score
More Streaks in Real Life Than Permutations
Streak Length Figure 1. Results, presented in a chart of z-score vs. length of streak
minus the probability that he will make outs in all of his atbats, or 1-(1-p)b. In baseball applications, p has often been replaced with the batting average (AVG) and the number of at-bats by the season average per game: b = # at bats/# games. For a fabricated player named John Dice, who hit 0.300 in 100 games with 400 at-bats (for an average of four at-bats per game): 1-(1-0.300)4 = 0.7599 ≈ 76% chance of at least one hit in any given game. Thus, if games really are independent and do not have predictive power when it comes to long hitting streaks, this means Dice’s 100-game season can be seen as a series of 100 tosses of a weighted coin that will come up heads 76% of the time. Long streaks of heads will represent long streaks of getting a hit in each game.
The Problem With Independence Does this method have a fundamental problem as it relates to looking at long hitting streaks? One may worry because it uses a player’s overall season stats to make inferences about what his season must have looked like on a game-by-game basis. More advanced methods use simulations to vary the number of atbats per game, to help account for batters not getting exactly four at-bats every day. However, these simulations still must rely on an assumption of independence, since the at-bats per game is varied randomly in the models. Testing the independence of games can easily be done using a random permutation. That is, we can randomly shuffle each player’s season game log (listing his batting line for each game) so the games are no longer in chronological order and the only variation in the number of long streaks would be due entirely to chance.
If we repeat this random permutation over and over, we can compare the number of long streaks in real life to the averages from the permutations. If the permutations show about the same number of streaks as real life, then games are probably close enough to independent that the coin-flip models will work. But, if the permutations show fewer streaks than real life, the coin-flip model will underestimate the probabilities of long hitting streaks.
The Number Crunching The batting lines of all players for 1957 through 2006 were compiled, excluding the 0-for-0 batting lines that neither extend nor break a hitting streak. The result was about 2 million batting lines. Then, with the assistance of Peter Mucha of the mathematics department at The University of North Carolina, I took each player’s game log for each season of their career and sorted the game-by-game stats in a completely random fashion 10,000 separate times. For each of the 10,000 permutations, we counted how many hitting streaks of each length occurred and compared to real life. We looked only at single-season streaks. By using the actual game-by-game stats (sorted randomly for each player), we didn’t have to make theoretical guesses about how a player’s hits are distributed throughout the season. If the independence assumption held true, this method of randomly sorting each player’s games should—in the long run—yield the same number of hitting streaks of each length that happened in real life. See Figure 1. It’s clear that, for almost every length of hitting streak of five-plus games, there have been more streaks in real life than we would expect, given players’ game-by-game stats. For demonstration, we can assume an approximately Gaussian distribution to calculate probabilities. Of course, this assumption may not hold true. Some of the individual p-values are not significant at a .05 level, but the majority is. And the CHANCE
53
Table 1—A Cumulative Table of Hitting Streaks 1957–2006
ACTUALLY HAPPENED
Length
# of Hitting Streaks
IN THE 10,000 RANDOM SORTINGS Avg.
Prob.
Less than 5
204567
210430.00
351.70
1.0000
5+
64803
62766.00
150.70
0.0000
10+
8410
7621.00
69.45
0.0000
15+
1394
1137.00
30.71
0.0000
20+
274
192.40
13.32
0.0000
25+
62
35.74
5.84
0.0000
30+
19
7.07
2.60
0.0000
35+
5
1.48
1.21
0.0018
cumulative table has values that are extremely significant. For instance, we have seen almost three times as many 30-plus game-hitting streaks in real life than we would expect if games were randomly distributed. The number of hitting streaks is significantly higher than we would expect if long hitting streaks could be predicted using the coin-flip model. Additionally, the results of the 10,000 trials converged, which means the first 5,000 trials had almost the exact same averages and standard deviations as the second 5,000 trials. These results indicate that many of the attempts to calculate the probabilities of long hitting streaks are actually underestimating the true odds that such streaks will occur. Additionally, if hits are not independent and identically distributed events, it may be extremely difficult to devise a way to calculate probabilities that do produce more accurate numbers.
Why Don’t They Match? An easy explanation could result from the quality of the opposing pitching. If a batter faces a bad pitching staff, he’d naturally be more likely to start or continue a hitting streak, relative to his overall season numbers. The problem with this explanation is that it may be too short-sighted: You can’t face bad pitching for too long without it noticeably increasing your numbers. One way to measure this is to look at how many long hitting streaks batters have had against particular teams (i.e., a batter getting a hit in 30 straight games vs. the Blue Jays over the course of his career). From 1957–2006, there were 19 hitting streaks of 30-plus games vs. the league as a whole, but only five such streaks by a batter vs. a particular opponent. We expect fewer streaks because you can’t count the last 10 games vs. Toronto and the first 20 games vs. Texas as a 30-game hitting streak vs. a particular opponent. But, if facing bad teams were so conducive to hitting streaks, it seems we would have seen more hitting streaks against teams with poor pitching staffs. This same reasoning is why playing at a hitter-friendly stadium does not seem to pan out, either. Looking at the 19 streaks of 30-plus games from 1957–2006, 50.2% of the games were played at the batters’ home stadiums and 49.8% of the games were played at road stadiums. As such, long hitting streaks from 1957–2006 do not seem to be centered around 54
Stan. Dev.
VOL. 23, NO. 4, 2010
stretches in which the batter played more games at home—or on the road—than at any given stretch of the season. The third possible explanation is the weather. Hitting usually increases with the temperature, seemingly making the warmest months of the summer fertile ground for a hitting streak. The reason this is important is that hitting streaks are exponential. That is, a player who hits .300 for two months will be less likely to have a hitting streak than a player who hits .200 one month and .400 the next, even though they both have the same season totals. The problem with the weather explanation is that the stats do not bear it out. Of the 274 streaks of 20-plus games from 1957–2006, 62 began in May, 56 in June, 60 in July, and 53 in August. April is excluded because the season frequently begins at varying points in April, so the numbers are not strictly comparable. September is also excluded because streaks that begin in September have a much lower chance of actually making it to 20 or 30 games, simply because the player will run out of games to play. So that eliminates the explanations that would seem most likely. Remember, if all the assumptions about independence were right, we would not even have these differences between the expected and actual number of streaks. This leaves two other possible explanations, each of which may involve psychology more than mathematics.
First Alternative Explanation Maybe the players who have long streaks going will change their approach at the plate to try to keep their hitting streaks going. This same idea was covered in The Bill James Goldmine, in which James discusses how pitchers will make an extra effort to reach their 20th victory of a season. The result is that there have actually been more 20-win seasons than 19-win seasons in major league history. There is evidence of this effect in hitting streaks, too. About 27% of the permutations yielded at least as many 29-game streaks as happened in real life. This means we’re still seeing more 29-game streaks than we should, but it’s nothing like 30-game streaks (see Figure 2). Only one of the 10,000 permutations yielded as many 30-game streaks as actually occurred.
29-Game Streak Distribution
30-Game Streak Distribution
Figure 2. We’re still seeing more 29-game streaks than we should, but it’s nothing like 30-game streaks. Only one of the 10,000 permutations yielded as many 30-game streaks as actually occurred.
% More in Real Life Than in Simulations
Longer Streaks Yield Higher Deviations
Length of Hitting Streak Figure 3. As a streak gets longer, a batter will become more focused on it, thinking about it during every at-bat and doing anything to keep it going.
These streaks are pretty rare, so we are dealing with small samples, but this helps show that hitters may really be paying attention to their streaks and trying to reach a “famous” length. Once a batter hits that magical 30-game mark, every teammate, fan, and opposing pitcher knows about the streak. Perhaps the opposing pitchers are extra careful in how they pitch to the batter whose streak just made national news by reaching 30 games. Also lending some credibility to this explanation is that the spread (the difference between how many streaks really happened and how many we expected to happen) seems to increase as the length of the streak increases. That is, there have been about 7% more hitting streaks of 10 games than we would expect, but there have been 20% more streaks of 15 games and 80% more streaks of 25 games. Perhaps, as a
streak gets longer, a batter becomes more focused on it, thinking about it during every at-bat and doing anything to keep it going. See Figure 3. Streaks occur when batters are maximizing their at-bats. Eighty-five percent of the players who had 20-plus game hitting streaks from 1957–2006 had more at-bats per game during their hitting streak than they had for their season as a whole. Overall, it worked out to an average 6.9% increase in at-bats per game during their streak. That extra 6.9% of at-bats per game almost certainly accounts for a portion of the “extra” hitting streaks that have occurred in real life. This increase in at-bats per game during a streak makes sense, as a batter is much less likely to be used as a pinchhitter or taken out of a game early when he has a hitting streak going. Additionally, when a player is hitting well, his manager CHANCE
55
Z-score
Excluding Non-Starts Yields Closer Totals
Streak Length Figure 4. Results of the second permutation
Table 2—A Cumulative Table of the Second Permutation of 10,000 Trials That Was the Same as the First Permutation Except All the Games Where the Batter Did Not Start Were Eliminated 1957–2006
ACTUALLY HAPPENED
Length
# of Hitting Streaks
10,000 SORTINGS–STARTS ONLY Avg.
Prob.
Less than 5
204567
179082.00
333.00
0.0000
5+
64803
67102.00
151.00
1.0000
10+
8410
9280.00
75.50
1.0000
15+
1394
1472.00
34.40
0.9880
20+
274
259.00
15.20
0.1620
25+
62
49.40
6.88
0.0336
30+
19
10.10
3.12
0.0022
35+
5
2.19
1.47
0.0280
is more likely to move him up in the batting order so he gets more plate appearances. Also, pitchers may be hesitant to walk batters (and batters hesitant to take walks) because the players want the streak to end “legitimately,” with the batter being given several opportunities to extend the streak. The extra at-bats per game also account for the slope in Figure 3, which shows an exponential trend in the number of “extra” hitting streaks that have occurred in real life as opposed to permutations. As streak length increases, those extra at-bats make streaks increasingly more likely. For instance, if we take a .350 hitter who plays 150 games and increase his at-bats per game from 4.0 to 4.28 (about a 6.9% increase) for an entire season, his odds of a 20-game hitting streak increase by 34%, but his odds of a 30-game streak increase by 81% and his odds of a 56-game streak increase by 244%. Keep in mind that those increases are larger than we would see in our hitting streak data because the 6.9% increase in at-bats per game applies only to the 20 or so games during the hitting streak—not the entire 150 games a batter plays during a season. 56
Stan. Dev.
VOL. 23, NO. 4, 2010
Second Alternative Explanation Perhaps something else is going on that is significantly increasing the chances of long streaks, including the idea that hitters experience a hot hand effect and become more likely to have a hitting streak because they are in a period in which they continually hit better than their overall numbers suggest. This hot streak may happen at almost any point during a season, so we do not see a spike in streaks during certain parts of the year. Of course, we expect a player to have a certain amount of hot and cold streaks during any season, but the hot hand effect says the player will have hotter hots and colder colds than we would expect. So, the player’s overall totals still balance out, but his performance is more volatile than we would expect using the standard coin-flip model. There may be some evidence for this. From 1957–2006, binomial models indicate there were about 7% more threeand four-hit games in real life than we would expect given the coin-flip model, but also about 7% more hitless games. What this means is that the overall numbers still balance out over the course of a season, but we are getting more “hot
games” than we would expect, which is being balanced by more “cold games” than we would expect. Over the course of 50 years, those percentages could really add up and result in more hitting streaks. Additionally, there is more evidence that tends to favor the hot hand approach over the varying-at-bat approach. Mucha and I ran a second permutation of 10,000 trials that was the same as the first permutation—except we eliminated all the games in which the batter did not start the game. In our first permutation, we implicitly assumed non-starts are randomly sprinkled throughout the season. But that is likely not the case. Batters will tend to have their non-starts clustered together, usually when they return from an injury and are used as a pinch-hitter, when they have lost playing time and are used as a defensive replacement, or when they are used sparingly as the season draws to a close. Considering that we eliminated a non-insignificant number of hitless games from the permutation, we expected this second trial would contain more streaks than the first permutation. The question was whether this second permutation would contain roughly the same number of streaks as occurred in real life. View Figure 4 and Table 2 for the results. Clearly, the numbers match up better to real-life, with few significant values for individual streaks. However, at the cumulative level, the numbers become significant for streaks of 25-plus games. As the streak length increases, the difference between real life and the two permutations widens further. For streaks of 30-plus games, there are still almost twice as many in real life as in the permutations. Here, we deal primarily with long streaks, but for streaks between five and 15 games, the pattern does not hold: There were fewer such streaks in real life than in the second permutation looking at only starts. Remember, this second permutation is still not comparing apples to apples because we’ve eliminated a lot of games for this second trial, which explains why the permutations returned so few streaks that were less than five games. There are undoubtedly streaks that fall just short of 20 games when looking only at starts, but would’ve hit 20 games if nonstarts (e.g., successful pinch-hitting appearances) were included. The reason this favors the hot hand effect is this: Our first alternative explanation above relies on the idea that players are getting significantly more at-bats per game during their hitting streaks than during the season as a whole. But, the reason for a large part of that difference is that players are not frequently used as nonstarters (e.g., pinch-hitters) during their streaks, so it artificially inflates the number of at-bats per game that the batters get during their streaks relative to their season as a whole. Thus, we should be able to remove the pinch-hitting appearances from our permutations and get results that closely mirror real life. Granted, the numbers are closer, but we still get the result that there have been significantly more long hitting streaks in real life than there “should have been.” This tends to add weight to the hot hand effect, since it just does not match up with what we would expect if the varying number of at-bats per game were the true cause. This study seems to provide strong evidence that players’ games are not independent, identically distributed trials, as statisticians have assumed all these years. It may even provide evidence that events such as hot hands are part of baseball
streaks. It will take even more study to determine whether it is hot hands, the change in behavior driven by the incentive to keep a streak going, or some other cause that really explains why batters put together more hitting streaks than they should have, given their actual game-by-game stats. Given the results, it is highly likely that the explanation is some combination of all these factors. Regardless of the explanations, there is overwhelming evidence that, when the same math formulas used for coin tosses are applied to hitting streaks, the probabilities they yield are incorrect. The formulas incorrectly assume the games in which a batter gets a hit are distributed randomly throughout his season. As a result, we can no longer just “assume independence” when calculating probabilities for long hitting streaks. Our methods have been underestimating the odds that any given player could put together a 56-game hitting streak. Also, perhaps the baseball purists have had it at least partially right all this time. Maybe batters really do experience periods when their hitting is above and beyond what would be statistically expected given their usual performance. This study does not just look at the statistical side of baseball. It also reveals the psychology of it. At the very least, there is evidence that batters adapt their approach to try to keep a long hitting streak going—and baseball players are nothing if not adapters.
Further Reading Albert, J. 2008. Great streaks. By the Numbers 18(4):9–13. www.philbirnbaum.com/btn2008-11.pdf Albert, J. 2008. Streaky hitting in baseball. Journal of Quantitative Analysis in Sports 4(1), Article 3. www.bepress.com/jqas/ vol4/iss1/3 Arbesman, S., and S. Strogatz. 2008. A journey to baseball’s alternate universe. The New York Times, March 30. Brown, B., and P. Goodrich. 2003. Calculating the odds: DiMaggio’s 56-game hitting streak. The Baseball Research Journal 32:35–40. Chance, D. 2009. What are the odds? Another look at DiMaggio’s streak. CHANCE 22(2):33–42. Freiman, M. 2002. 56-game hitting streaks revisited. The Baseball Research Journal, 31:11–15. Gould, S. J. 1989. “The streak of streaks. CHANCE 2(2): 10–16. Levitt, D. 2004. Resolving the probability and interpretations of Joe DiMaggio’s hitting streak. By the Numbers 14(2):5–7. McCotter, T. 2008. Hitting streaks don’t obey your rules. The Baseball Research Journal 37:62–70. Rockoff, D., and P. Yates. 2009. Chasing DiMaggio: Streaks in simulated seasons using non-constant at-bats. Journal of Quantitative Analysis in Sports 5(2), Article 4. Seidel, M. 2002. Streak: Joe DiMaggio and the summer of ’41. Fargo: Bison Books. Short, T., and L. Wasserman. 1989. Should we be surprised at the streak of streaks? CHANCE 2(2):13. Warrack, G. 1995. The great streak. CHANCE 8(3): 41–43, 60.
CHANCE
57
Visual Revelations Howard Wainer, Column Editor
Pies, Spies, Roses, Lines, and Symmetries
L
ife-changing events rarely announce themselves as such when they make their appearance. Such was the case more than 20 years ago, when I received a fateful phone call. It was my old friend and former colleague, Steve Fienberg. He was the cofounder and coeditor of CHANCE, which at the time was in its third year of existence. He asked me if I would be willing to take over the writing of a column on graphic display that Alan Paller, the initial columnist, had dubbed “Visual Revelations.” I thought about it for a moment and said, “OK.”
I had ideas for three or four columns, which would carry me through the first year, and figured other topics would pop up as time went on. I hoped my imagination would not run dry before I had fulfilled the three-year term Steve suggested. Twenty years and 93 articles later, my tenure as a columnist and CHANCE’s existence as an outlet for popular statistical discourse still seem to be going strong. It has been a wonderful run and I am delighted to have had the opportunity. Over the years, there have been many topics, but two principal themes
manifested themselves often: graphical display and history—almost always in tandem. Symmetry and consistency suggest it would be altogether appropriate if I mark these two decades with a column that encompasses these two great themes. Let us start with a simple data set from the Israel Bureau of Statistics. It shows the number of males and females in each of the various age categories (in thousands), as well as the road casualties. What are the questions these data might be asked to illuminate? Which display formats would do this well?
Table 1—Distribution of Israeli Population and Road Accident Casualties by Age and Sex Data from Israel Bureau of Statistics, 2002
Age
Male Population (in thousands)
0–4
340
623
Female Population (in thousands) 322
Number of Female Accidents 578
Male Accident Rate/1,000
Female Accident Rate/1,000
1.8
1.8
5–14
601
1,460
571
1,161
2.4
2.0
15–19
285
3,431
271
1,671
12.0
6.2
20–24
272
4,618
265
2,302
17.0
8.7
25–29
261
3,620
256
2,006
13.9
7.8
30–34
218
2,349
218
1,369
10.8
6.3
35–44
376
3,157
392
1,863
8.4
4.8
45–54
349
2,257
374
1,623
6.5
4.3
55–64
207
1,302
231
811
6.3
3.5
65–74
154
693
195
473
4.5
2.4
75+
114
401
168
307
3.5
1.8
3,176
23,911
3,263
14,164
7.5
4.3
TOTAL/Mean 58
Number of Male Accidents
VOL. 23, NO. 4, 2010
Since a graph is usually the answer to a question, it is worthwhile to think a little about what questions we are interested in answering—there are a lot of them—before drafting a figure. We could just look at the numbers of men and women in Israel and study the age distribution. The venerable age pyramid has a long and honorable history in such use. And this kind of plot is invaluable in making policy decisions about work force size, planning for retirement and medical costs, etc. We also might make a similar plot using the frequencies of being in automobile accidents. Such a display is likely to be of only limited use. It would gain in utility if it were paired with the population pyramid mentioned previously, for then we could see both how often men and women of each age group are in automobile accidents, but also the extent to which any of these frequencies is unrepresentative of the size of the segment of the population. But, if we are willing to sacrifice seeing the actual counts, we can make comparisons easier by dividing the number of those in an accident by the size of that segment of the population and look directly at the proportion of each age segment in an automobile accident. This sort of initial thinking represents minimal statistical due diligence, and we can now try out some preliminary displays. We do this with the clear understanding that developing a data display is an iterative process; what we learn from our first display helps us ask new questions and reformulate displays to answer them. In one of my earliest columns (volume 4, issue 2), I described what I understood were the strengths and weaknesses of pie charts. Among their strengths are that they make explicit that the sum of all pieces equals 100%, they are widely familiar, and they allow viewers who don’t know that the fraction 1/3 is larger than 1/4 to make such comparative judgments correctly. Among their weaknesses are that they can’t show more than a few segments in an understandable way (three segments are fine; 23 are not) and judgments of the comparative size of segments are inaccurate, as is the comparison of segments from adjacent pie charts. This is but a sampling. In the intervening decades, my opinion of the utility of pie charts has not improved. What has happened is the
Auto Accident Rates for Females
Figure 1. Auto accident rates for females
Auto Accident Rates
Figure 2. The age distribution of male and female accident rates
development of much-improved alternatives—most especially Bill Cleveland’s ingeniously simple dot plots. But let us not prejudge. Suppose we take some of the data from Table 1 and make a pie of it (see Figure 1). We can see that accident rates for women are greatest between ages 15 and 30 and taper lower for both older and younger females.
Obviously, we are likely to be just as interested in the rates among males as among females; indeed, we would probably find comparisons between them of interest, as well. Staying in the pie format, we could plot the males in the same way and put the pie alongside the pie for the females for easy visual comparison (see Figure 2). CHANCE
59
Back-to-Back Half Pies of Israel Road Accident Data
Figure 3. Back-to-back half pies comparing automobile accident rates for males and females
Females 0-4 5-14
Males 0-4 5-14
15-19
15-19
20-24
20-24
25-29 30-34 35-44 45-54 55-64 65-74 75+
25-29 30-34 35-44 45-54 55-64 65-74 75+
Figure 4. Distribution of road accident casualties by age and sex, relative to the size of the population
A quick look tells us that the general age structure is strikingly similar across the two sexes. Do we need two pies 60
VOL. 23, NO. 4, 2010
to do this? We could use one pie with males on one side and females on the other. This might aid comparison, since
asymmetries would indicate differences, and human eyes are good at spotting asymmetry (Figure 3). This display works better than I would have guessed. We can see that roughly the same pattern of involvement in auto accidents exists for both sexes. But, because each pie (or each half pie) is normed separately by sex, this figuration loses the relative size (compared across sex) of the likelihood of being in an auto accident. How can we keep those aspects of this chart that are valuable while adding the possibility of seeing different overall rates by sex? What about borrowing from Florence Nightingale’s often useful rose diagram? In this figuration, the pie is modified so that instead of each segment’s central angle varying as a function of the size of that segment, it is the segment’s radius that varies. Nightingale used her invention to show monthly mortality, by cause, in the Crimean War. She found that this approach worked better than standard pies for making comparisons across years because each month was in the same relative position every year, whereas they would shift with changes in the data with pies. Her data and roses provided powerful support for her successful argument to improve battlefield hygiene. Hebrew University’s Dror Feitelson invented a variation on Nightingale’s, which he has dubbed the spie chart (see Figure 4). However, spie charts are not a simple combination of two Nightingale roses. A careful look reveals that the angles of each segment are not equal. What Feitelson has done is combine a regular pie chart with a rose. The regular pie chart determines the size of the angles, and so the circle represents this traditional pie—or more accurately a half-pie—the left half for females and the right half for males. Then, superimposed on top of this, is a modified rose in which the angles have been set, but now the radii are proportional to the square root of the amount in the segment. Note it is just their length that carries the information. Because the angles are set so that equal radii would yield areas proportional to the data, the areas of the superimposed rose are not proportional to the data. But, in return for losing the area, we gain a basis of comparison. In Feitelson’s own words, “Slices that now extend beyond the circle of the original pie chart indicate that their relative size
has grown, while slices that are smaller than the original circle indicate that their relative size has shrunk. This provides an immediate and visually striking display of the change from the first partition to the second one, at the price of losing the easy comparison of slice sizes for the second partition.” Before we consider another alternative, it is worth mentioning two things. First, the order of the slices around the pie (and the spie) is not random. It was carefully chosen, as was the color scheme used in the original (variations of blue for males, of red for females). I am forced to concede that, for some purposes, this variation on the pie theme works. Is there another, simpler alternative that deserves consideration? Let us look at a simple line plot with age as the independent variable (on the horizontal axis) and the likelihood of a road accident shown on the vertical axis. We then draw one curve for males and another for females. In addition, we have added two horizontal lines, representing the average likelihood of road accident for each sex. This allows us to see easily both the sex and the age effects (Figure 5). Which of these is better? I have performed no perceptual experiments, but I lean toward the line chart. It is simple, has all the information, and easily scales upward to many more age groups. Also, there is the possibility of including more than two groups. If, instead of males and females, we wanted to examine four economic groups, five groups based on education, or six ethnic groups, there would be no problem. Just draw more lines. The pie, or even the spie format, is likely to run out of steam quickly. These limitations should not rule out pies or spies from your consideration. There may be circumstances in which such formats may be just right. For example, I recently ran across one pie chart that would not have been improved with another format (see Figure 6)—I ask the readers’ indulgence for this obiter dictum.
Likelihood of Being in a Traffic Accident
Figure 5. Line chart of Israel accident data Data Source: Israel Bureau of Statistics, 2002
All the Things You Need
Figure 6. Pie chart showing all you need
Further Reading Cleveland, W. S. 1994. Visualizing data. Summit, NJ: Hobart Press. Feitelson, D. G. 2003. Comparing partitions with spie charts. Technical Report 2003-87, School of
Computer Science and Engineering, The Hebrew University of Jerusalem. www.cs.huji.ac.il/~feit/papers/ Spie03TR.pdf
Wainer, H. 1991. Humble pie. CHANCE 4(2):52–53. Wainer, H. 1995. A rose by another name. CHANCE 8(2):46–51.
CHANCE
61
Goodness of Wit Test Jonathan Berkowitz, Column Editor
Goodness of Wit Test #10: Once Is Enough
W
hy are cryptic crossword puzzles suitable for magazines devoted to all things statistical? Perhaps it is because both involve pattern recognition and looking beyond the obvious for a deeper meaning. The “aha” moment that often accompanies a solution to a cryptic clue is similar to the discovery of lurking variables that can explain spurious associations.
Winners from Goodness of Wit Test #8: Employs Magic Leon Hall is chair of the mathematics and statistics department at Missouri University of Science and Technology in Rolla. In addition to mathematics, he enjoys love-hate relationships with the St. Louis Cardinals and the game of golf.
Jeff Passel spent the first half of his career at the U.S. Census Bureau and the second half at the Urban Institute. He is now in the midst of the third half at the Pew Hispanic Center. His outside interests include word puzzles, baseball, food, wine, Latin-American art, and a small bit of gardening.
62
VOL. 23, NO. 4, 2010
Perhaps, then, word play could be incorporated into the statistics classroom to explain various topics. For example, what is the term for a quantity equal to two-thirds of the median? The answer is mean. Why? Take two-thirds of the six letters in median to get mean. A groaner? Certainly. But a novel way to introduce a less-than-exciting term in descriptive statistics. I welcome your suggestions for other cryptic groaners. In Goodness of Wit Test #10, as in a number of previous puzzles, some alteration of the themed answer words is required for entry into the grid. Once you discover the gimmick, you might be inclined to add to the list of theme words. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by me by January 31, 2011. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver, BC Canada V6N 3S2, or email a list of the answers to
[email protected]. Please note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Reminder: A guide to solving cryptic clues appeared in CHANCE 21(3). The use of solving aids—electronic dictionaries, the Internet, etc.—is encouraged.
Solution to Goodness of Wit Test #8: Employs Magic This puzzle appeared in CHANCE Vol. 23, No. 2. “Employs Magic” is an anagram of “Olympic Games.” The center letter in each of the five squares is an O, representing the five rings of the Olympic symbol.
SQUARE 1: Across: 1 ASSET [deletion: assert–r] 6 DOLLS [rebus: do(ll)s; odd letters = dos] 7 ANOVA [hidden word: Nag(ano Va)ncouver] 8 PAPER [rebus: p+ape+r] 9 TRESS [rebus: t+res+s] Down: 1 ADAPT [rebus + reversal: a+dap+t] 2 SONAR [rebus: s+on+a+r] 3 SLOPE [container: s(lop)e] 4 ELVES [pun: plural of Elvis = elves] 5 TSARS [anagram: stars] SQUARE 2: Across: 1 GUSTS [container: gu(s)ts] 6 UNCAP [rebus: u+n+ca+p] 7 ELOPE [rebus: e+lope] 8 SIREN [rebus: sire+n] 9 T-TEST [deletion: attest–a] Down: 1 GUEST [homophone: guessed] 2 UNLIT [rebus: exchange t and l in until] 3 SCORE [rebus + anagram: CEOs+r] 4 TAPES [reversal + rebus: se(a+p)t] 5 SPENT [rebus: s+pent]
SQUARE 4: Across: 1 MOTET [anagram: totem] 6 AGONY [rebus: ago+NY] 7 GROUP [rebus + anagram: g+pour] 8 METRO [rebus + reversal: or+team–a ] 9 ASHES [deletion: slashes–sl] Down: 1 MAGMA [container: ma(g)ma] 2 OGRES [anagram: gores] 3 TOOTH [rebus: too+them–em] 4 ENURE [rebus + reversal: e+nur+e] 5 TYPOS [hidden word: nas(ty pos) itional] SQUARE 5: Across: 1 APRIL [rebus: a(par–a)il] 6 DRONE [container: d(r)one] 7 LOOSE [rebus: loos+e] 8 INTER [deletion: winter–w] 9 BESTS [container: be(s)ts] Down: 1 AD-LIB [rebus + container: A(d)li+b] 2 PRONE [rebus: p+r+one] 3 ROOTS [container: ro(o)ts] 4 INSET [charade: in+set] 5 LEERS [rebus: le+ers]
SQUARE 3: Across: 1 AMPLE [deletion: sample–s] 6 SALON [rebus: sa+l+on] 7 SPORT [container: spo(r)t] 8 ALTER [anagram: later] 9 MESSY [rebus: mess+y] Down: 1 ASSAM [rebus + reversal: a+ssam] 2 MAPLE [container: ma(p)le] 3 PLOTS [deletion: pilots–i] 4 LORES [anagram: loser] 5 ENTRY [deletion: sentry–s]
CHANCE
63
Goodness of Wit Test #10: Once Is Enough Instructions: Eight clue answers need to be altered in a similar way to fit into the grid. These answers, which are all proper names with something shared, are clued by word play only and do not have a definition. The rest of the answer words are clued and entered normally. Enumerations are withheld.
Across 1 Bench female baseball player 5 Slow moving Gore trails revolutionary guard 10 Examine portfolio 11 Married to help servant 13 Foot problem before marriage 14 Exit retreat leaderless 15 Unlimited places to store sword handle 16 Auditor’s weak pretense 18 Very old drawer is dirtier 19 Catches parents ill-prepared 20 Had a row with liquid the French poured into missile 26 Bite peeled melon fruit 28 Suspicious LSD guru in the audience 29 Brat loses time slip 30 Donating French wine packing job 32 Nice fish 33 Tennis player part of Star Trek collective 34 Couple owed last of rent 35 Lively ancient city prior to start of big trouble 36 Flimsy argument labels Southern Europe’s leaders treacherous
64
VOL. 23, NO. 4, 2010
Down 1 French master failing in standard English 2 Execute core of change 3 Drunken priest’s vivacious wit 4 Stage last tale 5 Good article on middle of recession in America 6 Kid grabs half a candy decoration 7 Clear Tut’s head after plague 8 Prove a BA might be better than average (2 words) 9 Nile swirling over Royal Society ships 11 Narrator follows prophet without introduction 12 Musicians’ patron saint lacking two eyelashes 15 Show limitless narration 17 Creoles crazy about a vegetable 20 Heartless controlled rulers rise and come forward (2 words) 21 Strip part of ship during day lost at sea 22 Ms. Woods or Ms. Dashwood 23 State answer, recite prayers in retrospect 24 You and I initially ignore edict 25 Suzy got excited holding pair of gametes 27 Go south to prepare 30 Crow flying up dress 31 Votes against bills not true