Editor’s Letter Mike Larsen, Executive Editor
Dear Readers,
T
his issue of CHANCE begins with an article by Jana Asher on collecting data in challenging settings. In particular, Asher describes her experiences conducting in-person survey interviews in East Timor. She gives us personal anecdotes, practical statistical advice, and an interesting story. Qi Zheng explains the origins of the Luria-Delbrück distribution and its role in studying evolutionary change in E. coli. The statistical reasoning underlying the phenomenon has a connection to the distribution of slot machine returns. Holmes Finch’s article, “Using Item Response Theory to Understand Gender Differences in Opinions on Women in Politics,” compares and contrasts item response models and how they describe a data set. The models are explained using formulas, pictures, and examples. In Volume 22, Number 4, Jürgen Symanzik proposed a puzzle based on 10 data points and a set of seven instructions. Contest winner Stephanie Kovalchik, a graduate student at UCLA, provided a solution in the form of an amusing letter and an illustrative graphic. The 10 data values were flight times in seconds recorded on the log 10 scale of the Space Shuttle Challenger. Brad Thiessen earned honorable mention for his graph that included temperature and historical facts. Bernard Dillard asks, “Who turned out the lights?” We are all concerned with energy demand and production. Bernard uses a discrete wavelet transformation to analyze electricity consumption data measured on a frequent time scale. The fit of the model is used in multiscale statistical process control. The ultimate goal is to be able accurately predict points of extreme energy demand and respond appropriately. Students in virtually all statistics courses learn something of least squares estimation when studying prediction of an outcome from an explanatory variable. Ivo Petras and Igor Podlubny ask whether there is a reasonable alternative to the default criterion. “Least circles” is presented for your consideration. To introduce students to concepts of design of experiments, instructors sometimes have students conduct taste tests of
Through your email you can get a table of contents notification for CHANCE. Go to www.springer.com/mathematics/ probability/journal/144 and add your email address in the box that says “Alerts For This Journal”. The web site also has a place you can recommend CHANCE to your library.
various food items, such as gummy bears (see Vol. 23, No. 1). John Bohannon, Robin Goldstein, and Alexis Herschkowitsch compared dog food and pâté. Really, they did. Read about their design and the results in this issue. Ronald Smeltzer shows us an early time-line bar graph by Philippe Buache depicting the water level of the Seine River in Paris from 1760 to 1766. The picture creatively and effectively depicts data in print before the advent of the modern printing techniques that we enjoy today. Howard Wainer, in his Visual Revelations column, writes about the graphics in the 2008 National Healthcare Quality Report and State Snapshots. Usefully and accurately displaying information graphically is important and challenging. Wainer makes suggestions for improving some of the displays. Continuing a series of articles on postage stamps, Peter Loly and George P. H. Styan discuss stamps issued in sheets with 5x5 Latin square designs. Color versions of the stamps, as well as previous articles on stamps, are available online at www.amstat.org/publications/chance. Jonathan Berkowitz’s puzzle celebrates the 2010 Winter Olympics, which was held in his home city of Vancouver, British Columbia. The puzzle, titled “Employs Magic,” is actually five smaller puzzles, each a cryptic five-square of 10 words. Mark Glickman’s Here’s to Your Health column will appear in the next issue. In other news, the Executive Committee of the ASA met recently and made decisions that impact CHANCE. First, the committee voted to continue CHANCE for another three years in both print and online versions. The next executive editor will serve 2011– 2013. I’ll enjoy reading CHANCE in the years to come. Second, the Executive Committee voted to make the online version of CHANCE free to the ASA’s certified student members. This is a great development, because students are potential long-term subscribers and future authors. They also can be inspired by the significant role that probability and statistics can play in major studies and activities. I hope that other professionals will be motivated to submit articles to CHANCE to entertain and influence this group. I look forward to your suggestions and submissions. Enjoy the issue! Mike Larsen
CHANCE
5
Collecting Data in Challenging Settings In the Global South—say, East Timor—data collection is not for the faint of heart or weak of stomach. Jana Asher
It is very early in the morning when water falling on my face jolts me awake. The raindrops are blowing into the small bedroom through a gap between the roof and side wall, and, unfortunately, the platform bed on which I am curled is occupied by four other women. Having nowhere to move, I coil up into a tighter ball under our shared blanket and attempt to sleep …despite the small pool of liquid that has formed a sheen on my cheek.
as ing the truck w one. Good th Le . ra le er ho Si w in it Truck wreck ve swallowed thole might ha big, or the po
Local villagers use machetes to remove a tree trunk from the road in East Tim or. There were no governmenta l services availab le to clear the tree; the villager s were responsib le for keeping their own pass off2,th e mountainside 6 VOL.ag23,esNO. 2010 clear.
The next morning, the occupants of the small village in which we are stranded will use machetes to cut out a portion of tree trunk that is blocking our path down the mountainside. We will make it about one-third of the way down the mountain in our truck before we are blocked by a second tree trunk, bigger than the first, and are therefore forced to abandon our vehicle. We will walk the rest of the way down the mountain in a state of hyper-alertness, listening to the creaking of the trees and hoping that one of them does not decide to fall on top of us. In the evening, we will finally make it to Dili, the capital of East Timor. That night, I will call my irate and somewhat despondent husband and assure him I am fine, despite having missed our planned phone call the night before. And then I will collapse in my warm hotel room bed, half a world away from my family and friends. To be fair, when we started out that morning, the weather was hot and dry. By an accident of fate we just happened to select a village on top of a mountain as a test site for our questionnaire on the very day that the rainy season decided to arrive. We completed our testing in a thunderstorm, and by the time we were done, the path we had taken up the mountain no longer existed, having been subsumed by a tree with a diameter greater than a man’s height. All for the sake of testing a questionnaire.
Figure 1.The data collection communication process
D
ata collection in the Global South is not for the faint of heart or weak of stomach. Nor is it for the overly rule-bound or uninventive. It requires an intimate understanding of your own fragility and mortality, a sense of adventure, a respect for the knowledge and wisdom that each person you encounter provides, and a limitless supply of patience. Finally, it requires a desire to bring the highest intellectual and scientific rigor to the most difficult of circumstances …coupled with the understanding that you will never truly succeed at doing so. But we are getting ahead of ourselves. Let us start by discussing why the questionnaire design and data-collection processes are so essential to the quality of the resulting data, and then review the ingredients that constitute good questionnaire design and data collection. Then we can return to East Timor—by way of the Middle East and Africa.
What Can Go Wrong? It seems like a deceptively simple process, one perhaps that you first tried in an elementary or junior high school class. You have a research question that can be answered by survey data, so you
write some questions, slap them onto a form, copy, distribute, collect, and presto: instant data! Well, yes and no. Any data-collection process represents a complex chain of communications, and as with any other chain, one weak link can break it. The more complicated the data-collection process, the more links in the chain, and the more opportunities for communication to break down. For a traditional interviewer-administered survey, multiple individuals must have a nearly identical conceptual understanding of a question, as shown in Figure 1. When everything goes well, the chain is a circle as presented on the left: That is, the interpretation of what information is desired is identical across the many individuals in the chain, leading to accurate reporting and recording of information. What can happen, however, is that the chain forms an outward spiral like the one on the right—and the information collected is not identical to that which was originally desired by the researcher. Two main issues that can arise in the process of communication are misinterpretations (errors in the communication between people) and inaccuracies in transmission (such as error caused by the interviewer writing down the wrong code on the survey form). The
misinterpretation type of error is best explained by examples. Let us start with a basic question that is fraught with interpretation issues and would therefore almost never appear on a pretested questionnaire.
Avoiding Ambiguous Questions To a critical observer, there are obvious ambiguities in this question. Within what time period do we mean? Do we mean household or individual income? Reported taxable income? Are alimony payments, gambling winnings, and other types of income included? What about bartered items? How is a respondent to know? Here is another real-life example: the testing of a web-based survey that was to be administered to expatriate Iraqi physicians so that they could describe their professional and other experiences in their new countries of residence. The research group that hired us to test it had been advised by a well-trained and highly reputable Iraqi physician that all Iraqi medical schools were conducted in English. Therefore, the survey could be administered in English and the doctors would have little difficulty understanding the questions. Unfortunately, he was wrong. As an illustration, note the following CHANCE
7
transcription from a pretest interview. The volunteer was instructed to read the question aloud and then tell the interviewer all thoughts he had as he determined his answer to the question. The goal of the question was to determine whether the physician was required to become recertified to practice medicine in his new country. Volunteer interviewee (reading): If you had to go through a credentialing process or are currently in a credentialing process in your new country, how many years did it take? Please do not answer if you did not need to go through a credentialing process in your new country.
Joahannes Kawa interviews in less than ideal circumstances. Note that privacy is not a common commodity in the rural areas of Sierra Leone, where the arrival of the interviewing team was considered a major event.
Volunteer interviewee (responding): Do you know if the credentialing process is the asylum process or the red card or green card? Interviewer: I'm not allowed to say anything, because it has to be how you interpret it. Volunteer interviewee: Yeah, I need to understand what means …This is asking me…like through the embassy or through the …I don't understand what mean [sic] “credentialing.” (Looks it up in English-Arabic dictionary.) Here credential is like ambassador or delegation …yes. What mean this [sic]? Interviewer: So how would you answer this question? Volunteer interviewee: This is paper of ambassador of delegation [sic] …I don't know.
News from the field in Sierra Leone. Team leader Bafara Jawara calls via satellite phone to report on progress in the Bonthe district.
8
VOL. 23, NO. 2, 2010
You do not need to be a seasoned questionnaire designer to see a big problem in the respondent's interpretation. We and our clients determined that their “cultural expert” on Iraqi physicians was familiar only with the “elite”—that is, those physicians who attended the best Iraqi medical schools and held the most prestigious positions in their home country. In fact, of the four Iraqi physicians we interviewed during the pretest of the questionnaire, only one came close to interpreting all the questions accurately. As a result of the pretest, the research team members developed an Arabic version of the questionnaire, thus saving themselves from collecting data that would be, for all intents and purposes, garbage. In general, a rigorous testing procedure will be required to ensure that a survey collects the data it intends
An interview during the field testing in East Timor. The interviewer, Jacinta Gonsalves, is sitting at the table; her supervisor, Silvia Verdial da Silva Lopes, is sitting to her right; and the respondent is the man sitting to her left. Note that privacy was virtually impossible during the interviewing process.
to collect—in other words, that the data collection process ends as a circle instead of a spiral.
Best Practices for Data Collection So what is the right way to design and test a questionnaire and interview respondents? Well, the short answer is that it depends, but there are some best practices that have been developed over the past few decades. Here is a look at the process of developing the questionnaire and training interviewers to administer it. Testing the questionnaire Testing a questionnaire requires multiple steps. A first round of testing should include review of the survey instrument by at least three different types of experts: an expert on the subject of the survey, an expert on the population to be surveyed, and an expert on questionnaire design. Although those roles might overlap—for example, your questionnaire design reviewer might be the same as your expert on the population to be surveyed—none of those experts should be the person who initially designed the questionnaire. Even the most experienced questionnaire designers make mistakes. Once that initial review is completed and the questionnaire has been modified based on the expert comments, the real testing begins. There are several possibilities for the next round of review.
One of the most successful techniques is called “cognitive interviewing.” Cognitive interviewing allows the survey designer some insight into the thinking processes of the respondent. During a cognitive interview, two trained professionals administer the survey to a test respondent. One of the professionals serves as the interviewer, and one records his/her impressions of the interaction between the interviewer and respondent, including tone of voice and body language that might indicate confusion or a strong emotional reaction. The respondent is asked to “think aloud”—that is, to verbalize his/her thoughts while responding to the question. In addition, the interviewer might ask probing questions—questions about the respondent's interpretation of or thinking about the survey. Those probing questions might be either ad-lib or carefully developed prior to the cognitive interviewing process. If possible, the cognitive interview will be tape- or video-recorded for study later. As an example, the transcription of the interview of the Iraqi physician given earlier in this article was taken from a cognitive interview. Cognitive interviewing can occur as a single process or in waves. When cognitive interviewing occurs in waves, between each wave the survey developer modifies the survey on the basis of what was learned during the previous wave. Following the example of the question
“What is your income?” we can imagine a first round of cognitive interviews might uncover the issue that there is no time frame given. The survey designer might respond by reformulating the question as “What was your income over the past 12 months?” A second round of cognitive interviews might reveal that some respondents are including interest income and some are not. The question might then be reformulated as “What was your income from gainful employment over the past 12 months?” and so on, until the cognitive interviews indicate that the respondents to the question are interpreting it as the survey developer intended. Limitations of cognitive interviewing One issue with the cognitive interviewing process is that it is time and labor intensive, so only a limited number of cognitive interviews can be completed during any wave. Therefore, it is a good idea to perform a final field test after the cognitive interviewing. A field test is a small run of the fieldwork for the survey, after which the results are tabulated to make sure that they seem appropriate. The field test allows both input from a large number of individuals from the population of interest and also an opportunity for the interviewers to practice with the survey before the real fieldwork begins. One hopes that any problems found at this stage will be minimal and CHANCE
9
A team in Sierra Leone reviewing questionnaires under the supervision of team leader Mohamed Daboh. From left to right: Andrew Simbo, Mohamed Daboh, Antoinette Licon, Nancy Joseph, and Joahannes Kawa. Note that the team is working by candlelight; electricity in the field was a luxury, as were flush toilets. Also note the spiffy Joint Statistical Meetings bags, donated by the American Statistical Association.
Transportation in the Bonthe district of Sierra Leone. The team members that covered the Bonthe district spent the majority of their mission being transported by boat. For this reason, they were issued life preservers and special containers in which to store the survey forms and their equipment and personal belongings. Here team members rest as they travel to the next village. The Bonthe team interviewed in the remotest village sampled for the survey. It was on an island that belonged to the district of Pujehun, and the Pujehun team had been unable to reach it. The Bonthe team members, led by Bafara Jawara, were required to ride in a boat for 16 hours and then hike for 10 miles to get to the village.
easily corrected, and the survey will be ready to commence in earnest. Another issue that frequently arises during questionnaire design is the need to translate the questionnaire into one 10
VOL. 23, NO. 2, 2010
or more languages. The base minimum for translation is a combination of forward and backward translation—that is, one individual performs a translation between the original and target
language, and another individual translates it back into the original language. The pretranslation is then compared to the version that has been forward- and back-translated to find inconsistencies, and the translation is then corrected. Although this has been an industry standard, recent research has suggested that a more rigorous technique must be used, especially in the case of a questionnaire that is being translated into multiple languages simultaneously. One option is to develop the questionnaire in the multiple languages at the same time, with individual teams for each language starting from a base set of concepts. In that case, cognitive interviewing and other testing methods will occur in all languages, not just the base language. Training the interviewers An essential aspect of the questionnaire design and testing process is the appropriate training of the interviewers. In most large survey projects in the United States, the interviewers are trained to “stick to the script”—in other words, to read the questions on the survey instrument exactly as written, to follow the questions in the order given, and not to offer additional explanation of the questions unless it has been preapproved by the survey manager. This protocol is designed to minimize interviewer bias in the answers—that is, changes in how respondents will answer questions that are due to some behavior of the interviewer. Interviewers may inadvertently change the meaning of a question if they do not read that question exactly, or they may cause respondents to favor one response over another due to the interviewer’s clear preference for that response. For that reason, interviewers are also taught to present the questions in a neutral way and to not implicitly or explicitly express their own opinions as to an appropriate answer. Interviewers must learn the appropriate way to administer the survey. However, there are other aspects of interviewer behavior that must be addressed, including appropriate voice control and body language. Interviewers are trained on techniques for building rapport, asking sensitive questions, and maintaining the confidentiality of the responses of the individuals interviewed. Finally, interviewers need to understand methods for keeping themselves safe in the field.
What Makes Survey Practice Different in a Developing or Transitional Country? Much of our understanding of what constitutes good survey practice has grown out of research that has taken place in the United States, Canada, and Europe— all vastly different environments from those of the Global South. There are several reasons why the data-collection methods discussed above might need significant alteration to be useful in the developing context. A particularly pernicious problem— one that just recently has become a research priority of government statistical organizations in the developed world—is the need to administer a survey to a diverse population comprised of multiple ethnic groups that speak a diverse set of languages. In the past, questionnaires often have been developed without consideration of how particular questions will be interpreted or understood across cultural groups, or whether particular concepts are directly translatable at all: an issue that the forward-/back-translation industry standard does not adequately address. Populations in the Global South might be more varied than those in the Global North for other reasons— including varying levels of literacy and different understanding and tracking of time (e.g., reliance on agricultural cycles rather than a Gregorian calendar). In the context of sensitive questions, ensuring privacy might be close to impossible in the Global South context, where entire families might share a one-room home or apartment and the arrival of interviewers is a villagewide event. And the cultural preferences of many Global South countries lead to higher rates of “acquiescence bias”: the higher likelihood of respondents answering yes to a question, not because the affirmative is true but because they want to please the interviewer, who is perceived as being in a position of authority.
Are There Best Practices for Data Collection in the Global South? The answer to this question is yes and no. There are several organizations— including the World Bank and United Nations—that have compiled the state of the art in random sample surveys in the Global South. However, there is still
Cognitive interviewing in the field in East Timor. From left to right: Jana Asher, Duarte da Silva, and the survey respondent with her daughter.
much research to be done as well as ongoing issues with the indiscriminate transfer of data-collection methodology. One important difference between fieldwork in the Global North and Global South is the availability of support infrastructure in the field. Common developed-world conveniences like cellular phone networks or landlines, hospitals and clinics, restaurants and hotels, and regular electricity and running water simply might not be available in some parts of the Global South. Teams of interviewers can be outfitted with satellite phones and carry their own medical supplies into the field. Interviewers can be vaccinated against local diseases and carry food and shelter to remote locations. And plenty of matches and candles will allow interviewers to complete their activities after dark. In addition, interviewers must be prepared in advance for difficulties such as flat tires, large potholes, or the need to forget the car or truck altogether and travel by motorcycle, boat, or foot. And you never know when your vehicle will not be able to proceed because there is a fallen tree in your way! What about the design of the questionnaire in the Global South context? In this researcher’s experience and opinion, in Global South countries where multiple tribes or cultures may be part of the sampled population, training the
interviewer to recite the questionnaire verbatim, with no ability to explain if the respondent is confused, does not lead to quality data. In fact, a reluctance to adapt to the interviewee can harm the interview process. Rather, the interviewer needs greater training and greater ability to improvise in the field. Additionally, although they are not often used, techniques like cognitive interviewing and more recently developed language translation techniques can and should be used in the Global South. Many cultures in the Global South are more attuned to agricultural cycles and timing than Gregorian divisions of time. In many cultures, even significant dates, like one's birth date, are not known or not important. That can significantly impact the quality of surveys that require recall of events, unless specific techniques are developed for those populations. A technique called the “calendar method” has shown great promise. The interviewer assists the respondent in developing a written calendar of important “landmark” events in his/her life to aid recall of events asked about during the interview, However, in its typical form, the calendar method requires the respondent to be literate and familiar with the Gregorian calendar. The following section describes an alternate method of assisting illiterate respondents in recalling date information: a CHANCE
11
method that the author developed in her fieldwork.
A Small Research Example
Team leader Mohamed Daboh watches as a local mechanic in Bo Town attempts to fix his team’s car during fieldwork. Car problems plague fieldwork in a country like Sierra Leone, where there is insufficient infrastructure to maintain the roads—or even pave the majority of them.
Sahr Gbondent holds up the remains of an interview team’s tire during fieldwork in Sierra Leone. During that project, each vehicle was equipped with two extra tires for each two–week mission. Very often, they weren’t enough.
12
VOL. 23, NO. 2, 2010
The survey was a national survey of human rights violations that occurred during the armed internal conflict of Sierra Leone. The U.S. State Department sponsored the survey as part of a general program of documenting war crimes. The participating organizations were the American Bar Association and Benetech. All three groups wanted the data to be used by the Sierra Leone Special Court during the prosecution of war criminals. For that to happen, each violation needed to be associated with a perpetrator group and a date. In addition, we were asked to collect age of the victim at the time of violation, the duration of the violation (including duration until death if appropriate), and the current age of all members of the respondent's household. However, a large percentage of the population of Sierra Leone is illiterate and does not regularly consult the Gregorian calendar of years, months, and weeks. Rather, the people are attuned to the seasons (rainy and dry) and other important events such as religious holidays and school terms. Our solution was twofold: We provided the interviewers with more latitude to probe the respondent for information (not sticking to the script of the questionnaire), and we crafted a series of prescripted probe questions that asked the respondents to determine whether a violation occurred before or after a national date of importance. Although the first decision deviated from standard survey practice regarding interviewers, we felt that the complexity of the information sought required that interviewers have the ability to be more creative in the field, and that the additional information elicited far outweighed the potential interviewer bias introduced. The second decision, however, rested firmly on current understanding of memory and cognition by psychologists. Current theory suggests that our recall of date information is based on a few “landmark” events for which we have memorized the date on which the event occurred, combined with the storage in memory of events in series form. In other words, an interviewee might remember that her birthday is October 17 and then
Table 1—Time Probe Results for Sierra Leone War Crimes Documentation Survey Probe
Victim Age During Violation
No probe needed + 1 probe
Violation Start Date
Violation Duration
Death Duration
Resident Age
766
171
85
20
701
44,168
11,040
57,488
10,018
32,013
3,229
2,107
4,899
370
1,825
16,528
32,538
2,217
380
4,310
2 probe codes recorded
26
18,577
28
0
1
3 probe codes recorded
1
283
0
0
0
4 probe codes recorded
0
1
0
0
0
More probe codes
0
0
0
0
0
19,784
51,399
2,245
380
4,311
No probe needed Missing value 1 probe code recorded
Total probe events
remember that about two weeks later she had her furnace serviced—but it is very unlikely she will remember that her furnace was serviced on October 29. In the case of the Sierra Leone survey, we created a probe question for each of eight events of national importance that were spread through time in a way that allowed interviewers to determine the year in which a particular respondent's reported human rights violation had occurred. For example, the probe “Did that happen before or after the invasion of Freetown?” allowed the interviewers to determine whether an event occurred (roughly) prior to 1999, during 1999, or later. In addition, the interviewers were provided with several scripted probes to help them narrow the time frame during which the violation could have occurred. Those probes referenced seasons (rainy versus dry); religious events (before or after Christmas or Ramadan); school terms (before or after a particular term started or ended); and, if needed, age of the respondent (both age of the respondent when the event happened and age now). The interviewers were free to use any combination of probes they felt was warranted, but they were required to record which probes they used (or note if they had not used any). The strong advantage of this method versus more traditional calendar-based methods was its ability to be used with illiterate individuals. Did it work? The evidence given in Table 1 would suggest so. Using the recorded information about which probes
were used to ask about which violations, we determined that 79.4% of violation dates relied on probing for the date to be determined. Events that were easier to remember were durations—as given by the results for violation duration and time from violation until death—only 3.5% of those cases required probes. Note that typically only one probe was required to determine dates. We believe that respondents, once introduced to the probing method, engaged in self-probing to recall ages and dates later in the interview. This theory is supported by observations of the cognitive interviewing for the survey, during which respondents began to self-probe. Please also note that the use of landmark-based probe events is highly supported by the response rate to time-related questions of the survey: Response rates were very high for resident age (99.9%), victim age (99.9%), violation start date (99.9%), and violation duration (96.9%).
Summary Many people required to analyze data do not see those data until they are nice and neat, either in a spreadsheet or another organized format. They might not realize how much error can be introduced through the data-collection process if the process is not carefully planned. You now know what can and will go wrong when appropriate questionnaire design and data collection protocols are not followed—or even when they are.
Scientists are hard at work determining new and better techniques for eliciting information from a variety of people across the world. So the next time you are given a data set to analyze, be sure to ask how the data were acquired. You might be surprised by how little or how much was done to produce data of the highest quality possible—whether or not the questionnaire designer ended up stuck on the top of a mountain.
Further Reading Bulmer, M., and Warwick, D. P., eds. 1993. Social research on developing countries: Surveys and censuses in the Third World. London: University College London Press. Casley, D. J., and Lury, D. A. 1981. Data collection in developing countries. London: Oxford University Press. Tourangeau, R., Rips, L. J., and Rasinski, K. 2000. The psychology of survey response. Cambridge, UK: Cambridge University Press. United Nations. 2005. Household surveys in developing and transitional countries. http://unstats.un.org/unsd/ Hhsurveys/ (October 30, 2009). Willis, G. B. 1999. Cognitive interviewing: A “how to” guide. http:// appliedresearch.cancer.gov/areas/cognitive/ interview.pdf.
CHANCE
13
The Luria-Delbrück Distribution Early statistical thinking about evolution Qi Zheng
I
n the 1940s, biologists were puzzling over the origin of certain mutations that confer survival advantages to bacteria living under harsh environmental conditions. A well-known example is a mutation that confers on Escherichia coli cells resistance to phage— viruses that infect and kill wild-type bacterial cells. Some biologists believed that such mutations occur spontaneously, or randomly, in the sense that they occur regardless of their usefulness to the organism. Others held that such mutations occur in response to the environment, for example, to the assault of phage. The former hypothesis, dubbed the “random mutation hypothesis,” supports Darwin’s theory of natural selection. The latter hypothesis, called the “directed mutation hypothesis” among other names, cannot be easily reconciled with Darwinism. The Luria-Delbrück probability distribution was developed to describe experimental results produced to address the contentious debate over these two hypotheses.
Photo courtesy of AP/Wide World Photos
King Gustaf Adolf, right, presents the Nobel Prize in Physiology or Medicine to German-born American biologist Max Delbrück in Stockholm, Sweden, December 10, 1969. Delbrück, of the California Institute of Technology, shares the prize with American biologist Alfred D. Hershey and Italian-American biologist Salvador E. Luria for their discoveries concerning the replication mechanism of viruses and their genetic structure.
Luria’s Experiment Salvador Luria, an Italian-born microbiologist, was responsible for several important advances in modern biology. An indirect contribution of Luria to biology was his decision to send James Watson, his first doctoral student, to Europe to pursue postdoctoral research, which culminated in the discovery of the molecular structure of DNA by James Watson and Francis Crick in 1953. In February 1943, after months of preoccupation with the controversy over the two hypotheses, Luria invented a new type of experiment that would shed light on the controversy. A diagram of the experiment is shown on the following page in Figure 1. Each test tube contains a liquid culture into which a few wild-type cells are seeded. During an ensuing incubation period cells grow and divide freely in each tube.
Under the random mutation hypothesis, when a wild-type cell divides, there is a small chance that one of the two daughter cells is a mutant. Since backward mutation is negligible, all offspring of that mutant would be mutants. To help understand how the cell population in a test tube evolves, consider synchronous cell growth. As depicted on the following page in Figure 2, let a cell population start from a common ancestor (top row). Each succeeding generation doubles in size. The offspring of a mutant type (black) are also of the mutant type. At the end of the incubation period, the numbers of mutants in the test tubes are determined by transferring the contents of each tube onto a solid culture in a dish containing a selective agent (e.g., phage). This transferring process is termed plating, which
eliminates wild-type cells but allows each mutant cell to grow and form a visible colony on a solid culture.
Overdispersion and the Slot Machine Luria’s experiment is a classic example of applying a simple statistical principle to an important biological problem. If mutations occur randomly, one would expect mutations to occur earlier in some tubes than in other tubes. Because an early-occurring mutation in general generates a larger number of mutant cells than a late-occurring mutation does, one is likely to observe a considerable amount of variation in the number of mutant cells across the tubes.
CHANCE
15
Figure 1. A simplified illustration of the fluctuation test. An actual experiment consists of around 30 test tubes. Wild-type cells are seeded into a test tube containing a liquid culture. After an incubation period, the cells in each tube are transferred to a dish where phage destroy the wild-type cells. The surviving mutant cells are fixed in place and grow to visible colonies, which are counted.
Figure 2. One of 36 pedigrees that have 16 mutant cells in the fifth generation. In the above diagram, a white circle stands for a wild-type cell, whereas a black disk stands for a mutant cell. Starting from a common ancestor (top row), each succeeding generation doubles in size. Under the random mutation hypothesis, mutations can occur in any generation. Thus, a large variation in the number of mutant cells across test tubes is expected.
On the other hand, if mutations occur only after the plating procedure brings wild-type cells into contact with phage, then, because cells lose mobility on a solid culture, the number of mutant colonies represents the number of mutations that occurred after plating (see Figure 3 on the following page). As each wild-type cell has an equal chance to mutate upon coming into contact with phage, the number of mutations would obey the binomial law, which should be well approximated by the Poisson law due to the large number of wild-type cells and the exceedingly small probability of mutation. 16
VOL. 23, NO. 2, 2010
The idea of measuring this kind of variability to test the random mutation hypothesis struck Luria when he was watching a colleague putting dimes into a slot machine at a faculty dance at Indiana University. As Luria later recounted in his autobiography, at that moment he vividly saw a striking similarity between the large variation in slot machine returns and the variation in the number of mutant cells across the tubes—if random mutations did occur. To put it another way, a jackpot is to the slot machine return what an
early-occurring mutation is to the number of mutants in a test tube. Luria eagerly communicated his novel idea and experimental results to Max Delbrück, who, equally excited, formulated a mathematical model to describe the variation in the number of mutants. Thus was born the LuriaDelbrück experiment, also known as the fluctuation test or the fluctuation experiment. The latter names are due to the fact that in their paper, published in the November 1943 issue of Genetics, Luria and Delbrück used the term “fluctuation” to refer to what a statistician would today
Figure 3. In contrast to Figure 2, mutations occur only after plating under the directed mutation hypothesis.
call “variance” or “variation.” A prominent feature of a Poisson random variable is that the mean and the variance are equal, and hence the variance-to-mean ratio is unity. This would be the case for the distribution of the number of mutants under the directed mutation hypothesis. However, as the slot machine analogy suggests, the ratio can far exceed unity under the random mutation hypothesis. From a modern perspective, the distribution of the number of mutants is overdispersed under the random mutation hypothesis, compared with the distribution under the directed mutation hypothesis. When Luria performed the world’s first fluctuation tests to investigate the mutation that confer on E. coli cells resistance to phage infection, the observed variance to mean ratios greatly exceeded unity. Luria and Delbrück thus argued for the occurrence of random mutations in their experiments. In their classic paper, Luria and Delbrück also demonstrated the usefulness of the fluctuation test in measuring microbial mutation rates in the laboratory. In the ensuing six decades, the fluctuation test would gradually be regarded more as a means of estimating mutation rates than as a tool for unraveling the controversy for which the experiment was invented.
Unpublished First Efforts A key step in estimating mutation rates using the Luria-Delbrück experiment is to express the distribution of the number of mutants (also called the “mutant distribution”) in terms of the mutation rate or related quantities. Several prominent statisticians of the time regarded the method adopted by Delbrück as
too crude for the purpose of estimating mutation rates. J. B. S. Haldane was among the first to seek algorithms to calculate the distribution. Haldane used the synchronous growth model as illustrated in Figure 2, and took a combinatorial approach to tackle the distribution. For example, to calculate the probability of 16 mutants in the fifth generation, Haldane first enumerates all five-generation pedigrees having 16 mutants in the fifth generation. Figure 2 shows one of 36 pedigrees that have 16 mutants in the fifth generation. If μ is the probability that a cell division produces a mutant daughter cell, then the probability of that pedigree is μ5(1 − μ)26, as in total 31 cell divisions have occurred and five divisions were accompanied by a mutation. The desired probability is the sum of the probabilities of the 36 pedigrees. This approach is simple in principle but can be unwieldy in practice. For instance, a wild-type cell can give rise to 374 pedigrees that have 46 mutant cells in the sixth generation. No efficient algorithms exist for identifying all these pedigrees. A modern approach is to treat the model as a Markov process, computing probabilities for the nth generation using probabilities for the (n−1)st generation. Haldane did not publish his results. The original manuscript, written in 1946, is now part of a large collection of Haldane’s papers archived by University College Library in London. Almost at the same time, around 1947, the distribution drew the attention of another giant figure in genetics and statistics. Finding Delbrück’s mathematical treatment of the mutant distribution
less than satisfactory, the young geneticist James Crow posed the question of how to find the distribution of the number of mutants to his newly acquainted friend R. A. Fisher. Fisher, upon hearing the question, leaned back in his chair to think for about a minute and then wrote on a scrap of paper a generating function that Crow could not immediately understand. Crow put aside the piece of paper for later study but could never find it again. As this valuable scrap of paper is unlikely to be ever recovered, the mathematical model that Fisher used to obtain his generating function will continue to remain a mystery. However, it is improbable that Fisher’s generating function was arrived at in the space of a minute or so, Fisher’s legendary intellectual prowess notwithstanding. The raging controversy and the refreshing statistical argument put forward by Luria and Delbrück grabbed the attention of many a contemporary geneticist. It would be hard to imagine, Fisher, then pioneering at the frontiers of both genetics and statistics, had not pondered the controversy and the mutant distribution before the episode that Crow recounted in 1990.
A Published Distribution E. A. Lea and C. A. Coulson were responsible for the mutant distribution that still is widely used today in laboratories around the world. Lea and Coulson completed their work in June 1947, and their paper appeared in the December 1949 issue of Journal of Genetics. The Lea-Coulson model is a modification of Delbrück’s model reported in the 1943 classic paper CHANCE
17
Photo courtesy of AP/Wide World Photos
Salvador E. Luria in his Massachusetts Institute of Technology Laboratory October 16, 1969, after word that he shared the 1969 Nobel Prize with two other bacteriologists for research on viruses.
of Luria and Delbrück. In Delbrück’s model, cell growth is asynchronous, and the number of wild-type cells in a tube at time t is approximated by an exponential growth function of the form N0 e t. Here is the cell-growth rate and N0 is the initial number of cells. It is assumed that mutations occur in accordance with a Poisson process with a time-dependent rate proportional to e t. Furthermore, Delbrück used the exponential function e(t−t') as a continuous approximation to the number of mutant cells at time t generated by a common mutation occurring at an earlier time t'. It is easy to see that Delbrück’s model induces a continuous mutant distribution. As the number of mutant cells generated by a mutation in a typical experiment is relatively small, the second approximation is a crude one. It was due to this second approximation that Delbrück’s model was generally regarded as inadequate. In the Lea-Coulson model a stochastic birth process with a birth rate replaces the exponential function e(t−t'). The resulting mutant distribution is a discrete one. Lea and Coulson derived a generating function for this distribution. In their mathematical development, however,
18
VOL. 23, NO. 2, 2010
Lea and Coulson inadvertently made an assumption that effectively treated the number of wild-type cells immediately before plating (NT) as an infinitely large quantity. This unintentional assumption allows the distribution to be indexed by a single parameter m, the expected number of mutations that occur in a test tube. As in a typical experiment NT is often in the order of 108, the effect of this simplifying assumption was largely negligible in practice. The exact form of the generating function allowing for finite NT is believed to have been derived independently by D. G. Kendall and M. S. Bartlett. Details about the origins of this so-called exact distribution were retold in the 1999 Mathematical Biosciences review article “Progress of a Half Century in the Study of the Luria-Delbrück Distribution,” by Q. Zheng. Note that the algorithm proposed by Lea and Coulson to calculate their distribution function was too laborious for routine application.
Recent Developments The year 1988 saw an explosive resurgence of interest in the directed mutation hypothesis, which intensified interest in the Lea-Coulson distribution. As a result, an improved algorithm for computing the Lea-Coulson distribution function was proposed by Ma and colleagues in 1990. The generating function of Lea and Coulson along with this improved algorithm entered the third edition of the authoritative monograph, “Univariate Discrete Distributions,” by Johnson, Kemp, and Kotz in 2005. This solidified the habit of calling Lea and Coulson’s distribution the Luria-Delbrück distribution. Among biologists, however, this distribution is often fondly called the “jackpot distribution,” for reasons now clear. Today the fluctuation test is introduced to biology students as an important biological experiment of the 20th century. The reader can gain a deeper understanding of the fluctuation test by consulting the popular genetics text Introduction to Genetic Analysis, by Griffiths et al. or several other biology textbooks. Continued interest in the Lea-Coulson and related distributions is due to the growing demand by biologists for better methods to estimate mutation rates using the fluctuation test. Recent work on mutant distributions focuses on the
improvement of point and interval estimation of the parameter m, from which the mutation rate is readily obtained. The reader can catch a glimpse of relevant computational issues from a tutorial written by microbiologists W. A. Rosche and P. L. Foster in 2000 or from a more specialized account written by Zheng in 2005. The Luria-Delbrück distribution remains an awe-inspiring subject among biologists because of the seminal contributions Luria and Delbrück made to modern biology. It is widely believed that Luria and Delbrück won the 1969 Nobel Prize in physiology or medicine (with A. D. Hershey) in part due to their fluctuation test from which the LuriaDelbrück distribution arose.
Further Reading Crow, J. F. 1990. R. A. Fisher, a centennial view, Genetics, 124:207–11. Griffiths, A. J. F., Wessler, S. R., Lewontin, R. C., and Carroll, S. B. 2007. Introduction to genetic analysis, 9th ed. New York: W. H. Freeman and Co. Johnson, N. L., Kemp, A. W., and Kotz, S. 2005. Univariate discrete distributions, 3rd ed. Hoboken, NJ: Wiley. Lea, E. A., and Coulson, C. A. 1949. The distribution of the numbers of mutants in bacterial populations. Journal of Genetics 49:264–85. Luria, S. E. 1984. A slot machine, a broken test tube: An autobiography. New York: Harper & Row. Luria, S. E., and Delbrück, M. 1943. Mutations of bacteria from virus sensitivity to virus resistance. Genetics 28:491–511. Ma, W. T., Sandri G. Vh. and Sarkar S. 1992. Analysis of the Luria-Delbrück distribution using discrete convolution powers. Journal of Applied Probability 19:255–267. Rosche, W. A., and Foster, P. L. 2000. Determining mutation rates in bacterial populations. Methods 20:4–17. Zheng, Q. 1999. Progress of a half century in the study of the LuriaDelbrück distribution. Mathematical Biosciences 162:1–32. Zheng, Q. 2005. New algorithms for Luria-Delbrück fluctuation analysis. Mathematical Biosciences 196:198–214.
Using Item Response Theory to Understand Gender Differences in Opinions on Women in Politics Holmes Finch
T
he analysis and assessment of items on standardized tests is the purview of a subgroup of statisticians known as psychometricians. These individuals study the statistical quality of test items and estimates of examinee ability in a content area that is measured by them. In large-scale testing programs, such as those used by state departments of education, decisions regarding which items to retain for use with the student population are often made with the help of statistics. These statistics estimate such things as the difficulty level of items and how well these items differentiate students with relatively high proficiencies in the content area from those students with lower proficiencies. In addition, psychometric analyses can be used to obtain estimates of examinee proficiency—estimates that are often used in these testing programs to make academic competency decisions. For such decisions to be valid statistically and defensible legally, the items used on these tests must be of the highest possible quality. A part of the quality question focuses on the level of item difficulty and discrimination. Another question is whether an item is fair for all populations that might take the test. Of specific interest is the question of whether individual items exhibit any particular “bias” in favor of one group over another.
and bi range from -∞ to ∞. When bi is smaller than , the probability of a correct answer is greater than one-half. We can visualize this model using an item characteristic curve (ICC). Take for example, an item with a difficulty value of 0, which would be considered average. Theoretically, difficulty ranges from -∞ to ∞. The ICC for this model appears in Figure 1. As an individual’s score on the latent trait being measured increases (x-axis) so does there probability of correctly answering the item (y-axis). In this case, the item difficulty of 0 corresponds to the value of the latent trait for which the probability of a correct response is 0.5. For the 1PL model, this interpretation of difficulty will always be the case, although for the more complex models discussed below this is not true. The 1PL model makes the assumption that all items are equally good (or bad) at discriminating among those examinees who have high proficiency from those who do not. If we do not want to make such an assumption, we can modify the 1PL model slightly by including item discrimination to get
1 0.9
Item Response Theory Item response theory (IRT) is the most common method used by psychometricians for analyzing item level testing data that can be scored dichotomously: as either correct or incorrect. Typically, correct responses are coded as 1 and incorrect as 0. IRT is characterized by a set of logistic models that differ in terms of the amount of information provided about individual items. For example, the 1 parameter logistic model (1PL) model links performance on an individual item only to the examinee’s proficiency, or ability, on the trait being measured and the difficulty of the item. One of the primary advantages of the IRT modeling framework is that item difficulty is on the same scale as an examinee’s ability on the latent trait being measured (e.g., proficiency in math), making a direct comparison between the two possible. The 1PL item response function (IRF) takes the following form
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
where Ui has a value of 1 for a correct answer and 0 for an incorrect answer, is the ability of the examinee on the trait measured by the test, and bi is the difficulty for item i. Both
Figure 1. Item characteristic curve (ICC) for the one parameter logistic (1PL) item response theory (IRT) model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response.
CHANCE
19
the 2 parameter logistic (2PL) model. The 2PL item response function is
Item 1 vs. Item 2 1 0.9 0.8 0.7 0.6
Item 1
0.5 Item 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 2. Item characteristic curve (ICC) for the two parameter logistic (2PL) item response theory (IRT) model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. Item 1 a=2.0, b=-0.5. Item 2 a=0.5, b=0.5.
Item 1 vs. Item 2 1 0.9 0.8 0.7 Item 1
0.6 0.5
Item 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 3. Item characteristic curve (ICC) for the three parameter logistic (3PL) item response theory (IRT) model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. Item 1 a=2.0, b=-0.5, c=0. Item 2 a=2.0, b=-0.5, c=0.25.
20
VOL. 23, NO. 2, 2010
where ai is the discrimination parameter for item i and the other terms are defined as before. Figure 2 contains examples of 2PL ICCs for two items. Note that the item discrimination parameter value (ai) impacts the steepness of the slope, whereas the difficulty parameter value (bi) indicates the location of the IRF. In this case, item 1 has a=2.0 and b=-0.5, whereas item 2 has a=0.5 and b=0.5. Thus, given these values we can conclude that it is easier to obtain a correct answer to item 1 than item 2, and that item 1 is better at differentiating among examinees in terms of their overall location on the latent trait being measured. The scaling constant of -1.7 in the 2PL model is included in order to align this model with a similar one based on the normal ogive. The normal ogive function, as described by R. P. McDonald in Test Theory: A Unified Treatment, is really just the cumulative normal distribution function. A given value of the normal ogive is simply the area to the left of a specific standard normal (Z) value. The inclusion of the constant -1.7 makes the logistic function and the normal ogive essentially the same. Finally, if it is possible for examinees to correctly answer an item by chance (i.e., guessing), the 3 parameter logistic (3PL) model can be used by including a chance parameter. The 3PL IRF is
where ci is the chance parameter for item i and is between zero and 1. Figure 3 contains ICC’s for two 3PL items. In this case, item 1 has a=2.0, b=-0.5, and c=0, whereas item 2 has a=2.0, b=-0.5, and c=0.25, The difficulty and discrimination parameter values have the same function in the 3PL as they did in the other IRT models. The ci parameter value serves as the minimum probability of a correct item response, and is generally known as the pseudo-guessing parameter. For this example, there is a 0 probability that an examinee will answer item 1 correctly due strictly to chance, whereas for item 2 the probability of a correct response by chance is 0.25. The 3PL model is only appropriate in cases where a chance correct response to an item is possible. In many survey situations in which our interest is in item endorsement and correct or incorrect responding is not an issue, one of the other IRT models would be more appropriate. An example might include assessing subjects’ political views by asking them to answer yes or no to statements regarding their attitudes on current events.
Women’s Rights Survey for Ninth-Grade Students In 1999, the National Center for Education Statistics (NCES) conducted a large-scale, nationwide study of ninth-grade students’ knowledge of democratic practices and governance and their attitudes regarding diversity, international relations, and national identity. (The interested reader can learn more about this study at the NCES web site http://nces.ed.gov/surveys/ CivEd/.) Among the scales used in this study was one devoted to attitudes regarding the rights and roles of women in society.
A total of eight Likert items (statements to which a respondent expresses agreement or disagreement) were included, with four possible responses ranging from Strongly Disagree to Strongly Agree, with Don’t Know as a fifth option. These items appear in the sidebar “Items on the Women's Rights Subscale.” For the purpose of the following analyses, student responses were recoded as either Agree (Strongly Agree and Agree) or Disagree (Strongly Disagree and Disagree). Furthermore, these dichotomous items were coded so that more traditional views of women were given the value 1 and less traditional were given a 0. Do Not Know responses were coded as missing for the purposes of the following analyses. The resulting scale can be seen, therefore, as measuring the degree to which respondents hold traditional views of women in society. A nationwide sample of 3,022 ninth-grade students was selected for inclusion in this study, using a complex survey design involving clustering and stratification. A total of 2,767 students responded to the eight items with one of the four Disagree to Agree responses, leading to a response rate of 91.6%. For the purposes of this analysis and to keep the length of the article reasonable, the item “Men are better qualified as political leaders” was selected as the target for IRT modeling. This item was chosen because of a belief by civics education professionals that it may prove to be interesting with regard to differences between female and male respondents. Specifically, it was felt that gender-specific response patterns might diverge—even for male respondents who had generally similar views to females in terms both of equal pay for equal work and the need for men and women to have equal opportunities in society.
Item Response Theory Results for Women’s Rights The analysis of the target item, “Men are better qualified as political leaders,” may first involve the estimation of item difficulty and discrimination using the 2PL IRT model. In a testing context, item difficulty reflects the location of the item on the latent trait scale. Easier items have lower difficulty parameter values, indicating that individuals with lower levels of proficiency have a higher probability of answering the item correctly. “Discrimination” refers to the ability of the item to differentiate among students based on their underlying proficiency on the trait being measured (e.g., knowledge of math). Higher discrimination values suggest that the item is better able to separate those students with higher proficiencies from those with lower proficiencies. In the current case, the eight items on the Women’s Rights scale can be thought of as measuring the latent opinion of respondents to the rights and roles of women in society. The total scale score is an estimate of the actual opinion score, which remains latent and unobserved. As described above, this opinion score can also be estimated using IRT. As an example of the type of information that this analysis can provide, let’s consider the target item in this study. The item parameter estimates for this item for the entire group of subjects, as well as results by gender, appear in Table 1. When the groups are considered together, as described by R. J. de Ayala in The Theory and Practice of Item Response Theory, item 8 would be considered to have a large bi value with a relatively high ai value. In this case, the larger bi would indicate that survey respondents are relatively unlikely to agree or strongly
Table 1—Item difficulty (level of agreement) and discrimination parameter estimates (standard errors) for the target item “Men are better qualified as political leaders” Group
Difficulty (Agreement)
Discrimination
All (n=2,767)
1.238 (0.055)
1.327 (0.112)
Male (n=1,375)
0.865 (0.061)
1.263 (0.133)
Female (n=1,392)
2.179 (0.216)
0.867 (0.128)
Items on the Women’s Rights Subscale A scale devoted to attitudes regarding the rights and roles of women in society was asked in the 1999 survey of ninth-grade students by the U.S. Department of Education. A total of eight Likert items were included, with four possible responses ranging from Strongly Disagree to Strongly Agree, with Don’t Know as a fifth option. Here are the items: 1. Girls have fewer chances in life than boys. 2. Women have fewer chances in life than men. 3. Women should run for political office. 4. Women have the same rights as men. 5. Women should stay out of politics. 6. Men should have more rights than women when jobs are scarce. 7. Men and women should receive equal pay for the same job. 8. Men are better qualified than women.
agree with the item, whereas the large ai means that this item is able to accurately discriminate between those with relatively high versus low scores on the Women’s Rights scale. Average bi is generally considered to be around 0, whereas “easy” items have difficulty values roughly below -2.0 renders as -2.0 , and “hard” items have difficulty values above 2.0. Although the notion of difficulty on a cognitive assessment is intuitively clear, “difficulty” in the context of an opinion item such as this one might not be. It is important to remember that the difficulty parameter is really a measure of the likelihood of a particular item being endorsed by respondents, with higher difficulty values corresponding to a respondent needing to have a higher level of proficiency to endorse the item. When the task at hand is a cognitive assessment, this translates into the level of proficiency needed to correctly answer the item. On the other hand, in the case of an opinion survey such as this one, endorsing an item means agreeing with it. Therefore, a high difficulty value would indicate that the respondent would need a high level of the latent trait (i.e., a more traditional opinion regarding the role of women in society) to endorse the item. In this context, discrimination refers to the extent to which the item is able to differentiate those with a more traditional view of women from those with CHANCE
21
An issue that might be of some interest to researchers is whether there are differences between genders in these item parameter values, after controlling for respondents’ levels on the Women’s Rights scale.
Item 1, Group 1 vs. Item 1, Group 2 1 0.9 0.8 0.7
Item 1, Group 1
0.6 0.5
Item 1, Group 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 4. Example of uniform DIF for a 2pl model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. For group 1, item 1 a=1.0, b=2.0. Group 2, item 1 a=1.0, b=-2.0.
a less traditional view, with higher values indicating more discriminatory power. An examination of these values in Table 1 suggests that this item has a larger bi than average for the sample as a whole, meaning that respondents must have a very traditional view of women’s roles in society in general to endorse the idea that men are better qualified to be political leaders. We can compare this value with the proportion of individuals who endorsed the item, which was 0.165, or 457 of the 2,767 respondents. The ai value for the full sample is also large, suggesting that this item is effective at separating those with relatively more traditional views of women from those with less traditional views of women. Because we hypothesized that female students may feel differently than males about the relative merits of women in positions of political leadership, it is worthwhile to compare the bi and ai values for the sample by gender. In order to obtain these parameter estimates by gender, it is necessary to fit the model for males and females separately. The item bi and ai values appear by gender in Table 1. In the sample, this item had a higher bi value for females than males, suggesting that in order for a female student to endorse this item, she would need to have a very traditional view of the role in women in society. Males would also need to have a more traditional view than average to endorse the item, but in the sample it would not need to be as high as that of the females. Of the 1,392 female respondents, 113 (0.081) endorsed the notion that males are better qualified as political leaders, while 347 of the 1,375 male respondents (0.252) did the same. The item has a higher discrimination value for male respondents than for females, suggesting that it might be more effective at differentiating more and less traditional boys from one another than it would be for girls. 22
VOL. 23, NO. 2, 2010
Differential Item Function Analysis: Comparing IRFs Across Subsamples Despite the fact that there are clearly differences in the difficulty parameter estimates between male and female respondents, this does not necessarily mean that such differences are present in the population. It is entirely possible that the apparently higher value exhibited by the males in the sample is due only to sampling variation. For this reason, some statistical analysis must be conducted to ascertain whether there may be a difference in the population as a whole. In the context of IRT, the comparison of item parameter values between two groups is referred to as “differential item functioning” (DIF) analysis. There are two types of DIF that can be present for a given item: uniform and nonuniform DIF. Uniform DIF is based focused on the difficulty parameter (bi), whereas nonuniform DIF is focused on the discrimination parameter (ai). In the context of educational measurement, “uniform DIF” refers to the situation where, after holding constant the latent trait being measured by the test as a whole, the probability of members of one group providing a correct response is lower than the probability of members of the other group doing so. In other words, an item displays DIF when members of the two groups who are matched on the ability being measured by the instrument as a whole have different probabilities of correctly responding. Another way to think about DIF is that it represents a difference in model parameters between two groups. In this context, uniform DIF occurs when the location of the item (bi) differs between the two groups, while the presence of nonuniform DIF means that the slope (ai) of the curve relating the latent trait to the probability of endorsing the item differs between the groups. Figure 4 provides a generic example of uniform DIF. We can see that although the shape of the ICCs is the same for the two groups, they have very different location parameter (difficulty) values. Nonuniform DIF occurs when the probability of a correct response to the target item differs between the two groups but that difference is not constant across the latent trait continuum. It can be thought of as an interaction between the latent trait and the group membership, and indeed it is often tested in that way, as will become evident shortly. An example of nonuniform DIF appears in Figure 5. Unlike the case in Figure 4, the location of the item for the two groups is the same; however, their curves cross. For individuals with proficiency levels below 0, those in group 2 have a higher probability of correctly responding to the item, while for those with proficiency above 0, members of group 1 are more likely to respond correctly than are those in group 2. A key element of DIF analysis is the matching of individuals on the latent trait. If two groups differ on the trait being measured, it would only be expected that their rate of correct responses to an item measuring that trait would differ as well. For example, imagine that a college physics instructor gives her class an introductory physics proficiency test at the beginning of the semester. Some of the students in the class will
have already had physics and would thus be expected to have a higher proficiency in the subject overall. Therefore, for any given item on the test, it would not be a surprise to find that the students who had physics previously would have a probability of answering the item correctly. Indeed, if the item is a good measure of physics knowledge, we would expect students with experience in physics to answer the item correctly at a higher rate than those who had never taken such a course. The issue of controlling for differences in the latent trait before comparing item parameter values is at the core of DIF analysis. For an item to demonstrate DIF, we must first match individuals in the two groups on the latent trait being measured by the instrument. Then, if there are differences in the probability of a correct response to the item for individuals who have the same level of the trait being measured, we can conclude that DIF is present. Thus, if we are able to match individuals from the two groups on their physics proficiency and still find a difference in the probability of a correct response to an item, we can conclude that there exists DIF, which may be indicative of a problem which favors one group over the other—perhaps in the way the item is written, for example.
Logistic Regression for DIF Analysis There are a number of approaches available for assessing DIF. In this example, we will use logistic regression (LR). This choice is based on a number of factors, including research such as that by Swaminathan and Rogers showing that LR is an effective tool for DIF assessment. Additionally, it has widespread use in areas outside of psychometrics and attending familiarity for most statisticians. LR also has the ability to simultaneously assess both uniform and nonuniform DIF effectively. The logistic regression (LR) model for DIF detection is
In the full model, p(ui) is the probability of person i responding correctly to the item (coded dichotomously so that 1=correct and 0=incorrect); represents proficiency on the trait being measured; g represents the group identifier; and g represents the interaction between group membership and proficiency. For LR, proficiency will be estimated as the total score on the instrument, excluding the item being tested for DIF, and a group is typically coded as 0 or 1. The s represent the intercept, and slopes for proficiency, group, and the interaction, respectively. The comparison of the log-likelihood of a correct response for the two groups is done controlling for the proficiency level of each member of the sample. In this way, we ensure that differences on this proficiency that might impact the response to the item are factored out when group response patterns are compared. A similar argument can be made for the testing of the interaction term. The examination of items for DIF with LR involves the estimation of three models and the comparison of the resulting chi-square fit statistics. The models differ with respect to the inclusion of the main effect for group and the interaction of group and the proficiency value. Significant differences in these models indicate the presence of uniform and/or nonuniform DIF. The reader interested in the details of this testing should refer to Bruno Zumbo's book called A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic
Item 2, Group 1 vs. Item 2, Group 2 1 0.9 0.8 0.7
Item 2, Group 1
0.6 0.5
Item 2, Group 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 5. Example of nonuniform DIF for a 2PL model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. For group 1, item 2 a=1.0, b=0. For group 2, Item 2 a=0.5, b=0.
Regression Modeling as a United Framework for Binary and Likert-type (Ordinal) Item Scores. In DIF research, an effect size is typically used in conjunction with the hypothesis test in order to determine the degree of DIF that is present. The preferred effect size measure for LR in the context of DIF, according to Zumbo, is the change in R2 values among the models (∆R2). According to Michael G. Jodoin and Mark Gierl in their Applied Measurement in Education article, “Evaluating Type 1 Error and Power Rates Using an Effect Size Measurement with the Logistic Regression Procedure for DIF Detection,” the most current guidelines for interpreting this value suggest that ∆R2 < .035 indicates negligible DIF; ∆R2 ≥ .035 and ≤ .070 indicates moderate DIF; and ∆R2 > .070 indicates large DIF. In general practice, the effect size values are only used when items exhibit DIF through at least one statistically significant ∆R2 value.
DIF Analysis of Women’s Rights Scale As noted above, the sample estimates of difficulty and discrimination differed between male and female respondents for the item “Men are better qualified as political leaders.” However, it is not known whether these differences are due to sampling variation or to some systematic divergence between the groups in the population. LR will be used to determine which is likely to be the case. Using the methodology outlined above, the chi-square values appearing in Table 2 on the following page were obtained. The difference between the values for the full model and the model without the interaction between group and the total score is not statistically significant (p=0.66); however, the difference between this second model and the model containing only the proficiency estimate is (p<0.001). This combination CHANCE
23
Table 2—Chi-square fit statistic values for the DIF analysis. Chi-square, DF, p value
Model Full (proficiency, group and interaction)
R2
167.745, 3, <0.0001 14.0%
R1 (proficiency and group)
67.548, 2, <0.0001 14.0%
R2 (proficiency)
27.087, 1, <0.0001
2.3%
The full model has terms for proficiency (score on Women’s Rights scale), group (male or female), and the interaction between the two variables. Item 1, Group 1 vs. Item 1, Group 2 1 0.9 0.8 0.7 Item 1, Group 1
0.6 0.5
Item 1, Group 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 6. Item characteristic curves for “Men are better qualified as political leaders.” Group 1 is male and Group 2 is female.
of results indicates that while there is no nonuniform DIF present (tested by the difference between the full model and R1), there is uniform DIF. Therefore, we can conclude that the likelihood of endorsing the target item differs between male and female respondents who have been matched on their opinion regarding the traditional role of women. The difference between these groups can be highlighted using their respective item characteristic curves, which appear in Figure 6. In this case Group 1 refers to males and Group 2 to females. It appears from the figure and the item difficulty values in Table 1 that male respondents need a lower value than females on the latent variable (traditional view of women) to endorse this item. In other words, when we hold this latent variable constant, males are more likely to endorse the item: that is, believe that men are better qualified to be political leaders. On the other hand, while in the sample the discrimination values of the two groups were somewhat different, this does not appear to hold true in the population at large. To characterize the magnitude of uniform DIF present between the male and female respondents, we will use the 24
VOL. 23, NO. 2, 2010
∆R2 value as described above. For R1 versus R2, it is 0.140 – 0.023 = 0.117. Using the commonly accepted guidelines, we would characterize this is a big effect. In other words, the degree of difference in difficulty parameter values for the two groups is really quite large. Again, it is important to remember that all these results were predicated on controlling for the underlying traditional attitudes toward women.
Gender Differences in Attitudes Toward Politicians These results suggest that male and female ninth-graders do indeed differ in terms of their views regarding gender and qualifications for political leadership. Even when their overall attitudes toward women in society are held constant, male respondents are more likely to endorse an item suggesting that men are better qualified as political leaders than are women. Furthermore, given the absence of nonuniform DIF, we can say that this difference in attitudes across genders is the same regardless of the underlying attitudes toward women in society. In other words, even males with positive attitudes toward women are more likely to endorse an item stating that men are better qualified as political leaders than are females with similar overall attitudes toward women in society. Furthermore, given the effect size value, this difference is quite large, suggesting that it is more than a minor difference of opinion. Another way to view this uniform DIF is through the lens of gender differences in the item difficulty parameters. In this context, “difficulty” refers to the relative agreement with the statement that men are better qualified to be political leaders than women. The results presented in Table 1 show that female respondents have a higher value of bi, indicating that this item is more difficult for them to endorse as compared to the male respondents.
Further Reading de Ayala, R. J. 2009. The theory and practice of item response theory. New York: The Guilford Press. Devellis, R. F. 2003. Scale development: theory and applications. Thousand Oaks, CA: Sage Publications. Jodoin, M. G., and Gierl, M. J. 2001. Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education 14, 329–349. Linacre, J. M. 1997. The normal cumulative distribution function and the logistic. Rasch Measurement Transactions 11:2:569. www.rasch.org/rmt/rmt112m.htm. McDonald, R. P. 1999. Test Theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates. Swaminathan, H., and Rogers, H. J. 1990. Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement 27:361–370. U.S. Department of Education. Institute of Education Sciences. National Center for Education Statistics. Civics education survey (CivEd). http://nces.ed.gov/surveys/CivEd/. Zumbo, B. D. 1999. A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. http:// educ.ubc.ca/faculty/zumbo/DIF/.
Results of “A Real Challenger of a Puzzle” Graphics Contest On page 28 of the Fall 2009 CHANCE (vol. 22, no. 4), Jürgen Symanzik proposed a graphics and communication contest concerning 10 data values. The winning entry is by UCLA PhD student Stephanie Kovalchik, who presented her solution using an entertaining letter from a friendly statistical sleuth. An entry by Bradley Thiessen, associate professor of mathematics at St. Ambrose University, earned honorable mention. Thiessen incorporated temperature at liftoff and additional facts into his graph. Further reading on statistical approaches to studying the shuttle disaster can be found in volume 18 of Statistica Sinica. Thanks to all who entered the contest. If you have suggestions for future contests, please email me, CHANCE editor, at
[email protected]. – Michael D. Larsen
Chance Puzzle Question
Solution
Jürgen Symanzik
What happened on 1/28? Google “1/28" and you get an unwanted mathematical result, but if you Google “January 28," the top three links (on May 29, 2009) are: 1. http://en.wikipedia.org/wiki/ January_28
Which population do the following 10 observations x[1 : 10] represent: 5.643517 5.721843 5.718105 5.837939 5.780754 5.851633 5.781989 5.836783 5.783540 1.863323
Hints 1. Order matters. 2. The data have been transformed. 3. The first nine points may differ slightly, depending on the source. 4. The 10th observation is famous (before its transformation). 5. A major hint is hidden in Edward Tufte's 1997 book. 6. The puzzle should have been sent on 1/28/09. 7. The data represent the entire population.
2. http://en.wikipedia.org/wiki/ Space_Shuttle_Challenger_disaster 3. http://www.brainyhistory.com/days/ january_28.html The explosion of the space shuttle Challenger (note the hint in the title), an event many readers will personally recall, is a prominent feature in Visual Explanations, Edward Tufte’s 1997 book. URL three above also provides us with two useful details that relate to our data: “1986 Space Shuttle Challenger 10 explodes 73 seconds after liftoff." Our data consist of 10 values, which is the entire population. Order matters and the data have been transformed. After trying some transformations, one should realize that x[10] = log10(73) = 1.863323. This number (73) is typically listed in all articles and web pages dedicated to the Challenger explosion. Tufte discusses the Challenger explosion in detail and mentions “73 seconds” twice (p. 38 and p. 39). Now that we know the data have been log10-transformed and that x[10] represents the last (10th) flight duration (in sec) of the Challenger, it is easy to verify that the other nine observations also represent log10–transformed flight
durations (in sec) of the first nine missions of the Challenger. Possible sources for these flight durations can be found at the Wikipedia web site http://en.wikipedia. org/wiki/Space_Shuttle_Challenger, which provides a slightly incorrect list of all 10 flight durations. Another possible source is the official NASA web site: http://science. ksc.nasa.gov/shuttle/missions. One hint states that x[1 : 9] may slightly differ depending on the source. For example, the Wikipedia page above lists “5 days, 00 hours, 23 minutes, 42 seconds" as the duration of the first Challenger mission, whereas the first NASA web page indicates “5 days, 2 hours, 14 minutes, 25 seconds." So, these 10 observations indeed represent the log 10–transformed flight durations (in seconds) of all 10 missions of the Challenger. The puzzle question arose when Jürgen Symanzik was teaching a course in statistical graphics and wondered how these population data could be plotted in a meaningful way (that is, avoiding that the first nine observations would be plotted in the upper part of a plot and the most important 10th observations would be plotted in the lower corner of a plot). Neither the plot of the original flight durations (versus mission number) nor the plot of the log10–transformed flight durations (versus mission number) provide a satisfactory graphical representation of these data. The reader is left with the additional challenge to come up with a meaningful graphical representation of these flight durations. CHANCE
25
CHANCE Challenger Contest Winner Stephanie Kovalchik
Figure 1
Stephanie Kovalchik completed her bachelor’s in biology and literature at Caltech. After graduation, she entered UCLA's biostatistics program, where she attained an master’s and is currently in her third year as a doctoral student. With the support of a UCLA Biostatistics AIDS training grant, she is working with her adviser, William Cumberland, to develop methods for meta-analyses of time-to-event outcomes.
Ever eager for a chance to make a graphical discovery, I was ready to take Jürgen Symanzik's challenge. With fame and a year of free stats studies on the line, I decided that some guidance from an expert was in order. But who could crack the case? The enterprise needed a person with the astuteness of Lord Peter Wimsey and the mathematical genius of Sir R. A. Fisher (think Holmes in a gendarme's hat). Fortunately, I had become acquainted with just the person when I began my graduate work in the biostatistics department at UCLA. More fortunately, he has always been willing to advise me on any analytic inquiry. Off I dashed to email the statistical sleuth, forwarding the contest description. Below is his answer, addressed to me, for which I give him full credit.
26
VOL. 23, NO. 2, 2010
December 29, 2009 Dear SK, A most challenging challenge—yet not irresolvable. Apply a few statistical principles and you will have your solution. What follows is a description of my deductive journey and a few concepts along the way, which I hope you will find instructive. Principle 1: Everything is data. The seven explicit hints are only a subpopulation of the clues provided by the puzzler. Note the curious description of the contest as a “challenger” and the repeated emphasis on the date “1/28” (hint 6). Principle 2: Learn from the history of the ineffective communication of data or be destined to repeat it. Edward Tufte’s book Visual Explanations gives an unforgettable examination (hint 5) of the 13 hastily compiled charts of the Morton Thiokol engineers, which taught us how missing data, unrepresentative and in small samples, could thwart accurate inference, which brings us to our first deduction: The data pertain to the 10 missions of the Challenger shuttle (hint 7), commencing with the maiden voyage of April 4, 1983, (hint 1) and ending with the tragic (hint 4) final flight, January 28, 1986. Principle 3: A good statistician is a storyteller with a penchant for quantification. The tale told by the Challenger data set is of a continuous measure, with one conspicuous outlier and no duplications, excluding the possibility of a categorical measure, such as erosion or damage events. Joint temperature would seem the next most fitting candidate except for the presence of ties, which no deterministic transformation (hint 2) would change. We must broaden the scope of our variables. Principle 4: Be resourceful but know the biases of your sources. At this stage, I unabashedly consulted Wikipedia (2009), whose page on the Challenger shuttle contained a descriptive table of the 10 missions of interest and a variable not previously considered: mission duration. Notably, this measure is continuous and without repeats. Also, nine of the voyage lengths were quite consistent, having an average of 6 days, 22 hours, 12 minutes, and 47 seconds, and a standard deviation of 1 day, 1 hour, 22 minutes, and 22 seconds, excluding the failed final mission, a brief 73 seconds in length. This described the very same clustering characteristics of the data set sought. Ordering the observations by launch date and obtaining the ranks of the mission lengths returned ranks matching our puzzler's data. Only the choice of time scale and the data transformation remain to be determined. Principle 5: An outlier is not a nuisance but the most informative of observations. Focusing on the 1-minute-and-13-second final flight, a transformation on the minute scale can be excluded, as this is already quite close to the 1.863323 result we seek. This leaves the choice of seconds units and, with
Table 1—Challenger Mission History Mission Designation
Launch Date
Duration (Days:Hours:Minutes:Seconds)
log10(seconds)
STS6
4Apr83
5:2:14:25
5.643517
STS7
18Jun83
6:2:23:59
5.721843
STS8
30Aug83
6:1:8:43
5.718105
STS41B
3Feb84
7:23:15:55
5.837939
STS41C
6Apr84
6:23:40:07
5.780754
STS41G
5Oct84
8:5:23:33
5.851633
STS51B
29Apr85
7:0:8:46
5.781989
STS51F
29Jul85
7:22:45:26
5.836783
STS61A
30Oct85
7:0:44:51
5.783540
STS51L
28Jan86
0:0:1:13
1.863323
further reflection on Tufte and his fascinating bivariate displays of mammalian brain to body mass, the consideration of a log10 transform. Applying this to the Wiki data yields a solution with a few forewarned discrepancies (hint 3). Principle 6: It is unnecessary to make a picture when you have the thousand words. The log10 transform is a useful tool in that it enables us to include all the observations of a disperse variable on a single plot, avoiding overlapping symbols or the statistical transgression of data exclusion (another infamous shortcoming of the prelaunch risk analysis for the final Challenger mission). With this transform we gain compactness but little insight (Figure 1). I positively concur with Howard Wainer, in his 2009 CHANCE article, “A Good Table Can Beat a Bad Graph: It Matters Who Plays Mozart,” —the characteristics of some data sets are best communicated with a table (Table 1).
Further Reading Fygenson, M. (2008) Modeling and predicting extrapolated probabilities with outlooks. Statistica Sinica 18(2008):9–90, with discussion. Tufte, E. R. (1997) Visual explanations. Cheshire, CT: Graphics Press.
QED and cheers, The Statistical Sleuth
CHANCE Challenger Contest Honorable Mention The graph displays the shuttle launch duration versus temperature on the day of launch. The numbers indicate the order of shuttle missions. Duration on the log10 scale is on the vertical axis. Minimum temperatures on launch days are displayed along the horizontal axis. Some historical notes and scientific information also are included. Author: Brad Thiessen
Brad Thiessen is director of institutional research and strategic planning at Touro University California in Vallejo, California, and an associate professor of mathematics at St. Ambrose University in Davenport, Iowa. He earned his PhD in educational measurement and statistics from the University of Iowa.
CHANCE
27
Hey, Who Turned Off the Lights? A look at electricity consumption Bernard Dillard
Scene One: August 14, 2003. New York City (NYC). A hot, humid dog day. You’ve traveled to the city of dreams to see a matinee of “The Lion King” on Broadway. Horns blowing. Skyline breathtaking. People scrambling. In the theater, you settle into your orchestra seats and great singing and dancing commence. In a magical moment, Rafiki prepares to lift Simba and Nala’s newborn cub as next in line to rule the Pride Lands when suddenly the power goes out. “Please exit the theater.”
Scene Two: April 12, 2004. Los Angeles International Airport (LAX). You’ve finally booked that trip to Hawaii. Sitting in the coach section in the back of the plane by the restroom, you calm yourself: The flight will not be that long, and soon you’ll be walking on the black sand at Big Island. You settle into your seat and close your eyes. “Attention passengers, there will be a (long) delay due to a power outage in one of the control towers. Feel free to get up and use the restroom while we wait.”
S
cenes like those described above were two of the many that could have occurred during actual power outages in the two largest U.S. cities. Theories abound as to why those blackouts happened during those times. One theory concerning the 2004 blackout includes a bird sitting on a power line. But that theory could not account for simultaneous outages at the Bellagio Hotel in Las Vegas. Many scholars maintain these outages had to do more with an overloading effect of electricity consumption on a power grid— much like overuse of power in a home causes a fuse to blow. Many posit that, if appropriate measures had been in place to detect signs of excessive electricity use, these power failures could have been avoided through the use of cutting-edge statistical monitoring techniques. Hence, in a time where an understanding of electricity consumption data and its relation to other factors are pivotal to various aspects of American life, including national security, proper analysis of 28
VOL. 23, NO. 2, 2010
historical consumption and related data becomes a key element for successfully detecting future abnormalities. The natural way for a statistician to treat electricity consumption data is as a time series. Two factors make the analysis and monitoring tasks nonstandard. First, like many modern time series, the time scale on which the data are collected is frequent. Clearly, the level of data aggregation depends on the objective of the application or analysis. In our case, we are interested in rapid detection of abnormal behavior in electric consumption or related variables (such as temperature) that could indicate a blackout is imminent. According to S. Basu and A. Mukherjee's 1999 INFOCOM article, “Time Series Models for Internet Traffic,” traditional time-series models, such as autoregressive integrated moving average (ARIMA) models, are not useful for data measured on such a frequent scale. The second complicating factor is we typically monitor not only the electricity consumption series but also a set of several related time series, out of a belief
the other series might carry information about consumption. For example, we can monitor the weather and hypothesize that an extreme wave of cold or hot weather would lead to increased consumption. Thus, we need a method that can simultaneously monitor multiple time series and take into account the interrelations between the series. The discrete wavelet transformation (DWT) can be used to analyze the features and structure of electricity consumption and consumption-related data measured on a frequent time scale. Multiscale statistical process control (MSSPC), which combines DWT and control chart methodology, can be used to assess abnormalities in individual time series. These techniques are illustrated in this article using data on hourly electricity consumption and temperatures in New Hampshire from August 29 to September 1, 1997. What can we learn about the time series using these methods? What do we anticipate will be possible with improved statistical methods?
2200
It is well known that fluctuation in electricity consumption depends heavily on many factors, the most important source being meteorology, and particularly temperature, as stated in R. Cottet and M. Smith’s 2003 article “Bayesian Modeling and Forecasting of Intraday Electricity Load,” that appeared in the Journal of the American Statistical Association. In most locations, although the meteorological variables that affect load can differ according to region, temperature appears to be by far the most important meteorological factor in most locations. Consequently, we study the consumption behavior using not only electricity load data but also relevant temperature data. The consumption data in this analysis is provided by the New Hampshire Electric Co. (www.seattlecentral.org/qelp/ sets/042/042.html#About). It records the electric consumption over the course of four days from one delivery point in New Hampshire at the end of August 1997. The electricity consumption load was measured in kilowatts per hour (kwh) and was recorded over the period of 96 hours: from 12:52 a.m. on August 29, 1997, to 11:52 p.m. on September 1, 1997. Figure 1 describes the consumption series graphically by using a time plot. The time plot reveals several notable observations. The most noticeable pattern is a cyclical fluctuation, which repeats daily. Typically, early morning hours are characterized by low energy consumption, whereas during the late morning and evening hours, consumption reaches the highest values of the day. This reflects the levels of activities of most people during late and evening hours, which require more electricity usage. This pattern of usage results in a daily toothlike structure, which has been observed in other geographical areas and at other periods of time, such as in Harvey and Koopman’s “Forecasting Hourly Electricity Demand Using Time Varying Splines” in the Journal of the American Statistical Association. Also, on September 1, the toothlike structure differs from those of the previous days. The relative maximum on this day exceeds that of the previous three days by far. On this day, the maximum point occurs at 8:52 p.m. The electricity consumption load at this time is 2148.12 kwh, which is the highest load of the entire four days. In addition, the second
2000 Electric Consumption Load
Electric Consumption and Temperature Data
8/29/97 Day 1
8/30/97 Day 2
8/31/97 Day 3
9/1/97 Day 4
1800 1600 1400
Max Usage (8:52p)
1200 1000 800 1a
9a
5p
1a
9a
5p
1a
9a
5p
1a
9a
5p
1a
Approximate Time During Given Day
Figure 1. Graph of hourly electric consumption load over time for four days in 1997 in New Hampshire
Cyclical Hourly Observations from Mt. Washington Regional Airport 75
Dry Bulb Temperature
70 Max temp 2:52p
65
60
55
8/29/97 Day 1
8/30/97 Day 2
8/31/97 Day 3
50
9/1/97 Day 4
Min temp 1:52p 3:52p
45 1a
9a
5p
1a
9a
5p
1a
9a
5p
1a
9a
5p
1a
Approximate Time During Given Day
Figure 2. Graph of hourly temperatures over time for four days in 1997 in New Hampshire
highest consumption on that day (taking place at 10:52 a.m.) is higher than the highest consumption levels on the previous three days. Along with the consumption data, we consider temperature data for the same area and during the same time. We use hourly dry-bulb temperatures from an hourly observation table from Mount Washington Regional Airport
(HIE) in Whitefield, New Hampshire, from August 29 to September 1, 1997. These data are from the National Climatic Data Center in Asheville, North Carolina (http://lwf.ncdc.noaa.gov/oa/ncdc. html). Temperatures are recorded hourly from 12:52 a.m. on August 29 until 11:52 p.m. on September 1. Figure 2 provides a time plot of the temperature data.
CHANCE
29
The World of the Wavelet Meaning "small wave," the term “wavelet” refers to mathematical functions that break data down into different frequency components and that then analyze each of the frequency components with a scale-matched resolution. In their article “Wavelets fo Computer Graphics,” that appeared in IEE Computer Graphics and Applications, E. J. Stollnitz, T. D. Rose, and D. H. Salesin describe wavelets as mathematical tools for hierarchically decomposing functions. They essentially allow functions to be described in terms of a coarse overall shape, along with details that range from broad to narrow. Historically, wavelets have been touted as the quintessential mathematical tool for image compression. In computer science circles, they have been lauded for their ability to flexibly adapt to shapes and patterns of the original image and reconstruct them using minimal space. Through a tag-team effort of using high- and low-pass filters, wavelets produce snapshots of images while minimizing pixilated space. Because wavelets possess such a great ability to stretch and shrink, they are able to confront the task of duplicating complex pictures. Over the last decade or so, the utility of wavelets has widened from this well-known idea of image compression to the relatively new area of anomaly detection. Even though several scholars have suggested that these mathematical tools may provide promising results for such detection, hardly any literature exists in which these methods are examined alongside age-old, more traditional approaches for detecting out-of-control processes. Simply said, since wavelets possess this uncanny ability to adapt and flex, they become ideal “spies” on the hunt for unknown aberrant behavior in a time series. Of course, classical approaches to modeling techniques for detecting anomalies center on autoregressive moving-average (ARMA) models and Fourier analysis. Wavelet analysis, however, becomes more suitable than these traditional methods for several reasons: 1. Wavelet analysis allows us to analyze a series while simultaneously preserving temporal and spatial information. Other key methods either preserve temporal or spatial information, not both. 2. Wavelet analysis is more flexible in its monitoring of frequent data (data that is daily, hourly, etc.). 3. Wavelet analysis requires the least tweaking from the nonstatistician user; current software makes for a user-friendly environment to aid in the use of wavelet analysis.
The most prominent pattern in the temperature data is, like the consumption data, a daily cyclical pattern with highs at late afternoon and lows in the night. The highest temperature of 75ºF occurs on the third day (August 31) at 2:52 p.m. (All temperatures are Fahrenheit.) Interestingly, the lowest temperature of 48º is also on the third day. These lows occur at 1:52 a.m. and 3:52 a.m. Temperatures during this four-day period are not unusual for New Hampshire during this time of the year.
The One-Dimensional Wavelet Transform The goal of using wavelets is to turn the information of a signal into coefficients, which can be manipulated, stored, 30
VOL. 23, NO. 2, 2010
transmitted, analyzed, or used to reconstruct the original signal. From a methodological point of view, wavelet techniques offer an analysis of the series as a sum of orthogonal signals corresponding to different time scales. From a more practical viewpoint, wavelets are used to extract information from different types of data like audio signals, images, and, more recently, over-the-counter sales and electricity consumption. The choice of wavelet used in the analysis is based on the interplay between a specific analysis goal (e.g., signal processing or monitoring) and the properties needed in a wavelet filter to achieve that goal. More on wavelets is explained in the two sidebars, “The World of the Wavelet” and “The Haar Wavelet.”
By definition, the discrete wavelet transform (DWT) is described by the mathematical representation:
where Wx(a,b) is the set of transformed wavelet coefficients at the appropriate approximation and detail levels, while x(t) represents the data points of the original series at time t. We express (t) as the basic wavelet function onto which the signal is analyzed. When stretched or shrunk, (t) becomes, where “a” is the dilation parameter and “b” is the translation parameter. In the example in this paper, the basic wavelet is stretched or shrunk to the point at which it has become the Haar wavelet. “The Haar Wavelet” sidebar presents more details of the Haar wavelet function and describes a simple numerical example. Essentially, the DWT is a decomposition of the original series into a linear combination of detail coefficients and the finest approximation coefficients. The DWT is very useful in analyzing data that has been parsed into varying (or multiscale) levels. Hence, wavelets become central to the analysis of our consumption and temperature data.
DWT of the Consumption Data For the electricity consumption data, the signal is analyzed using the Haar, which is the most basic wavelet. Each wavelet, of course, has its own unique characteristic. The choice of wavelet depends on a few things: insight into the data (what each level captures), the goals of the monitoring experience, and ease of interpretation and generalization. Decomposing our consumption time series with the Haar becomes appropriate because the goal is to detect any sudden shifts occurring from back-to-back data points. Since the Haar’s basic makeup is to average consecutive data points, its strength is its ability to pinpoint any huge jumps or dips in the data stream. Figure 3, which can be viewed on page 30, describes the wavelet decomposition of the electricity consumption data using the Haar wavelet and five levels of decomposition. The original signal of the consumption data is given as the top graphs on either side, while the five graphs beneath represent the Haar wavelet decomposition at each of the five
The Haar Wavelet The simplest wavelet we can use to decompose a series is the Haar wavelet. In the case of the Haar, it is useful to note its mathematical representation. Let be defined by:
When applied to data this wavelet performs a moving average for pairs of consecutive data points. An example is given below.
The Haar mother wavelet, (t)
A Simple Discrete Wavelet Transformation (DWT) Example Using the Haar Wavelet Consider the simple time series of eight observations: [9 27 30 14 20 32 50 26]. A process of averaging and differencing is employed to produce key values used in decomposing the series. Here, we apply the Haar-based DWT for only two levels to convey the basic process. The first set of values that arise from applying the Haar to the series comes about through averaging data pairs. For example, the average of 9 and 27 gives 18. The average of 30 and 14 gives 22. Summarizing in this vein, we have 18, 22, 26, and 38. These are referred to as the approximation coefficients of the DWT. This is the first level of averaging. In signal processing, especially, we would only concern ourselves with these four approximations, which result in the notion of “downsampling,” or reducing sample size. For the purposes of monitoring our data, however, it becomes appropriate to ignore downsampling and maintain all sample data points. Our goal is not to compress an image but to monitor a series. Consequently, at level A1, we preserve all data information from the original series by averaging and duplicating these newfound approximations, namely, [18 18 22 22 26 26 38 38].
Figure 1. Original signal and DWT using the Haar wavelet
We then measure deviations by subtracting the average of the data pairs from the first data point of the appropriate pair. For example, the difference between 9 (first point of first data pair) and 18 (average of first data pair) is –9. The difference between 30 and 22 (average of second data pair) is 8. Continuing in this fashion, these deviations are –9, 8, –6, and 12 and are referred to as “detail coefficients.” At level D1, these detail values are presented such that, when added back to their corresponding approximation counterparts, they give the values for the original series. Hence, the detail values are [–9 9 8 –8 –6 6 12 –12]. Essentially, we have the approximation values added to the detail values to yield the values in the original series. Or, [18 18 22 22 26 26 38 38] + [–9 9 8 –8 –6 6 12 –12] = [9 27 30 14 20 32 50 26]. Figure 1 shows a plot of the original series along with plots of the approximation (level A1) and detail (level D1) coefficients for this first-level DWT. Now, level A1 becomes our “new” signal. We get new approximations by averaging distinct values in A1: (18+22)/2 = 20 and (26+38)/2 = 32. We report approximations as [20 20 20 20 32 32 32 32]. Instead of reporting each set of approximation values twice as in level A1, we repeat them four times in level A2 to maintain the original number of data points (no downsampling effect). We get detail values at level D2 as we did in level D1. We subtract “new” approximation values (mimicking the role of the original signal) from “old” approximation values. Or, [18 18 22 22 26 26 38 38] – [20 20 20 20 32 32 32 32] = [–2 –2 2 2 –6 –6 6 6]. Figure 1 shows these two sets of new approximation (A2) and detail (D2) coefficients. Note that A2+D2 = A1 and A1+D1=Signal. One could report A1 or A2 to approximate the series and reduce the amount of information that is required. Or one can examine the details (D1 and D2) to look for abnormal behavior.
CHANCE
31
Original Signal and Approximations 1 to 5 2000
Original Signal and Details 1 to 5 2000
S
S 1000
1a
1p
1a
1p
1a
1p
1a
1p
1a
1000
2000
200 0
1000
-200
2000
500 0
A1
A2 1000
-500
2000
500
1a
1p
1a
1p
1a
1p
1a
1p
1a
I D1 II D2
0
D3
1000
-500
III
2000
500 0
D4
A3
A4 1000
-500
2000
500 0
1000
-500
A5 1a
1p
1a
1p
1a
1p
1a
1p
1a
D5 1a
1p
1a
1p
1a
1p
1a
1p
1a
Figure 3. Discrete wavelet transformation (DWT) of electricity consumption data using the Haar wavelet
Original Signal and Approximations 1 to 5 S
80
60
60
40 A1
A2
A3
Original Signal and Details 1 to 5
80
1a
1p
1a
1p
1a
1p
1a
1p
1a
80
5
60
0
40
-5
80
10
60
0
40
-10
80
10
60
0
40
-10
65
20
A4
A5
40
S 1a
1p
1a
1p
1p
1a
1p
1a
-20
64
5
62
0
60
-5 1p
1a
1p
1a
1p
1a
1p
1a
II D2
D3 III IV D4
D5 1a
1p
1a
1p
Figure 4. Discrete wavelet transformation (DWT) of temperature data using the Haar wavelet
VOL. 23, NO. 2, 2010
I D1
0 60
1a
32
1a
1a
1p
1a
1p
1a
levels. The graphs on the left correspond to the approximations, and those on the right correspond to the details. For ease of viewing, we insert vertical lines, which show approximate beginning and ending of days. We reiterate that we are using a DWT here: Our concern here is to analyze the consumption data independent of the temperature data. We see that different features of the data are captured at different levels of the decomposition: While D4 captures the daily cycle, D3 captures the four similar toothlike structures that are apparent in the original series. At level D1, the points marked by I capture the most abrupt increases and decreases in consumption (day 4). At level D2, point II captures the absolute maximum value of the original series of 2148.12 kwh at 9:52 p.m. on day 4. At level D4, point III captures the absolute minimum consumption value of 882.36 kwh at 3:52 a.m. on day 4. Regarding the choice of the number of decomposition levels, it appears that five levels capture most of the relevant information in the data, whereas further levels of decomposition contain only irrelevant noise. Technically, the fifth level of decomposition requires 25 data points (which we have, in fact). Decomposing into too many levels imposes more “holes” in the analysis and would sacrifice accuracy. Of course, if there were a different consumption series, the location of I, II, and III would be different, depending on the traits of the different series. When the time series change, the results change. Further, if the series were too massively long, the DWT would still highlight these points, since the Haar focuses on what occurs between two consecutive points and not what happens among a massive group of points. One should remember that, at this stage, no anomaly detection has taken place. The DWT is simply a method that summarizes and breaks down the original series into details and approximation coefficients. It is true that the sum of D1+D2+D3+D4+D5+A5 yields the original value of the electricity consumption series at that particular time point for all points.
DWT of the Temperature Data Following the same logic as stated with the consumption data, we choose the Haar wavelet to decompose the temperature series. Again, the task using this DWT is to recognize key details in
the series independent of the electricity consumption series. Figure 4 describes the Haar wavelet decomposition of the temperature series. The original series is given as the top graph, and the five graphs below it represent the five-level decomposition. The graphs on the left correspond to the approximations, and those on the right correspond to the details. Daily times are inserted, which approximately correspond to 1 a.m. and 1 p.m. The decomposition of the temperature series reveals several details. Level D1 captures the sharpest changes in temperatures (two points marked by I): the sharp increase on day 3 (from 61º to 69º) and the sharp fall in temperature on day 4 (from 68º to 62º). It is also interesting to see what happens at the two points marked by II, which occur on the second day around 6 p.m. This peak represents the difference between the approximations (on the left side) at levels 1 and 2 and at levels 2 and 3. At this point, the decomposition highlights the major and sudden decrease in temperature during this time. The approximation differences at these times become noteworthy. At level D4, the Haar captures the daily cycle, highlighting each of the four "hills" in the original series. Point III captures the highest peak occurring on the third day, while point IV captures the lowest point of the time series.
Monitoring Time Series Traditionally, the established statistical method of control charts has been used for monitoring a process over time for the purpose of detecting abnormalities. This technique has been heavily used in many applications. It provides systematic guidelines for showing if a process is “in control” or “out of control,” where an in-control process is defined as “a process that is operating with only chance causes of variation present” or one in which the chance causes of variation are an inherent part of the process, as D. C. Montgomery states in Introduction to Statistical Quality Control. For example, sources of variability, which cause the process to be out of control, arise from operator errors, maladjusted machines, or other defective bases. To create a control chart for monitoring the mean of a process, independent and identically distributed (i.i.d.) samples are taken every time point from the process, and the following are computed: the process (or sample) mean, the sample
size, the standard deviation of the sample average, and a constant value for the Z-statistic. We use these values to compute the upper control limit (UCL) and lower control limit (LCL), which are the statistical cutoffs for assessing whether the process is in or out of control. Once a new sample arrives, if its mean exceeds the control limits, an out-ofcontrol alarm is raised. Such control charts that abide according to the above principles are referred to as “Shewhart control charts,” developed by Walter S. Shewhart. Conventionally, we use the constant value of “3” for our Z-statistic, which further classifies our limits in the chart as “3-sigma” control limits. The problem with simple Shewhart charts is that they assume i.i.d. samples. In a time series, we typically have samples of size 1 (one series of measurements), and the points are autocorrelated (correlated with one another over time). Furthermore, in many cases we cannot assume, as Shewhart charts do, that the distribution of the observations is normal. On the other hand, we show in the previous section how wavelets can be used to analyze autocorrelated observations without making distributional assumptions. A method called “multiscale statistical process control” (MSSPC), combining DWT and Shewhart control charts, was introduced by B. R. Bakshi and is primarily used in chemical engineering circles. The idea of this methodology is to decompose the signal using DWT and then to monitor the coefficients at each scale separately using a Shewhart chart. If there is an alarm at one or more scales, the original series is reconstructed from all the coefficients that exceed the thresholds and another Shewhart chart is used to monitor this reconstructed series. On the following page, Figure 5 illustrates this MSSPC process. MSSPC sounds an alarm if a new point in the reconstructed series exceeds the control limits. Next, we illustrate this method for our two time series.
MSSPC for the Consumption Data Figure 3 shows a DWT of the electrical data using the Haar. The UCL and LCL are computed at each decomposition level by using the standard deviation and mean of the coefficients in that level. For the approximation level, the mean is exactly the same as for the original series. CHANCE
33
MSSPC Methodology
Figure 5. Illustration of MSSPC algorithm. X represents the initial series. W represents the wavelet decomposition of X. The quantity aLX represents the finest approximation level, while d1X-dLX represent the range of the detail levels. The mth detail level is denoted dmX. SPC ^ denotes the implementation of a Shewhart control chart at that level. wT represents the wavelet reconstruction onto X, which we call X. ^ Finally, we construct a control chart for the reconstructed signal, which we denote by SPC(X).
Original Signal, Details 1 to 5, and Finest Approximation Level 4000 2000 0
D1
500 0 -500
D2
500 0 -500
D3
500 0 -500
Day 1
Day 2
Day 3
Day 4
I Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
1000 D4
D5
A5
0 -1000 500 0 -500 2000 1500 1000 1456
R1
1454 1452
Figure 6. Application of multiscale statistical process control (MSSPC) to electricity consumption data. For each pair of lines, the top dotted line represents the upper control limit (UCL) and the bottom dotted line represents the lower control limit (LCL).
34
VOL. 23, NO. 2, 2010
Original Signal, Details 1 to 5, and Finest Approximation Level 80 60 40
D1
5 0 -5
D2
10 0 -10
D3
10 0 -10
D4
20 0 -20
D5
5 0 -5
A5
64 62 60
I Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
Day 1
Day 2
Day 3
Day 4
II
III Day 1
Day 2
Day 3
Day 1
Day 2
Day 3
Day 4
70 R1
60 50 Day 4
IV
Figure 7. Application of multiscale statistical process control (MSSPC) to temperature data. For each pair of lines, the top dotted line at each level represents the upper control limit (UCL) and the bottom dotted line represents the lower control limit (LCL).
For the detail coefficients, the mean at each decomposition level is zero. The standard deviation varies per level and is estimated from the coefficients in that level. Figure 6 shows the MSSPC method applied to the electricity data. Each of the four days is labeled as a marker in the monitoring process. We use observations only from the first three days (72 hours) to determine the UCL and LCL. Doing this allows us to create tighter UCLs and LCLs in our attempt to identify outlying coefficients. The top dotted line in each of the D1–D5 and A5 levels represents the UCL, which is given by the mean plus three times its standard deviation. The dotted line below it represents the LCL, which is given by the mean minus three times its standard deviation. We observe that in each level, there is no point exceeding the UCL or LCL. Based on our baseline stability period, the closest point that approaches either of the limit lines is
point I, which is the greatest consumption value. According to the control chart, however, since it still lies inside the control limits, it is not classified as a point that is out of control. Since none of the points at either of the detail levels or the finest approximation level falls outside of the control limits, then none of them falls outside the limits in the reconstruction phase. For this series, MSSPC advises us to zero-out all points at this phase and conclude that there is no anomalous behavior attached to this particular process of electricity consumption. Level R1 (the reconstructed series) illustrates this truth, as all points have been zeroed out. There are no UCL and LCL at this level because there is no standard deviation to take into account, since all of the points at each of the decomposition levels remain inside the detection limits. Hence, all sources of variation in this case are considered to be an inherent part of the process.
MSSPC for the Temperature Data MSSPC is now applied to our temperature data to see if any points are detected that contribute to an out-ofcontrol process. Figure 7 illustrates this process applied to our familiar temperature data points. Again, only observations from the first three days (72 hours) are considered for determining process stability. The upper and lower dotted lines are identified as the UCL and LCL, respectively, as described before. The difference in this data set is that there are a couple of points that fall outside of the UCL and LCL. Points I and II exceed the value of the UCL and LCL for the prescribed algorithm for this decomposition level. For this detail, the algorithm gives a UCL and an LCL of 3.45 units. Points I and II are both located at +4 units and -4 units, respectively. This means that in the reconstruction of the series (R1), these CHANCE
35
coefficients would be retained and added to any other coefficients at this same time point that lie outside of the UCL and LCL of its respective decomposition detail level. These points are denoted by III and IV. Only the finest scale coefficient is used to compute the reconstructed signal and detection limit for the reconstructed signal. So we use information from the standard deviation in D1 to calculate the detection limit in R1. The reconstructed series then provides us with statistically based reasoning for concluding that some of the variation in the temperature series is attributable to factors other than mere chance.
The Baseline Stability Period Although wavelets have the ability to monitor hourly data, as seen in this article, real applications typically would have several days or weeks of data available. In our case, there is no guarantee that the temperature points that were labeled as out of control would continue to be so if there had been a much longer length of stability on which the UCLs and LCLs were based. This, in fact, becomes the beauty of MSSPC. Although the algorithm is nonnegotiable with respect to its methodology, it is flexible in that its results are strictly data-driven, datadependent, and data-sensitive. UCLs and LCLs change slightly or drastically for a different baseline period of stability. A great deal of responsibility lies with the statistician whose task it is to analyze the history of the data and to report the baseline stability period upon which anomalous behavior will be based. As one might imagine, it becomes advantageous to revisit the data and update this stability period every so often to take into account reasonable data changes (such as expansion of customer base). Any nonsignificant points near but not beyond the control limits will generally remain nonsignificant even if there had been a longer series, as long as the stability period remains unchanged. Hence, a change in significance or nonsignificance of out-of-control points becomes a function of the change of the baseline stability period, as opposed to a function of how long the data series itself is. Irrespective of the baseline period, we still rely heavily on the strength of the Haar wavelet to zero in on sudden and quick disparities in consecutive 36
VOL. 23, NO. 2, 2010
time points and signal an appropriate alarm. As mentioned, the Haar has a short memory, which actually serves as the strength in this particular application. This wavelet does not depend necessarily on how long a process has been “normal” or how long the baseline stability period is. There is no definitive formula for calibrating this stability threshold. Determining the length of stability, then, should be based on a logical sense of data history within the business (such as the power company). Care should be taken in establishing this baseline period so as to minimize the number of false positives (alerts that are not really alerts) and the number of positive failures (real alerts that are not reported). Practically, businesses would probably need to hire statistical consultants to parse through data and determine an appropriate length of process stability based on each particular data set.
Practical Decisions Based on Results What do you do with out-of-control points in the reconstruction phase? After all, what good is all the statistical talk if there is no suggested course of action? Although there are no hard-and-fast rules concerning what should be done in cases where statistically based methods are employed, we can at least make decisions informed by something other than gut feeling. In this analysis, our concern primarily lies with values exceeding the UCL, since we were monitoring for high electricity consumption or high temperature. But we could very well have been monitoring both variables during the winter months, in which temperature values falling below the LCL for a certain set of data could signal an alarm by translating into increased consumption and possible grid system failure. Hence, using both the UCL and LCL becomes a key strategy to pinpointing such outof-control data points. Greater still, situations exist whereby both limits may not necessarily be needful. Only a one-sided limit may be needed, thereby requiring a less extreme high value for significance. For example, one application of this method is to the area of biosurveillance as it relates to rapid detection of a large-scale bioterrorist (such as anthrax) attack. In the
last several years, scholars such as Goldenberg et al. have led efforts to monitor nontraditional data sources, such as sales of various over-the-counter (OTC) medications. In their study, they explore situations in which they monitor sales of certain grocery items that people may purchase to treat flulike symptoms. Significant spikes in daily purchase levels above a prescribed UCL only (and not below a certain LCL level) might suggest cause for alarm and may allow for timely intervention before lifethreatening spores would cause damage to a person’s respiratory system. But the idea is that through real time (via UPC bar-code scanning), this data mining and monitoring technique could be applied almost immediately to serve the public’s safety interest. In this instance, emphasis would be placed only on analyzing what transpires above the UCL, thus lowering our limit cutoff point and allowing for a less extreme high value for significance. Essentially, the cutoff value applied with MSSPC would depend on the data and the nature of the process to be monitored. The results here show the implementation of MSSPC with the UCL and LCL overlaying all the data points. This simply shows the overall verdict on which points would have signaled an alarm. However, in real life, the points would be identified shortly after having been observed or recorded. Once the point was identified as being out of control, it would immediately be reported to an automated system set up to monitor quality control, which would be established by the statistician and the power grid company. In this new world of technology, little effort would have to be exerted to devise an automated monitoring system to do this and forward information to powers-thatbe in control towers to have them make decisions concerning what the system has found to be out-of-control points. Since outlying aberrant data points are strictly based on the historical baseline stability period, we would not need knowledge of the entire series to zero in on out-of-control points. The algorithm identifies and catches the point quickly. Hence, if the data were being monitored hourly or daily, the automated system could report that aberrant behavior shortly after that hourly or daily observation occurs, respectively.
In this article, we use a series having only 96 data points. In real life, we could conceivably monitor a series with indefinite length. Significance is not based on how many points are in the series: As soon as the point falls outside of the control limits, it is reported as aberrant. Being out of control is simply based on “looking back” while “rolling forward.” The monitoring process would remain in effect until it needed to be updated or tweaked. The only tweaking to the algorithm would be a new change in the baseline stability period (which would be data entered by the statistician) and whether one- or two-sided limits would be used in MSSPC. The basic idea, then, becomes to develop an automated indication system that provides an alert or flag to process operators that an observed value has exceeded prescribed thresholds and should be addressed quickly. In our scenario, an alert would be given at the time point in the temperature series that lies outside the control limits. What this would mean is that consumption should arguably be decreased within a reasonable time frame of the alert so as to avoid any possible overloading effect or blackout. Of course, one does not make decisions solely on this algorithm. These statistical methods simply provide an objective monitoring tool, which could be used in tandem with other methods that electric companies already use to monitor their power grids. Essentially, attacking these kinds of issues using wavelet-based methods equates to dealing with such problems proactively rather than reactively.
Future Directions This article examines wavelet-based methods for analyzing and monitoring very frequent time series. Although these methods require less parameters and assumptions than other traditional statistical monitoring methods, there are a few parameters that must be specified by the modeler. The first is the choice of wavelet. Again, the Haar is useful because of its ability for detecting sudden, abrupt changes in the series. Since the Haar is a wavelet whose basic function is to average and subtract consecutive points in the series, it detects quickly any two consecutive points that
have a large range. For gradual, slower changes in the series, this wavelet is not as useful because of its “short memory.” Even if the overall series is slowly, monotonically increasing or decreasing, the Haar essentially averages out this effect. By using more complicated wavelets, we are able to capture trends and patterns of the series that are more subtle, like the situation described above. The focus, as seen in the two previous examples, has been on analyzing each data set independent of other factors. However, this traditional DWT approach is limited in that the analysis does not capture any interaction between the series when it was clear that electricity consumption depended, among other things, on the temperature. The primary task, then, becomes to investigate both series together and to use wavelet decomposition to analyze both signals simultaneously. Ideally, we want a method that could be generalized to more than two series and that could capture not only relationships between the series at the same time points (e.g., at hour 3) but also relationships of a lagged nature. This multivariate method, called “multiscale principal components analysis” (MSPCA), seeks to capture the interrelations between the different series and any abnormalities that might occur in this relationship. This type of monitoring becomes crucial in detecting anomalies based on the interplay between series. What do you think the results would be if both series could be monitored together in a manner that takes account of their associations with one another over time? We anticipate that we will be better able to identify out-of-control time points when monitoring both series together. A simultaneous monitoring system that would take into account changes in relations between different data sources could provide a great improvement in detecting real outbreaks and in eliminating false alarms. Such theory almost begs for great minds in the statistical field to develop the MSPCA algorithm and investigate results, comparing them to monitoring with univariate monitoring with MSSPC. These monitoring techniques can prove invaluable for the statistician. At the end of the day, however, such wavelet-based monitoring techniques described herein can be helpful in
guarding against situations that threaten normal electricity consumption load within a city. Proper use of these methods can contribute to the early detection of any ensuing electricity consumption overload and the natural response of decreasing consumption on the necessary power grid.
Further Reading Bakshi, B. R. 1998. Multiscale PCA with application to multivariate statistical process monitoring. American Institute of Chemical Engineering Journal 44:1596–1610. Basu, S., and Mukherjee, A. 1999. Time Series Models for Internet Traffic, INFOCOM vol. 2, 1996: 611–620. Cottet, R., and Smith, M. 2003. Bayesian modeling and forecasting of intraday electricity load. Journal of the American Statistical Association 98:839–849. Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, E. S. 2002. Early statistical detection of anthrax outbreaks by tracking over-thecounter medication sales. Proceedings of the National Academy of Sciences 99:5237–5240. Graps, A. L. 1995. Introduction to wavelets. IEEE Computational Sciences and Engineering 2(2):50–61. Harvey, A., and Koopman, S. 1993. Forecasting hourly electricity demand using time-varying splines. Journal of the American Statistical Association 88:1228–1253. Hubbard, B. B. 1998. The world according to wavelets. Wellesley, MA: A. K. Peters, Ltd. Montgomery, D. C. 2004. Introduction to statistical quality control, 5th ed. Hoboken, NJ: Wiley & Sons. Percival, D. B., and Walden, A. T. 2000. Wavelet methods for time series analysis. Cambridge, UK: Cambridge University Press. Stollnitz, E. J., DeRose, T. D., and Salesin, D. H. 1995. Wavelets for computer graphics: A primer, Part 1, IEEE Computer Graphics and Applications 15(3):76–84. You, C and Chandra, K 1999. Time Series Models for Internet Data Traffic IEEE Computer Society:164–171
CHANCE
37
Least Squares or Least Circles? A comparison of classical regression and orthogonal regression Ivo Petras and Igor Podlubny
A
student came to his final exam on statistical methods in economics. The professor asked him to compute the linear regression of y versus x, and the student successfully computed some a and b of the straight line y = a + bx. Then the professor asked the student to compute the linear regression of x versus y, and the student immediately rewrote the previous equation into the form x = (1/b)y – (a/b). The professor was expecting that the student would derive the equation of a conjugate regression line and thus evaluated the student's answer as unsatisfactory. But was the student's answer really incorrect? That all depends on how the first line was calculated, which in turn depends on the criterion used for determining a and b in the first line.
Least Squares One could hardly name another method used as frequently as the method known as the least squares method. At the same time, it is difficult to name another method that has been accompanied by such strong and long-lasting controversy. (Details of that story appear in the sidebar, “The Birth of Least
Squares.”) It is also difficult to find another method that is both so easy and at the same time so artificial. Figure 1 is a version of the picture that appears frequently in textbooks and slides and on blackboards as a geometric illustration of the least squares method. Recall how this figure is created. A set of points (xy, yk), k = 1, …, N, is approximated by a line y = a + bx. The classical least squares fitting consists in minimizing the sum of the squares of the vertical deviations of a set of data points (1) from a chosen linear function y = a + bx. Each term in (1) corresponds to a square in Figure 1. In the opinion of the authors, this picture is ugly: It does not have any sign of mathematical beauty. It could be good for those abstract painters Wassily Kandinsky and Kazimir Malevich, but it is not good for Johann Gauss. The line and the squares are in some visual conflict. This conflict is even more obvious if we assume that the coordinate system is not rectangular. Figure 2 gives an illustration. This conflict is
The Birth of Least Squares The story of the birth of the least squares method is well covered in the literature and can be summarized as follows. The priority in publication definitely belongs to A. M. Legendre (Nouvelles méthodes pour la détermination des orbites des comètes, 1805), who also gave the method its famous name. C. F. Gauss (Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, 1809) claimed, however, that he knew and used this method much earlier, about 10 years before Legendre's publications. In a letter to Gauss about his new book Legendre wrote that claims of priority should not be made without proof in previous publications. Gauss did not have such a publication, despite which he actively attacked Legendre. We see that his efforts were fruitful enough: In the vast majority of today's textbooks, the least squares method is attributed to Gauss without further comment. In fact, Gauss's arguments for his priority were not perfect at all. His diaries with computations claimed to be made by the least squares method were lost. His colleagues did not hurry to acknowledge that he showed him those computations. Indeed, can one imagine that Gauss showed and explained the details of his unpublished computations to his potential competitors? 38
VOL. 23, NO. 2, 2010
Only many years later did H. W. M. Olbers (1816) and F. W. Bessel (1832) mention that Gauss showed them something in that sense. But how accurately could they really remember the details of some discussion that happened many years ago? It is also known that H. C. Schumacher suggested repeating Gauss's lost computations that Gauss claimed to have done by the least squares method. Gauss totally rejected this idea, claiming that such attempts would only suggest that he could not be trusted. This, however, has been done by Stigler (1981), who could not reproduce Gauss's results. Later, Celmins (1998) also tried to repeat Gauss's computations, including the adjustments suggested by Stigler (1981), and arrived at the same conclusion that Gauss's results cannot be obtained by the least squares method. In other words, it is well known which method Legendre used, and it is not clear at all which method was used by Gauss. Assuming that it was Gauss who invented the least squares method, it is hard to believe that he did not realize the huge potential of this method and its importance for applications. Knowing Gauss as a prolific mathematician and looking at the present version of the least squares method, one can see a certain contradiction.
80 70 60 50 y
absolutely obvious if we would consider, for example, polar or elliptic coordinates. It is difficult to imagine that C. F. Gauss would have been happy with such visual interpretation. The key idea here is that the visualization and, more important, the resulting approximation is dependent on the choice of the coordinate system. The vertical distance is not independent of the coordinate system in which vertical is defined. But some definitions of distance are invariant to the coordinate system.
40 30
Shape Recognition and Curve Detection
20
Nowadays, the distance between two points in k-dimensional space is widely used as an optimal fitting criterion in the field of image processing for industrial and scientific applications, especially in problems of shape recognition and curve detection. An ellipse (a circle) is an ellipse (a circle) in any coordinate system. A parabola is a parabola in any coordinate system, too. Those objects are not defined by equations but by their general properties, which include the notion of distance. Indeed, everybody knows from the school that a circle is a set of points in a plane that are at the same distance from a given point; an ellipse is a set of point for which the sum of distance from two given points is constant; a parabola is a set of points which are at the same distance from a given point and from a given straight line; and so forth. We sometimes forget that those geometric objects are not related to any particular coordinate system, although some coordinate systems are more suitable for describing those objects by equations. Indeed, one can write simpler equations in a suitable coordinate system. Figure 3 illustrates the fitting of an ellipse to a set of points in two-dimensional space. What must one do to fit a set of points by a circle, or an ellipse, or another geometric shape? One has to draw a sample curve, measure the distance from each point of a set under consideration to the curve, and consider the sum of these distances as a criterion that has to be minimized. Distance in shape recognition and pattern detection is usually a function of squared (or absolute) deviations from the point to the nearest point on the object. For doing this algorithmically using a computer, it is necessary to set the whole picture into some coordinate system. The most common coordinate system we use is the Cartesian rectangular system.
10 0
0
2
4
6
8
10
x
Figure 1. Least squares method—a classical illustration
y
x
Figure 2. Least squares method—nonrectangular coordinatess
The Least Squares Method, Revisited A. M. Legendre published an idea on how to circumvent the computational problems that arise in the case of trying to minimize the sum of orthogonal distances from data points to a straight line. The idea was to replace the computational problem with a problem in calculus. Put the whole set of the objects (data point and a line) in Cartesian coordinates. Instead of shortest distances from points to a line, consider the distances from points to the line in a direction that is parallel to the vertical axis (the vertical offsets). This step would give a different criterion to be minimized: (2)
Figure 3. Fitting a set of points with an ellipse. The “best" ellipse should be the one for which the sum of distances from the points to the ellipse is minimal.
CHANCE
39
from a chosen function f. And the minimization problem could be solved by the standard techniques of the differential calculus.
8 7
The Method of Least Circles
5
To adjust a viewpoint, let us note that the criterion (3) can be painlessly replaced with
y
6
4
(4)
3
Indeed, multiplication by a positive number does not affect the point of minimum. Only the minimum value of the criterion function (E) will be multiplied by , which itself is not the subject of interest at this stage, since we look for the values of 1, 2, … n. Taking = , we obtain
2 1 1
2
3
4
5
6
7
8
9
10
x
(a) The case of classical least squares fitting from viewpoint of circles
(5)
7 6 5
y
4 3 1 2 1 1
2
3
4
5
6
7
8
9
10
x
(b) The case of orthogonal distance fitting
(6)
Figure 4. “Least circles" viewpoint
and the minimization problem can be easily solved today using a computer and a suitable numerical routine for minimization. In the times of Gauss and Legendre, however, it would be natural to find an analytical solution to this problem using the differential calculus. The absolute value is not a good function for this, since its derivative is not continuous. It is possible that thinking in this way, Legendre got a good idea. The absolute value is a positive-valued function, but it does not have a continuous derivative. Is there a positive function close (or somewhat similar) to absolute value whose derivative is continuous? Of course there is: the square. And this is how the classical least squares fitting could appear, which consists in minimizing the sum of the squares of the vertical deviations of a set of data points (3) 40
VOL. 23, NO. 2, 2010
Geometrically, the formula (5) means the sum of the areas of the circles shown in Figure 4(a). The radii of the circles in Figure 4(a) are the vertical offsets of yi from the fitting line. Figure 4(a) is just a reformulation of the standard geometric "illustration" of the least squares method (recall Figure 1). Each of those circles has two points of intersection with the line. It is clear that one cannot consider this picture as elegant. Changing the radii slightly, one can preserve n pairs of intersection of the circles and the line. That is, the circles can be a little bigger or a little smaller and each one will still intersect the line in two places. Instead, suppose we adopt the perspective of the shortest distance to the line in two-dimensional space. The resulting circles are shown in Figure 4(b). In this case, the fitting line is a tangent line to all circles. The radii of the circles in Fig. 4(b) are equal to distances between the points (xi; yi) and the fitting line, and this guarantees the unique picture. The criterion to minimize in this case is
which is up to a constant multiplier the formula known under the name of orthogonal regression. It is also known as total least squares or as the errors in variables method. Here d((xi , yi), f) denotes the distance between the point (xi , yi) and the fitting line f. There are several obvious advantages to using least circles (squared orthogonal distance) fitting. 1. The shortest (orthogonal) distance is the most natural viewpoint on any fitting. 2. The sum of orthogonal distances is invariant with respect to the choice of the system of coordinates (see Figure 5). 3. There are no conjugate regression lines, which appear after swapping x and y, because in the case of orthogonal regression the fitting y = f(x) gives exactly the same line as the fitting x = f -1(y) (So, the student from the story in the beginning of this article could be absolutely right if he used the orthogonal "least circles" method to produce
the first coefficients a and b instead of the classical least squares method!) 4. There are no problems with causality. (Normally, determination of what is an independent variable and what is a dependent variable is simply unclear or even impossible; this is always postulated.)
y
5. Implementation of the orthogonal fitting does not depend on the number of spatial dimensions. The fourth point above could be a sticking point for some. Often the goal is to predict an outcome. In that case, one dimension (y) is of primary interest, and one often considers distance in that dimension of primary importance. Still, orthogonal regression can be less sensitive to outlying observations and useful as a form of robust regression.
x
Figure 5. Least circles method—nonrectangular coordinates
Orthogonal Distance Linear Regression In general, the use of orthogonal distance fitting requires the use of numerical routines for minimization of the criteria. Fortunately for the student in the econometrics course, in the case of orthogonal distance fitting it is possible to obtain simple formulas for evaluating the parameters of a straight line that fits a given set of points in a plane (orthogonal linear regression problem). Indeed, the orthogonal distance between a point Pi(xi, yi) and a straight line y = a + bx is illustrated in Figure 6 with values
y
(7)
y = a + bx
Following Legendre, instead of minimizing the sum of orthogonal distances, minimize the sum of their squares: 0
(8)
x
Figure 6. Orthogonal distance from a point to a straight line.
As usual, take partial derivatives with respect to the parameters a and b equal zero:
A system of two equations for determining the values of a and b is obtained. Details can be found on the CHANCE website at www.amstat.org/publications/chance under the supplemental material. Obviously, there are two possible fitting lines, y = a1 + b1x and y = a2 + b2x, which both run though the centroid (x,y) and are mutually orthogonal, since b1b2 = -1 . The proper fitting line (the proper pair of the values of a and b) can be determined by a smaller value of the criterion (8). This has a very simple geometric interpretation. Indeed, the set of points that we try to fit with a straight line is a "cloud" of points in 2D. One of the possible fitting lines coincides with the direction of the main "axis" of that cloud, and the second line corresponds to the direction of the width of that cloud. It is worth mentioning that this is an elementary illustration of the relationship between the orthogonal distance fitting, on one side, and the principal components analysis (PCA) on the other. There are a few mathematical tools which can be used
for orthogonal distance fitting. For solving the linear problems in n-dimensional space the PCA method is appropriate. Generally, for solving the linear and nonlinear problem the singular value decomposition (SVD) method and QR decomposition method are suitable. SVD is widely used in statistics where it is related to the PCA method. Now PCA is mostly used as a tool in exploratory data analysis and for making predictive models, but the applicability of the PCA is limited by several assumptions (linearity, statistical importance of mean and covariance, etc.).
Least Circles, Least Spheres, Least Hyperspheres! Now let us consider the 3D case. Suppose we have a set of data points, which look to be close to a straight or curved line in 3D, and we want to obtain the equation of the optimal fitting line. First, moving from 2D to 3D makes the idea of "least squares" absolutely useless. The only natural criterion is the minimum sum of distances from the data points to the fitting line, and this criterion, in Legendre's manner, can be replaced with the CHANCE
41
in n-dimensional space (recall the example of fitting data points by an ellipse in previous section, or imagine fitting a set of 3D data by the surface of an ellipsoid), which is a part of image processing theory. Such an approach can have many unexpected applications: for example, the description of national economies in state space, where the 3D data describing the behavior of national economies have been fitted by planes. In other words, there is a uniform approach to fitting lines (either straight or curved) and shapes in n-dimensional space by minimizing the volumes of n-dimensional spheres with radii equal to the orthogonal distances from the data points to the fitting line or shape. Maybe Gauss's method, which has not been successfully reproduced until today, was close to such a viewpoint?
2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 3 2 1
1
2
Further Reading
0
0
-1
-1 -2 -3
-2
Figure 7: Least spheres fitting of data by a line in 3D
volumes of spheres with the radii equal to the distances from the data points to the fitting line, (9)
where Fx,y,z,1,2,…,n denotes a line in 3D described by implicit or explicit equations containing n parameters 1, 2, …, n. The idea is illustrated in Figure 7, where F is a straight line. Obviously, this approach can be used in n-dimensional space too, where we have to minimize the sum of hypervolumes of the hyperspheres with radii that are equal to the distances from the data points to the fitting line. Besides fitting data by lines, we can consider fitting data by geometric shapes
Ahn, S. J. 2005. Least Squares Orthogonal Distance Fitting of Curves and Surfaces in Space. Berlin, Germany: Springer. Celmins, A. 1998. The method of Gauss in 1799. Statistical Science 13(2):123–135. Nievergelt, Y. 1994. Total least squares: State-of-the-art regression in numerical analysis. SIAM Review 36(2):258–264. Pernkopf, F., and O'Leary, P. 2003. “Image acquisition techniques for automatic visual inspection of metallic surfaces.“ NDT & E International 36(8):609–617. Petras, I., and Podlubny, I. 2007. State space description of national economies: the V4 countries. Computational Statistics & Data Analysis 52(2):1223–1233. Sardelis, D., and Valahas, T. 2004. Least Squares Fitting – Perpendicular Offsets. http://library.wolfram.com/infocenter/ MathSource/5292. Sprott, D. A. 1978. Gauss's contributions to statistics. Historia Mathematica 5:183–203. Stigler, S. M. 1981. Gauss and the invention of least squares. Annals of Statistics 9(3):465–474. van Hufiel, S., and Vandewalle, J. 1991. The total least squares problem: Computational aspects and analysis. Philadelphia: SIAM.
Become of a Member of the American Statistical Association Visit the ASA Membership site: www.amstat.org/membership/index.cfm No matter where your expertise in statistics takes you, the ASA is here with resources and opportunities
42
VOL. 23, NO. 2, 2010
Can People Distinguish Pâté from Dog Food? John Bohannon, Robin Goldstein, and Alexis Herschkowitsch
Pâté ?
T
he diet of domestic dogs in most of the world consists of scraps, the by-products of human food preparation and consumption. Indeed, as noted by J. W. S. Bradshaw in the 1991 Proceedings of the Nutrition Society article “Sensory and Experiential Factors in the Design of Foods for Domestic Dogs anad Cats,” the close overlap between the diet of Canis familiaris and Homo sapiens may have been crucial for its evolution as a human companion species. According to K. E. Michel's 2006 article “Unconventional Diets for Dogs and Cats,” that appeared in the Journal of Small Animal Practice, commercialized dog food is a recent phenomenon, becoming popular only in relatively wealthy industrialized nations since the mid-20th century. Nonetheless, it has grown rapidly into a $45 billion industry. Intense competition for market share has kept the price of dog food low compared to edible goods for humans, even those such as liverwurst and Spam that are derived from similar meat industry by-products. In spite of its attractive price, commercial dog food is left uneaten by humans. One valid concern is the risk of food poisoning. As reported by D. Barboza in his New York Times article “China Makes Arrest in Pet Food Case,” the discovery in 2007 that several commercial brands of pet food were contaminated with melamine. The presence of melamine, an industrial fire retardant that can cause renal failure, caused widespread concern. Partly as a result of this scandal, "organic" pet foods have gained significant market share. For example, Newman's Own Organics Premium Pet Food is made exclusively from "human grade" agricultural products. Even if dog food is safe for humans to eat, however, it must overcome considerable prejudice. Part of the barrier is the perception that dog food is
dog food ?
unpalatable. According to Bradshaw, the pet food industry has invested decades of research and development to make its products more appealing to the humans who purchase and handle the food. Pickering reports on the use of human volunteers to compare the sensory qualities of pet food formulae. The aim has been to reduce feelings of disgust while owners serve the food to their pets, rather than to make it more palatable for human consumption. Schaffer’s book describes how the diet and lifestyle of dogs in the industrialized world has converged
with that of humans. Could dog food be approaching acceptance as a comestible good fit for humans? Assessing the intrinsic palatability of dog food is a first step in answering this question. Controlling for bias is a challenge. Expectation has a large effect on the hedonic tone of food. As reviewed by R. Deliza and H. J. H. MacFie in their 2007 Journal of Sensory Studies article “The Generation of Sensory Expectation by External Cues and its Effect on Sensory Perception and Hedonic Ratings: A Review,” there are many CHANCE
43
Table 1—Raw data Ranking of samples Subject
A
B
C
D
E
Which is dog food?
1
1
4
3
2
5
E
2
1
4
5
3
2
D
3
5
2
1
4
3
E
4
1
2
5
3
4
B
5
2
4
5
3
1
B
6
1
5
4
2
3
E
7
2
4
5
1
3
E
8
1
2
5
4
3
E
9
1
4
5
3
2
D
10
2
1
5
3
4
A
11
3
2
5
1
4
C
12
4
2
5
1
3
E
13
1
4
5
4
4
E
14
1
5
4
3
2
C
15
1
4
5
3
2
C
16
3
2
1
5
4
E
17
3
4
5
2
1
B
18
1
3
5
2
4
B
Sums:
34
58
78
49
54
Table 2— Distribution of Rankings
Ranking (n, %)
A
B
C
D
E
Duck liver mousse
Spam
Dog food
Pork liver pate
Liverwurst
1st
10 (56%)
1 (6%)
2 (11%)
3 (17%)
2 (11%)
2nd
3 (17)
6 (33)
0 (0)
4 (22)
4 (22)
3rd
3 (17)
1 (6)
1 (6)
7 (39)
5 (28)
4th
1 (6)
8 (44)
2 (11)
3 (17)
6 (33)
5th
1 (6)
2 (11)
13 (72)
1 (6)
1 (6)
Rank sums
34
58
78
49
54
Percent totals by row might not add to 100% due to rounding. The rank sum is the sum of the ranks given by all 18 subjects to each product.
levels at which expectation can have its effects, and many mechanisms have been proposed. In Try It, You'll Like It, L. Lee, S. Frederick, and D. Ariely emphasize that the effects can be subtle and depend on when information is gained relative 44
VOL. 23, NO. 2, 2010
to consumption. In Do More Expensive Wines Taste Better? Evidence From a Large Sample of Blind Tastings, R. Goldstein, J. Almenberg, A. Almenberg, J. W. Emerson, A. Herschkowitsch, and J. Katz showed with respect to tasting expensive
wines, measuring the hedonic tone free of bias requires a double-blind trial. A double-blind trial is a comparative test of two or more inputs (treatments for disease in a clinical trial; edible materials in taste tests) in which neither the subject nor the person administering the test knows which input is being given to a subject. Neither the subject nor the clinician should know or have any preferential attitudes about the input being received. The researcher, of course, controls and keeps track of everything using confidential codes. We predicted that in a double-blind taste test, subjects would be unable to identify dog food among five samples of meat products with similar appearance and texture, thus allowing them to assess palatability independent of prejudice. We hypothesized that if the dog food were ranked favorably relative to human comestible goods with similar ingredients, it should be considered fit in terms of taste for human consumption.
Materials, Methods, and a Free-for-All The dog food tested was Newman's Own Organics Canned Turkey & Chicken Formula for Puppies/Active Dogs. The four meat products used for comparison were duck liver mousse (“Mousse de Canard,“ Trois Petits Cochons, New York); pork liver pâté (“Pâté de Campagne,“ Trois Petits Cochons, New York); supermarket liverwurst (D’Agostino); and Spam (Hormel Foods Corp., Austin, Minnesota). Each product was pulsed in a food processor to the consistency of mousse. Samples were allocated to serving bowls, labeled A through E, garnished with parsley to enhance presentation, and chilled in a refrigerator to 4°C. To allow one researcher (Bohannon) to perform a double-blind trial, the preparation was carried out by the co-authors (Goldstein and Herschkowitsch). To ensure safety, the researchers did taste the dog food before the preparations were made. As previously reported by Bohannon, the experiment was carried out between 7 p.m. and 10 p.m. on December 31, 2008, in Brooklyn, New York. After fully disclosing the aim of the experiment— to evaluate the taste of dog food—18 subjects volunteered. Subjects were college-educated male and female adults between the ages of 20 and 40.
The five sample dishes, A through E, were presented to subjects with a bowl of crackers ("Table Water Crackers,” Carr’s of Carlisle, UK). The identity of the samples, unknown to the researcher, was as follows: A, duck liver mousse; B, Spam; C, dog food; D, pork liver pâté; and E, liverwurst. Subjects were asked to rank the "tastiness" of the samples relative to each other on scale of 1 (best) to 5 (worst). No ties were allowed. Due to the small number of subjects, we used the following design. Subjects sat down around a table in groups of three to five. We laid out the samples in the middle of the table in order (A–E) along with a big plate of crackers. We instructed them to "try all the pâtés, comparing their tastes until you are satisfied that you can rank your preferences based on flavor." The subjects had a data sheet in front of them with a table (sample A–E and a column for ranking 1–5). We observed them trying samples multiple times in a variety of orders. It was a free-for-all. They took their time and made careful choices. The ordering of the samples on the table could, of course, have had some effect, but it should have been reduced by the arrangement of subjects and the fact that they made multiple comparisons of the samples at will. They were not permitted to discuss or interact with each other before submitting their evaluations in writing. Subjects were not allowed to discuss the tastes with each other until after the evaluations were submitted. They were asked not to make faces or display signs of pleasure or displeasure, although there was no way to enforce perfect compliance with facial expressions, as some responses were more or less involuntary. After the rankings were recorded on data sheets, the subjects stated which of the five samples they believed to be dog food.
Rating the Best and Wurst The dog food (sample C) was ranked lowest of the five samples by 72% (13) of subjects. The duck liver mousse (sample A) was rated as the best by 55% (10) of subjects. Between these extremes, the majority of subjects ranked Spam, pork liver pâté, and liverwurst in the range of second to fourth place (see Tables 1 and 2). The rank sum is the sum of the ranks given by all 18 subject to each product. Duck liver mousse has the lowest rank sum (34), and dog food had the highest
Table 3—Differences in Rank Sums with Significant Results Indicated D
E
B
C
49
54
58
78
A
34
15
20
24
44 (p<0.01)
D
49
–
5
9
29 (p<0.10)
E
54
–
–
4
24
B
58
–
–
–
20
20 5 4 3 2
15
1
10
5
0 A Duck
B Spam
C Dog Food
D Pork
E Liverwurst
Figure 1. Distribution of rankings of five food items
(78). The others are in the middle and quite similar. A graphical presentation of the data is given in Figure 1. The rankings were analyzed using the multiple comparison procedure described by Christensen and colleagues. The method compares the observed differences in rank sums to the differences in rank sums that would be observed if rankings were purely random (see Table 3). Tables in the article apply to situations with 20 panelists and six items being ranked. The table goes down to only 20 panelists, but the critical values are changing slowly. For 18 panelists, by extrapolation we’ll use 37 for p=0.01; 31 for p=0.05; and 28 for p=0.10. This
means only A (duck liver mousse) is significantly different from C (dog food, difference 44, p-value less than 0.01), and D (pork liver pate) is marginally significantly from C (dog food, difference 29, p-value less than 0.10). We can conclude, even with the small number of volunteers, for some foods people have a clear preference over dog food. The three items with intermediate ranks certainly do not appear to be significantly different in their evaluations. Could the subjects identify the dog food? Only three of 18 subjects correctly identified sample C as the dog food. If subjects were randomly guessing, you would expect on average 3.6 (=18(.2)) to CHANCE
45
Further Reading
Absolute Differences
max=72
37 28
min=0 min=18
A
D
E B
C
max=90
Sum of Ranks
Figure 2. Graph of absolute differences in sum of ranks versus sum of ranks. Sums of ranks for 18 raters of five items range from 18 to 90. Absolute differences in ranks range from 0 to 72. Absolute differences above 37 are significant at the 0.01 level (bold italics). Absolute differences above 28 are significant at the 0.10 level (bold). Note that there are two points for each comparison (e.g., A–C and C–A). Symbols B and D above E are toggled for better visibility.
guess correctly. The probability of getting 0, 1, 2, or 3 correct guess when the chance of guessing correctly is 0.20 is 0.02, 0.08, 0.17, and 0.23, respectively. Thus a (one-sided) p-value for assessing the hypothesis of random guessing was about 0.50. Clearly, in the context of this experiment, the volunteers could not identify the dog food.
Conclusions and Paradoxes Subjects significantly disliked the taste of dog food compared to a range of comestible meat products with similar ingredients. Subjects were not better than random at identifying dog food among five unlabeled samples. These two results would seem to be paradoxical. Why did the 72% of subjects who ranked sample C as worst in terms of taste not guess that sample C was dog food? One possibility is that slight differences in appearance and texture skewed the guesses. A full 44% (n=8, 44%) of subjects incorrectly chose liverwurst (sample E) as the dog food. As the 46
VOL. 23, NO. 2, 2010
texture of samples had been equalized with a food processor, it is possible that subjects were attempting to discern which sample was dog food based on taste, not texture. The explanation we find more compelling, however, is that subjects were primed to expect dog food to taste better than it does. As we assured subjects that the experience would not be disgusting, they might have excluded the worst-tasting sample from their guesses. Regardless of the cause of the distribution of guesses, we are confident that the comparison of taste was free of prejudice that would have been present had the volunteers known before tasting the identity of the products. Even with the benefits of added salt, a smooth texture, and attractive presentation, canned dog food is unpalatable compared to a range of similar blended meat products. We also conclude that to make distinctions between products that taste very similar, one needs a much larger party with far more volunteers.
Barboza, D. China makes arrest in pet food case. New York Times, May 4, 2007. Bohannon, J. 2009. Gourmet food, served by dogs. Science 323(5917):1006. DOI: 10.1126/science.323.5917.1006b [http://www.sciencemag.org/cgi/content/ full/323/5917/1006b] Bradshaw, J. W. S. 1991. Sensory and experiential factors in the design of foods for domestic dogs and cats. Proceedings of the Nutrition Society 50:99–106. Bradshaw, J. W. S. 2006. The evolutionary basis for the feeding behavior of domestic dogs (Canis familiaris) and cats (Felis catus). Journal of Nutrition 136:1927–1931. Christensen, Z. T., Ogden, L. V., Dunn, M. L., and Eggett, D. L. 2006. Multiple comparison procedures for analysis of ranked data. Journal of Food Science 71(2):S132–S143. Deliza, R., and MacFie, H. J. H. 2007. The generation of sensory expectation by external cues and its effect on sensory perception and hedonic ratings: a review. Journal of Sensory Studies 11(2):103–128. Goldstein, R., Almenberg, J., Almenberg, A., Emerson, J. W., Herschkowitsch, A., and Katz, J. 2008. Do more expensive wines taste better? Evidence from a large sample of blind tastings. Journal of Wine Economics 3(1):1–9. Lee, L., Frederick, S., and Ariely, D. 2006: Try it, you’ll like it: The influence of expectation, consumption, and revelation on preferences for beer. Psychological Science 17(12):1054–1058. Michel, K. E. 2006. Unconventional diets for dogs and cats. Small Animal Practice 36(6):1269–1281. Pennisi, L. 2002. Canine evolution: A shaggy dog story. Science 298(5598):1540–1542. Pickering, G. J. 2008. Optimizing the sensory characteristics and acceptance of canned cat food: Use of a human taste panel. Journal of Animal Physiology and Animal Nutrition 93(1):52–60. Schaffer, M. 2009. One nation under dog: adventures in the new world of Prozacpopping puppies, dog-park politics, and organic pet food. New York: Holt and Co.
Visual Revelations Howard Wainer, Column Editor
Commentary on the Graphic Displays in the 2008 National Healthcare Quality Report and State Snapshots
T
his report represents the seventh annual version of what is a substantial effort to characterize some of the major questions about contemporary health care in the United States, to gather data that sheds light on those questions, and to organize those data in a coherent and understandable form. This work was accomplished with considerable wisdom and many important decisions about inclusion, organization, and representation were made carefully and well. However, in any task as complex as this, improvements are always possible. A fundamental tenet of quality control is that the improvement of any complex process is often best accomplished by instituting an ongoing process of improvements. I strongly suspect that the seventh version of this report will not be the last, and so it seems worthwhile to take some time to suggest some improvements for future editions. These suggestions may have more general value, so let me offer these examples to a wider audience. The balance of this essay has the following structure: Section II provides a description of the general characteristics of effective display as well as some specific rules for doing so. Section III has examples drawn from the report with some sample revisions that illustrate how the rules from Section II can be applied. Section IV provides some conclusions.
II. General Characteristics of Effective Display The preparation of a chart book requires compromises. It is almost always best to choose a small number of conventional graphic formats and reuse them throughout, rather than to invent unique display formats as the character of the data changes. Even though the latter approach may, in some sense, convey the special character of a particular data set, the cost to the reader of becoming familiar with a new format often outweighs any potential benefit. The 2008 National Healthcare Quality Report (NHQR) uses just three graphic formats—the line chart, the bar chart, and the
chloropleth map—in combination with various kinds of tables, to convey their component data graphically. The decision to settle on three formats seems wise to me, although the precise character of these representations can be improved. State Snapshots uses a table to archive data in combination with a graphic based on the metaphor of a speedometer. The 2008 graphic choice can be improved upon.
Some principles to improve graphic displays Goals must be clear and prioritized. A graphic display can have five purposes: 1. Exploration: The data contain a message, and you would like to learn what it is. 2. Communication: You have learned something and want to communicate it to others. 3. Calculation: Instead of doing some sort of arithmetic, a graphic display can do it for you automatically, as a sort of visual algorithm. 4. Archiving: Data are stored for retrieval by others. 5. Decoration: Graphs are pretty and can draw the reader’s attention to a particular phenomenon as well as break up the wearisome vista of unending text. Trying to accomplish too much usually impedes the efficacy of a display’s primary purpose. For example, in this report, the primary purpose is almost surely communication, but this is reduced when too many numbers are overwritten on the graph in an effort to archive the data at the same time. Scales should be chosen to match the purpose. The scales should be chosen to provide maximum acuity. In this way the viewer can often obtain quite an accurate sense of the component Illustrations courtesy of 2008 State Snapshots: Virginia. Derived from 2008 National Healthcare Quality Report. www.ahrq.gov/qual/nhqr08/ nhqr08.pdf. November 2009. Rockville, MD: Agency for Healthcare Research and Quality. http://statesnapshots.ahrq.gov/snaps08/download/ VA_2008_Snapshots.pdf.
CHANCE
47
Captions should be informative There are two kinds of good graphs: 1. strongly good graphs that tell you all you want to know just by looking at them 2. weakly good graphs that tell you all you want to know just by looking at them, once you know what to look for We can transform a weakly good graph into a strongly good one by having an informative/interpretive caption. Instead of a caption like “The rates of completion of tuberculosis treatment for different age groups,” we might have “Although rates of completion of tuberculosis treatment have been increasing overall, children still have a 10% greater likelihood of completion than adults.” By forcing the graph-maker to include a caption on a display that explicitly tells the principal point of the display, we gain two benefits: We can discover the point of the display more easily, and we eliminate pointless displays. ALL is special and should be represented as such. Within any display it is often useful to represent some sort of summary measure—a mean, median, total. Such a measure is different in kind from the various pieces that compose it. It is also more stable statistically by virtue of its being a composite measure. Thus, its representation should be larger, darker, and visually separate from the components. Avoid legends whenever possible. A legend requires the viewer to first learn the legend and then apply it to the display. In the graphic theorist Jacques Bertin’s words, it requires two moments of perception and makes the viewer read the display rather than see it. A far better approach is to label the graphic elements directly. We’re almost never interested in “Alabama first”—the data elements should be ordered in a way that makes sense. Often, ordering by size makes a coherent visual impact as well as suggesting an implicit underlying structure.
III. Examples from the 2008 NHQR Example 1. A bar chart The first graph we encounter in the summary section of NHQR is a simple bar chart. Bar charts have a long history, but perceptual experiments have provided evidence that they can be improved. First, this one has a too-general label on the vertical axis. These data do not simply represent change; they represent improvement, for when the change is in the other direction the bar is directed downward. Thus the label ought to reflect this: Annual percent improvement. Second, each bar has a long label, which is more easily read if it is written horizontally. This is best accomplished by turning the plot on its side. Third, ALL needs to be visually emphasized and segregated. Fourth, the inclusion of the actual percentages is redundant and adds more clutter than precision. 48
VOL. 23, NO. 2, 2010
3.0 2.5 2.5 Annual percentage change in quality
data without having to append the visual noise of numerical values. Too large a scale can hide real variations that go unseen. Of course, too small a scale can make random fluctuations seem real, but the viewer can always replot on the larger scale—an option that does not exist if the scale is too large. I view the “too-small scale” as a venial sin, whereas the “too-large” can be mortal.
2.0
1.8
1.4
1.5
1.3 1.1
1.0
0.5
0.0 ) 4) 5) 5) 5) 190 =7 =5 =4 =4 (n= t (n t (n s (n n (n n n e s o r i e e e t u ur tm ven eas gem eas rea Pre ana eT re M ll M t M o A u c i C Ac ron Ch
Measure category
NHQR Figure H.1—Median annual rate of change overall and by measure category from baseline to most recent data year
Fifth, ordering the categories by the extent of their improvement immediately aids our understanding of the relative rates of improvement, which seems to follow the path of triage: The most critical is improving the fastest. And finally, the bar itself serves no purpose but to hold up the top line that conveys all the information. By replacing the bar with a large dot at its topmost point, we carry all the information but leave more room on the chart for additional material. To ease interpretation, one can connect the dot to its identifying label with a thin line. An alternative version that makes these changes is shown below. Measure Category Prevention (45) Chronic (55) Core (45) Acute (74)
ALL (190)
0.0
0.5
1.0
1.5
2.0
2.5
Annual Percentage Improvement
Alternative to NHQR Figure H.1
3.0
The connecting line can be partially replaced by letting the label serve double duty and thus provide still more space as shown in the slightly modified version below.
100
1999 2005
90 80
.7 .0 73 74
70
.2 68 66.4
.7 .1 71 72
.4 .0 74 75
.6 .9 75 75
Percent
60 50 40 30 20 10 0 0.5
1.0
1.5
2.0
2.5
l Tota
3.0
Annual Percentage Improvement
Another alternative to NHQR Figure H.1
Note that in both redesigns ALL is recognized as different and displayed so. It is more vivid and spaced apart to perceptually accentuate its differentness. The data are ordered rationally (almost surely not alphabetically). In this as in most other situations, ordering by size is effective. As predicted, orienting the display horizontally rather than vertically has allowed the labels on each data point to be horizontal and hence read more easily. The label on the x-axis is the more informative one. Accuracy in both words and numbers is important. Last, removing the numerical values from the plot removes their visual clutter without subtracting useful information. If the plot is scaled properly, the viewer can estimate the data value accurately enough. The report did not include the standard errors of the percentages, and so it was not obvious that the data were accurate to the nearest 10th of a percent. It is always wise not to portray data to any more precision than they are worth. I believe that in future designs the dot plot should be substituted for the bar chart, when suitable. In the next example a different alternative is recommended for some bar charts.
<40
49 40–
59 50–
69 60–
NHQR Figure 2.5—Women under age 70 treated for breast cancer with breast–conserving surgery who received radiation therapy to the breast within 1 year of diagnosis, 1999 and 2005
In the alternative below, the format is changed to a line chart, with time on the x-axis. The vertical scale is expanded so that we can read off the specific values with enough accuracy to allow us to elide the distracting numerical barnacles. ALL is represented in a more dominant form, and the graphic elements that represent the various age groups are labeled directly. The younger the breast cancer patient the less likely she is to receive radiation therapy 78
Percent
0.0
76
60–64
74
50–59 ALL
72
40–49
70 68
Example 2. Another bar chart This display shares some of the flaws of the previous one and adds a new one: It represents a continuous variable, time, as different colored bars. Convention and good sense would suggest representing time on the x-axis, with each of the age groups as separate graphical elements. This has the added advantage of scalability, so that when new time periods are available the plot can accommodate them effortlessly.
<40
66 64 1998
2000
2002
2004
2006
Data are the percentage of women under the age of 70 treated for breast cancer with breast–conserving surgery who received radiation therapy to the breast within one year of diagnosis
Alternative to NHQR Figure 2.5
CHANCE
49
Example 3. Bars for time, again
100
Total
50–64
almost worthless “percent.” And last, ALL is treated differently, as befitting its special character. Note that this format will scale easily to include many more years as well as additional age groups, whereas the bar chart format quickly becomes hopelessly jumbled. 65 Ever received colorectal screening (in percentages)
This representation makes obvious what was easily missed before: Young women are very different. To emphasize this, the figure caption is modified to inform the viewer of the principal inferences that we are led to by these data. Last, we note that there is a decline in the use of radiation between 1999 and 2005 for women up to the age of 60, yet the total increases. This suggests Simpson’s paradox deriving from the much larger Ns among women 60–64, which thus dominates the other groups. It would probably be wise to make some mention of this, for it is likely to be confusing to the general readership. It might be useful, if we wish to make inferences about the changes in the use of radiation among breast cancer patients over time, to compute a value for ALL ages that is standardized to a fixed age population. See chapter 10 of my book Graphic Discovery for instructions on how to do this and further examples.
65 or older
60 Over 50 55
50–64
50
45
40 1999
65 and over
2000
2001
2002
2003
2004
2005
2006
90
An alternative to NHQR Figure 2.1 80 70
.8 56
Percent
60 50
.8 49 .8 43
.2 59 .7 51 .4 45
.1 63
Example 4. A line chart that not even a mother could love
.5 55 .2 49
100
40 95
30 20
90
0
2000
2003
2005
NHQR Figure 2.1—Adults age 50 and over who ever received colorectal cancer screening (colonoscopy, sigmoidoscopy, proctoscopy, or fecal occult blood test—FOBT), 2000, 2003, and 2005
I want to emphasize that bars on a time chart do not usually work as well as other formats. A line chart uses less graphical space than a bar and provides a more evocative metaphor. Again, by avoiding legends as much as possible, we can see the display rather than having to read it (to use Bertin’s evocative description). It is always preferred, if possible, to label the data representation directly rather than through a legend. This is not always possible, but when it is, it should be done this way. The vertical axis is labeled a bit more fully than just the
50
VOL. 23, NO. 2, 2010
Percent
10
85
80
75 0 1998
1999
2000
2001
2002
2003
2004
NHQR Figure 2.39—Patients with tuberculosis who completed a curative course of treatment with one year of initiation of treatment, by age group, 1998–2004
50 45 40 Rate per 100,000 population
Not all line charts representing time are winners. NHQR Figure 2.39 is a complex mess, made so through the mistaken insistence that all data values be included and the various graphical elements be identified through a legend. A naïve viewing of this plot might suggest that the complexity of the visual experience is inescapably part of the complexity of the underlying phenomenon. But is this true? In the alternative sketched below, I have expanded the scale on the vertical axis so we can estimate the individual data values with enough accuracy for most purposes and have expanded the axis label to be more fully descriptive. I replaced the legend and labeled each line directly, and last, made the figure caption more descriptive, to inform the viewer of the key aspects of the data. Obviously, the details of the groups being plotted must still be included, but they can occupy a less dominant position on the chart, akin to a footnote.
35 30 25
.9 20
.9 20
20
.1 20
.7 19
.1 19
Percent with Completed Treatment
.5 17
15
HP 2010 Target: 13.7
10
90
.0 18
5
0–17
0 1999 85 65 and over TOTAL 45–64 18–44
80
2000
2001
2002
2003
2004
2005
NHQR Figure 2.4—Colorectal cancer deaths per 100,000 population per year, United States, 1999–2005
late. We also can insert the fitted equation to emphasize that the rate of colorectal cancer mortality is shrinking by about 0.6% per year.
75 1999
2000
2001
2002
2003
2004
2005
Alternative to NHQR Figure 2.39—Although rates of completion of tuberculosis treatment have been increasing overall, children still have a 10% greater likelihood of completion than do adults. This chart is for patients with tuberculosis who completed a curative course of treatment within one year of initiation of treatment, by age group, 1998–2004.
Example 5. A line chart with a linear extrapolation NHQR Figure 2.4 is a simple, straightforward plot. There are only seven data points, so it would take heroic efforts to make them incomprehensible. The data are so simple that a line chart of them can absorb some of the flaws of the other line charts (e.g., a too-large scale, inclusion of data values) without serious ill effects. But the NHQR uses the data from this display to make a prediction. It concludes, “At the present rate of change from 1999 to 2005, this target (a colorectal cancer death rate of 13.7 per 100,000) will not be met by 2010.” NHQR Figure 2.4 can be improved in a number of small ways that were illustrated in the previous examples. If the y-axis scale is expanded to cover only 12% to 22%, we can read off the data entries without having to write them in. We can expand the time axis to 2012 to show how the linear extrapolation will intersect the goal for 2010 about a year
22
y = 1,235–0.6x
21
r2 = 0.96
20 Rate per 100,000
1998
19 18 17 16 15
HP 2010 Target
14 13 12 1998
2000
2002
2004
2006
2008
2010
2012
Alternative to NHQR Figure 2.4 with linear fit
But extrapolation is always a dangerous business. If, instead of fitting the data with a linear function, suppose we fit a quadratic function, which fits even better. We would find that we make the target date with room to spare. It is probably wise to explicitly state what the extrapolating assumptions are before announcing a prediction. Also, just saying that the target will CHANCE
51
not be met is less helpful than pointing out exactly when our extrapolation predicts that it would be reached. 22
y = -208,710 + 209x – .05x2 r2 = 0.98
21
Rate per 100,000
20 19 18 17 16 15
HP 2020 Target
14 13 12 1998
2000
2002
2004
2006
2008
2010
2012
Alternative to NHQR Figure 2.4 with quadratic fit
This plot, and others like it, could be improved by using a shading metaphor that is ordered. The current scheme is not helpful. There is no way to know (without the legend) that black is high, blue is low, and green is average. Instead, if we used a naturally ordered metaphor, the result would be easier to see. One way, in a monochrome display, is to use “darker = more.” If you wanted to use color, it would be with increasing saturations. Note that this method easily scales to more than three categories; indeed, we could use continuous shading and not need to use such a coarse categorization as “below average,” “above average,” and “average.” We could actually shade according to the amount and then provide a legend that pairs a scale (say 40 to 75) with the associated shading. We could even use such a method on much smaller geographic entities (e.g., counties or SMSAs) and, with suitable smoothing, be able to spot phenomena that are related to, say, urban vs. rural, rather than state by state. Linda Pickle’s wonderful Atlas of United States Mortality provides an almost flawless model for all such work. Also, it is wasteful to use a single category for only one territory. A line of prose works far better. One alternative could be: Above Average
Example 6. A Choropleth map When data are gathered from various geographic areas, a common method for displaying them is the choropleth map, in which the geographic entities are shaded in a way that represents the data values that were generated by that region. NHQR Figure 2.2 is a typical choropleth map that has the virtues and flaws of all such displays. It provides a bit of visual distortion since the size of the state, and hence its visual impact, often has little relationship to the underlying variable of interest (e.g. population size). This flaw is dramatically displayed on election nights, when big, sparsely populated states are all red, and small, densely populated ones are blue. A naïve look would strongly suggest a red victory, even when it was a blue landslide. But this is the price we must pay for the benefits of such a display (showing the location of the phenomenon being plotted).
NHQR Figure 2.2—State variation: Adults age 50 and over who ever received a colonoscopy or sigmoidoscopy, 2006
52
VOL. 23, NO. 2, 2010
Average Below Average There are no data for Puerto Rico
Example 7. Dashboard displays It would be wasteful of space to reproduce a sample dashboard display here. Instead, the interested reader is referred to the following web site: http://statesnapshots.ahrq.gov/snaps08/dashboard. jsp?menuId=4&state=CA&level=0. Briefly, a dashboard display is a four-paneled, multicolored plot covering two pages. It seeks to show a state’s performance in both the most recent year and for a baseline year relative to all other states. The first panel is a replica of a speedometer, in which the two years are arrows in one of five categories ranging from “very weak” through “average” to “very strong.” The remaining three panels show the state’s performance broken down by the types of care, the settings of the care, and by the clinical area. This is a clever metaphor, but it uses up too much space with nondata figurations for efficient communication. It also uses distracting colors that, at the extremes, actually hide the data points. Less is more. One has to remember which kind of arrow is “most recent” and which is “Baseline,” and if memory fails, one must turn back to the first page. Also, the display’s lack of compactness makes it harder to compare across subareas. I believe that remembering the fundamentals of display can make improvements. Even a well-designed table would be an improvement, although I can see that making it graphical would ease comprehension considerably. One possibility would be a dot chart in which “Overall” is highlighted and performance within each subclass is ordered by performance (e.g., best first).
One such display is shown below:
A number of questions occurred to me while looking at the original display, the answers to which might lead to changes in subsequent editions. Among them are: Does “most recent year” vary across measures? If not, it would surely be more informative to just enter the actual year instead. If it does vary, I assume that what the years are would be available someplace public. Does “Baseline Year” vary? If not, enter that year. Why is “Baseline Year” sometimes missing? If there are no previous data, wouldn’t “most recent” become baseline for lack of anything else?
IV. Discussion
Alternative to the dashboard display: California versus all states
Variations on this theme could be tried; for example, one might align all the labels outside the plot frame (see version below). It would take a little empirical investigation to discern which formulation is more successful.
This is an especially important time to work hard on accurately measuring the efficacy of health care in the United States, for we are on the brink of a major change in the way health care is delivered and paid for. It is crucial to know the extent to which these changes have accomplished their goal of increasing access to medical care and thence improving the length and quality of Americans’ lives. The existing reports have done much of the hard work of deciding what topics would be included, collecting the data that illuminate those topics, and deciding how they would be organized. In this essay I have tried to provide some suggestions as to how the reporting might be improved. There are a few clear avenues for improvement. I recommend that consideration be given to replacing the bar chart format with dot charts. Aside from their cleaner look, there is substantial experimental evidence, such as that reported in Bill Cleveland’s books, that dot charts perform better. I also believe that it would be wise to abandon the practice of redundantly including the value of all data points on each graph. If such information is deemed to be important to convey to the reader, place them in associated tables in an appendix, or better, as Excel files on a web site.
Further Reading
Alternative to the dashboard display: California versus all states
The ordering of the rows of the display deserves further comment. This plot was ordered by performance within California. Yet this display is but one amid 50. Unless all states’ data are ordered identically, it means that varying the order from state to state would make comparisons among states unnecessarily difficult. A sensible compromise would order the graph components by the average state. Since it is likely that this common ordering will hold for most states, it means that the viewer’s attention will be drawn to any state with a nonmonotonic ordering, thus providing a way of highlighting states with large residuals from the common structure.
Bertin, J. 1973. Semiologie graphique, 2nd ed. The Hague: MoutonGautier. (English translation by W. Berg and H. Wainer, published as Semiology of graphics. Madison, WI: University of Wisconsin Press, 1983.) Cleveland, W. S. 1994a. The elements of graphing data. Summit, NJ: Hobart Press. Cleveland, W. S. 1994b. Visualizing data. Summit, NJ: Hobart Press. Pickle, L. W. 2008. Commentary on "Improving graphic displays by controlling creativity." CHANCE 21:53–53. Pickle, L. W., Mungiole, M., Jones, G. K., and White, A. A. 1996. Atlas of United States mortality. Hyattsville, MD: National Center for Health Statistics. Wainer, H. 1997. Visual revelations: graphical tales of fate and deception from Napoleon Bonaparte to Ross Perot. New York: Copernicus Books (reprinted in 2000, Hillsdale, NJ: Lawrence Erlbaum Associates). Wainer, H. 2005. Graphic discovery: a trout in the milk and other visual adventures. Princeton, NJ: Princeton University Press. Wainer, H. 2008. Improving graphic displays by controlling creativity. CHANCE 21, 46–52. Wainer, H. 2009. Picturing the uncertain world: how to understand, communicate and control uncertainty through graphical display. Princeton, NJ: Princeton University Press. CHANCE
53
One for the History Books: An Early Time-Line Bar Graph Ronald K. Smeltzer
T
he Histoire de L’Academie Royale des Sciences volume for 1767 contains a short report on a paper given by Philippe Buache on the structure of river basins in France. This report states that Buache collected data for the water level of the Seine River in Paris from 1732 to 1767 and that the data were available in the form of a figure. Based on this report of 1767, a sixpage publication appeared in 1770 in the academy’s Mémoirs de Mathématique et de Physique for 1767. Included is a remarkable time-line bar graph, “Extrait des Profils qui représentent la Cruë et la Diminution des Eaux de la Seine,” with
Buache’s data for the period 1760–1766 plotted. Figure 1 shows the plot, printed intaglio on a folded plate in the Histoire and Mémoires volume (in the author’s collection) published in 1770. Plotted month by month for a sixyear period are the high and low water depths of the Seine in units of the old French pied, which is about 7% percent longer than the English foot. The hachure design, discussed below, creates easily distinguishable overlying bars for the two water levels. Overall the chart is almost free of extraneous matter, one exception being the buttressed pedestals that support the vertical axes. More
detail is visible in Figure 2, which shows the graph’s legend, dated incorrectly by one year, in the upper left panel of Figure 1. Clearly indicated are details such as the range of water level appropriate for navigation, the level above which flooding occurs, the flood level in 1740, and the extreme of the low water level for three particular years. A curious scale, whose utility is not obvious, is shown between the five- and twelve-pied axis markings: The scale is graduated to show the division of a pied, from the bottom upward, into four, five, six, and seven parts, with three sections of four divisions and two sections of five divisions.
Figure 1. A time-line bar graph, “Extrait des Profils qui représentent la Cruë et la Diminution des Eaux de la Seine,” by Philippe Buache, published in 1770, showing month-by-month high and low levels of the Seine, 1760-1766.
54
VOL. 23, NO. 2, 2010
To illustrate the hachure, Figure 3 provides a magnified view of a small section of the graph. One sees that the bars representing the high water level are composed of straight lines and that the bars representing the low water level are composed of undulating lines. The key feature that distinguishes clearly the two overlying bars for each month is the graduated line spacing of the bars such that the top end of the bars presents to the eye a darker image than that at the bottom; consequently the top of the bar for low water levels is easily distinguished from the lines of the bar for high water levels. Examination of the lines that form the bars suggests that the lines were etched, whereas engraving seems to have been used for the other features; such mixed intaglio methods were very commonly used for illustrations in the 18th century. Although horizontal shading of bars was criticized as early as 1937 in Briton’s book Graphic Presentation, it must be remembered that not every modern option for shading was readily available in the 18th century. A solid black and shades of flat gray cannot be created by basic etching or engraving processes, which can only create lines in a printing plate. More elaborate intaglio methods, such as mezzotint, aquatint, and use of a roulette, that create a pattern of fine indentations to hold ink in a printing plate were used to create a gray tone effect for specialized work such as portraits and genre scenes, but not for routine illustration work. Background information about Philippe Buache (1700–1773), the author of the paper of 1770, is available in Gillispie’s Dictionary of Scientific Biography and in Michaud’s Biographie Universelle. Buache began his career as a descriptive geographer, holding the appointment as the chief royal geographer from 1729 and as the assistant geographer to the French academy from 1730. However, he abandoned descriptive geography to take up theoretical studies of the structure of the earth focused on river basins, oceans, and other bodies of water. He authored numerous books and prepared many maps for publication. Many of Buache’s publications can be found in the online catalog of Harvard University Library. Remarkably, one finds listed there the web site of a map collector whose collection contains a different version of Buache’s graph. The graph
Figure 2. The graph’s legend, which shows the range of water level appropriate for navigation, the level above which flooding occurs, the flood level in 1740, and the extreme of the low water level for three particular years.
Figure 3. A magnified view of a small section of the graph: the bars representing the high water level are composed of straight lines, whereas the bars representing the low water level are composed of undulating lines. CHANCE
55
that can be found via the Harvard library catalog shows Buache’s data for the entire period, 1732 to 1767, noted in his report to the academy, although the data are in some way condensed from the monthby-month data shown in Figure 1. Buache was certainly not the first to plot measurements against time. In a correction to the 1937 paper by Funkhouser, Boyer in 1947 pointed out that a daily graph of barometric readings appeared in 1685 in the Philosophical Transactions of the Royal Society of London. This chart is on a large folding plate with daily barometric readings for the year 1684 plotted in a “broken line” graph. With the addition of vertical lines at each step of this graph, it would be in essence a bar graph, albeit drawn without a proper scale extending down to a zero baseline. Boyer also pointed out the existence of a number of other types of early graphs, including one based on statistical data that was not published until much later, in 1669. William Playfair is well known for his graph of 1821 showing the price of wheat and the cost of labor. As Playfair lived in Paris for more than five years just before the French Revolution, one might wonder if Playfair ever saw Buache’s graph. That Playfair was in and out of courtrooms accused of extortion, swindles, and embezzlement, and that he may have expropriated the intellectual property for the four patents he obtained during the 1780s suggests that he probably took ideas wherever he could find them. This writer’s views of Playfair seem most akin to those expressed in the 1990 paper by Costigan-Evans and Macdonald-Ross, who concluded that
he should be considered as “an imaginative developer and popularizer of the graphic method.” That he was far in advance of the general acceptance of the graphical presentation of data is certainly true. An interested reader can readily find examples of other early graphics. For example, a bar chart over time was produced by another Frenchman, Jacques Barbeau-Dubourg (1709–1779), whose remarkable timeline appeared in 1753. The Visual Display of Quantitative Information by Edward Tufte and Graphical Discovery: A Trout in the Milk and Other Visual Adventures by Howard Wainer include many other examples.
Further Reading Boyer, Carl. Note on an Early Graph of Statistical Data (Huygens 1669). ISIS 37:3/4, pp. 148–149 (July 1947). Buache, Philippe. Exposé De divers objets de la Géographie physique, concernant les Bassins terrestres des Fleuves & Rivières qui arrosent la France, dont on donne quelque détails, & en particulier celui la Seine, Histoire de l'Academie Royale des Sciences avec les Mémoires de Mathématique et de Physique, for 1767, Paris: Imprimerie Royale, 1770, pp. 504–509 with folding plate. The account of the oral report appears on pp. 110–112 in the Histoire section of the volume. Buache, Philippe. Profils représents la cruë et la diminution des eaux de la Seine et des rivières qu’elle recoit dans le Paris-haut au dessus de Paris.1770. Original in the David
Rumsey Collection. http://nrs.harvard.edu/urn-3:hul.ebookbatch.RUMSE_ batch:ocm55223545. Friendly, Michael. 2010. Milestones in the history of thematic cartography, statistical graphics, and data visualization. http://datavis.ca/milestones; www. math.yorku.ca/SCS/Gallery/milestone/ milestone.pdf. Funkhouser, H. Gray. 1937. Historical development of the graphical representation of statistical data. Osiris. 3. 1. 269–405. Plot, Robert. A Letter from Dr. Robert Plot of Oxford, to Dr. Martin Lister F. of the R. S. concerning the use which may be made of the following History of the Weather, made by him at Oxford through out the year 1684. Philosophical Transactions of the Royal Society of London, vol. 15, no. 169 (March 23, 1685), pp. 930–943 with folding plate. Smeltzer, Ronald K. 2004. Four centuries of graphic design for science. New York: The Grolier Club. Tufte, E. R. 2001. The visual display of quantitative information, 2nd ed. Cheshire, CT: Graphics Press. www.edwardtufte. com/tufte. Wainer, H. 2005. Graphical discovery: A trout in the milk and other visual adventures. Princeton, NJ : Princeton University Press. Wainer, H. 1998. The graphical inventions of Dubourg and Ferguson: Two precursors to William Playfair. CHANCE 11(4), 39–41.
Online Access is offered to all subscribers of CHANCE, including all K-12 members. Full access to the magazine including features, columns, supplements, and more is accessible through the ASA's Members Only. Log in to Members Only now at www.amstat.org/membersonly to take advantage of your free online access.
56
VOL. 23, NO. 2, 2010
Comments on 5x5 Philatelic Latin Squares Peter D. Loly and George P. H. Styan
A
s noted in the Winter 2010 issue of CHANCE (v. 23, n. 1), postage stamps are occasionally issued in sheetlets of n different stamps printed in an n x n array containing n of each of the n stamps. Sometimes the n x n array forms what we call a philatelic Latin square (PLS): Each of the n stamps appears exactly once in each row and exactly once in each column. Much has been published on philatelic Latin squares in both the statistical and the mathematical literature. Examples are the Handbook of Combinatorial Designs edited by Charles Colbourn and Jeffrey Dinitz and the comprehensive book Latin Squares and their Applications by J. Dénes and A. D. Keedwell.
squares in standard form. Of these 1344 types we have found PLS of just these four types: C1B, C1F, C2B, C2F (see also Table 1).
Four Types of 5x5 Philatelic Latin Squares There are 5x5 philatelic Latin squares from six “countries” (stamp-issuing authorities): Canada (8), Pakistan (1), Swaziland (1), Transkei (5), Turkey (1), and the United States (1). These 17 philatelic Latin squares are all of the circulant type, either one- or two-step backward (C1B, C2B) or forward (C1F, C2F). A 5x5 Latin square is said to be in standard form when the first row is 1, 2, 3, 4, 5, and in reduced form when it is in standard form and in addition the first column is 1, 2, 3, 4, 5. From the well-known Statistical Tables by R. A. Fisher and Frank Yates, we find that there are 56 reduced-form 5x5 Latin squares (in contrast to just four reduced-form 4x4 Latin squares and just one reduced-form 3x3 Latin square), yielding 24 x 56 = 1344 possible distinct types of 5x5 philatelic Latin
One-Step Circulants: C1B, C1F We have identified five PLS of the one-step backward circulant type C1B: one from Swaziland and four from Canada, and 10 of the one-step forward circulant type C1F: five from Transkei, one from Turkey, and four from Canada. The eight PLS from Canada are embedded in two panes each of size 10x10: A 5x5 PLS of type C1B was issued by Swaziland in 1982 for wildlife conservation, the first of a series of four sets of five stamps (Figure 1). This first set was one of the earliest sets of stamps issued for the World Wide Fund for Nature (WWF), an international, nongovernmental organization working on questions regarding the conservation, research, and
Table 1—List of Figures for 5x5 Philatelic Latin Squares
CHANCE
57
the Lammergeyer or bearded vulture, and a third series in 1984 (Scott 448), featured the Southern bald ibis. The fourth and last set in this series was issued in 1985 (Scott 475) also celebrating the birth bicentenary of John James Audubon (1785–1851), the U.S. ornithologist and painter; depicted is the Southern ground-hornbill. We do not know if any of these sets of stamps was also issued as a PLS, and we welcome further information.
PLS Embedded in 10x10 Panes from Canada Four 5x5 philatelic Latin squares are embedded in each of two panes of 100 stamps (each with nominal value of 5 cents and 6 cents), issued by Canada for Christmas 1970 and featuring children’s designs (Figure 2 depicts the pane of 5-cent stamps). The pattern in each of the two panes is the same:
where the 5x5 matrix A denotes the one-step backward circulant matrix C1B, the matrix I the 5x5 identity matrix, and F the 5x5 flip or interchange matrix
Courtesy Marc Steurs
Figure 1. Swaziland 1982: Pel’s fishing-owl; PLS type C1B
restoration of the environment. (It was formerly called the World Wildlife Fund, which remains its official name in the United States and Canada). Depicted in Figure 1 is Pel’s fishing-owl (Scotopeliapeli), a large species of owl which feeds nocturnally on fish snatched from the surface of lakes and rivers. Printed on the stamps are the words “Sikhova setinhlanti,” which means fishing-owl in SiSwati, a Bantu language spoken in Swaziland and South Africa. Depicted are (top row left to right) male, female, pair, nest and egg, and adult (with youngster). The kingdom of Swaziland is a landlocked country in southern Africa, bordered by South Africa on three sides; to the east it borders Mozambique. These stamps from Swaziland seem to be quite rare. The 2007 Scott Standard Postage Stamp Catalogue lists a single stamp, unused, at US$16 and a strip of five stamps, unused, at US$125, while on the Steurs WWF web site the full PLS is for sale at US$1655.20 (on January 29, 2010, up from US$1483.10 in June 2008). The Groth WWF web site lists these stamps as a “forerunner” (Groth #F40) of the stamps issued for the WWF and lists a strip of five stamps for sale at 150 Swiss francs (approx. US$140, but not in stock on January 31, 2010). A second series of five stamps for wildlife conservation (but not for the WWF) was issued in 1983 (Scott 427), featuring 58
VOL. 23, NO. 2, 2010
The philatelic Latin squares defined by AF (upper righthand corner) and FA (lower left-hand corner) are one-step forward circulants of type C1F, and FAF (lower right-hand corner) is a one-step backward circulant of type C1B like the PLS defined by A (upper left-hand corner), but with the stamps arranged differently. There are, therefore, eight distinct philatelic Latin squares in all, with two one-step forward circulants and two one-step backward circulants in each of the two panes. An interesting feature of these stamps is that just one of the five stamps appears in a block of four only in the center. The 2010 Unitrade Catalogue lists such a block of four adjacent stamps, unused, at CA$200. As noted by the Canada Post, The meaning of Christmas to the under-13-year-old children of Canada has been captured in the representative group of delightful drawings chosen from tens of thousands of submissions in the Canada Post Office stamp design project. Featured in the pane of 5-cent stamps (Figure 2) are (top row, first five stamps left to right) horsedrawn sleigh, nativity, snowmen and Christmas tree, children skiing, and Santa Claus.
PLS from Transkei and Turkey The only other “countries” we have found to issue a 5x5 PLS of type C1F are Transkei and Turkey. Transkei is a region in the Eastern Cape of South Africa, near Swaziland, which issued its own stamps from 1976 to 1994. Five sheetlets of stamps featuring fishing flies were issued annually by Transkei from 1980 to 1984 in type C1F. Figure 3 shows the first of these
Courtesy George P. H. Styan
Figure 2. Canada, 1970: First set (5-cent stamps) of children’s designs for Christmas 1970; PLS types C1B (top left and bottom right 5x5) and C1F (top right and bottom left 5x5)
Courtesy George P. H. Styan
Figure 3. Transkei, 1980: fishing flies; PLS type C1F CHANCE
59
five sets with (top row left to right): Durham ranger, Colonel Bates, black gnat, zug bug, and March brown. In fly-fishing, fish are caught by using artificial flies that are cast with a fly rod and a fly line. A 5x5 PLS of type CF1 was issued in 1960 by Turkey (Figure 4), the earliest 5x5 PLS that we have identified. (As noted in our previous CHANCE article, the earliest 4x4 PLS that we have found was issued by Canada in 1972.) This 1960 PLS from Turkey was issued for the 1960 Summer Olympics, held in Rome August 25–September 11, 1960. Rome had been awarded the organization of the 1908 Summer Olympics, but after the 1906 eruption of Mount Vesuvius, it was forced to decline and pass the honors to London. At the 1960 Olympics, Turkey won seven gold medals and two silver—all nine for wrestling. Depicted on the stamps are (top row, left to right): hurdling, soccer, steeplechase, basketball, and wrestling.
Two-Step Circulant PLS Types C2B, C2F; Knut Vik and Gerechte Designs The two-step circulant Latin squares C2B and C2F have several interesting properties. In particular, they are of the type known as a “knight’s move” or “Knut Vik” design, as described in the 1990 Biometrics review article by Preece: Of Latin squares used for crop experiments with a single set of treatments, the earliest examples (published in 1924) are 5x5 squares of the systematic type known as Knut Vik or “knight’s move” designs (Knut Vik being a [Norwegian] person, not a Scandinavian translation of “knight’s move”!). These are squares where all cells containing any one of the treatments can be visited by a succession of knight’s moves (as in chess) and where no two diagonally adjacent cells have the same treatment. As was observed by R. A. Fisher in The Design of Experiments: In this arrangement the areas bearing each treatment are nicely distributed over the experimental area, so as to exclude all probability that the more important components of heterogeneity should influence the comparison between treatments. In 1931 Olof Tedin and in 1951 Øivind Nissen observed that there are just two possible arrangements of the 5x5 Knut Vik design in standard form: C2B and C2F. These designs are pandiagonal, in that all diagonals (the two main diagonals and all broken diagonals) contain each of the five treatments precisely once. They are also A-efficient: the A-criterion in the analysis of variance with a Latin square design is equivalent to minimizing (a) the average variance of treatment-effect estimates, (b) the average variance of an estimated difference in the treatmenteffect estimates, and (c) the expected value of the treatment sum of squares when there are no treatment differences. Moreover, no two diagonally adjacent cells have the same treatment and, as noted by Bailey, Kunert, and Martin (1991), C2B and C2F are the only gerechte designs with this property. In a kxk gerechte design the k x k grid is partitioned into k “regions” S1, S2, …, Sk ; say, each containing k cells of the grid; we are required to place the symbols 1, 2, …, k into the cells of the grid in such a way that each symbol occurs once in each row, once in each column, and once in each region.
60
VOL. 23, NO. 2, 2010
The row and column constraints say that the solution is a Latin square, and the last constraint restricts the possible Latin squares. Gerechte designs originated in the statistical design of agricultural experiments, where they ensure that treatments are fairly exposed to localized variations in the field containing the experimental plots. As discussed in the article "Some comments on gerechte designs, I: analysis for uncorrelated errors," in the Journal of Agronomy and Crop Science, R. A. Bailey, J. Kunert, and R. J. Martin (1990) point out that Gerechte designs are row and column designs which have an additional blocking structure formed by spatially compact regions. They were popularized in Germany in 1956 by Walter-Ulrich Behrens (1902–1962). We use the German word “gerechte” for these designs, as there is no simple word in English to express the idea of “fair-allocation,” and because the designs originated in Germany and do not appear to have spread from there. Behrens’s idea was to have a class of designs that were efficient, as good systematic designs are, but that allow some randomization. Walter-Ulrich Behrens (1956) called C2B a “gerechte Latin square” (German: “gerechte lateinische Quadrate”) and C2F a “knight’s move layout” (German: “Rösselsprunganordnung”). The well-known Behrens–Fisher problem of comparing means of two populations with possibly unequal variances was first considered by this Walter-Ulrich Behrens and Fisher (1935, 1941).
Gerechte Designs and Sudoku Solutions to 9x9 Sudoku puzzles are examples of gerechte Latin squares, where k = 9 and the regions S1, S2, …, S9 are the nine contiguous 3x3 regions (submatrices or boxes). Charles Colbourn and Jeffrey Dinitz in their 2007 Handbook of Combination Designs define a completed 9x9 Sudoku puzzle as a (3, 3)-Sudoku Latin square. While Sudoku puzzles have become popular only very recently, Sudoku’s French ancestors from the late 19th century are discussed by Christian Boyer in his 2007 book Sudoko's French Ancestors.
Two-Step Circulant PLS Types C2B, C2F from Pakistan and the United States The only 5x5 PLS of the two-step circulant types C2B and C2F that we have identified are from Pakistan and the United States. A PLS of type C2B was issued by Pakistan (Figure 5) in 2004 (Scott 1045), for the National Philatelic Exhibition in Lahore on Universal Postal Union Day (October 9). This PLS depicts five kinds of tropical fish (top row, left to right): striped gourami (Colisa fasciata), black widow tetra (Gymnocorymbus ternetzi), yellow dwarf cichlid, (Apistogramma borellii), tiger barb (Capoeta tetrazona), and neon tetra (Paracheirodon innesi). For more about philatelic Latin squares from Pakistan, see Chu, Puntanen, and Styan (2009). A PLS of type C2F was issued by the United States (Figure 6) in 1995 (Scott 3019), depicting five antique automobiles from the late 19th and very early 20th century. In the top row
Courtesy George P. H. Styan
Figure 4. Turkey, 1960: Rome Olympics; PLS type C1F
Courtesy George P. H. Styan
Figure 5. Pakistan, 2004: tropical fish; PLS type C2B CHANCE
61
Courtesy George P. H. Styan
Figure 6. USA, 1995: antique automobiles; PLS type C2F
(left to right) are 1893 Duryea, 1898 Columbia, 1901 White, 1894 Haynes, and 1899 Winton.
Concluding Remarks We do not know if the four Latin square designs of size 5x5 that we have found used in printing postage stamps are also the four most popular for statistical experimental design purposes. It is clear, however, that the one-step circulants are very easy to construct and that the two-step circulants have proven popular as Knut Vik designs. An expanded version of this article (with all stamps shown in color) is available in the supplemental material at www.amstat.org/publications/chance.
Further Reading Bailey, R. A., Cameron, P. J., and Connelly, R. 2008. Sudoku, gerechte designs, resolutions, affine space, spreads, reguli, and Hamming codes. American Mathematical Monthly 115(5):383–404. Bailey, R. A., Kunert, J., and Martin, R. J. 1990. Some comments on gerechte designs, I: Analysis for uncorrelated errors. Journal of Agronomy and Crop Science 165(2-3):121–130.
62
VOL. 23, NO. 2, 2010
Bailey, R. A., Kunert, J., and Martin, R. J. 1991. Some comments on gerechte designs, II: Randomization analysis and other useful methods that allow for inter-plot dependence. Journal of Agronomy and Crop Science 166(2):101–111. Chu, K. L., Puntanen, S., and Styan, G. P. H. 2009. Some comments on philatelic Latin squares from Pakistan. Pakistan Journal of Statistics 25(4):427–471. Colbourn, C. J., and Dinitz, J. H., eds. 2007. Handbook of combinatorial designs, 2nd ed. Boca Raton, FL: Chapman & Hall. Dénes, J., and Keedwell, A. D. 1974. Latin squares and their applications. New York: Academic Press. Fisher, R. A., and Yates, F. 1963. Statistical tables for biological, agricultural, and medical research, 6th ed. Edinburgh: Oliver & Boyd. Kloetzel, J. E. ed. Scott standard postage stamp catalogue. Published annually by Scott Publishing, Sidney, OH. Martin, R. J. 1986. On the design of experiments under spatial correlation. Biometrika 73(2):247–277. Steurs, Marc. The most complete site for your WWF stamp collection. Online open-access web site: http://wwfstamps.be. Styan, G. P. H. Personal collection of philatelic items. Verdun (Québec), Canada, February 1, 2010.
Goodness of Wit Test #8: Employs Magic Jonathan Berkowitz, Column Editor
T
his issue’s puzzle was composed in my home city of Vancouver, British Columbia, as we prepared to "Welcome the World" to the 2010 Winter Olympics. In the spirit of "welcoming the world," this puzzle is actually five smaller puzzles, each a cryptic five-square of 10 words. For those of you who have never tried solving cryptic crosswords because they seemed too large or too daunting, these smaller tastes are for you. We hope you will attempt to solve one or more so we can welcome you to the world of cryptic crosswords. Although the squares are, well, square, they represent the five rings of the Olympic symbol. And most of the clues have an Olympic theme. If you solve all five squares, you will find a common element related to the symbol. The title is also a miniature puzzle. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by the column editor by July 1, 2010. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Please mail your completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo
Crescent, Vancouver, BC Canada V6N 3S2, or send him a list of the answers to
[email protected]. Please note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Solution to Goodness of Wit Test #6: Standardizing Conflict This puzzle appeared in CHANCE Vol. 22, No. 4. In the places where two answers clash, the Across answer has an X, and the Down answer has a Y. Replace them with Z (i.e., in "standard” manner). Across: 1. INCUR [rebus: in + cur(e)] 5. EXAMPLE [charade: ex + ample] 10. COAT [odd letters of c(o)o(l) a(c)t(s)] 11. PEAL [anagram: plea] 13. MENACE [rebus + anagram: me + cane] 14. MATRIX [charade: ma + Trix] 15. ACORN [AC + or + n(ew), abbrev.] 16. AMONG [rebus: A(mon) g] 18. DEMERIT [hidden word reversal: en(tire med)ia] 19. PRELIMS [anagram: simpler] 20. ANCHOVY [anagram: havoc NY] 26. STATION [anagram: to stain] 28. HELLO [charade: Hell + O] 29. LINER [deletion + anagram: (B) erlin] 30. EXCESS [homophone: XS] 32. ELATED [beheadment: related – r]
Winners from Goodness of Wit Test #6: Standardizing Conflict James Walker received his BS in mathematics from MIT. He currently teaches AP subjects at the St. Paul Preparatory School in Minnesota. Besides mathematics, he enjoys running, golfing, cross-country skiing, acting, and playing thebluegrass banjo.
Nathan Wetzel is a professor of statistics at the University of Wisconsin – Stevens Point, where he has been on the faculty since 1997. He enjoys Lego, all types of puzzles, teaching, and anything involving data.
33. HOST [anagram: shot] 34. TELL [double definition] 35. PARADOX [homophone: pair of docs] 36. XEROX [rebus + reversal: X + ero + X]. Down: 1. ICEMAN [anagram: Cinema] 2. NOVA [beheadment: ANOVA – A] 3. CAPTOR [charade: cap + tor] 4. U-TURNS [rebus: (f)utur(e) + NS] 5. EERY [homophone: eerie] 6. YAMMER [anagram + deletion: marry me – r] 7. MINCE [homophone: mints] 8. LACRIMAL [rebus: LA + crim(e) + a +l(and)] 9. EVENTS [anagram: Steven] 11. PEIGNOIR [rebus: Pei + g+ noir] 12. MAORI [initial letters: M(an) A(t) O(lduvai) R(equired) I(dentifying)] 15. AMETHYST [anagram + container: teams + thy] 17. MANTILLA [hidden word: infor(mant I'll a)dvise] 20. ASLEEP [rebus + reversal: a + peels] 21. CANAL [rebus: can + Al] 22 VOODOO [rebus: V+O+O+do+O] 23. RECITE [container: r(EC)ite] 24. CLEVER [rebus: c + lever] 25. COSTLY [rebus + anagram: colts + Y] 27. TETRA [hidden word: loca(te tra)ps] 30. EASY [homophone: EZ] 31 SILO [anagram: oils]. Reminder: A guide to solving cryptic clues appeared in CHANCE Vol. 21, No.3. Using solving aids—electronic dictionaries, the Internet, etc.—is encouraged. CHANCE
63
Goodness of Wit Test #8: Employs Magic Instructions: Clueing and answers are normal. However, the completed grids contain evidence of the Olympic symbol. For full credit you need to find that evidence. A bonus puzzle is provided by the title; see if you can figure it out without any instructions. Special thanks to my good friend and cryptic crossword mentor (and solving partner in the National Puzzlers' League; see www. puzzlers.org for sample puzzles), John Forbes, whose talents I relied on to provide so many clever clues with an Olympics theme.
SQUARE 1 CLUES Across 1 Protest forgoing real advantage 6 Odds of doors accommodating two large women 7 Nagano, Vancouver, hosting means test (abbrev.) 8 Piste initially to act like back of smoother sheet 9 Lock front of skis to pursue technical matter Down 1 Heads of American team hold up landing site change 2 First of snowboarders working with a range-finding device 3 Slant cut in sides of skate 4 Little people or famous rock ’n‘ roll singers? 5 Stars could be leaders of Russian team? SQUARE 2 CLUES Across 1 Blows resolution, absorbing bit of steroid 6 Upper north borders of Canada add pressure to remove crown 7 Energy and easy pace leading to jump 8 Alert father before finish of biathlon 9 Student’s method manifest except for answer (hyph.)
64
VOL. 23, NO. 2, 2010
Down 1 Visitor offered prediction in conversation 2 Until switch of time, left dark 3 Twenty CEOs weaving about middle of courses 4 Records put up around area west of pylons 5 Breathless ski-cross’s leader confined SQUARE 3 CLUES Across 1 Enough data for us without snow 6 Reverse like luger initially performing in room 7 Wear red during part of broadcast 8 Change grooming later 9 Sloppy meal room close to ceremony Down 1 Crowd and some of athletes served up tea 2 Guy hides pole in tree 3 Leaders lacking interest in schemes 4 Traditional beliefs upset loser 5 One who watches losing head for competitor SQUARE 4 CLUES Across 1 Bizarre totem composition 6 Earlier Lake Placid area passion 7 Constituents pour out after game
8 Gold team returned bereft of a means of transportation 9 What’s left from hockey no-nos after dropping first two Down 1 Mother wins final of curling with hot rock 2 Gores dancing giants … 3 … and half of them canine 4 Benefit from dash returning between errors … 5 … – some nasty positional mistakes SQUARE 5 CLUES Across 1 Trouble gaining standard missing a month 6 Routine work completed capturing lead in relay 7 John’s at start of event relaxing 8 Put under heading “Missing from season” 9 Top performances slaloming initially inside stakes Down 1 Ex-boxer grips rear of sled before spontaneous remark (hyph.) 2 Bill, after beginning of promising run, lying flat 3 Goes downhill garnering zero cheers 4 Trendy place producing picture within a picture 5 Looks lasciviously at trio of bobsleighers following the French