Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, CHANCE now includes all content in print and online for subscribers and libraries. This will make CHANCE more accessible and useful to subscribers and attractive to potential authors. As you experience the online version, please feel free to share your impressions and suggestions at
[email protected]. The interaction between statistics and health and medicine are well represented in this issue. Articles and columns concern measuring health disparities, studying sleep and sleep disorders, defining and measuring disability, using machine learning in medical diagnosis, comparing effectiveness of antibiotics, and the emerging fields of pharmacogenetics and pharmacogenomics. Ken Keppel and Jeff Pearcy describe Healthy People 2010, a national initiative to coordinate health improvement and prevent disease. Of course, you have to measure disease rates if you want to judge performance. How many data sources do you think are involved? Numerical examples contrast absolute and relative measures of disability. Two articles concern sleep research. Despite the common theme, after introductions, the two articles differ significantly in data and methods. Brian Caffo, Bruce Swihart, Ciprian Crainiceanu, Alison Laffan, and Naresh Punjabi discuss data on transitions among sleep states gathered during overnight laboratory observations. Statistical matching plays an important role in the design. James Slaven, Michael Andrew, Anna Mnatsakanova, John Violanti, Cecil Burchfiel, and Bryan Vila report on a study involving actigraphy measurements on a large group of police officers. The officers wear monitors, which record activity levels every minute, for 15 days. Statistical measurements of data quality are important in this study. How would you define disability? How would you measure it in a sample survey? Michele Connolly explains why statistical description of disability in the population is both complicated and important. Michael Cherkassky examines two machine learning, or statistical learning, methods for use in medical diagnosis. Cross-validation is used to aid model selection. Methods are applied to three data sets. Cherkassky
is a high-school student and award winner in the 2008 ASA Intel International Science and Engineering Fair. In the Visual Revelations column, Howard Wainer tells us about Will Burtin and his contributions to scientific visualization. Burtin’s data set was the basis of the graphics contest announced in the previous issue. Winners of the graphics contest will be announced in the next issue. Wainer presents Burtin’s original, as well as his own, graphical interpretation of the data in this issue. In Mark Glickman’s Here’s to Your Health column, Todd Nick and Shannon Saldaña discuss pharmacogenetics and pharmacogenomics. Besides being dream words for Scrabble players, these refer to developments at the frontier of science and medicine. These are rich and challenging areas for statistical collaboration. Two articles have sports themes. Bill Hurley examines the first-year performance of hockey sensation Jussi Jokinen and how one can assess exceptional performance. This article could provide nice examples for instructors of regression to the mean and shrinkage estimators. Lawrence Clevenson and Jennifer Wright compute expected returns to punting or going for a first down on fourth down in professional football. Several assumptions and statistical modeling of probabilities of various events are required. This work was part of Wright’s master’s degree paper. To complete the issue, Jonathan Berkowitz brings us his Goodness of Wit Test column puzzle. Berkowitz gives us the hint that some degree of pattern recognition in the answers to clues will be useful (and fun!). I look forward to your comments, suggestions, and article submissions in 2009. Enjoy the issue! Mike Larsen
CHANCE
5
Healthy People 2010: Measuring Disparities in Health Kenneth G. Keppel and Jeffrey N. Pearcy
objectives are measured in terms of the rate or proportion of individuals with a particular health attribute, such as a health condition or outcome, a known health risk, or use of a specific health care service. Disparities in health are measured between subgroups of the population based on race and Hispanic origin, income, etc.
Measuring Changes in Disparity
H
ealthy People 2010 (HP2010), led by the Office of Disease Prevention and Health Promotion of the U.S. Department of Health and Human Services, is a national initiative to promote health and prevent disease. Through a national consensusbuilding process from 1997 to 1999, specific objectives for improving the health of the nation were identified. This process involved participants from 350 national membership organizations; 250 state health agencies; and numerous professionals at local, state, and national meetings. Baseline values for the objectives were established and specific targets were set for improvements to be achieved by 2010. Similar initiatives set objectives for 1990 and 2000. A new set of objectives for 2020 is being developed. 6
VOL. 22, NO. 1, 2009
The first overarching goal of this initiative is to increase the length and improve the quality of life. The second goal, which is the focus of this article, is to eliminate disparities in health among subgroups of the population. The second goal requires measurement of disparities and monitoring of changes in disparities over time. HP2010 includes 955 objectives that are being tracked using 190 data sources. Major data sources include the National Health Interview Survey (NHIS), the National Health and Nutrition Examination Survey (NHANES), and components of the National Vital Statistics System (NVSS). There are 504 objectives that call for data by demographic characteristics, including race and Hispanic origin, income, education, gender, geographic location, and disability status. These “population-based”
Conclusions about changes in disparity depend on which reference point the disparity is measured from, whether the disparity is measured in absolute or relative terms, and whether the indicator is expressed in terms of favorable or adverse events. When HP2010 began in 2000, there was little agreement on how disparities should be measured. Since 2000, consensus has been reached on the measurement of disparities in HP2010. The most favorable, or “best,” group rate among the groups for each characteristic is chosen as the reference point for measuring disparities. For example, the best racial and ethnic group rate is used as the reference point for measuring disparities among racial and ethnic populations. The best rate is a logical choice for an initiative with the goal to eliminate disparities in health. The ability of one population to achieve the most favorable rate suggests that other populations could achieve this rate. Also, it would not be desirable to eliminate disparity by making the rate for any group worse than it is. The accuracy of the best group rate is a concern when subgroup rates are based on a small population or a small sample. If the best group rate does not meet the criteria for precision, the best rate is selected from among groups that meet the criteria. Disparities can be measured as either the absolute difference or relative difference between the best group rate
and the rate for another group. Conclusions about the size and direction of changes in disparity depend on whether absolute or relative measures are used. In HP2010, disparities are measured in relative terms as the percent difference between each of the other group rates and the best group rate. When disparities are measured in relative terms, they can be compared across indicators with different units of measurement (e.g., %; per 1,000; per 100,000). Reductions in relative disparities are required as evidence of progress toward eliminating disparity. Reduction in an absolute measure of disparity can occur without any
corresponding reduction in the relative measure, therefore without any progress toward eliminating disparity. In HP2010, changes in disparity are assessed in terms of the percentage point change in the percent difference from the best group rate between two points in time. When disparities are measured in relative terms, conclusions about changes in disparity also depend on whether the indicator is expressed in terms of favorable or adverse events. James Scanlan, in his1994 CHANCE article “Divining Difference,” pointed out that the relative difference in rates of survival between black and white infants decreases as the
relative difference in rates of infant mortality increases. When disparities are measured in HP2010, indicators are usually expressed in terms of adverse events so meaningful comparisons can be made across indicators. For example, objective 3-13 calls for an increase in the percent of women 40 years and older who received a mammogram within the past two years. When disparity is measured, this indicator is expressed as the percent of women who did not receive a mammogram within the past two years. Only a few indicators expressed in terms of averages cannot be expressed readily in adverse terms—
Numerical Examples of Changes in Rates and Changes in Absolute and Relative Measures of Disparity Suppose groups A and B are measured on an adverse event at times 1 and 2. Low values are good. In each scenario, group B has the best rate and is the reference point for measuring disparity. For this illustration, we are ignoring possible sampling variability or model uncertainty associated with estimates. In scenario 1, both groups improve, and group A improves faster than group B. In scenario 2, both groups improve, but the percent improvement in group B is larger, so relative disparity increases. In scenario 3, group A improves and disparity decreases. In scenario 4, disparity decreases, but only because the adverse event measure increases in group B. 1. A good scenario—Improvement for Groups A and B and a reduction in disparity Time 1 Time 2 Direction of Change Group A 80 60 Rate better Group B – best 60 50 Rate better Absolute Disparity 20 10 Better Relative Disparity (80-60)/60 = 1/3 (60-50)/50 = 1/5 Better 2. Greater improvement for Group B results in an increase in disparity Time 1 Time 2 Group A 80 60 Group B – best 60 40 Absolute Disparity 20 20 Relative Disparity (80-60)/60 = 1/3 (60-40)/40 = ½
Direction of Change Rate better Rate better Same Worse
3. Improvement for Group A only results in a decrease in disparity Time 1 Time 2 Group A 80 70 Group B – best 60 60 Absolute Disparity 20 10 Relative Disparity (80-60)/60 = 1/3 (70-60)/60 = 1/6
Direction of Change Rate better Rate same Better Better
4. An undesirable way for disparity to decrease Time 1 Group A 80 Group B – best 60 Absolute Disparity 20 Relative Disparity (80-60)/60 = 1/3
Direction of Change Rate same Rate worse Better Better
Time 2 80 65 15 (80-65)/65 = 3/13
CHANCE
7
(Log scale) New cases per 100,000 population
100
Rates improving Disparity decreasing, then increasing
10
Hispanic American Indian or Alaska Native Non-Hispanic white
1
Non-Hispanic black Asian or Pacific Islander
0.1
Healthy People 2010 target, 4.3 1997 1998 1999 2000 2001 2002 2003 2004 2005
Figure 1. New cases of Hepatitis A by race and Hispanic origin: United States, 1997–2005 Source: DATA2010, http://wonder.cdc.gov/data2010
(Log scale)
Age-adjusted percent
100
Rates worsening Disparity decreasing 40
Non-Hispanic black Mexican American Non-Hispanic white
20 Healthy People 2010 target, 15 10 1988-94
1999-2002
2003-06
Figure 2. Obesity in adults 20 years and older by race and ethnicity: United States, 1988–1994, 2003–2006 Source: DATA2010, http://wonder.cdc.gov/data2010
the average age at first use of alcohol among adolescents or the median RBC folate level among nonpregnant women, for example. To summarize, in HP2010, consensus has been reached on measuring disparities from the best group rate, in relative terms, with indicators expressed in terms of adverse events. Reduction in the percent difference is required as evidence of progress toward eliminating disparity for each group. Reduction in the average percent difference from the best group is indicative of a reduction in disparity for each characteristic (e.g., race and ethnicity).
Three Examples from HP2010 Progress toward the first goal—to improve health—can be assessed in terms of changes since the baseline in the 8
VOL. 22, NO. 1, 2009
data for each objective. Progress toward the second goal can be assessed in terms of changes in the percent difference from the best group rate for specific subgroups of the population for most populationbased objectives. Progress toward the first goal does not necessarily coincide with progress toward the second and vice versa. The following examples illustrate different results thus far. Hepatitis A: The incidence of new cases of hepatitis A (HP2010 objective 14-6) provides an interesting example of how disparities can change as rates improve. Trends in the rate of new cases of hepatitis A for five racial and ethnic populations are shown in Figure 1. The percent difference between the rate for the Hispanic population and the rate for the group with the best rate (the Asian or Pacific Islander population in 1997 and both the non-Hispanic black and
white populations in 2002) declined from 432% in 1997 to 84% in 2002. The estimates in Figure 1 are shown on a log scale. The convergence in estimates from 1997 to 2002, therefore, represents reduction in relative differences from the best group rate. Not only had the racial and ethnic disparity been substantially reduced, but the HP2010 target of 4.3 new cases of hepatitis A per 100,000 population was achieved for all five racial and ethnic populations in 2002. This is a very desirable result. After 2002, new case rates continued to decline, but relative differences from the best group rate increased for each of the other groups to nearly the same level as they were in 1997. The American Indian or Alaska Native population went from having nearly the worst rate in 1997 to having the best rate in 2005. The reduction in new cases of hepatitis A was due to a combination of strategies, including geographic targeting of immunization programs. The continuing reduction in rates is encouraging, but the increase in disparities is not desirable. Adult Obesity: The disparity among three racial and ethnic populations declined as the percent of obese adults increased. HP2010 objective 19-2 calls for a reduction in the rate of obesity in adults 20 years and older from 23% at baseline in 1988–1994 to 15% in 2010. As indicated in Figure 2, the rate of obesity increased for the three racial and ethnic populations with reliable data. In 1988–1994, the percent of obese adults for both Mexican-American and non-Hispanic black populations was substantially higher than the percent for the non-Hispanic white population, the reference group. In 1999–2002 and 2003–2006, because of the increase in obesity in the reference group, there was only one racial and ethnic group for which the percent of obese adults was substantially higher than the percent for the non-Hispanic white population. The average of the two percent differences from the best group rate was reduced despite an increase in obesity. This is not a desirable way to reduce disparities. Prostate Cancer Death Rate: Objective 3-7 calls for a reduction in the prostate cancer death rate from 31.3 per 100,000 population in 1999 to 28.2 in 2010. Except for the American Indian or Alaska Native population, there were statistically significant declines in prostate cancer death rates between 1999
(Log scale) Age-adjusted rate per 100,000 population
and 2004 (Figure 3). Between 1999 and 2002, the percent difference from the best group rate increased for the Hispanic, non-Hispanic black, and nonHispanic white populations. However, when the baseline year and the most recent year are compared, there was no statistically significant change in the percent difference from the best group rate for any of the populations. Despite the improvement in rates for four of five racial and ethnic populations, relative differences from the best group rate were essentially unchanged.
100
Non-Hispanic black Non-Hispanic white
10
To monitor progress toward the elimination of disparities in HP2010, three essential choices were made. Progress toward eliminating disparities is judged to have occurred when relative differences from the best group rate in adverse events are reduced over time. These principles have been employed in measuring disparities and changes in disparity for the population-based objectives in HP2010. Disparities have been measured for race and ethnicity, income, education, gender, geographic location, and disability status. The results have been published in the Healthy People 2010 Midcourse Review. Although the review indicates disparities have been reduced for relatively few objectives and disparities increased for nearly as many objectives, these results do not imply that disparities in health cannot be reduced. Instead, these results are an indication of the difficulty that can be encountered in reducing rates of adverse outcomes for disadvantaged populations. Population groups with more unfavorable rates must improve by greater proportions than the rate for the best group if disparities are to be reduced. The measurement of disparity has not yet become standardized. The importance of choosing a reference point, deciding to measure disparity in absolute or relative terms, and expressing indicators in terms of adverse events is not widely understood. Different choices can lead to different conclusions about changes in disparity. Consensus on the measurement of disparities in HP2010 is a significant contribution to this initiative. Despite this accomplishment, there are issues that still need to be considered. While changes in relative measures of disparity are needed as evidence of prog-
American Indian or Alaska Native Asian or Pacific Islander
1 1999
Discussion
Hispanic
Rates generally improving Disparity increasing, then decreasing
2000
2001
2002
2003
2004
Healthy People 2010 target, 28.2
Figure 3. Prostate cancer death rates by race and Hispanic origin: United States, 1999–2004 Source: DATA2010, http://wonder.cdc.gov/data2010
ress toward reducing disparities, there are, as yet, no criteria for determining that a disparity has been eliminated. When the rates for groups are small, a large relative difference might correspond to a tiny absolute difference. If the absolute differences among group rates are small, it is possible that no further reduction is required. Figure 1 shows an example of this type. The question of whether a difference between groups is no longer great enough to warrant public health intervention is not just a statistical one. Social, ethical, and practical factors also need to be considered when decisions are made about public health interventions. Monitoring for HP2010 will continue through the end of the decade. Healthy People 2020 is now being planned. Lessons learned from HP2010 will inform the choice of objectives and methods for measuring health and disparities in health among subgroups of the population. Indeed, the continuing process of building consensus will make it possible to monitor health more accurately and in ways that can lead to a healthier nation.
Further Reading www.healthypeople.gov Department of Health and Human Services. (2000) Healthy People 2010, 2nd edition. With Understanding and Improving Health and Objectives for Improving Health. Government Printing Office: Washington DC.
Carter-Pokras, O. and Baquet, C. (2001) “What Is a ‘Health Disparity’?” Public Health Reports, 117(5):426–434. Scanlan, J.P. (1994) “Divining Difference.” CHANCE, 7(4):38–39&48. Keppel, K.; Pearcy, J.; Klein, R. (2004) “Measuring Progress in Healthy People 2010.” Statistical Notes, No. 25. National Center for Health Statistics: Hyattsville, Maryland. Keppel, K.; Pamuk, E.; Lynch, J.; et al. (2005) “Methodological Issues in Measuring Health Disparities.” Vital and Health Statistics, 2(141). National Center for Health Statistics: Hyattsville, Maryland. Keppel, K. and Pearcy, J. (2005) “Measuring Relative Disparities in Terms of Adverse Events.” Journal of Public Health Management and Practice, 11(6):479–483. Low, L. and Low, A. (2006) “Importance of Relative Measures in Policy on Health Inequalities.” British Medical Journal, 332:967–969. Office of Disease Prevention and Health Promotion. (2006) Healthy People 2010 Midcourse Review, www.healthypeople.gov/ data/midcourse/default.htm#pubs. Keppel, K.; Garcia, T.; Hallquist, S.; Ryskulova, A.; Agress, L. (2008) “Comparing Racial and Ethnic Populations Based on Healthy People 2010 Objectives.” Statistical Notes, No. 26. National Center for Health Statistics: Hyattsville, Maryland.
CHANCE
9
An Overview of Observational Sleep Research with Application to Sleep Stage Transitioning Brian Caffo, Bruce Swihart, Ciprian Crainiceanu, Alison Laffan, and Naresh Punjabi
10
VOL. 22, NO. 1, 2009
S
leep is an essential component of human existence, consuming roughly one third of our lives. Fatigue, jet lag, poor sleep, and vivid dreams are frequent points of our morning discussions. We look and feel terrible after getting too little sleep; hence a $20 billion industry of beds, pillows, pills, and other tools has cropped up to help us sleep better. Correspondingly, there are plenty of products designed to keep us awake. Despite the importance of sleep in our lives, and the lives of so many other species, a definitive answer on the specific neurobiological or physiologic purpose of sleep eludes researchers. However, substantial advances in the field are uncovering the crucial role that sleep plays in our health, behavior, and well-being. For example, studies of sleep duration have found associations with a variety of important health outcomes. Short sleep duration correlates with impaired cognitive function, hypertension, glucose intolerance, altered immune function, obesity, and even mortality. This point is driven home by the fact that sleep deprivation is a well-recognized form of torture. Consider the following quote from former Israeli Prime Minister Menachem Begin, who suffered forced sleep deprivation as a KGB prisoner: In the head of the interrogated prisoner, a haze begins to form. His spirit is wearied to death, his legs are unsteady, and he has one sole desire: to sleep, to sleep just a little, not to get up, to lie, to rest, to forget … Anyone who has experienced this desire knows that not even hunger and thirst are comparable with it. Quantity of sleep is only one measurable facet of sleep that is associated with health. Table 1 gives a few of the more common measurements of sleep and sleep disturbance. A common sleep disorder of particular public health interest is sleep apnea. This is a chronic condition characterized by collapses of the upper airway during sleep. Complete collapses lead to so-called “apneas;” partial collapses lead to so-called “hypopneas.” Over the last decade, research has shown these events can lead to several physiologic consequences, including changes in metabolism, glucose tolerance, and cardiac function. The respiratory disturbance index (RDI), sometimes also called the apnea/hypopnea index (AHI), is the principal measure of severity of sleep apnea. This rate index is the count of the number of apneas and hypopneas divided by the total time slept in hours. A severely affected patient may have an RDI of 30 events per hour or higher. Hence, such a patient is, on average, having a disruption in their sleep and breathing
every two minutes. As one can imagine, such frequent disruptions in sleep and oxygen intake can have negative health consequences. Terry Young, Paul E. Peppard, and Daniel J. Gottlieb write in the American Journal of Respiratory and Critical Care Medicine that a high RDI has been shown to be associated with hypertension, cardiovascular disease, cerebrovascular disease, excessive daytime sleepiness, decreased cognitive function, decreased health-related quality of life, increased motor vehicle crashes and occupational accidents, and mortality. We relate measures of sleep apnea with transitions that occur between “sleep states.” Sleep states are based on visual classification of brain electroencephalograms (EEGs) patterns. Two major sleep states are rapid eye movement (REM) and non-REM. Sleep states can be seen as a categorical response time series. Crude summaries of these states, such as the percentage of time spent in each one, are often used as predictors of health. Instead, we investigate the role sleep apnea has on the rate of transitioning between the states. We emphasize that the rate of transitioning contains important additional information that the crude percentage of time spent in each state omits. Notably, we use matching to account for other variables that might be related to both disease status and sleep
Table 1—Measurements of Sleep Taken During an Overnight Sleep Study and Routine Clinical Evaluation of Sleep Sleep Measure
Description
Arousal index
Number of arousals per hour slept
Epworth Sleepiness Scale
Aggregate measure of general sleepiness
Respiratory disturbance Number of apneas and index hypopneas per hour Sleep architecture
Proportion of time spent in various sleep states
Sleep efficiency
Time asleep as a proportion of time in bed
Sleep latency
Time until falling asleep
Total sleep time
Total time asleep in a night CHANCE
11
behavior, hence comparing a severely diseased group with a matched nondiseased group.
Sleep Measurement The gold standard of sleep measurement is based on an overnight sleep study called a “polysomnogram.” During a polysomnogram, a patient has several physiologic recordings that are digitized and subsequently stored. Some of these recordings includes skull surface electroencephalograms, which measure the actual electrical activity from neurons firing. Because the EEG measures aggregate electrical activity in the cortex, they have poor spatial resolution; however, they have excellent temporal resolution, with hundreds of measurements per second. Other physiologic recordings measure eye movement (an electro-oculogram), leg movement (electromyogram), oxygen saturation (pulse-oximter), air flow, chin movement activity (electromyogram), chest and abdominal movement (via belts around the torso), and heart rate and rhythm (electrocardiogram). A polysomogram produces an enormous amount of data, as each of these signals is recorded nearly continuously over a night of sleep. The signals are processed by trained technicians under the supervision of sleep physicians. The technicians and sleep physicians distill this deluge of information to more manageable summaries. In clinical settings, these summaries are used to help patient care decisions. They also are used in research to investigate the causes and consequences of sleep-related phenomena. Table 1 lists examples of summaries of the polysomnogram, as well as the Epworth Sleepiness Scale—a questionnaire-based assessment of daytime sleep propensity. Often, a concern in sleep clinics is whether the subject has sleep apnea and, if so, to evaluate the severity of the disease. As previously mentioned, the primary measure of severity of sleep apnea is the number of apneas or hypopneas per hour slept. Another important summary splits the sleep pattern into a few distinct sleep states. This is done visually, by trained and certified technicians and physicians, by grouping the data into 30-second “epochs.” The states of interest are labeled Wake, Stage I, Stage II, Stage III, Stage IV, and REM. Stages I–IV are referred to as non-REM sleep. Stages I and II represent light sleeping and encompass 3%–8% and 44%–55% of total sleep time, respectively. Stages III and IV represent deeper sleep and comprise about 15%–20% of the total sleep time. In REM sleep, which compromises approximately 20%–25% of total sleep time, the body is inactive while the brain manifests EEG patterns similar to wakefulness. As described by Sudhansu Chokroverty in Sleep Disorders Medicine: Basic Science, Technical Considerations, and Clinical Aspects, most dreaming occurs in REM sleep. A patient’s “sleep architecture” is simply the person-specific percentage of time spent in each of the six states. Sleep architecture can vary between people and within a person as they age. For example, infants spend more than 80% of their sleeping time in REM. It is generally accepted that sleep staging is relevant for understanding sleep’s effect on health. We are particularly interested in the impact of the rate of transitioning between the various sleep states. Note that it is not the case that a person necessarily transitions from wakefulness through Stages I to IV in sequential order, and then to REM. Instead, people pass 12
VOL. 22, NO. 1, 2009
through the states in cycles, with transitioning from any state to another both possible and likely. Wakefulness to REM is the transition that occurs the least frequently. In addition to measuring nighttime sleep signals, other behavioral measurements can be valuable in sleep research. One of the more widely used measures is the Epworth Sleepiness Scale. This is an aggregated score of several self administered questions involving sleep behavior and is used as a measure of daytime sleep propensity. For example, patients are scored on whether they fall asleep when sitting and reading or watching television.
Sleep Transition Rates The Sleep Heart Health Study (SHHS) added to and combined data from large, well-established longitudinal studies: the Atherosclerosis Risk in Communities Study, the Cardiovascular Health Study, the Framingham Heart Study, the Strong Heart Study, and the Tucson Health and Environment Study. As described in the SHHS by Stuart F. Quan and colleagues, the SHHS recruited more than 6,000 subjects at enormous expense to undergo an abbreviated in-home polysomogram. Roughly 4,000 of these subjects repeated this process about four years later. The sleep data were processed by trained SHHS technicians and included rigorous quality checks. The SHHS offers a unique data set to understand sleep and health. However, being an observational study, analysis of the data is often challenging. Any effects or absence of effects seen might be due to subtle biases from the sampling or (measured or unmeasured) variables either unaccounted for or improperly accounted for in the analysis. Matching We consider now the relationship between sleep transitions and sleep apnea. Hence, we compare a group of severely diseased patients with sleep apnea—as defined by an RDI greater than 22:3 events per hour—with a group of healthy controls having RDIs of less than 1:33 events per hour. The groups were chosen so each subject in the diseased group had a matching subject in the control group. This process helps control for confounding variables, not unlike methods such as regression analysis. However, matching— unlike regression adjustment—forces a discussion of how alike or unlike the diseased and control groups are. In contrast, regression adjustment will happily plod along via linearity assumptions, even if there is no overlap in the confounding variables for the diseased and control groups. Matching is not without its issues, however. Most notably, the SHHS population being studied is a subset of the population selected originally for study; hence, a matched subset may lack the generalizability of results on the original data. Only a subset of the SHHS was eligible to be matched for our analysis. For example, to adequately define disease status, only those subjects with outstanding sleep recording quality and without any history of coronary heart disease, cardiovascular disease, hypertension, chronic obstructive pulmonary disease, asthma, or stroke were eligible. In addition, current smokers were not considered. These rigid qualification standards narrowed the original SHHS pool from more than 6,000 to 183 diseased and 458 nondiseased subjects. These groups are not representative of the population, as conditions
REM
IV
IV Sleep State
Sleep State
REM
III II
III II
I
I
Wake
Wake 0
2
4
6
8
2.0
2.4
Time in hours
2.8
Time in hours
Figure 1. Hypnogram plot for a single subject. The left plot shows the full night, whereas the right plot highlights sleep between the second and third hours.
ID
Diseased
0
1
2
3
None NW
4
RW
5
WN
RN
WR
6
7
8
6
7
8
NR
ID
Controls
0
1
2
3
4
5
Figure 2. Transition plots for the 60 diseased (apneic) and matched control subjects. Grayscale points represent transitions where each gray scale corresponds to a different type of transition. The key is such that N represents non-REM, R represents REM, and W represents wake. Hence, NR represents non-REM to REM, RW represents REM to wakefulness, and so on.
such as hypertension and cardiovascular disease commonly occur with sleep apnea. Hence, our ‘diseased’ group is quite healthy in many aspects, excepting the high index of sleep apnea disease severity. The matching variables included body mass index (BMI, the ratio of a subject’s weight to the square of their height), age, race, and sex. Exact matching was used for race and
sex, whereas BMI and age were matched within a caliper (i.e., matched within an acceptable range). The matching procedure produced 60 pairs. Apart from BMI, none of the variables were significantly different (using Student’s t test and chi-square tests at the 5% level) when comparing the two groups. Although concern exists about the differential body mass indexes, we note that obesity is the primary cause CHANCE
13
1.5 1.0 0.5 0.0 -0.5
Log2 difference in rates (Diseased – Control)
-1.0
3.4
3.6
3.8
4.0
4.2
4.4
4.6
Log2 average rate
Figure 3. Mean/difference plot of log base two of transition rates with diseased minus matched controls on the vertical axis and pairwise average log base two transition rates on the horizontal axis.
of sleep apnea, and both the groups—though having statistically different BMIs—were similar practically. Specifically, the average body mass index for the diseased group was 30.7 Kg/ m2, whereas it was 29.2 Kg/m2 for the nondiseased. Of note is that traditional sleep architecture—though not considered for a matching parameter—was not statistically different between groups. This implies that this gold standard measurement summary of sleep states may not be affected by sleep apnea. Figure 1 displays the time series for the six sleep states for a single subject; such a plot is called a “hypnogram.” Figure 2 displays a plot comparing the 60 diseased and 60 matched control subjects. In this plot, each grayscale point corresponds to a different transition type, the horizontal index is time in hours, and the vertical index is subject. This plot simultaneously displays the information of many hypnograms. As discussed in the Journal of Clinical Sleep Medicine by Bruce Swihart and his colleagues, this plot and variations highlight the higher rate of transitioning occurring in the diseased group. Analysis of Sleep Transition Rates We restrict ourselves to studying the transitions starting at the first transition from wakefulness to sleep (usually to Stage 1) and ending at the last transition from a sleep state to wakefulness. That is, we discard time before sleep onset and after waking. We note that the initial time in bed before sleep onset (sleep latency) between the matched pairs showed no difference (Student’s t-test p-value of 0.70). Figure 3 shows mean/difference plots for the log base two of the transition rates for the matched pairs. By a transition, we mean a change from any sleep state to another; hence, the transition rate is the number of changes divided by the total time asleep in hours. Such plots highlight whether there is a difference between the two matched groups, whereas plotting 14
VOL. 22, NO. 1, 2009
against the average highlights whether any such difference is dependent on the magnitude of the transition rates. Log base two is used simply to work with ratios and because powers of two are easier to work with than powers of Euler’s number, e, and represent smaller increments than the other option of using base 10. Recall that the transition rate is defined as the number of transitions per hour of sleep. We note that a reasonable discussion could be had on which of the two measures, the transition rate or the raw number of transitions, is more important. It may be that a certain raw number of transitions is important for health, regardless of the rate. However, clearly a person who sleeps longer has more opportunities to transition between states, suggesting the use of rates. Regardless, we focus on only the analysis of the rates. Further analysis of the rates and transition numbers is presented in the paper by Swihart. In Figure 3, 39 of the 60 observations lie above the horizontal line, potentially indicating that diseased subjects transition more frequently than nondiseased. For example, under a null hypothesis of no difference in transition rates between the diseased and nondiseased groups, the binomial probability of 39 or more pairs out of 60 lying above the horizontal line is only 0.014 (the two-sided p-value would double this number). A useful summary of each subject’s data would be a threeby-three table that displays counts of their previous sleep state by their current sleep state. Table 2 displays the combination of such summary tables across subjects, with the non-REM sleeping states (Stages I–IV) aggregated. Shown are counts of the previous state by the current state, cross-classified by disease status. Transition counts occur in the off-diagonal cells, with the diagonal cells representing instances in which the subjects stayed in the same state from one epoch to the next. For example, in the diseased subjects, there were 346 transitions from non-REM (N) to REM (R), whereas there were 175 transitions from wake (W) to REM. Column totals in this data are special, representing the time at-risk for various kinds of transitions. Therefore, of the 346 transitions of type N → R, there were 281.78 total hours spent in non-REM where this type of transition could be made. The most frequent transition in both groups is W → N, with rates of 1,733/66.56 = 26.0 and 1,376/60.54 = 22.7, transitions per hour, respectively. The next most frequent transitions are N → W and R → W. The data paint the picture that a person, diseased or not, spends the majority of their time in the nonREM state. From there, they often transition to REM, but then spend a little time in REM before transitioning to wakefulness (more likely) or back to non-REM (less likely). In addition, from non-REM, they often wake up briefly, then transition (more likely) back to non-REM. It is of interest to compare whether these rates differ across disease groups. This is a somewhat challenging task, given that one must account for the correlation induced by matching and the correlation of the various rates within a particular subject. For example, in a subject with a high rate of non-REM to wake transitions, it is reasonable to believe there would be a correspondingly high rate of wake to non-REM transitions; hence, these two rates would be correlated. We fit a model that assumes a constant rate of transitioning over the time at risk for transitioning, a so-called exponential hazard model, accounting for these kind of correlations.
Table 2—Disease and Control Counts of Transitions Previous State Disease
Controls
Current State
N
R
W
N
R
W
Non-REM (N)
31,880
160
1,733
32,592
134
1,376
346
7,609
175
351
8,784
114
1,588
358
6,079
1,210
324
5,775
Total epochs
33,814
8,127
7,987
34,153
9,242
7,265
Total in hours
281.78
67.73
66.56
284.61
77.02
60.54
REM (R) Wake (W)
Note: The columns denote the previous sleep state, whereas the rows denote the current. The times in state are measured in 30-second “epochs.” The column totals, which are counts of epochs, are converted into hours for convenience.
Table 3—Relative Rates Comparing Diseased (Numerator) to Nondiseased (Denominator) Subjects Transition
Relative
Type
Rate
Interval
N→R
1.04
[0.84, 1.29]
N→W
1.39
[1.17, 1.65]
R→N
1.46
[1.11, 1.93]
R→W
1.34
[1.08, 1.67]
W→N
1.04
[0.87, 1.22]
W→R
1.27
[0.95, 1.68]
Table 3 shows the estimated relative rates (i.e., the estimated rate of transitioning for the diseased divided by that of the controls) and 95% credible interval. A credible interval is a Bayesian analog to the confidence interval; readers unfamiliar with Bayesian analysis can simply think of them as confidence intervals. The data suggest that the rate of transitioning from non-REM to wakefulness, REM to non-REM, and REM to wakefulness all differ between the two groups. Notably, all the estimates represent increases in the transition rates for the diseased subjects. This suggests that the disruption of sleep continuity in response to the airway collapses during sleep may cause increased transitions between wakefulness and the other states.
Discussion We briefly reviewed an area of observational sleep research with a particular emphasis on analyzing sleep transitions and their relationship with sleep apnea. The analysis showed some potential differences between the diseased and nondiseased group with respect to the amount of transitioning, though significant work remains to be done to fully understand this problem. It is especially important to consider how the transition rates differ over the night, relaxing the assumptions of the exponential model we presumed; improving the matching
algorithm; and applying the methods to the large, unmatched data using other adjustments. Hence, the study presented herein represents a small snippet of understanding how sleep is influenced by a specific disease. We emphasize, however, that the overarching focus of our research is to better exploit the full information contained in polysomnograms from large-scale observational studies of sleep. This includes functional data analysis of the sleep state and EEG signals. We believe there is important information omitted by considering only the standard epidemiological summaries, which in most cases were designed as simple clinical indexes and may be improved upon for research purposes. This point is driven home by the example provided here, where several relevant differences in sleep transition behavior were illustrated between a diseased and nondiseased group while the standard index of sleep staging showed none.
Further Reading The book by Chokroverty contains an excellent summary of sleep medicine. The manuscripts by Swihart et al. provide introductions to the display and analysis of sleep transitions. The article by Young et al. gives an overview of sleep disordered breathing. Chokroverty, S. (1999) Sleep Disorders Medicine: Basic Science, Technical Considerations, and Clinical Aspects. ButterworthHeinemann: Boston, MA. Quan, S.; Howard, B.; Iber, C.; Kiley, J.; Nieto, F.; O’Connor, G.; Rapoport, D.; Redline, S.; Robbins, J.; Samet, J.; et al. (1997) “The Sleep Heart Health Study: Design, Rationale, and Methods.” Sleep, 20(12):107785. Swihart, B.; Caffo, B.; Bandeen-Roche, K.; and Punjabi, N. (2008) “Quantitative Characterization of Sleep Architecture Using Multi-State and Log-Linear Models.” Journal of Clinical Sleep Medicine, 4(4):349–355. Swihart, B.; Caffo, B.; Strand, M.; and Punjabi, N. (2007) “Novel Methods in the Visualization of Transitional Phenomena.” Johns Hopkins University, Dept. of Biostatistics Working Papers. Young, T.; Peppard, P.; and Gottlieb, D. (2002) “Epidemiology of Obstructive Sleep Apnea. A Population Health Perspective.” American Journal of Respiratory and Critical Care Medicine, 165(9):1217–1239. CHANCE
15
Statistical Modeling of
16
VOL. 22, NO. 1, 2009
SLEEP James E. Slaven, Michael E. Andrew, Anna Mnatsakanova, John M. Violanti, Cecil M. Burchfiel, and Bryan J. Vila
B
etween 50 million and 70 million Americans have some form of chronic sleep disorder that makes them more vulnerable to accidents and disease, degrading their quality of life. It has been shown that quality of sleep affects a person’s health and sense of well-being. Poor sleep can have a negative effect on mental and physical characteristics, as well as on social interaction. Hence, the impact of sleep disorders and lack of sleep can be detrimental to job performance. Police officers are especially likely to suffer from sleep disorders, as they get too little sleep because of work schedules that often involve night-shift work, rotating shifts, and overtime, as well as part-time secondary jobs. This can put the officers and the communities they serve and protect at great risk. In a groundbreaking study, the State University of New York at Buffalo (SUNYAB) and the National Institute for Occupational Safety and Health (NIOSH), with additional funding from the National Institute of Justice (NIJ), are studying the effect of stress on the health of police officers, with the quality and quantity of sleep being part of a comprehensive investigation. Approximately 500 Buffalo, New York, municipal police officers are taking part in the study, which was approved by human subject review boards at both NIOSH and SUNYAB. For the sleep portion of the study, participants wear Motionlogger Actigraphs for 15 days, with the actigraph being removed only to protect from water damage. The actigraphs record
data every minute during those 15 days (Figure 1), giving an extremely large data set to analyze for each participant. According to the American Academy of Sleep Medicine, one of the best tools for studying sleep outside the lab is the wrist actigraph. Actigraphy is the method of using accelerometers—instruments that detect when a person moves— to determine sleep/wake patterns. The actigraph records information about movement and then uses predetermined algorithms to determine if the wearer was awake or asleep at any given time during the day. Actigraphy corresponds well with polysomnography—which must be done in a sleep lab— with the added benefit that the method can record information outside a sleep lab for several days in a row, allowing for less invasive data acquisition. Actigraphs have been used in many sleep studies, including research on physical activity, obesity, and disease associations. Unfortunately, actigraph data require painstaking analyses due to the amount of information collected during constant data recording and the number of possible derived parameters. Actigraphy analyses can range from simple sleep statistics (e.g., total sleep time and sleep efficiency) to more advanced statistical techniques, including structural equation modeling (to determine how groups of sleep variables cluster) and waveform analysis (to find peaks of activity). We have developed several statistical methods to help analyze these large data sets. CHANCE
17
25,000 20,000 15,000 10,000 5,000
PIM 400 300 200 100
11/05
11/07
11/09
11/11
11/13
11/05/04
Figure 1. First several days of an actigraphy reading. The actigraph’s PIM channel (top) measures the magnitude of acceleration. The life channel (bottom) records micro-vibrations, such as those caused by the wearer’s respiration, for quality control.
25,000 20,000 15,000 10,000 5,000
PIM
500 400 300 200 100 02/11
02/13
02/15
02/17
02/19
02/11/05
Figure 2. Poor-quality data caused by data corruption or noncompliance. Note how the total, or near total, lack of signal on the life channel during some periods differs from a normal actigraph reading in Figure 1.
18
VOL. 22, NO. 1, 2009
Statistical Methods
Good Sleep Data 250
AA =
n −1 χ + χ 1 i +1 i ∑ i = 1 n1 2
and the average distance between consecutive time points: AD =
n −1 χ − χ 1 i +1 i ∑ i = 1 n1 2
of the data to detect whether the overall readings are high enough to provide statistically useful information. The K-statistic is then given as:
K=
AA2 + AD 2 ,
where x and y represent the running mean and running variance, in any order. If too large a portion of the data is zero or truncated, or if the overall readings vary enough at small intervals, then data quality may be too poor for use in data analysis. Either of these types of data corruption can produce long periods with the same reading, something that is not possible, even when a subject is at rest because of basic bodily movements caused by respiration.
200 Number of Awakenings
K-statistic The first method developed to improve the efficiency of actigraphy analysis is a test of the quality of the actigraph data. Data quality can be poor for two reasons: data corruption and participant noncompliance. Data can become corrupted due to a malfunction of the actigraph or during transfer into a computer, giving large portions of time with no data or data that have been truncated. Noncompliance most often results from participants failing to wear the actigraph when they are supposed to. This results in zero readings over many time periods. Both of these issues present problems for data analysis, as simple statistics can be either completely incorrect, as in the case of truncated data, or biased, from noncompliance (Figure 2). We have developed a statistic (the K-statistic) to determine whether the quality of each officer’s actigraph data is good enough to use in the analyses. This method looks at the average amplitude of consecutive time points:
150 100 50 0 0
20
40 60 80 Time in Sleep (minutes)
100
120
Figure 3. Waiting time distributions of several participants
While specific sleep disorders cannot be determined with actigraphy, the movement and resultant poor sleep can be uncovered with the device. This means that analysis of nonlinear dynamics can be used to differentiate between participants’ sleep quality (good versus poor) and determine the extent of an individual’s poor-quality sleep. Waiting Time Distributions of Sleep One of the basic parameters used by medical professionals to evaluate sleep is the wake-to-sleep ratio. As this value is averaged across the entire night of sleep, it does not give an accurate description of the total time a participant spends waiting to fall sleep. We have developed a method for characterizing a participant’s total distribution of waiting times. Rather than using an overall average, we analyzed actigraph data to calculate the length of time from sleep to wakefulness for every awakening. This gave a better picture of what was happening during sleep, from the number of awakenings per night to the total number of minutes of sleep before each awakening.
Nonlinear Dynamics and Dimensionality Nonlinear dynamics play a part in many biological functions, such as heart rhythms and brainwaves. We used nonlinear analysis to compare participants who exhibit good sleep patterns with those who exhibit poor sleep patterns, according to basic sleep parameters provided by Maurice Ohayon and his colleagues in their Sleep article, “Meta-Analysis of Quantitative Sleep Parameters from Childhood to Old Age in Healthy Individuals: Developing Normative Sleep Values Across the Human Lifespan.” The actigraphy data from these two groups have significantly different fractal dimensions. Participants with poor sleep patterns exhibit much more movement during sleep than those with good sleep patterns. This movement can be caused by sleep disorders—such as insomnia, apnea, and restless leg syndrome—or by other sources of discomfort. CHANCE
19
Differences in waiting time distributions were shown to be significant between participants with good sleep patterns and those with poor sleep patterns. These distributions were also quite skewed (Figure 3), indicating that the use of the mean waiting time may be unnecessarily inaccurate. The median would be a better parameter to use when the distribution is asymmetric, and our analyses indicate the main differences between participants with good-quality sleep and those with poor-quality sleep occur in the upper end of the distributions, at the 75th and 90th percentiles.
First Canonical Coefficient
4 2 0 -2 -4 -6 -8
Advanced Sleep Variables -6
-4
-2 0 2 Second Canonical Coefficient
4
Figure 4. Clustering of study participants for good and poor sleep as determined by additional sleep variables, using canonical correlation analysis (N=228)
Factors
Variables
While the basic sleep variables (total sleep time, sleep efficiency, and sleep-to-wake onset) are typically used in classifying a study participant as having good or poor sleep patterns, the use of advanced variables may be of value. Actigraphy provides the opportunity to derive a large number of statistics from sleep/wake data, including sinusoid messor, sinusoid amplitude, maximum daily autocorrelation, time off 24-hour rhythm, number of awakenings during sleep, activity during sleep, sleep ratio, and wake-within-sleep percent. These variables offer additional mathematical and statistical information on sleep quality. We have performed cluster and discriminate analysis to show that these additional variables can perform exceptionally well in differentiating between good and poor-quality sleep. These additional variables had a low classification error rate of approximately 10% between the two sleep qualities (Figure 4). As can be seen from Figure 4, the first correlation coefficient is sufficient in differentiating the sleep qualities, with more than 90% of the variance explained. The first two coefficients describe nearly 99% of the variation in the data. Although these additional variables may lengthen the time required to analyze sleep data, they can add considerable information. Waveform/cosinor analysis gives parameters that allow a sinusoid to be fit to the data, which has been used extensively as a mathematical model for sleep. Autocorrelation coefficients give information about how the quality of a participant’s sleep pattern varies across days, as well as on the participant’s circadian rhythm. Average awakenings, activity during sleep, and length of wakefulness give additional information about in-bed activity during sleep. The use of these additional sleep variables will give research investigators more tools to test for differences in sleep quality between groups and to better characterize and parameterize sleep. Structural Equation Modeling
Figure 5. Structural equation model with standardized path (regression) coefficients between the factor solution and sleep variables and the correlations between factors
20
VOL. 22, NO. 1, 2009
Due to the many possible variables that can be derived from actigraphy, it may be necessary to reduce the variable set to a smaller subset, which still provides unique and meaningful information. Structural equation modeling (SEM) can be used as a dimension-reducing analytic procedure where a large number of observed variables are reduced into smaller sets containing highly correlated variables that describe the same underlying construct. Structural equation models consist of two parts: a measurement model and a structural model. The measurement model describes the relationship between measured and latent variables (the subsets to be discovered). The structural model deals with the relationships between those latent variables.
SEM is also useful in that it gives regression coefficients and correlation values between factors. Our initial variable set was total sleep time, sleep-to-wake onset (the average time it took to awaken after falling asleep), wake-within-sleep percent, mean activity during sleep (as measured in volts by the accelerometer), sleep efficiency (the proportion of time spent in actual sleep while in bed), sinusoidal messor (the wavelength of a fitted sinusoid from cosinor analysis), sinusoidal amplitude (the height of a fitted sinusoid from cosinor analysis), daily autocorrelation (correlation coefficient derived from each participant’s sleep/wake cycle), and the time off of a 24-hour sleep/wake rhythm. After initial analysis, the last two variables in the set were found to not be statistically significant. Without them, our final model had excellent fit statistics, which enabled us to group the variables into corresponding latent factors. The variables total sleep time and sleep-to-wake onset grouped together into a latent factor we called “sleep time.” Wake-within-sleep percent, mean activity during sleep, and sleep efficiency grouped together into a factor we named “during sleep activity.” The sinusoidal messor and sinusoidal amplitude parameters grouped into a factor we called “circadian rhythm” (Figure 5). SEM is an excellent method for reducing the number of variables needed to represent sleep/wake patterns. It can help researchers choose the proper variables to analyze, depending on the hypothesis being tested, in order to make their data analysis more efficient.
Discussion As a consequence of this work, we expect to be able to analyze the actigraphy data of the BCOPS study with more accuracy and with more tools to find differences not only in sleep quality, but also in the corresponding disease association analyses. Using the K-statistic makes determining the quality of data sets easier and faster, as opposed to manually viewing each data set. SEM is capable of reducing the set of possible variables by grouping them and allowing investigators to select the ones best suited for that particular study. Nonlinear analysis, waiting time distributions, and the ability to use many nonstandard variables give research investigators more tools to identify differences in participants’ sleep quality and in study popula-
tions. They also give researchers the ability to conduct more in-depth statistical and mathematical analysis. Ultimately, our work should help make wrist actigraphy more accurate and less expensive for research investigators and physicians who study and treat the millions of workers around the United States who suffer from sleep disorders. The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the National Institute for Occupational Safety and Health.
Further Reading The National Academies. (2006) “U.S. Lacks Adequate Capacity to Treat People with Sleep Disorders; Efforts Needed to Boost Sleep Research, Treatment, and Public Awareness.” The National Academies: Washington, DC. http://www8. nationalacademies.org/onpinews/newsitem.aspx?RecordID=11617. Ohayon, M.; Carskadon, M.; Guilleminault, C.; Vitiellow, M. (2004) “Meta-Analysis of Quantitative Sleep Parameters from Childhood to Old Age in Healthy Individuals: Developing Normative Sleep Values Across the Human Lifespan.” Sleep, 27:1255–1273. Slaven, J.E.; Andrew, M.E.; Violanti, J.M.; Burchfiel, C.M.; Vila, B.J. (2006) “A Statistical Test to Determine Quality of Accelerometer Data.” Physiol Meas, 27:413–423. Slaven, J.E.; Andrew, M.E.; Violanti, J.M.; Burchfiel, C.M.; Mnatsakanova, A.; Vila, B.J. (2008) “Dimensional Analysis of Actigraph Derived Sleep Data.” Nonlinear Dynamic, Psychology, Life Sciences, 12(2):153–161. Slaven, J.E.; Mnatsakanova, A.; Burchfiel, C.M.; Li, S.; Violanti, J.M.; Vila, B.J.; Andrew, M.E. (2008) “Waiting Time Distributions of Actigraphy Data.” The Open Sleep Journal, 1:1–5. Slaven, J.E.; Andrew, M.E.; Violanti, J.M.; Burchfiel, C.M.; Vila, B.J. (2008) “Factor Analysis and Structural Equation Modeling of Actigraphy Derived Sleep Variables.” The Open Sleep Journal, 1:6–10. Vila, B.J. (2006) “Impact of Long Work Hours on Police Officers and the Communities They Serve.” Am J Ind Med, 49:972–80.
CHANCE
21
Disability: It’s Complicated Michele Connolly
were spent on disability programs, just for the working-age population. During that time, it was estimated that states spent an additional $50 billion for joint federal-state programs. Disability data must be collected to address program and policy issues. But disability measures are challenging.
What Is Disability?
I
t can happen in an instant. A soldier in Iraq loses a leg from a roadside bomb. A baby is born with Down syndrome. An elderly woman loses the ability to speak due to a stroke. Or it can happen more insidiously. A diabetic with nerve damage can gradually become blind. A man with Alzheimer’s disease may lose the ability to care for himself. All these individuals have a disability. Disabilities vary dramatically and can affect people at any time and at any age. An individual’s disability also may change over time. Someone with multiple sclerosis may have a deteriorating condition, but may regain function with rehabilitation. Disability cannot be directly measured. There is no blood test, medical procedure, or functional test that absolutely measures disability. Disability is a subjective construct used to measure the impact of very real, but disparate, events. Just as the status of an individual changes over time, overall disability levels and trends change as new interventions are discovered, new types of disability emerge, existing types of disability change, and the population ages. The changes are reflected in disability programs, policy, and the definitions we use. There are dozens of federal disability programs, each of which has its own unique purpose. In 2002, according to Nanette Goodman and David Stapleton in their article, “Federal Program Expenditures for Working-Age People with Disabilities” that was published in the Journal of Disability Policy Studies, 11.3% of all federal outlays amounting to $226 billion 22
VOL. 22, NO. 1, 2009
In general, disability is defined as a limitation or inability to perform usual societal roles due to a medical condition or impairment. Societal roles include growing, developing, and learning for people under age 18, working for working-age adults (ages 18–64), and living independently for the elderly (ages 65 and older). In addition, usual activities include recreation and interaction with family, friends, and neighbors. Usual activities vary by individual circumstances and age. For example, full-time college students are in their working years (18–64), but may not be ready to join the work force until graduation. The term “elderly” is often described as age 65 and older, yet many people retire later. In addition, growing, developing, learning, working, and living independently are all general terms open to interpretation. Not all people with medical conditions have a disability. According to 2007 estimates by the Centers for Disease Control and Prevention, 10.7% of the population aged 20 and older (23.5 million Americans) had diagnosed or undiagnosed diabetes. Yet, in December of 2002, about 237,000 (just 4.0%) of disabled workers received Social Security Disability Insurance (SSDI) program benefits as the result of “endocrine, nutritional, and metabolic disorders,” a category in which diabetes is the major (but not the only) condition. Disability can be permanent or temporary. Those who use crutches for a broken leg or who are recovering from knee replacement surgery may not be considered to have a disability by most federal programs because their condition is temporary. However, for purposes of accessibility mandated by the Americans with Disabilities Act (ADA), they are considered to have a disability because they need ramps or curb cuts to get around during recovery. Periodicity also can complicate disability definition. Disability can be ever present (e.g., blindness), episodic (e.g., cancer and mental illness), or somewhere in between (e.g., people with arthritis who have good and bad days). Disabling conditions can often be successfully treated and corrected. For example, congenital heart defects in infants can be resolved by surgical intervention. Disabilities also can disappear as situations (and thus definitions) change. Schoolage children with dyslexia may be regarded as having a learning disability, which may entitle them to special education (a disability program). But after they leave school, they may not continue to be considered as having a disability. Severity is also a consideration. Some measures contain implicit severity indicators. For example, a person may be
asked if she has difficulty climbing a flight of stairs. If she answers yes, she is asked if she is able to climb stairs at all. In this case, there are three levels of severity: able to climb stairs, limited in the ability to climb stairs, or unable to climb stairs at all. Some tasks are so basic that the inability to perform them is considered more severe. A person who reports he is limited in performing one of the activities of daily living, such as going to the bathroom, may be considered to have a more severe disability than someone who is limited in climbing a flight of stairs. Disabilities can be “visible,” as for people who use wheelchairs, scooters, or seeing-eye dogs, or “invisible,” as for those with mental illness or limited physical endurance. Technological advances, such as prostheses, can give functioning back to a person who has lost a limb due to cancer, combat, or an accident. Whether these individuals have a disability depends on the situation. No description of disability is complete without addressing ability. We all have many abilities—even if we have a disability. Abilities can be task-specific, such as lifting a bag of groceries, or more general, such as the ability to work. Many people with disabilities are able to work. President Franklin D. Roosevelt was confined to a wheelchair due to polio. One of the world’s most brilliant theoretical physicists is Stephen Hawking, a man who continues to publish and lecture while almost completely
paralyzed from amyotrophic lateral sclerosis (ALS, also known as Lou Gehrig’s disease).
Federal Disability Definitions and Programs Altogether, the federal government employs a staggering 67 definitions of disability, which can be pared down to 41 after overlaps are accounted for. Besides federal definitions, hundreds more disability definitions exist for state, local, and tribal governments. As most are either derived from or similar to federal definitions, our focus is on federal programs. Federal disability definitions are not written in stone, as federal definitions (and regulations) are rooted in congressional legislation and executive branch rules and regulations. Changes to federal disability definitions result from new legislation, regulations, reauthorizations, and court decisions. These and other definition changes affect estimates of disability prevalence rates and trends (see History of Federal Disability Programs). Most recently—on September 25, 2008—President George W. Bush signed the ADA Amendments Act (ADA-AA), which broadened and clarified the interpretation of the definition of disability that had become narrowed due to court decisions. The many and varied federal disability programs suggest it is highly unlikely that there will ever be a single federal
History of Federal Disability Programs Disability programs are as old as this country. The first disability program (and definition) was enacted by the Continental Congress on August 26, 1776, to provide compensation for “every officer, soldier, or sailor losing a limb in any engagement or being so disabled in the service of the United States so as to render him incapable of earning a livelihood.” Federal disability programs in the United States are historically rooted in veterans’ disability programs, which dominated federal disability for most of our history. As times changed, so did veterans’ disability programs and their impact on society, including the formation of disability programs for the nonveteran population. During the Civil War, an estimated 360,222 soldiers died on the Union side and 281,881 were wounded, many of whom were amputees. The best estimate of Confederate dead is 258,000 (Drew Gilpin Faust in The Republic of Suffering). No figures are available for the number of wounded Confederate soldiers. The huge number of disabled soldiers (called invalids) and dependent widows and orphans called for an extensive Civil War pension system. The Civil War pension definition of disability, similar to that of the American Revolution, was dependent on the ability to work. Disability pensions were given to “… any person who served in the military or naval service, received an honorable discharge, and who was wounded in battle or in the line of duty and is not unfit for manual labor by reason thereof, or who from disease or other causes incurred in the line of duty.” Pension benefits for the massive number of surviving dependents (i.e., widows) represented the first
large-scale social program in this country. This may have served as the precedent for dependent coverage in today’s Social Security and other programs. Approximately 204,000 wounded veterans came home from World War I. Veterans’ disability compensation was expanded and modernized to “… establish courses for rehabilitation and vocational training for veterans with dismemberment, sight, hearing, and other permanent disabilities.” The focus shifted from providing disability benefits to those who were incapable of work to providing services to help veterans with disabilities return to work. This policy shift reached over into the civilian population, when, by 1920, the Basic Vocational Rehabilitation Services program was established to help people with disabilities (not just veterans) attain gainful employment. The 1944 Servicemen’s Readjustment Act, known as the GI Bill, was enacted to provide returning veterans from World War II (including those with disabilities) a college or vocational education. Benefits included educational costs, a stipend, one year of unemployment compensation, and home and business loans. A striking social change occurred after World War II at the University of Illinois at Urbana-Champaign, where returning veterans with disabilities successfully obtained a college education. Some 30 years before the enactment of the Individuals with Disabilities Education Act and about 50 years before the ADA, these veterans showed the importance of architectural changes (accommodations) and personal assistance. CHANCE
23
Functional Disability Measures in the NHIS-D Functional disability measures were the most complex, as many body systems are involved, but they are also the most widely accepted and often the most useful for policy and program purposes. Functional measures included the following: • • •
•
• • •
Limitations in or the inability to perform a litany of physical activities (e.g., walking, lifting 10 pounds, reaching) Serious sensory impairments (e.g., inability to read newsprint, even with glasses or contact lenses; hearing and speaking impairments) Mental impairments (e.g., frequent depression or anxiety, frequent confusion, disorientation, or difficulty remembering) that seriously interfered with life during the past year Long-term care needs (e.g., needing the help of another person or special equipment for basic activities of daily living (bathing, dressing, going to the bathroom) and instrumental activities of daily living (going outside, managing money and/or medication)) Use of selected assistive devices (e.g., scooters, wheelchairs, Braille) Developmental delays for children identified by a physician (e.g., physical, learning) Inability to perform age-appropriate activities for children under age 5 (e.g., sitting up, walking by age 3).
Questions about mental impairments were difficult to develop due to the stigma of mental illness. The NHISD based the question series on an earlier supplement on mental illness in conjunction with the cognitive questionnaire lab. It was found that four approaches needed to be used: symptoms (e.g., frequent depression, anxiety, confusion, disorientation, difficulty remembering, getting along with others), a diagnosis, use of prescription psychotropic drugs, and use of community mental health services. For example, some respondents would report a diagnosis of schizophrenia, but not report any symptoms, prescription drugs, or use of services. Other individuals would report use of psychotropic drugs for schizophrenia, but not report any symptoms, diagnosis, or use of services. The final question designed to determine disability was whether a person reported that his or her mental illness seriously interfered with his or her life during the past year. A major flaw in the mental disability measures was the lack of a question on psychosis. One was proposed: “… Do you see things other people don’t see or hear things other people don’t hear?” The question did not work. Non-psychotic respondents answered yes in the cognitive lab, explaining that they were color-blind, had better than 20/20 vision, or had excellent hearing. 24
VOL. 22, NO. 1, 2009
definition of disability. Two federal disability programs illustrate the complexity we face in defining disability by specific criteria—IDEA, the Individuals with Disabilities Education Act, and the Social Security Administration’s SSDI program. Besides disability, both programs employ a number of other factors in their eligibility criteria.
Measuring Disability Through Surveys National population-based surveys are the best source of overall disability rates, profiles, and trends. Surveys collect data on a rich variety of other sociodemographic and economic characteristics so it is possible to understand the lives of people with disabilities. These data can be used to understand policy issues that cannot be examined using the often limited information in administrative program records. For example, while the Social Security Administration has data on those who receive SSDI benefits, the agency does not collect data on who might be eligible. It is challenging to replicate disability definitions and legislative eligibility criteria in surveys, but not impossible. Perhaps the hardest part of disability survey measurement is translating federal legislative definitions into plain English survey questions that can be understood by respondents. It can be done, however, by careful work in the cognitive questionnaire labs, pretests, and statistical analyses. Statisticians also must address other critical survey issues, such as the effects of where the questions fit into the overall survey (context), mode (telephone, mail, or personal interview), and self versus proxy response. Data sets from the large surveys discussed here are made available to researchers as public use data (stripped of personal identifiers) after confidentiality and privacy concerns are met. Several major national population-based surveys contain items on disability. These surveys include the National Health Interview Survey (NHIS), conducted by the National Center for Health Statistics, the Surveys of Income and Program Participation (SIPP), and the American Community Survey (ACS), conducted by the U.S. Census Bureau and the replacement to the long form of the decennial census. The purpose of the NHIS is to monitor the nation’s health. The NHIS—which covers the civilian, noninstitutionalized population—was established in 1956 and is the world’s longestrunning health survey. Altogether, approximately 35,000 households containing about 87,500 individuals are sampled each year. Interviews are conducted in person. The major disability items on the NHIS ascertain limitation of activity (e.g., working and going to school). Besides basic health information, special supplements are collected on areas of public health concern, such as cancer screening, smoking, and mental health. The SIPP, sponsored by the U.S. Census Bureau, examines income, labor force data, assets, and participation in and eligibility for dozens of federal programs. The SIPP, which started in 1983, has been a rich source of sociodemographic data, including disability, but now is in the process of being redesigned. The SIPP was designed as an overlapping set of longitudinal panels, each generally lasting about three years. SIPP panels ranged from 14,000 to 36,700 households representing the civilian, noninstitutionalized population. Respondents were interviewed largely through in-person interviews every four months on a basic set of questions (income and
participation in the labor force and federal programs) and on topical modules on various subjects. The disability module was asked at two points one year apart. Unfortunately, it appears that under the SIPP redesign, the disability questions will be reduced and that disability will only be obtained at one point in time. Other features, such as panel length, may be changed. The ACS, which began in 1996, is designed to provide data every year that was previously collected by the long form of the decennial census every 10 years. The ACS (and previously the decennial census) is the only source of sociodemographic and economic data at the state and local levels. More than 3 million households participate in the ACS by mail, with follow-up personal interviews if necessary. ACS estimates can be obtained by states, counties, cities, metropolitan areas, and population groups of 65,000 or more. Starting in December 2008, the U.S. Census Bureau will release three-year estimates for population groups of 20,000 or more. At first, the ACS used the 2000 census questions, but much statistical and methodological testing has been done. The latest version—now being collected in the 2008 ACS and scheduled to be collected in the 2009 Annual Social and Economic Supplement to the CPS—contains six separate items on hearing impairments, visual impairments, mental impairments, physical impairments, activities of daily living (self-care), and instrumental activities of daily living. In the ACS, the item on work disability has been dropped for methodological concerns. A list of the current ACS questions is contained in Disability Items from the 2008 American Community Survey.
Prevalence of Disability Disability is widespread, but the exact number of Americans with disabilities depends on the measure or definition used. In 2006, according to the ACS, nearly 41.3 million Americans (standard error 16,000), or roughly more than one in seven aged 5 or older, reported a disability. During 2006, estimates from the NHIS indicated that 35.8 million people of all ages reported a limitation in their usual activity due to a chronic health condition. This comes to 12.2% (standard error of 0.2%), or about one in six Americans. It is important to note that while disability prevalence increases with age, most people with disabilities are not elderly. The ACS reports that 65% of those with a disability are under age 65; the NHIS estimates that figure to be 67%. A fair amount of variation is expected between results from the ACS and NHIS. Disability definitions differed, even though a great deal of overlap exists in the concepts— if not the specific questions—and each had a different design and data collection mode. The ACS focused on broad categories of disabilities (i.e., physical, mental, sensory, self care, going outside, and the ability to work). The NHIS measures disability as limitations in the ability to carry out usual activities by age group (i.e., going to school, play, work, self care). Disability estimates among working-age adults from the 1994 National Health Interview Survey Supplement on Disability (NHIS-D) yielded four figures, depending on which broad measure was used. These measures, based on about 100 questions, were functional, work disability, perceived as having a disability as defined in the ADA, and receipt of
Social Security Disability Insurance (SSDI) The SSDI program is the largest disability program in the world and the primary program for working-age adults in this country. SSDI and Social Security retirement are funded by the Federal Insurance Contributions Act, or FICA, payroll tax, paid by employers and employees. The SSDI definition of disability is “the inability to engage in any substantial gainful activity (SGA) by reason of any medically determinable physical or mental impairment which can be expected to result in death or to last for a continuous period of not less than 12 months.” The SSDI definition of disability is “all or nothing.” There is no partial disability, as under Workers’ Compensation, nor are there degrees of disability, as for veterans’ programs. Eligibility is achieved through a five-step sequential process based on coverage under Social Security, disability, employment, education, age, and vocational factors. Those who are denied SSDI have the right to appeal the decision under certain circumstances. After 24 months of SSDI benefits, Medicare is extended to SSDI beneficiaries, even though they are under age 65. The Social Security system pays benefits to two additional categories of people with disabilities: disabled widow(er)s aged 50 to 59 and adults aged 18 or older who were disabled in childhood (ADC) and who have
at least one parent who receives (or received, if deceased) Social Security retirement or disability benefits. The average age for disabled workers in December of 2005 was 51.9 years for men and 51.7 years for women. Early retirement under Social Security can be obtained starting at age 62. In 2002, the leading causes of disability for disabled workers were mental disorders other than mental retardation (e.g., schizophrenia, severe depression) at 28.1%, musculoskeletal system and connective tissue disorders (e.g., bad back) at 23.9%, diseases of the circulatory system at 10.1%, diseases of the nervous system and sense organs (e.g., multiple sclerosis, traumatic brain injury, epilepsy) at 9.6%, and mental retardation at 5.2%. Disabled widow(er)s had a similar pattern of leading disability causes. However, among ADC, the leading causes were mental retardation at 43.6%, other mental disorders besides retardation at 13.0%, and diseases of the nervous and sense organs at 8.6%. In December of 2007, slightly more than 7.1 million disabled workers received a monthly average of $1,004 in SSDI benefits. As of December of 2007, nearly 225,000 disabled widow(er)s received benefits, and slightly fewer than 795,000 received benefits as ADC. Altogether, more than 8.1 million workers, survivors, or dependents received Social Security benefits on the basis of their own disability. CHANCE
25
disability program benefits. About 25.7 million working-age adults reported a functional disability (e.g., climbing stairs, seeing); 16.9 million reported a limitation or inability in work; 11.1 million reported that they perceived themselves or others perceived them as having a disability; and 9.1 million reported receiving disability program benefits from SSDI, Supplemental Security Income (SSI), and/or the Veterans’ Administration (VA) programs. Discussion of these broad measures is presented in Functional Measures in the NHIS-D.
Future Needs Disability measurement is constantly evolving as society changes and medical and rehabilitation advances are made. Perhaps the greatest challenge is posed by the large number of veterans returning with disabilities from the wars in Iraq and Afghanistan. As of December 18, 2008, the Department of Defense reported that 4,211 members of the military were killed in Iraq and 558 in Afghanistan. The ratio of troops who survive their wounds is the greatest of any American war. The number wounded was 30,879 in Iraq and 2,605 in Afghanistan as of December 18, 2008. We do know that, besides physical wounds, many returning veterans have either post traumatic stress disorder (PTSD) or traumatic brain injury (TBI)—two challenging disability measures.
As of December 31, 2007, 2.9 million veterans received benefits from the VA Disability Compensation program, and 7.8 million were enrolled in the VA Health Care System during fiscal year 2007. These numbers will grow as more veterans return from Iraq and Afghanistan. It is too early to see what changes will occur as a result of these wars to veterans’ disability programs, nonveterans’ disability programs, and society, but this is clearly an area where additional work is needed to improve measurement as a way to improve the treatment and support for this new group of disabled Americans.
Next Steps There are four major areas requiring attention so we can better examine our disability policy and programs to improve the lives of Americans with disabilities. Work Disability – The U.S. Census Bureau announced that the work disability item would be dropped from the ACS and the planned 2010 census due to methodological concerns. This is unfortunate. The ability to work is the central focus of most federal disability programs serving the working-age population and is specifically cited as an example of a major life activity in the ADA-AA. Previously, the work disability question has been considered and tested within the context of
Special Education: The Individuals with Disabilities Education Act (IDEA) The IDEA is divided into Part B, which serves children aged 3 through 21, and Part C, serving children under age 3. The purpose of the IDEA, enacted in 1975, is to make special education and related services available to children with disabilities so they can receive a free and appropriate public education to prepare them for employment and independence as adults. Under Part C, early intervention services are provided to prepare children for an education and eventual independence when they reach adulthood. Not all children with disabilities need special education services. In the fall of 2005, approximately 6.8 million children aged 3 through 21 received special education and related services under Part B of the IDEA. Disability in Part B is defined as having one of the following 13 conditions: mental retardation, hearing impairment (including deafness), speech or language impairments, visual impairments (including blindness), serious emotional disturbance, orthopedic impairments, specific learning disabilities, traumatic brain injury, multiple disabilities, deafblindness, autism, other disabilities (e.g., asthma, atten26
VOL. 22, NO. 1, 2009
tion deficit disorder), and developmental delay (for ages 3–9 at state discretion). Individual states establish criteria for each of the 13 categories. There is variation among states and localities in how Part B is defined. Children are evaluated by an educational team, specific to particular schools and type of disability. An individualized education plan (IEP) is typically prepared by a multidisciplinary team, tailored to the needs of each child, and periodically reviewed with the child’s parents. The IEP changes over time, as children mature and learn. Part C of IDEA, known as the Early Intervention Program for Infants and Toddlers with Disabilities, served nearly 300,000 children during the fall of 2006. Infants are under age 1, and toddlers are between the ages of 1 and 3. Part C is a federal grant program to states serving infants and toddlers with disabilities and their families. The purpose of Part C services is to enhance the development of infants and toddlers with disabilities, reduce the need for Part B, and help families meet the needs of their very young children with disabilities. Infants and toddlers served by Part C are defined as either having a developmental delay or a condition with a high probability in a developmental delay based on diagnostic medical measures. Developmental delays include cognitive, physical, communication, social or emotional, or adaptive functioning. Similar to Part B, eligibility for the disability categories is determined by states. Altogether, in the fall of 2006, approximately 2.4% of the population under age 3 was served by Part C. Slightly less than half (46%) of those served were under the age of 2.
the entire disability series. Yet, work disability is also an aspect of employment. Could the U.S. Census Bureau look at work disability as an employment item? This may be a better fit for methodological concerns. Second, disability is subjective and not easily verified through methodological work. It appears that participants in the cognitive questionnaire lab reported no limitations in work, even though they were collecting disability benefits. More program knowledge is needed. The SSDI and disability portion of the SSI programs allow and encourage employment and rehabilitative efforts for those receiving benefits (e.g., Ticket to Work). Mental Impairments – The emphasis, as described in the U.S. Census Bureau report “Evaluation Report Covering Disability” was that methodological work was geared toward the elderly in terms of cognitive impairments. Clearly, this focus needs to be expanded to included returning veterans with PTSD and TBI. Instrumental Activities of Daily Living – The IADL item was dropped from the ACS. IADLs—which include shopping, using the telephone, and managing money and/or medication—typically refer to activities that involve social interaction and more sophisticated self care. IADLs tend to require more mental and cognitive skills. While in the past, methodological research on IADLs has focused on the elderly, it is worth reexamining these items in light of returning veterans. ADA-AA – On January 1, 2009, the ADA-AA, which now includes specific examples of major life activities in the law, takes effect. Although many examples appear in the ACS and other surveys, many do not—the ability to work being the most critical. ADA-AA major life activities not typically included in surveys are manual tasks, eating, sleeping, standing, bending, speaking, breathing, learning, thinking, communicating, and working. Methodological work needs to be done to measure progress of the ADA-AA. Methodological Work – The U.S. Census Bureau is to be commended for its methodological work performed on the ACS and its coordination with other federal agencies. Because definitions of disability are constantly evolving, methodological work and analyses must continue to evolve. Space and time constraints on surveys are real concerns. For example, if one question identifies disability for 90% of a certain category and 10 questions identify 97%, we could analyze who is in the 7% and potentially drop nine questions. Even though the NHIS-D is old, the 100+ questions can be analyzed for overlaps and more efficient disability questions can be designed. Data from other surveys could be used, as well. Cooperation is required. No one federal program (or agency) is responsible for disability, and no single federal agency is responsible for disability statistics. Disability is too important to ignore. Creative work needs to be done by statisticians, federal agencies, academia, and advocacy groups.
Further Reading Clipsham, J.A. (2008) “Disability Access Symbols.” Graphic Artists Guild: www.gag.org/resources/ das.php. Adler, M.C. and Hendershot, G. (2000) “Federal Disability Surveys in the United States: Lessons and Challenges.” In ASA Proceedings, Section on Survey Research Methods, pp.
Disability Items from the 2008 American Community Survey The full questionnaire can be found at www.census.gov/acs/ www/Downloads/SQuest08.pdf. The disability questions begin with question 16 and are asked of each person listed at the address. If the respondent is 5 years or older, then question 17 is asked. Otherwise, question 17 is skipped for that respondent. Question 18 is asked of those aged 15 or older. The answer to each question is yes or no. 16. a. Is this person deaf or does he/she have serious difficulty hearing? 16. b. Is this person blind or does he/she have serious difficulty seeing, even when wearing glasses? 17. a. Because of a physical, mental, or emotional condition, does this person have serious difficulty concentrating, remembering, or making decisions? 17. b. Does this person have serious difficulty walking or climbing stairs? 17. c. Does this person have difficulty dressing or bathing? 18. Because of a physical, mental, or emotional condition, does this person have difficulty doing errands alone, such as visiting a doctor’s office or shopping?
98–104, Alexandria, VA: American Statistical Association. www.amstat.org/sections/SRMS/proceedings/papers/2000_014.pdf. Adler, M.C.; Clark, R.F.; DeMaio, T.J.; Miller, L.F.; Saluter, A. (1999) “Collecting Information in the 2000 Census: An Example of Interagency Cooperation.” Social Security Bulletin, 62(4). www.ssa.gov/policy/docs/ssb/v62n4/v62n4p21.pdf. Brault, M.; Stern, S.; Raglin, D. (2007) “Evaluation Report Covering Disability.” 2006 American Community Survey Content Test Report P.4. Washington, DC: U.S. Census Bureau. www.census.gov/acs/www/AdvMeth/content_test/P4_ Disability.pdf. Turek, J. (2008) “Committee on Statistics and Disability.” Amstat News, 370:10. www.amstat.org/publications/amsn/index. cfm?fuseaction=pres042008. General Accounting Office. (2008) “Federal Disability Programs: Coordination Could Facilitate Better Data Collection to Assess the Status of People with Disabilities.” Statement of Daniel Bertoni, Director of Education, Workforce, and Income Security; Testimony before the Subcommittee on Information Policy, Census, and National Archives, Committee on Oversight and Government Reform, House of Representatives, June 4. www.gao.gov/ new.items/d08872t.pdf. Goodman, N. and Stapleton, D. (2007) “Federal Program Expenditures for Working-Age People with Disabilities.” Journal of Disability Policy Studies, 18(2):66–78. National Health Interview Survey 1995 Supplement Booklet: Disability, Phase 1. www.cdc.gov/nchs/data/nhis/dis_ph1.pdf.
CHANCE
27
Jussi Jokinen, Regression to the Mean, and the Assessment of Exceptional Performance W. J. Hurley
Dallas Stars forward Jussi Jokinen, of Finland, works the puck in a hockey game against the Montreal Canadiens on December 23, 2007, in Dallas, Texas. (AP Photo/Matt Slocum)
L
ate in the second period of a game between the Dallas Stars and Edmonton Oilers during the 2005–2006 National Hockey League (NHL) season, Jussi Jokinen, a rookie with the Stars, was awarded a penalty shot. For such shots, the goaltender has a sizable advantage, as NHL players score on roughly one in three penalty shots. Jokinen scored and, in the process, continued a rather remarkable streak. He scored on his three penalty shots during the NHL preseason and on nine straight penalty shots during the regular season up to the Oiler game. So, all in, he was successful on his first 13 penalty shots, an unofficial NHL record not likely to be broken any time soon. Jokinen’s performance, and those of some lesser known players, led media commentators to assert 28
VOL. 22, NO. 1, 2009
that there were players with exceptional ability on penalty shots and that NHL teams were actively looking for these penalty shot specialists. Jokinen has since cooled off. Throughout the complete 2005–2006 regular season, he went 10 for 13, and over the 2006–2007 season, five of 12, a frequency more in line with the league-wide rate. Jokinen’s streak, his subsequent cooling off, and the media discussions at the time the streak was maturing raise some interesting questions. The first is an assessment of the role of chance in explaining Jokinen’s rookie season performance. Based on a standard order statistics argument, one would expect the highest relative scoring frequency (defined as the fraction of penalty shots that result in a goal) to be fairly high. The question is whether Jokinen’s frequency, 10 for 13, is sufficiently high to warrant the conclusion that something other than chance is part of the explanation. The second question is whether the media prognosticators were correct that taking a penalty shot is a special skill, that some otherwise gifted NHL scorers did not possess this skill, and that it was important for teams to identify their shootout specialists. For example, Joe Sakic, the Colorado Avalanche captain and future Hall of Famer, is a prolific scorer, but did not score a single goal in his seven penalty shots over the 2005–2006 season. There is significant experimental evidence that people are quick to find a pattern in a random sequence where there isn’t one, especially when the sequence is relatively short. Amos Tversky and Daniel Kahneman, in a 1971 article published in Psychological Bulletin, termed this bias the “law of small numbers.” In sum, does the evidence support the existence of a subclass of players with shootout ability superior to proven NHL stars? Finally, the shootout data set covers two NHL seasons. This affords an opportunity to study the phenomenon of regression to the mean. Were players who exhibited high scoring rates on penalty shots throughout the 2005–2006 season able to maintain those rates over the 2006–2007 season? Did the worst players over the 2005–2006 season improve over the 2006–2007 season? Given the increasing competitiveness of the NHL, teams must do well in shootouts. Every point counts. For instance, during the 2005–2006 season of the Eastern Conference, Tampa Bay finished in the last playoff spot with two more points than Toronto. During the season, Tampa Bay won six of
10 shootouts, whereas Toronto won only three of 10. Clearly, the Toronto shootout performance cost them a playoff spot.
Table 1—Shootout Performance of All NHL Players Over the 2005/06 and 2006/07 Seasons Attempts
Shootouts and Uncertainty At the start of the 2005–2006 season, the NHL introduced a shootout competition to determine which team gets an additional point (toward league standings) when there is a tie at the end of regulation time and five minutes of 4-on-4 overtime play. The shootout rules are specified in Rule 89 of the NHL Rulebook (see www.nhl.com/rules/rule89.html): Each team will be given three shots, unless the outcome is determined earlier in the shootout. After each team has taken three shots, if the score remains tied, the shootout will proceed to a “sudden death” format. No player may shoot twice until everyone who is eligible has shot. As in soccer, teams alternate taking penalty shots at the goalies until there is a winner. It is more difficult to score on a hockey penalty shot than on a soccer free kick. In Table 1, the shootout success percentages are calculated for the 2005–2006 and 2006–2007 NHL seasons. Based on all shots for both years, the frequency of scoring is 0.3311. It is interesting to note that scoring on an NHL goaltender on a penalty shot is comparable in difficulty to hitting a baseball thrown by a Major League Baseball pitcher, a task generally considered one of the toughest in sports. Most NHL teams tend to rely on the same three to four players for shootouts. It is only when a shootout goes beyond six shots (three for each team) that additional shooters are employed. For instance, throughout the 2005–2006 season, the Dallas Stars used Jokinen, Mike Modano, Sergei Zubov, and Antti Miettinen for 37 of the team’s 42 penalty shots. A sample of those who took the most shots should be useful for assessing how good the best shootout players are. Coaches select the players they think best at shootouts to take the shots. I decided to examine an “elite set,” those players who took at least five penalty shots in both the 2005–2006 and 2006–2007 seasons. There were 52 such players. Over both seasons, they took 859 shots and were successful on 339 for a relative scoring frequency of 0.395. The relative scoring frequency for nonelite set players was only 0.290. Hence, there is a substantial difference in the performance of these two groups. Another important factor is the chance that an NHL game gets to a shootout. In Table 2, the number of shootouts for the 2005–2006 and 2006–2007 seasons is presented. Throughout both seasons, the frequency of games going to a shootout is 0.1253.
Assessing the Jokinen Streak In 2005–2006, Jokinen’s rookie season, he finished fourth in rookie scoring with 55 points. Over the 2006–2007 season, his point production fell to 48, but his plus-minus rating was eight. The plus-minus statistic is important because it is a rough approximation of the player’s offensive and defensive capabilities. Jokinen’s plus-minus was the second-highest on the Dallas Stars team. To assess Jokinen’s shootout ability, it would be inappropriate to consider only his first 13 shots. Obviously, this sample would be highly selective and biased. Nonetheless, it is inter-
Goals
Ratio
2005/06
981
329
0.3354
2006/07
1,209
396
0.3275
Totals
2,190
725
0.3311
Table 2—Frequency of Shootout Games Over the 2005/06 and 2006/07 NHL Seasons Games
#Shootouts
Ratio
2005/06
1,230
145
0.1179
2006/07
1,230
163
0.1325
Totals
2,460
308
0.1253
esting to determine the chances of it happening. Under the assumption that Jokinen’s probability of scoring on a penalty shot was the average for the elite set, 0.395, and that these shots are independent, the chance of 13 consecutive goals is (0.395)13 5 0.000057,
(1)
a small chance indeed. One of the explanations for Jokinen’s early success was his uncanny ability to execute a shot known as “The Paralyzer.” This move requires the shooter to first fake to his forehand side, and in so doing, get the goalie to move that way. Thereafter, he quickly moves the puck back to his backhand and, with one hand on his stick and the puck as far away from his body as possible, guides it into the corner away from the direction of his and the goaltender’s movement. Interested readers can visit YouTube for a video of Jokinen executing the technique (see www.youtube.com/watch?v=QJbI-nITjIM). It is a shot Peter Forsberg made famous in the shootout in the 1994 Olympic Gold Medal game between Sweden and Canada. It proved to be the winning goal and has since been commemorated on a Swedish stamp. I am not sure how often Jokinen tried it during his streak, but it was a lot. Hence, the combination of his rookie status and the almost flawless execution of a very difficult shot gave him a temporary CHANCE
29
edge on goaltenders. But this edge was likely to be short-lived. Given the importance of the shootout to NHL regular season success, goaltenders and their coaches study shooters and gradually developed a ‘book’ on Jokinen, just as Major League hitters develop a book on opposing pitchers, particularly young pitchers. Alternatively, suppose we assess his shootout performance throughout his complete rookie season (excluding pre-season), when he was successful on 10 of 13 shots. If we were to consider this performance in isolation, we could compute a one-sided p-value under the hypothesis that Jokinen was able to score at the rate of an elite set player, 0.395. Therefore, the chance Jokinen would score 10 or more times on 13 shots would be a sum of binomial probabilities: 13
13
Σ i (0.395) (1 − 0.395)
13−i
i
= 0.007018
(2)
i =10
which is certainly small enough to reject the null that Jokinen scores with a 0.395 frequency in favor of an alternative that says he does better. One of the limitations of this calculation is that it ignores the uncertainty associated with the number of shots. To include the effects of this uncertainty, suppose elite set player i takes an uncertain number of shots, Si. As each team plays 82 games and a particular game goes to a shootout with probability 0.1253, Si follows a binomial random variable with parameters n = 82 and q = 0.1253 and density 82 S 82 − S g ( Si ) = q i (1 − q ) i . Si
(3)
Given a realization of Si, say si, player i will have a success frequency of at least 10/13 if he scores on at least
10 δi ( si ) = si 13
(4)
means to round the number x to the next highest integer. Hence, conditional on Si = si, a player will have a success frequency of at least 10/13 with probability s
si
Σ( ) j (0.395) (1 − 0.395) j
i
si − j
.
j =δi si
(5) Therefore, the chance that player i’s relative scoring frequency would be at least 10/13 is
υi = Σ r ( s j ) g ( s j ) . j
30
VOL. 22, NO. 1, 2009
Ym 5 max(X1, X2, … Xm).
(6)
(7)
To get its cumulative distribution function, we proceed in the usual way: Fm(y)
5 Pr(Ym # y) 5 Pr(max(X1, X2, … Xm) # y) 5 Pr(X1 # y, X2 # y, … Xm # y) 5 Pr(X1 # y) Pr(X2 # y) Pr(Xm # y) 5 [B(y)]m , (8)
where B(y) is the probability that a player has a relative scoring frequency no better than y. Note that B(10/13) 5 1 – 0.0066230 5 0.993377.
of these shots, where the notation [ χ]
r ( si ) =
The only difficulty in this calculation is where to start the summation in (6). It would not be appropriate to include the outcome where a player was, say, two for two on penalty shots. For this reason, I imposed the restriction that a player had to take at least eight shots. Under this assumption, yi = 0.006623, which is close to what I got above using (2). In this case, making allowances for the uncertainty in the number of shots has little effect on the p-value. But there is a more serious problem. We are picking the player with the highest penalty shot scoring frequency during the 2005–2006 season and applying a binomial sampling distribution for the average elite set player. Instead, we should examine his performance using the distribution of the relevant order statistic; in this case, the maximum. What we are interested in is the chance the player with the highest relative scoring frequency has a relative scoring frequency of at least 10/13. This can be calculated as follows. Suppose there are m players in our elite set. Let the relative scoring frequency for player i be Xi. We need to consider the statistic
(9)
Hence, the probability that the player with the highest relative scoring frequency would have one 10/13 or better is Pr(Ym $ 10/13) 5 1 – Fm (10/13) 5 1 – [B(10/13)]m.
(10)
Values of Pr(Ym $ 10/13) are shown below for three values of m: m
Pr(Ym $ 10/13)
50
0.282707
75
0.392501
100
0.485490
Assuming there were somewhere between 50 and 100 players taking at least eight shots each, there was a relatively good chance (approaching the flip of a coin) that we would observe a relative scoring frequency of at least 10/13. This is hardly evidence that Jokinen’s shootout performance during the 2005–2006 regular season was exceptional.
While the use of the maximum order statistic to assess Jokinen’s performance is a step in the right direction, it could be criticized because all players are assumed to have the same chance of scoring on a penalty shot. An obvious way to relax this assumption is with shrinkage estimators.
Shrinkage Estimators Charles Stein, in his 1955 article titled “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution,” showed it is possible to improve upon maximum likelihood parameter estimates in terms of squared error loss when the parameters of several independent normal distributions are to be estimated. Bradley Efron and Carl Morris, in a 1975 paper published in the Journal of the American Statistical Association, applied this idea to estimating the batting averages of a subset of Major League Baseball players throughout the 1970 season based on averages over their first 45 at-bats. Their technique is not directly applicable to the assessment of relative scoring frequency on penalty shots for each player for a number of reasons, most notably because the sample data on penalty shots is not large enough to justify the normality assumption. Fortunately, Jim Albert has developed an empirical Bayes procedure for estimating binomial probabilities for any set of sample sizes, which he discusses in a 1984 paper titled “Empirical Bayes Estimation of a Set of Binomial Probabilities.” A rough outline of his procedure is as follows: Suppose we have a sample of binomial observations X1,X2, …,Xp with probabilities u1, u2, …, up and numbers of observations n1, n2, …, np. Let n 5 n1 + n2 + … + np
(11)
X 5 X1 + X2 + … + Xp.
(12)
and
Applied to our shootout data, ni is the number of shots taken by player i and ui is player i’s relative scoring frequency. Assuming heterogeneity of players, the maximum likelihood estimate of ui is θ iMLE =
Xi . ni
(13)
On the other hand, if all players were the same, we could estimate ui with X θI = . n (14) The empirical Bayes procedure, then, estimates ui with a linear combination of these two: θ iEB = (1 − λ )θ iMLE + λiθ I
(15)
where the estimate of li depends on the assumption about the prior distribution for u = (ui, u2, …, up). Albert employs a beta distribution, Beta(Kh,K(1 – h)), with a suitable joint distribution for the hyper-parameters K and h. Under his assumption,
λi =
K . ni + K
(16)
To apply this technique to estimate relative scoring frequencies for elite set players, I employed the method of moments to estimate K and h. (The method of moment estimators for a beta distribution can be found in the Engineering Statistics Handbook, www.itl.nist.gov/div898/handbook.) With these in hand, equation (15) was used to estimate the relative scoring frequencies for players in the elite set. The data set for the estimation consisted of the aggregated elite set player shootout performances over the 2005–2006 and 2006–2007 seasons. The results for a subset of these players are shown in Table 3. Note the effect of these shrinkage estimators: for the highest relative scoring rates, the estimated scoring rates are lower, and for the lowest rates, they are higher. How, then, could we use this information to assess Jokinen’s shootout performance over the 2005–2006 season? Suppose we take as given the empirical Bayes scoring frequencies in Table 3 and calculate the chance that the scoring frequency of the player with the highest scoring frequency would exceed 10/13. Under the assumption that these players take the same number of shots they did over the 2005–2006 season, I calculate this probability to be 0.842855. Under the assumption that all players take 13 shots (the same as Jokinen), the probability is 0.514710. Hence, this approach to assessing Jokinen’s performance also suggests that, while it was a very good performance, it was well within the bounds of normal statistical fluctuation.
Regression to the Mean This data set covering two NHL seasons offers a good opportunity to examine the concept of regression to the mean. Suppose all shooters in the elite set score with probability 0.395 on every penalty shot they take. With this assumption in mind and for a specific period of time, the actual scoring frequencies will vary about 0.395. Now consider what would happen in a subsequent period. We would expect that the performance of the best shooters in the first period would fall in the second. In fact, this is precisely what has happened. Table 4 compares the relative scoring frequency of the best 10 players in the CHANCE
31
Table 3—Empirical Bayes Estimates of Scoring Frequencies for Players in the Elite Set During the 2005/06 and 2006/07 Seasons Player
Name
Attempts
Goals
Rate
θ iEB
1
Kariya
18
12
0.6667
0.5525
2
Kozlov
18
12
0.6667
0.5525
3
Jokinen
24
14
0.5833
0.5170
4
Koivu
21
12
0.5714
0.5038
5
Kotalik
13
7
0.5385
0.4665
48
Hejduk
14
3
0.2143
0.3012
49
McDonald
16
3
0.1875
0.2804
50
Prucha
11
2
0.1818
0.2972
51
Boyes
12
2
0.1667
0.2853
52
Ponikarovsky
12
1
0.0833
0.2453
Table 4—The Shootout Performance of the Best 10 Shooters in 2005/06 During the 2005/06 and 2006/07 Seasons 2005/06 Season
2006/07 Season
Player
Goals
Shots
Relative Frequency
Goals
Shots
Relative Frequency
Sykora
5
6
0.833
1
5
0.200
Whitney
4
5
0.800
0
5
0.000
Jokinen
10
13
0.769
5
12
0.417
Frolov
3
4
0.750
4
9
0.444
Kozlov
5
7
0.714
7
11
0.636
Kariya
5
7
0.714
7
11
0.636
Williams
5
7
0.714
3
9
0.333
Satan
7
10
0.700
5
13
0.385
Kozlov
8
12
0.667
5
13
0.385
Richards
6
9
0.667
5
12
0.417
2005–2006 season with their performance in the 2006–2007 season. Note that the relative scoring frequency of all 10 fell over the 2006–2007 season. These players had a combined relative scoring frequency of 0.725 over the 2005–2006 season and 0.420 over the 2006–2007 season. For Jokinen in particular, his scoring frequency for the 2006–2007 season fell to five goals in 12 attempts, a considerable drop in his stellar rookie 32
VOL. 22, NO. 1, 2009
performance. For these 10 players, the correlation coefficient for their relative frequencies over the two seasons is –0.571, which, as expected, is negative and significant. Regression to the mean also should apply to the worst performers in the elite set for the 2005–2006 season. The bottom 10 players in the elite set, as measured by relative scoring frequency over the 2005–2006 season, scored 12 goals in 68
Table 5—A Comparison of the Shootout Performances of Players Grouped into Quartile Ranges by 5-on-5 Point Totals for the 2005/06 Season Shootout Performance Quartile
Average Points
#Shots
#Goals
Relative Frequency
Q1 (36 players)
83.3
305
108
0.3541
Q2 (36 players)
59.1
252
95
0.3770
Q3 (36 players)
43.9
225
82
0.3644
Q4 (35 players)
27.3
161
62
0.3851
attempts for a relative scoring frequency of 0.1765. This same group during the 2006–2007 season scored 35 goals in 84 attempts for a relative scoring frequency of 0.4167, a considerable improvement and comparable to the elite set average scoring frequency of 0.395.
The data suggest NHL players have comparable conversion skills but substantially different generation skills. This, of course, is the beauty of 5-on-5 play. Great goals require great team play and gritty determination, two characteristics not essential to success in a penalty shootout.
Who Are the Shootout Specialists?
Summary
Toward the end of Jokinen’s streak and the 2005–2006 NHL season, the hockey media, particularly the Canadian TV media, were suggesting the existence of a group of players who were better than average in shootouts and that, for specific teams, it was not necessarily the case that their best shootout players were their best 5-on-5 players. What does the evidence suggest? To this point, I have argued that the performances of the players with the best shootout percentages are consistent with normal statistical variation. Given the chance nature of the shootout, we would expect a subset of players to do very well, but the performances of these players at the top does not support the conclusion that they have above average ability on shootouts Moreover, for these players at the top, we observe the phenomenon of regression to the mean: The best players in one year are not the best the next year, and the worst in one year improve their performance the next. Here is another piece of supporting evidence. For the 2005–2006 season, I looked at all NHL players who took at least three shots in the penalty shootout. There were 143 such players. I then ranked these players according to their regular season points (goals plus assists) and put them into quartiles. Table 5 compares the quartile performances in the penalty shootout. In 5-on-5 play, there are clearly significant differences in the performance of the quartiles. The top quartile (36 players) had an average point count (goals + assists) almost three times that of the bottom quartile. But these shootout frequencies are statistically the same. All in, this evidence does not support the existence of a specialist shootout group. What it does suggest is that the game of hockey is a continuous flow game, where both sides compete to create and destroy scoring chances. We can think of the play leading to a goal in two parts. There is the play that leads to a scoring chance (opportunity generation) and, subsequently, the shooter converting the opportunity into a goal (opportunity conversion).
During his rookie NHL season, Jussi Jokinen had the highest relative scoring rate (10 of 13) in the NHL. The interesting question is whether this exceptional performance can be explained by chance. Based on an order statistic argument using shrinkage estimators to estimate scoring abilities, I found there is a high probability that the player with the highest relative scoring frequency would have one 10/13 or better. Hence, I conclude that Jokinen’s performance over the 2005–2006 season was consistent with normal statistical variation.
Further Reading Albert, J.H. (1984) “Empirical Bayes Estimation of a Set of Binomial Probabilities.” Journal of Statistical Computation and Simulation, 20:129–144. Efron, B. and Morris, C. (1975) “Data Analysis Using Stein’s Estimator and Its Generalizations.” Journal of the American Statistical Association, 70:311–319. Everson, P. (2007) “Stein’s Paradox Revisited,” CHANCE, 20:49–56. Engineering Statistics Handbook, www.itl.nist.gov/div898/ handbook. Gould, S.J. (1989) “The Streak of Streaks,” CHANCE 2:10– 16. Nevzorov, V.B. (2001) Records: Mathematical Theory, American Mathematical Society. Stein, C. (1955) “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution,” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Berkeley: University of California Press, 197–206. Tversky, A. and Kahneman, D. (1971) “Belief in the Law of Small Numbers,” Psychological Bulletin, 2, 105–110.
CHANCE
33
Go For It
What to consider when making fourth-down decisions M. Lawrence Clevenson and Jennifer Wright
will punt the ball (kick it) as far from their defensive goal as possible. Other times, a team will try for a field goal, which means a player kicks the football between posts at the end of the field, if the team is close enough to its offensive goal line, a possibility we do not address here. This means we assume the offense has the ball at a point on the field where a field goal attempt is not a realistic option. If the offense punts, goes for a field goal, or tries for the first down and fails on their fourth down, then the other team is the team on offense with the opportunity to advance the ball.
Fourth-Down Decisions We must examine three scenarios to make the decision to punt or “go for it.” Punting gives the other team the ball and a corresponding expectation of scoring from their new position on the field after the punt. Going for a first down and failing also gives the other team the ball and an even greater expectation of scoring, as it will be much closer to the goal line, the worst result of these three. Going for a first down and succeeding gives the offense a positive expectation of scoring, clearly the best of these three scenarios for the offense. Time may be a factor if near the end of the half or game. We assume the team on defense will have adequate time to try to score after the fourth-down play. In these cases, a statistician would argue that the decision should be made by computing expected points (both positive and negative) from the decisions. Therefore, we need statistical models for these expectations.
I
t is Sunday afternoon in the late fall and a football game has just started between the Green Bay Packers and the St. Louis Rams on the frozen tundra of Lambeau Field. St. Louis received the opening kick-off, moved the ball forward for some plays, but their third-down attempt to reach a first down failed and left them with fourth down and two yards to go on their own 42-yard line. Without much thought, the punting team comes onto the field and St. Louis punts. The decision to punt on fourth down is so common that hardly anyone thinks about it. But, wait. Should St. Louis have punted, or should they have gone for it? One way to decide would be to quantify expected points after both decisions. But what does one need to know to do that? For readers unfamiliar with American football, the two teams try to advance the ball toward the other’s (defensive) goal line. The team with the ball (offense) gets four chances to advance the ball. These chances are called “downs.” An important rule is that if the offense can advance the ball 10 yards or more before their four chances are over, they get a “new first down” and four more chances to advance the ball. Often, if a team has reached its fourth down (its last chance) and the players feel they cannot gain a new first down, they 34
VOL. 22, NO. 1, 2009
Expected Points from a First Down What is the expected number of points with a new first down from a given yard line? Let EP(x) represent a team’s expected points when that team has a first down with a given number of yards, x, from the opposing goal. We seek a model to estimate EP(x) for each team. There are 32 teams in the National Football League with various strengths in offensive and defensive play. Teams from Green Bay, San Francisco, St. Louis, Chicago, and Indianapolis in the 2005 season were chosen for this study. This selection reduces the required data collection effort while maintaining a variety of strengths and weaknesses among the teams used to build statistical models. A qualitative summary of the team strengths and weaknesses in 2005 appears in Table 1. All the plays from all the games played by these five teams in the 2005–2006 were collected (see www.nfl.com). When a team had a first down (first down with 10 yards to go or first and goal to go), a point in the data set for the team is created. Each point has a bivariate measurement. The explanatory variable (x) is yards from the goal line, and the response variable
2. Cubic regression: EP(x) = b0 + b1 x + b2 x2 + b3 x3.
Table 1—Summary of Offensive and Defensive Team Strength, Mid-Season 2005 Teams
( ) / (1 + e(
β30 + β31 x )
/ 1 + e(
β30 + β31 x )
+ e(
) ) )
β30 + β31 x
β00 + β01 x
Chicago
Weak
Strong
Indianapolis
Strong
Strong
San Francisco
Weak
Weak
St. Louis
Strong
Weak
Medium
Medium
Data Points for Chicago 7 6 5 4 3 2 1 0
β00 + β01 x )
) P ( 0 ) = P (Y = y0 ) = e( + e( P(7) = 1 – P(3) – P(0). For this model, we replaced actual points of 6, 7, or 8 with 7. There were actually very few such replacements, as there were few touchdowns that did not result in seven points.
β00 + β01 x
Defense
Green Bay
3. Linear logistic regression with the response being one of three events—no score (0), a field goal (3), or a touchdown (7): EP(x) = 0 P(0) + 3 P(3) +7 P(7), where
P (3) = P (Y = y3 ) = e(
Offense
Points
(y) is the number of points made before giving up the ball to the defensive team. The response variable result, therefore, is a member of {0, 3, 6, 7, 8}. Zero points means no score. Three points are awarded for a field goal. A touch down counts as six points. After a touch down, a kick through the uprights adds one point (7) and a single play yielding a score (e.g., a touch down) adds two points (8). Negative scores—a safety or a fumble or interception returned for a touchdown—are rare and do not appear in these data sets. First downs after a penalty (e.g., first and 15 or first and five, etc.) were disregarded, as these all originally began with the standard first and 10. Four models were examined for predicting points from a first down. Models 1 and 2 use least squares to estimate coefficients. That is, coefficients were estimated to minimize the average squared difference between the observed value of y and the expected value. Models 3 and 4 use logistic regression. Again, the coefficients were estimated to minimize the average squared difference between the observed value of y and the expected value, but the computation algorithm is more involved. The models are as follows: 1. Quadratic regression: EP(x) = b0 + b1 x + b2 x2.
0
50
100
Yards from the Goal Figure 1. Graph of data points for Chicago's points vs. first down position (yards from the goal) in the 2005 regular season. Points are jittered for viewing repetitions.
4. Quadratic logistic regression: This is similar to Model 3, but there is a quadratic term in the exponential functions.
Table 2—Estimated Equations and R2 Values for the Models of EP(x) for Chicago
With the logistic regression models, expected points were computed using the value 7 for the event of a touchdown, which was the actual result in nearly all cases. There was little difference between EP(x) for models 1, 2, and 4. Figure 1 shows all of the first-down data for Chicago; the points have been jittered to display repetitions of cases. Three polynomial models for EP(x) and the almost identical fits of the quadratic and cubic models can be seen in Figure 2. Table 2 exhibits the best fitting polynomial and logistic equations, along with their R2 values. The interesting general consistency, but slight variation, in the expected fits is displayed in Table 3. We calculated the average points at intervals of five yards to see more clearly how the average points scored varied with the yards from the goal. These are displayed in Figure 3 with the chosen quadratic model. Because of its greater simplicity, we chose a quadratic regression model to estimate EP(x) for each team. Of course, different teams had different coefficients resulting from the least squares estimates when EP(x) was fit to their data. The other teams had similar results.
Model
Equation
R2
Linear
EP(x)=-0.0481x+4.4156
0.1795
Quadratic (Model 1)
EP(x)=0.0009x 0.1408x+6.1255
0.2260
Cubic (Model 2)
EP(x)=-2E-6x3+0.0013x20.154x+6.2434
0.2262
2
P(0)=e(-1.190+0.048x)/ (1+e(-0.450+0.009x)+e(-1.190+0.048x)) Linear Logistic P(3)=e(-0.45+0.009x)/ (Model 3) (1+e(-0.450+0.009x)+e(-1.190+0.048x)) P(7)=1-P(0)-P(3)
Quadratic Logistic (Model 4)
0.1849
P(0)=e(-2.832+0.140x-0.001x^2)/ (1+e(-1.308+0.070x-0.0007x^2) + e(-2.832+0.140x-0.001x^2)) P(3)=e(-1.308+0.070x-0.0007x^2)/ (1+e(-1.308+0.070x-0.0007x^2) + e(-2.832+0.140x-0.001x^2))
0.2287
P(7)=1-P(0)-P(3) CHANCE
35
Data Points and Models for Chicago 7
6 Points Cubic Model Quadratic Model Linear Model
5
Points
4
3
2
1
0 0
10
20
30
40
50
60
70
80
90
100
-1
Yards from the Goal Figure 2. Graph of data points and models for Chicago's points vs. first down position (yards from the goal) in the 2005 regular season. Points indicate data. The solid, dotted, and dashed lines are the cubic, quadratic, and linear model fits.
Chicago – Model of Expected Point vs. Mean of Data Points 7
6
5 Mean Points
Points
4 Quadratic Model 3
2
1
0
Yards from the Goal Figure 3. Graph of mean points joined by solid lines at yard intervals noted on the x-axis and the fitted quadratic model boxes joined by the dashed lines for EP(x) for Chicago’s first down data for the 2005 regular season
36
VOL. 22, NO. 1, 2009
Table 3—Mean of Actual Data Points, for Chicago, at the Indicated Intervals vs. the Model’s Expected Points for the Same Interval Midpoint (2005 Regular Season)
Chicago
Count (n)
Mean
Quad Model
Cubic Model
Quadratic Logistic Model
Yards 1–2
8
6.375
5.934
6.034
5.766
4.535
Yards 3–7
8
5.125
5.399
5.455
5.403
4.365
Yards 8–12
13
5.462
4.823
4.840
4.942
4.155
Yards 13–17
8
4.250
4.201
4.185
4.349
3.886
Yards 18–22
15
2.400
3.649
3.614
3.761
3.605
Yards 23–27
10
2.800
3.180
3.136
3.234
3.328
Yards 28–32
15
3.400
2.772
2.727
2.772
3.053
Yards 33–37
25
1.560
2.325
2.286
2.279
2.710
Yards 38–42
19
3.053
1.986
1.955
1.923
2.412
Yards 43–47
21
2.000
1.723
1.704
1.663
2.156
Yards 48–52
26
1.423
1.469
1.463
1.426
1.878
Yards 53–57
26
1.385
1.240
1.249
1.225
1.591
Yards 58–62
25
0.760
1.093
1.112
1.100
1.373
Yards 63–67
35
0.657
0.970
0.998
0.997
1.143
Yards 68–72
44
1.568
0.911
0.941
0.943
0.976
Yards 73–77
21
0.762
0.887
0.914
0.915
0.805
Yards 78–82
35
0.486
0.915
0.929
0.925
0.663
Yards 83–87
15
1.133
0.985
0.978
0.972
0.551
Yards 88–92
10
0.300
1.115
1.071
1.076
0.444
Yards 93–99
10
2.100
1.352
1.240
1.313
0.340
Total
4–19-CHI 3
Linear Logistic Model
389
(12:34) B.Maynard punts 40 yards to CHI 43, Center-P.Mannelly. A.Randle El to CHI 42 for 1 yard (H.Hillenmeyer). PENALTY on PIT-S.Morey, Offensive Holding, 10 yards, enforced at CHI 42.
Figure 4. Example of a play-by-play situation from the December 11, 2005, game, Chicago vs. Pittsburgh. This is in the third quarter, as extracted from www.nfl.com.
Expected Net Yards for a Punt When St. Louis punts to Green Bay from their own 42-yard line, the position from which Green Bay starts their next series of downs will vary with the effectiveness of the punt. The data for all the punts made for our five teams, in 2005, were examined. Figure 4 is a typical example extracted from NFL.com for one play.
This is how you read the summary from Figure 4: On this fourth-down situation, Chicago punted 40 yards, and then Pittsburgh ran the punt reception back for one yard, bringing the net punt gain to 39 yards. A penalty was called (holding on the return team) for a loss of 10 yards. The net punt distance became 49 yards. Similarly, for each punt, the net punt distance
CHANCE
37
Indianapolis Third Down Success 0.9 0.8
Quadratic Probability Model
0.7
Success Frequency
Probability
0.6 0.5 0.4 0.3 0.2 0.1 0 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
21
Yards to Go (for new first down) Figure 5. Graph of relative frequency of successful first down conversions and probability model
was calculated as the distance from the starting punt position to the new punt position at the end of the play. The example above was an effective punt, with a net punt distance of 49 yards. Most punts are less effective, and we chose to simplify this part of the analysis by assuming all punts net the average punting effectiveness for that team. Because the quadratic model for EP(x) is nearly linear, there will be little change in the expected points after a punt with this assumption.
Expected Yards When Getting a First Down If St. Louis goes for it and obtains a first down, what do they gain on average? They gain the opportunity to score points on this possession. They will have a first-and-10, some number of yards from the goal line. How many yards from the goal? That depends on how much yardage they gained, beyond the necessary two yards (recall it was fourth and two at their own 42 in our example). Data for all of St. Louis’ successful attempts at a first down from third-down positions showed that, on average, they gained approximately eight more yards than the first-down marker. We use this value to estimate St. Louis’ position after a successful first down. The analysis will change little if we use the detailed distribution, as EP(x) is nearly a linear function. In addition, the quadratic function EP(x) is convex, and so Jensen’s inequality says this analysis understates, slightly, the value of a successful first-down attempt. A study looking at variability and expectations would need to address this issue 38
VOL. 22, NO. 1, 2009
more carefully. Unsuccessful attempts usually result in not much change from the current position and were not analyzed separately. That is, it is assumed an unsuccessful attempt delivers the football to the opposing team at the line of scrimmage, where the fourth-down play started.
Probability of a Fourth-Down Conversion If the offense does not punt on fourth down, what is the probability of it successfully achieving a new first down? How do we answer this question? Teams rarely try for a new first down on fourth down, and thus not much data exist on fourth-down conversion attempts. Of course, teams always try for a new first down on third down. We decided to use success rates on third-down and fourth-down conversions together to model the probability of a successful conversion on fourth-down attempts. While defensive teams might try even harder to prevent fourth-down conversions (by risking longer gains in an all-out attempt to stop the conversion), they already usually align their defense to prevent third-down conversions. We believe the fourth-down conversion rates would be quite close to the third-down conversion rates. The cases are again bivariate, with the explanatory variable being yards to go for a new first down and the response variable being success or failure. Because the response variable is binary, some logistic regression models were compared. The linear logistic model gave approximately the same estimated probabilities as the quadratic logistic model, and so was chosen
for its greater simplicity. Figure 5 exhibits the quadratic logistic probability graph for Indianapolis, shown with the actual relative frequency of success (for Indianapolis).
Comparing the Expected Points Recall that we are questioning St. Louis’ decision to punt on fourth and two from their own 42-yard line. For St. Louis, the average net punt is 34 yards. From St. Louis’ perspective, if the punt nets 34 yards—the St. Louis average—then Green Bay will be 76 (42 + 34) yards from the goal and have EP(76) = 1.320, expected points. Punting puts St. Louis down, on average, 1.320 points. If they go for it, they need two yards, and, if successful, they average eight additional yards, and so, on average, a successful attempt gains 10 yards. This would leave them 100 – 42 – 10 = 48 yards from the goal. Their expected points from this position are 2.635. However, the previous analysis asked what Green Bay’s scoring potential was when they received the ball after a punt. For proper comparison, we need to compare that with the net average points when Green Bay next receives the ball, regardless of what St. Louis does, and how many points they score. Remember, we are considering scenarios in which the team on defense will have enough time to try to score as the offense. So, the gain from a successful fourth-down conversion has to be decreased by Green Bay’s scoring potential on their next possession. Of course, we do not know where they will start that next possession. Assuming St. Louis does score a touchdown or field goal, the average position would be approximately the 25-yard line. The exact position is not so important, because EP(x) changes little when x is large. Green Bay’s EP(75) is 1.338. Thus, St. Louis will achieve a gain by successfully making a first down of 2.635 – 1.338 = 1.297. The linear logistic model for St. Louis shows their estimated probability of a successful fourth-down conversion, at two yards to go, is 0.576. So, St. Louis has an expected loss, by going for it, as follows: P(Failure)(Expected Points for Green Bay 42 yards from the goal) – P(Success)(Expected Points by Successful Conversion) = (1-0.576) (2.548) + 0.576 (-1.297) = 0.333 (expected points behind the next time Green Bay has the ball). Recall that they expect to be down 1.320 points by punting. St. Louis gains an average of almost a point by the decision to go for it in this situation. Notice that this analysis shows the correct choice is to go for it, and punting provides an expectation that is not close to the expected loss from attempting to convert a first down. Yet, with the prevailing understanding of NFL games, St. Louis would be strongly criticized by every NFL expert for ‘gambling’ or ‘not playing the percentages’ or being ‘wild risk takers’ if they went for it and failed. Pundits (punt-its) might even say, “They should have gone with the percentages.” But our analysis shows that to go for it is the percentage play, and many experts probably have never looked at any percentages. If they go for a first down, they have a reasonable chance (58%) of keeping the ball and thus scoring points with this possession. And their expectation goes from -1.320 to -0.333, or from clearly negative to almost even. The data show the decision to go for it is the “percentage play.”
Comparison of Teams We computed the values of expected points, EP(x), using each team’s quadratic regression model. To obtain an idea of how optimal fourth-down choices vary from team to team, we chose to average our EP(x) values for the five teams studied when computing the ‘other’ team’s scoring potential. The table uses EP(x) for the specific team in the following tables when that team is on offense. Tables 4 and 5 show similar calculations to the one done for St. Louis, with fourth and two at their own 42-yard line for all cases. The columns indicate yards necessary for a first down (1 to 20 yards), and the rows specify the yards from the goal line (30 to 99). Teams should not punt when they are within 30 yards of the goal line. The relevant decisions then are go for it or try a field goal. The values in the table cells are the differences between the expected points for punting and expected points for attempting to get a first down. An X in a cell means the situation is not possible. Grey areas indicate situations in which the team should punt (positive expected difference, or, in other words, the expected loss for attempting a first down is larger than the expected loss for punting). Table 4 gives results for Chicago. Table 5 gives results for St. Louis. Results for the other three teams used in the analysis can be found at www.amstat.org/ publications/chance. Interestingly, our results show that even a poor offensive team such as Chicago should go for a first down more often than they actually do. For example, on fourth and one at midfield (50 yards from the end zone), this analysis shows Chicago should go for a first down. Intuitively, this may be more obvious than NFL coaches seem to realize. They have an estimated probability of success of 0.4747—about 50/50. So essentially, they are taking a 50/50 shot at having nearly the same situation as their opponent, which, should favor neither team on average. So going for it is not disadvantageous, and punting produces a disadvantage. Offensively strong and defensively weak teams such as St. Louis should go for a first down even more often. These teams have high expected points when they have the ball and high expected points for the opponent when the opponent has the ball. They also have higher probabilities of successfully converting a fourth-down attempt for a new first down. They should try to keep the ball.
Summary Our analysis shows there are many situations in which the correct decision on fourth down is to go for a first down. This analysis assumed it was early in the game and the correct CHANCE
39
Table 4—Expected Difference in Points for Going for It vs. Punting on Fourth Down for Chicago
Yards from End Zone
Note: Yards from the end zone and yards to go for the first down are on axes. Points are more for going for it in the grey area. Impossible situations are marked with X.
decision would be determined by expected points. Coaches may be more comfortable with punting when the expectation analysis shows they gain a small amount by trying to keep the ball. After all, if the team fails on an attempted fourth down, the coach will likely receive some criticism. So, for factors beyond those considered in our analysis, they may want to widen the grey areas to punt. However, even allowing a margin of, say, 0.25 or 0.5 points, the coach could make a better decision in many circumstances by going for it. Our analysis only applies when the game is not close to finished. Near the end of the game, models for expected points 40
VOL. 22, NO. 1, 2009
should be replaced by considering the models for probabilities of particular results—no score, a field goal, or a touchdown— probabilities that we have modeled in our analysis. At the beginning of the season, teams would be without the data we used to analyze fourth-down decisions to go for it. To use our analyses, the decisionmakers might try to find the team of our five most similar to their team with regard to offensive and defensive strength. As their season progresses, they could then use the data from the current season. Similar analyses to compare kicking a field goal with going for a first down were done by Jennifer Wright in her master’s
Table 5—Expected Difference in Points for Going for It vs. Punting on Fourth Down for St. Louis
Yards from End Zone
Note: Yards from the end zone and yards to go for the first down are on axes. Points are more for going for it in the grey area. Impossible situations are marked with X.
thesis at California State University. Again, she found that the decision to go for a first down or touchdown, rather than kick a field goal, should be made more often than it is.
Further Reading Agrestt, A. (2002) Categorical Data Analysis (2nd Edition), John Wiley & Sons. Bartshe, P. (2005) “An NFL Cookbook: Quantitative Recipes for Winning.” STATS, 6:12–13.
Myers, R. (2000) Classical and Modern Regression with Applications (2nd Edition), Duxbury Press. Sackrowitz, H. (2000) “Refining the Point(s)-After-Touchdown Decision.” CHANCE, 13:29–34. Stern, H. (1998) “Football Strategy: Go for It!” CHANCE, 11:20–24. Theismann, J. and Tarcy, B. (2001) The Complete Idiot’s Guide to Football (2nd Edition), Alpha Books. NFL Football Data, www.nfl.com/stats/2005/regular.
CHANCE
41
Application of Machine Learning Methods to Medical Diagnosis Michael Cherkassky
T
he technological boom in recent years has come with great advances in medical technologies. New technologies, such as MRI (magnetic resonance imaging) and ECG (electrocardiogram), enable better understanding of the functions and malfunctions of the human body. However, technological progress adds pressure on medical professionals, who are now faced with an influx of data to interpret. Moreover, inundated with data, doctors may be more prone to mistakes. Misdiagnosis, though seemingly rare, is actually quite common. A recent study of Patient Safety in American Hospitals by HealthGrades found that per 1,000 patients, more than 150 were classified as “failure to rescue,” meaning a failure to diagnose correctly or in time. Twenty percent of patients in the emergency department (ED) are misdiagnosed, according to reports at http://wrongdiagnosis.com. In addition, John Davenport, in his paper “Documenting High-Risk Cases to Avoid Malpractice Liability,” observes that the majority of the cases of misdiagnosis occurs in serious diseases such as breast cancer, appendicitis, and colon cancer. Clearly, from these statistics, it can be concluded that misdiagnosis is a significant problem. Thus, it is necessary to find an efficient, unbiased, and accurate method for diagnosis. Machine learning computer aided diagnostics (CAD) provides a realistic solution to the problem of misdiagnosis. Though the intuition of a human can never be replaced, machine learning can provide a useful secondary opinion to help in the diagnosis of a patient. In CAD applications, empirical data from various medical studies are used to estimate a
42
VOL. 22, NO. 1, 2009
predictive diagnostic model, which can be used for diagnosing new patients. The simplest type of predictive problem is classification. In a classification setting, the goal is to estimate a model that classifies patients’ data into two classes (e.g., healthy and sick) based on available features of each patient. The input features may include clinical data (e.g., the number of lymph nodes, results of a blood test), demographic data (e.g., age and sex), genomic data, etc. Often, the number of input variables d is large, say, d =10 inputs or 100 inputs, and can even reach hundreds of thousands of variables in genomic data. Current medical technologies, such as heart monitors, take numerous readings on several characteristics each minute. The scenario here is limited, but future work would explore several of the topics considered previously. Due to the large number of input variables, designing a classifier amounts to estimating a decision boundary in a high-dimensional space based on available diagnostic data about past patients. This is a difficult task for medical doctors because (a) humans have no intuition of working in/ visualizing a high-dimensional input space and (b) there may be many models (decision boundaries) that explain available historical data equally well, but have different prediction accuracies. Machine learning methods address both problems and enable estimation of statistically reliable classification models for medical diagnosis. Once such a predictive model is estimated, it can be used in future diagnosis (classification) of new patients. Prediction refers to the CAD model assigning the diagnosis to a new patient, based on the values of input features for this patient. Two binary classification methods are k-nearest neighbors (kNN) and support vector machine (SVM) classifier. The kNN method is a simple classical method based on the intuitive idea of classifying a new patient based on his/her similarity to other patients (with known classification labels). The SVM method is a more recent technology that has become widely used since the late 1990s. It is based on a solid theoretical foundation of statistical learning theory and uses a new concept (i.e., margin) to control the prediction accuracy.
Statistical Learning Methods The field of statistical learning or pattern recognition studies the process of estimating an unknown (input, output) dependency or structure of a system from a finite number of (input, output) samples (i.e., observations or cases). Learning methodologies have been of growing interest in modern science and engineering when the underlying system under study is unknown or too complex to be mathematically described. Machine learning can estimate a ‘useful’ model to characterize the unknown system using available data. The estimated model
is expected to have good prediction accuracy for the future data. Learning is the process of estimating an unknown (input, output) dependency or structure of a system using a limited number of observations. The general scenario for machine learning involves three components (see Figure 1): a generator of random input samples, a system that returns an output for a given input vector, and a learning method that estimates an unknown (input, output) mapping of the system from the observed samples (xi,yi),i,…n. Here, the input vector x denotes the patient’s characteristics relevant for diagnosis (classification), and y denotes the class label. We only consider applications with two possible classes (so-called binary classification). Note that observed or training data has classification labels. However, the future (test) data is unlabeled, and it has to be classified by a learning method. The learning method is a computer algorithm that implements a set of possible models (or a set of functions) ƒ(x,v),v e describing the unknown system. This set of functions is parameterized, and denotes a set of parameters used to index the set of functions. The learning method then attempts to select the ‘best’ predictive model ƒ(x,v) from this set of functions, using only available data samples. For classification problems used in this study, the quality of a model is measured as its error rate (for future samples). That is, for a given input x, if a model correctly predicts the class label y, then its error is zero; if it makes incorrect prediction, its error is one. The prediction error rate is a fraction of incorrectly classified future samples over the total number of future samples. The main problems are that the future or test data are unknown and the model has to be estimated using only a finite number of training samples. Statistical learning does not necessarily use statistical models as they would be presented in an introductory statistics course. The procedures in statistical learning can be defined by algorithms with necessary values (e.g., parameters, tuning constants) being estimated using available data. Whereas a model in the world of statistics corresponds to a statistical model that relies on probability distributions, the procedures (or models) in statistical learning are algorithmic and do not have to correspond to underlying statistical models. Occasionally, algorithms in statistical learning incorporate statistical models (e.g., mixture models), but the predictions have little or no statistical interpretation at other times. In any case, researchers in statistical learning theory refer to their procedures as models and their tuning constants as parameters. k Nearest Neighbors Method kNN is a simple method of classification that bases itself on the notion that for a given input x, its estimated class label should be similar to the class labels of its ‘neighbors,’ or surrounding points. So the new (unlabeled) input can be classified by a majority vote of its neighbors from the (labeled) training set. Intuitively, this method makes sense. For example, if the new patient is 50 years old, female, and has five cancerous nodes, she probably should have the same diagnosis as a woman who is 49 years old with six cancerous nodes. One critical consideration is how to measure distance in X-space. In this method, the similarity between input samples is measured as the Euclidean distance (between these samples)
Input Samples
Learning Method
System Figure 1. General Setting for Machine Learning
in the input space. However, there are many ways to quantify distance, including statistical—or Mahalanobis—distance. Moreover, each input variable is normalized from zero to one to prevent one variable from ‘outweighing’ another. The main practical issue in the design of kNN classifiers is a proper selection of the value of k, which usually depends on the training data. However, as the number of input variables—or dimensionality of the input space X—increases, the concept of k nearest neighbors can become problematic. As dimensionality increases, the kNN method should become less accurate because points are likely to be more dispersed in high dimensions. This phenomenon is known as “the curse of dimensionality” in statistics. Support Vector Machine Classifier The SVM method was introduced for estimating a linear decision boundary with high prediction accuracy. It uses a new concept in binary classification, margin. For example, in Figure 2(a), there are many linear decision boundaries that classify (separate) available data into two classes equally well (with 100% accuracy). However, these linear decision boundaries will have different prediction accuracy (for future samples). It can be shown that the best decision boundary has the maximum separation margin between the training samples from two classes (as shown in Figure 2(b)). Therefore, the optimal model (decision boundary) would be the model that has the maximum distance between the decision boundary and the nearest points from each class. The motivation for achieving the maximum margin is relatively simple. If the margin of the decision boundary is small, there is large variability in terms of how many ways the model can separate the data. However, if the margin is large, there is only one way to separate the data. Because of such a lower variability, the large-margin model is less prone to random fluctuations of the training data and thus yields higher prediction accuracy (for future data). Note that the concept of margin is independent of dimensionality, so a large-margin decision boundary can guarantee good generalization, even for high-dimensional data. Vladimir Vapnik, in The Nature of Statistical Learning Theory, extended the idea of margin to situations where the training data is not separable, and when the decision boundary is nonlinear. These extensions have resulted in the (nonlinear) SVM methodology for classification, where nonlinear models are implemented via so-called kernel mapping. Nowadays, many publicly available CHANCE
43
Haberman’s Survival Data Set Haberman’s data set contains data from studies conducted on the survival of patients who have undergone surgery for breast cancer. The studies were conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital. The data set offers three numerical input variables: age of patient at time of operation, patient’s year of operation, and number of positive auxiliary nodes detected. I disregarded the second input variable (patient’s year of operation) for this study because it has very low association with survival and most likely would not have any influence on the output. The output, or class attribute, represents whether a patient survived or died within five years of the operation. There are a total of 306 samples; 81 are labeled as dead and 225 as alive. The correlation coefficients between the two inputs and the output are shown in Table 1. Figure 2(a). Multiple linear decision boundaries separating the data with zero error
Optimal Linear Boundary
Statlog Heart Disease Data Set The Statlog Heart Disease data set contains 270 patient records, where each record (sample) has 13 input variables (see Table 2) and an output (diagnosis) indicating whether the patient has heart disease or is normal. Input variables include age, sex, blood pressure, cholesterol level, etc. Inputs 1, 4, 5, 8, 10, and 12 are real valued. Inputs 2, 6, and 9 are categorical. Inputs 3, 7, 11, and 13 are ordinal (i.e., high, medium, low). There are a total of 270 samples in this data set; 120 are positive and 150 are negative. Correlation coefficients between each of the inputs and the output are shown in Table 2. Wisconsin Diagnostic Breast Cancer Data Set
Margin
Figure 2(b). Linear decision boundary with maximum margin
software implementations of SVM exist, and this study used MATLAB-based software. When applying SVM classification software to available data, one has to specify two SVM parameters: parameter C that controls the margin size and a kernel parameter that controls the degree of nonlinearity (of the decision boundary). This study uses the radial basis function (RBF) kernel, with single parameter Sigma, denoting the RBF width. Proper tuning of SVM parameters C and Sigma for a given data set is analogous to optimal selection of k in kNN method.
Description of Data Sets I used three publicly available data sets from UCI Machine Learning Repository. Preprocessing the data included scaling each input variable to the same range (zero to one) and organizing the data into MATLAB format. Scaling was necessary to make sure one input variable did not dominate other input variables because of its much larger values. 44
VOL. 22, NO. 1, 2009
The Wisconsin Diagnostic Breast Cancer data set includes results of a medical test for 599 female patients suspected of having breast cancer. Each patient record has 30 input variables computed from a fine needle aspirate (i.e., diagnostic procedure used to investigate lumps and tumors) of a breast mass conducted in November of 1995. The input variables describe characteristics of the cell nuclei present in the breast, including radius, area, symmetry, etc. (see Figure 3). The output is a diagnosis of whether the cells are benign or malignant. There are a total of 599 data points, of which 375 samples are classified as benign instances and 212 as malignant instances. All the input variables are real valued.
Experimental Procedure In the problem of estimating a classifier from available data (i.e., training data), we are faced with two goals: (1) an accurate explanation of the training data and (2) good generalization for future data. All modeling approaches implement some sort of data fitting, but the true goal of modeling is prediction. The trick to finding the optimal model is balancing these goals. For example, a model may be great at explaining the training data but can be poor in generalization for future data. This problem is addressed using complexity control, which amounts to choosing an optimal model complexity for a given (training) data set. For example, as k increases, the complexity of the kNN model decreases, and vice versa. Likewise, increasing the SVM Sigma will decrease the complexity of the SVM model. An optimal model complexity should help generalization for future data.
Resampling An effective tool for implementing complexity control (i.e., model selection) is resampling or cross-validation. An extremely complex model may successfully classify all the training data correctly, but it will usually yield poor generalization for future data. A solution lies in cross-validation, which partitions the available data into training and validation sets. Thus, we can use the training set for model estimation and the validation set to validate our model. Next, we change the model complexity and repeat the prior steps and finally select the model that provides the lowest prediction error for validation data. This approach solves the dilemma of overfitting because the most complex model (low training error) will not necessarily provide the best validation error. However, the results are sensitive to how the data is split, thus it is necessary to partition the data randomly. A specific type of cross-validation, illustrated in Figure 4, is called M fold cross-validation, which involves dividing the available data set Z of size n into M randomly selected separate subsets of size n/M each. Then, one subset is left out and the remaining M-1 subsets are used to estimate the model. The prediction error for this model is estimated using the left-out subset. This is repeated for M partitions (or M folds) to find the average prediction error. In model selection, we would test many parameters and select the ones that gave the least cross-validation error. A special case of M fold cross-validation (with M=n) is called leave one out (LOO) cross-validation. In LOO cross-validation, each left-out subset includes only one point; the remaining n-1 data points are used for model estimation. In large data sets, LOO cross-validation and higher fold cross-validation (i.e., 10–15 fold) would yield the same results. However, in smaller data sets (<100), LOO and 10–15 fold cross-validation would give different results. Ideally, LOO cross-validation would be used for every analysis; however, LOO cross-validation is computationally demanding so we resort to higher fold cross-validation. Model Selection Model selection refers to optimal tuning of model parameters for each method for each data set. We used a LOO cross-validation for choosing an optimal value of k for kNN. We chose this particular implementation of resampling because the kNN algorithm is relatively simple to implement and program. We used five values for the parameter k and selected the one that returned the smallest resampling error. If the optimal selected value of k was an extreme of the five values, then I tested five new values until I was sure the optimal value of k yielded the smallest cross-validation error. Parameter tuning for SVM algorithm was a little different than kNN. Because SVM is such a complex algorithm, I decided to use a 15-fold cross-validation to minimize the computing time (rather than a LOO cross-validation used for kNN). Each of the 15 folds randomly partitions the data into 15 subsets. Similar to the procedure above, I tuned the parameters Sigma and C until I found the model with the least cross-validation error. Testing: Two Rounds of Cross-Validation Testing the data refers to comparing the effectiveness of kNN and SVM. If we use all our data to choose an optimal model for each method and data set, we cannot infer the generalization
Table 1—Correlation Coefficients Between Input Variables and Output for Haberman's Data Set Input Variable
Correlation Coefficient
Age
0.06795
Number of Nodes
0.28677
Table 2—Correlation Coefficients Between Input Variables and Output for Statlog Heart Disease Data Set Input Number
Input Variable
Correlation Coefficient
1
Age
0.21
2
Sex
0.29
3
Chest Pain Type (4 values)
0.42
4
Resting blood pressure
0.16
5
Serum cholesterol
0.12
6
Fasting blood sugar
–0.02
7
Resting electrocardiographic
0.18
8
Maximum heart rate
–0.41
9
Exercise-induced angina
0.42
10
(Oldpeak) ST depression induced by exercise relative to rest
0.42
11
Slope of the peak exercise ST segment
0.34
12
Number of major vessels (0–3) colored by flourosopy
0.45
13
Thal: 3=normal; 6=fixed defect; 7=reversible defect
0.52
Figure 3. Breast cells compiled from fine needle aspiration CHANCE
45
Validation Set
Samples from Training Set
Split 1
Results
Split 2 Split 3 Split 4 Split 5 Figure 4. Partitioning the data into training and validation sets via five-fold cross-validation
Table 3—Selection of Optimal Value of k for Haberman’s Data via LOO Cross Validation k 1 3 7 15 30 45 47 50 53 57 60 100
Error (%) 43.13 31.70 25.82 24.51 25.49 22.88 22.55 22.22 23.52 24.51 24.51 26.47
accuracy of these models because we cannot use previously ‘seen’ data for testing. Therefore, in testing this data, I used two rounds of cross-validation. The first cross-validation was a five-fold cross-validation, which split the data into test sets and training sets. For Haberman’s data, this partitioning was deliberately not random, but ordered as follows: Test Data for Fold 1: samples # 1, 6, 11, 16… Test Data for Fold 2: samples # 2, 7, 12, 17… Test Data for Fold 3: samples # 3, 8, 13, 18 …. etc. Because this data was ordered by age, I thought each fold would provide an accurate representation of the data. When the first cross-validation was finished, I had five partitions of available data in training and test sets. The second level of cross-validation was used to select optimal tuning parameters for each method, separately for each of the five training data sets. Once the optimal model for the specific training set was found, the model was tested using the test set (formed in the first round of cross-validation). This was done five times, and the test error of the method (either 46
kNN or SVM) was computed by averaging the test errors for five test data sets.
VOL. 22, NO. 1, 2009
After the data were preprocessed, the two methods (kNN and SVM) were applied to the three medical data sets. The results (for both learning methods) try to address two distinct questions: 1.What are optimal tuning parameter(s) of a learning method (for each data set)? These ‘optimal’ parameter values result in an estimated model with high generalization for future data. Selection of optimal model parameters was based on the smallest cross-validation error using resampling techniques. 2.What is the true prediction accuracy (or test error) of each learning method? This compares the two methods in terms of their generalizability and selects the ‘best’ learning method for each data set. Estimation of the test error for each method was based on using resampling techniques. Haberman’s Breast Cancer Data Model selection results for Haberman’s data set are presented in Figure 5. The decision boundary for kNN using the value k=5 results in the complex (wiggly) decision boundary shown in Figure 6. In each figure, the variables are standardized from 0 to 1. The ranges for age and number of cancerous nodes are 30–83 and 0–52, respectively. The optimal k for this data using the kNN algorithm was k=50, yielding a smallest LOO crossvalidation error of 22.22% (see Table 3). This optimal decision boundary is shown in Figure 7. Model selection results for the SVM method are summarized in Table 4. An optimal model with Sigma=0.5 and C=10 yields a cross-validation error of 25.19%. This SVM model is graphically displayed in Figure 8. The test error of each method was estimated via five-fold cross-validation. The kNN algorithm had an average test error of 22.5% (SE 9.3%), whereas the SVM algorithm had an average test error of 21.5% (SE 4.2 %), as shown in Table 5. Statlog Heart Disease Data For the 10-dimensional Statlog data, the optimal k for the kNN algorithm was k=47, yielding smallest LOO cross-validation error of 15.56% (see Table 6). The optimal SVM model had optimal parameters found as Sigma=35, C=30, yielding a crossvalidation error of 16.30% (see Table 7). The test error of each method was estimated via five-fold cross-validation. The kNN algorithm had an average test error of 16.3% (SE 4.0%), whereas the SVM algorithm had an average test error of 15.9% (SE 4.5%), as shown in Table 8. Wisconsin Diagnostic Breast Cancer Data For this 30-dimensional data set, the optimal value of k for the kNN method was k = 15, yielding smallest LOO cross-validation error = 2.46% (see Table 9). The optimal parameters for the SVM method were found as Sigma = 30, C = 100, yielding a cross-validation error of 4.40% (see Table 10). The test error of each method was estimated via five-fold cross-validation. The kNN algorithm had an average test error of 3.69% (SE 1.69%), whereas the SVM algorithm had an average test error of 3.20% (SE 1.70%), as shown in Table 11.
Haberman’s Breast Cancer Data 1
alive dead
0.9 0.8
Number of Nodes
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Age Figure 5. Haberman's breast cancer data - 306 samples: 81 dead, 225 alive
Haberman’s Breast Cancer Data 1
alive 0.9
dead decision boundary
0.8
Number of Folds
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Age Figure 6. kNN classifier decision boundary for Haberman’s data, k=5 CHANCE
47
Haberman’s Breast Cancer Data 1
alive dead decision boundary
0.9 0.8
Number of Nodes
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Age Figure 7. kNN classifier decision boundary for Haberman's data, k=50
Haberman’s Breast Cancer Data 1
alive dead support vector decision boundary margin border
0.9 0.8
Number of Nodes
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Age Figure 8. Nonlinear SVM decision boundary for Haberman’s data. SVM parameters: RBF kernel with Sigma=0.25, C=10.
48
VOL. 22, NO. 1, 2009
Table 4—Selection of Optimal SVM Parameters for Haberman’s Data via 15-Fold Cross-Validation
Table 7—Selection of Optimal SVM Parameters for Statlog Data via 15-Fold Cross-Validation
Sigma
C
Error (%)
Sigma
C
Error (%)
0.1
1
25.75
35
10
16.67
0.1
10
25.77
35
20
17.04
0.1
20
26.07
35
30
16.30
0.5
1
26.81
40
10
16.67
0.5
10
25.19
40
20
17.04
0.5
20
26.16
40
30
16.67
1
1
26.81
45
10
17.04
1
10
26.16
45
20
17.04
1
20
25.84
45
30
17.04
5
1
26.48
5
10
26.19
5
20
26.51
Table 5—Estimation of Prediction Accuracy of kNN and SVM Classifiers for Haberman’s Data via Five-Fold Cross-Validation
Table 8—Estimation of Prediction Accuracy for kNN and SVM Classifiers for Statlog Data via Five-Fold Cross-Validation
fold
k
Error (%)
Sigma
C
Error (%)
fold
k
Error (%)
Sigma
C
Error (%)
1
40
24.19
.06
1
20.39
1
55
11.11
40
30
11.11
2
40
11.48
0.1
10
18.03
2
40
16.67
35
20
18.52
3
40
16.39
0.5
10
21.31
3
60
22.22
30
30
22.22
4
40
21.31
1.0
1
27.87
4
40
14.81
45
30
12.96
5
30
39.34
.04
10
19.67
5
30
16.67
30
30
14.81
Average kNN Error = 22.5
Average SVM Error = 21.5
Table 6—Selection of Optimal Value of k for Statlog Data via LOO Cross-Validation
Average kNN Error = 16.3
Average SVM Error = 15.9
Table 9—Selection of Optimal Value of k for Wisconsin Breast Data via LOO Cross-Validation
k
Error (%)
k
Error (%)
1
24.44
1
4.75
5
20.00
3
2.99
10
19.63
5
3.34
15
18.52
7
2.99
30
17.04
15
2.46
35
16.30
17
2.64
40
16.30
20
3.16
47
15.56
25
4.04
50
16.30
30
3.87
60
17.41
40
4.57
75
16.67
50
4.51 CHANCE
49
Table 10—Selection of Optimal SVM Parameters for Wisconsin Breast Data via Fifteen-Fold Cross-Validation Sigma
C
Error (%)
30
1
37.28
30
20
6.51
30
50
4.93
30
75
4.58
30
100
4.40
50
1
37.28
50
20
16.01
50
50
7.74
50
75
6.15
50
100
5.80
70
1
37.28
70
20
33.07
70
50
12.13
70
75
8.97
70
100
7.21
Table 11—Estimation of Prediction Accuracy for kNN and SVM Classifiers for Wisconsin Breast Data via Five-Fold Cross-Validation fold
k
Error (%)
Sigma
C
Error (%)
1
20
6.14
20
125
4.39
2
10
3.51
15
125
1.75
3
5
1.75
15
200
0.88
4
10
2.63
15
200
4.39
5
10
4.42
15
150
2.65
Average kNN Error = 3.69
Average SVM Error = 3.20
From these results, we can see that: • The SVM method yields better (smaller) test error than the kNN method for all three data sets. However, the performance of the kNN method was surprisingly good for the two high-dimensional data sets used in this study. • The cross-validation error of kNN was consistently lower than that of the SVM method; however, this has not translated into a lower test error. This observation confirms the premise that the small cross-validation error (achieved for tuning model parameters) is a poor measure of the test error. 50
VOL. 22, NO. 1, 2009
Summary This study made heavy use of resampling techniques for tuning parameters of learning methods, as well as for estimating their prediction accuracy. My results show that accurate medical diagnostic models can be obtained using standard machine learning methods. Specifically, the SVM method diagnosed patients from Haberman’s Data Set with a 78.5% accuracy rate, from the Statlog Heart Disease with 84.1% accuracy, and the Wisconsin Diagnostic Data Set with 96.8% accuracy. However, proper application of machine learning methods to real-life data requires proper preprocessing of the data and careful use of resampling techniques for estimating prediction accuracy (or test error) of diagnostic models. This study used prescaling of all inputs to the same range, and the preprocessing yielded good predictive models for both learning methods. Without such a preprocessing, both kNN and SVM methods would have produced inferior predictive models. Overall comparisons suggest that SVM is a better (more robust) machine learning method than kNN. However, my research also brings up some questions. It was unexpected that the kNN method would perform so well in comparison to the SVM algorithm. It is also puzzling that the kNN was very accurate for high-dimensional inputs. In the future, I hope to investigate why kNN performed so well as well as apply many more learning algorithms to diagnostic data. Another interesting project would be to get more diagnostic data sets for different diseases. With autism being such a prevalent disease in young children, I think an innovative research project would be to investigate the relationship between childhood immunization shots and the severity of autism.
Further Reading Asuncion, A. and Newman, D.J. (2007) UCI Machine Learning Repository. www.ics.uci.edu/~mlearn/ MLRepository.html. Davenport, J. (2000) Documenting High-Risk Cases to Avoid Malpractice Liability, Family Practice Management. Duda, R.O., P.E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. Hand, D. J., H. Mannila, and P. Smyth, Principles of Data Mining, Cambridge, MA: MIT Press, 2001. Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, New York: Springer, 2001. Mangasarian, O.L. and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 &18. Patient Safety in American Hospitals, July 2004, HealthGrades Quality Study, HealthGrades, http://www.healthgrades.com/ media/english/pdf/HG_Patient_Safety_Study_Final.pdf Vapnik, V., The Nature of Statistical Learning Theory, New York: Springer, 1995. Wolberg, William H. and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 91939196.
Visual Revelations
Howard Wainer,
Column Editor
A Centenary Celebration for Will Burtin: A Pioneer of Scientific Visualization
2
008 marks the 100th anniversary of the birth of Will Burtin (1908–1972), one of the foremost graphic designers of the 20th century. During his career, he had an enormous influence on the character of modern design, and more specifically to the point of this column, he was an early developer of what has come to be called scientific visualization. Burtin was born in Ehrenfeld, a suburb of Cologne, the only son of August and Gertrude Bürtin. He successfully began his career in Germany, despite the dismal economic conditions of Germany after World War I and the Great Depression. In the summer of 1938, he fled Germany with his Jewish wife, Hilde Munk (1910–1960) after the rise of the Nazis. One event that must have weighed heavily in the timing of his decision to leave was Josef Goebbels’ 1937 request that he become the design director of the Propaganda Ministry. This request led his wife to ask her American cousin, Max Munk, to sponsor their immigration to the United States. His sponsorship led to permission that arrived just in time, because in 1938, Adolf Hitler repeated Goebbels’ request and Burtin could stall no longer. His departure from Germany was also a departure from most things German; Burtin adamantly refused to speak German. In 1946, he visited Albert Einstein as part of his research for a Fortune article, “The Physics of the Bomb.” Einstein was then actively trying to convince the world of the dangers of nuclear weapons that were not under international control. He would not speak English with Burtin, and Burtin would not speak German with him, so the interview was conducted bilingually. Despite his limited English, Burtin was an almost instantaneous success. Within months of his arrival, he won a contract to design the Federal Works Agency Exhibition for the U.S. Pavilion at the New York World’s Fair. By 1939, he had designed the cover for the World’s Fair issue of The Architectural Forum magazine, which won the Art Directors’ Club medal for cover design. Thus began a rich career, which included a long relationship with Upjohn Pharmaceuticals. During this time,
German-born American graphic designer Will Burtin poses for a portrait in front of a display of his work in the fields of science and medicine, USA, 1950s. (Photo by Arnold Newman/Getty Images)
he was responsible for the design of much of the content of Upjohn’s magazine, Scope, which was focused on communicating technical material to physicians. CHANCE
51
Table 1—The Effectiveness of Three Antibiotics Against 16 Bacteria Shown as Minimum Inhibitory Concentration (µg/ml) Antibiotic Bacteria
Penicillin
Aerobacter aerogenes
Streptomycin
870
1
Gram Staining
Neomycin 1.6
negative
1
2
0.02
negative
Brucella anthracis
0.001
0.01
0.007
positive
Diplococcus pneumoniae
0.005
11
10
positive
Escherichia coli
100
0.4
0.1
negative
Klebsiella pneumoniae
850
1.2
1
negative
Mycobacterium tuberculosis
800
5
2
negative
3
0.1
0.1
negative
850
2
0.4
negative
1
0.4
0.008
negative
10
0.8
0.09
negative
Staphylococcus albus
0.007
0.1
0.001
positive
Staphylococcus aureus
0.03
0.03
0.001
positive
1
1
0.1
positive
Streptococcus hemolyticus
0.001
14
10
positive
Streptococcus viridans
0.005
10
40
positive
Brucella abortus
Proteus vulgaris Pseudomonas aeruginosa Salmonella (Eberthella) typhosa Salmonella schottmuelleri
Streptococcus fecalis
The Cell The 1950s were transformational years in biology and, more specifically, 1953 was annus mirabilis. In this one remarkable year, James Watson and Francis Crick published their famous paper on the double helix structure of DNA. In the same issue of Nature, Maurice Wilkins, Alex Stokes, and Herbert Wilson published a paper providing the X-ray crystallographic evidence to support Watson and Crick, and, in that same issue, Rosalind Franklin and Ray Gosling added further support and suggested that the phosphate backbone of the DNA molecule lies on the outside of the structure. A week later, in the next issue of Nature, Watson and Crick added detailed speculation on how the base pairing in the double helix allows DNA to replicate. Information about detailed cellular structure poured from the literature, but because it was hard to visualize, it was hard to integrate. Burtin convinced Upjohn president Jack Gauntlett that it would be worthwhile to fund the construction of a giant (24 feet across and 12 feet high) model of a human red blood cell
My thanks to Stephen Clyman and Steve Goodman for helping track down the meaning of the dependent variable in Burtin’s antibiotic graph and to Editha Chase for helping put everything together. 52
VOL. 22, NO. 1, 2009
that would provide all the details thus far known about it. It contained structures seen in electron microscopes, but not yet explained. It embodied, on a grand scale, modern scientific visualization. Burtin’s cell was unveiled in San Francisco at the 1958 meeting of the American Medical Association. It was the star of the convention and, subsequently, traveled widely.
Impact of Three Antibiotics on a Variety of Bacteria In the post World War II world, antibiotics were called “wonder drugs,” for they provided quick and easy cures for what had previously been intractable diseases. Data were being gathered to aid in learning which drug worked best for which bacterial infection. Being able to see the structure of drug performance from outcome data was an enormous aid for practitioners and scientists, alike. In the fall of 1951, Burtin published a graph showing the performance of the three most popular antibiotics on 16 bacteria. The data used in his display are shown in Table 1. The entries of the table are the minimum inhibitory concentration (MIC), a measure of the effectiveness of the antibiotic. The MIC represents the concentration of antibiotic required to prevent growth in vitro. The covariate “gram staining” describes the reaction of the bacteria to Gram staining. Gram-positive bacteria are those that are stained dark blue or violet; Gramnegative bacteria do not react that way.
Myc oba cter ium tube rcul osis
Sa lm on ell as ch ott mu ell eri
e onia neum us p cocc Diplo
ns da iri sv cu oc oc pt re St
Stre pto coc cus hem olyt icus
0.001.
0.1. 1.
aris vulg us e t Pro
10. 100.
Staphylococc us albus
niae Klebsiella pneumo
Penicillin Streptomycin Neomycin
Bruc ella a bortu s
reus h.au Stap
Ps eu do m on as ae ru gin os a
coli
B.a nth rac is
ia ich her Esc
) erthella ella (Eb Salmon typhosa
Ae r o b a cter a erogen es
s ali fec s u cc co to p e Str
Gram-negative Gram-positive
Figure 1. Will Burtin’s diagram comparing the impacts of penicillin, streptomycin, and neomycin on a range of bacteria (Scope, Fall, 1951)
Figure 2. Box plots showing how the log transformation makes the distributions of MIC symmetric
Figure 3. Dot plots that differentiate gram-positive from gramnegative bacteria tell us clearly that penicillin is unique among these three in its differential response to these two classes of bacteria. CHANCE
53
Gram-positive bacteria 0.001
0.01
Effectiveness
0.1
Penicillin Streptomycin
1.0
Neomycin
10.0
Streptococcus fecalis
Staphylococcus aureus
Staphylococcus albus
Streptococcus viridans
Diplococcus pneumoniae
Streptococcus hemolyticus
1000.0
Brucella anthracis
100.0
Figure 4. The MICs of three antibiotics on gram-positive bacteria ordered by the efficacy of penicillin
Burtin, who to my knowledge had no training as a statistician, made a variety of wise choices in the display he constructed of these data (Figure 1). His display is a direct lineal descendent of Florence Nightingale’s famous Rose, in which the radii of the segments convey the amount of the data, rather than a traditional pie chart, in which the angle of each segment is the carrier of the information. Burtin saw the huge range of values the data took and realized some sort of re-expression was necessary. He chose a log transformation. Such re-expression is obvious to someone with statistical training, but it is reassuring that a designer should come to the same conclusion. The box plots in Figure 2 show the distributions of performance for each drug after log transform, oriented so better performance is at the top. We see immediately that the transformation worked, as the resulting distributions are symmetric without unduly long tails. In addition, we can see there is far greater variation in the performance of penicillin than in the other two drugs. Why? A dot plot that identifies Gram-positive and Gram-negative bacteria shows penicillin works far better for Gram-positive bacteria than for Gram-negative, differential performance that is not evident for the other two drugs. Burtin noticed this and visually segregated the bacteria that were Gram-positive from those that were Gram-negative. His resulting display is compact, accurate, and informative, but with the wisdom borne of a half century of work on statistical display and exploratory data analysis, can we improve matters? 54
VOL. 22, NO. 1, 2009
An obvious nit is his omission of a circular reference line at .01. When interpolating between reference points, humans have a tough time with a log scale. Thus, it seems useful to provide as many intermediate waypoints as possible so linear interpolation is not too far off. But, perhaps it isn’t important to judge accurately between .1 and .001. If so, improving the level of visual precision might not be necessary. A place where real improvement may be possible is in the ordering of the bacteria. Let us consider just the Gram-positive bacteria. Suppose we order the graph by the success rate of penicillin. One possible display is shown in Figure 4. Now, we can see clearly that for Gram-positive bacteria— except Streptococcus fecalis—penicillin works well. For Staph infections however, neomycin seems to have an edge. The second panel of this display (shown here as Figure 5) would then be the Gram-negative bacteria, this time ordered by the effectiveness of neomycin. From even a cursory examination of this two-panel display we can easily decide which drug is best for what bacteria. We note that, for these bacteria at least, the other two drugs dominate streptomycin. I contend that this two-panel display, although it lacks the compactness of Burtin’s original design, has a small edge in exposing the underlying structure of drug effectiveness. I suspect the rank order of the bacteria in each panel exposes an underlying molecular structure, but I leave it to others to uncover its meaning. Also, by including the component data, I challenge readers to come up with further improvements
Gram-negative bacteria 0.001
Penicillin
0.01
Streptomycin Neomycin
Effectiveness
0.1
1.0
10.0
Mycobacterium tuberculosis
Aerobacter aerogenes
Klebsiella pneumoniae
Pseudomonas aeruginosa
Escherichia coli
Proteus vulgaris
Salmonella schottmuelleri
Brucella abortus
1000.0
Salmonella (Eberthella) typhosa
100.0
Figure 5. The MICs of three antibiotics on gram-negative bacteria ordered by the efficacy of neomycin
and send them to me, which could easily form the basis of a future column.
Postscript A sensitive reader might plausibly ask why an article honoring Will Burtin would spend much of its time offering suggestions on the improvement of one of his designs. A fine question. An answer, of sorts, is found in a framed letter that hangs next to his daughter, Carol Burtin Fripp’s, bed. It is dated April 18, 1959, on specially designed letterhead of the Type Directors Club of New York and says: Dear Will: This comes from four guys who sat in the 18th row during the forum session. First we are friends who appreciate your work for the TDC; second, we are aware of your talent as designer. But may we help you—and future meetings that you chair—by saying that, to put it plainly on the line, you talk too long. You don’t have to be eternal to be immortal. Your audience will get much more from you if you say more briefly—and more orderly—what you have to say.
Someone should write a letter like this to statisticians.
Further Reading Franklin, R. and Gosling, R.G. (1953) “Molecular Configuration in Sodium Thymonucleate.” Nature, 171:740–741. Nightingale, F. (1858) Notes on Matters Affecting the Health, Efficiency and Hospital Administration of the British Army. London. Remington, R.R. and Fripp, R.S.P. (2007) Design and Science: The Life and Work of Will Burtin. Lund Humphries: Hampshire, England. Watson, J.D. and Crick, F.H.C. (1953) “A Structure for Deoxyribose Nucleic Acid.” Nature, 171:737–738. Watson, J.D. and Crick, F.H.C. (1953) “Genetical Implications of the Structure of Deoxyribose Nucleic Acid.” Nature, 171:964–967. Wilkins, M. H. F., Stokes, A. R. & Wilson, H. R. (1953). Molecular structure of Deoxyribose Nucleic Acid. Nature, 171, 738-740.
Yours, for a better meeting in the future and for a better Will Burtin who knows when to stop… . Four Friends in the 18th Row who wish you well
Column Editor: Howard Wainer, Distinguished Research Scientist, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104;
[email protected] CHANCE
55
Here’s to Your Health
Mark Glickman,
Column Editor
Pharmacogenetics and Pharmacogenomics: Statistical Challenges in Design and Analysis Todd G. Nick and Shannon N. Saldaña
Can genetic information help predict an individual’s reaction to a drug? Can the process of dosing medication be improved from the current trial-anderror approach by knowing a patient’s genetic fingerprint? Can drug and dosage optimization reduce the chances of an adverse drug reaction?
S
tudies in pharmacogenetics (PGt) attempt to answer these important questions. From our vantage point as pharmacogenetic researchers, the design and analysis of PGt research bring interesting challenges to professionals in our field. Statisticians will continue to have a critical role in addressing the design and analytical challenges. Consequently, it is important for statisticians to learn about the science of PGt research. It also is important that professionals in PGt research strive to gain a sense of statistical issues and conundrums that arise in their studies. The term “pharmacogenetics” was coined by Friedrich Vogel in 1959 as the study of the role of genetics in drug response. In the Medical Subject Headings (MeSH) thesaurus, PGt is defined as “a branch of genetics which deals with the genetic variability in individual responses to drugs and drug metabolism.” Pharmacogenomics (PGx) is a related discipline that uses genomics 56
VOL. 22, NO. 1, 2009
and proteomics technology over the entire genome, and the term is often interchangeable with PGt. In an effort to develop consistent definitions for drug regulation, the United States Food and Drug Administration (FDA) recently developed a guidance on terminology in the discipline of PGx and PGt. PGx is defined as “the study of variations of deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) characteristics as related to drug response.” The FDA considers PGt a subset of PGx and defines it as “the study of variations in DNA sequence on drug response.”
Why Is PGt/PGx Research Important? The ultimate goal of PGt and PGx research is to facilitate individualized pharmacotherapeutic regimens for patients based on individual genetic profiles. Together with other factors that may influence drug response—such as
age, sex, concomitant medications, diseases, diet, and lifestyle factors—PGt/ PGx may facilitate creation of individualized treatment plans for patients. This is the overarching goal of personalized medicine, which intends to incorporate known individual factors with the objective of providing a patient with the right drug and dose at the right time. The practice of personalized medicine has the potential to drastically improve health outcomes and reduce costs to health care systems. Drug response can be defined broadly in two categories: desired responses (therapeutic efficacy) and undesired responses (side effects or adverse drug reactions (ADRs), such as toxicity responses). Genetic differences among individuals explain some variability in drug response. The consequence of many genetic variations in the context of pharmacotherapy is not known. Typically, data are lacking on the predictive ability of genetic
variations in drug or dose selection. The most extensively researched PGt components in drug response are drugmetabolizing enzymes (DMEs), the proteins responsible for the conversion of medications to other compounds that ultimately will be eliminated from the body. In some cases, DMEs activate medications, but in most cases, they inactivate them. Two Caucasian patients who have the same age, sex, height, weight, and
overall health status may experience vastly different responses to the analgesic codeine. Codeine requires activation by a genetically variable DME to an analgesic component, called morphine. One patient may experience the expected analgesic response, while the other patient may not experience analgesia. The latter may be due to lack of the enzyme or inadequate function. Roughly 5% to 10% of Caucasians have little or no activity of the DME required
to convert codeine to morphine and may need to be treated with an alternative analgesic agent. Anticipated benefits of incorporating PGt/PGx into practice include improving medication selection and dosing accuracy. Ultimately, a patient’s genetic profile may be used to predict therapeutic response to a medication. As a result, drugs unlikely to benefit a patient may be avoided, thus increasing the likelihood of therapeutic suc-
Terminology PGx and PGt utilize traditional textbook genetic terminology as described previously by Schwender, Rabstein, and Ickstadt in “Do You Speak Genomish?” (CHANCE, 2006, volume 19). For transparency, we briefly explain a few terms here. • Genome: all the genetic material contained in an organism or a cell, which includes both the chromosomes within the nucleus and the DNA in mitochondria. • Chromosome: a string of DNA; the elements or building blocks of a DNA strand traditionally are denoted with letters A, C, G and T. • Gene: a combination of DNA segments that together code for a functional unit (e.g. codes for proteins). • Locus: a place or position of a gene on a chromosome. The plural of locus is loci. Genes occur in pairs at loci along the pairs of chromosomes. • Allele: a version of genetic material at a locus. Genes at the same locus are called allelic to each other and are called alleles. Genes at different loci are non-allelic to each other. Genes with two alleles are diallelic. • Polymorphism: a genetic variation in which the most common allele at a locus occurs with a proportion of at most 99% in a population. That is, of all the individuals in a population, at most 99% of the individuals have one allele at a given locus. In such a situation, we say the locus is polymorphic. A locus with more
than 99% of the individuals with one allele is considered nearly monomorphic and if all individuals have the same allele, it is a monomorphic locus. • Nucleotide: one of the building blocks of both DNA and RNA. • Insertion/deletion polymorphism: occurs when a nucleotide may be added or removed from the DNA sequence. This may result in the protein product of the gene being dysfunctional. • Single nucleotide polymorphism (SNP): the simplest class of polymorphisms. An SNP is a DNA sequence variation at a locus and represents a difference in a single nucleotide. For example, an SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a certain stretch of DNA. • Genetic marker: a segment of DNA with an identifiable physical location on a chromosome; used to study the association between disease and genetic variation at a given locus. A genetic marker can be as simple as an SNP or could be based on multiple SNPs. • Genotype: genetic makeup of in individual. • Phenotype: an observed trait. A genotype governs phenotype in the absence of other factors. For example, genes encode for blue eyes, so the individual has blue eyes unless they alter their eye color via other means.
A phenotype often is an expression of a genotype, but may also be influenced by other factors. For example, how fast a person metabolizes a drug could be an expression of genotype, but it may also be influenced by factors such as age, concomitant medications, and severity of disease. When there are two possible alleles at the same locus, A and a, there are three possible genotypes: AA (homozygous for the A allele), Aa (heterozygous), and aa (homozygous for the a allele). If the phenotype is the same for the AA and Aa genotypes, then we say A is dominant to the allele a and allele a is recessive to allele A. With respect to a phenotype, if the heterozygous genotype Aa falls between the two homozygous genotypes, the alleles are said to be codominant. An additive allele model is a codominant model that assumes the heterozygous genotype falls exactly in between the two homozygous genotypes. A general allele model does not assume any relationship between genotype and phenotype and can be considered assumption-free. The terms dominant, recessive, and codominant refer to the effect of the allele. When the genetic mechanism is not so simple, multiple SNPs or genes at multiple loci are included in an analysis. If the multiple SNPs or genes on the same chromosome remain together in blocks when inherited by a child from a parent, each combination of SNPs in a block forms a haplotype.
CHANCE
57
cess. Likewise, a patient’s genetic profile may be used to predict tolerability of a medication, thus allowing a safer and more effective starting dose. Several therapeutic areas are currently influenced by incorporation of PGt test results (see the Therapeutic Areas Influenced by PGt Testing sidebar for specific examples). In genetic analyses, data collected from related individuals, such as from
twins or families, provide the most genetic information. Unfortunately, using related individuals in PGt/PGx studies generally is not possible because we typically do not have information about response to a drug from more than one member of a family. Therefore, similar to a standard genetic association study or randomized controlled trial (RCT), these studies often involve unrelated individuals.
Therapeutic Areas Influenced by PGt Testing There are several therapeutic areas that have been influenced by PGt testing. Here are three examples. TPMT and 6-mercaptopurine Toxicity
The clinical utility of PGt is highlighted by an important, but genetically polymorphic, drug-metabolizing enzyme (DME) called thiopurine S-methyltransferase (TPMT). This enzyme is involved in the detoxification of several drugs, including 6-mercaptopurine, a medication used to treat certain types of cancer. Ninety percent of the population has TPMT activity, as expected, and can process the drug safely, but the remaining 10% do not respond as predicted based on the normal dosing regimen. Approximately one in 300 individuals possesses a combination of alleles requiring only 6%–10% of the standard dose of medication. For these individuals, toxicity (potentially fatal bone marrow suppression) can result if standard dosing is used. Consequently, PGt testing for these allelic variants is useful in predicting who is more susceptible to 6-mercaptopurine toxicity. The FDA and drug manufacturer recognize this importance, and it is reflected in the package labeling. UGT1A1 and Irinotecan Toxicity
Another example that highlights the clinical utility of PGt is dosing of the anti-cancer medication, irinotecan. Irinotecan is used to treat several types 58
VOL. 22, NO. 1, 2009
of cancer, and its active metabolite is metabolized by the genetically polymorphic DME, uridine-diphosphate glucuronosyltransferase 1A1 (UGT1A1). Roughly half of the patient population is at increased odds for a life-threatening decrease in white blood cells: 10% of the population is homozygous for two deficient alleles and have a 0.50 probability of developing this severe reaction, and 40% are heterozygous and have a 0.11 probability of not developing this lifethreatening reaction. Consequently, PGt testing for the UGT1A1 deficiency allele, UGT1A1*28, is useful in predicting who is more susceptible to irinotecan toxicity. CYP2D6 and Tamoxifen Response in Breast Cancer
CYP2D6 is involved in the metabolism of roughly 25% of medications and has more than 50 variants described. Tamoxifen, a medication commonly used in patients with breast cancer, is metabolized to an active metabolite, endoxifen, by CYP2D6. In contrast to previous examples, in which PGt was clinically useful in predicting potential drug toxicity, PGt may be useful in tamoxifen-treated patients to predict the likelihood of treatment success. Patients with two or more functional CYP2D6 alleles have expected endoxifen blood levels, whereas those with reduced function or deletion alleles have low to nondetectable levels and decreased response to therapy.
Illustration of a Simplified PGt Analysis To demonstrate different aspects of a PGt statistical analysis, we present an example of a randomized clinical trial. This is a simplified example with only one genetic component, not hundreds, and only one drug. Additionally, for the sake of simplicity, no covariates—such as concurrent drugs, age, or health status—are incorporated. To illustrate the modeling of a response as a function of gene and drug, consider the data recently published by Hans-Henrik Parving and colleagues from the PGt component of the Reduction of Endpoints in noninsulindependent diabetes mellitus (NIDDM) with the Angiotensin II Antagonist Losartan (RENAAL) study. This study examined renal outcomes as a function of the drug losartan and the angiotensin-converting enzyme (ACE) gene in patients with NIDDM and nephropathy. The ACE insertion (I) / deletion (D) polymorphism has genotypes II, ID, and DD. Losartan is in a class of medications called angiotensin II receptor antagonists. In this study, the primary outcome—or endpoint—was a composite event of time to end-stage renal disease (ESRD), doubling of serum creatinine, or death, whichever was shorter. Five additional outcomes involving the individual endpoints or a combination of the individual events were presented in their study. Here, we focus on the composite endpoint. It is good to not have too low an event rate, or rate of reaching the composite endpoint. Table 3 of Parving and colleagues’ paper presents outcomes for 1,435 patients. For the losartan-taking patients, the event rates were 42.4% (72 of 170; simple binomial standard error 3.8%) for genotype II, 44.0% (155 of 352; SE 2.6%) for genotype ID, and 42.9% (81 of 189; SE 3.6%) for genotype DD. For the placebo-taking patients, the rates were 42.5% (85 of 200; SE 3.5%) for II, 47.6% (151 of 317; SE 2.8%) for ID, and 51.2% (106 of 207; SE 3.5%) for DD. The analysis we present below is for illustrative purposes only and differs in three substantial aspects from Parving and colleagues’ paper. First, we use the binary response summarized in their table, the presence or absence of the composite endpoint, and not the timeto-event endpoint. Second, we did not
adjust for important covariates, such as geographic region, because we use their summary data. Third, we do not restrict our attention to an additive type of gene action; instead, we use a general model for allele effect. To model the occurrence of an event (a binary response), we use a logistic regression model with indicators for genotype and drug treatment. The logit of the probability that Y=1 is the log of the ratio of the probability to one minus the probability: logit(P(Y=1)) =log(P(Y=1)/P(Y=0)). In this study, Y=1 in the presence of the composite endpoint. Predictors are included in the model in the following manner: log it( P (Y = 1 Drug, GID , GDD )) =
0.55 Losartan Placebo
0.50 0.45
0.40 0.35
No Interaction
Interaction
0.30 DD
ID
II
Genotype
DD
ID
II
Genotype
Figure 1. Relationship between genotype and composite endpoint by drug. Left plot does not allow for interaction. Right plot does allow for interaction.
β0 + β1 Drug + β2GID + β3GDD + β4 Drug × GID + β5 Drug × GDD , where Drug indicates losartan treatment (Drug=1 if losartan taker and Drug=0 if placebo taker) and GID and GDD are two design variables needed for a general allele model (GID=1 if genotype ID and 0 otherwise; GDD=1 if genotype DD and 0 otherwise) The above model does not assume the drug effect is the same for each genetic subgroup. The estimates of event rates for the different drug-genotype groups match the directly estimated proportions given above. The right plot of Figure 1 displays the fitted relationship of gene and drug on composite endpoint allowing for interaction between genotype and drug. Although the right plot suggests the effect of drug on composite endpoint depends on gene, testing for interaction formally with the null hypothesis H0:b4=b5=0 shows no evidence of gene by drug interaction (2=1.30, two degrees of freedom, d.f., P=0.52). The statistic was computed using a likelihood ratio test (LRT) for testing a model with gene and drug against a model with gene, drug, and gene by drug interaction. This test compares the change in deviance values for two models, here one with and one without interaction, and compares its value to a chi-square distribution with d.f. equal to the difference in the d.f. of the two models. Because there was no evidence of interaction, it is not appropriate to perform separate significance tests in each
of the subgroups (e.g., for each genotype is there a significant treatment effect). Instead, it is of interest to test for the main effects of gene and drug separately. The left plot in Figure 1 assumes no gene-drug interaction and also uses a general model for allele effect. That is, we remove the interaction terms from the logistic regression model and reestimate the remaining coefficients. For the main effects, there is no evidence of a genotypic effect (2=1.99, two d.f., P=0.37) or treatment effect (2=2.34, one d.f., P=0.13). As a result, there is little to gain by using this drug and considering these genetic factors for this disease. It is interesting to note that the joint test of marginal genetic effects and gene-drug interaction testing H0:b2=b3=b4=b5=0 provided no evidence of a gene effect at any level of drug (2=3.29, 4 d.f., P=0.51). Additionally, the marginal drug effects and gene-drug interaction testing H 0: b 1=b 4=b 5=0 provided no evidence of a gene effect at any level of drug (2=3.64, three d.f., P=0.30). In fact, the total LRT for an overall test of any gene or drug effect testing H0:b1=b2=b3=b4=b5=0 provided no evidence of any effect, justifying no further analyses (2=5.51, five d.f., P=0.36). One can consider three hypothetical alternative data sets to illustrate how significant effects would appear. First, imagine the rates for the no losartan group are unchanged, but all the losartan groups have rates 10 percentage
points lower. Such a data set would show a strong drug effect, but no druggene interactions. Second, imagine no change in the rates for the II and ID genotype groups, but results 10 points higher than in reality for the DD group. The interaction is still not significant, but there is a significant effect on outcome for genotype DD. Third, imagine all the data being the same as they were originally, but the rate for the no losartan group being 10 points higher for the DD group. Such a data set would show a significant drug-gene interaction. It would suggest that using the drug in the DD group is beneficial, but not helpful in the other groups. In the original summary data, Parving and colleagues assumed an additive allele effect and used the time-to-event endpoint. Using the binary response of composite endpoint, the model can be expressed as:
log it( P(Y = 1 Drug, ACount )) =
β0 + β1 Drug + β2 ACount + β3 Drug × ACount , where ACount is the number of deletion alleles (zero, one, or two). The total LRT for an overall test of any gene or drug effect testing H0:b1=b2=b3=0, provided no evidence of any effect (2=5.32, three d.f., P=0.15) and no further analyses are necessary. For illustrative purposes, the joint test of marginal genetic effects and CHANCE
59
gene-drug interaction testing provided no evidence of a gene effect at any level of drug (2=3.10, two d.f., P=0.21). Additionally, the marginal drug effects and gene-drug interaction testing H0:b2=b3=0 provided no evidence of a gene effect at any level of drug (2=3.57, 2 d.f., P=0.17). There is no evidence of a gene-drug interaction testing H0:b1=b3=0 (2=1.30, one d.f., P=0.25). Parving and colleagues also found no evidence for interaction using time-to-event as the endpoint (P=0.21).
Statistical Challenges in PGt/ PGx Research There are several challenges for statisticians who work in PGt/PGx research. Adequate solutions to these challenges will require development of statistical methods and theory, as well as scientific advances in the field. Pooling Data and Identifying Regions In The Adventure of the Copper Beeches, Sherlock Holmes alluded to insufficient data by uttering “Data! data! data! I can't make bricks without clay!” This statement certainly is applicable in PGt/PGx research. There is a real need for research studies to help guide the development of the field. For example, according to the Agency for Healthcare Research and Quality (AHRQ), there is “a paucity of goodquality data addressing the questions of whether testing for CYP450 [DME] polymorphisms in adults entering SSRI 60
VOL. 22, NO. 1, 2009
[selective serotonin reuptake inhibitor] treatment for nonpsychotic depression leads to improvement in outcomes, or whether testing results are useful in medical, personal, or public health decisionmaking.” The Cytochrome P450 (CYP450) system, largely in the liver, is responsible for the metabolism of the majority of medications. This system comprises multiple specific enzymes, several of which have known genetic variations that influence drug metabolism. For sparse data, it is important to combine results across studies, and this is becoming popular with genetic association studies, especially when genetic effects are small. However, as of December 17, 2008, more than 19,000 metaanalyses have appeared in the online bibliographic database MEDLINE with only 16 including the search terms PGt or PGx. This is a small number of articles, considering that 8,119 articles include the terms PGt or PGx, and 5,056 include PGt as a MeSH term. PGt studies in the future should pool PGt data from multiple sources, which would increase power and improve population coverage. Caution should be used when there are differences in allele frequencies among subgroups (e.g., ethnic subpopulations) and disease rates. That is, there may be some confounding of the genetic and disease association due to subgroup and disease association. This is one form of population stratification and can cause spurious associations (i.e., increase the chance of finding a false positive). A challenge for the statistician is to develop better methods that will allow for pooling of genetic data that take into account multiple genetic variants simultaneously, missing data due to poor signal-to-noise ratios of the genotyping assay, and heterogeneity such as ethnicity of individuals. It is typical for studies to use a candidate gene approach, which assesses the association between a particular SNP— or sets of SNPs of a gene region—with a drug response. This approach is not feasible when the candidate genes are ill-defined and the biology of the disease being studied is not well-understood. For example, in the AHRQ report mentioned above, several studies were found that examined only a limited number of more commonly known genotypes of CYP450 enzymes involved in SSRI metabolism and did not account for the involvement of multiple CYPs
in drug metabolism. In such a situation, one could consider a genome-wide association study (GWAS) to scan markers across the complete set of genetic variants and associate variants with a drug response. After new candidate regions are identified, then researchers could focus on specific regions. The challenge to a statistician is to develop and validate clinical prediction models for high-dimensional data while considering complexities such as gene-drug interactions. Here, validation refers to the accuracy of a model when applied to new patient samples. Incorporating Nontraditional Study Designs Design issues are always complicated and involve trade-offs. For example, RCTs often are not practical or feasible in PGt studies. In fact, PGt studies often are “piggy-backed” onto RCTs designed to detect only a drug-based treatment effect. If DNA samples from participants in the RCT were saved or “banked,” then they could be tested retrospectively for association with outcomes. Analyses from these banked samples are typically considered hypothesis-generating and would need to be prospectively confirmed. For example, an RCT may show a treatment effect that is considered small, but when genetic testing is performed, one genetic subgroup demonstrates a statistically larger treatment effect. This result would need to be verified in a prospective fashion before it could be considered trustworthy. Ideally, researchers would take into account both the drug and genetic marker by designing prospective studies. In PGt studies, it is often beneficial for prospective studies to incorporate a so-called enrichment design. These designs are common in clinical oncology trials and involve an additional screening phase with the active treatments evaluated. They capitalize on the fact that some subjects are more likely to respond to a drug compared to other subjects, and this “responding subgroup” will show the most promise for a treatment effect. The first phase acts as a screening process for selecting the subgroup that responds favorably to treatment. The second phase involves the “responding subgroup” being randomized to treatment. These designs require fewer subjects because only one subgroup is
studied, not two or more as with most PGt studies. For example, an enrichment design, also called a targeted design, would enroll subjects of a particular genotype prospectively and exclude subjects in other genotypes who are at increased risk of side effects or fail to respond to a drug. To make these trials more affordable, another alternative to genotyping all subjects is to identify and genetically test targeted subgroups that are most likely to contribute information to the tested hypothesis. By identifying individuals with extreme phenotypic values, the required sample size is decreased. For instance, only subjects who are highly responsive and highly resistant to treatment (lower and upper deciles of the phenotypic distribution) would be genotyped. Statisticians could encourage the use of adaptive designs, where one uses accumulating data to decide how to modify aspects of a study without undermining the integrity of the trial. The use of the designs discussed above may prove to be more efficient or practical than standard designs. Each design choice, however, carries risks and benefits. For example, with enrichment designs, responders need to be identified accurately. If identification is not reliable, then important subgroups may be excluded from being studied and resources will be spent testing individuals with less potential to be informative about the hypotheses. The challenge for the statistician is to further investigate the efficiency of these novel designs under different scenarios and to encourage their use if proven efficient. Modeling Complexity It is well-known that RCTs provide the best evidence of a treatment effect, if it exists. Typically, a trial is conducted to compare response rates from an experimental drug and a placebo. However, in a PGt trial, we have an additional genetic component where the goal is to provide evidence of a difference in treatment effect in the complementary genetic subgroups. The test of differential treatment effect across genetic subgroups is a test of gene-treatment interaction. For example, in the Genetics of Hypertension Associated Treatment (GenHAT) study of high-risk hypertensive subjects, the primary hypothesis was a gene-drug interaction, as described by
Donna K. Arnett and colleagues in a 2002 Pharmacogenomics Journal paper. In GenHAT, it was of interest to study the difference in association with the type of medication and coronary heart disease across hypertensive genetic variants. The GenHAT study was the largest PGt study conducted when it was undertaken. More specifically, suppose for example, we are interested in examining a response difference in treatment effect between a drug and placebo group in three genetic groups corresponding to the genotypes AA, Aa, and aa. We can express the two factors and call them drug and gene using standard notation that is common in a linear regression model. We did this in our simple example discussed earlier in this article. For the drug factor, one design variable is required because there are two drugs. For the genetic factor, the number of design variables depends not only on the number of genetic groups, but on the assumptions, if any, made about the allele effect (e.g., general or dominant). With many alleles and multiple drugs, the range of statistical models can be fairly large. PGt analyses should also take into account multivariate outcomes. There may be multivariate phenotypes that are correlated (e.g., multiple neuropsychological tests performed in the rehabilitation of brain-injured patients) or multiple outcomes that are not necessarily correlated (e.g., efficacy and safety). Models that handle multiple genetic variants and SNPs simultaneously should be used as well. Data reduction techniques, such as principal components analysis, could be used to reduce multiple measures into one composite phenotype or reduce multiple SNPs (e.g., 50 SNPs) within a candidate gene into one or a few scores. For reducing the number of SNPs that are close together and possibly highly correlated, another approach is to analyze haplotypes. The pattern of the haplotypes may show that a few SNPs (haplotype tagging SNPs) can explain a majority of the information of all the SNPs. A specific challenge to statisticians who work in this area is to develop methods that handle uncommon haplotypes and rare genetic variants in an efficient manner. Another challenge is to continue to develop and evaluate methods for reducing model complexity
when considering combinations of genes and gene-drug interactions. Handling Multiplicity Issues and Small Sample Sizes Parallel with any “omics-”type of science (e.g., genomics and proteomics), a statistician has the major challenge of jointly handling severe multiplicity issues and small samples. Multiplicity issues stem from analyzing hundreds, if not thousands, of genetic variants, and small samples usually result from the exorbitant expense of processing the samples. PGt research is especially susceptible to these issues, given the vast number of classifiers (e.g., genetic variants, genegene interactions). What makes PGt unique is grappling with inconsistent individual responses to a drug (i.e., intrasubject variability). This has been referred to by Stephen Senn as patient-by-treatment—or patient-bydrug—interaction. Senn points out that it is not useful to look at gene-by-drug interaction unless there is patient-bydrug interaction. Therefore, PGt studies with small samples or multiple comparison problems often result in a majority of the findings being false. Small studies with positive associations have not been consistently replicated (type I error, false positives). Those with negative findings from underpowered studies have difficulty detecting meaningful effects. Nebert and colleagues discuss a significant finding in their 2008 Drug Metabolism Reviews paper that the rate of caffeine metabolism or risk of myocardial infarction from drinking coffee is associated with a particular SNP in or near the CYP1A2 gene. They point out that
CHANCE
61
CYP1A2 activity differs by more than 60-fold among individuals and that it is impossible to devise a genetic test to distinguish between high and low CYP1A2 activity in individuals. Therefore, they regard these significant associations as being inconclusive, not adequately powered, and needing replication. Sample sizes need to be much larger than the standard trial to detect genedrug interactions or to perform subgroup analyses. In PGt, power depends on the minimum allele frequency of the variants of interest, the type of gene action (e.g., additive model), the number of variants examined, the interaction effect of interest, and the variability among and within patients. The challenge for the statistician here is to evaluate power while considering the issues addressed above in combination. Specifically, methods should be explored and evaluated that compute power to test for gene-drug interactions in the presence of inconsistent patient responses to drugs.
Summary PGt/PGx influences pharmacodynamics (PD), or what a drug does to the body, in terms of efficacy and toxicity. PGt/PGx also influences the other determinant of medication response, pharmacokinetics (PK), or what the body does to the drug. PK encompasses such processes as absorption, distribution, metabolism, and excretion. A variety of PGt/PGx factors are pivotal in these processes. There are issues with PK/PD studies, especially with PK studies, that we have not addressed. These include the analysis of repeated concentration measurements on subjects and determining the 62
VOL. 22, NO. 1, 2009
therapeutic window for subgroups (e.g., poor versus extensive metabolizing phenotypes). These issues are not unique to PGt/PGx studies, but are associated with the full assortment of PK/PD studies. Nonetheless, they are other rich areas for statistical contributions. The challenges discussed here and other reasons we have not mentioned have led some to doubt whether personalized drug therapy based on DNA testing will ever be achievable. It is clear, however, that individual variation in drug response is a considerable clinical problem. It is a fact that FDA-approved product labeling of many medications now incorporates PGt information. Indeed, the FDA web site includes a table of PGt factors from product labeling: www.fda.gov/cder/genomics/ genomic_biomarkers_table.htm. We think consideration of PGt/PGx factors when initiating medication therapy in many cases is vital to identifying individuals likely to experience successful therapeutic response and identifying individuals at risk for toxicity responses. Medications or doses should be adjusted on an individual basis to optimize potential benefit and minimize risk of toxicity. The success of PGt/PGx will depend on the perception of patients and doctors of how important individualized therapy is in terms of identifying optimal drugs and dosages. Regardless, PGt/ PGx will provide a rich opportunity for medical researchers and statisticians to engage in clinically and statistically revolutionary work.
Further Reading Arnett, D.K.; Boerwinkle, E.; Davis, B.R.; Eckfeldt, J.; Ford, C.E.; Black, H. (2002) “Pharmacogenetic Approaches to Hypertension Therapy: Design and Rationale for the Genetics of Hypertension Associated Treatment (GenHAT) Study.” Pharmacogenomics J., 2(5):309–17.
Banks, D. (2008) “Statisticians and Metabolomics: Collaborative Possibilities for the Next *omics Revolution?” CHANCE, 21(2):5–11. Food and Drug Administration. (2008) “Guidance for Industry: E-15 Definitions for Genomic Biomarkers, Pharmacogenomics, Pharmacogenetics, Genomic Data, and Sample Coding Categories.” www.fda.gov/cber/gdlns/ iche15term.htm. Jorgensen AL, Williamson PR. Stat Med. 2008 Dec 30;27(30):6547-69. Methodological quality of pharmacogenetic studies: Issues of concern. Kelly, P.J.; Stallard, N.; Whittaker, J.C. (2005) “Statistical Design and Analysis of Pharmacogenetic trials.” Stat Med. 24(10):1495–508. Matchar, D.B.; Thakur, M.E.; Grossman, I.; McCrory, D.C.; Orlando, L.A.; Steffens, D.C.; Goldstein, D.B.; Cline, K.E.; Gray, R.N. (2006) “Testing for Cytochrome P450 Polymorphisms in Adults with Nonpsychotic Depression Treated with Selective Serotonin Reuptake Inhibitors (SSRIs).” Evidence Report/Technology Assessment No. 146. AHRQ Publication No. 07-E002. Nebert, D.W.; Zhang, G.; Vesell, E.S. (2008) “From Human Genetics and Genomics to Pharmacogenetics and Pharmacogenomics: Past Lessons, Future Directions.” Drug Metabolism Reviews, 40(2):187–224. Parving, H.H.; de Zeeuw, D.; Cooper, M.E.; Remuzzi, G.; Liu, N.; Lunceford, J.; et al. (2008) “ACE Gene Polymorphism and Losartan Treatment in Type 2 Diabetic Patients with Nephropathy.” Journal of the American Society of Nephrology, 19(4):771–779. Senn, S.J. (2007) Statistical Issues in Drug Development, 2nd ed. Wiley. Schork, N.J.; Fallin, D.; Tiwari, H.K.; Schork, M.A. (2001) “Pharmacogenetics.” In Handbook of Statistical Genetics (Balding, D.; Bishop, M.; Cannings, C., eds.) 741–764.
“Here’s to Your Health” prints columns about medical and health-related topics. Please contact Mark Glickman (
[email protected]) if you are interested in submitting an article.
Goodness of Wit Test
Jonathan Berkowitz, Column Editor
O
ne of the most appropriate definitions of statistical analysis is pattern recognition. From that perspective, statistical analysis has a great deal in common with solving word puzzles. I collect examples of ways in which word puzzles can be incorporated into a discussion of statistics and share two of my favorites here. I often point out to my students that the median is a much more amusing statistic than the mean. I justify this by pointing out that just as a pilot is often accompanied by a co-pilot, the median must often be accompanied by the co-median, or comedian. Groan! We all know the word significant has multiple meanings; statisticians make a clear distinction between statistically significant and practically (or clinically) significant. But significant also means power of attorney. How? Add some spaces and punctuation to the word: significant becomes sign if I can’t! I welcome your contributions to my collection. The greater the groan, the better. This issue’s puzzle has a small gimmick. Some of the solution words have something in common. I hope you enjoy the ‘aha!’ of discovery when you come across the common element. If you are a novice solver or need some reminders about the type of wordplay in cryptic clues, refer to the guide to solving cryptic clues that appeared in CHANCE 21(3). I will remind you again that the use of solving aids such as electronic dictionaries, Google, anagram-finders, etc. is encouraged. Crosswords are meant to teach, as well as entertain. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by the column editor by May 1, 2009. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Mail your
Past Winner Stephen Ellis is a statistician at the Duke Clinical Research Institute in Durham, North Carolina, though he works remotely from the California Bay Area. He earned his bachelor’s degree in mathematics and physics from Washington University, before earning a PhD in statistics from Stanford University. Previous work experience includes that with MathSoft (subsequently named Insightful Corp. and now Tibco Spotfire) in Seattle, Washington. Away from work, he enjoys bicycling, basketball, playing the cello, and playing cards.
completed diagram to Jonathan Berkowitz, CHANCE Goodness of Wit Test Column Editor, 4160 Staulo Crescent, Vancouver, BC Canada V6N 3S2, or send him a list of the answers by email (
[email protected]) by May 1, 2009. Please note that winners to the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Solution to Goodness of Wit Test #1 This puzzle appeared in CHANCE 21(3):63.
Across: 1. QUEUES [homophone: cues] 4 ODORLESS [anagram: old roses] 10 AUTOMATIC [charade: au + Tom + at + IC] 11 SKIER [deletion: riskier – RI] 12 TIME [charade: initial letters = t + i + m + e] 13 RANDOMNESS [anagram: Ms. Anderson] 15 LATENCY [charade: late + N(c)Y] 16 SPLINE [container: sp(l)ine] 19 EROICA [deletion: erotica – t] 21 GENERAL [anagram: enlarge] 23 ALAN TURING [anagram: natural + in + g] 25 MAPS [reversal: spam] 27 CHEER [charade: ch + e’er] 28 SYMBOLIZE [homophone: cymbal eyes] 29 SCHEDULE [anagram: she clued] 30 LLOYDS [anagram: Dolly’s] Down: 1 QUARTILE [charade with deletion: square – se + tile] 2 ESTIMATOR [anagram: East Timor] 3 EMMY [homophone: M.E.] 5 DECIDES [charade with reversal: Se + diced] 6 RESAMPLING [container with anagram: ring + maples] 7 ELITE [charade: e-lite] 8 STRESS [charade: s + tress] 9 STEADY [anagram: stayed] 14 UNFILTERED [anagram: enter fluid] 17 NORMALITY [hidden word: Beni-n or Mali ty-rant] 18 CLUSTERS [charade with anagram: C + results] 20 AEROSOL [container with reversal: A(eros)OL] 21 GENOME [charade with anagram: ge – l + nome] 22 CAUCUS [deletion: Caucasus – as] 24 ALEPH [hidden word: m-ale ph-ysician] 26 POLL [double definition] CHANCE
63
Goodness of Wit Test #3 “Discouraging Words Are Seldom Heard” A well-known statistical term appears in all the answers to clues marked with an asterisk. Additionally, for those answers, this word is excluded from the wordplay part of the clue.
ACROSS
DOWN
*1. Backfired and made a loud sound (11) 7. In support of 50% of profit (3) 9. One-man representative (5) *10. Ring notice excellent citrus drink (9) *11. Disaffected established Democrat (9) 12. Last letter written during home game (5) 13. Doctor to share hearing distance (7) 15. Droll uncle keeps back nothing (4) 18. Concept confused aide (4) 20. Reclaim new and extraordinary event (7) 23. See 8 Down *24. Most unusual stone street (9) 26. First in tossed arugula (9) 27. California to take advantage of principle (5) 28. Gentleman heads to swanky international resort (3) *29. Order craft to hold crew (11)
1. Wildly celebrate losing last piece of jewelry (8) 2. Upset plain Greek vase (8) 3. A T. Rex chewed up more than usual (5) 4. In the midst of Missouri surrounded by feeling of anxiety (7) 5. Dangled oddly to make happy (7) 6. Money (about $1000 Canadian initially) to order item some clerics wear (9, 2 wds) 7. Earth, perhaps, beginning to tumble past flier (6) 8. (With 23 Across) Anyone avow a corrected means test? (6 hyph.,5) *14. Hot day refreshed flower (9) 16. Escapes northward carrying church program (8) 17. Most irritable with even scores during exam (8) 19. Like ruse developed by radical underwriter (7) 20. Norma Jean remake mainly about actress’s heart (7) 21. Following victory, Ram makes changes (6) 22. Broken oath I read at first is empty talk (6, 2 wds) 25. Agreeable to gain half specialized market (5)
64
VOL. 22, NO. 1, 2009