General Issue
NATIONAL ASSOCIATION OF SCHOOL PSYCHOLOGISTS
Volume 40, No. 1, 2011
School Psychology Review, Vol. 40, #1 General Issue
Cognitive Correlates of Inadequate Response to Reading Intervention . . . 3 Jack M . Fletcher , Karla K. Stuebing, Amy E. Barth, Carolyn A. Denton, Paul T . Cirino, David J. Francis, & Sharon Vaughn
Teacher Judgments of Students’ Reading Abilities Across a Continuum of Rating Methods and Achievement Measures . . . 23 John C. Begeny, Hailey E. Krouse, Kristina G. Brown , & Courtney M . Mann
Behavior Problems in Learning Activities and Social Interactions in Head Start Classrooms and Early Reading, Mathematics, and Approaches to Learning . . . 39 Rebecca J. Bulotsky - Shearer, Veronica Fernandez, Ximena Dominguez, & Heather L. Rouse
Escape- to- Attention as a Potential Variable for Maintaining Problem Behavior in the School Setting . . . 57 Jana M . Sarno, Heather E. Sterling , Michael M . Mueller, Brad Dufrene, Daniel H. Tingstrom, & D. Joe Olmi
Treatment Integrity of Interventions With Children in the School Psychology Literature from 1995 to 2008 . . . 72 Lisa M Hagermoser Sanetti Katie L Gritter & Lisa M Dobey
Race Is Not Neutral: A National Investigation of African American and Latino Disproportionality in School Discipline . . . 85 Russell J. Skiba, Robert H. Horner, Choong - Geun Chung, M . Karega Rausch, Seth L. May, & Tary Tobin Potential Bias in Predictive Validity of Universal Screening Measures Across Disaggregation Subgroups . . . 108 John L. Hosp, Michelle A. Hosp, & Janice K . Dole
School Psychology Research : Combining Ecological Theory and Prevention Science . . . 132 Matthew K . Burns Prereading Deficits in Children in Foster Care . . . 140 Katherine C. Pears, Cynthia V. Heywood, Hyoun K . Kim, & Philip A. Fisher Effects of the Helping Early Literacy with Practice Strategies ( HELPS) Reading Fluency Program When Implemented at Different Frequencies . . . 149 John C. Begeny
Determining an Instructional Level for Early Writing Skills . . . 158 David C. Parker, Kristen L. McMaster, & Matthew K. Burns
School Psychology Review, 2011, Volume 40, No. 1, pp. 3–22
Cognitive Correlates of Inadequate Response to Reading Intervention Jack M. Fletcher, Karla K. Stuebing, and Amy E. Barth University of Houston Carolyn A. Denton University of Texas Health Science Center at Houston Paul T. Cirino and David J. Francis University of Houston Sharon Vaughn University of Texas—Austin Abstract. The cognitive attributes of Grade 1 students who responded adequately and inadequately to a Tier 2 reading intervention were evaluated. The groups included inadequate responders based on decoding and fluency criteria (n ⫽ 29), only fluency criteria (n ⫽ 75), adequate responders (n ⫽ 85), and typically achieving students (n ⫽ 69). The cognitive measures included assessments of phonological awareness, rapid letter naming, oral language skills, processing speed, vocabulary, and nonverbal problem solving. Comparisons of all four groups identified phonological awareness as the most significant contributor to group differentiation. Measures of rapid letter naming, syntactic comprehension/working memory, and vocabulary also contributed uniquely to some comparisons of adequate and inadequate responders. In a series of regression analyses designed to evaluate the contributions of responder status to cognitive skills independently of variability in reading skills, only the model for rapid letter naming achieved statistical significance, accounting for a small (1%) increment in explained variance beyond that explained by models based only on reading levels. Altogether, these results do not suggest qualitative differences among the groups, but are consistent with a continuum of severity associated with the level of reading skills across the four groups.
A recent consensus report suggested that students with learning disabilities (LD) should be identified on the basis of inadequate treatment response, low achievement, and traditional exclusionary criteria (Bradley, Daniel-
son, & Hallahan, 2002). The most controversial component of this report was the indication that an assessment of response to instruction is a necessary (but not sufficient) component of identification. From a classifi-
Correspondence regarding this article should be addressed to Jack M. Fletcher, Department of Psychology, University of Houston TMC Annex, 2151 W. Holcombe, Suite 222, Houston, TX 77204-5053; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 3
School Psychology Review, 2011, Volume 40, No. 1
cation perspective, the validity of this provision should be tested as a hypothesis by comparing adequate and inadequate responders on attributes not used to define the groups, such as cognitive processing. If adequate and inadequate responders can be differentiated from students typically developing on these nondefinitional variables, the classification hypothesis accrues validity (Morris & Fletcher, 1988). The consensus report excluded assessments of cognitive processing skills known to underlie different kinds of LD as a component of identification. We differentiate cognitive assessments of skills that support mental operations (e.g., language, memory, problem solving) and do not involve reading for task completion from assessments of different components of reading, such as decoding, fluency, and comprehension. The latter are also cognitive measures, but are determined in part by cognitive processes that vary with the component of reading that is assessed (Vellutino, Fletcher, Snowling, & Scanlon, 2004). Assessing cognitive skills is controversial in school psychology because of questions about the value added by these tests for identifying or treating LD (Gresham, 2009); however, these assessments are commonly employed, and strengths and weaknesses in cognitive processes are clearly related to the achievement domains that represent LD (Reynolds & Shaywitz, 2009). Although assessment of cognitive processes is not required for identification of LD in the Individuals with Disabilities in Education Act (IDEA; U.S. Department of Education, 2004), Hale et al. (2008) proposed that inadequate responders to Tier 2 intervention should receive a cognitive assessment to explain why the students did not respond to intervention, to guide treatment planning, and as an alternative to LD eligibility models explicitly identified in IDEA (ability–achievement discrepancy and methods based on response to intervention). This issue has significant implications for everyday practice in school psychology because it suggests a major role for cognitive assessment for intervention (and for identification). However, a recent review (Pashler, 4
McDaniel, Rohrer, & Bjork, 2009) did not identify evidence that interventions based on group by treatment interactions (e.g., learning styles, aptitude by treatment interactions) were differentially related to outcomes. Consistent with views from other school psychologists, whether cognitive skills represent child attributes that interact with treatment outcomes and are essential components of intervention planning is not well established (Gresham, 2009; Reschly & Tilly, 1999). Moreover, little research establishes whether inadequate responders differ from adequate responders and typical achievers outside of the defining characteristics of inadequate instructional response and poor development of academic skills. Taking an approach somewhat different from the analysis of group by treatment interactions, we approached the question of cognitive assessment from a classification perspective, addressing whether there are unique cognitive attributes of inadequate responders. Cognitive and Behavioral Attributes of Inadequate Responders One meta-analysis has addressed whether cognitive skills represent attributes of variously defined subgroups of inadequate responders (Nelson, Benner, & Gonzalez, 2003). This meta-analysis initially utilized a literature review by Al Otaiba and Fuchs (2002), which summarized 23 studies of preschool through Grade 3 students who received reading interventions. Al Otaiba and Fuchs reported that most studies identified difficulties with phonological awareness as a major characteristic of inadequate responders. However, difficulties with rapid naming, phonological working memory, orthographic processing, and verbal skills, as well as attention and behavior problems, and demographic variables, also correlated with inadequate response. In their meta-analysis of 30 studies, Nelson et al. (2003) began with these 23 studies. They used the same search criteria as Al Otaiba and Fuchs (2002), but disagreed on the inclusion of 4 studies and added 11 other studies. Moderate to small weighted effect sizes were reported for rapid naming (Zr
Cognitive Correlates of Inadequate Response
⫽ 0.51), problem behavior (Zr ⫽ 0.46), phonological awareness (Zr ⫽ 0.42), letter knowledge (Zr ⫽ 0.35), memory (Zr ⫽ 0.31), and IQ (Zr ⫽ 0.26). Effect sizes for demographic and disability/retention variables were negligible. Except for the negligible weightings for demographic variables and the statistical equivalence of rapid naming, phonological awareness, and behavior problems, these results were consistent with Al Otaiba and Fuchs (2002). Since these two syntheses, other studies have examined cognitive characteristics of students with inadequate response to reading intervention. Stage, Abbott, Jenkins, and Berninger (2003) compared cognitive attributes in students who responded “faster” or “slower” to a Grade 1 intervention. Faster responders had higher initial reading levels and reading-related language skills, including phonological and orthographic awareness, rapid naming, and verbal reasoning. Al Otaiba and Fuchs (2006) used a letter naming fluency task to classify students as consistently and inconsistently responsive to intervention in kindergarten and Grade 1. Consistently inadequate responders obtained lower scores on measures of morphology, vocabulary, rapid naming, sentence repetition, and word discrimination, and had higher rates of problem behaviors. Phonological segmentation was weakly related to responder status, with Al Otaiba and Fuchs emphasizing low verbal skills (e.g., vocabulary) as a major attribute of inadequate responders. Vellutino, Scanlon, and Jaccard (2003) found that students who responded to Grade 1 intervention had cognitive profiles similar to typically achieving students after intervention. Before intervention, responders had been lower in phonological processing and initial levels of reading skills than typical achievers. Before and after intervention, inadequate responders were best differentiated from adequate responders on phonological awareness, rapid naming, and verbal working memory, but not verbal IQ or nonverbal processing abilities. In a second study, Vellutino, Scanlon, Small, and Fanuele (2006) used the same untimed word reading criterion (25th percentile) to classify students at the end of Grade 3
as poor readers who were difficult and less difficult to remediate. Measures of rapid naming, phonological processing, vocabulary, and verbal IQ showed a stepwise progression in accordance with the groups’ levels of word reading skills. They interpreted the relation of reading level and cognitive processing as indicating that “the cognitive abilities underlying reading ability can be placed on a continuum that determines the ease with which a child acquires functional reading skills” (Vellutino et al., 2006, p. 166). These studies identify difficulties with phonological awareness, rapid naming, vocabulary, and oral language skills as the most consistent cognitive attributes of inadequate responders. However, these differences are relative to the samples and measures chosen for investigation, and most were ad hoc applications of responder criteria; the studies were not designed to assess differences in adequate and inadequate responders. These findings are also influenced by differences in interventions and criteria for inadequate response. Criteria for Inadequate Response It is difficult to specify the role of intervention differences in evaluating these studies because they vary in intensity, comprehensiveness, and grade level of the at-risk students. For the second issue, different criteria do not identify the same students as inadequate responders (Barth et al., 2008; Burns & Senesac, 2005; Fuchs, Fuchs, & Compton, 2004; Speece & Case, 2001). Fuchs and Deshler (2007) noted that methods for assessing intervention response varied both by the method and type of assessment employed. Methods include (1) final status, based on end of the year status on a criterion-or norm-referenced assessment; (2) slope discrepancy, based on criterion-referenced assessments of growth; and (3) dual discrepancy, which uses assessments of growth and the end point of the criterion-referenced assessment. Summarizing across studies, Fuchs and Deshler (2007) reported that rates of agreement were generally low when inadequate responders were identified using different methods. Another source 5
School Psychology Review, 2011, Volume 40, No. 1
of variability across identification methods is measurement error because of small amounts of unreliability in the identification measures and the difficulty of determining where to put the cut point on distributions that are essentially normal, a problem for any assessment or study of LD (Francis et al., 2005). Variability across identification approaches is also from differences in the types of assessments used to identify responder status, such as the use of timed assessments of word reading and passages versus untimed word reading. No studies have used a norm-referenced fluency measure with a national standardization to identify inadequate responders.
will have poorer performance on measures of verbal skills (e.g., vocabulary and oral language) than adequate responders or typically achieving students; (2) phonological awareness skills will be more strongly associated with responder status when defined by decoding criteria, whereas rapid letter naming skills will be more associated with responder status when defined by fluency criteria; and (3) differences in cognitive skills between adequate and inadequate responders will reflect differences in the severity of reading impairment (i.e., a continuum of severity). Method Participants
Rationale for the Present Study Our overall research question was whether adequate and inadequate responders to a Grade 1 reading intervention can be differentiated on cognitive measures not used to define responder status, addressing the classification issue fundamental to determining whether inadequate responders might benefit from an assessment of cognitive processes. To assess differences from the type of measure employed for determining responder status, we used norm-referenced end of the year assessments of timed and untimed word reading, and performance on a criterion-referenced oral reading fluency probe. The cognitive measures included those implicated in previous studies of inadequate responders: phonological awareness, rapid naming, vocabulary, and oral language skills (Al Otaiba & Fuchs, 2006; Nelson et al., 2003). Phonological awareness, rapid letter naming skills, and vocabulary are also major correlates of poor reading (Vellutino et al., 2004). To further address oral language skills, we administered measures of syntactic comprehension and listening comprehension, both linked to reading comprehension difficulties (Catts & Hogan, 2003). In addition, we included measures of nonverbal problem solving and processing speed for broader coverage of the cognitive domain. We hypothesized (1) regardless of reading domains (decoding vs. fluency) used to define responder status, inadequate responders 6
The study was approved by the Institutional Review Boards at the University of Houston and University of Texas—Austin. We derived the sample from the entire Grade 1 general education population in nine elementary schools located in two study sites, one in a large urban area and the other in a smaller suburban community. These students, largely minority and economically disadvantaged, were the basis for a Tier 2 reading intervention study (see Denton, Cirino et al., in press, for the complete report). We excluded from screening only students who received their primary reading instruction outside of the regular general education classroom or in a language other than English, and those with school-identified severe intellectual or behavioral disabilities. Figure 1 presents a flow chart illustrating student assignments to intervention groups. Denton, Cirino et al. (in press) screened 680 students for reading problems in September, identifying 461 as at risk for reading difficulties and 219 as not at risk. The large proportion of at-risk students reflects the participation of schools with many at-risk students and the use of a screening plan designed to minimize false negative errors (i.e., failure to identify a “true” at-risk child), which carries with it a higher false-positive rate (i.e., identification of students as at risk who actually develop as typical readers; Fletcher et al., 2002). Because of the potentially high false-positive
Cognitive Correlates of Inadequate Response
Figure 1. Flow chart showing origins of the sample for this study. DF1 ⴝ impaired on Basic Reading, TOWRE, and CMERS; DF2 ⴝ impaired on Basic Reading and TOWRE; F1 ⴝ impaired on TOWRE and CMERS; F2 ⴝ impaired on TOWRE; F3 ⴝ impaired on CMERS; Basic Reading ⴝ Woodcock-Johnson III composite of Word Identification and Word Attack (untimed decoding); TOWRE ⴝ Test of Word Reading Efficiency (timed decoding); CMERS ⴝ Continuous Monitoring of Early Reading Skills (passage fluency).
rate, students identified as at risk were then progress monitored with oral reading fluency probes from the Continuous Monitoring of Early Reading Skills (CMERS; Mathes & Torgesen, 2008) biweekly through October and November. Of the 461 identified in the initial screening, 273 failed to attain fluency benchmarks by the end of November. At this point, these 273 students were randomly assigned to one of three Tier 2 treatment groups that varied in intensity (8 weeks of instruction 2 and 4 times weekly; 16 weeks of instruction 2 times weekly). The final sample for the current study included 189 at-risk readers who completed
the intervention and the post-test assessments and 69 students identified as not at risk at the beginning of the year. The 84 at-risk readers who were initially randomized and not included in the current study were 37 students in an alternate group who did not receive treatment because of insufficient resources and another 41 who were not treated because they moved, were withdrawn by the school or parents, or did not receive sufficient intervention. Two students received intervention, but were missing post-test data, and one student with decoding, but not fluency deficits, could not be classified using our criteria. Three students were dropped because of nonverbal IQ 7
School Psychology Review, 2011, Volume 40, No. 1
scores ⬍70 to exclude possible intellectual disabilities. The 69 typically developing students were drawn from an original sample of 84 students randomly selected at the beginning of the study from the large sample of students not at risk on the screen. The 15 excluded students were 9 who moved and 6 who met inadequate responder criteria at post-test. The proportion of at-risk students assigned but not in intervention did not differ from those who remained on sociodemographic characteristics or baseline scores ( p ⬎ .05). Criteria for Inadequate Response To ensure that all students who needed continued intervention were identified, we cast a broad net to identify inadequate responders, including three separately applied end-oftreatment criteria: (a) untimed word reading standard score below 91 (25th percentile) on the Woodcock-Johnson III Basic Reading Skills Cluster (WJIII; Woodcock, McGrew, & Mather, 2001); (b) word reading fluency standard score below 91 on the composite score from the Test of Word Reading Efficiency (TOWRE; Torgesen, Wagner, & Rashotte, 1999); and (c) oral passage reading fluency score below 20 words correct per minute (wcpm on the CMERS). The cut point for the norm-referenced tests follows previous studies of inadequate responders as well as cut points employed in many studies of LD (Torgesen, 2000; Vellutino et al., 2006). For the CMERS, the oral reading fluency criterion was selected based on the procedures used for the DIBELS, where scores ⬍20 wcpm indicate the child is at risk (Dibels.uoregon.edu/docs/benchmarkgoals. pdfretrieved August 24, 2009). We used the DIBELS norms because there was no national sample for CMERS from which to identify a norm-referenced cut point. The cut point we used is more stringent than in other studies using CMERS (e.g., Mathes et al., 2005). The CMERS stories are less difficult than Grade 1 DIBELS stories, so equating precisely to DIBELS cut points (e.g., 25th percentile) was not justified. In support of this decision, 8
increasing the CMERS criterion to 28 wcpm, which would represent the 25th percentile in the Hasbrouck and Tindal (2006) norms, would significantly increase the number of students identified as inadequate responders based solely on the CMERS. We show later that the increase in criterion score is difficult to justify given their scores on other reading measures. To further evaluate this cut point, we compared fluency rates in relation to the WJIII Basic Reading and TOWRE criteria. Consistent with the CMERS cut point, students in this sample who scored within 3 points of the WJIII Basic Reading criterion of 90 averaged 19.0 (SD ⫽ 9.4) wcpm; those with scores within 3 points of the TOWRE cut point of 90 had average CMERS rates of 24.3 (SD ⫽ 11.5) wcpm. The application of these criteria yielded 85 adequate responders and 5 subgroups of inadequate responders (n ⫽ 104; see Figure 1). This rate may seem high, but is partly the consequence of applying multiple criteria to identifying inadequate responders and of measurement error. Figure 1 shows that the inadequate responders included two groups with primary impairments in decoding skills who also fell below criteria on both fluency assessments (DF1; n ⫽ 19), or who fell below criteria on the TOWRE, but not CMERS (DF2; n ⫽ 10). Three groups had primary problems in fluency, but not decoding. The first fluency group (F1; n ⫽ 41) fell below both TOWRE and CMERS criteria. Fluency Group 2 (F2; n ⫽ 20) fell below only the TOWRE criterion, whereas fluency Group 3 (F3; n ⫽ 14) fell below only the CMERS criterion. Procedures Description of the intervention Tier 1 instruction. All students received Tier 1 instruction using evidence-based programs, a prerequisite for school selection. Beginning in September, three experienced literacy coaches held monthly meetings with Grade 1 classroom reading teachers at each school to examine graphs of their at-risk students’ oral reading fluency data, discuss stu-
Cognitive Correlates of Inadequate Response
dent progress, and provide instructional strategies. Teachers received in-class coaching on request. Tier 2 intervention. Beginning in January, students in each intervention condition received the same Tier 2 supplemental small group reading intervention, provided by 14 trained paraprofessionals who were coached and supported by the same literacy coaches. Intervention was provided in 30-min sessions in groups of two to four students with one tutor, in a location outside of the regular classroom, on the three varying schedules previously described. The intervention was comprehensive and addressed phonemic awareness, decoding, fluency, vocabulary, and comprehension. Tutors followed a manualized curriculum with a specific scope and sequence based on a modification of the Grade 1 Read Well program (Sprick et al., 1998). Read Well is a highly structured curriculum that supports the delivery of explicit instruction in phonemic awareness and phonics, with practice in fully decodable text and repeated reading to meet fluency goals. Modifications for this study added explicit instruction in vocabulary and reading comprehension, as well as more detailed lesson scripting to support high-fidelity implementation. Read Well includes one instructional unit for each letter–sound correspondence in the scope and sequence, and four lessons are provided for each unit. Tutors used mastery tests included in the program to individualize student placement in specific units and the number of lessons taught in each unit. Fidelity of implementation. Following procedures commonly implemented in intervention research, fidelity data were collected through direct observation of tutors three times during the semester. To document program adherence and quality of implementation, tutors were rated on Likert scales ranging from 0 (expected but not observed) to 3 (observed nearly all of the time). The mean total rating (including both fidelity and quality) was 2.47 (SD ⫽ 0.27, range 2.01 to 2.95), indicating strong implementation of the intervention (Denton, Cirino et al., in press).
Outcomes. Although there was evidence of growth in reading for many students over the intervention period, an evaluation of treatment efficacy at the end of the 8- and 16-week periods showed few differences in outcomes for the three intensity/duration groups (see Denton, Cirino et al., in press). Therefore, we combined the three groups to determine responder status. Test administration Students were assessed by examiners who completed an extensive assessment training program. All assessments were completed in the students’ elementary schools in quiet locations. The cognitive and pretest achievement measures were administered at the end of November and December. The post-test achievement measures and nonverbal problem-solving task were administered in May. The variation in administration of cognitive assessments was from time limitations imposed by the schools. We administered the cognitive variables before intervention to ensure that they would not be influenced by treatment. The nonverbal problem-solving measure is not likely to change over the short intervention because these skills were not taught. Measures We selected cognitive and language measures implicated either as correlates of inadequate response or as indicators of constructs often associated with LD. The battery was necessarily parsimonious, as we were restricted to a 60-min time frame by schools, who were concerned about time lost from instruction. With some exceptions, all measures had a national standardization. A description of all tests can be found at www.texasldcenter.org/project2.asp. Measures to group membership
determine
student
Woodcock-Johnson III Test of Achievement (Woodcock et al. 2001). The Basic Reading Skills composite combines the Letter-Word Identification and Word Attack subtests of untimed decoding skills. Let9
School Psychology Review, 2011, Volume 40, No. 1
ter-Word Identification assesses the ability to read real words; Word Attack assesses the ability to read phonetically correct nonsense words. The reliability of the composite ranges from .97 to .98 for students aged 5– 8 years. Test of Word Reading Efficiency (Torgesen et al., 1999). The Sight Word Efficiency subtest assesses the timed reading of real words presented in a list format. Phonemic Decoding Efficiency assesses timed reading of pseudowords. The TOWRE composite was the dependent variable. Alternate forms and test–retest reliability coefficients exceed .90 in this age range. Continuous Monitoring of Early Reading Skills (Mathes & Torgesen, 2008). The CMERS is a timed measure of oral reading fluency for connected text. All texts were written at approximately a Grade 1.7 readability level according to the Flesch-Kincaid index and were 350 – 400 words in length. Students were required to read two passages, for 1 min each. Test–retest reliability for the first two screening periods in this study was .93. The dependent variable is the total number of words read correctly in 60 s averaged over the two stories. Other academic measures We gave, but did not analyze, the WJIII Passage Comprehension and Spelling subtests to permit a broader characterization of the students’ academic development. Passage Comprehension uses a cloze procedure to assess sentence-level comprehension by requiring the student to read a sentence or short passage and fill in missing words. Spelling involves orally dictated words written by the student. Coefficient alpha ranges from .94 to .96 for Passage Comprehension and .88 to. 92 for Spelling in the 5- to 8-year age range. Cognitive and linguistic measures Comprehensive Test of Phonological Processing (Wagner, Torgesen, & Rashotte,1999). Blending Words measures the ability to combine sounds to form whole words. Elision requires deletion of specific 10
sounds from a word. For students 5– 8 years, coefficient alpha is .96 and .99 for Elision and Blending Words subtests, respectively. To reduce variables for analysis, a composite score was created by averaging the standardized Blending Words and Elision subtests. The Rapid Letter Naming subtest measures the speed of naming letters presented in a repeated pattern. We only administered Form A, so a standardized score could not be computed. The dependent measure was the number of letters identified divided by the total time to identify all items, and was converted into time per letter. Alternate-form and test–retest reliability coefficients are at or above .90 for students aged 5– 8 years. Clinical Evaluation of Language Fundamentals— 4 (Semel et al., 2003). Concepts and Following Directions assesses the understanding and execution of oral commands containing syntactic structures that increase in length and complexity. The syntactic component involves manipulation of pronouns and sentence structure; the working memory component involves the increasing length of the commands (Tomblin & Zhang, 2006). Test–retest reliability is .87 to .88 and coefficient alpha is .90 to .92 for students 5– 8 years of age. Understanding Spoken Paragraphs subtest evaluates the ability to understand oral narrative texts. The test–retest reliability is .76 to .81; coefficient alpha is .64 to .81 for students 5– 8 years of age. Underlining Test (Doehring, 1968). The Underlining Test is a paper-and-pencil measure of speed of processing (or focused attention). For each subtest, a target is displayed at the top of a page. Below are lines with the target stimuli and distracters. The participant underlines target stimuli as fast as possible for either 30 or 60 s. We used 3 subtests in which the target stimuli were (1) the number 4, nested with randomly generated single numbers; (2) a symbol (a plus sign) nested among other symbols; and (3) a diamond containing a square that also contained a diamond. The score for each subtest is the total number of correct targets identified mi-
Cognitive Correlates of Inadequate Response
nus errors. We computed age-adjusted residuals with a mean of 0 (SD ⫽ 1) for each subtest across the whole sample and then averaged these scores to create a composite. Kaufman Brief Intelligence Test—2 (K-BIT; Kaufman & Kaufman, 2004). The Kaufman Brief Intelligence Test—2 is an individually administered intellectual screening measure. Verbal Knowledge assesses receptive vocabulary and general information (e.g., nature, geography). Matrices assesses nonverbal problem solving, requiring students to choose a diagram from among five or six choices that either “goes with” a series of other diagrams or completes an analogy. Both subtests are good indicators of overall intellectual functions. Internal consistency ranges from .86 to .89 for Verbal Knowledge and .78 to .88 for Matrices in students 5– 8 years of age. Results Hypotheses 1 and 2 were assessed with multivariate analyses (MANOVAs) comparing group performance across the seven cognitive variables. We followed procedures in Huberty and Olejnik (2006) for a descriptive discriminant analysis that permits the interpretation of the contribution of a set of dependent variables to MANOVA. A MANOVA computes a linear composite (i.e., discriminant function) that maximally separates groups. Following Huberty and Olejnik, we used three methods for interpreting the contribution of individual variables to the discriminant function, including canonical structure correlations, standardized discriminant function coefficients, and univariate tests, where alpha per variable was set at .05/7 ⫽.007. The canonical structure coefficients represented the bivariate correlation of each variable with the discriminant function maximally separating groups, whereas the standardized coefficients provided an index of the unique contribution of each variable to group separation given the set of variables selected for the model. We presented the univariate tests because there are no statistical tests associated with either of the two multivariate methods for interpretation of the
canonical variates. In a comparison with two groups, the canonical structure correlations and the univariate tests will parallel one another (Huberty & Olejnik). Although MANOVA is not affected by the scaling of the measures, visual interpretation is facilitated when the measures are on the same scale. We adjusted the raw scores of the cognitive variables for age and retained the studentized residuals, placing all of the variables into a z-score metric. This permitted control for the small age differences across the four groups (see Table 2). We checked each group distributions for restriction of range from either the scaling or the approach to group definitions and found no evidence of range restriction. Comparisons of Decoding- and FluencyImpaired Subgroups Table 1 presents mean standard scores for the three reading measures used to define the groups, showing that the DF1 group was more impaired than the DF2 group. The F1 group, which fell below both fluency criteria, was more impaired in decoding than the F2 group, which fell below the TOWRE criterion; the least impaired group (F3) only fell below the CMERS criterion. Not surprisingly because we created the differences among groups by the cut points used to define them, analyses of variance across the five groups were significant for all three measures ( p ⬍ .0001). Table 1 shows that the DF1 group differs from the other four groups on all three measures. The DF2 group has higher decoding (and fluency scores) than the DF1 group, but is significantly below the three F groups on decoding. The DF2 group does not differ from the F1 and F2 groups on TOWRE or from the F2 group on CMERS, consistent with the definitions. The F1 and F2 groups are similar on WJIII Basic Reading, but significantly below the F3 group, which has above average decoding scores. To maximize sample size within groups, we evaluated whether the decoding/fluency (DF1, DF2) and reading fluency (F1, F2, F3) subgroups could be differentiated on the cog11
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Means and Standard Deviations for the Decoding Fluency and Fluency Groups on Criterion Academic Outcome Variables Group DF1
DF2
F1
F2
F3
N ⫽ 19
N ⫽ 10
N ⫽ 41
N ⫽ 20
N ⫽ 14
Task
M
SD
M
SD
M
SD
M
SD
M
SD
Basic Readinga TOWREa CMERSb
78.37 73.79 7.95
9.79 7.68 5.19
84.80 80.00 27.25
4.69 5.31 4.89
98.02 83.83 12.95
4.68 5.02 4.90
98.95 85.85 27.83
4.54 3.28 6.47
108.71 97.36 15.50c
4.27 4.70 3.71
Note. DF1 ⫽ impaired on Basic Reading, TOWRE, and CMERS; DF2 ⫽ impaired on Basic Reading and TOWRE; F1 ⫽ impaired on TOWRE and CMERS; F2 ⫽ impaired on TOWRE; F3 ⫽ impaired on CMERS; Basic Reading ⫽ Woodcock-Johnson III composite of Word Identification and Word Attack (untimed decoding); TOWRE ⫽ Test of Word Reading Efficiency (timed decoding); CMERS ⫽ Continuous Monitoring of Early Reading Skills (passage fluency). a M ⫽ 100; SD ⫽ 15. b Words correct per minute.
nitive variables. The cognitive profiles are graphically displayed in Figure 2. For the two DF groups, the MANOVA was significant, F(7, 21) ⫽ 2.99, p ⬍ .025, 2 ⫽ .50. Univariate comparisons were significant for CTOPP phonological awareness, F(1, 27) ⫽ 8.94, p ⬍ .006, KBIT Verbal Knowledge, F(1, 27) ⫽ 7.24, p ⬍ .02, and KBIT Matrices, F(1, 27) ⫽ 14.50, p ⬍ .0007 (all DF1 ⬍ DF2). These differences are consistent with the previously observed stepwise progression in severity of reading difficulties (DF1 ⬍ DF2) in Table 1, with Figure 2 also suggesting parallel cognitive profiles that reflect the severity of the reading problems, so we combined them for subsequent analyses. Note that the DF2 group is too small (n ⫽ 10) to treat separately, and the presence of decoding deficits indicates different treatment needs from the reading fluency groups; had we compared these two groups separately with the other groups, the difference would not have met a Bonferroniadjusted critical value of alpha ( p ⫽ .05/7 ⫽ .007). The MANOVA of the three fluency groups across the seven cognitive variables was not significant (see Figure 2), F(14, 12
132) ⫽ 1.18, p ⬍ .30, 2 ⫽ .21. No univariate comparisons achieved the critical level of alpha ( p ⬍ .05). Comparisons of Adequate and Inadequate Responder Groups Sociodemographic variables. In Table 2, the frequencies for age, subsidized lunch, English as a Second Language status, and ethnicity are shown by group. There were significant differences in age across the four groups, F(3,254) ⫽ 25.26, p ⬍ .0001, with the decoding/fluency group significantly older than the other three groups, which did not differ. However, the size of the age difference is small, with a maximum of about 7 months between the decoding/fluency and adequate responder groups. This difference was addressed by residualizing for age in scaling the cognitive data. There were no significant relations of group with gender, 2(3, N ⫽ 258) ⫽ 6.80, p ⬍ .08, English as a Second Language status, 2(3, N ⫽ 257) ⫽ 2.33, p ⬍ .51; p ⬍ .15, or race, 2(12, N ⫽ 258) ⫽ 17.24. Subsidized lunch status and group were significantly related: 2(3, N ⫽ 258) ⫽ 8.24, p ⬍
Mean z score
0.5
-0.5
-1.5 PA
Rapid Naming
Under lining
CELF USP
KBIT Verbal
CELF CD
KBIT Matrices
Cognitive Measures Groups
DF1
DF2
Mean z score
0.5
-0.5
-1.5 PA
Rapid Naming
Under lining
CELF USP
CELF CD
KBIT Verbal
KBIT Matrices
Cognitive Measures Groups
F1
F2
F3
Figure 2. Cognitive profiles for inadequate responders defined by decoding and fluency (DF) criteria (upper panel) and only fluency (F) criteria (lower panel). PA ⴝ Phonological Awareness; CELF ⴝ Clinical Evaluation of Language Fundamentals— 4; USP ⴝ Understanding Spoken Paragraphs; CD Concepts/Directions; KBIT ⴝ Kaufman Brief Intelligence Test—2 13
School Psychology Review, 2011, Volume 40, No. 1
Table 2 Demographics by Group Group Decoding/Fluency
Reading Fluency
Responder
Typical
Variable
N ⫽ 29
N ⫽ 75
N ⫽ 85
N ⫽ 69
Age* Mean SD % Male* % Subsidized Lunch* % English as Second Language % Black % White % Hispanic % Other
7.08 0.48 72 86 31 45 7 48 0
6.60 0.33 44 71 20 41 17 40 1
6.42 0.28 51 61 21 28 19 49 4
6.54 0.41 51 59 17 33 15 44 9
*p ⬍ .05.
.05. The participants in the decoding/fluency group were more likely to receive a subsidized lunch. Consistent with Nelson et al. (2003), this difference was small, so we did not use this variable as a covariate. Cognitive comparisons. For Hypotheses 1 and 2, we performed a subset of all possible comparisons to control Type I error and maintain power. The decoding/fluency group was compared to the reading fluency and adequate responder groups, and the reading fluency group was compared to the adequate responder group. This permitted a direct evaluation of differences between adequate and inadequate responders defined by different reading domains. We also compared the adequate responder and typical groups to evaluate the responders’ progress towards the performance levels of the not-at-risk group. The significance level was controlled at .05 by setting the alpha per comparison at .0125 (.05/4). Table 3 provides means and standard deviations for the seven cognitive variables. A MANOVA of the age-residualized scores across the four groups was significant, F(21, 712.67) ⫽ 9.17, p ⬍ .0001, 2 ⫽ .50; only the first discriminant function was significant 14
( p ⬍ .0001). Figure 3 shows the age-residualized z-score profiles for the four groups. The decoding/fluency group showed the poorest performance across measures of phonological awareness, rapid naming, and syntactic comprehension/working memory (Concepts and Directions), with a stepwise progression showing higher levels of performance in the reading fluency, adequate responder, and typical groups, in that order. As Table 3 shows, this progression parallels the progression of performance on academic achievement measures used to define the groups, as well as other measures of reading comprehension and spelling. Interpretation of the significant discriminant function in Table 4 shows that phonological awareness and rapid letter naming were most strongly related to group separation. Decoding/fluency versus reading fluency groups. The MANOVA for the decoding/fluency and reading fluency groups did not meet the critical level of alpha, F(7, 96) ⫽ 1.89, p ⫽ .08, 2 ⫽ .12. No univariate tests met the critical level of alpha ( p ⬍ .0007); syntactic comprehension/working memory ( p ⬍ .05) and rapid letter naming ( p ⬍ .04) had the largest effects.
Cognitive Correlates of Inadequate Response
Table 3 Performance by Group on the Cognitive and Achievement Variables in Original Measurement Units Group
Variable Cognitive variables Phonological Awareness Rapid Naming Underlining Test CELF USP CELF Concepts/Directions KBIT Verbal Knowledge KBIT Matrices Achievement variables Basic Reading TOWRE CMERS Passage Comprehension Spelling
Decoding/ Fluency
Reading Fluency
Adequate Responder
Typical
N ⫽ 29
N ⫽ 75
N ⫽ 85
N ⫽ 69
Mean
SD
Mean
SD
Mean
SD
Mean
SD
83.53 0.86 0.43 82.24 69.31 79.45 86.93
11.85 0.29 0.14 17.81 9.98 13.91 11.01
93.00 1.00 0.39 85.07 79.07 82.69 89.57
9.93 0.26 0.11 15.82 13.65 16.61 11.05
101.03 1.05 0.41 91.08 86.59 91.14 96.0
9.07 0.29 0.12 16.45 13.98 14.33 11.86
111.49 1.38 0.42 96.99 93.62 95.86 101.84
10.05 0.31 0.11 19.03 16.80 16.76 13.82
80.59 75.93 14.61 76.17 81.97
8.85 7.48 10.59 9.88 11.11
100.27 86.89 17.39 88.44 96.47
6.09 6.82 8.20 5.98 7.07
111.63 100.77 37.84 98.40 105.58
8.04 7.09 15.83 6.52 10.05
118.93 113.57 75.73 107.51 116.09
10.49 12.11 26.15 9.06 11.72
Note. Phonological Awareness ⫽ average standard score of Comprehensive Test of Phonological Processing Blending Phonemes and Elision; CELF ⫽ Clinical Evaluation of Language Fundamentals— 4; USP ⫽ Understanding Spoken Paragraphs; KBIT ⫽ Kaufman Brief Intelligence Test—2; Basic Reading ⫽ Woodcock-Johnson III composite of Word Identification and Word Attack (untimed decoding); TOWRE ⫽ Test of Word Reading Efficiency (timed decoding); CMERS ⫽ Continuous Monitoring of Early Reading Skills (passage fluency). Standard scores (M ⫽ 100; SD ⫽ 15) used except for timed tests, which are targets per second.
Decoding/fluency versus adequate responder groups. The MANOVA for the decoding/fluency and adequate responder groups achieved the critical level of alpha, F(7, 106) ⫽ 3.71, p ⬍ .002, 2 ⫽ .20. Table 4 shows that the three methods for interpreting the contribution of individual variables to the discriminant function (canonical correlation, standardized discriminant function, univariate) concurred in heavily weighting phonological awareness, rapid letter naming, and syntactic comprehension/working memory.
groups achieved the critical level of alpha, F(7, 152) ⫽ 2.86, p ⫽ .0008, 2 ⫽ .11. Table 4 shows results similar to the comparison of decoding/fluency and adequate responder groups. The three methods for interpreting variable contribution to the discriminant function concurred in heavily weighting phonological awareness. KBIT vocabulary and matrices also met the critical level of alpha for the univariate tests, but the standardized coefficients are relatively small in relation to the coefficient for phonological awareness.
Reading fluency versus adequate responder groups. The MANOVA for the reading fluency and adequate responder
Adequate responder versus typical groups. The MANOVA for the adequate responder and typical groups achieved the crit15
School Psychology Review, 2011, Volume 40, No. 1
Mean z score
1
0
-1 PA
Rapid Naming
Under lining
CELF USP
CELF CD
KBIT Verbal
KBIT Matrices
Cognitive Measures Responder Groups
Decoding Problem Responder
Fluency Problem Typical
Figure 3. Mean z scores for cognitive measures for groups of inadequate responders who meet both decoding and fluency criteria, only fluency criteria, responders, and typical achievers. PA ⴝ Phonological Awareness; CELF ⴝ Clinical Evaluation of Language Fundamentals— 4; USP ⴝ Understanding Spoken Paragraphs; CD ⴝ Concepts/Directions; KBIT ⴝ Kaufman Brief Intelligence Test—2 ical level of alpha, F(7, 146) ⫽ 15.54, p ⬍ .0001, 2 ⫽ .43. Table 4 shows that the three methods concurred in weighting phonological awareness and rapid letter naming as the primary contributors to group separation. Regression Analyses: A Continuum of Severity? The test of Hypothesis 3 was based on Stanovich and Siegel (1994), who compared cognitive functions in poor readers who met and did not meet IQ–achievement discrepancy definitions. Regression models were created where each cognitive variable was predicted by the criterion-reading measures and a contrast representing, in the present study, the difference between adequate and inadequate 16
responders. A significant beta weight for the contrast indicates variance in cognitive functions independent of reading level. This finding would suggest that a continuum of severity explanation (Vellutino et al., 2006) is an inadequate model of the relation of cognitive performance and differences between adequate and inadequate responders. For each of the seven models, the WJIII Basic Reading, TOWRE composite, and CMERS wcpm were entered into a regression model along with a single contrast (adequate vs. inadequate responders). Before conducting the seven regressions, we investigated the suitability of the data for regression analysis by evaluating assumptions of (1) linear relations between predictors and outcome variables;
Cognitive Correlates of Inadequate Response
Table 4 Canonical Structure Coefficients, Standardized Discriminant Function Coefficients, and Univariate Tests by Group for Significant MANOVAs All Groups
DF-Responder
RFResponder
ResponderTypical
Variable
r
sdfc
r
sdfc
r
sdfc
r
sdfc
Phonological Awareness Rapid Naming Underlining Test CELF USP CELF Concepts/Directions KBIT Verbal Knowledge KBIT Matrices
.84* .64* .10 .28* .46* .33* .40*
.84 .60 ⫺.02 ⫺.01 .09 ⫺.05 .26
.85* .54* ⫺.02 .37 .68* .36 .24
.71 .41 ⫺.17 ⫺.04 .37 .02 .20
.78* .24 .34 .48 .45 .61* .57
.61 .18 .27 .12 ⫺.21 .37 .38
.81* .64* .06 .19 .34* .23 .33
.87 .65 ⫺.07 .04 ⫺.05 ⫺.04 .23
Note. sdfc ⫽ Standardized discriminant function coefficient; DF ⫽ decoding/fluency; RF ⫽ reading fluency; USP ⫽ Understanding Spoken Paragraphs; PA ⫽ Phonological Awareness; CELF ⫽ Clinical Evaluation of Language Fundamentals— 4; KBIT ⫽ Kaufman Brief Intelligence Test—2. The comparison of decoding/fluency and reading fluency groups did not achieve the critical level of alpha, p ⬍ .0125. *Univariate p ⬍ .007.
(2) heteroscedasticity; (3) non-normal distribution of residuals, which may be caused by outlier data points; and (4) multicollinearity among the predictor variables (Hamilton, 1992). The evaluation of linearity suggested a quadratic term for WJIII Basic Reading in the model for rapid letter naming and for CMERS for the phonological awareness model. Heteroscedasticity was significant only for KBIT Matrices, 2(14) ⫽ 28.56, p ⬍ .01, so all tests of predictor regression weights in this model used a heteroscedasticity consistent test. There was no evidence of non-normality or multicollinearity. Across all seven models, the contrast between adequate and inadequate responders was statistically significant ( p ⬍ .05) only for Rapid Naming, b ⫽ ⫺.15745, t(252) ⫽ ⫺2.31, p ⬍ .02. The negative sign of the b weight adjusts the predicted mean score of the responders down and the mean of the inadequate responders up relative to the prediction obtained from reading level predictors alone. The direction of the change indicated that adequate and inadequate responders are more similar on Rapid Naming than would be predicted on the basis of the reading level predic-
tors alone. The addition of this contrast resulted in an increase in explained variance from 39% to 40%, a small increment. The group contrast did not account for unique variance in any of the other models, consistent with the hypothesized continuum of severity. Discussion For the overall research question, these results support the validity of classifications of LD that include evaluations of intervention response. Although we were not able to show statistically significant differences in cognitive skills between the decoding/fluency and reading fluency groups, both the decoding/fluency and reading fluency groups were clearly differentiated from the adequate responder group when separately compared (Figure 3), thus supporting the validity of the classification model proposed by Bradley et al. (2002). We also evaluated three hypotheses concerning the cognitive attributes of adequate and inadequate responders to Grade 1 reading intervention. The first hypothesis predicted that regardless of definition, inadequate responders identified with either decoding or 17
School Psychology Review, 2011, Volume 40, No. 1
fluency criteria would show poorer performance than adequate responders or typically achieving students on measures of oral language. Although this hypothesis was supported in a univariate context, the support was less apparent in a multivariate context. Phonological awareness was a major contributor to group separation for the overall comparison of the four groups as well as for any significant two-group comparison. As a metacognitive assessment of language processing, phonological awareness is correlated with oral language measures, such as vocabulary, verbal reasoning, and listening comprehension, but loads on a different factor in latent variable studies in this age range (Fletcher, Lyon, Fuchs, & Barnes, 2007). Although the relation of phonological awareness and reading proficiency is well established, phonological awareness does not require reading and is often poorly developed in poor readers with strong oral language skills (Vellutino et al., 2004). Other comparisons of general verbal skills depended on the domain of language and which groups were being compared. A measure of listening comprehension did not contribute to group separation for any comparison. The measure of syntactic processing/ working memory (Concepts and Directions) contributed more robustly to comparisons of decoding/fluency and reading fluency groups versus adequate responders, but not for adequate responders versus typical students. Vocabulary did not contribute uniquely to group separation except for comparisons of the reading fluency group and adequate responders, which was also the only comparison where the nonverbal problem solving measure (KBIT Matrices) contributed uniquely. Although some univariate studies consistently identify low vocabulary/verbal reasoning as a major attribute of inadequate responders, often as a proxy for verbal intellectual capacity (Al Otaiba & Fuchs, 2006), like Stage et al. (2003) and Vellutino et al. (2006), the unique contribution of vocabulary to inadequate response was less apparent in a multivariate context, especially in relation to phonological awareness. Rapid letter naming skills also contributed uniquely to the separation of all four 18
groups, the comparison of adequate responders and typical students, and the decoding/ fluency and adequate responder groups. We did not find evidence for the second hypothesis. Rapid letter naming was not more strongly related to inadequate response if fluency criteria were used. As in other studies (Vellutino et al., 2006), phonological awareness measures were stronger correlates of inadequate response than rapid letter naming measures and other language skills regardless of the reading domain used to define responder status. The third hypothesis was largely supported. Six of the seven regression models for each cognitive variable residualized for the reading measures revealed no significant contrasts of the inadequate (combined decoding/ fluency and reading fluency groups) versus the adequate responder groups. Only the contrast in the model for rapid letter naming accounted for more variance than a model with the three reading level variables. The increment in explained variance was small (1%), but demonstrated that the models were adequately powered to detect small effects. If there are unique cognitive attributes of inadequate responders, we would expect that more of these contrasts would achieve statistical significance and that effects would be larger. In fact, there was a stepwise progression similar to that observed by Vellutino et al. (2006), in which the degree of severity in the cognitive profiles paralleled the levels of reading skills across inadequate and adequate response groups (see Table 3). Because the contrast of adequate and inadequate responders was largely accounted for by the criterion reading skills, which themselves reflect a continuum of severity, these results are consistent with Vellutino et al. Regarding the overall research question related to group differentiation on the basis of response to intervention criteria, the results indicate that a classification of LD incorporating inadequate response yields subgroups that can be differentiated on cognitive variables not used to create the subgroups. However, no single method would detect the pool of all inadequate responders. Particular concern should be expressed for the sole use of a
Cognitive Correlates of Inadequate Response
passage reading fluency measure as in some implementations of response to intervention models. As the comparison of identification rates shows, not all students who meet normreferenced criteria on other tests were detected with this approach. Elevating the benchmark for the passage reading fluency measure would have increased the number of students impaired only in passage reading fluency, which represents a relatively mild reading problem on a measure with lower reliability. Limitations of the Study The generalization of study results should be guided by our descriptions of the study sample, the intervention approach, and its implementation and outcomes. In addition, our choice of criteria for adequate intervention response should be considered. We did not incorporate criteria based on growth in this study or evaluate a dual discrepancy model based on both slope and end point assessments. We did not adequately assess verbal working memory because of the time required for these measures. We cannot determine whether syntactic comprehension versus working memory constructs account for the generally stronger contribution of the Concepts and Directions subtest to group differentiation relative to vocabulary and listening comprehension. Group averages do not address the variability of individuals within a group, an analysis that is beyond the scope of this article. However, there are many subtyping studies based on cognitive skills that generally have not shown relations with treatment outcomes (Vellutino et al., 2004). Morris et al. (1998) identified subtypes based on profiles across eight cognitive domains that identified subtypes of poor readers with variations in phonological awareness, rapid naming, and lexical skills using the same constructs as this study operationalized via other cognitive measures, including verbal and nonverbal working memory, spatial cognition, and processing speed. Thus, it may be that the variability among individuals within the inadequate responder
groups will reflect the subtypes identified by Morris et al. The intervention from which this study was derived resulted in growth in many students. However, that study was designed to evaluate the effects of a Tier 2 intervention as commonly implemented in schools (Denton, Cirino et al., in press). However, this intervention did not generate results as robust as other early intervention studies implemented for 25 weeks or more and began in the first semester of Grade 1 (e.g., Denton, Nimon et al., 2010; Mathes et al., 2005). Findings may have been different had we delivered more intensive interventions. The pattern of results may be different with older students, who are more likely to be impaired in reading comprehension, where the cognitive correlates are more closely associated with oral language skills (Catts & Hogan, 2003). Our study had many economically disadvantaged students. We only studied reading, so the results may not extend to LD involving math and written expression. Finally, the results do not apply to students who are assessed prior to the onset of formal reading instruction, where measures of phonological awareness, rapid naming, and vocabulary predict reading difficulties (Fletcher et al., 2007). There are other cognitive tests that could be used, including those commonly proposed for assessing cognitive processes, such as the WJIII cognitive battery (Flanagan et al., 2007), the Cognitive Assessment System (Naglieri, 1999), and subtests from the Wechsler intelligence scales (Hale et al., 2008). Our approach focused on constructs and was limited by the amount of assessment we could complete in the context of ongoing intervention research. Conclusions and Future Directions for Research Studies of this sort should be completed with additional cognitive variables in the context of interventions at Tiers 2 and 3 that are more robust than the present study. In addition, there are no studies we know of that address the cognitive characteristics of inade19
School Psychology Review, 2011, Volume 40, No. 1
quate responders at the secondary level. Larger studies with a mixed group of economically advantaged and disadvantaged students may be able to evaluate the heterogeneity of the inadequate responder groups and whether the two decoding/fluency groups should be combined. Other domains of LD (e.g., math, written expression) should be investigated. The initial premise of this study was supported. Subgroups defined on the basis of inadequate response and low achievement can be differentiated on variables not used to define them. However, the differentiation seems to reflect a continuum of impairment that parallels the severity of impairment in reading skills as opposed to qualitatively distinct variation in the cognitive profiles of adequate and inadequate responders. Although more research is clearly needed, the results do not support the hypothesis of value-added benefits of assessments of cognitive processes for inadequate responders to a Tier 2 intervention (Hale et al., 2008). The critical assessment data for reading intervention planning may be the level of reading skills and the domains of impairment (decoding, fluency, comprehension). It is noteworthy that in recent research, group by treatment interactions have been demonstrated for assessments of reading components that are directly tied to instruction. Connor et al. (2009) has shown in a series of studies that helping classroom reading teachers vary the amount of code-based versus meaning-based instruction based on strengths and weaknesses in decoding versus comprehension leads to better outcomes compared to classrooms in which this assessment information and assistance was not provided. More obviously, providing reading interventions for students with reading disabilities is more effective than providing math interventions for students with reading difficulties (and vice versa). Thus, although assessing cognitive processes for intervention purposes may not be associated with qualitatively distinct cognitive characteristics and may not justify the extensive assessments as proposed by Hale et al. (2008, in press), assessment of reading components and other academic skills appears to be well be justified. 20
References Al Otaiba, S., & Fuchs, D. (2002). Characteristics of children who are unresponsive to early literacy intervention: A review of the literature. Remedial and Special Education, 23, 300 –316. Al Otaiba, S., & Fuchs, D. (2006). Who are the young children for whom best practices in reading are ineffective? An experimental and longitudinal study. Journal of Learning Disabilities, 39, 414 – 431. Barth, A. E., Stuebing, K. K., Anthony, J. L., Denton, C. A., Mathes, P. G., Fletcher, J. M., et al. (2008). Agreement among response to intervention criteria for identifying responder status. Learning and Individual Differences, 18, 296 –307. Bradley, R., Danielson, L., & Hallahan, D. P. (Eds.). (2002). Identification of learning disabilities: Research to practice. Mahwah, NJ: Erlbaum. Burns, M. K., & Senesac, S. V. (2005). Comparison of dual discrepancy criteria to assess response to intervention. Journal of School Psychology, 43, 393– 406. Catts, H. W., & Hogan, T. P. (2003). Language basis of reading disabilities and implications for early identification and remediation. Reading Psychology, 24, 223– 246. Connor, C. M., Piasta, S. B., Fishman, B., Glasney, S., Schatschneider, C., Crowe, E., et al. (2009). Individualizing student instruction precisely: Effects of child by instruction interactions on first graders’ literacy development. Child Development, 80, 77–100. Denton, C. A., Cirino, P. T., Barth, A. E., Romain, M., Vaughn, S., Wexler, J., et al. (in press). An experimental study of scheduling and duration of “tier 2” first grade reading intervention. Journal of Research on Educational Effectiveness. Denton, C. A., Nimon, K., Mathes, P. G., Swanson, E. A., Kethley, C., Kurz, T., et al. (2010). The effectiveness of a supplemental early reading intervention scaled up in multiple schools. Exceptional Children, 76, 394 – 416. Doehring, D. G. (1968). Patterns of impairment in specific reading disability. Bloomington, IN: University Press. Flanagan, D. P., Ortiz, S. O., & Alphonso, V. C. (Eds.). (2007). Essentials of cross- battery assessment. New York: John Wiley. Fletcher, J. M., Foorman, B. R., Boudousquie, A. B., Barnes, M. A., Schatschneider, C., & Francis, D. J. (2002). Assessment of reading and learning disabilities: A research-based, intervention-oriented approach. Journal of School Psychology, 40, 27– 63. Fletcher, J. M., Lyon, G. R., Fuchs, L., & Barnes, M. A. (2007). Learning disabilities: From identification to intervention. New York: Guilford Press. Francis, D. J., Fletcher, J. M., Stuebing, K. K., Lyon, G. R., Shaywitz, B. A., & Shaywitz, S. E. (2005). Psychometric approaches to the identification of learning disabilities: IQ and achievement scores are not sufficient. Journal of Learning Disabilities, 38, 98 –110. Fuchs, D., & Deshler, D. K. (2007). What we need to know about responsiveness to intervention (and shouldn’t be afraid to ask). Learning Disabilities Research & Practice, 20, 129 –136. Fuchs, D., Fuchs, L. S., & Compton, D. L. (2004). Identifying reading disabilities by responsiveness-to-instruction specifying measures and criteria. Learning Disability Quarterly, 27, 216 –227.
Cognitive Correlates of Inadequate Response
Gresham, F. M. (2009). Using response to intervention for identification of specific learning disabilities. In A. Akin-Little, S. G. Little, M. A. Bray, & T. J. Kehl (Eds.), Behavioral interventions in schools: Evidencebased positive strategies (pp. 205–220). Washington, DC: American Psychological Association. Hale, J., Alfonso, V., Berninger, V., Bracken, B., Christo, C., Clark, E., et al. (in press). Critical issues in response-to-intervention, comprehensive evaluation, and specific learning disabilities identification and intervention: An expert white paper consensus. Learning Disability Quarterly. Hale, J. B., Fiorello, C. A., Miller, J. A., Wenrich, K., Teodori, A. M., & Henzel, J. (2008). WISC-IV assessment and intervention strategies for children with specific learning disabilities. In A. Prifitera, D. H. Saklofske, & L. G. Weiss (Eds.), WISC-IV clinical assessment and intervention (2nd ed., pp. 109 –171). New York: Elsevier. Hamilton, L. C. (1992). Regression with graphics: A second course in applied statistics. Belmont, CA: Wadsworth. Hasbrouck, J., & Tindal, G. A. (2006). Oral reading fluency norms: A valuable assessment tool for teachers. The Reading Teacher, 59, 636 – 644. Huberty, C. J., & Olejnik, S. (2006). Applied discriminant analysis (2nd ed.). New York: Wiley. Kaufman, A. S., & Kaufman, N. L. (2004). Kaufman Brief Intelligence Test (2nd ed.). Minneapolis, MN: Pearson Assessment. Mathes, P. G., Denton, C. A., Fletcher, J. M., Anthony, J. L., Francis, D. J., & Schatschneider, C. (2005). An evaluation of two reading interventions derived from diverse models. Reading Research Quarterly, 40, 148 – 183. Mathes, P. G., & Torgesen, J. K. (2008). Continuous monitoring of early reading skills. Dallas, TX: Istation. Morris, R. D., & Fletcher, J. M. (1988). Classification in neuropsychology: A theoretical framework and research paradigm. Journal of Clinical and Experimental Neuropsychology, 10, 640 – 658. Morris, R. D., Stuebing, K. K., Fletcher, J. M., Shaywitz, S. E., Lyon, G. R., Shankweiler, D. P., et al. (1998). Subtypes of reading disability: Variability around a phonological core. Journal of Educational Psychology, 90, 347–373. Naglieri, J. A. (1999). Essentials of CAS assessment. New York: Wiley. Nelson, R. J., Benner, G. J., & Gonzalez, J. (2003). Learner characteristics that influence the treatment effectiveness of early literacy interventions: A metaanalytic review. Learning Disabilities Research & Practice, 18, 255–267. Pashler, H., McDaniel, M., Rohrer, D., & Bjork, R. (2009). Learning styles: Concepts and evidence. Psychological Science in the Public Interest, 9, 105–119. Reschly, D. J., & Tilly, W. D. (1999). Reform trends and system design alternatives. In D. Reschly, W. Tilly, & J. Grimes (Eds.), Special education in transition (pp. 19 – 48). Longmont, CO: Sopris West. Reynolds, C. R., & Shaywitz, S. E. (2009). Response to intervention: Ready or not? Or, from wait-to-fail to watch-them-fail. School Psychology Quarterly, 24, 130 –145.
Semel, E., Wiig, E. H., & Secord, W. A. (2003). Clinical evaluation of language fundamentals (4th ed.) San Antonio TX: The Psychological Corporation. Speece, D. L., & Case, L. P. (2001). Classification in context: An alternative approach to identifying early reading disability. Journal of Educational Psychology, 93, 735–749. Sprick, M. M., Howard, L. M., & Fidanque, A. (1998). Read Well: Critical foundations in primary reading. Longmont, CO: Sopris West. Stage, S. A., Abbott, R. D., Jenkins, J. R., & Berninger, V. W. (2003). Predicting response to early reading intervention from verbal IQ, reading-related language abilities, attention ratings, and verbal IQ-word reading discrepancy: Failure to validate discrepancy method. Journal of Learning Disabilities, 36, 24 –33. Stanovich, K. E., & Siegel, L. S. (1994). Phenotypic performance profile of children with reading disabilities: A regression-based test of the phonological-core variable-difference model. Journal of Educational Psychology, 86, 24 –53. Tomblin, J. B., & Zhang, X. (2006). The dimensionality of language ability in school-age children. Journal of Speech, Language, and Hearing Research, 49, 1193– 1208. Torgesen, J. K. (2000). Individual responses in response to early interventions in reading: The lingering problem of treatment resisters. Learning Disabilities Research & Practice, 15, 55– 64. Torgesen, J. K., Wagner, R., & Rashotte, C. (1999). Test of Word Reading Efficiency. Austin, TX: Pro-Ed. U.S. Department of Education. (2004). Individuals with Disabilities Education Improvement Act, 20 U.S.C. §1400. Washington DC: Author. Vellutino, F. R., Fletcher, J. M., Snowling, M. J., & Scanlon, D. M. (2004). Specific reading disability (dyslexia): What have we learned in the past four decades? Journal of Child Psychiatry & Psychology & Allied Disciplines, 45, 2– 40. Vellutino, F. R., Scanlon, D. M., & Jaccard, J. (2003). Toward distinguishing between cognitive and experiential deficits as primary sources of difficulty in learning to read: A two-year follow-up to difficult to remediate and readily remediated poor readers. In B. R. Foorman (Ed.), Preventing and remediating reading difficulties (pp. 73–120). Baltimore: York Press. Vellutino, F. R., Scanlon, D. M., Small, S., & Fanuele, D. P. (2006). Response to intervention as a vehicle for distinguishing between children with and without reading disabilities: Evidence for the role of kindergarten and first-grade interventions. Journal of Learning Disabilities, 39, 157–169. Wagner, R. K., Torgesen, J. K., & Rashotte, C. A. (1999). Comprehensive Test of Phonological Processing. Austin, TX: Pro-Ed. Woodcock, R. W., McGrew, K. S., & Mather, N. (2001). Woodcock-Johnson III Tests of Achievement. Itasca, IL: Riverside Publishing.
Date Received: September 1, 2009 Date Accepted: September 1, 2010 Action Editor: Sandra M. Chafouleas 䡲 Article accepted by previous Editor. 21
School Psychology Review, 2011, Volume 40, No. 1
Jack M. Fletcher, Ph.D. is a Hugh Roy and Lillie Cranz Cullen Distinguished Professor of Psychology at the University of Houston. He is the Principal Investigator of the NICHD-funded Texas Center for Learning Disabilities. His research addresses classification and definition and the neuropsychological and neurobiological correlates of learning disabilities. Karla K. Stuebing, Ph.D. is a Research Professor at the Texas Institute for Measurement, Evaluation and Statistics, Department of Psychology, University of Houston. Her research focuses on measurement issues in development disorders. Amy E. Barth, Ph.D. is a Research Assistant Professor at the Texas Institute for Measurement, Evaluation and Statistics, Department of Psychology, University of Houston. Her research addresses the assessment of language and cognitive skills in language and learning disabilities. Carolyn A. Denton, Ph.D. is an Associate Professor of Pediatrics in the Children’s Learning Institute at the University of Texas Health Science Center at Houston. Her research is focused on interventions for children with reading disabilities and difficulties, and she is the Principal Investigator of an NICHD-funded study of interventions for children with both reading and attention difficulties. Paul T. Cirino, Ph.D. is a developmental neuropsychologist whose interests include disorders of math and reading, executive function, and measurement. He is an associate professor in the Department of Psychology at the Texas Institute for Measurement, Evaluation and Statistics at the University of Houston. David J. Francis, Ph.D. is a Hugh Roy and Lillie Cranz Cullen Distinguished Professor and Chairman of Psychology at the University of Houston. He is the Director of the Texas Institute for Measurement, Evaluation and Statistics with a long-term focus on measurement issues in learning disabilities. Sharon Vaughn, Ph.D. is the H.E. Hartfelder/Southland Corp Regents Chair at the University of Texas at Austin and the Executive Director of the Meadows Center for Preventing Educational Risk. She is interested in intervention studies of a variety of populations with reading problems.
22
School Psychology Review, 2011, Volume 40, No. 1, pp. 23–38
Teacher Judgments of Students’ Reading Abilities Across a Continuum of Rating Methods and Achievement Measures John C. Begeny and Hailey E. Krouse North Carolina State University Kristina G. Brown Georgia Gwinnett College Courtney M. Mann University of North Carolina—Chapel Hill Abstract. Teacher judgments about students’ academic abilities are important for instructional decision making and potential special education entitlement decisions. However, the small number of studies evaluating teachers’ judgments are limited methodologically (e.g., sample size, procedural sophistication) and have yet to answer important questions related to teachers’ judgments. Thus, a primary goal of the present study was to examine unanswered questions about teacher judgments (e.g., what is the relationship between teacher judgments and students’ performance on widely used reading measures) and to meaningfully improve upon earlier research methodologically (e.g., involving a large enough sample of teachers for sufficient statistical power). In doing so, teachers’ perceptions of students’ reading performance were examined across five different measures of reading ability, including direct measures such as the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) and state-mandated end-of-grade tests, and indirect measures such as a brief teacher rating scale. Findings suggested that teachers had considerable difficulty judging students’ reading levels across most of the measures (e.g., DIBELS and end-of-grade tests), and were better judges of high-performing readers compared to low- and average-performing readers. Implications for research and practice are discussed.
Teachers’ judgments about their students’ academic achievement is highly important (Hurwitz, Elliott, & Braden, 2007; Meisinger, Bradley, Schwanenflugel, Kuhn, & Morris, 2009). For instance, based on judgments of student
achievement, teachers make daily decisions about instructional materials, teaching strategies, and student-learning groups (Clark & Peterson, 1986; Sharpley & Edgar, 1986). These judgments have been shown to influence teachers’
Correspondence regarding this article should be addressed to John C. Begeny, College of Humanities and Social Sciences, Department of Psychology, 640 Poe Hall, Campus Box 7650, Raleigh, NC 27695-7650; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 23
School Psychology Review, 2011, Volume 40, No. 1
expectations of student achievement, the ways in which teachers and students interact, and student outcomes (e.g., Cadwell & Jenkins, 1986; Good & Brophy, 1986; Hurwitz et al., 2007). Moreover, long-term educational decisions, such as eligibility for special education services, are also affected by teacher judgments (Gresham, MacMillan, & Bocian, 1997; VanDerHeyden, Witt, & Naquin, 2003; Ysseldyke & Algozine, 1983). Student performance on an objective assessment should also influence teachers’ instructional decision making, which is particularly relevant in a response to intervention (RTI) model of data-based decision making. RTI models use curriculum-based measures of reading (CBM-R) as the most common method for assessing elementary-aged students’ reading skills once every 3– 4 months (for most students), and as regularly as once every 1 or 2 weeks (for students receiving intervention services; Burns & Gibbons, 2008). However, even with objective assessment data being generated for some students as much as once per week, teachers make more frequent instructional decisions based on their judgments of students’ academic performance (Berliner, 2004; Gerber, 2005). Also, decisions based on teacher judgments are arguably the first step in preventing learning difficulties. More specifically, teachers generally have access to more regular, objective assessments of student learning (e.g., CBM-R) only after (a) a student is identified by multiple school personnel as having a learning difficulty, and (b) the student begins receiving more intensive intervention (Burns & Gibbons, 2008). In the absence of such data, ongoing and accurate judgments of student performance may prevent the need for intensive intervention because teachers can first employ early, less intensive forms of instruction/intervention. Related to the topic of RTI and databased decision making, Gerber (2005) argued that even a standardized approach to RTI is not likely to be successful at a meaningful scale of implementation because of variation in teachers’ judgments about students’ responsiveness to instruction. More specifically, Gerber stated that “the teacher actively measures 24
the distribution of responsiveness in her class by processing information from a series of teaching trials and perceives some range of students as within the [teachable range]” (p. 516). He then posited that teacher perceptions of students’ differential responsiveness to instruction ultimately influence how teachers refer students for special education. The diagnostic accuracy of special education referrals can be increased by including objective data such as CBM-R, but even CBM-R data may sometimes lack desired levels of specificity, sensitivity, and predictive power for important educational decisions (Christ & Silberglitt, 2007; VanDerHeyden et al., 2003). Thus, accurate teacher judgments may help to minimize the frequency of false positives and false negatives because of measurement error. Summary of Previous Research There is an expanding body of research investigating the accuracy of teacher judgments, including a review of earlier research by Hoge and Coladarci (1989). In that review, Hoge and Coladarci evaluated 16 studies conducted between 1971 and 1988, and found a moderate correlation between student achievement and teacher judgments (Mdn ⫽.66). Moreover, they found correlations that were highly variable, ranging from .28 to .92, which they interpreted to suggest the presence of moderating variables, such as degree of similarity between teacher judgment measures and the achievement measures, as well as item format of the judgment scales. However, subsequent studies investigating teacher judgments of students’ academic abilities rarely examined whether teachers’ judgment accuracy differs depending on the type of measure used to assess student achievement, and no studies compared teachers’ judgments about students’ reading skills to reading measures that are widely used in schools. Four studies have used CBM-R as a measure to evaluate teachers’ judgment accuracy, but none used CBM-R in the ways it is commonly used in schools, and none used reading performance on commonly used CBM-R universal screening systems, such as the
Teacher Judgments of Students’ Reading
Dynamic Indicators of Basic Early Literacy Skills (DIBELS; Good & Kaminski, 2002). Previous research with CBM-R found that teachers significantly overestimated thirdgrade student’ performance (Hamilton & Shinn, 2003), and showed that teachers were more accurate when asked to estimate students’ performance compared to their peers rather than estimating actual reading scores in words read correctly per minute (WCPM; Feinberg & Shapiro, 2003). However, a notable criticism of these earlier CBM-R studies was that teachers’ ability to predict students’ WCPM on particular passages was not a realistic standard by which teachers should be determined as accurate or inaccurate judges of students’ reading achievement. In other words, accurately estimating students’ exact WCPM score may be impractical or even irrelevant, and has less instructional decision-making utility than accurately judging students’ reading level (e.g., low, average, high), as determined by WCPM scores and CBM-R benchmark standards. Previous studies also showed that correlations between student achievement and teacher judgment may actually mask teachers’ judgment (Eckert, Dunn, Codding, Begeny, & Kleinmann, 2006; Feinberg & Shapiro, 2003), suggesting a need to use percentage agreement analyses to evaluate teacher judgments. However, percentage agreement analyses have shown inconsistent results. One study found that teachers made more accurate judgments about students who were performing at the lowest level (i.e., frustration) on grade-level material (Eckert et al., 2006), whereas another found that teachers had particularly poor judgment accuracy of students reading at frustration and instructional levels (Begeny, Eckert, Montarello, & Storie, 2008). Limitations of Previous Research, Purposes of This Study, and Research Questions Although research in this area has progressed, the small number of previous studies still limits our understanding about teacher judgments in important ways. First, of the two
studies requiring teachers to estimate student reading levels based on CBM-R (Begeny et al., 2008; Eckert et al., 2006), both included small samples of teachers (10 and 2, respectively) and students (87 and 33, respectively), which limit the statistical power and external validity of findings. Second, student reading levels in those studies were determined with (a) CBM-R passages developed only for research purposes (and are therefore not actually used by teachers in schools), and (b) CBM-R benchmark standards developed approximately 30 years ago (Fuchs & Deno, 1982) and were unconnected to the actual passages used in the study. This is an important limitation in the research because reading levels identified with CBM-R scores are now in widespread use across the country and the normative samples used to determine these levels have been significantly updated and improved upon from the earlier benchmarks (e.g., Good & Kaminski, 2002). Third, previous studies have not asked teachers to estimate students’ reading performance on the most widely used measures of reading, such as common universal screening assessment systems (e.g., DIBELS) and state-mandated endof-grade tests. In an era of increased academic accountability in U.S. schools, an understanding about teachers’ judgments of students’ performance compared to these types of measures is warranted because both measures serve as educationally meaningful indicators of student reading skill. One goal of the present study was to examine previously unanswered research questions about teacher judgments. First, we asked, To what extent are teachers able to accurately judge students’ performance on the DIBELS oral reading fluency measure, as determined by students’ specific WCPM score and students’ reading level (i.e., the updated benchmark categories of At Risk, Some Risk, or Low Risk; Good & Kaminski, 2002)? As noted previously, reading levels that inform educational decision making are arguably the most meaningful standard by which teachers should be determined as accurate or inaccurate judges of students’ reading achievement, but no previous research has examined teachers’ 25
School Psychology Review, 2011, Volume 40, No. 1
judgments with commonly used reading levels such as those obtained with DIBELS. For similar reasons, the second primary research question asked, To what extent are teachers able to accurately judge students’ performance on students’ end-of-grade reading assessment? Third, to gather a more comprehensive understanding about teacher judgments and the alternative methods by which to evaluate teacher judgments, this study also asked, How is teacher judgment accuracy meaningfully influenced by using indirect measures of reading performance (i.e., a rating scale of reading abilities and a class-comparison ranking task)? Collectively, this study extends previous research by evaluating teacher judgments with five different measures of reading, including three important reading measures (DIBELS WCPM scores, DIBELS reading levels, endof-grade tests) not previously evaluated in this area of research. This study also adds unique information to the relatively small teacher judgment research literature because we (a) included a substantially larger sample of teachers and students, which allowed for adequate statistical power; and (b) evaluated whether teacher judgment accuracies appeared better or worse depending upon the type of analytic strategy employed (i.e., correlational analyses, percentage agreement indices, and inferential statistics). Moreover, we examined judgment accuracy across grade levels and student ability levels because previous research found differential judgment accuracies for those variables (Begeny et al., 2008; Meisinger et al., 2009). Method Participants Teacher informed consent forms were distributed to all first- through fifth-grade teachers (n ⫽ 27) in one Southeastern kindergarten through fifth-grade elementary school. A total of 27 teachers (100%) provided their consent and consequently participated in the current study. All students in each of the 27 participating teachers’ classrooms (N ⫽ 502) were asked to participate in a larger project 26
that occurred at the same time within the same school. Participation in this larger project involved having students read four to six oral reading fluency passages (described in detail later) and reviewing students’ end-of-year, statewide test results. Of those students, 486 (96.8%) returned parental consent for participation. For the purposes of this study, eight students from each classroom were randomly selected for participation and we asked teachers to estimate those eight students’ reading abilities. Eight students were selected from each class in order to (a) increase the probability that teachers would be rating students with varying reading-ability levels, (b) maintain consistency for the number of students each teacher would rate, and (c) obtain a large enough sample of students. According to the participating school’s end-of-grade state-mandated reading assessment, students in the participating school performed, on average, similar to national averages of the U.S. reading assessment. Specifically, students in the participating school read at the following levels: Below Basic ⫽ 29.2%; Basic ⫽ 47.6%; Proficient ⫽ 21.7%; and Advanced ⫽ 1%. Reading levels across all participants in this study were commensurate with school-wide end-of-grade tests scores in reading. Thus, teachers ultimately estimated the reading abilities of a representative sample of students from the participating school. Students. After selecting 216 students (eight students per classroom), 212 students were represented in the data set because four students were absent during data collection. Of the student participants, 51.4% (109) were female, 41.0% were African American, 36.8% were Caucasian, 18.4% were Hispanic, and 1.0% reported mixed race. The large majority of students (81.5%) received free or reduced-price lunch; 7.5% of the participants had been retained and 6.5% received special education services. Teachers. Six of the 27 teacher participants (all female) taught first grade and 6 taught third grade, and 5 each taught second, fourth, and fifth grades. Eight teachers had a
Teacher Judgments of Students’ Reading
bachelor of arts (BA) or bachelor of sciences (BS) degree, 3 teachers held BA/BS degrees plus additional training, 11 teachers had master of arts (MA) degrees, 4 teachers had MA degrees plus additional training, and 1 teacher was nationally certified as a teacher. The total number of years teaching ranged from 1 to 30 years (M ⫽ 15.6, SD ⫽ 9.2) and the total number of years teaching at the current grade level ranged from 1 to 26 years (M ⫽ 8.3, SD ⫽ 6.9). Eighteen teachers (66.7%) indicated they received at least some training on issues related to reading fluency through undergraduate training (n ⫽ 2), graduate training (n ⫽ 4), workshops (n ⫽ 6), or other professional development activities (n ⫽ 6), but the participating school was not using CBM-R as a part of their reading assessment procedures. Thus, teachers did not have prior experience administering CBM-R or interpreting CBM-R data. When asked how much reading time was allocated daily in the classroom, the first and fourth-grade teachers reported 1.5 hr, whereas the second-, third-, and fifth-grade teachers reported 2 hr of daily reading instruction. All of the teachers at this school used the Open Court (McGraw-Hill Education, 2002) reading curriculum, supplemented by trade books, skill sheets, and library time. A power analysis indicated that a sample size of 27 teachers would provide adequate power (.78) for the statistical testing of twotailed correlational relationships (Cohen, 1992). Based on findings and procedures from previous research (e.g., Begeny et al., 2008; Eckert et al., 2006), the power analysis was calculated by setting the significance level to .05 and assuming a large effect (.5). Materials Oral reading fluency. First- through fifth-grade benchmark passages from the DIBELS were used to evaluate students’ oral reading fluency. The DIBELS Oral Reading Fluency (DORF) measure is a standardized, individually administered assessment of a student’s rate and accuracy of reading connected text. The student reads aloud from a passage
for 1 min and the scorer records the occurrence of errors, omissions, substitutions, or hesitations longer than 3 s. The score is the total number of words read correctly within 1 min (WCPM), which can then be used to determine a categorical score (At Risk, Some Risk, or Low Risk). Strong correlations (ranging from .60 to .80) have been found between DORF and statewide assessments of reading achievement (Elliott, Lee, & Tollefson, 2001; Good, Simmons, & Kame’enui, 2001; Hintze, Ryan, & Stoner, 2003; McGlinchey & Hixson, 2004). In the present study, we used the first and third winter benchmark passages at each grade level (assessments occurred in mid-December). Of the two passages administered per grade level, the student’s average WCPM score were used for analyses. Teacher Rating Scale of Reading Performance (TRSRP). The 9-item TRSRP (Begeny et al., 2008) was used to assess teachers’ perceptions of their students’ reading abilities. The TRSRP asks teachers to respond to 9 items on a 5-point scale, ranging from Consistently Poor to Consistently Successful. Teachers are asked to judge students’ performance across a broad range of reading skills, including decoding, reading accuracy, reading fluency, reading comprehension, average completion rates in reading/language arts written work, application of reading skills to other school subjects, and overall reading performance. Sample items include “Please rate the student’s level of fluency during oral reading” and “Please rate the student’s ability to apply learned reading skills to other school subjects (e.g., social studies).” Based upon data from the present study, the TRSRP demonstrated adequate reliability (␣ ⫽ .97), and a principal components factor analysis resulted in a onefactor solution, accounting for 81.5% of the variance. Teacher interview data sheet. An interview data sheet (similar to the sheet used by Begeny et al., 2008) was used to immediately record teacher responses during each teacher interview. On the interview data sheet, interviewers recorded teachers’ ranks and esti27
School Psychology Review, 2011, Volume 40, No. 1
mates of each student participants’ reading abilities. Information recorded on the data sheet also included (a) estimates of DORF reading level (i.e., At Risk, Some Risk, or Low Risk) for grade-level passages, (b) estimates of WCPM on the two grade-level passages administered to each student, (c) percentile estimates of each student in terms of his/her reading fluency, compared to all of the other students in the student’s classroom, and (d) estimates of Language Arts scores on the Palmetto Achievement Challenge Test (PACT). Class percentile-ranking chart. Toward the end of the teacher interview, each teacher was asked to estimate each student’s reading fluency skills compared to all other students in her classroom. A class ranking chart (identical to the chart used by Begeny et al., 2008) was used for making estimations. The chart consisted of six numbers, each corresponding to a specific percentile range (1 ⫽ the bottom 10th percentile; 2 ⫽ the 11th through 30th percentile; 3 ⫽ the 31st through 50th percentile; 4 ⫽ the 51th though 70th percentile; 5 ⫽ the 71st through 89th percentile; 6 ⫽ the top 10th percentile). Students’ actual percentile rank was determined by ranking all students in each given classroom according to each student’s winter DORF WCPM score. PACT. As part of their end-of-year statewide assessment, educators at the participating school administered the PACT to all third- through fifth-grade students and then reported those scores to us for purposes of our study. The PACT is the state’s means of assessing student progress toward national educational standards and all test items are aligned with the state’s academic standards. Consistent with the U.S. test of reading achievement given at Grades 4 and 8, student scores on the PACT are categorized as Below Basic, Basic, Proficient, and Advanced. For the purposes of this study, we were concerned with student scores on the English Language Arts subtest of the PACT. Reliability of the PACT is strong, with alpha coefficients for the English Language Arts portion of the test 28
above .91 for each grade level (Tenenbaum, 2003). Procedure Undergraduate psychology students and graduate-level school psychology students administered DORF measures with student participants. Each research assistant was trained to administer DORF measures prior to the start of the project and each demonstrated 100% accurate administration on three consecutive trials before administering the DORF with student participants. All student participants (with the exception of the first-grade students) were asked to read a total of six passages: two passages at grade level, and two passages representing one grade level below and one grade level above. First grade students read four passages (two first- and two second-grade passages). For the purposes of the teacher interviews, we were only concerned with students’ performance on the two grade-level passages. Teacher interviews lasted approximately 12–15 min and were completed by the primary researcher and three school psychology graduate students. Like the procedures employed by Begeny et al. (2008) all interviews were conducted one-on-one with teachers and the interviewers followed a specified interview protocol created by and attainable from the first author. Each teacher was asked to estimate 8 students’ reading abilities from her classroom. The interviewer completed all questions regarding 1 student before asking questions about the next, and the order of students queried about during the interview was determined randomly. Each interview began by asking the teacher general questions and explaining important concepts related to the interview. First, the teacher was asked about the reading curriculum currently used in her classroom, how much time was allotted each day for reading, the types of reading activities employed during that time, and the types of materials used for reading instruction. The researcher then explained the accuracy and fluency definitions relevant for the remainder of the interview. This was done to ensure that teachers clearly
Teacher Judgments of Students’ Reading
understood the reading behaviors evaluated by the DORF, so that teacher judgments would not be unfairly influenced by lack of knowledge about this measure. Accuracy was defined as the proportion of correct to incorrect words a student said while reading a passage aloud, regardless of how long it took the student to read the passage. Fluency was defined as reading aloud with speed and accuracy. The researcher provided examples and emphasized that fluency takes into account both reading accuracy and reading speed. Each teacher was encouraged to ask questions and attempts were made to ensure that the teacher clearly understood the difference between reading accuracy and reading fluency. Each teacher verbally indicated she understood the distinction between these two concepts before the researcher proceeded with the interview. Next, the researcher presented the teacher with a reading-level table displaying the DIBELS winter benchmark standards for the teacher’s classroom grade-level. The researcher explained that a student’s reading level (i.e., At Risk, Some Risk, or Low Risk) is determined by WCPM at each particular grade level.1 The teacher was given information about the range of WCPM scores that represented each reading level (e.g., a thirdgrader reading third-grade material would fall in the At Risk range if she read fewer than 67 WCPM, the Some Risk range if she read between 67 and 91 WCPM, and the Low Risk range if she read more than 91 WCPM). The researcher also described how grade-level reading material will probably be too difficult for students in the At Risk range, whereas students in the Low Risk range are likely ready to begin practicing material of greater difficulty. The teacher was then asked to respond to a series of questions about one of the 8 student participants from her class. First, she was asked to estimate the student’s reading level (i.e., At Risk, Some Risk, or Low Risk) for grade-level material. Next, the researcher showed the teacher the two DIBELS gradelevel reading passages her students read and allowed the teacher to read each passage. All
reading passages displayed the cumulative word count of each line of the passage, and the teacher was asked to predict the number of WCPM that the student would read on each of the two passages. The average of the two WCPM estimates represented the teacher-estimated scores used in our analyses. The researcher then presented the teacher with the class percentile-ranking chart and explained the chart. The researcher asked the teacher to estimate the number (1– 6) that best represented the student’s reading fluency skills. This task was not dependent upon the teacher’s judgment of each student’s DORF level. In other words, one task involved the teacher estimating the student’s DORF level and associated WCPM scores, another task was to estimate the student’s reading fluency skills according to the class percentile-ranking chart. Following the percentile-ranking estimate, each third- through fifth-grade teacher was asked to estimate the student’s Language Arts score on the PACT (first- and secondgrade students did not take the PACT). Each teacher was shown the score classification choices (i.e., Below Basic, Basic, Proficient, and Advanced) and was asked to select one as her estimate of the student’s score. Overview of Dependent Measures, Predictors Variables, and Analyses Dependent measures included students’ observed reading performance (e.g., PACT and DORF scores) and predictor variables included teachers’ estimates of students’ reading abilities (e.g., TRSRP ratings, estimated DORF and PACT scores). Because previous studies have suggested that teachers’ judgments are not always portrayed accurately through correlation coefficients alone (Feinberg & Shapiro, 2003), and because others have argued that percentage agreement analyses or inferential statistics provide a clearer understanding about teacher judgment accuracy (Begeny et al., 2008; Eckert et al., 2006; Feinberg & Shapiro, 2003), teachers’ judgment ability was assessed through percentage agreement analyses, inferential statis29
School Psychology Review, 2011, Volume 40, No. 1
tics (t tests and effect sizes), and correlational analyses. One method of percentage agreement analysis compared students’ actual reading level (i.e., At Risk, Some Risk, or Low Risk) on grade-level material to teachers’ estimates of the students’ reading level (i.e., At Risk, Some Risk, or Low Risk) on grade-level material. With respect to correlational analyses, Pearson’s product-moment correlation coefficient (r) was computed for the following variables: (a) students’ WCPM on grade-level material and teachers’ estimates, and (b) students’ WCPM on grade-level material and teachers’ ratings of their reading skills on the TRSRP. Kendall’s tau-b correlation coefficient (rT) was computed for the following variables: (a) students’ DORF reading level and teachers’ estimates, (b) students’ PACT reading level and teachers’ estimates, and (c) students’ class ranks in reading fluency (according to the class percentile-ranking chart) and teachers’ estimated ranks. Interscorer Reliability and Procedural Integrity As noted previously, DORF assessments within this study were administered as part of a larger project that occurred at the same time and in the same school. Research assistants in this study were also responsible for administering assessments for the larger project; in total, 2280 DORF assessments were administered with 486 first- through fifth-grade students. Of these assessments, 39.6% of the student testing sessions were audiotaped to allow for the calculation of interscorer reliability of the DORF administrations. Trained research assistants listened to each session and independently scored WCPM and words read incorrectly per minute. Reliability was calculated by dividing the number of words agreed upon by the total number of words read, and then multiplying that value by 100. Results revealed an average reliability measurement of 99.3% (range ⫽ 86%–100%). To evaluate interviewers’ adherence to the interview protocol, each interviewer audiotaped two to three interviews, resulting 30
in 10 (37.0%) audiotaped sessions. A trained research assistant listened to each session and recorded the number of procedural steps followed correctly during the interview. Across each interviewer, 100% of steps were followed correctly. Using the audiotapes, we also computed interscorer agreement to assess whether interviewers accurately recorded teachers’ statements and responses. Interscorer agreement was 100%. Results Prior to all analyses, we considered whether the following teacher-related variables may have influenced teachers’ judgment accuracy of their students’ reading abilities: grade level taught (Grades 1, 2, 3, 4, or 5); years of teaching experience (few ⫽ ⬍5 years; moderate ⫽ 6 –15 years; several ⫽ 16⫹ years); amount of professional training (master’s degree and/or national certification vs. no master’s degree); and previous training in reading fluency (yes or no). Given the number of student behavior and teacher estimation variables present in our study, we selected only one student behavior/teacher estimation variable as the focus for the above comparisons: teachers’ percentage of agreement between students’ actual DORF reading level (At Risk, Some Risk, or Low Risk) and teachers’ reading-level estimates. Compared to the other variables measured in this study, we feel this particular comparison has the strongest educational implications because this is arguably the most useful estimate a teacher could make about a student’s reading ability in order to make good and timely instructional decisions (Good & Kaminski, 2002; Good et al., 2001). To classify each teacher with respect to her overall judgment accuracy, we developed a 3-point scale based on the 7– 8 students about whom each teacher provided academic judgments: 1 ⫽ 0–33% accurate judgments, 2 ⫽ 34–67% accurate judgments, and 3 ⫽ 68–100% accurate judgments. For our initial analyses, we categorized data in this way as a pragmatic means to aggregate a large amount of data and improve the interpretability of our
Teacher Judgments of Students’ Reading
Table 1 The Relationship Between Teacher Judgment and Students’ Reading Performance Dependent Measures and Predictors
Correlation Coefficient
Students’ WCPM on grade-level material and teachers’ estimates of students’ WCPM on grade-level material Students’ reading level at grade-level material and teachers’ estimates of students’ reading level Students’ PACT reading score and teachers’ estimates of students’ PACT reading scorea Students’ WCPM on grade-level material and teachers’ rating of reading skill on the TRSRP rating scale Students’ percentile rank in reading fluency and teachers’ estimated percentile rank of students’ reading fluency
r ⫽ .51** r T ⫽ .47** r T ⫽ .58** r ⫽ .43** r T ⫽ .56**
Note. WCPM ⫽ words read correctly per minute; TRSRP ⫽ Teacher Rating Scale of Reading Performance; PACT ⫽ Palmetto Achievement Challenge Test; r ⫽ Pearson correlation coefficient; r T ⫽ Kendall’s tau-b correlation coefficient. a PACT scores are based on third- through fifth-grade participants only. ** p ⬍ .01.
initial analyses. The series of 2 tests used to evaluate the possible differences in teacher judgment accuracy revealed no significant differences across any of the teacher-related variables, including differences in judgment accuracy between individual teachers (range, p ⫽ .25–.52). Because of these findings, we aggregated our data across teachers for remaining analyses. Teachers’ Judgments of Students’ Performance on the DORF Measures As shown in Table 1, correlation coefficients solely related to DORF measures were in the moderate range. The correlation between students’ actual WCPM on grade-level material and teachers’ estimates was .51, and the relationship between students’ DORF reading level and teachers’ estimates of reading level was .47. Percentage agreement and “how close” judgments. We evaluated percentage of agreement between teachers’ estimates of DORF reading level and students’ actual
DORF reading level on grade-level material. As shown in Table 2, teachers accurately identified just over half (57.5%) of students’ actual reading level. Across grade level of teachers, second-grade teachers accurately judged the highest percentage of students (72.5%) and third-grade teachers accurately estimated the lowest percentage of students (44.7%). Using a 2 analysis, we found no statistically significant differences between grade levels on teachers’ judgment accuracy with this measure, 2(4, N ⫽ 212) ⫽ 7.18, p ⫽ .13. We also evaluated whether teachers were better at estimating students who scored in the At Risk, Some Risk, and Low Risk ranges. We found that 55.3% of students scoring At Risk (N ⫽ 56) were accurately estimated at that level. Further, 40.3% and 71.3% of students scoring at the Some Risk (N ⫽ 62) and Low Risk (N ⫽ 94) levels were accurately estimated at those respective levels. These differences were statistically significant, 2(2, N ⫽ 212) ⫽ 15.14, p ⬍ .01. For inaccurate teacher estimates of students’ DORF reading levels, we also evalu31
School Psychology Review, 2011, Volume 40, No. 1
Table 2 Percentage of Accurate Teacher Judgments for DORF Reading Levels and PACT Levels Outcome Measure Grade Level (Number of Students)
DORF Reading Level
PACT Reading Level
First (46) Second (40) Third (47) Fourth (39) Fifth (40) All grades combined (212)
60.9% 72.5% 44.7% 56.4% 55.0% 57.5%
Not applicable Not applicable 56.5% 51.3% 52.9% 53.8%
Note. DORF ⫽ Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency; PACT ⫽ Palmetto Achievement Challenge Test.
ated how close teachers were to an accurate judgment and whether they tended to over- or underestimate students’ abilities. We found that teachers overestimated 55.6% of the students and underestimated 44.4% of the students. We also recorded whether teacher estimates were one level off (e.g., student actually scored At Risk, but teacher estimated Some Risk) or two levels off (e.g., student actually scored Low Risk, but teacher estimated At Risk). For inaccurately estimated students who scored At Risk, 69.2% of these students were judged to be Some Risk (one level off) and 30.8% of these students were estimated to be Low Risk (two levels off). For inaccurately estimated students who scored Low Risk, 74.1% of these students were judged to be Some Risk (one level off) and 25.9% of these students were estimated to be At Risk (two levels off). Of the inaccurately estimated students who scored Some Risk, teachers overestimated 64.9% of these students (i.e., estimated them to be Low Risk). Statistical differences between actual and estimated WCPM scores. To further examine teachers’ judgment accuracy of students’ actual performance and to evaluate a form of continuous data across students with varying ability levels (similar to analyses used in related research), a series of paired-samples t tests were conducted. Specifically, students 32
were divided into three separate groups according to their reading levels, and within each group, differences between actual and estimated WCPM scores were evaluated. Table 3 shows there were significant differences ( p ⬍ .01) between actual and estimated WCPM scores across each of the three reading levels. Across each reading level, teachers, on average, overestimated students’ actual WCPM. Effect sizes, represented as Cohen’s d (1988), are also presented in Table 3. These data reveal a medium to large effect size (d ⫽ 0.67) for the At Risk group and a large effect size (d ⫽ 0.98) for the Some Risk group. Teachers’ Judgments of Students’ Performance on the PACT End-ofGrade Test Consistent with the correlations found with the DORF measures, Table 1 shows that students’ estimated and actual performance on the PACT was in the moderate range (rT ⫽ .58). Similar to the DORF analyses above, we evaluated percentage of agreement between teachers’ estimates of PACT reading levels and students’ actual PACT reading levels. Again, these analyses only include thirdthrough fifth-grade teachers and students because students in earlier grades do not take the
Teacher Judgments of Students’ Reading
Table 3 Summary of Descriptive, Inferential, and Effect Size Statistics of the Teacher Judgments and Students’ DORF Performance on Grade-Level Material Across Reading Levels Actual Mean
Estimated Mean
Students’ Reading Level
M
(SD)
M
(SD)
df
t
p
d
At risk (WCPM) Some risk (WCPM) Low risk (WCPM)
45.81 59.82 108.20
(27.23) (37.92) (39.03)
75.38 107.27 123.73
(56.31) (58.11) (54.22)
55 61 93
4.82 6.78 2.90
⬍.001 ⬍.001 .005
⫹0.67 ⫹0.98 ⫹0.33
Note. DORF ⫽ Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency; WCPM ⫽ words read correctly per minute; d ⫽ effect size, calculated as follows: estimated mean score ⫺ actual mean score/pooled SD.
PACT. As shown in Table 2, teachers accurately identified 53.8% of students’ actual PACT levels. Across grade level of teachers, third-grade teachers accurately judged the highest percentage of students, (56.5%) and fourth-grade teachers accurately estimated the lowest percentage of students (51.3%). There were no statistically significant differences between grade levels, 2(2, N ⫽ 119) ⫽ 0.25, p ⫽ .88. We also evaluated whether teachers were better at estimating students who scored in the Below Basic, Basic, Proficient, or Advanced ranges. We found the following percentages of judgment agreements across these ranges: Below Basic (70.0%, N ⫽ 23), Basic (54.7%, N ⫽ 64), Proficient (41.4%, N ⫽ 29), and Advanced (50.0%, N ⫽ 2). These differences were not statistically significant, 2(3, N ⫽ 119) ⫽ 4.13, p ⫽ .25. Finally, similar to our DORF how close analyses, we determined whether inaccurate teacher PACT estimates were over- or underestimates of students’ actual PACT levels. Analyses revealed that 62.9% of students were overestimated and 37.1% of students were underestimated. Teachers’ Judgments of Students’ Performance Using Indirect Measures As shown in Table 1, the lowest correlation across each of the measures and predictors was with students’ actual WCPM score
and teachers’ TRSRP rating (r ⫽ .43), which was slightly lower than students’ actual and estimated percentile rank in reading fluency (rT ⫽ .56). However, like the DORF and PACT correlations, correlations between students’ actual performance and teachers’ estimates using the indirect measures were in the moderate range. Discussion Despite important advances in developing weekly or monthly progress monitoring assessment methods that can assist teachers in making data-based instructional decisions, teachers make more frequent and ongoing instructional decisions based on their judgments of students’ academic performance (Berliner, 2004), which likely influence data-based decision-making models such as RTI (Gerber, 2005). Further, teachers’ judgments are arguably the first step in preventing learning difficulties, and teachers must continue to make accurate judgments about student performance because even commonly used progress monitoring assessment methods (such as CBM-R) have limitations. Although all earlier research was limited by using noncontemporary CBM-R materials, interpretive procedures, and/or benchmark norms, current findings were generally commensurate with previous research. There was a moderate relationship between teachers’ esti33
School Psychology Review, 2011, Volume 40, No. 1
mated WCPM scores and students’ actual scores, which was consistent with previous research (Begeny et al., 2008; Feinberg & Shapiro, 2003). We also found that teachers in our study overestimated students’ WCPM scores, judged low- and average-performing readers less accurately than high-performing readers, and judgment accuracy did not differ by individual teacher or teacher grade level, all of which were consistent with the Begeny et al. (2008) study. Further, our study showed that teachers estimated students’ PACT categorical score about as well as they judged students’ DORF reading level (53.8% accurate judgments vs. 57.5%, respectively), and teachers also tended to overestimate students’ PACT scores. However, unlike the significant differences in judgment accuracy that emerged between students of different DORF reading levels, no statistically significant differences in judgment accuracy were found between the four PACT categories. Finding that teachers’ PACT judgments were about as accurate as their DORF judgments may be at least partially explained by the close relationship between students’ oral reading fluency scores and their end-ofgrade test scores (Good et al., 2001; McGlinchey & Hixson, 2004). Teachers’ estimates using the TRSRP rating scale and estimates of reading fluency percentile rank were in the moderate range, which was similar to the other teacher judgment methods used (i.e., the DORF and PACT measures). These findings are also commensurate with previous research that used indirect teacher judgment methods, such as rating scales and comparisons to classmates (Begeny et al., 2008; Eckert et al., 2006; Feinberg & Shapiro, 2003). As such, it appears that teachers are no better able to estimate students’ actual reading performance with indirect measures such as rating scales and classwide percentile rankings. Overall, findings from this study showed that teachers often made inaccurate judgments about students’ actual reading performance, tended to overestimate students’ abilities, and were better judges of high-performing versus low- or average-performing readers. One pos34
sible explanation for teachers’ inaccurate judgments about students’ oral reading fluency may be that teachers do not understand what is meant by reading fluency, and may confuse this with concepts such as reading accuracy (Hamilton & Shinn, 2003). Alternatively, inaccurate teacher judgments may simply be from misunderstanding how students are judged as low-, average-, or high-performing (Eckert et al., 2006). Although these remain as possible explanations for the findings in the present study, our methodology helped to rule out those explanations because the teacher participants in our study (a) were given explicit information about the concepts of reading accuracy and fluency, and (b) were given the specific criteria for students representing low-, average-, and high-performers. Thus, another explanation for low accuracy could be that teachers need more training and practice using oral reading fluency assessments in order to improve judgment accuracy, which is possible because there is evidence that many teachers receive little preservice training with academic assessment in general (Schafer & Lissitz, 1987), and with CBM in particular (Begeny & Martens, 2006). Limitations Although this study extended previous research in unique ways, the present study also has limitations. Most notably, the external validity in this study is somewhat limited because all participants came from one school in the southeast, and at the time of the study, teacher participants did not have prior experience administering CBM-R or interpreting CBM-R data. The participating school shared characteristics with a large percentage of schools throughout the United States, which suggests potential implications beyond the teachers and students of this particular school, but the external validity is unknown. Like nearly all previous studies in this area, this study may be limited by using teachers as the unit of analysis but analyzing data at the student level. In other words, although our data suggest that each teacher rated her students across the continuum of possible levels,
Teacher Judgments of Students’ Reading
using multiple estimates by each teacher may violate an assumption about independence (e.g., a teacher may estimate all of her students as being strong readers or all of them as poor readers). Another possible limitation of this study is the use of DIBELS and the associated DORF risk categories. Using DIBELS as a primary measure of teachers’ judgment accuracy offers practical significance because as of 2009, more than 15,000 schools adopted DIBELS as a screening and/or progress monitoring assessment tool (University of Oregon, 2009). However, some have argued that using DIBELS risk categories to screen students’ reading levels may be inadequate for identifying students who are at risk or very at risk (Jenkins, 2003). Other commonly used measures for screening students’ reading skills have important limitations as well, but the potential limitations of DIBELS risk categories should be considered. A fourth potential limitation of this study was our use of two DORF assessments to determine a student’s reading level, rather than using the more traditional median score of three CBM-R assessments. Using only two measures was done (a) out of convenience for teacher and student participants, (b) because we were not using DORF levels for highstakes educational decisions, and (c) because previous research among our research group suggest there are minimal differences at one testing period when comparing an average of two passages versus a median of three passages (Begeny, 2010). Nevertheless, using an average of two passages remains a possible limitation. Potential Implications and Directions for Future Research This study offers potential implications for both research and practice. The current study coincides with earlier research suggesting that correlation coefficients did not provide a complete (or fully accurate) account of teacher-judgment accuracy (Begeny et al., 2008; Eckert et al., 2006). Thus, teacher-judgment accuracy should be measured with mul-
tiple analytic strategies, including percentage of agreements (and subsequent analyses with 2 tests), t tests, effect sizes, and correlation coefficients. Furthermore, teacher estimates of students’ WCPM should carry less interpretive weight than teachers’ judgments about student reading levels because the latter is a more realistic standard to evaluate teacher judgments and likely has a stronger influence on teachers’ instructional decision making. However, evaluating teachers’ estimates of WCPM could help to more fully understand whether teachers tend to over- or underestimate students’ oral reading fluency. Teachers may not make good decisions about some students’ reading instruction because they are uncertain about students’ oral reading fluency abilities, which is potentially important for two primary reasons. First, teacher judgments are important across numerous facets of student learning, including daily instructional decision making and educational placements (Gerber, 2005; Gresham et al., 1997). Second, oral reading fluency has been widely endorsed as a critical component of early reading assessment and instruction for elementary-aged students (Fuchs, Fuchs, Hosp, & Jenkins, 2001; Kame’enui & Simmons, 2001). Thus, teachers who are unable to accurately gauge student reading fluency levels may inadvertently weaken students’ learning opportunities and their overall reading development, which can occur when teachers overestimate and underestimate students’ abilities. The current data also suggest that teachers may have similar difficulty judging how students will perform on high-stakes, end-ofgrade reading tests. Similar to the implications for inaccurate teacher judgments of students’ oral reading fluency, inaccurate judgments about students’ potential on high-stakes tests could also compromise student learning opportunities. For example, if a teacher presumes a student will pass an end-of-grade test without needing additional instructional support during the school year, the child’s education would be compromised if, in fact, he needs such support to acquire the skills needed to be successful with that test. Our study revealed 35
School Psychology Review, 2011, Volume 40, No. 1
there were a number of students that teachers suspected would perform well on the PACT, but ultimately performed below teachers’ expectations. Of course, additional research is needed to examine teachers’ ability to judge student performance on end-of-grade tests other than the PACT. Because school psychologists offer unique expertise within the schools regarding assessment and how assessment can influence effective intervention, these data suggest that school psychologists may help improve student learning outcomes by offering teachers in-service training about measures of oral reading fluency and by helping them (a) make instructional decisions with oral reading fluency data (particularly in the early grades); (b) further conceptualize the importance of making data-based decisions, more generally; and (c) conceptualize the possible link between periodic data-based decisions and accurate teacher judgments that must occur daily. As has been shown in earlier research, student outcomes are positively associated with teachers who periodically monitor students’ progress with CBM (Fuchs & Fuchs, 1986). However, future research is needed to clarify whether these positive outcomes are associated with (a) specifically using CBM to make decisions; (b) the routine practice of and focus on using data (generally) to make decisions; or (c) a combination of these (and potentially other) factors. In other words, although teachers should be encouraged to use objective data for instructional decision making, they must still make ongoing instructional decisions in the absence of objective data and must make accurate judgments to potentially counteract objective data containing large degrees of measurement error. As such, by using objective data periodically, teachers’ ongoing instructional decisions (based on accurate judgments of student performance) should be more effective. Future research should also address the specific relationship between teacher training, the use of specific assessments, and the instructional decisions made as a result of such use and training. In addition, qualitative research methods in future studies may help 36
researchers and practitioners understand how teachers make judgments about students’ academic abilities, how often they attempt to gauge a student’s abilities (e.g., daily, weekly, monthly), and how these judgments influence teachers’ daily instructional decision making. Footnotes 1 According to Good, Simmons, Kame’enui, Kaminski, & Wallin (2002), DIBELS risk categories were developed using longitudinal predictive information from all participants in the DIBELS Data System for the academic years of 2000 –2001 and 2001–2002. Thus, end-of-year reading outcomes were compared to students’ performance on previous benchmark measures to identify patterns of performance on each measure. Receiver Operator Characteristic curves were computed for each benchmark goal of each measure and used to analyze the measure’s sensitivity and specificity relative to two different outcomes: reading health and reading difficulty. Using this methodology, the authors identified three risk categories: At Risk, Some Risk, and Low Risk. “At Risk” describes “a level of performance where the odds are against achieving subsequent goals” (p. 3). Twenty percent or fewer of students within this category would be expected to achieve their subsequent reading goals; therefore, “substantial” intervention is indicated. “Some Risk” describes the category of students for whom the odds of success on subsequent reading goals is approximately 50%. Therefore, intervention is indicated for students who are identified as being at some risk. Finally, the “Low Risk” category refers to those students who are on grade level and are likely to meet subsequent reading goals. Specifically, 80% or more students with this pattern would be expected to achieve subsequent goals.
References Begeny, J. C. (2010). Comparing an average of two passages versus a median of three passages when interpreting measures of oral reading fluency. Manuscript in preparation. Begeny, J. C., Eckert, T. L., Montarello, S., & Storie, M. S. (2008). Teachers’ perceptions of students’ reading abilities: An examination of the relationship between teachers’ judgments and students’ performance across a continuum of rating methods. School Psychology Quarterly, 23, 43–55. Begeny, J. C., & Martens, B. K. (2006). Assessing preservice teachers’ training in empirically validated behavioral instruction practices. School Psychology Quarterly, 21, 262–285.
Teacher Judgments of Students’ Reading
Berliner, D. (2004). Describing the behavior and documenting the accomplishments of expert teachers. Bulletin of Science, Technology, & Society, 24, 200 –212. Burns, M. K., & Gibbons, K. A. (2008). Implementing response-to-intervention in elementary and secondary schools: Procedures to assure scientific-based practices. New York: Routledge. Cadwell, J., & Jenkins, J. (1986). Teacher’s judgments about their students: The effects of cognitive simplification strategies on the rating process. American Educational Research Journal, 23, 460 – 475. Christ, T. J., & Silberglitt, B. (2007). Estimates of the standard error of measurement for curriculum-based measures of oral reading fluency. School Psychology Review, 36, 130 –146. Clark, C. M., & Peterson, P. L. (1986). Teachers’ thought processes. In M. C. Wittrock (Ed.), Third handbook of research on teaching (pp. 255–296). New York: Macmillan. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. Eckert, T. L., Dunn, E. K., Codding, R. S., Begeny, J. C., & Kleinmann, A. E. (2006). Assessment of mathematics and reading performance: An examination of the correspondence between direct assessment of student performance and teacher report. Psychology in the Schools, 43, 247–265. Elliott, J., Lee, S. W., & Tollefson, N. (2001). A reliability and validity study of the Dynamic Indicators of Basic Early Literacy Skills—Modified. School Psychology Review, 30, 33– 49. Feinberg, A. B., & Shapiro, E. S. (2003). Accuracy of teacher judgments in predicting oral reading fluency. School Psychology Quarterly, 18, 52– 65. Fuchs, L. S., & Deno, S. L. (1982). Developing goals and objectives for education programs [Teacher’s Guide]. U.S. Department of Education Grant, Institute for Research in Learning Disabilities, University of Minnesota, Minneapolis. Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic formative evaluation: A meta-analysis. Exceptional Children, 53, 199 –208. Fuchs, L. S., Fuchs, D., Hosp, M. K., & Jenkins, J. R. (2001). Oral reading fluency as an indicator of reading competence: A theoretical, empirical, and historical analysis. Scientific Studies of Reading, 5, 239 –256. Gerber, M. M. (2005). Teachers are still the test: Limitations of response to instruction strategies for identifying children with learning disabilities. Journal of Learning Disabilities, 38, 516 –524. Good, T. L., & Brophy, J. R. (1986). School effects. In M. C. Wittrock (Ed.), Handbook of research on teaching (3rd ed., pp. 570 – 602). New York: Macmillan. Good, R. H., & Kaminski, R. A. (Eds.). (2002). Dynamic Indicators of Basic Early Literacy Skills (6th ed.). Eugene, OR: Institute for the Development of Educational Achievement. Available from http://dibels. uoregon.edu/ Good, R. H., Simmons, D., & Kame’enui, E. (2001). The importance and decision-making utility of a continuum, of fluency-based indicators of foundational reading skills for third-grade high-stakes outcomes. Scientific Studies of Reading, 5, 257–288.
Good, R. H., Simmons, D. S., Kame’enui, E. J., Kaminski, R. A., & Wallin, J. (2002). Summary of decision rules for intensive, strategic, and benchmark instructional recommendations in kindergarten through third grade (Technical Report No. 11). Eugene: University of Oregon. Gresham, F. M., MacMillan, D. L., & Bocian, K. M. (1997). Teachers as tests: Differential validity of teacher judgments in identifying students at-risk for learning difficulties. School Psychology Review, 26, 47– 60. Hamilton, C., & Shinn, M. R. (2003). Characteristics of word callers: An investigation of the accuracy of teachers’ judgments of reading comprehension and oral reading skills. School Psychology Review, 32, 228 – 240. Hintze, J. M., Ryan, A. L., & Stoner, G. (2003). Concurrent validity and diagnostic accuracy of the Dynamic Indicators of Basic Early Literacy Skills and the Comprehensive Test of Phonological Processing. School Psychology Review, 32, 541–556. Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of academic achievement: A review of the literature. Review of Educational Research, 59, 297–313. Hurwitz, J. T., Elliott, S. N., & Braden, J. P. (2007). The influence of test familiarity and student disability status upon teachers’ judgments of students’ test performance. School Psychology Quarterly, 22, 115–144. Jenkins, J. (2003, December). Candidate measures for screening at-risk students. Paper presented at the National Research Center on Learning Disabilities Responsiveness-to-Intervention Symposium, Kansas City, MO. Kame’enui, E. J., & Simmons, D. C. (2001). Introduction to this special issue: The DNA of reading fluency. Scientific Studies of Reading, 5, 203–210. McGlinchey, M. T., & Hixson, M. D. (2004). Using curriculum-based measurement to predict performance on state assessments in reading. School Psychology Review, 33, 193–203. McGraw-Hill Education. (2002). Open court reading. New York: Author. Meisinger, E. B., Bradley, B. A., Schwanenflugel, P. J., Kuhn, M. R., & Morris, R. D. (2009). Myth and reality of the word caller: The relation between teacher nominations and prevalence among elementary school children. School Psychology Quarterly, 24, 147–159. Schafer, W. D., & Lissitz, R. W. (1987). Measurement training for school personnel: Recommendations and reality. Journal of Teacher Education, 38, 57– 63. Sharpley, C. F., & Edgar, E. (1986). Teachers’ ratings vs standardized tests: An empirical investigation of agreement between two indices of achievement. Psychology in the Schools, 23, 106 –111. Tenenbaum, I. M. (2003). Technical documentation for the 2003 Palmetto Achievement Challenge Tests of English language arts, mathematics, science, and social studies. Retrieved April 30, 2010, from http:// ed.sc.gov/agency/Accountability/Assessment/old/ assessment/publications/documents/PACT-Tdoc03. pdf University of Oregon. (2009). DIBELS data system brochure. Available from https://dibels.uoregon.edu/ resources/data_system/brochure/dds_brochure_color_ low.pdf 37
School Psychology Review, 2011, Volume 40, No. 1
VanDerHeyden, A. M., Witt, J. C., & Naquin, G. (2003). The development and validation of a process for screening and referrals to special education. School Psychology Review, 32, 204 –227. Ysseldyke, J., & Algozzine, R. (1983). On making psychoeducational decisions. Journal of Psychoeducational Assessment, 1, 187–195.
Date Received: October 19, 2009 Date Accepted: October 28, 2010 Action Editor: Matthew K. Burns 䡲 Article was accepted by previous Editor.
John C. Begeny is an assistant professor at North Carolina State University, and his current research examines methods to improve children’s reading abilities, strategies to narrow the gap between research and practice, and international education. He has received several grants for his teaching and research activities, including grants to improve literacy development for children living in low-income communities nationally and internationally. As part of The Guilford Press School Practitioner Series, he is currently writing a book intended to help educators use academic consultation in schools. Hailey E. Krouse is a school psychology doctoral student at North Carolina State University. Her primary research involves evaluating the reliability and validity of the Wechsler Intelligence Scale for Children—Fourth Edition with deaf and hard-of-hearing children. However, she also has interest in reading research, particularly research that evaluates fluency-based reading interventions for young children. Kristina G. Brown is a part-time faculty member at Georgia Gwinnett College in Lawrenceville, Georgia. She received her master’s degree from Wake Forest University and her doctorate from North Carolina State University. Her research interests include the assessment of and intervention with early academic skills. Courtney M. Mann is an early childhood intervention and literacy MA student at the University of North Carolina—Chapel Hill. Her primary research interests are in the development of literacy interventions and teaching strategies that can be effectively used in the classroom setting with struggling and nonstruggling readers, especially diagnostically driven and linguistically based approaches. Other research interests include biological and cognitive aspects of language acquisition and reading.
38
School Psychology Review, 2011, Volume 40, No. 1, pp. 39 –56
Behavior Problems in Learning Activities and Social Interactions in Head Start Classrooms and Early Reading, Mathematics, and Approaches to Learning Rebecca J. Bulotsky-Shearer and Veronica Fernandez University of Miami Ximena Dominguez SRI International Heather L. Rouse University of Pennsylvania Abstract. Relations between early problem behavior in preschool classrooms and a comprehensive set of school readiness outcomes were examined for a stratified random sample (N ⫽ 256) of 4-year-old children enrolled in a large, urban school district Head Start program. A series of multilevel models examined the unique contribution of early problem behavior in structured learning activities, peer interactions, and teacher interactions to reading, mathematics, and approaches to learning at the end of the year, accounting for child demographic variables (child age, sex, and ethnicity). Early problem behavior in structured learning activities consistently predicted lower academic outcomes (early reading and mathematics ability) as well as lower motivation, attention, and persistence in academically focused tasks. Early problem behavior in peer situations predicted lower attitude toward learning, reflecting children’s difficulties self-regulating and engaging appropriately in socially mediated classroom learning activities. Implications for intervention within early childhood educational programs serving low-income children are discussed.
A growing body of early childhood research provides empirical evidence that preschool problem behavior negatively influences school readiness in multiple domains (Bowman, Donovan, & Burns, 2001; Denham, 2006; Raver, 2002; Thompson & Raikes, 2007). Prevalence estimates in urban early childhood educational programs suggest that
as many as 30% of children exhibit moderate to clinically significant emotional and behavioral needs (Barbarin, 2007; Feil et al., 2005; Qi & Kaiser, 2003). Unfortunately, programmatic resources to address children’s needs are scarce. Referrals for psychological evaluations through early intervention often take many months and access to individual psychological
Correspondence regarding this article should be addressed to Rebecca J. Bulotsky-Shearer, University of Miami, Department of Psychology, Child Division, 5665 Ponce de Leon Blvd., Coral Gables, FL 33146-0751; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 39
School Psychology Review, 2011, Volume 40, No. 1
service providers who work with young children is limited (Cooper et al., 2008). For children living in urban poverty, it is essential that problem behavior is identified early when classroom-based interventions can be most effective (Bowman et al., 2001; Klein & Knitzer, 2007). Logically, early intervention efforts are dependent upon the availability of psychometrically sound and developmentally appropriate measurement tools for diverse low-income populations (Nuttall, Romero, & Kalesnik, 1999; U.S. Department of Health and Human Services, 2001). In large early childhood programs, teacher or parent rating scales are often the most efficient and cost-effective mechanisms for identifying children in need of intervention (McDermott, 1993). However, the validity of data from parent or teacher rating scales with low-income minority preschool populations has been called into question (Lopez, Tarullo, Forness, & Boyce, 2000; U.S. Department of Health and Human Services, 2001). The most commonly available measures identify problem behavior via checklists of psychiatric symptoms that identify the type of internalizing or externalizing problem (e.g., Reynolds & Kamphaus, 2002; Achenbach, 1991). Empirical studies suggest that when asked to use these measures early childhood educators underreport problem behavior to avoid stigmatizing children with labels that are not linked to classroom-based services (Lutz, 1999; Mallory & Kearns, 1988; Piotrkowski, Collins, Knitzer, & Robinson, 1994). In addition, checklist measures have been criticized because they (a) require teachers to infer children’s internal thoughts or feelings, and (b) do not consider the classroom context within which behavior problems occur (Fantuzzo & Mohr, 2000; Friedman & Wachs, 1999; McDermott, 1993). Understanding where problem behaviors are the most challenging to children within daily classroom learning activities and social interactions is critical to inform developmentally appropriate classroom interventions that can reach diverse, low-income children (Cooper et al., 2008; Klein & Knitzer, 2007; Meisels, 1997). 40
A contextual assessment approach is needed to identify where problem behavior occurs within the preschool classroom, and to examine the influence of problems within classroom contexts on academic and learningrelated skills recognized as important dimensions of school readiness (Kagan, Moore, & Bredekamp, 1995). Below, we present a developmental and ecological model that guides our research and then summarize early childhood research that examines the relationship between preschool problem behavior, and academic and learning-related readiness skills (Kagan et al., 1995). In addition, we critique the extant literature and provide a rationale for a more contextually relevant approach to examine classroom problem behavior and its effects on the school readiness of diverse lowincome children. Developmental and Ecological Systems Framework A developmental ecological model provides a conceptual framework for understanding the preschool classroom as a unique developmental setting and its dynamic proximal influence on children’s behavior (Bronfenbrenner & Morris, 1998). This model suggests that in order to understand early problem behavior, assessments must consider the proximal contexts within which problem behavior occurs (Friedman & Wachs, 1999; Kontos & Keyes, 1999; McDermott, 1993), which include interactions among children, teachers, and instructional materials that serve as the primary mechanism for children’s learning (Pianta, 2006). Moreover, preschool learning opportunities contain distinct developmental demands that require complex behavior and increase the likelihood of behavior problems (Kontos & Keyes, 1999). Specific social and emotional skills are required to navigate each type of preschool learning activity or social interaction, and the skills needed vary across activity setting (Kontos & Keyes, 1999). For example, during structured learning activities such as circle time children must be able to self-regulate, inhibit verbal and motor activity, listen carefully, and pay attention; in free play,
Preschool Situational Behavior Problems
another repertoire of skills is required to initiate and maintain cooperative peer interactions. When the demands of learning situations do not match children’s self-regulation, attention, cognitive skills, or motivation, behavior problems may occur (Goldstein, 1995; McEvoy & Welker, 2000). Preschool Problem Behavior and School Readiness Both early reading and mathematics ability have been recognized as important cognitive readiness skills for preschool children and as predictive of future academic success (Duncan et al., 2007; Kagan et al., 1995). Unfortunately, early childhood research substantiates the negative association between problem behavior and reading ability in preschool (Dominguez & Greenfield, 2009; Fantuzzo, Bulotsky, McDermott, Mosca, & Lutz, 2003; Harden et al., 2000; Lonigan et al., 1999), kindergarten (Vaughn, Hogan, Lancelotta, Shapiro & Walker, 1992; Ready, LoGerfo, Burkham, & Lee, 2005; Spira & Fischel, 2005), and first grade (Bub, McCartney, & Willett, 2007; McWayne & Cheung, 2009). Externalizing problem behavior such as aggressive or inattentive behavior has been associated with reading delays (Campbell, Shaw, & Gilliom, 2000), language deficits (Arnold, 1997; Stevenson, Richman, & Graham, 1985), and poor literacy skills (Dominguez & Greenfield, 2009). Although less research has focused on internalizing problem behavior, a recent study found that socially reticent and withdrawn behavior in the Head Start classroom were negatively associated with children’s expressive and receptive vocabulary skills at the end of the year (Fantuzzo, Bulotsky et al., 2003). Mathematics ability is also recognized as important to kindergarten readiness, but very few studies have examined the association between preschool problem behavior and mathematics skills. Three recent studies conducted in Head Start classrooms provide evidence for the negative influence of preschool problem behavior on mathematics outcomes. In a cross-sectional study, Dobbs, Doctoroff,
Fisher, and Arnold (2006) found that teacherrated problem behavior, internalizing symptoms, and attention problems predicted lower mathematics skills. In a predictive study, Dominguez and Greenfield (2009) found that teacher-rated behavioral concerns predicted lower teacher-reported mathematics skills at the end of the year, and Fantuzzo et al. (2007) found that early academically disengaged problem behavior in the Head Start classroom predicted lower mathematics ability at the end of the preschool year. This research provides evidence for the negative association between types of preschool problem behavior and early reading and mathematics ability. However, the research does not provide any information about the most behaviorally challenging classroom situations, nor does it provide information about the effects of problems occurring with these situations on children’s readiness skills. Approaches toward learning, another important dimension of school readiness for preschool children, has been identified as one of the most critical yet least understood readiness domain (Kagan et al., 1995). Approaches to learning reflect “how” children engage in learning and their enthusiasm for learning (Hyson, 2008). These include children’s initiative and curiosity, engagement and persistence, and reasoning and problem-solving skills (McDermott, Green, Francis, & Stott, 2000). Few studies have examined the influence of preschool problem behavior on children’s approaches to learning, particularly for low-income children. However, two recent Head Start studies provide evidence for a negative relationship between the two constructs (Dominguez & Greenfield, 2009; Fantuzzo, Bulotsky-Shearer, Fusco, & McWayne, 2005). Dominguez & Greenfield (2009) found that teacher-reported behavioral concerns predicted lower global approaches to learning outcomes. Examining multiple dimensions of learning behaviors, Fantuzzo et al. (2005) found that early aggressive problem behavior differentially predicted lower attitude toward learning, and inattentive problems predicted lower competence motivation, attention, and persistence during learning tasks. Although 41
School Psychology Review, 2011, Volume 40, No. 1
this research provides evidence for the relations between types of problem behavior and children’s approaches to learning, it does not provide information about the effects of problem behavior within the context of classroom social or learning situations. Multisituational Assessment Approach McDermott (1993) developed an alternative approach to studying children’s classroom problem behavior that was sensitive to the learning and social demands of the classroom context, and was recently adapted for use within early childhood classrooms (Lutz, Fantuzzo, & McDermott, 2002). The Adjustment Scales for Preschool Intervention (ASPI) is a multisituational assessment of problem behavior occurring within structured learning, teacher interactions, and peer interactions within the classroom context (BulotskyShearer, Fantuzzo, & McDermott, 2008). Problems in structured learning includes behavior problems within the context of academic learning activities, both teacher-initiated learning situations (e.g., sitting during teacher-directed activities, involvement in class activities, paying attention in class) and peer-mediated learning situations (e.g., taking part in games with others, free play/individual choice). Problems in peer interactions consist of peer situations (e.g., getting along with agemates, behaving in the classroom, and standing in line). Problems in teacher interaction includes situations such as talking to teacher, answering teacher questions, greeting teacher, seeking teacher help, and helping teacher with jobs. Initial research provided evidence for the unique and differential effects of problem behavior within structured learning, teacher, and peer interactions on social and academic outcomes in preschool, above and beyond traditional ratings of externalizing and internalizing behavior (Bulotsky-Shearer et al., 2008). A clear pattern emerged where problem behavior during structured learning situations predicted lower cognitive readiness outcomes (literacy and mathematics skills), whereas 42
problem behavior in peer interactions predicted lower social competence (interactive peer play). This was the first study to examine the unique contribution of situational problem behavior to school readiness outcomes. However, the study had several limitations. Findings were confounded by source invariance because the teacher reported on both problem behavior and children’s readiness outcomes. Language and mathematics readiness skills were assessed using a global measure of cognitive skills. In addition, the statistical effects of children nested within classrooms were not accounted for in the regression models. Further research is needed to extend this initial study and examine the differential effects of situational problem behavior on a more comprehensive set of school readiness skills, using multiple measurement methods and sources, and employing a multilevel analytic model. Study Purpose The purpose of the present study was to address limitations in previous research and to extend initial research in several important ways. This study sought to examine the differential relations among early preschool emotional and behavioral problems within classroom situations, independent assessments of early reading and mathematics ability, and teacher ratings of approaches to learning at the end of the year. In addition, multilevel modeling was employed to examine these differential relations for a representative sample of urban Head Start children. Based on previous research, three hypotheses were generated: (a) first, that early problem behavior in structured learning situations would predict lower academic outcomes (literacy and mathematics ability); (b) second, that early problem behavior in structured learning would predict lower behavioral engagement in learning tasks (competence motivation, attention, and persistence); and (c) third, that early problem behavior in peer interactions would predict lower socially mediated learning behaviors (attitude toward learning).
Preschool Situational Behavior Problems
Method Participants A stratified random sample of 257 Head Start children from a large urban school district program in the Northeast participated in this study. Children in this sample consisted of 4-year-old children targeted to go on to kindergarten in the fall. In the fall of the Head Start year, children’s ages ranged from 4.05 to 5.12 years (M ⫽ 4.65, SD ⫽ 0.30). Sex was split evenly, with girls comprising 49% of the sample. The children were predominantly African American (69%), with 28% Latino, 4% Caucasian, and 1% Asian. The participants were predominantly low income, with annual income for 93% of the program’s families below $15,000. Children in the sample were enrolled in 20 schools, 78 classrooms, across the nine geographic regions. Program demographic information indicated that all teachers were credentialed in early childhood education and had at least a bachelor’s degree. The majority (61%) had experience teaching in Head Start for at least 5 years. Teachers were predominantly Caucasian (62%) with 29% African American, 3% Latino, 1% Asian and 5% other. Measures Classroom situational problem behavior. The ASPI (Bulotsky-Shearer et al., 2008; Lutz et al., 2002) was used to assess emotional and behavioral problems within routine preschool classroom situations at the beginning of the Head Start year. The ASPI is a 144-item multidimensional instrument based on teacher observations of adaptive and maladaptive behavior across 22 routine, preschool classroom situations and 2 categories of nonsituationally specific behavior problems (e.g., unusual habits or outbursts; Lutz et al., 2002). The scale’s behavioral items reflect both problem behavior (122 items) as well as more adaptive behavior (22 items) within the context of interactions with the teacher, relationships with peers, involvement in structured and unstructured classroom activities, and
games and play. Teachers complete the scale by endorsing as many behaviors as apply in each of the 22 classroom situations. For example, for the situation “How does the child greet you as the teacher?” the teacher endorses as many of the following child behaviors that apply: “Greets as most other students do,” “Waits for you to greet him/her first,” “Does not greet you even after you greet him/her,” “Seems too unconcerned about people to greet,” “Welcomes you loudly,” “Responds with an angry look or turns away,” “Clings to you.” The ASPI was standardized on a sample of urban Head Start children and validated for use with this population. The scale was developed in partnership with Head Start teachers, special needs coordinators, and parents who created the scale content to ensure its developmental appropriateness for preschool children and scripted the items in the parlance of early childhood educators (rather than that of clinical psychologists). Construct validity studies of the ASPI with urban, low-income preschool populations have revealed three situational dimensions: Problems in Structured Learning, Peer Interactions, and Teacher Interactions (Bulotsky-Shearer et al., 2008). The three situational dimensions demonstrated adequate internal consistencies, with Cronbach alpha coefficients of .84, .81, and .75 (Problems in Structured Learning, Peer Interactions, and Teacher Interactions, respectively) and have been found to be replicable and generalizable to important subgroups of the standardization sample (i.e., younger and older children, boys and girls, African American, Latino, and Caucasian ethnicities). Convergent and divergent validity of the three ASPI situational dimensions was established with constructs of interactive peer play and classroom learning competence in preschool (Bulotsky-Shearer et al., 2008), approaches to learning in preschool (Domínguez, Vitiello, Fuccillo, Greenfield, & Bulotsky-Shearer, 2011), interactive peer play in kindergarten (Bulotsky-Shearer, Dominguez, Bell, Rouse, & Fantuzzo, 2010), and language and literacy achievement in kindergarten and first grade (Bulotsky-Shearer & Fantuzzo, 43
School Psychology Review, 2011, Volume 40, No. 1
2011). Correlations between situational dimensions and outcomes ranged from .16 to .63 with high positive associations between problems in peer interactions and socially disruptive behavior; problems in structured learning and disconnected behavior; and negative associations between problems in structured learning and cognitive skills such as language and literacy (Bulotsky-Shearer et al., 2008; Dominguez et al., 2011). For the present study, ASPI T scores based on an area conversion derived from the normative sample of urban Head Start children from the Northeast (N ⫽ 829) were used (Lutz et al., 2002).
form) for children 3 years to 8 years, 11 months of age. The TEMA-2 was normed on a nationally representative sample of 896 children across 27 states. Internal consistencies were high across all age ranges, as was test– retest reliability (Ginsburg & Baroody, 1990). Criterion validity was established through correlations with standardized scores on the TEMA, Diagnostic Achievement Battery (Newcomer, 2001), and Quick Score Achievement Test (Hammill, Ammer, Cronin, Mandlebaum, & Quinby, 1987). For this study, the overall composite Mathematics Quotient (M ⫽ 100, SD ⫽ 15) was used.
Early reading ability. The Test of Early Reading Ability, third edition (TERA-3; Reid, Hresko, & Hammill, 2001) was used to assess early reading ability at the end of the Head Start year. The TERA-3 is a nationally standardized individually administered test of reading ability for children 3 years, 6 months, to 8 years, 6 months. It measures alphabet knowledge, knowledge of conventions of print, and construction of meaning from print. In addition to these three subscale scores, an overall standardized composite score can be derived (Reading Quotient). The normative sample was based on a stratified national sample (N ⫽ 875) configured on the U.S. Census. Substantial reliability exists with the Cronbach’s alpha coefficient of .95. Validity is supported through correlations with established measures of academic achievement and cognitive ability. For the present study, the overall composite Reading Quotient (M ⫽ 100, SD ⫽ 15) was used as the most reliable indicator of children’s reading ability.
Approaches toward learning. The Preschool Learning Behavior Scale (PLBS; McDermott, Leigh, & Perry, 2002) was used to measure children’s learning-related behaviors at the end of the Head Start year (McDermott et al., 2000). The PLBS is a 29-item nationally standardized teacher-completed rating scale of readily observable learning behaviors within the classroom. This measure was developed in collaboration with Head Start teachers and staff. Three reliable and valid dimensions have been derived: Competence Motivation, Attention/Persistence, and Attitude Toward Learning (with Cronbach’s alpha coefficients of .85, .83, and .75, respectively). Sample items include “Says task is too hard without making much effort to accept it,” “Accepts new tasks without fear or resistance,” and “Responds without taking sufficient time to look at the problem or work out a solution.” Teachers rate the child’s behavior on a Likert scale, from most often applies, sometimes applies, or doesn’t apply.” The Competence Motivation scale assesses children’s willingness to take on tasks and their determination to complete activities successfully. The Attention/Persistence dimension measures the degree to which children pay attention and are able to persist with difficult tasks. The Attitude Toward Learning dimension focuses on such concepts as children’s willingness to be helped, desire to please the teacher, and ability to cope when frustrated. Convergent and divergent validity has been established for urban, low-income preschool children with di-
Early mathematics ability. The Test of Early Mathematics Ability, second edition (TEMA-2; Ginsburg & Baroody, 1990) was used to assess early mathematics skills at the end of the Head Start year. The TEMA-2 is a 65-item, individually administered, nationally normed assessment of informal early mathematics (concepts of relative magnitude, counting, calculation with objects present) and formal mathematics (reading and writing numbers, number facts, calculation in symbolic 44
Preschool Situational Behavior Problems
rect assessments of cognitive ability, receptive and expressive vocabulary skills, teacher-rated social skills, teacher- and parent-rated interactive peer play competencies, and direct observations of children’s classroom self-regulation (Fantuzzo, Perry, & McDermott, 2004; McDermott et al., 2002). For the present study, PLBS T scores (M ⫽ 50, SD ⫽ 10) based on the national normative sample were used (McDermott et al., 2002). Procedures Sampling. A stratified, random sample of 257 children was drawn for the purposes of the study. Children in this sample consisted of 4-year-old children targeted to go on to kindergarten in the fall. Children within classrooms were stratified to be demographically and geographically representative of the school district’s nine geographic regions. During that academic year, the prekindergarten Head Start program served a total 4,539 children across 73 centers and nine geographic regions. Data collection. Data collection involved: (a) administrative data including child, family, and teacher demographic information routinely collected by the district’s Head Start program; (b) Head Start teacher assessments of classroom situational behavior problems (ASPI) collected programmatically at the beginning of the preschool year; (c) Head Start teacher assessments of preschool learning behaviors (PLBS) at the end of the preschool year; and (d) individually administered direct assessments of children’s language (TERA-3) and mathematics ability (TEMA-2) at the end of the preschool year. Consent for children’s participation was obtained from parents as part of a larger collaborative university research partnership project with an urban public school district Head Start program in the Northeast. Approval for the research activities was obtained from the Director of the Head Start program and from the Head Start Policy Council. Approval from the University Institutional Review Board was obtained prior to initiating data collection. Program administrative data
were prepared in cooperation with the School District’s Office of Research and Evaluation and the Head Start program. Before archival data were obtained, a confidentiality agreement was signed to ensure the confidentiality of all identifying information. Data were linked by school district personnel using students’ unique district identification numbers. Once the files were integrated, any identifiers were stripped from the files to protect the confidentiality of participants before proceeding with data analyses. The ASPI is systematically collected twice a year (within the first 45 days in the fall and in mid-May at the end of the Head Start year) as part of a federal Head Start assessment requirement (Performance Standard, 1304.20; U.S. Department of Health and Human Services, 1996). The study’s principal investigator obtained permission from the school district administration to use these administrative records and integrate them as described above. In the spring of the Head Start year, teachers were contacted to elicit participation in the study. Prior to data collection, research team members met with teachers individually to explain the purpose of the study and to clarify issues of confidentiality, informed consent, and data collection procedures. Packets including the PLBS were distributed to teachers individually who completed the PLBS and a demographic questionnaire. Concurrently, a team of master’s-level psychology or education graduate students were hired and trained to conduct individually administered direct assessments (TERA-3 and TEMA-2). Children who were randomly sampled to participate were assessed individually outside of the Head Start classroom in a quiet place following a brief “warm-up” period. Results Descriptive Statistics To ensure that data were normally distributed, all variables of interest were examined for outliers, homoscedasticity, and kurtosis. Table 1 presents descriptive statistics and Table 2 presents the bivariate correlation matrix. Low to moderate associations were found 45
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Descriptive Statistics for Sample (N ⴝ 257) Measure Fall Child age (in years) Problems in Peer Interactions (ASPI) Problems in Structured Learning Situations (ASPI) Problems in Teacher Interactions (ASPI) Spring Attitude (PLBS) Motivation (PLBS) Persistence (PLBS) TEMA-2 TERA-3
N
Mean
SD
257
4.65
0.30
220
50.36
10.05
220
46.88
9.41
220
48.78
10.25
230 230 230 253 252
54.11 52.53 53.27 90.26 87.94
8.94 9.15 9.71 13.35 12.94
Note. ASPI ⫽ Adjustment Scale for Preschool Intervention; PLBS ⫽ Preschool Learning Behavior Scale; TEMA-2 ⫽ Test of Early Mathematics Ability: 2nd Edition; TERA-3 ⫽ Test of Early Reading Ability: 3rd Edition. Scores for the ASPI and PLBS represent T scores (M ⫽ 50, SD ⫽ 10). Scores for the TEMA and TERA represent quotient scores (M ⫽ 100, SD ⫽ 15) based on their respective national standardization samples.
between the three ASPI situational problems and learning behaviors (ranging from ⫺.26 to ⫺.48) and between ASPI problems in structured learning and mathematics ability (r ⫽ ⫺.15, p ⬍ .05). Multilevel Modeling Results To examine the unique relationship between the three ASPI situational problems assessed by the teacher at the beginning of the preschool year and the three school readiness outcomes, a series of multilevel models were tested to account for the hierarchical structure of the data (children nested within classrooms). A series of two-level models were analyzed using hierarchical linear HLM modeling (Version 6.01a; Raudenbush, Bryk, Cheong, & Congdon, 2004). Separate models were constructed for early reading ability (TERA-3), for early mathematics ability 46
(TEMA-2), and for each of the three approaches to learning dimensions (PLBS). The first set of models specified were fully unconditional models in order to determine the distribution of variance in each of the outcomes attributable to Level 1 (variability from differences between children within classrooms) and Level 2 (variability from differences between classrooms). Once it was established that there was substantial variability to be explained at each level, variables were entered in a series of steps. First, child demographic covariates (age, sex, and ethnicity) were entered at Level 1 to determine the proportion of child-level variance explained in outcomes. Child age was entered as a demographic covariate in approaches to learning outcome models; however, because the TEMA and TERA overall mathematics and reading quotients were age-normed, child age was not entered as a covariate in these models. Second, the three ASPI situational problem dimensions (behavior problems in structured learning, teacher and peer interactions) were entered at Level 1 to examine the additional proportion of variance explained in children’s outcomes by the ASPI above and beyond demographic covariates. All variables, including dummy-coded demographic variables, were centered at the group mean (Enders & Tofighi, 2007). To determine the relative contribution of each set of variables to children’s readiness outcomes, the percent of incremental variance explained by child demographic variables and ASPI situational problem behavior was examined. To determine the direction and strength of the effects of each predictor, the fixed effects (unstandardized regression coefficients) from the multilevel models were examined. Situational problem behavior and early reading ability. To determine the unique contribution of ASPI situational problems assessed at the beginning of the preschool year to children’s reading ability, a multilevel model tested the relationship between the three ASPI situational dimensions and the TERA-3 reading quotient assessed at the end of the year. The final multilevel model consisted of two levels (the child level and the
Preschool Situational Behavior Problems
Table 2 Bivariate Correlations Among Child-Level Measures Measure 1. 2. 3. 4. 5. 6. 7. 8.
1
2
3
4
5
6
7
8
Fall Problems in Peer Interaction — .58** .50** ⫺.48** ⫺.26** ⫺.43** ⫺.04 .03 Fall Problems in Structured Learning — .62** ⫺.42** ⫺.37** ⫺.48** ⫺.15* ⫺.11 Fall Problems in Teacher Interaction — ⫺.39** ⫺.39** ⫺.39** ⫺.06 ⫺.06 Attitude — .57** .73** .18** .07 Motivation — .73** .23** .20** Persistence — .23** .18** Early Math — .60** Early Reading —
*p ⬍ .05. **p ⬍ .01.
classroom level). In the Level 1 equation (Equation 1), the reading ability score (Y) for a child (i) who is in a classroom (j) is a function of the intercept (0j; the estimated classroom mean score) after adjusting for demographic covariates (1j, 2j, 3j, and 4j), problem behavior scores (5j, 6j, and 7j), and the error term associated with this estimated mean (rij). Level 1: Yij ⫽ 0j ⫹ 1j (Sex) ⫹ 2j (Black) ⫹ 3j (Hispanic) ⫹ 4j (Other) ⫹ 5j (Problems in Peer Interaction) ⫹ 6j (Problems in Structured Learning) ⫹ 7j (Problems in Teacher Interaction) ⫹ rij (1) In the Level 2 equation (Equation 2), the adjusted outcome mean score for children in each classroom (0j) is a function of the grand mean score (␥00) and the error term associated with this estimated mean (u0j). Level 2: 0j ⫽ ␥ 00⫹ u 0j
(2)
Results from the unconditional model indicated that a significant proportion of variance in the TERA Reading Quotient was at-
tributable to differences between classrooms (8%) with the remaining 92% attributable to child-level differences. Child demographic covariates (gender and ethnicity), as a set, accounted for 1% of the child level variance in early reading ability. The situational dimensions, as a set, accounted for an additional 4% of the child level variance in early reading ability. Table 3 presents the results for the final models. Unstandardized regression coefficients (B), degrees of freedom (df), t ratio, and p values indicate the direction and magnitude of the associations between child-level demographic variables, situational problem behavior, and children’s reading ability. The direction and strength of the regression coefficients indicated that early problem behavior in structured learning situations predicted lower reading ability scores at the end of the year. Situational problem behavior and early mathematics ability. To determine the unique contribution of ASPI situational problems assessed at the beginning of the preschool year to children’s mathematics ability, a multilevel model tested the relationship between the three ASPI situational dimensions and the TEMA-2 mathematics quotient assessed at the end of the year. The final model was identical to the one estimated for early reading ability, so the equation is not repeated here. Results from the unconditional model 47
School Psychology Review, 2011, Volume 40, No. 1
Table 3 Relationship Between Preschool Situational Problem Behavior Dimensions and Early Mathematics and Reading Ability Early Mathematics Ability
Early Reading Ability
Fixed Effects
Coefficient
df
t Ratio
Coefficient
df
t Ratio
Intercept ( 0j ) Sex ( 1j ) Black ( 2j ) Hispanic ( 3j ) Other ( 4j ) Fall Problems in Peer Interaction ( 5j ) Fall Problems in Structured Learning ( 6j ) Fall Problems in Teacher Interaction ( 7j )
91.30** 2.90 ⫺2.57 ⫺5.95 ⫺1.20 0.20 ⫺0.45* 0.10
74 208 208 208 208 74 74 74
104.44 1.56 ⫺0.41 ⫺0.78 0.14 1.25 ⫺2.51 0.58
88.69** ⫺0.75 ⫺2.36 ⫺9.67 ⫺3.60 0.24 ⫺0.36* ⫺0.14
74 207 207 207 207 207 207 207
95.11 ⫺0.40 ⫺0.38 ⫺1.28 ⫺0.42 1.83 ⫺2.17 ⫺0.84
Random Effects
Variance Component
df
2
Variance Component
df
2
Intercept (␥ 0o ) Fall Problems in Peer Interaction (u 1j ) Fall Problems in Structured Learning (u 2j ) Fall Problems in Teacher Interaction (u 3j ) Level-1 Effects (r ij )
10.96 0.32** 0.17** 0.04* 130.73
13 13 13 13 —
18.16 30.69 37.57 26.16 —
74 n/a n/a n/a —
90.88 n/a n/a n/a —
12.85 n/a n/a n/a 146.74
Note. n/a ⫽ not applicable because these variance components were fixed and not allowed to vary randomly in this model. *p ⬍ .05. **p ⬍ .01.
indicated that a small proportion of the variance in the TEMA Mathematics Quotient was attributable to differences between classrooms (4.4%) and 96% was attributable to child-level differences. Multilevel modeling was still considered the most appropriate analytic approach because the percent of variance attributable to between classroom differences was close to 5% (Raudenbush & Bryk, 2002). Child demographic covariates (gender and ethnicity) accounted for 1% of the child-level variance in early mathematics ability. The situational dimensions, as a set, accounted for an additional 24% of the child-level variance in early mathematics ability. Table 3 presents the results for the final models. Unstandardized regression coefficients (B), degrees of freedom (df), t ratio, and p values indicate the direction and magnitude of the associations between childlevel demographic variables, situational prob48
lem behavior, and children’s mathematics outcomes. The regression coefficients (Table 3) indicated that early problem behavior in structured learning activities was associated with lower mathematics ability scores at the end of the year. Situational problem behavior and approaches to learning. To determine the unique contribution of ASPI situational problems to children’s approaches to learning, a set of multilevel models tested the relationship between the three ASPI situational dimensions and the three PLBS dimensions (competence motivation, attention/persistence, and attitude toward learning). A separate model was estimated for each of the three PLBS outcomes. Final models were identical to the one estimated for early reading and mathematics ability, except that age was also included as a
Preschool Situational Behavior Problems
Table 4 Relationship Between Preschool Situational Problem Behavior Dimensions and Approaches to Learning Competence Motivation Fixed Effects Intercept ( 0j ) Age ( 1j ) Sex ( 2j ) Black ( 3j ) Hispanic ( 4j ) Other ( 5j ) Fall Problems in Peer Interaction ( 6j ) Fall Problems in Structured Learning ( 7j ) Fall Problems in Teacher Interaction ( 8j )
Random Effects Intercept (␥ 0o ) Fall Problems in Peer Interaction (u 1j ) Fall Problems in Structured Learning (u 2j ) Fall Problems in Teacher Interaction (u 3j ) Level-1 Effects (r ij )
Attention/Persistence
Attitude Toward Learning
Coefficient df t Ratio Coefficient df t Ratio Coefficient df t Ratio 52.71** 2.65 ⫺1.97 7.16 6.85 12.46* 0.07
67 67.68 187 1.14 187 ⫺1.67 187 1.67 187 1.35 187 2.06
53.60** 1.17 ⫺1.92 4.03 4.01 10.32
187
⫺0.16
0.86
67 72.95 187 0.50 187 ⫺1.62 187 0.91 187 0.77 187 1.69
54.27** 0.10 ⫺0.99 5.02 3.39 10.82*
67 68.11 187 0.05 187 ⫺0.97 187 1.31 187 0.76 187 2.04
67 ⫺1.74
⫺0.29**
67 ⫺3.26
⫺0.22*
187 ⫺2.08
⫺0.38** 187 ⫺3.60
⫺0.13
187 ⫺1.46
⫺0.17
187 ⫺1.53
⫺0.02
⫺0.03
67 ⫺0.26
Variance Component df 21.72**
2
67 145.04
187 ⫺0.19
Variance Component df 18.69**
2
51 114.27
n/a
n/a
n/a
0.08*
51
72.86
n/a
n/a
n/a
n/a
n/a
n/a
n/a 52.79
n/a —
n/a —
n/a 48.82
n/a —
n/a —
Variance Component df 30.21** 0.15**
2
32 150.23 32
67.45
n/a
n/a
n/a
0.02 34.11
32 —
46.06 —
Note. n/a ⫽ not available. *p ⬍ .05. **p ⬍ .01.
covariate at the child level. Results from the unconditional models indicated that a substantial proportion of the variance in the PLBS dimensions was attributable to differences between classrooms, thus confirming that multilevel modeling was the most appropriate analytic approach (Raudenbush & Bryk, 2002). For competence motivation, attention/persistence, and attitude toward learning outcomes, 22%, 14%, and 26% of the variance, respectively, was attributable to classroom-level differences; the remaining 78%, 86%, and 74% was attributable to child-level differences.
Child demographic covariates (age, gender, and ethnicity), as a set, accounted for 6%, 9%, and 5% of the child-level variance in competence motivation, attention/persistence, and attitude toward learning, respectively. The situational dimensions, as a set, accounted for an additional 13%, 31%, and 37% of the child level variance in competence motivation, attention/persistence, and attitude toward learning, respectively. Table 4 presents the results for the final models. Regression coefficients indicated that the ASPI situational problem dimensions differentially related to the three 49
School Psychology Review, 2011, Volume 40, No. 1
approaches to learning outcomes. Problems in structured learning situations were negatively associated with children’s competence motivation, attention, and persistence. Problems in peer interactions at the beginning of the preschool year were negatively associated with children’s attitude toward learning at the end of the year. Discussion The present study advances the knowledge base by examining the distinct contribution of problem behavior within routine preschool learning and social contexts to a comprehensive set of academic and learningrelated readiness skills. Guided by a developmental-ecological model, the study examined these relations for a representative sample of children living in urban poverty. Findings extend prior research and suggest that where children exhibit early problem behavior within the early childhood classroom matters for their mastery of important academic readiness skills as well as learningrelated skills such as attention, persistence, and motivation to learn. Situational Problem Behavior, Early Reading, and Mathematics Ability Study findings confirmed initial hypotheses, providing consistent evidence for the negative influence of early problem behavior within structured learning activities on children’s reading and mathematics outcomes. Controlling for child demographics, early problem behavior in structured learning situations predicted lower scores on both measures of reading and mathematics ability at the end of the year. This finding is consistent with initial research that identified the differential effects of early problems within organized classroom learning activities on more global teacher measures of cognitive skills (Bulotsky-Shearer et al., 2008). The present study replicates these findings employing nationally norm-referenced direct assessments. Findings are supported by early childhood research and ecological theory underscoring the positive benefits of children’s ac50
tive engagement within classroom activities intentionally designed by educators to foster learning (Bowman et al., 2001; National Association for the Education of Young Children, 2009). Our study provides empirical support for the notion that multiple social and emotional skills are required to navigate structured learning situations where reading or mathematics skills are taught (Kontos & Keyes, 1999). Children with difficulty attending, regulating their behavior, or engaging appropriately during structured activity times (such as circle time or small group time) demonstrated lower early reading and mathematics skills. In our study, direct assessments of these academic skills tested children’s alphabet knowledge, phonemic awareness, early number operations, and counting skills—academic skills likely intentionally taught during structured learning activities within the Head Start classroom (Rimm-Kaufman, Pianta & Cox, 2000). Situational Problem Behavior and Approaches to Learning Confirming hypotheses, early problem behavior in structured learning activities predicted lower competence motivation, attention, and persistence in learning tasks at the end of the year. Behavior problems in peer interactions predicted lower attitude toward learning. Although research conducted within Head Start supports the finding that early problem behavior in structured learning and teacher interactions negatively affects children’s approaches to learning as a global construct (Domínguez et al., 2011), this is the first study to examine the differential relations between situational problem behavior and multiple dimensions of preschool learning behaviors. Study findings are supported by early childhood research documenting the importance of learning-related skills to academic readiness outcomes. Competence motivation, attention, and persistence have been identified as the key dimensions that link children to learning and engagement in academically focused activities (Fantuzzo et al., 2005; Rouse
Preschool Situational Behavior Problems
& Fantuzzo, 2008). In our study, children exhibiting early problem behavior within structured learning situations demonstrated lower motivation to learn, attention, and persistence with learning tasks at the end of the year. Early childhood research supports this finding, documenting a negative association between early internalizing problem behavior, competence motivation, and autonomous classroom behavior (Fantuzzo et al., 2004). Research also provides evidence for a negative relationship between early externalizing problem behavior, attention, and persistence at the end of the year (Fantuzzo et al., 2005). In our study, problem behavior in structured learning activities included both internalizing and externalizing behavior. Both types of problem behavior within structured learning contexts related differentially to children’s competence motivation and attention and persistence, which were the two approaches to learning dimensions found most strongly associated with early academic success (Rouse & Fantuzzo, 2008). Our study also extends previous research by documenting the negative association between early problem behavior in peer interactions and children’s attitude toward learning. In our study, attitude toward learning consisted of learning-related behaviors within the context of socially mediated learning interactions with peers and teachers (e.g., children’s willingness to be helped, desire to please the teacher, propensity to express hostility when frustrated; Fantuzzo et al., 2004). Children demonstrating problem behavior in peer interactions exhibited difficulties regulating their emotions, tolerating frustration, and accepting help in learning activities involving peers and teachers. Although this is the first study to examine the relations between situational problem behavior and children’s attitude toward learning, initial research employing the ASPI situational dimensions provides some support for these findings. This research identified a link between problem behavior in peer interactions and higher disruptive peer play skills in the classroom (Bulotsky-Shearer et al., 2008). In addition, findings from two recent studies suggest that early socially dis-
ruptive or emotionally disregulated behavior, such as aggressive behavior, is associated with lower attitude toward learning in Head Start (Fantuzzo et al., 2005, 2007). Although further research is needed to confirm this differential relation, our findings suggest that children who demonstrate early behavioral difficulties within the peer context are less likely to acquire important adaptive learning-related skills within socially mediated learning activities throughout the course of the preschool year. Limitations and Directions for Future Research Although the study employed a complex multilevel analytic approach to examine children’s behavior within routing classroom social and learning contexts, several qualifications must be acknowledged. The study sample was intentionally representative of 4-yearold children enrolled in a large, urban school district Head Start program serving predominantly African American families in the Northeast; thus, findings are limited to this population. Future studies should examine the generalizability of our findings to other more geographically, culturally, and ethnically diverse populations of preschool children (e.g., children served in rural programs, or children from other cultural or linguistic backgrounds). In addition, our study did not include contextual variables, at the family or classroom level that might explain variance in reading, mathematics, and approaches to learning outcomes. Future studies could extend this line of research by examining the additive or interactive influence of contextual variables documented in the literature to affect school readiness. These could include family risk variables such as poverty, maternal unemployment, or depression (Garbarino, 1995; McLoyd, 1998), or protective factors such as classroom quality (e.g., teacher– child interactions or teacher instructional support; Pianta, 2006). For example, a growing body of early childhood research suggests that classroom process quality (teacher sensitivity; instructional and emotional support) may moderate 51
School Psychology Review, 2011, Volume 40, No. 1
the effect of classroom activity setting on problem behavior (Rimm-Kaufman, LaParo, Downer, & Pianta, 2005). A recent study in Head Start documented that observed classroom emotional support buffered the negative effects of problem behavior in teacher interactions on a global assessment of approaches to learning (Domínguez et al., 2011). Our study is also qualified by the fact that the Head Start teachers completed both the ASPI and the PLBS. Although direct assessments (TERA and TEMA) were used to measure children’s academic readiness, teacher report was used to assess both situational problem behavior (ASPI) and approaches to learning (PLBS). We intentionally chose the PLBS as a measure of approaches to learning because it is a multidimensional instrument validated for use with low-income preschool populations (Rogers, 1998; U.S. Department of Health and Human Services, 2001). In addition, we chose this measure because research suggests that teachers are one of the most knowledgeable and reliable sources for accurate, summative observations of children’s classroom behavior (McDermott, 1986). Nevertheless, it is important to acknowledge that the relations observed between situational problem behavior and learning behavior may in part from shared method variance. Future studies can extend this research by incorporating additional assessments of children’s approaches to learning across additional sources and methods (e.g., parents or direct observation by independent observers). Finally, our models theoretically assumed directional relations between early situational problem behavior and children’s readiness outcomes. Further research is needed to confirm or disconfirm this assumption as researchers are increasingly highlighting the overlapping and bidirectional nature of school readiness domains (McWayne & Cheung, 2009; Snow, 2007). The temporal structure of our data did not permit the examination of an alternate hypothesis (testing the influence of early reading, mathematics, and learning behaviors on problem behavior) or possible mediating mechanisms among these variables. However, future studies could test a 52
cross-lagged structural equation model to determine which statistical model best fit these data: whether children’s academic skills and learning-related behaviors predicted situational problem behavior or vice versa (Muthe´n & Muthe´n, 1998 –2010). A longitudinal model that examines the influence of situational problem behavior on academic and learning-related outcomes across the transition to elementary school would also strengthen our findings. For example, a recent longitudinal study examined the relations between preschool situational problem behavior and literacy and language outcomes in kindergarten and first grade (Bulotsky-Shearer & Fantuzzo, 2011). Children with behavioral difficulties in preschool classroom learning situations demonstrated significantly lower early reading fluency, language, and reading achievement across these critical transition points in elementary school. Implications for Policy and Practice There are a number of implications of this research. First, our study highlights the importance of early identification of problem behavior within the classroom context for preschool children living in urban poverty. Our findings underscore the need to attend to behavioral problems within the context of early childhood classroom learning and social experiences where fundamental mathematics, reading, and approaches to learning skills are intentionally taught. Our study suggests that if problem behavior occurred in structured learning situations, these mattered for children’s early reading and mathematics skills. In addition, if problems occurred in classroom structured learning or peer interactions, these influenced children’s ability to acquire important learning-related behaviors across the preschool year. To inform targeted interventions, it is critical that developmentally and contextually relevant tools like the ASPI are used to identify problems where learning occurs in early childhood classrooms. Our study is responsive to national priorities that call for the expansion of developmentally and contextually appropriate assess-
Preschool Situational Behavior Problems
ments to inform early identification and intervention efforts for low-income populations (U.S. Department of Health and Human Services, 2001). Key to programmatic early intervention is the use of high-quality assessment tools that can identify a comprehensive set of problem behaviors within routine, developmentally appropriate classroom learning contexts. Use of such tools is particularly important for diverse low-income populations whose mental health needs are traditionally underidentified within community-based early childhood educational programs (U.S. Department of Health and Human Services, 2001). In our study, we provide additional validity for an assessment of early situational problem behavior based on observations within the naturalistic classroom context, by early childhood teaching staff— key natural resources and contributors to children’s development (Fantuzzo, McWayne, & Bulotsky, 2003). Early childhood programs can benefit directly from using information about the ASPI situational dimensions to inform both universal and targeted classroom-based intervention efforts. Rather than identifying “types” of behavior problem (e.g., internalizing or externalizing), this study identified those situations “where” problem behavior in the preschool classroom most affected children’s learning. Data collected program-wide regarding the most challenging classroom situations for both children and teachers can help direct programmatic resources toward classrooms in greatest need, as well as guide staff professional development efforts that complement universal and targeted classroom strategies such as the Pyramid Model (Powell, Dunlap, & Fox, 2006). In the classroom, teachers can use ASPI data to identify “where” children’s behavior problems are most likely to influence learning and to take steps to support more adaptive behavior. This ecological approach to intervention shifts the focus of intervention from a more traditional focus on “fixing” the individual child to “making the larger system work” (e.g., changing the classroom situation to better fit the capacities of the child; Evans & Evans, 1990; Swartz & Martin, 1997).
References Achenbach, T. M. (1991). Child Behavior Checklist. Burlington: University of Vermont. Arnold, D. H. (1997). Co-occurrence of externalizing behavior problems and emergent academic difficulties in young high-risk boys: A preliminary evaluation of patterns and mechanisms. Journal of Applied Developmental Psychology, 18, 317–330. Barbarin, O. (2007). Mental health screening of preschool children: Validity and reliability of ABLE. American Journal of Orthopsychiatry, 77, 402– 418. Bowman, B. T., Donovan, M. S., & Burns, M. S. (Eds.). (2001). Eager to learn: Educating our preschoolers. Washington, DC: National Academy Press. Bronfenbrenner, U., & Morris, P. A. (1998). The ecology of developmental processes. In W. Damon (Ed.), Handbook of child psychology: Theoretical models of human development (5th ed., Vol. 1, pp. 993–1028). New York: Wiley. Bub, K. L., McCartney, K., & Willett, J. B. (2007). Behavior problem trajectories and first-grade cognitive ability and achievement skills: A latent growth curve analysis. Journal of Educational Psychology, 99, 653– 670. Bulotsky-Shearer, R., Dominguez, X., Bell, E., Rouse, H., & Fantuzzo, J. W. (2010). Relations between behavior problems in classroom social and learning situations and peer social competence in Head Start and kindergarten. Journal of Emotional and Behavioral Disorders, 18(4). Advance online publication. doi: 10.1177/ 1063426609351172. Bulotsky-Shearer, R., & Fantuzzo, J. (2011). Preschool behavior problems in classroom learning situations and literacy outcomes in kindergarten and first grade. Early Childhood Research Quarterly, 26(1), 61–73. Advance online publication. doi:10.1016/j.ecresq.2010.04.004. Bulotsky-Shearer, R., Fantuzzo, J. W., & McDermott, P. A. (2008). An investigation of classroom situational dimensions of emotional and behavioral adjustment and cognitive and social outcomes for Head Start children. Developmental Psychology, 44, 139 –154. Campbell, S. B., Shaw, D. S., & Gilliom, M. (2000). Early externalizing behavior problems: Toddlers and preschoolers at risk for later maladjustment. Development and Psychopathology, 12, 467– 488. Cooper, J. L., Aratani, Y., Knitzer, J., Douglas-Hall, A., Masi, R., Banghart, P., et al. (2008). Unclaimed children revisited: The status of children’s mental health policy in the United States. New York: National Center for Children in Poverty, Columbia University Mailman School of Public Health. Retrieved November 1, 2008, from http://nccp.org/publications/pdf/text_853.pdf Denham, S. A. (2006). Social-emotional competence as support for school readiness: What it is and how do we assess it? Early Education and Development, 17, 57– 89. Dobbs, J., Doctoroff, G. L., Fisher, P. H., & Arnold, D. H. (2006). The association between preschool children’s socio-emotional functioning and their mathematical skills. Journal of Applied Developmental Psychology, 27, 97–108. Domínguez Escalo´n, X., & Greenfield, D. B. (2009). Learning behaviors mediating the relationship between behavior problems and academic outcomes. NHSA Dialog, 12, 1–17. 53
School Psychology Review, 2011, Volume 40, No. 1
Dominguez, X., Vitiello, V. E., Fuccillo, J., Greenfield, D. B., & Bulotsky-Shearer, R. J. (2011). The role of context in preschool learning: A multilevel examination of the contribution of context-specific problem behaviors and classroom process quality to approaches to learning. Journal of School Psychology. Manuscript in press. Duncan, G. J., Claessens, A., Huston, A. C., Pagani, L. S., Engel, M., Sexton, H., et al. (2007). School readiness and later achievement. Developmental Psychology, 43, 1428 –1446. Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12, 121– 138. Evans, W. H., & Evans, S. S. (1990). Ecological assessment guidelines. Diagnostique, 16, 49 –51. Fantuzzo, J. W., Bulotsky, R., McDermott, P., Mosca, S., & Lutz, M. N. (2003). A multivariate analysis of emotional and behavioral adjustment and preschool educational outcomes. School Psychology Review, 32, 185–203. Fantuzzo, J. W., Bulotsky-Shearer, R., Frye, D., McDermott, P. A., McWayne, C., & Perlman, S. (2007). Investigation of social, emotional, and behavioral dimensions of school readiness for low-income, urban preschool children. School Psychology Review, 36, 44 – 62. Fantuzzo, J. W., Bulotsky-Shearer, R., Fusco, R. A., & McWayne, C. (2005). An investigation of preschool emotional and behavioral adjustment problems and social-emotional school readiness competencies. Early Childhood Research Quarterly, 20, 259 –275. Fantuzzo, J. W., McWayne, C., & Bulotsky, R. (2003). Forging strategic partnerships to advance mental health science and practice for vulnerable children. School Psychology Review, 32, 17–37. Fantuzzo, J. W., & Mohr, W. K. (2000). Pursuit of wellness in Head Start: Making beneficial connections for children and families. In D. Cicchetti & J. Rappaport (Eds.), The promotion of wellness in children and adolescents (pp. 341–369). Washington, DC: Child Welfare League of America, Inc. Fantuzzo, J. W., Perry, M. A., & McDermott, P. (2004). Preschool approaches to learning and their relationship to other relevant classroom competencies for low-income children. School Psychology Quarterly, 19, 212– 230. Feil, E. G., Small, J. W., Forness, S. R., Serna, L. A., Kaiser, A. P., Hancock, T. B., et al. (2005). Using different measures, informants, and clinical cut-off points to estimate prevalence of emotional or behavioral disorders in preschoolers: Effects on age, gender, and ethnicity. Behavioral Disorders, 30, 375–391. Friedman, S. L., & Wachs, T. D. (Eds.). (1999). Measuring the environment across the life span: Emerging methods and concepts. Washington, DC: American Psychological Association. Garbarino, J. (1995). Raising children in a socially toxic environment. San Francisco: Jossey Bass. Ginsburg, H. P., & Baroody, A. J. (1990). Test of Early Mathematics Ability, Second Edition. Austin, TX: ProEd. Goldstein, S. (1995). Understanding and managing children’s classroom behavior. New York: Wiley. 54
Hammill, D., Ammer, J. J., Cronin, M. E., Mandlebaum, L. H., & Quinby, S. S. (1987). Quick-Score Achievement Test. Austin, TX: Pro-Ed. Harden, B. J., Winslow, M. B., Kendziora, K. T., Shahinfar, A., Rubin, K. H., Fox, N. A., et al. (2000). Externalizing problems in Head Start children: An ecological exploration. Early Education & Development, 11, 357–385. Hyson, M. (2008). Enthusiastic and engaged learners: Approaches to learning in the early childhood classroom. Washington, DC: National Association for the Education of Young Children. Kagan, S. L., Moore, E., & Bredekamp, S. (Eds.) (1995). Reconsidering children’s early development and learning: Toward common views and vocabulary. Washington, DC: National Education Goals Panel. Klein, L., & Knitzer, J. (2007). Promoting effective early learning: What every policymaker and educator should know. New York: National Center for Children in Poverty. Retrieved August 1, 2007, from http:// www.nccp.org/publications/pdf/text_695.pdf Kontos, S., & Keyes, L. (1999). An ecobehavioral analysis of early childhood classrooms. Early Childhood Research Quarterly, 14, 35–50. Lonigan, C. J., Bloomfield, B. G., Anthony, J. L., Bacon, K. D., Phillips, B. M., & Samwel, C. S. (1999). Relations among emergent literacy skills, behavior problems, and social competence in preschool children from low- and middle-income backgrounds. Topics in Early Childhood Education, 19, 40 –53. Lopez, M. L., Tarullo, L. B., Forness, S. R., & Boyce, C. A. (2000). Early identification and intervention: Head Start’s response to mental health challenges. Early Education & Development, 11, 265–282. Lutz, M. N. (1999). Contextually relevant assessment of the emotional and behavioral adjustment of Head Start children. Unpublished doctoral dissertation, University of Pennsylvania, Philadelphia. Lutz, M. N., Fantuzzo, J., & McDermott, P. (2002). Multidimensional assessment of emotional and behavioral adjustment problems of low-income preschool children: Development and initial validation. Early Childhood Research Quarterly, 17, 338 –355. Mallory, B. L., & Kearns, G. M. (1988). Consequences of categorical labeling of preschool children. Topics in Early Childhood Special Education, 8, 39 –50. McDermott. P. A. (1986). The observation and classification of exceptional child behavior. In R. T. Brown & C. R. Reynolds (Eds.), Psychological perspectives on childhood exceptionality: A handbook (pp. 136 –180). New York: Wiley. McDermott, P. A. (1993). National standardization of uniform multisituational measures of child and adolescent behavior pathology. Psychological Assessment, 5, 413– 424. McDermott, P. A., Green, L. F., Francis, J. M., & Stott, D. H. (2000). Preschool Learning Behavior Scale. Philadelphia: Edumetric and Clinical Science. McDermott, P. A., Leigh, N. M., & Perry, M. A. (2002). Development and validation of the Preschool Learning Behaviors Scale. Psychology in the Schools, 39, 353– 365. McEvoy, A., & Welker, R. (2000). Antisocial behavior, academic failure, and school climate: A critical review. Journal of Emotional & Behavioral Disorders, 8, 130 – 140.
Preschool Situational Behavior Problems
McLoyd, V. C. (1998). Socioeconomic disadvantage and child development. American Psychologist, 53, 185– 204. McWayne, C. M., & Cheung, C. (2009). A picture of strength: Preschool competencies mediate the effects of early behavior problems on later academic and social adjustment for Head Start children. Journal of Applied Developmental Psychology, 30, 273–285. Meisels, S. J. (1997). Using work sampling in authentic performance assessments. Educational Leadership, 54, 60 – 65. Muthe´n, L. K., & Muthe´n, B. O. (1998 –2010). Mplus user’s guide (6th ed.). Los Angeles, CA: Muthe´n & Muthe´n. National Association for the Education of Young Children. (2009). Developmentally appropriate practice in early childhood programs serving children from birth through age 8: Position statement. Washington, DC: Author. Retrieved February 18, 2009, from http:// www.naeyc.org/about/positions/dap.asp Newcomer, P. (2001). Diagnostic Achievement Battery— Third edition. Austin, TX: Pro-Ed. Nuttall, E. V., Romero, I., & Kalesnik, J. (1999). Assessing and screening preschoolers: psychological and educational dimensions. Boston, MA: Allyn & Bacon. Pianta, R. C. (2006). Teacher-child relationships and early literacy. In D. Dickinson & S. Newman (Eds.), Handbook of early literacy research (Vol. 2, pp. 149 –162). New York: Guilford. Piotrkowski, C. S., Collins, R. C., Knitzer, J., & Robinson, R. (1994). Strengthening mental health services in Head Start: A challenge for the 1990s. American Psychologist, 49, 133–139. Powell, D., Dunlap, G., & Fox, L. (2006). Prevention and intervention for the challenging behaviors of toddlers and preschoolers. Infants & Young Children, 19, 25– 35. Qi, C. H., & Kaiser, A. P. (2003). Behavior problems of preschool children from low-income families: Review of the literature. Topics in Early Childhood Special Education, 23, 188 –216. Raver, C. C. (2002). Emotions matter: Making the case for the role of young children’s emotional development for early school readiness. Retrieved November 1, 2003, from http://www.srcd.org/spr.html Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage. Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., & Congdon, R. (2004). Hierarchical linear and nonlinear modeling (Version 6.01a) [Computer software]. Lincolnwood, IL: Scientific Software International. Ready, D. D., LoGerfo, L. F., Burkam, D. T., & Lee, V. E. (2005). Explaining girls’ advantage in kindergarten literacy learning: Do classroom behaviors make a difference? The Elementary School Journal, 106, 21–38. doi:10.1086/496905. Reid, D. K., Hresko, W. P., & Hammill, D. D. (2001). Test of Early Reading Ability, third edition. Austin, TX: Prod-Ed.
Reynolds, C. R., & Kamphaus, R. W. (2002). The clinician’s guide to the Behavior Assessment System for Children (BASC). New York: Guilford. Rimm-Kaufman, S. E., La Paro, K. M., Downer, J. T., & Pianta, R. C. (2005). The contribution of classroom setting and quality of instruction to children’s behavior in kindergarten classrooms. The Elementary School Journal, 105, 377–394. Rimm-Kaufman, S. E., Pianta, R. C., & Cox, M. J. (2000). Teachers’ judgments of problems in the transition to kindergarten. Early Childhood Research Quarterly, 15, 147–166. Rogers, M. R. (1998). Psychoeducational assessment of culturally and linguistically diverse children and youth. In H. B. Vance (Ed.), Psychological assessment of children: Best practices for school and clinical settings (2nd ed., pp. 355–384). New York: Wiley. Rouse, H. L., & Fantuzzo, J. W. (2008). Competence motivation in Head Start: An early childhood link to learning. In C. Hudley & A. Gottfried (Eds.), Academic motivation and the culture of schooling in childhood and adolescence (pp. 15–35). New York: Oxford. Snow, K. L. (2007). Integrative views of the domains of child function: Unifying school readiness. In R. C. Pianta, M. J. Cox, & K. L. Snow (Eds.), School readiness and the transition to kindergarten in the era of accountability (pp. 197–216). Baltimore, MD: Brookes. Spira, E. G., & Fischel, J. E. (2005). The impact of preschool inattention, hyperactivity, and impulsivity on social and academic development: a review. Journal of Child Psychology and Psychiatry, 46, 755–774. Stevenson, J., Richman, N., & Graham, P. J. (1985). Behaviour problems and language abilities at three years and behavioural deviance at eight years. Journal of Child Psychology and Psychiatry, 26, 215–230. Swartz, J. L., & Martin, W. E. J. (Eds.). (1997). Applied ecological psychology for schools within communities: Assessment and intervention. Mahwah, NJ: Erlbaum. Thompson, R. A., & Raikes, H. A. (2007). The social and emotional foundations of school readiness. In D. F. Perry, R. K. Kaufman, & J. Knitzer (Eds.). Social and emotional health in early childhood: Building bridges between services and systems (pp. 13–36). Baltimore, MD: Brookes. Vaughn, S., Hogan, A., Lancelotta, G., Shapiro, S., & Walker, J. (1992). Subgroups of children with severe and mild behavior problems: Social competence and reading achievement. Journal of Clinical Child Psychology, 21, 98 –106. U.S. Department of Health and Human Services. (1996). Final rule—Program performance standards for the operation of Head Start programs by grantee and delegate agencies, 45 CFR Part 1304. Federal Register, 61, 57186 –57227. U.S. Department of Health and Human Services. (2001). Report of the Surgeon General’s Conference on Children’s Mental Health: A national action agenda. Washington, DC: U.S. Department of Health and Human Services. www.surgeongeneral.gov/cmh/
55
School Psychology Review, 2011, Volume 40, No. 1
Rebecca Bulotsky-Shearer, PhD, is currently an assistant professor in the Department of Psychology, Child Division, at the University of Miami. She is involved in communitybased research within the Head Start community in Miami-Dade County. Her research interests include the development of contextually relevant measures of emotional and behavioral adjustment for culturally and linguistically diverse, low-income preschool populations; and examination of the relation between early behavioral adjustment and preschooler’s engagement in classroom learning and social interactions. She also conducts research on early protective influences in the home and school contexts that promote learning for low-income preschool children. Veronica Fernandez, BS, is a doctoral candidate in the applied developmental program at the University of Miami. Her research interests include identifying and examining factors that affect the school readiness outcomes of low-income preschool children, especially those who have an identified or suspected disability. She intends for her research to inform and improve the referral, identification, and service delivery processes for these children. Ximena Domiìnguez, PhD, received her doctorate in applied developmental psychology from the University of Miami in 2010. She is currently a research social scientist at SRI International, where she collaborates on projects aimed at improving low-income children’s school readiness and serves as evaluation director for early mathematics and science intervention programs. Her research interests include examining the effects of social-emotional skills and learning-related behaviors on young children’s academic school readiness. Her research also aims to identify classroom-level processes that promote adaptive classroom behavior and ensure children’s successful engagement in important classroom learning experiences that foster literacy, mathematics, and science. Heather Rouse, PhD, received her doctorate in school, community, and clinical child psychology from the University of Pennsylvania in 2007. Her research interests include population-based investigations of early biological and familial risks to educational well-being and the identification of protective factors that mitigate against the negative effects of these risks. In 2010 she was awarded a Public Policy Fellowship from the Stoneleigh Foundation in recognition for her contributions to the development of an integrated administrative data system to support public policy research in Philadelphia. With this fellowship , she is serving as the deputy research director for Philadelphia’s Policy and Analysis Center , a collaboration between the school district and the city to conduct cross-systems research to improve public services for vulnerable children and families.
Date Received: September 6, 2010 Date Accepted: November 9, 2010 Action Editor: Tanya Eckert 䡲 Article was accepted by previous Editor.
56
School Psychology Review, 2011, Volume 40, No. 1, pp. 57–71
Escape-to-Attention as a Potential Variable for Maintaining Problem Behavior in the School Setting Jana M. Sarno and Heather E. Sterling The University of Southern Mississippi Michael M. Mueller Southern Behavioral Group Brad Dufrene, Daniel H. Tingstrom, and D. Joe Olmi The University of Southern Mississippi Abstract. Mueller, Sterling-Turner, and Moore (2005) reported a novel escapeto-attention (ETA) functional analysis condition in a school setting with one child. The current study replicates Mueller et al.’s functional analysis procedures with three elementary school-age boys referred for problem behavior. Functional analysis verified the participant’s problem behavior was maintained by escape from academic demands. Follow-up functional analyses in which target behaviors in escape versus ETA conditions were compared resulted in higher levels of target behavior in the ETA condition for 2 of the 3 participants. The current study also extended previous research by including a treatment analysis. Treatments designed to address escape and attention functions were more effective at reducing the target behaviors than treatments designed to target escape alone for all 3 participants. Results and implications for future research are discussed.
ior. In their original work, Iwata and colleagues measured levels of target behaviors during experimental conditions (i.e., attention, escape, alone) and compared the data to levels of target behavior in a control condition in which the reinforcers were available noncontingently. Iwata et al.’s methodology has been used extensively to identify the behavioral function of self-injurious behavior in clinical settings and has been used with a variety of behaviors and in other nonclinical settings. Although use of functional analysis proce-
Incorporating experimental analyses into a functional behavioral assessment is an effective and time-efficient approach for the assessment and treatment of problem behavior (Hanley, Iwata, & McCord, 2003; Mueller, Sterling-Turner, & Moore, 2005; Mueller, Nkosi, & Hine, in press). The functional analysis methodology developed by Iwata, Dorsey, Slifer, Bauman, and Richman (1982) is an analogue evaluation of problem behavior in which purported reinforcers are withheld and then delivered contingent upon target behav-
This article was taken, in part, from the first author’s thesis. Correspondence regarding this article should be addressed to Heather E. Sterling, The University of Southern Mississippi, 118 College Drive, #5025, Hattiesburg, MS 39406; E-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 57
School Psychology Review, 2011, Volume 40, No. 1
dures is reported less commonly in school settings (Hanley et al., 2003), studies have been reported with examples of disruptive school-based behaviors reinforced by peer attention (e.g., Broussard & Northup, 1997), teacher attention, (e.g., Gunter, Jack, Shores, Carrell, & Flowers, 1993), access to tangible items (e.g., Moore, Mueller, Dubard, Roberts, & Sterling-Turner, 2002), and escape from academic demands (e.g., Broussard & Northup, 1995). Although a functional behavioral assessment, including experimental analysis, may not be necessary to address all disruptive behaviors in school settings (Gresham et al., 2004), additional research on the effect of idiosyncratic variables is needed. In school settings, as in other nonclinical settings, unique environmental variables (e.g., setting, personnel, physical) could require modifications to the standard functional analysis conditions typically reported. For example, in school settings, tasks in the form of academic demands (e.g., ongoing instruction, independent practice worksheets) are, at least theoretically, present throughout the majority of the day. Likewise, concurrent and potentially competing reinforcers in the form of peer attention, teacher attention, or preferred activities (e.g., reading a more desirable book) or items (e.g., playing with a toy hidden in a desk) for inappropriate behavior may be present. Thus, students may be provided with escape from academic demands, while subsequently being provided with an additional reinforcer for problem behavior. Because student behavior can be under the discriminative control of multiple antecedent events or reinforced by multiple variables (e.g., teacher and peer attention, access to preferred materials, breaks from work), it is important to examine a combination of factors that may be maintaining problem behavior in the classroom. Over the past few years, investigations of the effects of multiple variables have begun in and out of the classroom. For example, Hoff, Ervin, and Friman (2005) examined the separate and combined effects of escape and peer attention on disruptive behavior in the 58
general education classroom. Following a descriptive assessment, including interviews and direct observations, Hoff and colleagues formulated three hypotheses to test in an alternating treatments design: access to peer attention, escape from a nonpreferred activity, and access to peer attention and escape from a nonpreferred activity. Treatment analysis data verified the initial hypothesis of access to peer attention and escape from academic demands. In addition, a combined intervention targeting both attention and escape decreased problem behaviors to near zero levels. Moore, Mueller et al. (2002) investigated the influence of the simultaneous delivery of therapist attention on self-injurious behavior in a tangible condition. Following the initial functional analysis, attention in the tangible condition was evaluated using a reversal design. In one phase, juice and brief attention were delivered contingent on self-injurious behavior. In the second phase, the delivery of the preferred stimulus (juice) was returned contingent on problem behavior and attention was withheld. The results of the follow-up analysis demonstrated that self-injurious behavior occurred at higher rates when the juice and attention were delivered concurrently than when the juice was presented alone. By incorporating procedural variations in the functional analysis methodology, Moore, Mueller et al. (2002) demonstrated that the presence of attention could confound the outcomes of functional analysis conditions. Moore and colleagues hypothesized that “practical solutions for the tangible condition might be to restrict attention as much as possible or to weaken the dependency between problem behavior and therapist attention by delivering attention on a response-independent schedule” (p. 284). However, Moore, Mueller et al. did not present treatment data to support their hypothesis. It is conceivable, though, as the authors suggested, that the influence of attention might affect other consequent analysis conditions. In Mann and Mueller (2009), the functional analysis results of a girl’s aggression appeared to be maintained by attention. The results of the functional analysis of her behav-
Escape-to-Attention as a Potential Variable
ior showed high levels of aggression in the attention condition and low levels in an escape from academic demand, access to tangibles, and a toy play control. When she failed to acquire a functional communicative response to replace aggression for attention, a follow-up functional analysis was used to evaluate whether access to attention was part of a chain of reinforcers maintaining aggression. In the follow-up analysis, attention-to-tangibles, attention-to-escape, and attention alone as a control condition were each used. Aggression was high in the attention-to-tangibles condition and low in the attention-to-escape and the attention-only control. When functional communication training was used to address the attention-to-tangibles (i.e., manding for access to tangibles and attention, rather than attention only), the response was acquired and the aggression decreased. These results highlight two issues relevant for school-based functional analysis. First, Iwata et al.’s (1982) methodology is useful, even if structural variants to assess multiple reinforcers are required. Second, for behaviors maintained by multiple reinforcers, matching treatments to both reinforcers may be required to reduce target behaviors substantially. Mueller et al. (2005) provided pilot data for an ETA condition used in a school setting. For one child, a functional analysis with an escape, attention, and toy play conditions was conducted using the procedures described by Iwata et al. (1982). Results of initial functional analysis showed that problem behavior only occurred in the escape from academic demand condition, however, lower than that typically was observed in the classroom setting. After escape was identified by the initial functional analysis, the researchers assessed a combination of variables to determine whether differential levels of problem behavior would occur with the addition of attention during the break from academic demands, as this was observed in the direct behavioral observation prior to the initial functional analysis. In the follow-up functional analysis, the escape-only, ETA, and control conditions were presented. A substantially higher level of tantrums was demon-
strated in the ETA condition than in the escape-alone condition or the control conditions. Mueller et al. (2005) hypothesized that without the information derived from the follow-up analysis, an intervention based on the escape-only hypothesis would have failed. Although the results of Mann and Mueller (2009) provide some support for this hypothesis, Mueller et al. (2005) did not provide any intervention data. Other limitations should be addressed as well. First, the investigation was a pilot study of the ETA condition and involved only one participant. Another limitation was that the consultant collected all data and, because of staffing issues, no interobserver agreement data (IOA) were collected. Given the limitations of Mueller et al. (2005), the current study was undertaken with two primary goals. First, we replicated Mueller et al.’s ETA investigation with additional participants and in a more controlled manner, including IOA data and multiple behavioral observers, to determine whether the ETA function would emerge in additional participants. The second goal was to extend Mueller et al. by evaluating two different behavioral interventions, one that presented an escapeonly treatment and one that was matched to both functions (escape and teacher attention). We predicted that differential treatment results would emerge for students who showed higher levels of problem behavior in the ETA condition, with stronger treatment effects favoring the combined treatment for children with an ETA function when compared to children with escape-maintained problem only. Method Participants and Setting Three elementary school-age boys referred for problem classroom behavior participated. All students were enrolled in public schools and were placed in general education classrooms in a rural Southeastern school district. Teacher and parental consent were secured for participation; participant names used hereafter are pseudonyms. Brandon was a 6-year-old Caucasian male enrolled in a general education first-grade classroom. Brandon 59
School Psychology Review, 2011, Volume 40, No. 1
was diagnosed with attention deficit hyperactivity disorder (combined type) when he was 5-years-old and was prescribed a 10-mg dose patch of methylphenidate (Daytrana). Franklin and J’Marcus were 5-year-old African American males enrolled in separate general education kindergarten classrooms. J’Marcus and Franklin had no medical diagnoses and were prescribed no medications. All sessions were conducted in the participants’ classrooms during typically scheduled activities that corresponded to teacherreported times when problem behaviors were most frequent. The students’ classroom teachers implemented all functional analyses and treatment evaluation sessions. Measures Functional Assessment Informant Record for Teachers (FAIR-T). The FAIR-T is an instrument administered to teachers to generate hypotheses concerning the function of problem behavior (Edwards, 2002). The FAIR-T is designed with four components to achieve this purpose: (a) general referral information, (b) identification and description of problem behavior, (c) potential antecedents for problem behavior, and (d) potential consequences that follow the problem behavior most frequently. Researchers have demonstrated that the hypotheses generated from information gathered via the FAIR-T correspond with behavioral function identified in experimental analyses (e.g., Doggett, Edwards, Moore, Tingstrom, & Wilczynski, 2001; Dufrene, Doggett, Henington, & Watson, 2007). Intervention Rating Profile-15 (IRP15). The Intervention Rating Profile-15 (IRP15; Martens, Witt, Elliott, & Darveaux, 1985) was used as a social validity measure of the treatment conditions. The IRP-15 is composed of 15 questions that the respondent rates on a Likert-type scale ranging from 1 (strongly disagree) to 6 (strongly agree). Ratings range from a total score of 15–90, where a total score above 52.50 represents a rating of “acceptable” (Von Brock & Elliott, 1987). The IRP-15 has high reported internal consistency (Cronbach ␣ ⫽ .98), and all items load on a 60
General Acceptability Factor (ranging from 0.82 to 0.95; Martens et al., 1985). Problem behavior. Child problem behavior served as the primary dependent variable and was reported as the percentage of intervals in which the behavior occurred. Problem behavior included: inappropriate vocalizations (Brandon, J’Marcus, Franklin), which was defined as talking or yelling without teacher permission; elopement (J’Marcus), which was defined as any movement 1 m away from the teacher or teacher-designated area without permission; and banging on surfaces (Franklin), which was defined as throwing academic materials in a downward motion to the desk, and/or floor so that it made an audible sound on impact. Additional data were also collected for task engagement during the treatment evaluation phases. Task engagement was defined as the student’s eyes directed at work materials and/or manipulating objects associated with the teacher command. Task engagement is presented as the percentage of intervals in which behavior was observed during a session. A 10-s partial interval recording system was used for all observations. All sessions were 10 min in length. Procedures First, functional behavioral assessments that included teacher interview and direct classroom observations were conducted to generate hypotheses of behavioral function. Second, functional analyses were used to verify escape from task demands as the maintaining variable for referred behaviors. Third, follow-up functional analyses were used to investigate the additive effects of attention delivered during the break from academic demands (i.e., ETA). Finally, two different treatments were compared to examine the effects on target behavior when an escape-only treatment was alternated with an intervention package that targeted escape and attention. Functional behavior assessment. Each teacher was administered the FAIR-T as a semistructured interview in order to define target behaviors and their immediate anteced-
Escape-to-Attention as a Potential Variable
ent and consequent events. Next, direct behavioral observations were conducted in the students’ classrooms. The information obtained from the functional assessment was used to form hypotheses about potential behavior-reinforcer relationships. Conditional probabilities (VanDerHeyden, Witt, & Gatti, 2001) were also calculated from the observational data to determine the temporal proximity of specific consequences and target behaviors. Conditional probabilities for each participant were as follows: Brandon— escape ⫽ 74%, teacher attention ⫽ 24%, peer attention ⫽ 0%; Franklin— escape ⫽ 66%, teacher attention ⫽ 25%; peer attention ⫽ 5%; J’Marcus— escape ⫽ 59%, teacher attention ⫽ 29%, peer attention ⫽ 9%. Descriptive data suggested escape from academic demands or teacher attention in the form of reprimands and redirection to work might be reinforcing the target behaviors identified for each child. Thus, each child proceeded to the experimental phases of the study reported below. Functional analysis. A hypothesisdriven (Repp, Felce, & Barton, 1988) functional analysis was used to identify the reinforcers for problem behaviors. For each participant, escape from academic demands and teacher attention were tested. A play condition was included as an experimental control. Conditions were presented in a random order and results were evaluated using a multi-element design. All conditions were 10 min in duration. A 2-min break was given between conditions. During the break, the student and teacher continued with the naturally occurring classroom activities (e.g., read a book, transitioned between activities). Students were not informed of changes in contingencies across sessions and different stimuli were used were used across conditions (e.g., academic worksheets during demand conditions; leisure items during attention conditions). Each participant’s teacher implemented the functional analysis and conducted between 2 and 4 sessions per day. Control (play) condition. The student was provided free access to attention and pre-
ferred play materials available in the classroom. The teacher engaged in interactive play with the student and delivered attention at least every 30 s. No programmed consequences or demands were delivered for target behaviors. Attention condition. The student was allowed unrestricted access to activities/items typically available in the classroom. The teacher interacted with the student until he was engaged in an activity. Next, the teacher removed herself from the activity, saying she needed to do work at her desk. Contingent on target behavior(s), the teacher delivered verbal attention in the form of reprimands or redirection to work, consistent with verbalizations noted in the descriptive observations. Following the delivery of attention, the teacher returned to work and the student continued to have free access to preferred items. Escape from academic demand condition. During the escape condition, the student was presented with work materials identified by the teacher as associated with problem behavior in the past. A graduated prompting (i.e., verbal, gestural, physical) sequence was used to deliver academic demands. If problem behavior occurred, the teacher removed the academic demand and walked away from the student. No attention was provided to the student. Following the 30-s break period, the teacher returned to the student and delivered another demand and repeated the procedure described above. Follow-up functional analysis. Following the initial functional analysis, a follow-up functional analysis was conducted to investigate the additive effects of attention during the escape condition. All conditions were 10 min in duration, and 2-min breaks were given between conditions. The teacher implemented between 2 and 4 experimental conditions per day. The escape from academic demands and control/play conditions were implemented in an identical manner as in the initial functional analysis. 61
School Psychology Review, 2011, Volume 40, No. 1
Escape-to-attention condition. During the ETA condition, the student was presented with work materials identified by the teacher as associated with problem behavior in the past. A graduated prompting (i.e., verbal, gestural, physical) sequence was used to deliver academic demands. Contingent on target behavior, the teacher removed the task materials and provided verbal attention during the 30-s break. The quality of teacher attention during the escape break and the nature of teacher attention were based on information obtained in the descriptive assessment (i.e., reprimands, redirections, physical attention). During the 30-s break, the teacher continued to deliver attention to the student in the typical manner for that classroom (e.g., “You need to get back to work”, “I told you no screaming, you have to work.”). Following the 30-s break, the teacher represented the task and the prompting sequence continued. Treatment evaluations. A treatment comparison was employed to evaluate the target behavior under two different treatment types. One treatment, Escape Extinction, targeted escape only. The other treatment (Escape Extinction ⫹ Differential Reinforcement of Alternative Behaviors), targeted escape and teacher attention. The ETA treatment conditions were evaluated using a B/C/B/C design for Brandon and C/B/C designs for Franklin and J’Marcus. Escape extinction (EE). The EE condition was identical to the escape condition from the functional analysis with the exception that no break was delivered for target behavior. Difficult academic materials were presented using a graduated prompting sequence. No attention was delivered for target behaviors. Escape extinction ⴙ differential reinforcement of alternative behaviors (EEⴙDRA). The EE⫹DRA condition was implemented identical to the EE condition with one exception. During EE⫹DRA phase, the schedule of attention was based on the descriptive data obtained during baseline ob62
servations. For Brandon and J’Marcus, attention was delivered every 30 s contingent on demonstrating appropriate behavior. For Franklin, teacher attention was delivered every 15 s. In this phase, teacher attention consisted of descriptive praise for appropriate behavior (i.e., task engagement; “Great job working.”) and/or physical attention (e.g., pats on the back). If problem behavior occurred when the interval elapsed, the interval was reset and the teacher did not deliver attention to the student. That is, problem behavior did not result in the delivery of teacher attention. Interobserver Agreement Two observers were assigned to one student: one observer served as the primary data collector and the other for IOA. Agreement coefficients were calculated by dividing the total number of agreements by the number of agreements plus disagreements and multiplying by 100. IOA data were collected across a minimum of 30% of sessions during all phases of the study. IOA data during the initial functional analysis were: Brandon, M ⫽ 95% (range ⫽ 85%–100%); Franklin, M ⫽ 96% (range ⫽ 92%–100%); and J’Marcus, M ⫽ 95% (range ⫽ 90%–100%). IOA data during the follow-up functional analysis were: Brandon, M ⫽ 98% (range ⫽ 90%–100%); Franklin, M ⫽ 91% (range ⫽ 85%–100%); and J’Marcus, M ⫽ 98% (range ⫽ 95%–100%). IOA data during the treatment sessions were: Brandon, M ⫽ 96% (range ⫽ 90%–100%); Franklin, M ⫽ 92% (range ⫽ 84%–100%); and J’Marcus, M ⫽ 93% (range ⫽ 90%–97%). Procedural and Treatment Integrity All teachers were trained to implement to implement the functional analysis and treatment evaluation conditions, based on procedures outlined by Moore, Edwards et al. (2002). For all activities for which teachers were trained, a series of steps was created. Procedural integrity was calculated each session and by dividing the number of correctly implemented steps by the total number of steps for that condition. Procedural integrity
Escape-to-Attention as a Potential Variable
data (see Appendix A) were collected during each functional analysis session and ranged from 90% to 100% across all teachers. Procedural integrity for individual components of each treatment condition was calculated for a minimum of 50% of the treatment evaluation sessions. During the EE condition, treatment integrity averaged 98% (range ⫽ 96%–100%) for Brandon’s teacher, 80% (range ⫽ 65%–91%) for Franklin’s teacher, and 89% (range ⫽ 86%–92%) for J’Marcus’s teacher. During the EE ⫹ DRA treatment, treatment integrity averaged 98% (range ⫽ 92%–100%), 86% (range ⫽ 58%– 100%), and 95% (range ⫽ 92%–100%) for Brandon’s, Franklin’s, and J’Marcus’s teachers, respectively.
Results Initial Functional Analysis Initial functional analysis results are depicted in Figure 1. Data supported the hypothesis that each participant’s behavior was reinforced by escape from academic demands. Brandon (top panel) exhibited problem behavior in an average of 14% (range ⫽ 12%–19%) of the observed intervals during the escape condition. The mean percentage of intervals with problem behavior during the teacher attention condition was 0.67% (range ⫽ 0%– 2%). No problem behavior was observed during the control sessions. Franklin’s (middle panel) mean percentage of intervals with problem behavior during the escape condition was 11.33% (range ⫽ 8%–16%) and less than 1% during the attention sessions (range ⫽ 0%– 2%). No problem behavior was observed during control sessions. The bottom panel of Figure 1 depicts the results of the initial functional analysis for J’Marcus. The mean percentage of intervals containing problem behavior during the escape condition was 32.67% (range ⫽ 29%–37%). No problem behavior was observed during control and attention conditions. Follow-up Functional Analysis
Figure 1. Percentage of intervals containing problem behavior during the initial functional analysis for Brandon (top panel), Franklin (middle panel), and J’Marcus (bottom panel).
The top panel of Figure 2 depicts the results of Brandon’s follow-up functional analysis. The mean percentage of intervals containing problem behavior during the escape condition was 10.5% (range ⫽ 7%– 15%), and no problem behavior occurred during the control condition. The mean percentage of intervals with problem behavior in the ETA condition was 11.5% (range ⫽ 7%– 14%). The ETA condition resulted in slightly more problem behavior than the escape condition. However, given the substantial overlap of the level of behavior between the escape and ETA conditions, the addition of teacher attention during the escape interval did not produce differential levels of responding behavior for Brandon across the two conditions. As shown in the middle panel of Figure 2, Franklin’s mean percentage of intervals 63
School Psychology Review, 2011, Volume 40, No. 1
with problem behavior during the escape condition was 11.33% (range ⫽ 7%–17%). Low levels of problem behavior occurred during the control condition (range ⫽ 0%–3%). Problem behavior in the ETA condition occurred in an average of 36.67% (range ⫽ 31%– 44%) of intervals. The high level of behavior in the ETA and low level of behavior in the other two conditions suggests that Franklin’s problem behavior was reinforced by attention during the escape period. The results of the follow-up functional analysis for J’Marcus are depicted in the bottom panel of Figure 2. The mean percentage of intervals with problem behavior during the escape condition was 23.33% (range ⫽ 22%– 25%), and no problem behavior occurred during the control condition. A substantial increase in problem behavior was observed during the ETA condition, (M ⫽ 46.67%;
Figure 2. Percentage of intervals containing problem behavior during the modified functional analysis for Brandon (top panel), Franklin (middle panel), and J’Marcus (bottom panel). 64
range ⫽ 40%–58%), suggesting that J’Marcus’s problem behavior was reinforced by attention delivered during breaks from work. ETA Treatment Evaluations Brandon. The top panel of Figure 3 depicts the percentage of intervals with problem behavior and task engagement observed during Brandon’s treatment evaluation. During the first EE treatment phase, an increasing trend in Brandon’s problem behavior was observed. (M ⫽ 38.4%; range ⫽ 17%– 60%). Following the implementation of the EE⫹ DRA treatment condition, an immediate decrease was observed in Brandon’s problem behavior. The mean percentage of intervals with problem behavior was 11.8% (range ⫽
Figure 3. Percentage of intervals with problem behavior and task engagement for Brandon, (top panel), Franklin (middle panel), and J’Marcus (bottom panel) during the escape-toattention treatment evaluations with escape extinction (EE) and escape extinction ⴙ differential reinforcement of alternative behaviors (DRA).
Escape-to-Attention as a Potential Variable
1%–23%) of intervals. When the EE treatment was reintroduced, Brandon’s problem behavior increased slightly to a mean percentage of intervals of 12.33% (range ⫽ 12%–13%). The level of problem behavior observed in the second EE phase was relatively stable but did not reach the levels observed in the first EE phase. Finally, the reintroduction of the combined treatment of EE⫹DRA produced stable, low levels of problem behavior (M ⫽ 4.5%; range ⫽ 2%– 6%). The top panel of Figure 3 also depicts Brandon’s task engagement data during the ETA treatment evaluation sessions. During the first EE treatment phase, Brandon’s task engagement had a decreasing trend as problem behavior increased. Brandon was engaged with task materials, on average, during 27.2% (range ⫽ 0%–58%) of the intervals. When the combined EE⫹DRA treatment was implemented, Brandon’s level of task engagement increased immediately and continued on an upward trend over the phase (M ⫽ 73%; range ⫽ 40%–95%). Following the reintroduction of the EE treatment, Brandon’s mean task engagement was stabilized at 78.67% (range ⫽ 78%– 80%) of the intervals. In the final EE⫹DRA treatment phase, Brandon’s task engagement increased slightly to a mean of 86.5% (range ⫽ 83%–95%). Franklin. The results for Franklin’s ETA treatment evaluations are depicted in the middle panel of Figure 3. During the first EE⫹DRA treatment phase, Franklin’s problem behavior decreased slightly across the phase (M ⫽ 9.17%; range ⫽ 2%–19%). When the EE treatment was implemented, a large and immediate increase in problem behavior was observed, averaging 46% (range ⫽ 38%– 45%) of the intervals. When the EE⫹DRA treatment was reintroduced, a large and immediate decrease was observed in Franklin’s problem behavior (M ⫽ 25.5%; range ⫽ 25%–26%). Franklin engaged with academic materials during an average of 77.67% (range ⫽ 65%–95%) of intervals during the first EE⫹DRA treatment evaluation phase. With the implementation of the EE treatment, a
large and immediate decrease was observed in Franklin’s task engagement (M ⫽ 47.33%; range ⫽ 42%–55%) of observed intervals. Finally with the reintroduction of the EE⫹DRA, Franklin’s task engagement increased slightly to a mean of 61% (range ⫽ 52%–70%) of the observed intervals. J’Marcus. The bottom panel of Figure 3 depicts J’Marcus’s ETA treatment evaluation results for problem behavior and task engagement. During the initial EE⫹DRA treatment phase, J’Marcus exhibited low levels (M ⫽ 15%; range ⫽ 10%–20%) of problem behavior with a decreasing trend across the phase. Following the implementation of the EE treatment phase, problem behavior showed an immediate increase and continued trending upward (M ⫽ 51%; range ⫽ 33%– 65%). Finally, after the reimplementation of the EE⫹DRA treatment, an immediate and large decrease was observed in J’Marcus’s problem behavior. Low and stable levels of problem behavior were observed in the final EE⫹DRA treatment condition, with a mean level of 3.75% (range ⫽ 2%–5%). During the EE⫹DRA treatment phase, J’Marcus was appropriately engaged with academic work materials during a mean of 78.67% (range ⫽ 73%– 83%) of the intervals. During the EE treatment phase, J’Marcus’s task engagement decreased sharply over the phase with a mean percentage of 33.33% (range ⫽ 16%– 62%) of intervals coded with problem behavior. When the EE⫹DRA treatment was reimplemented, an immediate increase in J’Marcus’s task engagement was observed, and behavior levels were stable throughout the phase (M ⫽ 97.5%; range ⫽ 94%–100%). Treatment Acceptability Each classroom teacher completed the IRP-15 at the conclusion of each treatment phase. Overall, all teachers rated the EE⫹ DRA treatment condition as more acceptable than the EE treatment condition. For the EE⫹DRA condition, the total scores were 89 for all participants. For the EE condition, the following total scores were obtained: 71, 30, 65
School Psychology Review, 2011, Volume 40, No. 1
and 16 for Brandon, Franklin, and J’Marcus’s teachers, respectively. Thus, two of the three teachers responded that the EE treatment was an “unacceptable” treatment evidenced by total scores substantially lower than the traditionally used cutoff score of 52.50. Discussion The purpose of the current investigation was to replicate and extend Mueller et al.’s (2005) case study of a novel functional analysis condition designed to assess the additive effects of attention as a reinforcer during breaks from academic tasks. That is, the present study sought to determine whether the addition of teacher attention to an escape interval (i.e., ETA) would result in elevated levels of problem behavior when compared to a standard escape condition for 3 children. The second purpose of the study was to evaluate whether a treatment package that targeted escape and attention functions would reduce target behavior better than a treatment targeting only escape. The descriptive data from the functional assessment suggested that problem behavior led to escape from task demands for all participants. Classroom observations revealed that teachers also provided attention (e.g., reprimands, requests to return to work) when the students were not engaged in work; low levels of peer attention for problem behavior were observed. The results of the initial functional analyses verified that all three participant’s problem behavior was maintained by escape from task demands. The first research question evaluated whether the addition of teacher attention during the escape interval would produce elevated levels of problem behavior when compared to an escape-only condition. The results supported Mueller et al.’s (2005) findings, as the follow-up functional analysis showed increases in problem behavior in the ETA condition for 2 of the 3 participants relative to the escape-only and play/control condition. Responding during the escape-only and ETA functional analyses was similar for Brandon. 66
As hypothesized by Mueller et al. (2005), the findings that attention can reinforce problem behavior during a work task suggest that the presentation of task demands may motivate problem behavior reinforced by escape from an aversive task and by teacher attention. In the conditions described by Iwata et al. (1982), the only establishing operation for the assessment of attention as a reinforcer was the deprivation of attention. As seen in the current analyses, attention functioned as a reinforcer in contexts other than those in which a child was being ignored. The second research question investigated treatment implications for ETA-maintained problem behavior. A treatment program designed to match the escape function only (i.e., EE) versus a treatment targeting escape with the addition of teacher attention (i.e., EE⫹DRA) were evaluated. Problem behavior decreased for all participants during the EE⫹DRA treatment, and a general increasing trend in problem behavior was found during the standard EE treatment. Likewise, concomitant increases in task engagement and decreases in problem behavior were observed across all participants. Thus, differential responsiveness to treatment based on behavioral function was observed, although differences for Brandon were minimal by the end of treatment. The present study adds to a growing literature base of studies investigating the effect of multiple reinforcing variables delivered together (compound reinforcers) or in sequential arrangement (chained reinforcers). Golonka et al. (2000) described the additive effects of escape to an enriched environment contingent upon appropriate behavior as treatment for problem behavior in a classroom. Although a compound or chained reinforcer was not used in their functional analysis, the treatment results supported the reductive effects of contingent escape to an enriched environment as more effective than contingent escape alone. Mueller et al. (2005) used Golonka et al.’s (2000) findings as the basis for their ETA condition. However, no treatment data were described in their one-participant case study. The present results add to
Escape-to-Attention as a Potential Variable
Mueller et al.’s (2005) findings by replicating effects across additional participants. In addition, preliminary treatment findings showed that when a behavior is reinforced by ETA, providing attention combined with EE intervention was more effective than escape extinction alone. Although treatment phases were truncated for some participants, the treatment comparisons data provide some intriguing information for future study. Immediate changes of a substantial magnitude were observed for all participants’ problem behavior during the initial EE⫹DRA condition, regardless of the ordering of treatments. In subsequent iterations of the treatment comparisons, problem behavior levels for J’Marcus and Franklin continued to show differential responding, with substantially lower levels observed in the combined treatment phase. Brandon, who did not exhibit differential responding during the modified ETA functional analysis, showed less substantial treatment differences across the treatment conditions in the latter treatment phase comparisons. It is possible that the initial differences between the EE and EE⫹DRA treatments could reflect an extinction burst associated with the EE treatment, rather than an actual difference between the two treatments. It is also possible that the inclusion of prompts in the EE condition may have provided students with preferred attention, therefore making the condition functionally similar to the EE⫹DRA condition. However, given the observed differences between the two treatment conditions, the effects of prompts were likely minimal. Additional treatment comparisons may provide additional support for matching treatment programs to behavioral function, the ultimate goal behind conducting a functional behavioral assessment. The simplest explanation of the current treatment results is that the EE⫹DRA treatment worked better than EE for Franklin and J’Marcus because EE⫹DRA addressed both the escape and the attention aspects of the compound function. For Brandon, although EE initially produced higher level of problem behavior, each treatment reduced the behavior to similar levels by the end of the treatment
evaluation. The explanation of Brandon’s outcomes is also straightforward, but different from the reasons why the EE⫹DRA worked so well with Franklin and J’Marcus. That is, the addition of a positive reinforcer into Brandon’s demand context most likely reduced the aversiveness of the task and therefore reduced the motivation to demonstrate escape-maintained behavior. The benefits of using positive reinforcement techniques to reduce behaviors maintained by escape have been used successfully for over 30 years. Carr, Newsom, and Binkoff (1980) first demonstrated that attenuating the aversiveness of task demand situations through the delivery of highly preferred edibles during work tasks reduced escape-maintained problem behavior. Several studies followed replicating and supporting the aversiveness-attenuating benefits of introducing preferred tangibles or food items into demand contexts (e.g., Fischer, Iwata, & Mazaleski, 1997; Mazaleski, Iwata, Vollmer, Zarcone, & Smith, 1993; Mueller, Edwards, & Trahant, 2003). Adding preferred tangibles to a demand context makes use of reinforcers not identified during the functional assessment. That is, the addition of positive reinforcers into demand contexts can reduce escape-maintained behavior by using noncontingent reinforcement with a functional or arbitrary reinforcer, or by positively reinforcing behaviors such as task engagement or compliance through differential reinforcement of alternative behavior (DRA). The results of the present study apply this well-known concept to a treatment in which escape and attention functions required support. By interjecting positive reinforcers into the demand context using DRA procedures, the aversiveness of the task was reduced through the provision of a functional reinforcer. Data also were collected for treatment acceptability, which has not been commonly reported in previous FBA research (Ervin et al., 2001). In the present study, all three teachers rated the EE⫹DRA treatment as more acceptable treatment than the EE alone. Surprisingly, two of the three teachers responded that the EE treatment was an “unacceptable” treatment; as such, low ratings of treatment 67
School Psychology Review, 2011, Volume 40, No. 1
acceptability are often not reported in the published literature. Teachers specifically reported that they strongly disagreed that the EE treatment was effective for changing problem behavior, acceptable to use in the classroom, and consistent with interventions they had used in the past. These ratings may have been influenced by the fact that teachers completed acceptability ratings post-treatment use, after they had experience implementing the two interventions and had seen graphed data supporting the relative effectiveness of the EE⫹DRA treatment to the EE treatment alone, a finding reported in analogue treatment acceptability research (e.g., Tingstrom, 1989; Tingstrom, McPhail, & Bolton, 1989; VonBrock & Elliott, 1987). Likewise, the higher ratings for the combined treatment may have been influenced by the addition of differential reinforcement for task engagement (i.e., praise), as previous researchers have reported that reinforcement-based interventions are generally rated as more acceptable than punishment-based interventions (e.g., Blampied & Kahan, 1992; Elliott, Witt, Galvin, & Peterson, 1984). Although extinction-based procedures are not technically classified as punishment in the behavioral literature, the teachers may have perceived the treatment as such, and thus rated it less favorably vis-a`-vis a treatment that included a richer schedule of reinforcement. A few limitations in the current study are worth noting. Data collection was discontinued early for Franklin because of a 9-day suspension from school for fighting at end of the school year. Thus, only two data points were collected during the final treatment phase. As such, definitive information about the long-term effects of the EE⫹DRA treatment is unknown. Additional replications of treatment phases, counterbalancing treatment order across students with escape-only and ETA functions, extended data collection within phases, and follow-up data would strengthen the experimental design of such studies and add information about long-term treatment effects. Next, the current treatment comparisons, although successful in reducing target behavior, did not allow for a full anal68
ysis of the mechanisms underlying the change. As described above, multiple theoretical postulates are reasonable and plausible. Future studies may seek to determine exactly which mechanisms of behavior change are at play to reduce the behaviors. For example, a study might compare the reductive effects of EE with those of noncontingent reinforcement with an arbitrary reinforcer such as preferred tangible items to discover whether the addition of the attention in the current study reduced target behavior because it was functionally related to the target behavior or whether it simply reduced the aversiveness of the tasks presented. Notwithstanding the limitations, the present study provides additional support for translating FBA research into practice in the schools. One criticism of previous FBA research is that individuals with specialized training (e.g., researchers, behavior specialists) conducted the functional analysis conditions. In the current investigation, the students’ teachers implemented all assessment and treatment conditions, albeit with supervision and feedback. Another strength is that the functional analysis occurred during standard classroom activities, rather than an analogue setting. In addition, all participants were general education students, unlike the majority of FBA studies conducted with participants with disabilities (Ervin et al., 2001; Hanley et al., 2003; Hoff et al., 2005). The present study is in line with previous research documenting the importance of identifying idiosyncratic variables during FBAs, in an effort to develop the most effective treatment plan (Hoff et al., 2005). The results also add to a growing literature of school-based studies investigating the effect of idiosyncratic variables on problem behavior. The findings provide a heuristic and experimental example for incorporating descriptive assessment information into the development of functional analysis conditions in the school setting. Similar to previous studies, the results show descriptive assessment information was vital to the precise identification of behavioral function (Galiatsatos & Graff, 2003; Lalli & Casey, 1996; Mueller et al., 2005; Richman &
Escape-to-Attention as a Potential Variable
Hagopian, 1999; Tiger, Hanley, & Bessette, 2006). The ETA condition was designed specifically to assess the relative contributions of escape from academic demands and the additive effects of teacher attention following problem behavior, a consequent frequently observed in classroom settings (Kurtz et al., 2003; McKerchar & Thompson, 2004). The results suggest the ETA condition may have utility for FBAs and treatment planning in the school setting. In future research, researchers may wish to investigate other sources of reinforcement that may occur during the escape interval (i.e., peer attention, access to preferred activities/items). Also, researchers should focus on additional ETA-based treatment options. Future researchers may wish to investigate the addition of teacher attention with other escape-maintained treatment options. References Blampied, N. M., Kahan, E. (1992). Acceptability of alternative punishments: A community survey. Behavior Modification, 16, 400 – 413. Broussard, C. D., & Northup, J. (1995). An approach to functional assessment and analysis of disruptive behavior in regular education classrooms. School Psychology Quarterly, 10, 154 –164. Broussard, C. D., & Northup, J. (1997). The use of functional analysis to develop peer interventions for disruptive classroom behavior. School Psychology Quarterly, 12, 65–76. Carr, E. G., Newsom, C. D., & Binkoff, J. A. (1980). Escape as a factor in the aggressive behavior in two retarded children. Journal of Applied Behavior Analysis, 13, 101–117. Doggett, R. A., Edwards, R. P., Moore, J. W., Tingstrom, D. H., & Wilczynski, S. M. (2001). An approach to functional assessment in general education classroom settings. School Psychology Review, 30, 313–328. Dufrene, B. A., Doggett, R. A., Henington, C., & Watson, T. S. (2007). Functional assessment and intervention for disruptive classroom behaviors in preschool and Head Start classrooms. Journal of Behavioral Education, 16, 368 –388. Edwards, R. P. (2002). A tutorial for using the functional assessment informant record for teachers (FAIR-T). Proven Practice, 4, 31–33. Elliott, S. N., Witt, J. C., Galvin, G. A., & Peterson, R. (1984). Acceptability of positive and reductive behavioral interventions: Factors that influence teachers’ decisions. Journal of School Psychology, 22, 353–360. Ellis, J., & Magee, S. (2004). Modifications to basic functional analysis procedures in school settings: A selective review. Behavioral Interventions, 19, 205– 228. Ervin, R. A., Radford, P. M., Bertsch, K., Piper, A. L., Ehrhardt, K. E., & Poling, A. (2001). A descriptive analysis and critique of the empirical literature on
school-based functional assessment. School Psychology Review, 30, 193–210. Fischer, S. M., Iwata, B. A., & Mazaleski, J. A. (1997). Noncontingent delivery of arbitrary reinforcers as a treatment of self-injurious behavior. Journal of Applied Behavior Analysis, 30, 239 –249. Galiatsatos, G. T., & Graff, R. B. (2003). Combining descriptive and functional analyses to assess and treat screaming. Behavioral Interventions, 18, 123–128. Golonka, Z., Wacker, D., Berg, W., Derby, M., K., Harding, J., & Peck, S. (2000). Effects of escape to alone versus escape to enriched environments on adaptive and aberrant behavior. Journal of Applied Behavior Analysis, 33, 243–246. Gresham, F. M., McIntyre, L. L., Olson-Tinker, H., Dolstra, L., McLaughlin, V., & Van, M. (2004). Relevance of functional behavioral assessment research for school-based intervention and positive behavioral support. Research in Developmental Disabilities, 25, 19 – 37. Gunter, P. L., Jack, S. L., Shores, R. E., Carrell, D. E., & Flowers, J. (1993). Lag sequential analysis as a tool for functional analysis of student disruptive behavior in classrooms. Journal of Emotional and Behavioral Disorders, 1, 138 –148. Hanley, G. P., Iwata, B. A., & McCord, B. E. (2003). Functional analysis of problem behavior: A review. Journal of Applied Behavior Analysis, 36, 147–185. Hoff, K. E., Ervin, R. A., & Friman, P. C. (2005). Refining functional behavior assessment: Analyzing the separate and combined effects of hypothesized controlling variables during ongoing classroom routines. School Psychology Review, 34, 45–57. Iwata, B. A., Dorsey, M. F., Slifer, K. J., Bauman, K. E., & Richman, G. S. (1982). Toward a functional analysis of self-injury. Analysis and Intervention in Developmental Disabilities, 2, 3–20. Iwata, B. A., Wallace, M. D., Kahng, S., Lindberg, J. S., Roscoe, E. M., Conners, J., et al. (2000). Skill acquisition in the implementation of functional analysis methodology. Journal of Applied Behavior Analysis, 33, 181–194. Kurtz, P. F., Chin, M. D., Huete, J. M., Tarbox, R. S., O’Conner, J. T., Paclawskyj, T., et al. (2003). Functional analysis and treatment of self-injurious behavior in young children: A summary of 30 cases. Journal of Applied Behavior Analysis, 36, 205–219. Lalli, J. S., & Casey, S. D. (1996). Treatment of multiply controlled problem behavior. Journal of Applied Behavior Analysis, 29, 391–396. Mann, A. J., & Mueller, M. M. (2009). False positive functional analysis results as a contributor of treatment failure during functional communication training. Education and Treatment of Children, 32, 121–149. Martens, B. K., Witt, J. C., Elliot, S. N., & Darveaux, D. X. (1985). Teacher judgments concerning the acceptability of school-based interventions. Professional Psychology: Research and Practice, 16, 191–198. Mazaleski, J. A., Iwata, B. A., Vollmer, T. R., Zarcone, J. R., & Smith, R. G. (1993). Analysis of the reinforcement and extinction components in DRO contingencies with self-injury. Journal of Applied Behavior Analysis, 26, 143–156. McKerchar, P. M., & Thompson, R. H. (2004). A descriptive analysis of potential reinforcement contingencies 69
School Psychology Review, 2011, Volume 40, No. 1
in the preschool classroom. Journal of Applied Behavior Analysis, 37, 431– 443. Moore, J. W., Edwards, R. P., Sterling-Turner, H. E., Riley, J., DuBard, M., & McGeorge, A. (2002). Teacher acquisition of functional analysis methodology. Journal of Applied Behavior Analysis, 35, 73–77. Moore, J. W., Mueller, M. M., Dubard, M., Roberts, D. S., Sterling-Turner, H. E. (2002). The influence of therapist attention on self-injury during a tangible condition. Journal of Applied Behavior Analysis, 35, 283–286. Mueller, M. M., Edwards, R. P., & Trahant, D. (2003). Translating multiple assessment techniques into an intervention selection model for classrooms. Journal of Applied Behavior Analysis, 36, 563–573. Mueller, M. M., Nkosi, A., & Hine, J. F. (in press). Functional analysis in public school settings: A review of 90 functional analyses. Journal of Applied Behavior Analysis. Mueller, M. M., Sterling-Turner, H. E., & Moore, J. M. (2005). Towards developing a classroom-based functional analysis condition to assess escape-to-attention as a variable maintaining problem behavior. School Psychology Review, 34, 425– 431. Repp, A. C., Felce, D., & Barton, L. E. (1988). Basing the treatment of stereotypic and self-injurious behaviors on hypotheses of their causes. Journal of Applied Behavior Analysis, 21, 281–289. Richman, D. M., & Hagopian, L. P. (1999). On the effects of “quality” of attention in the functional analysis of
destructive behavior. Research in Developmental Disabilities, 20, 51– 62. Tiger, J. H., Hanley, G. P., & Bessette, K. K. (2006). Incorporating descriptive assessment results into the design of a functional analysis: Case example involving a preschooler’s handmouthing. Education & Treatment of Children, 29, 107–124. Tingstrom, D. H. (1990). Acceptability of time-out: The influence of problem behavior severity, interventionist, and reported effectiveness. Journal of School Psychology, 28, 165–169. Tingstrom, D. H., McPhail, R. L., & Bolton, A. B. (1989). Acceptability of alternative school-based interventions: The influence of reported effectiveness and age of target child. The Journal of Psychology, 123, 133– 140. VanDerHeyden, A. M., Witt, J. C., & Gatti, S. (2001). Descriptive assessment method to reduce overall disruptive behavior in a preschool classroom. School Psychology Review, 30(4), 548 –567. Von Brock, M. B., & Elliott, S. N. (1987). Influence of treatment effectiveness information on the acceptability of classroom interventions. Journal of School Psychology, 25, 131–144.
Date Received: August 27, 2009 Date Accepted: January 15, 2011 Action Editor: Tanya Eckert 䡲
Jana Sarno, MA, BCBA, is a senior program supervisor with Coyne and Associates Education Corporation in San Diego, California, and a doctoral candidate at The University of Southern Mississippi. Her areas of interest include early intervention with children diagnosed with autism, verbal behavior, and functional analysis. Heather E. Sterling, PhD, is an associate professor of psychology and the director of training for the school psychology program at The University of Southern Mississippi. Her research areas include functional assessment and analysis, single-case research design, school-based consultation, and intellectual and developmental disabilities. Michael Mueller, PhD, BCBA-D, received his doctorate in school psychology from The University of Southern Mississippi. He is currently the director of behavioral services for Southern Behavioral Group in Atlanta, Georgia. Brad A. Dufrene, PhD, is an assistant professor and director of the School Psychology Service Center at The University of Southern Mississippi. His research interests include functional assessment and school-based consultation. Daniel H. Tingstrom, PhD, is a professor of psychology and is affiliated with the school psychology program in the Department of Psychology at The University of Southern Mississippi. His research interests include applied behavior analysis, and the implementation and evaluation of behavioral and academic interventions. D. Joe Olmi, PhD, is a professor of psychology and departmental chairperson at The University of Southern Mississippi. His research interests include positive behavioral interventions and supports, compliance training, and school-based consultation.
70
Escape-to-Attention as a Potential Variable
APPENDIX A PROCEDURAL INTEGRITY FOR ETA CONDITION
Student:
Session:
Teacher:
Date:
Observer:
Condition: ETA
This form is used to assess the level of procedural integrity for each teacher-implemented functional analysis ETA condition. Record if the teacher behaviors were implemented as planned (Yes) or not implemented as planned (No) during each FA control condition.
YES
NO
N/A
1. Seats student at his/her desk or table 2. Teacher places academic materials on the student’s desk 3. Teacher provides verbal instructions to student to complete the academic work 4. Teacher waits 5 seconds for compliance a. The student complies i. Teacher provides descriptive praise ii. Teacher moves to the next demand b. The student does not comply i. Teacher restates the instructions with verbal and gestural prompts ii. Teacher waits 5 seconds for compliance A. Student complies 1. Teacher provides descriptive praise 2. Teacher moves to the next demand B. Student does not comply 1. Teacher restates the instructions and provides hand-over-hand guidance 5. Teacher does not respond to any other problem behavior 6. Contingent on problem behavior a. Teacher removes task demand for 30 s b. Teacher provides attention during escape period Repeat steps 3–6 for each demand sequence
71
School Psychology Review, 2011, Volume 40, No. 1, pp. 72– 84
Treatment Integrity of Interventions With Children in the School Psychology Literature from 1995 to 2008 Lisa M. Hagermoser Sanetti, Katie L. Gritter, and Lisa M. Dobey University of Connecticut Abstract. Increased accountability in education has resulted in a focus on implementing interventions with strong empirical support. Both student outcome and treatment integrity data are needed to draw valid conclusions about intervention effectiveness. Reviews of the literature in other fields (e.g., applied behavior analysis, prevention science) suggest that most researchers fail to report treatment integrity data. The purpose of this study was to review the treatment integrity data reported in all experimental intervention studies published in four school psychology journals between 1995 and 2008. Results indicate that a majority of published studies do not include a definition of the independent variable and half do not include quantitative treatment integrity data.
intervention as intended by developers, and (d) evaluating the local effectiveness of the intervention (Kratochwill et al., 2004), which are also components of any RTI model (National Association of State Directors of Special Education, 2008). As the EBP and RTI movements gained momentum, it became apparent that treatment integrity was integral to both (e.g., Kratochwill et al., 2004; Noell & Gansle, 2006). Treatment integrity data are necessary for EBP because they allow (a) researchers to draw valid conclusions about intervention effectiveness in research trials, (b) consumers to select an intervention to understand whether it can be adapted to their setting, (c) practitioners to ensure the intervention is implemented as intended in an applied setting, and (d) teams
For decades, treatment integrity has received minimal conceptual or empirical attention in education research. However, treatment integrity has become a more frequent topic of discussion during the past 10 years as traditional service-delivery approaches were challenged and evidence-based practice (EBP) and response to intervention (RTI) became more prevalent in education research (Kratochwill, Albers, & Steele-Shernoff, 2004). The enactment of the No Child Left Behind Act (2001) mandated that educators implement researchbased instruction, which made EBP central to education research and practice. EBP encompasses (a) conducting high-quality intervention evaluation research, (b) selecting interventions proven effective in high-quality evaluation research, (c) implementing the selected
Preparation of this article was supported by a grant provided by the University of Connecticut Research Foundation. Opinions expressed herein do not necessarily reflect the position of the University of Connecticut, and such endorsements should not be inferred. Correspondence regarding this article should be addressed to Lisa M. Hagermoser Sanetti, University of Connecticut, Department of Educational Psychology, U-2064, Storrs, CT 06269-2064; E-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 72
Treatment Integrity
to evaluate the effectiveness of the intervention in their setting. Likewise, ensuring high levels of treatment integrity is at the crux of RTI (Brown & Rahn-Blakeslee, 2009) because decisions about intervention intensity (potentially including special education placement) are based on student response to evidence-based interventions. The EBP and RTI movements motivated scholars (e.g., Brown & Rahn-Blakeslee, 2009; Fixsen, Naoom, Blase´, Friedman, & Wallace, 2005; Jones, Clarke, & Power, 2008; Noell, 2008; Noell & Gansle, 2006; Power et al., 2005) to attend to the role of treatment integrity in contemporary education research and practice. Early definitions of treatment integrity (e.g., Gresham, 1989; Moncher & Prinz, 1991; Yeaton & Sechrest, 1981) focused primarily on the degree to which an intervention was implemented as intended. More recent conceptualizations (Dane & Schneider, 1998; Fixsen et al., 2005; Jones et al., 2008; Noell, 2008; Power et al., 2005; Waltz, Addis, Koerner, & Jacobson, 1993) suggest that treatment integrity is a multidimensional construct. Although approximately 20 different dimensions (e.g., adherence, quality of implementation, participant exposure, participant responsiveness) have been proposed across multiple models, four dimensions are common across all: (a) adherence (degree to which an intervention was implemented as intended), (b) quality of implementation (how well the intervention is implemented), (c) exposure (the duration and/or frequency of the intervention), and (d) program differentiation (the difference between the intervention and another intervention or practice as usual; Sanetti & Kratochwill, 2009). Although there is emerging empirical support for some of these dimensions (e.g., Dusenbury, Brannigan, Falco, & Hansen, 2003; Dusenbury, Brannigan, Hansen, Walsh, & Falco, 2005; Gullan, Feinberg, Freedman, Jawad, & Leff, 2009; Hirschstein, Edstrom, Frey, Snell, & MacKenzie, 2007), adherence remains the sine qua non of treatment integrity; without adherence, there may be little reason to consider issues such as quality or exposure (Schulte, Easton, & Parker, 2009).
Despite considerable evolution in how we conceptualize treatment integrity, the methodological importance and role of the construct has not changed. The demonstration of a functional relationship between the implementation of an intervention (i.e., independent variable) and a change in student outcomes (i.e., dependent variable) is, and has always been, at the core of intervention effectiveness research. To draw valid conclusions regarding the presence of such a relationship, it is necessary to define and assess both intervention implementation and student outcomes (Johnston & Pennypacker, 1993). Making incorrect assumptions about the presence of a functional relationship between an intervention and student outcomes is a possible result of failing to adequately define and assess the intervention. Thus, without an assessment of an operationally defined intervention, the internal validity of an intervention outcome study is threatened as a change in student outcomes (or lack thereof) is open to a wide variety of explanations (Shadish, Cook, & Campbell, 2002). Despite the methodological importance of treatment integrity, its assessment has been limited in the treatment outcome literature (Noell, 2008). This is evidenced by reviews of the treatment outcome literature in areas such as applied behavior analysis (Gresham, Gansle, & Noell, 1993; McIntyre, Gresham, DiGennaro, & Reed, 2007; Peterson, Homer, & Wonderlich, 1982), learning disabilities (Gresham, MacMillan, Beebe-Frankenberger, & Bocian, 2000), anger management (Gansle, 2005), autism (Wheeler, Baggett, Fox, & Blevins, 2006), alternative communication (Snell, Chen, & Hoover, 2006), prevention programming (Dane & Schneider, 1998), and psychotherapy (Perepletchikova, Treat, & Kazdin, 2007). Results across three reviews (Gresham et al., 1993; McIntyre et al., 2007; Peterson et al., 1982) of the applied behavior analysis literature indicated that researchers were more likely to operationally define the intervention than assess its implementation. Further, on average, only 20.6% of the researchers, across the reviews, assessed and reported any quan73
School Psychology Review, 2011, Volume 40, No. 1
titative treatment integrity data. Similar results were found in the learning disabilities and autism literatures; only 18.5% and 18%, respectively, of the researchers in these fields assessed and reported any quantitative treatment integrity data (Gresham et al., 2000; Wheeler et al., 2006). Gansle’s (2005) review of the anger management literature indicated that only 10% of treatment outcome studies assessed and reported any quantitative treatment integrity data. Slightly more promising results were found in reviews of the alternative communication (Snell et al., 2006) and prevention science (Dane & Schneider, 1998) literatures, in which 30% and 31%, respectively, of researchers assessed and provided any quantitative treatment integrity data. Perepletchikova and colleagues (2007) found that only 3.5% of psychotherapy outcome studies included adequate treatment integrity assessment. Although only a sample of all the treatment outcome studies conducted in education and related fields, the data presented above provide a clear and consistent message that treatment integrity data are largely absent from the treatment outcome literature base in these fields. The widespread failure of researchers to report quantitative treatment integrity data combined with the methodological importance of treatment integrity suggests that researchers in many fields were at risk for drawing invalid conclusions about intervention effectiveness. This is particularly concerning for two reasons. First, the level of treatment integrity affects outcomes (Durlak & DuPre, 2008). Thus, without treatment integrity data, it is impossible to know whether positive outcomes were obtained under conditions of perfect, high, moderate, or variable levels of treatment integrity, which could have significant implications for future research as well as transportability of interventions to applied settings. Second, educators are facing increasing demands for high levels of accountability (e.g., EBP; No Child Left Behind, 2001; Individuals with Disabilities Education Improvement Act, 2004), which mandate that educators implement research-based interventions. Before educators are able to fulfill this man74
date, educational researchers must conduct and disseminate high-quality intervention evaluation research, which requires treatment integrity assessment (e.g., Task Force on Evidence-Based Interventions, 2003). Prevention and intervention evaluation studies published in the school psychology literature have not been comprehensively reviewed with regard to treatment integrity assessment. Thus, the primary purpose of the current study was to systematically code the treatment integrity data reported in all schooland home-based studies published between 1995 and 2008 in four school psychology journals (i.e., Journal of School Psychology [JSP], Psychology in the Schools [PITS], School Psychology Quarterly [SPQ], School Psychology Review [SPR]). These years were chosen as reviews in related fields indicate a low and variable level of attention to treatment integrity data prior to, but an increasing trend in reporting treatment integrity data after, 1995 (e.g., McIntyre et al., 2007). Each of these journals publishes at least four issues per year and specifically targets school psychology researchers and practitioners. In schools, a variety of individuals implement interventions, across a variety of settings, to remediate a variety of concerns experienced by youth of all ages. Consequently, a secondary purpose of this study was to summarize and analyze characteristics of treatment outcome research in school psychology (e.g., by year, treatment agent, location, dependent variable, student age) and treatment integrity trends across these characteristics to better understand the treatment outcome research being conducted within the field and any related trends in treatment integrity assessment. More specifically, the current study addressed the following four research questions: 1. How prevalent are operational definitions of the independent variable and treatment integrity assessment data in the treatment outcome studies published in the school psychology literature between 1995 and 2008? 2. How likely were implementation inaccuracies for the interventions evaluated
Treatment Integrity
in the treatment outcome studies published in the school psychology literature between 1995 and 2008? 3. What are the general characteristics (e.g., student age, intervention location, treatment agents, and dependent variables) of the treatment outcome studies published in the school psychology literature between 1995 and 2008? 4. Did the inclusion of an operational definition of the independent variable or treatment integrity data vary by study characteristics (e.g., student age, intervention location, treatment agents, and dependent variables)? Method Selection of the Research A total of 2,023 articles from four school psychology journals (i.e., PITS, JSP, SPQ, SPR) were reviewed by the second author to determine potential for inclusion. Articles for review were located through a serial search of the table of contents of PITS, JSP, SPQ, and SPR between 1995 and 2008. To qualify for inclusion, a study had to fulfill four criteria. First, the study had to have been published between the years of 1995 and 2008. Second, all of the participants in the study had to be younger than 19 years of age, as we were evaluating only interventions implemented with youth. Third, the study had to be experimental (including quasi-experimental), in that the design allowed for conclusions regarding a causal relationship between implementation of an independent variable (i.e., intervention) and change in a dependent variable (i.e., student outcome). A study was considered experimental if it had either (a) a single-subject design, including a baseline phase (except in cases where a baseline was implied or unnecessary) or (b) a recognized between- or within-subject group design (e.g., randomized control designs, matched control group designs, factorial designs). Studies were excluded from the review if they were considered nonexperimental, assessment oriented, or brief. A study was judged as nonexperimental if it employed a (a)
demonstration (i.e., with no baseline), (b) case study (i.e., AB design), (c) single-subject without a baseline phase (when one is not implied and is necessary), or (d) nonexperimental (e.g., correlational study) design. In addition, because we were evaluating only intervention articles, a study was excluded if only an assessment was performed (e.g., functional assessment, preference assessment). If an article contained an initial functional assessment followed by an experimental study of an intervention based on the functional assessment, the study was included. Finally, brief reports (i.e., articles that were three or fewer pages in length) were excluded. As discussed in both McIntyre et al. (2007) and Peterson et al. (1982), articles of three or fewer pages seldom provide sufficient detail with regard to study procedures and analyses. Thus, we chose to exclude brief reports from the analysis to avoid potentially underreporting intervention definition and treatment integrity monitoring. Following the above criteria, 210 articles were included for further review. Because some of the articles contained more than one study, a total of 223 studies met the inclusionary criteria. A full list of the articles chosen for review is available from the first author. Selection Reliability To check the reliability of the article selection process, the first author conducted independent ratings on 101 (5%) of the articles considered for inclusion in the study. Interrater agreement was calculated as the number of instances of agreement divided by the agreements plus disagreements, multiplied by 100%. These comparisons yielded 100% agreement. To further ensure appropriate selection of articles, the first and second authors met weekly to discuss and resolve any questions about potential articles for inclusion. Coding In this review, we focused on (a) how the independent variables were operationally defined; (b) to what extent they were monitored and measured; (c) the level of risk for 75
School Psychology Review, 2011, Volume 40, No. 1
treatment implementation issues in each study; and (d) trends in independent variable definition and treatment integrity assessment across publication year, treatment agent, location of the intervention, design, dependent variable, and student age. Coding procedures for each of these variables are described below. Studies were also coded based on year of publication (i.e., 1995–2008). Operational definition of the intervention. For each study, we used criteria based on Baer, Wolf, and Risley (1968) and used by Gresham et al. (1993) and McIntyre et al. (2007). Raters considered the question: “Has the treatment been operationally defined?” In answering this question, raters determined whether they could replicate the intervention with the information provided. Studies were coded as “yes,” “no,” or “reference.” Studies coded “reference” did not contain enough information to replicate the intervention with the information provided, but included a reference to a more extensive source (e.g., book chapter, manual, technical report, other manuscript). Treatment integrity assessment. Studies were coded based on whether they included treatment integrity data. Raters answered the question: “Did the study include a statement regarding treatment integrity on at least one independent variable?” Studies were coded using criteria based on Gresham et al. (1993) and McIntyre et al. (2007). According to these criteria, studies were coded as “yes” if the authors had specified a method of measurement (e.g., observer present, videotaping sessions, using a component checklist) and reported quantitative data. Studies were coded as “monitored” if they mentioned that treatment integrity was assessed or monitored but failed to provide quantitative data. Studies were coded as “no” if there was no mention of treatment integrity checks and no treatment integrity data were reported. Risk for treatment implementation issues. Treatments were coded for having no risk, low risk, or high risk for treatment im76
plementation issues using the criteria defined by Peterson et al. (1982) and used by McIntyre et al. (2007). Treatments were coded as “no risk” when “treatment integrity assessment” was coded as “yes” or “monitored.” Treatments were coded as “low risk” when authors provided neither quantitative treatment integrity data nor mentioned treatment integrity assessment but the intervention was nonetheless considered likely to have been implemented accurately. Examples of treatments in this category included those that were computer mediated, employed permanent products (e.g., posting classroom rules), or were applied continuously (e.g., presence of a 1-to-1 aide). Treatments were coded as “high risk” when authors provided neither quantitative treatment integrity data nor mentioned treatment integrity assessment, and the intervention was not considered to be at low risk for inaccurate implementation. Treatment agent. The person implementing the intervention was categorized into one of the following groups: teacher, professional, paraprofessional, parent, sibling, peer tutor, researcher, self, or other. If more than one person was responsible for implementing the intervention, each treatment agent was coded. Definitions for these categories were based on those used by McIntyre et al. (2007). Teachers included general education teachers, special education teachers, and early childhood educators. Professionals included psychologists, speech and language pathologists, and school principals. Paraprofessionals included teacher’s aides and members of the support staff. Peer tutors were students who were not the focus of the intervention. “Researcher” was recorded if the treatment agent was collecting data for the purpose of the research study. “Self” was recorded for treatments that were self-administered or self-mediated. If the treatment agent did not fit into one of these categories, “other” was recorded. Other treatment agents included, for example, volunteers and interns. Each category was coded when the intervention was implemented by individuals in more than one category.
Treatment Integrity
Location of intervention. Studies were coded according to where the intervention took place: regular public school, charter school, private school (nonparochial), parochial school, residential school, hospital school, home, not specified, or, for locations that did not fit any of these categories, other. Other locations included, for example, Head Start and university-based lab schools. Each location was coded when the intervention was implemented in more than one location. Design. Studies were coded with regard to experimental design. Those that used reversal, multiple baseline, alternating treatments, or changing criterion designs were classified as single-case designs, and those that used randomized control designs, matched control group designs, or factorial designs were coded as group designs. Dependent variable. Each dependent variable was coded as belonging to one of eight categories: academic behaviors, social behaviors, disruptive behaviors, stereotypic/ destructive behaviors, daily living skills, psychological well-being, academic-related behavior, or other. Examples of outcomes included in each category are provided below. Academic behaviors included those that dealt with the acquisition or performance of academic skills such as reading, math, spelling, and so forth. Social behaviors included social skills, play behaviors, sharing behaviors, and so on. Disruptive behaviors included behaviors that led to the disruption of social environments such as a student being out of his or her seat or talking without permission. Stereotypic/destructive behaviors were those that interfered with instruction or hurt the student (e.g., self-injurious behaviors), others, or property. Daily living skills included behaviors related to safety, independent functioning, mobility, eating, and so forth. The psychological well-being category included self-esteem, self-concept, or psychological adjustment. Academic-related behaviors were defined as behaviors that facilitated improved academic outcomes, such as engagement or opportunities to respond. Other behaviors included, for
example, knowledge change, health-related functioning (e.g., asthma), absences, and treatment acceptability. All relevant dependent variables were coded for each study. Student age. The age of study participant(s) was coded using the following categories: preschool (0 – 4 years), elementary (5–12 years), high-school (13–18 years), or any combination of these age categories. When study participants spanned across age ranges, all relevant age ranges were coded. Rater Training and Interrater Agreement The first author trained two advanced graduate students in school psychology in the use of the coding criteria across one 1-hr and two 2-hr sessions. In the first session, the first author (a) gave each rater a coding handbook that included the definition of each category and code; (b) reviewed the handbook; and (c) discussed the categories, codes, and definitions with the raters. In addition, the first author gave the raters four experimental treatment outcome studies to code; all training articles were published in school psychology journals prior to 1995. Coding results, problems with coding, and ambiguous codes were discussed during the second session and the coding scheme was modified to increase clarity. The first author gave the raters four additional experimental treatment outcome articles to code using the modified coding scheme. In the third session, the results of applying the modified coding scheme were discussed. Disagreements in ratings were discussed and all raters reached 100% agreement via consensus. Each rater was assigned half of the articles in the database. In addition, a random sample of 10% of studies meeting the inclusion criteria was selected for interrater agreement coding. Studies were coded on nine categories: (a) journal (4 categories), (b) year of publication (14 categories), (c) student age (3 categories), (d) location of the intervention (9 categories), (e) treatment agent (9 categories), (f) operational definition of the independent variable (3 categories), (g) treatment integrity assessment (3 77
School Psychology Review, 2011, Volume 40, No. 1
categories), (h) risk for treatment implementation issues (3 categories), and (i) dependent variable (8 categories). There was perfect agreement for seven categories: journal, year of publication, student age, location of intervention, operational definition of the independent variable, treatment integrity assessment, and dependent variable. Interrater reliability for the remaining categories in the coding scheme was estimated using two statistics: percentage agreement and Cohen’s kappa (Cohen, 1960). Percentage agreement was calculated by the following formula: number of agreements/number of agreements ⫹ number of disagreements ⫻ 100. Percentage agreement and kappa were 99.3% and .95, 96.8% and .97, and 96.8% and .97 for all codes, treatment agent, and risk for treatment implementation issues, respectively. Data Analysis To analyze the presence of operational definitions of the intervention and treatment integrity data in the reviewed studies by year of publication, student age, treatment agent, location, and dependent variable, 2 analyses were conducted. As some studies included multiple student ages, treatment ages, locations, and/or dependent variables, these variables were coded into mutually exclusive categories for the purpose of data analysis. For example, for student age, if study participants were all from one student age group (i.e., preschool, elementary, or high school), the appropriate age group code was retained. If study participants were from multiple age groups (e.g., preschool and elementary), the student age was recoded as “multiple ages.” Further, because of small cell sizes, multiple categories needed to be collapsed together. With regard to treatment integrity variables, the categories “not assessed” and “monitored,” “defined” and “referenced,” and “high risk” and “low risk” were collapsed together for treatment integrity assessment, definition of the intervention, and risk for treatment implementation issues, respectively. With regard to study characteristics, (a) years of publication were collapsed into three 4- to 5-year 78
blocks (i.e., 1995–1999, 2000 –2003, and 2004 –2008); (b) treatment agents were collapsed into four categories: teachers, researchers, multiple treatment agents, or others (i.e., professionals, paraprofessionals, parents, peers, self, and other combined); (c) location of intervention was collapsed to public school or other location(s); and (d) dependent variables were collapsed into four categories: academic, disruptive, other, and multiple. Student age categories did not need to be collapsed. Results General Characteristics of Treatment Outcome Studies The results related to the general characteristics of the reviewed studies are presented first to provide a context for the treatment integrity data. A vast majority of the coded studies targeted elementary-age students (n ⫽ 186) as opposed to high-school-age (n ⫽ 44) or preschool-age (n ⫽ 20) students. Likewise, a vast majority of coded studies were conducted in public schools (n ⫽ 170). Additional locations of intervention studies were other (n ⫽ 27), home (n ⫽ 13), private school (n ⫽ 10), residential school (n ⫽ 7), hospital school (n ⫽ 2), and charter school (n ⫽ 2). No studies were conducted in parochial schools and the location of intervention was not specified for five studies. Overall, teachers (n ⫽ 116) and researchers (n ⫽ 113) most frequently served as intervention agents. However, target students themselves (n ⫽ 35), professionals (n ⫽ 27), parents (n ⫽ 26), paraprofessionals (n ⫽ 21), others (n ⫽ 20), and peers (n ⫽ 13) also served as treatment agents. Siblings did not serve as treatment agents in any of the reviewed studies. Academic (n ⫽ 106) and disruptive behaviors (n ⫽ 72) were the most common dependent variables in the coded studies, followed by academic-related behaviors (n ⫽ 49), other (n ⫽ 43), social behaviors (n ⫽ 42), psychological well-being (n ⫽ 17), stereotypic/destructive behaviors (n ⫽ 5), and daily living skills (n ⫽ 5). Overall, the number of coded studies that employed single-case (n ⫽ 120)
Treatment Integrity
and group (n ⫽ 103) experimental designs was roughly equal. Treatment Integrity Data in Treatment Outcome Studies Of the 223 studies coded, 71 (31.8%) included an operational definition of the independent variable, 86 (38.6%) provided readers with a reference to another source to enable them to access more information about the intervention, and 65 (29.1%) did not define the independent variable. Half (n ⫽ 112; 50.2%) of the coded studies reported quantitative treatment integrity data as a percentage of implementation. Twenty-nine studies (13%) included a statement about monitoring treatment integrity but did not provide quantitative data. Eighty-three studies (37.2%) neither included quantitative treatment integrity data nor reported monitoring treatment integrity. Of the researchers who did include quantitative treatment integrity data, 73.8% reported how many intervention sessions for which treatment integrity data were collected. On average, treatment integrity data were collected for 61.2% of intervention sessions and the average overall level of treatment integrity reported was 93.6%. A majority of studies (n ⫽ 139; 62.3%) were deemed to be at no risk for treatment implementation issues, as either the authors provided quantitative treatment integrity data or reported that treatment integrity was monitored. Just under one third (n ⫽ 65; 29.1%) of studies were deemed to be at high risk for treatment implementation issues. Eighteen studies (8.1%) did not provide treatment integrity data, but were considered to be at low risk for treatment implementation issues because of the nature of the intervention (e.g., computer mediated).
ond, we report the results of the statistical analyses related to intervention definition, treatment integrity assessment, and risk for treatment implementation issues across treatment agents, locations of intervention, dependent variables, and student ages. Unless otherwise stated, the .05 level of statistical significance was used to determine significance in all analyses. Year of publication. Across the 14year period, the percentage of coded studies including an operational definition of the intervention or a reference to a definition of the intervention was variable. On average, across journals, approximately one third of the treatment outcome studies published each year included an operational definition of the independent variable (M ⫽ 34%, SD ⫽ 7.62) or reference to such a definition (M ⫽ 37.51%, SD ⫽ 10.54). Overall, across journals, the percentage of studies including treatment integrity assessment data demonstrated an increasing trend over the 14-year period (see Figure 1). On average, researchers provided quantitative treatment integrity data in 48.12% (SD ⫽ 15.54) of experimental treatment outcome studies published each year. Such data were included in, on average, 39.71% (SD ⫽ 16.05) of articles during the first 5 years (1995–1999) and 59.06% (SD ⫽ 14.95) of articles during the last 5 years (2004 –2008)
Treatment Integrity Assessment by Study Characteristics As we were interested in visually analyzing any trends in intervention definition and treatment integrity assessment across the years studied, both visual analysis and statistical analyses for these data are reported first. Sec-
Figure 1. Percentage of studies including quantitative treatment integrity assessment data by year of publication. 79
School Psychology Review, 2011, Volume 40, No. 1
studied. Despite differences in trends of intervention definition and treatment integrity assessment across the years studied, 2 analyses indicated no significant differences in the presence of an operational definition, treatment integrity assessment, or risk for treatment implementation issues across year of publication. Treatment agent, location of intervention, dependent variable, student age. 2 analyses indicated no significant differences in the presence of an operational definition of the intervention, treatment integrity assessment, or risk for treatment implementation issues across treatment agent, location of intervention, or dependent variable categories. With regard to student age, 2 analyses indicated no significant differences in the presence of an operational definition of the intervention or risk for treatment implementation issues. There was a significant relationship between student age and treatment integrity assessment, 2(3) ⫽ 8.6, p ⫽ .036. Follow-up analyses were conducted for all possible 2 ⫻ 2 relationships and a Bonferroni correction was applied ( p ⬍ .008). Results indicated that treatment integrity was significantly more likely to be assessed with preschoolers than high schoolers, 2(1) ⫽ 7.65, p ⫽ .006. The difference in treatment integrity assessment between elementary school and preschool approached significance, 2(1) ⫽ 6.95, p ⫽ .008. Across journals, treatment integrity was assessed in 82.35% (SD ⫽ 1.24), 48.73% (SD ⫽ 3.18), 36.84% (SD ⫽ 0.52), and 48.82% (SD ⫽ 1.03) of studies conducted at the preschool, elementary school, high school, and multiple student age levels, respectively. Discussion The primary purposes of this study were to elucidate the characteristics of the school psychology treatment outcome literature published between 1995 and 2008 and, more specifically, to analyze the extent to which researchers operationally define the intervention and assess treatment integrity. The data indicated that a majority of school psychology treatment outcome studies targeted academic and behavioral outcomes and were conducted 80
in public elementary schools with researchers or teachers as the intervention agents. A vast majority of researchers did not include operational definitions of the intervention in their manuscripts. The percentage of studies that included an operational definition of the intervention in the current study (31.8%) was nearly identical to those reported by Gresham and colleagues (1993) in their review of school-based behavioral interventions published between 1980 and 1990 (35%). However, the current data are more encouraging when the number of researchers providing a reference to another source that describes the intervention is considered. Together, nearly three-quarters (70.4%) of the published treatment outcome studies allowed readers to access detailed information about the intervention. It is unclear whether this level of reporting is consistent with previously published school-based treatment outcome studies as Gresham et al. did not code referenced intervention definitions. It is clear however, that the rate of intervention definition in school psychology literature is lower than that found in treatment outcome studies published in the applied behavior analysis and autism literatures. In these fields, only a small percentage (2%– 8%) of studies did not provide the reader with either an operational definition of the intervention or a reference to another source (McIntyre et al., 2007; Wheeler et al., 2006). There are several potential reasons for the lower rate of providing operational definitions of interventions in school psychology. First, it may be because of the theoretical orientation of the authors being published. Unlike fields such as applied behavior analysis in which most researchers embrace a behavioral orientation that emphasizes operationally defining variables, school psychology researchers may identify with more varied theoretical orientations that do not have such an emphasis. Second, the discrepancy may be a result of the complexity of the intervention. Assessing treatment integrity is more straightforward when an intervention is implemented in a single setting, for a few students at a time, by one individual, and is readily observable. Most
Treatment Integrity
school-based interventions are not so simple. Treatment integrity assessment is more complex when an intervention is implemented across numerous settings (e.g., classroom, recess), for multiple students (e.g., class, groupor school-wide), and by a variety of individuals (e.g., paraprofessionals, teachers), and may not be readily observable (e.g., decoding text, self-talk). Evaluations of such interventions are often published in school psychology journals (e.g., Bramlett, Cates, Savina, & Lauinger, 2010). Regardless of the reasons why, nearly one third of treatment outcome studies did not include adequate access to a definition of the intervention being studied and it is essential that the prevalence of such information increase (Dusenbury et al., 2003; Sanetti & Kratochwill, 2009). Having an operational definition of the intervention is necessary for researchers to replicate an intervention evaluation study and thus build a systematic body of research to support an intervention’s effectiveness. Furthermore, an operational definition of the intervention facilitates disseminating effective practices to practitioners because the information provides the discrete intervention steps, which should be provided regardless of whether the intervention is described via journal articles, books, conference presentations, or professional development sessions. As all journals have a limited amount of space, having authors provide a reference to another source may be an efficient method of providing a reader with access to the information needed without taking up journal space. Providing such a reference, however, may only be an appropriate option if the intervention was implemented exactly as described in the referenced publication and the publication is readily accessible to all readers. For example, referencing an intervention description provided in an unpublished dissertation study has the practical effect of leaving most readers without a description of the intervention. Another alternative could be utilizing journal Web sites for expanded article-related content. Considering that most individuals reading professional journals have access to the Internet, journal Web sites might provide an efficient
and accessible location for detailed intervention descriptions. Half of the researchers in the current review reported quantitative treatment integrity data. This percentage is encouraging as it is higher than was found in other fields (Dane & Schneider, 1998; Gansle, 2005; Gresham et al., 1993, 2000; McIntyre et al., 2007; Perepletchikova et al., 2007; Peterson et al., 1982; Wheeler et al., 2006). These data are also encouraging as they indicate that researchers are creating methods (e.g., direct observation, permanent product review, and treatment agent self-report) to assess treatment integrity. However, it is possible that invalid conclusions regarding the causal relationship between an intervention and student outcomes may have been drawn in half of the treatment outcome studies reviewed. Furthermore, the treatment integrity data in a vast majority of the reviewed studies consisted solely of adherence data. It seems that the scope of researchers’ treatment integrity assessments has not evolved as the conceptualization of treatment integrity has shifted over the past 15 years (Dane & Schneider, 1998; Fixsen et al., 2005; Jones et al., 2008; Noell, 2008; Power et al., 2005; Waltz et al., 1993). For those studies in which treatment integrity assessment data were provided, the average level of treatment integrity reported was quite high. This result is similar to trends seen in the applied behavior analysis literature (Gresham et al., 1993; McIntyre et al., 2007), in which treatment integrity levels were, on average, above 90%. With regard to the relationship between intervention definition and treatment integrity assessment across study characteristics, results indicated that there were few significant differences in intervention definition and treatment integrity assessment across study characteristics. The finding that researchers are more likely to assess treatment integrity in studies conducted in preschool and possibly the elementary schools, as compared to high schools, may be related to the logistical difficulties in assessing treatment integrity when students change classes several times a day. 81
School Psychology Review, 2011, Volume 40, No. 1
Limitations Several limitations need to be taken into account when interpreting our results. First, because of resource constraints, not all journals related to school psychological practice were able to be reviewed. Despite this, the current review provides a sense of the current state of operational definition of the intervention and treatment integrity assessment in school psychology research. Second, the small cell sizes across codes required us to collapse categories to conduct 2 analyses. Doing so might have masked differences in more discrete categories (e.g., not assessed and monitored) that we were unable to analyze because of cell size. Third, for studies in which the operational definition of the intervention was coded as “referenced,” we did not obtain and review the intervention definition in the referenced publication. Thus, it is possible that the referenced publication did not include an adequate operational definition of the intervention. Fourth, we considered an operational definition of an intervention to be adequate if we, researchers, could replicate the intervention based on the description provided. Our decisions may not generalize to practitioners or other consumers who are not as familiar with the intervention literature base. Finally, the coding scheme used was adapted from previous reviews of the applied behavior analysis literature. As a result of using this coding scheme, it is possible that (a) the exclusion of less well-controlled studies resulted in a slight overestimate of the actual rates of treatment integrity and (b) information was not captured that might have been had we used an alternative coding scheme. Potential Implications for Research and Practice Results of the current study have several potential implications for research and practice, many of which have been elucidated throughout the discussion and are only highlighted again here. Researchers need to include more information about the interventions they are evaluating and provide quantitative treatment integrity data. Individuals 82
involved in the editorial and peer-review process can greatly influence the prevalence of treatment integrity data in the treatment outcome research by requiring such data for publication (Dusenbury et al., 2003; Sanetti & Kratochwill, 2009). It may be necessary for researchers to expand their conceptualization of treatment integrity to be consistent with current thought. Several researchers have demonstrated that multiple dimensions of treatment integrity can be assessed and can differentially predict student outcomes (e.g., Dusenbury et al., 2005; Hirschstein et al., 2007). In addition, increased attention to treatment integrity data may enable us to build a more systematic body of research supporting an intervention. For example, we will be better able to identify critical components, which will enable us to adapt interventions to increase their contextual fit while maintaining their effectiveness. Having an empirical basis for intervention adaptation will enable educators to take a partnershipbased approach to implementing interventions, which could facilitate effective and sustained implementation of evidence-based interventions within RTI models. Best practices suggest, and the implementation of RTI requires, practitioners to implement evidence-based interventions and to evaluate their effectiveness. To implement an evidence-based intervention, a practitioner has to have an adequate definition of the intervention. Thus, the importance of the presence of an operational definition of an intervention for those practitioners who read journal articles is obvious. Intervention description in journal articles is equally important, however, for those practitioners whose practice is informed by other dissemination materials (e.g., books, conference presentations) as those materials should be informed by research. So, for example, if an intervention is shown to be effective and the intervention is included in a book of effective school-based interventions, it is essential that the author is able to articulately describe the discrete intervention steps as they were implemented in the evaluation studies. Dissemination and adoption of research-based practices is a complex process, but it seems
Treatment Integrity
that providing an adequate definition of the intervention in an evaluation study is one of many steps toward this end. Regardless of whether one is a researcher or a practitioner, it is necessary to be able to articulate the intervention being implemented and to assess treatment integrity, along with student outcomes, to draw valid conclusions about the relationship between an intervention and student outcomes. As the science of treatment integrity assessment develops, researchers and practitioners alike can utilize their knowledge of best practices in assessment and apply those, in combination with recommendations from the field about treatment integrity assessment (e.g., Noell 2008; Schulte et al., 2009), to develop treatment integrity assessment procedures appropriate for their needs. References Baer, D., Wolf, M., & Risley, T. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91–97. Bramlett, R., Cates, G. L., Savina, E., & Lauinger, B. (2010). Assessing effectiveness and efficiency of academic interventions in school psychology journals: 1995–2005. Psychology in the Schools, 47, 114 –125. Brown, S., & Rahn-Blakeslee, A. (2009). Training schoolbased practitioners to collect intervention integrity data. School Metal Health, 1, 143–153. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37– 46. Dane, A. V., & Schneider, B. H. (1998). Program integrity in primary and early secondary prevention: Are implementation effects out of control? Clinical Psychology Review, 18, 23– 45. Durlak, J. A., & DuPre, E. P. (2008). Implementation matters: A review of research on the influence of implementation on program outcomes and the factors affecting implementation. American Journal of Community Psychology, 41, 327–350. Dusenbury, L., Brannigan, R., Falco, M., & Hansen, W. B. (2003). A review of research on fidelity of implementation: Implications for drug abuse prevention in school settings. Health Education Research, 18, 237–256. Dusenbury, L., Brannigan, R., Hansen, W. B., Walsh, J., & Falco, M. (2005). Quality of implementation: Developing measures crucial to understanding the diffusion of preventive interventions. Health Education Research, 20, 308 –313. Fixsen, D. L., Naoom, S. F., Blase´, K. A., Friedman, R. M., & Wallace, F. (2005). Implementation research: A synthesis of the literature. FMHI Publication #231. Tampa, FL: University of South Florida, Louis de la Parte Florida Mental Health Institute, The National Implementation Research Network. Retrieved Novem-
ber 1, 2007, from http://nirn.fmhi.usf.edu/resources/ publications/Monograph/ pdf/monograph_full.pdf Gansle, K. A. (2005). The effectiveness of school-based anger interventions and programs: A meta-analysis. Journal of School Psychology, 43, 321–341. Gresham, F. M. (1989). Assessment of treatment integrity in school consultation and prereferral intervention. School Psychology Review, 18, 37–50. Gresham, F. M., Gansle, K. A., & Noell, G. H. (1993). Treatment integrity in applied behavior analysis with children. Journal of Applied Behavior Analysis, 26, 257–263. Gresham, F. M., MacMillan, D. L., Beebe-Frankenberger, M. E., & Bocian, K. M. (2000). Treatment integrity in learning disabilities intervention research: Do we really know how treatments are implemented? Learning Disabilities Research and Practice, 15, 198 –205. Gullan, R. L., Feinberg, B. E., Freedman, M. A., Jawad, A., & Leff, S. S. (2009). Using participatory action research to design an intervention integrity system in the urban schools. School Mental Health, 1, 118 –130. Hirschstein, M. K., Edstrom, L. V., Frey, K. S., Snell, J. L., MacKenzie, E. P. (2007). Walking the talk in bully prevention: Teacher implementation variable related to initial impact of the Steps to Respect program. School Psychology Review, 36, 3–21. Individuals with Disabilities Education Improvement Act, 20 U.S.C. § 1400 et seq. (2004). Johnston, J. M., & Pennypacker, H. S. (1993). Strategies and tactics of behavioral research (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Jones, H. A., Clarke, A. T., & Power, T. J. (2008). Expanding the concept of intervention integrity: A multidimensional model of participant engagement. In Balance, 23, 4 –5. Kratochwill, T. R., Albers, C. A., & Steele-Shernoff, E. (2004). School-based interventions. Child and Adolescent Psychiatric Clinics of North America, 13, 885– 903. McIntyre, L. L., Gresham, F. M., DiGennaro, F. D., & Reed, D. D. (2007). Treatment integrity of schoolbased interventions with children in Journal of Applied Behavior Analysis studies from 1991 to 2005. Journal of Applied Behavior Analysis, 40, 659 – 672. Moncher, F. J., & Prinz, R. J. (1991). Treatment fidelity in outcome studies. Clinical Psychology Review, 11, 247–266. National Association of State Directors of Special Education. (2008). Response to intervention: Blueprints for implementation. Alexandra, VA: Author. No Child Left Behind, 20 U.S.C. § 16301 et seq. (2001). Noell, G. H. (2008). Research examining the relationships among consultation process, treatment integrity, and outcomes. In W. P. Erchul & S. M. Sheridan (Eds.), Handbook of research in school consultation: Empirical foundations for the field (pp. 315–334). Mahwah, NJ: Lawrence Erlbaum Associates Noell, G. H., & Gansle, K. A. (2006). Assuring the form has substance: Treatment plan implementation as the foundation of assessing response to intervention. Assessment for Effective Intervention, 32, 32–39. Perepletchikova, F., Treat, T., & Kazdin, A. E. (2007). Treatment integrity in psychotherapy research: Analysis of the studies and examination of the associated factors. Journal of Consulting and Clinical Psychology, 75, 829 – 841. 83
School Psychology Review, 2011, Volume 40, No. 1
Peterson, L., Homer, A., & Wonderlich, S. (1982). The integrity of independent variables in behavior analysis. Journal of Applied Behavior Analysis, 15, 477– 492. Power, T. J., Blom-Hoffman, J., Clarke, A. T., RileyTillman, T. C., Kellerher, C., & Manz, P. (2005). Reconceptualizing intervention integrity: A partnership-based framework for linking research with practice. Psychology in the Schools, 42, 495–507. Sanetti, L. M. H., & Kratochwill, T. R. (2009). Toward developing a science of treatment integrity: Introduction to the special series. School Psychology Review, 38, 445– 459. Schulte, A. C., Easton, J. E., & Parker, J. (2009). Advances in treatment integrity research: Multidisciplinary perspectives on the conceptualization, measurement, and enhancement of treatment integrity. School Psychology Review, 38, 460 – 475. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin. Snell, M. E., Chen, L. Y., & Hoover, K. (2006). Teaching augmentative and alternative communication to students with severe disabilities: A review of intervention research 1997–2003. Research and Practice for Persons with Severe Disabilities, 31, 203–214.
Task Force on Evidence-Based Interventions. (2003). Procedural and coding manual for review of evidencebased interventions. Retrieved August 3, 2009, from http://www.indiana.edu/⬃ebi/projects.html Waltz, J., Addis, M. E., Koerner, K., & Jacobson, N. S. (1993). Testing the integrity of a psychotherapy protocol: Assessment of adherence and competence. Journal of Consulting and Clinical Psychology, 61, 620 – 630. Wheeler, J. L., Baggett, B. A., Fox, J., & Blevins, L. (2006). Treatment integrity: A review of intervention studies conducted with children with autism. Focus on Autism and Other Developmental Disabilities, 21, 45–54. Yeaton, W. H., & Sechrest, L. (1981). Critical dimensions in the choice and maintenance of successful treatments: Strength, integrity, and effectiveness. Journal of Consulting and Clinical Psychology, 49, 156 –167.
Date Received: May 10, 2009 Date Accepted: June 1, 2010 Action Editor: George Bear 䡲 Article was accepted by previous Editor.
Lisa M. Hagermoser Sanetti, PhD, is an Assistant Professor in the Neag School of Education and a Research Scientist with the Center for Behavioral Education and Research (CBER) at the University of Connecticut. Dr. Sanetti’s primary areas of research interest include treatment integrity assessment and promotion and evidence-based practice in schools. Katie L. Gritter, MS, is a doctoral candidate in the School Psychology program at the University of Connecticut. Topics of interest include treatment integrity, behavioral consultation, and evidence-based interventions. Lisa M. Dobey, MA, is a graduate of the school psychology program at the University of Connecticut. She is currently a school psychologist in the Belmont, MA public schools.
84
School Psychology Review, 2011, Volume 40, No. 1, pp. 85–107
Race Is Not Neutral: A National Investigation of African American and Latino Disproportionality in School Discipline Russell J. Skiba Indiana University Robert H. Horner University of Oregon Choong-Geun Chung and M. Karega Rausch Indiana University Seth L. May and Tary Tobin University of Oregon Abstract. Discipline practices in schools affect the social quality of each educational environment, and the ability of children to achieve the academic and social gains essential for success in a 21st century society. We review the documented patterns of office discipline referrals in 364 elementary and middle schools during the 2005–2006 academic year. Data were reported by school personnel through daily or weekly uploading of office discipline referrals using the Web-based School-wide Information System. Descriptive and logistic regression analyses indicate that students from African American families are 2.19 (elementary) to 3.78 (middle) times as likely to be referred to the office for problem behavior as their White peers. In addition, the results indicate that students from African American and Latino families are more likely than their White peers to receive expulsion or out of school suspension as consequences for the same or similar problem behavior. These results extend and are consistent with a long history of similar findings, and argue for direct efforts in policy, practice, and research to address ubiquitous racial and ethnic disparities in school discipline.
The Supreme Court’s ruling in Brown v. Board of Education in 1954 set the nation on a path toward equalizing educational opportu-
nity for all children. The right not to be discriminated against on the basis of race, color, or national origin was explicitly guaranteed by
This research was supported in part by U.S. Department of Education Grant H326S980003. Opinions expressed herein do not necessarily reflect the policy of the Department of Education, and no official endorsement by the Department should be inferred. Correspondence regarding this article should be addressed to Russell J. Skiba, Center for Evaluation and Education Policy, 1900 E. 10th St., Bloomington, IN 47401; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 85
School Psychology Review, 2011, Volume 40, No. 1
Title VI of the Civil Rights Act of 1964 (Browne, Losen, & Wald, 2002). Those protections were expanded to students with disabilities in the Individuals with Disabilities Education Improvement Act of 2004 and to educational outcomes for all children in the Elementary and Secondary Education Act (No Child Left Behind, 2008). Yet continuing racial and ethnic disparities in education ranging from the achievement gap (Ladson-Billings, 2006) to disproportionality in special education (Donovan & Cross, 2002) to dropout and graduation rates (Wald & Losen, 2007) have led some to question the extent to which the promises of Brown have been fulfilled (Blanchett, Mumford, & Beachum, 2005). In particular, over 30 years of research has documented racial and socioeconomic disparities in the use of out-of-school suspension and expulsion. The purpose of this article is to describe a national investigation exploring the extent of, and patterns in, racial and ethnic disparities in school discipline at the elementary and middle school level. Consistently Demonstrated Disproportionality For over 25 years, in national-, state-, district-, and building-level data, students of color have been found to be suspended at rates two to three times that of other students, and similarly overrepresented in office referrals, corporal punishment, and school expulsion (Skiba, Michael, Nardo, & Peterson, 2002). Documentation of disciplinary overrepresentation for African American students has been highly consistent (see e.g., Gregory, 1997; McCarthy & Hoge, 1987; McFadden, Marsh, Price, & Hwang, 1992; Raffaele Mendez & Knoff, 2003; Skiba et al., 2002; Wu, Pink, Crain, & Moles, 1982). According to data from the U.S. Department of Education Office for Civil Rights, disciplinary disproportionality for African American students appears to have increased from the 1970s, when African Americans appeared to be at approximately twice the risk of out of school suspension to 2002, when African American students risk for suspension was almost three times as great 86
as White students (Wald & Losen, 2003). Although disciplinary overrepresentation of Latino students has been reported in some studies (Raffaele Mendez & Knoff, 2003), the finding is not universal across locations or studies (see e.g., Gordon, Della Piana, & Keleher, 2000). Possible Causative Mechanisms A number of possible hypotheses have been proposed as mechanisms to account for rates of disciplinary disparity by race/ethnicity, including poverty, differential rates of inappropriate or disruptive behavior in school settings, and cultural mismatch or racial stereotyping. The possible mechanisms are discussed in the following. Poverty Race and socioeconomic status (SES) are unfortunately highly connected in American society (McLoyd, 1998), raising the possibility that any finding of racial disparities in school discipline can be accounted for by disproportionality associated with SES. Low SES has been consistently found to be a risk factor for school suspension (Brantlinger, 1991; Wu et al., 1982). Yet when the relationship of SES to disproportionality in discipline has been explored directly, race continues to make a significant contribution to disproportionate disciplinary outcomes independent of SES (Skiba et al., 2002; Wallace, Goodkind, Wallace, & Bachman, 2008; Wu et al., 1982). Higher Rates of Disruption Among Students of Color A related hypothesis might be that students of color, perhaps because they have been subjected to a variety of stressors associated with poverty (see e.g., Donovan & Cross, 2002), may learn and exhibit behavioral styles so discrepant from mainstream expectations in school settings as to put them at risk for increased disciplinary contact. Investigations of student behavior, race, and discipline have consistently failed, however, to find evidence of differences in either the frequency or intensity of African American students’ school be-
Disciplinary Disproportionality
havior sufficient to account for differences in rates of school discipline. Some studies have found no significant differences in behavior between African American and White students (McCarthy & Hoge, 1987; Wu et al., 1982), while others have reported that African American students receive harsher levels of punishment for less serious behavior than other students (McFadden et al., 1992; Shaw & Braden, 1990). Skiba et al. (2002) compared the types of infractions for which African American and White middle school students in a large urban district were referred to the office, and found no obvious differences in severity of behavior, but that African American students tended to be referred to the office more often for offenses that required a higher degree of subjectivity, such as disrespect or loitering. Cultural Mismatch or Racial Stereotyping With a teaching force in most school districts in this nation that is predominantly White and female (Zumwalt & Craig, 2005), the possibility of cultural mismatch or racial stereotyping as a contributing factor in disproportionate office referral cannot be discounted. Townsend (2000) suggested that the unfamiliarity of White teachers with the interactional patterns that characterize many African American males may cause these teachers to interpret impassioned or emotive interactions as combative or argumentative. In an ethnographic study of disciplinary practices at an urban elementary school, Ferguson (2001) documented the seemingly unconscious process whereby racial stereotypes may contribute to higher rates of school punishment for young African American males. There is some indication that teachers do make differential judgments about achievement and behavior based on racially conditioned characteristics. Neal, McCray, WebbJohnson, and Bridgest (2003) found that students who engaged in a “stroll” style of walking more often associated with African American movement style were more likely to be judged by teachers as being more aggres-
sive or lower achieving academically, whether the student was African American or White. In an extensive study of teacher ratings, Zimmerman, Khoury, Vega, Gil, and Warheit (1995) found evidence that African American students were more likely to be rated as having more extensive behavior problems by both Hispanic and non-Hispanic White teachers. In addition, teachers were more likely than parents to rate African American students as more problematic and less likely than parents to rate White students’ behavior as more problematic. In a more restricted sample set in a highpoverty inner-city setting, Pigott and Cowen (2000) found no evidence of a child–teacher race interaction in teacher ratings of their students, but found that all teacher groups reported a higher incidence of race-related stereotypes for African American students. There is some classroom observational data consistent with either a cultural mismatch or racial stereotyping explanation. Vavrus and Cole (2002) analyzed videotaped interactions among students and teachers, and found that many ODRs were less the result of serious disruption than what the authors described as “violations of … unspoken and unwritten rules of linguistic conduct” (p. 91), and that students singled out in this way were disproportionately students of color. In a study of office referral practices in an urban high school, Gregory and Weinstein (2008) found that, among a sample of African American students with ODRs, differences in classroom management style significantly contributed to both student attitudes toward classroom management and actual disciplinary outcomes. Further, even among students with multiple referrals to the office, only certain student–teacher combinations resulted in higher rates of office referral. Summary A number of hypotheses might be applied to explain the ubiquitous overrepresentation of African American students in a range of school disciplinary consequences. It seems likely that, in the face of multiple hypotheses, the disproportionate representation of students 87
School Psychology Review, 2011, Volume 40, No. 1
of color in school discipline is complex and multiply determined. Risks of Disproportionate Representation in School Exclusion Overrepresentation in out-of-school suspension and expulsion appears to place African American students at risk for a number of negative outcomes that have been found to be associated with those consequences. First, given documented positive relationships between the amount and quality of engaged time in academic learning and student achievement (Brophy, 1988; Greenwood, Horton, & Utley, 2002), and conversely between school alienation/school bonding and subsequent delinquency (Hawkins, Doueck, & Lishner, 1988), procedures like out-of-school suspension and expulsion that remove students from the opportunity to learn and potentially weaken the school bond must be viewed as potentially risky interventions. Second, a substantial database has raised serious concerns about the efficacy of school suspension and expulsion as a behavioral intervention in terms of either reductions in individual student behavior or overall improvement in the school learning climate (see e.g., American Psychological Association, 2008). Finally, by removing students from the beneficial aspects of academic engagement and schooling, suspension and expulsion may constitute a risk factor for further negative outcomes, including poor academic performance (Skiba & Rausch, 2006), school dropout (Ekstrom, Goertz, Pollack, & Rock, 1986), and involvement in the juvenile justice system (Wald & Losen, 2003). Thus, the overrepresentation of African American students in such high-risk procedures must be considered highly serious. Gaps in Knowledge There are substantial gaps in the research literature exploring racial and ethnic disparities in school discipline, some extending to basic descriptive information. Data concerning the representation of Hispanic/Latino students in school discipline are limited and highly inconsistent. Few studies of school dis88
cipline have focused on school level as a variable (elementary vs. middle vs. high school) and fewer still have examined disproportionality across school levels (Skiba & Rausch, 2006). Third, although the disciplinary process has been recognized as a complex, multilevel process proceeding from office referral to administrative disposition (see e.g., Morrison, Anthony, Storino, Cheng, Furlong, & Morrison, 2001), little attention has been paid to the relative contribution of office referrals and administrative consequences to racial and ethnic disparities in school discipline. Finally, few investigations have been both comprehensive and detailed. That is, empirical investigations of school disciplinary processes appear either to rely on national or state databases (e.g., U.S. Department of Education Office for Civil Rights data) that provide a comprehensive perspective on suspension or expulsion, but little detail concerning the initial offense that led to referral; or to analyze local school or district databases of ODRs that provide a richer picture of student infractions, but may or may not be generalizable to other locations. Purpose and Assumptions The purpose of this investigation was to explore racial and ethnic disparities in office referrals and administrative discipline decisions in a nationally representative sample. The schools in the sample had been involved in efforts to reform their school disciplinary practices using School-wide Positive Behavior Supports (SWPBS) for at least one year. SWPBS is a whole-school approach to prevention of problem behavior that focuses on defining, teaching, and rewarding behavioral expectations; establishing a consistent continuum of consequences for problem behavior; implementing a multitiered system of behavior supports; and the active use of data for decision making (Sugai & Horner, 2006). A core element of the SWPBS implementation process is systematic data collection on occurrence of problem behaviors that result in office referrals and the discipline decisions associated with those referrals.
Disciplinary Disproportionality
Although the data were drawn from a subsample of schools implementing SWPBS, the purpose of this investigation was not in any way to explore the effects or effectiveness of SWPBS as an intervention for reducing disciplinary referrals or disproportionality in referrals. Rather, to our knowledge, these data provide the most comprehensive and nationally representative sample for addressing some of the gaps in research knowledge regarding racial and ethnic disproportionality in school disciplinary procedures. We used descriptive and logistic regression analyses to explore patterns of disproportionality in office referral rates, patterns of disciplinary decisions across different racial/ethnic groups (African American, Hispanic, White), and school level (elementary vs. middle school). The analyses make two assumptions about effective disciplinary practices and hence about the types of data that would provide evidence of an effective and equitable disciplinary system. First, we presume that the most effective disciplinary systems are graduated discipline systems (American Psychological Association, 2008) in which minor infractions produce less severe administrative consequences than more severe infractions. The philosophy and practice of zero tolerance has tended to emphasize an alternate model, in which both minor and major infractions are met with more severe consequences, but a substantial database has failed to support the efficacy of practices based on that perspective (American Psychological Association, 2008). Second, given that there is no evidence supporting a distribution of infractions that varies in severity by race, we presume that disciplinary outcomes will be proportional across racial/ethnic categories. Methods Data Source The subjects for this investigation were drawn from data generated by the School-wide Information System (SWIS: May et al., 2006), which was being used in over 4000 schools across the nation during the 2005–206 aca-
demic year (cf. http://www.swis.org, January 2007). The SWIS is a three-component decision system for gathering and using school discipline data for decision making. The components of SWIS are (a) a data collection protocol that schools adopt that uses operationally defined categories for problem behavior, school-wide standards defining which problem behaviors are addressed in classrooms versus sent to the office, and a structure for team meetings in which data are used; (b) a Web-based computer application for entering ODR data and retrieving summary reports in graphic and tabular formats (May et al., 2006), and (c) a facilitator-based training system to help teams use data for active decision making. The entry of data into the SWIS computer application requires that students be identified by name, district identification number, grade, Individualized Education Program (IEP) status, and ethnicity. The content of an office discipline referral includes information about (a) the type of problem behavior leading to the referral; (b) the time of day, location, referring adult, and others present during the event; (c) the presumed maintaining consequence (e.g., access to attention, escape from work, response to taunting from peers); and (d) the primary administrative decision (e.g., conversation, detention, loss of privilege, parent report, suspension) resulting from the referral. This information is summarized in a series of reports that allow an administrator, specialist, team, or faculty member to monitor the rate of office discipline referrals; the type of behaviors leading to referrals; the time of day, location, and presumed maintaining consequence; and the administrative decision patterns. The SWIS system also provides the option for a school to compute a summary of ODRs by race/ethnicity. Race/ethnicity within SWIS is determined by the family designation when a child is enrolled in school, but is limited in specificity to the six federal race categories: African American, Asian, Native American, Pacific Islander, Hispanic/Latino, and Caucasian. 89
School Psychology Review, 2011, Volume 40, No. 1
Selection of problem behavior for all schools using SWIS is based on a mutually exclusive and exhaustive list of 24 operationally defined “major problem behaviors” and three operationally defined “minor problem behaviors.” School-based consequences for these reported behaviors are coded into 14 mutually exclusive and exhaustive categories of “administrative decisions.” Operational definitions of both student behaviors and administrative decisions may be found on the SWIS Resources site at http://www.swis.org/ index.php?page ⫽ resources;rid ⫽ 10121. Participant Sample As of January 2007, there were over 4000 schools in the United States at varying stages of SWIS adoption. During the fall of 2007, we identified from this population of schools a subset of 436 schools who (a) used SWIS for the full 2005–2006 academic year, (b) reported ethnicity information, (c) had grade levels between kindergarten and sixth grade (K– 6) or sixth and ninth grade (6 –9), and (d) agreed to share anonymous summaries of their data for evaluation purposes. These schools reported total enrollment of 120,148 students in elementary grades (K– 6) and 60,522 students in middle school grades (6 –9). Spaulding et al. (2010), compared the demographic features of the sample with 73,525 comparable schools in the National Center for Educational Statistics (NCES) sample for 2005–2006 to assess bias in size, proportion of students with an IEP, SES, racial/ ethnic distribution, and location (urban, suburban, or rural). They found the SWIS database to be within 5% of the NCES data in all categories except that the SWIS database was composed of fewer schools within the high SES category and had a larger number of schools with high ethnic/racial diversity. At the K-6 and 6 –9 levels, the sample did not differ from the NCES data in the percentage of large, midsized, or rural schools represented, but at the high school level there were more schools with large enrollment. It is important to note that schools adopting SWIS were self-nominated, and in 90
most cases were in varying stages of implementing school-wide positive behavior support (Sugai et al., 2002; Lewis & Sugai, 1999). Adoption of SWIS does not require that schools also adopt school-wide positive behavior support, but often districts investing in adoptions of school-wide discipline systems are more likely to also invest in adoption of SWIS. Again, the purpose of this investigation was not to explore any attributes, effects, or effectiveness of the use of SWPBS in these schools. No information was available within the SWIS database to indicate the length or fidelity of SWPBS adoption. For purposes of analysis, the 436 schools were organized into an elementary level (K– 6) and a middle school level (6 –9). Schools that overlapped to some extent (e.g., Grades 5–9) were placed in the group with the largest degree of overlap. Schools with a degree of overlap that did not permit sorting with confidence (e.g., schools serving Grades K– 8) were dropped from the sample. The final sample included 272 K– 6 level schools and 92 6 –9 level schools. Research Questions and Variables The primary research questions focused first on the pattern of ODRs by race and then on the pattern of administrative decisions by race. The following questions guided the study. ODRs. Two questions were addressed through descriptive data and logistic regression analyses. 1. To what extent does racial/ethnic status make a contribution to rates of ODR in elementary or middle schools? 2. In which categories of ODRs are racial or ethnic disparities evident? The original SWIS data included 27 categories of disciplinary infraction that could be entered as the reason for an ODR. For conceptual and analytic clarity, we categorized those infractions into the categories minor misbehavior, disruption, noncompliance, moderate infractions, major violations, use/ possession, tardy/truancy, and other/unknown
Disciplinary Disproportionality
(Note: specific infractions included in each category may be found in Table 3). Administrative decisions. Three questions were addressed through descriptive data, simply logistic regression, and multinomial logit models. 3. To what extent does racial/ethnic status make a contribution to administrative decisions concerning disciplinary consequences in elementary or middle schools? 4. In which categories of disciplinary consequence are racial or ethnic disparities evident? 5. What are the racial disparities in the interaction of infraction types and administrative decisions regarding consequence? In which infraction/consequence pairs do such disparities occur? The original SWIS data included 14 categories of administrative decision regarding primary disciplinary consequence. For conceptual and analytic clarity, we categorized those decisions into the categories of minor consequences, detention, moderate consequences, in-school suspension, out-of-school suspension and expulsion, and other/unknown. (Note: Specific administrative decisions comprising each category may be found in Table 4.) Data analyses. Descriptive analyses and a series of logistic and multinomial logit regression analyses (Greene, 2008) were used to describe the extent of disproportionality in student infractions, administrative decisions, and their interaction. The first logistic equation addressing Research Question 2 (Table 2) was designed to test the extent to which race proved a factor in the probability of an ODR. The second analysis addressing Research Question 2 (Table 3) was a multinomial logit model designed to explore the contribution of race to referrals for specific types of infraction. The third regression, addressing Questions 3 and 4, was a multinomial logit regression testing the influence of type of infraction and race on the probability of receiv-
ing a given consequence (Table 4). The final multinomial logit regression, addressing Research Question 5, was designed to test the specific influence of race on the probability of all possible infraction/consequence interactions (Table 5). Early analyses showing differences in patterns of results by school level led us to conduct separate analyses for K– 6 and 6 –9 schools. The measure used to index disproportionality throughout the analyses is the odds ratio (OR) drawn from the logistic and multinomial logit regression equations, with values greater than 1.0 indicating overrepresentation and values less than 1.0 indicating underrepresentation. In contrast to risk ratios, the OR may offer a more stable and accurate estimate of disproportionality, because it accounts for both occurrence and nonoccurrence of the event being measured (Finn, 1982; Oswald, Coutinho, Best, & Singh, 1999). For all analyses involving racial/ethnic categories, the index category was “White,” while the index category for infractions or administrative decisions varied and is noted in the footnote of the table for that analysis. Finally, note that description of the results refers to “overrepresentation” or “underrepresentation” of a given racial/ethnic group with respect to a given infraction or disciplinary consequence. In descriptive results, presented as composition indices, proportionality is compared to that group’s representation in the population. Across all logistic and multinomial logit analyses, disproportionality is framed in terms of over- or underrepresentation in comparison to the index group. This is not intended to convey any judgment concerning how frequently a given consequence ought to be applied (e.g., Latino students being underrepresented in detention does not in any way suggest they should be placed in detention more frequently), but rather is simply a numerical representation of the probability of occurrence relative to the index group (White students). The Box Tidwell Transformation Test (Cohen, Cohen, West, & Aiken, 2003) was performed to test for the assumption that there is a linear relationship between the independent variables and the log odds (logit) of the dependent 91
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Enrollment, Number of Students Referred, and Number of Referrals Disaggregated by Racial Ethnic Group Enrollment
Students Referred
Group
N
%
Hispanic/Latino African American White Unknown/all othersa Total N of students
25,051 30,961 54,690 9,446 120,148
20.9 25.8 45.5 7.9 100.0
Hispanic/Latino African American White Unknown/all others Total N of students
10,332 13,228 32,975 3,987 60,522
17.1 21.9 54.5 6.6 100.0
N
%
Referrals N
%
K–6 Schools (N ⫽ 272) 4,311 13.1 11,577 35.3 11,703 35.7 5,212 15.9 32,803 100.0
12,863 57,601 45,900 17,518 133,882
9.6 43.0 34.3 13.1 100.0
6–9 Schools (N ⫽ 92) 4,245 16.9 8,024 32.0 9,542 38.1 3,260 13.0 25,071 100.0
18,419 52,894 42,605 12,842 126,760
14.5 41.7 33.6 10.1 100.0
Note. K– 6 ⫽ kindergarten through Grade 6; 6 –9 ⫽ Grade 6 through Grade 9. a Unknown/all others: American Indian/Alaskan Native, Asian, Pacific Islander/Native Hawaiian, Not Listed, and Unknown.
measure. Tests across all models failed to yield any significant results, indicating that there is no nonlinearity issue for this sample. Results Disproportionate representation in school discipline can occur at either the point of referral or administrative decision. To track disproportionality through these two points in the disciplinary process, results are organized into two sections: ODRs and administrative decisions. ODRs Table 1 presents the total enrollment, number of students referred to the office, and number of ODRs disaggregated by race for K– 6 and 6 –9 level schools. The percentages in Columns 2, 4, and 6 are column percentages, reflecting the percent of enrollment, students referred, or total number of ODRs respectively for each racial/ethnic group. Each percentage in Column 4 or 6 therefore repre92
sents a composition index (Donovan & Cross, 2002) that can be interpreted by comparing it to the percent overall enrollment in Column 2. Thus, at the K– 6 level, African American students appear to be overrepresented, relative to their proportion in the population, among those referred to the office, representing 25.8% of total enrollment (Column 2), but 35.3% of those referred to the office (Column 4). White and Hispanic/Latino students are underrepresented relative to their enrollment among those referred to the office at the K– 6 level. At the 6 –9 level, African American students appear to be overrepresented, and White students appear to be underrepresented in their rate of ODRs as compared to their percentage in the population. Hispanic/Latino students appear to be roughly proportionately represented in middle school ODRs. The level of overreferral of African American students becomes even more apparent if one examines the absolute number of referrals to the office (Column 6). The discrepancy between individ-
Disciplinary Disproportionality
Table 2 Logistic Regression of the Influence of Race/Ethnicity on Referrala Group K–6 Schools Hispanic/Latino African American Unknown/all others Number of cases Model 2 Nagelkerke Pseudo R 2 % Correctly predicted 6–9 Schools Hispanic/Latino African American Unknown/all others Number of cases Model 2 Nagelkerke Pseudo R 2 % Correctly predicted
Referred: Odds Ratio
0.76* 2.19* NA 120,148 7,152 0.084 73.5 1.71* 3.79* NA 60,522 6,925 0.146 67.4
Note. K– 6 ⫽ kindergarten through Grade 6; 6 –9 ⫽ Grade 6 through Grade 9; NA ⫽ not available. a Reference category is Not Referred for outcome and is White for Race/Ethnicity. *p ⬍ .05.
ual referrals and total referrals suggests a higher rate of multiple referrals to the office for African American students as opposed to White or Hispanic/Latino students at both the elementary and middle school levels. Differences in ODRs across racial/ethnic categories were more directly compared in the ORs drawn from a logistic regression analysis predicting the probability of at least one ODR (Table 2). The index group for all analyses is White students. All ORs in Table 2 are significant at the 0.01 level. At the K– 6 level, African American students’ odds of being referred to the office are 2.19 times that of White students; the overrepresentation of African Americans in ODRs relative to White students appears to increase (OR ⫽ 3.79) at the 6 –9 level. A somewhat different pattern of disproportionality is in evidence for Hispanic/Latino students. At the K– 6 level, Hispanic/Latino
students are underrepresented in their rate of referral to the office relative to White students (OR ⫽ 0.76). At the 6 –9 level, however, Hispanic/Latino students in this sample are overrepresented (OR ⫽ 1.71) in their rate of ODRs. These results at the 6 –9 level appear at first glance to be somewhat at odds with composition indices (Table 1) that appear to show rates of ODRs for Hispanic/Latino students that are roughly proportionate at the 6 –9 level. Thus, significant Latino overrepresentation relative to White students at the middle school level appears to be from, not the absolute overreferral of Latino students, but rather to the substantial underreferral to the office of White students as compared to their representation in the population. The ORs associated with ODRs broken down by infraction comparing African American and Latino students with White students are presented in Table 3. With the exception of Tardy/Truancy, Major Violations, and Use/ Possession for Hispanic/Latino students at the K– 6 level, all ORs are significant at the p ⬍ .01 level. At both the K– 6 and 6 –9 levels, African American students are significantly overrepresented in ODRs across all infraction types, with the highest ORs compared to White students for the infraction types of Tardy/Truancy, Disruption, and Noncompliance. At the K– 6 level, Hispanic/Latino students are underrepresented as compared with White students in ODRs for Minor Misbehaviors, Disruption, Noncompliance, and Moderate Infractions. At the 6 –9 level, in contrast, Hispanic/ Latino students appear to be overrepresented relative to White students for all ODR categories. Administrative Decisions Table 4 presents the results of a multinomial logistic regression predicting the likelihood of a particular administrative decision using two models. In Model 1, the likelihood of an administrative consequence is predicted solely from type of infraction. In Model 2, race/ethnicity is added to type of infraction to predict the likelihood of an administrative consequence. For both Model 1 and Model 2, 93
94
1.50** 3.72** NA
Hispanic/Latino African American Unknown/all others n Model 2 Nagelkerke Pseudo R 2 % Correctly predicted
1.31** 5.60** NA
0.67** 3.96** NA
Disruption
1.87** 5.43** NA
0.60** 3.32** NA
Noncompliance
6–9 Schools 1.76** 4.76** NA 89,279 12,001 0.129 40.5
K–6 Schools 0.77** 2.62** NA 142,451 12,848 0.093 61.3
Moderate Infractions
1.80** 3.91** NA
0.91 2.87** NA
Major Violations
1.93** 2.02** NA
1.22 2.96** NA
Use/Possession
2.44** 4.40** NA
0.84 6.58** NA
Tardy/ Truancy
1.30** 4.55** NA
0.80** 2.66** NA
Other/ Unknown
Note. Minor Misbehaviors ⫽ minor inappropriate verbal language, minor physical contact, minor defiance/disrespect/noncompliance, minor disruption, minor property misuse, other minor misbehaviors; Disruption ⫽ disruption; Noncompliance ⫽ defiance/disrespect/insubordination/noncompliance; Moderate Infractions ⫽ abusive language/inappropriate language, fighting/physical aggression, lying/cheating, harassment/bullying; Major Violations ⫽ property damage, forgery/theft, vandalism, bomb threat/false alarm, arson; Use/Possession ⫽ use/possession of tobacco, alcohol, combustible items, weapons, drugs; Tardy/Truancy ⫽ tardy, skip class/truancy, dress code violation; Other/Unknown ⫽ other behavior, unknown behavior; K– 6 ⫽ kindergarten through Grade 6; 6 –9 ⫽ Grade 6 through Grade 9; NA ⫽ not available. a Entries in each cell represent the odds ratio drawn from the multinomial logit, holding constant the contribution of all other variables. Reference category is Not Referred for outcome and is White for Race/Ethnicity. **p ⬍ .01.
0.66** 1.77** NA
Minor Misbehaviors
Hispanic/Latino African American Unknown/all others n Model 2 Nagelkerke Pseudo R 2 % Correctly predicted
Group
Table 3 Multinomial Logit Regression of the Influence of Race/Ethnicity on Referrals by Infractions: Odds Ratiosa
School Psychology Review, 2011, Volume 40, No. 1
Disciplinary Disproportionality
the odds of receiving a suspension/expulsion for committing a minor infraction are very low at both the K– 6 and 6 –9 levels, and appear to increase proportionally as the seriousness of infraction increases. Thus, the odds of receiving a suspension or expulsion for Use/Possession are very high at both the K– 6 (OR ⫽ 16.60) and 6 –9 (OR ⫽ 53.01) level. Model 2 in Table 4 adds race/ethnicity to the model, and results in little change to the ORs or significance for type of infraction as compared to Model 1. Race/ethnicity enters the equation significantly for most administrative decisions. Both African American and Latino students are overrepresented in suspension/expulsion relative to White students at both K– 6 and 6 –9 levels. African American students are underrepresented in the use of detention at the K– 6 level, and underrepresented in all administrative consequences except suspension/expulsion at the 6 –9 level. In contrast, Hispanic/Latino students are underrepresented relative to White students in the use of moderate consequences, but overrepresented in detention at both the K– 6 and 6 –9 levels. The continuing significance of race/ ethnicity in Model 2 even after controlling for type of behavior indicates that race/ethnicity makes a contribution to administrative decisions regarding discipline independent of type of infraction, above and beyond any prior disparity in classroom referral. Table 5 presents the ORs for receiving various administrative consequences broken down by race and type of infraction drawn from a series of multinomial logit regression analyses at the K– 6 and 6 –9 levels, testing for the presence of differential administrative treatment for the same offense. The results describe a complex pattern of variation across type of infraction, race/ethnicity, and school level. At the elementary level, African American students were more likely than White students to receive out-of-school suspension/ expulsion for all types of infractions tested (note that tardiness/truancy and use/possession could not be estimated in the model because zero cells). In particular, results indicated that African American elementary school students were more likely than White students to be
suspended out-of-school for minor misbehavior (OR ⫽ 3.75, p ⬍ .01). They were also less likely to receive in-school suspension for disruption or noncompliance, less likely to receive moderate consequences for noncompliance, and less likely to receive detention for minor misbehavior or moderate infractions. Latino students at the elementary level were more likely to be suspended/expelled than White students across all infractions except disruption, and also more likely to receive detention than White students for minor misbehavior, noncompliance, and moderate infractions. Finally, Latino students were more likely than White students to receive in-school suspension for minor misbehavior, and less likely to receive in-school suspension for noncompliance. A slightly different pattern of infraction/ consequences appears at the middle school level. The overrepresentation of African American students in suspension/expulsion for specific offenses is less consistent at the 6 –9 level, with ORs significantly greater than 1.00 for only disruption, moderate infractions, and tardy/truancy. The pattern of African American underrepresentation in less serious consequences was more pronounced at the 6 –9 level, however, with ORs significantly less than 1.00 for almost all less serious consequences across most types of infractions. Hispanic/Latino disproportionality in suspension/ expulsion at the 6 –9 level appeared to be more consistent with elementary level findings, with ORs significantly greater than one across all types of infractions except use/possession. Significant underrepresentation of Hispanic/ Latino students was found in the use of moderate consequences for disruption and moderate infractions, and in in-school suspension for minor misbehavior and tardy/truancy. Discussion We conducted a disaggregated analysis of a detailed, nationally representative data set in order to provide a more comprehensive picture of disproportionality in discipline across racial/ethnic categories and school levels. The results indicate that, across an exten95
96
Infractionsb Minor Misbehaviors Disruption Noncompliance Moderate Infractions Major Violations Use/Possession Other/Unknown Raceb Hispanic/Latino African American Other/Unknown N of cases Model 2 Nagelkerke Pseudo R 2 % Correctly predicted
Variable
0.96 0.76* 0.88 1.04 1.19 0.94 1.33*
Detention
0.64 1.40 1.51 1.42 2.53** 1.84 2.25*
Moderate Consequence
133,882 33,899 0.240 57.5
0.17** 0.79 0.88 1.28 1.44* 4.34** 1.19
In-School Suspension
Model 1
0.02** 0.59** 0.78 1.55** 1.14 16.60** 1.32
OSS and Expulsion Detention
0.03** 0.01** 0.01** 0.01** 0.01** 0.02** 0.02**
0.66 1.41 1.53 1.44 2.60** 1.91 2.30** 0.71** 1.03 NA
1.24** 0.78** NA
Moderate Consequence
0.88 0.73* 0.84 0.96 1.07 0.82 1.20
K–6 Schools
Other/ Unknown
1.01 0.99 NA 133,882 35,681 0.251 57.5
0.17** 0.78 0.88 1.28 1.44* 4.35** 1.19
In-School Suspension
Model 2
1.52** 1.64** NA
0.03** 0.63** 0.85 1.73** 1.27 18.82** 1.48**
OSS and Expulsion
0.64** 1.03 NA
0.03** 0.01** 0.01** 0.01* 0.01** 0.02** 0.02**
Other/ Unknown
Table 4 Multinomial Logit Regression of the Influence of Behavior and Race on Administrative Decisions: Odds Ratiosa
School Psychology Review, 2011, Volume 40, No. 1
0.48** 0.30** 0.29** 0.30** 0.38** 0.27** 0.49**
Detention
0.27** 0.38** 0.41** 0.67** 2.04** 1.97* 1.20*
Moderate Consequence
126,760 28,905 0.211 32.5
0.27** 0.58** 0.74** 1.20** 1.04** 2.05** 1.04
In-School Suspension
0.27* 1.03 1.40** 6.40** 6.59** 53.01** 2.95**
OSS and Expulsion Detention
1.06 0.20** 0.24** 0.33** 0.50** 0.61* 0.49**
0.27** 0.40** 0.43** 0.66** 1.93** 1.74* 1.19* 0.77** 0.53** NA
1.06* 0.58** NA
Moderate Consequence
0.49** 0.33** 0.31** 0.31** 0.38** 0.25** 0.51**
6–9 Schools
Other/ Unknown
0.99 0.82** NA 126,760 31,123 0.226 33.3
0.27** 0.59** 0.75** 1.22** 1.44** 1.98** 1.05
In-School Suspension
Model 2
1.58** 1.12** NA
0.28** 1.09 1.45** 6.70** 6.89** 55.65** 3.13**
OSS and Expulsion
0.90** 0.73** NA
1.07 0.20** 0.24** 0.34** 0.50** 0.58* 0.50**
Other/ Unknown
Note. Detention ⫽ detention; Minor Consequences ⫽ time in office, loss of privileges, conference with student, parent contact, individualized instruction; Moderate Consequences ⫽ Saturday school, bus suspension, restitution; In-School Suspension ⫽ in-school suspension; OSS and Expulsion ⫽ out-of-school suspension, expulsion; Other/Unknown ⫽ other administrative decision, unknown administrative decision; K– 6 ⫽ kindergarten through Grade 6; 6 –9 ⫽ Grade 6 through Grade 9; NA ⫽ not available. a Reference category is Minor Consequences for outcome. b Reference category is Tardy/Truancy for Infractions, White for Race/Ethnicity. *p ⬍ .05. **p ⬍ .01.
Infractionsb Minor Misbehaviors Disruption Noncompliance Moderate Infractions Major Violations Use/Possession Other/Unknown Raceb Hispanic/Latino African American Other/Unknown N of cases Model 2 Nagelkerke Pseudo R 2 % Correctly predicted
Variable
Model 1
Table 4 Continued
Disciplinary Disproportionality
97
98
Detention
1.22** 0.58** NA
1.27 0.94 NA
1.30** 1.04 NA
Raceb
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
0.72 0.70** NA
1.03 0.94 NA
0.85 1.24* NA
Moderate Consequences
0.72** 0.81** NA 21,374 141 0.007
0.78 0.83* NA 8,203 52 0.007
3.01** 1.76** NA 56,884 1,673 0.033
In-School Suspension
K–6 Schools
1.24* 1.22** NA
1.28 1.54** NA
2.06* 3.75** NA
OSS and Expulsion
0.63* 0.53** NA
0.82 0.44** NA
Disruption 1.36 1.08 0.84 0.54** NA NA
Noncompliance 1.06 1.09 0.92 0.47** NA NA
Moderate Consequences
0.80 0.46** NA
Detention
Minor Misbehaviors 0.51** 1.05 1.03 0.70** NA NA
Other/ Unknown
1.05 0.88** NA 34,642 887 0.026
0.99 0.93 NA 18,570 498 0.028
0.36** 0.52** NA 23,181 1,209 0.054
In-School Suspension
6–9 Schools
1.25** 0.99 NA
1.59** 1.35** NA
1.83** 0.95 NA
OSS and Expulsion
0.94 0.90* NA
1.12 1.19** NA
0.27* 0.62* NA
Other/ Unknown
Table 5 Multinomial Logit Regression of the Influence of Race on Administrative Decisions in Individual Infraction: Odds Ratiosa
School Psychology Review, 2011, Volume 40, No. 1
1.27** 0.91* NA
1.12 0.85 NA
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
Detention
Raceb
1.01 0.73 NA
0.48** 1.32** NA
Moderate Consequences
0.69 0.90 NA 2,978 47 0.017
0.89 0.98 NA 34,792 504 0.015
In-School Suspension
K–6 Schools
1.87** 2.02** NA
1.59** 1.84** NA
OSS and Expulsion
1.11 0.45** NA
0.39 0.33 NA
Major Violations 1.03 1.47 0.87 0.52** NA NA
Use/Possessionc 1.42 0.34 NA
Moderate Consequences
0.72* 0.59** NA
Detention
Moderate Infractions 1.08 1.01 1.14* 0.52** NA NA
Other/ Unknown
Table 5 Continued
1.00 0.29** NA 1,015 40 0.047
0.94 0.80 NA 2,120 84 0.040
1.03 0.88** NA 24,520 481 0.020
In-School Suspension
6–9 Schools
2.32 0.55 NA
1.93** 1.30 NA
1.74** 1.13** NA
OSS and Expulsion
0.67 0.38 NA
0.96 1.14 NA
1.34** 1.28** NA
Other/ Unknown
Disciplinary Disproportionality
99
100
1.32 1.10 NA
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
0.55 0.83 NA
Moderate Consequences
0.62* 1.38** NA 4,506 186 0.042
In-School Suspension
2.12** 1.69** NA
OSS and Expulsion
1.06 0.48** NA
Tardy/Truancyc 1.34** 0.63** NA
Moderate Consequences
0.71 1.19 NA
Detention
Other/Unknown 0.99 0.94 0.61** 0.57** NA NA
Other/ Unknown
1.86** 0.62** NA 16,384 823 0.051
0.93 1.00 NA 6,328 403 0.064
In-School Suspension
6–9 Schools
1.66** 1.37** NA
3.99** 1.67** NA
OSS and Expulsion
2.28** 0.38** NA
1.70** 0.81* NA
Other/ Unknown
Note. Detention ⫽ detention; Moderate Consequences ⫽ Saturday school, bus suspension, restitution; In-School Suspension ⫽ in-school suspension; OSS and Expulsion ⫽ out-of-school suspension, expulsion; Other/Unknown ⫽ other administrative decision, unknown administrative decision; K– 6 ⫽ kindergarten through Grade 6; 6 –9 ⫽ Grade 6 through Grade 9; NA ⫽ not available. Each ODR/school level combination in this analysis represents a separate model. Thus, with separate N values for each model, it is important to note that the odds ratios across models are not directly comparable. a Outcome reference category is Minor Consequences. b Race/Ethnicity reference category is White. c Model could not be estimated because of zero counts in one or more cells. *p ⬍ .05. **p ⬍ .01.
Hispanic/Latino African American Unknown/All Others N Model 2 Nagelkerke Pseudo R 2
Detention
Raceb
K–6 Schools
Table 5 Continued
School Psychology Review, 2011, Volume 40, No. 1
Disciplinary Disproportionality
sive national sample, significant disparities exist for African American and Latino students in school discipline. Patterns are complex and moderated by type of offense, race/ethnicity, and school level. Nevertheless, the overall pattern of results indicates that both initial referral to the office and administrative decisions made as a result of that referral significantly contribute to racial and ethnic disparities in school discipline. Across a national sample, African American students have twice the odds compared to White students of receiving ODRs at the elementary level, and almost four times the odds of being referred to the office at the middle school level. A different pattern of disproportionality emerges for Hispanic students, with significant overrepresentation (OR ⫽ 1.71) at the middle school level, but significant underrepresentation (OR ⫽ 0.76) at the elementary school level. These results are consistent with previous research indicating ubiquitous overrepresentation in school discipline for African American students, but inconsistent evidence of disparities for Latino students (Gordon, Della Piana, & Keleher, 2000; Skiba & Rausch, 2006). It is possible that the striking shift found in the current study from Hispanic under- to overrepresentation in ODRs as one moves to the middle school level may help explain some of the inconsistency in previous findings. For African American students at the elementary school level, and for both African American and Latino students at the middle school level, disparities in rates of referral were widespread across referral types. One explanation for racial and ethnic disproportionality in school discipline is that such disparities are primarily a result of socioeconomic disadvantage (National Association of Secondary School Principals, 2002); that is, African American students, overexposed to the stressors of poverty, are more likely to be undersocialized with respect to school norms and rules. Yet previous research has found no evidence that disciplinary disproportionality can be explained to any significant degree by poverty (Wallace et al., 2008; Wu et al., 1982). More important, there appears to be
little support for a hypothesis that African American students act out more in similar school or district situations (McFadden et al., 1992; McCarthy & Hoge, 1987; Wu et al., 1982). The current results at the middle school level, that the most likely types of ODR leading to disparate African American discipline are disruption and noncompliance, are consistent with a growing body of previous research in suggesting that the types of referrals in which disproportionality is evident are most likely to be in categories that are more interactive and subjectively interpreted, such as defiance (Gregory & Weinstein, 2008) and disrespect (Skiba et al., 2002). One important premise of the present research is that effective disciplinary systems are more likely to have at their core a graduated model, in which more serious consequences are reserved for more serious infractions. Although zero tolerance disciplinary philosophy and practice has focused on “sending a message” to potentially disruptive students by applying relatively harsh punishments for both minor and more serious infractions (Skiba & Rausch, 2006), reviews of the evidence regarding school discipline have found little evidence supporting the effectiveness of such an approach (American Psychological Association, 2008). Rather, a graduated discipline model whereby the severity of consequences are scaled in proportion to the seriousness of the infraction, often in conjunction with a tiered model of discipline (Sugai, 2007), appears to hold far more promise as an effective and efficient method for organizing school disciplinary policy and practice. Prior to disaggregation of the data (Model 1 in Table 4), the current data seem to show some evidence of such a pattern of graduated discipline. That is, the odds for all students in this sample of receiving a suspension or expulsion for minor misbehavior are extremely low at both the elementary and middle school levels, and gradually increase such that it becomes highly likely that use and possession of weapons or drugs will result in a suspension or expulsion. Although given the absence of information on the extent of implementation of SWPBS for this sample it is 101
School Psychology Review, 2011, Volume 40, No. 1
impossible to know whether SWPBS implementation was related to that finding, these data do suggest that many of the school disciplinary systems in the current sample are in general organized in a way that could be expected to be efficient and effective. In the logistic regressions regarding student infraction, consequence, and race/ethnicity (Table 4), both the various types of infractions and the racial/ethnic categories enter the equation significantly at both the elementary and middle school level. The failure of race to significantly affect the pattern of ORs for infractions means that the nature of the recorded infraction is an important contributor to the severity of consequences received, regardless of student race or ethnicity. At the same time, the significance of race in predicting suspension and expulsion even after the inclusion of infraction means that, regardless of the type of infraction, race/ethnicity makes a significant contribution to the type of consequence chosen for a given infraction. The pattern of differential treatment is even more clearly articulated by the pattern of administrative decisions for various infractions (Table 5). At the elementary school level, African American students were more likely than White students to be suspended or expelled for any offense, and Latino students more likely to be suspended for all offenses except disruption. In particular, African American students have almost four times the odds, and Hispanic students twice the odds, of being suspended or expelled for a minor infraction at the elementary school level. Although the pattern of overrepresentation in out-of-school suspension/expulsion is somewhat less pronounced at the middle school level, there is still substantial evidence of differential treatment. African American students were significantly more likely than White students to be suspended or expelled for disruption, moderate infractions, and tardy/truancy, while Latino students were more likely to be suspended or expelled in Grade 6 –9 schools for all infractions except use/possession. In addition, underrepresentation in the use of minor or moderate consequences appears to be more pronounced at the middle school level, espe102
cially for African American students. Although specific patterns differ by race/ethnicity and school level, these findings are again consistent with previous investigations (e.g., McFadden et al., 1992; Shaw & Braden, 1990) that have found evidence of differential processing (Gregory et al., 2010) at the administrative level. In summary, these results suggest that both differential selection at the classroom level and differential processing at the administrative level make significant contributions to the disproportionate representation of African American and Latino students in school discipline. For African American students, disproportionality at both the elementary and middle school levels begins at referral, most particularly in the areas of tardiness/truancy, noncompliance, and general disruption; for Latino students, disparities in initial ODRs emerge at the middle school level. Yet regardless of previous disproportionality at referral, the type of infraction, or the school level, the findings from this study indicate that students of different races and ethnicities are treated differently at the administrative level, with students of color being more likely to receive more serious consequences for the same infraction. An investigation of extant data, this study was able to identify the existence of disproportionate outcomes at the classroom and administrative level, but without local observation, was unable to specify the classroom or school variables that create such imbalances. Further research, in particular ethnographic or observational studies that can isolate specific teacher–student or administrator– student interactions, are essential for increasing understanding of the variables contributing to racial and ethnic disparities in school discipline. The focus of this article was not on intervention per se, but these results may hold important implications for monitoring the effects of interventions intended to address disciplinary disproportionality. Although the efficacy of SWPBS in reducing rates of ODRs has been well demonstrated (Barrett, Bradshaw, & Lewis-Palmer, 2008; Bradshaw, Koth, Thornton, & Leaf, 2009; Bradshaw,
Disciplinary Disproportionality
Mitchel, & Leaf, 2009 Horner et al., 2009; Nelson, Martella, & Marchand-Martella, 2002; Safran & Oswald, 2003 Taylor-Green & Kartub, 2000), few investigations (Jones et al., 2006) have explored the issue of PBS and cultural variation, or sought to explore how the application of such school-wide systems will affect rates of disciplinary disproportionality. Kauffman, Conroy, Gardner, and Oswald (2008) have argued that there is no evidence that behavioral interventions operate differently based on ethnicity, gender, or religion, but also noted that differential effects based on race and ethnicity have been understudied in the behavioral literature, and that “many studies in leading behavioral journals. . . do not report sufficient detail about the cultural identities of participants” (p. 255). Until such time as a sufficient database has been accumulated on interventions for reducing disproportionate representation in school discipline outcomes, it seems logical that implementations of interventions designed to affect student behavior in school should disaggregate their results, in order to empirically explore the extent to which those interventions work equally well for all groups. The current data demonstrate a marked discrepancy between the aggregated data, in which the severity of infraction and consequence are relatively well matched, and the disaggregated data, showing that African American and Latino students receive more severe punishment for the category “minor misbehavior.” Such a pattern of results is consistent with emerging research on culturally responsive pedagogy and classroom management (e.g., Harris-Murri, King, & Rostenberg, 2006; Serpell, Hayling, Stevenson, & Kern, 2009; Utley, Kozleski, Smith, & Draper, 2002) in suggesting that it cannot be assumed that interventions intended to improve behavior will be effective to the same degree for all groups. Existing racial and ethnic differences in the use of current disciplinary interventions strongly indicate that, for any intervention strategy aimed at reducing such disparities, disciplinary outcome data should be disaggregated, in order to explicitly evaluate whether SWPBS, or
indeed any general intervention, is equally effective for all racial/ethnic groups. Limitations The present investigation was not able to explicitly test the influence of SES on the tested relationships. Despite widespread beliefs to the contrary, there is no previous evidence that the overrepresentation of African American or Latino students in school disciplinary outcomes can be fully explained by individual or community economic disadvantage (Skiba et al., 2002; Wallace et al., 2008; Wu et al., 1982). Further investigation is needed, however, to parse the relative contribution of individual, classroom, and school characteristics to disciplinary disproportionality, including both SES and the complex effects of gender (see e.g., Wallace et al., 2008). In addition, although we were able to explore variations for different educational levels or racial/ethnic categories to arrive at a more complex rendering of disciplinary disproportionality, we were not able to analyze the data by geographic location or school locale. It seems highly likely that the variables contributing to racial and ethnic disparities will vary considerably by location and locale, especially for groups such as Hispanic students that have shown inconsistency in previous research. It may well be that the specific causes of racial disparities are regionally unique, requiring local analysis of causes and conditions (Skiba et al., 2008), in much the same way that functional behavior analysis is used at the individual level in order to develop individualized behavior plans tailored to the needs of each child in each situation. Finally, using school as the unit of analysis restricted our ability to investigate the contribution of prior infractions, a variable that might well be expected to have a significant effect on administrative decisions regarding disciplinary consequences. It is important to note, however, that there is no previous research we are aware of that explores the association of students’ prior record of school infraction with racial and ethnic disproportionality in school discipline. There is an excep103
School Psychology Review, 2011, Volume 40, No. 1
tionally long history in this nation of accepting a stereotype of African Americans, especially African American males, as being more prone to disordered behavior or criminality (see e.g., Muhammed, 2010), often with little or no supporting evidence. With no evidence that supports the notion that there are concurrently higher levels of disruption among African American students, we see no reason to presume that disparate rates of discipline between racial and ethnic groups can be explained by differential behavioral histories. Summary and Recommendations The fact of racial/ethnic disproportionality in school discipline has been widely and, we would argue, conclusively demonstrated. Across urban and suburban schools, quantitative and qualitative studies, national and local data, African American and to some extent Latino students have been found to be subject to a higher rate of disciplinary removal from school. These differences do not appear to be explainable solely by the economic status of those students, nor through a higher rate of disruption for students of color. Opportunity to remain engaged in academic instruction is arguably the single most important predictor of academic success. In the absence of an evidence-based rationale that could explain widespread disparities in disciplinary treatment, it must be concluded that the ubiquitous differential removal from the opportunity to learn for African American and Latino students represents a violation of the civil rights protections that have developed in this country since Brown v. Board of Education. We propose here that the existing empirical evidence for disproportional school discipline by race, and the severe effect of exclusionary discipline on educational success, make disproportional application of exclusionary discipline an issue in need of immediate and substantive response. At the school level, (a) data on discipline by race should be reported regularly (monthly) to faculty, (b) policies focused on prevention and culturally responsive practice should be encouraged, and (c) investment in developing appropriate so104
cial behaviors should be made before resorting to exclusionary consequences. At the district and state level, (a) disaggregated data on discipline patterns should be available and disseminated, (b) policies addressing disciplinary inequity and promoting equity should be established, and (c) personnel development options should be made available to minimize the disproportionate application of discipline. At the federal level, (a) research funding is needed to move beyond mere description of disproportionality to clear documentation of causal mechanisms and functional interventions for reducing disparate outcomes; (b) resources are needed to document the technical assistance and implementation strategies that will allow state- and district-wide responses to disproportionate use of discipline; and (c) as is currently the case for disproportionality in special education, federal monitoring practices should regularly require disaggregated reporting of discipline patterns, and mandate the development and implementation of corrective action plans where disparities are found. Racial and ethnic disparities that leave students of color behind remain ubiquitous in American education. The national report Breaking Barriers (Caldwell, Sewell, Parks, & Toldson, 2009; Toldson, 2008) found that while personal, family, and community factors all make a contribution to such disparities, so do school and teacher characteristics, such as student perceptions of being respected and supported by teachers, and perceptions of school safety. To the extent that the policies and practices of schools maintain or widen racial gaps, it is imperative that policy makers and educators search for school-based solutions that can contribute to reducing racial and ethnic disparities in important educational outcomes. All children deserve access to effective educational settings that are predictable, positive, consistent, safe, and equitable. Access to educational achievement requires the support needed to be socially successful in school. This typically involves not simply ensuring that problem behavior is addressed equitably, but investing in building school cultures where appropriate behavior is clearly defined, ac-
Disciplinary Disproportionality
tively taught, and consistently acknowledged. For race to become a socially neutral factor in education, all levels of our educational system must be willing to make a significant investment devoted explicitly to altering currently inequitable discipline patterns, to ensure that our instructional and disciplinary systems afford all children an equal opportunity for school learning. References American Psychological Association Zero Tolerance Task Force. (2008). Are zero tolerance policies effective in the schools? An evidentiary review and recommendations. American Psychologist, 63, 852– 862. Barrett, S., Bradshaw, C., & Lewis-Palmer, T. (2008). Maryland state-wide PBIS initiative. Journal of Positive Behavior Interventions, 10, 105–114. Blanchett, W. J., Mumford, V., & Beachum, F. (2005). Urban school failure and disproportionality in a postBrown era: Benign neglect of the constitutional rights of students of color. Remedial and Special Education, 26, 70 – 81. Bradshaw, C., Koth, C., Thornton, L., & Leaf, P. (2009). Altering school climate through School-wide Positive Behavioral Interventions and Supports: Findings from a group randomized effectiveness trial. Prevention Science, 10, 100 –115. Bradshaw, C., Mitchell, M., & Leaf, P. (2009). Examining the effects of school-wide positive behavioral interventions and supports on student outcomes: Results from a randomized controlled effectiveness trial in elementary schools. Journal of Positive Behavior Interventions, 12, 133–148. Brantlinger, E. (1991). Social class distinctions in adolescents’ reports of problems and punishment in school. Behavioral Disorders, 17, 36 – 46. Brophy, J. E. (1988). Research linking teacher behavior to student achievement: Potential implications for instruction of Chapter 1 students. Educational Psychologist, 23, 235–286. Browne, J. A., Losen, D. J., & Wald, J. (2002). Zero tolerance: Unfair, with little recourse. In R. J. Skiba & G. G. Noam (Eds.), New directions for youth development (No. 92, Zero tolerance: Can suspension and expulsion keep schools safe?) (pp. 73–99). San Francisco: Jossey-Bass. Caldwell, L. D., Sewell, A. A., Parks, N., & Toldson, I. A. (2009). Before the bell rings: Implementing coordinated school health models to influence the academic achievement of African American males. Journal of Negro Education, 78, 204 –215. Cohen, J., Cohen, P., West, S. G., & Aiken, L. (2003). Applied multiple regression: Correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum. Donovan, M. S., & Cross, C. T. (Eds.). (2002). Minority students in special and gifted education. Washington, DC: National Academies Press. Elementary and Secondary Education Act. Pub. L. 89 – 10, 79 Stat. 77, 20 U.S.C. ch.70 (0000).
Ekstrom, R. B., Goertz, M. E., Pollack, J. M., & Rock, D. A. (1986). Who drops out of high school and why? Findings from a national study. Teachers College Record, 87, 357–373. Ferguson, A. A. (2001). Bad boys: Public schools and the making of Black masculinity. Ann Arbor: University of Michigan Press. Finn, J. D. (1982). Patterns in special education placement as revealed by the OCR survey. In K. A. Heller, W. H. Holtzman, & S. Messick (Eds.), Placing children in special education: A strategy for equity (pp. 322–381). Washington, DC: National Academy of Sciences National Academy Press. Gordon, R., Della Piana, L., & Keleher, T. (2000). Facing the consequences: An examination of racial discrimination in U.S. public schools. Oakland, CA: Applied Research Center. Gottfredson, D. C., Gottfredson, G. D., & Hybl, L. G. (1993). Managing adolescent behavior: A multiyear, multischool study. American Educational Research Journal, 30, 179 –215. Greene, W. H. (2008). Econometric analysis (6th ed.). Upper Saddle River, NJ: Pearson/Prentice Hall. Greenwood, C. R., Horton, B. T., & Utley, C. A. (2002). Academic engagement: Current perspectives on research and practice. School Psychology Review, 31, 328 –349. Gregory, A., Skiba, R., & Noguera, P. (2010). The achievement gap and the discipline gap: Two sides of the same coin? Educational Researcher, 39, 59 – 68. Gregory, J. F. (1997). Three strikes and they’re out: African-American boys and American schools’ responses to misbehavior. International Journal of Adolescence and Youth, 7(1), 25–34. Gregory, A., & Weinstein, S. R. (2008). The discipline gap and African Americans: Defiance or cooperation in the high school classroom. Journal of School Psychology, 46, 455– 475. Harris-Murri, N., King, K., & Rostenberg, D. (2006). Reducing disproportionate minority representation in special education programs for students with emotional disturbances: Toward a culturally responsive response to intervention model. Education and Treatment of Children, 29, 779 –799. Hawkins, J. D., Doueck, H. J., & Lishner, D. M. (1988). Changing teaching practices in mainstream classrooms to improve bonding and behavior of low achievers. American Educational Research Journal, 25, 31–50. Horner, R. H., Sugai, G., Smolkowski, K., Eber, L., Nakasato, J., Todd, A., & Esperanza, J. (2009). A randomized, waitlist-controlled effectiveness trial assessing school-wide positive behavior support in elementary schools. Journal of Positive Behavior Interventions, 11(3), 133–144. Individuals with Disabilities Education Improvement Act of 2004. Pub. L. No. 108 – 446, 20 U.S.C. 1400 et seq. (2004). Jones, C., Caravaca, L., Cizek, S., Horner, R., & Vincent, C. G. (2006). Culturally responsive schoolwide positive behavior support: A case study in one school with a high proportion of Native American students. Multiple Voices for Ethnically Diverse Exceptional Learners, 9, 108 –119. Kauffman, J. M., Conroy, M., Gardner, R., & Oswald, D. (2008). Cultural sensitivity in the application of behav105
School Psychology Review, 2011, Volume 40, No. 1
ior principles to education. Education and Treatment of Children, 31, 239 –262. Ladson-Billings, G. (2006). From the achievement gap to the education debt: Understanding achievement in U.S. schools. Educational Researcher, 35, 3–12. May, S., Ard, W., Todd, A. W., Horner, R. H., Glasgow, A., Sugai, G., et al. (2006). School-wide information system. Eugene: Educational and Community Supports, University of Oregon. McCarthy, J. D., & Hoge, D. R. (1987). The social construction of school punishment: Racial disadvantage out of universalistic process. Social Forces, 65, 1101– 1120. McFadden, A. C., Marsh, G. E., Prince, B. J., & Hwang, Y. (1992). A study of race and gender bias in the punishment of handicapped school children. Urban Review, 24, 239 –251. McLoyd, V. C. (1998). Socioeconomic disadvantage and child development. American Psychologist, 53, 185– 204. Morrison, G. M., Anthony, S., Storino, M., Cheng, J., Furlong, M. F., & Morrison, R. L. (2001). School expulsion as a process and an event: Before and after effects on children at-risk for school discipline. New Directions for Youth Development: Theory, Practice, Research, 92, 45–72. Muhammed, K. G. (2010). The condemnation of blackness: Race, crime, and the making of modern urban America. Cambridge, MA: Harvard University Press. National Association of Secondary School Principals. (2000, February). Statement on civil rights implications of zero tolerance programs. Testimony presented to the United States Commission on Civil Rights, Washington, DC. Neal, L. I., McCray, A. D., Webb-Johnson, G., & Bridgest, S. T. (2003). The effects of African American movement styles on teachers’ perceptions and reactions. Journal of Special Education, 37, 49 –57. Nelson, J. R., Martella, R. M., & Marchand-Martella, N. (2002). Maximizing student learning: The effects of a comprehensive school-based program for preventing problem behaviors. Journal of Emotional and Behavioral Disorders, 10, 136 –148. No Child Left Behind Act of 2001, 20 U.S.C. § 6319 (2008). Oswald, D. P., Coutinho, M. J., Best, A. M., & Singh, N. N. (1999). Ethnic representation in special education: The influence of school-related economic and demographic variables. The Journal of Special Education, 32, 194 –206. Pigott, R. L., & Cowen, E. L. (2000). Teacher race, child race, racial congruence, and teacher ratings of children’s school adjustment. Journal of School Psychology, 38, 177–196. Raffaele Mendez, L. M., & Knoff, H. M. (2003). Who gets suspended from school and why: A demographic analysis of schools and disciplinary infractions in a large school district. Education and Treatment of Children, 26, 30 –51. Safran, S. P., & Oswald, K. (2003). Positive behavior supports: Can schools reshape disciplinary practices? Exceptional Children, 69, 361–373. Serpell, Z., Hayling, C. C., Stevenson, H., & Kern, L. (2009). Cultural considerations in the development of school-based interventions for African American ado106
lescent boys with emotional and behavioral disorders. Journal of Negro Education, 78, 321–332. Shaw, S. R., & Braden, J. P. (1990). Race and gender bias in the administration of corporal punishment. School Psychology Review, 19, 378 –383. Skiba, R. J., & Rausch, M. K. (2006). Zero tolerance, suspension, and expulsion: Questions of equity and effectiveness. In C. M. Evertson & C. S. Weinstein (Eds.), Handbook of classroom management: Research, practice, and contemporary issues (pp. 1063– 1089). Lawrence Erlbaum Associates. Skiba, R. J., Michael, R. S., Nardo, A. C., & Peterson, R. (2002). The color of discipline: Sources of racial and gender disproportionality in school punishment. Urban Review, 34, 317–342. Skiba, R. J., Simmons, A. B., Ritter, S., Gibbs, A. C., Karega Rausch, M., Cuadrado, J., et al. (2008). Achieving equality in special education: History, status, and current challenges. Exceptional Children, 74, 264 –288. Spaulding, S., Irvin, L., Horner, R., May, S., Emeldi, M., Tobin, T., & Sugai, G. (2010). School-wide social behavioral climate, student problem behavior, and administrative decisions: Empirical patterns from 1510 schools nationwide. Journal of Positive Behavioral Interventions, 12, 69 – 85. Sugai, G. (2007, August). Response to intervention approach to school-wide discipline. Paper presented at the OSEP Response to Intervention Summit, Washington, DC. Sugai, G., & Horner, R. H. (2006). A promising approach for expanding and sustaining school wide positive behavior support. School Psychology Review, 35, 246 – 259. Sugai, G., Horner, R. H., & Gresham, F. (2002). Behaviorally effective school environments. In M. R. Shinn, G. Stoner, & H. M. Walker (Eds.), Interventions for academic and behavior problems: Preventive and remedial approaches (pp. 315–350). Silver Spring, MD: National Association for School Psychologists. Sugai, G., & Lewis, T. J. (1999). Developing Positive Behavioral Support Systems. In G. Sugai & T. J. Lewis (Eds.), Developing positive behavioral support for students with challenging behaviors. Council for Children with Behavioral Disorders, 15–23. Taylor-Greene, S. J., & Kartub, D. T. (2000). Durable implementation of school-wide behavior support: The high five program. Journal of Positive Behavior Interventions, 2, 233–245. Todd, A. W., Haugen, L., Anderson, K., & Spriggs, M. (2002). Teaching recess: Low-cost efforts producing effective results. Journal of Positive Behavioral Interventions, 4, 46 –52. Toldson, I. A. (2008). Breaking barriers: Plotting the path to academic success for school-age AfricanAmerican males. Washington, DC: Congressional Black Caucus Foundation. Townsend, B. L. (2000). The disproportionate discipline of African-American learners: Reducing school suspensions and expulsions. Exceptional Children, 66, 381–391. Utley, C. A., Kozleski, E., Smith, A., & Draper, I. L. (2002). Positive behavior support: A proactive strategy for minimizing behavior problems in multicultural youth. Journal of Positive Behavior Interventions, 4, 196 –207.
Disciplinary Disproportionality
Vavrus, F., & Cole, K. (2002). “I didn’t do nothin’”: The discursive construction of school suspension. The Urban Review, 34, 87–111. Wald, J., & Losen, D. J. (2003). Defining and redirecting a school-to-prison pipeline. In J. Wald & D. J. Losen (Eds.), New directions for youth development (No. 99; Deconstructing the school-to-prison pipeline) (pp. 9 –15). San Francisco: Jossey-Bass. Wald, J., & Losen, D. J. (2007). Out of sight: The journey through the school-to-prison pipeline. In S. Books (Ed.) Invisible children in the society and its schools (3rd ed.) (pp. 23–27). Mahwah, NJ: Lawrence Erlbaum Associates. Wallace, J. M., Jr., Goodkind, S. G., Wallace, C. M., & Bachman, J. (2008). Racial/ethnic and gender differences in school discipline among American high school students: 1991–2005. Negro Educational Review, 59, 47– 62. Wu, S. C., Pink, W. T., Crain, R. L., & Moles, O. (1982). Student suspension: A critical reappraisal. The Urban Review, 14, 245–303.
Zimmerman, R. S., Khoury, E. L., Vega, W. A., Gil, A. G., & Warheit, G. J. (2006). Teacher and parent perceptions of behavior problems among a sample of African American, Hispanic, and Non-Hispanic White students. American Journal of Community Psychology, 23, 181–197. Zumwalt, K., & Craig, E. (2005). Teachers’ characteristics: Research on the indicators of quality. In M. Cochran-Smith & K. M. Zeichner (Eds.), Studying teacher education: The report of the AERA Panel on Research and Teacher Education (pp. 111–156). Mahwah, NJ: Lawrence Erlbaum.
Date Received: September 23, 2009 Date Accepted: January 6, 2011 Action Editor: George Bear 䡲
Russell J. Skiba is Professor in the School Psychology Program and Director of the Equity Project at Indiana University. His current research focus includes racial and ethnic disproportionality in school discipline and special education, and developing effective school-based systems for school discipline and school violence. Robert H. Horner is Almuni-Knight Endowed professor of Special Education at the University of Oregon. His research focuses on severe disabilities, instructional design, positive behavior support, and implementation science. Choong-Geun Chung is Statistician at the Center for Evaluation and Education Policy at Indiana University in Bloomington. His research interests are analytical issues in racial and ethnic disproportionality in special education and in school discipline, and statistical models for access and persistence in higher education. M. Karega Rausch currently serves as the Director of the Office of Education Innovation for Indianapolis Mayor Greg Ballard. He is currently completing his doctorate in school psychology at Indiana University-Bloomington. Rausch’s research interests include equity in school discipline and special education. Seth L. May is a research assistant and software developer at Educational and Community Supports in the University of Oregon’s College of Education. His efforts currently focus on developing research databases and large software systems for managing, collecting, and reporting on student behavior. Tary J. Tobin is a Research Associate and an Adjunct Assistant Professor of Special Education at the University of Oregon. Her current research interests include staff development, cultural and linguistic diversity, and use of prevention science and technology in school and community interventions.
107
School Psychology Review, 2011, Volume 40, No. 1, pp. 108 –131
Potential Bias in Predictive Validity of Universal Screening Measures Across Disaggregation Subgroups John L. Hosp University of Iowa Michelle A. Hosp Iowa Department of Education Janice K. Dole University of Utah Abstract. Universal screening measures are an integral component of any tiered system of instructional delivery. Recent studies of screening measures have often excluded examinations of bias in predictive validity. The present study examined a common screening instrument for evidence of bias in predictive validity across the four disaggregation categories of the No Child Left Behind Act. Performance of 3,805 students in Grades 1–3 on the Nonsense Word Fluency and Oral Reading Fluency measures of the Dynamic Indicators of Basic Early Literacy Skills were examined cross-sectionally in relation to a state criterion-referenced test. Bias in predictive validity was found, but varied by grade and by disaggregation category. Implications are discussed.
Universal screening is a crucial component of any comprehensive system of assessment (Salvia, Ysseldyke, & Bolt, 2009), especially those used within a problem-solving or response to intervention framework (Batsche et al., 2005). Efficient and effective delivery of the most appropriate interventions to the right students requires a consistent and accurate process of identifying which students need what help (Hosp & Ardoin, 2008). While providing this help, it is also crucial to be able to accurately and efficiently judge each student’s response (Barnett et al., 2007). Universal screening involves the assessment of all students within a classroom, grade, school, or
district on measures that are valid indicators of important academic or social/emotional outcomes (Ikeda, Neessen, & Witt, 2008). These assessments should be quick to administer, score, and provide information that leads to valid inferences about those outcomes (Hosp & Ardoin, 2008). These inferences are the decisions that need to be made in identifying each student’s level of need as well as grouping students with similar needs. Given the recent focus on universal screening that has come from a renewed emphasis on problem solving in delivering educational services, it is no surprise that there have been recent advances in the development
Correspondence regarding this article should be addressed to John L. Hosp, College of Education, N264 Lindquist Center, Iowa City, IA 52242; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 108
Predictive Validity Bias in Screening
of screening measures (Catts, Fey, Zhang, & Tomblin, 2001; Foorman, Francis, Fletcher, Schatschneider, & Mehta, 1998; O’Connor & Jenkins, 1999). However, there has also been increased scrutiny to ensure that they result in reliable data that provide accurate classification of students as needing intervention or not. Ritchey and Speece (2004) explored characteristics of screening assessment that should be considered in the early identification of reading disabilities. Differences in skills measured, performance tasks required, and content coverage are all characteristics that can affect the classification accuracy of a measure. The timing of a measure is important in terms of both the interval between screening measurement and outcome measurement as well as when the screening measurement takes place in a developmental sequence. Selection of outcome is also important as prediction of more proximal outcomes is likely to be more accurate than prediction of more distal ones. Jenkins, Hudson, and Johnson (2007) suggested additional factors to consider when developing and using screening measures such as accounting for the severity of the problem, different levels of risk (use of the dichotomous at risk/ not at risk or a polytomous system), and the inclusion of cross-validation of screening measures that is a crucial measurement component to the development of any measure (Haladyna, 2006). Predictive Validity A key component in the determination of the quality of a screening measure is its predictive validity. Predictive validity is an indication of how well performance on a criterion measure is predicted by performance on a screening measure when there is a difference in the time of administration (typically 3–5 months) between the two measures (Salvia et al., 2009). The criterion measure is also described as a meaningful outcome (Ikeda et al., 2008) such as performance on the state highstakes test. Researchers have evaluated the predictive validity of Nonsense Word Fluency (NWF) and Oral Reading Fluency (ORF) as
compared to norm-referenced tests of reading (e.g., Woodcock Reading Mastery Test; Woodcock, 1998; see Ritchey, 2008) as well as the mandated high-stakes state tests of many states (e.g., Florida—Buck & Torgesen, 2002; Washington—Stage & Jacobson, 2001). Predictive validity coefficients for both NWF and ORF typically average between .65 and .75 (cf. Hintze & Silberglitt, 2005; Ritchey, 2008; Roehrig, Petscher, Nettles, Hudson, & Torgesen, 2007; Shanahan, 2003). Catts, Petscher, Schatschneider, Bridges, and Mendoza (2009) recently extended this predictive validity work by looking for floor effects that might influence the predictive accuracy of screening measures, particularly Initial Sound Fluency, Phoneme Segmentation Fluency, NWF, and ORF from the Dynamic Indicators of Basic Early Literacy Skills (DIBELS; Good et al., 2004). Floor effects occur when the lower end of the performance range for a scale does not go low enough to adequately describe participants’ performance (Drew, Hardman, & Hosp, 2008). This is demonstrated by a large number of individuals receiving scores near the minimum possible score such that the scores for a group are “bunched” near the minimum performance. This can have a negative effect on predictive validity because of the restriction of range of participants’ performance. Using nearly 19,000 students from Florida, Catts et al. (2009) demonstrated that floor effects were present, which reduced the predictive validity of the data. The effect on predictive validity was most pronounced in kindergarten and Grade 1, with decreasing effect in Grade 2 and little to no effect in Grade 3. Catts et al. conclude that more sensitive measures of early literacy are needed to overcome these floor effects and the effect they have on predictive validity. The Importance of Examining Bias in Predictive Validity The studies mentioned above have contributed to our understanding of screening measures and some of the potential issues that affect the results provided and the inferences made from those results. However, they did 109
School Psychology Review, 2011, Volume 40, No. 1
not examine the potential differential performance of screening measures across different subgroups of students. The achievement gap between subgroups of students has been a longstanding and persistent issue in American education (Rampsey, Dion, & Donahue, 2009). This is one reason that disaggregation across various traditionally underperforming subgroups was required for states to demonstrate adequate yearly progress under the No Child Left Behind Act (NCLB; 2002). By explicitly disaggregating the performance of various subgroups (i.e., students from economically disadvantaged backgrounds, students with limited English proficiency, students with disabilities [SwD], and students from various racial/ethnic backgrounds), schools, districts, and states would be able to determine whether they were meeting the needs of all groups of students. However, this analysis is on state-identified outcome measures only and does not address the potential for differential prediction on screening measures. Although overall classification accuracy is an important consideration when evaluating screening measures, bias in predictive validity is also an important consideration (Cole & Moss, 1993). Bias in predictive validity (also referred to as “differential prediction”) is a difference in the quality of inferences when making a judgment of individuals from one group rather than another (Helms, 2006). That is, it is a difference between two groups in the predictive validity of a measure. Instruments used for universal screening are often characterized by high rates of under- or overidentification. This is part of the effect that test use has on students that has been shown to differentially affect different subgroups of students (Cleary, Humphreys, Kendrick, & Wesman, 1975). It is also often implicated in the disproportionate representation of minority students in special education (Hosp & Reschly, 2003) and differential provision of services, in that not only might some individuals, but some groups, might be overidentified yet underserved (Donovan & Cross, 2002). With the increased emphasis on assessment and accountability as mandated through NCLB as well as Race to the Top (2010) and the in110
creased alignment of the Individuals with Disabilities Education Act (2004) with the proposed revisions to the Elementary and Secondary Education Act (NCLB, 2002), the influence of assessment on students and the importance of examining potential bias in predictive validity arguably has never been higher. Unfortunately, bias in predictive validity is something that is evaluated less frequently than it should (Betts et al., 2008). For example, The National Center on Response to Intervention conducts technical reviews of screening instruments for reading and math. To date, the technical review committee has reviewed nine reading-related screening measures, and only two (DIBELS ORF and STAR Reading [Renaissance Learning, 2011]) have provided evidence of predictive validity across disaggregation groups (see http://www.rti4success.org). There have been a few studies to examine differential predictive validity (i.e., with a span of ⬎3 months between predictor and criterion) of screening measures across disaggregation groups. Wiley and Deno (2005) found differences between English learners (EL) and English fluent (ES) students on both Maze and ORF tasks as compared to a state high-stakes test at Grades 3 and 5. Roehrig et al. (2007) found no differences for students receiving free or reduced-price lunch, EL students, or African American and Hispanic third-grade students in prediction of the state high-stakes test using ORF. Betts et al. (2008) found no difference for EL students, African American students, or Asian-American students, but some difference between White and Latino students when predicting second-grade reading outcomes from kindergarten screening assessments. Last, Fien et al. (2008) found that 7 of 24 comparisons between EL and ES students demonstrated differential prediction, but that 5 of the 7 were from winter kindergarten assessments, which was consistent with Catts et al.’s (2009) concern about floor effects and suggesting that the measures may differentially affect different subgroups of students. When examined as a group, no clear pattern of differential prediction has been consistent across groups, grade levels, or samples.
Predictive Validity Bias in Screening
Purpose of the Study Given the importance of screening in response to intervention (Hughes & Dexter, 2007), the concerns detailed by Catts et al. (2009), and the need to demonstrate accountability for disaggregated subgroups within NCLB, the purpose of the current study was to examine the possibility of bias in predictive validity (including differential prediction and differential floor effects) across the disaggregation categories included in NCLB. As such, the research questions guiding this study were as follows: 1. How well do benchmark scores on the NWF and ORF measures of the DIBELS predict grades 1–3 scores on a state criterion-referenced test when examined across the disaggregation categories of NCLB? 2. How much does the accuracy of prediction of the NWF and ORF measures of the DIBELS on a state criterion-referenced test vary as a function of level of performance when examined across the disaggregation categories of NCLB? These research questions served as the basis for the following hypotheses: H1. Benchmark scores on the NWF and ORF measures of the DIBELS will differentially predict scores on a state criterion-referenced test when examined across the disaggregation categories of NCLB. H2. Accuracy of prediction of the NWF and ORF measures of the DIBELS on a state criterion-referenced test will vary as a function of level of performance when examined across the disaggregation categories of NCLB. Method Participants Participants were 3,805 students enrolled in Grades 1–3 of Utah’s Reading First schools during the 2006 –2007 school year. This sample included all the students in Utah’s Reading First schools who had data on both
measures. The entire sample of students was 50.8% male, 71.8% eligible for free or reduced-price lunch, 25.3% EL, 9.4% students with disabilities, 45.8% White, 38.7% Hispanic, 8.7% American Indian, 2.6% Pacific Islander, 2.1% African American, and 1.1% Asian. See Table 1 for the demographic characteristics broken out by each subgroup used in the analysis for each grade level. Analyses indicated no differences between the demographic profile of the final sample and overall school demographics. Measures As part of Utah’s Reading First, all children were required to be administered a screening instrument at least three times per year in order to predict which students were likely to not reach proficiency on the state’s criterion-referenced test, which is used to report adequate yearly progress to the U.S. Department of Education as a condition of NCLB. The reading coaches, reading coordinators, and administrators from the participating schools chose to use the DIBELS as their screening measures, which was then implemented in all Reading First schools in Utah. For the purposes of this study, only NWF and ORF were included because these are the only DIBELS measures administered in Grades 1–3, which are grades in which an outcome measure is also administered. NWF. This is a standardized, individually administered measure of a student’s ability to use letter–sound correspondence to decode short consonant–vowel– consonant (CVC) and vowel– consonant (VC) nonsense words. Given a page of these words, the student must verbally produce either the individual letter sounds or each nonsense word. The student’s score is the number of correct letter–sound correspondences produced within 1 min. Reliability for NWF with first-grade students has been reported as .94 for test–retest (Harn, Stoolmiller, & Chard, 2008) and .83 (Mdn ⫽ .67 to .88 range) for 1-month alternate form (Good et al., 2004). 111
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Demographic Characteristics of the Participants in Each Subgroup at Each Grade Level Group
n
Male
FRL
FRL non-FRL EL non-EL SwD non-SwD AI W H
945 408 403 950 101 1252 94 622 547
483 213 203 494 73 624 39 324 286
— — 368 577 70 875 78 311 487
FRL non-FRL EL non-EL SwD non-SwD AI W H
886 351 311 930 106 1135 66 618 473
459 187 150 497 72 575 32 340 234
— — 286 600 73 813 48 339 428
FRL non-FRL EL non-EL SwD non-SwD AI W H
766 322 247 841 141 947 60 502 450
397 155 139 413 91 461 28 244 244
— — 221 545 107 659 40 268 401
EL
SwD
AA
Grade 1 (n ⫽ 1353) 368 70 19 35 30 7 — 24 7 — 77 19 24 — 4 379 — 22 30 6 — 9 56 — 337 31 — Grade 2 (n ⫽ 1241) 286 73 23 22 33 3 — 17 6 — 89 21 17 — 1 294 — 26 3 10 — 7 60 — 276 32 — Grade 3 (n ⫽ 1088) 221 107 22 26 34 7 — 33 8 — 108 21 33 — 2 214 — 27 0 8 — 8 72 — 211 55 —
AI
As
W
H
PI
O
78 16 30 64 6 88 — — —
9 2 6 5 0 11 — — —
311 311 9 613 56 566 — — —
487 59 337 210 31 516 — — —
33 10 10 33 2 41 — — —
8 2 4 6 2 8 — — —
48 18 3 63 10 56 — — —
11 3 8 6 1 13 — — —
339 279 7 611 60 558 — — —
428 43 276 197 32 441 — — —
28 4 10 23 1 32 — — —
9 1 1 9 1 9 — — —
40 20 0 60 8 52 — — —
10 8 9 9 3 15 — — —
268 234 8 494 72 430 — — —
401 49 211 239 55 395 — — —
18 3 7 14 1 20 — — —
7 1 4 4 0 8 — — —
Note. FRL ⫽ students receiving free/reduced-price lunch; non-FRL ⫽ students not receiving free/reduced-price lunch; EL ⫽ English learners; non-EL ⫽ English-proficient students; SwD ⫽ students with disabilities; non-SwD ⫽ students without IEPs; AA ⫽ African American; AI ⫽ American Indian; As ⫽ Asian; W ⫽ White; H ⫽ Hispanic; PI ⫽ Pacific Islander; O ⫽ other race/ethnicity.
ORF. This is a standardized, individually administered measure of the accuracy and rate of a student’s ability to orally read connected text. Given a grade-level passage of previously unseen material, the student reads aloud for 1 min. The number of words read correctly in that minute is recorded. Three separate passages are administered with the student’s median words read correctly score serving as the student’s recorded score. Reli112
ability of ORF has been reported as .95 for alternate form (Good, Kaminski, Smith, & Bratten, 2001) and .96 for test–retest (Catts et al., 2009). Utah State Criterion-Referenced Tests (UCRTs). The UCRTs are group-administered tests given to all students in Grades 1– 8 in the spring of each school year. The questions are in multiple-choice format with
Predictive Validity Bias in Screening
students recording their answers on a computerized Scantron sheet. The items are aligned with the state core curriculum with cut scores established by the Utah State Office of Education to determine the minimum score a student must receive to achieve the level of proficiency in the state curriculum. Because NWF and ORF are designed to predict reading outcomes, the English/Language Arts component was used as the outcome in this study. Reliability of the UCRTs was reported as .92 (Kuder-Richardson 20) and .93 (stratified alpha) for internal consistency; criterion validity was reported as .65 with the Grade 3 Iowa Test of Basic Skills (Hoover et al., 2003; Utah State Office of Education, 2007). Analyses also suggest that the UCRT meets standards as a nonbiased instrument (Utah State Office of Education, 2007). Procedures All measures were administered by classroom teachers or reading coaches. Training for administration and scoring of the DIBELS measures was conducted by outside expert consultants hired by the Utah Reading First Director to conduct multiple two-day trainings for educators across the state. District-based coaches and coordinators were also trained in using the DIBELS administration integrity checklists to use while observing practice administrations of the measures. Although data from those checklists are not available for analysis in this study, no individual was allowed to administer the measures without demonstrating administration and scorning accuracy ⬎95% to a trainer. The Utah Reading First evaluation team provided a schedule for DIBELS administration to all participating schools in order to have consistency in administration times across schools. Two-week windows for administration were provided for fall (2– 4 weeks after the first day of school, typically occurring in early September), winter (2 weeks in January equidistant from the beginning and end of the school year), and spring (2– 4 weeks before the last day of school, typically occurring in early May). NWF was
administered at all three time points in Grade 1 whereas ORF was administered for winter and spring in Grade 1 and all three time points in Grades 2 and 3. The UCRTs are administered over a 2-week period typically occurring in April or early May. The spring DIBELS window was scheduled so as not to overlap with the UCRT administration window to reduce scheduling burden. Data Analysis To make comparisons, four sets of analyses were conducted. These aligned with the disaggregation categories as required through NCLB: economic disadvantage (operationalized as receiving free/reduced-price lunch or not), limited English proficiency (identification as an English learner or not), disability status (identification as having a disability or not), and race/ethnicity (White, Hispanic, and American Indian—these three groups with large enough samples for analysis). The sample sizes for each analysis can be found in Table 2. Two methods of analysis were used to address the research questions for this study. First, Receiver Operating Characteristic (ROC) curves were calculated for each group within each disaggregation category, for each measure, at each grade level. ROC curve analysis is a method of judging the diagnostic efficiency of a measure (Swets, 1996). The three indexes included from the ROC curve analyses are sensitivity (SE; i.e., the proportion of students correctly classified as nonproficient on both measures being compared), specificity (SP; i.e., the proportion of students correctly classified as proficient on both measures), and area under the curve (AUC). AUC is a probability ranging from 0.5 to 1.0 that provides the probability of a predictor correctly classifying a pair of students from two different categories (e.g., proficient, nonproficient), and can be used as an effect size statistic (Swets, 1988). The SE, SP, and AUC were all compared between groups within disaggregation categories using a two-proportions test (Sprinthall, 2003). Determination for significance was adjusted for multiple comparisons 113
School Psychology Review, 2011, Volume 40, No. 1
with p ⬍ .001 being used given 33 comparisons in each family of analyses (Drew et al., 2008). The second method of analysis was the use of quantile regression. Quantile regression is similar to ordinary least-squares regression in that it attempts to minimize the sum of squared residuals, but rather than calculating a line to represent the best-fit data, quantile regression can provide best-fit points by asymmetrically weighting the data above and below the point of interest (Koenker, 2005). These estimates can be plotted in a simple line graph to illustrate the change in correlation between two variables across levels of the predictor variable. By plotting the quantile regression lines for multiple groups on a single graph, the differential effect of the relation between the two variables for different disaggregation groups can be examined as well as the presence of floor and ceiling effects. Results Descriptive Statistics The means and standards deviations of the performance of each disaggregated group are shown in Table 2. Although on average the traditionally underrepresented groups (FRL, EL, SwD, Hispanic, and American Indian) appeared to perform below their comparison group (non-FRL, non-EL, non-SwD, and White, respectively), this was not a hypothesis of the current study and therefore not tested for statistical significance. Also, all groups showed within-grade growth on both NWF and ORF, but again these gains were not compared to a norm or criterion standard to test for their significance. Within-grade growth cannot be compared for the UCRTs as it is only administered once per year. Cross-grade comparisons are not made here because this sample is cross-sectional, rather than longitudinal. ROC Curve Analyses To answer the first research question, “How well do benchmark scores on the NWF and ORF measures of the DIBELS predict Grade 1–3 scores on a state criterion-refer114
enced test when examined across the disaggregation categories of NCLB?,” a series of ROC curves were calculated. The AUC index was evaluated as good if it was ⬎.80 (Metz, 1978); the SE index was evaluated as good if it was ⬎.80 (Carran & Scott, 1992); the SP index was also evaluated as good if it was ⬎.80. In addition, the AUC, SE, and SP indexes between the groups were compared using a twoproportions test. Economic disadvantage. Results of the ROC curve analyses for the economic disadvantage disaggregation comparisons are shown in Table 3. Overall, the AUCs fell into the desired range for screening measures. All AUCs for both groups, except for the three administrations of the NWF measure for the FRL group, were ⬎.80. For SE, ORF in Grades 2 and 3 exceeded the .80 criterion for both the FRL and non-FRL groups with the exception of spring Grade 3 for the non-FRL group (.79). The only measurement for either group to meet the criterion for Grade 1 was winter ORF for the non-FRL group. The opposite was true for SP; no measurements in Grades 2 or 3 met the criterion, but all except winter NWF and winter ORF for the FRL group did in Grade 1. Using the conservative criterion of p ⬍ .001, two measurements demonstrated differences in AUC: two in SE, and three in SP. Most of these differences were in Grade 1, with two of the SP differences in ORF for Grade 2. There were no significant differences in AUC or SE for Grade 2, and no significant differences for any index at Grade 3. Limited English proficiency. Results of the ROC curve analysis for the limited English proficiency disaggregation comparisons are shown in Table 4. Overall, the AUCs fell into the desired range for screening measures for ORF but not for NWF. For SE, ORF in Grades 2 and 3 exceeded the .80 criterion for both groups. In Grade 1, only the winter ORF measurement for the EL group exceeded the criterion. The opposite was true for SP; no measurements in Grades 2 or 3 met the criterion, but all except winter NWF and winter
Predictive Validity Bias in Screening
Table 2 Means and Standard Deviations for Each Group in Each Grade Fall Measure
Winter
Spring
Grade
Group
n
Mean
SD
Mean
SD
Mean
SD
NWF
1
ORF
1
FRL non-FRL EL non-EL SwD non-SwD W H AI FRL non-FRL EL non-EL SwD non-SwD W H AI FRL non-FRL EL non-EL SwD non-SwD W H AI FRL non-FRL EL non-EL SwD non-SwD W H AI FRL non-FRL EL non-EL SwD non-SwD W H AI
945 408 403 950 101 1252 622 547 94 945 408 403 950 101 1252 622 547 94 886 351 311 930 106 1135 618 473 66 766 322 247 841 141 947 502 450 60 945 408 403 950 101 1252 622 547 94
34.2 40.4 30.4 38.2 21.4 37.2 39.0 31.1 43.6 — — — — — — — — — 43.4 56.8 36.0 50.6 22.4 49.3 52.4 40.0 44.8 69.5 82.3 56.0 79.0 42.6 77.7 81.6 66.1 65.2 — — — — — — — — —
22.5 26.2 20.5 24.6 17.5 23.9 25.8 19.7 24.0 — — — — — — — — — 28.9 33.8 28.2 31.1 20.6 30.7 32.8 27.4 27.8 34.2 35.8 32.0 34.6 32.5 33.1 35.5 32.9 31.1 — — — — — — — — —
61.3 69.7 55.4 67.0 45.6 65.2 68.1 56.8 74.3 27.9 40.6 21.6 35.6 16.6 32.8 39.0 22.0 38.2 66.0 82.7 56.9 75.5 39.2 73.6 78.3 61.7 62.6 82.4 95.5 69.0 92.0 51.5 91.0 94.4 78.8 77.8 — — — — — — — — —
27.5 32.8 25.6 30.1 24.5 29.3 31.0 25.1 28.7 25.9 34.0 20.9 31.0 18.0 29.5 33.3 20.1 28.0 34.9 38.5 35.9 35.1 32.1 35.6 37.4 34.2 31.7 37.2 37.4 36.8 36.6 37.7 35.0 37.6 35.8 34.9 — — — — — — — — —
78.3 86.1 74.0 82.4 59.2 82.3 82.9 73.5 101.0 49.0 65.1 41.7 58.8 33.5 55.4 62.0 44.4 53.3 79.4 96.2 70.3 88.9 50.7 87.2 91.0 75.7 77.5 99.1 111.4 86.0 108.0 66.9 107.6 110.3 96.3 95.6 159.7 167.3 155.6 164.5 154.4 162.5 166.9 157.3 158.6
36.0 36.8 34.5 35.4 35.5 35.9 35.7 32.3 47.1 30.3 35.7 27.4 33.7 27.6 32.7 35.0 27.7 29.4 37.1 38.5 38.6 37.4 34.4 37.1 38.6 36.4 33.2 37.9 36.0 37.3 36.5 40.9 34.4 37.2 36.5 37.7 11.2 13.6 9.9 12.5 11.6 12.3 12.8 10.2 10.2
2
3
UCRTa
1
(Table 2 continues) 115
School Psychology Review, 2011, Volume 40, No. 1
Table 2 Continued Means and Standard Deviations for Each Group in Each Grade Fall Measure
Winter
Spring
Grade
Group
n
Mean
SD
Mean
SD
Mean
SD
2
FRL non-FRL EL non-EL SwD non-SwD W H AI FRL non-FRL EL non-EL SwD non-SwD W H AI
886 351 311 930 106 1135 618 473 66 766 322 247 841 141 947 502 450 60
— — — — — — — — — — — — — — — — — —
— — — — — — — — — — — — — — — — — —
— — — — — — — — — — — — — — — — — —
— — — — — — — — — — — — — — — — — —
160.1 166.2 156.0 163.8 154.5 162.5 165.2 158.0 159.3 160.9 164.5 156.7 163.6 155.4 162.8 164.7 159.4 158.8
10.6 10.6 10.7 10.5 9.4 10.8 10.9 10.2 8.2 9.8 9.9 9.9 9.6 9.4 9.7 9.4 10.1 7.2
3
Note. NWF ⫽ Nonsense Word Fluency; ORF ⫽ Oral Reading Fluency; UCRT ⫽ Utah Criterion-Referenced Test; FRL ⫽ students receiving free/reduced-price lunch; non-FRL ⫽ students not receiving free/reduced-price lunch; EL ⫽ English learners; non-EL ⫽ English proficient students; SwD ⫽ students with disabilities; non-SwD ⫽ students without IEPs; W ⫽ White; H ⫽ Hispanic; AI ⫽ American Indian. Mean for NWF is the mean number of correct letter sounds per minute. For ORF, the mean is the mean number of words read correctly in one minute. a The UCRT score reported here is the scaled score. This scale is equated across grade levels so that a score of 160 always indicates performance at the median for that grade.
ORF for the EL group did in Grade 1. Using the conservative criterion of p ⬍ .001, no comparisons between the EL and non-EL groups demonstrated significant differences. In Grade 1, the winter NWF and winter ORF comparisons were significant for SP. Winter ORF was also significant for SE. In Grade 2, the winter and spring ORF comparisons were significant for SP, but not for SE. In Grade 3, all SE and SP comparisons were significant. Disability status. Results of the ROC curve analysis for the disability status disaggregation comparisons are shown in Table 5. Overall, the ORF AUCs fell into the desired range for screening measures, but the NWF AUCs did not. The one exception is Grade 1 winter ORF AUC for the SwD group (.794). 116
For SE, ORF met the criterion at all measurement points for the SwD group, but only for Grades 2 and 3 (with the exception of winter ORF at Grade 2, .794) for the non-SwD group. SP again had nearly the opposite pattern with the non-SwD group measurements at Grade 1, except for winter ORF (.795) meeting the criterion. For the SwD group, only spring NWF met the criterion. None of the AUC comparisons met the criterion for significance; however, almost all of the SE and SP comparisons met the significance criterion with the exception of winter ORF (SE p ⫽ .033) and spring NWF (SP p ⫽ .010). Race/ethnicity. Results of the ROC curve analysis for the race/ethnicity disaggregation comparisons are shown in Tables 6 and
F NWF W NWF S NWF W ORF S ORF F ORF W ORF S ORF F ORF W ORF S ORF
1
24 50 50 20 40 44 68 90 77 92 110
BM
.535 .530 .324 .769 .732 .845 .821 .870 .874 .865 .865
SE
.853 .790 .917 .754 .851 .671 .721 .663 .595 .577 .613
SP .761 .733 .741 .849 .867 .862 .874 .866 .843 .841 .846
AUC .764 .692 .775 .739 .816 .666 .692 .660 .591 .582 .606
PPP .665 .648 .598 .784 .778 .868 .844 .865 .875 .875 .881
NPP .384 .319 .245 .523 .586 .517 .534 .507 .439 .426 .464
⌲ .618 .600 .373 .809 .627 .812 .812 .859 .815 .827 .790
SE .875 .862 .936 .838 .899 .737 .820 .771 .693 .656 .668
SP .845 .817 .802 .901 .884 .892 .888 .889 .822 .840 .827
AUC .648 .617 .683 .650 .697 .496 .583 .545 .468 .443 .441
PPP
non-FRL
.860 .853 .801 .922 .867 .925 .924 .945 .918 .919 .904
NPP
.501 .466 .360 .601 .543 .451 .543 .526 .406 .373 .361
⌲
.001 .020 .075 .089 .000 .134 .689 .575 .019 .121 .004
SE
.290 .002 .230 .001 .018 .014 .000 .000 .003 .017 .089
SP
p
.001 .001 .018 .011 .390 .121 .453 .230 .384 .968 .430
AUC
Note. ROC ⫽ receiver operating characteristic; FRL ⫽ students receiving free/reduced-price lunch; BM ⫽ benchmark score; SE ⫽ sensitivity; SP ⫽ specificity; AUC ⫽ area under the curve; PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; ⌲ ⫽ kappa; F ⫽ fall; W ⫽ winter; S ⫽ spring; NWF ⫽ Nonsense Word Fluency; ORF ⫽ Oral Reading Fluency.
3
2
Measure
Grade
FRL
Table 3 ROC Two-Proportions Test Results for the Economic Disadvantage Disaggregation Analysis
Predictive Validity Bias in Screening
117
118
F NWF W NWF S NWF W ORF S ORF F ORF W ORF S ORF F ORF W ORF S ORF
1
24 50 50 20 40 44 68 90 77 92 110
BM
.551 .535 .314 .820 .743 .847 .816 .837 .906 .928 .906
SE
.865 .747 .899 .690 .823 .620 .653 .595 .468 .468 .541
SP .761 .711 .709 .848 .859 .848 .853 .839 .846 .865 .859
AUC .839 .766 .828 .804 .867 .778 .787 .764 .683 .688 .714
PPP .545 .509 .458 .712 .674 .721 .693 .699 .794 .833 .817
NPP .353 .260 .182 .513 .544 .479 .474 .444 .386 .410 .460
⌲ .538 .551 .347 .745 .688 .864 .821 .882 .829 .824 .824
SE .868 .831 .929 .809 .880 .707 .775 .719 .668 .642 .660
SP .787 .762 .783 .865 .876 .892 .894 .893 .827 .833 .834
AUC .668 .618 .708 .659 .740 .560 .612 .576 .506 .486 .500
PPP
non-EL
.791 .789 .742 .865 .851 .923 .909 .934 .904 .899 .901
NPP
.427 .393 .317 .537 .578 .495 .543 .522 .418 .386 .407
⌲
.660 .589 .246 .001 .036 .465 .841 .059 .001 .000 .000
SE
.881 .000 .061 .000 .005 .003 .000 .000 .000 .000 .000
SP
p
.289 .046 .003 .406 .390 .036 .049 .010 .484 .234 .352
AUC
Note. ROC ⫽ receiver operating characteristic; EL ⫽ students with limited English proficiency; BM ⫽ benchmark score; SE ⫽ sensitivity; SP ⫽ specificity; AUC ⫽ area under the curve; PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; ⌲ ⫽ kappa; F ⫽ fall; W ⫽ winter; S ⫽ spring; NWF ⫽ Nonsense Word Fluency; ORF ⫽ Oral Reading Fluency.
3
2
Measure
Grade
EL
Table 4 ROC Two-Proportions Test Results for the Limited English Proficiency Disaggregation Analysis
School Psychology Review, 2011, Volume 40, No. 1
F NWF W NWF S NWF W ORF S ORF F ORF W ORF S ORF F ORF W ORF S ORF
1
24 50 50 20 40 44 68 90 77 92 110
BM
.723 .708 .523 .846 .831 .947 .947 .933 .968 .968 .946
SE
.714 .686 .857 .571 .714 .514 .571 .457 .370 .352 .352
SP .755 .741 .760 .794 .841 .859 .868 .841 .822 .825 .804
AUC .825 .807 .872 .786 .844 .810 .819 .788 .729 .723 .718
PPP .581 .558 .492 .667 .694 .818 .826 .762 .870 .864 .792
NPP .415 .374 .325 .432 .541 .523 .550 .446 .397 .377 .348
⌲ .520 .522 .308 .768 .696 .822 .794 .845 .831 .825 .822
SE .868 .822 .927 .795 .875 .702 .765 .711 .647 .622 .653
SP .779 .753 .753 .869 .875 .867 .876 .870 .827 .831 .835
AUC .720 .655 .731 .710 .785 .604 .647 .614 .519 .505 .526
PPP
non-SwD
.734 .725 .672 .841 .815 .891 .874 .898 .891 .892 .894
NPP
.408 .355 .260 .556 .584 .497 .533 .516 .413 .393 .424
⌲
.000 .000 .000 .033 .000 .000 .000 .000 .000 .000 .000
SE
.000 .000 .010 .000 .000 .000 .000 .000 .000 .000 .000
SP
p
.575 .787 .873 .029 .317 .810 .810 .384 .881 .857 .342
AUC
Note. ROC ⫽ receiver operating characteristic; SwD ⫽ students with disabilities; BM ⫽ benchmark score; SE ⫽ sensitivity; SP ⫽ specificity; AUC ⫽ area under the curve; PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; ⌲ ⫽ kappa; F ⫽ fall; W ⫽ winter; S ⫽ spring; NWF ⫽ Nonsense Word Fluency; ORF ⫽ Oral Reading Fluency.
3
2
Measure
Grade
SwD
Table 5 ROC Two-Proportions Test Results for the Disability Status Disaggregation Analysis
Predictive Validity Bias in Screening
119
120
F NWF W NWF S NWF W ORF S ORF F ORF W ORF S ORF F ORF W ORF S ORF
1
24 50 50 20 40 44 68 90 77 92 110
BM
.547 .574 .329 .829 .738 .837 .801 .829 .876 .900 .886
SE
.831 .799 .896 .703 .843 .659 .685 .655 .594 .554 .602
SP .761 .739 .743 .862 .859 .844 .847 .841 .846 .859 .859
AUC .795 .774 .790 .769 .849 .723 .729 .719 .635 .620 .643
PPP .604 .609 .526 .773 .728 .800 .767 .791 .855 .873 .867
NPP .366 .363 .212 .535 .573 .506 .490 .493 .453 .436 .471
⌲ .640 .622 .433 .823 .756 .894 .856 .925 .837 .837 .829
SE .869 .817 .930 .815 .878 .697 .780 .712 .662 .649 .673
SP .845 .808 .828 .897 .911 .909 .907 .911 .830 .838 .842
AUC .640 .548 .689 .616 .693 .509 .576 .529 .443 .434 .449
PPP
White
.871 .857 .820 .928 .910 .949 .939 .964 .883 .925 .924
NPP
.511 .420 .412 .577 .618 .475 .548 .512 .302 .369 .389
⌲
.001 .097 .000 .787 .478 .007 .018 .000 .085 .004 .012
SE
.069 .435 .038 .000 .084 .180 .000 .042 .030 .003 .023
SP
p
.000 .005 .000 .066 .005 .001 .002 .000 .503 .368 .465
AUC
Note. ROC ⫽ receiver operating characteristic; BM ⫽ benchmark score; SE ⫽ sensitivity; SP ⫽ specificity; AUC ⫽ area under the curve; PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; ⌲ ⫽ kappa; F ⫽ fall; W ⫽ winter; S ⫽ spring; NWF ⫽ Nonsense Word Fluency; ORF ⫽ Oral Reading Fluency.
3
2
Measure
Grade
Hispanic
Table 6 ROC Two-Proportions Test Results for the Race Ethnicity Disaggregation Analysis: Hispanic to White
School Psychology Review, 2011, Volume 40, No. 1
Predictive Validity Bias in Screening
7. Three different patterns of AUCs emerged: for White students, the AUCs met the criterion for all measurements; for Hispanic students, the AUCs met the criterion for all ORF measurements, but not NWF; for American Indian students, the AUCs met the criterion for ORF in Grades 1 and 2 only, but not for NWF. For both White and Hispanic students, SE met the criterion in Grades 2 and 3 as well as winter ORF in Grade 1. For the American Indian students, SE only met the criterion for spring ORF in Grade 2 and fall ORF in Grade 3. SP met the criterion for all three groups at all measurements in Grade 1 except for winter ORF for the Hispanic students. No measurements in Grades 2 or 3 met the criterion for SP. Because the comparisons can only be conducted between two groups, Hispanic and American Indian students were compared to White students separately. In comparing Hispanic students to White students, significant differences in AUC were found for fall and spring NWF in Grade 1 and fall and spring ORF in Grade 2. A similar pattern was noted for SE with the exception of fall ORF in Grade 2. SP was significantly different between the groups for winter ORF in Grades 1 and 2. There were no significant comparisons in Grade 3. In comparing American Indian students to White students, only fall NWF met the criterion for significance. For SE, all comparisons in Grade 1 met the criterion for significance, as did fall ORF in Grade 2. For SP, all three measurements in Grade 3 met the criterion for significance. Summary of ROC results. Across the four disaggregation groups, there were few statistically significant differences in AUC between the groups being compared. All of these differences were with the NWF measure with the exception of ORF at grade 2 comparing Hispanic and White students. When examining the SE and SP indexes, there tended to be more differences in SE in Grade 1, particularly with the NWF measure; and more differences in SP in Grades 2 and 3 (most clearly illustrated in the American Indian to White comparisons). A final pattern was that for some
groups, at certain grade levels (EL Grade 3, SwD Grades 2 and 3) there were significant differences in SE and SP, but not in AUC. Quantile Regression Analyses To answer the second research question, “How much does the accuracy of prediction of the NWF and ORF measures of the DIBELS on a state criterion-referenced test vary as a function of level of performance when examined across the disaggregation categories of NCLB?,” a series of quantile regression models were developed and the graphs of the resultant correlations plotted and compared. Interpretation of the graphs was conducted by visual inspection comparing the regression plots between the groups. Floor or ceiling effects are demonstrated when a line is not horizontal (horizontal lines indicating the regression coefficients are similar across all points in the performance range). When two groups have similar plots, this indicates that the floor or ceiling effect (or lack thereof) affects both groups similarly. When the plots are different, one group is affected by the floor/ceiling effect more than the other. Economic disadvantage. As can be seen in Figure 1, there are differences in the plots for the fall and winter NWF and ORF at Grades 1 and 2. The spring measurements in Grades 1 and 2 as well as all three measurements in Grade 3 have similar plots for both groups. Limited English proficiency. As can be seen in Figure 2, there are differences in the plots for spring NWF and winter ORF in Grade 1 as well as fall and winter ORF in grade 2. All three measurements in Grade 3 have similar plots for both groups. Disability status. As can be seen in Figure 3, there are differences in the plots for nearly every measurement point except spring NWF in Grade 1 and fall ORF in Grade 3. Race/Ethnicity. As can be seen in Figure 4, there are differences among the plots for fall and winter NWF and winter ORF in Grade 1 as well as fall ORF in Grade 2. There 121
122
F NWF W NWF S NWF W ORF S ORF F ORF W ORF S ORF F ORF W ORF S ORF
1
24 50 50 20 40 44 68 90 77 92 110
BM
.283 .233 .133 .450 .567 .714 .794 .825 .855 .764 .745
SE
.894 .915 .979 .894 .915 .795 .773 .727 .492 .475 .492
SP .723 .735 .715 .822 .837 .862 .873 .859 .765 .743 .733
AUC .800 .813 .875 .897 .917 .829 .829 .789 .676 .686 .676
PPP .432 .423 .407 .508 .569 .774 .774 .786 .696 .680 .654
NPP .137 .115 .074 .317 .435 .604 .604 .570 .357 .360 .327
⌲ .640 .622 .433 .823 .756 .894 .856 .925 .837 .837 .829
SE .869 .817 .930 .815 .878 .697 .780 .712 .662 .649 .673
SP .845 .808 .828 .897 .911 .909 .907 .911 .830 .838 .842
AUC .640 .548 .689 .616 .693 .509 .576 .529 .443 .434 .449
PPP
White
.871 .857 .820 .928 .910 .949 .939 .964 .883 .925 .924
NPP
.511 .420 .412 .577 .618 .475 .548 .512 .302 .369 .389
⌲
.000 .000 .000 .000 .001 .000 .147 .013 .617 .095 .063
SE
.478 .017 .059 .055 .280 .049 .873 .757 .000 .000 .000
SP
p
.001 .073 .004 .020 .016 .124 .267 .087 .093 .012 .004
AUC
Note. ROC ⫽ receiver operating characteristic; BM ⫽ benchmark score; SE ⫽ sensitivity; SP ⫽ specificity; AUC ⫽ area under the curve; PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; ⌲ ⫽ kappa; F ⫽ fall; W ⫽ winter; S ⫽ spring; NWF ⫽ Nonsense Word Fluency; ORF ⫽ Oral Reading Fluency.
3
2
Measure
Grade
American Indian
Table 7 ROC Two-Proportions Test Results for the Race Ethnicity Disaggregation Analysis: American Indian to White
School Psychology Review, 2011, Volume 40, No. 1
Predictive Validity Bias in Screening
Summary of quantile regression results. In general, there was less bias in predictive validity in Grade 3 than in Grades 1 and 2 as
well as less of an influence of a floor effect (i.e., slope in the regression line). The patterns in Grade 1 appeared similar for NWF and ORF except for the EL comparisons. However, it should be noted that although there are many differences in performance of different groups across measures, there are also many similarities—indicating that patterns of potential bias are not extreme or consistent.
Figure 1. Quantile regression plots for grades 1–3 DIBELS measures for the economic disadvantage disaggregation analysis.
are also differences for the Hispanic group in spring ORF for Grade 2 (on the higher end of the distribution) and the American Indian group in fall ORF for Grade 3 (again, particularly on the high end). All other plots are similar.
123
School Psychology Review, 2011, Volume 40, No. 1
Discussion
Figure 2. Quantile regression plots for Grade 1–3 DIBELS measures for the limited English proficiency disaggregation analysis.
In our nation’s push to improve educational outcomes for all students, examination of bias in predictive validity of educational measures is vital. Consistency in our decision making is important in order to ensure consistency in service delivery and outcomes (Barnett et al., 2007) and to prevent over- or un-
deridentification of a subgroup of students. Unfortunately, studies of bias in predictive validity of screening measures are relatively uncommon (Betts et al., 2008). In this study, we examined universal screening data for bias in predictive validity across the disaggregation categories mandated by NCLB. We found that measures with good overall predictive validity
124
Predictive Validity Bias in Screening
addition to the typical differences across research studies (different settings, participants, and so on), studies vary in use of criterion measures, inclusion and exclusion of variables, and instruction and intervention. Because studies of prediction bias involve relating a predictor measure to an outcome measure and use of a cut score, each of these components may contribute to differential pre-
Figure 3. Quantile regression plots for Grade 1–3 DIBELS measures for the disability status disaggregation analysis.
(NWF, ORF) may not demonstrate consistent levels of predictive validity when focusing on different subgroups. Our results also suggest that this differential prediction varies across the subgroup analyses. Findings support prior research in which the patterns of predictive validity (or bias) have varied across studies. There are many potential explanations for this variation in pattern across studies. In
125
School Psychology Review, 2011, Volume 40, No. 1
location of that bias. A second source of variation is differential inclusion of variables. If a variable is correlated with both the predictor and the criterion measures, the coefficients (and therefore the decisions) may be biased because of the omission of the variable rather than the performance of the measures (Johnson, Carter, Davison, & Oliver, 2001). This could be the influencing factor involved in the
Figure 4. Quantile regression plots for Grade 1–3 DIBELS measures for the race/ethnicity disaggregation analysis.
diction patterns. The bias may be conceived as residing with the predictor measure (i.e., it being the dominant factor influencing the differential prediction because of variation in performance or functioning), the criterion measure, or the cut scores for one measure or the other (Flaugher, 1978). Results of predictive validity bias studies can indicate the presence of differential prediction, but generally not the
126
Predictive Validity Bias in Screening
participants in different studies receiving different instruction and intervention. Because the predictive validity studies involve a 3- to 6- month lag between administration of the predictor and criterion measures, participants received instruction and most likely differential intervention based on individual needs. This instruction may vary across studies and across classrooms or schools within studies adding another source of variance. These considerations highlight the importance of examining a phenomenon across studies to examine the pattern in greater detail. In relation to the findings of Catts et al. (2009), we found a similar pattern of greater floor effects in Grade 1 (for both NWF and ORF) than in Grade 2, with little to no floor effect in Grade 3. As students progress in grade, their performance distribution is high enough to not have the restriction of range indicative of a floor effect. As far as differential floor effects across the disaggregation categories, there was not a clear pattern. All groups demonstrated floor effects at the fall and winter screenings for NWF and the winter ORF screening in Grade 1. All but the race/ ethnicity comparisons also demonstrated floor effects in fall of Grade 2. No subgroup demonstrated floor effects in the spring of any year (i.e., potential concurrent predictive bias). No group demonstrated floor effects at all measurement points across all grades. The pattern displayed in this study appears to be that the first measurement period within a year (Grades 1 and 2) is the most likely to exhibit differential prediction. This makes sense in that group performance at the beginning of the year is likely to be lower than at other points, making the impact of floor effects more likely. If two groups perform differently, one would be more susceptible to floor effects than the other (i.e., the group with the lower overall performance). In addition, if the groups came into their education with different prior knowledge and experience, or were receiving differential curriculum or instruction (or differentially effective curriculum or instruction), different levels or patterns of performance could be expected (Donovan & Cross, 2002). The pattern of differential prediction in this study
potentially caused by floor effects in the lower grades was not duplicated in the ROC analyses. From the ROC analyses, although AUC is a valid effect size statistic for use in comparisons (Swets, 1988), the present results suggest that it may not be best to use it in isolation to judge bias in predictive validity. Differences in both SE and SP between two groups, with decisions for one group having higher SE and decisions for the other having higher SP, can actually offset each other in the determination of AUC. The clearest examples of this phenomenon are in Grade 3 for the limited English proficiency comparisons (Table 4) and all grades for the disability status comparison (Table 5). For these comparisons, AUC was similar between the groups, but there were significant differences between SE and SP. An implication of this pattern is that there are different mistakes in terms of decision making being made for different groups of students. From our results for the limited English proficiency comparison, ORF at Grade 3 demonstrated greater sensitivity for EL (i.e., the measures were better at identifying which individuals in the EL group would not meet proficiency on the outcome than for the non-EL group). Conversely, both measures demonstrated better specificity for non- EL (i.e., the measures were better at identifying which individuals in the non-EL group would meet or exceed the criterion for proficiency on the outcome measure than for the EL group). If using ORF at Grade 3 to screen students using a direct route approach, wherein those predicted to not meet the proficiency criterion on the outcome measure are automatically placed in supplemental instruction, or Tier 2 (Jenkins et al., 2007), one would make more false-positive errors for the EL group. This means that more EL students would be placed into Tier 2 intervention programs that they do not necessarily require than their non-EL peers. The reverse would also occur: more non-EL students would not receive Tier 2 services that they needed than their EL peers (i.e., a higher false-negative rate for non-EL). One way to examine the presence of counteracting SE and SP is to use multiple 127
School Psychology Review, 2011, Volume 40, No. 1
indexes of classification accuracy. If one index indicates a difference (e.g., SE), yet another does not (e.g., SP), there is a different pattern of predictive validity than if both indexes demonstrate differences. In the case of a nonsignificant AUC with significant SE and SP, this would also provide a check of whether there are two phenomena counteracting each other (indicating that the phenomenon may be a result of differential base rates on the criterion measure). Another strategy for examining and combating differential predictive validity is to identify different cut scores for different groups and/or different outcome measures (Roehrig et al., 2007). By systematically identifying different cut scores that maintain the same levels of sensitivity and specificity, similarity in proportions of false positives and false negatives across different groups could be ensured. However, one potential caution would be that generalizability of performance can be lost for the decisions made for individual students. If each school or district uses different criteria to determine the direct route to supplemental or intensive services, a student could move from one building where he was predicted to be proficient on the outcome measure (and therefore not receiving supplemental services) to another where he was not predicted to be proficient and therefore in need of additional instruction. Although both decisions may be correct in determining the student’s needs, it provides an additional level of coordination and judgment for the school to which the student moves. In addition to the need to examine multiple indexes in a determination of differential predictive validity, there are other implications from our results. First is that individual students are included in multiple groups (e.g., a Latino student receiving free lunch, or a student with a disability with limited English). As such, if different cut scores are developed for use with different groups, which one would be used for making a decision about service delivery for an individual student? Because the disaggregation categories are mutually exclusive (e.g., a student cannot be both economically disadvantaged and not economically disadvantaged) and comprehensive (e.g., 128
all students are either economically disadvantaged or not), every student can be classified along the dimensions of every disaggregation category. This would mean up to four separate cut scores (five if you add sex) for every student as identifying which one was the most accurate would be a difficult web to untangle. A second implication is that the absence of prediction bias does not automatically equal fairness. As stated previously, predictive validity has to do with prediction of outcomes; that bias in predictive validity occurs when a test differentially predicts that outcome for one group as compared to another. By contrast, fairness can be conceived as differences in the mean test scores that an individual is being compared to (or used to develop the criterion) that are not directly related to the focus of the measure (i.e., construct-irrelevant variance). Although approaches to quantitatively address lack of fairness in assessment do exist (see Helms, 2006), they are not widely adopted or used. Limitations The above findings should be interpreted in light of some potential limitations. First, the sizes of groups being compared were not equivalent, which could lead to differences in the consistency of scores and error for the groups (Tabachnick & Fidell, 2007). Second, some of the group sizes were relatively small (n ⬎ 100 students). Although each group included was large enough to run the analyses, larger samples would provide more stable estimates—particularly for the quantile regression analyses in which more stable estimates would provide smoother plot lines (Koenker, 2005). A third limitation is the lack of an African American subgroup in the race/ethnicity analyses. This is a potential limitation because African American students are, and have traditionally been, one of the larger racial/ethnic groups in the United States. Fourth, the data for this study came exclusively from Reading First schools. As such, the minority, English learner, and economically disadvantaged proportions of students are higher than those of the state as a whole. Similar to the
Predictive Validity Bias in Screening
results from the Catts et al. (2009) study, the extent of this effect in unclear. Last, as with most any study examining prediction with screening measures, there is an intervening agent present as the results from the screening measures were intended to be used to make decisions about placement and intervention provision. This can affect the classification accuracy estimates despite the fact that it is precisely the purpose for which the measures are designed (Hosp, Dole, & Hosp, 2006). Implications for the Practice of School Psychology Despite the above-mentioned limitations, there are messages that school psychologists can take from the current findings. First, use of a single measure is not prudent for screening decisions. Given these preliminary findings of bias in predictive validity, using other pieces of data to validate the decision from a screening measure should reduce the potential for false positives or negatives. Triangulation of data either from other screening measures that address the same skill in a different way or inclusion of progress monitoring data after the screening provides additional pieces of information with which to make a decision. Second, using a team to make decisions can be useful for screening decisions as well as eligibility decisions as mandated by Individuals with Disabilities Education Act (2004). Similar to use of multiple pieces of data or a structured decision-making process, it introduces an added layer of accountability to make sure that there is agreement in the decisions. References Barnett, D. W., Hawkins, R., Prasse, D., Graden, J., Nantais, M., & Pan, W. (2007). Decision-making validity in response to intervention. In S. R. Jimerson, M. Burns, & A. VanDerHeyden (Eds.), Handbook of response to intervention: The science and practice of assessment and intervention (pp. 106 –116). New York: Springer. Batsche, G., Elliott, J., Graden, J. L., Grimes, J., Kovaleski, J. F., Prasse, D., et al. (2005). Response to intervention: Policy considerations and implementation. Alexandria, VA: National Association of State Directors of Special Education. Betts, J., Reschly, A., Pickart, M., Heistad, D., Sheran, C., & Marston, D. (2008). An examination of predictive
bias for second grade reading outcomes from measures of early literacy skills in kindergarten with respect to English-Language learners and ethnic subgroups. School Psychology Quarterly, 23, 553–570. Buck, J., & Torgesen, J. (2002). The relationship between performance on a measure of oral reading fluency and performance on the Florida Comprehensive Assessment Test (FCRR Technical Report No. 1). Tallahassee: Florida Center for Reading Research. Carran, D. T., & Scott, K. G. (1992). Risk assessment in preschool children: Research implications for the early detection of educational handicaps. Topics in Early Childhood Special Education, 12, 196 –211. Catts, H. W., Fey, M. E., Zhang, X., & Tomblin, J. B. (2001). Estimating the risk of future reading difficulties in kindergarten children: A research-based model and its clinical implementation. Language, Speech, and Hearing Services in Schools, 32, 38 –50. Catts, H. W., Petscher, Y., Schatschneider, C., Bridges, M. S., & Mendoza, K. (2009). Floor effects associated with universal screening and their impact on the early identification of reading difficulties. Journal of Learning Disabilities, 42, 162–176. Cleary, T., Humphreys, L. G., Kendrick, S. A., & Wesman, A. (1975). Educational uses of tests with disadvantaged students. American Psychologist, 30, 15– 41. Cole, N., & Moss, P. (1993). Bias in test use. In R. L. Linn (Ed.), Educational measurement (3rd ed.; pp. 201– 220). Phoenix, AZ: The Oryx Press. Donovan, M. S., & Cross, C. T. (Eds.). (2002). Minority students in special and gifted education. Washington DC: National Academy Press. Drew, C. J., Hardman, M. L., & Hosp, J. L. (2008). Designing and conducting research in education. New York: Sage. Fien, H., Baker, S. K., Smolkowski, K., Mercier-Smith, J. L., Kame’enui, E. J., & Beck, C. T. (2008). Using nonsense word fluency to predict reading proficiency in kindergarten through second grade for English learners and native English speakers. School Psychology Review, 37, 391– 408. Flaugher, R. L. (1978). The many definitions of test bias. American Psychologist, 33, 671– 679. Foorman, B. F., Francis, D. J., Fletcher, J. M., Schatschneider, C., & Mehta, P. (1998). The role of instruction in learning to read: Preventing reading failure in at-risk children. Journal of Educational Psychology, 90, 37–55. Good, R. H., Kaminski, R. A., Shinn, M., Bratten, J., Shinn, M., Laimon, D., et al. (2004). Technical adequacy of DIBELS: Results of the Early Childhood Research Institute on measuring growth and development (Technical Report No. 7). Eugene: University of Oregon. Good, R. H., Kaminski, R. A., Smith, M. R., & Bratten, J. (2001). Technical adequacy and second grade DIBELS Oral Reading Fluency (DORF) passages (Technical Report No. 8). Eugene: University of Oregon. Haladyna, T. M. (2006). Roles and importance of validity studies in test development. In S. M. Downing & T. M. Haladyna (eds.), Handbook of test development (pp. 739 –758). Hillsdale, NJ: Lawrence Erlbaum Associates. Harn, B. A., Stoolmiller, M., & Chard, D. J. (2008). Measuring the dimensions of alphabetic principle on the reading development of first graders: The role of 129
School Psychology Review, 2011, Volume 40, No. 1
automaticity and unitization. Journal of Learning Disabilities, 41, 143–157. Helms, J. E. (2006). Fairness is not validity or cultural bias in racial-group assessment: A quantitative perspective. American Psychologist, 61(8), 859 – 870. Hintze, J. M., & Silberglitt, B. (2005). A longitudinal examination of the diagnostic accuracy and predictive validity of R-CBM and high-stakes testing. School Psychology Review, 34, 372–386. Hoover, H. D., Dunbar, D. A., Frisbie, D. A., Oberley, K. R., Bray, G. B., Naylor, R. J. (2003). The Iowa Tests of Basic Skills. Rolling Meadows, IL: The Riverside Publishing Company. Hosp, J. L., & Ardoin, S. (2008). Assessment for instructional planning. Assessment for Effective Intervention, 33, 69 –77. Hosp, J. L., Dole, J. A., & Hosp, M. K. (2006, July). DIBELS as a predictor of proficiency on high stakes outcome assessments for at-risk readers. Paper presented at the annual meeting of the Society for the Scientific Study of Reading, Vancouver, BC. Hosp, J. L., & Reschly, D. J. (2003). Referral rates for intervention or assessment: A meta-analysis of racial differences. The Journal of Special Education, 37, 67– 80. Hughes, C., & Dexter, D. (2007). Universal screening within a response-to-intervention model (report brief for RTI Action Network). New York: National Center for Learning Disabilities. Ikeda, M. J., Neessen, E., & Witt, J. C. (2008). Best practices in universal screening. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology (5th ed., Vol. 2, pp. 103–114). Bethesda, MD: National Association of School Psychologists. Individuals with Disabilities Education Improvement Act, 20 U.S.C. § 1400 (2004). Jenkins, J. R., Hudson, R. F., & Johnson, E. S. (2007). Screening for at-risk readers in a response to intervention framework. School Psychology Review, 36, 582– 600. Johnson, J. W., Carter, G. W., Davison, H. K., & Oliver, D. H. (2001). A synthetic validity approach to testing differential prediction hypotheses. Journal of Applied Psychology, 86, 774 –780. Koenker, R. (2005). Quantile regression. New York: Cambridge University Press. Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283–298. No Child Left Behind Act, 20 U.S.C. § 6301 (2002). O’Connor, R. E., & Jenkins, J. R. (1999). Prediction of reading disabilities in kindergarten and first grade. Scientific Studies of Reading, 3, 159 –197. Race to the Top, 26 U.S.C. § 1 (2009). Rampsey, B. D., Dion, G. S., & Donahue, P. L. (2009). NAEP 2008 trends in academic progress (NCES
130
2009 – 479). Washington, DC: National Center for Education Statistics, Institute for Education Sciences, U.S. Department of Education. Renaissance Learning. (2011). STAR Reading. Retrieved from http://www.renlearn.com/sr/ Ritchey, K. D. (2008). Assessing letter sound knowledge: A comparison of letter sound fluency and nonsense word fluency. Exceptional Children, 74, 487–506. Ritchey, K. D., & Speece, D. L. (2004). Early identification of reading disabilities: Current status and new directions. Assessment for Effective Intervention, 29(4), 13–24. Roehrig, A. D., Petscher, Y., Nettles, S. M., Hudson, R. F., & Torgesen, J. K. (2007). Accuracy of the DIBELS Oral Reading Fluency measure for predicting third grade reading comprehension outcomes. Journal of School Psychology, 46, 343–366. Salvia, J., Ysseldyke, J. E., & Bolt, S. (2009). Assessment: In special and inclusive education (11th ed.). New York: Wadsworth. Shanahan, T. (2003). Review of the DIBELS: Dynamic Indicators of Basic Early Literacy Skills (6th ed.). In B. S. Plake, J. C. Impara, & R. A. Spires (eds.), The sixteenth mental measurements yearbook (pp. 310 – 313). Lincoln, NE: Buros Institute of Mental Measurements. Sprinthall, R. C. (2003). Basic statistical analysis (7th ed.). New York: Pearson. Stage, S. A., & Jacobson, M. D. (2001). Predicting student success on a state-mandated performance-based assessment using oral reading fluency. School Psychology Review, 30, 407– 419. Swets, J. A. (1988). Measuring the diagnostic accuracy of diagnostic systems. Science, 240, 1285–1293. Swets, J. A. (1996). Signal detection theory and ROC analysis in psychology and diagnostics. Hillsdale, NJ: Lawrence Erlbaum Associates. Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Needham Heights, MA: Allyn & Bacon. Utah State Office of Education. (2007). Utah ELA CRT technical manual. Salt Lake City, UT: Author. Available at www.usoe.k12.ut.us Wiley, H. I., & Deno, S. L. (2005). Oral reading and maze measures as predictors of success for English learners on a state standards assessment. Remedial and Special Education, 26, 207–214. Woodcock, R. (1998). Woodcock Reading Mastery Test— Revised/Normative Update. Circle Pines, MN: American Guidance Service.
Date Received: November 11, 2010 Date Accepted: January 15, 2011 Action Editor: Sandy Chafouleas 䡲
Predictive Validity Bias in Screening
John L. Hosp, PhD, is an associate professor of teaching and learning at the University of Iowa and codirector of the Center for Disability Research and Education. His current research interests include aligning assessment and instruction through curriculum-based measurement and curriculum-based evaluation, particularly in the elementary grades, as well as the disproportionate representation of students of color in special education. Michelle K. Hosp, PhD, is a consultant with the Iowa Department of Education. Her interests are curriculum-based measurement and curriculum-based evaluation for reading and literacy with elementary students. She has extensive experience writing about reading and assessments as well as presenting at local, state, and national conferences. She is also currently a trainer for the National Center for Response to Intervention. Janice A. Dole, PhD, is a professor of education at the University of Utah, where she teaches graduate courses in reading. Her research interests include school reform in reading, professional development, and summer reading loss in high-poverty schools.
131
School Psychology Review, 2011, Volume 40, No. 1, pp. 132–139
Commentary School Psychology Research: Combining Ecological Theory and Prevention Science Matthew K. Burns University of Minnesota Abstract. The current article comments on the importance of theoretical implications within school psychological research, and proposes that ecological theory and prevention science could provide the conceptual framework for school psychology research and practice. Articles published in School Psychology Review should at least discuss potential implications for theory, should be written from an ecological and preventative perspective, and should discuss implications for policy when appropriate. Moreover, intervention studies published as general articles in School Psychology Review should address moderating or mediating variables so that researchers can better understand the ecology and causal mechanisms. Manuscripts that address strength-based competence enhancement, multitiered systems of support for a variety of issues, school-based coordination of services, or multicultural competence will be consistent with an ecological approach to prevention.
Research and the scientific method is the foundation for school psychology practice and training (Ysseldyke et al., 2006). Fortunately, research within school psychology is more rigorous than ever before and recently has had notable effects on policy and practice. The number of federally funded studies published in School Psychology Review (SPR) almost doubled between 2006 and 2010 in comparison to the previous 5 years, which represented a shift in conceptual soundness, methodological rigor, and alignment with federal priorities within school psychology research (Power & Mautone, 2011). The impact factors and immediacy index for
school psychology journals, which are based on frequency with which articles in a journal are cited and used as estimates of journal quality and influence, consistently rate among the highest within educational psychology. Moreover, the provision for using response to intervention as part of the learning disability diagnostic process and the rapid increase in its use in practice are directly linked to school psychology research. The widespread use of response to intervention is an example of how our research has directly affected the lives of countless children. Given the expanding influence of school psychology research, it is essential that journals
Correspondence regarding this article should be addressed to Matthew K. Burns, 341 Education Science Building, 56 E. River Road, Minneapolis, MN 55455; E-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 132
School Psychology Research
promote ever-evolving standards of methodological rigor to lend confidence to conclusions that affect children, communities, and families. We will continue to follow published methodological guidelines (e.g., Kratochwill et al., 2010; What Works Clearinghouse, 2008) to enhance the internal validity of research conclusions and will promote the use of sophisticated designs and analyses. However, increased rigor alone does not influence practice. Ellis (2005) suggested that for educational innovations to have a lasting effect, there should be convincing research regarding its theoretical basis, its effectiveness in highly controlled settings, and the consistency of results when applied in natural settings. School psychological research has focused on effectiveness and even somewhat on consistency of implementation (e.g., Bolt, Ysseldyke, & Patterson, 2010; Hagermoser Sanetti, & Kratochwill, 2009), both of which will advance the field. However, our research has yet to adequately address theoretical implications and doing so will move our science to a more mature presence. The purpose of this article is to comment on the importance of theoretical implications within school psychological research, to discuss which theoretical orientation will advance the field, and to outline potential implications for research and practice. Why Does Theory Matter? Many community organizations conduct free health screenings to promote early identification of potential health difficulties. Early identification is almost always important in treating health problems, but the President of the Minnesota Academy of Family Practice argued against these health screenings, stating that potential patients were better off receiving regular physicals that examined the entire body and considered risk factors and family history, than they were to receive a “blind search for disease” (Yee, 2009). In other words, diagnosticians cannot understand risk data without fully considering the context from which they came. This is quite analogous to the role of theory within research in that the data cannot be adequately interpreted unless they are contextualized within theory.
Tharinger (2000) cautioned against an overreliance on empirically supported treatments, and suggested that instead practitioners should rely on an integration of theoretical frameworks and a working knowledge of empirical research. Practitioners are frequently presented with newly developed interventions for which the research base is still developing. School psychologists should be cautious about practices without a solid research base, but should avoid those without a theoretical foundation because theoretical and conceptual frameworks provide a structure to guide practices and solve problems (Tharinger, 2000; Tilly, 2008). Applying more sophisticated conceptual frameworks to intervention research is the best way to advance it (Hughes, 2000) because it allows us to give meaning to data while providing a heuristic to guide future research (Tharinger, 2000). In the absence of a guiding theory, research results in disjointed incrementalism with a focus on problems to be solved rather than positive goals to be sought, fragmented and disparate findings, and “a sequence of trials, errors, and revised trials” (Lindblom, 1979, p. 517). Moreover, identifying the theoretical underpinnings of an intervention allows us to better adapt it to specific settings without sacrificing efficacy and to consider broader implications. Methodological rigor, which is not synonymous with sophisticated statistical analysis, is needed to assure the validity of the findings, but advancements in theory is how fields truly move forward. Which Theory? What does school psychology uniquely offer to the education and development of children and youth in various applied settings? We should be the most expert consumers and synthesizers of research within a school (Keith, 2008) and have a level of expertise regarding data-based decision making that exceeds our peers (Ysseldyke et al., 2006). In addition to these important functional roles, we should also bring an ecological and developmental perspective to problems and address them through a preventative framework. In other words, school psychology should be the application of an eco133
School Psychology Review, 2011, Volume 40, No. 1
logical developmental approach to prevention science, and the juxtaposition of these theoretical orientations provides the conceptual framework for the field. Ecological systems theory appears to be the prevailing wisdom within most childfocused fields of psychology, and taking an ecological perspective has fostered a better understanding of many phenomena from physical inactivity (Spence & Lee, 2003), to school violence (Baker, 1998), to peer relations (Elias & Dilworth, 2003). Below I will provide a succinct summary of ecological theory, prevention science, and school psychology’s conceptual consistency with each. Ecological Perspective to Development Sheridan and Gutkin’s (2000) seminal article suggested that school psychology’s positive effect on children and youth was limited by our historical reliance on a medical model paradigm and its focus on finding and treating pathology. They further suggested that an ecological perspective would better advance the field. Ecological theory is defined as the study of the multiple interconnected environmental systems that influence individual development (Bronfenbrenner, 1977) and operates under four fundamental assumptions as outlined by Apter and Conoley (1984): (a) individuals are an inseparable part of a system; (b) “disturbance is not viewed as a disease located within the body of the child, but rather as a discordance in the system” (p. 89); (c) dysfunctions are the result of a mismatch between an individual’s skills and knowledge, and the environmental demands; and (d) any intervention should focus on the system to be most effective. School psychological services are delivered through a systems approach that not only recognizes the need to modify systems that affect schools, but also examines individual student problems by assessing the systems that affect a student’s development (Ysseldyke et al., 2006). For example, Hughes (2000) points out that “individuals who share common risk factors or psychopathology experience divergent outcomes” (p. 311), and academic success, positive relationships with teachers, and prosocial friendship networks all reduce negative behavioral pat134
terns of conduct disorders (Hughes, Cavell, & Jackson, 1999; Stattin & Magnusson, 1996). Thus, an ecological approach to preventing a conduct disorder could include academic interventions in addition to teaching appropriate interpersonal skills. Prevention Science Prevention science is the process of identifying potential risk and protective factors to eliminate or mitigate major human dysfunction (Coie et al., 1993), and has consistently been shown to be an effective approach (Botvin, 2004; Stith et al., 2006; Wilson, Gottfredson, & Najaka, 2001). Caplan (1964) conceptualized preventions within the now famous three tiers of primary (stopping from happening before it occurs), secondary (delaying onset), and tertiary (reducing the effect), which focused on reducing risk factors and negative behaviors. More recent conceptualizations of prevention science reduce risk while also promoting wellness through a strength-based approach (Hage et al., 2007). Moreover, prevention efforts should also improve personal well-being and the environments in which people live, learn, and work (Romano & Hage, 2000). Coie et al. (1993) outlined four principles of prevention science: (a) address fundamental causal factors for the dysfunction, including those attributed to the individual’s environment; (b) address risk factors before they stabilize; (c) target those at highest risk; and (d) coordinate action in each domain of functioning. The four principles for prevention science and its focus on wellness promotion and strength-based approaches (Hage et al., 2007) align well with school psychological services, which are best delivered through a three-tiered model and should result in “enhancement of student competence and development of the capacity of systems to meet students needs” (Ysseldyke et al., 2006, p. 12). Implications of Applying Ecological Theory to Prevention Science Advancing theory is a slow and deliberate process (Cowen, 1977), but it is a process in which school psychology should engage.
School Psychology Research
How school psychology will advance prevention theory over the immediate, foreseeable, and long-term future is a matter of some debate. Several scholars have suggested specific practices and research foci based on an ecological perspective or prevention science (e.g., Elias & Dilworth, 2003; Hage et al., 2007; Hughes, 2000; Sheridan & Gutkin, 2000). Below, I will synthesize the recommendations into a list that incorporates both perspectives for research and practice within school psychology. Potential Implications for Research Identifying Moderator Variables School psychology is an applied science and often studies the application of interventions to various problems. Intervention research has historically only rarely examined moderator variables (McClelland & Judd, 1993), but doing so would foster our understanding of how individual ecology affects intervention effectiveness. Intervention researchers could also more frequently examine individual participant’s ecological patterns of development (Hughes, 2000) within the study design. Contextualize Within Prevention Science Given the consistency among ecological theory, prevention science, and school psychology described above, it is difficult to imagine topics within school psychological research that could not be contextualized within an ecological approach to prevention. Researchers are encouraged to better understand prevention science and what an ecological perspective brings to it, and to present studies within that framework. They are also encouraged to discuss the potential implications that study data have for those theories. Policy Research can perhaps influence practice most directly by first influencing the policy that directs the practice. Research published within SPR could have implications for policy
that positively affect the various systems in which children and youth live and function. The Children, Research, and Public Policy (CRPP) section of SPR is designed for research that addresses empirical findings with implications for national, state, and local policy development, but of the three journal sections (CRPP, Research Brief, and Research into Practice) there are approximately half as many manuscripts submitted to the CRPP section than are submitted to either of the other two. Researchers could design research from a policy perspective to better address governmental, legislative, and political influences on health and well-being of the many groups that we serve, but could also discuss implications within studies in which the focus is more micro in nature. Causal Mechanisms Sheridan and Gutkin (2000) indicated that researchers should identify the necessary, desirable, and sufficient conditions for interventions to be successful so that practitioners can recognize that which is unalterable and that which can be changed to accommodate different environments. The potential implications for closing the research-to-practice gap for that recommendation are obvious and appreciated, but focusing on the most important aspect of an intervention could identify the causal mechanism, which could lead to theoretical implications. Identifying the causal mechanism for various interventions could more directly link interventions to theory and lead to modifications of interventions and expansions of existing theories, both of which would be a desirable outcome for intervention research. Mahoney (2002) defined causal mechanisms as the often unobservable entities and processes that actually bring about the desired outcome. Research that examines mediating variables attempts to identify what underlies the relationship between an independent and dependent variable (Baron & Kenny, 1986), and could be helpful in identifying causal mechanisms. Understanding the forces that cause the effect can help researchers better explain the 135
School Psychology Review, 2011, Volume 40, No. 1
effects, derive new hypotheses about the effects, and integrate existing knowledge (Mahoney, 2002). Moreover, mediation research could help us better understand the operative elements, which would allow for the interventions to be more precise and efficient with potential broader applications. Change Process School psychologists need to be well versed in systems consultation (Ysseldyke et al., 2006), but they should also understand the change process and how it affects the systems in which we work. More specifically, they should be able to promote institutional change that will enhance the well-being of individuals and groups. Grimes, Kurn, and Tilly (2006) and Ervin, Schaughency, Goodman, McGlinchey, and Matthews (2006) discussed how the Adelman and Taylor (1997) phases of change model was applied at local educational agencies and individual schools to bring about systems change in order to enhance student wellbeing. These two articles are excellent examples to which school psychologists could turn for information about changing systems to better help children and youth. However, demonstrations in different settings are also needed. Moreover, each phase of the process—(a) creating readiness, (b) initial implementation, (c) institutionalization, and (d) ongoing evolution—presents a potential research agenda for school psychologists. Engaging in that line of inquiry would almost certainly lead to more successful systems change to better enhance the well-being of individuals and groups. Potential Implications for Practice Use Interventions That Promote Strengths Rather Than Reduce Symptoms According to the 1999 Surgeon General’s report, mental health is more than an absence of pathology; it is also the “successful performance of mental function, resulting in productive activities, fulfilling relationships with other people, and the ability to adapt to change and to cope with adversity” (U.S. De136
partment of Health and Human Services, 1999, p. 4). Thus, mental health involves competence in interacting with others, learning academic (e.g., reading) and career-related skills, problem solving, and many other productive activities. School psychologists who work from an ecological approach to prevention should adopt this definition of mental health and focus on building skills as a way to reduce problem behaviors. Utilize a Tiered Model for Service Delivery There is an almost universal acceptance that tiered service-delivery models are preferable for school psychologists (Feifer, 2008; Tilly, 2008; Ysseldyke et al., 2006), but this approach is critical for prevention science. In fact, the three-tiered model associated with response to intervention and positive behavior supports can be linked directly to public health and prevention efforts (Ervin et al., 2006; Sugai & Horner, 2006). Allocating resources through tiered services allows for efficiency in the system while also proactively preventing problems, preventing problems from becoming worse, and targeting those who are most at risk. Coordination of Services One of the basic tenets of ecological theory is that the various systems that affect development are interconnected (Bronfenbrenner, 1977). Thus, one of the most important activities in which school psychologists can engage is to coordinate the services for individual and groups of students. It stands to reason that schools should work closely with community mental health care providers, but a lack of coordination among agencies has previously interfered with appropriate services (Fantuzzo, McWayne, & Bulotsky, 2003). School psychologists are often the primary mental health service providers for children and youth, but they are also uniquely trained to bridge the gap between schools and community agencies (National Association of School Psychologists, 2009). Thus, practitioners should work closely with community
School Psychology Research
mental health, juvenile justice, and other child-serving agencies, and doing so improved outcomes for children, their families, and their communities with fewer disruptions to the learning environment (Adelman & Taylor, 2006). In addition to interagency collaboration, partnering with the child’s family is an important component of coordinated services. Given the meta-analytic research that consistently demonstrated the positive effects of partnering between schools and families (Hill & Tyson, 2009; Jeynes, 2007), school psychologists should also help enhance the connection between parents and teachers. Moreover, parents should be members of school-based teams and should collaborate with all educational professionals involved in the education of their children. Multicultural Competence Cultural competence is an ethical obligation (Jacob & Hartshorne, 2007) for school psychologists and a foundational competence for practice (Ysseldyke et al., 2006), but it is especially important within an ecological perspective to prevention science (Hage et al., 2007). To explain cultural competence would go well beyond the scope of this article, but involves much more than superficial assumptions about individual students from any given population. For example, assuming that Native American children should be taught with a holistic (i.e., determine meaning from the whole picture), visual, reflective, and cooperative approach, which is a common assumption among educators (Pewewardy, 2002), creates a stereotype rather than an instructional accommodation. Modifying instruction based on assumptions regarding populations does not accelerate achievement and can interfere with student learning (Shaw, 2001). Instead of making decisions with generalization, Ortiz (2008) suggested that when school psychologists work with culturally diverse students that they “begin with the hypothesis that the examinee’s difficulties are not intrinsic in nature, but rather that they are more likely attributable to external or environ-
mental problems” (p. 664). It is then necessary to collect individual data to analyze the problem and to interpret all data through each student’s unique experiences and background, including the child’s language proficiency and degree of acculturation before selecting interventions (Paredes Scribner, 2002). Implications for SPR Articles published in SPR should at least discuss potential implications for theory and should be written from an ecological and preventative perspective. Moreover, manuscripts submitted to SPR should explicitly discuss potential implications for policy when appropriate. Publication decisions will be determined by the scientific merit of each individual paper, but consistency with ecological theory and prevention science, and potential implications for policy, will be included as criteria with which to judge a manuscript’s consistency with SPR’s focus. Special series proposals will be largely evaluated by their consistency with an ecological perspective to prevention science. SPR currently publishes intervention research that often appears atheoretical because it only investigates whether an intervention is effective and does not ask why it is. Certainly empirical testing of interventions and descriptions thereof will continue to be welcome submissions to the Research Into Practice and Research Briefs sections of SPR. However, interventions research published as general articles in SPR should address moderating or mediating variables so that researchers can better understand the ecology and the causal mechanisms of the intervention. There is a large range of manuscript topics that is appropriate for SPR, which is a strength of the journal. However, potential authors are especially encouraged to submit their work to SPR if it addresses strengthbased competence enhancement, multitiered systems of support for a variety of issues, school-based coordination of services, or multicultural competence. Special series proposals regarding these topics will also be welcomed submissions. 137
School Psychology Review, 2011, Volume 40, No. 1
Conclusion The ideas presented here are not new, but are syntheses of previous recommendations for school psychology. It is difficult to imagine a time when school psychology had such an effect on the education of children and youth in this country, but there is work to be done. Change theorists propose that organizations are not likely to succeed when they change as a reaction to external forces, but must evolve in a manner consistent with those forces (Hannan & Freeman, 1984). Many states are adopting a multitiered system of support that combines both response to intervention and positive behavior support within one comprehensive model. This multitiered approach was developed from public health and is consistent with prevention science. Thus, school psychology should move even closer to that conceptual orientation and add the ecological perspective that we bring. A move in that direction would be consistent with external forces driving education today, but would also extend from our strengths and could increase the effect of school psychology research on policy. Moreover, better contextualizing our research within a strong theoretical framework will allow for conceptual clarity of the data and present areas for future research. Although these ideas are not new, they will be the driving force for SPR during the next 5 years. Thomas Power guided SPR with a similar vision throughout the past 5 years, and the current focus will merely continue what he started. I thank Dr. Power and his editorial team for their excellent work and vision for the journal. The incoming editorial leadership team humbly welcomes the challenge and looks forward to putting this vision in place. References Adelman. H. S., & Taylor. L. (1997). Toward a scale-up model for replicating new approaches to schooling. Journal of Educational and Psychological Consultation, 8, 197–230. Adelman, H. S., & Taylor, L. (2006). The implementation guide to student learning supports in the classroom and schoolwide: New directions for addressing barriers to learning. Thousand Oaks, CA: Corwin Press. Apter, S. J., & Conoley, J. C. (1984). Childhood behavior disorders and emotional disturbance: An introduction to teaching troubled children. Englewood Cliffs, NJ: Prentice-Hall. 138
Baker, J. A. (1998). Are we missing the forest for the trees? Consider the social context of school violence. Journal of School Psychology, 36, 29 – 44. Baron, R. M., & Kenny, D. A. (1986). The moderatormediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. Bolt, D. M., Ysseldyke, J., & Patterson, M. J. (2010). Students, teachers, and schools as sources of variability, integrity, and sustainability in implementing progress monitoring. School Psychology Review, 39, 612– 630. Botvin, G. (2004). Advancing prevention science and practice: Challenges, critical issues, and future directions. Prevention Science, 5, 69 –72. Bronfenbrenner, U. (1977). Toward and experimental ecology of human development. American Psychologist, 32, 513–531. Caplan, G. (1964). Principles of preventive psychiatry. New York: Basic Books. Coie, J. D., Watt, N. F., West, S. G., Hawkins, J. D., Asarnow, J. R., Markman, H. J., et al. (1993). The science of prevention: A conceptual framework and some directions for a national research program. American Psychologist, 48, 1013–1022. Cowen, E. L. (1977). Psychologists and primary prevention: Blowing the cover story. American Journal of Community Psychology, 5, 481– 489. Elias, M. J., & Dilworth, J. E. (2003). Ecological/developmental theory, context-based best practice, and school-based action research: Cornerstones of school psychology training and policy. Journal of School Psychology, 41, 293–297. Ellis, A. K. (2005). Research on educational innovations (4th ed.). Larchmont, NY: Eye on Education. Ervin, R. A., Schaughency, E., Goodman, S. D., McGlinchey, M. T., & Matthes, A. (2006). Merging research and practice to address reading and behavior school-wide. School Psychology Review, 35, 198 –223. Fantuzzo, J., McWayne, C., & Bulotsky, R. (2003). Forging strategic partnerships to advance mental health science and practice for vulnerable children. School Psychology Review, 32, 17–37. Feifer, S. G. (2008). Integrating response to intervention within neuropsychology: A scientific approach to reading. Psychology in the Schools, 45, 812– 825. Grimes, J., Kurns, S., & Tilly, W. D. (2006). Sustainability: An enduring commitment to success. School Psychology Review, 35, 224 –243. Hage, S. M., Romano, J. L., Conye, R. K., Kenny, M., Matthews, C., Schwartz, J. P., et al. (2007). Best practice guidelines on prevention, practice, research, training, and social advocacy for psychologists. Journal of Counseling Psychologists, 35, 493–566. Hagermoser Sanetti, L. M., & Kratochwill, T. R. (2009). Toward developing a science of treatment integrity: Introduction to the special series. School Psychology Review, 38, 445– 459. Hannan, M. T., & Freeman, J. (1984). Structural inertia and organizational change. American Sociological Review, 49, 149 –164. Hill, N. E., & Tyson, D. F. (2009). Parental involvement in middle school: A meta-analytic assessment of the strategies that promote achievement. Developmental Psychology, 45, 740 –763.
School Psychology Research
Hughes, J. N. (2000). The essential role of theory in the science of treating children: Beyond empirically supported treatments. Journal of School Psychology, 38, 301–330. Hughes, J. N., Cavell, T. A., & Jackson, T. (1999). Influence of teacher–student relationships on aggressive children’s development: A prospective study. Journal of Clinical Child Psychology, 28, 173–184. Jacob, S., & Hartshorne, T. S. (2007). Ethics and law for school psychologists (5th ed.). Hoboken, NJ: John Wiley & Sons. Jeynes, W. H. (2007). The relationship between parental involvement and urban secondary school student achievement: A meta-analysis. Urban Education, 42, 82–110. Keith, T. Z. (2008). Best practices in applied research. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology—IV (pp. 91–102). Bethesda, MD: National Association of School Psychologists. Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & et al. (2010). Single-case designs technical documentation. Retrieved from What Works Clearinghouse Web site: http://ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdf Lindblom, C. E. (1979). Still muddling, not yet through. Public Administration Review, 39, 517–526. Mahoney, J. (2002, August) Causal mechanisms, correlations, and a power theory of society. Paper presented at the annual meeting of the American Political Science Association, Boston, MA. Available online at http://www.allacademic.com/meta/p66368_index.html McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114, 376 –390. National Association of School Psychologists. (2009). Appropriate behavioral, social, and emotional supports to meet the needs of all students (Position Statement). Bethesda, MD: Author. Ortiz, S. O. (2008). Best practices in nondiscriminatory assessment. In A. Thomas & J. Grimes (Eds.). Best practices in school psychology (5th edition, pp. 661– 678). Bethesda, MD: National Association of School Psychologists. Paredes Scribner, A. (2002). Best assessment and intervention practices with second language learners. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology IV (pp. 337–351). Bethesda, MD: National Association of School Psychologists. Pewewardy, C. (2002). Learning styles of American Indian/Alaskan Native students: A review of the literature and implications for practice. Journal of American Indian Education, 41, 22–56. Power, T. J., & Mautone, J. A. (2011). School Psychology Review: Looking back on 2006 –2010. School Psychology Review, 39, 673– 678. Romano, J. L., & Hage, S. M. (2000). Prevention and counseling psychology: Revitalizing commitments for the 21st century. The Counseling Psychologist, 28, 733–763.
Shaw, C. C. (2001). Instructional pluralism: A means to realizing the dream of multicultural social reconstructionist education. In C. A. Grant & M. L. Gomez (eds.), Campus and classroom: Making schooling multicultural (2nd ed.; pp. 47–64). Upper Saddle River, NJ: Prentice Hall. Sheridan, S. M., & Gutkin, T. B. (2000). The ecology of school psychology: Examining and changing our paradigm for the 21st century. School Psychology Review, 29, 485–502. Spence, J. C., & Lee, R. E. (2003). Toward a comprehensive model of physical activity. Psychology of Sport and Exercise, 4, 7–24. Stattin, H., & Magnusson, D. (1996). Antisocial development: A holistic approach. Development and Psychopathology, 8, 617– 646. Stith, S., Pruitt, I., Dees, J., Fronce, M., Green, N., Som, A., et al. (2006). Implementing community-based prevention programming: A review of the literature. Journal of Primary Prevention, 27, 599 – 617. Sugai, G., & Horner, R. R. (2006). A promising approach for expanding and sustaining school-wide positive behavior support. School Psychology Review, 35, 245–259. Tharinger, D. (2000). The complexity of development and change: The need for the integration of the theory and research findings in psychological practice with children. Journal of School Psychology, 38, 383–388. Tilly III, W. D. (2008). The evolution of school psychology to science-based practice. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology V (pp. 18 –32). Washington, DC: National Association of School Psychologists. U.S. Department of Health and Human Services. (1999). Mental health: A report of the surgeon general. Rockville, MD: U.S. Department of Health and Human Services, Substance Abuse and Mental Health Services Administration, Center for Mental Health Services, National Institutes of Health, National Institute of Mental Health. What Works Clearinghouse. (2008). Procedures and standards handbook version 2.0. Washington, DC: Institute for Education Sciences. Wilson, D. B., Gottfredson, D. C., & Najaka, S. S. (2001). School-based prevention of problem behaviors: A meta-analysis. Journal for Quantitative Criminology, 17, 247–272. Yee, C. M. (2009, February 8). Medical tests at churches aren’t what docs ordered: Screenings by for-profit companies are popular, but experts say the exams waste money and can create anxiety. Minneapolis Star Tribune. Retrieved from http://www.startribune.com/ lifestyle/39266627.html?page⫽1&c⫽y Ysseldyke, J., Burns, M., Dawson, P., Kelley, B., Morrison, D., Ortiz, S., et al. (2006). School psychology: A blueprint for training and practice III. Bethesda, MD: National Association of School Psychologists.
Matthew K. Burns is a professor of educational psychology, coordinator of the school psychology program, and codirector of the Minnesota Center for Reading Research at the University of Minnesota. He is the editor of School Psychology Review and past editor of Assessment for Effective Intervention. Specific areas in which he has conducted research include assessment of instructional level, academic interventions, facilitation of problemsolving teams, and putting them together within a response to intervention framework.
139
School Psychology Review, 2011, Volume 40, No. 1, pp. 140 –148
RESEARCH BRIEF Prereading Deficits in Children in Foster Care Katherine C. Pears, Cynthia V. Heywood, and Hyoun K. Kim Oregon Social Learning Center Philip A. Fisher Oregon Social Learning Center University of Oregon Abstract. Reading skills are core competencies in children’s readiness to learn and may be particularly important for children in foster care, who are at risk for academic difficulties and higher rates of special education placement. In this study, prereading skills (phonological awareness, alphabetic knowledge, and oral language ability) and kindergarten performance of 63 children in foster care were examined just before and during the fall of kindergarten. The children exhibited prereading deficits with average prereading scores that fell at the 30th to 40th percentile. Variations in prereading skills (particularly phonological awareness) predicted kindergarten teacher ratings of early literacy skills in a multivariate path analysis. These findings highlight the need for interventions focused on prereading skills for children in foster care.
Children in foster care fare worse than their peers on many indicators of academic adjustment, exhibiting high rates of special education placement, discipline referrals, and school dropout (e.g., Scherr, 2007; Zima et al., 2000). Children in foster care also lag significantly behind their peers in reading, writing, numeracy, and language (Mitic & Rimer,
2002) and perform significantly worse on measures of academic and socioemotional adjustment compared to children from low socioeconomic backgrounds (Pears, Fisher, Bruce, Kim, & Yoerger, 2010). This poor performance does not appear to be solely attributable to unique risk factors often found in children in foster care. For example, Fantuzzo
This research was supported by the following grants: DA021424, National Institute on Drug Abuse (NIDA), U.S. Public Health Service (PHS); and DA023920, NIDA, PHS. The authors thank Kristen Greenley and Angie Relling for project management, Katie Lewis and Matthew Rabel for editorial assistance, and the children and families who participated in the project. Correspondence regarding this article should be addressed to Katherine C. Pears, Oregon Social Learning Center, 10 Shelton McMurphey Boulevard, Eugene, OR 97401; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 140
Prereading in Children in Foster Care
and Perlman (2007) found that, even when other risk factors (e.g., birth and poverty risks) were controlled, being in out-of-home care significantly and independently predicted poor academic and behavioral adjustment for children in second grade. Given the elevated risks for poor school adjustment among children in foster care, there is a need for research on the potential early precursors of school difficulties with this population. Additional knowledge about the reading development of children at risk for academic failure because of foster care placements could expand the scientific knowledge base about academic skill development in general, and could allow service providers to tailor preventive intervention services to the needs of such populations (Justice, Invernizzi, Geller, Sullivan, & Welsch, 2005). Because child welfare agencies often have limited resources for screening and intervention services for children in foster care (Zima et al., 2000), such targeted interventions could aid agencies in maximizing resources to increase the chances of better school outcomes for these children. Early reading skills are an important predictor of later academic and behavioral adjustment in the general population. Children who struggle with reading in the first and second grades are likely to exhibit difficulties into middle and high school (e.g., Cunningham & Stanovich, 1998). Poor reading skills have also been linked to behavioral difficulties at school, which may increase the likelihood of problems such as antisocial behavior and juvenile delinquency (Halonen, Aunola, Ahonen, & Nurmi, 2006). In one of the few studies to examine reading skills in school-aged children in foster care, Fantuzzo and Perlman (2007) suggested that children in out-of-home placements show markedly poorer reading skills than their peers as early as second grade. Reading difficulties may already be well established by the second grade (Al Otaiba & Fuchs, 2006); thus, identifying risk factors for poor reading before school entry might help in preventing later problems. To date, there are no published studies on prereading skills in children in foster care
before school entry, and research regarding early screening could aid intervention efforts to prevent subsequent difficulties in this population. The prereading skills considered in this study have been linked to later reading abilities in the general population. In kindergarten, phonological awareness (i.e., the ability to distinguish sounds in words) predicts better reading outcomes across the early school years, and alphabetic understanding (i.e., the ability to recognize letters) is linked to well-developed or deficit reading skills (National Institute for Literacy, 2009). General language skills also appear to be important to later reading abilities, particularly reading comprehension (e.g., Catts, Fey, Zhang, & Tomblin, 1999). Based on the deficits observed in the later academic functioning of children in foster care (Mitic & Rimer, 2002), we hypothesize that the children in our study would perform more poorly on measures of prereading skills as compared to the general population. Schatschneider, Fletcher, Francis, Carlson, and Foorman (2004) noted that there was little agreement in the literature on the relative importance of specific prereading skills in predicting later reading abilities. In addition, different prereading skills may be differentially important for specific populations to outcomes, and could be effective targets for intervention (Al Otaiba & Fuchs, 2006). We can best pinpoint those targets by testing such associations within particular populations such as children in foster care. In this study, we examined associations between prereading skills in children in foster care and teacherrated early literacy skills in kindergarten, while controlling for the other prereading skills and an estimate of general intelligence. The following research questions guided the study: •
•
How do the prereading skills of children in foster care compare to those of general population children? To what extent are particular prereading skills more important than others in predicting teacher-rated early literacy skills in kindergarten for children in foster care? 141
School Psychology Review, 2011, Volume 40, No. 1
Methods Participants The participants in this study were 63 (36 females; 57%) children in foster care. To be eligible for the study, each child had to be in nonrelative or relative foster care at recruitment, entering kindergarten in the fall, and a monolingual or bilingual English speaker. The children and their foster families were recruited from two counties in the Pacific Northwest of the United States, each with a midsized metropolitan area. Our staff members first contacted each child’s caseworker to request consent for the child to participate and then contacted the foster caregivers to invite them to participate. Both the caseworker and foster caregivers had to consent to participate. The mean age of the children was 5.46 years (SD ⫽ 0.36). Fifty-nine percent of the children were in nonrelative foster care. The children had experienced an average of 3 unique foster placements (SD ⫽ 1) and an average of 558 days in care (SD ⫽ 397). The ethnicity breakdown of the sample was as follows: 59% European American, 27% Latino, and 14% mixed race. The children in this study were part of a larger sample of children participating in an efficacy trial of a school readiness intervention for children in foster care. However, all of the children in the current study were randomly assigned to the control group. Measures Phonological and phonemic awareness. Phonological awareness was assessed using the Phonological Awareness Composite scale score from the Comprehensive Test of Phonological Processing (CTOPP; Wagner, Torgesen, & Rashotte, 1999). This score is a composite of the scale scores from the Elision, Blending Words, and Sound Matching subtests. Reliability estimates for 5- and 6-yearolds were ␣ ⫽ .95 and .96, respectively (Wagner et al., 1999). Percentile rankings of the children’s scores were used to compare the performance of children in foster care to that of children in the general population. In addition, raw scores on the Initial 142
Sound Fluency (ISF) measure from the Dynamic Indicators of Basic Early Literacy Skills (DIBELS; Good & Kaminski, 2002) were used to assess phonemic awareness. The DIBELS are designed to assess reading development of students from kindergarten through sixth grade. In the ISF measure, the child is asked to orally produce the initial sound of a word that corresponds to a stimulus picture. The total score is the number of correct initial sounds produced in 1 min. Alternate-form reliability for ISF data are high (r ⫽ .72; Good et al., 2003). The percentile ranks of the children’s raw ISF scores were based on the norms for general population children tested in the fall of their kindergarten year. As the children in this study were about to enter kindergarten, this was felt to be an appropriate comparison sample. Alphabetic understanding. Each child’s raw score on the Letter Naming Fluency (LNF) measure from the DIBELS was used to assess alphabetic understanding. The children are asked to identify as many upper- and lowercase letters as possible from a randomly ordered array. The score is the number of correct letters identified in 1 min. Alternateform, 1-month reliability for LNF data are high (r ⫽ .88; Good et al., 2003). As with the ISF scores, percentile ranks for the raw LNF scores were based on the norms for general population children in the fall of kindergarten. Oral language ability. Oral language ability was assessed using each child’s scaled core language score (Sentence Structure, Word Structure, and Expressive Vocabulary subscales; M ⫽ 100, SD ⫽ 15) of the Clinical Evaluation of Language Fundamentals Preschool–Second Edition (CELF-P; Wiig, Secord, & Semel, 2004). Internal consistency coefficients for data from this scale are high (for ages 4 –5 years, ␣ exceeded .92). Percentile ranks of the core language score were used in analyses comparing the scores of children in foster care to those of general population children. Estimated general cognitive ability. The scaled score from the Block Design subscale of the Wechsler Preschool and Pri-
Prereading in Children in Foster Care
mary Scales of Intelligence—Third Edition (Wechsler, 2002) was used to estimate general cognitive ability. Data from this subscale are strongly correlated with the Full Scale IQ (r ⫽ .72; Wechsler, 2002). Teacher-rated early literacy. Each child’s kindergarten teacher completed the 26item Pre-Literacy Rating Scale (PLRS) from the CELF-P during the fall of kindergarten. The PLRS shows good internal consistency (␣ ⫽ .95) and is designed to measure the frequency with which children display a number of critical emergent reading and writing skills. Although there was some overlap with the measures used to assess prereading skills before kindergarten entry (e.g., “the child identifies and names 5 or more letters of the alphabet”), the PLRS items assess a broader range of skills specific to reading and writing abilities (e.g., “The child holds a book right side up” and “The child copies and/or writes own name accurately”). The teachers were asked to rate the frequency with which each child displayed the behaviors on a 4-point scale: 1 (never) to 4 (always) or N/A. A mean PRLS score (range ⫽ 1– 4) was computed for each child. This was used in the correlational and path analyses described below. Early intervention services. To account for any early intervention services received, the foster caregivers were interviewed about the type and duration of such services. The caregivers indicated the duration of services received on a 5-point scale: 1 (less than 1 school year) to 5 (more than 2 school years). Procedure The children’s prereading skills were assessed twice during the summer before kindergarten entry: at the beginning and the end of the summer just before the start of school. The 1.5 h assessments were conducted at the research center. Because general cognitive ability is assumed to be a fairly stable trait (Wechsler, 2002), the Block Design subscale was measured only at the beginning of the summer. Early intervention services were assessed only at the beginning of the summer.
The CTOPP, DIBELS, and CELF scores used in the current study were taken from the assessments conducted at the end of the summer. This was done to ensure that the measures used were the closest to the start of school. Information was only available from the assessments conducted at the beginning of the summer for 8 of the students, but their scores were used in the analyses to increase statistical power. The PLRS scores were taken from teacher interviews in the fall of kindergarten an average of 2.93 months (SD ⫽ 1.00) after the start of school. The mean length of time between the end-of-summer child assessment and the teacher interview was 3.51 months (SD ⫽ 1.17). All assessments were conducted by undergraduate- and graduate-level assessors trained by supervisors experienced in standardized test administration. The assessors were trained to reliability with their supervisors while assessing practice participants who were not part of the study sample. Periodic checks of their reliability were also conducted as they assessed the study participants. Data Analysis Plan The children’s percentile rankings on each of the prereading measures were used to analyze the first research question (How do the prereading skills of children in foster care compare to those of general population children?). Chi-square analyses were used to determine if the percentages of children in foster care falling below the 25th and 50th percentile ranks for each prereading skill measure were significantly different than those of the general population. In addition, the percentage of children in foster care falling below the critical score for each measure was examined using chi-square analyses. Path modeling was conducted using Mplus (Muthe´n & Muthe´n, 2007) to answer our second research question (To what extent are particular prereading skills more important than others in predicting teacher-rated early literacy skills in kindergarten for children in foster care?). We chose to use path analysis because it allows for the estimation of missing data using full informa143
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Means, Standard Deviations, and Percentages of Children at or Below Critical Scores for the Prereading Measures
Measure
M
SD
Phonological awareness 91.70 10.24 composite score Initial sound fluency 7.11 7.69 raw score Letter naming fluency 7.49 10.22 raw score 94.85 15.24 Oral language ability scaled score Estimated general 8.77 3.28 cognitive ability Teacher-rated early 3.14 0.57 literacy
Percentage at/below the 25th Percentile
2
Percentage at/below the 50th Percentile
2
Percentage at/below the Critical Score
2
53.7
23.73*
81.5
21.41*
53.7
28.74*
44.3
12.07*
72.1
11.95*
36.1
9.84*
50.8
21.69*
78.7
20.08*
47.7
28.92*
36.7
4.36*
68.3
8.07*
26.7
5.08*
Note. Chi-square tests were used to determine if the percentages of children in foster care at or below given percentiles or scores differed significantly from those in the general population. *p ⬍ .05.
tion likelihood estimation and accounts for correlated measurement error. An alpha level of p ⬍ .05 was used to determine statistical significance in all analyses reported in the following. Results Preliminary Analyses Our preliminary analyses indicated that there were no differences in the prereading and PRLS scores on the basis of foster care type (relative vs. nonrelative), county of residence, or gender (t ⫽ ⫺1.58 to 1.92, p ⫽ .96 to .06). The only significant difference was that children of Latino ethnicity had lower core language scores on the CELF-P (M ⫽ 86.19, SD ⫽ 15.26) than children of non-Latino ethnicity (M ⫽ 98.00, SD ⫽ 14.12), t(58) ⫽ ⫺2.80, p ⬍ .05. This may have been because of the possibility that the biological families of some of the children of Latino ethnicity used Spanish or a mixture of English and Spanish in the home. Given that the sam144
ple was recruited after entering foster care, it was not possible to gather this information. Latino ethnicity was included as a control variable in preliminary path analyses. The path model that included Latino ethnicity was not significantly different than the model reported later (2 difference ⫽ 5.68, p ⫽ .34). Thus, Latino ethnicity was not included in further analyses. An alternative path analysis was conducted excluding from the sample the 8 children who only had CTOPP, DIBELS, and CELF scores from the beginning of the summer. The path model without these children did not significantly differ from the path model with these children (2 difference ⫽ 1.04, p ⫽ .95). Thus, the results from the path model that included all of the children are presented below. Descriptive Analyses The children’s mean scores on the measures are presented in Table 1. Also shown in Table 1 are the percentages of children at or
Prereading in Children in Foster Care
below the 25th and 50th percentiles for each prereading skill measure. Chi-square tests were used to determine if the percentages of children at or below the 25th and 50th percentiles differed significantly from what would be expected by chance. This was the case for all of the measures. In addition, we examined the percentages of children at or below the critical scores for each prereading skill measure. Children scoring below the critical scores are considered to be at risk for reading or language difficulties. Scores at or below the 23rd percentile on the CTOPP are considered below average to very poor (Wagner et al., 1999). Children who score at or below the 16th percentile (i.e., one standard deviation or more below the mean) on the CELF are considered at risk for language difficulties (Wiig et al., 2004). For the DIBELS, children who score below the 20th percentile on the ISF or LNF measure are considered to be at risk for later reading difficulties (Good et al., 2003). Chisquare analyses (see Table 1) indicated that the proportion of children in foster care scoring below the critical scores on each of these measures was significantly greater than what would be expected by chance. Thirty-nine percent of the children had received early intervention services: 19% for less than 1 school year, 18% for 1–2 school years, and 2% for more than 2 school years. Multivariate Path Model Before the path analysis, the associations between the children’s scores on the prereading measures, the mean of the PLRS, and the control variables were examined. The positive correlations among the prereading skill measures and between the prereading skill measures and the PLRS mean score were significant (r ⫽ .27 to .68 and .33 to .59, p ⬍ .05). The children’s phonological awareness scores demonstrated a particularly strong association with their core language scores (r ⫽ .68), raising the possibility of multicollinearity. However, the two skills may be differentially associated with later reading abilities; thus, we decided to keep these two scores separate in the path analysis and to undertake
additional testing to ensure that the strong association did not change results. When the control variables were examined, the WPPSI Block Design scale scores were significantly positively associated with the prereading skill and teacher measures (r ⫽ .33 to .43, p ⬍ .05), with the exception of initial sound fluency (r ⫽ .22, p ⫽ ns). The length of time for early intervention services was not significantly associated with any measure (r ⫽ ⫺.19 to ⫺.02, p ⫽ ns). Thus, this variable was not included in the path analysis. The path model (see Figure 1) showed acceptable fit, 2(5) ⫽ 4.95, p ⫽ .42, Comparative Fit Index (CFI) ⫽ 1.00, Tucker Lewis Index (TLI) ⫽ 1.00, Root Mean Square Error of Approximation (RMSEA) ⫽ 0.00. When all of the prereading measures were included in the model, only phonological awareness was a unique significant predictor of teacher-rated early literacy skills. All of the prereading skill measures significantly covaried with one another. As a group, they accounted for a significant amount of the variance in the teacher ratings (R2 ⫽ .42, p ⬍ .05). Two alternate models were conducted to examine the potential effects of the high correlation between phonological awareness and oral language ability. The first analysis included all of the prereading measures except oral language ability and the second included all of the measures except phonological awareness. Neither alternate model significantly differed from the model that included all of the measures (2 difference ⫽ 2.30 and 3.93, p ⫽ .32 and .34, respectively). Core language was not a significant predictor of PRLS scores in the alternate model. Given these results, we concluded that the results of the full model did not seem to be overly influenced by the strong association between phonological awareness and oral language ability. Discussion Data regarding our first research question were consistent with a worrisome observation reported previously in the literature: up to 50% of children in foster care entering 145
School Psychology Review, 2011, Volume 40, No. 1
Figure 1. Path model of prereading measures, teacher ratings, and control measures. *p < .05.
kindergarten are at risk for later reading difficulties. On phonological awareness, one of the most predictive prereading skills (Schatschneider et al., 2004), 54% of the children in this study scored below the 23rd percentile. Further, most of the children scored below the 50th percentile on all prereading skill measures. This is consistent with the high rates of developmental delays found in children in foster care (e.g., Klee, Kronstadt, & Zlotnick, 1997) and builds upon past studies by focusing on the prereading skills essential for the development of reading ability. Our findings for our second research question were consistent with research with the general population (e.g., Schatschneider et al., 2004): phonological awareness was the strongest predictor of teacher-rated early literacy skills in kindergarten. This was true even when estimated general cognitive ability was controlled and in the presence of other prereading skill measures. The association between phonological awareness and future teacher ratings suggested a potentially important target for intervention with children in foster care. A number of studies have demonstrated that it is possible to bolster future read146
ing abilities and prevent reading difficulties by improving phonological awareness (e.g., Bus & Van IJzendoorn, 1999). In addition, such interventions may increase the effectiveness of future reading interventions, as strong phonological awareness skills appear to predict better response to literacy interventions (Al Otaiba & Fuchs, 2006). Ideally, all children at risk for reading difficulties, including children in foster care, would receive early intervention in a range of prereading abilities. However, given the often limited resources within the child welfare system, specifying the targets that have the most influence on reading outcomes might help to identify services that have the greatest impact. Our results suggest that all preschool-aged children in foster care should receive phonological awareness screening and that those with deficits should receive early intervention services. However, additional research is needed before recommendations for practice or policy can be confidently made. It was somewhat surprising that the length of time that the children had received early intervention services was not associated with any of the prereading skill measures. However, such services typically focus on
Prereading in Children in Foster Care
specific disabilities (e.g., providing articulation therapy or occupational therapy) and might not be specific to prereading skills. Further, the caregivers might have underestimated the length of early intervention services because of a lack of knowledge of the children’s care histories. Limitations and Future Directions Although this study is one of the first to examine specific prereading skills in children in foster care at kindergarten entry, a number of caveats should be mentioned. First, the sample size was small compared to other studies of early literacy skills in the general population. Although this limitation is understandable given the difficulties involved in longitudinal data collection in this population, the results should be interpreted with caution and replicated within a larger sample. To ensure that the significant effect of phonological awareness on teacher reports of early literacy was robust, we conducted a Monte Carlo analysis. Such analyses help to determine whether there is enough power with a given sample size to detect an effect of a given magnitude across multiple samples (Muthe´n & Muthe´n, 2002). Maximum power (i.e., ⬎0.99) was obtained with 1000 random samples, suggesting that the effect was robust despite the small sample size. Because of the small sample size, it was beyond the scope of this study to specify the precursors of the prereading skill deficits documented here. Future researchers should more finely detail the early factors that may affect the prereading skills of children in foster care (e.g., type of maltreatment or time spent in foster care). Although we focused on the association between prereading skills and teacherrated early literacy, there are likely many other factors that affect school outcomes for children in foster care. For example, attention might be important to early school performance (Pears et al., 2010). Finally, as noted earlier, children in foster care perform more poorly on measures of socioemotional development and academic performance than children from at-risk, low socioeconomic back-
grounds (Pears et al., 2010). In future work, it would be useful to compare the scores of children in foster care on the specific prereading skill measures in this study with the scores of children from low socioeconomic backgrounds. Despite these limitations, our results indicate that children in foster care lag far behind the general population on a number of prereading skills and suggest some targets for prevention and early intervention with these children, most notably phonological awareness. Programs that target the prereading skills of these children might help to guide them to a more positive trajectory of academic success. References Al Otaiba, S., & Fuchs, D. (2006). Who are the young children for whom best practices in reading are ineffective? An experimental and longitudinal study. Journal of Learning Disabilities, 39, 414 – 431. Bus, A. G., & Van IJzendoorn, M. H. (1999). Phonological awareness and early reading: A meta-analysis of experimental training studies. Journal of Educational Psychology, 91, 403– 414. Catts, H. W., Fey, M. E., Zhang, X., & Tomblin, J. B. (1999). Language basis of reading and reading disabilities: Evidence from a longitudinal investigation. Scientific Studies of Reading, 3, 331–361. Cunningham, A. E., & Stanovich, K. E. (1998). What reading does for the mind. American Educator, 22(1– 2), 8 –15. Fantuzzo, J., & Perlman, S. (2007). The unique impact of out-of-home placement and the mediating effects of child maltreatment and homelessness on early school success. Children and Youth Services Review, 29, 941– 960. Good, R. H., & Kaminski, R. A. (2002). Dynamic Indicators of Basic Early Literacy Skills (6th ed.). Eugene, OR: Institute for the Development of Educational Achievement. Good, R. H., Kaminski, R. A., Smith, S., Simmons, D., Kame’enui, E., & Wallin, J. (2003). Reviewing outcomes: Using DIBELS to evaluate a school’s core curriculum and system of additional intervention in kindergarten. In S. R. Vaughn & K. L. Briggs (Eds.), Reading in the classroom: Systems for the observation of teaching and learning (pp. 221–266). Baltimore: Paul H. Brookes. Halonen, A., Aunola, K., Ahonen, T., & Nurmi, J.-E. (2006). The role of learning to read in the development of problem behavior: A cross-lagged longitudinal study. British Journal of Educational Psychology, 76, 517–534. Justice, L. M., Invernizzi, M., Geller, K., Sullivan, A. K., & Welsch, J. (2005). Descriptive-developmental performance of at-risk preschoolers on early literacy tasks. Reading Psychology, 26, 1–25. 147
School Psychology Review, 2011, Volume 40, No. 1
Klee, L., Kronstadt, D., & Zlotnick, C. (1997). Foster care’s youngest: A preliminary report. American Journal of Orthopsychiatry, 67, 290 –299. Mitic, W., & Rimer, M. (2002). The educational attainment of children in care in British Columbia. Child and Youth Care Forum, 31, 397– 414. Muthe´n, L. K., & Muthe´n, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling, 4, 599 – 620. Muthe´n, L. K., & Muthe´n, B. O. (2007). Mplus user’s guide (5th ed.). Los Angeles: Authors. National Institute for Literacy. (2009). Developing early literacy: Report of the National Early Literacy Panel. Washington, DC: Author. Pears, K. C., Fisher, P. A., Bruce, J., Kim, H. K., & Yoerger, K. (2010). Early elementary school adjustment in maltreated children in foster care: The roles of inhibitory control and caregiver involvement. Child Development, 81(5), 1550 –1564. Schatschneider, C., Fletcher, J. M., Francis, D. J., Carlson, C. D., & Foorman, B. R. (2004). Kindergarten prediction of reading skills: A longitudinal comparative analysis. Journal of Educational Psychology, 96, 265–282. Scherr, T. G. (2007). Educational experiences of children in foster care: Meta-analyses of special education, re-
tention and discipline rates. School Psychology International, 28, 419 – 436. Wagner, R. K., Torgesen, J. K., & Rashotte, C. A. (1999). Comprehensive Test of Phonological Processing: Examiner’s manual. Austin, TX: PRO-ED. Wechsler, D. (2002). Wechsler Preschool and Primary Scales of Intelligence—Third Edition. San Antonio, TX: PsychCorp. Wiig, E. H., Secord, W. A., & Semel, E. (2004). Clinical Evaluation of Language Fundamentals Preschool— Second Edition. San Antonio, TX: Harcourt Assessment. Zima, B. T., Bussing, R., Freeman, S., Yang, X., Belin, T. R., & Forness, S. R. (2000). Behavior problems, academic skill delays and school failure among schoolaged children in foster care: Their relationship to placement characteristics. Journal of Child and Family Studies, 9, 87–103.
Date Received: June 19, 2009 Date Accepted: November 1, 2010 Action Editor: Matthew K. Burns 䡲 Article accepted by previous Editor.
Katherine C. Pears, Ph.D. is a Research Scientist at Oregon Social Learning Center (OSLC). Her research interests include school readiness in high-risk populations and interventions to improve school readiness outcomes, the development of social cognitive and social-emotional skills in high-risk preschool children, and the effects of early adversity on child social and cognitive development. Cynthia V. Heywood, Ph.D. is an early career scientist at OSLC. Her research interests include the development and implementation of child-directed and family-based interventions for high-risk populations. Specifically, she is interested in the development of self-regulation and stress reactivity in early childhood, mechanisms of resilience, parenting practices for caregivers of infants and children with intensive needs, and clinical applications of mindfulness practices. Hyoun K. Kim, Ph.D. is a research scientist at the Oregon Social Learning Center. Her research interests center around the development of psychopathology in children, adolescents and young adults from at-risk backgrounds, including depression, delinquency, drug use, health risking sexual behavior, and intimate partner violence. Philip A. Fisher, Ph.D. is a senior research scientist at OSLC and Professor of Psychology at the Univesity of Oregon. His research interests include prevention research in the early years of life, the effects of early stress on the developing brain, and the plasticity of neural systems in response to environmental interventions.
148
School Psychology Review, 2011, Volume 40, No. 1, pp. 149 –157
RESEARCH BRIEF Effects of the Helping Early Literacy with Practice Strategies (HELPS) Reading Fluency Program When Implemented at Different Frequencies John C. Begeny North Carolina State University Abstract. Approximately 40% of U.S. fourth-grade students are nonfluent readers. In response to the need for fluency-based instructional programs for elementaryaged students, the Helping Early Literacy with Practice Strategies (HELPS) Program was developed by integrating eight evidence-based fluency-building instructional strategies into a systematic program that can be (a) feasibly implemented by several types of educators, and (b) accessed for free by all educators. The present study sought to examine the effects of HELPS with second-grade students when implemented three times per week compared to once or twice per week, and throughout most of a school year. Results showed that students receiving HELPS three times per week significantly outperformed a control group of students on the measures of reading fluency and comprehension. Students who received HELPS an average of 1.5 times per week significantly outperformed the control group students on the measure of reading fluency. Implications of these findings for school psychologists are discussed.
Several factors highlight the need for additional research to evaluate the effects of structured, easy-to-use reading programs designed to strengthen students’ reading fluency. For example, of the small number of programs specifically designed to improve students’ reading fluency (e.g., Fluency First, Great Leaps, Read Naturally), many of those pro-
grams currently have little to no published research evaluating their effectiveness (Begeny, 2009). To illustrate, as of September 2009, the What Works Clearinghouse reviewed 170 reading programs for kindergarten to third-grade students and found that none of the programs met What Works Clearinghouse guidelines as demonstrating “strong evidence
Correspondence regarding this article should be addressed to John C. Begeny, College of Humanities and Social Sciences, Department of Psychology, 640 Poe Hall, Campus Box 7650, Raleigh, NC 27695-7650; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 149
School Psychology Review, 2011, Volume 40, No. 1
of a positive effect with no overriding contrary evidence” for reading fluency (What Works Clearinghouse, 2009). The need to empirically evaluate reading fluency programs is also evidenced by the following. First, although reading fluency is considered one of the five areas of early reading that should be targeted during instruction (Armbuster, Lehr, & Osborn 2001; National Reading Panel, 2000), approximately 40% of U.S. elementary-aged students are considered nonfluent readers (Daane, Campbell, Grigg, Goodman, & Oranje, 2005). Second, oral reading fluency is one of the strongest predictors of students’ overall reading ability, including reading comprehension and performance on end-of-grade tests (e.g., Fuchs, Fuchs, Hosp, & Jenkins, 2001; McGlinchey & Hixson, 2004). Third, reading fluency was identified as a possible indicator of a specific learning disability in the U.S. Individuals With Disabilities Education Improvement Act of 2004. In response to the aforementioned factors, the Helping Early Literacy with Practice Strategies (HELPS) Program (Begeny, 2009) was created to specifically target students’ reading fluency development. In the initial study evaluating HELPS (Begeny et al., 2010), second-grade students of all reading ability levels received the program as a supplement to their core reading curriculum for approximately 10 min per day, three times per week, from February through April of a traditional school year. Students’ performance was evaluated across eight different measures of reading and was compared to a control group and to students who received the Great Leaps K–2 Reading Program (Mercer & Campbell, 1998), which is a widely adopted reading fluency program. Overall findings from the study revealed that students who received the HELPS Program scored significantly better than students in the control group across five of the eight measures of reading, although the differences between students’ performance on the measure of reading comprehension were not statistically significant. Students who received the Great Leaps program did not sig150
nificantly outperform the control group on any of the reading measures. Given the initial findings suggesting that HELPS appears to be an effective program for improving early elementary-aged students’ reading skills, the purpose of the present study was to replicate and extend the initial study in three primary ways. First, this study sought to examine the effects of HELPS when implemented at different frequencies during a given week. In the initial evaluation of HELPS, the program was implemented three times per week. However, it is possible that HELPS would be equally effective if implemented less frequently (e.g., once or twice per week). Therefore, a primary research question in the present study sought to examine the comparative effects of HELPS when implemented three times per week versus an average of 1.5 times per week. Second, this study aimed to evaluate the effects of HELPS when implemented throughout most of school year. Although the Begeny et al. (2010) study demonstrated the effectiveness of HELPS when implemented for only three months, that study did not show statistically significant differences between the HELPS and control group students on the standardized measure of reading comprehension. Because student improvements in reading fluency often improve students’ reading comprehension (Therrien, 2004), additional research is needed to elucidate whether HELPS may improve students’ reading comprehension if implemented over a longer period of time. Lastly, as a means to replicate some of the methodology from the earlier HELPS study, this study likewise included second-grade participants with a full range of reading-ability levels and compared the performance of students receiving HELPS to a wait-list control group of students. The main applied relevance of this study is as follows. First, published research on “packaged” programs (i.e., systematic procedures that include multiple instruction strategies) designed to improve students’ reading fluency is meager (Begeny et al., 2010). As such, there is a scarcity of evidence to support the efficacy of such programs. Therefore, for
HELPS Program
school psychologists and other educators to make well-informed decisions about which programs might best improve students’ reading fluency, additional research is needed to evaluate such programs and identify those that are effective. Second, because HELPS appears to be a promising reading fluency program, additional research about this program is needed to further assist educators with understanding how to best use the program. For example, although initial research with HELPS suggests that meaningful effects should occur when implementing it three times per week at the end of a school year, educators could save valuable time and resources if the program could be implemented half as often and still produce the same positive outcomes. Likewise, educators using HELPS would benefit from knowing whether yearlong implementation is more or less advantageous than briefer implementation periods. Method Participants Student participants. At the start of the study (which began in September), consent for student participation was obtained for all but one eligible second-grade student (N ⫽ 101) across six classrooms of one public elementary school (kindergarten through sixth grade) in the southeastern region of the United States. All second-grade students from these classes were eligible for participation unless they were classified by the school as academically and intellectually gifted (N ⫽ 6). For the purposes of this study, 90 eligible students who provided consent were randomly selected to participate. Ninety students were randomly selected because this allowed for an equal and relatively large sample size per group, and research personnel could support no more than 90 total students. Using a randomized block design, 15 students from each of the six classrooms were randomly assigned (5 students each) to one of three conditions. Four of the students from the total sample moved to another school during the project, leaving 86 students who completed all phases of the study. Of these stu-
dents, 29 received HELPS three times per week (i.e., the HELPS-3 group), 29 received HELPS one or two times per week (i.e., the HELPS-1.5 group), and 28 were in a wait-list control group. During the present study, control-group participants received their typical reading instruction (i.e., treatment as usual) until the completion of the study in April. At that point, participants were eligible to receive the HELPS Program. At the beginning of the study, participants ranged in age from 6.83 to 8.67 years with a mean age of 7.5 (7 years, 6 months). Forty-five (52.3%) of the participants were female, 61.6% were White, 11.6% were Black, 19.8% were Latino, 1.2% were Asian, and 5.8% were identified as “Other Ethnicity.” Eight (9.3%) of the students in the sample had been previously retained in a grade level and 9 received reading assistance in addition to their regular classroom instruction. Across conditions, 2 analyses revealed there were no significant differences across any of these demographic variables: sex 2(2, N ⫽ 86) ⫽ 0.03, p ⫽ .99; ethnicity 2(8, N ⫽ 86) ⫽ 3.32, p ⫽ .91; retained 2(2, N ⫽ 86) ⫽ 0.30, p ⫽ .86; extra reading assistance 2(2, N ⫽ 86) ⫽ 0.30, p ⫽ .86. Throughout the participants’ school, 34% of the students received free or reducedprice lunch and 12% qualified for special education services, but because school policies, this demographic information could not be obtained for individual students. The core reading instruction across all second-grade teachers in the participating school was similar. All teachers integrated language arts into their daily curriculum for approximately 90 min; each utilized the Houghton Mifflin basal reading series (Cooper & Pikulski, 2004); and each included daily independent reading activities, phonics and vocabulary lessons, writing activities, and small reading groups determined by student reading ability. Implementation agents (tutors). All HELPS instructional sessions were implemented in a one-on-one (adult–student) format in a quiet hallway outside each participant’s 151
School Psychology Review, 2011, Volume 40, No. 1
classroom during the morning hours. The lead researcher, 22 undergraduate psychology majors, and four postbaccalaureate volunteers shared HELPS implementation responsibilities throughout the study. Prior to working with student participants, the lead researcher instructed all tutors on HELPS implementation procedures and ensured that each tutor reached mastery criterion according to an implementation protocol. Mastery criterion was set at 100% implementation integrity during two consecutive “practice” sessions (i.e., sessions conducted with other adults) and two consecutive sessions with student participants. In addition, all tutors’ implementation integrity was monitored and recorded regularly throughout the study, using the same procedural protocol used for training purposes. When necessary, implementation support and feedback was provided. Materials Consistent with the Begeny et al. (2010) study, a set of 88 instructional passages ranging in difficulty from the beginning of first grade to the end of fourth grade were used as part of all HELPS implementation procedures. HELPS implementation materials also included (a) a specific implementation protocol for tutors to follow while implementing HELPS procedures, (b) a Progress Tracking Form to facilitate communication between tutors working with the same student across different days, (c) a Student Graph for each HELPS participant (for the purposes of implementing goal setting and performance feedback procedures), (d) a Star Chart for each HELPS participant (for purposes of the motivational reward system), and (e) examiner copies of the instructional passages so tutors could score students’ reading performance during each HELPS instructional session. Additional information about the instructional materials used in this study is described by Begeny et al. (2010). Interested readers can also visit the HELPS Web site (http:// www.helpsprogram.org) to access all instructional materials. 152
Measures The Gray Oral Reading Test, Fourth Edition (GORT; Wiederholt & Bryant, 2001) was used at pre- and post-test to evaluate participants’ reading growth throughout the study. As shown by Begeny et al. (2010) in a factor analysis of eight separate measures of early reading, and as described by Wiederholt and Bryant (2001), the GORT includes two distinct measures of reading performance: a measure of reading fluency (GORT-Fluency) and a measure of reading comprehension (GORT-Comp.). Performance on the reading fluency and comprehension measure each produce an age-based standard score, which has a mean of 10 and a standard deviation of 3. Because the GORT provides alternate forms for testing, Form A was used during pretest assessments and Form B was used during post-test. GORT-Fluency and GORTComp. meet rigorous standards for reliability and validity and have sufficient evidence for measuring fluency and comprehension (Big Ideas in Beginning Reading, 2005; Florida Department of Education, 2005). For example, coefficient alphas for the GORT-Fluency and GORT-Comp. are at or above .90 for early primary students, and test–retest reliability estimates for these measures range from .85 to .95 (Wiederholt & Bryant, 2001). Procedures Participants in the control group received their typical language arts curriculum throughout the duration of the study and were assessed with the pre- and post-test measures previously described. Unless there was an extenuating circumstance (e.g., school or university holiday, class field trip, student absence), students in the HELPS-3 condition received one instructional session every Monday, Wednesday, and Friday from the beginning of October through mid-April. The most notable exception to the consistent implementation schedule was a four-week gap of time in which tutors were unavailable to deliver instruction because of the mid-December to mid-January university holiday. Students in
HELPS Program
the HELPS-1.5 condition received one or two instructional sessions each week, also from the beginning of October through mid-April, with the exception of the winter holiday noted above. More specifically, students in this condition would receive one instructional session in a given week (i.e., on Wednesday), two sessions the following week (i.e., on Monday and Friday), one session during the next week (i.e., on Wednesday), two sessions the next week (i.e., on Monday and Friday), and so forth. Schedules were therefore arranged so that students in the HELPS-1.5 condition received an instructional session every 4 to 5 days. All students receiving the HELPS Program were pulled from class during the morning hours when language arts activities occurred in their respective classrooms. Instructional sessions required approximately 8 –10 min to complete. Over the course of the study, HELPS-3 students received an average of 55.9 sessions (range ⫽ 54 –59; SD ⫽ 1.8) and HELPS-1.5 students received an average of 28.2 sessions (range ⫽ 27–30; SD ⫽ 1.0). HELPS instructional procedures include eight evidence-based strategies shown in previous research to improve students’ reading fluency (see, for example, Morgan & Sideridis, 2006; Therrien, 2004). The strategies include repeated reading, modeling, phrase-drill error correction, two verbal cueing procedures, goal setting, performance feedback, and a motivational/reward system. These strategies were integrated into one structured program because (a) research suggests that integrating several fluency-based strategies into one instructional package is typically more effective than implementing one or only a small number of strategies (e.g., Therrien, 2004); and (b) the program offers educators an integrated set of clear, systematic procedures that target students’ reading fluency development. Detailed information about HELPS implementation procedures is available in the HELPS Program teacher’s manual (Begeny, 2009), which can be accessed with other HELPS instructional materials at the HELPS Web site (http://www.helpsprogram.org).
Procedural Integrity To measure procedural integrity, each tutor’s implementation of HELPS was regularly observed by another tutor (each of whom was specifically trained in steps for evaluating implementation integrity). Each tutor evaluated another tutor’s implementation integrity at least 10 times, as this was a shared responsibility among tutors. Procedural integrity was typically checked in vivo by a second tutor. When in vivo observations were not possible (i.e., ⬍10% of the time) the session was recorded via audiotape and scored by a second tutor. Procedural integrity was evaluated for 381 (15.6%) of the 2,441 instructional sessions. Across tutors, the average percentage of steps followed accurately was 98.7% (SD ⫽ 2.79), and this was consistent with both in vivo and audio-taped sessions. The average procedural integrity for each tutor was above 95%. Results To ensure there were no differences between groups at the outset of the study, a one-way analysis of variance was computed for GORT-Fluency and then for GORTComp. Both analyses yielded nonsignificant differences between the groups: F(2, 83) ⫽ 0.17, p ⫽ .84; for GORT-Fluency; F(2, 83) ⫽ 0.17, p ⫽ .85, for GORT-Comp. Next, for each separate measure, a repeatedmeasures analysis of variance was used to determine the overall differences from pretest to post-test when comparing the three groups. This statistical design used a withinsubjects factor of time (i.e., pretest to posttest) by a between-subjects factor of condition (i.e., HELPS-3, HELPS-1.5, and the control group). When finding a significant time ⫻ condition interaction, post hoc analyses were computed to evaluate specific differences between each group. For both analyses, the data met the assumptions for using a repeated-measures analysis of variance. For GORT-Fluency, there was a significant time ⫻ condition interaction, F(2, 83) ⫽ 26.88, p ⬍ .001, 2 ⫽ 0.393. Post hoc analyses were then conducted using Tukey’s 153
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Reading Measure Means and Standard Deviations (SD) by Group at Pretest and Post-Test HELPS-3a
HELPS-1.5b
Controlc
Measure
Pre
Post
Pre
Post
Pre
Post
GORT-Fluency
9.24 (3.44) 9.31 (4.43)
12.31a⬎c (3.14) 12.38a⬎c (2.88)
9.76 (3.00) 8.66 (4.18)
12.14b⬎c (2.59) 10.62 (3.63)
9.46 (3.65) 9.00 (4.30)
8.29 (3.09) 8.29 (2.52)
GORT-Comp.
Note. HELPS ⫽ Helping Early Literacy with Practice Strategies Program; HELPS-3 ⫽ students who received HELPS three times per week; HELPS-1.5 ⫽ students who received HELPS an average of 1.5 times per week; GORT ⫽ Gray Oral Reading Test, Fourth Edition. All values are reported as standard scores. Post hoc t-test statistical significance (using Tukey’s HSD to control for multiple comparisons) was set at p ⬍ .05. Subscripts denote that groups performed reliably higher.
HSD because this type of test is commonly used and stringently controls for Type I error when multiple comparisons are evaluated. Post hoc analyses for GORT-Fluency revealed statistically significant differences between the HELPS-3 and control group ( p ⬍ .05) and the HELPS-1.5 and control group ( p ⬍ .05). For GORT-Comp. there was also a significant time ⫻ condition interaction, F(2, 83) ⫽ 5.37, p ⬍ .01, 2 ⫽ 0.115. Post hoc comparisons revealed statistically significant differences between the HELPS-3 and control group ( p ⬍ .05). No other statistically significant differences were found. Table 1 summarizes these findings and shows each group’s mean standard score at pre- and post-test across the two measures. Table 2 shows Cohen’s (1988) d effect size comparisons for both of the reading measures and across each of the conditions. As is conventionally done with effect size comparisons, and as was done by Begeny et al. (2010), d was computed by subtracting one group’s mean change score on the particular measure from another group’s mean change score on that same measure, and then dividing that value by the pooled standard deviation of the two respective groups on that measure. For each effect size calculation, pre- and post-test standard scores were used to obtain each group’s standard deviation score, and as sug154
gested by Cohen (1988), the pooled standard deviation represented the root mean square of the two standard deviations. Discussion Given the need for additional research to evaluate structured programs designed to improve students’ reading fluency (e.g., see What Works Clearinghouse, 2009), this study sought to extend and replicate the initial study that showed promising effects of the HELPS reading fluency program. Specifically, the present study sought to examine the effects of HELPS with second-grade students when implemented (a) three times per week compared to an average of 1.5 times per week, and (b) throughout most of a school year. A primary finding from this study showed that the students receiving HELPS three times per week significantly outperformed the control group students on the measures of reading fluency and reading comprehension. With both measures, the difference between the groups at post-test exceeded one full standard deviation (i.e., 3 standard score points), suggesting a meaningful practical significance between the groups. Accordingly, the differences between these students’ performance produced large effect sizes. Students receiving the HELPS Program half as frequently (i.e., an average
HELPS Program
Table 2 Effect Size Comparisons Between Conditions Across All Reading Measures Conditions Compared
Reading Measure
HELPS-3 vs. Control
HELPS-1.5 vs. Control
HELPS-3 vs. HELPS-1.5
GORT-Fluency GORT-Comprehension
1.21 (L)a 1.00 (L)a
1.11 (L)a 0.71 (M to L)
0.21 (S) 0.27 (S)
Note. HELPS ⫽ Helping Early Literacy with Practice Strategies Program; HELPS-3 ⫽ students who received HELPS three times per week; HELPS-1.5 ⫽ students who received HELPS an average of 1.5 times per week; GORT ⫽ Gray Oral Reading Test, Fourth Edition; L ⫽ Large; M ⫽ Medium; S ⫽ Small. Effect sizes reported as Cohen’s d. Magnitude of effect, using recommendations by Cohen (1988), is shown in parentheses. a A statistically significant difference ( p ⬍ .05).
of 1.5 times per week) also significantly outperformed the control group students on the measure of reading fluency. However, there was no statistically significant difference between these groups on the measure of reading comprehension. This finding may have been influenced by insufficient statistical power to detect group differences, which represents a limitation of this study. The findings from this study are easily interpretable based upon previous research. For instance, it is unsurprising that students receiving HELPS (regardless of implementation frequency) made meaningful reading improvements compared to students not receiving the program, as HELPS includes the key instructional strategies (e.g., repeated reading, modeling, performance feedback) that have been shown in previous research to improve elementary-aged students’ fluency (Morgan & Sideridis, 2006; National Reading Panel, 2000; Therrien, 2004). Likewise, the effects of HELPS in this study are consistent with meta-analyses showing medium to large effect sizes when multiple fluency-based strategies are combined (National Reading Panel, 2000; Therrien, 2004). However, in contrast to previous research (including most research summarized in meta-analyses on fluency instruction), the HELPS Program offers a unique contribution to this literature base because it offers a structured and systematic procedure for using mul-
tiple evidence-based strategies designed to improve students’ reading fluency. Interpreting the findings from the present study in comparison to the initial HELPS study (i.e., Begeny et al., 2010) also reveals consistent patterns of effects. In the initial study, second-grade students receiving HELPS three times per week, for approximately 3 months, significantly outperformed control group students on the GORT-Fluency measure (at levels highly similar to the differences shown in this study). However, in that initial study, the differences between HELPS and control-group students on the GORT-Comp. measure only approached statistical significance. The findings in this study similarly showed that students who received HELPS less frequently (i.e., an average of 1.5 times per week) outperformed controlgroup students on the measures of reading fluency and comprehension, but the differences in performance on the comprehension measure were not statistically significant. In contrast, students in the present study who received HELPS three times per week over most of a school year significantly and meaningfully outperformed control-group students on both the measure of reading fluency and comprehension. Given the findings from the first two studies that evaluated HELPS, as well as the importance of reading comprehension, the evidence suggests that second-grade students should benefit most by receiving the HELPS 155
School Psychology Review, 2011, Volume 40, No. 1
Program three times per week (ideally, every Monday, Wednesday, and Friday) for most of the school year. This implementation plan should produce the most meaningful gains in students’ reading fluency and comprehension. With this said, teachers may encounter weeks in a school year when they cannot implement HELPS with a student three times per week. For example, student absences, school holidays, and special classroom activities may sometimes prevent teachers from implementing HELPS three times per week, every week. Fortunately, evidence from the first two studies also suggests that an occasional deviation from the three-time-per-week implementation plan should not substantially decrease the overall benefit of HELPS.
possibly even implementing it in a smallgroup context. This study is also limited because it did not specifically evaluate the effects of HELPS with low-performing readers. This is important because, as noted above, schools with even modest levels of staff resources may not be able to implement HELPS three times per week with all students. As such, educators may be more likely to implement HELPS with a subset of students who need the most assistance with fluency development. Relevant to each of the aforementioned limitations, the HELPS Program teacher’s manual (Begeny, 2009) provides some guidance and commentary for educators. Potential Implications for Practice
Study Limitations One limitation to this study was that HELPS was implemented in only one school. To better understand the effectiveness of this program, future research should evaluate HELPS in multiple schools and with students in other grades. Although second grade is an ideal time to promote students’ fluency development, end-of-first-grade and third- grade students should also benefit from this program as a supplement to core reading instruction (e.g., see Begeny, 2009). Other limitations of this study include implementation factors. For example, in this study HELPS was implemented by university students and volunteers, rather than schoolbased staff. Although there is initial evidence that school-based professionals can implement HELPS with integrity and success (see Begeny, 2009), research reports and continued studies are needed to clarify the effects of HELPS when implemented by classroom teachers and support staff (e.g., school psychologists, reading specialists, teacher assistants). Also related to implementation feasibility, a classroom teacher’s one-on-one implementation of HELPS for an entire classroom of students is likely unfeasible in most schools. Thus, future work is needed to identify practical options for efficiently implementing HELPS in a one-on-one context, or 156
There are meaningful implications of the present study for school psychologists, particularly when one considers that school psychologists in the United States receive more referrals for students experiencing reading difficulties than any other school-based concern (Bramlett, Murphy, Johnson, Wallingsford, & Hall, 2002). For instance, HELPS appears to be an effective fluency-based program that can be (a) implemented with early elementary-aged students of all reading-ability levels, (b) learned and implemented with a feasible amount of training and supervision, and (c) accessed for free by all educators. Therefore, when consulting with teachers, school psychologists might recommend HELPS to teachers who want to improve students’ reading fluency and otherwise do not have the materials and/or strategies to do so. Furthermore, as a means to better facilitate reading instruction services in schools that have limited resources, school psychologists might consider training other educators (e.g., special education teachers, teacher assistants, and reliable school volunteers) on HELPS implementation procedures (in addition to training classroom teachers). As noted previously, the structured and systematic nature of the HELPS Program should aid in implementation feasibility across multiple types of educators, and use of the program by
HELPS Program
multiple educators within a school should therefore increase implementation capacity throughout the school. Interested readers should review the HELPS Program teacher’s manual (Begeny, 2009) for additional information and materials related to learning HELPS and training others to use this program. References Armbuster, B., Lehr, F., & Osborn, J. (2001). Put reading first: The research building blocks for teaching children to read (kindergarten through Grade 3). Washington, DC: National Institute for Literacy. Begeny, J. C. (2009). Helping Early Literacy with Practice Strategies (HELPS): A one-on-one program designed to improve students’ reading fluency. Raleigh, NC: The HELPS Education Fund. Retrieved from http://www.helpsprogram.org Begeny, J. C., Laugle, K. M., Krouse, H. E., Lynn, A. E., Parker, M., & Stage, S. A. (2010). A control-group comparison of two reading fluency programs: The Helping Early Literacy with Practice Strategies (HELPS) program and the Great Leaps K–2 reading program. School Psychology Review, 39, 137–155. Big Ideas in Beginning Reading. (2005). Analysis of Reading Assessment Instruments for K–3. Retrieved December 1, 2005, from http://reading.uoregon.edu/ assessment/analysis.php Bramlett, R. K., Murphy, J. J., Johnson, J., Wallingsford, L., & Hall, J. D. (2002). Contemporary practice in school psychology: A national survey of roles and referral problems. Psychology in the Schools, 39, 327– 335. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Cooper, J. D., & Pikulski, J. J. (2004). Houghton Mifflin reading series. Orlando, FL: Houghton Mifflin Harcourt. Daane, M. C., Campbell, J. R., Grigg, W. S., Goodman, M. J., & Oranje, A. (2005). Fourth-grade students reading aloud: NAEP 2002 Special Study of Oral Reading. U.S. Department of Education, National Cen-
ter for Education Statistics. Washington, DC: Government Printing Office. Florida Department of Education. (2005). Primary and secondary diagnostic instruments. Retrieved November 6, 2007, from http://justreadflorida.com/educators/ PrimSecDiagChart.asp?style⫽normal Fuchs, L. S., Fuchs, D., Hosp, M. K., & Jenkins, J. R. (2001). Oral reading fluency as an indicator of reading competence: A theoretical, empirical, and historical analysis. Scientific Studies of Reading, 5, 239 –256. Individuals With Disabilities Education Improvement Act of 2004. Pub. L. No. 108 – 446, 20 U.S.C. 1400 et seq. (2004). McGlinchey, M. T., & Hixson, M. D. (2004). Using curriculum-based measurement to predict performance on state assessments in reading. School Psychology Review, 33, 193–203. Mercer, C. D., & Campbell, K. U. (1998). Great leaps reading program, kindergarten–2nd grade. Gainesville, FL: Diarmuid. Morgan, P. L., & Sideridis, P. D. (2006). Contrasting the effectiveness of fluency interventions for students with or at risk for learning disabilities: A multilevel random coefficient modeling meta-analysis, Learning Disabilities Research & Practice, 21, 191–210. National Reading Panel. (2000). Report of the National Reading Panel: Teaching children to read: An evidence based assessment of the scientific research literature on reading and its implications for reading instruction: Reports of the subgroups (NIH Publication No. 00 – 4754).Washington, DC: U.S. Government Printing Office. Therrien, W. J. (2004). Fluency and comprehension gains as a result of repeated reading: A meta-analysis. Remedial and Special Education, 25, 252–261. Wiederholt, J. L., & Bryant, B. R. (2001). Gray Oral Reading Test (4th ed.). Austin, TX: Pro-Ed. What Works Clearinghouse. (2009). Beginning reading. Retrieved August 30, 2009, from http://ies.ed.gov/ ncee/wwc/reports/beginning_reading/topic/tabfig. asp#tbl2
Date Received: February 1, 2010 Date Accepted: October 18, 2010 Action Editor: Thomas J. Power 䡲 Article was accepted by previous Editor.
John C. Begeny is an assistant professor at North Carolina State University, and his current research examines methods to improve children’s reading abilities, strategies to narrow the gap between research and practice, and international education. He has received several grants for his teaching and research activities, including grants to improve literacy development for children living in low-income communities nationally and internationally. As part of The Guilford Press School Practitioner Series, he is currently writing a book intended to help educators use academic consultation in schools.
157
School Psychology Review, 2011, Volume 40, No. 1, pp. 158 –167
RESEARCH BRIEF Determining an Instructional Level for Early Writing Skills David C. Parker, Kristen L. McMaster, and Matthew K. Burns University of Minnesota Abstract. The instructional level is helpful when identifying an intervention for math or reading, but researchers have yet to investigate whether the instructionallevel concept can be applied to early writing. The purpose of this study was to replicate and extend previous research by examining technical features of potential instructional-level criteria for writing. Weekly writing performance was assessed with 85 first-graders over 12 weeks using Picture-Word and Sentence Copying prompts. Data from the students with the highest slopes were used to derive instructional level criteria. Several scoring procedures across Picture-Word and Sentence Copying prompts produced reliable alternate-form correlations and statistically significant relationships with a standardized writing assessment. Determining an instructional level in writing appears feasible; however, further research is needed to examine the instructional utility of this approach.
Writing skills are essential for satisfactory academic progress during kindergarten through twelfth-grade education and for later vocational success (Graham & Perin, 2007), but writing problems often go undetected until late elementary or middle school, when they become increasingly difficult to remediate (Baker, Gersten, & Graham, 2003). Early identification and intervention are critical for preventing the long-term negative consequences of persistent writing problems
(Berninger, Nielsen, Abbott, Wijsman, & Raskind, 2008). Given that previous research has established that early writing skills (e.g., transcription skills) are related to compositional fluency (Graham, Berninger, Abbott, Abbott, & Whitaker, 1997), the successful development of these skills might establish a foundation on which to develop later writing skills, contributing to the reverse of the national trend in poor writing outcomes (Salahu-Din, Daane, & Persky, 2008).
This research was supported in part by Grant H324H030003 awarded to the Institute on Community Integration and the Department of Educational Psychology, College of Education and Human Development, at the University of Minnesota, by the Office of Special Education Programs in the U.S. Department of Education. Correspondence regarding this article should be addressed to David Parker, 250 Education Sciences Building, 56 East River Rd., Minneapolis, MN 55455; e-mail:
[email protected] Copyright 2011 by the National Association of School Psychologists, ISSN 0279-6015 158
Determining an Instructional Level for Writing
The learning hierarchy (Haring & Eaton, 1978) is an intervention heuristic that could guide intervention development for writing problems. In using the learning hierarchy, interventions are identified by matching student skill with one of four phases of student learning (Haring & Eaton, 1978). First, interventions focus on skill accuracy (the acquisition phase) through high modeling and frequent cueing. Next, interventions are targeted to enhance the speed with which the skill is performed (fluency phase) through additional practice and contingent reinforcement. Once a student can accurately and fluently exhibit the skill, efforts can focus on the later phases of maintenance and generalization. Although research supports the learning hierarchy as an intervention heuristic (Burns, Codding, Boice, & Lukito, 2010), it remains unknown as to when the intervention focus should change. The instructional level is a potential criterion that could be used to identify whether the intervention should focus on acquisition, fluency, or maintenance/generalization. Gickling and Armstrong (1978) operationally defined the instructional level for reading as material in which the student could read 93%– 97% of the words. Reading less than 93% of the words represented a frustration level and exceeding 97% was an independent level. Researchers have found that task completion, task comprehension, and time on task increased when instructional level material was used (Gickling & Armstrong; Treptow, Burns, & McComas, 2007). Moreover, students provided with an acquisition intervention (i.e., high modeling and cueing) to facilitate an instructional level in reading experienced increased and sustained growth over a period of 15 weeks compared to a randomly assigned control group (Burns, 2007). Math intervention research also supports the instructional level as a decision-making criterion. Burns, VanDerHeyden, and Jiban (2006) empirically derived instructional-level criteria for math by computing slopes of growth and finding the mean baseline score for students with the highest growth. Students who did not experience high growth rates may have (a) demonstrated proficient skills before
instruction (i.e., an independent level) and had little room to grow, or (b) started too low (i.e., at a frustration level) and the instruction did not adequately address their learning needs (Burns et al., 2006). Analysis of a subset of the data used for validation showed that students who had initial math scores within the instructional level made the greatest gains over time (Burns et al. ). Recent meta-analytic research that used the Burns et al. (2006) instructionallevel criteria found stronger effects for acquisition interventions (modeling and cueing) provided for students with frustration-level skill than for those whose baseline performance represented an instructional level (Burns et al., 2010). Identifying an instructional level for early writing skills could help interventionists determine the type of intervention struggling writers need (e.g., modeling vs. practice). To determine a student’s instructional level in other academic domains, the interventionist assesses student skill within a specific domain by sampling the behavior for a short time period (e.g., 1– 4 min) from a specific instructional stimulus (e.g., a passage that will be read as part of instruction, or a single-skill math probe such as single-digit multiplication), and then computes either the percentage of the words read correctly for reading or the rate at which the skill was completed for math (e.g., digits correct per minute). The resulting data are sufficiently reliable for instructional decisions in reading and math (Burns et al., 2000, 2006), but less is known about the adequacy of assessments for instructional decision making in writing. One assessment approach that has shown promise for yielding technically sound data for instructional decision making in writing is curriculum-based measurement (CBM; Deno, Mirkin, & Marston, 1982), which might serve as an approach for identifying students’ instructional level in writing. CBM for writing typically consists of brief prompts to which students respond for 3–5 min that are scored for the number of words written (WW), words spelled correctly (WSC), or correct word sequences (CWS; Videen, Deno, & Marston, 1982). Several CBM options for assessing stu159
School Psychology Review, 2011, Volume 40, No. 1
dent progress provide information about early writing skills that are important for later writing development (Graham et al., 1997). For example, McMaster, Du, & Petusdottir (2009) found CBM using sentence copying or pictures with words produced promising technical characteristics, and subsequent research showed these measures were sensitive to growth over time (McMaster et al., 2011). Whereas previous research has established that early writing CBM is promising for measuring progress in early writing, less is known about which measures and assessment procedures can be used to directly inform intervention. Given that acquisition interventions at the instructional level correspond to positive learning outcomes in other academic domains, research is needed to determine whether an instructional level can be identified for writing. The purpose of the current study was to replicate Burns and colleagues’ (2006) study of instructional level in math by applying their methods to writing. Specifically, we examined the reliability and criterion validity of potential instructional-level estimates for beginning writing based on different types of prompts and scoring procedures. The following research questions guided the study: (a) To what extent do different prompts and scoring procedures affect the reliability of writing assessment data? (b) To what extent do different prompts and scoring procedures affect estimates of the criterion validity of writing assessment data? (c) What mean initial writing scores are linked to the highest rates of growth during weekly progress monitoring of writing? (d) How reliable are instructional-level categories based on empirically derived criteria? (e) How well do instructional-level categories relate to a standardized measure of writing? Method Setting and Participants Data for this study were drawn from a larger study of the technical features of slopes produced from weekly administered early writing CBM prompts (McMaster et al., 2011). The larger study was conducted in a Midwestern urban school district with five 160
first-grade classrooms from two schools that were selected by convenience sampling. As described in the larger study, all consenting students participated, resulting in 85 students (51% male). Forty-one percent were White, 28% Black, 26% Hispanic, 2% Native American, and 2% Asian American. Fifty-seven percent were eligible for the federal free or reduced-price lunch program, 21% were English learners, and 17% received special education. Measures CBM tasks. Participants in the larger study completed several CBM tasks on a weekly basis for 12 weeks. In the present study, we examined data from two tasks completed each week that were informed by the current understanding of early writing development (i.e., that transcription skills play a critical role in writing development; Graham et al., 1997) and examined for use with firstgrade students (McMaster et al., 2009). Sentence Copying consisted of packets of eight pages, with three sentences on each page. Participants were instructed to copy an example sentence at the top of the first page (e.g., “We have one cat.”). Then, they were instructed to copy the remaining sentences and to stop after 3 min. Alternate-form reliability on Sentence Copying using the current scoring procedures ranged from r ⫽ .63 to .80, and criterion validity with a standardized normreferenced writing measure ranged from r ⫽ .23 to .50 (McMaster et al., 2011). Picture-Word prompts consisted of words with a picture above each word. Participants wrote a sentence using the word provided. Before the task, the examiner drew a picture (e.g., a tree) on the board and wrote the word underneath. Then, the examiner asked the students to generate sentences using the word. After allowing the students to practice, the examiner instructed them to write as many sentences as they could using the words and pictures on their probes. After 3 min, the examiner instructed participants to stop. Alternate-form reliability on Picture Word using the current scoring procedures ranged from
Determining an Instructional Level for Writing
r ⫽ .70 to .77, and criterion validity ranged from r ⫽ .23 to .54 (McMaster et al., 2011). Writing samples were scored using words written, words spelled correctly, and correct word sequences. A word was defined as at least two letters written in sequence, or single-letter words such as “I” and “a” (Deno, Mirkin, & Marston, 1980). Words were judged as spelled correctly by scoring them as a computer would score them (i.e., syntax and semantics were not taken into account), and were computed by subtracting the number of all words that were judged as being spelled incorrectly from the total words written. A correct word sequence was defined as any two adjacent, correctly spelled words that are acceptable within the context of the sample to a native speaker of English (Videen et al., 1982). Test of Written Language—3rd Edition (TOWL-3). The TOWL-3 (Hammill & Larsen, 1996) is a comprehensive test of written language designed for students from 7 years to 17 years 11 months of age. The Spontaneous Writing subtest (Form A) was group administered to all participants. Students were presented with a picture of astronauts, space ships, and construction activity; asked to plan a story about the picture; and then write as much as they could in 15 min. Writing samples were scored based on Contextual Conventions (capitalization, punctuation, and spelling), Contextual Language (quality of vocabulary, sentence construction, and grammar), and Story Construction (e.g., quality of plot, prose, character development, and interest). Alternate-form reliabilities for Spontaneous Writing for 7-year-olds ranged from r ⫽ .60 to .87, and the measure is reported to correlate well with other standardized writing measures (r ⫽ .50; Hammill & Larsen, 1996). Procedures All writing prompts were administered by a trained graduate student during Week 1 and by the classroom teachers thereafter. Fidelity observations indicated that measures were administered with high levels of accuracy. Scorers included the first and second
authors, four additional graduate research assistants (all special education or school psychology students), and one special education teacher. All scorers had scoring experience on previous CBM projects and received training for this project. Mean interrater agreement was 95% for both CBM prompt types and 88% for the TOWL-3 Spontaneous Writing score. See McMaster et al. (2011) for a complete description of prompt administration, scoring, and interrater agreement. Data Analyses The first step was to compute reliability coefficients for both fluency and accuracy metrics of writing performance. Fluency metrics consisted of the total scores students received for the two prompt types using WW, WSC, and CWS (e.g., 25 CWS on the PictureWord prompt). Accuracy metrics were computed for the two prompt types by dividing the total number of WSC by the total number of words written, and the total number of CWS by the total number of word sequences. For example, a student who produced 25 WSC out of a total of 30 words on Sentence Copying would have an accuracy score of 83%. Fluency and accuracy scores for Weeks 2 and 3 were then correlated using Pearson product moment correlation coefficients (the first week of data were omitted because of potential task novelty). Next, data that produced acceptable reliability coefficients were evaluated for criterion-related validity by correlating Sentence Copying and Picture-Word scores at Week 2 with the age-based total of the TOWL-3 standard scores using a Pearson product moment correlation. Fluency data were subsequently converted to categories of frustration, instructional, and independent levels. To make these conversions, slopes of growth over Weeks 2–12 of progress monitoring were computed using ordinary least-squares regression, which represented average weekly growth in WW, WSC, and CWS for each student. Next, we identified students whose slopes of growth equaled or exceeded the 66th percentile of all the slopes within the sample (as in Burns et al., 161
School Psychology Review, 2011, Volume 40, No. 1
Table 1 Means, Standard Deviations, and Correlation Coefficients for Fluency and Accuracy Scores for Sentence Copy and Picture-Word Prompts and Accompanying Scoring Procedures Fluency
Accuracy
Prompt
Probe 2
Probe 3
Procedure
M
SD
M
SD
r
M
SD
M
SD
r
17.0 13.4 11.9
8.4 7.7 8.6
18.4 15.0 13.1
8.6 8.6 9.1
.71* .67* .67*
76.1 54.2
23.4 28.6
77.8 55.9
24.6 26.5
.52* .46*
16.7 12.8 11.9
7.1 6.6 7.3
16.8 13.3 12.6
7.7 7.1 8.1
.71* .74* .70*
74.6 59.8
25.1 29.6
78.8 64.6
19.7 26.7
.60* .56*
Picture-Word Words written Words spelled correctly Correct word sequences Sentence Copy Words written Words spelled correctly Correct word sequences
Probe 2
Probe 3
*p ⬍ .01.
2006). Finally, the mean score on each probe type among the group of students with high rates of growth was considered an estimate of an instructional level for beginning writing. Categories were subsequently created by establishing a range for the instructional-level scores, which was determined by computing the standard error (SE) of the mean. Scores that exceeded the mean by two SEs were considered an independent level and those that fell more than two SEs below the mean were at a frustration level. This process was repeated for the data from Week 3, and kappa coefficients were computed for agreement between the data from each week. The criterion-related validity of the categorical data were then evaluated by computing Spearman rho correlation coefficients between the total standard scores on the TOWL-3 and the categories produced by Week 2 data for each of the scoring procedures and prompts. Results To address the first research question regarding reliability of prompts and scoring procedures, delayed alternate-form reliability coefficients for each scoring procedure for 162
each prompt were computed for both fluency and accuracy metrics, and are reported in Table 1. Correlation coefficients for scores on the fluency metric approached or exceeded .70 for each scoring procedure under both prompt types, whereas the coefficients for scores on the accuracy metrics were at or below .60, suggesting less reliable scores for accuracy measures. Reliability coefficients that are at or above .70 are acceptable for programs of research that are in early stages (Nunnally & Bernstein, 1994), but coefficients below .70 are less interpretable. For that reason, subsequent analyses excluded the accuracy data and included only the fluency measures. To address the research question regarding the criterion validity of prompts and scoring procedures, the fluency scores for each procedure and prompt were correlated with the TOWL-3 Spontaneous Writing standard score total. The results are presented in the second column of Table 2. With the exception of WW for Sentence Copying prompts, each of the scoring procedures was significantly correlated, using the more conservative alpha level of .01, with the TOWL-3 total standard scores. Significant correlations ranged from r ⫽ .42
Determining an Instructional Level for Writing
Table 2 Criterion-Related Validity Coefficients Between Scoring Procedures for Each Prompt and the TOWL-3 Total Score Correlation with TOWL-3 Total
Prompt Procedure Picture-Word Words written Words spelled correctly Correct word sequences Sentence Copy Words written Words spelled correctly Correct word sequences
Table 3 Derivation of and Estimates for Fluency Instructional-Level Criteria for Scoring Procedures Within Prompt Types
Category
r
.32* .48* .52*
.36* .46* .50*
.26 .42* .46*
.21 .46* .48*
Note. TOWL-3 ⫽ Test of Written Language—3rd Edition. *p ⬍ .01.
for WSC on the Sentence Copying prompt to r ⫽ .52 for CWS on the Picture-Word prompt. Validity was further explored by examining concurrent (i.e., within Week 2) and predictive (i.e., between Weeks 2 and 3) correlations of the same scoring procedures across Sentence Copying and Picture-Word prompts. All coefficients for these analyses were significant and moderate (r ⫽ .42 and r ⫽ .69). The third research question addressed whether an instructional level could be estimated for beginning writing. Given the acceptable technical adequacy of the fluencybased measures, instructional-level estimates were computed for each of the scoring procedures and prompts. These estimates are based on the initial performance of the students whose slopes were at or above the 66th percentile. The slopes that represented the 66th percentile for the Picture-Word prompts were .87 for WW, .96 for CWS, and .84 for WSC. For Sentence Copying, slopes that represented the 66th percentile were .97 for WW, .96 for
Mean
SD
SE
14.46
8.68
1.64
11–18
11.43
7.03
1.33
9–14
10.93
8.56
1.62
8–14
16.25
6.58
1.24
14–19
13.39
6.52
1.23
11–16
13.32
7.86
1.49
10–16
Prompt Procedures
Fluency Raw
Fluency Criteria (3-min Probe)
Picture-Word Words written Words spelled correctly Correct word sequences Sentence Copy Words written Words spelled correctly Correct word sequences
CWS, and .90 for WSC. Instructional-level criteria were computed by finding the mean score from Weeks 2 and 3 and building a two SE range around those data (see Table 3). Twenty-eight (34.6%) of the students were classified as high responders using each scoring procedure for both prompt types. To answer the fourth research question, the fluency writing data were converted to the categories of frustration, instructional, and independent levels. Reliability of the categorical data was estimated using Cohen’s (1960) kappa coefficient for chance agreement. The number and percentage of students scoring in the frustration, instructional, and independent levels of difficulty are presented in Table 4 along with the kappa coefficients. Kappa coefficients for the categories using Sentence Copying data were all .46, and using PictureWord data ranged from .37 (WW) to .47 (CWS). This indicates that the categorical agreements across Weeks 2 and 3 were 46% above chance for the scoring procedures for Sentence Copying data and 37% to 47% above chance for the Picture-Word data. 163
School Psychology Review, 2011, Volume 40, No. 1
Table 4 Number and Percentage of Fluency Scores Categorized as Frustration, Instructional, and Independent and Kappa Coefficients Probe 2 Prompt Procedure Picture-Word Words written Words spelled correctly Correct word sequences Sentence Copy Words written Words spelled correctly Correct word sequences
Probe 3
Frustration Instructional Independent Frustration Instructional Independent N
%
N
%
N
%
N
%
N
%
N
%
19
23.8
19
23.8
42
52.5
21
25.3
15
18.1
47
56.6
.46*
24
30.0
18
22.5
38
47.5
23
27.7
12
14.5
48
57.8
.46*
30
37.5
20
25.0
30
37.5
29
34.9
17
20.5
37
44.6
.46*
24
29.6
30
37.0
27
33.3
23
28.8
24
30.0
33
41.2
.37*
29
35.8
24
29.6
28
34.6
29
36.2
28
35.0
23
28.8
.46*
32
39.5
25
30.9
24
29.6
31
38.8
20
25.0
29
36.2
.47*
*p ⬍ .01.
Finally, to answer the fifth research question, the categorical data from Week 2 were correlated with the TOWL-3 standard score total using Spearman’s rho, presented in the third column of Table 2. As with the continuous fluency data, the correlations were all significant ( p ⬍ .01) with CWS resulting in the correlation coefficients at or near .50 for the Picture-Word ( ⫽ .50), and Sentence Copying prompts ( ⫽ .48). Discussion Results of this study suggest that WSC and CWS for Sentence Copying, as well as all scoring procedures for the Picture-Word prompts, were sufficiently reliable and statistically significantly correlated with a standardized measure of writing. Similar to previous research in math (Burns et al., 2006), acceptable coefficients were found for fluency but not accuracy scores. Overall, correlations were similar to reliability and validity findings from previous research on CBM for beginning writers (Coker & Ritchey, 2010; McMaster et al., 164
2009). Our findings also suggest that these data can be converted to instructional-level categories that are reliable and significantly related to scores on a standardized measure of writing. Although these findings provide a method to assess the instructional level for early writers, discussion is warranted with regard to how the instructional-level concept applies to writing. In keeping with the learning hierarchy (Haring & Eaton, 1978), frustrationlevel scores might indicate that a student requires additional modeling in the skill being taught, but scores within the instructional level might suggest that the student is ready for fluency-building activities in the skill. In reading and math, it is possible to manipulate the difficulty of the material, whereas the task for writing is to produce an entirely new product on blank paper. Thus, the instructional level for writing may be more dependent on the skills of the learner than the required task difficulty. This study was the first step in a line of inquiry designed to understand the instruc-
Determining an Instructional Level for Writing
tional level as related to writing, and additional instructional implications should be determined with future research. For example, as currently conceptualized, the instructional level might serve as a heuristic for determining at what stage of the instructional hierarchy (Haring & Eaton, 1978) a student’s skills fall, which suggests that future research examine the effectiveness of early writing interventions in a way that is similar to that done with math (Burns et al., 2010). Moreover, other types of CBM prompts, such as spelling (a critical component of transcription; Berninger & Amtmann, 2003), might offer a curricular domain in which the difficulty of the material could drive application of the instructional level in a way similar to how reading material is manipulated to correspond to students’ instructional levels (Gickling & Armstrong, 1978). The findings and implications of this study should be considered in light of the limitations of the data. First, the current sample of students came from two schools in a single district with a relatively high proportion (21%) of students who were classified as English language learners. Thus, the degree to which the results would apply to students in other locations with other characteristics and curricula is unknown, and the overall rates of growth from which instructional levels were computed may have been affected by growth characteristics that are specific to English language learner students. Moreover, this study included only first-graders, and previous research in math found different instructional levels across grades (Burns et al., 2006). A second limitation is the somewhat arbitrary designation of high and low growth rates used to determine instructional level. The current procedure used the top third of all slopes to identify students at the instructional level because that was the approach used in previous research (Burns et al., 2006). Other approaches could be selected, such as the median slope (cf. Vellutino et al., 1996), and might lead to different estimates of instructional-level ranges. Further research could be conducted to identify cutoffs that yield the most precise instructional level estimates. Moreover, different scoring procedures yielded different numbers of students in
each of the instructional-level categories. For instance, for the Picture-Word prompt types, WW, WSC, and CWS, resulted in 19, 24, and 30 students, respectively, at the frustration level. These scoring procedures assess different skills, and thus future research would be necessary to identify which of the procedures resulted in the most useful instructional-level criteria. Related to this point, the criteria derived in this study suggest three categories in which students’ skills can fall, but such a fine-grained evaluation of student skills may not be necessary for early writing, and perhaps a dichotomous interpretation of the instructional level (i.e., below and at or above) would sufficiently suggest the need for different interventions. Third, whereas kappa correlations calculated to determine the reliability of the categorical data were statistically significant, coefficients were modest. It is possible that students who were at the upper limits of the instructional-level criteria made sufficient gains between assessments such that they moved to the independent level; this possibility could be explored in future studies by conducting direct evaluations of classification accuracy. Further research is also needed to determine whether other indices of student writing yield more reliable classifications. Related to this point, whereas criterion-related validity coefficients were statistically significant and similar to criterion-validity coefficients generally found in the writing literature (cf. McMaster & Espin, 2007), these coefficients are not strong and thus prevent firm conclusions regarding criterion validity. This problem is not new to writing assessment research, given the complex, multidimensional nature of writing (cf. Tindal & Hasbrouk, 1991). Because of these limitations in technical adequacy of writing measures, future researchers may wish to continue to investigate various writing tasks, durations, and scoring procedures to identify more precise estimates of writing skill. One direction for future research might investigate the influence of the amount of time allowed for completion on fluency-based assessments on writing, and whether increasing the duration of writing 165
School Psychology Review, 2011, Volume 40, No. 1
samples would yield more technically sound and instructionally useful data. In light of the above limitations, the current study should primarily be used as a heuristic for empirically deriving instructional levels for writing; further work is needed to determine whether using CBM, or perhaps other skill-based writing assessments, to identify instructional levels for writing is an instructionally useful approach. To validate the utility of identifying an instructional level in writing, instruction or intervention would need to be manipulated to occur within and outside of the instructional-level ranges suggested by this study. Results that indicated the strongest outcomes for those students taught at their instructional level would support the applicability of instructional level to writing. The current design did not permit this analysis, but the results suggest the value of future efforts to do so. References Baker, S., Gersten, R., & Graham, S. (2003). Teaching expressive writing to students with learning disabilities: Research-based applications and examples. Journal of Learning Disabilities, 36, 109 –123. Berninger, V., & Amtmann, D. (2003). Preventing written expression disabilities through early and continuing assessment and intervention for handwriting and/or spelling problems: Research into practice. In H. L. Swanson, K. Harris, & S. Graham (Eds.), Handbook of learning disabilities (pp. 345–363). New York Guilford Press. Berninger, V. W., Nielsen, K. H., Abbott, R. D., Wijsman, E., & Raskind, W. (2008). Writing problems in developmental dyslexia: Under-recognized and undertreated. Journal of School Psychology, 46, 1–21. Burns, M. K. (2007). Reading at the instructional level with children identified as learning disabled: Potential implications for response-to-intervention. School Psychology Quarterly, 22, 297–313. Burns, M. K., Codding, R. S., Boice, C. H., & Lukito, G. (2010). Meta-analysis of acquisition and fluency math interventions with instructional and frustration level skills: Evidence for a skill by treatment interaction. School Psychology Review, 39, 69 – 83. Burns, M. K., Tucker, J. A., Frame, J., Foley, S., & Hauser, A. (2000). Interscorer, alternate-form, internal consistency, and test-retest reliability of Gickling’s model of curriculum-based assessment for reading. Journal of Psychoeducational Assessment, 18, 353– 360. Burns, M. K., VanDerHeyden, A. M., & Jiban, C. L. (2006). Assessing the instructional level for mathematics: A comparison of methods. School Psychology Review, 35, 401– 418. 166
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37– 46. Coker, D. L., & Ritchey, K., D. (2010). Curriculum based measurement of writing in kindergarten and first grade: An investigation of production and qualitative scores. Exceptional Children, 76, 175–193. Deno, S. L., Mirkin, P., & & Marston, D. (1982). Valid measurement procedures for continuous evaluation of written expression. Exceptional Children Special Education and Pediatrics: A New Relationship, 48, 368 – 371. Deno, S. L., Mirkin, P., & Marston, D. (1980). Relationships among simple measures of written expression and performance on standardized achievement tests (Vol. IRLD-RR-22, pp. 109). University of Minnesota, Institute for Research on Learning Disabilities. Gickling, E. E., & Armstrong, D. L. (1978). Levels of instructional difficulty as related to on-task behavior, task completion, and comprehension. Journal of Learning Disabilities, 11, 559 –566. Graham, S., Berninger, V. W., Abbott, R. D., Abbott, S. P., & Whitaker, D. (1997). Role of mechanics in composing of elementary school students: A new methodological approach. Journal of Educational Psychology, 89, 170 –182. Graham, S., & Perin, D. (2007). A meta-analysis of writing instruction for adolescent students. Journal of Educational Psychology, 99(3), 445– 476. Hammill, D. D., & Larsen, S. C. (1996). Test of Written Language—Third Edition. Austin, TX: PRO-ED, Inc. Haring, N. G., & Eaton, M. D. (1978). Systematic instructional technology: An instructional hierarchy. In N. G. Haring, T. C. Lovitt, M. D. Eaton, & C. L. Hansen (Eds.), The fourth R: Research in the classroom (pp. 23– 40). Columbus, OH: Merrill. McMaster, K. L., Du, X., & Petursdottir, A. (2009). Technical features of curriculum-based measures for beginning writers. Journal of Learning Disabilities, 42, 41– 60. McMaster, K. L., Du, X., Yeo, S., Deno, S. L., Parker, D., & Ellis, T. (2011). Curriculum-based measures of beginning writing: Technical features of the slope. Exceptional Children, 77, 185–206. McMaster, K. L., & Espin, C. A. (2007). Technical features of curriculum-based measurement in writing: A literature review. Journal of Special Education, 41, 68 – 84. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Salahu-Din, D., Persky, H., & Miller, J. (2008). The Nation’s Report Card: Writing 2007. Retrieved December 1, 2009, from http://nces.ed.gov/nationsreportcard/writing/ Tindal, G., & Hasbrouck, J. (1991). Analyzing student writing to develop instructional strategies. Learning Disabilities Research and Practice, 6, 237–245. Treptow, M. A., Burns, M. K., & McComas, J. J. (2007). Reading at the frustration, instructional, and independent levels: Effects on student time on-task and comprehension School Psychology Review, 36, 159 –166. Vellutino, F. R., Scanlon, D. M., Sipay, E. R., Small, S. G., Chen, R., Pratt, A., et al. (1996). Cognitive profiles of difficult-to-remediate and readily remediated poor readers: Early intervention as a vehicle for distinguishing between cognitive and experiential def-
Determining an Instructional Level for Writing
icits as basic causes of specific reading disability. Journal of Educational Psychology, 88, 601– 638. Videen, J., Deno, S. L., & Marston, D. (1982). Correct word sequences: A valid indicator of proficiency in written expression (Vol. IRLD-RR-84, pp. 61). Minneapolis: University of Minnesota, Institute for Research on Learning Disabilities.
Date Received: November 12, 2009 Date Accepted: October 25, 2010 Action Editor: Tanya Eckert 䡲 Article accepted by previous Editor.
David C. Parker is a doctoral candidate in the school psychology program within the Department of Educational Psychology at the University of Minnesota. His research interests include direct assessment and intervention for reading and writing, particularly in the primary elementary grades. Kristen L. McMaster, Ph.D. is an Associate Professor of Special Education in the Department of Educational Psychology, University of Minnesota. Her research interests include (a) promoting teachers’ use of data-based decision making and evidence-based instruction and (b) developing individualized interventions for students at risk for or identified with reading and writing-related disabilities. Matthew K. Burns, Ph.D. is a Professor of Educational Psychology, Coordinator of the School Psychology Program, and Co-Director of the Minnesota Center for Reading Research at the University of Minnesota. His current research interests include curriculum-based assessment for instructional design, interventions for academic problems, problem-solving teams, and incorporating all of these into a response-to-intervention model.
167