cover
title: author: publisher: isbn10 | asin: print isbn13: ebook isbn13: language: subject publication date: lcc: ddc...
62 downloads
5083 Views
1MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
cover
title: author: publisher: isbn10 | asin: print isbn13: ebook isbn13: language: subject publication date: lcc: ddc: subject:
next page >
Individualizing the Assessment of Language Abilities Multilingual Matters (Series) ; 59 Jong, John H. A. L. de Multilingual Matters 1853590665 9781853590665 9780585171630 English Language and languages--Ability testing, Individualized instruction. 1990 P53.4.I53 1990eb 418/.0076 Language and languages--Ability testing, Individualized instruction.
cover
next page >
< previous page
page_i
next page > Page i
Individualizing the Assessment of Language Abilities
< previous page
page_i
next page >
< previous page
page_ii
next page > Page ii
Multilingual Matters Age in Second Language Acquisition BIRGIT HARLEY Bicultural and Trilingual Education MICHAEL BYRAM and JOHAN LEMAN (eds.) Communication and Simulation D. CROOKALL and D. SAUNDERS (eds.) Cultural Studies in Foreign Language Education MICHAEL BYRAM Current Trends in European Second Language Acquisition Research HANS W. DECHERT (ed.) Dialect and Education: Some European Perspectives J. CHESHIRE, V. EDWARDS, H. MUNSTERMANN, B. WELTENS (eds.) Introspection in Second Language Research C. FAERCH and G. KASPAR (eds.) Key Issues in Bilingualism and Bilingual Education COLIN BAKER Language Acquisition: The Age Factor D. M. SINGLETON Language in Education in Africa CASMIR M. RUBAGUMYA (ed.) Language and Education in Multilingual Settings BERNARD SPOLSKY (ed.) Language Planning and Education in Australasia and the South Pacific R. B. BALDAUF and A. LUKE (eds.) Marriage Across Frontiers A. BARBARA Methods in Dialectology ALAN R. THOMAS (ed.) Minority Education: From Shame to Struggle T. SKUTNABB-KANGAS and J. CUMMINS (eds.) The Moving Experience: A Practical Guide to Psychological Survival G. MELTZER and E. GRANDJEAN Modelling and Assessing Second Language Acquisition K. HYLTENSTAM and M. PIENEMANN (eds.) The Role of the First Language in Second Language Learning HÅKAN RINGBOM Second Language Acquistion - Foreign Language Learning B. VanPATTEN and J. F. LEE (eds.) Special Language: From Humans Thinking to Thinking Machines C. LAUREN and M. NORDMAN (eds.) Teaching and Learning English Worldwide J. BRITTON, R. E. SHAFER and K. WATSON (eds.) Variation in Second Language Acquisition (Vol. I and II) S. GASS, C. MADDEN, D. PRESTON and L. SELINKER (eds.) Please contact us for the latest book information: Multilingual Matters, Bank House, 8a Hill Rd, Clevedon, Avon BS21 7HH, England
< previous page
page_iii
next page > Page iii
MULTILINGUAL MATTERS 59 Series Editor: Derrick Sharp
Individualizing the Assessment of Language Abilities Edited by John H. A. L. de Jong and Douglas K. Stevenson MULTILINGUAL MATTERS LTD Clevedon - Philadelphia
< previous page
page_iii
next page >
< previous page
page_iv
next page > Page iv
Library of Congress Cataloging in Publication Data Individualizing the Assessment of Language Abilities Edited by John H. A. L. de Jong and Douglas K. Stevenson p.cm. (Multilingual Matters: 59) Bibliography: p. Includes indexes 1. Language and languagesability testing 2. Individualized instruction I. Jong, John H. A. L. de, 1947- . II. Stevenson, Douglas Keith, 1944- . III. Series: Multilingual Matters (Series): 59 P53.4.153 1990 418'0076-dc20 British Library Cataloguing in Publication Data Individualizing the Assessment of Language Abilities (Multilingual Matters: 59) 1. Children. Language skills. Assessment I. Jong, John H. A. L. de. II. Stevenson, Douglas K. 401'.9 ISBN 1-85359-067-3 ISBN 1-85359-066-5 (pbk) Multilingual Matters Ltd Bank House, 8a Hill Road, Clevedon, Avon BS21 7HH, England & 1900 Frost Road, Suite 101 Bristol, PA 19007 USA Copyright © 1990 John H. A. L. de Jong, Douglas K. Stevenson and the authors of individual chapters All rights reserved. No part of this work may be reproduced in any form or by any means without permission in writing from the publisher Index compiled by Meg Davies (Society of Indexers) Typeset by SB Datagraphics Printed and bound in Great Britain by WBC Print Ltd., Bridgend
< previous page
page_iv
next page >
< previous page
page_v
next page > Page v
CONTENTS Preface ix Foreword xi Section I: Theoretical Considerations On Individualized Assessment 1 3 Social Aspects of Individual Assessment Bernard Spolsky 2 16 Response to Spolsky Arthur Hughes 3 20 Learner-Centred Testing through Computers: Institutional Issues in Individual Assessment J. Charles Alderson 4 28 Response to Alderson Paul Tuffin 5 38 National Issues in Individual Assessment: The Consideration of Specialization Bias in University Language Screening Tests Grant Henning 6 51 Response to Henning: Limits of Language Testing Graeme D. Kennedy 7 56 Psychometric Aspects of Individual Assessment Geofferey N. Masters 8 71 Response to Masters: Linguistic Theory and Psychometric Models John H.A.L. de Jong
< previous page
page_v
next page >
< previous page
next page >
page_vi
Page vi Section II: Language Teaching And Individualized Assessment 9 Individual Learning Styles in Classroom Second Language Development Rod Ellis 10 Comprehension of Sentences and of Intersentential Relations by 11- to 15-Year-Old Pupils Denis Levasseur And Michel Pagé 11 Discourse Organization in Oral and Written Language: Critical Contrasts for Literacy and Schooling Rosalind Horowitz 12 Indeterminacy in First and Second Languages: Theoretical and Methodological Issues Antonella Sorace 13 An Experiment in Individualization Using Technological Support Norma Norrish 14 Discrete Focus vs. Global Tests: Performance on Selected Verb Structures Harry L. Gradman And Edith Hanania Section III: Individualization And Assessment Procedures 15 Operationalising Uncertainty in Language Testing: An Argument in Favour of Content Validity Alan Davies 16 Minority Languages and Mainstream Culture: Problems of Equity and Assessment Mary Kalantzis, Diana Slade, And Bill Cope 17 The Role of Prior Knowledge and Language Proficiency as Predictors of Reading Comprehension among Undergraduates Tan Soon Hock 18 The Language Testing Interview: A Reappraisal Gillian Perrett
< previous page
page_vi
83
97
108
127
154 166
179
196
214
225
next page >
< previous page
next page >
page_vii
Page vii 19 Directions in Testing for Specific Purposes Gill Westaway, J. Charles Alderson And Caroline M. Clapham List of Contributors Index
< previous page
page_vii
239 257 263
next page >
< previous page
next page >
page_vii
Page vii 19 Directions in Testing for Specific Purposes Gill Westaway, J. Charles Alderson And Caroline M. Clapham List of Contributors Index
< previous page
page_vii
239 257 263
next page >
< previous page
page_ix
next page > Page ix
PREFACE The 15 chapters in this volume were chosen from among the several hundred papers which were presented at the 1987 AILA World Congress in Sydney, Australia. The first four chapters are complemented by responses, making a total of 19 contributions. This collection reflectsand responds tothe growing international interest in the theme of individualization. More specifically, it brings forward many of the current issues, problems, and hopes of the movement towards individualizing the assessment of language abilities. Independence and internationalism are two of the most noteworthy markers of this movement. It is not tied to any particular school or "brand" of linguistics, or any single approach or trend in language teaching. Nor does it presume any one testing method or measure. Similarly, the interest in individualization does not have a nationalistic slant, or a geographical center. Rather, the interest is as apparent in Malaysia and Montreal as it is in Melbourne, and colleagues working in Edinburgh, Jerusalem, or San Antonio find much in common. These chapters sample and represent this great diversity in professional interests and origins. As a result, they are of interest to the classroom teacher as well as the linguist and the teacher of teachers, to those concerned with large-scale assessment, as well as those concerned about that one individual who, when taking a test, always takes it personally. The editors would like to express their appreciation to the two dozen contributors, first, for allowing their papers to be included in this volume, and secondly, for complying with the rather stringent time requirements which allowed this volume to appear so that what is of current interest, remains current. We would also like to thank the publisher, Multilingual Matters, for their immediate, and firm support in bringing together contributions which otherwise would only have been available piecemeal. Editorial policy has been to interfere as little as possible with each contributor's presentation style, and academic accent. Even consistent spelling variation among the contributions, being normal and natural, has been maintained. In the same spirit, addresses have been provided in the List of Contributors. THE EDITORS
< previous page
page_ix
next page >
< previous page
next page >
page_vii
Page vii 19 Directions in Testing for Specific Purposes Gill Westaway, J. Charles Alderson And Caroline M. Clapham List of Contributors Index
< previous page
page_vii
239 257 263
next page >
< previous page
page_xi
next page > Page xi
FOREWORD The current movement towards the individualization of language assessment reveals a great diversity of interests, activities, and goals. At the same time, it responds to three general concerns, or directions. The first is the increased awareness of the intricacy of language, the second is the availability of new technologies, and the third is the resolution of issues connected with the democratization of education. Each is apparent in linguistics and language pedagogy, as well as in language testing. And each is perhaps best understood as a complex factor, a set of related influences which, when taken together, is distinguished by effect. The first influence comes from our continually increasing awareness of the intricacy of language and language behavior, whether viewed from the perspective of second language acquisition research, sociolinguistic surveys, studies in attitudes and motivation, attempts to derive communicative syllabuses, or the question of what it means, exactly, to know and use a language. The increased methodological rigor with which language studies have been approached in the past two decades has, above all, heightened this awareness. Researchers are more wary of considering "all other things equal," as they know, from experience, that they seldom are. Some linguists might still maintain the utility of positing idealized native-speakers in contained situations for the purposes of abstract modelling. Yet most now agree that the presumption, at any level, of free variation comes at high cost when models are applied or, if ever, their validity is examined. The inclusion of nonnative speakers in multilingual communities in other than purely academic situations has, of course, added further dimensions of complexity. When pursuing applied goals, therefore, language testing theory cannot assume that one size will fit all. However, the testing models presently available tend to be undeterminedunable to specify interrelationships by weight or degreeor merely suggestive. As a result, with no sufficiently specified and validated testing framework available, the quality of testing and assessment can best be enhanced by individually tailoring instruments so that they are sensitive to highly differentiated populations, multipurpose
< previous page
page_xi
next page >
< previous page
page_xii
next page > Page xii
goals, and needs. That this will necessarily restrict the applicability of such tests is acknowledged, as is the related problem of cross-interpretation among test results. Both are argued, at least, to be acceptable trade-offs for increased specificity. This is not to say that the traditional broad-spectrum sampling of skills and abilities, the ''universal'' large-scale group administered test, is either ill-conceived or inappropriate. The attention given to such approaches in the testing literature reflects the very real need for such measures. In the U.S.A., for example, these procedures remain necessary to conveniently and inexpensivelyand still, objectivelysurvey the English language proficiencies of the hundreds of thousands of foreign students who seek admission each year to hundreds of different universities. Too often, in fact, the practical constraints under which such tests must operate have not been given sufficient weight in critical reviews. Yet they only serve to highlight the relative success of some of these tests, given their purposes. There are, nonetheless, a large number of testing contexts for which broad-spectrum sampling is obviously insufficient. Until quite recently, however, measures specifically fitted to individual populations, situations, purposes and goals were largely impractical, if they were also concerned with the professional requirements to demonstrate reliability and validity. The ideal has long been apparent, but the aligned costs in time, effort, and administrative and interpretive complexity, were usually prohibitive. The second influence on the movement towards individualized assessment is the promise that new technologies, when applied to language testing, can largely mitigate many such problems of cost and convenience. With more microcomputers available at a reasonable price, and more language testers at home with computer-adapted approaches, the profiling of individual abilities against specific goals or syllabuses has become much less utopian. Various branching or implicational methods which allow an examinee's skill levels to be quickly approached, and then thoroughly reviewed, are also now more feasible, with computerized databanks served by advances in psychometric theory. Tests are being developed which, based upon a blend of Item Response Theory (IRT) with Bayesian decision theory, allow test takers to answer only as many questions as needed to be assessed at predetermined levels of accuracy. This can be contrasted with traditional practice, which required an examinee to march through a complete set of group-referenced items. This concept of starting and stopping a formal assessment procedure, relative to an individual's abilities, is not new, of course. Yet through such computerized approaches, much of the time and expense (and intensive
< previous page
page_xii
next page >
< previous page
page_xiii
next page > Page xiii
training of test administrators) demanded by older, individually administered procedures can be avoided. In some ways, then, microcomputer adaptions can be seen as a bridge between the convenience of large-scale, paper and pencil tests, and the personalized, in-depth virtues of the face-to-face, one-on-one examination. In addition, however, they can more easily draw upon recent advances in psychometric theory and practice to further the quality of testing. One approach with great potential is the linking of microcomputers and videodiscs, which can raise the level of test authenticity, while at the same time offering compatible, yet individualized, situational contexts. Such a marriage would also enable interactive assessment. Not only could a dialogue, for example, be seen as well as heard in a suitable contextas opposed to traditional heard-only tapes, or printed dialoguesbut examinees could also "interrupt" conversations in order to ask for a repeat, and so on. With an audiovisual data base, tests can be tailored to individual teaching goals and language learners, against a more realistic context than is available through textbooks and test booklets. Adaptive assessment, of course, sometimes raises "two cultures" reactions, that is, a general distaste for anything that smacks of the technological. Not infrequently, parallels to language laboratories have been drawn. Nonetheless, few test takers are averse to assessment which, with demonstrable fairness, also offers convenience and appropriateness from their point of view. Whether it is seen as ironic, or not, a higher degree of personalization in testing can be achieved using computerized procedures, than would otherwise be feasible in most testing situations. This is especially true when large numbers of test takers are involved. The third set of influences which have furthered the interest in individualized assessment are closely intertwined with concepts of test fairness, ethics in testing, and, in general, the overall democratization of education. Here, as one might expect, questions arise as to what, for example, constitutes fair and useful testing from the learner's point of view, about different definitions of content or method bias, about what levels of accuracy are called for in making various types of decisions, and even the fundamental question of "why test?" Pressures of educational accountability are evident here too: in those testing contexts where when a student is tested, the teacher (or the syllabus, or teaching methods, or entire system) is tested at the same time, or in the larger educational context, where the measure is more and more how well all of the pupils do, and not just which pupils do well. The public attention to accountability is probably greatest at present in North America, as is the resultant attention to individualized instruction
< previous page
page_xiii
next page >
< previous page
page_xiv
next page > Page xiv
and, therefore, individualized assessment. Interestingly, in some areas, the established professional testing standards and ethics are now being strengthened by formal legal requirements that tests publicly demonstrate their relevance and fairness along with the more usual technical qualities. To the extent that broad-spectrum testing tends to provide less precise information the more an individual "deviates" from the norm, individualized testing is in one way simply fulfilling what test developers have long recommended, but have never been able to require. This is that test users tailor the tests, the uses of the test, norms, and interpretive guidelines to their own populations and educational needs. It would be tempting, in this sense, to see the interest in individualized instruction and testing as a logical, if belated response to a demonstrated need. It would be naive, however, to overlook the considerable pressures which have been brought to bear, especially in those educational contexts in which populations to whom tests have been given often bear little resemblance to those on which a test was normed, or skill referenced. Even in those cases in which bias has not been demonstrated, the attention to individualization is often of great attitudinal worth. It is, therefore, not lightly rejected as being just the cosmetics of face validity. These three sets of influences are those which can be said to have effected much of the current interest in the individualization of language assessment. Needless to say, for any specific area of interest, or any specific testing or educational context, they are not necessarily determinant or, for that matter, even acknowledged. Our intent here is not to isolate specific influences in individual settings, but to provide a generalized backdrop against which the individual contributions can be highlighted. Other, less general influences could, of course, be mentioned. One could, for example, stress the immediate influences of any of the many approaches to teaching a language for specific purposes. And, one could even attempt to trace influences by following a cline of less general to more particular (e.g. from Business English down to English for Hotel Desk Receptionists). On the other hand, it could also be argued that this attention to individualization is not always a chosen shift in testing paradigms, but sometimes a necessary reaction to change in language teaching and testing markets. Similarly, it could be argued that some of the current attention to individualization can be attributed to rapid growth of the field of language testing, itself. In other words, as more and more individuals throughout the world become proficient in test design and development, there will also be a notable gain in the number of tests which they construct to fit their own specific interests and circumstances, and a decline in those which are taken
< previous page
page_xiv
next page >
< previous page
page_xv
next page > Page xv
"off the shelf" from commercial suppliers, or adapted from large-scale, international testing programs. As these two examples show, there is, for any specific case, a myriad of possible causes and motivations for the interest in individualization with some of them, after a closer look, less apparent than others. Yet, while such speculations are perhaps interesting in their own right, they do little to show what the present interests actually are. And this is the major purpose of this volume, that is, to represent, by example, the great variety of interests which at present have converged on the theme of individualization in language assessment. In this way it is hoped to provide a concise, yet accessible overview of contemporary activity internationally, while keeping the various viewpoints, arguments, disputes, problems, hopes, approaches, and testing contexts intact and, as it were, speaking with their own voices. It would be ironical indeed, given the topic, if its individualistic nature were hidden by a selection which only stressed a few of its many aspects. Therefore, in selecting the papers, an attempt was made to represent as much variety as possible, from the general and philosophical discussions of fundamental concerns, to the actual research study that pursues a specific hypothesis, from planned, large-scale testing systems, to the very real-life concerns of the classroom teacher and tester. Some of the contributions are brief and descriptive. Others give considerable attention to theoretical issues and assumptions. The first four chapters address general considerations from four different perspectives. Social and ethical, institutional, national, and psychometric issues are dealt with. Each of these chapters is followed by an invited response. Together, they offer an introduction to many of the topics and issues dealt with in a more applied fashion in the rest of the volume. The next six contributions demonstrate the need for individualized assessment by revealing the requirements of individual language learners, for instance, characteristics related to age, learning styles, or learning objectives. The last five chapters show how a neglect of the concept of individualization in the development and use of language tests can jeopardize the assessment procedures, and obscure interpretative results. Alternatives are presented, and discussed. JOHN H.A.L. DE JONG CITOARNHEM DOUGLAS K. STEVENSON UNIVERSITY OF ESSEN
< previous page
page_xv
next page >
< previous page
page_1
next page > Page 1
SECTION I THEORETICAL CONSIDERATIONS ON INDIVIDUALIZED ASSESSMENT
< previous page
page_1
next page >
< previous page
next page >
page_vii
Page vii 19 Directions in Testing for Specific Purposes Gill Westaway, J. Charles Alderson And Caroline M. Clapham List of Contributors Index
< previous page
page_vii
239 257 263
next page >
< previous page
page_3
next page > Page 3
1 Social Aspects of Individual Assessment BERNARD SPOLSKY Bar-Ilan University Statistical Approaches to Test Fairness The key feature of the modern or structuralist-psychometric approach to language testing has been the use of statistical methods to establish the fairness of tests. Where the traditional method of testing and examinations tended to rely on the good will and innate honesty of the examiner to avoid any unfair bias in a test, the modern trend to explicit statement of test criteria has required the development of publishable and replicable methods of showing that tests are reliable, valid, and unbiased. The goal is a fine one, and psychometricians have been scrupulous and zealous in its pursuit. The meaning of the terms is quickly explained. A test is reliable if its results are consistent when the test is repeated. A test is valid if it measures what it is intended to measure. A test is unbiased if it will give the same results for any set of test takers of equal ability, regardless of other characteristics. The method of achieving these goals has been well studied and is generally understood. The reliability of a measure can be estimated by repeating the test at some future time, by looking at the correlation between the two halves of the test, by developing an alternate form, or by the use of a statistical formula such as coefficient alpha (Carmines and Zeller, 1979) which checks internal consistency of a test. The correlation itself is a measure of reliability; the issue of how much reliability is needed remains to be decided. The validity of a test, or more precisely, the validity of its interpretation (Cronbach, 1971), needs to be established in several different ways. The most easily understood sense of the term is criterion-related validity, when the test
< previous page
page_3
next page >
< previous page
page_4
next page > Page 4
has been used to measure some behavior external to the test itself (Nunnally, 1978); the correlation of the test to this external measure is its concurrent or predictive validity (Carmines and Zeller, 1979). The second is content validity, the extent to which the content of the test is generally accepted as defining the universe of behavior being measured (Cronbach and Meehl, 1955). The third is construct validity, where a theoretical body of assumptions about the constructs being measured are empirically supported. Each of these approaches is open to debate: the question of what is a suitable outside measure, of what is appropriate content, of what is the theoretical model underlying the test, these are all questions on which there can be and is honest disagreement and serious scholarly debate. Current testing theory favors using several different methods to establish the validity of test interpretations. The notion of test bias can be studied outside of its social context; internal statistical techniques exist that make it possible to determine that a test is equitable, "so that when an inequality exists between groups' test scores, the disparity is due primarily to differences in whatever it is the test purports to measure" (Osterlind, 1983:8). The various measures are based on a critical assumption of testing statistics, unidimensionality, the notion that a test is measuring a single ability. Item bias can be investigated within classic testing theory, by using chi-square, ANOVA, and other techniques, but is also particularly well handled within Item Response Theory, which has methods of handling individuals independently from items (Osterlind, 1983). The use of these various techniques makes it possible to develop more focussed tests, in which any bias is identified and captured for the purpose of the tester. Each of these complementary approaches results from methods developed within psychometrics to be sure that a test is fair in itself. They help clear the way for dealing with the more generalized question of whether the test results are used fairly. It should be noted that a test is not reliable, or valid, or unbiased, but rather that its use with a certain population has a certain degree of reliability, or validity, or lack of item bias. The various measures still need interpretation, and for this we must go beyond statistics. What Does a Language Test Measure? The statistical methods of modern testing enable us to try to focus more precisely on the complex issue of what is being measured by a test. This is clearly at the heart of the validity question: a decision on validity is a decision on the nature of the ability being measured. It is also critical to the issue of
< previous page
page_4
next page >
< previous page
page_5
next page > Page 5
test item bias, for the fundamental assumption of unidimensionality requires an understanding of the underlying construct. Only when we know what we have measured (Bachman and Clark, 1987) are we ready to consider the socially critical question of the relevance of this for the purposes to which we will put the results. There is not space here to discuss the question of what it means to know a language. I have argued elsewhere (Spolsky, 1973; 1985b) that this is a matter of considerable complexity. I would sum up the requirement for a satisfactory statement in the following preference condition (Spolsky, 1989:80) 1: Condition 20. Linguistic outcome condition (typical, graded): Prefer to say that someone knows a second language if one or more criteria (to be specified) are met. The criteria are specifiable: a. as underlying knowledge or skills (dual knowledge condition); (unanalyzed knowledge condition); b. analyzed or unanalyzed (analyzed knowledge condition); (unanalyzed knowledge condition); c. implicit or explicit (implicit/explicit knowledge condition); d. of individual structural items (sounds, lexical items, grammatical structures) (discrete item condition); e. which integrate into larger units (language as system condition); f. such as functional skills (integrated function condition); g. for specified purposes (see for instance academic skills condition, communicative goal condition); h. or as overall proficiency (overall proficiency condition); i. productive or receptive (productive/receptive skills condition); j. with a specified degree of accuracy (variability condition; accuracy condition); k. with a specified degree of fluency (automaticity condition); 1. and with a specified approximation to native speaker usage (native speaker target condition); m. of one or more specified varieties of language (specific variety condition). It is worth expanding on some aspects of these conditions. First, there is the basic distinction between competence, seen as an abstract set of rules accounting for underlying knowledge, and performance, defined as observable behavior. Testers are concerned with performance, but aim to understand the underlying abilities and knowledge that can be revealed by performance. A second set of distinctions has involved some form of the theoretical division of competence into its various components, such as linguistic, pragmatic, and sociolinguistic. The first component, Chomskyan linguistic competence, accounted for by the grammar of the language, is itself
< previous page
page_5
next page >
< previous page
page_6
next page > Page 6
subdivided into at least phonetic representation, phonology, syntactic structures and rules, semantic structures and rules, correspondence rules between syntax and semantics, and lexicon (Jackendoff, 1983:9). While the division into these subcomponents is generally accepted, the boundaries between them are not clear, nor is there agreement about the boundary between the grammar and the second major component, also clearly a kind of competence, the pragmatics, or the general rules of language use. There may be a third component, sociolinguistic competence, which may be defined informally as socially specific rules of language use. It is also possible to treat pragmatics and sociolinguistics together, the former dealing with universal and the latter with local society-specific rules. The third distinction is between separate components of language knowledge, whether structural or functional, and the notion of general language proficiency, which may be defined operationally within information or probability theory as the ability to work with reduced redundancy. Now an obvious question that follows after making these distinctions is to ask about how the parts are related. On the question of the relation between competence and performance, there are good arguments that favor building the latter on the former. 2 At the same time, we must note that there are strong arguments presented against any such connection: those who wish to develop performance grammars present the following kind of argument: There is, however, no direct link between an interlanguage rule system and interlanguage performance data. The learner's output is directly determined by rules of language production and only indirectly by the corresponding linguistic rule system (Jordens, 1986:91). Even more powerful arguments in favor of performance grammars are likely to develop as a result of work with the implications of Parallel Distributed Processing (McClelland et al., 1986; Rumelhart et al., 1986) for language knowledge and use (see e.g. Sampson, 1987). An equally difficult challenge is set if we wish to consider the relation between structural and functional descriptions. This issue is faced on a practical level by those who take on the complex task of intertwining productively a notional-functional syllabus and a structural one. Why this is difficult is clear if we look at the same task tackled on a theoretical level by those who have attempted to trace the connection between speech acts and the many different formal structures that realize them. One elaborate analysis of functional language competence is the work in speech act theory. Bach and Harnish (1979) attempt to relate linguistic structures and speech acts. Their answer however does not satisfy. The key
< previous page
page_6
next page >
< previous page
page_7
next page > Page 7
problem faced by those working to relate the functional and formal characteristics is the very absence of the possibility of one-to-one mappings. Not only are there many different forms of words that I can use in making a request, e.g., "Please shut the window", "Close the window, please", "Close it, please", "Do it, please'', but also there are many different syntactic structures that may be used, e.g., a question like "Is the window open?" or a statement like ''I am cold." Given the difficulty and perhaps theoretical impossibility of specifying precisely the relation between structure and function, 3 we can see the necessity for continuing to include both approaches in our model, and be willing to describe language proficiency in both functional and structural terms. Equal difficulty is set by the notion of overall proficiency. Work in language testing research has been concerned to clarify as much as possible the relations between the various kinds of testing tasks and the specific abilities that they measure. The multitrait-multimethod studies of language testing encouraged by Stevenson (1981) were essentially concerned with attempting to separate the various strands built into language tests. In practice, it has turned out to be simpler to think up new tests and testing techniques than to explain precisely what it is that they are measuring. Certain conclusions are however safe. First, there is a good relation between various kinds of language tests. Part of this relation comes from the fact that they are all formal tests: thus, subjects who for various reasons do not test well (become over-anxious, or are unwilling to play the special game of testing, i.e. answering a question the answer to which is known better by the asker than the answerer) will not be accurately measured by any kind of formal test: there will be a large gap between their test and their real life performance (Spolsky, 1984; 1985a). A second part of this relation might well be explained by some theory of overall language proficiency. Overall language proficiency was originally derived from John Carroll's (1961) idea of integrative language tests. The argument was presented as follows: The high correlation obtained between the various sections of TOEFL [Test of English as a Foreign Language] and other general tests of English suggests that in fact we might be dealing with a single factor, English proficiency. . . . (Spolsky, 1967:38) In a subsequent paper, the acknowledgement to Carroll is spelled out: Fundamental to the preparation of valid tests of language proficiency is a theoretical question: What does it mean to know a language? There are two ways in which this question might be answered. One is to follow
< previous page
page_7
next page >
< previous page
page_8
next page > Page 8
what Carroll (1961) has referred to as the integrative approach and to say that there is such a thing as overall proficiency. (Spolsky et al., 1968) There are empirical and theoretical arguments for this claim. The empirical argument follows from the work of a number of language testing scholars including Holtzman (1967) who were at the time reporting very high correlations between various language proficiency measures. It was further supported by a series of studies by Oller and some of his colleagues who, using factor analysis, were struck by the power and importance of a common first factor that Oller labelled "unitary language competence." There is now considerable doubt over the validity of this statistical argument. 4 The use of one kind of factor analysis, exploratory principal components analysis, tends to exaggerate the size of the first factor. This statistical technique, because it does not start with a hypothetical model of the underlying factors and their relationships, sets up a general component that includes in it much of the unexplained scatter. The statistical argument over language proficiency parallels similar debates over the notion of a single measurable factor of intelligence. The key argument between Thurstone, who claimed that there were three underlying mental abilities, and Spearman, who argued for one, depended on the statistical techniques they used. Thurstone showed that Spearman's analysis of results of tests to produce a one-vector solution that he labelled g (a general intelligence factor) is theory bound and not mathematically necessary. By using a different technique, Thurstone produced his own three-vector solution, and then proceeded to reify the three vectors as primary mental abilities. But there is no reason to believe that Thurstone's own primary abilities are not themselves dependent on the tests used; with more tests added, one could use the same statistical technique to show more kinds of primary ability. Gould (1981) argues convincingly that the statistical tests that have been used in this debate do not in fact make it possible to distinguish between single and multifactor causes. He is particularly critical of the reification of the results of factor analyses. Oller (1984) has acknowledged the criticism of the statistical basis for his claim and is now much more hesitant. Oller's arguments are not only statistical, but relate in part to the notion of an expectancy grammar. The theoretical argument presented in Spolsky (1968) is not statistical either, but draws attention to the link between the creative aspect of language, especially as represented by the fact that speakers of language can understand and create sentences they have not heard before, and the work of Miller and Isard (1963), working within models of information theory, on the importance of ability to understand language with reduced redundancy. The information theory model was particularly important because it treated language in an independent way; as Chomsky's
< previous page
page_8
next page >
< previous page
page_9
next page > Page 9
earliest work demonstrated, the probability analysis it used was quite different from a structural linguistic analysis. Tests that mask the message randomly (dictation with or without noise, or the cloze test) then might be considered independent, non-linguistically determined measures of language proficiency. If they are tapping specific abilities, they are doing it in a more or less random way, and so are testing a random conglomerate of specific items. Indeed, the weakness that Klein-Braley (1981) spotted in the cloze test was its use of the word, a more or less linguistic unit, as the unit to be deleted, and as she showed, this very fact meant that a specific cloze test was biassed towards measuring specific structural features. The C-test that Raatz and Klein-Braley (1982) have proposed tries to overcome this problem by deleting not words but parts of words; it is thus further from being a measure of structural ability, and so closer to a general measure. According to this argument, the best (because most general) measure of this kind would be a written equivalent of the dictation with noise; the visual noise too would need to be added randomly, and as equivalent to the white or pink noise of the aural test, one would use some form of visual blurring for the written passage. One might want to argue that a test like this is functional if not structural, but the very lack of face validity and obvious task authenticity in the cloze and the noise test is what makes these tests so abstract and non-specific. It is of course possible to find ways to make the tasks seem authentic, 5 but the very fact that an effort is needed makes my point. All these studies continue to provide support for the notion of general language proficiency however difficult it might be to measure it. This may be summarized as a condition (Spolsky, 1989:72): Condition 19: Overall proficiency condition (necessary): As a result of its systematicity, the existence of redundancy, and the overlap in the usefulness of structural items, knowledge of a language may be characterized as a general proficiency and measured. One of the important results of this condition is the possibility of developing a common core of items to be included in a general introductory course; as a general rule, specific purpose teaching of a foreign language follows the teaching of this common core. Essentially, then, we have seen that there are both in general theory and in language testing theory and practice three interrelated but not overlapping approaches to describing and measuring knowledge of a second language, the one structural, the second functional, and the third general. Anybody who knows a second language may be assumed to have all three kinds of knowledge, and they are related but not in any direct way, so that any description
< previous page
page_9
next page >
< previous page
page_10
next page > Page 10
on one dimension alone is just as likely to be distorted as a description on the basis of one aspect of one dimension (e.g. vocabulary knowledge only for structural knowledge, or greeting behavior only for functional). We saw above that the statistical methods for establishing the reliability, validity and lack of bias of a test are based on certain assumptions, the most important of which is unidimensionality. The survey of the nature of language knowledge in the last section makes clear that this assumption requires careful consideration in the case of language tests. The existence of something like overall proficiency does mean that there will be high correlation (overlap) among various kinds of language tests, but not enough to justify a claim like Oller's for unitary ability. Work with the multitrait-multimethod approach, which attempts to tease out the dimensions, has been less successful than hoped, for in language it is very difficult to distinguish traits from methods. The Social Contract in Tests There has been a lot of recent concern for what is called authenticity in language tests. This often ignores the special artificiality of language tests. It is not the actual nature of the task posed to the test taker so much as the breach of a normal conversational maxim that creates the artificiality. The issue was pointed out by Searle (1969) in his discussion of the types of illocutionary acts. A question as a speech act is defined by its fulfilling the conditions: Preparatory: 1. S[peaker] does not know "the answer", i.e., does not know if the proposition is true, or in the case of the propositional function, does not know the information needed to complete the proposition truly. . . 2. It is not obvious to both S[peaker] and H[earer] that H[earer] will provide the information at that time without being asked. Sincerity: S[peaker] wants the information. (1969:65) But Searle immediately goes on to point out the special problem of the examination situation and the status of examination questions: There are two kinds of questions, (a) real questions, (b) exam questions. In real questions S[peaker] want to know (find out) the answer; in exam questions, S[peaker] wants to know if H[earer] knows. (1969:65)
< previous page
page_10
next page >
< previous page
page_11
next page > Page 11
He says later (1969:69) that questions are in fact requests, and that examination questions are requests "that the hearer display knowledge", or as I have suggested earlier, display the skills of performing a task like writing an essay, conducting a conversation, playing a role in a dialogue. From this analysis we are forced to the conclusion that testing is not authentic language behavior, that examination questions are not real however much like real life questions they seem, and that an examinee needs to learn the special rule of examinations before he or she can take part in them successfully. Thus, the greatest challenge to the notion of authentic tests comes from those like Edelsky and her colleagues (1983) who argue that the whole process is artificial whatever you do, fatally flawed by the fact that in a test someone is asked to perform in an unnatural way: to answer questions asked by someone who doesn't want to know the answers because he already knows them. According to this analysis, there can be no test that is at the same time authentic or natural language use; however hard the tester might try to disguise his purpose, it is not to engage in genuine conversation with the candidate, or to permit him/her to express his/her ideas or feelings, but rather to find out something about the candidate in order to classify, reward, or punish him/her. Only if the candidate knows this and accepts the rules of the game, Edelsky argues, will he/she cooperate. The most we can then ask for is an authentic test; not authentic communication. This fact then establishes the social contract of the test: an understanding on the part of the person taking the test that the performance is necessary and relevant to the task. But, as we have seen, both psychometric and linguistic considerations raise serious doubts about our rights to make such claims. Even the best prepared test still contains at least a modicum of error, and however hard we may have worked on validity, we must be left with doubts about exactly what it is we have measured. An Ethical Envoi The only morally justifiable answer to this dilemma is, as I have suggested earlier (Spolsky, 1981; 1984), an approach that sets ethical criteria for testing according to the purposes for which the test results may be used. The less important the results will be to the present or future career of the test taker, the more we are justified in what testers call quick and dirty tests. Conversely, the more important the results, the larger the rewards and punishments, the more detailed and precise and varied (and expensive) the test itself should be. In practical terms, we know that the more generalized a
< previous page
page_11
next page >
< previous page
page_12
next page > Page 12
test is, the less accurate its results will be for any individual. The real danger of modern objective tests is that they can hide behind their scientific basis, and disguise their uncertainties in surface objectivity. At least in a traditional examination, whether face to face or reading the marks made by a pen put to paper, the examiner is regularly reminded that he or she is judging a real person. Of particular importance in solving this problem is the need for full and detailed reporting of results. In a criterion-referenced test, the result reported is the ability of the subject to perform the criterion skill or some reasonable approximation. One of the main advantages of tests like the FSI Oral Interview is in fact the detailed description included in the results; the consumer has available a prose description of the kind of performance that leads to a particular score. That these tests still involve potential problems can be attributed to the fact that they assume a single scale; as I have argued elsewhere (Spolsky, 1986), they depend on a shared agreement on the appropriateness of the scale. Among other problems of the oral interview (Raatz, 1981 and Bachman, 1987 both draw attention to its failure to meet psychometric criteria) is its ending up with a single score. For this reason, I can see strong arguments for the greater fairness and wider usefulness of individualized scales. The norm-referenced test, objective though it may be, is potentially dangerous to the health of those who take it, for a casual test user might easily be misled in interpreting its results. In the best of cases (and one might think of a well-developed, highly standardized, sensitively revised test like TOEFL as a good example of the best of cases), there is still plenty of reason to advise the users of test results to develop their own validation and interpretations. The wisest (and so ethically most correct) interpretations of TOEFL are those that take into account the age, country of origin, background, and chosen field of study of the applicant, and that draw on the institution's own previous experience to determine the weight that should be given to the TOEFL scores in admission decision and counselling. Because there is a natural tendency on the part of those who use test results to take shortcuts, there is a moral responsibility on testers to see that results are not just accurate but do not lend themselves to too quick interpretation. For this purpose, profiles rather than single scores seem of special value: score reports that include several skills, tested in different ways, and adding if at all possible some time dimension. This last factor is most seldom included, but can be most revealing. Proficiency tests usually ignore the dynamic dimension; they give no way of deciding whether the subject is in process of rapid language learning or has
< previous page
page_12
next page >
< previous page
page_13
next page > Page 13
long since reached a plateau. But this kind of information, achievable from the kind of profile possible in teacher-administered observational testing or sometimes in a face to face interview is surely of vital importance. While the normal tendency of the tester is to try to increase the reliability and validity of the simple test (and the other contributors to this volume have given excellent advice on how this can be done), I can see strong grounds for reversing direction, and aiming, as individualized testing does, to increase the amount of information gathered and reported to the consumer. From the social and ethical point of view, individualized assessment is particularly important when it makes clear the complexity of second language knowledge and the difficulty of summing it up in any simple measure. Notes 1. A somewhat different model has been proposed by Bachman (forthcoming). 2. A criticism of Chomsky's formulation was that he seemed on the one hand to treat performance as a wastebasket for anything that would not fit into competence, and on the other to set up his own statement of what kinds of regularities would be admitted to linguistic competence. 3. This task is undertaken for literary interpretation by Schauber and Spolsky (1986) using a preference model. 4. See for instance the first seven papers in Hughes and Porter (1983). 5. Stevenson (1985) has shown how students may be convinced that the tasks are authentic and reflect real-life events, but the artificiality remains. References Bach, K. and R. M. Harnish (1979) Linguistic Communication and Speech Acts. Cambridge, MA: MIT Press. Bachman, L. F. (1987) Problems in examining the validity of the ACTFL oral proficiency interview. In: A. Valdman (ed), Proceedings of the Symposium on the Evaluation of Foreign Language Proficiency. Bloomington, IN: Indiana University. Bachman, L. F. (Forthcoming) Fundamental Considerations in Language Testing. Oxford: Oxford University Press. Bachman, L. F. and J. L. D. Clark (1987) The measurement of foreign/second language proficiency, Annals of the American Academy of Political and Social Science, 490, 20-33. Carmines, E. G. and R. A. Zeller (1979) Reliability and Validity Assessment. Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-017. Beverly Hills and London: Sage Publications. Carroll, J. B. (1961) Fundamental considerations in testing for English language proficiency of foreign students. Testing the English Proficiency of Foreign Students. Washington, DC : Center for Applied Linguistics.
< previous page
page_13
next page >
< previous page
page_14
next page > Page 14
Cronbach, L. J. (1971) Test validation. In: R. L. Thorndike (ed), Educational Measurement. 2nd edn. Washington, DC: American Council on Education. Cronbach, L. J. and P.E. Meehl (1955) Construct validity in psychological tests, Psychological Bulletin, 52,281302. Edelsky, C., Altwerger, B., Barkin, F., Floris, B., Hudelson, S., and K. Jilbert (1983) Semilingualism and language deficit, Applied Linguistics, 4, 1-22. Gould, S. (1981) The Mismeasure of Man. New York: Norton. Holtzman, P. D. (1967) English language proficiency testing and the individual. In: D. C. Wigglesworth (ed), Selected Conference Papers of the Association of Teachers of English as a Second Language. Los Altos, CA: Language Research Associates. Hughes, A. and D. Porter (eds) (1983) Current Developments in Language Testing. London: Academic Press. Jackendoff, R. (1983) Semantics and Cognition. Cambridge, MA: MIT Press. Jordens, P. (1986) Production rules in interlanguage: evidence from case errors in L2 German. In: E. Kellerman and M. Sharwood Smith (eds), Crosslinguistic Influence in Second Language Acquisition. New York: Pergamon Press. Klein-Braley, C. (1981) Empirical investigations of cloze tests. Ph.D. dissertation, University of Duisburg. McClelland, J.L., D.E. Rumelhart, and the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Volume II, Psychological and Biological Models. Cambridge, MA: MIT Press. Miller, G. A. and S. Isard (1963) Some perceptual consequences of linguistic rules, Journal of Verbal Learning and Verbal Behavior, 2, 217-228. Nunnally, J. C. (1978) Psychometric Theory. 2nd edn. New York: McGraw-Hill. Oller, J. W., Jr. (1984) "g", what is it? In: A. Hughes and D. Porter (eds), Current Developments in Language Testing. London: Academic Press. Osterlind, S. J. (1983) Test Item Bias. Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-030. Beverly Hills and London: Sage Publications. Raatz, U. (1981) Are oral tests tests? In: C. Klein-Braley and D. K. Stevenson (eds), Practice and Problems in Language Testing I. Frankfurt am Main: Peter D. Lang. Raatz, U. and C. Klein-Braley (1982) The C-test - a modification of the cloze procedure. In: T. Culhane, C. Klein-Braley, and D. K. Stevenson (eds), Practice and Problems in Language Testing. Colchester: University of Essex. Rumelhart, D.E., J.L. McClelland, and the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Volume I, Foundations. Cambridge, MA: MIT Press. Sampson, G. (1987) Review article, Language, 63, 871-886. Schauber, E. and E. Spolsky (1986) The Bounds of Interpretation : Linguistic Theory and Literary Text. Stanford: Stanford University Press. Searle, J. R. (1969) Speech Acts: an Essay in the Philosophy of Language. Cambridge: Cambridge University Press. Spolsky, B. (1967) Do they know enough English? In: D. C. Wigglesworth (ed), Selected Conference Papers of the Association of Teachers of English as a Second Language. Los Altos, CA: Language Research Associates. Spolsky, B. (1968) Language testing the problem of validation, TESOL Quarterly, 2, 88-94.
< previous page
page_14
next page >
< previous page
page_15
next page > Page 15
Spolsky, B. (1973) What does it mean to know a language, or how do you get someone to perform his competence? In: J. W. Oller, Jr. and J.C. Richards (eds), Focus on the Learner: Pragmatic Perspectives for the Language Teacher. Rowley, MA: Newbury House. Spolsky, B. (1981) Some ethical questions about language testing. In: C. Klein-Braley and D. K. Stevenson (eds), Practice and Problems in Language Testing I. Frankfurt am Main: Peter D. Lang. Spolsky, B. (1984) The uses of language tests: an ethical envoi. In: C. Rivera (ed), Placement Procedures in Bilingual Education: Education and Policy Issues. Clevedon: Multilingual Matters. Spolsky, B. (1985a) The limits of authenticity in language testing, Language Testing, 2, 31-40. Spolsky, B. (1985b) What does it mean to know how to use a language: an essay on the theoretical basis of language testing, Language Testing, 2, 180-91. Spolsky, B. (1986) A multiple-choice for language testers, Language Testing, 3, 148158. Spolsky, B. (1989) Conditions for Second Language Learning: Introduction to a General Theory. Oxford: Oxford University Press. Spolsky, B., Sigurd, B., Sato, M., Walker, E., and C. Aterburn (1968) Preliminary studies in the development of techniques for testing overall second language proficiency, Language Learning, Special Issue, No. 3, 79-101. Stevenson, D. K. (1981) Beyond faith and face validity: the multitrait-multimethod matrix and the convergent and discriminant validity of oral proficiency tests. In: A. S. Palmer, P. J. M. Groot, and G. A. Trosper (eds), The Construct Validation of Tests of Communicative Competence. Washington, DC: TESOL. Stevenson, D. K. (1985) Authenticity, validity and a tea party, Language Testing, 2, 41-7.
< previous page
page_15
next page >
< previous page
page_16
next page > Page 16
2 Response to Spolsky ARTHUR HUGHES University of Reading Spolsky's stimulating chapter touches on too many issues for this response to deal with them all adequately. Instead, it will concentrate on two of Spolsky's desiderata in language testing: the provision of test users with more detailed information on an individual's test performance than is usually the case at present; and the reflection in test content of different approaches to the description of knowledge of a second language. The "Ethical Envoi" is concerned essentially with the obligation of testers to be fair. What does it mean to be fair? To answer this question, it is necessary, I think, to distinguish between the tester and the test (results) user, even though on occasion they will in fact be the same person. In order to be fair, the tester has to create instruments which will provide test users with sufficient accurate and interpretable information on language ability for the purpose to which the information will be put. In addition, the tester should make users aware of the limitations of the information provided (indicating, for example, the likely magnitude of error). The tester should also perhaps remind them, as does the TOEFL handbook, that language ability is usually only one factor to be considered in the decisions that they have to take. that a weakness in language may be offset by strengths elsewhere. It is then the test user's responsibility to take decisions which are fair to the test taker. Only if both tester and test user meet their obligations will the test taker be fairly treated. It is hard to argue against the need to be fair. It is, however, possible to question whether greater fairness would actually be achieved by the means which Spolsky suggests. He proposes that test results should be full and detailed, and reported in the form of a profile for each individual test taker. Not only would this help meet the requirement of sufficiency mentioned above, but should also compel the test user to exercise care in interpretation.
< previous page
page_16
next page >
< previous page
page_17
next page > Page 17
Ideally these profiles will be based on the use of individualized scales, teacher-administered observational testing, the testing of several skills, and taking into account information on previous training. I agree that fuller and more individual information ought to increase fairness to the test taker, provided that use is made of it by the test user and provided, of course, that the information is accurate. Unfortunately, these two conditions may not be easy to meet. The British Council ELTS test in fact provides test users with a profile of each candidate. However, despite efforts on the part of the British Council to educate test users, very few of them appear to look at performance on the different components of the test. According to the British Council (1988), only two British universities or constituent colleges look at the whole profile, while nine others demand a minimum score on every component (the same score on each component). The remainder rely on the global score. Unless these British institutions are untypical, it would seem that profiles are unlikely in practice to add significantly to the fairness with which test takers are treated. This is because the behaviour of test users, whose cooperation is necessary to ensure fairness, cannot be controlled by testers. The condition that the additional information be accurate is potentially more important. If it is not met when the first condition is, then many test takers are likely to be treated less, not more, fairly. Imagine students applying for entry to university in an English speaking country, and who take an English language test. As well as a score which represents the sum of scores on a number of subtests, there is also available separate information on writing ability, as with the ELTS test. It happens that, because of the nature of the academic courses for which they are applying, a particular university is especially interested in students' ability in writing. If the university makes use of the additional information, a student who does not do very well overall, but who does well on writing, may be accepted, when he would have been rejected if only the overall score had been available. This seems eminently fair. Similarly, a student who does well overall, but who does badly in writing, may be rejected. This too may be seen as fair, since the student, it could be argued, would probably be unable to cope with the demands of the course. But the treatment of these two cases will only be fair if the writing component of the test is sufficiently reliable and valid. If it is not, it is most unfair to base important decisions on estimates of writing ability. The suggestion is sometimes heard that decisions in such cases should be based on the global score and the component score. Inasmuch as this means paying attention to one reliable score and to one unreliable score, the advice seems neither helpful to the test user nor fair to the test taker.
< previous page
page_17
next page >
< previous page
page_18
next page > Page 18
We know that high reliability in valid writing tests is not easily achieved. Not only must there be high scorer reliability (which is likely to be achieved on a regular basis through multiple scoring by thoroughly trained scorers), there will also have to be at least two (and preferably more) substantial writing tasks. In fact, in order to achieve acceptable accuracy in all components of a profile, we will need longer tests than we are used to contemplating. We will also need to invest in the training and monitoring of scorers for any subjective components. Individualized scales, if they are to be used, will involve a considerable investment of time and effort if they are to be valid and reliable. And teacher-administered observational testing, desirable as it may be, will obviously require extensive training and continued monitoring, if that too is to be valid and reliable. What all this adds up to is something that we should already be well aware of: accurate information on the language ability of individuals is an expensive commodity. It has to be recognized that a high price will have to be paid for the level of fairness that Spolsky would like to see. One response to this might be to say that test constructors should try to convince test users (and perhaps test takers) that such a price is worth paying. But what justification is there for saying that? Are we in a position to tell test users and test takers just what they will get for their money (assuming that in the end they will be paying for the extra information)? Do we know what differences there would be in test user decisions if more and reliable information was provided, and test users were really prepared to take advantage of it? I do not think we do. But it is open to empirical investigation, and something which may be worth pursuing. Certainly, if significant differences were found, these would be more convincing than argument in the abstract. In the meantime, to insist on more expensive tests without knowing the benefits which will accrue in decisions taken by the test users may be regarded as quite unprofessional, and even unfair. So far in this response, I have concentrated on the question of providing individual profiles and the need for accuracy. But whatever the form in which information is presented, it must be not only accurate but also appropriate and readily interpretable in terms of the test users' purposes. While these purposes will vary, it seems not unreasonable to assume that the test user will typically be concerned to know about the test taker's communicative ability in a language. It is on the basis of this assumption that I wish to question the apparent implication for testing of there being ''three interrelated but not overlapping approaches to describing and measuring knowledge of a second language, the one structural, the second functional, and the third general''. Spolsky says that "Anyone who knows a language may be assumed to have all three kinds of knowledge" and that "any
< previous page
page_18
next page >
< previous page
page_19
next page > Page 19
description on one dimension alone is likely to be distorted". It would seem to follow from this, in the context of the general theme of the chapter, that information on all three "dimensions" should be obtained in order to give an adequate description of the test taker's ability in the language. First let us take "general proficiency" or "overall ability". This notion would seem to be inherently unhelpful with respect to the description of an individual's communicative ability, particularly where the profiling of that ability is intended. Spolsky defines overall proficiency as ''the ability to work with reduced redundancy''. The notion does indeed have as its basis the fact that tests making use of reduced redundancy (such as cloze) have tended to correlate quite highly with tests of various skills (such as listening and reading) and, not surprisingly, with tests that measure separately a number of such skills. But it is generally recognised that such tests are, in Spolsky's words, "quick and dirty", suitable only for the least important, easily rectifiable decisions. This is because scores on these tests, though they may give a rough idea of the language ability of many people, will give a misleading impression of the ability of many others. In the context of individualization and of profiling, it is also hard to see how test users are to interpret scores on such a test, and what use they will make of them. What usable information can a measure of overall ability add to that provided by reliable and valid measures of communicative skills? I can see no place for it in the kind of testing that Spolsky recommends in his Ethical Envoi. Interpretable and accurate information on the other two dimensions (the structural and the functional) will obviously be helpful to many test users. My only concern here is that such information should not be derived from test components made up of discrete-point items. There is a considerable misfit between, for instance, many people's performance on a grammar test and their ability to use grammar in communicating. Reports of structural and functional ability should be based on their manifestations in communicative performance, provided that the assessment is valid and reliable. I will end, as did Spolsky, on an ethical note. Responding to his chapter, and being reminded of how many things we do not know, has made clear to me that the professional obligation to be fair implies another: that is, to continue to carry out the research necessary to make this possible. References British Council (1988) English Language Entrance Requirements in British Educational Institutions. London: British Council.
< previous page
page_19
next page >
< previous page
page_20
next page > Page 20
3 Learner-Centred Testing through Computers: Institutional Issues in Individual Assessment J. CHARLES ALDERSON University of Lancaster Microcomputers are increasingly being used in language classrooms as an aid to the development of competence. Reading, grammar, and vocabulary exercises have been devised to allow individuals to focus upon areas relevant to their own needs. And group exercises are also used, based upon computer software, to develop a range of interactive language skills. In the early stages of the development of computer software, the emphasis was mainly on exercises in multiple-choice format, which closely resembled language drills. In addition, cloze procedures of various sorts were used to produce so-called reading exercises. More recently, exercise formats have become more varied, although there is still a preponderance of gap-filling and multiple-choice methods. Word-processing software, databases, and spreadsheets are frequently exploited in language classrooms in order to develop learners' productive skills. Developments in Computer Assisted Language Learning (CALL) have met with some scepticism, largely on the basis of comparisons of the advent of CALL to the introduction of the language laboratory in the 1960s, which is generally regarded to have failed to live up to the claims made for the effectiveness of language laboratories. However, it appears to be the case that microcomputers are now fairly generally seen to have a useful role to play in language learning and teaching, provided that they are appropriately integrated into the curriculum. Computers have, of course, been used in language testing for a long time, but their use has been largely restricted to the analysis of test results. The current availability of cheap but powerful microcomputers has
< previous page
page_20
next page >
< previous page
page_21
next page > Page 21
made test delivery by computer both feasible and attractive and, indeed, work is progressing in various places in two areas: computerized adaptive testing (CAT) and computer-based language tests. This chapter addresses computer-based English language tests (CBELT), since most of the work to date has been in the area of ESL/EFL (English as a Second/Foreign Language). Computerised Adaptive Testing CAT involves the delivery to testees of items which are gradually tailored to the apparent ability level of the testee. Items are delivered to candidates on a individual basis, usually but not necessarily via a VDU screen, and responses are made via the keyboard. On the basis of an ongoing estimate of the testee's ability, derived from an Item Response Theory (IRT) analysis of performance, items are selected from an item bank in the computer's memory within an appropriate level of difficulty for individual candidates. It is possible, through such tests, to gain an estimate of a candidate's ability in a given area much quicker than by exposing a candidate to a test on which many items will be inefficient for that candidate, because they are simply too easy or too difficult to add information about ability. To date, CAT has been restricted to developing and refining the statistical model used for estimating person ability and item difficulty. Development work seems to have concentrated on the reliability of test scores and the increased efficiency with which test scores can be obtained. The methods or formats by which test items are delivered, the content of the computer-delivered tests, and the nature of the validity of such tests have been relatively little discussed. Indeed, the tests used in CAT are usually tests which have been separately developed for delivery through traditional paper and pen means, and are multiple-choice in format. They do not attempt any innovation in test method or test content. Computer-Based English Language Testing (CBELT) Recent work in CBELT at the University of Lancaster has attempted to address the issue of whether language tests delivered by microcomputer can benefit from the speed, patience, and memory of such computers in order to bring about changes in test method or content, and more generally, to examine the nature and role of testing within educational settings. It is with the results of this research that the rest of this chapter will be concerned, as it is the belief of the author that it is in this area that the most important
< previous page
page_21
next page >
< previous page
page_22
next page > Page 22
contribution of computers to language testing will be made. It is unlikely to be the case that developments in CBELT will bring about improvements in the content of language tests. There is also still doubt about whether there will be significant improvements in test methods as a result of further work on CBELT. However, it seems certain that CBELT has important potential for changing the way in which tests are viewed and used within curricula. The individualization which CBELT makes possible is likely to ensure that most applications of CBELT are at an institutional rather than a national level. Yet the issues that such developments raise are important not only for institutional uses and interpretations of tests, but also for applied linguistics more generally. The Institute for English Language Education at the University of Lancaster has been engaged in research into the nature of CBELT, partly funded by the British Council, for two years. The research began with a speculative phase, during which the possibilities for test development and delivery were explored at some length: the result of this work is to be published shortly. In the second phase, some of the ideas in phase 1 were developed into sample item types, as part of a feasibility study. Phase 3 will consist of the construction of innovative computerbased tests, together with a shell authoring system to enable users to input a test of their choice for computer delivery. Research will concurrently be conducted into the validity, acceptability, and potential of these tests within language learning curricula. Computer Scoring In this research, CBELT is defined as tests which are delivered and scored by computer. The term is not restricted to tests which are supposedly constructed by computer as, for instance, in the use of pseudo-random deletion procedures to produce cloze tests by mechanical means. More importantly, however, this definition rules out tests which are delivered by computer but scored by humans. This restriction has major consequences for the sorts of tests that can be considered to be CBELT. In particular, direct tests of writing and even speaking can clearly be delivered by computer, but cannot be computer scored on the basis of any interesting set of criteria, and this is likely to remain true for several decades to come. In fact, this restriction means that test techniques must be restricted to those which can be objectively marked. It does not, however, mean that multiple-choice techniques are the only acceptable item types that can be used in CBELT. Gap-filling techniques are perfectly feasible, provided that it is possible, by whatever means, to arrive at a key of acceptable and unacceptable responses.
< previous page
page_22
next page >
< previous page
page_23
next page > Page 23
In addition, matching techniques, language transformation tasks, reordering tasks, information transfer techniques, and other objectively scored item types are available for use in CBELT. The larger the scoring key is, the more the advantages of CBELT over conventional test delivery are apparent, since reliability of marking is ensured on CBELT. In addition, differential weights can easily be assigned to different responses by the computer, something which it is very difficult for humans to do consistently. Thus, communality of response scoring proceduresof which clozentropy is but one exampleare perfectly feasible. A further advantage of computer scoring is not only that results can be made available to interested parties immediately after taking the test, but significantly, they can be made available whilst the candidate takes the test. In other words, immediately a candidate has made a response and committed himself to it, the computer can evaluate it and provide the candidate with feedback on the response. This feedback can be in the form of "right/wrong" information, or it can be in the form of hints or clues as to what an acceptable answer might be. Thus a candidate can be given the chance to have a second attempt at an item, with or without guidance. A successful second attempt can be awarded reduced credit, or the number of attempts required to arrive at an acceptable response could be recorded and subsequently used in a variety of ways. It could, for example, be entered into a record of performance, or stored for comparison with future performances. It could be used in calculating the final outcome on the test, or it could be used diagnostically or remedially, either by teachers or by the learner him/herself. Tests and Exercises The possibility of immediate feedback, and the impact of this feedback upon subsequent performance by candidates, raises the interesting question of the distinction between a test item and an exercise. It has been common in both the teaching and the testing literature to assert that there are indeed significant differences between tests and exercises. However, it is unusual to find any discussion of what might constitute the distinguishing characteristics of either. The implication of most discussions is that exercises aid learning whereas tests do not: they "simply" assess whether learning has taken place. Such a distinction is clearly overly simplistic, however, if only because it is evident that learners can learn from tests. Tests are typically administered to learners in situations where they have to perform without support. Assistance from teachers is held to invalidate a test, and assistance from peers is termed "cheating". In the case
< previous page
page_23
next page >
< previous page
page_24
next page > Page 24
of exercises, presumably the role of teachers and other learners is typically to provide guidance, support and feedback during the activity as a means of encouraging successful completion of the task and as a way of facilitating learning. The possibility of immediate feedback on CBELT thus blurs the distinction between test and exercise. Furthermore, the computer can provide other means of support during the test-taking. HELP facilities are commonly available in computer software, providing explanations and guidance to users. Such HELP facilities can easily be built into computer-based tests as well. Further explanations of test instructions, including mother tongue versions, dictionary definitions of unknown words, contextual meanings of unknown words, explanations, paraphrases or translations of structuresall these support facilities can be made available to CBELT users. It is possible to adjust test scores for use of such facilities, or to withhold the facilities for particular test purposes, or for particular candidates. Indeed, the candidate himself could choose to take a test with more or less HELP. This element of learner choice is a potentially important feature of CBELT, which can be extended to include learner control over when to take a test. For instance, a learner could engage in CALL exercises which provided feedback on the quality of performance, and even suggest when it was appropriate for the learner to take a relevant test. The decision, however, could be left up to the learner. Diagnosis and Remediation The blurring of a distinction between a test and an exercise also means that it is possible to think of including teaching routines into tests. Thus, in the light of feedback on performance to date during a test, a learner may be allowed to opt for a remedial teaching exercise in a branching program, and then re-enter the test once the exercises had been satisfactorily completed. A computer could easily keep track of the responses made by candidates, and develop a profile of performance which could be used as the basis of remedial intervention. Thus, if a learner displayed particular weaknesses in a specific grammatical area, he could be branched out of the test into exercises. Alternatively, he could be branched into a more detailed test routine which explored in depth the nature of his weakness in order to develop an appropriate diagnosis. This sort of diagnostic routine could include hints and clues as to the nature of particular linguistic rules and might be designed to provide specific feedback on the choice of particular items by learners.
< previous page
page_24
next page >
< previous page
page_25
next page > Page 25
However, such possibilities for diagnosis and remediation raise an important problem for applied linguists and language teachers. The limitation on the development of such tests is not the capacity of the hardware, or the complexity of the programming task, but our inadequate understanding of the nature of language learning and of language use. The fact is that applied linguistics to date has been able to provide very little insight into the causes of language learning difficulties. Second language acquisition research, studies of learners' interlanguages, and research into error analysis have not yet helped explain how learners learn what they learn, nor have they given much guidance on the construction of suitable teaching or learning activities. Diagnosis of learner weaknesses is still in a very primitive state: the challenge of CBELT is more to the linguist and applied linguist to provide appropriate input on the nature of the branching routines, and on the hints, clues and feedback that would help learners, than to the computer programmer to produce adequate software. Reporting of Scores Computers also make it easier to produce very detailed reports of learner performance on tests. The output of a CBELT routine need not be a single test score, but could be considerably more complex. The production of profile scores for different subtests or test components is straightforward. In addition, the computer can combine item scores in a multiplicity of different ways in order to arrive at a profile score. Thus, items can be grouped for scoring purposes according to a range of linguistic or even statistical criteria. The length of time a learner took to complete a task or test can be recorded separately and reported, or used to calculate a revised test score. The number of times a learner requested HELP or dictionary facilities can similarly be reported, as can the number of attempts made on an item or set of items. The linguistic and content characteristics of texts used in comprehension tests can be reported alongside a test score. Any one learner's score can be reported in comparison with other learners' performances, with criterion group performances or with native speaker performances. The possibilities for detailed and complex reporting of results are very great indeed. In fact, the possibilities almost certainly outstrip the ability of teachers or applied linguists to understand or interpret or even simply to digest the information. Once again, CBELT presents a challenge to testers and applied linguists generally to decide what would constitute useful, understandable and usable information: the challenge is to determine what information one needs to gather about learners' performance, not simply what one can gather.
< previous page
page_25
next page >
< previous page
page_26
next page > Page 26
Learner Ratings When a learner's response to a CBELT item is gathered, it is possible to collect other data from the learner at the same time which might be of considerable interest. Two things in particular come to mind. One is the degree of confidence a learner has that the response he has made is in fact appropriate or correct. This data can be gathered in response to a question after the learner has completed an item, or the learner can simply indicate the degree of confidence during the response itself by inputting 1, 2, or 3 depending upon whether he was sure (1), fairly sure (2), or unsure (3) that the response was accurate. Comparisons can then be made, either by the computer, or by teachers or learners, between the degree of confidence a learner had in a performance, and the nature of that performance. A further possibility is for learners to evaluate their ability on particular language items or tasks before, during and after taking a test on similar items or tasks. Again, it is a simple matter for the computer to compare self-evaluations with actual performances, and to make this information available to learners and teachers. Such information clearly has implications for learners' self-awareness, and for learner training programmes in which learners' perceptions of their abilities can be explored. Such information is also potentially useful for test and item validation purposes. Indeed, CBELT offers the possibility of gathering validation data during the test delivery which it is difficult to gather on conventional paper and pen tests. After each item, a learner could be required to fill out a closed-choice question on attitudes to the item, on what the learner thought was being tested, about the judged difficulty of the item, and so on. Alternatively, the question could be open-endedmuch like scratchpads in current word-processing softwareor could be designed to elicit introspective or retrospective information on the processes undergone by the learner when taking the test item. One would expect such information, potentially at least, to contribute substantially to our understanding of a test's or an item's validity. Indeed, it may well be the case that CBELT might be able to contribute to applied linguistics by providing the possibility for gathering information on learners' test-taking, and conceivably, even learning strategies, either by techniques such as those mentioned above, or by keeping records of students' behaviours and responses when taking tests. By so doing, CBELT could not only obtain information useful for the weighing of test scores, for the development of more appropriate and valid tests, and for the provision of more useful support during tests. It could also add to our understanding of the nature of language learning, language use and language proficiency.
< previous page
page_26
next page >
< previous page
page_27
next page > Page 27
Conclusion It is likely that most computer-based language testing will be implemented within or in relation to particular curricula or institutions, rather than in some coordinated national programme. The advent of CBELT raises a number of issues about testing and test use which will have to be resolved at an institutional rather than national level. These issues include the place of testing within, before, or after language learning curricula, and the relationship that is possible between the test and the learning materials, given the limitations and possibilities of CBELT. Decisions on when to take a test, on the level of difficulty of the test, and on the content of the test, can all be made independently by the learner via menu choices, or jointly with teachers, or with the advice of the computer. This may well lead to a democratization of the testing process. CBELT seems to imply a blurring of the customary distinction between a test and an exercise, which can have far-reaching implications for the degree of support a learner can choose to receive, or which teachers can offer (or withhold). Teaching routines can be incorporated within a test, into which a learner can be branched in response to his developing performance. Detailed diagnosis of performance is in principle possible on a learner's performance, if institutions can decide how to provide and react to such diagnoses. Similarly, very detailed profiles of performance can be produced, if they can be used, and if they are needed. Computer scoring of performance may limit the item types that can be implemented in CBELT, but does offer the possibility of reliable scoring based upon large databases for scoring keys, as well as variable weighing of the value of responses. Immediate feedback on performance raises the interesting issue of whether learners should be allowed second or subsequent attempts at an item, and of how to interpret scores obtained with such feedback. The possibility of gathering simultaneous ratings from learners of their performance, their self-evaluations, their opinions on test items, and their introspective accounts of what they think they had to do in order to answer an itemall these possibilities allow for increased learner-centredness in test development and test use. An important question institutions need to address is: do they want to and can they take advantage of these possibilities?
< previous page
page_27
next page >
< previous page
page_28
next page > Page 28
4 Response to Alderson PAUL TUFFIN University of Adelaide In responding to Alderson's chapter I need to make clear the rather particular perspective from which this response will be made. It is the perspective of a person spending a considerable amount of time and effort in the area of assessment of Languages Other Than English (LOTE) and of English as a Second Language (ESL). It is a rather particular perspective in that my work in this area has been limited to the attempt to produce, along with many other colleagues in South Australia and across Australia, a National Assessment Framework for Languages at Senior Secondary Level (Years 11 and 12 of Australian high schools). I believe, however, that it is a constructive perspective since, as will become evident, both the theoretical and practical issues which are being faced in work on the Framework tend, when associated with the possibility of computer-based language testing, to highlight a number of Alderson's key themes. Development of the Framework The concept of a framework which would attempt to provide overall guidelines for the assessment at Year 12 level of LOTE and ESL first appeared as part of discussions within the Curriculum Area Committee: Languages of the Senior Secondary Assessment Board of South Australia (SSABSA). While this Committee continued to work on its framework, it was suggested that a national approach might be of benefit to allstudents, syllabus working groups, and accrediting authorities alike. This suggestion was acceded to by the Australasian Conference of Accreditation and Certification Authorities which perceived potential economies of scale as well as educational advantages in a national approach, particularly with regard to small-candidature languages.
< previous page
page_28
next page >
< previous page
page_29
next page > Page 29
As a result the first of a series of National Consultations was organised by SSABSA and the Australian Languages Levels (ALL) Project. There is a close relationship between the ALL project and the National Framework. This relationship exists not only in the sharing of a common approachand consequently in the use of common terminology and quite often of common textbut also in a fairly extensive common membership of the ALL Project Reference Committee and the Consultative Group for the National Framework. Following the third Consultation, held in July 1987, the production of a revised Framework was decided which, because all decisions of the Consultative Group have only been made with full consensus of all state and territory representatives, will at least have the potential to be taken up and used by states and territories across Australia. Description of the Framework An overall description of the Framework can be taken from the Summary of the still-to-be-revised draft version: This Framework is designed to meet the varying needs of learners of languages with a living social context. It provides for the development of Year 12 syllabuses and assessment schemes in these languages and is capable of fitting into the varied assessment requirements of State and Territory accrediting authorities in Australia. The Framework also provides concrete, practical guidelines to support the work of syllabus groups in individual languages, while bringing a measure of cohesion and commonality to the subject field. The Framework consists of a General Course, an Extended Course, and a Specialist Studies Course. Since all courses are structured and defined on the same basis, and are designed to cover a continuum of development in competence to use languages, the Framework as a whole constitutes a single syllabus structure capable of meeting the needs of students with differing levels of language competence on entry to Year 12. The Framework as a whole is designed to give teachers and students greater flexibility and scope in determining programs which better reflect student needs and encourage learner autonomy. A communicative approach which emphasises the purposeful and meaningful use of language offers a suitable base for both the Framework and its desired outcomes.
< previous page
page_29
next page >
< previous page
page_30
next page > Page 30
The Framework provides for the development of a common set of expectations for all languages, expressed in common terminology, which will ensure that students' achievements are seen to be comparable one to the other, no matter which language they study, so that all language subjects are accorded equal esteem. Within this common set of expectations, however, there will exist sufficient flexibility (through the possibility of differing balances of skills and the linked application of differing coefficients in the assessment process) to allow of the differences which exist, for example, in potential student rates of acquisition of skills in using roman, cyrillic, arabic, or ideographic scripts. The basis for the organisation of the Framework and for its overall structure is the identification of six Activity Types which it shares with the ALL Project. The process whereby these Activity Types have been identified is described within the Framework document but this description is too lengthy to reproduce here. At this point it is sufficient to note that they represent a principled attempt to produce generalised definitions and descriptions, and a categorisation of the various activities in which people are involved as they communicate with one another using language. They are not intended in any way to constitute an exhaustive description of all possible uses of language but to provide a principled means for deriving goals in the form of activities for senior secondary language courses in Australia in relation to which students' ability to use languages can also be assessed. Neither are they intended to constitute watertight and mutually exclusive categories but to serve as useful focuses for the development of syllabuses and assessment procedures which emphasise the purposeful use of language. The six Activity Types which provide the common ground plan upon which the Framework is erected are these: 1. Establishing and maintaining relationships and discussing topics of interest (e.g. through the exchange of information, ideas, opinions, attitudes, feelings, experiences, plans). 2. Participating in social interaction related to solving a problem, making arrangements, taking decisions with others, and participating in transactions to obtain goods, services, and public information. 3. Obtaining information (a) by searching for specific details in a spoken or written text, (b) by listening to or reading a spoken or written text as a whole and then processing and using the information obtained.
< previous page
page_30
next page >
< previous page
page_31
next page > Page 31
4. Giving information in spoken or written form on the basis of personal experience (e.g. giving a talk, writing an essay or set of instructions). 5. Listening to, reading, or viewing, and responding personally to stimulus (e.g. story, play, film, song, poem, picture) 6. Creating (e.g. a story, a dramatic episode, a poem, a play). Assessment Activities In 1986 development of the Framework reached a point where it became evident that no further progress could be made unless criteria statements for setting assessment activities for all three courses were produced. Similarly, there was an evident need for the production of criteria for judging performance. This work was beyond the time resources of members of the Curriculum Area Committee: Languages of SSABSA and so application was made to the Schools Commission for funding to appoint a Project Officer to carry out the task. Funding was approved and the work carried out with excellent results by the Project Officer, Antonio Mercurio, who had the support and guidance of a specific-purpose reference group. The product of this work consists of three booklets (one for each of the Framework's courses) containing detailed criteria for the setting of assessment activities for each of the six activity typesin other words a total of eighteen sets of criteria. In each case sample assessment activities are also provided. It is not possible here to examine each of these criteria statements in respect of the potential application of computer-based testing as presented in Alderson's chapter, and it is probably not altogether desirable since there would doubtless be considerable repetition. It is thus my intention to proceed by exemplification and so to limit this examination to one coursethe General Course (Years 11 and 12 only)and to those activity types which appear most immediately available for computer-based testing. In doing this I need to emphasise that I can already see potentialthanks to Alderson's workfor the application of computer-based testing in the assessment of activity types which do not fit into this category. Activity Type 1 (establishing and maintaining relationships and discussing topics of interest), for example, appears at first sight most unsuitable. Yet if one considers this activity type as including (as it must) written correspondence, then the possibility of building up a text from a limited number of possible sequentially presented sentences towards producing a ''correct'' piece of correspondence, would seem capable both of delivery and scoring by
< previous page
page_31
next page >
< previous page
page_32
next page > Page 32
computer. I shall also keep to the limit of CBELT itself and treat "computer-based testing" as including computer scoring. The activity type which appears most immediately available for computer-based testing is Activity Type 3. This was designated above as "Obtaining information (a) by searching for specific details in a spoken or written text, (b) by listening to or reading a spoken or written text as a whole and then processing and using the information obtained". The criteria for setting assessment activities for this activity type (as for all others) are presented under three major headings: "The nature of the assessment activity", "Linguistic and socio-cultural demands", and ''Level of support". Of these three sections the first and last appear most relevant here in that they relate to format and delivery, while the section on linguistic and socio-cultural demands relates more to the level of the language presented and expected, and to items of socio-cultural significance. The nature of the assessment activity is described thus: With regard to Activity Type 3(a), the assessment activity should be such that: the student searches for specific items of information, from one or more texts, as a response to explicit cues the kind of information acquired for use is of an objective nature whereby the student is asked to comprehend literally specific items of information arising from texts read or listened to (e.g., to give precise details, dates, times, prices). With regard to Activity Type 3(b), the assessment activity should be such that: the student listens to or reads information in the target language for a given purpose, paying attention to the overall meaning of the text (spoken or written) or to large sections of the text the information required for use in the target language is of the kind where the student is asked for more than simple factual information e.g. to extract gist, to summarise the information, to make notes, to offer explanations. These criteria statements indicate the presence in the case of Activity Type 3(a) of clear potential both for computer delivery and computer scoring. In the case of Activity Type 3(b), however, the more open-ended use of the information extracted would mean that computer delivery would be possible but that computer scoring would be unlikely to be achievable without considerable constraints being applied to the student's response.
< previous page
page_32
next page >
< previous page
page_33
next page > Page 33
The "Level of support" statement for Activity Type 3(a) and (b) reads as follows: The assessment activity should be such that the student is given access to relevant reference materials (e.g. bilingual dictionaries, grammar reference books). The text (written or spoken) on which the assessment activity is based: should be such that the context will assist the student in finding and using relevant information should be such that the presentation/organisation of the text does not hinder and preferably assists the search for the information required (e.g. use of headings, sub-headings, diagrams, photographs, format of listening documents, density of text, lay-out, varying type face) should provide an adequate number of informational clues so that specific and general items of information can be extracted. Here the potential for achieving the desired improvement in the presentation/organisation of text through the use of computers and through computers linked with other equipment is evident, as is the potential both for controlled access to reference materials (through HELP) or the monitoring of free access with consequent adjustment to scores. The "Criteria for judging performance" constitute a fourth booklet which includes both general criteria and examples of specific criteria for judging performance at particular levels in particular activity types. Of interest here is the description of the detailed assessment procedure which is proposed along with a global assessment procedure. The "Detailed Assessment" statement includes these lines: Analysis of the various components which are deemed crucial to the success of the [student's] language performance and a diagnosis of the student's response occur here. Assessors need to provide explicit criteria statements on such variables as fluency, accuracy, range of expression, socio-cultural appropriateness, etc., to ensure reliability between markers. The potential advantages of computer scoring in this process are once more evident: reliability is ensured, consistency is assured and the data-base can go beyond any single individual's capacity.
< previous page
page_33
next page >
< previous page
page_34
next page > Page 34
Framework-Oriented Responses to Some Key Issues Given then that this very limited examination of Framework assessment procedures reveals clear scope for the use of computer-based testing, and thereby not just a theoretical but a potentially practical link between its further development and developments in CBELT, it would seem worthwhile responding to some of the key issues raised in Alderson's chapter from a Framework-oriented perspective. Alderson states: "It is unlikely to be the case that developments in CBELT will bring about improvements in the content of language tests". Alderson has concerned himself with this matter far longer than I have and I do not intend to gainsay him. However from the particular perspective to which I have admitted it already seems to me that there are strong possibilities for a very fruitful influence on the process of developing assessment procedures for the Framework through consideration of how computer-based testing might be used as part of these procedures. This is not so much to suggest that this interaction will produce hitherto undiscovered language test content types. Rather, it may produce new and improved combinations of test content types which, through the kinds of processes which Alderson has presented to us, may well be far more capable of speedy and reliable validation. It is worth noting that within the already existing suggestions for assessment activities in the Framework there is a tendency for activity types to be combined within an overall "authentic" activity, with the focus for assessment being placed upon one particular activity type within this combination. Possibilities of appropriately weighted computer-based scoring in such combinations of activity types are attractive. This last point leads on to another issue raised by Alderson. He states: "There is also doubt whether there will be significant improvements in test methods as a result of further work on CBELT". From a Framework-oriented perspective, however, there is likely to be a very beneficial influence exerted on test methods through an interaction of existing Framework approaches and reference to possibilities of computer-based testing methods already highlighted through CBELT. I have not mentioned before that another suggestion in the production of assessment activities within the Framework is that assessors should set out "Conditions of Assessment" by answering a pro-forma list of questions on these conditions. Examples of answers to these questions are provided with some of the sample assessment activities. It is interesting to observe how some of the aspects of testing which CBELT brings to the fore as being readily and more easily available and/or handled in computerbased testing (such as time control, possibility of review, possibility of limited repetition, weighing) are already present. The
< previous page
page_34
next page >
< previous page
page_35
next page > Page 35
example below is associated with a sample assessment activity (taking telephone messages) for the Activity Type "Obtaining information by searching for specific details in a spoken or written text": Conditions of Assessment: In which language is the response required? the target language Who is the audience to whom the response is directed? receiver of message Who is assessing the response? teacher What kind of support is to be offered to the student by teacher or others? telephone messages repeated a certain number of times What support materials are to be offered to the student? none How much time is to be offered to the student for the preparation or execution of the task? suggested time limit What is the expected length of the written response? jottings Which written draft is to be considered for assessment? final draft Which parts of the activity are to be assessed and are these parts to be given equal weighing? jottings submitted How many active participants are involved in producing the response? the student only Are there any other special conditions? no. The issue of the potential which Alderson sees for CBELT in changing the way in which tests are viewed and used within curricula reveals another area of coincidence between CBELT and developments within the Framework. Both projects, by different routes, have arrived at a position where the line between test and exercise becomes blurred. Alderson notes in this regard that the CBELT ability to allow student choice of when to be
< previous page
page_35
next page >
< previous page
page_36
next page > Page 36
tested and to provide instant feedback, branching options, and support through HELP, inevitably leads to such a blurring. The development of "Activity Types" in the Framework and the ALL Project has led to a similar blurring. The "process" activity through which the student learns language is guided by the same concept of activity types as is the "assessment" activity during which the student's performance in using the language is assessed. In this way Alderson's CBELT-oriented comment on the inadequacy of discussions on distinguishing characteristics between test and exercise is borne out by experience in developing the Framework. An apparent point of tension exists between the Framework's desire to be national and the clear potential which Alderson stresses for computer-based language testing not only to be learner-centred but quite possibly learnercontrolled. Other apparent points of tension between the two are the Framework's need to fit the requirements of certification and accreditation authorities, and the Framework's need to cover assessment areas which go well beyond the present capabilities of CBELT. On the first point Alderson himself notes that "It is likely that most computer-based language testing will be implemented within or in relation to particular curricula or institutions, rather than in some coordinated national programme". In the case of the Framework, I believe that it is a "particular curriculum" which hopefully will also be a "coordinated national programme". For this reason the amount of control which can be given to the student will be limited but this in itself in no way, I believe, implies that computerbased testing can have no place in the Framework's suggested assessment procedures. It is simply a matter of limitation and recognition of this in fact provides the answer to the two other apparent points of tension: (a) computer-based testing in the Framework would be constrained by requirements of accrediting and certificating authorities but not ruled out, and (b) application of computer-based testing in the Framework will probably not cover the assessment of all activity types. Do these limitations mean that, despite the points of convergence and potential benefit noted earlier, the offerings of CBELT and Alderson's work are not worthy of pursuit within the context of the National Assessment Framework for Languages at Senior Secondary Level? I believe not. Leaving to one side the vast and very exciting potential of combined computer/human delivery and scoring in language assessment (which is not within the remit of CBELT), the advantages for the Framework and for accrediting authorities of using computer-based testing in those areas where it can be applied are too evident to ignore. There is the human resources advantage of not requiring high level supervision, the advantage of instant and consistent scoring, and
< previous page
page_36
next page >
< previous page
page_37
next page > Page 37
the advantage of variable weighing (and its instant calculation). There is the advantage of being able to track students' paths through tests, the advantage of parallel validation of tests, the advantage of being able to control or monitor time and/or support, and the advantages of item-banking. There are also the advantages for those preparing assessment activities which will result from being able to include a new and challenging perspective from which to approach them. So, in answer to Alderson's final question as to whether institutions want to and are able to take advantage of the possibilities offered by computer-based language testing, I can only say that this is one individual who will be trying very hard to convince a number of accrediting and certificating institutions in Australia that they do and they can!
< previous page
page_37
next page >
< previous page
page_38
next page > Page 38
5 National Issues in Individual Assessment: The Consideration of Specialization Bias in University Language Screening Tests GRANT HENNING Educational Testing Service The issue of bias or, more properly, differential item functioning for tests in general has been investigated by a number of researchers over the past two decades (Angoff and Ford, 1973; Berk, 1982; Cronbach, 1976; Linn, 1976; Peterson and Novick, 1976; Scheuneman, 1985; etc.). Most of this research has focused on questions of race or gender bias in aptitude testing. Chen and Henning (1985) studied the possibility of differential item functioning in English language proficiency/placement tests that might be attributed to group differences in examinee native language and culture. To date, however, little empirical research has investigated the question of bias or differential item functioning in language proficiency/placement tests that can be attributed to academic specializations of the examinees, or to the subject matter passages contained in the test. One exception is a recent study by Hale (1988) reporting a small but significant interaction effect of reading passage selection on performance with the Test of English as a Foreign Language (TOEFL). Students in humanities/social sciences outperformed students in biological/physical sciences on text related to the former major-field group. The reverse was true for text related to the latter group. Other studies conducted with regard to major-field differences exclusively in the reading skill area include Alderson and Urquhart (1983; 1984; 1985), Brown (1982), Erickson and Molloy (1983), Moy (1975), and Peretz (1986). In
< previous page
page_38
next page >
< previous page
page_39
next page > Page 39
general these studies point to a small but consistent tendency for students in particular academic specializations to perform better with reading passages related to their specializations. A preliminary pilot study of the specialization bias phenomenon was reported for reading, listening, and error detection passages by Henning and Gee (1987) using the same measurement context, and reaching many of the same conclusions, as the present study. The question of specialization bias is particularly important given growing interest in the teaching and testing of English for specific purposes. Its importance is also linked to recent concerns about the validity of screening instruments used for the selection of nonnative English speaking teaching assistants in various university subject areas and academic institutions throughout the United States, and perhaps elsewhere as well. Indeed, one of the major national issues currently facing American university campuses concerns appropriate selection procedures for awarding teaching assistantships to foreign nationals, particularly in science and engineering departments (NAFSA, 1987). At issue is the tension between candidates' knowledge of subject material and their ability to communicate that knowledge in a linguistically and culturally acceptable manner. This issue is further extended by the frequent criticism of currently available standardized language tests that they are insufficiently directed to the particular academic specializations of the targeted examinees (NAFSA, 1987). Recent attempts in Britain to create tests that are sensitive to the specialization background of the test taker (e.g. English Language Testing Service and Royal Society of Arts tests) have helped further to contextualize and clarify the issues related to specialization bias (Porter, Hughes and Weir, 1988). The present study has been directed to the question, "Is there evidence of specialization bias or differential item performance on a language proficiency/placement test that may be attributed to the match or mismatch of test passage content to academic specializations of the examinees?" Of interest in this inquiry is the ascertaining of the presence, magnitude, and direction of differential item functioning, as well as the investigation of appropriate methods for the indication of such differential item functioning. The correspondence of test topic content and the academic backgrounds, goals and interests of individual examinees is here considered an important national issue in individual assessment. Unlike previous studies cited, this study focuses on individual item functioning within specialized passages and across language skills, rather than on mean passage or test differences only with regard to the skill of reading.
< previous page
page_39
next page >
< previous page
page_40
next page > Page 40
Method Subjects During the Fall 1985 academic quarter at the University of California at Los Angeles, 672 entering graduate and undergraduate students were administered the English as a Second Language Placement Exam (ESLPE)to determine eligibility for university admission and/or level of ESL instructional placement. Based on student responses to a demographic questionnaire, students were initially grouped under five specialization categories, as follows: (1) Physical Sciences (engineering, mathematics, computer science, physics, astronomy, and chemistry); (2) Humanities (education, linguistics, literature, philosophy, classics, history, and all language studies); (3) Fine Arts (visual and performing arts, and architecture); (4) Business (accounting, marketing, management, and economics); and (5) Social Sciences (psychology, sociology, and anthropology). Because of the comparatively small size of the group of Social Science students, that category was dropped from the study. In an effort to make group sizes reasonably comparable, the final sample was again constrained by use of random selection from among the pools of subjects from the larger specialization categories. The final sample consisted, then, of 59 randomly selected Physical Science students, 59 randomly selected Business students, all 41 Fine Arts students, and all 40 Humanities students, comprising a total sample of 199 students for whom English was not a native language. Instrumentation The English as a Second Language Placement Exam (ESLPE)is a battery of five 30-item multiple-choice subtests and one 20-minute written composition requiring two and one half hours to administer. For purposes of the present study, the composition was not analyzed as a potential source of differential item functioning, since it was not an item-based test. The comparatively high internal consistency reliability (0.96 KR-20) and concurrent validity (0.79 and 0.84 correlations with the TOEFL)of the objective portion of the test have been detailed elsewhere (Henning, Hudson, and Turner, 1985). Of particular importance to this study is the consideration that passage content selection for the test proceeded rationally according to a plan to permit approximate proportional matching of topic representation to student specialization. Thus, for the listening comprehension (30 items), reading
< previous page
page_40
next page >
< previous page
page_41
next page > Page 41
comprehension (30 items), and error detection (30 items) subtests of the exam, use was made of passages with topics chosen to be relevant to the respective specialization categories described above. The remainder of the items of the test, that is, grammar (30 items) and vocabulary (30 items), were also analyzed for specialization differential item functioning even though these items were not passage dependent and there was no a priori basis for labeling them under particular specialization rubrics. Analyses Variations of three general methods were used to reveal differential item functioning:(1) the Angoff item difficulty by group plotting method (Angoff and Ford, 1973), with Rasch difficulty calibrations (Rasch, 1980; Wright and Stone, 1979) substituted for the Delta statistics originally employed by Angoff; (2) regression residual analyses whereby means and standard deviations of errors of prediction were compared for individual specialization groups; and (3) the Mantel-Haenszel chi-square technique (Holland and Thayer, 1986; Mantel and Haenszel, 1959) to compare reference and focus groups in all combinations at five matched levels along the proficiency total score continuum. Results Prior to analysis of differential item functioning, it was considered helpful to examine comparability of the four specialization person groups. To do this, person ability estimates were calculated for each person on a Rasch Model logit ability scale using the BICAL program to analyze performance on the ESLPE. Means and standard deviations of ability estimates for four groups individually and combined are reported in Table 5.1. TABLE 5.1 Person ability estimates means and standard deviations for four specialization groups and group total Science Humanities Fine Arts Business Total (N= 56) (N= 38) (N= 36) (N= 57) (N= 187) mean st. dev.
1.37 0.81
1.65 0.70
< previous page
1.11 0.72
1.87 1.13
1.58 0.90
page_41
next page >
< previous page
page_42
next page > Page 42
Note from Table 5.1 that 12 persons were dropped from the original total sample of 199 for this analysis due to non-conformity of person characteristic curves to the expectations of the model. Presumably these persons were guessing or exhibiting other uncharacteristic behavior that caused their response data to be viewed as invalid and a potential source of contamination. This rate of attrition (six per cent) appeared unimportant inasmuch as it was both small and fairly evenly distributed across specialization groups and since it was comparable to the attrition rate commonly observed in other Rasch model applications. Note also that, while group ability means are comparable (that is, each within one logit and nearly within one standard deviation of the others), there is clear evidence that Fine Arts students exhibited lower overall English language proficiency than Business students. This finding is consistent with differentiation in apparent specialization needs for accuracy of verbal communication and with methods of recruitment at the University. This finding of differential group abilities is, as such, no indication of testing bias or differential item functioning. Item difficulty by group plotting The Angoff item difficulty by group plotting method (Angoff and Ford, 1973) involves evaluating the independence of item difficulties from group characteristics by comparing pairs of item difficulty estimates based on separate calibrations from different groups of subjects. In this study Rasch (Rasch, 1960) difficulty estimates were used instead of the Delta statistics originally employed by Angoff and Ford (1973). Figure 5.1 reports scatterplots for all 150 items in the ESLPE of Rasch item difficulty estimations based on the results of individual specialization groups versus estimations based on the results of the total of all four groups. Figure 5.1a through 5.1d present these scatterplots for the Science, Humanities, Fine Arts, and Business groups respectively. Note that correlations of paired difficulty estimates ranged from 0.90 for Humanities to 0.96 for Science and Business. Confidence intervals of 95% have been constructed around each of the regression lines in a manner analogous to the method of Angoff and Ford (1973). Items whose plots are positioned outside and above the confidence intervals may be said to be significantly easier (p < 0.05) for the particular specialization group than they are for the total sample including all groups. Items whose plots are located outside and below the confidence intervals may be said to be significantly more difficult (p < 0.05) for the particular specialization group than they are for the total sample including all groups. Thus for no single group-total
< previous page
page_42
next page >
< previous page
page_43
next page > Page 43
FIGURE 5.1 Scatterplots of Rasch Model difficulty estimates of 150 items of the ESLPE from four specialization groups (horizontala . Science Group, N= 59; b: Humanities Group, N= 40; c: Fine Arts Group, N= 41; d: Business Group, N = 59) versus estimates from total of all groups (verticalN= 199). Items are represented by dots. Larger dots represent multiple entries.
< previous page
page_43
next page >
< previous page
page_44
next page > Page 44
comparison could it be said that more than 10 of 150 items (or 6.6% in the case of Humanities) exhibited differential item functioning or bias whether for or against the group in question. Least bias or differential item functioning occurred for the Fine Arts group with only 5 of 150 items (or 3.3%) identified. As the error of item calibrations in fact increases towards the extremes of the ability continuum, drawing the confidence intervals strictly parallel to the regression line may even be considered too conservative. TABLE 5.2 Items exhibiting differential functioning or specialization bias, for or against, four specialization groups Bias direction: Difficulty Advantage Difficulty Disadvantage Item type: VC GR ED RD LS VC GR ED RD LS Science Group
11 06 19
Humanities Group
03 12 15 06 22 23 25 09 13
Fine Arts Group Business Group
10
03 20 21 08 22
23 11
14
08 19 11 14 07 06 25 08 16 17 VC = vocabulary; G R = grammar ;ED = error detection; RD = reading comprehension; LS = listening comprehension Table 5.2 lists all items, subtest identifiers and item numbers, found to exhibit this kind of specialization bias or differential item functioning by the method indicated. It is interesting to note that those sections of the test most subject to bias phenomena were the Grammar and Vocabulary subtests, for which passage content selection was not a consideration. The Listening, Reading, and Error Detection subtests for which passage content selection was viewed as a possible source of bias exhibited disproportionately low differential item functioning and almost no systematic bias by this criterion.
< previous page
page_44
next page >
< previous page
page_45
next page > Page 45
The only discernible pattern from among the non-passage dependent items concerned item Grammar 08 that was biased against all groups but the Science group, and this Grammar subtest item required correct use of causal conjunctions. From among the Listening, Reading, and Error Detection subtests and those passage-bound items that could be identified in passage content with one of the four specialization categories considered here, no items could be said to be biased in favor of any specialization. Only three of these items were found to be biased against particular specialization groups: item Error Detection #06 (Humanities), was biased against the Business group; item Error Detection #23 (Humanities), was biased against the Science group; item, Listening # 11 (Science), was biased against the Humanities group. Again, use here of the term bias is with the conventional meaning that there was a systematic performance advantage for or against a given group. It is worth mentioning that, for this particular analysis, numbers of items identified as being biased may have been conservative in that the total group criterion also included the focus group. However, since there were as many as four groups of virtually equal size, since it was possible in this way to maintain the same criterion for all groups, and since the same procedure was repeated to provide comparative differentials for each group, it was thought that this approach would be appropriate and would not overly obscure any systematic bias. Regression residual analyses The next method employed in analysis involved the use of mean regression residuals from the prediction of Rasch item difficulties for individual groups regressed on Rasch item difficulties for the total of all groups (N = 199). Residuals were used rather than squared residuals since only the subset of specialized items were tallied in each comparison, and since it was known that such a partial tally was unlikely to sum to zero. Table 5.3 reports the comparison of residual means for specialized items with residual means for non-specialized items, within each specialization group. Table 5.3 indicates that there were no significant t-values in this comparison, supporting the conclusion that there was no overall systematic bias resulting from the selection of passage content in this particular testing context as judged by this particular methodology. This analysis, unlike the preceding analysis, has focused on the mean differences for specialization-related and non-related items rather than on the comparative performances of individual items.
< previous page
page_45
next page >
< previous page
page_46
next page > Page 46
TABLE 5.3 Comparison of Specialized and Non-Specialized Items: Residuals of Rasch item difficulty estimations in specialized groups versus in total of groups (Means and Standard Deviations) and t-Values Residuals for Residuals for Specialized Items (n) Non-Specialized Items (n) t- Value Group N n Mean St. Dev. n Mean St. Dev. Science 59 21 0.095 0.305 129 0.021 0.394 0.823 Humanities 40 41 0.024 0.414 109 -0.009 0.643 0.306 Fine Arts 41 6 -0.117 0.324 144 0.005 0.479 0.616 Business 59 22 0.208 0.232 128 -0.037 0.425 0.518 Mantel-Haenszel chi-square analysis The third method of analysis involved the use of Mantel-Haenszel chi-square analysis (Holland and Thayer, 1986; Mantel and Haenszel, 1959). This procedure has been amply described in the references provided. Suffice to note here that, since the procedure consists of successive chi-square computations at fixed ability intervals, it results in the promulgation of two separate statistics for each item, (1) a chi-square value indicating independence or dependence of item response patterns for a focus and a reference group, and (2) a common odds ratio indicating magnitude of the relationships examined. In the overall analysis of differential item functioning by this procedure, there appears to be an overall trend for specialization mean chi-square values not to be significant, as was apparent also from both methods of analysis described above. Individual item results have not been computed, but the overall finding suggests that, whileas with the first analysis described abovea few individual items may show differential item performance in one direction or another, there appears to be no overall systematic bias for or against groups based on passage content selection. Differential person functioning Following this unanticipated finding of minimal or no differential item performance by the three methods described, it subsequently became of interest to examine whether persons, if not items, could be found to exhibit a kind of differential person functioning for prescribed item groups. This focus
< previous page
page_46
next page >
< previous page
page_47
next page > Page 47
in analysis is unusual since the Mantel-Haenszel analysis is commonly used primarily to find whether items function differently for different person groups (e.g., gender, race, language, or specialization differences). It is not usually possible, as it was in this case, to classify the items in advance as more appropriate for one group or another. Here it was proposed to investigate whether persons function differently for different item groups (that is, items differing according to major-field classification). Table 5.4 reports the overall means of the chi-square and common odds ratio statistics for persons by specialization group. TABLE 5.4 Mean Mantel-Haenszel chisquare and common odds ratio statistics for persons by specialization group (N total = 199) Group N Chi-square Odds Ratio Science 59 0.00001 0.0003 Humanities 40 1.937 -0.147 Fine Arts 41 2.399 -0.164 Business 59 7.652* 0.293 *p < 0.05 Notice that only for students in the Business specialization did a significant low magnitude relationship exist, supporting a view that Business students may have some overall advantage on the test that could be labeled differential person functioning. Note also that these findings seem to parallel the data reported in Table 5.1 suggesting also that Business students had the greatest overall proficiency. Discussion This study has investigated the possibility of specialization bias or differential item functioning in language proficiency/placement tests. It has been of importance in this investigation to ascertain the presence, extent, and direction of any systematic bias associated with the selection and application of passage content in the assessment of language proficiency of university students in particular specialization categories. Three methods of
< previous page
page_47
next page >
< previous page
page_48
next page > Page 48
analysis were employed in the study; that is, the Angoff Group Item Difficulty Scatterplot Method, the Regression Residual Method, and the Mantel-Haenszel Method. While individual items were identified that showed differential item functioning, there was no indication by any of these methods that systematic specialization bias existed that could be associated with passage content selection for specified subtests of the proficiency test battery. This finding could support at least two possible conclusions. First, it could be argued that since anticipated specialization bias did not appear to occur in this context, there may be no fairness considerations in similar testing contexts that would support the need for English for Special Purposes (ESP) testing. This would not rule out the development of ESP tests for reasons of heightened motivation or perceived face validity on the part of the examinees. Alternatively, it could be suggested that, if we view ''specialization'' as existing along a continuum both for persons and for items and their associated passages, the persons and items considered in the present study were generally not sufficiently far along the continuum to reach a bias threshold. In other words, had the passages been more arcane and specialized, and had the examinees been more advanced in their academic preparation, then differential item functioning would have been more apparent. The problem with this likely view is that no metric of degree of specialization is currently available, nor is it clear that such a metric is needed if passages adapted from firstyear university academic texts, as in the present study, do not appear to pose a bias problem for typical entering university students. It seems necessary to offer suggestions for why the results of the present study do not agree in every case with those studies cited earlier showing differential performance on specialized reading passages by members of different specializations. A variety of potential explanations could be given, including the following: 1. The passages selected in some of the earlier studies were purposely chosen to be relevant and comprehensible only to members of the specializations represented. By contrast, the passages of the ESLPE, while topically identifiable with particular specializations, have been adapted in test development so that inaccessible vocabulary and usage would be minimized. 2. The comparisons made for the first two analyses of this study were not between opposing groups such as Science and Humanities, but rather were between individual groups and the total of all groups. Some of the studies reported earlier were concerned with maximiz-
< previous page
page_48
next page >
< previous page
page_49
next page > Page 49
ing differences between contrasting groups. Thus the focus here was not on how to find differences, but rather on whether unfair differences associated with passage selection would be found to persist in pragmatic test construction as it was done with the ESLPE. 3. The students entering UCLA in recent years are usually above the TOEFL 500 mark in language proficiency to begin with. This suggests that, unlike what may have been the case with previous studies cited, knowledge of English may have been so pervasive for these students that major-field differential passage content posed little barrier to comprehension. Of final interest in the study is the recognition that in the same way that items may be said to exhibit differential item functioning for person groups, so it now appears that persons can be said to exhibit differential person functioning for item groups. Thus it may be possible that while a test may not manifest content bias for or against a particular specialization group, it may be possible for members of that group to exhibit specialization bias for or against particular sections of the test. More specific information about the importance of this procedure should be available on completion of some of the analyses begun for the present study. References Alderson, J.C. and A.H. Urquhart (1983) The effect of student background discipline on comprehension: a pilot study. In: A. Hughes and D. Porter (eds), Current Developments in Language Testing. London: Academic Press. Alderson, J.C. and A.H. Urquhart (1984) The problem of student background discipline. In: T. Culhane, C.Klein-Braley and D.K. Stevenson (eds), Practice and Problems in Language Testing. Occasional Papers 29. Colchester: University of Essex. Alderson, J.C. and A.H. Urquhart (1985) The effect of students' academic discipline on their performance on ESP reading tests, Language Testing, 2, 192-204. Angoff, W.H. (1982) The use of difficulty and discrimination indices in the identification of biased items. In: R.A. Berk (ed), Handbook of Methods for Detecting Test Bias. Baltimore: Johns Hopkins University Press. Angoff, W.H. and S.F. Ford (1973) Item-race interaction on a test of scholastic aptitude, Journal of Educational Measurement, 10, 95-106. Berk, R.A. (ed) (1982) Handbook of Methods for Detecting Test Bias. Baltimore: Johns Hopkins University Press. Brown, J.D. (1982) Testing EFL reading comprehension in engineering English. Ph.D. dissertation, University of California at Los Angeles. Dissertation Abstracts International, 43, 1129A-1130A.
< previous page
page_49
next page >
< previous page
page_50
next page > Page 50
Chen, Z. and G. Henning (1985) Linguistic and cultural bias in language proficiency tests, Language Testing, 2, 155-163. Cronbach, L.J. (1976) Equity in selectionwhere psychometrics and political philosophy meet, Journal of Educational Measurement, 13, 31-41. Erickson, M. and J. Molloy (1983) ESP test development for engineering students. In: J.W. Oller, Jr. (ed), Issues in Language Testing Research. Rowley, MA: Newbury House. Hale, G.H. (1988) The interaction of student major-field group and text content in TOEFL reading comprehension. Research Report No. 88-1. Princeton, NJ: Educational Testing Service. Henning, G. and Y. Gee (1987) Specialization bias. A paper presented at the Ninth Annual Language Testing Research Colloquium. Hollywood, Florida, April 2628. Henning, G., Hudson, T. and J. Turner (1985) Item response theory and the assumption of unidimensionality for language tests, Language Testing, 2, 141154. Holland, P.W. and D.T. Thayer(1986) Differential item functioning and the MantelHaenszel Procedure. Technical Report No. 86-69. Research Report No. 86-31. Princeton, NJ: Educational Testing Service. Linn, R.L. (1976) In search of fair selection procedures, Journal of Educational Measurement, 13, 53-58. Mantel, N. and W. Haenszel (1959) Statistical aspects of the analysis of data from retrospective studies of disease, Journal of the National Cancer Institute, 22, 141154. Moy, R. H. (1975) The effect of vocabulary clues, content familiarity and English proficiency on cloze scores. Master's thesis, University of California at Los Angeles. NAFSA Conference (1987) Proceedings of the Workshop on Screening Non-native Teaching Assistants. Long Beach: National Association of Foreign Student Advisors. Peretz, A.S. (1986) Do content area passages affect student performance on reading comprehension tests? Paper presented at the twentieth meeting of the International Association of Teachers of English as a Foreign Language, Brighton, England. Peterson, N.S. and M.R. Novick (1976) An evaluation of some models for culture-fair selection, Journal of Educational Measurement, 13, 3-29. Porter, D., Hughes, A. and C. A. Weir (eds) (1988) Proceedings of a conference held to consider the ELTS validation project. ELTS Research Report, Vol. I(ii). Cambridge: University of Cambridge Local Examinations Syndicate. Rasch, G. (1980) Probabilistic Models for some Intelligence Attainment Tests. 2nd edn. Chicago: University of Chicago Press. Scheuneman, J.D. (1985) Explorations of causes of bias in test items. ETS Research Report 85-42. Princeton, NJ: Educational Testing Service. Wright, B.D. and M.H. Stone (1979) Best Test Design: Rasch Measurement. Chicago: MESA Press.
< previous page
page_50
next page >
< previous page
page_51
next page > Page 51
6 Response to Henning: Limits of Language Testing GRAEME D. KENNEDY Victoria University of Wellington Grant Henning's stimulating chapter highlights an important issue for applied linguistics. Even when sophisticated statistical analyses confirm the essential reliability and validity of a testing instrument, it is clear that consumers of the results are not always going to be convinced that the results are valid. However, the issue of language testing for placement or screening purposes is unlikely to disappear, in large part because the great movement of individuals and groups across national and linguistic boundaries continues to grow as one of the most important phenomena of this century. While the major focus of attention is often on the language proficiency necessary for advanced education, there has also been controversy over accreditation to practice medicine, control air traffic, broadcast, and teach, to name only some of the most salient of many fields. It is the issue of English language proficiency for teaching purposes, however, which provides the focus for Henning's chapter and which, perhaps surprisingly for non-American colleagues, he identifies as "one of the major national issues currently facing American university campuses". How to determine whether nonnative speakers of English who know their subject are proficient enough in English to be allowed to teach at tertiary level "in a linguistically and culturally acceptable manner" is indeed a difficult question for it involves fairness to two parties, namely the individual potential teaching assistants and those whom they teach. The study showed that on a well-established screening test battery used at a major United States university there is no evidence that there is systematic bias for or against persons who work in particular academic
< previous page
page_51
next page >
< previous page
page_52
next page > Page 52
fields, resulting from the subject matter of the passages used in the test. This conclusion is reached on the basis of three independent analytic techniques. One of the strengths of the study comes from using these three means of analysis, to show so overwhelmingly the absence of bias, that it may not be productive to pursue the issue of specialization bias any further along the lines described. Indeed, although the issue is unlikely to have been laid to rest, one of Henning's conclusions is that there may be no case for English for Specific Purposes testing in these circumstances except on motivational or face validity grounds. Henning does tentatively suggest that a possible reason for the inability to detect specialization bias may be because the content of the test and the background of the examinees were not specialized enough. Now this may be the case and is worth exploring further. To begin with, the four so-called "specialization areas" are in some cases very broad indeed. While Fine Arts and Business may each be relatively homogeneous, the same cannot be said for Physical Science and Humanities. Compare, for example, the content of the following sentences taken quite randomly, as were the texts in the ESLPE, from first-year undergraduate texts. Examples (a) and (b) are from different fields grouped as Physical Science while (c) and (d) come from the Humanities grouping. (a) Mathematical statistics If cells in a contingency table are combined in order to satisfy conditions (a) and (b) in example 9.4.1, then one degree of freedom must be subtracted each time a cell is combined with one or more others. (b) Chemistry The necessary condition for dissolving a compound is that its ionic product be reduced below its solubility product. This can be achieved by addition of a reagent which reacts with one of the ions in the system. (c) Literary criticism Semiotic structuralism and deconstructionism generally take no cognizance at all of the various ways that texts can relate to their oral substratum. In a sense, the three sinister old women are conventional figures; they derive pretty clearly from Macbeth. But the derivation is quite transcended by Scott's use of the vernacular. (d) Education The experimenter found positive transfer that depended on the extent to which the first task was learned and the degree of
< previous page
page_52
next page >
< previous page
page_53
next page > Page 53
similarity of the two tasks. The better the learning of the first task, the greater was the amount of transfer. The more closely similar the responses were in the two tasks, the greater also was the transfer. It is not at all self-evident that a reader familiar with the content of (a) is going to feel at home with (b). Nor will persons familiar with (c) necessarily claim a community of interest with (d). It does not require very much experience of disciplines such as literature, education, linguistics, and philosophy to realize that while for university administrative purposes they may sometimes be grouped together as Humanities, their lexical and semantic structures and methods of argumentation frequently have very distinctive characteristics. Thus, although as Henning points out it may be difficult to measure degrees of specialization, it would seem to be unsatisfactory to assume that passages of text general enough to be accessible in a test for persons from widely different academic fields will really reflect the often arcane and inaccessible lexical and pragmatic structures of particular disciplines. Rather than pursue the issue of matching language test content to the person being tested, a more valuable line of inquiry might be to look more closely at the issue of face validity and to ask whether the kind of language test discussed by Henning really tests relevant language skills and, further, whether a test of English language is the right kind of test for the selection of teaching assistants. Listening comprehension, reading comprehension, and error analysis, whether using specialist content or not, will not necessarily be seen to be the language skills which cause the greatest offence or unease among the students with whom the teaching assistants come into contact. The production skills of speaking grammatically well-formed English with acceptable pronunciation, writing on the blackboard or on students' work in well-formed English, following normal discourse conventions in interaction with studentsthese would seem to be perceived to be relevant. While it is understandable that in this study the composition performance was not analysed because it was not item based, it may nevertheless have provided a better indication than the receptive skills of one aspect of production proficiency. The question of whether an even more valid test of language proficiency is, however, the most appropriate way of selecting teaching assistants should also be considered. To gain entry to a graduate school in the United States, potential foreign teaching assistants will already have had to get high scores on the TOEFL examination and a test of academic potential such as the Graduate Record Examination. If a foreign graduate student is wellqualified enough on academic and language proficiency grounds to gain entry to the
< previous page
page_53
next page >
< previous page
page_54
next page > Page 54
university, then it may be appropriate to consider what further attributes will count in making him or her an acceptable and successful teaching assistant. Davies (1975:21) warned on the basis of his study of some 2,000 foreign students in Scotland that "it would be a mistake to exaggerate the place of language among foreign students' problems". His observation of "the exaggerated role given to language as the 'explanation' for all kinds of social and psychological problems in educational discussions" is still relevant. A recent paper by Light et al. (1987) similarly found that TOEFL scores were not effective predictors of academic success among graduate students. But, if not language, then what other source or sources of the problem in selecting acceptable teaching assistants may be identified? Research by Brodkey (1972) showed that familiarity with the speaker rather than familiarity with a particular variety of foreign-accented English was the critical factor in mutual intelligibility between native and nonnative speakers. It is normally harder to understand people we are not familiar with, partly because spoken communication depends on interaction and on interpersonal skills rather than simply language proficiency in any narrow sense. Thus the assessment of personality characteristics such as openness, attentiveness, warmth, confidence, adaptability, willingness to admit error, and so on, may be relevant. It is worth noting that in media personalities, including film stars, quite low language proficiency can be compensated for by other personal characteristics. Maurice Chevalier, while not a native speaker of English, did not have difficulty being acceptable to English-speaking audiences in spite of his "accent". Given academic suitability and optimal personal qualities, then it should be possible for applied linguists to contribute to training in communication skills acceptable in American universities. For example, by means of videotapes and direct instruction, potential teaching assistants could be trained in acceptable modes of interaction with students: how not to be imperious or arrogant, how to respond to criticism, how to respond to questions. Just as medical personnel may sometimes have to accept that patients in another country will assume they have a right to question the doctor and must be listened to and answered, so teaching assistants in the United States should expect different interactional norms from ones they may have been used to in their own countries. For example, the teacher is not always agreed with nor may be accorded the status he or she might be used to. There is a further possible solution to the issue of foreign teaching assistantships, and it is quite removed from applied linguistics. Teaching assistantships serve at least two important purposes: to provide teachers for the university, and to provide financial support for the teaching assistants to
< previous page
page_54
next page >
< previous page
page_55
next page > Page 55
enable them to pursue graduate studies in expensive institutions where normal fees would be prohibitive. For graduate students from foreign countries, where the primary aim of the teaching assistantship might be to provide student support, then the most appropriate course of action might be to provide more scholarships or grants, rather than cause embarrassment or even antipathy by employing these graduate students as low-paid teachers. Thus, working for social change or change in institutional practices may be a more effective solution than one which depends on improved language testing or some other contribution from applied linguistics. Ultimately the issue is a social and political one in the sense that English no longer belongs to the traditional English-speaking countries alone. With the spread of English as an international language, the issue of whose language it is, and whose responsibility it is when intelligibility is difficult or breaks down, has thus become increasingly important with implications which go far beyond the particular issue which has been investigated here. References Brodkey, D. (1972) Dictation as a measure of mutual intelligibility, Language Learning, 22, 203-220. Davies, A. (1975) Do foreign students have problems? ELT Documents, 3, 18-21. Light, R. L., Xu, M., and J. Mossop (1987) English proficiency and academic performance of international students, TESOL Quarterly, 21, 251-261.
< previous page
page_55
next page >
< previous page
page_56
next page > Page 56
7 Psychometric Aspects of Individual Assessment GEOFFEREY N. MASTERS University of Melbourne The application of psychometric methods to language testing has had a checkered history. In the 1930s, psychometricians armed with their new tool, factor analysis, set about investigating the dimensions underlying performances on mental tests. Thurstone (1938) and others used factor analysis in an attempt to discover and isolate sets of elementary "abilities" that make up complex intellectual functioning. It was hoped that the isolation of these component abilities would enable a more detailed assessment and description of an individual's test performance. When applied to language testing, the factor analytic problem became one of identifying the dimensions underlying language proficiency. Thus Davis (1944) reported the isolation of nine distinguishable dimensions of reading ability including such skills as identifying the main idea in a passage, and determining a writer's purpose. Thurstone (1946) in a reanalysis of Davis's reading tests argued for the interpretation of reading comprehension as a single skill. Later, Davis (1972) reported the identification of eight reading skills. Thorndike (1973) in an independent analysis argued for three. Other researchers like Harris (1948), Derrick (1953), and Spearitt (1972) performed other analyses of reading tests and arrived at other numbers of factors. In short, this early encounter between modern psychometrics and language testing was disappointingly inconclusive. Despite the many studies carried out to identify component dimensions of language proficiency, we are in many ways no more enlightened about the "true" number of dimensions underlying a proficiency like reading comprehension than we were when the question was first raised. In practice, most modern reading tests assume that reading comprehension can usefully be conceptualized and treated as unidimensional.
< previous page
page_56
next page >
< previous page
page_57
next page > Page 57
A second development in psychometrics which had an influence on language testing had its origins in behavioral psychology and took the form of the behavioral objectives movement. The goal of this movement was to improve test reliability and validity by specifying the detailed behaviors (skills and knowledge) to be assessed by individual test items. The intention was that each objective or skill should be "stated in terms which are operational, involving reliable observation and allowing no leeway in interpretation" (Bloom et al., 1971: 34). In this way, test items could be written to assess specific behavioral objectives and could be scored unambiguously as either right or wrong. When applied to language testing, the behavioral objectives task became one of breaking down language into discrete "elements" each of which could be reliably assessed using a discrete point test item (Lado, 1961). An individual's measure of proficiency then became a count of the number of items answered correctly. Spolsky (1978) has referred to the influence of the behavioral objectives movement on language tests as the "psychometric-structuralist" phase of language testing. In recent years, the appropriateness of attempting to represent language in terms of a set of component "elements", each of which can be assessed with a discrete point test item, has been questioned and discussed at length. Morrow (1981), for example, argues that this "atomistic" approach to test design depends on the assumption that language proficiency is equivalent to knowledge of a set of language elements, and that "knowledge of the elements of a language in fact counts for nothing unless the user is able to combine them in new and appropriate ways to meet the linguistic demands of the situation in which he wishes to use language" (1981: 11). A third development in psychometrics which was quickly recognized as having implications for individualized language assessment was the introduction of the concept of "criterion-referencing" by Glaser in 1963. In a system of criterion-referenced assessment, an individual's test performance is interpreted not in terms of the performances of other learners (e.g., a reading age), but in terms of the knowledge and skills demonstrated by that individual. Central to the concept of criterion-referencing as it was introduced by Glaser is the notion of a continuum of developing proficiency. Glaser defined criterion-referencing as the process of interpreting an individual's test score by referring it to "the behaviors which define each point along a continuum ranging from no proficiency at all to perfect performance" (1963: 519). In applications of criterion-referencing to language testing, however, the central notion of a continuum of developing proficiency frequently has
< previous page
page_57
next page >
< previous page
page_58
next page > Page 58
been lost. Instead, under the influence of the behavioral objectives movement, ''criterion-referenced'' language tests often divide an area of proficiency into large numbers of component subskills. These are assessed separately, and an individual's test score is replaced by a checklist that indicates whether or not each of these subskills has been "mastered". The "criterion-referenced" reading comprehension tests developed by the Instructional Objectives Exchange (IOX, 1972) are an example of this approach; they provide a list of 91 separately assessed reading comprehension subskills. Another development in psychometrics with the potential to significantly improve individualized language assessment has been the introduction of item response theory (IRT) models for test analysis. Two features of these psychometric models make them particularly attractive in the context of language assessment. The first is the opportunity these models provide to mark out and define a continuum of increasing proficiency in an area of language development. Once such a continuum has been constructed, individual language assessments can be interpreted in terms of levels of proficiency along this continuum (i.e., language assessments can be criterionreferenced). A second significant feature of IRT models is that they provide a way of comparing directly performances on different language tasks. This possibility was first demonstrated by Rasch (1960) who sought a method by which he could place students taking different reading tests on a single developmental scale. The psychometric model that Rasch developed enabled him to monitor students' developing reading ability over time using their performances on a series of increasingly difficult (i.e., graded) reading tests. This model is now used to provide criterion-referenced proficiency measures on a wide variety of language tests, including the Degrees of Reading Power, the Woodcock Mastery Reading Tests, the English Language Skills Profile (Hutchinson and Pollitt, 1987), and the TORCH Tests of Reading Comprehension (Mossenson, Hill and Masters, 1987). The development of item response models has opened up exciting new approaches to individualized language assessment, some of which are discussed by Henning (1987). To date, however, most applications of these models have been limited to dichotomously scored (i.e., right/wrong) test questions. While questions of this kind may be appropriate for the assessment of reading and listening comprehension and some more mechanical aspects of language, they are less likely to be appropriate for tests of language production. For the assessment of spoken and written language, it is usual to use expert judgements and ratings rather than discrete point tests.
< previous page
page_58
next page >
< previous page
page_59
next page > Page 59
This chapter discusses the application of modern psychometric methods to the individualized assessment of spoken and written language production. It shows how an item response model can be used to provide criterionreferenced interpretations of individuals' proficiencies using language profiles based on professional judgements and ratings. The opportunity to make "specifically objective" comparisons of individual assessments using these methods is discussed. The issue of dimensionality is addressed from the perspective of item response theory, and the use of statistical tests of model-data fit to identify individuals with particularly idiosyncratic language profiles is described. An Integrative Assessment Framework An ongoing issue in language testing has been the tension between holistic and atomistic approaches to assessment. On one hand, a common approach to assessing and recording students' performances on written essays and in oral interviews has been to use impression marking to produce a single global score for each student (e.g., on a scale of 1 to 10). This approach is efficient, is considered to provide acceptable levels of intermarker reliability, and is consistent with a view of language as an integrative, communicative act. A disadvantage of this approach is that it does not provide information of a diagnostic kind and enables only very general descriptions of individuals' levels of language functioning. On the other hand, more atomistic approaches to language assessment (e.g., IOX's 91 reading comprehension objectives) appear to offer a great deal of diagnostic information about individuals, but may be of questionable validity if they assess language skills outside the context of meaningful communications. Much recent work in language testing has attempted to construct a compromise between these two extremes by developing communicative tasks which can be used to assess a number of aspects of language functioning in situ rather than as isolated skills. Examples of "integrative" tasks of this kind include the speaking and writing tests of the English Language Skills (TELS) Profile (Hutchinson and Pollitt, 1987). The TELS oral communication tasks, for example, identify five aspects ("components") of an individual's oral performance, each of which is assessed in the context of an interview. These five components, Appropriacy, Coherent Fluency, Superficial Fluency, Interactive Skills, and Amount of Support, provide a profile of each individual's interview performance and so are more descriptive of that person's oral communication proficiency. In the case of the TELS oral communication test, each interviewee receives a grade on each of these five
< previous page
page_59
next page >
< previous page
page_60
next page > Page 60
components of performance. Four possible grades (labelled 0 to 3) are defined for each component. Descriptions of the four grades defined for Amount of Support are reproduced in Table 7.1. TABLE 7.1 Ordered categories for TELS Amount of Support item (Hutchinson and Pollitt, 1987) GradeDescription 3 The pupil will usually be able to operate independently; the interlocutor will not feel the need to provide conscious support, apart from occasional lapses. 2 While the pupil will show some evidence of being able to operate independently, the interlocutor will still need to give support in at least one of the following ways: rephrasing or repeating questions or remarks, directing the course of the discussion, calling attention to or supplying details from the source materials. 1 The pupil will need support in all of the ways described above. The interlocutor must include frequent rephrasing and repetition, supply information, and regularly prompt the pupil to take his turn. 0 Despite the efforts of the interlocutor, the pupil's contribution to the task will be minimal. Integrative tests which rely on professional judgements of a number of aspects of a person's language production appear to meet Morrow's (1981: 12) plea for language tests which are designed to reveal ". . . not simply the number of items which are answered correctly, but to reveal the quality of the candidate's language performance." A test of communicative ability, according to Morrow (1981: 17f.), can be expected to rely on ". . . modes of assessment which are not directly quantitative, but which are instead qualitative. It may be possible or necessary to convert these into numerical scores, but the process is an indirect one and is recognised as such." Constructing Measures of Language Proficiency An individual's results on an integrative language task like the TELS oral communication test are a set of expert ratings on the components of
< previous page
page_60
next page >
< previous page
page_61
next page > Page 61
performance defined for that task. If the intention is to obtain a measure of an individual's proficiency in the area of functioning (e.g., oral communication) assessed by that test, then this measure must be constructed not from a count of right answers, but from that person's set of ratings. The simplest way to do this is to assign successive integers to grades and to sum these over the rated aspects of performance to obtain a total score for each person. However, there are many advantages in having an explicit statistical model to supervise this process. The item response model described here is designed to supervise the process of combining ratings on a number of aspects of performance to obtain a global measure of proficiency. This model is known as the partial credit model (Masters, 1982) and it has been applied to integrative language tests by Davidson and Henning (1985), Pollitt and Hutchinson (1987), Adams, Griffin and Martin (1987), and Harris, Laan and Mossenson (1988). The model can be explained using the four grades defined in Table 7.1 for the TELS oral communication component Amount of Support. The intention in Table 7.1 is that these four grades are ordered 0 < 1 < 2 < 3, and are indicators of increasing oral communication proficiency. We begin by considering some particular person n with an overall level of oral communication proficiency bn and represent that person's probabilities of receiving grades of 0, 1, 2 and 3 on Amount of Support as Pni0, Pni1, Pni2, and Pni3. If we now restrict our attention to any pair of adjacent grades x-1 and x, the higher person n'soverall level of proficiency, the more likely it should be that person will receive a grade of x rather than a grade of x-1. In other words, the conditional probability Pnix / (Pnix-1 + Pnix) should increase with bn. In the partial credit model (PCM), this conditional probability is modelled to increase with bn in the following way:
where dix is a parameter associated with the set of grades for Amount of Support and governs the probability of a person receiving a grade of x rather than a grade of x-1 on this component of performance. Because person n must receive one of the four grades on Amount of Support, the individual grade probabilities must sum to 1, that is:
From equations (1) and (2) and with a small amount of algebra it follows
< previous page
page_61
next page >
< previous page
page_62
next page > Page 62
directly that the model probabilities of person n with proficiency bn receiving grades 0, 1, 2 and 3 on Amount of Support are:
where Y is the sum of the four numerators on the right of these equations and ensures that the four response probabilities on the left sum to 1. This is the general form of the partial credit model. It can be extended to any number of grades by continuing the pattern shown here. When this model is used, it provides a measure of overall proficiency for each individual based on that person's set of ratings. All measures are expressed on a continuum of developing proficiency. In addition, the analysis provides a set of parameter estimates for the various aspects upon which performances have been rated. For Amount of Support, for example, three parameters di1, di2 and di3 are obtained. These can be used to mark out the continuum to enable a criterion-referenced interpretation of individuals' proficiency measures. Criterion-Referencing The use of the partial credit model to make criterion-referenced measures of language proficiency is illustrated in Figures 7.1 and 7.2. Figure 7.1 shows how the probability of a person requiring a particular level of support during an oral interview (see Table 7.1) might be modelled to vary with overall oral communication proficiency. At any given level of oral proficiency, the widths of the four regions of this map give the estimated probabilities of a person at that level of proficiency receiving grades of 0, 1, 2 and 3 on Amount of Support. Persons with low levels of proficiency will require a great deal of support. A person whose oral proficiency bn is estimated at -2.0 logits, for example, has a probability of about 0.6 of receiving the lowest possible grade of 0 and a probability of just over 0.3 of receiving a grade of 1. The probability of requiring this kind of support decreases with increasing proficiency. When the partial credit model is used to analyze a set of ratings, a map like Figure 7.1 can be drawn for each rated aspect of performance. These maps will differ somewhat in appearance because of the different ways in which the grades are defined for different aspects of performance. While the basic shapes of the regions in these maps are fixed by the algebra of the PCM
< previous page
page_62
next page >
< previous page
page_63
next page > Page 63
FIGURE 7.1 Partial credit model probabilities for Amount of Support grades
< previous page
page_63
next page >
< previous page
page_64
next page > Page 64
FIGURE 7.2 Interpreting oral communication proficiency
< previous page
page_64
next page >
< previous page
page_65
next page > Page 65
(3), the positions and widths of these regions will vary from aspect to aspect and are governed by the estimates di1, di2 and di3 for each aspect. Figure 7.2 shows how a set of maps of this kind can be used to provide a detailed picture of a continuum of developing language proficiency. In this example, six different aspects of performance have been rated, each using four grades. The symbols representing these four grades are the same as those used in Figure 7.1. An excellent practical example of this approach to criterion-referencing is provided by Pollitt and Hutchinson (1987). To use Figure 7.2, we first locate an individual's estimated level of oral communication proficiency on the vertical continuum. This is then referred to the columns on the right. Consider, for example, a person with an estimated proficiency of 1.5 logits. Reading horizontally across the five columns, this person will most probably receive a grade of 3 on aspects I and II, and a grade of 2 on aspects III through VI. A person with an estimated oral proficiency of -0.5 logits, on the other hand, will most probably receive a grade of 1 on aspects I, II and III, and a grade of 0 on aspects IV, V and VI. In this way, meaning can be attached to global measures of oral communication proficiency. By referring to the definitions of the grades for each aspect, it is possible to provide a criterion-referenced description of an individual's level of oral proficiency. This description will be a description of the typical characteristics of that level. Testing Model-Data Fit A fundamental difference between item response methods and exploratory factor analysis is that the purpose of an IRT analysis is not to isolate and identify different dimensions in a test, but to evaluate the extent to which various components of performance can be combined to obtain a global measure of proficiency. In the context of Figure 7.2, the question is: to what extent is it valid to combine ratings on these six aspects of interview performance to obtain a global measure of a person's oral communication proficiency? In an IRT analysis, this question is answered by carrying out statistical and possibly graphical tests of model-data fit. Harris et al. (1988), for example, analyzed ratings of various aspects of children's narrative writing and concluded that ratings of Spelling and Grammar did not function in the same way as other aspects of story writing (e.g., Story Structure, Setting, Dialogue Use). They decided to not include Spelling and Grammar in measures of children's narrative writing ability, but to report them separately.
< previous page
page_65
next page >
< previous page
page_66
next page > Page 66
A simple graphical test of model-data fit is illustrated in Figure 7.3. Whereas Figure 7.1 showed how the distribution of grades on Amount of Support might be modelled to vary with increasing oral proficiency, Figure 7.3 shows how this distribution of grades might vary in practice. A test of model-data fit for this aspect of performance is to compare Figures 7.1 and 7.3. This comparison shows that more persons of high oral proficiency have received a grade of 1 than has been modelled, and more persons of low overall proficiency have received a grade of 3 than has been modelled. In other words, this aspect of performance is somewhat less discriminating than has been modelled. A statistical comparison of these two pictures usually also would be made to evaluate whether or not Amount of Support should be included in a measure of oral communication proficiency. (Note that Figures 7.1 and 7.3 have been invented for illustrative purposes and do not in fact shed any light on this question.) Adams (1988) provides practical examples of misfit diagnosed in this way. Investigating Language Profiles In addition to enabling the construction of a criterion-referenced scale of developing proficiency from a set of expert ratings and providing a means of evaluating the extent to which different aspects of performance can validly be combined into a single global measure of proficiency, the partial credit model provides a way of investigating individual profiles of performances. This can be illustrated using Figure 7.2. Consider, for example, a person whose oral communication proficiency is estimated as 1.0 logits. The most probable ratings at this overall level of proficiency are a rating of 3 on aspect I; 2 on aspects II, III, and IV; and 1 on aspects V and VI. These "most probable" ratings are derived from the performances of a wider group of individuals and represent typical performance. Suppose, however, that this particular individual in fact received the ratings shown in Table 7.2. This pattern of ratings is very different from the typical pattern of ratings obtained by persons of a similar level of overall proficiency. This individual has received much worse ratings on aspects I and III than expected, and somewhat better ratings on aspects II, IV, V, and VI. This simple technique of comparing typical and actual patterns of ratings can be carried out statistically and an index used to identify persons with highly idiosyncratic patterns of performances. In general, these will be individuals with special areas of strength and weakness. The reason for performing tests of model-data consistency at the level of individuals is not to label some learners as atypical, but to better understand
< previous page
page_66
next page >
< previous page
page_67
next page > Page 67
FIGURE 7.3 Observed grade distribution for Amount of Support
< previous page
page_67
next page >
< previous page
page_68
next page > Page 68
TABLE 7.2 One person's ''most probable'' and actual ratings Total Score
11 11
Rated Aspects of Performance I II III IV V VI Most probable ratings: Actual ratings:
3 2 1 3
2 0
2 3
1 2
1 2
their pattern of developing abilities. Through the technique just described it may be possible to diagnose an area of slow development which requires special attention. A child whose rating on Use of Dialogue in narrative writing was unexpectedly low (given their overall level of narrative writing ability), for example, may be identified as requiring help with this aspect of their writing. Discussion Psychometric methods often are associated with atomistic/behavioristic approaches to language assessment, objectively defined correct answers, quantity rather than quality, and reliability at the expense of an overrestrictive view of the nature of language (Morrow, 1981). Spolsky (1978), for example, uses the term "psychometric" to label a bygone era of language testing. But this conception of psychometrics is too narrow. It fails to recognize the significant advances that have occurred in the development of psychometric methods for the analysis of qualitative judgements. Modern IRT methods are concerned primarily with the construction of criterion-referenced proficiency scales along which development can be monitored and in terms of which individual performance profiles can be studied. The particular psychometric method described in this chapter brings together developments in modern psychometrics and language testing. It incorporates Glaser's original intention for criterion-referencing in that it can be used to mark out a continuum of developing proficiency so that test performances can be interpreted in terms of the behaviors (levels of language functioning) that define this continuum. As an IRT model and, in particular, a member of the Rasch family of measurement models, it permits comparisons to be made across different combinations of calibrated language tasks. And, consistent with much recent work in language testing, it relies on expert judgements of the quality of various aspects of individuals' language productions.
< previous page
page_68
next page >
< previous page
page_69
next page > Page 69
Applications of this method to individualized language assessment by Adams et al. (1987), Davidson and Henning (1985), Griffin (1985), Harris et al. (1988), and Pollitt and Hutchinson (1987) illustrate the use of item response theory in a wide variety of contexts. As language testers become increasingly familiar with these methods of analysis, we can expect to see these methods contributing to an improved understanding of the nature of developing language proficiencies as well as being used to study and understand the language development of individual learners. References Adams, R.J. (1988) Applying the partial credit model to educational diagnosis, Applied Measurement in Education, 1, 347-361. Adams, R.J., Griffin, P.E., and L. Martin (1987) A latent trait method for measuring a dimension in second language proficiency, Language Testing, 4, 9-27. Bloom, B.S., Hastings, J.T., and G.F. Madaus (1971) Handbook on Formative and Summative Evaluation of Student Learning. New York: McGraw-Hill. Davidson, F. and G. Henning (1985) A self-rating scale of English difficulty: Rasch scalar analysis of items and rating categories, Language Testing, 2, 164-179. Davis, F.B. (1944) Fundamental factors of comprehension in reading, Psychometrika, 9, 185-197. Davis, F.B. (1972) Psychometric research on comprehension in reading, Reading Research Quarterly, 7, 628678. Derrick, C. (1953) Three aspects of reading comprehension as measured by tests of different lengths. Research Bulletin 53-8. Princeton, NJ: Educational Testing Service. Glaser, R. (1963) Instructional technology and the measurement of learning outcomes: some questions, American Psychologist, 18, 519-521. Griffin, P.E. (1985) The use of latent trait methods in the calibration of tests of spoken language in large-scale selection-placement programs. In: Y.P. Lee, A.C.Y.Y. Fok, R. Lord, and G. Low (eds) New Directions in Language Testing. Oxford: Pergamon. Harris, C.W. (1948) Measurement of comprehension of literature: studies of measures of comprehension, School Review, 56, 332-342. Harris, J., Laan, S., and L. Mossenson (1988) Applying partial credit analysis to the construction of narrative writing tests, Applied Measurement in Education, 1, 335346. Henning, G. (1987) A Guide to Language Testing: Development, Evaluation, Research. Rowley, MA: Newbury House. Hutchinson, C. and A. Pollitt (1987) The English Language Skills Profile. London: Macmillan. Instructional Objectives Exchange (1972) Language Arts: Comprehension Skills x-12. Los Angeles, CA: IOX. Lado, R. (1961) Language Testing. London: Longmans, Green and Co. Masters, G.N. (1982) A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
< previous page
page_69
next page >
< previous page
page_70
next page > Page 70
Morrow, K. (1981) Communicative language testing: revolution or evolution. In: J.C. Alderson and A. Hughes (eds), ELT Documents 111 . Issues in Language Testing. London: The British Council. Mossenson, L.T., Hill, P.W., and G.N. Masters (1987) TORCH Tests of Reading Comprehension. Hawthorn: Australian Council for Educational Research. Pollitt, A. and C. Hutchinson (1987) Calibrating graded assessments: Rasch partial credit analysis of performance in writing, Language Testing, 4, 72-92. Rasch, G. (1960) Probabilistic Models for some Intelligence and Attainment Tests. Copenhagen: Denmarks Paedagogiske Institut. Spearitt, D. (1972) Identification of subskills of reading comprehension by maximum likelihood factor analysis, Reading Research Quarterly, 8, 92-111. Spolsky, B. (1978) Language testing: art or science? In: G. Nickel (ed), Language Testing. Stuttgart: HochschulVerlag. Thorndike, R.L. (1973) Reading as reasoning, Reading Research Quarterly, 9, 135-147. Thurstone, L.L. (1938) Primary mental abilities, Psychometric Monographs, 1. Thurstone, L.L. (1946) Note on a re-analysis of Davis's reading tests, Psychometrika, 11, 185-188.
< previous page
page_70
next page >
< previous page
page_71
next page > Page 71
8 Response to Masters: Linguistic Theory and Psychometric Models JOHN H.A.L. DE JONG National Institute for Educational Measurement (CITO), Arnhem Masters's lucid chronicle of the ups and downs in the relation between psychometrics and language testing, and his illuminating example of an application of new psychometric methods to the notoriously difficult area of testing oral proficiency, is both rewarding and revealing. It is rewarding in the sense that it illustrates how the field of language testing can benefit from new psychometric models. Also it helps to adjust the too narrow view of psychometricians as number crunchers who are so preoccupied with statistics and large samples that they disregard effects at the individual level. It is revealing, because by demonstrating the richness of new psychometric methods, Masters refutes the too common allegation that psychometrics stands in the way of sensitive, individualized language testing. Psychometric theory does offer refined models which allow for highly sophisticated analyses of test data. Therefore, it is up to language teachers and language acquisition theorists to define what it is they want to assess, how different stages in language acquisition can be defined, and how they wish to grade these stages. There is a paradox to be found in the present state of the art in language testing. Language proficiency, it would seem, can be judged by the man in the street, it can be measured using advanced psychometric models, and it can also be defined by language specialists. But the relation among the views of the public, the measurements obtained from objective tests, and the constructs of applied linguists is far from being established. Indeed, as Masters points out, item response theory (IRT) models provide the opportunity to mark out a continuum of increasing proficiency. Masters and Evans (1986) have correctly observed that this concept of
< previous page
page_71
next page >
< previous page
page_72
next page > Page 72
increasing proficiency is directly related to Glaser's (1963) original concept of criterion-referencing as the interpretation of an individual's achievement as a position on a continuum of developing proficiency. But can language specialists define language proficiency, let alone answer the question of what is it that makes one language user more proficient than the other? In the past decades impressive efforts have been made to define objectives for language curricula. One thinks, for example, of the Threshold Level of the Council of Europe (van Ek, 1975), the Graded Objectives in the United Kingdom (Southern Examining Group, 1986), or of the Australian Language Levels and the National Assessment Framework for Languages at Senior Secondary Level in Australia (cf. Tuffin, this volume). These efforts aim primarily at confining the potentially infinite domain of language proficiency to a finite list of language elements, functions, or notions to be mastered at a particular level. The acceptability of these levels, grades, and frameworks seems to rely primarily on the authority of the scholars involved in their definition, or on the political status of the bodies that control and promote them. Furthermore, rating scales of language proficiency have been developed which provide fairly detailed descriptions of hypothesized stages of language proficiency. Among these are the ACTFL Proficiency Guidelines (American Council on the Teaching of Foreign Languages, 1986) and the Australian Second Language Proficiency Ratings (ASLPR)(Ingram, 1984). These scales, which have been developed as measurement instruments, claim to represent a "design for measuring foreign language proficiency" (ACTFL, 1986), or a means to "match learners against a concept of how proficiency develops" (Ingram, 1984). However, the adequacy of these scales as descriptions of developing language proficiency has been subject to considerable doubt among scholars (Savignon, 1985; Bachman and Savignon, 1986; Bachman, 1987; de Jong, 1987; Douglas, 1987). And, the use of these scales requires a more or less intensive training, which might well be indicative of their ambiguity. Masters mentions a number of tests, such as the English Language Skills (TELS) Profile (Hutchinson and Pollitt, 1987) and the TORCH Tests of Reading Comprehension (Mossenson, Hill, and Masters, 1987), which claim to be criterion-referenced proficiency measures. These tests have been shown to measure to a certain degree some school related development: a relationship exists between, for example, years at school and scaled score on the tests. But whether the observed progress of learners is indeed a development of language proficiency remains to be proven. Furthermore, the suggested stages of development are too loosely defined to constitute a basis for a sound and falsifiable scientific model of language acquisition. Another influential test claiming to constitute a criterion-referenced measure (Stansfield, 1986) is the Test of Written English (TWE)developed by the Educational Testing
< previous page
page_72
next page >
< previous page
page_73
next page > Page 73
Service (ETS). Again there seems to be no more proof of the actual existence of the scale used other than usersagree-it-works. One of the obstacles to the assessment of language proficiency seems to be the analytic approach imposed by what is often understood as "modern scientific" methods. Time and again scholars have stood up against atomistic approaches to language proficiency, but every so often the pendulum just swings back. The various numbers of language dimensions or skills from the factor analytic period described by Masters were questioned by Carroll (1961) and then, more drastically, replaced by Oller's general proficiency factor (Oller, 1976). And while this g-like-factor in its turn was in the process of being dethroned by more complex models (Bachman, forthcoming; Canale and Swain, 1980; Vollmer and Sang, 1980; 1983; Sang et al., 1986), the representation of language as a set of elements was put to doubt once again (Morrow, 1981). Once more holistic scoring (Aghbar, 1983; Conlan, 1983; Stansfield, 1986) and integrative testing (Klein-Braley, 1985) were advocated. More recently, IRT models have been brought into action to support the idea of a single dimension across different language skills (Henning, Hudson, and Turner, 1985). And attempts at content related testing from a language for specific purposes perspective have not shown the effects hoped for (Westaway, Alderson, and Clapham, this volume). However, when the issue is taken further, and a student's writing ability in L2 proves to be related even to his writing ability in L1 (Canale, Frenette, and Bélanger, 1985), one might start to doubt what the general ability that is measured really is. Might it be that a considerable proportion is made up from a general intellectual ability, which, though language related (Vernon, 1950; Carroll, 1961), is not .necessarily a foreign language related ability (Cummins, 1983; Alderson, 1984; de Jong and Glas, 1987; de Jong, 1988)? Might it be that language tests are often measuring general cognitive development, instead of the advances of a student in mastering a language other than his own? In this sense, I have argued elsewhere for the need to verify whether tests do in fact show a difference between subjects who know and subjects who don't. In the case of foreign language tests, for instance, this would be the difference between those who speak the language as their native tongue and those who do not (de Jong, 1983; 1984; 1988; de Jong and Glas, 1987). Although such differences may be revealed using general language proficiency measurestaking care not to fall into the trap of measuring general cognitive developmentit should also be possible to discover what accounts for these differences. One of the avenues taken to answer this question has been the attempt at defining components of language proficiency. Lado (1961) is often regarded
< previous page
page_73
next page >
< previous page
page_74
next page > Page 74
as the first to advocate the use of discrete-point tests. Subsequent generations of language testers wishing to stress an integrative conception of language proficiency have conveniently forgotten that in Lado's view the use of discrete-point tests should be carefully embedded in a general proficiency construct, the whole not being definable as the sum of the parts. Apart from overlooking Lado's admonitions, a problem with many attempts at describing the constituent components of general language proficiency has been the adherence to descriptions of language according to a particular linguistic theory. There are many who still describe language in terms of the traditional concept of language: lexical items in a network relation which is defined by grammatical rules and which in its turn may be observable through sequencing and morphological features. Though there may be few who will overtly profess adherence to this model, modern linguistics has not succeeded in replacing it entirely, not in the mind of the general public or of the language learner, nor in the practice of many language teachers. Applied linguists are "better informed" and have adopted richer concepts from general linguistics, sociolinguistics, etc., to define the components of language proficiency. However, linguistics aims at understanding the phenomenon of language, at understanding how language is structured. By contrast, the study of language acquisition and language testing are not directly concerned with the structure of language itself, but in the development of language proficiency. It is questionable at least whether this development can be adequately described in terms of the acquisition of the structural components of language (Newmark, 1971; Sapon, 1971). The revelation of developmental stages of language proficiency may profit more from cognitive psychology, neurolinguistics, and cognitive neurophysiology, than from general linguistics and sociolinguistics. Recent findings in cognitive neurophysiology, for example, seem to indicate that separate brain areas are involved in visual and auditory coding of lexical items, each with independent access to supramodal articulatory and semantic systems (Petersen et al., 1988; Posner et al., 1988a; 1988b). These findings are consistent with multiple route models as have been proposed by cognitive psychologists (e.g. Rumelhart et al., 1986) and present new arguments in favor of separating performance from competence. On the other hand, psychological experiments in sequencing and timing in speech production have provided support for the notion of hierarchical coding at the level of utterance plans (Gordon and Meyer, 1987) as opposed to models of element by element encoding procedures. Keele (1987) has discussed the relevance of these hierarchic sequencing structures to the ability to use different codes or languages. He has suggested that entire upper levels of a hierarchy, not just subunits, can be transferred from one language
< previous page
page_74
next page >
< previous page
page_75
next page > Page 75
to another, provided that the subject is (a) sufficiently prepared at the motor level for the production of the second language, and (b) is able to attach to concepts the wordsor larger unitsfrom either language. The appeal of models like the one proposed by Masters is that they allow for a general proficiency measurement, while at the same time offering the possibility of investigating the extent to which a hypothesized multicomponent model can explain the observations. This possibility is offered also by factorial designs using extended batteries of carefully designed discrete-point tests. But IRT models provide a more practical approach. Furthermore, the person-oriented character of IRT enables test users to evaluate the applicability of the model at the individual level. IRT applied in this way also allows users to trace alternative interpretations of test results whenever an individual's proficiency fails to conform to the hypothesized structure of the general model. Importantly, this quality also allows individual learning styles in second or foreign language development (cf. Ellis, this volume) to be taken into account. There remain, however, some practical problems with the test Masters uses as illustration. Even if the criteria used to evaluate a person's oral proficiency form a scale, it is not yet as such established whether this scale is indeed a language proficiency scale, and the criteria or stages would need to be defined more closely. A very precise definition may often be a simplification by nature, but is also necessary because any ambiguity in the interpretation of the stage definitions will inevitably lead to a lack of reliability and hence jeopardize the validity of the measurement. Furthermore, how can the danger of halo effects be satisfactorily ruled out when using judgmental scales? And, finally, is it at all feasible to attain a sufficiently reliable assessment of a subject's oral proficiency using tests of such restricted length? In the TELS oral proficiency measure described by Masters, an interviewee receives one of four possible grades on five components of oral performance. Though a rating scale yields more information per item than a dichotomously scored item, the number of observations (five) seems rather limited. The limited number of observations and the possibility of halo effects both lead to a data set which will not be easily rejected by the model. In other words, given the test procedure it is highly improbable that the model will be rejected and, therefore, it is hard to provide sufficient proof of the validity of the procedure. In addition to these practical problems, I would like to address a theoretical issue related to the linguistic model underlying procedures such as TELS and, subsequently, also to the psychometric model used in the analysis. The TELS oral proficiency profile distinguishes five components of
< previous page
page_75
next page >
< previous page
page_76
next page > Page 76
performance, one of which is "Amount of Support". The five components are regarded as separately identifiable. A subject's oral proficiency can be reported as a profile of ratings on each of the components. In the process of evaluating the oral proficiency data using the partial credit model, two steps can be distinguished. First the ratings of each component have to be scaled, that is, from an ordinal scale an interval scale has to be constructed. This step can only be successful if the ratings constitute levels of growing proficiency in a single dimension. The second step is combining the component ratings in an overall test score. In order to allow this combination of component ratings, among other assumptions, the model necessarily presupposes that each component can be assessed independently, that is, obtaining a low score on any one component does not preclude obtaining a high score on any other component. This feature is commonly known as the independence of items, and is generally assumed to be a necessary characteristic in classical test theory as well as in IRT. The assumption of independence is necessary within any model which hypothesizes an additive relationship between the constituent elements of the trait to be measured. In the application of factor analysis, the research effort is directed at determining the psychologically independent and statistically uncorrelated components that account for the variance of a group of subjects on a battery of measures. By reporting total scores as a simple summation of more or less correlated subtests, however, the whole exercise of extracting these factors becomes gratuitous. Similarly, the component ratings in Masters's model are summed to add up to the total test score as a sufficient statistic to assess each subject's overall ability, and to predict probable ratings on each of the components. The assumption which has to underlie such a procedure is that the rated components contribute independently towards the overall proficiency that is being measured. In multidimensional models using total scores as sufficient statistics, the relation between the subskills representing different dimensions is hypothesized to be compensatory, in that deficits in one subskill could be overcome by strengths in another. Language proficiency, however, may well prove to involve more complicated relationships between constituents. For example, the correct pronunciation of a word may be influenced independently by knowledge of phonological rules, familiarity with the lexical item itself, understanding of its syntactical relations with other lexical items, and even by awareness of the concept it expresses. Such relationships would call for multiplicative item combination rules where subskills are regarded as distinct and individually necessary prerequisites for a correct response (Jannarone, 1986). In recent years cognitive theorists have proposed increasingly complex models of
< previous page
page_76
next page >
< previous page
page_77
next page > Page 77
multiple ability traits (e.g. Pellegrino and Glaser, 1979) and also in the field of second language learning hierarchical models of language acquisition have been proposed (e.g. Sang et al., 1986). To complicate things further, the relative contribution of constituent components of a more general skill such as reading comprehension, does not appear to remain constant during the learning process. Drum, Calfee, and Cook (1980) found that different subskills interfere with overall comprehension depending on the level at which the reader is performing. Yet, here again it may be pointed out that the definition of the constructs to be measured needs to be provided by the language specialists. Psychometric theory already offers item response models to evaluate such complex constructs of interdependent skills and subskills (e.g. Fischer, 1973; Andrich, 1984; 1985; Embretson, 1983; 1985a; 1985b; Jannarone; 1986; 1987). These models provide a completely different outlook on test construction and evaluation. The principle of internal consistency, so predominant both in classical test theory and in unidimensional latent trait models, is replaced by mathematical models of test performance. Embretson (1983) points out that these mathematical models allow for a new type of construct validity research. She argues that traditional construct validation research examines the correlations of individual differences on a test with other measures and evaluates the validity of the test as a measure of individual differences. It is therefore concerned with nomothetic span. Mathematical modelling of task difficulty, in contrast, aims at revealing the construct representation of the task by identifying the constructs that account for performance. The ideal of a single dimension underlying language proficiency will most probably prove to be too simple. Still, for practical purposes, it may often be desirable or necessary to express the proficiency measured with a single rating. In those circumstances, tests that can be shown to fit as nearly as possible a unidimensional psychometric model are to be preferred. If, on the other hand, a multicomponent approach of assessment is feasible, adhering to an additive, compensatory measurement model seems counterproductive. The overall score, that is, the sum of subtest ratings, does not provide the maximum amount of information which can be extracted from the measurement data. References Aghbar, A.A. (1983) Grid-based impressionistic scoring of ESL compositions. Paper presented at the 18th Annual TESOL Convention, Toronto.
< previous page
page_77
next page >
< previous page
page_78
next page > Page 78
Alderson, J.C. (1984) Reading in a foreign language: a reading problem or a language problem? In: J.C. Alderson and A.H. Urquhart (eds), Reading in a Foreign Language. New York: Longman. American Council on the Teaching of Foreign Languages (1986) ACTFL Proficiency Guidelines. Hastings-onHudson, NY: American Council on the Teaching of Foreign Languages. Andrich, D. (1984) The attenuation paradox of traditionalist test theory as a breakdown of local independence in person item response theory. Paper presented at the National Conference of Measurement in Education, New Orleans. Andrich, D. (1985) A latent trait model for items with response dependencies: implications for test analysis and construction. In: S.E. Embretson (ed), Test Design: Development in Psychology and Psychometrics. Orlando, Florida: Academic Press. Bachman, L. F. (1987) Problems in examining the validity of the ACTFL oral proficiency interview. In: A. Valdman (ed), Proceedings of the Symposium on the Evaluation of Foreign Language Proficiency. Bloomington, IN: Indiana University. Bachman, L. F. (Forthcoming) Fundamental Considerations in Language Testing. Oxford: Oxford University Press. Bachman, L. F. and S. Savignon (1986) The evaluation of communicative language proficiency: a critique of the ACTFL oral interview. Modern Language Journal, 70, 380 390. Canale, M., and M. Swain (1980) Theoretical bases of communicative approaches to second language teaching and testing, Applied Linguistics, 1, 1--47. Canale, M., Frenette, N., and M. Bélanger (1985) On the interdependence of L1 and L2 writing in a minority setting. In: S. Jones, M. Desbrisay, and T. Pari Bakht (eds), Proceedings of the 5th Annual Language Testing Colloquium. Ottawa: Carleton University. Carroll, J.B. (1961) Fundamental considerations in testing for English language proficiency of foreign students. In: Testing the English Proficiency of Foreign Students. Washington, DC: Center for Applied Linguistics. Conlan, G. (1983) Comparison of analytic and holistic scoring techniques. Unpublished draft. Princeton, NJ: Educational Testing Service. Cummins, J. (1983) Language proficiency and academic achievement. In: J.W. Oller, Jr. (ed), Issues in Language Testing Research. Rowley, MA: Newbury House. de Jong, J.H.A.L. (1983) Focusing in on a latent trait: construct validation by means of the Rasch model. In: J. van Weeren (ed), Practice and Problems in Language Testing 5. Arnhem: CITO, National Institute for Educational Measurement. de Jong, J.H.A.L. (1984) Listening, a single trait in first and second language learning, Toegepaste Taalwetenschap in Artikelen, 20, 66 79. de Jong, J.H.A.L. (1987) Defining tests for listening comprehension: a response to Dan Douglas's ''Testing listening comprehension''. In: A. Valdman (ed), Proceedings of the Symposium on the Evaluation of Foreign Language Proficiency. Bloomington, IN: Indiana University. de Jong, J.H.A.L. (1988) Le modèle de Rasch: les principes sous-jacents et son application à la validation de tests, Toegepaste Taalwetenschap in Artikelen, 31, 57 70. de Jong, J.H.A.L. and C.A.W. Glas (1987) Validation of listening comprehension tests using Item Response Theory, Language Testing, 4, 170 194.
< previous page
page_78
next page >
< previous page
page_79
next page > Page 79
Douglas, D. (1987) Testing listening comprehension in the context of the ACTFL proficiency guidelines. In: A. Valdman (ed), Proceedings of the Symposium on the Evaluation of Foreign Language Proficiency. Bloomington, IN: Indiana University. Drum, P.A., Calfee, R.C., and L.K. Cook (1980) The effects of surface structure variables on performance in reading comprehension tests, Reading Research Quarterly, 16, 486-513. Embretson, S.E. (1983) Construct validity: construct representation versus nomothetic span, Psychological Bulletin, 93, 179-197. Embretson, S.E. (1985a) Multicomponent latent trait models for test design. In: S.E. Embretson (ed), Test Design: Development in Psychology and Psychometrics. Orlando, Florida: Academic Press. Embretson, S.E. (1985b) Studying intelligence with test theory models. In D.K. Detterman (ed), Current Topics in Human Intelligence. Volume 1: Research Methodology. Norwood, NJ: Ablex. Fischer, G. (1973) Linear logistic test model as an instrument in educational research, Acta Psychologica, 37, 359-374. Glaser, R. (1963) Instructional technology and the measurement of learning outcomes: some questions, American Psychologist, 18, 519-521. Gordon, P.C. and D.E. Meyer (1987) Hierarchical representation of spoken syllable order. In: A. Allport, D.G. McKay, W. Prinz, and E. Scheerer (eds), Language Perception and Production: Relationships between Listening, Speaking, Reading and Writing. London: Academic Press. Henning, G., Hudson, T., and J. Turner (1985) Item response theory and the assumption of unidimensionality for language tests, Language Testing, 2, 141154. Hutchinson, C. and A. Pollitt (1987) The English Language Skills Profile. London: Macmillan. Ingram, D.E. (1984) Australian Second Language Proficiency Ratings. Canberra: Australian Government Publishing Services. Jannarone, R.J. (1986) Conjunctive item response theory kernels, Psychometrika, 51, 357-373. Jannarone, R.J. (1987) Locally dependent models for reflecting learning abilities. USCMI Report No. 87-67. Columbia, SC: University of South Carolina, College of Engineering. Keele, S.W. (1987) Sequencing and timing in skilled perception and action: an overview. In: A. Allport, D.G. McKay, W. Prinz, and E. Scheerer (eds), Language Perception and Production: Relationships between Listening, Speaking, Reading and Writing. London: Academic Press. Klein-Braley, C. (1985) A cloze-up on the C-Test: a study in the construct validation of authentic tests, Language Testing, 2, 76-109. Lado, R. (1961) Language Testing. London: Longmans, Green and Co. Masters, G.N. and J. Evans (1986) A sense of direction in criterion-referenced assessment, Studies in Educational Evaluation, 12, 257-265. Morrow, K. (1981) Communicative language testing: revolution or evolution. In: J.C. Alderson and A. Hughes (eds), ELT Documents 111: Issues in Language Testing. London: The British Council. Mossenson, L.T., Hill, P.W., and G.N. Masters (1987) TORCH Tests of Reading Comprehension. Hawthorn: Australian Council for Educational Research.
< previous page
page_79
next page >
< previous page
page_80
next page > Page 80
Newmark, L.D. (1971) A minimal language-teaching program. In: P. Pimsleur and T. Quinn (eds), The Psychology of Second Language Learning. Cambridge: Cambridge University Press. Oller, J.W., Jr. (1976) Evidence for a general language proficiency factor: an expectancy grammar, Die Neueren Sprachen, 75, 165-174. Pellegrino, J.W., and R. Glaser (1979) Cognitive correlates and components in the analysis of individual differences, Intelligence, 3, 169-186. Petersen, S.E., Fox, P.T., Posner, M.I., Mintun, M., and M.E. Raichle (1988) Positron emission tomographic studies of the cortical anatomy of single-word processing. Nature, 331, 585-589. Posner, M.I., Petersen, S.E., Fox, P.T., and M.E. Raichle (1988a) Localization of cognitive operations in the human brain. ONR Technical Report 88-1. Arlington, VA: Office of Naval Research. Posner, M.I., Sandson, J., Dhawan, M., and G.L. Shulman (1988b) Is word recognition automatic? A cognitiveanatomical approach. ONR Technical Report 88-4. Arlington, VA: Office of Naval Research. Rumelhart, D.E., McClelland, J.L., and the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Volume I, Foundations. Cambridge, MA: MIT Press. Sang, F., Schmitz, B., Vollmer, H.J., Baumert, J., and P.M. Roeder (1986) Models of second language competence: a structural equation approach, Language Testing, 3, 54-79. Sapon, S.M. (1971) On defining a response: a crucial problem in the analysis of verbal behavior. In: P. Pimsleur and T. Quinn (eds), The Psychology of Second Language Learning. Cambridge: Cambridge University Press. Savignon, S. (1985) Evaluation of communicative competence: the ACTFL provisional proficiency guidelines, Modern Language Journal, 69, 129-134. Southern Examining Group (1986) General Certificate of Secondary Education: French, 1988 Examinations. Oxford: The Southern Examining Group. Stansfield, C. (1986) A history of the Test of Written English: the developmental year, Language Testing, 2, 225-234. van Ek, J.A. (1975) The Threshold Level. Strasbourg: The Council of Europe. Vernon, P.E. (1950) The Structure of Human Abilities. New York: Wiley. Vollmer, H.J. and F. Sang (1980) Zum psycholinguistischen Konstrukt einer internalisierten Erwartungsgrammatik, Linguistik und Didaktik, 42, 122-148. Vollmer, H.J. and F. Sang (1983) Competing hypotheses about second language ability: a plea for caution. In: J.W. Oller, Jr. (ed), Issues in Language Testing Research. Rowley, MA: Newbury House.
< previous page
page_80
next page >
< previous page
page_81
next page > Page 81
SECTION II LANGUAGE TEACHING AND INDIVIDUALIZED ASSESSMENT
< previous page
page_81
next page >
< previous page
next page >
page_vii
Page vii 19 Directions in Testing for Specific Purposes Gill Westaway, J. Charles Alderson And Caroline M. Clapham List of Contributors Index
< previous page
page_vii
239 257 263
next page >
< previous page
page_83
next page > Page 83
9 Individual Learning Styles in Classroom Second Language Development ROD ELLIS Ealing College of Higher Education Research into the relationship between learning style and second language acquisition (SLA) has been mainly concerned with cognitive style. This term refers to the way in which learners process information. The construct is premised on the assumption that different learners have characteristic modes of operation which function across a variety of learning tasks, including language learning. The principal measure of cognitive style used in SLA research has been the Group Embedded Figures Test of field dependency/independency developed by Witkin et al. (1971). This requires subjects to identify a simple geometric figure within a more complex design. Learners who are able to carry out this task easily and rapidly are said to be "field independent", while those who cannot do so are "field dependent". It has been hypothesised that field independents will do better in classroom learning because they will be better able to analyse formal grammar rules. However, SLA research which has used the Group Embedded Figures Test has been far from conclusive (cf. Ellis, 1985, and McDonough, 1987 for reviews of the literature). Field independency appearsat bestto be only weakly correlated with second language (L2) proficiency. However, the hypothesis that SLA is influenced by the way in which learners orientate to the learning task remains an appealing one. The relative failure to find any consistent relationship between cognitive style and language learning may be the result of the test measure which has been used. The crucial distinction with regard to learning style may be between learners who are norm-oriented as opposed to those who are communicativeoriented
< previous page
page_83
next page >
< previous page
page_84
next page > Page 84
(Clahsen, 1985: Johnston, undated). Norm-oriented learners are those who are concerned with developing knowledge of the linguistic rules of the second language, while communicative-oriented learners are those who seek to develop their capacity to communicate effectively in the L2 irrespective of formal accuracy. According to this proposal, learners vary in the way in which they view the language learning task. This task can be viewed as a modular one in accordance with the theory of SLA proposed by Bialystok and Sharwood Smith (1985). The general construct "language" is subdivided into different modules according to which aspect (e.g. linguistic vs. pragmatic) is involved. Each module is further subdivided to reflect the general distinction between "competence" and "performance''. Thus, in the case of the linguistic module, acquisition entails both the development of knowledge of the actual rules that constitute the language and control of the knowledge which has been acquired. We can speculate that norm-oriented learners choose to focus on acquiring linguistic knowledge, while communicative-oriented learners endeavour to develop channel control mechanisms. There are a number of interesting pieces of research that reflect the basic distinction between norm-oriented and communicative-oriented learners. Hatch (1974), on the basis of an extensive review drawing on 15 observational studies of 40 L2 learners, distinguished "rule formers" (i.e. learners who progress steadily by building up their knowledge of L2 rules) from "data gatherers" (i.e. learners who gain rapidly in fluency but who do not appear to sort out any rules). Seliger (1980) found that some learners were "planners" who endeavoured to organise their productions prior to performance, while others were "correctors" who preferred to go first for fluency and then to edit subsequent performance as necessary. Dechert (1984) describes two different learners' approaches to the retelling of an oral narrative; one was "analytic'', characterised by long pauses at chunk boundaries, an absence of filled pauses and corrections, very few additions to or omissions from the original story and serial, propositional processing. The other was "synthetic", characterised by shorter pauses at chunk boundaries, many filled pauses and corrections, considerable changes to the original story and ad hoc episodic processing. Schmidt (1983) reports on a particularly illuminative case study of a Japanese painter called Wes. The study covered a three year period. Schmidt found little development in this learner's grammatical competence over this period, but considerable development in sociolinguistic, discourse, and strategic competence. Schmidt suggests that the type of progress Wes demonstrated was a product of his learning style and that the acquisition of grammatical competence is in part, at least, independent of the acquisition of the general ability to communicate effectively.
< previous page
page_84
next page >
< previous page
page_85
next page > Page 85
We can also speculateas does Schmidtthat learners with different learning-task orientations will also vary systematically on individual learner factors such as motivation and aptitude. For example, a norm-oriented learner might be expected to score highly in tests of grammatical analysis such as the "Words in Sentences" section of the Modern Language Aptitude Test (Carroll and Sapon, 1959). Communicative-oriented learners, on the other hand, might be expected to do better on tests of word memory, on the grounds that vocabulary is particularly important for communication. The purpose of the study reported in this chapter is to explore these speculations with reference to the acquisition of German as a second language in a classroom context by a group of 39 adult learners. The study has these aims: (1) To test whether field dependency/independency is related to the acquisition of linguistic knowledge and channel control. (2) To investigate the extent to which the acquisition of linguistic knowledge and channel control proceed independently of each other. (3) To explore which individual learner factors relate respectively to norm-oriented and communicativeoriented learners. Each of these aims will be considered separately. Subjects and Instructional Context The subjects were 39 adult students taking beginning courses in German at two institutions of higher education in London. They were aged between 18 and 41 years with different first language backgrounds: English, Spanish, French, Mauritian Creole, and Arabic. All the learners were experienced classroom learners, having reached A-Level or equivalent in at least one second language other than German. Although the course was designed for complete beginners, 14 of the students already had some limited knowledge of German. The courses the learners were taking were part of a degree programme leading to a BA in Applied Language Studies. They lasted a full academic year, although the period of this study covered only two terms (approximately 22 weeks). The students were taught in separate groups, receiving between 7 and 12 hours of language instruction per week. Two course books were used with different groups. One book provided a traditional structural course and the other a notional-functional course. In fact, however, the methods of instruction varied little between groups. In general, the overall approach was
< previous page
page_85
next page >
< previous page
page_86
next page > Page 86
a traditional one, involving extensive explanation of formal grammar points together with various kinds of practice and translation exercises. None of the groups received much opportunity to use German in natural communication. In other words, the instruction was accuracy rather than fluency focussed (Brumfit, 1984). Field Dependency/Independency The Group Embedded Figures Test (Witkin et al., 1971) was administered to the 39 learners at the end of the study. The test provided measures of the degree of field independency of each subject. A number of different measures of learning were obtained as follows. 1. Word Order Acquisition Score The learners performed an information gap activity in pairs at the ends of term I and term 2. This required them to describe to each other pictures making up a story in order to reconstruct the narrative and then to tell the complete story. The intention was to obtain a corpus of relatively spontaneous speech. Transcriptions were prepared and obligatory occasions for three German word order rules obtained. The three rules were PARTICLE, INVERSION and VERB-END. Research into the naturalistic acquisition of German (Meisel 1984) has shown that these rules are developmental in the sense that learners acquire them in a fixed sequence. The percentage of correct supplying of each word order rule was computed. In order to obtain the maximum score a learner must have produced a minimum of three obligatory occasions and performed correctly in all of them. Learners who produced fewer than three obligatory occasions were penalised by reducing the maximum score possible by 33% if they produced only two obligatory occasions, and by 66% if they produced only one. Learners who produced no obligatory occasions scored zero. Learners who produced three or more obligatory occasions were awarded a score out of 100% according to the percentage they performed correctly. The Word Order Acquisition Score was designed to provide a measure of the learners' level of acquisition of linguistic knowledge at the end of term 1 (Time 1) and term 2 (Time 2).
< previous page
page_86
next page >
< previous page
page_87
next page > Page 87
2. Word Order Acquisition Gain Score This was calculated by subtracting the Word Order Acquisition Score at Time 1 from that at Time 2. It was intended to provide a measure of the rate of acquisition of linguistic knowledge. 3. Speech Rate Score This was computed using the same speech data as for the first two learning variables. The score consisted of the number of syllables produced in one minute of speech after disfluencies (i.e. repetitions, corrections, fillers, and parts of words) had been discounted. The Speech Rate Score was designed to provide a general measure of the learners' channel control at Times 1 and 2. 4. Speech Rate Gain Score This was calculated by subtracting the Speech Rate Score at Time 1 from that at Time 2. It was intended to provide a measure of the rate of acquisition of channel control. 5. Vocabulary Proficiency Score This was obtained by means of a test used by the Department of German as a Foreign Language at the University of Munich. The test consisted of 31 items requiring learners to complete a sequence of four words with the appropriate item. For example: Tag : hellNacht: . . . . . . The items became progressively more difficult. The students were given 10 minutes to complete the test, which was administered once only at Time 2. 6. Grammar Proficiency Score A discrete item grammar test (also from the University of Munich) was administered at Time 2. The test consisted of 30 items arranged in order of
< previous page
page_87
next page >
< previous page
page_88
next page > Page 88
difficulty. A sliding marking scale was used giving weight to the more difficult items. The students were given 20 minutes to complete the test. 7. A Cloze Score The passage for the cloze test was a written version of one of the picture compositions used in the information gap activity, prepared by a native speaker of German. The first 50 words of the text were given in full. Thereafter, every tenth word was deleted. The score was out of 25. The students were allowed 15 minutes to complete the test. Pearson product-moment correlation coefficients between the field independency scores and the scores for the various measures of learning were obtained. These are displayed in Table 9.1. The coefficients are all negative. However, no relationship reaches statistical significance. For this sample of learners, therefore, cognitive style as measured by the Group Embedded Figures Test is unrelated to measures of the acquisition of linguistic knowledge or of channel control. TABLE 9.1 Correlations between field independency and learning variables Learning Variables 1) Word Order Acquisition Time 1 Time 2 2) Word Order Acquisition Gain 3) Speech Rate Time 1 Time 2 4) Speech Rate Gain 5) Vocabulary Proficiency 6) Grammar Proficiency 7) Cloze
Correlation with Field Independency -0.05 -0.27 -0.22 -0.09 -0.11 -0.01 -0.00 -0.19 -0.15 All correlations nonsignificant: p > 0.05
N = 39
< previous page
page_88
next page >
< previous page
page_89
next page > Page 89
Acquisition of Linguistic Knowledge and Channel Control The second aim of the study was to investigate to what extent the measures of linguistic knowledge were independent of the measures of channel control. If the scores obtained could be shown to be unrelated, this would provide support for the modular theory of language acquisition proposed by Bialystok and Sharwood Smith (1985), and would also suggest that learners differ according to whether they are norm, or communicative oriented. The same measures of learning as described in the previous section were used. Two kinds of statistical analysis were carried out. First, Pearson product-moment correlation coefficients were computed between the measures of linguistic knowledge and the measures of channel control. Second, a principal components factor analysis was run to gain further insight into the relations between the two sets of variables. The Pearson product-moment correlations between the measures of linguistic knowledge and speech rate obtained at Times 1 and 2 are shown in Table 9.2. The coefficients are all small, well below the 5% level. These results suggest that for this group of learners acquisition of linguistic knowledge and channel control develop separately. However, a significant negative correlation between gains in word order acquisition and in speech rate from Time 1 to Time 2 was found (r = -0.44; p < 0.01). Thus, those learners who showed the greatest gain in acquisition of the three word order rules manifested the smallest gain in general oral TABLE 9.2 Correlations between linguistic knowledge and channel control Linguistic Knowledge
Speech Rate Time I Time 2
Word Order Acquisition Time 1 Time 2 Vocabulary Proficiency Grammar Proficiency Cloze
-0.16 0.04 0.04 0.07 0.14
N = 39 All correlations non-significant: p > 0.05
< previous page
page_89
next page >
< previous page
page_90
next page > Page 90
fluency and, conversely, those learners who developed the ability to process speech the most rapidly displayed the smallest gain in knowledge of the word order rules. A principal components factor analysis was carried out using the scores obtained on all the learning variables at Times 1 and 2. Two factors were extracted in accordance with the hypothesis that linguistic knowledge and channel control constitute independent aspects of acquisition. The results (see Table 9.3) show that the knowledge variables all load positively on Factor 1, while the Speech Rate Scores produce loadings around zero. The pattern is reversed for Factor 2. The "Control" variables load strongly and the knowledge variables only weakly. There is, therefore, reason to assume that the factors represent "Knowledge" and "Control'' respectively, and, that these two aspects of acquisition are independent. TABLE 9.3 Results of factor analysis (two factor solution) Variable Speech Rate (Time 1) Speech Rate (Time 2) Word Order Acquisition (T1) Word Order Acquisition (T2) Vocabulary Proficiency Grammar Proficiency Cloze
Factor Factor % of I 2 Eigenvalue variance -0.03 0.88 3.06 43.7 0.09 0.82 1.48 21.1 0.64 -0.05 0.86 12.4 0.66
0.16
0.75
10.7
0.78 -0.03
0.44
6.3
0.91 0.86
0.29 0.11
4.2 1.6
0.00 0.10
Taken together, these results indicate that for this group of learners a clear distinction can be made between the acquisition of linguistic knowledge on the one hand and of channel control on the other. It should be noted that the two modules were found to be independent irrespective of whether the measurements of each module were obtained from the same or different data sets. Thus, Speech Rate was weakly correlated with both Word Order Acquisition and general proficiency. Furthermore, development in one module is linked to a lack of development in the other. Learners may choose which aspect of acquisition to place emphasis on and progress differently according to the choice they make. Some learners opt to develop the rule system of the target language, while others prefer to develop fluency.
< previous page
page_90
next page >
< previous page
page_91
next page > Page 91
Individual Learner Factors and Knowledge/Control The results reported in the previous section give credence to the claim that learners differ in the general way they orientate to the learning task. They suggest that there are norm-oriented and communicative-oriented learners. In this section we explore whether individual learner factors such as motivation and aptitude can be used to distinguish such learners. There is a rich literature dealing with the relationship between learner factors and language learning. Gardner (1980) summarised years of research in Canada by concluding that a general index of attitude and motivation and standard measures of aptitude together account for around 27% of the variance in learning (measured by means of grade scores) obtained by instructed foreign language learners. Gardner and other researchers, however, have been solely concerned with the relationship between learner factors and linguistic knowledge. There has been no researchto the best of our knowledgethat has investigated the relationship between learner factors and channel control. In this study a number of measures of different learner factors were obtained at the beginning of the study. The measures were: 1. Integrative Motivation The measure of integrative motivation was derived from the subjects' responses to a number of statements of the kind: This language will enrich my background and broaden my cultural horizons. They were included in a questionnaire administered at the beginning of the study. The subjects rated each statement on a scale from 1 (unimportant) to 3 (very important). The ratings were then aggregated. 2. Instrumental motivation This measure was obtained in a similar way to that for integrative motivation. The students responded to statements such as: This language seems easier than others I could have taken. 3. Expectancy of Achieving Native Speaker Fluency The learners were asked to rate how probable they thought it was that they would one day achieve native speaker fluency in German. The scale was from 0 (completely improbable) to 5 (completely probable). This measure was designed to provide an indication of how the subjects saw themselves as learners of German in the long term. 4. Aptitude (words in sentences)This was assessed by means of Part IV of the Modern Language Aptitude Test (Carroll and Sapon, 1959). The test
< previous page
page_91
next page >
< previous page
next page >
page_92
Page 92 requires the subjects to identify the function of words within a sentence. There were 45 items. A time limit of 13 minutes was set. 5. Aptitude (memory)This was measured by means of a test (Skehan, 1982) that assesses the subjects' abilities to memorise words in an unknown foreign language (Finnish). The subjects were given 5 minutes to memorise the words and 2 minutes to write down all the words they could remember. 6. Aptitude (sound discrimination)This was measured by means of Part 5 of Pimsleur's (1966) Language Aptitude Battery. The test requires the subjects to learn phonetic distinctions and to recognise them in different contexts. There were 30 items. 7. Aptitude (sound-symbol association)This was measured using Part 6 of Pimsleur's Language Aptitude Battery. The subjects were required to listen to words and select the closest written version in a multiple-choice format. There were 30 items. TABLE 9.4 Intercorrelations between learning measures and learner factors (1) Learning Measures Speech Rate (Time 1) Speech Rate (Time 2) Word Order
Individual Learner Factors (2) (3) (4) (5) (6)
0.15 -0.05 -0.03 -0.09
(7)
0.06 -0.06 -0.19 -0.17 0.03 0.18 -0.06 -0.11 0.11 0.14
-0.03 0.12 0.41** 0.14 0.07 0.29 -0.07
Acquisition (T1) Word Order
-0.02 -0.04 0.18 Acquisition (T2) Vocabulary Proficiency 0.07 0.04 0.45** Grammar Proficiency 0.06 0.29 0.42** Cloze 0.03 0.37* 0.40**
0.16 -0.07 -0.20 -0.25 0.29 0.33* 0.05 0.04 0.21 0.09 -0.12 0.03 0.16 0.29 -0.05 0.26
Key to Learner Factors: (1) Integrative Motivation (2) Instrumental Motivation (3) Expectancy of Achieving Native Speaker Fluency (4) Aptitude (words in sentences) (5) Aptitude (vocabulary memory) (6) Aptitude (sound discrimination) (7) Aptitude (sound-symbol association) N = 39
Significance levels: * p < 0.05; ** p < 0.01
< previous page
page_92
next page >
< previous page
page_93
next page > Page 93
Table 9.4 presents the Pearson product-moment correlations between the various measures of acquisition described in previous sections and the measures of individual learner factors. In general, the results are disappointing. None of the learner factors are related to Speech Rate; the correlations are uniformly low. This is, perhaps, not so surprising as the measures used to assess individual learner differences were developed to determine which factors affected the learning of linguistic knowledge rather than channel control. The correlations between learner factors and measures of linguistic knowledge are generally stronger, with Expectancy of Achieving Native Speaker Fluency the best overall predictor of learning outcomes. This factor relates at the 1% level of significance to all three proficiency measures and also to Word Order Acquisition (T ). However, many of the correlations that might have been expected to reach statistical significance fail to do so. In particular, Integrative Motivation and Aptitude (Words in Sentences)two factors which previous research has shown to be significant in predicting learning outcomesproduce only low correlations, especially where Word Order Acquisition is concerned. There are a number of explanations for these results, which need not concern us here. It is sufficient to note that for this sample of learners the individual learner factors that were investigated shed little light on the aim of the study which was to explore which factors characterise norm-oriented as opposed to communicative-oriented learners. As might have been expected, given the knowledge-focussed direction of previous research into individual differences, none of the affective and aptitudinal measures obtained provide any insights into the personal characteristics of those learners who orientate towards the acquisition of channel control. This is clearly an area where more research is needed. Summary and Conclusion The main findings of the study can be summarised as follows: (1) Cognitive style (measured as field independency) was not significantly related to any of the measures of linguistic knowledge or channel control. (2) Measures of linguistic knowledge and channel control were unrelated, suggesting that these two aspects of SLA are independent of each other. However, acquisition of one aspect of SLA was inversely related to acquisition of the other aspect.
< previous page
page_93
next page >
< previous page
page_94
next page > Page 94
(3) Quantitative measures of motivation and aptitude failed to distinguish clearly those learners who were knowledge oriented from those learners who were control oriented. The research reported in this chapter has attempted to explore an issue of considerable current interest in SLA studies. It has been concerned with identifying differences in learning style. The results suggest that the distinction between norm-oriented learners who seek to develop their knowledge of linguistic rules and communicative-oriented learners who are more concerned with acquiring the channel capacity to perform with greater fluency in the target language is a valid one. This distinction may serve as a more profitable focus of research than that between field dependency/independency, with which previous research into learning styles in SLA has been largely concerned. Substantial future research is needed to determine whether the independence of knowledge and control in SLA applies to other learner samples. Such research should ideally obtain measures of linguistic knowledge that reflect mainstream SLA enquiry into developmental sequencesas was attempted in this study by means of the Word Order Acquisition Scoreas well as more traditional measures of proficiency. Such research should also develop adequate measures of channel capacity, drawing perhaps on the work in second language productions by the Kassel team (cf. Dechert, Mohle, and Raupach, 1984). We need to know much more about the kind of competence which communicative-oriented learners develop. We can speculate that in addition to channel control (explored in this study) they will manifest greater acquisition of discourse, sociolinguistic and strategic competence (cf. Canale, 1983) than norm-oriented learners, as Schmidt's study of Wes has shown. Ideally, also, we need information about the long-term outcomes of learners with different learning styles. Which style leads to greater success in language learning? Do norm-oriented learners eventually catch up with communicative-oriented learners in other aspects of communicative competence? Do communicative-oriented learners eventually acquire adequate linguistic competence? Schmidt's study suggests that they might not do so. What of balanced learners, that is, learners who display an equal tendency for accuracy and fluency? Do they achieve the best of both worlds, acquiring both formal knowledge and satisfactory channel control? Are learning styles fixed or do they change as acquisition proceeds? These are all questions about which we know very little.
< previous page
page_94
next page >
< previous page
page_95
next page > Page 95
Finally, we need research that can help us to identify the personal characteristics of learners with markedly different learning styles. Here a more qualitative approach involving the use of such research techniques as learner diaries may prove more insightful in the first instance than the kind of quantitative approach reported above. Diary studies kept by 6 of the 39 learners in this study suggest that learning style is revealed with considerable clarity in the way learners respond to such factors as grammar explanations, type of instructional activity, tests and teacher correction. References Bialystok, E. and M. Sharwood Smith (1985) Interlanguage is not a state of mind: an evaluation of the construct for second language acquisition, Applied Linguistics, 6, 101-117. Brumfit, C.J. (1984) Communicative Methodology in Language Teaching. Cambridge: Cambridge University Press. Canale, M. (1983) From communicative competence to language pedagogy. In: C. Richards and R.W. Schmidt (eds), Language and Communication. London: Longman. Carroll, J. B. and S. M. Sapon (1959) Modern Language Aptitude Test. New York: The Psychological Corporation. Clahsen, H. (1985) Profiling second language development: a procedure for assessing L2 proficiency. In: K. Hyltenstam and M. Pienemann (eds), Modelling and Assessing Second Language Acquisition. Clevedon: Multilingual Matters. Dechert, H. (1984) Individual variation in language. In: H. Dechert, D. Mohle, and M. Raupach (eds), Second Language Productions. Tübingen: Gunter Narr. Dechert H., Mohle D., and M. Raupach (eds) (1984) Second Language Productions. Tübingen: Gunter Narr. Ellis, R. (1985) Understanding Second Language Acquisition. Oxford: Oxford University Press. Gardner, R. C. (1980) On the validity of affective variables in second language acquisition: conceptual, contextual, and statistical considerations, Language Learning, 30, 255-70. Hatch, E. (1974) Second language universals, Working Papers on Bilingualism, 3, 1-17. Johnston, M. (undated) L2 acquisition research: a classroom perspective. In: M. Johnston and M. Pienemann (eds), Second Language Acquisition: A Classroom Perspective. Sydney: Adult Migrant Education Service. McDonough, S. (1987) Psychology in Foreign Language Teaching. 2nd edn. London: George Allen and Unwin. Meisel, J. (1984) Strategies of second language acquisition: more than one kind of simplification. In: R. Andersen (ed), Pidginization and Creolization as Language Acquisition. Rowley, MA: Newbury House. Pimsleur, P. (1966) Pimsleur Language Aptitude Battery. New York: Harcourt, Brace and World. Schmidt, R. (1983) Interaction, acculturation, and the acquisition of communicative competence: a case study of an adult. In: N. Wolfson and E. Judd (eds), Sociolinguistics and Language Acquisition. Rowley, MA: Newbury House.
< previous page
page_95
next page >
< previous page
page_96
next page > Page 96
Seliger, H. W. (1980) Utterance planning and correction behavior: its function in the grammar construction process for second language learners. In: H. Dechert and M. Raupach (eds), Towards a Cross-linguistic Assessment of Speech Production. Kasseler Arbeiten zur Sprache und Litaratur 7. Frankfurt am Main: Peter D. Lang. Skehan, P. (1982) Memory and motivation in language aptitude testing. Ph.D. dissertation, University of London. Witkin, H., Oltman, P., Raskin, E., and S. Karp (1971) A Manual for the Embedded Figures Test. Palo Alto, CA: Consulting Psychologists Press.
< previous page
page_96
next page >
< previous page
page_97
next page > Page 97
10 Comprehension of Sentences and of Intersentential Relations by 11- to 15-year-old pupils 1 DENIS LEVASSEUR and MICHEL PAGÉ Université de Montréal This chapter deals with the development of text comprehension abilities in pupils who are in upper primary school and secondary school. These subjects are from 11 to 15 years old. Pupils of this age are generally considered to require no further systematic training in reading, although their experience in reading texts has still not been very extensive. By systematic training in reading, we mean training to decode written discourse in order to be able to understand the meaning of a text. At the end of primary school, such systematic training in reading is coming to an end. In secondary school, when pupils are 13 and older, this training in reading is no longer given. It is assumed that by this age they are able to grasp the meaning of any text that they have to read in school. Research on text comprehension very rarely addresses this particular age group. It deals mainly with the comprehension abilities of pupils in the first grades of primary school, and very rarely does it deal with subjects who are 12 and older. Our main interest in the research presented here (Levasseur, 1988) has been to test the hypothesis that, contrary to the general assumption, text comprehension abilities are still developing in some important way after primary school. This hypothesis is based on previous research where a preliminary attempt was made to assess comprehension of paragraphs by subjects of 12 years of age (Pagé, 1988).
< previous page
page_97
next page >
< previous page
page_98
next page > Page 98
Text Comprehension: A Multilevel Process Text comprehension, as we define it, is the ability to construct a unified representation with the pieces of information given in the flow of successive sentences in a text. This definition can be made more explicit when we consider the different units of textual information and the levels of processing which correspond to these specific units. Textual information is given to the reader in packages of increasing size. In order to distinguish between these packages, we have delineated five units of textual information. The smallest unit of textual information is the individual word. The second smallest unit is the sentence, which itself is defined by syntactic structures. The third level is made of the coreference strings which link a sequence of sentences. The specific units of information within this level are the means of cohesion which include such devices as recurrence, proforms and articles, and coreference. The fourth level is the paragraph, whose organizational structure is signalled by sentences announcing this organization, by the specific order of the sentences, and sometimes by the explicit connectives which link the sentences. The largest unit is the complete text, which at times may be composed of only a few paragraphs and at other times may be composed of many, covering many pages. Comprehension of textual information is defined as the cognitive process by which a reader constructs a cognitive representation of the textual information. When such a process is completed, the result should be a unified representation of the complete information in a text. This state is achieved through a continuous interaction of different levels of processing. In correspondence with the five units of textual information, we have distinguished five levels of cognitive activity: words of a text activate concepts or categories in the semantic memory sentences are processed as semantic propositions; through this process, syntactic features are transformed into semantic relations linking concepts (Frederiksen, 1975) coreference strings are understood when the reader is able to see that the means of cohesion are there to link some elements occurring in different successive sentences (De Beaugrande, 1980) the coherent structure of paragraphs is comprehended when the reader constructs the coherence relations that link the meaning of successive sentences to each other in a paragraph (Hobbs, 1982)
< previous page
page_98
next page >
< previous page
page_99
next page > Page 99
comprehension of the whole text is achieved when the reader has constructed a unified representation of the whole textual information. A unified representation may be defined from different theoretical viewpoints: macrostructures (van Dijk, 1980), frames (Frederiksen, 1985), ''schématisation" (Borel et al., 1983), textworld models (De Beaugrande, 1980). Inferential activity in the processing of sentences In the experiment reported here, we were interested, in the first place, in observing "text-based inferences" which were demonstrated by Frederiksen (1979) to be an important feature of the processing of sentences. Our aim was not to produce a detailed study of the different inferential operations by which a reader modifies the propositions of the text base when he recalls them. Rather, we were interested in evaluating, as a whole, the inferential activity we could detect by analyzing recall performance of subjects much older than those studied by Frederiksen. Comprehension of intersentential coherence relations In order to explain our second point of interest, which is the fourth level of processing, where coherence relations between sentences are comprehended, we need to define explicitly what coherence relations are. The theoretical concept of coherence relations, which involves relating one sentence to another in a paragraph, has not yet been worked on very much. Jerry Hobbs (1978) has, however, done some seminal work on coherence relations, which are an essential feature in the DIscourse ANAlysis (DIANA) program he is working on at the Stanford Research Institute (Hobbs, 1982). At the local level, a coherence relation links together two successive sentences in a paragraph. Constructing a coherence relation between two sentences is more than simply recognizing the sequential connectivity that is signaled by the means of cohesion. Thus, it is more than merely recognizing the recurrence of one concept in the two sentences. Constructing a coherence relation between two sentences involves comprehending what particular way the second sentence adds information to the preceding one.
< previous page
page_99
next page >
< previous page
page_100
next page > Page 100
Hobbs distinguishes four different groups of coherence relations. Two of these groups are described here to illustrate this more precisely. Both are used in describing the coherent structure of the text we used in our experiment. These groups of relations are called expansion relations and linkage relations (Hobbs, 1978). When expansion relations link two sentences, they indicate the particular way the second sentence expands the first. This expansion can be either a parallel relation, a contrast relation, an exemplification relation, or an elaboration relation. A ''parallel relation involves moving from a predication about some set of entities to the same predication about a similar set of entities" (Hobbs, 1982:231 f.). In the target text used in this experiment, an example of a parallel relation occurs between a sentence in which one of two entrances to a squirrel nest is described (it is the main entrance, located at the base of the nest, etc.) and a following sentence in which the other entrance is described (it is a small door used in case of emergency, etc.). In a contrast relation, "the move is to the negation of the predication about similar entities, or to the same predication about entities that lie at the opposite end of a scale" (Hobbs, 1982:231). There is an example of such a contrast relation occurring in our text. The first sentence of a pair states that the squirrel is able to distinguish a few colors and the following one states that other mammals do not see colors, except monkeys. The generalization relation "is a move from a predication about some set of entities to the same predication about classes to which they belong" (Hobbs, 1982:231). There is no case of this relation in our text. But there are many cases of the reversed relation, the exemplification relation, in which the predication is done first about a class of entities, and secondly, about a member or subset of the same class. An exemplification relation links the sentence which states that certain parts of the squirrel's anatomy are not well known, with the subsequent sentence which states that one part, which is not well known, is the eye. The fourth relation to be defined here is the elaboration relation. It is very frequent in informative texts like the one we used in the experiment. In this particular relation, two successive sentences make a predication about the same set of entities. The move to the second sentence adds some more specific information to the information already given in the preceding sentence about the same entity. For example, a sentence in our text states that the squirrel likes to have many nests; the subsequent sentence states that some squirrels use as many as four nests in the same period of time.
< previous page
page_100
next page >
< previous page
page_101
next page > Page 101
Finally, we must define another relation which is classified in the group of linkage relations: this one is the explanation relation. Relations of the linkage group are "those that arise out of the need to link what the writer says that is new and remarkable with what is known or believed to be known to the reader" (Hobbs, 1982:231). The explanation relation occurs when an event, state or object present in the message is somewhat unusual and needs to be explained. In order to connect it with what the reader already knows, the writer provides a causal chain from some normal situation to what is unusual. In our text, we have a few cases of this. In one example, the first sentence states that building a nest with a perfectly watertight roof is in conformity with the habits of a squirrel. This information needs an explanation, which is provided by the next sentence: the squirrel is one of the few animals which are very sensitive to storms. Coherence relations are not explicitly signaled in the text. They must be inferred by the reader. "By choosing and ordering utterances in a particular fashion, the writer can exercise some control over this inference process by supplying or modifying the appropriate framework of their interpretation" (Hobbs, 1982:234). However, the inferential activity on the part of the reader is essential. This important point can be made more explicit by considering the following case of elaboration relations. In order to construct an elaboration relation, the reader must first identify at least one common entity about which the predication is made in the two related sentences. He must also recognize a specification relation between the predicates of the two sentences. This is sometimes an easy task, like in the example which was given for the elaboration relation. It is not difficult to build a link between the predicate "many nests" and the predicate "four" which specifies how many nests are used by a squirrel at the same time. But this example illustrates clearly what a specification relation is. In many cases, the inference process must supply implicit information which is required to build such a relation between two predicates. Coherence relations are described as links between sentence elements which can be identified by propositional analysis. It is well known that propositional analysis, which is a basic feature in cognitive psychology, is limited to the sentence itself. Hence, semantic relations identified through propositional analysis cannot help with identifying relations between sentences. The concept of coherence relation helps greatly to identify these links and to give a description of the coherent structure of paragraphs. Thus, when this approach is coupled with propositional analysis, it becomes possible to study the comprehension of packages made of two sentences and more. That is precisely what we have aimed to do in the experiment.
< previous page
page_101
next page >
< previous page
page_102
next page > Page 102
Description of the Experiment A total of 604 subjects of three different ages (11, 13, and 15 years) were selected to read the target text in class and to write down in a text all the information which they could recall. Among these, 20 subjects were selected randomly in each of the three age groups and their performances in recalling the target text were then analyzed. The text used in such an experiment must be written according to certain criteria to make it similar to school textbooks. The one we used is 7 paragraphs long, with each paragraph consisting of 4 or 5 sentences. A full variety of means of cohesion, and of types of semantic propositions are represented. On the level of coherence, the text presents cases of each of the five different relations we have described. These properties are consistent with those which are usually found in informative texts of the descriptive type, and which are written to present different aspects of an object (Borel et al., 1983). As we have already indicated, the object described in the target text is the squirrel. Every child in Canada is familiar with this small animal: squirrels live in towns as well as in the country. It is important that the text deals with a familiar object, but to be interesting, it must add some entirely new information. Two important conditions are then satisfied: the subjects possess largely developed knowledge schemas enabling them to process the information, but they also have to expand these schemas to insert new knowledge into them. Recall Analysis The first type of data with which the subjects are to be compared has been obtained by contrasting the semantic propositions in the sentences of the text recalled by the subjects with that of the target text. In order to do this analysis, we have adopted Carl Frederiksen's (1975) method of propositional analysis as well as his method of contrastive analysis. This analysis precisely identifies how semantic propositions of the target text are recalled. Three different cases may occur: R: A proposition is completely recalled. This is the case when every element of a text proposition can be identified in the recalled text. Ri: A proposition is recalled with an inference which slightly transforms it. However, in spite of the inference, every element of a text proposition is still identifiable in the recalled text.
< previous page
page_102
next page >
< previous page
page_103
next page > Page 103
I: A proposition is modified by an inference. A proposition of the target text is considerably transformed by an inference, but some elements of the target text proposition can still be identified in the recalled text. From the point of view of the general model of text processing elaborated by Frederiksen (personal communication), these three types of recall response indicate different depth levels in the processing of information. R-type responses indicate that the processing has been realized at the most shallow level, whereas I-type responses indicate that the text has been processed at the deepest level. Contrasting the target text and the subjects' performance in recalling this text on the propositional level yields the data from which we can study the amount of inferential activity occurring in the processing of the sentences. The second type of data on which the subjects are to be compared is obtained by grouping the propositions according to the coherence relation in which they are embedded. Since there are five different coherence relations in the target text, the propositions are classified into five groups. For example, when two sentences are related by an elaboration relation, all the propositions of these two sentences are embedded in an elaboration relation and they are grouped together. All the propositions belonging to sentences related together by an elaboration relation constitute the whole group of propositions embedded in an elaboration relation. The propositions recalled by the subjects have been grouped according to this classification of the target text's propositions. For example, when a proposition recalled by a subject comes from a sentence related to another one by an elaboration relation, this proposition is located in the group of propositions embedded in an elaboration relation. Statistical Comparisons The statistical analyses of these data were carried out using multivariate analysis of variance with repeated measures. Following this method, the total number of propositions recalled by the subjects in each school grade have been contrasted as comparisons between subjects. In the two comparisons within subjects which have been made, the first one contrasts the number of propositions recalled in each of the five groups of propositions classified according to the coherence relations in which they are embedded. The second one contrasts the number of recalled propositions in the three types delineated as propositions recalled completely (R), recalled with an inference
< previous page
page_103
next page >
< previous page
page_104
next page > Page 104
(Ri) and modified by an inference (I). In this design, the interaction effects between the school grades, the types of recalled propositions and the groups of propositions classified according to the coherence relations have been analyzed by a crossed analysis. First order and second order interaction effects have been obtained through this crossed analysis. Comparison between subjects with respect to number of propositions recalled The comparison between subjects reveals the difference between the age groups in the total number of propositions recalled from the text. The 11-year-old pupils recall significantly fewer propositions than the 13year-old subjects do (F=46.41, p<0.0001). Compared to the 15-year-old subjects, the 13-year-old subjects recall significantly fewer propositions (F=14.09, p<0.001). Thus, this first comparison shows a very large discrepancy between age groups in the total number of propositions recalled. This first overall result will be further considered with regard to coherence relations and types of response. Comparison within subjects with respect to number of propositions recalled The first comparison within subjects looks for differences in the number of propositions recalled by the subjects in each of the five groups of propositions characterized by the coherence relation in which they are embedded. Four contrasts have been made on this important matter: C1: The number of propositions embedded in a parallel relation has been compared with the number of propositions embedded in an elaboration relation. C2: The number of propositions embedded in parallel and elaboration relations, which move from one piece of specific information to other specific information, has been compared with that of the example relation, which moves from general information to specific. C3: The number of propositions embedded in relations which are positive expansions (elaboration, parallel, and exemplification), and where the move adds some information, has been compared to that of contrast relations, where the move cancels some information.
< previous page
page_104
next page >
< previous page
page_105
next page > Page 105
C4: The number of propositions embedded in relations of the expansion group (elaboration, parallel, exemplification, and contrast), has been compared with the number of propositions embedded in an explanation relation, which belongs to the group of linkage relations in Hobbs' model (1978). Taken altogether, subjects have recalled more propositions in the parallel relations group than they have in the group of elaboration relations (F = 10.90, p < 0.001). They have recalled more propositions embedded in the relations which move from one piece of specific information to another piece, compared with the relation which moves from general information to specific (F=3.93, p=0.05). They also have recalled more propositions embedded in the relations where the move adds information, compared to the one where the move cancels information (F = 39.47, p < 0.0001). The last comparison, between expansion relations and linkage relations, is not clearly significant (F=3.53, p=0.06). In summing up these results, we may conclude that the number of propositions recalled by the subjects differs according to the coherence relations in which they are embedded. It is important to take into account some first order interactional effects regarding this general result. These interactions show that the tendency of the subjects to recall more propositions of the positive expansion group of coherence relations (elaboration and parallel), which was observed in the 11year-old subjects' performance, appears to be even stronger in the performance of 13- and 15-year-old subjects. Hence, this first comparison within subjects indicates clearly that coherence relations are an important variable to take into account in the study of the development of reading comprehension abilities. Comparison within subjects with respect to types of propositions recalled The comparison within subjects contrasting the types of recalled propositions shows the importance of coherence relations from another point of view. The main result of this analysis is that the number of propositions which have been recalled with an inference (Ri) or modified by an inference (I) is larger than the total amount of propositions which have been recalled completely (R) (F = 30.17, p < 0.0001). However, this general result has to be interpreted according to second order interactional effects. The interaction effects clearly indicate that subjects of different ages process the propositions in different ways according to the coherence relations in which they are embedded. In the recalls produced by the 11-year-
< previous page
page_105
next page >
< previous page
page_106
next page > Page 106
old subjects, the number of propositions recalled completely (R) is smaller than the number of propositions recalled with an inference (Ri) or modified by an inference (I), in all the coherence relations groups of propositions. On this particular point, the older subjects present a different performance. The older the subjects are, the more they completely recall the propositions which are embedded in coherence relations of the positive expansion group (elaboration, parallel, and exemplification). Thus, from the performance of the older subjects, it seems that they are better at recalling propositions completely than younger subjects, although they still modify many propositions by their inferential activity. This observation means that inferential activity on the propositional level does not increase with age and that it may not be an indicator of the development of comprehension abilities. However, the type of coherence relations is a variable that must be taken into account here too. Final Remarks Three statements can be made to summarize the observations made in this experiment: (1) there is a substantial increase in the subjects' capacity to stock textual propositions over the period from 11 to 15 years of age; (2) the type of coherence relations in which propositions are embedded is a variable which has a clear influence on the process of stocking textual information; (3) during the same period of age, there is also an increase in the subjects' capacity to stock textual propositions without transforming them by inference, but this increase occurs in the case of propositions embedded in certain types of coherence relations only. From these observations, we may conclude that coherence relations constitute an important variable in the study of the development of text comprehension abilities, and that the ability to infer links between sentences deserves to be extensively studied. These observations have been made on performances in a recall task. To investigate abilities involved in text processing with more precision and with less expense, a method using questions is now being experimented with. Questions are designed to check the comprehension of different types of links between sentences in paragraphs. The text paragraphs used in the experiment present a varied combination of coherence relations. Every particular combination is covered by a particular paragraph structure: procedural, narrative, descriptive, comparative, etc. The questions, asked to check comprehension of the paragraph coherence structure, address information given in two or more
< previous page
page_106
next page >
< previous page
page_107
next page > Page 107
sentences in a paragraph. To answer these questions correctly, the subject must assemble information coming from many other sentences in the paragraph and sometimes from all of them. In this way, we expect that it will be possible to assess the comprehension of a large variety of coherent paragraph structures with precision yet within a reasonable amount of time. Note 1. Denis Levasseur's Ph.D. thesis, which is reported in this chapter, is part of a research program (FCAR Grant no EQ-2550) directed by the co-author. References Borel, M.J., Grize J.B., and D. Miville (1983) Essai de Logique Naturelle. Berne: Peter D. Lang. De Beaugrande, R. (1980) Text, Discourse and Process: Towards a Multidisciplinary Science of Texts. Norwood, NJ: Ablex Publishing Co. Frederiksen, C.H. (1975) Representing logical and semantic structure of knowledge acquired from discourse, Cognitive Psychology, 7, 371-485. Frederiksen, C.H. (1979) Discourse comprehension and early reading. In: L.B. Resnick and P. A. Weaver (eds), Theory and Practice of Early Reading. Hillsdale, NJ: Lawrence Erlbaum Associates. Frederiksen, C.H. (1985) Cognitive models and discourse analysis. In: C.R. Cooper and S. Greenbaum (eds), Written Communication Annual, vol. I: Linguistic Approaches to the Study of Written Discourse. Beverly Hills, CA: Sage. Hobbs, J.R. (1978) Why is discourse coherent? Technical Report, Stanford Research Institute, Menlo Park, CA. Hobbs, J.R. (1982) Towards an understanding of coherence in discourse. In: W.G. Lenhert and M.H. Ringle (eds), Strategies for Natural Language Processing. Hillsdale, NJ: Lawrence Erlbaum Associates. Levasseur, D. (1988) Structures de cohérence et inférences, dans le développement des habiletés en compréhension et rappel de texte informatif. Thèse de doctorat inédite, Département de psychologie, Université de Montréal. Pagé, M. (1988) The measurement of text comprehension: results of an experiment at the end of primary school. In: Elisabetta Zuanelli Sonino (ed), Literacy in School and Society: International Trends and Issues. New York: Plenum Publishing Co. van Dijk, T.A. (1980) Macrostructures: An Interdisciplinary Study of Global Structures in Discourse, Interaction, and Cognition. Hillsdale, NJ: Erlbaum.
< previous page
page_107
next page >
< previous page
page_108
next page > Page 108
11 Discourse Organization in Oral and Written Language: Critical Contrasts for Literacy and Schooling 1 ROSALIND HOROWITZ University of Texas at San Antonio In 1984, the Linguistics Association of Great Britain and the British Association for Applied Linguistics convened as a "Committee for Linguistics in Education" to discuss the higher-level differences between speech and writing and between the processing of oral and written language. There were over 80 linguists at the meeting. In his account of the event Hudson (1984:2) states that, "A decade or so ago, it would probably have been hard to muster more than a handful of linguists for a discussion of this sort; we knew very little about the relations between speech and writing, and cared even less." In 1985, the first European Conference on Research on Learning and Instruction was held at the University of Leuven, Belgium, as a function of the European Association for Research on Learning and Instruction (EARLI), affiliated with the American Educational Research Association. This conference addressed a number of themesincluding oral and written discourse processing, instructional and social interactions, and individual differenceswith the intent of improving learning and instruction in school settings. Approximately 140 participants from 12 countries attended. In 1987, the Eighth World Congress of the International Association of Applied Linguistics (AILA) met at the University of Sydney. Approximately 800 participants from 37 countries attended. Once again, a number of papers
< previous page
page_108
next page >
< previous page
page_109
next page > Page 109
at this meeting addressed features of discourse; some addressed oral while others written, and some made comparisons between the two (see Horowitz, 1987a). These three meetings suggest a growing interest in oral and written language contrasts, and indicate that the 21st century will give increasing attention to oral and written discourse in learning in classrooms. Research reported at these meetings demonstrates a significant broadening in perspective about language and a broadening of approaches for assessment procedures (e.g. see The Harvard Education Letter, 1988). The interest in oral and written language contrasts is rather remarkable, given that the study of oral and written language and all that surrounds it was restricted, for most of this century, by the dictum set forth by Bloomfield (1933:21) that ''writing is not language, but merely a way of recording language by means of visible marks", or by those who saw natural oral language as not worthy of serious study because it was unselfconscious and spontaneous (Halliday, 1987:55). Moreover, looking at the scientific research, I found that there has indeed been limited published study, outside of these conference events, of children's uses of oral and written language, be it inside or outside of schools. Also, little has been published on contrasts of oral and written discourse that would be useful for the assessment of language abilities and that would be critical for formal learning (Halliday, 1986). Discourse organization is a central element of discourse processing. Whether it be as a listener or reader, speaker or writer, the individual learner in schools is repeatedly confronted with the task of constructing a mental representation of information received. There are primarily two representations of discourse organization that present an enormous challenge for many children and adolescents. First, there is the surface, linguistic representation of particular concepts and their relations. For example, cause and effect relations as in Sarah burned the roast. Consequently, David took her out for dinner, may be used to structure an entire discourse. Secondly, there is the social-situational representation of e.g. such a cause and effect structure. Discourse is processed not only linguistically, but also in social terms. Language users access social-situational knowledge and social models, based on repetitions of situations, drawn from memory or text, to understand written or spoken communication. For example, in the above case, the particular experiences and models formulated in memory about unsuccessful cooking will come to play in the reader's meaning system. Knowledge about social situations and a particular series of social events, such as used in the opening and closing of talk, prove to be critical in discourse processing (see van Dijk, 1980; 1987; Goffman, 1974; Hinds, 1979; Sacks, Schegloff and Jefferson, 1974; Schegloff and Sacks, 1973).
< previous page
page_109
next page >
< previous page
page_110
next page > Page 110
In schools students are often required to produce and process discourse with organizational patterns in a variety of ways. They are asked to partake in structured discussions, to respond during planned recitation, to orally compose (for example, in impromptu or formal speaking), and to participate in structured cooperative learning groups or paired learning situations. They are also asked to read, from a range of genres and disciplines, texts with different organizational patterns, and to write about those texts using various organizational forms and social knowledge (Horowitz, 1985a; 1985b). The majority of what is learned in schools is acquired through oral language (speaking or listening) and written language (writing or reading), whether it be by print or talk, computers or other electronic media. Assessment of language abilities and potentials for learning must take into account the different language systems, the structure of the oral and the written discourse, and the social world they evoke. In this chapter I will discuss (a) persisting myths about the complexity of oral and written language; (b) the different processes used in structuring and negotiating meaning in oral and written language; and (c) research studies of discourse organization which demonstrate psychological differences (including age-grade) and socialsituational factors that interact with discourse organization and influence comprehension and meaning. The studies that I refer to in this chapter are part of an international line of studies that I have begun, and are designed to show developmental and individual differences in production and processing of discourse structure. To date, studies have included 7-year-olds, 14-, 20-, and 24-year-olds (mean ages) in the United States. All together, I have collected almost a thousand protocols for linguistic analysis. Through these studies, I offer suggestions for assessment of oral and written discourse, and of cognitive growth. Persisting Myths about Oral and Written Language Most of the work on oral and written language has concentrated on adult rather than child communication. This literature identifies oral-written language differences, although opinions differ with respect to the complexity of oral and written language. The oral-written language contrasts suggested in the literature need further study based on real discourse protocols. Oral language is typically associated with conversation that is produced in a real social context with face-to-face exchange, and grounded in interpersonal relationships that are clearly established. This language is characterized as narrative-like and episodic. The structure is linear. Cohesion is expressed through deixis (reference to people, places, or objects
< previous page
page_110
next page >
< previous page
page_111
next page > Page 111
outside of the discourse), but also through prosodic features of the language (pitch, stress, juncture), paralinguistic devices (nonverbal cues such as facial expressions), and hedges and repetitions. In oral language much can remain unsaid, in particular between lovers or close friends. Oral language is regarded as informal, natural, and unselfconscious. Finally, some believe that oral language is structureless and less complex than writing. Written language, on the other hand, is typically associated with the language of books and explanatory prose such as is found in schools. It is often regarded as the higher form of discourse. Writing is believed to be formal, academic, and planned. It is often portrayed as autonomous language, that is, lacking in context. Writing is highly structured and hierarchical, it lacks the linear nature of oral language. Cohesion is expressed, not by reference to outside of the text, but by reference and redundancy within the text, through signal words (e.g. connectives such as however, moreover, because, or similarly). Finally written language is supposedly detached from the receiver, and highly succinct and controlled (see Akinnaso, 1982; Hudson, 1984). However, adult oral and written language do not constitute unitary constructs. In listening to or reading oral and written language, we realize that there is much variation and sometimes characteristics overlap. Oral language may be quite formal, as when one is engaged in addressing a police officer, and written language may be quite informal, as in a love letter. Sometimes writing is intended to be fleeting, as in a quick note to oneself to stop at the bank, and oral language can be designed to be permanent, as in a marriage proposal. Sometimes oral language is used in writing. For instance, many children, and even adults, talk as they write (Dyson, 1983). This represents a mingling, inside and outside of text, of the two forms of language. There is little research which shows how oral and written language emerge and interact. However, evidence is mounting that the course of oral and written development differs by age, social groups, situations, and also between children and adults. Chafe and Tannen (1987) have proposed that adult oral language reflects involvement, while written language reflects detachment between communicators. Tannen (1982) has indicated that there are certain strategies that adult writers use that are more oral-like and others that are more written-like. She shows that differences in language result not so much from the use of oral versus written language, but are due to degree of personal involvement. A writer may be highly involved and display features typical of conversation. In contrast, lack of personal involvement will produce features typical of exposition, where the message concentrates on the ideas presented rather than the persons involved. Likewise, Brown (1982) has proposed that most
< previous page
page_111
next page >
< previous page
page_112
next page > Page 112
adult oral language is person oriented and argues that there is a message oriented speech that schools typically do not teach. Children's oral language may be highly structured, depending upon its purpose and audience, or may be limited in structure and subordination due to cognitive, linguistic, or social knowledge limitations. The psychological or social context in which a child's oral or written language is produced and used in schools will also influence the degree to which structure is present. Of course, in many respects, teachers and textbooks control the type and degree of structure that children use. Finally, school language, oral or written, is different from outside-of-school oral or written language. The age of the communicator and the relationship with the teacher will determine to some extent the nature and complexity of the oral and written language used. The kinds of contexts and problem situations that are established for testing can be powerful factors for evoking certain kinds of child language. Some of this may be more oral-like and some more written-like. Some will become more or less structured, and more or less socially appropriate. Finally, teachers and testers alike must also be sensitive to social and ethnic cultural group differences associated with children's oral and written language. For example, in some cultures, there is limited oral talk to adults. An undergraduate Japanese student of mine, who was studying to be a teacher, rarely spoke in class and even in the context of my office said little. I, for some time, worried about whether or not she would be too shy to teach in a classroom. I later realized the limited speech had to do with her sign of respect to an adult teacher. Producing Oral and Written Language The sense of organization in writing has its roots in interpersonal relations and social experiences with dyadic conversations. Little is known at this time about when children begin to create organization in oral or written discourse. However, it appears that organizational patterns are first produced and developed through parentchild and child-child interactions in social situations. We are only now beginning to build a corpus of teacherchild talk in schools. Most research on discourse processing and production in schools has studied children's narratives or storytelling. Nonetheless, I was interested in studying children's talk, through use of arguments, for persuasion, and in dyadic communication. One reason is that there is little research on young children's ability to persuade, and in fact some researchers believe that young children are unable to write persuasion. It is often argued
< previous page
page_112
next page >
< previous page
page_113
next page > Page 113
that persuasive writing is difficult, because it requires linguistic and social-cognitive controls, and also knowledge that many children do not possess until at least grade four. Furthermore, to my knowledge, there are no studies of primary grade children's use of emerging organization in the development of persuasive arguments, and the structuring of those arguments in order to take control of a situation which is so vital for their development. Although there is little research about young children's use of structure in written persuasion, there is evidence that children do develop skill with structure in oral persuasion. They make requests to parents and friends and cleverly use a range of appeals to achieve their preconceived ends. The 32 subjects who participated in this research were second graders in a rural elementary school in Texas. Subjects were randomly assigned in counterbalanced order to an oral and written condition. Each subject completed both an oral and written task. The children were taken one at a time into a room in the high school building, adjacent to the elementary school. Subjects were asked to persuade either a mother or father (an intimate target audience) to purchase a Snoopy Sno Cone Machine. 2 The experimenter presented a video-tape of the American television commercial for the Snoopy Sno Cone Machine prior to the written task and again prior to the oral task. A tape recorder was set up to record the oral discourse produced. Following Scardamalia, Bereiter, and Goelman (1982), two supplementary prompts were used in the course of the oral and the written tasks, to encourage the children to sustain their discourse. Samples of discourse were collected in November and December at the height of the Christmas holiday ''request" season, because it is natural for children to orally request gifts from parents, or to write requests, for example, to Santa Claus.3 The oral and written protocols were collected and the oral protocols transcribed. The discourse was scored according to several criteria. Persuasion was examined for (a) the number of words in each protocol, (b) the number of persuasive arguments, (c) the categories of persuasive arguments used under each mode, including logic and appropriateness of appeals, and (d) the organizational structure used in the persuasions. Children produced significantly more words in the oral mode than in their written productions (p<0.01). This finding was to be expected from previous research. For one thing, young children have not yet developed skill with writing. But also, oral language is said to be less succinct and to contain redundancies and hedges not acceptable in writing. No significant differences were found in the number of arguments used in the oral as opposed to the written discourse (p > 0.05). These children have
< previous page
page_113
next page >
< previous page
page_114
next page > Page 114
apparently not yet learned how to adjust the number of arguments under each condition, or the task may not have elicited this. We found significant differences in the types of arguments used across the oral mode (p<0.001), with more pleas and trades than other types of arguments. Across the written discourse significantly more pleas (p< 0.001) were used than any other type. There were also significantly more task-related arguments (p <0.05) in the oral than in the written mode. An important finding of this research is that there was considerable variation in the range of persuasive appeals that were used. Horowitz and Davis (1984) developed a classification system to assess the kind of arguments children used in the oral and written conditions. Appendix A gives a list of least-to-more complex types of arguments used by second graders to organize and support requests made under this task. We found more oral than written appeals to benefit the persuader. It may be that these children have not had the opportunity to use persuasion in writing in school. Or perhaps, they were not comfortable using certain arguments in writing, which in speaking, they were comfortable with and were willing to use. Most interesting was the nature of the discourse organizations used. We found large differences in the degree and quality of organizations used in the persuasions. The children responded to the task with different text and social-situational knowledge structures. Appendix B contains examples of the subjects' oral and written discourse. The first form was where the experimenter was addressed and the child told or wrote what they would say to a parent (see Example 1.2). The second form was a dialogue format where the child role-played the child and the parent in a conversation about the Snoopy Sno Cone Machine. The child spoke, waited, responded to an unheard answer, and then continued (see Example 1.1). The third form was a dialogue format where the children role-played both the child and the parent, interacting (see Examples 2.1, 2.2, 3.1, and 3.2). The fourth form was a kind of memo (Example 4). The fifth format was a letter directly to the intended audience (Example 5). Children varied their organizations from what might be termed a list-structure, simply listing points for why they wanted the Sno Cone Machine, to using cause-and-effect explanations as an organization form. Statistical analysis of the discourse produced by the subjects will be the subject of further research. However, the tremendous variation among subjects in responding to the task can be illustrated with the following three examples. One of the 7-year-olds employed a list structure to organize her request for a Snoopy Sno Cone Machine. Example 1.2 includes a list of seven
< previous page
page_114
next page >
< previous page
page_115
next page > Page 115
reasons why Maia believes that her parents should buy her a Snoopy Sno Cone Machine. Her list is neatly numbered and concise, but also it is unconnected and lacking in cohesion (as a list typically is). The transcription of her oral counterpart is less systematic and less succinct (Example 1.1). As will be shown below, the list structure is used also by 14- and 20-year-olds, who are not skilled as communicators. It is used particularly in writing by college freshmen, 20-year-olds who practice list making in lectures (Horowitz, 1987b). In this instance, Maia is using the list structure because she may not have other organizational patterns at her disposal. Additional samples of her writing would need to be obtained, with other discourse purposes and tasks, to determine the range of organizations that she can use and the social situations she knows about and can evoke. Maia's discourse shows that she is anticipating counter arguments by her parent, as a good persuader should do. But she does not seem to know how to use complete "if then" constructions such as If they buy me an ice cone machine, I wouldn't say anything if it broke. Finally, Maia's written list has no misspellings which is unusual for a 7-year-old and is rarely seen even from older subjects. Another subject, Billie Jo (Example 2), turned his persuasion into a story and convinced his mother to go to his father's shop for money, engaging his listener and reader in a series of events that he may have observed in his family. The story form is the discourse type that he probably is most comfortable with. His oral and written protocols show limited communication skill. His sentences are incomplete, the ideas are not fully developed, and his discourse seems less mature with respect to vocabulary and syntax than many of the other samples from this 7-year-old group. Finally, Billie Jo uses dialogue initially but later switches his discourse into the narrative mode. He uses the traditional "The End" and seems to close with his winning over the parent at the end, akin to the traditional ending "and they lived happily ever after." Like many other American 7-year-olds, he has not been exposed to persuasive speaking or writing assignments in school. The third sample is a discourse by Angie, also a 7-year-old, in the same grade and school as Maia and Billie Jo. Angie produced an expanded oral persuasive piece, 259 words long, while her written text was only 66 words in length. She created a drama between her mother and herself conveying all of the nuances of the social situation she imagined would appeal to her mother for the Sno Cone Machine. Angie covered a range of arguments noted in Appendix A, and she showed tremendous skill in playing the role of her mother, as well as expressing her own point of view. In the oral rendition, Angie used two voices, one to represent herself and the other to represent her mother, with control of intonation, pitch, stress words, and appropriate pauses to suit each. Of the 32 samples that we collected, her oral version was
< previous page
page_115
next page >
< previous page
page_116
next page > Page 116
the most interesting. In writing, Angie produces a shorter discourse. She is not yet able to use quotation marks, and does not know how to signify in writing the different voices, the tone elements, or the pitch. However, her written presentation is more concise, a tendency we would expect from more mature adult writers too. Angie uses transitional spelling, that is, her text contains traditional English spelling but also invented spelling. We have learned from this experiment that comparisons of oral and written language are valuable for assessing and understanding an individual child's language ability and potential for growth in discourse and literacy. Comparisons offer a new assessment tool beyond standardized tests. In the case of Angie, her oral language is far more developed than her written language with respect to the number of words, types of arguments, and use of role playing and creation of social situations. Teachers or testers should expect that Angie has the potential for far more development in writing. Oral language production may well be indicative of a child's potential with written language, just as a measure of listening skill may serve as an index of potential for reading. In this study, types of arguments used by the children to organize talk and text were identified. Children used simple appeals, such as requests and pleas, to ask for the Snoopy Sno Cone Machine. Some used appeals that were modified and beneficial to the persuader still very much pleading for the toy. Others, and the largest number, used trades, sometimes arguments used to counter parental objections. Trades are frequently used by 6and 7-year-olds (see also Mishler, 1979), but more work is needed to know when this argument form emerges, how it gets modified over time, and when its use diminishes. The subjects in this study rarely presented their arguments in a logical manner with benefits for the parent or for some larger group which suggests a concentration on self. Finally, this research also shows that discourse structures are related to social-situational contexts, recounts of those contexts, and parent-child roles. Children in this research responded to the persuasion task by calling forth in memory mental models that they hold of particular parent-child situations and dialogues. Work on children's language development must be related to their knowledge of social settings and situations. Processing Oral and Written Language An awareness of discourse organization not only influences the production of oral and written language, but also the ability to process and
< previous page
page_116
next page >
< previous page
page_117
next page > Page 117
comprehend language. Organizational patterns of discourse have been suggested to have an effect on recall (see also Horowitz, 1982). Text and teacher talk in schools are organized by patterns of list structure, compare and contrast, cause and effect, and problem and solution. Further study is needed to reveal the effects of organizational patterns on the learning process. Meyer and Freedle (1979; 1984) conducted a study where, maintaining the content, they altered the organization of a text. The text they used dealt with the requirement that athletes lose particular amounts of body water. Their subjects were graduate students, age unidentified, who listened to tape recordings of this passage manipulated in four different ways with respect to the top organizational structure, also called rhetorical structures. Their research demonstrated different effects on recall depending on the organizational framework. Their subjects produced greater recall of text organized by compare and contrast, and by cause and effect, than with text organized by list structure. This important work coupled with other studies has led researchers to view the list structure as a less developed structure and a less effective organizational pattern for communication. In Horowitz (1987b) I repeated the Meyer and Freedle study with some variation. Subjects were 120 14-yearolds (mean age) in ninth grade, and 99 20-year-olds (mean age) enrolled in freshman composition classes at a university in the United States. In addition, there were 63 college students 24year-olds (mean age) who were juniors, sophomores, and seniors at the same university. The same procedures were followed as in Meyer and Freedle (1979; 1984) except that subjects were asked to read rather than listen to the Body Water passages. In addition, a second text was used. This was an edited version of a passage on social spiders taken from a National Science Foundation publication (1978) designed for the general public. Social Spiders was edited to resemble Body Water according to a number of criteria. It was made to be of similar length and the placement of rhetorical structures resembled that of Body Water. Both passages were scored using a prose analysis system developed by Meyer (1985). Social Spiders was organized by each of the organizational patterns used in Body Water. As control, versions of both texts without a rhetorical top structure were used. One of the goals of the study was to determine whether for Social Spiders the compare and contrast pattern would facilitate greater recall than the other patterns, as Meyer and Freedle had found to be the case with Body Water. Secondly, in the Meyer and Freedle study subjects were asked to listen to a text. I examined whether findings would persist with Body Water and Social Spiders under a reading task. For some students, listening to these texts might be easier than reading them, depending on their experience with
< previous page
page_117
next page >
< previous page
page_118
next page > Page 118
organizational patterns in written language. Finally, the previous research did not include a control text, that is, a text without a specific organizational pattern. I was interested in how well 14-, 20-, and 24-year-olds would fare in processing a text that did not have an explicit overall organization. Appendix B contains some examples from the written discourse produced by the subjects in this study. Example 7 represents a list-structure organization produced by a 15-year-old, Dan, following the reading of the Social Spiders text. The original was organized by compare-contrast structure. The recall is accurate, but the ideas are not connected. The list-structure is a structure frequently used to begin a categorizing process when taking notes from lectures or readings. It is also, however, used by learners who are not skilled communicators. Skilled readers processing the compare-contrast structure reproduced the compare-contrast structure in their recall. Example 8 illustrates the use of the list-structure by another subject, a 15-year-old, Scott, with Body Water, where the original structure was a cause-effect form. A 19-year-old, Jenny (Example 9), when given a control condition text without a top structure, started out with a list-structure, but then switched to a cause-effect form and a reconstruction of the text that showed some knowledge of sociology as a domain of knowledge. Another subject, Dorothy (Example 10), a 24-year-old, produced a 2-sentence paragraph that concisely gave the gist of the Social Spiders text, showing a contrast, following the original of the compare-contrast text. The older subjects are more concise, controlled, and maintain the pattern read in their recall providing the text structure is clear (see also Kintsch and Yarbrough, 1982). Contrary to the research by Meyer and Freedle, the adversative (compare-contrast pattern) did not result in greater text recall and an explicit rhetorical structure was not always needed. More importantly, the effect of the text structures was found to be dependent of the text in which it was placed, of the text topic and the reader. The question is not which structure is better or more useful for particular grades, but which structure best matches the content and purpose of the text. Finally, 14- and 20-year-olds perceived Body Water and Social Spiders as being significantly different in memorability. They may have been aware that Social Spiders would be more difficult to socially envision. Yet, 14-year-olds saw no difference in memorability. The 20-year-olds showed greater prior knowledge than the 14-year-olds for Body Water and Social Spiders. The 14-year-olds judged the texts as more organized than the 20-year-olds did, also they were less skillful in differentiating degrees of text organization. This research demonstrated that the study of reader judgment of text structure and social factors used in discourse warrants further study, and that these variables are powerful in discourse processing.
< previous page
page_118
next page >
< previous page
page_119
next page > Page 119
Conclusion The studies reported here add to a growing body of scientific research on discourse organization. Unlike previous work on functional literacy, these experiments address multiple variables that must be considered in discourse analysis. Previous studies have not considered actual structures, discourse organization, and socialsituational knowledge, applied to the structures used by specific learners at particular ages. In the studies discussed above, discourse organization used in the production and processing of language are studied. This work has implications for language education, cognition and learning, and testing in school settings. This is because part of what is learned and should be tested in schools, is the ability to manipulate these discourse organizations in both oral and written language systems, in social situations and contexts. The research reported hereand other studies conducted along these lines in applied linguistics and cognitive psychologyallow us to address not only the features of discourse but also the social meanings that surround the language. Oral and written discourse represent systems of communication with sometimes unique, and other times overlapping, potentials for learning. This type of research allows us to assess individual student potential for learning across and through these systems. Notes 1. The information included in this chapter was prepared with the gracious support of a National Academy of Education Spencer Fellowship awarded for 1985-1988. Appreciation is due also to the National Academy of Education and the Spencer Foundation in the United States. I thank Melissa Beams, Diana Starrett, and Marion Roumo for their help on this manuscript. 2. The Snoopy Sno Cone Machine is a plastic copy of the American cartoon character Snoopy's dog house. Ice cubes are dropped into the chimney of this toy dog house, are then crushed, and come out of the door and fill a paper cup. The children in the television commercial are seen pouring flavored syrup over the ice from a dispenser shaped like the cartoon character Snoopy. The commercial is shown regularly on television in the afternoon. 3. See Garvey (1975) for some discussion of requests in children's speech. References Akinnaso, F.N. (1982) On the differences between spoken and written language, Language and Speech, 25, 97125. Bloomfield, L. (1933) Language. New York: Holt, Rinehart and Winston.
< previous page
page_119
next page >
< previous page
page_120
next page > Page 120
Brown, G. (1982) The spoken language. In: R. Carter (ed), Linguistics and the Teacher. London: Routledge and Kegan Paul. Chafe, W. and D. Tannen (1987) The relation between written and spoken language, Annual Review of Anthropology, 16, 383-407. Dyson, A.H. (1983) The role of oral language in early writing processes, Research in the Teaching of English, 17, 1-30. Garvey, C. (1975) Requests and responses in children's speech. Journal of Child Language, 2, 41-63. Goffman, E. (1974) Frame Analysis. An Essay on the Organization of Experience. Cambridge, MA: Harvard University Press. Halliday, M.A.K. (1986) Spoken and Written Language. Victoria: Deakin University Press. Halliday, M.A.K. (1987) Spoken and written modes of meaning. In: R. Horowitz and S.J. Samuels (eds), Comprehending Oral and Written Language. San Diego and London: Academic Press. Hinds, J. (1979) Organizational patterns in discourse. In: T. Givón (ed), Syntax and Semantics, Vol. 12, Discourse and Syntax. New York: Academic Press. Horowitz, R. (1982) The limitations of contrasted rhetorical predicates on reader recall of expository English prose. Ph.D. dissertation, University of Minnesota, Minneapolis. Horowitz, R. (1985a) Text patterns: part I, Journal of Reading, 28, 448-454. Horowitz, R. (1985b) Text patterns: part II, Journal of Reading, 28, 534-541. Horowitz, R. (1987a) Conference explores applied linguistics: a report on the 8th World Congress of the International Association of Applied Linguistics, Reading Today, 5, 5. Horowitz, R. (1987b) Rhetorical structure in discourse processing. In: R. Horowitz and S.J. Samuels (eds), Comprehending Oral and Written Language. San Diego and London: Academic Press. Horowitz, R. and R. Davis (1984) Oral and written persuasive strategies used by second graders. Paper presented at the Third International Congress for the Study of Child Language, Austin, TX. Hudson, R. (1984) The higher level differences between speech and writing. Committee for Linguistics in Education, Working Paper No. 3. London: University College, Department of Phonetics and Linguistics. Kintsch, W., and J.C. Yarbrough (1982) Role of rhetorical structure in text comprehension, Journal of Educational Psychology, 74, 828-834. Meyer, B.J.F. (1985) The Organization of Prose and its Effects on Memory. Amsterdam: North Holland. Meyer, B.J.F. and R.O. Freedle (1979) Effects of different discourse types on recall. Research Report, No. 6. Tempe, AZ: Arizona State University, Department of Educational Psychology. Meyer, B.J.F. and R.O. Freedle (1984) Effects of discourse types on recall, American Educational Research Association Journal, 21, 121-143. Mishler, E.G. (1979) ''Wou' you trade cookies with the popcorn?": the talk of trades among six-year-olds. In: O. Garnica and M.L. King (eds), Language, Children and Society: The Effect of Social Factors on Children Learning to Communicate. New York: Pergamon. National Science Foundation, The (1978) Unraveling the top arachnid, Mosaic, 9 (6), 10-18.
< previous page
page_120
next page >
< previous page
page_121
next page > Page 121
Sacks, H., Schegloff, E.A., and G. Jefferson (1974) A simplest systematics for the organization of turn-taking for conversation, Language, 50, 696-735. Scardamalia, M., Bereiter, C., and H. Goelman (1982) The role of production factors in writing ability. In: M. Nystrand (ed), What Writers Know: The Language, Process, and Structure of Written Discourse. New York: Academic Press. Schegloff, E.A. and H. Sacks (1973) Opening up closings, Semiotica, 8, 289-327. Tannen, D. (1982) The myth of orality and literacy. In: W. Frawley (ed), Linguistics and Literacy. New York: Plenum. The Harvard Education Letter (1988), 4, 1-4, Testing: Is there a right answer? van Dijk, T.A. (1980) Macrostructures: An Interdisciplinary Study of Global Structures in Discourse, Interaction, and Cognition. Hillsdale, NJ: Erlbaum. van Dijk, T.A. (1987) Episodic models in discourse processing. In: R. Horowitz and S.J. Samuels (eds), Comprehending Oral and Written Language. San Diego and London: Academic Press. Appendix A: categories of persuasive strategies with examples from oral and written protocols I. SIMPLE APPEALS a. Requests Mom, can I have a Snoopy Sno Cone Machine? (oral) b. Pleas Please, Mommy, please? (oral) II. HIGHER LEVEL APPEALS (includes rationale) a. Normative Appeals My sister likes it, too. (written) I think my grnmom what sum to and my ucel to. (written) b. Beneficial to Persuader It will be fun to play with. (oral) I want a Snoopy Sno Cone Machine. (oral) c. Task Related Statements and Information Just put it (ice) in and it will come out soft. (oral) All it does is just need a little ice to put in it and we can make a whole lot of snow cones and we can do a lot of stuff with it, too. (oral) III. AUDIENCE DIRECTED APPEALSTRADES a. Negative Trades If I don't get it, I'll beg you and if I don't get it for my birthday I'll get mad. (oral) I'll stop bugging you if you get me the Snoopy Sno Cone Machine. (oral) b. General (Normative) Trades And I'll make good grades. (written) I might get one for being good. (oral)
< previous page
page_121
next page >
< previous page
page_122
next page > Page 122
c. Audience Specific Trades I won't bother you when you're washing dishes. (oral) I'll take out the trash and make the beds and rake the yard and feed the dog and cat. (written) You don't have to buy ice. (oral) We can take it around with us if we're going somewhere hot like to Singapore or something, it's real hot there. (oral) IV. APPEALS TO OFFSET COUNTER-ARGUMENTS It won't cost much money. (written) I wouldn't fight over it. (written) Why don't you go over to my Dad's shop and get some money from him? (oral) V. COMPLEX LOGICAL ARGUMENTS a. Benefit to Persuader But Mom, last . . . last year I had one and it broke. (implied, oral) You no wy I want the snowpe snow knme michin becuse I like ice creme. (written) b. Benefit to Audience c. Benefit to Greater GroupFamily or Society Appendix B: examples of oral and written discourse
EXAMPLE 1, Maia, age 7
< previous page
page_122
next page >
< previous page
page_123
next page > Page 123
EXAMPLE 2, Billie Jo, age 7
< previous page
page_123
next page >
< previous page
page_124
next page > Page 124
EXAMPLE 3, Angie, age 7
< previous page
page_124
next page >
< previous page
page_125
next page > Page 125
EXAMPLE 4, P.J., age 7
EXAMPLE 5, Albert, age 7
EXAMPLE 6, Matthew, age 7
< previous page
page_125
next page >
< previous page
page_126
next page > Page 126
EXAMPLE 7, Dan, age 15
EXAMPLE 8, Scott, age 15
EXAMPLE 9, Jenny, age 19
EXAMPLE 10, Dorothy, age 24
< previous page
page_126
next page >
< previous page
page_127
next page > Page 127
12 Indeterminacy in First and Second Languages: Theoretical and Methodological Issues ANTONELLA SORACE University of Edinburgh This study is concerned with the nature of nonnative intuitions of grammaticality, their development in the process of adult second language acquisition and their use in the empirical determination of interlanguage competence. One of the central issues addressed in the following discussion is the phenomenon of indeterminacy, which can be definedin very broad termsas the absence of a clear grammaticality status for a given language construction in the speaker's linguistic competence, and which manifests itself either in the speaker's lack of intuitions or in variability at the intuitional level. The study has both a theoretical and a methodological dimension, the main features of which can be summarized as follows: (a) theoretically, it aims at probing similarities and differences between linguistic intuitions in native and nonnative grammars, given that the latter are, by their very nature, more open to indeterminacy than the former. What is needed is a characterization of interlanguage indeterminacy at different stages of the language acquisition process and of its multiple causes, as well as an analysis of intermediate grammaticality in native language grammars. The concept of acceptability hierarchy will be employed as a framework for the comparison between the two. (b) methodologically, it aims at exploring problems related to the elicitation of linguistic intuitions in empirical second language
< previous page
page_127
next page >
< previous page
page_128
next page > Page 128
acquisition research. It is a well-known fact that the most common validity and reliability requirements are often overlooked in theoretical linguistics, particularly Chomskyan linguistics. Research on second language acquisition cannot afford to underestimate the fact that interlanguage grammarsbeing unstable by definitionare even more problematic objects of investigation than native grammars. The usual procedures for the elicitation of interlanguage intuitions (such as dichotomous rating tests) need to be replaced by alternative methods that take account of the built-in indeterminacy of nonnative grammars. This study suggests some possibilities in this direction. On the assumption that ranking measurements are more reliable than rating ones, three different ranking methods were tested, and compared in a small-scale pilot study involving both second language learners of Italian at different levels of proficiency and a control group of native Italian speakers. The Problem of Indeterminacy in Linguistic Intuitions Native intuitions and linguistic theory In recent years a number of criticisms have been raised with respect to the use of native speakers' linguistic intuitions as the main source of data for the confirmation or disconfirmation of hypotheses in linguistic theory. It should be recognized that, despite the seriousness of these objections, faith in the validity of linguistic intuitions is, at least in principle, less unconditional than it appears. Doubts in this respect were clearly expressed, for example, by Chomsky (1965:20) who pointed out that linguistic intuitions were not an ''objective operational measure", but that the search for more reliable procedures was a matter of minor importance at that stage of research. And yet, twenty years later, no visible progress has been made towards more adequate methods for testing linguistic hypotheses, as is apparent in this full-length quotation from Chomsky (1986:36f.): In actual practice, linguistics as a discipline is characterized by attention to certain kinds of evidence that are, for the moment, readily accessible and informative: largely, the judgments of native speakers. Each such judgment is, in fact, the result of an experiment, one that is poorly designed but rich in the evidence it provides. In practice, we tend to operate on the assumption, or pretense, that these informant judgments give us "direct evidence" as to the structure of the I-
< previous page
page_128
next page >
< previous page
page_129
next page > Page 129
language but of course this is only a tentative and inexact working hypothesis . . . in principle, evidence concerning the character of the I-language and initial state could come from many different sources apart from judgments concerning the form and meaning of expressions: perceptual experiments, the study of acquisition and deficit or of partially invented languages such as creoles, or of literary usage, or language change, neurology, biochemistry and so on . . . [but] the judgments of native speakers will always provide relevant evidence for the study of language . . . although one would hope that such evidence will eventually lose its uniquely privileged status. While it is still a widespread practice on the part of linguists to rely on grammaticality judgements in order to support their theoretical claims, there has been a growing awareness of the fact that very little is known about the psychological nature of linguistic intuitions. A clearer understanding of the cognitive factors involved in the mental formation of linguistic intuitions and in their expression through the mediation of judgements would constitute the foundation for a more effective and informative exploitation of intuitional data. Criticisms have been mainly concerned with two issues: (a) validity, or the relationship between (i) linguistic intuitions and grammatical competence, (ii) linguistic intuitions and acceptability judgements, and (iii) linguistic intuitions and different kinds of underlying norm; (b) reliability, referring to (i) inconsistency among native speaker judgements, as well as to (ii) the intermediate grammaticality of certain areas of language. These questions deserve careful consideration, since they are relevant to the analysis and understanding of nonnative linguistic intuitions. The validity problem If the elicitation of intuitional data is regarded as a small-scale experiment, the results of which are individual judgements, then it is crucial to ensure that such judgements actually tap the speaker's internalized grammatical competence. The assumption that there exists a correspondence between judgements and underlying grammar has been the target of several objections. Two arguments have been raised against the presupposed validity of linguistic intuitions:
< previous page
page_129
next page >
< previous page
page_130
next page > Page 130
(1) the capacity to have relevant linguistic intuitions may not be a reflection of grammatical competence: it may derive from a separate faculty characterized by a set of properties sui generis that are not shared by other kinds of linguistic behaviour (Bever, 1970; 1974; Snow and Meijer, 1977); (2) even if linguistic intuitions are directly related to grammatical competence, they may be affected by other factors that are extralinguistic in nature and cannot be easily isolated. A sentence may be judged acceptable or unacceptable for reasons that have little to do with its actual grammaticality status in the competence of speakersin other words, speakers may direct their attention towards aspects of the sentence irrelevant to the purpose of the experiment and judge something different from what they were expected to judge (Bever, 1970; Botha, 1973; Levelt, 1974). The former argument is more fundamental than the latter, since if linguistic intuitions turned out to be totally independent of the speaker's internalized grammar it would obviously make no sense to use them as basic data for the purpose of constructing models of grammar. Although the psychological laws of the intuitional process are still poorly understood, it is nevertheless indisputable that the use of acceptability judgements and introspective reports has led to the discovery of a substantial number of significant facts about syntactic processes (see Newmeyer, 1983 on this point). These results would hardly be explainable if no connection was assumed between grammatical knowledge and linguistic intuitions. Moreover, studies addressing the question of determining the degree of consistency between acceptability judgements and linguistic performance have generally shown that the two dimensions are highly correlated (Quirk and Svartvik, 1966; Greenbaum and Quirk, 1970). This suggests that native speakers tend to rely on the same grammar for both the sentences that they accept as well-formed and those that they are able to produce. There are therefore sufficient grounds to rule out the extreme hypothesis of a separation between linguistic competence and intuitional processes. The problem still remains that the speaker's internalized grammar is not the only system activated in the intuitional process. The interaction of the grammar with other cognitive and pragmatic systems is fully explainable within a modular conception of language, according to which the grammar is only one of a number of human cognitive systems, each governed by its own principles and each contributing to the superficial complexity of language (Newmeyer, 1983; White, 1985). Botha (1973) reserves the term ''spurious" for intuitions determined or affected by extralinguistic factors, as opposed to
< previous page
page_130
next page >
< previous page
page_131
next page > Page 131
"genuine" intuitions, which originate from the informant's internalized grammar. A variety of factors may be at the source of spurious intuitions. To mention the most relevant: (a) perceptual strategies (Bever, 1970; 1974; Snow, 1974), as in the famous example "The horse raced past the barn fell", which tends to be judged as ungrammatical by most informants because of the tendency to take the first NP V NP sequence as the main clause; (b) context of presentation (Levelt, 1971; 1974; Snow, 1974). A sentence of dubious grammaticality is more likely to be judged as ungrammatical if placed after a set of clearly grammatical sentences, or as grammatical if placed after a set of clearly ungrammatical sentences. This suggests that judgements in isolation are very different from judgements by contrast, as will be pointed out later. Furthermore, the order in which sentences are given to informants may influence their judgements (Greenbaum, 1988). (c) pragmatic considerations (Hawkins, 1982): when faced with syntactic ambiguity, informants tend to prefer the reading that (1) represents the most frequent interpretation, and (2) requires fewer assumptions about previous discourse; (d) linguistic training (Levelt, 1974; Botha, 1973). When linguists use themselves as main informantsproducing "the theory and the data at the same time" as Labov (1972:199) puts itthere cannot be any guarantee that their judgements are not influenced by theoretical expectations. Also, linguistically naive native speakers have been found to be more normative and more confident than native speakers with a background in linguistics in some studies (Ross, 1979), and less consistent (Snow and Meijer, 1977; but this conclusion is contradicted by Bradac et al., 1980). It seems appropriate to conclude that intuitions doat least partiallyreflect knowledge of grammar, and that the interference of extragrammatical factors can be controlled, at least to a certain extent, by carefully selecting the test sentences, the test design and the informants. Even if it is assumed, for the sake of simplicity, that all extralinguistic factors can be isolated, a decision about the nature of the rules underlying genuine intuitions is not a straightforward task. In producing grammaticality judgements, speakers may unconsciously shift towards the norm they believe they should follow, and away from the norm actually governing their internalized grammar (Greenbaum and Quirk, 1970; Coppieters, 1987). It is
< previous page
page_131
next page >
< previous page
page_132
next page > Page 132
important to distinguish among different attitudes to usage that speakers may have. According to Greenbaum and Quirk, three potentially different but often interacting factors can be reflected by speakers' judgements: (a) beliefs about the forms they habitually use, (b) beliefs about the forms that ought to be used, and (c) willingness to tolerate usage in others that corresponds neither to their own habitual forms nor to prescriptive forms. This is a question of great relevance with respect to the intuitions of second language learners, as they often have both a metalinguistic and an interlanguage norm availableand the two may not coincide. It is also relevant with respect to bilingual/multilingual communities, in which more than one grammar (associated with varying degrees of prestige) usually coexists in speakers' competence. A further problem is represented by the fact that intuitions and judgements are not the same, although they are often used interchangeably. Linguistic intuitions ("non reasoned feelings": Botha, 1973:174) are not easily accessible, either to the speaker's conscious mind or to the researcher: they can only be expressed through the mediation of judgements, which are linguistic descriptions and may therefore be inaccurate. This applies in particular to judgement tests requiring complex verbalizations of rules or other forms of metalinguistic statements. The reliability problem The concept of reliability is usually related to that of consistency: both intersubject consistency, that is, agreement among judgements produced by different informants, and intrasubject consistency, or agreement among judgements produced by the same informant in different replications of the test. There are areas of language on the acceptability of which native speakers do not agree, or do not have any intuitions. The easiest way of solving the question of intersubject consistency is to ascribe conflicting intuitions (supposedly genuine) to idiolectal or dialectal differences. In other words, speakers may disagree in their intuitions because they do not share the same grammar. According to this position, there cannot exist any builtin variation in the grammar. Ringen (1974) distinguishes between the culturalist view of language, according to which language is a cultural object or an institution, and the mentalist view, which regards language as the mental reality underlying speech behaviour. Interestingly, both views consider intersubject consistency of judgements unnecessary in linguistic research, either because members of the same speech community share the
< previous page
page_132
next page >
< previous page
page_133
next page > Page 133
same cultural institutions or because speakers of the same language share the same mentally internalized grammar. Another common solution is to minimize the inconsistencies of native speakers' intuitions. Newmeyer (1983) draws a distinction between superficial and genuine disagreement about sentence grammaticality. Superficial disagreement does not concern the actual grammaticality status of a given sentence but rather its analysis: whether, for example, some fundamental characteristic of the sentence should be explained by a grammatical principle or by an extragrammatical one. Genuine disagreement, on the other hand, arises from having conflicting intuitions about the grammaticality of the sentence. According to Newmeyer, the vast majority of alleged native speakers' disagreements on data are superficial. Cases of genuine disagreement, however, are far from uncommon. Early theories of transformational grammar recognized the existence of this problem, although their solution was based on the concept of definite grammaticality. Chomsky, for example, stated (1957:14) that ". . . in many intermediate cases we shall be prepared to let the grammar itself decide, when a grammar is set up in the simplest way so that it includes the clear cases and excludes the clear non-sentences". The alternative way of accounting for intersubject disagreement is to acknowledge that native language grammars are indeterminate and that such an indeterminacy is a characteristic of natural human languages. In this perspective, linguistic structures are not simply grammatical or ungrammatical: they may be grammatical to a degree. Models of grammar have been elaborated based on grammaticality as a relative property of sentences (fuzzy grammars). Lakoff (1973) and Mohan (1977) argue that there exists an ordinal scale of acceptability within a speech community such that all speakers are likely to agree on the rank order of acceptability values in a given set of sentences, although they may disagree on the absolute rating of individual sentences. The pattern of individual rating judgements, however, should not be random in that it would reveal an implicational scale, the basis of which is a shared ordinal scale. In other words, it would be inconsistent for a speaker to rank sentence A as more acceptable than sentence B, but then accept B and reject A. Along the same lines, Ross (1979) suggests that a language L may be seen as consisting of an indefinite number of acceptability hierarchies, each leading away from the core to the periphery of L, and governed by implicational laws. These such, that (a) acceptance of a sentence at distance x from the core implies acceptance of any more central sentences along the same hierarchy (i.e. between x and the core), and (b) speakers may disagree
< previous page
page_133
next page >
< previous page
page_134
next page > Page 134
on the absolute acceptability values of individual sentences because they may have different acceptability thresholds on the same hierarchy. The concepts of relative grammaticality and acceptability hierarchy are important because, besides accounting for intersubject disagreement of judgements or lack of intuitions, they are also consistent with psychometric theory, according to which people are usually more accurate in producing relative rather than absolute judgements (see, for example, Nunnally, 1967: 44). This seems to be true of judgements of grammaticality, even when they involve rating isolated sentences. When a given form has close variants, informants may match the variants mentally before making a judgement (Greenbaum, 1988), or try to find a visual context in which the sentence could make sense (see Levelt et al., 1977, who suggest that grammaticality judgements tend to be faster and more positive for high imagery, or concrete, materials than for low imagery, or abstract, materials). If so, then inconsistencies in absolute acceptability ratings may be due to differing abilities of individuals to retrieve a matching variant, or to construct mental contexts. Moreover, these concepts have important implications for second language acquisition theory, which should provide an answer to the following questions: (a) how do acceptability hierarchies develop? (b) can nonnative speakers ever attain the same acceptability hierarchies as those shared by the majority of native speakers? Recent developments in linguistic theory within a Government and Binding framework account for language indeterminacy in a more satisfactory way than earlier versions of transformational grammar. As Liceras (1983) points out, there are parameters of core grammar that are fixed in a variety of ways or not fixed at all. As a result, some areas of grammar may be characterized by permanent parametric variation, which leads to variability and inconsistencies in native speakers' intuitions. Following Pateman (1985), one can say that variability is naturally compatible with Universal Grammar (UG) in that UG constraints may be more or less rigid in specifying given structures as analyses for given input patterns. The force of UG constraints is presumably stronger near the core of an acceptability hierarchy, whereas the periphery, being UG-unspecified, is a more fertile ground for variability (in the form of cultural norms, speech adaptations, individual beliefs, and conscious rationalizations about language). UG and cognitive resources nonspecific to language have a combined influence at the periphery, although the modes of this combination are largely unexplained.
< previous page
page_134
next page >
< previous page
page_135
next page > Page 135
Nonnative intuitions and second language acquisition theory One often has the impression, in reading the second language acquisition literature, that principles and methodologies have been borrowed from linguistic theory without questioning their applicability to the study of interlanguage grammars. Given the fundamental difference between native and nonnative grammarsthat the former are fully developed, steady states of linguistic competence (at least in their core), whereas the latter are unstable, transitional states of knowledgeit is even more surprising that very few studies have so far attempted a more precise definition of interlanguage linguistic intuitions, or of the elicitation procedures employed in the collection of intuitional data from second language learners. As with native intuitions, the assumption normally made by researchers working with nonnative intuitional data is that there exists a relationship between (a) learners' grammaticality judgements and the transitional state of their interlanguage knowledge, on the one hand, and (b) between learners' judgements and their actual intuitions, on the other. As to the former, it seems reasonable to claim that, if extralinguistic variables are properly controlled, interlanguage judgements actually reflect interlanguage knowledge (on the basis of the arguments outlined above). It can be a more complex task, however, to decide about the kind of norm consulted by learners in the process of producing a judgement, particularly in a learning situation that fosters the development of metalinguistic knowledge. The elicitation of immediate judgement responses under well-defined time constraints may provide a partial solution to this problem, as suggested by the present experimental study 1. What characterizes interlanguage grammars and distinguishes them from native grammars is the pervasiveness of indeterminacy. In many cases, the indefiniteness of interlanguage rules leads to the learner's inability to express a clear-cut judgement of grammaticality. One of the factors contributing to interlanguage indeterminacy is the permeability of interlanguage grammars, their "openness" to the penetration of other linguistic systems. The question has been debated as to whether permeability is a competence phenomenon, or it is restricted at the performance level (Adjémian, 1976; 1982; Liceras, 1983); or whether permeability is peculiar to interlanguage but does not affect native language grammars. It will be assumed here that permeability is a crucial, though possibly not unique, property of interlanguage competence, that it is necessary in order for acquisition to take place, and that it generates indeterminacy by creating the conditions for the coexistence of more than
< previous page
page_135
next page >
< previous page
page_136
next page > Page 136
one rule for the same aspect of grammar. Such coexistence may involve eiiher rules belonging to different linguistic systems or rules belonging to successive stages of interlanguage development, or both. The result is variability or indecisiveness in learners' intuitions. As Klein (1986) suggests, interlanguage grammars may be regarded as "test grammars" in which a rule is associated, at a given time T, with both a degree of confirmation, indicating the certainty with which the learner perceives his/her knowledge of the L2, and a degree of criticalness, representing the stability of that rule. In other words, whether that rule is undergoing a process of change at that time. The relation between conformation and criticalness is not necessarily linear although, as Klein points out, weakly confirmed rules are more likely to become critical. Naturally, the degree of criticalness of rules may increase only if the interlanguage grammar is permeable to the integration of new forms. Besides the ever-changing nature of interlanguages, which necessarily brings about diachronic variability, another cause of indeterminacy can be identified from a Universal Grammar (UG) perspective. This has to do with the fact that UG specification of core properties is probably narrower in scope and strength than in native grammars. This allows for a wider periphery and consequently more room for permeability and variation. Let us examine the latter point in more detail. UG is the only cognitive force operating in first language acquisition and it finds uncharted territory, in the presence of rich input, for fixing the parameter settings relevant to the language being learned. The second language acquisition situation, on the other hand, is such that either (or more often both) of the following conditions obtain: (a) within the scope of UG, there is attrition between L1(Ln-) parameters and L2 parameters, so that, depending on the nature of the evidence available, L1-parameters may have to be reset, or may be left unset. Furthermore, the quantity and/or quality of the input available may be insufficient for the L2 parameters to be (re)set and for significant projections to take place; (b) outside the scope of UG, there is competition with problem-solving cognitive resources generally available to adult learners, which may offer structural solutions to input configurations not definable in terms of UG constraints (often called "unnatural" rules). Also, it is not implausible that the availability of UG is, at least in part, neurologically reduced in adult language acquisition (see Lamendella, 1977; Felix, 1981; 1985).
< previous page
page_136
next page >
< previous page
page_137
next page > Page 137
This double-sided conflict may account for the fact that acceptability hierarchies for a second language never become fully determinate. Only a small portion of the core is unambiguously specified by UG, but a relatively larger portion has been (or continues to be) the battleground of conflicting parameter settings and presents longlasting or even permanent parametric variation. Furthermore, an ample periphery remains open to the influence of different kinds of factors, both grammatical and extragrammatical in nature. This partly explains why one finds variable or inconsistent intuitions in nonnative speakers even at very advanced levels of language proficiency (Coppieters, 1987). To the extent that interlanguages are natural languages, falling within the range of possible grammars allowed by UG, learners' intuitions will also vary around a limited number of alternatives ("possible" alternatives). To the extent that interlanguages are also determined by the problem-solving cognitive faculty, learners' intuitions will take more idiosyncratic and largely unpredictable forms. Where the input presents severe limitations in quantity and variety, as in conventional instructional settings, the scope of action of the language-specific faculty is obviously restricted in that it lacks the necessary triggering evidence. Consequently, alternative structural solutions are likely to be provided either by the L1 or by cognitive resources nonspecific to language. This in turn determines (a) greater intersubject variability, since different learners may come up with idiosyncratic hypotheses; (b) greater intrasubject variation, since individual learners may formulate competing hypotheses and the conflict cannot be solved by positive confirmation or disconfirmation, as would happen in naturalistic acquisition. Types of interlanguage indeterminacy Successive stages of interlanguage development are characterized by different kinds of indeterminacy. It is crucial for the researcher to ascertain whether superficially similar patterns at the intuitional level are or are not derived from the same mental representations in the interlanguage grammar. In the initial stages of acquisition interlanguage grammars are indeterminate for the obvious reason that they are incomplete: learners cannot have intuitions about those constructions that are not represented in the transitional stage of their interlanguage competence. Most studies on learners' intuitions have been concerned with this type of indeterminacy (Schachter, Tyson and Diffley, 1976; Arthur, 1980; Gass, 1983, among others). 2
< previous page
page_137
next page >
< previous page
page_138
next page > Page 138
In a theoretically more interesting sense, the grammatical status of a given construction can become indeterminate at a more advanced stage of the acquisition process. In this case, indeterminacy is due not to lack of knowledge but rather to the learner's restructuring of knowledge within the interlanguage grammar. A rule may become critical after a period of relative stability, triggering a process of reanalysis that in turn may lead to temporary regression in performance abilities, as reported by studies on U-shaped curves in language acquisition (Kellerman, 1985). Although these U-shaped developmental patterns have generally been captured at the performance level, it is entirely plausible to assume that the temporary loss of determinacy of a given construction also involves a decrease in the learner's ability to express a definite acceptability judgement about it. One of the interesting questions arising from the concept of acceptability hierarchy is whether nonnative speakers of a language L can ever construct acceptability hierarchies that are indistinguishable from those shared by native speakers of L. A positive answer to this question would imply that at very advanced stages of the acquisition process virtually no differences could be found between the linguistic intuitions expressed by learners and those expressed by native speakers. In other words, (very proficient) nonnative speakers would have similar acceptability values with respect to a given area of the L2 grammar. Some of the available evidence suggests, however, that this is not necessarily the case and that at least certain areas of grammar may never become determinate in interlanguages. Coppieters (1987) conducted a study aimed at investigating competence differences between the intuitions of native speakers and those of near-native speakers of French about some highly productive aspects of grammar. Among these were the contrast imperfect/present perfect, the distinction between the third person pronouns il/elle and ce, and that between preposed and postposed uses of adjectives. The most striking result of his study (which was, however, questionable on methodological grounds) was the extent of the gap between native and nonnative intuitions. While native speakers seem to share a ''native majority norm", with respect to which they show minimal variation, all nonnative speakers deviate from the native norm in statistically significant ways (the closest nonnative value to the majority norm is three standard deviations from it), and reveal extensive variation. Furthermore, the interpretation that nonnative speakers produced of the grammatical forms in question was on several occasions remarkably different from the interpretation offered by native speakers, even though their actual rating judgements were the same. This suggests, according to Coppieters (1987), that the two groups of informants may have developed significantly different competence gram-
< previous page
page_138
next page >
< previous page
page_139
next page > Page 139
mars for French, despite the fact that they express the same judgements of grammaticality and are nearly indistinguishable in production. Finally, nonnative speakers often lacked any clear intuitions on some of the grammatical rules investigated (in particular, the distinction perfect/imperfect) and their preference for either form was unsystematic. Given the high proficiency level of these subjects, one could assume that their interlanguage grammar has fossilized with respect to certain rules or, to put it differently, that the grammar reveals permanent parametric variation for these areas of grammar. The Elicitation of Linguistic Intuitions The elicitation of valid and reliable judgemental data is a complex task. As for interlanguage intuitions, the issues of validity and indeterminacy are correlated, in that it is necessary to capture indeterminacy (when this is present) in order to obtain valid judgements. Such a correlation is often overlooked in interlanguage research. Furthermore, reliability requirements assume a special significance when applied to nonnative intuitions. Our discussion of the particular methodological problems related to the collection and analysis of nonnative speakers' acceptability judgements will first deal with issues of reliability. Greenbaum (1977b) proposes four reliability criteria for native speakers' intuitions, three of which are concerned with intrasubject consistency and one with intersubject consistency. Let us briefly examine their applicability to nonnative intuitions. (a) Replication of the same test with different subjects belonging to the same speech community. This approach leaves us with the problem of defining "the same speech community", which is difficult with respect to native speakers and impossible with respect to nonnatives. In the second language acquisition literature, however, it is often assumed that learners at the same proficiency level, or from the same language background, share common characteristics and can therefore be said to "belong" to the same group. (b) Replication of the same test with the same subjects after a lapse of time. What needs to be defined in this case is the optimal length of time between two successive administrations of a test. The interval cannot be too short if one wants to avoid producing a learning effect in the subjects. More importantly, when working with second language learners, the interval cannot be too long since interlanguage grammars are in constant evolution. For this reason, this type of replication is unsuitable to the investigation of nonnative intuitions.
< previous page
page_139
next page >
< previous page
page_140
next page > Page 140
(c) Replication of the same test with the same subjects, using different but equivalent materials. As Greenbaum (1977a) points out, different lexical versions of the same sentence type should be built into experiments as a matter of course. Inconsistent responses to lexical versions may indicate either unreliable subjects, or the intrusion of syntactic or lexical differences among sentences which are irrelevant to the syntactic feature under investigation. Equivalence among lexicalizations cannot be taken for granted (except perhaps for the most common constructions), since theoretically identical sentences may turn out to be different in the informants' perception (Bradac et al., 1980). This is even more crucial in the case of nonnative informants, particularly at low levels of proficiency, who may judge lexicalizations differently because they lack knowledge either of the vocabulary or of the other syntactic aspects of the sentences. (d) Replication of the test with the same subjects and the same test materials but asking for different kinds of measurements. The obvious question is whether different measurements actually elicit the same aspects of the subjects' judgements, or, in other words, whether they are equally valid. As will be shown below, some of the most common measurements of linguistic intuitions are not valid for the purpose of testing interlanguage grammars. Judgemental scales and intrasubject consistency As mentioned before, native speakers may not be able to express absolute judgements of grammaticality for the indeterminate areas of their native language, especially if they are asked to perform on a binary, or dichotomous, rating scale. Given the pervasiveness of indeterminacy in interlanguage grammars, it is apparent that dichotomous judgement tests are even more inadequate for the investigation of interlanguage competence. Nevertheless, they are the most widely employed elicitation procedure (see the report in Chaudron, 1983). Judgements obtained by means of a dichotomous scale may provide inaccurate or deceptive information about the learner's state of interlanguage competence, particularly if the object of investigation is an indeterminate structure. Let us consider the case in which a learner is asked to produce an absolute judgement (''correct" vs. "incorrect") on a construction that is not yetor no longerdeterminate in his/her interlanguage grammar. Any sentence exemplifying that construction will be marked as either correct or incorrect without having such a status in the interlanguage grammar: the learner's choice will be random. If there are different versions of the same
< previous page
page_140
next page >
< previous page
page_141
next page > Page 141
test sentence, judgements may or may not be inconsistent. If they are not, the researcher is left with the deceptive impression that the construction is determinate. But if judgements are inconsistent, there can be no unambiguous interpretation of such inconsistency. There is no way of deciding whether it can be traced back to random choice, for example, because the learner has no representation of the structure in question, or whether it is due to intermediate or advanced indeterminacy, generated, respectively, by reanalysis of knowledge and permanent parametric variation. The validity of judgements obtained through dichotomous judgements is therefore highly questionable. 3 The basic inadequacy of dichotomous judgements is not much improved if binary scales are replaced by threepoint scales. The addition of a third category "not sure" or "don't know" to a dichotomous scale would seem at first sight to solve the problem of learners' random choice and uninterpretability of the consistency factor. It might be argued, in fact, that because of the presence of such a category learners would not be forced to produce a categorical judgement: they would be free to express their uncertainty by channelling their responses to the neutral category. In practice, however, one is likely to obtain invalid results because of two problems. The first one has to do with personality factors. Learners tend to fall into two major groups: those who choose the neutral category most of the time in order not to commit themselves to a definite judgement, and those who never choose it, as they feel reluctant to mark their uncertainty explicitly. A three-point scale is therefore not an adequate solution on psychological grounds. The other problem lies in the definition of the middle category. It can be chosen either because of a state of psychological uncertainty of the subject, or because of a state of linguistic intermediate grammaticality of the construction in question. In theory it is possible for a sentence to be judged completely acceptable but with less than full confidence. But, also a sentence can be judged as being of intermediate grammaticality with complete confidence. Although in practice acceptability and certainty tend to be correlated, this possibility cannot be ruled out. It can be revealing to include a separate scale for certainty so that learners are asked to express acceptability judgements as well as the degree of confidence with which this judgement is produced. As mentioned before, learners' perceptions of their own knowledge may be very different from their actual knowledge. It is in the researcher's interest to isolate the degree of certainty (or uncertainty) in judgements, on the assumption that acceptability and certainty are both relevant, though independent, dimensions of intuitional data. The difficulty lies in ensuring that they are clearly perceived
< previous page
page_141
next page >
< previous page
page_142
next page > Page 142
as distinct by the informants, certainty judgements being a kind of "meta-evaluation" about acceptability judgements. The instructions provided to the subjects are therefore crucial in an experiment that relies on two separate scales. 4 Including more than three points can increase scale reliability, but subjects may find it difficult to maintain a stable and consistent criterion in their use of middle categories. Moreover, the same ambiguity problems as with three-point scales arise with respect to the interpretation of middle choices. Ranking scales are psychologically more reliable, provided that factors such as order of presentation and sentence position are appropriately controlled in the test design. They also have greater validity, since sentences to be compared differ only with respect to the particular grammatical feature under investigation. One can therefore be confident that subjects' judgements are concerned with that feature and not with some irrelevant aspect of the sentence. Of course, two or more sentences may be equally (un)grammatical for a given subject. This cannot be captured unless the possibility is allowed to rank more than one sentence in the same position. Ranking scales can cope more adequately with interlanguage indeterminacy, since they do not require absolute judgements.5 It is for this reason that the experiment reported by this study is based on relative judgements, elicited by means of different types of ranking scales. These are briefly described below. The most common ranking procedure is the ranking of pairs or sets of sentences. The reliability of ranking tests involving more than three sentences is not clear, although experiments with native speakers have used sets of 6 (Snow and Meijer, 1977) or even 11 sentences (Mohan, 1977). Learners at initial and intermediate stages of interlanguage development should not be expected to handle large sets of sentences with confidence. Depending on whether it is timed or not, this procedure leaves ample margin for correcting one's initial reactions. Furthermore, the larger the set of sentences is, the lower the face validity of the test. In analogy with experiments run by Miller (1969) and successively applied in second language acquisition research by Kellerman (1978), an alternative ranking procedure requires subjects to sort out a set of cards, each representing a sentence, and distribute them in piles according to their perceived degree of acceptability. Subjects are instructed to form as many piles as they wish and then arrange them in rank order. This procedure has greater face validity than straightforward ranking, in that subjects feel free to manipulate sentences and move them around without being tied to a particular order.
< previous page
page_142
next page >
< previous page
page_143
next page > Page 143
Another technique, magnitude estimation, has been developed in psychophysics for the scaling of sensory attributes such as brightness and loudness, and then extended to the measurement of the intensity of opinions and attitudes (Stevens, 1966). To our knowledge, it had not been applied to measure reactions to grammatical acceptability before. After providing subjects with an answer sheet, isolated sentences are projected on a screen, one at a time. Subjects are instructed to assign any number that seems to them appropriate to the first sentence and then to assign successive numbers proportionally, in such a way that they reflect their subjective impression of sentence acceptability. Experiments with various forms of nonmetric stimuli have shown that subjects normally maintain a stable criterion in the sequence of their judgements: it is plausible to assume that a proportional criterion also underlies judgements of acceptability. The advantage of this method is that it elicits immediate reactions, without leaving time for consulting metalinguistic knowledge. Two Experiments with Acceptability A pilot study was conducted to investigate the development of acceptability hierarchies in the nonnative grammars of learners of Italian. The area of investigation of the Italian grammar selected for the experiment was concerned with different aspectual distinctions related to imperfectivity and perfectivity. This area was chosen not only because it presents considerable difficulty for learners of Italian, but also because it is characterized by indeterminacy in the intuitions of native Italian speakers. As such, it lends itself well to the observation and comparison of native and nonnative acceptability hierarchies. This study does not provide details about the linguistic aspects of imperfectivity, nor will it be concerned with the interpretation of the acceptability hierarchy obtained. The study aimed at finding out: (a) whether there exists an acceptability hierarchy in the intuitions of native Italian speakers for this particular grammatical area; (b) whether nonnative acceptability hierarchies gradually approximate to native ones and eventuallyat very advanced stages of mastery of the second languagecome to overlap with them; whether there is an increasingly strong correlation between intrasubject consistency of learners' judgements and degree of determinacy of grammatical structures in their linguistic competence;
< previous page
page_143
next page >
< previous page
page_144
next page > Page 144
(c) whether different ranking procedures (cards, magnitude estimation, and straightforward ranking) produce similar results and how they compare in terms of validity and reliability. The study involved four groups of subjects: (a) 18 Italian native speakers, (b) 6 near-native speakers of Italian, (c) 6 advanced learners, and (d) 6 low intermediate learners. The native speakers were all Italians living in Edinburgh, either permanently or temporarily, in the age range 19-52. The near-native speakers were chosen on the criterion that they could pass for natives in their mastery of Italian; none was of Italian origin; all lived and worked in Edinburgh. The learners were university students of Italian at the University of Edinburgh. All nonnative speakers had English as their mother tongue. All subjects participated in the experiment on a voluntary basis. Subjects had to judge the acceptability of 90 sentences, divided into three sets. Each set consisted of (a) 15 grammatical sentences, of which 10 represented the categories of use for the imperfective area and 5 represented the categories of use for the perfective area, and (b) 15 sentences lexically identical to the ones above but differing in the use of tense, that is, having the present perfect tense instead of the imperfect and vice versa. Although most of these sentences were ungrammatical, some of them were actually acceptable. Each set therefore included 30 sentences. The three sets were equivalent to one another in that they consisted of different lexicalizations of the same categories. The order of presentation of sentences was varied within each set and resulted from appropriate randomization procedures. As mentioned before, subjects were asked to take three different types of judgement tests: Ranking, Cards, and Magnitude Estimation. The directionality of responses varied in that "1" meant "most acceptable" for Ranking and Cards, but "least acceptable" for Magnitude Estimation. Written instructions (in Italian for native speakers and in English for nonnative speakers) were provided at the beginning of the experiment, followed by further verbal clarification if required. The word ''acceptability" was used throughout the instructions, and its interpretation was deliberately left vague in order to avoid an exclusive focus on grammatical correctness. (The instructions in English can be found in the Appendix.) Both groups of native and nonnative subjects were divided into 3 subgroups A, B, and C of 6 people each. In the case of the nonnative speakers, this division coincided with the three levels of proficiency, so that each level consisted of 6 subjects. Each subgroup received a different combination of tests and lexicalizations, so that all subjects had to judge all lexicalizations but in association with different tests. Moreover, each
< previous page
page_144
next page >
< previous page
page_145
next page > Page 145
subgroup was in turn divided into 3 pairs of subjects, in such a way that each pair would be presented with the three tests in a different order. This was meant to counterbalance possible effects of order of presentation and of differences among lexicalizations on subjects' responses. Subjects were tested in pairs. No time limits were imposed, but subjects generally took approximately 30 minutes to complete the battery of tests. At the end of the session, they were invited to express informal comments on the tests and on the aspects of grammar investigated. In order to identify an acceptability hierarchy in the judgements of native speakers, the following steps were taken: (a) the means were calculated (over 18 responses for native speakers; over 6 for each of the nonnative levels) for each sentence in both its grammatical (G) and its ungrammatical (U) versions and for the three tests. Arithmetic means were calculated for Ranking and Cards; geometric means for Magnitude Estimation (see Stevens, 1966); (b) the three means obtained for each sentence were averaged and then rank-ordered. Since each sentence corresponded to a grammatical category, an order of categories was produced. Whether such an order can be seen as an acceptability hierarchy of course depends on the degree of intertask consistency. The following procedures were then performed on the sentence/category means to answer the quantitative questions: (c) a one-way ANOVA was applied to verify whether the four levels were actually different in terms of their acceptability judgements. This was done for each task separately; (d) Spearman's rank order correlation coefficients were calculated among levels for each task in order to obtain the degree of consistency between native and nonnative acceptability hierarchies at different proficiency levels. This was done in pairs for all the six possible permutations (native/near-native; native/advanced; native/low intermediate; near-native/advanced; near-native/low intermediate; advanced/low intermediate); also, it was done separately for G and U categories; (e) Spearman's rank order correlation coefficients were also calculated among tasks for each level in order to obtain the degree of intrasubject consistency. This was done separately for G and U categories. This procedure also tests whether the three methods consistently produce the same results;
< previous page
page_145
next page >
< previous page
page_146
next page > Page 146
(f) t-tests were performed between G and U versions for each task and for each level in order to verify whether the method actually succeeds in separating G from U sentences, and whether the ability to distinguish between G and U sentences is proportional to the proficiency level; (g) Spearman's rank order correlation coefficients were calculated between G and U versions within each task for each level, in order to see whether there are two inverse acceptability hierarchies for G and U sentences. The remainder of the discussion will only be concerned with the results obtained for imperfectivity, leaving perfectivity aside. One of the most obvious results of the study is that there appears to be a clear and consistent hierarchy for ungrammatical (U) sentences, but not a clear hierarchy for grammatical (G) sentences. In other words, native speakers, and to a lesser extent nonnatives, seem to be better at distinguishing among degrees of unacceptability than at distinguishing degrees of acceptability, which explains why in most cases the intratask correlation coefficients between G and U are not significantly negative as they would be if there were two converse rank orders. This is shown by Table 12.1. One reason for this may be that most G sentences are unambiguously acceptable and therefore tend to be top-ranked. This produces the lack of an overall negative correlation, although there are individual grammatical sentences that are more indeterminate than others. TABLE 12.1 Intratask correlation coefficients for grammatical (G) and ungrammatical (U) versions of sentences at four proficiency levels (NS= native; NNS= near-native; AD V= advanced; INT= intermediate) NS NNS ADV INT Cards G/U -0.488 -0.358 -0.148 -0.630 Ranking G/U -0.018 -0.624 -0.473 -0.564 Mag.est. G/U -0.297 -0.233 -0.061 -0.403 Interestingly, learners show higher negative correlation coefficients between G and U than native and nearnative subjects, suggesting that perhaps they have a "black and white" vision of acceptability and tend to mark the G and U versions of each sentence in opposite ways, often by using
< previous page
page_146
next page >
< previous page
page_147
next page > Page 147
the extremes. In several cases nativesand to some extent near-nativesjudge both versions to be acceptable (or unacceptable), which obviously affects the G/U correlation. They are more sensitive to shades of (un)acceptability. Put in another way, nonnatives (especially those who are still learners) tend to be more normative than natives, probably under the influence of the prescriptive grammar rules they were taught. TABLE 12.2 Intertask correlation coefficients for grammatical (G) and ungrammatical (U)sentences at four proficiency levels (NS = native; NNS = nearnative; ADV = advanced; INT = intermediate) NS NNS ADV INT U G U G U G U G Cards-Ranking 912*** 0.179 0.769** 0.683* 0.470 0.088 0.633* 0.485 Cards-Mag.est. -0.930*** -0.418 -0.765** -0.385 -0.197 -0.248 -0.603 -0.430 Mag.est.Ranking -0.894***-0.670* -0.899*** -0.205 -0.542 -0.448 -0.379 -0.255 *p<0.05 **p<0.01 ***p<0.001 Another clear result is that there seems to be no appreciable difference between ranking methods, particularly for native speakers and at high levels of proficiency of nonnatives. Table 12.2 reports the correlation coefficients among the methods, which are again higher for U-sentences than for G-sentences. This suggests that both the Magnitude Estimation task and the Cards tests are quite reliable, or at least as reliable as the common straightforward ranking procedure. Despite their usual comments ("How can I possibly remember the number I gave to the first sentence?"), native and near-native subjects did in fact maintain a stable judging criterion. As mentioned before, the main advantage of this procedure is that it elicits immediate responses, without leaving room for second thoughts and, presumably, for the intervention of conscious knowledge of grammar. That there is no difference with the other two tests could mean that native and near-native speakers do not make much use of their metalinguistic knowledge. The picture is different for learners who obviously keep being exposed to language instruction: Cards and Ranking seem to give more consistent results both in terms of correlation and in terms of discrimination between G and U sentences, as can be seen from Table 12.1 and Table 12.3, respectively.
< previous page
page_147
next page >
< previous page
page_148
next page > Page 148
TABLE 12.3 Intratask t-values for grammatical (G) and ungrammatical (U) versions of sentences at four proficiency levels (NS = native; NNS = near-native; AD V= advanced; INT = intermediate) NS NNS ADV INT Cards G/U -4.461** 2.509* 3.512** -3.389** Ranking G/U 4.657** -2.131 -5.210*** -4.145*** Mag.est. G/U -4.838*** 3.441** 2.904* 2.498* *p <.0.05 **p<0.01 ***p<0.001 This may be due to the fact that in these two tests the subjects have the opportunity to retrieve grammar rules, which they can not do in the Magnitude Estimation test. Therefore, at lower proficiency levels, Magnitude Estimation may tap a different kind of language knowledge from Cards and Ranking. In other words, it could rely on intuitiveas opposed to metalinguisticknowledge. Also, it can be seen from Table 12.3 that the t-value which indicates a significant difference between G and U versions of sentences is particularly high for native speakers. The correlation coefficients reported in Table 12.2 also indicate that the degree of intrasubject consistency generally increases with the proficiency level. This means that the grammatical structures tested become more determinate in the interlanguage grammar. The degree of consistency between native and nonnative acceptability values increases with proficiency as well, as can be seen from Table 12.4. The correlation coefficients between native and near-native speakers are strongly TABLE 12.4 Interlevel correlation coefficients between acceptability hierarchies with three methods CARDS RANKING MAG.EST. U G U G U G NS-NNS 0.673* 0.748* 0.930*** 0.685* 0.809** 0.330 NS-ADV 0.673* 0.582 0.718* 0.345 0.348 0.030 NS-INT 0.285 0.515 0.173 0.091 0.036 0.036 NNS-ADV 0.588 0.536 0.842** 0.494 0.124 0.097 NNS-INT 0.515 0.179 0.224 0.297 -0.330 0.030 ADV-INT 0.527 0.574 0.470 0.033 0.633* 0.825** *p < 0.05 **p<0.01 ***p<0.001
< previous page
page_148
next page >
< previous page
page_149
next page > Page 149
significant. The correlations between advanced and intermediate learners are in almost all cases higher than those between the latter and native speakers, which means that the learners at two levels tend to be more similar to one another than to natives. As predicted, increase in determinacy implies gradual approximation to the target grammar. The analysis of variance performed over proficiency levels for both G and U sentences confirms that there is a significant difference among subgroups of subjects with respect to rank orders in all the three methods. This is shown by Table 12.5. TABLE 12.5 Analysis of Variance on four proficiency levels: F-ratios for grammatical (G) and ungrammatical (U) sentences in three tasks U G Cards 20.481** 13.054** Ranking 13.807** 3.121* Mag.est. 15.732** 110.386** *p<0.05 **p<0.01 In conclusion, this study suggests a number of theoretical and methodological reasons why ranking scales are more adequate than rating scales for the purpose of eliciting linguistic intuitions from nonnative speakers, and in particular for the determination of acceptability hierarchies in interlanguage development. Ranking measurements such as Magnitude Estimation seem to be valid and reliable tests of non-reasoned intuitions of both native and nonnative speakers, although more research on a wider scale is needed. Notes 1. This is not to imply that metalinguistic knowledge cannot be internalized and eventually govern the learner's interlanguage production. The problem of distinguishing between types of underlying norms arises when metalinguistic knowledge has not yet been internalized. It can be assumed that time constraints prevent learners from retrieving their metalinguistic rules, particularly at the initial stages of the acquisition process (see Bialystok, 1981).
< previous page
page_149
next page >
< previous page
page_150
next page > Page 150
2. See for example Schachter et al. (1976:70): ''There are strings which the learner knows to be grammatical or ungrammatical, because he has internalized rules that allow him to make judgments on them. These are the determinate strings. But there are also strings about which the learner has no knowledge at all because, with his incomplete rule system, he does not have internalized rules which allow him to make judgments on them. These are the indeterminate strings." 3. It is often the case that most inconsistent judgements are produced by learners at intermediate stages of proficiency. See, for example, White (1985), where a dichotomous judgement test was used and five judgements of the same type (out of a total of seven) were counted as evidence of consistent behaviour: more than 45% of the subjects appeared to be "indecisive", the highest concentration of which turn out to be at the intermediate levels. White suggests that this is ad hoc behaviour, but in fact it may well be that some of the subjects are operating on the basis of their interlanguage competence which is indeterminate with respect to subjacency (possibly because of conflicting parameter settings). This results in real (and as such, interesting) inconsistency. 4. Bialystok and Frölich (1978) adopted as a measure of certainty an oral judgement test designed to distinguish the use of implicit knowledge from the use of explicit knowledge. Their assumption was that the degree of certainty would be directly proportional to the degree of consciousness. Judgements based on metalinguistic knowledge would therefore be associated with a high degree of certainty whereas judgements derived from implicit knowledge would be characterized by a low degree of certainty. However, it is entirely possible to be certain of the grammaticality of a given construction without necessarily being aware of the underlying rule. Similarly, it is possible to express judgements on the basis of conscious rules without feeling confident about them. 5. It is worth pointing out, however, that if learners are asked to judge structures of which they have no knowledge at all, their responses are likely to be random, independent of the testing procedure. But this would be irrelevant to the purpose of determining learners' interlanguage competence. As Liceras (1983:136) puts it, "If it is assumed that learners do not have intuitions about their grammar, by definition they do not have knowledge of their interlanguage and a model of the interlanguage grammar cannot be constructed." Appendix: Instructions in English The purpose of this experiment is to measure the way in which learners of Italian judge the acceptability of some structures in Italian: in particular, how they order certain sentences with respect to one another. In making your judgements, think if a sentence "sounds" more or less acceptable than another. To the extent that you can, try to distinguish between different degrees of acceptability (or unacceptability).
< previous page
page_150
next page >
< previous page
page_151
next page > Page 151
Magnitude estimation Isolated sentences will be projected on the screen in front of you, one at a time. You will have to judge the acceptability of these sentences. Give the first sentence any number you wish; then assign successive sentences numbers that are proportional to the first number you chose. For instance, if you gave the number 6 to the first sentence and if you think that the second sentence is less acceptable than the first, give a number inferior to 6 to it. Similarly, if you think that the second sentence is more acceptable than the first one, choose a number higher than 6. Cards You have a set of cards, each representing a sentence. Your task is to arrange sentences in different piles according to their degree of acceptability. You may form as many piles as you wish. Once you have divided all the sentences into piles, place the piles along the tape measure in front of you, starting from 'least acceptable' on your left towards 'most acceptable' on your right. For each position you choose along the tape measure, please write the numbers corresponding to the sentences in that pile. Ranking You have a list of 30 sentences in front of you. Some of them are perfectly acceptable and normal, others are less acceptable. Your task is to rank these sentences according to their degree of acceptability, by giving them numbers from 1 to 30 so that: 1 = least acceptable 30 = most acceptable You may give the same number to more than one sentence if you think they have the same degree of acceptability. You may therefore not need to use all numbers up to 30. References Adjémian, C. (1976) On the nature of interlanguage systems, Language Learning, 26, 297-320. Adjémian, C. (1982) La spécificité de l'interlangue et l'idéalization des langues secondes. In: Centre de Recherche, Université de Paris VIII (ed), Grammaire Transformationelle: Théorie et Méthodologies. Arthur, B. (1980) Gauging the boundaries of second language development: a study of learners' judgments, Language Learning, 30, 177-194.
< previous page
page_151
next page >
< previous page
page_152
next page > Page 152
Bever, T.G. (1970) The cognitive basis for linguistic structures. In: J.R. Hayes (ed), Cognition and The Development of Language. New York: John Wiley and Sons. Bever, T.G. (1974) The ascent of the specious, or there's a lot we don't know about mirrors. In: D. Cohen (ed), Explaining Linguistic Phenomena. New York: John Wiley and Sons. Bialystok, E. (1981) Some evidence for the integrity and interaction of two knowledge sources. In: R.W. Andersen (ed), New Dimensions in Second Language Acquisition Research. Rowley, MA: Newbury House. Bialystok, E. and M. Frölich (1978) Aspects of second language learning in classroom settings, Working Papers on Bilingualism, 13, 2-26. Botha, R.P. (1973) The Justification of Linguistic Hypotheses. The Hague: Mouton. Bradac, J.J., Martin, L.W., Elliott, N.D., and C.H. Tardy (1980) On the neglected side of linguistic science: multivariate studies of sentence judgment, Linguistics, 18, 967-995. Chaudron, C. (1983) Research on metalinguistic judgments: a review of theory, methods and results, Language Learning, 33, 343-377. Chomsky, N. (1957) Syntactic Structures. The Hague: Mouton. Chomsky, N. (1965) Aspects of The Theory of Syntax. Cambridge, MA: MIT Press. Chomsky, N. (1986) Knowledge of Language: Its Nature, Origin and Use. New York: Praeger. Coppieters, R. (1987) Competence differences between native and fluent nonnative speakers, Language, 63, 544573. Felix, S.W. (1981) On the (in)applicability of Piagetian thought to language learning, Studies in Second Language Acquisition, 3, 179-192. Felix, S.W. (1985) More evidence on competing cognitive systems, Second Language Research, 1, 47-72. Gass, S.M. (1983) The development of L2 intuitions, TESOL Quarterly, 17, 273-291. Greenbaum, S. (ed) (1977a) Acceptability in Language. The Hague: Mouton. Greenbaum, S. (1977b) The linguist as experimenter. In: F.R. Eckman (ed), Current Themes in Linguistics. New York: John Wiley and Sons. Greenbaum, S. (1988) Good English & the Grammarian. London: Longman. Greenbaum, S. and R. Quirk (1970) Elicitation Experiments in English. Linguistic Studies in Use and Attitude. London: Longman. Hawkins, J.A. (1982) Constraints on modelling real-time processes: assessing the contribution of linguistics. Paper presented at the conference on Constraints on Modelling Real-Time Processes, St Maximin, France. Kellerman, E. (1978) Giving learners a break: native language intuitions as a source of predictions about transferability, Interlanguage Studies Bulletin, 15, 59-92. Kellerman, E. (1985) If at first you don't succeed . . . In: S.M. Gass and C. Madden (eds), Input in Second Language Acquisition. Rowley, MA: Newbury House. Klein, W. (1986) Second Language Acquisition. Cambridge: Cambridge University Press. Labov, W. (1972) Sociolinguistic Patterns. Philadelphia: University of Pennsylvania Press. Lakoff, G. (1973) Fuzzy grammar and the performance-competence terminology game. In: C. Corum et al. (eds), Papers from the Ninth Regional Meeting of the Chicago Linguistic Society. Chicago: Chicago Linguistic Society. Lamendella, J.T. (1977) General principles of neurofunctional organization and their manifestation in primary and non-primary acquisition, Language Learning, 27, 155-196.
< previous page
page_152
next page >
< previous page
page_153
next page > Page 153
Levelt, W.J.M. (1971) Some psychological aspects of linguistic data, Linguistische Berichte, 17, 18-30. Levelt, W.J.M. (1974) Formal Grammars in Linguistics and Psycholinguistics. Vol. 3: Psycholinguistic Applications. The Hague: Mouton. Levelt, W.J.M., van Gent, J.A.W.M., Haans, A.F.J., and A.J.A. Meijers (1977) Grammaticality, paraphrase and imagery. In: S. Greenbaum (ed), Acceptability in Language. The Hague: Mouton. Liceras, J. (1983) Markedness, contrastive analysis, and the acquisition of Spanish syntax by English speakers. Ph.D. dissertation, University of Toronto. Miller, G.A. (1969) A psychological method to investigate verbal concepts, Journal of Mathematical Psychology, 6, 169-191. Mohan, B.A. (1977) Acceptability testing and fuzzy grammar. In: S. Greenbaum (ed), Acceptability in Language. The Hague: Mouton. Newmeyer, F.J. (1983) Grammatical Theory: Its Limits and Its Possibilities. Chicago: University of Chicago Press. Nunnally, J.C. (1967) Psychometric Theory. New York: McGraw-Hill. Pateman, T. (1985) From nativism to sociolinguistics: integrating a theory of language growth with a theory of speech practices, Journal for the Theory of Social Behaviour, 15, 38-59. Quirk, R. and J. Svartvik (1966) Investigating Linguistic Acceptability. The Hague: Mouton. Ringen, D. (1974) Linguistic facts: a study of the empirical scientific status of transformational generative grammar. In: D. Cohen and J.A. Wirth (eds), Testing Linguistic Hypotheses. New York: John Wiley and Sons. Ross, J.R. (1979) Where's English? In: C.J. Fillmore, D. Kempler, and W.S.-Y. Wang (eds), Individual Differences in Language Ability and Language Behavior. New York: Academic Press. Schachter, J., Tyson, A., and F. Diffley (1976) Learners' intuitions of grammaticality, Language Learning, 26, 67-76. Snow, C. (1974) Linguists as behavioural scientists: towards a methodology for testing linguistic intuitions. In: A. Kraak (ed), Linguistics in The Netherlands 1972-1973. Assen: Van Gorcum. Snow, C. and G. Meijer (1977) On the secondary nature of syntactic intuitions. In: S. Greenbaum (ed), Acceptability in Language. The Hague: Mouton. Stevens, S.S. (1966) A metric for social consensus, Science, 151, 530-541. White, L. (1985) The acquisition of parameterized grammars: subjacency in second language acquisition, Second Language Research, 1, 1-17.
< previous page
page_153
next page >
< previous page
page_154
next page > Page 154
13 An Experiment in Individualization Using Technological Support NORMA NORRISH Victoria University of Wellington This chapter describes some of the problems encountered in the teaching of an accelerated beginners' course in French and some of the solutions found to deal with these problems. It is an update of an article written in 1985 (Norrish, 1987). Language teaching experts such as Van Ek, Pit Corder, Strevens and Wilkins have all stressed the importance of the learners' particular needs and the responsibility of the teacher to satisfy them. A course must be defined in terms of the needs of the students who select it. The practice of devising a course on the basis of general pedagogical principles and then slotting the learners into it is no longer acceptable. In today's learners' market I found it hard to formulate a definition which would cover the requirements of the vastly different clients enrolling in the course discussed in this chapter. The nearest I have come to it is by analogy. I quote from a New Zealand newspaper (Daily News 7/4/87): To a question from the Committee Chairman Mr. Simon Shera, Mrs. Tatersall explained that ''fish'n'chips" encompassed the normal range of deep fried foods within that gambit, such as crab sticks and hot dogs. However Mr. Shera found it difficult to accept that hot dogs came under the same definition as fish, although he conceded oysters might. But not chicken he thought. "Fish'n'chips" are the real beginners. "Crab sticks" are mature students who studied French years ago, hated it but felt that they did not want to waste those years in which they sweated and failed. The "oysters" are those who
< previous page
page_154
next page >
< previous page
page_155
next page > Page 155
have taken classes run by Continuing Education, have spent holidays in France, New Caledonia, or Tahiti. They found that they could say various sentences as opening gambits but were unable to understand either the answers or the approaches of the native speakers. You might say they had clammed up . . . and wanting to put the situation right joined the course. The "hot dogs" and "the chicken" stand for the young men and women who are already good at the subject, have reached sixth and even seventh form level, are perhaps looking for a "fastfood'' 6 credits and who according to Mr. Shera should not be there. However, when challenged they say that they want to keep their French going and that their workload would not permit their taking on the full first-year 12-credit course to which their qualifications would admit them. The initial language achievement level is not the only area of disparity. The students are from many different departments in which they are majoring. They come from Accountancy, Law, Sociology, Music, Political Science, Economics, and Architecture. For some of them there is the further difficulty of knowing English only as a second language. Some of them are in their late teens, others have already retired. Some work full-time, some part-time, and some are full-time students. The first problem, therefore, lies in the wide range of ability, interest, and commitment. Differentiated Training and Learning The plight of the real beginners, finding themselves in immediate competition with a large number of people who already have a reasonable working knowledge of the language, deserves prime consideration. These students are also uncomfortably aware that the gap between them and the more advanced students will widen even further. The tempo of the course accelerates with the passing weeks, and the teacher's expectation is corrupted by the swinging standard set by those who have come to the course with a head start. The feeling of unfairness rankles in the mind of the learner and raises an immediate and lasting barrier to the learning process. Somehow these students, who see themselves as already disadvantaged, have to be protected in a course which was, after all, originally introduced for their benefit. The notion of "levels" has attracted me for some time. It seems sensible, as is done in music for instance, for students to enter themselves for the grade which realistically matches their experience and potential. An adaptation of this idea led me to set up three levels within the course with a measure of flexibility between them. They are not completely interdependent, thereby
< previous page
page_155
next page >
< previous page
page_156
next page > Page 156
protecting the weak members in the group from competition with the strong. At the same time there is a certain dosage of content common to all three levels which ensures cohesion within the course. In this way the beginners feel some security and all students profit from the confidence which comes from making their own choice of level at which to start off. To try to offer more than three levels of entry with the possibility of interchange, within one course, is probably too difficult. Admittedly they are broad divisions but they signal a healthy departure from the norm of one level for all. Having made this initial decision, it is logical to approach teaching method, course content definition, and assessment in the same spirit and try to facilitate the learning process for as many different students as possible. The method is to a large extent dictated by the requirements of the learners and the way they learn best. They have all had ample time during their previous educational experiences to find out which strategies work best for them. Some are most comfortable if they hear the message, others find they cannot concentrate unless they see the message in a visual form, and still others find that, whether they are listening or looking, they have to reformulate the message and write it in their own words if they are to retain the information. Some may favour a combination of all three ways. The Language Laboratory as a Free Access Library If the teacher wishes to take advantage of the different learning strategies of the students, it is necessary to use a method which presents material via all three channels, audio, visual, and active. In this way there is a chance of latching on to one or other of the systems which the learner has found to be the most effective for him/herself. Work which is absorbed through the ears and the eyes simultaneously and is linked to some activity of an oral or written nature seems to result in a very high retrieval rate and students regularly comment on this factor. There is, however, a problem of logistics. How can a combination of audio-visual as well as textual activities be set up in such a way that there is reasonable access for a number of individual learners, each working according to a personal timetable? The obvious answer is to use the language laboratory with its battery of machines and its facility for "library" access. The mention of "technological support" usually stimulates considerable interest. Unfortunately there is a less favourable response when the words 'language laboratory" are used. The disastrous conclusions published in the Keating Report (1963), showing that the use of the laboratory made no
< previous page
page_156
next page >
< previous page
page_157
next page > Page 157
difference to the effectiveness of teaching methods, created a deep negative impression in the minds of language teachers. The subsequent critiques of the findings made much less of an impact. The damage had been done. Even though a number of laboratories have had their name changed in an effort to overcome the prejudice of the last two decades, there are some teachers who have retained their distrust and dislike of language laboratories and their attitude is all too easily picked up in their classes. Fortunately none of the students in these beginners' groups has been exposed to such tutors and they are ready to accept the combination of tutor, machine, and student as an unbeatable trio. And whatever reservations may be held by some about monitored sessions in the laboratories, few would challenge their appropriateness for "library" use, and it is to this end that they are predominantly used in this course. All the equipment needed to complete the various assignments is permanently set up and a flexible booking system enables students to take advantage of any free hours they have during the day to work on the set assignment. Enrolling students are asked to say how they feel about working regularly on the different machines carrying the assignments. All express their readiness, many with enthusiasm, to master the equipment and the material. Some of the "mature" women admit to being intimidated by the thought of working with machines, attributing this to their "conditioning". The confidence which they feel by the end of the course is, they say, an extra bonus. The availability of so much hardware makes it possible to ensure that students are provided with adequate exposure to the language outside the minimal weekly two-hour contact with the tutor. It also makes it possible to present similar material in very different ways. The learning percentage is greater if the material to be absorbed is presented in several different ways rather than repeated several times in the same way. One of the two skills taught in this course is reading comprehension. Vocabulary is acquired in context by the usual method of working on written texts at home. It is also learned and revised using microcomputers in the laboratories and students, particularly those who are working at the basic level, find this a most acceptable method of exposure to a wide range of words and idioms. Similarly, they work on a grammar manual at home after its presentation in class and then complete exercises on verbs using the computer. The other skill which is taught is listening comprehension. Wherever possible the spoken and written forms are presented together though not necessarily simultaneously. The main part of the work is on tape and this is accompanied by either a visual element, or a script, or both. Reference folders are provided containing supplementary vocabulary, grammar, and cultural notes and a worksheet has to be completed as the student works
< previous page
page_157
next page >
< previous page
page_158
next page > Page 158
through the assignment. The visual element may be in the form of a video cassette, a film strip, a set of slides, or simply line drawings or photographs. Besides catering for individual learning strategies this multi-sensory approach provides the added spin-off of increased tolerance in the learner. Students will concentrate for a longer spell of time when they are hearing, seeing, and doing than when occupied with only one of the senses. This is of capital importance when students are working on an individualized basis. There must be a dovetailing of activities to keep students working on their own for the length of time it takes to master the set tasks. So the method boils down to the introduction by the teacher, in the class situation, of the material to be studied, then the study of that material by individual students working mostly on machines in the laboratories to enable each student to profit from his/her favourite learning strategy. The final phase consists of personal feedback to the student from the teacher, outside the class situation, on progress made on the set assignments. Adaption of Course Material to Individual Students The concept of individualization influences not only the method of teaching one uses but also the choice of content. Students work more willingly on topics or themes to which they can personally relate than on those which are set on the basis of general interest. One of the items in a questionnaire given at the beginning of the year asks for information on the personal interests or problems of each student. Based on the answers received over the period of time the course has been run, a series of texts for reading comprehension has been collected, from which the student may select a set number on which to work during the year. The topics range from windsurfing to summer allergies. Completion of three modules, consisting of a variety of assignments, is a major requirement in this course. The Appendix contains an example of an assignment sheet describing the first module for basic level students. The assignment sheets are arranged in the form of checklists. They specify exactly the many small tasks to be completed and provide a framework for the rewarding activity of ticking them off. For reading comprehension, there are 26 texts in all from which to choose. Each text is labelled as recommended for basic, basic/intermediate, intermediate, intermediate/advanced and advanced levels. Students who have selected to work at the advanced levels must work on either the transitional intermediate/advanced or the advanced texts. There is no restriction for other students for whom interest in the topic may compensate for the difficulty of the level.
< previous page
page_158
next page >
< previous page
page_159
next page > Page 159
Students may, if they wish, work on the modules in pairs or in small groups provided that the names of those working jointly are indicated on the assignments of those concerned. Peer teaching in the class situation has proved to be successful for many of the learners and there seem to be good reasons for extending the practice to learning situations elsewhere. In the individual evaluations of the course which take place at the end of the year, many students have claimed that cooperation of this kind, particularly on tasks which are either lengthy and perhaps repetitive or those which are demanding and involve grammatical analysis, constitutes one of the most rewarding aspects of the course. The fact that some of the students have done no French, or very little, poses the problem of the level of language. It is counterproductive to present material for study in which the level of language does not correspond to the level of interest. These are adult learners wishing to penetrate the language and thought of people of another culture. They accept that they are not equipped to deal with rapid, high-powered oral discussions or stylish, specialized, written articles. What they do expect and respond to is material with a normal measure of sophistication to match their level of maturity. Therefore a number of bilingual texts are provided which enable students to perceive, by comparing them, to what extent the two languages are similar and how far they differ from each other. The exercise also ensures that the idea of the appropriateness of a word-to-word translation is scotched very early in the course. Individualization and Auto-Correction It is made very clear, at the beginning of the course, that the marking of all work is part of the learning process. Therefore it is the responsibility of the students to treat the correction of their work as a positive feature contributing to the learning task and not as a scoring exercise. This is preferable, in any case, to the usual situation where much of the creative and nervous energy of the teacher is dissipated by the chore of endless marking. Inevitably it takes the tutor time to correct multiple copies of different assignments and to write in the correct version. The time delay in returning the work results in a loss of interest in the exercise by the students, so that the homing sheets do little but confirm their impressions of how well or badly the work had been done. There must be no time lag if any real learning from mistakes is to be likely. Commuters, for example, check their crossword puzzles every day and they would be outraged if someone else did it for them and did not let them
< previous page
page_159
next page >
< previous page
page_160
next page > Page 160
see where they had gone wrong until a week later. By then they would have completed other puzzles and would have completely forgotten the options and dilemmas of the first set of clues. If the marking is done by the students as it is done by the commuters, that is, as soon as the answers are available and before the next task is undertaken, then there is no question of simply ticking those sections which are right. It means going through the assignment and comparing each answer with the fair copy, then referring to the question and the relevant part of the original text, and finally, of understanding the reasons for any errors or omissions. In a system relying heavily on individualized learning and autocorrection, it is imperative if credits are involved, that safeguards be set up. No coordinator wishes to have his or her course branded as an easy option which enables a student to beat the system and receive credit for virtually no effort. Adequate precautions must be taken. Therefore the sheets or tapes containing the correct answers are made available to the students when it is clear from the student's work that the student has at least made a reasonable attempt at the completion of the task. The marking is done under supervision either in tutorial time or at selected times in the laboratory office. The student then hands back the answer sheets or taped answers to the person supervising, and the work is immediately stored in the student's personal file. This is kept for reference to be used at the end-of-year assessment. In this way neither the ''master" answer materials nor the corrected students' scripts can be casually copied by less highly motivated students wishing to complete the modules in record time with a minimum of effort. In the middle of the year there is a two-hour test consisting of listening and reading comprehension exercises. This test serves as a diagnostic test although I prefer to call it a diagnostic check as this seems to appear less threatening to the learners. The objective is to check on the appropriateness of the choice of level the student made on entry to the course. The test is carefully marked but the numerical results act only as a temporary pointer and are not counted in the final assessment. The bonus of the exercise comes from the fact that the students discuss with me their strengths and weaknesses. They are able to see from detailed distribution sheets how they rate on each subtest in relation to the whole group and also to the students who have chosen the same level. Individualized Courses and Formal Assessment There is always a point in time at which formal assessment has to take place and in most cases a numerical ranking has to be provided. In a course
< previous page
page_160
next page >
< previous page
page_161
next page > Page 161
which has played down the competitive aspect of learning it is hard to come to terms with such a different concept. But the Registry (and in many cases the student) expects a percentage mark, and some compromise must be made. Therefore the element of formal testing is an integral part of the course. A course which is designed to deal with learners at different levels in terms of its content and the amount of work required must necessarily be assessed in a way which takes account of these differences. For this course, which is internally assessed, there are three requirements. The first of these is "terms". Each student must attend 75% of the teaching sessions. This is not just a bureaucratic decision where an acte de présence is required to ensure, in part, the granting of six credits at the end of the year. It is essential to the course on two counts. Firstly, because of the logistics involved in a programme of this kind. There are many demonstrations necessary on how to work efficiently on the machines and on how to tackle the many different kinds of assignment to be done when working on one's own. These demonstrations and explanations are given in class time and must not be missed. Secondly, and more importantly, regular and frequent teacher and peer contact make individualization a going proposition. There must be, on the one hand, continuous feedback from the tutor to the learner on general progress and, on the other, there must be a consistent forging of links between the learners to counteract any feelings of isolation which might otherwise occur. The group dynamic facilitates a subsequent move to a head-to-head situation between teacher and student, and student and peer. The readiness of students to participate in a language course is an indication of the measure of their ability eventually to communicate with others. Such participation can be encouraged and enjoyed provided the students have a sufficiently regular contact with each other. The second requirement concerns a series of assignments available at basic, intermediate or advanced level (see Appendix). These must be adequately completed and corrected by the students at the levels each has chosen. The assignments are divided into three modules of more or less equal length, and the student contracts on entry to the course to work on the different tasks on a "library" basis. Provided each module is finished by the pre-set deadline, there is complete flexibility for the student to schedule the individual tasks to suit other academic or personal pursuits. The average weekly workload has been worked out in accordance with an across-the-board guideline given within the University for a six-credit course at firstyear level. The recommended total amount of time to be spent is six hours. Two of these are contact hours, two or three are spent working with materials
< previous page
page_161
next page >
< previous page
page_162
next page > Page 162
in the language laboratories, and one or two are spent working at home on reading comprehension texts and learning or revision of vocabulary and grammar. Lists of times taken by previous students to complete sample tasks are available so that students have some idea of how much time to allocate to the different kinds of tasks and activities. The number of tasks to complete by each student is proportional to the level chosen as the more advanced students are expected to work more quickly. There is a core of work which is common to all levels so that there is the option of moving from one level to another with a reasonable amount of the workload for that level already done. Students may change level at any time during the year, but not after a deadline set shortly before the end of the second semester. Completion of all three modules is a major element of this course. The third requirement is the end-of-year test. This consists of a listening comprehension test, which is the same for all three levels, and a reading comprehension test. The latter is in two parts, part one being based on the work done in assignments at the three different levels and part two consisting of unseen passages for reading comprehension. These texts are selected according to the level of difficulty. A sample test is talked over in class the week before the final test. The format is explained so that the student knows what to expect and is prepared for the kind of exercises to be dealt with. The test is very demanding in that there are many questions to be answered in a limited length of time. Points are given for any correct information, no points are subtracted. In this way students are not penalized for what they do not know, but are given credit where it is due. The multiplicity of exercises ensures that the lucky speculator who is successful at "spotting" likely questions has less of an advantage than if there were only three or four areas tested. The "unseen" sections of the papers are included to evaluate how well the students can deal with the foreign language. To test only the course material gives an inadequate result treating, as it does, only selected parts of the language. Somewhere along the line the learner must be measured against the authenticity of the unfamiliar. In general, one would expect that students who had done the advanced work would receive an A result or a B 1, intermediate students a B1 or B2, and basic students a C result. There is, of course, no guarantee that students would achieve that result simply because they have tackled the yearly workload for a particular level. The actual test and the modules must show consistency and must merit the grade for that level. If they do not, then a student will receive the lower grade. Equally at the basic level there may be students whose tests give evidence of great progress and whose numerical percentage is a marginal B2. About two-thirds of the test is common to all three levels. Therefore, on a subtest which is identical for all students (e.g. the
< previous page
page_162
next page >
< previous page
page_163
next page > Page 163
listening comprehension test) there is an area where an intermediate student and a basic student may have done equally well (see Figure 13.1). The opportunity to move into the B category is available to those basic students who (a) receive 59% or more on their test, (b) whose assignments have been above average, and (c) who are prepared to do extra assignments on an intensive basis to give them the extra percentage to move into the B category. This gives some incentive to the highly motivated student entering the course at real beginners' level and for whom restriction to a C result is a depressing prospect. Finally a student who has done really badly in the two-hour test does not necessarily fail the course. Provided that a reasonable level of commitment to the course throughout the year is substantiated on examination of the student's personal folder containing marked assignments and class work, the basic pass mark is given.
FIGURE 13.1 Distributions of students, over 5 score point intervals, at basic, intermediate, and advanced levels on end-of-year lis tening comprehension test Conclusion This course has been offered over a period of five years. It has been successful both as a full-year course and as an intensive summer course lasting six weeks. It involved, initially, an enormous commitment of time to set up the work, reference, and answer sheets, and also the answer tapes for the
< previous page
page_163
next page >
< previous page
page_164
next page > Page 164
various assignments. Each year there are changes to be made, suggestions from the students to be incorporated in the scheme, and refinements to be carried out in what can be a cumbersome booking and checking system. But it has been a rewarding enterprise involving quite substantial adjustments in approach and attitude on the part of both teacher and students. The teacher has to become only one of many resources for the learner, has to abandon the powerful, autocratic role and hand over to the machines the job of providing the more varied, and, in some cases, more authentic input to the learner. The task becomes one of diagnosis of the individual's problems and potential, of triggering interest and of finding ways and means of sustaining it. The students have to be prepared to lose part of the shelter and anonymity of the large group, to take responsibility for their own weaknesses and to do something about them, and to learn how to learn alone. References Keating, R.F. (1963) A Study of the Effectiveness of Language Laboratories. New York: The Institute of Administrative Research, Teachers College, Columbia University. Norrish, N. (1987) One teacher, many levels, System, 15, 19-25.
< previous page
page_164
next page >
< previous page
page_165
next page > Page 165
Appendix: Sample assignment sheet
< previous page
page_165
next page >
< previous page
page_166
next page > Page 166
14 Discrete Focus vs. Global Tests: Performance on Selected Verb Structures HARRY L. GRADMAN AND EDITH HANANIA Indiana University It has been suggested that a learner's performance in the target language varies depending on whether the learner's attention is focused on content, as in a natural conversation, or on form, as in a grammar-based task. This chapter reports on a preliminary investigation of variability in language performance in its relation to testing task, with reference to verb structure. References in the literature to the subject of variability have appeared in two main areas: in writings on the monitor model, and in interlanguage studies. In his monitor model, Krashen considers that the learner may bring to bear on language production a knowledge of the rules. In other words, when individuals focus on form, they monitor their language production by applying formally learned, consciously available rules. This notion has been used to interpret differences in the reported order of acquisition of morphemes, suggesting that data elicited through discrete-point tasks would yield a different order of acquisition than data obtained otherwise (Krashen, 1981; Dulay, Burt and Krashen, 1982). Variability in performance has also become a focus of interest in interlanguage studies. Dickerson's article (1975) on ''the learner's interlanguage as a system of variable rules" drew attention to systematic variability in the acquisition of the English sound system by Japanese learners: variability related to phonological environment and to task. (Spontaneous speech was different from reading a dialogue, which was again different from reading a word list.) In subsequent articles, Tarone (1979 and 1982) suggests that a learner's interlanguage ranges along a continuum, from a "superordinate style," where attention is focused on language form, to a "vernacular or
< previous page
page_166
next page >
< previous page
page_167
next page > Page 167
subordinate style," where attention is focused on accomplishing a communicative task. She argues that the vernacular is the most systematic style, but she states that data collection for the study of interlanguage systems should include output in both styles. This variability in language performance is borne out in the experience of language teachers. It is not unusual for a teacher to find that students who perform well on grammar tests are unable to correctly produce the same structures in their spontaneous speech or writing. It is clear that such a variability would have particular relevance to language testing. If discrete test items that focus on linguistic form invoke conscious knowledge of rules that may not have become part of the productive system of the learner, then global tests may reflect more accurately the learner's ability to apply those rules in communicative situations. The purpose of the present work was to explore, in a controlled experimental setting, the performance of a group of learners at different levels of language proficiency on two main types of tests: discrete point, where attention is focused more on form, and global, where attention is focused more on communicative content. The area selected for testing was English verb forms. Verbs are, of course, a central part of English sentence structure. Furthermore, the various verb forms are acquired at different stages of language learning and would therefore provide a rich body of data for comparative analysis. Specifically, the study sought answers to the following questions: 1. Does performance vary according to the type of test? 2. If so, is this variation affected by level of proficiency? specific verb forms? Procedure Tests For the purpose of the investigation, three tests were prepared to measure learners' knowledge of selected verb forms, each test representing a different type of task: cloze, multiple-choice, and fill-in-the-blank. The cloze test consisted of two passages of continuous discourse from which only verbs (some, not all) had been deleted. The first passage was a
< previous page
page_167
next page >
< previous page
page_168
next page > Page 168
narrative, a short humorous story written in the past tense; and the second was an expository science-related passage written in the present tense. Both passages were taken from ESL texts at an intermediate level. There were 15 blanks in each passage, giving a total of 30 cloze items. The students were instructed to fill in each blank with the one word that best fitted the meaning. In order to preserve the global character of the test, the instructions did not specify that the missing words were verbs. The multiple-choice test also consisted of 30 items which corresponded to the items on the cloze test. The items followed the usual multiple-choice format: discrete sentences, each containing a blank, were followed by four choices. An attempt was made to produce items that were as similar as possible to their cloze counterparts, eliciting the same verb structures, with attention given to syntactic and phonological environments as well. The third test consisted of discrete sentences, each with a blank preceded by the base form of a specified verb. The students were instructed to fill in the blank with the appropriate form of the given verb, adding helping verbs or modals where necessary. Examples were given to illustrate the directions. Again, the intent was to produce items that would parallel those on the cloze test. To sum up, the three tests each consisted of 30 items which were designed to elicit corresponding verb structures. However, each of the tests represented a different type of task. Cloze required closure through the production of an appropriate verb within the context of continuous discourse, the attention of the test takers presumably being focused more on content than on form. With multiple-choice, the task was essentially one of recognition, requiring selection of the correct form of the verb from four alternatives. In fill-in-the-blank, the task involved production, as in cloze, but since the base form of the lexical verb was given, the focus of the production task was on form. In that respect, the fill-in-the-blank test was intermediate between the other two tests in the type of task involved. Subjects The subjects for this study were 126 nonnative speakers of English studying at Indiana University. They were students in the Intensive English Program and in two graduate linguistics classes, and they represented several language backgrounds, including Arabic, Chinese, Japanese, Malay, Spanish, and Italian. The subjects were divided into three groups by proficiency levellow, intermediate, and advancedon the basis of
< previous page
page_168
next page >
< previous page
page_169
next page > Page 169
TOEFL scores. The low group consisted of students with scores below 420, the intermediate group had students with scores between 420 and 530, and the advanced group had scores above 530. The three tests were given at the same session in the following order: cloze passage (narrative), cloze passage (expository), fill-in-the-blank, and multiple-choice. Ample time was allotted, and the papers were collected at the completion of each task. The tests were also given to 20 native speakers of English as a reference group. Analysis The analysis of data was based on percent scores. The cloze test was scored for correct verb form, regardless of lexical choice. Non-verb entries were considered inapplicable and were eliminated from the calculation. This method of scoring was used to insure that the cloze scores reflected only correct use of verb form, which was the concern of this investigation, and in that respect to make the cloze scores comparable with the scores of the other two tests. Mean scores on the three tests were compared for the whole group and for each of the three levels on all 30 items. The analysis then focused on five specific verb structures: V-ed (simple past tense), V-s (present tense, 3rd person singular), BE (present, is/are), perfective, and modal, comprising 21 items in all. The data obtained enabled comparison of performance across task for the whole group, for each level, and for each of the five verb forms. In order to determine whether differences in performance on the tests were statistically significant, the t-test was applied to the mean scores, taken in pairs. The results of the analysis are presented below. Results Table 14.1 summarizes the overall results. It gives the mean percent scores for all 30 items, distributed by test type and by proficiency level. The figures show that the results for cloze and fill-in-the-blank tests are quite similar at each of the three levels, but that multiple-choice scores are significantly higher in all cases. Table 14.2 gives the corresponding results for the 21 items pertaining to the five specific verb structures that we examined.
< previous page
page_169
next page >
< previous page
page_170
next page > Page 170
TABLE 14.1 Overall Results by Level30 Items Mean Score % Level N CL MC FB Adv. 48 85.9 90.7 86.4 Int. 46 68.0 77.4 66.0 Low 32 43.7 56.6 42.8 TOTAL 126 68.7 77.2 67.8 Differences are significant at p<0.01 at all levels for MC/CL and for MC/FB (except Adv. at p< 0.05) CL = cloze; MC = multiple-choice; FB = fillin-the-blank TABLE 14.2 Overall Results by Level21 Items Mean Score % Level N CL MC FB Adv. 48 85.4 93.1 87.5 Int. 46 70.0 82.9 67.3 Low 32 43.3 60.8 42.4 TOTAL 126 69.1 81.2 68.7 Differences are significant at p<0.01 at all levels for MC/CL and for MC/FB CL = cloze; MC = multiple-choice; FB = fillin-the blank As the Table shows, the overall results for these 21 items parallel the results in Table 14.1. Here, too, cloze and fill-in-the-blank scores are similar, but multiple-choice scores are significantly higher. However, it is also evident that the extent of the difference varies according to proficiency level. For the advanced group, multiplechoice scores are a little higher than cloze and fill-in-the-blank scores. The differences are greater for the intermediate group and much more marked for the lower group. In other words, score differences between multiple-choice and the other two tests are most pronounced for the weakest group of students and, as might be expected, differences are smallest for the advanced group.
< previous page
page_170
next page >
< previous page
page_171
next page > Page 171
The question now arises as to whether the verb forms, taken individually, reflect the same differences as do the overall data. Table 14.3 gives the relevant data for each of the five specific verb forms, but for all levels combined. The results that are presented in Table 14.3 reveal some variability among the various structures examined. The significant differences noted in the previous Table between multiple-choice and the other two tests do not appear here in the case of V-ed. Furthermore, an overall pattern of variability emerges. For the V-ed structure, there are hardly any differences among the tests. The differences between multiple-choice and the other two are more pronounced for the V-s and BE structures, and are even greater for perfectives and modals. It was particularly interesting for us to find that this sequence of increasing variability as one goes down the list of verb structures parallels the sequence of decreasing overall scores (that is, increasing difficulty), as can be seen in the last column of Table 14.3. Here, the mean scores for all the tests combined showed that the whole group performed best on V-ed, followed by BE and then V-s, and performed worst on perfectives and modals. This observation suggests that a relationship may exist between patterns of variability in test scores and the linguistic difficulty of a particular verb structure. TABLE 14.3 Overall Results by Verb Structure (N = 126) Mean Score % All Structure CL MC FB Tests V-ed 88.9 86.2 85.4 86.8 V-s 60.0 77.0 68.2 68.4 BE 67.8 82.9 77.0 75.9 Perfective 58.8 76.7 51.4 62.3 Modal 42.2 77.4 46.4 55.3 Differences are significant for MC/CL at p<0.01 for all structures except V-ed. Differences are significant for MC/FB at p<0.01 only for perfectives and modals. V-ed = simple past tense, both regular and irregular; V-s = present tense, 3rd person singular; BE = present, is/are
< previous page
page_171
next page >
< previous page
page_172
next page > Page 172
The question to consider now is how this variability by structure is related to the learners' level of proficiency. The next tabulation of data, Table 14.4, shows how the test scores are distributed by verb structure and by proficiency level. Only three sets of data are given, as examples. If we look at the easiest structure, V-ed, we find that, regardless of the task, there are no significant differences in test scores, at any of the three levels. On the other TABLE 14.4 Results by Level for Three Selected Structures Structure: V-ed Mean Score % Level N CL MC FB Adv. 48 98.9 94.8 95.6 Int. 46 92.2 88.9 85.6 Low 32 69.4 69.5 69.8 Differences are not significant: p > 0.01 Structure: V-s Mean Score % Level N CL MC FB Adv. 48 86.1 90.6 88.0 Int. 46 57.0 79.4 65.6 Low 32 25.0 53.0 43.2 Differences are significant at p < 0.01 for CL/MC at the two lower levels. Structure: Modal Mean Score % Level N CL MC FB Adv. 48 63.5 91.5 70.8 Int. 46 38.0 73.9 44.9 Low 32 16.5 60.9 11.7 Differences are significant at p < 0.01 for all levels for MC/CL and MC/FB.
< previous page
page_172
next page >
< previous page
page_173
next page > Page 173
hand, if we look at the more difficult structures examined, perfectives and modals (only modals are shown in Table 4), we find that the multiple-choice scores are significantly higher than the other two, at all levels. Again the differences are most pronounced for the weakest group of students. Table 14.4 also gives an example of the two structures, V-s and BE, that are intermediate in difficulty. Here there is mixed variability. Significant differences between multiple-choice and cloze appear at the two lower levels of proficiency for V-s (shown in Table 14.4), and at the two upper levels of proficiency for BE (not shown).
FIGURE 14.1 Overall results by Student Level (data taken from Table 14.2 mean score %) The above set of results are also illustrated with bar diagrams in Figures 14.1 and 14.2, the first showing overall results from Table 14.2, and the second the results from Table 14.4. It is clear from Figure 14.1 that the cloze and fill-in-the-blank tests yield similar scores, both being lower than multiple-choice scores, and that the differences are greater at the lower levels of proficiency. With respect to verb structures, Figure 14.2 demonstrates the pattern of data in Table 14.4: no significant differences for the easiest structure (V-ed), significant differences only at the two lower levels for the intermediate structure (V-s), and significant differences throughout for the difficult structure (modals). At this point, it is useful to examine the variability in student performance in relation to the contrasting features that each of the testing tasks represents. Cloze and fill-in-the-blank differ mainly in one set of features: GLOBAL, with attention focused on content, vs. DISCRETE, with
< previous page
page_173
next page >
< previous page
page_174
next page > Page 174
FIGURE 14.2 Results by Verb Structure and Student Level (data taken from Table 14.4mean score %) attention focused on form. Our results show that, although differences are not substantial, performance is consistently better on the discrete focus task (fill-in-the-blank), except for the easiest structure V-ed, where all results were very close, and for the more difficult verb structures at the low proficiency level. Presumably, students at this low level have not done much class work on perfectives and modals and therefore do not readily recognize sentential cues that elicit these structures.
< previous page
page_174
next page >
< previous page
page_175
next page > Page 175
With respect to fill-in-the-blank and multiple-choice, both of these tests represent discrete tasks focusing on form. The contrast between them concerns the feature of production vs. recognition. Fill-in-the-blank requires production, whereas multiple-choice requires recognition of the correct form. Differences in scores for these two tests are greater than they were in the first case (cloze/fill-in-the-blank), and they are apparent in all structures (except for V-ed, as noted above). The third pair of tasks is cloze and multiple-choice. Here the contrast combines the two main features: global vs. discrete, and production vs. recognition. Cloze may be regarded as a global and production task; multiple-choice as a discrete focus and recognition task. Our results have shown that, among the three tests, differences in scores are greatest between cloze and multiple-choice, and that these differences are highly significant. The above discussion can be summarized in the following points. On the whole, cloze and fill-in-the-blank tests, which differ with respect to global vs. discrete task, show only small differences in test scores. On the other hand, multiple-choice and fill-in-the-blank tests, which differ with respect to recognition vs. production, show considerably greater differences. This contrast suggests that the role of production vs. recognition may be more prominent than that of global vs. discrete focus. This conclusion is further supported by the fact that cloze and multiple-choice, which combine both contrasting features (global and production vs. discrete and recognition) show the greatest differences in test scores. The question may now be raised as to whether these broad patterns of variation in fact mask real differences among students at the individual level. One way to answer this question is to look at correlations between the task scores. High correlations could be taken to indicate that individual student performance corresponds with the general patterns of variability for the whole group. Pearson linear correlations were applied to data on the three tasks, taken in pairs. The correlation coefficients for the total sample were found to be high: 0.81 for cloze and multiple-choice, 0.79 for cloze and fill-in-the-blank, and 0.84 for fill-in-the-blank and multiple-choice. Correlations between the fill-in-the-blank and multiplechoice tasks appear to be slightly higher than they are for the other pairs of tests, which may reflect the focus on form in these two tests. The above correlation data may be interpreted in two ways. The first is that the tests are to a large extent measuring the same kind of linguistic knowledge. The second is that, although there is a high correspondence in the
< previous page
page_175
next page >
< previous page
page_176
next page > Page 176
individuals' relative performance on these tests, the correspondence is by no means perfect, indicating that there are indeed some appreciable individual differences. Conclusions The purpose of the present phase of our work was to determine whether performance on certain verb forms varies according to the type of task, and whether this variation is affected by level of proficiency and by the specific verb form involved. On the basis of the results we have presented, the following conclusions may be drawn. 1. Performance does indeed vary according to task, the major differences being between a cloze type and a multiple-choice test. 2. The extent of this variation depends on the learners' level of proficiency, the lower levels showing the greatest differences. 3. The extent of the variation also depends on the particular verb structure involved, the more difficult structures showing greater differences. 4. Differences in performance appear to be related not only to global vs. discrete focus, but also to production vs. recognition, the relative contributions of these two factors being themselves dependent upon the level of proficiency and on the particular structure examined. 5.Although there is some evidence of individual differences, the overall patterns appear to be stable. 6. The type of task is clearly an important factor to consider in the evaluation of test results, a factor which appears to be particularly important at lower levels of proficiency and not necessarily as important at the upper levels. References Dickerson, L. J. (1975) The learner's interlanguage as a system of variable rules, TESOL Quarterly, 9,401-407. Dulay, H., Burt, M. and S. D. Krashen (1982) Language Two. New York: Oxford University Press. Krashen, S. D. (1981) Second Language Acquisition and Second Language Learning. Oxford: Pergamon Press. Tarone, E. E. (1979) Interlanguage as chameleon, Language Learning, 29, 181-191. Tarone, E. E. (1982) Systematicity and attention in interlanguage, Language Learning, 32, 69-84.
< previous page
page_176
next page >
< previous page
page_177
next page > Page 177
SECTION III INDIVIDUALIZATION AND ASSESSMENT PROCEDURES
< previous page
page_177
next page >
< previous page
next page >
page_vii
Page vii 19 Directions in Testing for Specific Purposes Gill Westaway, J. Charles Alderson And Caroline M. Clapham List of Contributors Index
< previous page
page_vii
239 257 263
next page >
< previous page
page_179
next page > Page 179
15 Operationalising Uncertainty in Language Testing: An Argument in Favour of Content Validity ALAN DAVIES University of Edinburgh All language measurement involves uncertainty. The twin features of variability and error are endemic to all attempts to study and measure language learning: variability because of linguistic imprecision, error because of measurement failure. I shall argue in this chapter that language testing deliberately operationalises these uncertainties in order to move towards its necessary aim of explicitness, and, in so doing, provides an increasingly powerful research methodology for Applied Linguistics. Variables Involved in Language Proficiency Measurement The five major variables involved in language proficiency are ill-defined and subject to unreliability. These variables are: the native speaker, the cutoff, the criterion score, the test, and the language itself. I shall mention the sources of uncertainty for the first four variables and I then propose to consider the language variable at greater length since it embodies the types of uncertainty present in the other variables. The native speaker Given the non-existence of a homogeneous speech community, which exists only in an idealised form, the native speaker disappears from view.
< previous page
page_179
next page >
< previous page
page_180
next page > Page 180
The idealisation has no substance in terms of minute particulars of language. Native speakers disagree, all grammars leak. As Ross (1979) shows, the only code one is a native speaker of is one's own idiolect. For language testing therefore it is important which native speaker is chosen as model. The cut-off Cronbach's words still hold good: ''setting a cutting score requires a value judgment'' (1961: 335). It is users of tests, rather than test constructors, who regard every test as criterion referenced. Pass marks properly should be determined by the test users themselves and they should always be related to local circumstances. When I observed, during an evaluation study, that the English Language Battery (ELBA)results (Ingram, 1973; Davies, 1983) were being viewed so very differentially in different parts of the same university, my first reaction was one of dismay. Looking back I now feel that I was wrong and the difference of views indicated that the users were in fact making up their own minds, quite properly. ELBA, designed by a former colleague, Elisabeth Ingram, in the early 1960s on the structuralist model, accumulated data over many years which we have in part analysed. The findings of interest to this chapter were first, that differential levels of English may be required for different types of academic study. Thus in three Faculties of the University of Edinburgh the combined mean scores of postgraduate successes and failures on ELBA (success and failure referring to academic success/failure in the students' own academic courses of Edinburgh University) were as shown in Table 15.1. The numbers of failures were, as they typically are, small, but it is clear that Arts requires a higher mean than either Medicine or Science. Indeed Arts were failing students on scores which would have predicted success in Science. The conclusion that what matters for adequate performance is not the same for different Faculties, as seen in these successes and failures TABLE 15.1 ELBA Mean Scores of Academic Successes and Failures in three Faculties, University of Edinburgh Postgraduate Postgraduate Faculty Successes Failures N Mean ELBA N Mean ELBA Score Score Arts 278 78 15 70 Medicine112 72 10 67 Science 141 64 7 53
< previous page
page_180
next page >
< previous page
page_181
next page > Page 181
figures, suggests that the difference is one of levels. The second finding is one of happenstance but it does illustrate how very particular and local proficiency may need to be, that proficiency is in fact very context sensitive. It is this: another criterion used for the ELBA study, a difficult and frustrating one, was supervisors' judgements of their students' English. It was noticeable when we looked at the results that in several cases (and in the same Faculty this time) students on the same score were judged by different supervisors in opposing ways (see Table 15.2). TABLE 15.2 Differential Prediction for a single Mean ELBA Score ELBA Supervisor's rating of score English Student 1 (Germany) Student 2 (Nepal)
65
Not satisfactory
65
Satisfactory
There are several possible explanations: (1) ELBA is not reliable, (2) supervisors are using different scales, or (3) individual students have different required proficiencies. My guess is that (2) and (3) are both partly true and that they mask a fourth explanation which concerns supervisors' expectations, viz. that their expectation for a West European student's English proficiency is higher than it is for an Asian student's English. The criterion score Validation by external means requires a quantifiable criterion with which shared variance can be established. As is well known the typical predictive correlation with academic examination criteria is about 0.30, but as is also well known, such criteria are commonly unreliable and in any case may be irrelevant. For example it is not axiomatic that performance in the subject of study is necessarily critical for English proficiency. That is one problem, the unsatisfactory status of the usual criterion. There is another problem, the uncertainty as to what reaching criterion means or involves, since it can be accomplished by a variety of accumulations (although this collapses into the cut-off question). The test It has been noted (Alderson, 1979) that cloze and dictation tests, but especially cloze, are difficult to replicate, in the sense of producing parallel
< previous page
page_181
next page >
< previous page
page_182
next page > Page 182
versions. This is a problem not only for cloze and dictation but for all language tests. Parallelism of items typically means statistical equivalence plus categorical similarity. But the equivalence of two test forms is, beyond a core similarity, a very grey area. And if forms of the same test are difficult to calibrate equivalently how much more is that true of different proficiency tests. The general level of correlation is about 0.70, representing a shared variance of about 50%. The generous implication here is that every proficiency test is a sample of all possible proficiency tests and that what the unshared shows isapart from errorproficiency that every test could have but doesn't include. But that is an alarming conclusion. In the first place, the effect of unshared variance is to produce varying rank orders. In the second place, the notion of untapped proficiency means that we are quite unclear about proficiency and that what counts as proficiency unique to one test may not be proficiency at all. The language Linguistic theories face three major problems, universality, combining power, and the incorporation of extra language data. A view of language as an interlocking system of systems is straightforward enough for it to apply across all languages. It would be unthinkable in any linguistic theory for a natural language not to have a sound system (a phonology), a meaning system (a semantics), and a combining system (a syntax). Theories may differ as to how these systems relate and whether to admit other systems as systems (e.g. pragmatics, discourse, context). But there is no dispute about the need to posit three main systems. Similarly with high level sentence rules. All linguistic theories accept that their purpose is to provide rules for sentences, the rules are productive, or sentence generators. But all that says is that the highest linguistic unit is the sentence (leaving aside the contentious issue of whether higher linguistic units than the sentence can be specified). So far so good . . . all languages have sentences. But for a linguistic theory to be of any interest it must provide more information, viz. that a sentence is made up of parts, themselves made up of further parts, combining and recombining in various ways. No doubt linguistic theories may again agree that sentences contain, for example, subjects and predicates. But already, even at this still high level of generality, problems arise, whether the focus of the sentence is a subject or a predicate, and so on. Undoubtedly there is universality across languages but current investigations of universal grammar indicate the fugitiveness of that universality. What is universal is probably either a set of components (the categories or parts of speech of a traditional grammar) or a
< previous page
page_182
next page >
< previous page
page_183
next page > Page 183
very small set of phrase structure rules which indicate that all languages process and combine in similar ways subjects and predicates, subjects and objects, nouns and verbs. Thereafter, theories have to resort to some form of controlled ad hocking transformations. All languages, then, can be analysed linguistically into rules and exceptions. The aim of all theories is to increase the rules and limit the exceptions. The problem of combination has already been addressed. If it is a problem for generalising across languages, it is equally an intralanguage problem, between systems (how, say, grammar and semantics combine), between systems and part systems (how semantics and discourse combine), and within systems (how parts of the grammar combine so as to produce or generate sentences; and how discourse, for example, can be reduced to rules). This is the fundamental task of linguistic theories and they cope by various solutions of rule + exceptions (or transformations). The saying "all grammars leak" is a way of accepting two things: first, interperson variation such that at a certain point a separate grammar is needed for each individual speaker, as evidenced by the common experience of native speaker disagreement; and second, the accounting for what a single native speaker can do, that is to say that a grammar can only go as far as a linguistic description, even though it must be the case that native speakers control or know a much more elaborate grammar than can be described. That is, in part, another way of saying that for the native speaker those no-man's land areas of discourse and intonation etc. are indeed rule governed, but what those rules are and how they combine with grammar and phonology and semantics it is difficult to say. Hence the attempts through more refined theory to incorporate more and more in the description. The problems so far noted have to do with language as a form of independent behaviour, treating it as if it took place in a laboratory, in isolation from other behaviours and forms of organisation. But language is never an independent activity, except in linguistic description. It always takes place in social settings, it is always produced by and indicative of individual processes, it always refers backwards and forwards in time, thus demanding that previous knowledge and indeed world knowledge be drawn on and shared. Reducing such knowledge to rule is problematic, but to deny the incorporation of such knowledge into, for example, material for language testing is intolerably restrictive. Attention to extra language data moves linguistics from its narrow or micro focus to its broad or macro focus, from the linguistics of language systems to the bracketed linguistics (sociolinguistics, psycho. . . , etc.). If, however, the data are regarded as being potentially intralanguage, in other
< previous page
page_183
next page >
< previous page
page_184
next page > Page 184
words colonisable by linguistics, then of course linguistics systems expand themselves, or multiply into more systems in order to incorporate the claimed data. Such an approach might argue that linguistics must of necessity explain or account for cognitive processes or language acquisition or contextual setting or language variety. These claims are commonly made by psycholinguists and sociolinguists on behalf of the data at the centre of their attention, which they regard as rule governed and which they claim are part of language. It will be noted that the narrow and the broad approaches are equivalent to the areas of explicitness and uncertainty. Indeed, the rival claims of the two can be found within the narrow approach alone since, as has been noted above, all theories, however narrow, acquire uncertainty as they seek explicitness about the extra data they are incorporating. Or to express the relationship symbolically, Explicitness + Data = Uncertainty, or: E + D = U. The very act of incorporation, that is of constructing theories and writing grammars, necessary and welcome acts, leads to uncertainty but an uncertainty that must be admitted and accepted. Language Testing and the Problem Areas in Linguistics Universality This is the least serious problem for language testing since tests are by their very nature tests of particular languages. However, in the areas of language aptitude, language acquisition, and bilingualism, comparability of tests across languages would be of value: language aptitude because of the desirability of being able to make general statements about ability to learn languages in the sense that language aptitude is of interest only if it is general. Similarly with language acquisition, where again both the linguistic and the learning interest are in the general processes and properties of acquisition rather than in the acquisition of one language. That being so, both First and Second Language Acquisition Studies can only benefit from comparable methodologies and procedures and categories of a hypothesised order from more powerful as well as more detailed linguistic theories. (Note that the typical methodologies used in eliciting both First and Second Language Acquisition systems make use of so-called morphemes, in part because of their aim at generality, a quality that is overlooked in the many animadversions against such methods.) Bilingual studies can also benefit from more universal linguistic theories since statements about bilingualism (both individual and societal) would be
< previous page
page_184
next page >
< previous page
page_185
next page > Page 185
more valid were there matching and comparability across the languages in question. Combining power and language testing It is here that the normative nature of language testing becomes evident. As we have seen, language varies between individuals, hence the problem of a language description which is more than very high level abstraction, and within one individual, in the sense that ontogenetic development is lifelong, hence the problem of writing a grammar for an individual. Language tests have so far done no better than linguistic grammars, proposing an idealised form of the grammar (for the description) and of some part of the language (for the test). What is likely is that a test will favour a particular subgroup in society, often a more advantaged group. Although individual members of such a subgroup will vary among one another linguistically, it remains the case that they are more like one another than they are like less advantaged groups. The choice of such a subgroup as a model, target, or norm of the test is made inevitable by the proximity of the educational model to this subgroup's variety, the elaborated code, the educated standard of the written language which cannot but have a backwash effect on the forms of speech of those who traditionally succeed in education, viz. this same subgroup. The effect is to exaggerate and perhaps increase the existing social disadvantage of those whose language variety is different. For the sake of equity it would be desirable for language tests to incorporate variation, just as it is desirable for grammars to do so: the sociolinguistic extension of the latter is one way of admitting and providing for language variation within grammars. The employment within testing of Second Language Acquisition elicitation techniques as test methods and of implicational scalar techniques for scoring and analysis indicate how tests may eventually be able to incorporate language variation also. However, it is an open question to what extent such admission of variation into tests is desirable, even if it is possible, given that not only do tests now clearly operate as norms but perhaps that is what they must do, and indeed all that they can do. Language tests are capable of handling language elements, testing for control of elements in the grammar, the phonological systems, etc. This is the well tried discrete point approach. But the very nature of language systems is that they combine, grammar with semantics, etc. Now a test of the overall language systems (i.e. of all the systems in combination) may be no more than
< previous page
page_185
next page >
< previous page
page_186
next page > Page 186
a set of tests of the systems. In essence this is an argument about the factorial structure of language proficiency which remains unresolved. What is clear is that linguistic descriptions do not provide any compellingly satisfactory account of system combination. Therefore what passes for a test of language systems is either a disguised grammar test or a confusing mixture, a kind of countable shotgun approach to combination, of which cloze and dictation are favoured examples; but the framework essay and the structured interview, older favourites, also are in the same sense, language system tests. The point is that there is no rationale for designing a language test to test language systems in combination because there exists no algorithm which will tell us how language systems combine. All we can do is guess, recognising that any language sample must provide features of all linguistic systems and that syntax is likely to dominate in any sample we choose, which is only to admit what cloze and dictation results typically show, viz. that the dominant variable is grammar. Incorporation of extra language data and language testing The tension between explicitness and uncertainty exists in all linguistic enterprise. It becomes more strained here in the attempts to incorporate within systematic descriptions the stuff of language use, those marginal areas where system is uncertain, in which language interacts with social and psychological systems, above all to provide determinacy for contextual language use. If we consider Sapir's example, the farmer killed the duckling (Sapir, 1921:82), the grammar and the semantics (and the phonology when spoken) of the sentence are straightforward. What is problematic is what are the dynamics of such a sentence: which farmer, why did s/he kill the duckling, where, who said so, what elicited this statement anyway, to which question or first statement could it act as a response, what could be the next statement, would such a statement be equally likely from young and old, male and female, is it a standard form, does it suggest casual or formal style, and so on. Why should anyone say this . . . is it a comment or a description, a criticism . . . ? All such questions concern the status of this sentence in contextual settings: the aim of the additive linguistic disciplines is to incorporate such concerns within linguistic systematisation. The area of most difficulty for language testing is in the attempt to develop true performance tests, a good example of which is communicative language testing. The impetus to develop communicative language tests
< previous page
page_186
next page >
< previous page
page_187
next page > Page 187
comes from the communicative language teaching movement, itself a response on the pedagogic side to the continuing inadequacy of more traditional structural methods and to a general move in pedagogy, and therefore in language teaching from a deductive to an inductive philosophy, and on the linguistic side primarily for the extension of linguistics into social context. This is a powerful double edged sword and communicative language teaching has had an important influence on language teaching developments, though it is not clear how much it has actually influenced practice. The problems that arise for language testing are precisely that language testing needs to be clear, in this case explicit, about what is under test. Ironically there is no such constraint on communicative language teaching since the teacher can explain and expand what the tasks, the texts, the exercises, the lessons only hint at by making them come alive and so have immediate communicative effect, in other words, tolerate the central uncertainty of such tasks and not demand explicitness. But that is precisely what tests, as tests, must do. In testing this is desirable both for the language input and for the measurement output. Some flexibility is possible between the input and the output, that is, more uncertainty on the input side (vagueness as to what the language content of the tasks is) is acceptable but only if it is accompanied by greater certainty on the output side. This has traditionally been the case for tests of both spoken and written production where care has been taken to objectify the open-ended tasks of essay writing and interviews (open-ended even when they are constrained by frameworks and skeletons). To return to the Sapir example, a strict grammar question would focus on morphology, and in fact on the testee's knowledge of grammar, thus: Write in the correct past tense marker: The farmer kill . . . the duckling. A less strict item would concentrate on, for example, sequence, thus: Put these words in a correct sequence: farmer, duckling, killed, the, the Already it will be observed that the strict grammar constraint has been relaxed since the "correct" response must be: The farmer killed the duckling. and not: The duckling killed the farmer. although both are grammatically well formed on the information given. Let us now add a time marker so as to facilitate testing for tense, thus:
< previous page
page_187
next page >
< previous page
page_188
next page > Page 188
Which of these is correct: A Yesterday the farmer kills the duckling. B Yesterday the farmer killing the duckling. C Yesterday the farmer killed the duckling. D Yesterday the farmer will kill the duckling. The "correct" response of course is C, but the assumption has been made in the item construction that the testee will be aware of the time-tense relationship. Already, then, the discreteness of the linguistic system has had to be broken down and indeed this is the case for much, perhaps most, discrete testing in the structural tradition. Of course it is legitimate to maintain that grammatical control of, for example, English past tense assumes an awareness of the time-tense relationship, but even in this very narrow example the explicitness of what is being tested has already shifted in slight uncertainty. How much more uncertainty must a performance test accept! So back to the farmer and the duckling, this time regarding it as a potential utterance in a real context of situation. Let us regard the sentence: The farmer killed the duckling. as a possible response or as second member of a two part response pair. What is of interest, is what the first part would be, thus: Which of the following sentences (A, B, C) is most likely to elicit the response: The farmer killed the duckling? A Who killed the duckling? B What did the farmer do to the duckling? C What did the farmer kill? In fact all three questions are possible first part members. No doubt the canonical choice is C, given that in unmarked theme the new information comes at the end of the response and since farmer and kill are given in C then the new information must be duckling. But there is no reason to exclude A and B since we cannot assume that theme is unmarked, nor can we presume to judge how the sentence The farmer killed the duckling is read, that is, any one of the content words may receive the main tonic. One solution would be to accept all three choices as "correct", but that would give the benefit of the doubt to the testee who checks all three choices that s/he is aware both that they are all possible and under what circumstances they are. The limiting case would be if a testee were to select A. Should that be regarded as incorrect? In a structural type test if would be but it can hardly be wrong in a communicative test. It is, of course, the case that once we admit into the equation the testee's awareness of context everything becomes possible. It is
< previous page
page_188
next page >
< previous page
page_189
next page > Page 189
a common experience of language test construction that native speakers used in piloting will claim that for them "incorrect" choices are "correct". They are of course imagining an appropriate context for the incorrect form and giving that context saliency. The two issues of combining power and of incorporating extra language data come together in the field of languages for specific purposes (LSP). Here we are concerned with the interlocking of linguistic systems, with variation (register differences) and with the relation of language systems to their social context in terms of institutional differences between, for example, science and journalism, or between different kinds of science. LSP is pedagogically important because it enables the learner to avoid the general in favour of the particular. At least it is on that ground that it has gained support. The pedagogic problem is of course whether the general and the specific are so different that it is useful to teach them separately, as indeed is the extension of the question into whether the specific areand how far they aredifferent from one another. LSP is also pedagogically important because it has been used to argue for a delay in beginning language teaching until adulthood and to explain why literature is not a necessary part of language teaching. Now it can be argued that all language teaching is a form of LSP, even so-called general language teaching, but in fact this begs the interesting question which is not pedagogic but linguistic: to what extent one LSP differs linguistically from another and to what extent there is a language common core; to what extent the LSP is content validity and not just face validity. There are two apparently contradictory arguments against LSP on linguistic grounds. The first is how, on what principled grounds, one variety can be distinguished from another. The second is how, once the process has started, it can be stopped. (It will be observed that the two arguments are exactly those found in the languagedialect distinction where attempts to distinguish on linguistic grounds between language and dialect are unsuccessful.) On what linguistic grounds then can scientific English (to take one language as example) be distinguished from journalistic English? The answer is that there are no linguistic grounds on which such a distinction can be made, that in terms of linguistic systems there is no principled way in which a distinction can be made. There are, however, two possible ways in which such a distinction may be contemplated: first, that there is a differential use of linguistic systems (e.g. more passives, fewer transitives, more definite articles), second, that there may be systematic distinctions within those marginal systems we have spoken of, discourse, pragmatics, etc. In addition, of course, there will necessarily be important differences in lexical choice and these will have some influence on differential semantic use. Furthermore, differential lexical uses are very important from a
< previous page
page_189
next page >
< previous page
page_190
next page > Page 190
pedagogical viewpoint, since knowledge of specific content words may assist or hinder understanding. So we can agree that major language varieties such as scientific English and journalistic English may be distinguishable in terms of lexical choice, grammar, and semantic use, and, less certainly, discourse and pragmatic system. Though we must be clear what can be claimed in terms of such systematic differences, as will be clear when we look at the issue in reverse. So far we have asked the question whether such distinctions can be made at all and we have decided that, yes, they can be, in terms of use; we are less clear about distinctions in terms of system. The second issue continues the question: can such distinctions be brought to a stop, or, once started, is there a principled reason for refusing to make further distinctions? Having distinguished scientific from journalistic English, can we now distinguish among varieties of scientific English, both in terms of content or subject matter (physics, medicine, surgery, paediatrics, etc.) and in terms of style or mode or methodology: that is, are some scientific Englishes more formal than others, more spoken, more laboratory reporting, more research oriented, can the research reports be distinguished? In all these cases the question we return to is: in terms of what? Indeed it may be the case that the English of X uses more spoken English than the English of Y, but that is a feature of language use, not of linguistic difference. And so on for the other uses. The issue comes back always to the theoretical one of discrete language varieties and it looks again as though once we have admitted a difference in terms of use between, say, science and non-science there is no legitimate reason for refusing to accept further distinctions ad infinitum until, of course, logically we arrive at individual use; and even that as we have noted is uncertain. The most sensible position seems to be that there are indeed differences between major language uses, themselves defined on institutional grounds (e.g. scientific: compare the political criteria used to determine differences among languages), that these differences have language correlates, that to some extent these are systematic, that this is an important pedagogical question and is indeed of sociolinguistic interest, and that it is no more (or less) possible to pin down these systematic variations than anything else in linguistic description. And that, basically, we should treat the issue as pragmatic and not theoretical. The implications for testing are therefore very clear. Tests of LSP are indeed possible but they are distinguished from one another on nontheoretical grounds, that is, their variation depends on practical and ad hoc distinctions which cannot be substantiated. In other words, such tests should be determined on entirely pragmatic grounds, such that their separate status
< previous page
page_190
next page >
< previous page
page_191
next page > Page 191
relates to occasions which are quite non-linguistic, for instance, the existence of institutional support such as separate teaching provision or professional divisions (e.g. doctor-dentist, chemist-biochemist). In all such cases it is necessary to remember that there are pragmatic and testing imperatives as well as linguistic arguments. One way out of the dilemma of uncertainty is to delay, to put a stop to all descriptions, etc. until the time is right, the ''waiting on science'' approach. The problem of course is that this is unworkable for testing certainty (and probably for much else beside) since tests are needed with urgency. The alternative way is to make use of the very uncertainty which appears to vitiate all attempts to produce error free tests, a virtue of necessity, but more than that, it is the Heisenbergian acceptance of the uncertainty that is at the very root of our being. We shall look at this a little more carefully towards the end of the chapter. It is probably also the case that too much in language proficiency testing is made of prediction. As we have seen, prediction is vague with a lack of clarity of what to predict; and the prediction which we do achieve is weak, 9% or 10% of the variance. What this argues for is better testing or better criteria or something quite other, for example, abandoning correlational relationships and replacing them by, for instance, very gross, broad bands since by and large this is the only information which can be made full use of. The Actual Validation of an English Language Proficiency Test I want now to discuss some of our major findings reported in a validation study of the English Language Testing Service (ELTS)test of English language proficiency carried out at Edinburgh University over a four year period (Criper and Davies, 1988). Defining the situation Once again we found great difficulty both in determining what was the criterion and which cut-off to use. As common sense might suggest, it depended on what use was intended to make of it. In our results we reported multiple regression figures based on various sets of outcome criteria. None of these showed itself to be the single best outcome on statistical grounds. But more importantly, prediction was considerably higher overall within one institution (Edinburgh University which provided a considerable portion of
< previous page
page_191
next page >
< previous page
page_192
next page > Page 192
the data) than across institutions (0.41 to 0.46 vs. 0.24 to 0.29), evidence again of the context related nature of proficiency. Repeating the predictor We had speculated that a major reason for the typical low correlations for prediction in language proficiency testing was that there was too long a time gap between predictor and criterion in which differential individual learning took place. To test this hunch, which we considered important, we retested on the predictor one month before criteria started to become available. The result was no change in prediction, figures remained around 0.30. No doubt our retest sample was badly truncated because of the voluntary registration for retesting, that is, those who offered themselves were in nearly all cases students with above mean scores on initial ELTS testing. Subject supervisors and language tutors Our survey showed that subject supervisors were closer to ELTS in their judgements of students' English than English language tutors and than students' own self assessments. We realised that one explanation could be that supervisors are closer to ELTS because they know their students well, see them frequently, and as a result are able to judge their English proficiency with more accuracy than are language tutors. An alternative explanation is that, contrary to their superficial judgement of ELTS as an ESP test giving it high face validity, supervisors are more likely to be in linguistic tune with the very general structural features of Test G1 which in fact contributes 0.83 to the overall band score. Language tutors, on the other hand, are more likely to take a thoroughgoing communicative view. (One of the anomalies of the ELTS test is that although it was designed as a communicative test of ESP the designers retained two general tests, both multiple-choice short answer tests of the discrete point variety, one testing grammar, the other a listening test of appropriate short responses. In our analysis it is these two tests which dominate factorially.) Our advice on best cut-off This was, again, "it depends". We were able to show that it is only for those scoring below 4.5 that there are more students failing than passing on
< previous page
page_192
next page >
< previous page
page_193
next page > Page 193
criterion. At score 6.0 there is a marked change with a rapid drop to 6% failure. The results for specialist samples show the need to interpret these results in terms of each particular context. General prediction Basically our main conclusion is, that overall there is the usual prediction at about 0.30, but that there is considerable variation both of discipline and skill. Medicine Oral (M3), for example, produced a correlation of 0.53 with the criterion. We concluded therefore that within its self-imposed constraints (a very dominant G1 Test) ELTS does indicate that ESP is a valid construct, but that the variability is more varied than the rather simple subject specific structure permits, involving language skills and no doubt other variables not tested in ELTS. Conclusions In spite of the messiness and the uncertainty, the S of LSP stands up but not as expected. There are differential language needs but not systematically so; thus some groups may be advantaged in terms of language of their study, other groups in terms of the productive or receptive skills. What no test can do is to provide a fixed pattern for all cases; but what can be done is to provide a range of tests, spread across the whole LSP field. The students and the receiving institutions can choose whichever test array they prefer. Test arrays, the natural thrust of the LSP model, will therefore vary from institution to institution and may even be individual related where numbers are very small. The issue is political-linguistic, not unlike that of determining how many ethnicities to recognise in a polity. The answer can only be pragmatic: as many as want to be recognised. No doubt there are financial implications in terms of the cost effectiveness of choice but not implications of principle. Such provision of choice has to be the logic and it is the logic that accepts uncertainty as inescapable. Of course a vastly extended test pool is needed but that can be built up over time. What is difficult is the problem of equivalence, but that may (like prediction) be a problem which we have chosen to make too much of. The only "logical" alternatives are either a fixed set of "general" specific tests or the same general test for everyone. The same test for everyone approach may have the charm of simplicity and ease of test
< previous page
page_193
next page >
< previous page
page_194
next page > Page 194
administration (as also of custom since that is the normal practice) but it cannot be taken since it inevitably denies the value of surface communicativity, that is, that there is some influence on language performance of certain background knowledges. However, it should be noted that these two alternative decisions, to have one small set of general specific tests or to return to one overall proficiency test, are just as full of uncertainty as the table-d'hôte approach. Explicitness comes by defining what is being tested quite carefully and then measuring it as accurately as possible. It is indeed the case that only through the development of an innovative test like ELTS and through its validation study could our understanding of proficiency have been advanced this much further. Where concentration in testing is needed isin my viewon extremely careful language analysis and manipulation and therefore as accurate measurement as possible. Thus we leave uncertain what must remain so, for example, prediction. We tighten up on internal validation and, as far as possible, on reliability. Above all we provide a much wider choice. To what extent this vitiates the principle of comparability (Alderson and Urquhart, 1983) is an interesting question but not, in the end, a testing one. Rather it is a theoretical one which tests may be used to help elucidate. For practical testing activity it does not matter if proficiency tests are comparable for different groups (existing ones aren't anyway). What does matter is that different groups can have their proficiency measured for their own needs or the demands placed on them. Even the explicitness I have emphasised is less important than the acceptance of uncertainty. To put it another way, reliability is not a separate construct from validity (although that is how it is often presented), it is part of validity and we should be prepared to trade it off in gross assessments if we can gain in the increase of validity. We probably do not need more complex statistical forms of analysis, for example, Item Response Theory. The homogeneity quest in any case is chimerical and by assuming sameness and therefore idealising we move the test away from a communicative purpose. A search for homogeneity is more suitable, as has often been said, for very narrow systems, for instance, mathematics. But that is knowledge learning, not language learning. Language testing needs to be about integrated pragmatic systems: the major skills we bring as testers must be linguistic-descriptive, the statistical analyses must always be ancillary and minor. References Alderson, J.C. (1979) The cloze procedure and proficiency in English as a foreign language, TESOL Quarterly, 13, 219-227.
< previous page
page_194
next page >
< previous page
page_195
next page > Page 195
Alderson, J.C. and A.H. Urquhart (1983) The effect of student background discipline on comprehension: a pilot study. In: D. Porter and A. Hughes (eds), Current Developments in Language Testing. London: Academic Press. Criper, C. and A. Davies (eds) (1988) ELTS Validation Project Report. ELTS Research Report Vol. I(i). University of Cambridge Local Examinations Syndicate. Cronbach, L.J. (1961) Essentials of Psychological Testing. 2nd edn. London: Harper and Row. Davies, A. (1983) English language testing in the University of Edinburgh: report on the English Language Battery (ELBA Test) with an analysis of results 1973-1979. Unpublished Manuscript, University of Edinburgh. Ingram, E. (1973) English standards for foreign students, University of Edinburgh Bulletin, 9, 4f. Ross, J.R. (1979) Where's English? In: C.J. Fillmore, D. Kempler and W.S.-Y. Wang (eds), Individual Differences in Language Ability and Language Behavior. New York: Academic Press. Sapir, E. (1921) Language: An Introduction to the Study of Speech. New York: Harcourt, Brace and World.
< previous page
page_195
next page >
< previous page
page_196
next page > Page 196
16 Minority Languages and Mainstream Culture: Problems of Equity and Assessment 1 MARY KALANTZIS University of Wollongong DIANA SLADE University of Sydney and BILL COPE University of Wollongong Mass migration over the past few centuries, but particularly in the decades since the Second World War, has produced in almost every country of the world an unprecedented degree of cultural and linguistic diversity. Too often, however, mainstream social institutions have systematically attempted to ignore this reality. In the arena of education, for example, standardised forms of assessment, in a single national language, reflected the dominant ideology of an era in which the homogenising process of the "melting pot" was supposed to be at work. This form of assessment purported to be a generally valid measure of socially relevant educational outcomes, relevant at least from the points of view of employers, further education institutions, and so on. In some societies, however, where the cultural and linguistic impact of immigration has been particularly significant, there has been a growing
< previous page
page_196
next page >
< previous page
page_197
next page > Page 197
recognition of the reality and lasting nature of diversity. In Australia, for example, there has been a clear move over the past few decades from a government policy of "assimilation" to one of "multiculturalism". Yet the growing recognition and servicing of cultural and linguistic diversity has its own limits. To test or assess in ways which are culturally and linguistically relative can also be at the expense of measuring a more general linguistic-cognitive competence required for social access and social equity. The traditional, standardised, linguistically and culturally biased assessment procedures certainly excluded large social groups from success in the education system. However, new mechanisms of language learning and assessment, which recognise the reality of pluralism and reward students in terms relative to cultural linguistic background, can also exclude certain groups from access to mainstream social and educational institutions, but this time under the veil of democratic rhetoric. This chapter will discuss the difficulties of pluralised, differentiated forms of language pedagogy and assessment. It will suggest ways in which new forms of assessment can be devised which are both sensitive to the diversity of students' specific backgrounds and useful measures of those more general linguistic competencies required for effective social participation. The chapter is founded on an extensive piece of research, undertaken for the Australian Department of Immigration and Ethnic Affairs, which investigated the relation of minority languages and mainstream culture. This involved an investigation of policy issues, both theoretically and in international comparison. An evaluation of attitudes to language maintenance was made using a sample of children and parents of German and Macedonian background in the Wollongong-Shellharbour region of New South Wales. Language assessment instruments were developed with which we could compare first (L1) and second language (L2) competence and followed by the trialling and implementation of these instruments with the children of the sample whose attitudes we had surveyed (Kalantzis, Cope, and Slade, 1986). The main difficulty this research project attempted to resolve was how to assess in different languages using sufficiently general and comparable tests so that relative competence in each language could be measured. In this chapter we will summarise some of the results of this project. In the first part, we will describe the complex amalgam of problems the project addressed: government policy, people's attitudes to language maintenance, and those other linguistic educational needs of which students and parents are frequently not fully aware. Second, we will discuss different approaches to pedagogy and assessment in a multilingual society, contrasting relativistic
< previous page
page_197
next page >
< previous page
page_198
next page > Page 198
approaches to pluralism with more sophisticated approaches which balance cultural issues with social equity. Finally, we will address the more specific logistical dilemma of producing culturally and linguistically differentiated forms of pedagogy and assessment which are equally concerned with the objectives of social equity, access, and participation through education. The Challenge of Multilingualism Australia now has a National Policy on Languages, in the form of the Lo Bianco Report (Lo Bianco, 1987), officially adopted by the Federal Government in 1987. But this development was the end result of a slow and laborious process that has left a legacy of contradiction and compromise. Through to the early 1970s, the implicit (and often not-too-implicit) policy rationale behind Federally funded programs, such as the Child Migrant Education Program, had been cultural and linguistic assimilation or integration through the learning of English. By the mid 1970s, however, it was becoming clear that the old policy of assimilation/integration was not working. A plethora of government reports, for example, showed that students of non-English speaking background (NESB) were educationally disadvantaged (Kalantzis and Cope, 1984). Moreover, assimilation or integration at a cultural and linguistic level was simply not occurring and specialist servicing (such as language services) was increasingly needed as the NESB proportion of the population increased and as "ethnic" lobby groups emerged. By the late 1970s a new policy of multiculturalism had been developed in which "ethnicity" was a crucial new term of social categorisation. Cultural difference was now seen to be a significant concern of government policy. In the area of language policy and practice, education programs now came to emphasise cultural diversity and language maintenance. In many ways this was a constructive and sophisticated development in policy. The language of one's ethnic community certainly can be an important fulcrum of communality, a rallying point for political and social solidarity, of psychological comfort in the home, of heartfelt aesthetic power, or a conveyor of specific meanings closely related to one's sense of self and identity. But, too often, the project of cultural pluralism simply meant the celebration of colourful cultural differences and attempting to foster self-esteem without necessarily changing educational outcomes. In the area of language learning, we saw the rise of so-called community language programs. Inadequate funding went hand in hand with limited rationales such as cultural maintenance and self-esteem. This meant that
< previous page
page_198
next page >
< previous page
page_199
next page > Page 199
these programs, despite frequent good intentions, were often no more than tokenistic and short-term. Students would learn a limited range of language for domestic interchange (relevant, community forms of the language) and a little bit of culture. At the same time a considerable amount of the educational responsibility was put on the semi-private sector as meager subsidies were given to the community-run ethnic schools which operated outside mainstream school hours. Critics of this approach to the language question came to argue that community language learning would not become a serious or fully effective educational endeavour until it concerned itself with language: not just as a way of maintaining culture, but as a tool of social access as well (summarised in Kalantzis and Cope, 1987; NACCME, 1987). This would necessarily involve transitional bilingualism (thus learning mainstream school subjects in one's mother tongue with a gradual transition to English so that lines of cognitive development were not disrupted by language break), bilingual education (using both mother tongue and the second language for instruction), and teaching community languages with the same degree of intellectual seriousness as traditional foreign languages, such that students become competent in literate standard forms of the language as well as oral and dialect community forms. Indeed, it was argued that funding for tokenistic community language programs with restricted domestic-communicative and cultural aims might well have been better spent on improving specialist English as a second language teaching. Furthermore, whilst such language programs could possibly achieve important educational objectives, they were perhaps better not undertaken at all than undertaken in such a way that they could be proved ineffective, given the highly political context in which they operated. The 1987 National Policy on Languages shows a serious concern with strengthening language policy and government services. The broad social goals of language policy are stated by the report to be "enrichment, economic opportunities, external relations and equality". In the teaching and learning of languages, three guiding principles are spelt out: English for all, support for Aboriginal and Torres Strait Islander Languages, and, perhaps most dramatically, a language other than English for all. Funding for programs arising from the Policy to 1989 was announced, but rather predictably, at levels inadequate to achieve its ambitious objectives. Despite the idealism of the Policy, the realities of funding reflect political opportunism more than educational or social objectives. In fact, the earlier phase of multiculturalism, with its emphasis on cultural maintenance and community relevance, was a relatively inexpensive response to ethnic lobby groups and the perceived migrant vote.
< previous page
page_199
next page >
< previous page
page_200
next page > Page 200
The pragmatic political context was very much reflected in the way our research was instigated. When we began negotiations, our brief was to survey attitudes to minority L1 maintenance. Satisfying stated personal desires was thus conceived to be the purpose of language policy. Following this brief, we undertook an attitudes survey, but at the same time, we also attempted a rather more objective comparative analysis of L1/L2 language competencies measured in terms of communicative performance in those forms of the language required for educational and social access. This would involve a critical analysis of what form of language was being maintained and how this related to the acquisition of English. The assessment component of the research project, in other words, was not part of our original brief but something we considered necessary to balance the more immediately political concern with attitudes against more serious linguistic and pedagogical questions. In the attitudes component of the research, to summarise our results very briefly, the German and Macedonian adult groups interviewed did, indeed, show a degree of concern that their community languages be taught and that their ethnic group get a "slice of the action". Since there was public funding available, its pattern of allocation was seen to be a mark of esteem or respect on the part of the mainstream society. In the particular case of the Macedonian group, maintenance of the mother tongue was considered a vehicle for the preservation of religious and social cohesion and for the continuation of patronage networks. This group was also concerned to preserve a language that had only obtained a standardised, written form relatively recently, and to demonstrate that it was a viable official language in its own right. The German group, on the other hand, claimed particular virtue of its ethnic group for being better educated and learning English quickly. Yet German was also considered to be an important language for the wealth of its literature and international significance, and therefore a significant school subject offering. Although all groups interviewed agreed on the importance of learning English, none made links between L1 maintenance and learning English, nor the possible role of L1 maintenance in general school achievement, nor the cognitive disruption of entering a school whose sole language of instruction is not one's L1. Language maintenance was seen to be an issue of esteem, respect, prestige, and funding rather than a serious pedagogical matter. The fundamental dilemma that emerges is whether pedagogy and assessment should be relative to political exigency and the subjective aspirations of community groups, or on the basis of the more technical, less readily comprehensible, linguistic prerequisites of educational and social access. The students in our sample groups were more pragmatic in this
< previous page
page_200
next page >
< previous page
page_201
next page > Page 201
regard. Their main goals were success at school in English and the relevance of their L1 in helping them get a job, although still, there could be no clear understanding about what forms of L1 learning and assessment would achieve these ends. Approaches to Pedagogy and Assessment in a Multilingual Society The historical and sociolinguistic roots of the language question can be traced to two powerful and contradictory tendencies in the modern world: the cultural/linguistic universalisation of national cultures and languages and, at the same time, the juxtaposition of people of diverse cultural and linguistic backgrounds. As it is not possible nor desirable to return to the old melting pot, assimilationist model in which the universal national language and culture was to be imposed on the whole population, there are now two approaches to servicing sociolinguistic pluralism. One is the simple pluralist response which concerns itself, in immediate reaction to the racism and ineffectiveness of assimilation, with the preservation, celebration, and reproduction of cultural diversity. The second, working with a more holistic understanding of culture, concerns itself equally with the dynamics of diversity and the dynamics of unification. In the moments in which this approach concerns itself with the logistics of empowerment, it can sometimes be mistaken to be a form of cultural and linguistic assimilation. The pluralist conception To elaborate, a simple pluralist conception of culture for multiculturalism busies itself with the celebration and maintenance of only such variety as can comfortably exist in diversity. In so doing, it often tends to neglect the wider structural context limiting that diversity. Indeed, in this wider context, maintenance is a problematic concept, and L1 assessment which is ostensibly relevant to cultural background and measuring linguistic preservation might be quite inappropriate. For example, within a language, according to historical and social context, there is considerable variety of register, genre, syntax or semantic field. Sociolinguistically, some L1 language forms (such as those originating in peasant-agrarian contexts) do not have the same functional efficacy as others (such as those of educated, middle-class city dwellers) as tools for reading and conceiving the western industrialism to be found in the immigrant's new setting. Moreover, certain forms of minority
< previous page
page_201
next page >
< previous page
page_202
next page > Page 202
language, confined in their sociolinguistic form and context to the realm of oral, domestic interchange, should not simply be reproduced through education institutions for their authenticity as community forms of the language. In other words, individualising and relativising assessment procedures for L1 background students should not be at the expense of evaluating for those forms of language and cognition necessary to reading and participating in the wider society of settlement. These communicative requirements also extend to trade and educational contacts with the country of origin and manipulation of literate, standard forms of the language. There is still a place, in other words, for more universal and objective assessment measures. Language policy cannot simply orient itself towards maintenance for maintenance's sake, nor can assessment be individualized to the point where it is relative only to linguistic and cultural background. Language is an open, dynamic communicative tool. Pedagogy and assessment needs to move beyond the inherent sociolinguistic conservatism of relativistic approaches for linguistic diversity. The special position of English in Australia's multilingual society Furthermore, L1 proficiency has to be measured against L2 proficiency. The English language in Australia today is not just a means of cultural-linguistic dominance and assimilation. It is a common language of social understanding and effective participation. The particular geographical origin of this language is a historical accident, and as such, English has no special virtue as a language. Nonetheless, today it is more than just one piece in the Australian mosaic of community languages. It is also a crucial means to social power and selfdetermination. Some proponents of pluralist multilingualism may well denounce the possible assimilationist intent in ESL or transitional bilingual programs. But the pedagogical, evaluative, and final social consequences, even in the teaching of languages other than English, can be very different according to policy rationales which view language as a tool of self-determination and social access against those which view language mainly as a symbol and as a means of maintaining cultural diversity. There is a danger, for example, that language teaching and assessment programs emanating from a poorly thought-out simple pluralist position might end up being shortterm, poorly funded exercises in community relations. There are also the limitations to the concept of community language discussed above, if this implies teaching and assessing children only in the form in which it is used in the community. The
< previous page
page_202
next page >
< previous page
page_203
next page > Page 203
fundamental problem arises from rationales and assessment procedures which orient themselves to specific popular/cultural/affective factors, rather than more general factors such as cognition, educational access, and social outcomes. The obvious implication for assessment is that individualising and relativising evaluation should not be at the expense of other important elements that require measurement. It is crucial to recognise that the old standardised tests in a common national language were incapable of recognising the complex dynamics of linguistic and cognitive proficiency in which L1 and L2 are integrally related for students of minority-language background. Yet a radical relativising of pedagogy and assessment, measured according to relevance to individual background, equally can neglect the role of languages as a means of participation and access in a wider social context. The Language Question research project aimed to make assessment an integral part of language planning again by attempting to overcome the limitations of both standardised testing and relativist pluralism. The assessment procedures we developed tried to measure the very specific and individual relation of L1 and L2 proficiency, whilst at the same time measuring for those more universal forms of language necessary for effective social participation. Design and Implementation of the English, Macedonian, and German Assessments As discussed above, the brief for this project received from the Department of Immigration and Ethnic Affairs was to undertake an attitudes survey. The emphasis on attitudes is in keeping with the fashionable notions of ''needs'', "relevance", and "community involvement." But that attitudes were conceived to be the central question in the language debate indicates a critical dilemma which was the central question of our research: whether community languages are a political palliative or a pedagogical imperative. Insofar as the maintenance of community languages is regarded as worthy, this should not just be an accommodation to public opinion. Rather, it should be seen to be critical in the servicing of educational needs. Objectives of the assessment project We had three major aims in conducting the language assessments. Our first goal was to evaluate the children's English language performance on
< previous page
page_203
next page >
< previous page
page_204
next page > Page 204
academic and non-academic language tasks. There is an obvious need for such information on children from NESB. There are serious indications of educational underachievement of these children in Australia, yet very little research has been done in this area. Secondly, if language programs are to be considered as part of everyday, mainstream schooling, it is necessary to have information on proficiency in the mother tongue of different children who might be targeted for community language programs. Thirdly, we wanted to establish whether there was any correlation between proficiency levels in English and proficiency levels in the mother tongue. The task we set ourselves was extremely ambitious because of the scarcity of available assessment materials. Problems of comparability between English and Macedonian/German dogged the whole exercise, but the assessment tools we designed are, we hope, a contribution to satisfying a pressing need. Our research made it clear that there is a vacuum in this area in Australia, both in terms of knowledge about language, and the needs and experiences of students from non-English speaking backgrounds, as well as in terms of there being no instrument to evaluate this experience adequately. When we asked teachers how they identified language difficulties, replies ranged from those (the majority) who believed in informal and subjective procedures, to the few who still administered traditional, fully standardised tests with statistical support. The majority of the teachers who advocated informal and subjective procedures objected to systematic language assessment for two main reasons. Firstly, they believe all existing language tests are culture specific and therefore inappropriate or, secondly, that most traditional, standardised tests only give an account of the students' ability to manipulate abstract grammatical rules and not of their ability to actually use the language in real contexts. It is indeed harmful to slot children into self-fulfilling categories, or to draw assumptions about the child's intellectual abilities from their performance in a language test. However, it is essential to be able to diagnose children's language difficulties so appropriate teaching and learning strategies can be formulated. If the students' needs are to be adequately catered for, the teacher must know, as precisely as possible, what stage of linguistic development has been reached. This is particularly the case for children with language difficulties. So the need that has to be addressed is for the development of systematic assessment procedures that are not culturally inappropriate, that can assess the students' communicative abilities, and that are sensitive to the diversity of the students' linguistic and cultural backgrounds. Moreover, as English is the language of instruction in schools, success in English therefore being
< previous page
page_204
next page >
< previous page
page_205
next page > Page 205
essential for cognitive and intellectual development, these tests need also to be able to measure the students' ability to use the language for educational purposes. The problem, therefore, is to try to devise a test that is capable of measuring individual differences, but not to the point where all forms of language are seen as purely relative. The test also needs to be relevant to the linguistic and cognitive demands placed on children at school, and as such be able to provide measures of the more general linguistic competence necessary for educational success. Design of the assessment procedures The details of the assessment tasks, rating procedures, and reliability indices are fully discussed in Kalantzis, Cope, and Slade (1986). Here we will give a brief description of the assessment tasks and the main features of the rating procedures that were used. When we designed the assessment procedures our first concern was that the tasks to be undertaken should be realistic, communicative operations and should include both academic and non-academic language activities. The tasks should, as closely as possible, resemble the type of linguistic demands made of children of the target ages. When designing the tasks we first considered the following questions: 1. What is the general area in which the target students actually use English, Macedonian, or German? 2. What are the general areas in which the students need English, Macedonian, or German? 3. How well, and with what degree of skill, do they need to perform in these areas? We restricted the study to looking only at classroom interaction, and we did this in all four skills, speaking, listening, reading, and writing. For most of the students we assessed, a major setting for communicating in English is the school. This is not the case with Macedonian or German, where, despite the fact that in some instances those languages were taught in school, they were essentially languages used at home and in a limited number of other places outside the school. Our intent, however, was not to measure just the domestic and community use of the first language, but to devise comparable, bilingual tests that are essentially measuring those skills necessary for educational success. As we discuss below, this presented a number of problems in the designing of the German, and Macedonian tests; in
< previous page
page_205
next page >
< previous page
page_206
next page > Page 206
particular finding relevant material for the reading and listening sections of the test. We wanted to get an indication of the students' ability in all four skills, so the assessment procedures had four major sections. The listening and speaking assessment tasks are related, as are the reading and writing. The assessments included a combination of direct and indirect approaches as a way of dealing with the conflicting demands of the requirements of validity, reliability, and efficiency. We decided to have a combination of communicative integrative tasks and discrete-point tasks, with the emphasis on the communicative aspect. The compromise appeared in the rating procedures rather then with the content or form of the assessments themselves. We will now briefly describe each segment of the assessment tasks. ESL speaking and writing tasks For the productive skills of speaking and writing, we devised a communicative task that attempted to measure both the students' overall performance as well as the strategies and skills which had been used in the process of performing that task. It is possible to assess both spoken and written abilities in this way. As we were restricted to the kinds of interaction that the students at school are engaged in, we had two issues to consider: teacher-student interaction, and student-student interaction. We decided, therefore, to have an oral interview (involving teacher-student interaction) and a student-student task. Because we needed to administer the questionnaire about attitudes to language maintenance, we decided to include it as part of the oral interview and, by doing so, it provided us with a purposeful communication task where a genuine transfer of information was taking place. Having an authentic task to complete overcame many of the problems associated with oral interviews. In the trialling of this element of the evaluations we tried a variety of approaches, for example, interviewing two or three students together, but we decided it was both more efficient and more likely to elicit extended response if we interviewed the students individually (for a detailed discussion of this see Kalantzis, Cope, and Slade, 1986). The student-student task was by far the most successful component, both in terms of the students' involvement as well as the linguistic skills it demonstrated. It was designed to evaluate a student's communicative effectiveness when interacting with another student. We chose a barrier task which involved giving and interpreting instructions or descriptions. The
< previous page
page_206
next page >
< previous page
page_207
next page > Page 207
students faced each other, but were separated by a physical barrier, so neither could see what the other was doing. The barrier was low enough for them to have eye contact. By using such a task, we could assess the students' effectiveness in communicating precise instructions and the ability to clarify and repair where necessary. It was also possible to assess the recipients' ability to interpret information, to seek clarification, and to respond appropriately. There are problems with making definitive statements about recipients' listening ability based on this type of task, as this is affected by the clarity of the information given, and so on. This, therefore, was only one of the aspects of the listening assessment. The aim of the task was for subject B to build (year 5) or draw (year 10) a replica of the model constructed by subject A. Thus B had to ask as many questions as possible. As students often perform this sort of task at school, they found the demands familiar. After completing the task, A and B compared drawings/models, B reported to A on her/his version, and if it was not the same, A corrected, re-explained, and so on. They then described what had happened to the teacher/evaluator. The task involved the manipulation of abstract, analytical concepts and as such was close to the cognitively demanding, content reduced end of the continuum as outlined by Cummins (1984). In this way the two sections of the assessment of oral interaction gave us an indicator of the student's basic interpersonal communicative skills, and also of their ability to perform in cognitive/academic language tasks. In the trialling of the task we identified what information ought to be included in a notionally "good" account of a story (see Ellis, 1984; Arthur et al., 1980), and we designed a matrix which included these elements as a checklist to guide the markers. The matrix was used for the markers to write down detailed comments and the rating scale we designed was used to allocate a mark. The evaluation of writing consisted only of one task: a free writing activity that was directly related to the theme of the reading passage. This free writing technique was one that the students were familiar with and as such was seen to be a relevant task. ESL listening and reading tasks For the purposes of objectivity and reliability, we designed a listening and reading test that had marker objectivity. We tried to ensure that the content represented the kind of tasks that confronted the language users at school. These assessments had a multiple-choice format which focused on lexico-grammatical features (such as the ability to deduce the meaning of unfamiliar lexical items), discourse features (for example, cohesive devices)
< previous page
page_207
next page >
< previous page
page_208
next page > Page 208
as well as generic features, that is, the ability to identify the particular text type and its characteristic schematic structure (see Martin and Rothery, 1985). The multiple-choice questions were designed to assess the students' ability to interpret and understand these features. We trialled these assessment tasks with ten ESB and ten NESB speakers. In the listening and reading evaluations we tried to identify how the students processed the text and what they found most difficult. We did this by trying a variety of questions/approaches and by discussing the process and outcome with the students. On the basis of the trialling, the final content and format of multiple-choice questions evolved. The assessment tasks for listening skills had three aspects: the oral interview, the barrier task, and the listening activity on tape. The students' listening ability could be partly gauged through the oral interview and the barrier task. As it was difficult to judge by those alone, we also included a separate and specific listening task. Furthermore, as argued earlier, for the purpose of objectivity and reliability, we designed a listening and reading test that had marker objectivity and hence reliability (see Kalantzis, Cope, and Slade, 1986). We tried to ensure that the context was realistic: for year 5 we chose a principal's school announcement, and for year 10 an authentic extract of taped casual conversation. These were followed by multiple-choice questions. The evaluation of reading skills had two aspects: a reading passage with multiple-choice questions, and the students were also asked to read aloud the first paragraph of this passage. It could be argued that reading aloud is not an authentic task, as native speakers very rarely perform this activity. However, it is not unusual for children to read aloud at school (even at the secondary level). In fact it is often seen as an essential part of the development of literacy skills. There are technical skills of reading aloud which can provide useful indices of reading ability. It is thus an important diagnostic technique enabling the identification of particular problems students may have. As we asked the children to read the first paragraph of the reading passage they had previously been tested on, the overall context was familiar. Performance criteria It remains to discuss the problem of developing criteria for assessing the different components of the tasks. The concern was not whether candidates passed or failed. The point was to assess what they could do. So we were more concerned with making qualitative rather than quantitative judgments, although for statistical reasons a quantitative measure was necessary. The problem that immediately arises is whether it is possible to assess production
< previous page
page_208
next page >
< previous page
page_209
next page > Page 209
in ways which are not so subjective as to render the assessment procedures completely unreliable. An interim solution proposed by Carroll (1980), Ingram (1984) and others is to design an operational scale of attainment in which different levels of proficiency are defined in terms of a set of performance criteria. There are quite a few criteria that have now been developed, including the ratings procedure adapted for the Royal Society of Arts Examinations (RSA)and the Australian Second Language Proficiency Ratings (ASLPR)designed for adult ESL learners in Australia. The emphasis of the RSA rating scale is on providing a profile of the learner, whilst the emphasis of ASLPR is on providing a rating scale. They are both designed for adult second language learners. Influenced by these criteria we designed our own ratings procedures that would be more appropriate for year 5 and year 10 children, and that could provide us both with a profile of the learner as well as a rating scale. We needed both these measures as we were interested in making quantitative as well as qualitative statements (detailed results are available in Kalantzis, Cope and Slade, 1986). We will now briefly discuss the difficulties we had in designing comparable Macedonian and German tests. Assessment procedures for Macedonian and German For the Macedonian and German assessment procedures the same framework was used as in the English language evaluations, although different topics were suggested for the interview to stimulate discussion. Similar tasks were used for the student/student interaction. They were slightly modified to make them more appropriate. For example, the diagram for the year 10 students was altered so as not to include objects and symbols that they would not be able to identify in Macedonian and German. It is important to stress again that the design and format of the German and Macedonian assessments followed the same theoretical framework as those for English. The German and Macedonian listening and reading evaluations were also passages followed by multiple-choice questions. Despite trialling and extensive modifications, the content of these passages remained unsatisfactory and reflects the lack of locally produced material in Macedonian or German which is relevant to Australian students. There were three major problems in designing the German and Macedonian tests. The first problem was finding fluent Macedonian and German speakers who were aware of the theoretical and practical developments in language assessment. We believed this pedagogical understanding was important as the design of the assessment tasks involved much more than the translation of those we had already produced in English. For example, it would have been culturally inappropriate simply to translate
< previous page
page_209
next page >
< previous page
page_210
next page > Page 210
the listening and reading passages. However, for any comparisons to be made, it was also necessary to design comparable tasks that had the same framework and that were based on the same theoretical principles. The first stage of the design of the Macedonian evaluation instruments involved a Macedonian exchange teacher in consultation with us. We explained our framework in detail, and discussed the importance of ensuring cultural relevance in the design of the instruments. A university lecturer in Macedonian and a Macedonian school teacher were consulted for the second stage of the design. The reading and listening passages that were first constructed, they felt, were not relevant or appropriate for children brought up in the Australian-Macedonian community. This inappropriateness was seen both in terms of the topic of the material chosen and the lexico-grammatical items it contained. In cooperation with us and the exchange teacher, the material was amended so as to be more relevant to the Australian-Macedonian context, whilst still attempting to balance this relevance with the requirements as standard or literate forms of the languages. The German assessment tasks were devised by a German teacher from Sydney in consultation with us. She worked within the framework used for the English evaluations, and chose listening and reading passages that she felt would be suitable. In a second stage, the tasks were modified by two German teachers from Wollongong, as these were seen as inappropriate vehicles for evaluating the German with which children in the Wollongong area were familiar. In both cases, it was clear that much needs to be done at the materials development level and on assessment procedures for languages other than English, so that there is a pool of experience and material to draw upon for research and for teachers. In many ways, we had to construct our assessment procedures from scratch. We hope the experience will provide a basis for further necessary action research in this area. The second major problem concerns comparability for languages which, as we argued earlier in this chapter, are put to quite different uses in the Australian context. Bearing this in mind, we aimed the procedures at two general major uses to which both English and the mother tongue might be put, everyday communication and formal schooling situations. The third problem was the impossibility of standardising the German and Macedonian assessment procedures against a control group of native speakers. This involved more than the impossibility of administering these to control groups in Germany and Macedonia. The languages spoken there
< previous page
page_210
next page >
< previous page
page_211
next page > Page 211
display differences in form, content, and purpose when compared to the German and Macedonian spoken in Australia. Yet we also wanted to overcome the difficulty of simply assessing those oral, dialect, and domestic forms of language used in everyday community context, of proving the rather circular point that the language of the individual and community is used well in terms measured by individual and community usage. Administration of the assessments Our concern was to examine language proficiency in English and Macedonian or English and German. We assessed 111 children, across 7 schools. Of the total, 42 children form a control group of native English speakers, and the rest were children of Macedonian or German background. We decided to assess students in years 5 and 10; year 5 because they were at a stage in their schooling which was critical in terms of language development and extension, and year 10 as at this stage they were preparing for work or further study in tertiary institutions. In fact, we originally intended to draw our sample from year 11, but had to change this to year 10 when it was discovered that very few children of Macedonian background made it into the senior years in the sample schools. The results will only be discussed briefly here, and we refer you to the report for a full account of the results of this research. Although several L1 maintenance programs were being run which involved the sample groups in some of the day schools we surveyed and a number of the sample groups attended community-run ethnic schools outside normal school hours (and despite the fact that most of the sample were born in Australia), our battery of evaluations showed that English language proficiency was lower for the Macedonian and German groups than for an English L1 control sample from the same schools. In the L2 evaluations, students scored considerably better in oral than they did in literate forms of the language. This problem of results showed no sign of improvement between the primary and secondary school components of the sample. In other words, ESL programs had not brought even Australian-born NESB students up to the average standard of their ESB peers. The L1 experience had not been used constructively to this end, and the practical requirements of L1 learning (such as employment in translation/interpreter services, or the possibility of working overseas) had not been realised through significantly improved literacy proficiency. Perhaps limited language maintenance goals (relevant to existing domestic usage, for example) had been met, but not broader objectives relating to social access and participation.
< previous page
page_211
next page >
< previous page
page_212
next page > Page 212
Conclusion Despite the methodological difficulties, we did succeed to create assessment procedures which we consider successfully measured comparative L1 and L2 ability. Standardised language tests in a national language, although they might crudely forecast success in a monolingual education system, tell us nothing of the complex relation of L I and L2 language abilities which unfairly disadvantages many NESB students. A radical and simplistic move to a pluralist paradigm, however, measuring language abilities relative to background and community context, also can disadvantage NESB students. Simplistic pluralist assessment procedures and pedagogy with limited objectives of esteem and cultural maintenance do not necessarily help students acquire those linguistic-cognitive capacities necessary for social access and participation in a late industrial society. L1 and language teaching and assessment for NESB students could work towards those literate and abstracting forms of language necessary to read the broader social structures; standard forms of language necessary for Australian trade interaction with the country of origin or socially-mobile return migration; and language ability necessary for employment in translation/interpreting/teaching service in Australia. In the relation of L1 with L2, transitional bilingualism can maintain lines of cognitive development in mainstream subject areas as students' language abilities develop. Finally, community languages should be taught and discussed with the same degree of intellectual seriousness as traditional foreign languages. With these objectives in mind, language assessment in a multilingual society needs to be able to measure the complexity of the L1/L2 relation through assessment in both languages. This, however, needs to be comparable and we also need to evaluate those general forms of language which will be the most effective educational/communicative cognitive tools to social equity. Evaluation and pedagogy which are simply relative to the multiplicity of community contexts, aiming no more than to preserve and maintain these culturallinguistic differences, are an inadequate response to the undoubted limitations of the old, standardised testing procedures. Given the situation of linguistic diversity in Australia it is essential to devise bilingual assessment procedures that are able to give indications of students' cognitive and linguistic development in both their first language as well as in English. In this research we wanted to devise comparable assessment tasks which would be both sensitive to students' linguistic and cultural backgrounds, as well as in electing forms of language and cognition which are necessary for educational success and mobility. Our tests are very much an interim measure, but we hope they will provide the first step in a
< previous page
page_212
next page >
< previous page
page_213
next page > Page 213
necessary process of developing systematic assessment procedures for children of non-English speaking backgrounds. Note 1. For more details on the research presented in this chapter, see Kalantzis, M., Cope, W. and D. Slade (1989) Minority Languages and Dominant Culture: Issues of Education, Assessment and Social Equity. London: Falmer Press. References Arthur, B., Weiner, R., Culver, M., Lee, Y. J. and D. Thomas (1980) The register of impersonal discourse to foreigners: verbal adjustments to foreign accent. In: D. Larson-Freeman (ed), Discourse Analysis in Second Language Research. Rowley, MA: Newbury House. Carroll B.J. (1980) Testing Communicative Performance. An Interim Study. Oxford: Pergamon. Commonwealth of Australia (1984) A National Language Policy. Senate Standing Committee on Education and the Arts. Canberra: Australian Government Publishing Service. Cummins, J. (1984) Bilingualism and Special Education: Issues in Assessment and Pedagogy. Clevedon: Multilingual Matters. Ellis, R. (1984) Communication strategies and the evaluation of communicative performance, ELT Journal, 38, 39-44. Ingram, D.E. (1984) Introduction to the Australian Second Language Proficiency Ratings. Canberra: Australian Government Publishing Service. Kalantzis, M. and B. Cope (1984) Multiculturalism and education policy. In: C. O. Bottomley and M. de Leperuance (eds), Ethnicity, Class and Gender. Sydney: George Allen and Unwin. Kalantzis, M. and B. Cope (1987) Pluralism and equitability: multicultural curriculum strategies for schools, National Advisory and Consultative Committee on Multicultural Education, Position Paper, Canberra. Kalantzis, M., Cope, B., and D. Slade (1986) The Language Question: The Maintenance of Languages other than English. 2 vols. Canberra: Australian Government Publishing Service. Lo Bianco, J. (1987) National Policy on Languages. Canberra: Australian Government Publishing Service. Martin, J. R. and J. Rothery (1985) Teaching Writing in the Primary School: A Genre Based Approach to the Development of Writing Activities. Working Papers. Sydney: Department of Linguistics, University of Sydney. NACCME (1987) Education in and for a Multicultural Society: Issues and Strategies for Policy Making. Canberra: National Advisory and Consultative Committee on Multicultural Education.
< previous page
page_213
next page >
< previous page
page_214
next page > Page 214
17 The Role of Prior Knowledge and Language Proficiency as Predictors of Reading Comprehension among Undergraduates TAN SOON HOCK University of Malaya The Need to Read in English Malaysian undergraduates can be characterized as readers of English as a foreign language. However, they are not exempted from what is expected from native-speaker undergraduates, that is, they should be able to read academic materials written in English, both extensively and intensively. The need to read with comprehension in English is therefore an indispensable skill for Malaysian undergraduates. If they have not mastered it already during their eleven years of learning English at school, they must acquire it at the university. The alternative solution to this dilemma faced by the nonnative speaker of English would be to rely on material written originally in Bahasa Malaysia, the national language and the medium of instruction in school and for most of the undergraduate courses, or on material which has been translated into Bahasa Malaysia. This again is not easily achievable because publication and translation in Bahasa Malaysia between 1967 and 1980 was more than 50% below the required number of 1350 titles a year in relation to the 13.4 million population in Malaysia. Even if translated books were readily available, the need to be able to read efficiently is not obviated. This is because the rate at which translations become available cannot keep pace with the cumulation
< previous page
page_214
next page >
< previous page
page_215
next page > Page 215
of knowledge. In other words, scientific work is often obsolete by the time a translation is published. Clearly, Malaysian undergraduates need to be able to read in English, and what is more, from a preliminary study it appeared that they need to be able to apply this ability to reading material that is discipline-related. But while the place of content is well designated in the literature, the relative weights of linguistic competence in the target foreign language and conceptual discipline-related knowledge have yet to be determined. Still, it is a welldocumented fact that linguistic competence plays an important role in determining the extent of comprehension, be it oral or written. Obviously, any reading programme will have to consider these two variables, especially one that seeks to develop reading skills in content-related areas, as the University of Malaya's does. While it is acknowledged that language proficiency and prior knowledge are important variables in reading comprehension, there is as yet little evidence to suggest the extent to which they are able to predict success. This chapter attempts to answer this question by presenting evidence collected from a study conducted with University of Malaya undergraduates. It is also in part a response to the observation by Alderson and Urquhart (1984 :xxvii) that because . . . reading is a complex activity . . . the study of reading must be interdisciplinary. If the ability involves so many aspects of language, cognition, life and learning, then no one academic discipline can claim to have the correct view of what is crucial in reading: linguistics certainly not, probably not even applied linguistics. Cognitive and educational psychologists are clearly centrally involved; sociology and sociolinguistics, information theory, the study of communication systems and doubtless other disciplines all bear upon an adequate study of reading. Alderson and Urquhart go on to point out: ''The literature on reading abounds with speculations, opinions and claims, particularly in foreign language reading, but relatively little evidence is brought to bear on specific issues.'' This study acknowledges that there is certainly a lack of research in this direction for foreign language reading. As such, it attempts to be interdisciplinary in seeking support, on the one hand, from linguistics and applied linguistics to understand the role of language, and, on the other, from cognitive psychology to evaluate the role of prior knowledge. Alderson (1984:24) acknowledges that there is no certainty as to whether reading in a foreign language is a language problem or a reading problem. He concludes:
< previous page
page_215
next page >
< previous page
page_216
next page > Page 216
The answer, perhaps inevitably, is equivocal and tentativeit appears to be both a language problem and a reading problem, but with firmer evidence that it is a language problem, for low levels of foreign language competence, than a reading problem . . . there is great need for further research into the relative contribution to foreign language reading performance of foreign language competence and first language ability, on particular tasks, seen in relation to other factors like conceptual knowledge, to help us to define more closely the nature and level of the language competence ceiling or threshold for particular purposes. Method Subjects A sample of 317 undergraduates was drawn consisting of 141 Medical students, 95 students of Law, and 81 students of Economics. They were all in the third year of their studies. This year was thought particularly suitable because by this time the students would have become "specialists" in the sense that they would have decided on their area of specialization or would have delved deep enough into the course to be considered more "specialized" than their colleagues from other disciplines. Theoretical and pragmatic considerations were involved in the selection of the three faculties. A theoretical consideration was that if one of the predictors of reading comprehension is prior knowledge in the form of (mainly) content knowledge, it appeared expedient to choose faculties whose disciplines would have as little overlap in knowledge as possible with one another. Hence, medical students would be less informed of a knowledge of law than they would be of the science-based disciplines. Similarly, law and economics would, hopefully, be sufficiently differentiated to keep the content knowledge distinct. This, for example, would be difficult to achieve between economics and history, as both share a common humanities background. In this way the three faculties, in their own right, demand prior knowledge of a specific nature from students which would allow them to comprehend texts and references efficiently. For the Medicine students the field of specialization was parasitology, for the students in Economics it was marketing, and for Law students it was family law.
< previous page
page_216
next page >
< previous page
page_217
next page > Page 217
The availability of students was the main pragmatic criterion, and the extent of administrative ease with which students could be obtained for this study also pointed to these three faculties. The third-year medical students were believed to be more cooperative and accessible as there was no formal end-of-year examination to study for. The students in the Law faculty could be reached through their language classes since they were registered for courses at our Language Centre. The availability of cooperative subject-specialists-cum-course lecturers ensured equally smooth access to the students during lecture hours. Tests The study used three types of tests: (1) a general English proficiency test to serve as a common yardstick for measuring the language proficiency of all subjects in the study, (2) prior knowledge tests to measure the subjects' background knowledge in the disciplines involved in the study, and (3) discipline-related English reading tests. 1. The general English proficiency tests To measure the general English proficiency of all the subjects in the study the English Proficiency Test Battery (EPTB, Form D, 1977) was used. This is a short version of a British Council test (Davis and Alderson, 1977) to assess the English language proficiency of foreign students intending to enter British universities. It consists of listening comprehension, (67 multiple-choice questions), reading comprehension (50 modified cloze items, first letter of word given), and grammar (50 multiple-choice questions). Administration of the whole test takes approximately 90 minutes. 2. The prior knowledge tests A measure of the candidates' prior knowledge in the disciplines involved was operationalized in two ways: a. the extent to which they were familiar with their own subject area, namely, parasitology, marketing, or family law, as the case may be; b. the extent to which they were conversant with the topic under discussion in the discipline-related reading texts, namely, malaria, consumer behavior, or divorce.
< previous page
page_217
next page >
< previous page
page_218
next page > Page 218
Since these areas and topics were regarded as highly specialized, the writer had to rely on subject specialists. Therefore, the lecturers of the respective courses were provided with a set of general instructions and asked to devise test items following these instructions. In this way, adequate coverage of the points to be tested and the number of items required would be provided and content validity would be furthered. Other experiments have measured the extent of prior knowledge of the topic through a test consisting of a set of items based on the actual reading of the text itself as a pre-reading activity. This study, however, allowed the subject specialists to gauge the importance of the facts specific to the topic that ought to be tested. This lends a measure of objectivity to the test items in that the test is removed from the text itself. At the same time the level of content validity is advanced. In order to ensure that students' scores reflect their real understanding of the topics and areas tested, and are not due to language difficulties that might be encountered, questions were in two languages, Bahasa Malaysia and English, in the Economics faculty. It was not considered necessary in the other two faculties since lectures and tutorials were, in the main, conducted in English. The testing format for the prior knowledge tests in Law and Economics was multiple-choice. For the Medicine students a true/false test was developed. 3. Discipline-related reading tests For the discipline-related reading tests, cloze tests were constructed. The source of the texts for the three disciplines was comparable: academic texts/references of a typically academic, expository style, which concentrated on a single theme. Hence, the source and language can be regarded as authentic. The length was made as comparable as possible, depending on the treatment of the topic. It was not considered reasonable to simply end all texts at a certain word count, say the 700th word, just for the sake of uniformity. Rather, it seemed more practical and sensible to end each text at a point where the topic had been satisfactorily dealt with. This policy led to taking complete texts as they were found in the source material. For this reason there is an unavoidable, but minor discrepancy in length (see Table 17.1 for details of each text). The texts were also subjected to various readability formulae to check comparability of the selected texts with respect to difficulty. They were all found to be college level texts. Sceptics of text
< previous page
page_218
next page >
< previous page
page_219
next page > Page 219
TABLE 17.1 Details of discipline-related tests Discipline Text No. of Readability Test Del. No. of Words Fry Flesch Format rate Blanks Medicine Malaria1 955 208 36.49 Cloze 10 91 College Diffic. Law Divorce2 1045 173 32.51 Cloze 10 96 College Diffic. Economics Consumer 764 161 46.57 Cloze 10 71 behavior3 College Fairly diffic. Sources: 1 Brown, H.W. and D.L. Belding (1964) Basic Clinical Parasitology. 2nd ed. New York: Appleton-Century-Croft, pp. 85-86. 2 Ahmad, I. (1978) Family Law in Malaysia and Singapore. Singapore: Malayan Law Journal (Pte) Ltd, pp. 78-80. 3 McDaniel, C., Jr. (1982) Marketing. 2nd ed. New York: Harper and Row, pp. 117119. comparability may take comfort in the research design of the study which is crossed in that the subjects from the different groups sat for all the texts. Due to time constraints, however, the Economics students could not sit for the Medical and Law prior knowledge tests. The cloze format was used with all three texts. A deletion rate of 10 was chosen and students' responses were scored according to the exact scoring procedure. Results As reported above, all subjects sat for the EPTB as well as for the discipline-related tests. With the exception of the Law students, all subjects took all prior knowledge tests. For the purpose of this study, texts were classified as "familiar" if they were administered to students of that particular discipline, and "unfamiliar" if students of another discipline sat for it. Test results for all groups on all tests are summarized in Table 17.2. In order to gauge the significance of the two variables (discipline-related prior knowledge and English language proficiency) as predictors of reading
< previous page
page_219
next page >
< previous page
page_220
next page > Page 220
TABLE 17.2 Results for three groups of students on all tests, means and standarddeviations Med. students Law students Econ. Students Tests n s.d. n s.d. n s.d. English Proficiency 141 72 12 92 72 13 81 70 13 EPTB Prior Knowledge 141 56 10 92 7 10 Medicine 141 38 11 92 64 9 Law 141 38 11 92 38 10 81 60 9 Economics Discipline tests 141 45 7 94 35 8 80 34 7 Medicine 139 43 11 92 53 11 78 44 9 Law 141 30 6 94 32 5 81 34 5 Economics comprehension, a simple multiple regression analysis was used. Analyses of performance of the students from the three faculties on familiar and unfamiliar texts were carried out. Tables 17.3, 17.4, and 17.5 present the findings on comprehension of familiar texts. Each table reports the standardized regression weights (Beta) for the predictor variables, the t-value of the regression weights, and their significance level (p). Below each table the squared multiple correlation coefficient (R2) is given, together with its F-value and the significance level (p). The squared multiple correlation coefficient indicates the proportion of variance in the discipline-related reading test accounted for in total by the language proficiency and prior knowledge tests. TABLE 17.3 Effect of Medical Prior Knowledge and Language Proficiency on comprehension of medical text by medical students Beta
t-value
Signific.
Source weight 1. Medical Prior Knowledge 2. Language Proficiency Adjusted R2=0.503;
0.243 0.513 F-Value = 58.123;
< previous page
level 3.36 8.06
p<0.01 p<0.001
p<0.001
page_220
next page >
< previous page
page_221
next page > Page 221
TABLE 17.4 Effect of Law Prior Knowledge and Language Proficiency on comprehension of law text by law students Beta t-value Signific. Source weight level 1. Law Prior Knowledge 0.294 3.73 p<0.001 2. Language Proficiency 0.589 7.45 p< 0.001 F-Value = Adjusted R2=0.619; 73.277; p<0.001 TABLE 17.5 Effect of Economics Prior Knowledge and Language Proficiency on comprehension of economics text by economics students Beta t-value Signific. Source weight level 1. Economics Prior Knowledge 0.284 2.97 p<0.01 2. Language Proficiency 0.558 5.84 p<0.001 F-Value = Adjusted R2 = 0.463; 28.552; p < 0.001 Prior knowledge of the medical topic and competence in the target language are both shown to be significant predictors of comprehension of the medical text (Table 17.3). A comparison of Beta weights, however, reveals that language proficiency (0.513) is a stronger predictor than medical prior knowledge (0.243). As in the medical text, prior knowledge of the law domain and a firm grasp of the target language both contribute to a prediction of the extent to which a law student is able to extract and interpret a law text (Table 17.4). Here too, however, the strength of prediction from a score on a language proficiency test (0.589) is greater than that of the score on a test of prior knowledge (0.294). The findings on the economics text (Table 17.5) confirm those of the other two faculties. While both the variables are significant predictors of comprehension of the economics text, language proficiency (0.558) is a better predictor than prior knowledge in economics (0.284).
< previous page
page_221
next page >
< previous page
page_222
next page > Page 222
In the light of these identical findings for the three faculties, it can be concluded there is adequate empirical evidence to show that comprehension of a text can be predicted to a significant extent by both the amount of background knowledge in the topic domain of the text and proficiency in the language the text is written in. However, the weight of language proficiency is about twice the weight of subject specific background knowledge in the prediction of how well a reader can extract and interpret the meaning of a foreign language text on the subject matter. If prior knowledge and language proficiency have been determined as significant predictors of comprehension of familiar texts, it is logical to expect that the same findings would be extended to the comprehension of unfamiliar texts as well. For unfamiliar texts, however, the language proficiency would be expected to contribute proportionally even more to reading comprehension. Tables 17.6 and 17.7 present data on the performance of medical students reading unfamiliar texts. From the results of the analyses presented in Tables 17.6 and 17.7 it appears that indeed, when medical students read texts that belong to other TABLE 17.6 Effect of Law Prior Knowledge and Language Proficiency on comprehension of law text by medical students Beta t-value Signific. Source weight level 1. Law Prior Knowledge 0.145 1.93 p=0.05 2. Language Proficiency 0.651 8.69 p<0.001 F-Value = Adjusted R2= 0.513; 56.263; p< 0.001 TABLE 17.7 Effect of Economics Prior Knowledge and Language Proficiency on comprehension of economics text by medical students Beta t-value Signific. Source weight level 1. Economics Prior Knowledge 0.121 1.57 p=0.11 2. Language Proficiency 0.637 8.25 p<0.001 F-Value = Adjusted R2=0.473; 48.609; p<0.001
< previous page
page_222
next page >
< previous page
page_223
next page > Page 223
discipline areas, their linguistic competence of the target language has a much greater weight in the prediction of their comprehension success than prior knowledge of the subject matter. In fact the regression weights of prior knowledge on the unfamiliar subject areas do not reach significance. This is observed for the performance on the law text (Table 17.6) as well as on the economics text (Table 17.7). TABLE 17.8 Effect of Medical Prior Knowledge and Language Proficiency on comprehension of medical text by law students Beta t-value Signific. Source weight level 1. Medical Prior Knowledge 0.077 0.91 p=0.36 2. Language Proficiency 0.689 8.21 p<0.001 F-value = Adjusted R2 = 0.450; 33.736; p <0.001 TABLE 17.9 Effect of Economics Prior Knowledge and Language Proficiency on comprehension of economics text by law students Beta t-value Signific. Source weight level 1. Economics Prior Knowledge -0.049 -0.55 p=0.58 2. Language Proficiency 0.665 7.38 p<0.001 F-value Adjusted R2 = 0.409; =29.344; p<0.001 Tables 17.8 and 17.9 concern the law students' comprehension of unfamiliar texts. In the law faculty, too, reading performance in texts outside the discipline area seems predictable largely by proficiency in the language the text is written in. Here too the predictive value of the unfamiliar knowledge remains insignificant. The stronger predictive value for language proficiency supports the earlier findings which show that students depend less on their prior knowledge than on their linguistic competence to understand these relatively unfamiliar texts. The same speculations put forth earlier can also be applied here, namely, that this could be due to an unduly low level of prior knowledge in the subject matter. In the framework of the
< previous page
page_223
next page >
< previous page
page_224
next page > Page 224
schema notion of reading (Adams and Collins, 1979; Carrell, 1981; Rumelhart, 1980), it could be said that when foreign language readers possess less information (i.e. a less elaborated schema) to integrate with that found in the text, their ability to reconstruct the meaning of the text inevitably suffers. It may be that the topics, concepts and ideas found in texts were not within the knowledge base of the reader in sufficient amounts to be efficiently accessed. The example of the law students' medical prior knowledge is a case in point. It implies again, therefore, a minimum take-off prior knowledge before it can be a useful aid to comprehension. In conclusion it can be said that foreign language readers make use of their prior knowledge of the topic domain as well as their knowledge of the target language in their attempt to reconstruct and interpret the meaning of a text. This emerges very strongly with texts that are related to the students' discipline of study. When confronted with texts that lie outside the discipline area of which they have relatively little prior knowledge, the foreign language reader is inclined to rely more on his grasp of the foreign language to make sense of the text. It is suggested that this implies that the existence of prior knowledge alone is not the criterion for better comprehension, but the amount of it. References Adams, M.J. and A.M. Collins (1979)A schema-theoretical view of reading. In: R.O. Freedle (ed), New Directions in Discourse Processing. Norwood, NJ: Ablex. Alderson, J.C. (1984) Reading in a foreign language: a reading problem or a language problem? In: J.C. Alderson and A.H. Urquhart (eds), Reading in a Foreign Language. New York: Longman. Alderson, J.C. and A.H. Urquhart (1984) Introduction: what is reading? In: J.C. Alderson and A.H. Urquhart (eds), Reading in a Foreign Language. New York: Longman. Carrell, P.L. (1981) The role of schemata in L2 comprehension. Paper presented at the 1981 TESOL Convention. Detroit, Michigan. Davies, A. and J.C. Alderson (1977) English Proficiency Test Battery (EPTB). Edinburgh: Edinburgh University. Rumelhart, D.E. (1980) Schemata, the building blocks of cognition. In: R.J. Spiro, B.C. Bruce and W.F. Brewer (eds), Theoretical Issues in Reading Comprehension. Hillsdale, NJ: Erlbaum.
< previous page
page_224
next page >
< previous page
page_225
next page > Page 225
18 The Language Testing Interview: A Reappraisal GILLIAN PERRETT University of Sydney The Unstructured Interview Unstructured interviews are interviews which do not rely on an interview schedule, and therefore the questions are neither preformulated nor identical for each subject. Today, they are found in many tests and testing programs. Simmonds (1985) lists a dozen British tests which include an unstructured interview, and in the United States such an interview is used as an elicitation device in conjunction with the Foreign Service Institute's Absolute Language Proficiency Ratings (1968). In Australia an unstructured interview has been for some years a very widely used elicitation technique used in association with the ASLPR (Australian Second Language Proficiency Ratings)and the AMES (Adult Migrant Education Service of New South Wales) rating scale to test high school and adult students for placement into general classes. The unstructured oral interview has found popularity because it is considered that it tests global, general proficiency, both by generating language which has linguistic and non-linguistic context (Oller, 1979:305) and by generating authentic language and real communication (Spolsky, 1985:34). It has been claims such as these, together with the high face validity of the interview, which have led to its considerable popularity as a testing instrument. These favourable claims have arisen from contrast with other types of oral tests which are less global, authentic, contextualised or communicative. Among these are linguistic manipulation tasks (translation, transformation, substitution, completion) or structured interviews where preformulated questions are asked in order to elicit specific structures (see Burt and Dulay, 1978). They have also arisen from being based on models of
< previous page
page_225
next page >
< previous page
page_226
next page > Page 226
language which have an inadequate discourse component, and which lack an account of how language relates to context. As a result there is little awareness of what it could mean to assess the interview in terms of discourse features or in terms of context influencing the discourse features produced. Criticisms of the one-to-one interview as a testing device fall into two groups. Firstly there are those issues which surround the question of authenticity. Shohamy and Reves (1985:54) refer to Stevenson's (1985) and Spolsky's (1985) distinction between ''authentic test language'' and "real life language". They go on to elaborate the difference between the two in terms of: the goal of the interaction (to obtain assessment); the participants (unfamiliarity between them); the setting (role plays carried out away from the proper physical context); the topic (imposed by the tester instead of being determined by both participants in an unplanned manner); and the fixed time limit (which has an effect on the quality of language produced). Upshur (1975), Jones (1985), and Rea (1985) all refer to the difficulty of testing the language required for particular situations away from the actual situations in which that language variety is used. Secondly there is the question of the discourse structure itself. These interviews are interactive only in the narrow sense that in them occurs a one-way flow of information between two speakers. Jones (1985) makes the point that it is not easy for examinees to use questions and commands in the interview situation. Upshur (1975:62) expresses the opinion that the question-answer format is not enough: "the interviewer should add false starts, mumbles, ambiguous reference questions and shifts of initiative", and that it does not test participant roles other than conversational roles, roles such as teaching, seeking information, or foreigner talk. These criticisms are all valid, but they are weakened by being made outside of a model of language which can relate questions of discourse structure and authentic language to the situation in which the language is used. They are also weakened by an implicit assumption that all interviews are intrinsically the same and are different from all other language situations. When the same criticisms are made within the framework of systemic functional linguistics it is possible to suggest how the two sets of factors are systematically related. This chapter examines the utility of the oral interview as an elicitation device for communicative language testing. It argues that despite its high face validity, the range of linguistic phenomena the interview can elicit is limited. While such interviews provide data which is satisfactory for the assessment of the subject's phonological, lexico-grammatical, and some discourse systems, they cannot yield adequate information about the
< previous page
page_226
next page >
< previous page
page_227
next page > Page 227
subjects' control of topic or text type, nor interactive aspects of discourse such as speech functions or exchange structure. Nor can interviews provide information about how the subject might use language in other situations. Systemic functional linguistics provides a model which is capable of explaining how the social constraints of the context in which the language testing interview occurs prevent the interviewer eliciting a wider range of discoursal phenomena. Analysis of Six Interviews Six language testing interviews have been analysed to show the nature of the restrictions that the interview places on the subject's output. This analysis seeks to show the discourse limitations of the interviews by examining the generic structure, topics, subtexts, and conversation structure of each. It uses a stratified random sample of six of the 48 interviews collected by Johnston (1985) of the NSW Adult Migrant Education Service (AMES) for his own second language acquisition research. These interviews were typical AMES course placement interviews conducted by skilled and experienced AMES interviewers. The interviews were with three Polish women and three Vietnamese men. The proficiency level ratings accorded to each subject by the AMES raters are given in Table 18.1. Initially stages of generic structure were identified. Hasan's model of generic structure potential (Halliday and Hasan, 1985:56) identifies those situational variables which comprise the cultural configuration: a specific set of values that describe context in a systematic way and which she claims "can be used for making certain kinds of predictions about text structure", that is, TABLE 18.1 Ratings for six subjects on the ASLPR, given by trained AMES raters Subjects Nationality Sex Rating IS Polish F 0+ ES Polish F 1 LJ Polish F 2 Van Vietnamese M 0 Duc Vietnamese M -1 Phuc Vietnamese M 2
< previous page
page_227
next page >
< previous page
page_228
next page > Page 228
which stages are optional and obligatory, what order they may occur in, and which stages are iterative. These variables combine to predict the generic structure, or the stages, of a text. The following stages of generic structure are identifiable in the interviews (brackets indicate optionality and the karat indicates sequence): (Orientation) Questioning of Subject (Pre-Closing) (Questioning by Subject) (Closing). The Orientation, Pre-Closing and Closing differ from the Questioning of Subject (Q of S)and the Questioning by Subject (Q by S)in that they function to control the progress of the interview. They are very minimal in length, and in this data set the Orientation and Closing appear not to have necessarily been recorded in their entirety, and therefore not to have been considered relevant testing data. Orientation and Closing were found in only three interviews; four had a Pre-Closing; five had the Q by S. Only one element of structure, the Q of S, appeared in all six interviews. The Q by S and the Q of S are clearly the two most important sections. Together they form the body of the interview. Of the two, the Q of S is the more important. The Q of S is obligatory while the Q by S is optional. The sequence Orientation Q of S at the beginning of the interview suggests that the Q of S is the main business of the interview. If there is a Pre-Closing it always precedes the Q by S, suggesting that the Q by S is of secondary significance. The Q of S may reappear either after the Q by S or the PreClosing, again suggesting its superior importance. Most importantly, the Q of S contains much more material than the Q by S. Table 18.2 shows the distribution of subtexts (for a definition of subtexts see below) between the two stages Q of S and Q by S. On average more than 90% of the subtexts are of the Q of S stage. In order to further analyse the body of the interview in terms of text, units called "subtexts" were established in the following way. Initially the researcher isolated "chunks" of text which were felt intuitively to have TABLE 18.2 Distribution of subtexts over stages Q of S and Q by S Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ -1 1 2 2 Q of S 9 27 37 49 22 40 184 Q by S 2 1 5 1 2 0 11 Total 11 28 42 50 24 40 195
< previous page
page_228
next page >
< previous page
page_229
next page > Page 229
"unity". These "chunks" were then examined and it was found that the subtext boundaries largely coincided with the following criteria: 1. beginning of new stage of schematic structure 2. topic refusal by subject 3. topic "letting off" by interviewer 4. explicit marker of end of topic 5. explicit marker of beginning of topic 6. new participant in agent role 7. new participant in medium/range role 8. new type of verbal process 9. new circumstantial element 10. new lexical field 11. new tense, time reference, or aspect All departures from these criteria were then considered and a few adjustments were made so that all boundaries between subtexts did meet the above criteria. Thus boundaries between subtexts are signalled by either overt markers of topic change, shifts in participants, processes or attendant circumstances (all of which suggest topic shift), or changes in the way a topic is dealt with in terms of time reference. So a subtext is distinguished from its TABLE 18.3 Distribution of subtexts over motifs by order of total frequency of motifs Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ -1 1 2 2 Occupation 2 3 16 10 4 8 43 Language 1 5 5 6 4 8 29 Native country 3 8 5 10 26 Family 3 7 8 3 1 2 24 Hobbies 3 7 10 20 Australia 3 4 5 5 17 Travel to Aust. 2 3 4 2 5 16 Education 2 4 3 2 11 Personal I.D. 2 1 3 Friends 2 1 3 English class 3 3 Total 11 28 42 50 24 40 195
< previous page
page_229
next page >
< previous page
page_230
next page > Page 230
adjacent subtexts both in terms of topic and its orientation to its topic. Topics, which inevitably relate closely to the individual informant's personal experience, sort into broader "motif" categories which can be compared across interviews (see Table 18.3). One purpose of establishing subtexts was to determine the proportion of topics introduced by the interviewer and by the subject. Table 18.4 shows that the relative number of introductions made by the subjects themselves does not consistently increase as general language proficiency increases. With one exception the subjects introduce less than 20% of the subtexts and there is no pattern in the figures for these five subjects to suggest that control of topic increases with increased language proficiency. TABLE 18.4 Number of topic introductions by speaker Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ -1 1 2 2 by interviewer 9 27 36 46 14 33 165 by subject 2 1 6 4 10 7 30 Total 11 28 42 50 24 40 195 Table 18.5 presents details on the modes of topic nominations. Topic nominations are most often realised as questions if nominated by the interviewer. But the figures show a positive relation between level of proficiency and the likelihood of nominating a topic by means of a statement by the subject. Subjects at higher levels of proficiency seem even more likely to nominate topics with statements than their interviewers are. Direct questions are easiest for low level subjects to respond to, yet apparently the interviewer persists in using questions regardless of the proficiency level of the subjects, even when the subjects' own topic nominations become more varied. Therefore the subjects' ability to respond to various types of topic introduction was not tested in these interviews. Halliday (1985:68) distinguishes between language which is used to exchange information and language which is used to negotiate the transfer of goods and services. The subtexts in each of interviews were analysed according to their discourse genre. Table 18.6 shows that among the six interviews, with a total of 195 subtexts, only 4 texts occur which are not
< previous page
page_230
next page >
< previous page
page_231
next page > Page 231
TABLE 18.5 Mode of topic nomination by speaker Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ -1 1 2 2 by means of questions 9 27 25 42 11 31 145 by interviewer 2 3 1 2 8 by subject by means of statements 11 4 3 2 20 by interviewer 1 3 3 8 7 22 by subject Total 11 28 42 50 24 40 195 concerned with Information Exchange. When only one single offer, one apology, one request, and one piece of advice occur in the course of six interviews it is obvious that the social uses of language are not being tested at all. The specific text types occurring within those sections of the interviews which are concerned with Information Exchange are Report (Present, Past, Future, and General), Recount, Narrative, Description, Opinion, and Discussion. The distributions of these text types in each interview are given in Table 18.7. There is little evidence in these six interviews of any systematic TABLE 18.6 Distribution of subtexts over discourse genres Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ 1 1 2 2 Info. Exchange 11 27 42 48 24 39 191 Offer 1 1 Request 1 1 Advice 1 1 Apology 1 1 Total 11 28 42 50 24 49 195
< previous page
page_231
next page >
< previous page
page_232
next page > Page 232
attempt to elicit additional text types from the subject. It was observed that when such an attempt is made, it is generally a response to the difficulty the subject finds in "keeping on talking" and is therefore more frequent in the lower proficiency levels (where it is in fact less appropriate). More advanced subjects talk more readily and as a result their ability to "talk in different discourse genres" appears to receive less probing in these interviews. TABLE 18.7 Distribution of subtexts over text types within Information Exchange by order of total frequency of text types Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ -1 1 2 2 Pres. Report 9 16 23 20 14 5 87 Past Report 8 10 16 7 19 60 Opinion 2 1 4 6 1 14 Recount 3 6 9 Discussion 3 3 2 8 Fut. Report 2 2 1 5 Gen. Report 3 3 Description 2 1 3 Narrative 2 2 Total 11 27 42 48 24 39 191 A last type of analysis is the analysis of the conversation structure of the interviews. Use of the model of conversation structure developed by Martin (1985) and Ventola (1987), based on the work of Berry (1981) can show more clearly how far the interviewer is in control of the discourse. According to this model, conversation is constructed out of exchanges which consist of moves. Each exchange negotiates one or more propositions and the moves of an exchange are related by syntax which is potentially elliptical. Every exchange has one compulsory move: K 1 I don't know what I will study now but I think I will study soon, you know. The K symbolises a knowledge/information move whereas the 1 indicates that the speaker is the source of the information. There are various optional moves, for example:
< previous page
page_232
next page >
< previous page
page_233
next page > Page 233
K2 Can you tell me your name? K1 Yes, my name Van. K2 Uh huh. A second type of moves is the A move which is a move associated with the actual performance of an action. Action exchanges are represented thus: A1 And, ah . . . do you want a cigarette? A2 Please. A1 Here you are. The distribution of K1 moves and A1 moves (see Table 18.8) shows a similar imbalance between language used in connection with Information Exchange and language which is used to negotiate the transfer of goods and services (cf. Table 18.6). This distinction, then, is realised both at the level of discourse genre (Table 18.6) and at the level of conversation structure (Table 18.8). TABLE 18.8 Conversation structure: distribution of K1 and A1 moves by speaker Subject and ASLPR rating Van IS Duc ES LJ Phuc Total 0 0+ -1 1 2 2 KI moves 7 6 46 21 7 37 124 by interviewer 18 63 150 123 54 186 594 by subject Al moves 1 2 1 4 by interviewer 1 1 2 by subject Total 25 70 198 145 62 224 724 Discussion The range of limitations in the scope of subject discourse in the interviews is not primarily the result of limited language proficiency. In terms of the features described, there is very little distinction between the beginning and the intermediate level speakers. If the features described
< previous page
page_233
next page >
< previous page
page_234
next page > Page 234
could be attributed to limited language proficiency we would expect figures which would indicate a diminution of the restrictions with increased proficiency. Systemic functional linguistics offers a model of language which shows how the control exercised by the context of the interview is responsible for certain discourse features of the language testing interview, features which might discriminate between different proficiency levels in language generated in another social context. Systemic functional register theory relates instances of language performance to the features of immediate context and social structure which determine them. Halliday and Hasan (1985) use Malinowski's concepts of context of culture and context of situation to posit three variables which define the register of a text: Field describes what is happening, that is, the subject matter and purpose of the text. Mode describes the part the language plays in the situation, including channel and rhetorical mode. Tenor describes the participants and the relationships between them. They claim "Any piece of text . . . will carry with it indications of its context [so we can] reconstruct from the text certain aspects of the situation, certain features of the field, tenor and mode" (p.38). In their view the notion of register is . . . the concept of a variety of language corresponding to a variety of situation. . . It can be defined as a configuration of meanings that are typically associated with a particular situational configuration of field, mode, tenor (pp.38f). The language testing interview is a unique sociolinguistic event which does not find its parallel anywhere else in our culture. Looking at the three register variables posited by systemic functional linguistics will make the reasons for this clear. Field deals primarily with purpose and topic. The language testing interview is unique in having two purposes embedded one within the other. The overt, but secondary, purpose of the interview is for the interviewer to extract factual information of some type from the interviewee. In any other interview there is a clear pragmatic purpose of exchanging certain categories of information and there is a high degree of consensus about what these categories are. Therefore the duration of the interview, the length, and structure of the stages within it, and the length of speaker turns are determined by the demands of this content. However the covert, but real,
< previous page
page_234
next page >
< previous page
page_235
next page > Page 235
purpose of the language testing interview is to display the language proficiency of one participant. This orientation results in a one-way flow of personal information in order to exhibit language fluency. The question of what information is exchanged is secondary. What is actually discussed is not considered important, as the interest is in the degree of fluency evidenced by the subject. Thus the concept of field offers a way of distinguishing between the overt and the covert purposes of the interview. Mode deals with the role of language within a situation. The role of language in the language testing interview is display on one side and facilitation of display on the other. Brindley (1979: 5) states explicitly: "the main purpose of the interview is to encourage the student to speak so the interviewer should resist the temptation to talk at length". It is assumed by both interviewer and informant that the primary purpose of the interview is for the subject to talk as much as possible and the interviewer only as much as is necessary to keep the subject talking. A fluent subject is often given the opportunity to produce long monologues in answer to questions rather than to exhibit interactional skills. The weaker speakers engage in more interaction (of a limited kind) as they require support and "scaffolding". This accounts for the fact that the interviewer uses his control to elicit K moves from the subject, as these are the moves which can most easily be expanded or added to within a single turn. A more interactional style of discourse is not sought because this would produce more even turn lengths, and therefore a certain amount of "wasted time". As a result subjects are restricted in the type of language (information or action), and move types available to them within the interview format. They are not given a chance to show what they may be able to do as conversationalists. Tenor deals with the relationships of power, social distance and affect between participants (Poynton, 1985:76). Social distance between the interviewer and the subject is extreme. Not only are they strangers, but the tester, especially if perceived as a teacher or somebody who will make decisions about the subject's future, is in a position of considerable power over the subject. The tester is likely to be a total stranger and, even if known, not an intimate. The combination of extreme social distance with real or perceived power results in it being hard or impossible for the interviewer to relinquish control even if s/he wishes to. In terms of conversational roles this means that the interviewer remains the initiator of the conversation and the subject remains the respondent. These roles do not shift throughout the interview. The tenor variables are so strong in these interviews that they ultimately override the other variables of field and mode.
< previous page
page_235
next page >
< previous page
page_236
next page > Page 236
Conclusion The register variables are responsible for both the homogeneity of the generic structure of the interviews and for the discourse limitations described in this chapter. Topics are introduced predominantly by the interviewer, regardless of the proficiency level of the subject. Topic nominations by the interviewers are most frequently questions, even when subjects are able to show greater variety in their own nominations. An imbalance exists between the two sections of the body of the interview which suggests that it is implicitly considered more important to test whether the subject can give information than elicit it. Language to exchange goods and services is not tested at all; language which conveys information is preferred. A range of text types which involve the exchange of information clearly develops, but the fact that there is no evidence of any deliberate probing to elicit these or to attempt to elicit these suggests that it was not (at the time the interviews were given) considered important to elicit a variety of text types. Analysis of conversation structure brings up again the imbalance between language as reflection and language as action: the subject has a disproportionate number of K1 moves compared to the interviewer. The subject is given too few opportunities to question and has no opportunity to produce appropriate commands, requests, or offers. Although instructions to interviewers about handling the interview in terms of discourse are often very inexplicit (see, for example, Brindley, 1979:7f; Ingram 1984:114-33), improving them is likely to have a limited effect in view of the fact that the power relationship between interviewer and interviewee remains unchanged. To test uses of language appropriate to other registers, true ethnographic observations have to be undertaken which are expensive in time and money, or other registers have to be simulated in the testing situation (Morrison and Lee, 1985; Berkoff, 1985). There are situations in which the interview should be abandoned in favour of the many other types of oral elicitation techniques currently being developed. The analysis presented in this chapter leads to the conclusion that, in general terms, it is not possible to use an interview to assess the subjects' ability to control conversation, to produce topic-initiations, or to assume responsibility for the continuance of the discourse. Suggestions to the interviewer about how to be more sensitive to the discourse features of the language cannot address those unchanging and unchangeable characteristics of the interview as a cultural event in which there is an uneven distribution of power and control.
< previous page
page_236
next page >
< previous page
page_237
next page > Page 237
References Berkoff, N.A. (1985) Testing oral proficiency: a new approach. In: Y.P. Lee, A.C.Y.Y. Fok, R. Lord, and G. Low (eds), New Directions in Language Testing. Oxford: Pergamon Press. Berry, M. (1981) Systemic linguistics and discourse analysis: a multilayered approach to exchange structure. In: M. Coulthard and M. Montgomery (eds), Studies in Discourse Analysis. London: Routledge and Kegan Paul. Brindley, G. (1979) Evaluation of E.S.L. Speaking Proficiency through the Oral Interview. Sydney: A.M.E.S. Syllabus Development Project. Burt, M. and H. Dulay (1978) Some guidelines for the assessment of oral language proficiency and dominance, TESOL Quarterly, 12, 177-192. Foreign Service Institute School of Language Studies (1968) Absolute Language Proficiency Ratings. Reproduced in J.L.D. Clark (1972), Foreign Language Testing: Theory and Practice. Philadelphia, PA: Center for Curriculum Development. Halliday, M.A.K. (1985) An Introduction to Functional Grammar. London: Edward Arnold. Halliday, M.A.K. and R. Hasan (1985) Language, Context, and Text: Aspects of Language in a Social-semiotic Perspective. Geelong: Deakin University. Ingram, D. (1984) Report on the Formal Trialling of the Australian Second Language Proficiency Ratings (ASLPR). Canberra: Australian Government Publishing Service. Johnston, M. (1985) Syntactic and Morphological Progressions in Learner English. Canberra: Department of Immigration and Ethnic Affairs. Jones, R.L. (1985) Language testing and the communicative language teaching curriculum. In: Y.P. Lee, A. C. Y. Y. Fok, R. Lord, and G. Low (eds), New Directions in Language Testing. Oxford: Pergamon Press. Martin, J.R. (1985) Process and text: two aspects of human semiosis. In: J. D. Benson and W. S. Greaves (eds), Systemic Perspectives on Discourse. Vol. 1. Advances in Discourse Processes XV. Norwood, NJ: Ablex. Morrison, D.M. and N. Lee (1985) Simulating an academic tutorial: a test validation study. In: Y.P. Lee, A. C. Y. Y. Fok, R. Lord, and G. Low (eds), New Directions in Language Testing. Oxford: Pergamon Press. Oller, J.W., Jr. (1979) Language Tests at School: A Pragmatic Approach. London: Longman. Poynton, C. (1985) Language and Gender: Making the Difference. Geelong: Deakin University. Rea, P.M. (1985) Language testing and the communicative language teaching curriculum. In: Y.P. Lee, A. C. Y. Y. Fok, R. Lord, and G. Low (eds), New Directions in Language Testing. Oxford: Pergamon Press. Shohamy, E. and T. Reves (1985) Authentic language tests: where from and where to? Language Testing, 2, 4859. Simmonds, P. (1985) A survey of English language examinations, ELT Journal, 39, 33-42. Spolsky, B. (1985) The limits of authenticity in language testing, Language Testing, 2, 31-40. Stevenson, D.K. (1985) Authenticity, validity and a tea party, Language Testing, 2, 41-47.
< previous page
page_237
next page >
< previous page
page_238
next page > Page 238
Upshur, J.A. (1975) Objective evaluation of oral proficiency in the ESOL classroom. In: L. Palmer and B. Spolsky (eds), Papers on Language Testing 1967-1974. Washington, DC: TESOL. Ventola, E. (1987) The Structure of Social Interaction: A Systemic Approach to the Semiotics of Service Encounters. London: Frances Pinter.
< previous page
page_238
next page >
< previous page
page_239
next page > Page 239
19 Directions in Testing for Specific Purposes GILL WESTAWAY British Council J. CHARLES ALDERSON University of Lancaster and CAROLINE M. CLAPHAM University of Lancaster The English Language Testing Service (ELTS)is jointly produced and administered by the British Council and the University of Cambridge Local Examinations Syndicate (UCLES). Introduced in 1980, it provides a systematic and continuously available means of assessing the English language proficiency of nonnative speakers of English wishing to study or train in the medium of English. Based on a specification of the students' language needs, the Testing Service offers an on-demand test designed to measure candidates' general language skills and other skills needed for effective study or training. The test takes account of differences in subject specialism and course types. As well as the test instrument itself, the Service also provides guidance on how the test results should be interpreted in order to reach decisions about the need for English language tuition before or alongside the main course of study. In 1986/7 ELTS was taken by around 14,000 candidates at 150 centres worldwide. The test results are currently accepted for undergraduate or postgraduate entry by all British universities and polytechnics and by an increasing number of receiving institutions in the wider Englishspeaking world, notably in Australia and Canada.
< previous page
page_239
next page >
< previous page
page_240
next page > Page 240
The current ELTS divides into two main tests: the Academic and the Non-Academic. The Academic ELTS is made up of five subtests (cf. Table 19.1). Two of these (G) are designed to test general English language proficiency. The remaining three (M) form modules to test language skills in particular subject areas: Life Sciences, Medicine, Physical Sciences, Social Studies and, for those candidates not covered by these specific subject areas, General Academic. Each candidate's scores are reported on a Test Report Form as Bands of Ability associated with performance descriptors from Band I (Nonuser) to Band 9 (Expert User), with a profile score containing details of each of the five subtests. An Overall Band Score gives the mean score of the subtests. TABLE 19.1 Current structure of ELTS, academic SubtestSkill Tested Test Method Test Length. QuestionsMinutes General multipleG1 Reading choice 40 40 General multipleG2 Listening choice 35 30 multipleM1 Study Skills choice 40 55 (subjectspecific) extended M2 Writing writing 2 40 (subjectspecific) individual M3 Interview guided n.a. 12-15 discussion (general & subjectspecific) The Non-Academic Module is intended for candidates whose training is to be of a practical or technical nature. It is made up of three subtests (cf. Table 19.2). TABLE 19.2 Current structure of ELTS, nonacademic Skill Test SubtestTested Test Method Length (minutes) M1 Listening multiple-choice 30 multiple-choice M2 Reading & & 45 writing sentences Writing M3 Interview individual guided 15 discussion
< previous page
page_240
next page >
< previous page
page_241
next page > Page 241
The Edinburgh Validation Study When the English Language Testing Service was first introduced in 1980, ''it had been as fully piloted as was compatible with the time constraints imposed by the need to bring it into use as quickly as possible'' (Criper and Davies, 1986:11). Given the innovative nature of the test's design, the British Council and the University of Cambridge Local Examinations Syndicate were anxious to maximize data on the test's validity. Consequently in 1981 the Edinburgh ELTS Validation Study was set up (Criper and Davies, 1986). Briefly, the project aimed to evaluate the predictive validity with respect to candidates' success in their academic studies; the extent to which proficiency in English affects success in academic studies; the test's face, content and construct validity; the concurrent validity with two test batteries of non-subject specific proficiency in English as a foreign language, the English Language Battery (ELBA)(Ingram, 1967) and the English Proficiency Test Battery (EPTB)(Davies and Alderson, 1977); the internal reliability and retest reliability. Predictive validity As with all predictive studies, the design of this predictive study was not without its problems. Defining the criterion of "success" is difficult in a situation in Britain where few postgraduate students actually fail their courses, although their performance may not be up to the required standards. This means that a further difficulty is identifying a suitable criterion for academic success or failure. Sampling was another problem. As all the students used in the study were already in Britain, the sample was inevitably truncated as the potential failures were less likely to have got through the system. Nevertheless, on the whole the ELTS test was felt to predict as well as any other English language proficiency test, accounting for about 10% of the variance (i.e. correlations of about 0.30). Interestingly, when all the candidates' results from all the subject modules were put together for statistical analysis, the modules G1, G2 and M3 provided almost as much predictive power as the whole test; M1 and M2 appeared to be adding no significant prediction to the test results. However, when the results of the different modules were analysed separately, the
< previous page
page_241
next page >
< previous page
page_242
next page > Page 242
figures varied considerably. It cannot be established how much of this variation is due to lack of equivalence between subject modules, but it could be interpreted as evidence in support of the theory that the language contribution over different academic skills and disciplines is not constant. As to the question of optimum score levels in ELTS for various types of academic success, there was evidence to show that the critical cut-off point was at Band 6. Construct validity The validation report concluded that ELTS can be described as an English for specific purposes (ESP) test and draws on the methodology of a needs analysis design. Nevertheless the test was found to be weak in both areas (Criper and Davies, 1986:99): In the first, the lack of specificity as well as the uncertainty as to level . . . This is in part a reflection on the weak content validity of the test, drawing too little on subject specialist opinion, in part a flaw in the theory of ESP itself. Like register analysis before it, ESP both in teaching and testing falls down once it moves from the process of variation, variety, specific purpose to discrete entities which appear to be impossible to delineate and to keep apart. The failure then is not in ELTS . . . but in the theory. In the second area, that of needs analysis, ELTS was constructed with something of a needs analysis blueprint but in what was, as we have now seen, a highly unsystematic way and also in a thoroughly nonempirical manner. Since needs analysis stands or falls by the empiricism it demands, it is regrettable that the needs analysis used for the construction of this first version of ELTS was not in itself the result of an empirical investigation but rather, as far as we can see, the result of the best guesses of various language teachers. This activity may well have helped in the construction of a good language test but in no way can it be regarded as an exercise in needs analysis. Content validity It was impossible to provide an adequate assessment of the content validity since the original test specifications were no longer available. There were, however, differences of opinion among applied linguists asked to judge
< previous page
page_242
next page >
< previous page
page_243
next page > Page 243
what particular test items did in fact test. Judgements on the success of the different subtests in meeting their supposed aims also varied. Face validity Apart from some criticism of the G2 Listening subtest, candidates were generally in favour of ELTS, considering it to be a "fair" test of their English proficiency. Supervisors' views were also largely favourable. Concurrent validity The test was found to overlap considerably with ELBA and with EPTB (correlations of 0.77 and 0.81 respectively). This was considered to be at least partly due to the strong effect of the general component in ELTS. Correlations between candidates' scores and supervisors' and language tutors' ratings were low (around 0.30) and inconclusive. Reliability Reliability indices (KR-20) for the multiple-choice subtests ranged from 0.80 to 0.93. In a study designed to look at the test/retest reliability over an eight-month period, correlations for the multiple-choice sections of the test (G1, G2 and M1) were all in the 0.70s. Reliability for the two open-ended subtests (M2 and M3) was reported to be more problematic (correlations of 0.49 and 0.53 respectively). It should be noted, however, that since these studies considerable work has been undertaken, including the development of detailed assessment guides, to improve reliability. One important finding was that when the results were analysed by subject area, the different modules were seen to behave differently. Practicality There were found to be considerable practical difficulties in administering ELTS owing to the length of the test, the need to set up an interview for each individual candidate, the need to find and train examiners for the M2 Writing test and the M3 Interview, and the amount of time spent dealing
< previous page
page_243
next page >
< previous page
page_244
next page > Page 244
with the wealth of test booklets comprising the G and M components in the different modules. The validation report recommended, therefore, that the test should be simplified, with fewer items and a shorter overall length. A more serious difficulty, however, was perceived to be that of matching candidates to the appropriate modules, for as the report points out, if this matching is indeed a problem, much of the rationale for the complexity of the test design is lost (Criper and Davies, 1986:108): The principle underlying ELTS is that true English proficiency is best captured in a test of specific purposes. If it is the case that matching student to module (or testee to test) is so uncertain, then ELTS loses the very advantage it was designed to maximise. The "ELTSVAL" Conference Following the submission of the report on the ELTS validation study (Criper and Davies, 1986), a conference was organized in London in October, 1986 at which a group of language testing researchers presented papers reacting to the report and discussed the implications of its findings for a revised ELTS test. 1 During the conference the following main points emerged. Construct validity Much discussion centred on construct validity with particular reference to the direction in which the revised ELTS should move. For while the theory of ESP no longer enjoys the same popularity as in the 1970s and Munby's model (Munby, 1978) for communicative syllabus design is widely held to be inadequate, there is no single accepted theory of communicative competence available to replace them. The need for wider consultation among applied linguists was recognized as an important step in the process of determining the shape of the revised test. Content validity Because of the problems of attempting to reconstruct the original content specifications of the current ELTS test post hoc in order to create parallel versions, it was realized that one of the crucial stages in the
< previous page
page_244
next page >
< previous page
page_245
next page > Page 245
construction of a revised test would be the drawing up of specifications which would provide a blueprint for future versions of the test. Reliability The reliability needed to be improved upon, despite the attention paid in recent years to the tightening up of assessment procedures for M2 and M3. Practicality The test should be shortened and administrative procedures simplified. Continuity Despite its weaknesses, it was clearly the case that the current ELTS has very high face validity especially with receiving institutions. It was therefore felt to be important to ensure a significant degree of continuity in a revised test. Like the current ELTS, the revised ELTS should have a good washback effect on teaching. Furthermore, retaining the same system of score reporting, would build upon the accumulated experience of receiving institutions in interpreting ELTS test results. Data Collection for the ELTS Revision Project In January 1987 a project to revise the ELTS was officially set up by the British Council and the University of Cambridge Local Examinations Syndicate (UCLES). The ELTS Revision Project has the task of developing a revised test by September 1989 which will meet the changing demands of its users worldwide. The Project is overseen by a steering committee which includes representatives of the British Council, UCLES and the International Development Programme of Australian Universities (IDP Australia). Drawing lessons from the shortcomings of the development process of the current ELTS, it was felt that, before the revision of the test was begun, as much data as possible should be gathered to ascertain the views of test users and testing experts on the current test. The process would be fully documented at all stages so that the methodology followed would be
< previous page
page_245
next page >
< previous page
page_246
next page > Page 246
accessible to those concerned with further revisions of this or other testing systems in the years to come. The following steps were taken to collect the maximum amount of data: 1. Questionnaires were sent out to overseas test administrators, receiving institutions (universities and polytechnics), overseas sponsors, and teachers on pre-sessional English courses. Reports were compiled from the responses. 2. Two one-day workshops were held, one with language testers, and one with teachers on pre-sessional and in-sessional English courses. At both these meetings participants gave their views on the future structure, method and content of the revised test and sent in written comments and suggestions afterwards. 3. Interviews were held in London with British Council Departments involved in using and interpreting the test scores. The results were summarized in a report. 4. A random sample of 1,000 Test Report Forms was analysed to examine which candidates were entered for which subject modules. 5. A report was compiled by the Testing Service on the difficulties of servicing the current ELTS. 6. Papers on the nature of language proficiency relevant to the testing of English for Academic Purposes were requested from a number of well-known applied linguists. The general opinion of virtually all informants was that the overall design of ELTS should remain the same. Not unexpectedly some aspects of the test were consistently criticized but nevertheless all the groups who contributed their views were generally satisfied with the test. Overseas test administrators generally expressed a high level of satisfaction with ELTS which they considered an important aspect of the British Council's work overseas and recognized as a good predictor of the value to be derived from a course of study. They did, however, complain that it was a burden to administer and was unfair to some groups of students. Responses indicated that this group was in favour of a revised test which assessed reading, writing, listening, and speaking while avoiding the overlap between reading and study skills which exists in the current test in G1 and M1. If a modular structure were retained, it should be better adapted to the needs of the candidates. The availability of practice materials would ensure that candidates were familiar with the demands of the test, which in its revised form should be suitable for a wider range of candidates including pre-
< previous page
page_246
next page >
< previous page
page_247
next page > Page 247
undergraduates. The regular production of new parallel versions was perceived to be essential as was the simplification of the administrative procedures. Almost all of the receiving institutions consulted indicated that they accepted ELTS scores as proof of English language proficiency. Band 6 was the most commonly accepted score for both undergraduate and postgraduate entry. In setting conditions for entry, institutions tended to quote Overall Band scores for administrative convenience whereas individual departments frequently used the profiles, which they found more informative. Most liked the fact that ELTS is offered in six different Academic subject modules and did not think that this number should be reduced. If a reduction was necessary, the most popular option was for three modules. The majority felt that, because the academic needs of undergraduates and postgraduates did not differ significantly, it was not necessary to develop separate tests for these two target populations. The feedback from the overseas sponsors was very limited, largely because most sponsors are unfamiliar with the test and not normally directly involved in interpreting ELTS results. The clearest response came from Oman, where ELTS had been used to test groups of "pre-undergraduate" students, for whom neither the Academic nor the Non-Academic test was felt to be appropriate. As one might expect, pre-sessional teachers showed interest in the washback effect of the test upon classroom teaching and in this connection were in favour of the development of a wider range of item types in the revised ELTS. The profile scores were felt to be useful although concern was expressed at the unreliability of the subjectively marked subtests. Some teachers mentioned the difficulty students experienced selecting the appropriate module and favoured the development of more subject specific modules (e.g. Computer Studies, Architecture). They did, however, recognize the practical problems of implementing such a suggestion. The language testers were generally in favour of reducing the number of Academic subject modules from six to two or three. Some participants felt that there should be a reduction to a single general test for all candidates, but it was recognized that the high face validity of the current test was due at least in part to the subject specificity of the modular structure. It was suggested that the aim of shortening the test could best be achieved, not by cutting out one of the subtests, but by having fewer items per subtest or using test methods (e.g. a cloze) that allowed greater item yield for a given period of time.
< previous page
page_247
next page >
< previous page
page_248
next page > Page 248
ELTS was generally regarded favourably by the British Council staff interviewed and was felt to be a considerable improvement over its predecessors (e.g. the ELBA test). It was perceived as accurately identifying the need for pre-sessional English tuition in most cases, although the feeling was expressed that it was not suitable for all categories of overseas students (e.g. Omanis). People attending short vocational courses and visitors on industrial attachments were also mentioned as target populations for whom the test was less suitable. One of the most frequently articulated comments was the need for a shorter test which was simpler to administer. The most obvious point to emerge from the analysis of the Test Report Forms was the very wide range of courses taken by ELTS candidates and the range of different subject modules sat for by candidates within a given field. For example, of 12 students studying Geology, 9 took the Physical Sciences modules, 1 took General Academic and 2 Technology. The report compiled by the Testing Service on the difficulties in servicing the current ELTS underlined the difficulties of producing parallel versions of the ELTS test because of the lack of original specifications on which to base the multiple-choice items in G1, G2 and M1. Post-hoc specifications had been drawn up for this purpose. The problem of low reliability in the subjectively marked subtests (M2 and M3) was emphasized. Considerable work had been undertaken on the production of detailed Assessment Guides and monitoring of the M2 marking had been started. Nevertheless more systematic monitoring of standards in the assessment of the M2 and M3 subtests was recommended for the future. The position papers commissioned from applied linguists setting out their views on the nature of language proficiency to be incorporated in an English for Academic Purposes (EAP) test were very heterogeneous, thus confirming the impression that no single theory of communicative competence has yet replaced the Munby (1978) model on which the current test is based. There was, however, a consensus that the Munby model itself was not appropriate for a test which is intended to reflect current views of language proficiency. In July 1987 a group of language testing researchers, including representatives from Australia, Canada, and the United States, were invited to meet in a Consultative Conference. The researchers had been given the opportunity to consider the data collected by the ELTS Revision Project prior to the conference and were asked to put forward their proposals for a revision of the test. The need to produce a test which is less cumbersome and cheaper to administer than the current ELTS was a major factor in the discussions on the proposed shape of a new version of the test. The majority
< previous page
page_248
next page >
< previous page
page_249
next page > Page 249
felt that changes to the test structure should not be too radical and that General and Modular components should be maintained. Most felt that there was no need to develop separate tests for undergraduates and postgraduates, but that there should be a differentiation between academic and nonacademic target populations. The suggestion was made that different target groups could take different combinations of subtests. The majority were in favour of a reduction from six to either two or three subject modules with a General component testing structure and lexis, listening and speaking, and a Modular component testing academic reading, writing and listening. With regard to test method, there was support for the exploration of a greater variety of item types, for example cloze, information transfer, and short answer questions. No views were expressed on the Non-Academic Module, a fact which underlines the general lack of awareness in the English language teaching and testing profession of the existence and nature of this module. The Structure of the Revised ELTS Following the Consultative Conference, the ELTS Revision Steering Committee drew up proposals as to the structure and contents of the revised ELTS. It was decided that the new ELTS would contain three General components (G1-Grammar, G2-Listening, G3-Speaking) and two Modular components (M-Academic Reading, M2-Academic Writing). For the Modular components three subject specific modules (Arts & Social Science, Physical Science & Technology, Life & Medical Sciences) as well as a General Training Module would be developed. Total test length would be limited to 2.5 hours. It is not the purpose of this chapter to examine the rationale behind these decisions in detail but nevertheless several points should be noted. Because it would never be possible to develop enough modules to satisfy all students, there was a certain amount of feeling that there should be no subject specific modules in the revised ELTS. Nevertheless, as has been noted above, almost all participants, and notably the receiving institutions, felt that one of the attractions of ELTS is the choice of subject modules. Furthermore the ELTS Validation Study (Criper and Davies, 1986) had provided some evidence in support of an ESP test model, and the decision to keep the modular structure therefore prevailed. The decision as to how many subject modules there should be was difficult. For reasons of practicality it was agreed that, in spite of support in some quarters for the idea of increasing the number of subject modules offered, the revised test should have fewer. Opinion was divided between
< previous page
page_249
next page >
< previous page
page_250
next page > Page 250
those who felt that a simple distinction should be made between Arts and Sciences and those who felt that a further subdivision should be attempted. The analysis of the 1,000 Test Report Forms had revealed that candidates were roughly divided into thirds, or three groups, one third intending to take subjects in Arts and Social Sciences, one third Physical Science and Technology and the remaining one third Life and Medical Sciences. This coincided with the recommendations of some receiving institutions who had suggested conflating the existing six modules in this way, and in the absence of any other firm data indicating how this matter should be resolved, the decision was taken to have three modules in these broad subject areas in the revised test. Subsequent stages of the Revision Project will seek evidence as to the validity and value of this decision. The data gathered indicated that the revised test battery should cater for a wider range of target populations than the existing ELTS, which was felt to be unsuitable for prospective undergraduates since they do not normally have sufficient specialist knowledge in their chosen subject of study, and for "access" students. The revised ELTS aims to cater for the following target groups: Undergraduates/postgraduates Since the revised test will no longer be divided into so many subject specific areas, undergraduates should not experience problems selecting a module. Undergraduates and postgraduates would therefore sit the same test, since no evidence was found for a need for a separate undergraduate module. "Access" students Such students are planning to undertake school matriculation exams or elementary technical qualifications. They may possibly apply for academic courses at a later date, but at the time of taking ELTS will be unable to cope with the demands of the Academic modular component. A General Training Module will be available for such candidates. Non-academic students There is evidence to show that an increasing number of the candidates sitting ELTS are intending to take vocational courses which vary a great deal in nature and length. The revised ELTS will therefore include a General Training Module with components appropriate to this group. English as a second language (ESL) students Some students are currently exempt from taking ELTS because, as they come from countries where English is the medium of
< previous page
page_250
next page >
< previous page
page_251
next page > Page 251
instruction, their language level is expected to be high. Nevertheless, owing to their unfamiliarity with British study methods, they often experience difficulties with their courses. It is proposed that such students should be required to take the modular component only of the revised ELTS. Many of those consulted considered that the most appropriate place for the Listening subtest would be in the M component. Candidates could then be asked to listen to a lecture, make notes and then carry out a writing task. On practicality grounds, however, this proposal could not be implemented. Until the day when all test centres could be equipped with individual headphones so that candidates could listen to different texts at the same time, a modular listening component was out of the question. In the current ELTS, the Speaking test is in the modular component. This has been found to be problematic for the following reasons: (1) as the candidate and the interviewer are frequently from different subject disciplines, the attempt at authentic academic exchange in Phase 2 of the interview, in which candidates have to talk about a topic in a specific subject area, is largely unsuccessful; (2) undergraduates, who do not yet have a subject discipline, find Phase 2 very difficult to deal with; (3) some candidates (e.g. ESL students) only sit the General component. For these reasons it seemed sensible to transfer the oral interaction subtest to the General component. Theoretical Issues We conclude by briefly considering some of the issues raised by the procedures followed in moving towards the development of a revised ELTS as described in this chapter. Subject specificity of test content There was considerable variety in the views of the different groups consulted, ranging from those who favour there being a separate subject module for each course of study to those who would prefer there to be a single academic module for all subject disciplines. The current modules are felt by some to be too general but greater differentiation of modules might make the matching of students to modules even more problematic than at present. Once a decision has been taken on the specificity or generality of the Modular component, this clearly affects:
< previous page
page_251
next page >
< previous page
page_252
next page > Page 252
a. content validity: it is important to try to determine what all the subject areas in one module have in common and what differentiates them from all other study areas; b. concurrent validity: should the different subject modules correlate highly with one another or are they designed to measure different skills according to the requirements of the different study areas? Can they, therefore, ever strictly speaking be considered parallel tests? Similarly should the new Physical Science & Technology module correlate highly with the two separate modules of Physical Science and Technology in the current test? c. construct validity: the construct being used is one of divisible competence, not one of general language proficiency; it is based on some notion of separable skills and on the belief that different sorts of candidates require different sorts of competence, and not simply different levels of some general competence. It is interesting to note that in the revision of ELTS, there has been a movement away from subject specificity of content towards greater differentiation according to target group. Construct validity It is now generally accepted that the Munby (1978) construct is unsatisfactory. Apart from the fact that it lacks a firm empirical basis, it is criticized for the vagueness of its categories, the considerable overlap between supposedly different categories, arbitrary divisions and significant omissions in its structure and content. The question, however, remains as to what will replace the Munby model and it was in this connection that it was felt necessary to include the consultation of applied linguists in the data collection phases of the Project, to gain their views on the nature of language proficiency. We are, perhaps unsurprisingly, far from consensus and, in this regard, ELTS is likely to have to break new ground in devising and operationalizing a construct. Content validity In the course of this chapter the missing original specifications for the current ELTS have been alluded to on several occasions. This has caused particular problems in the production of parallel versions. Faced with the task of designing a revised ELTS we must from the outset question the role
< previous page
page_252
next page >
< previous page
page_253
next page > Page 253
and the nature of the specifications in this process. Firstly, how should they be arrived at? Should detailed needs analysis be undertaken, despite the fact that observation of previous projects of this type has indicated that serious difficulties are encountered, not in the drawing up of specifications themselves but in their conversion into test items? Should the specification be regarded as guidelines, subject to reinterpretation, modification, and revision in the light of the problems of test development or should they be seen as a blueprint to be followed exactly in matters of, for example, text selection? Should the specifications be reconstructible, and if not, can they be considered valid? Test construction procedures In the setting up of this Project and the formulation of aims, objectives and timescales, we have given considerable thought to the role of research. Within a framework where both time and money are in relatively short supply we have examined carefully what it is that we already know and how we can best find out what is not known. Our decision has been to reduce the time which could have been spent on preliminary analyses or speculations and to increase the time spent verifying intuitions, experience and existing research done by others in the field. After reviewing work already done in the area of testing EAP and drawing extensively upon existing needs analyses and the accumulated experience of EAP teachers and testers, teams were set up to produce draft specifications and items for their own test components. The Project will then submit these drafts to the scrutiny of a wide variety of informants including subject specialists, teachers, and candidates. Modifications will then be made in the light of feedback from these sources. Parallel validation The current ELTS was quite justifiably criticized in the early 1980s for its lack of validation. The Edinburgh validation study (Criper and Davies, 1986), described earlier in this chapter, has to some extent placated those who have been hostile to the test since its inception. The ELTS Revision Project, however, has recognized the importance of setting up procedures to ensure that test validation can run parallel with test development, and not happen post hoc as has been the case with the current ELTS. The Revision Project aims to draw on the existing ELTS worldwide to validate the revised test, thus avoiding the problem encountered by the Edinburgh validation
< previous page
page_253
next page >
< previous page
page_254
next page > Page 254
study of having a truncated sample. Furthermore it intends to capitalize on the ELTS validation study data and procedures whilst adding to it the benefit of systematically gathered judgements on test content from subject specialists and introspective data from test takers before the new test is launched in September 1989. Practicality As the data gathered from many different quarters underline, the current test has been criticized for being cumbersome to administer. The Revision Project has recognized that the future success of the ELTS depends not only on its validity and reliability but also on its practicality and has turned its attention to questions of marking (the need for objectively scorable subtests to be clerically markable and for provision to be made for the writing and oral interaction subtests to be markable either at the local test centre or centrally); training (to minimize the need for lengthy training); reduction of the number of pieces of paper test administrators have to deal with (this will be achieved by, for example, reducing the number of subject modules from six to three); simplification (consideration of exactly what this means and how it can be achieved); cost (the British Council is examining its costing mechanisms to ensure that the test is run in a cost-effective way. In addition the possibility of holding the test off British Council premises is being explored). Face validity Scrutiny of the findings of the data collection undertaken as part of the Revision Project reveals that ELTS has high face validity in most quarters. One of the major decisions drawn from the data has been, therefore, to develop a revised test which does not differ too radically from the existing ELTS so as to capitalize on the positive attitudes toward the Testing Service built up over the last seven years. The nine Band scoring system and the profile reporting are thus to be retained, as is a choice of subject modules, albeit more limited. We must, however, question to what extent face validity should be taken seriously. Let us consider, for example, the issue of the number of desirable subject modules. Receiving institutions happily proclaim their support of a test instrument which tests candidates' English language proficiency in their field of study yet how many of them understand
< previous page
page_254
next page >
< previous page
page_255
next page > Page 255
the problems encountered matching candidates to the appropriate modules, or indeed monitor which module their prospective students have been entered for? Face validity is not always as determinable as may first appear and should not, therefore, become an overriding concern in test design. Profile reporting It has been agreed that profile reporting should be retained as a feature of the revised ELTS because of the use receiving institutions, pre-sessional teachers, and the British Council's own English Tuition Coordination Unit claim to make of the diagnostic information it provides. We should, however, question how much of a diagnostic function the test in its present form really can perform and not exaggerate the usefulness of ELTS beyond its role as a screening/proficiency test. In any case, the current state of knowledge about the nature of language proficiency precludes the possibility of developing a truly diagnostic test, so that claims for the diagnostic value of the revised test will not be emphasized. This chapter has aimed to show the breadth of the consultation that has been undertaken during the first stage of the ELTS Revision Project. The Project has attempted to consult all those who could conceivably have opinions to offer on the current ELTS, and has endeavoured to involve testing specialists from the United Kingdom and beyond in the process of devising a revised test. And yet we must still ask ourselves whether the data we have collected is exhaustive. Have we tapped all the views of those concerned with ELTS? If we have, how do we weight them? Do we give more importance to some views than others? As the earlier sections of this chapter show, the amounts of data collected are enormous and in many areas the conclusions to be drawn are far from clear. Inevitably the pressure to simplify and rationalize may conflict with what the data seem to be saying but in the final analysis we must recognize that the process of test development involves a degree of compromise. Looking forward to the subsequent stages of the Revision Project, leading up to the launch of the new ELTS in September 1989, we must make sure that the Project remains flexible and open to feedback as the new test evolves. The decision-making process should be sensitive to data as it is gathered and considered, even though ultimately it cannot be blindly driven by data since this is often conflicting and contradictory. Professional and doubtless subjective judgements will still be necessary. What is important is to ensure that the final product is the best possible compromise and that we have good evidence for this.
< previous page
page_255
next page >
< previous page
page_256
next page > Page 256
Note 1. The Criper and Davies report, and the proceedings of the 1986 ELTSVAL conference are available in Criper and Davies (1988) and Porter, Hughes and Weir (1988), respectively. References Criper, C. and A. Davies (1986) The ELTS validation study: report to the British Council and the University of Cambridge Local Examinations Syndicate (unpublished report). Criper, C. and A. Davies (eds) (1988) ELTS Validation Project Report. ELTS Research Report Vol. I(i). University of Cambridge Local Examinations Syndicate. Davies, A. and J.C. Alderson (1977) English Proficiency Test Battery (EPTB). Edinburgh: Edinburgh University. Ingram, E. (1967) English Language Battery (ELBA). Edinburgh: Edinburgh University. Munby, J. (1978) Communicative Syllabus Design. Cambridge: Cambridge University Press. Porter, D., Hughes, A. and C. Weir (eds) (1988) Proceedings of a Conference Held to Consider the ELTS Validation Project. ELTS Research Report Vol. I(ii). University of Cambridge Local Examinations Syndicate.
< previous page
page_256
next page >
< previous page
page_257
next page > Page 257
LIST OF CONTRIBUTORS J. Charles Alderson is currently directing the English Language Testing Service (ELTS) Revision Project and the University of Cambridge Local Examination Syndicate. Other research work with which he is associated includes developments and validation of computer-based language tests, the validation of exam reform, and the impact of exam reform on classrooms. Address: Institute for English Language Education, Bowland College, University of Lancaster, Bailrigg, Lancaster LA1 4YT, England. Caroline Clapham is Research Coordinator for the ELTS Revision Project for the British Council and the University of Cambridge Local Examinations Syndicate. She has recently contributed to an evaluation of a largescale examination in Sri Lanka, and the construction of a British Council item analysis program for the BBC computer. Address: Institute for English Language Education, Bowland College, University of Lancaster, Bailrigg, Lancaster LA 4YT, England. Bill Cope has been a research fellow at the Centre for Multicultural Studies at the University of Wollongong since 1984. His main areas of research are multicultural education, language policy, and historical research into the question of Australian national identity. His doctoral dissertation traced changes in Australian identity since 1945, as reflected in history and social studies curricula. Address: Centre for Multicultural Studies, University of Wollongong, P.O. Box 1144, Wollongong, N.S.W. 2500, Australia. Alan Davies is editor of Applied Linguistics. His main interests are in language in education, language testing, the language of religion, and theoretical models of applied linguistics. Address: University of Edinburgh, Department of Applied Linguistics, 14 Buccleuch Place, Edinburgh EH8 9LN, Scotland.
< previous page
page_257
next page >
< previous page
page_258
next page > Page 258
John H.A.L. de Jong is a Senior Research Scientist at the Dutch National Institute for Educational Measurement (CITO). His special interest is the application of Item Response Theory to the measurement of levels of language ability. He is chairman of the Scientific Commission on Language Testing of the International Association of Applied Linguistics (AILA). Address: CITO, P.O. Box 1034, 6801 MG Arnhem, The Netherlands. Rod Ellis' interests centre on second language acquisition, in particular the relationship between instruction and L2 learning, the role of learning styles, and the nature of L2 variability. His most recent book is Instructed Second Language Acquisition: The Teaching-Learning Relationship. Address: Ealing College of Higher Education, St Mary's Road, Ealing, London W5, England. Harry L. Gradman is Professor of Linguistics, and Director of the Center for English Language Training at Indiana University. His research interests include language analysis, program evaluation, and language testing, with a special focus on reduced redundancy. Address: Center for English Language Training, Indiana University, Bloomington, IN 47405, USA. Edith Hanania is Research Associate with the Center for English Language Training, and Part Time Associate Professor of Linguistics, at Indiana University. Among her research interests are language acquisition, language testing, language teaching, and contrastive studies of Arabic and English. Address: Center for English Language Training, Indiana University, Bloomington, IN 47405, USA. Grant Henning is currently a Senior Research Scientist at the Educational Testing Service where he coordinates research for TOEFL Program examinations. He has taught EFL/ESL, and directed several university-level EFL/ESL programs. His research interests include language test development and validation, and the study of cognitive and affective variables related to all aspects of language learning and teaching. Address: R12, Educational Testing Service, Princeton, NJ 08541, USA. Rosalind Horowitz's research interests include discourse organization, and the psychological and socialsituational factors which influence discourse processing in schools. With the goal of improving literacy, she has examined
< previous page
page_258
next page >
< previous page
page_259
next page > Page 259
the relationships between oral and written language forms, and listening and reading processes. The volume Comprehending Oral and Written Language is among her recent publications. Address: College of Social and Behavioral Sciences, Division of Education, The University of Texas, San Antonio, TX 78285, USA. Arthur Hughes is Director of the Testing and Evaluation Unit of the Centre for Applied Language Studies at the University of Reading. He was founding editor, with Don Porter, of the journal Language Testing. His current research is concerned with the relationship between control of grammatical structures and perceived communicative ability. Address: Testing and Evaluation Unit, Centre for Applied Language Studies, University of Reading, Whiteknights, P.O. Box 218, Reading RG6 2AA, England. Mary Kalantzis is a research fellow at the Centre for Multicultural Studies at the University of Wollongong. Among her publications are Mistaken Identity: Multiculturalism and the Demise of Nationalism in Australia and, as a co-author, The Language Question. Her interest in multicultural education is reflected by numerous publications. Since 1979, she has been coordinator, with Bill Cope, of ''The Social Literacy'' Curriculum Project. Address: Centre for Multicultural Studies, University of Wollongong, P.O. Box 1144, Wollongong, N.S.W. 2500, Australia. Graeme Kennedy is Professor of Applied Linguistics, and Director of the English Language Institute at Victoria University of Wellington. His current research interests include computer-assisted analysis of English, second language learning theory, and English for academic purposes. Address: English Language Institute, Victoria University of Wellington, Private Bag, Wellington, New Zealand. Denis Levasseur has just completed his doctoral dissertation at the University of Montreal. He has been a member of Michel Pagé's research team which worked on the development of text comprehension abilities. Address: Université de Montréal, Départment de Psychologie, C.P. 6128, Succ. A, Montréal QUE H3C 3J7, Canada. Geoff Masters' paper was written while he was a senior lecturer in the Centre for the Study of Higher Education at the University of Melbourne. Since August 1988 he has been Assistant Director of the Australian Council for
< previous page
page_259
next page >
< previous page
page_260
next page > Page 260
Educational Research and head of ACER's Measurement Division. A special interest is the application of item response models to ordered categories of response or performance in educational settings. Address: Australian Council for Educational Research, 9 Frederick Street, Hawthorn, Victoria 3112, Australia. Norma Norrish is Senior Lecturer in charge of the Language Laboratories at the Victoria University of Wellington. Her research interests include the teaching of French across age and ability ranges, individualizing learning by using technological resources, and testing in language laboratories. She recently co-authored a manual that exploits the use of video-tapes in language learning. Address: Language Laboratories, Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand. Michel Pagé is Professor of Psychology at the University of Montreal. His research and teaching activity is devoted to the field of Educational Psychology. His special interest is in cognitive learning and communicative abilities. Address: Université de Montréal, Départment de Psychologie, C.P. 6128, Succ. A, Montréal QUE H3C 3J7, Canada. Gillian Perrett has taught ESL to refugees, school and university students in Australia and the USA, and has trained teachers from Papua New Guinea and Burma. She is currently writing a doctoral dissertation which models the discourse capabilities of adult learners of English. She is co-author of the series, Studying in Australia. Address: Department of Linguistics, University of Sydney, Sydney, N.S.W. 2006, Australia. Diana Slade has been a lecturer in the MA programme in Applied Linguistics at the University of Sydney since 1985. Her main research interests are conversational analysis, language testing, curriculum development, and multicultural education. Her publications include The Language Question (with Kalantzis and Cope) and Teaching Casual Conversation, Volumes I and II (with Norris). Address: Linguistics Department, University of Sydney, Sydney, N.S.W. 2006, Australia.
< previous page
page_260
next page >
< previous page
page_261
next page > Page 261
Antonella Sorace is interested in the psycholinguistic aspects of the development and mental organization of knowledge in adult second language acquisition. Her research has focused on the interaction between different cognitive sources in learners' construction of interlanguage grammars. Address: University of Edinburgh, Department of Applied Linguistics, 14 Buccleuch Place, Edinburgh EH8 9LN, Scotland. Bernard Spolsky is Professor of English at Bar-Ilan University. He works in the field of educational linguistics, sociolinguistics, and language testing. His recent books include Language and Education in Multilingual Settings and Conditions for Second Language Learning: Introduction to a General Theory. He is at present writing a book, with Robert Cooper, on The Languages of Jerusalem and another on Educational Linguistics. Address: Department of English, Bar-Ilan University, 52 100 Ramat Gan, Israel. Douglas K. Stevenson teaches linguistics and language pedagogy at the University of Essen. He is Co-chairman of the Interuniversitäre Sprachtestgruppe (IUS) and has edited several volumes on language testing. His main area of interest in testing is validity and validation theory. Address: University of Essen, Fachbereich 3, Universitätsstr., 4300 Essen 1, Federal Republic of Germany. Tan Soon Hock is Associate Professor of English and Deputy Director of the Language Centre at the University of Malaya. Her doctoral dissertation, on reading comprehension among EFL learners, reflects one of her main research interests. She teaches courses on language testing to teacher-trainees, and her publications include articles on language testing, programme evaluation, and English for Special Purposes. Address: Language Centre, University of Malaya, Kuala Lumpur 59100, Malaysia. Paul G. Tuffin is Director of the South Australian Institute of Languages at the University of Adelaide, and Senior Lecturer in Modern Greek at the South Australian College of Advanced Education. His professional interests include language teaching methods, teacher training, and curriculum development and assessment, as well as language policy and planning. A special area of concern has been the development of the "National Assessment Framework for Languages at the Senior Secondary Level". Address: South Australian College of Advanced Education, 46 Kintore Avenue, Adelaide S.A. 5000, Australia.
< previous page
page_261
next page >
< previous page
page_262
next page > Page 262
Gill Westaway is Testing and Evaluation Adviser with the British Council, London. Her main area of interest is the testing of English for Academic Purposes. She is currently working on a joint British, Australian, and Canadian project to revise the English Language Testing Service (ELTS). Address: British Council, English Language Management Department Overseas, 10 Spring Gardens, London SW1A 2BN, England.
< previous page
page_262
next page >
< previous page
page_263
next page > Page 263
INDEX A Ability, language 10, 12, 16, 18-19, 73, 109-10, 116, 212, 216 communicative 19, 60, 161, 206-7 multiple a. traits 77 Absolute Language Proficiency Ratings 225 Acceptability, hierarchies 127, 129-30, 133-4, 137-49, 143-9, 146, 148, 150-1 Accountability, educational xiii-xiv Accreditation, of professionals 51 Accuracy, in L2 42, 84, 86 in testing xiii, 5, 12, 16-18, 19, 94 Acquisition, L1 136, 184 L2 25 hierarchical models 77 and intuition 134, 135-8, 139 and learning style 83-95 order of 166 ACTFL Proficiency Guidelines 72 Adams, M.J. & Collins, A.M. 224 Adams, R.J. 66 Adams, R.J. et al. 61, 69 Adjémian, C. 135 Adults, L2 acquisition 85-95, 127, 136 oral/written language 110-12 Age, and oral/written language 111 and recall 104, 105-6 Aghbar, A.A. 73 Akinnaso, F.N. 111 Alderson, J. Charles 20-7, 28, 31, 34-7, 73, 181-2, 215-16, 239-56 Alderson, J.C. & Urquhart, A.H. 38, 194, 215 AMES (Adult Migrant Education Service of New South Wales)225, 227 Andrich, D. 77 Angoff item difficulty 41, 42-5, 48 Angoff, W.H. & Ford, S.F. 38, 41, 42 ANOVA 4, 145 Appeals 113, 121
Aptitude, learner 85, 91-4, 92, 184 Arthur, B. 137 Arthur, B. et al. 207 ASLPR, see Australian Second Language Proficiency Ratings Assessment, formal, and individualization 160-3, 180, 193 institutional aspects 20-7 integrative 59-60 interactive xiii L1/L2 204-11 and minorities 196-8, 202-3 national aspects 38-50 psychometric aspects 3-4, 11, 56-70 social aspects 3-15 see also individualization; National Assessment Framework; testing; tests Assimilation 197, 198, 201, 202 Attitudes, learner 91 to minority language 197, 200, 203, 206 Attrition 136 Australia, multiculturalism 197 National Policy on Languages 198, 199-200, 202 see National Assessment Framework for Languages Other Than English Australian Languages Levels (ALL) Project 29-30, 36, 72
< previous page
page_263
next page >
< previous page
page_264
next page > Page 264
Australian Second Language Proficiency Ratings (ASLPR) 72, 209, 225, 230, 232, 233 Authenticity, test xiii, 9, 10-11, 208, 226 Auto-correction, see correction B Bach, K. & Harnish, R.M. 6-7 Bachman, L.F. 12, 13n., 72, 73 Bachman, L.F. & Clark, J.L.D. 5 Bachman, L.F. & Savignon, S. 72 Backwash effect 185, 245, 247 Behavioral objectives movement 57-8 Berk, R.A. 38 Berkoff, N.A. 236 Berry, M. 232 Bever, T.G. 130, 131 Bialystok, E. & Frölich, M. 150n. Bialystok, E. & Sharwood Smith, M. 84, 89 Bias, cultural 197, 204 specialization 38-49, 41, 43, 44, 51-5 test xiii-xiv, 3-4, 9, 10, 42, 185 item 4-5 Bilingualism 184-5 transitional 199, 202, 212 Bloom, B.S. et al. 57 Bloomfield, L. 109 Borel, M.J. et al. 99, 102 Botha, R.P. 130, 132 Bradac, J.J. et al. 131, 140 Brindley, G. 235, 236 British Council, ELTS test 17, 217, 239, 241, 245-6, 248, 254-5 Brodkey, D. 54 Brown, G. 111-12 Brown, J.D. 38 Brumfit, C.J. 86 Burt, M. & Dulay, H. 225 C C-test 9 CALL, see computers, C. Assisted Language Learning Canale, M. & Swain, M. 73 Canale, M., Frenette, N. & Bélanger, M. 73 Carmines, E.G. & Zeller, R.A. 3, 4
Carrell, P.L. 224 Carroll, J.B. & Sapon, S.M. 85, 91 Carroll, John 7-8, 73, 209 CAT, see computers, computerized adaptive testing Cause-effect structure 117-18 CBELT, see computers, c.-based English language tests Chafe, W. & Tannen, D. 111 Channel control 84, 85, 87-8, 89, 89-91, 93-4 Chaudron, C. 140 Chen, Z. & Henning G. 38 Chi-square analysis 4, 41, 46, 47, 47, 48 Child, oral/written language 65, 109-10, 112-16, 206-13 Chomsky, Noam 8-9, 13n., 128-9, 133 Clahsen, H. 84 Clapham, Caroline M. 73, 239-56 Cloze test 19, 186, 217-19 and discrete focus/global tests 167-71, 170, 173, 175-6 and learning styles 88, 88, 89, 90, 92 problems of 9, 181-2 Clozentropy 23 Cognition, and comprehension 98-9 development 73, 110, 199-200, 212 and intuition 129, 130 and language acquisition 136-7, 184 see also style, cognitive Coherence relations 98, 99-101, 102-7 Cohesion 98, 99, 102, 110-11, 115 Combination, and language testing 182-3, 185-6, 189 Communication, dyadic 12; see also ability, communicative Communication approach 29-30, 187, 192 Community language 198-9, 200-4, 211, 212 Comparability, text 194, 218-19 Compare-contrast structure 117-18 Compensatory measurement model 77 Competence, communicative 94, 244, 248 definition 5-6, 74
< previous page
page_264
next page >
< previous page
page_265
next page > Page 265
discourse 94 grammatical 84, 129-30 interlanguage 140, 150n. L1/L2 197, 200, 215, 216, 252 linguistic 5-6, 84, 127, 130, 135, 143, 197, 205, 215, 223 pragmatic 6 sociolinguistic 6, 84, 95 strategic 94 testing 18, 29 unitary 8 Components analysis, exploratory principal 8 Comprehension, listening 40, 53, 157, 160, 162-3, 163, 192, 215, 217 reading 41, 53, 56, 97-107, 157, 158, 160, 162, 215-16, 218-24 Computers, C. Assisted Language Learning (CALL) 20, 24 c.-based English language tests (CBELT) 21-7, 31-2, 34-7 in language learning 156-8 in language testing xii-xiii, 20-7, 28-37 see also scoring Conlan, G. 73 Consistency 3, 33, 40, 77, 132, 139, 140-3, 145, 148 Construct 4, 77 Content, test 4, 16, 22, 34, 39, 73, 251-2 and bias 44-6, 48-9, 52-3 individualized 158-9, 161 Context, of discourse 109-10, 111-12, 116, 119, 131 of interview 226, 227, 234 and proficiency xiii, 181, 184, 188-9, 192-3 Continuity 245 Contract, social 11 Contrast relation 100, 104-5 Contrastive analysis 102 Conversation, structure 110-11, 232-3,233, 236 Cope, Bill 196-213 Coppieters, R. 131, 137, 138 Correction, auto-correction 159-60, 161 Criper, C. & Davies, A. 191, 194n., 241, 242, 244, 249, 253 Criteria, performance 33, 208-9
test 3, 31, 181, 191, 192 ethical 10-13, 16, 19 Criterion-referencing 12, 57-9, 62-5, 66, 68, 72, 180 Cronbach, L.J. 3, 38, 180 Cronbach, L.J. & Meehl, P.E. 4 Cue, nonverbal 111 Culture, and language 112, 227, 234 Cummins, J. 73, 207 Cut-off, see scoring D Data, extralanguage, and language testing 182, 183-4, 186-91 Davidson, F. & Henning, G. 61, 69 Davies, A. 54 Davies, A. & Alderson, J.C. 217, 241 Davies, Alan 179-95 Davis, F.B. 56 de Beaugrande, R. 98, 99 de Jong, John H.A.L. 71-80 de Jong, John H.A.L. & Glas, C.A.W. 73 Dechert, H. 84 Dechert, H. et al. 94 Decision theory, Bayesian xii Degrees of Reading Power 58 deixis 110-11 Democratization, of education xi, xiii of testing 27 Derrick, C. 56 Devices, paralinguistic 111 Diagnosis, in computer testing 24-5, 27, 33 of language difficulties 204 and scoring systems 59 and testing 160, 164, 208, 256 Dickerson, L.J. 166 Dictation test 181-2, 186 Dimensionality 59 Discourse, genre 231-2, 233 oral/written 108-16, 122-6 processing 108-10, 116-18 structure 109-10, 112-16, 119, 226
< previous page
page_266
next page > Page 266
Discrete-point test 19, 57, 59, 74-5, 87, 166-7, 173-4, 185, 188, 192, 206 Distance, social 235 Diversity, language 196-8, 201-2, 204, 212 Douglas, D. 72 Drum, P.A. et al. 77 Dulay, H., Burt, M. & Krashen, S.D. 166 Dyson, A.H. 111 E Edelsky, C. et al. 11 Edinburgh University, ELTS Validation Study 241-4, 253-4 language testing 191-3 Education, bilingual 199 and multilingualism 198-9 Educational Testing Service 72-3 EFL, see English, as L2 Elaboration relation 100, 101, 103-5, 106 Ellis, Rod 75, 83-96, 207 ELTSVAL Conference 244-5 Embretson, S.E. 77 English, for Academic Purposes 246-8, 253 as international language 55 as L2 21, 28, 40, 168, 199, 202-3, 204-9, 214-15, 250-1 for special purposes 39, 48, 52, 180, 189-91, 192-3, 240-4, 249 verb structure 167-72, 171, 172, 174, 176 English Language Battery (ELBA)tests 180, 180-1, 181, 241, 243, 248 English Language Skills (TELS) Profile 59-60, 60, 61, 72, 75 English Language Testing Service (ELTS)17, 39, 191-3, 217, 239-44, 240 revision 244-55 validation study 194, 241-4, 249 English Proficiency Test Battery (EPTB)217, 219, 241, 243 English as a Second Language Placement Exam (ESLPE)40-5, 48-9, 52 Equity, and language diversity 197-8, 212 and testing 4, see also fairness Erickson, M. & Molloy, J. 38 Error analysis 25, 53 Error detection 41, 44-5
ESL, see English, as L2 ESLPE, see English as a Second Language Placement Exam Ethnicity, and language 198 Exchange, conversational 110, 232-3 Exemplification relation 100, 104-5, 106 Exercise, and computer use 20, 23-4, 27, 35-6 Expansion relations 100, 105-6 Expectations, teacher 155, 181, 214 Explanation relation 101, 105 Explicitness 179, 186, 194 F Factor analysis 8, 56, 65, 73, 76, 89-90, 90 Fairness, of assessment xiii-xiv, 3-4, 10-13, 16-19, 48, 51, 243, 246 of learning situation 155 Feedback, from teacher 158, 161 and use of computers 23-4, 24-5, 27, 36 Felix, S.W. 136 Field 234-5 Field dependency/independency 83, 85, 86-8, 88, 93-4 Fischer, G. 77 Fluency, and L2 acquisition 5, 84, 86, 89-91, 92, 93-4, 235 Focus, discrete 166-76 Framework essays 186, 187 Fredericksen, C.H. 98, 99, 102-3 French, accelerated learning 154-64 FSI Oral Interview 12 Function, differential item functioning 38, 39, 41-6, 44, 47-9 differential person functioning 41, 41, 46-7, 47, 49 and structure 6-7, 9-10, 18-19 G Gardner, R.C. 91 Garvey, C. 119n.
< previous page
page_266
next page >
< previous page
page_267
next page > Page 267
Gass, S.M. 137 Generalization relation 100 German, adult acquisition 85-95 as minority language 200, 205, 209-11 Glaser, R. 57, 68, 72 Goals, language-learning 30-1, 72 Goffman, E. 109 Gordon, P.C. & Meyer, D.E. 74 Gould, S. 8 Government and Binding 134 Graded Objectives 72 Gradman, Harry L. & Hanania, Edith 166-76 Graduate Record Examination 53 Grammar, acquisition 157, 162 expectancy 8 fuzzy 133 interlanguage 127-8, 135-40, 148 performance 6 testing 41, 45, 87-8, 88, 89, 90, 92, 187-8, 192, 217 transformational 133, 134 universal 134, 136-7, 182-3 Grammaticality 127, 129-35, 138-46, 146, 147, 148, 149 Greenbaum, S. 131, 134, 139-40 Greenbaum, S. & Quirk, R. 130, 131, 132 Griffin, P.E. 69 Group Embedded Figures Test 83, 86-8 H Hale, G.H. 38 Halliday, M.A.K. 109, 230 Halliday, M.A.K. & Hasan, R. 227, 234 Harris, C.W. 56 Harris, J. et al. 61, 65, 69 Hasan, R. 227 Hatch, E. 84 Hawkins, J.A. 131 Hedges 111, 113 Henning, G. & Gee, Y. 39 Henning, G., Hudson, T. & Turner, J. 40, 73 Henning, Grant 38-50, 51-5, 58
Hinds, J. 109 Hobbs, Jerry 98, 99-101, 105 Holland, P.W. & Thayer, D.T. 41, 46 Holtzman, P.D. 8 Horowitz, R. & Davis, R. 114 Horowitz, Rosalind 108-26 Hudson, R. 108, 111 Hughes, A. & Porter, D. 13n. Hughes, Arthur 16-19 Hutchinson, C. & Pollitt, A. 58, 59, 72 I Indeterminacy 127-51 elicitation 139-43, 143-9 interlanguage 137-9 and native intuitions 128-9, 133-4, 143, 143-9 and nonnative intuitions 128, 134-7, 143-9 reliability problem 132-4 validity problem 129-32 Individualization, and auto-correction 159-60 and CBELT 22 and course content 158-9 and formal assessment xi-xv, 12-13, 160-3, 180, 193 and language minorities 202-3 Inference, text-based 99-101, 102-4, 105-6 Information, exchange 231, 233, 234-5, 236 textual 98, 102-3, 106 Information gap activity 86, 88 Information theory 6, 8 Ingram, Elisabeth 72, 180, 209, 236, 241 Instructional Objectives Exchange 58, 59 Integration 198 Interaction, student-student 206-7, 209 teacher-pupil 54, 112-13, 117, 206 Interlanguage, competence 127, 135, 150n. grammar 6, 128, 135-9, 142, 148 indeterminacy 137-9 and variability 166-7 International Development Programme of Australian Universities (IDP Australia) 245
< previous page
page_268
next page > Page 268
Interview 206, 208, 225-6 structured 186, 187 unstructured 225-36 analysis 227-33 purpose 234-5 Intuition, elicitation 127-8, 129, 139-43 interlanguage 127-8, 137-9 and linguistic theory 128-9 native 128-34, 138-9, 143-9 nonnative 135-7, 139-49 Italian, and acceptability hierarchies 143-9, 150-1 Item, differential functioning 38, 39, 41-6, 43, 46, 47-9 Item response theory (IRT) xii, 4, 21, 58, 59, 61, 65, 68-9, 71-3, 75-7, 194 Items, independence 76 J Jackendoff, R. 6 Jannarone, R.J. 77 Johnston, M. 84, 227 Jones, R.L. 226 Jordens, P. 6 Judgement, and intuition 129-35, 138-47, 150 of supervisors 181, 181, 192 Kalantzis, M. & Cope, B. 198, 199 Kalantzis, M., Cope, B. & Slade, D. 197, 205, 208, 209 Kalantzis, Mary 196-213 Keating Report 156-7 Keele, S.W. 74-5 Kellerman, E. 138, 142 Kennedy, Graeme D. 51-5 Klein, W. 136 Klein-Braley, C. 9, 73 Knowledge, and learning styles 84, 85, 86-8, 89, 89-94 linguistic 6, 10, 16, 18-19, 57 metalinguistic 141-3, 147-8 prior 214-24, 222, 223 testing 217-19, 218, 220, 221 underlying 5 Krashen, S.D. 166
L Laboratory, language xiii, 20, 156-8, 162 Labov, W. 131 Lado, R. 57, 73 4 Lakoff, G. 133 Lamendella, J.T. 136 Lancaster University, Institute for English Language Education 22 Language, culturalist view 132 definitions 74 and linguistic theory 182-4 mentalist view 132 minority 196-213 oral, see production; proficiency, language; structure school 112, 117 for specific purpose xiv, 5, 9, 73, 189-93, 240, 246 see also English, for specific purposes Language Aptitude Battery 92 Learner, communicative-oriented 83-5, 89, 91, 93-4 norm-oriented 83-5, 89, 91, 93-4 Learner factors 91-3, 92 Learning, accelerated 154-64 Levasseur, Denis & Pagé, Michel 97-107 Level, language learning 155-6, 158-62 proficiency 168-72, 170, 172, 173, 174, 175-6, 180-1 Levelt, W.J.M. 130, 131 Levelt, W.J.M. et al. 134 Liceras, J. 134, 135, 150n. Light, R.L. et al. 54 Linguistics, applied 25, 26, 74, 179 systemic functional 226-7, 234 Linkage relations 100-1, 105, 106 Linn, R.L. 38 List structure 114-15, 117-18 Listening, L1/L2 205-6, 207-8, 209-10 testing 44-5, 251 see also comprehension Lo Bianco Report 198 M
McClelland, J.L. et al. 6 McDonough, S. 83 Macedonian, as minority language 200, 205, 209-11
< previous page
page_268
next page >
< previous page
page_269
next page > Page 269
Maintenance, language 197-8, 200, 201-3, 206, 211-12 Malaya, University of 215-24 Malinowski, B.K. 234 Mantel, N. & Haenszel, W. 41, 46 Mantel-Haenszel chi-square analysis 41, 46, 47, 47, 48 Martin, J.R. 232 Martin, J.R. & Rothery, J. 208 Masters, G.N. & Evans, J. 71-2 Masters, Geofferey N. 56-70, 71-3, 75-6 Meaning, negotiation 110, 119 Measurement, problems of 5-10 Mercurio, Antonio 31 Meyer, B.J.F. 117 Meyer, B.J.F. & Freedle, R.O. 117-18 Miller, G.A. 142 Miller, G.A. & Isard, S. 8 Mischler, E.G. 116 Mode 234-5 Modern Language Aptitude Test 85, 91-2 Mohan, B.A. 133, 142 Monitor model 166 Morrison, D.M. & Lee, N. 236 Morrow, K. 57, 60, 68, 73 Mossenson, L.T., Hill, P.W. and Masters, G.N. 58, 72 Motivation, instrumental 91, 92 integrative 91, 92, 93 learner 85, 91, 93-4, 163 Move, conversational 232-3, 233, 235, 236 Moy, R.H. 38 Multiculturalism 197-8, 200-2 Multilingualism 197-8, 202, 212 Multiple regression analysis 191, 219-21 Multivariate analysis 103-4 Munby model 244, 248, 252 N NACCME 199 NAFSA Conference 39 National Assessment Framework for Languages Other Than English 28-37, 72 Native speaker, and proficiency xi, 179-80, 183
see also indeterminacy; intuition Needs analysis 242, 253 Needs, learner xii, xiv, 29, 154, 193-4, 203-4, 239, 246-7 Neurolinguistics 74 Neurophysiology, and language proficiency 74 Newmark, L.D. 74 Newmeyer, F.J. 130, 133 Norrish, Norma 154-65 Nunnally, J.C. 4, 134 O Oiler, J.W. 8, 10, 73, 225 Osterlind, S.J. 4 P Parallel relation 100, 104-5, 106 Parameter setting 136-7, 139 Partial credit model 61-6, 63, 76 Pateman, T. 134 Pearson correlations 88, 89, 93, 175 Peer teaching 23, 159, 161 Peretz, A.S. 38 Performance, language, definition 5-6,74 measuring 5-10, 19, 31, 33-4, 59-60, 84, 186-8 as task-related 176 variability in 166-76, 180-1 Permeability 135-6 Perrett, Gillian 225-38 Person, differential person functioning 46-7, 47, 49 Persuasion, in writing 112-15, 121-2 Petersen, S.E. et al. 74 Peterson, N.S. & Novick, M.R. 38 Pimsleur, P. 92 Planning, language 84, 203 Pluralism, cultural 197-8, 201-2, 212 Policy, language, Australia 198, 199-200, 202 Pollitt, A. & Hutchinson, C. 61, 65, 69 Porter, D., Hughes, A. & Weir, C. 39 Posner, M.I. et al. 74 Poynton, C. 235 Practicality 243-4, 245, 250-1, 254 Pragmatics, see competence, pragmatic Prediction 54, 191-4, 241-2, 246 of reading comprehension 215-24
Probability theory 6, 9
< previous page
page_269
next page >
< previous page
page_270
next page > Page 270
Processing, Parallel Distributed 6 Production, oral/written language 53, 112-16 rules of 6 testing 175 Proficiency, language, components 56, 73-4, 246 continuum 57-8, 62, 65-6, 68-9, 71-2 definition 6-7, 56-7, 72, 248, 255 L1 204, 209-12 L2 202-9, 211-12, 218-24, 220, 221, 222, 223 and interview 230, 232, 233-6 testing xii, 239-55, 246-8 measuring 5-20, 53-4, 60-5, 72-7, 191, 217 validation 191-3 variables 179-84 oral 62-8, 63, 64, 67, 71, 75-6 overall 5, 7-10, 18-19, 47, 66, 73, 76, 90, 194, 225 see also level, proficiency Profiles 17-19, 59, 209, 240, 247, 254-5 advantages of 12-13, 16 and computer use 24-5, 27 and partial credit model 66-8, 75-6 Propositional analysis 101, 102-3 Prose analysis 117 Prosodic features 111 Psychology, cognitive 74, 101 Psychometric theory xii-xiii, 3-4, 56-69, 71-7, 134 Q Question, definition 10-11 Quirk, R. & Svartvik, J. 130 R Raatz, U. 12 Raatz, U. & Klein-Braley, C. 9 Ranking scales 128, 142-9, 151 Rasch difficulty estimates 41, 42, 43, 45-6, 46 Rasch, G. 41, 58, 68 Rating scales 60-1, 72, 128, 141-2, 149, 207, 209, 225, 227 Rea, P.M. 226 Readability 218 Reading,
aloud 208 L1/L2 205-6, 207-8, 209-10, 218-19 in L2 44-5, 214-24 schema notion 224 skills 56, 58, 77 training 97 see also comprehension, reading Recall 99, 102-6, 117-18 Register 189, 234, 236 Regression residual analyses 41, 45-6, 48 Reliability xii, 10, 13, 17, 21, 57, 68, 142, 144, 194 definition 3 ELTS 241, 243, 245, 248 and intuition 129, 132-4, 139 scoring 18, 23, 33, 59 Remediation 24-5 Repetitions 111 Requests 11 by child 112, 113-16, 121 Ringen, D. 132 Ross, J.R. 131, 133, 180 Royal Society of Arts Examinations (RSA)39, 209 Rule, and L2 acquisition 84, 90, 94 Rumelhart, D.E. 224 Rumelhart, D.E. et al. 6, 74 S Sacks, H., Schegloff, E.A. & Jefferson, G. 109 Sampson, G. 6 Sang, F. et al. 73 Sapir, E. 186-7 Sapon, S.M. 74 Savignon, S. 72 Scardamalia, M. et al. 113 Schachter, J. et al. 150n. Schachter, J., Tyson, A. & Diffley, F. 137 Schauber, E. & Spolsky, E. 13n. Schegloff, E.A. & Sacks, H. 109 Scheuneman, J.D. 38 Schmidt, R. 84-5, 94 Scoring, by computer 22-3, 25, 27, 32-4, 36 cut-off 180-1, 191, 192-3, 242 of ELTS 254
< previous page
page_271
next page > Page 271
global 59 global/discrete focus testing 169-72, 172, 175 impression marking 59 Searle, J.R. 10-11 Self-evaluation 26, 27 Seliger, H.W. 84 Senior Secondary Assessment Board of South Australia (SSABSA) 28-9, 31 Sentence, and linguistic theory 182 in text comprehension 98-107 Sequencing, in speech production 74 Shohamy, E. & Reves, T. 226 Simmonds, P. 225 Skehan, P. 92 Slade, Diana 196-213 Snow, C. 131 Snow, C. & Meijer, G. 131, 142 Sociolinguistics 74; see also competence, sociolinguistic Sorace, Antonella 127-53 Spearitt, D. 56 Spearman, C. 8, 145-6 Specialization, see bias, specialization Specification relation 101 Specificity xii Speech, assessment 59-69 L1/L2 205-7 rate 87, 88, 89-90, 90, 92, 93 and writing 108-9, 110-12 Speech act theory 6-7 Spolsky, Bernard 3-15, 16-19, 57, 68, 225, 226 Stansfield, C. 72, 73 Statistics, and language competence 3-4, 7-8, 10, 59, 61, 103-4 Stevens, S.S. 143 Stevenson, D.K. 7, 13n., 226 Strategies, learning 26, 156-8 Strevens, 154 Structuralist model 3, 57, 180 Structure, and function 6-7, 9-10, 18-19
generic 227-8 and oral/written language 112, 113, 116-19 and proficiency 74 rhetorical 117-18 Style, cognitive 83, 88, 93 Styles, learning 75, 83-95, 141 superordinate/subordinate 166-7 Subskills 58, 76-7 Support, in computer testing 27, 36-7, 191 in oral testing 60, 60-2, 62, 63, 66, 67, 76 Syllabus design 29, 244 System, language, testing 186, 189-90 Systemic functional register theory 234 T Tan Soon Hock 214-24 Tannen, D. 111 Tarone, E.E. 166-7 Teaching assistants, see universities, selection/screening procedures Technology, support from 156-8 Tenor 234, 235 Test theory, classical 4, 76, 77 Test of Written English (TWE)72-3 Testing, increased interest in xiv-xv and language variation 185 linguistic problems 184-91 see also computers; criteria Tests, communicative 186-8, 192, 194, 206, 226 computerized adaptive (CAT) 21 fill-in-the-blank 168-71, 173-5 gap-filling 22 global 167-70, 173-6, 225 integrative 7-8, 59-60, 61, 73, 206 learner response to 26-7, 36 multiple-choice 21, 22, 40, 168-71, 173-6, 194, 208, 209, 217-18, 243, 248 multitrait-multimethod studies 7, 10 norm-referenced 12 observational 17, 18 open-ended 243 oral 251 for specific purposes 244, 247
type 227 see also assessment; computers; criterion-referencing; discrete-point test; validity
< previous page
page_271
next page >
< previous page
page_272 Page 272
Text, subtext 228, 228-30, 229, 231, 232 type 230-2, 232, 236 Text comprehension, see comprehension, text Thorndike, R.L. 56 Threshold Level 72 Thurstone, L.L. 8, 56 Time, as factor in testing 12, 25, 37 TOEFL (Test of English as a Foreign Language) 7, 12, 16, 169 and specialization bias 38, 49, 53-4 Topic, change 229 control 227, 229-30, 230, 235, 236 nominations 230-1, 231, 236 TORCH Tests of Reading Comprehension 59,62 Trade 114, 121-2 Traits 76-7 latent t. model 77 multiple ability 77 Tuffin, Paul 28-37 U UCLES (University of Cambridge Local Examinations Syndicate) 239, 241, 245 Uncertainty, and content validity 191-3 and language testing 184-91 operationalizing 179-94 variables 179-84 Unidimensionality 4, 5, 10, 56, 77 Universality, and language testing 182-3, 184-5 Universities, selection/screening procedures xii, 39, 40-1, 47-8, 51, 53-5 Upshur, J.A. 226 Use, language 29, 205, 226-7, 230-1, 236 V Validation, parallel 37, 253-4, see also Edinburgh University, ELTS Validation Study Validity, concurrent/predictive 4, 40, 241-2, 243, 252 construct 4, 77, 242, 244, 252 content 4, 181, 189, 191-3, 194, 218, 242-3, 244-5, 252-3 criterion-related 3-4 definition 3
face xiv, 9, 142, 144, 189, 192, 225, 226, 243, 245, 247, 254-5 and specialization bias 48, 52, 53 and intuition 129-32, 139, 142 predictive, see validity, concurrent scoring 59 test xii, 10-11, 13, 17-18, 21, 51, 57, 75, 181-2, 241-4 Van Dijk, T.A. 99, 109 Variability, see performance, language; proficiency, language, measuring Ventola, E. 232 Vernon, P.E. 73 Videodiscs, in language testing xiii Vocabulary, acquisition 157, 162 and learning style 85, 87, 88, 89, 90, 92 testing 41 Vollmer, H.J. & Sang, F. 73 W Washback effect 185, 245, 247 Westaway, Gill 73, 239-56 White, L. 130, 150n. Wilkins, H. 154 Witkin, H. et al. 83, 86 Woodcock Mastery Reading Tests 58 Word order, acquisition 86-7, 88, 89, 89-90, 90, 92, 93-4 Wright, B.D. & Stone, M.H. 41 Writing, assessment 17-18, 59-69, 73 L1/L2 205-7 see also speech
< previous page
page_272