A Taste for Corpora
Studies in Corpus Linguistics (SCL) SCL focuses on the use of corpora throughout language study, ...
82 downloads
771 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
A Taste for Corpora
Studies in Corpus Linguistics (SCL) SCL focuses on the use of corpora throughout language study, the development of a quantitative approach to linguistics, the design and use of new tools for processing language texts, and the theoretical implications of a data-rich discipline.
General Editor
Consulting Editor
Elena Tognini-Bonelli
Wolfgang Teubert
The Tuscan Word Centre/ The University of Siena
University of Birmingham
Advisory Board Michael Barlow
Graeme Kennedy
Douglas Biber
Geoffrey N. Leech
Marina Bondi
Michaela Mahlberg
Christopher S. Butler
Anna Mauranen
Sylviane Granger
Ute Römer
M.A.K. Halliday
Jan Svartvik
Yang Huizhong
John M. Swales
Susan Hunston
Martin Warren
University of Auckland Northern Arizona University University of Modena and Reggio Emilia University of Wales, Swansea University of Louvain University of Sydney Jiao Tong University, Shanghai University of Birmingham
Victoria University of Wellington University of Lancaster University of Nottingham University of Helsinki University of Michigan University of Lund University of Michigan The Hong Kong Polytechnic University
Volume 45 A Taste for Corpora. In honour of Sylviane Granger Edited by Fanny Meunier, Sylvie De Cock, Gaëtanelle Gilquin and Magali Paquot
A Taste for Corpora In honour of Sylviane Granger Edited by
Fanny Meunier Sylvie De Cock Gaëtanelle Gilquin Magali Paquot Université catholique de Louvain
John Benjamins Publishing Company Amsterdamâ•›/â•›Philadelphia
8
TM
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.
Cover design: Françoise Berserik Cover illustration from original painting Random Order by Lorenzo Pezzatini, Florence, 1996.
Library of Congress Cataloging-in-Publication Data A Taste for Corpora : In honour of Sylviane Granger / Edited by Fanny Meunier, Sylvie De Cock, Gaëtanelle Gilquin and Magali Paquot. p. cm. (Studies in Corpus Linguistics, issn 1388-0373 ; v. 45) Includes bibliographical references and index. 1. Corpora (Linguistics) 2. Language and languages--Computer-assisted instruction. 3. Second language acquisition--Computer-assisted instruction. I. Meunier, Fanny. II. Granger, Sylviane, 1951P128.C68.T37 2011 410.1’88--dc22 isbn 978 90 272 0350 2 (Hb ; alk. paper) isbn 978 90 272 8708 3 (Eb)
2011008291
© 2011 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
To Sylviane Granger, once our professor, always our mentor, now our colleague and dear friend
Table of contents Acknowledgements List of contributors Preface Bengt Altenberg
ix xi xiii
Putting corpora to good uses: A guided tour Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
1
Frequency, corpora and language learning Geoffrey Leech
7
Learner corpora and contrastive interlanguage analysis Hilde Hasselgård and Stig Johansson†
33
The use of small corpora for tracing the development of academic literacies JoAnne Neff van Aertselaer and Caroline Bunce
63
Revisiting apprentice texts: Using lexical bundles to investigate expert and apprentice performances in academic writing Christopher Tribble
85
Automatic error tagging of spelling mistakes in learner corpora Paul Rayson and Alistair Baron
109
Data mining with learner corpora: Choosing classifiers for L1 detection Scott Jarvis
127
Learners and users – Who do we want corpus data from? Anna Mauranen
155
Learner knowledge of phrasal verbs: A corpus-informed study Norbert Schmitt and Stephen Redwood
173
A Taste for Corpora
Corpora and the new Englishes: Using the ‘Corpus of Cyber-Jamaican’ to explore research perspectives for the future Christian Mair
209
Towards a new generation of Corpus-derived lexical resources for language learning David Wible and Nai-Lung Tsao
237
Automating the creation of dictionaries: Where will it all end? Michael Rundell and Adam Kilgarriff
257
ddendum A Select list of publications by Sylviane Granger
283
Subject index Name index
289 293
Acknowledgements We would first of all like to thank all the contributors to this volume for their enthusiasm, their diligence in keeping to deadlines, and their patience in complying with our editorial demands, one of which being secrecy for quite a while! We would also like to express our most sincere gratitude to Elena Tognini-Bonelli, editor of the Studies in Corpus Linguistics series, and to Kees Vaes and his team at Benjamins for their much appreciated trust and support. Last but not least, we would like to thank Sylviane for not finding out about our secret project and meetings before all was (officially) revealed to her!
List of contributors Bengt Altenberg
Lund University, Sweden
Alistair Baron
Lancaster University, United Kingdom
Caroline Bunce
Universidad Complutense de Madrid, Spain
Sylvie De Cock
University of Louvain, Belgium
Gaëtanelle Gilquin
University of Louvain, Belgium
Hilde Hasselgård
University of Oslo, Norway
Scott Jarvis
Ohio University, United States of America
Stig Johansson†
University of Oslo, Norway
Adam Kilgarriff
Lexical Computing Ltd., Brighton, United Kingdom
Geoffrey Leech
Lancaster University, United Kingdom
Christian Mair
University of Freiburg, Germany
Anna Mauranen
University of Helsinki, Finland
Fanny Meunier
University of Louvain, Belgium
JoAnne Neff van Aertselaer Universidad Complutense de Madrid, Spain Magali Paquot
University of Louvain, Belgium
Paul Rayson
Lancaster University, United Kingdom
Stephen Redwood
University of Nottingham, United Kingdom
Michael Rundell
exicography MasterClass and Macmillan Dictionaries, L United Kingdom
Norbert Schmitt
University of Nottingham, United Kingdom
Christopher Tribble
London University, United Kingdom
Nai-Lung Tsao
National Central University, Taiwan
David Wible
National Central University, Taiwan
Preface Bengt Altenberg
The digital revolution has had a profound effect on contemporary life. It has changed our way of communicating with each other and our ways of gathering and processing information. In linguistics the change has also been dramatic. It has made it possible to develop models for simulating language behaviour and practical applications in human-machine interaction and to create tools for storing, processing and analysing large amounts of text. The development of computer corpus linguistics is now familiar to most scholars interested in the study of language. The fact that we can analyse large corpora of various kinds has provided a solid empirical basis for the description of language and language use. Although corpus linguistics is strictly speaking a methodology rather than a theory of language, it has opened up new approaches to the study of language and new and fruitful ways of matching theory and data. Today we tend to take this development for granted. But it is profitable to remember that carefully compiled computer corpora and tools for exploring them did not arise ‘out of the blue’. They were – and are – the laborious achievement of inspired linguists who understood the potential of the new technology and knew how to use it for linguistic purposes. Computer corpus linguistics has had several pioneers of this kind since its beginning in the 1960s. This book is a tribute to one of these pioneers: Sylviane Granger, professor of English at the University of Louvain, Belgium. Sylviane Granger began her career the hard way, in what has humorously been called the era BC (‘Before Computers’) when corpus data were stored on cards in shoeboxes or filing cabinets. Her Ph.D. thesis on the use of the passive in spoken English (published in 1983) was the result of a painstaking manual inventory and analysis of be + past participle forms in the files of the Survey of English Usage (then not yet available in computerized form) at University College, London. That experience undoubtedly trained her in handling and analysing a large amount of corpus data but it also, one can imagine, made her appreciate the advantages offered by computerized corpora which were being developed at the time. But Sylviane Granger also had another fervent interest. Being tri-lingual in French, English and Dutch, she was deeply concerned with second language learning and teaching, notably the learning and teaching of English as a foreign language (EFL) and
A Taste for Corpora
– almost as a logical consequence – in contrastive analysis. These interests can be seen as the main driving forces behind the research conducted at the Centre for English Corpus Linguistics (CECL) which she founded at Louvain-la-Neuve in 1990. Since then, her wide-ranging interests in English corpus linguistics, her ambition to use the results for pedagogical purposes, and her enthusiasm as a teacher and project leader has made the CECL a veritable hothouse of corpus research and development which has inspired a large number of scholars around the world and fostered a new generation of enthusiastic co-workers at Louvain-la-Neuve and abroad. The research activities at the CECL have undergone a remarkable expansion since its beginning 20 years ago. Broadly speaking, the development has focused on four related areas: – The creation and analysis of computer corpora of various kinds: learner corpora, multilingual corpora, corpora of English for Specific Purposes, etc. – Linguistic research on these corpora ranging from lexis and phraseology to grammar and discourse with special emphasis on the development of corpus-related methodologies and on matching empirical data and linguistic theories – Pedagogical applications, for example in learner-oriented lexicography, textbooks, web-based dictionaries, proficiency testing, etc. – The development and use of computer-aided tools in research and pedagogical applications The work in these areas has expanded organically in a series of related steps, each serving to supplement or refine the results of the previous one. For instance, the first corpus initiated by Sylviane Granger was the widely successful International Corpus of Learner English (ICLE), a computerized corpus of written argumentative essays produced by advanced learners of English with a number of different mother tongues. This written corpus was soon supplemented with a spoken counterpart (LINDSEI) consisting of interviews of intermediate to advanced EFL learners. However, both these corpora offered a cross-sectional view of the learners’ interlanguage. To redress this limitation a new longitudinal learner corpus project (LONGDALE) has recently been launched, again involving advanced learners with different mother tongues but followed over a period of three years. Another example of the expansion of the work at CECL is the development of cross-linguistic research. The learner corpora give evidence of errors as well as quantitative deviations – overuse or underuse – from a (selected) native English norm or standard of comparison. These errors and deviations tend to differ in type and frequency depending on the L1 of the learners. One natural question that arises is to what extent L1 interference (transfer) plays a role in the learners’ production. This question encourages a contrastive perspective and the development of multilingual (comparable or translation) corpora which can provide empirical evidence for testing claims in second language acquisition theory which have previously mainly been based on intuition.
Preface
However, interlanguage phenomena like underuse or overuse of a target language feature may also be the result of overgeneralization of a target language structure or, alternatively, reflect special characteristics of the selected native English norm. The choice of target norm is therefore problematic. Which variety (or varieties) of English should be the target in second language research and teaching? Should all learners have the same target? Questions like these inevitably lead to a concern with language variation and the characteristics of different varieties of English. The result has been a development of learner corpora and multilingual corpora representing English for Specific Purposes (such as newspaper editorials, business English, academic English, law, etc.). Another recent interest at the CECL is to compare learner English with indigenized varieties of English (‘World Englishes’). All these perspectives require special methodologies and the use of various computer-aided tools for marking, analysing and presenting the data and for the creation of pedagogical applications of various kinds (e.g. learner-oriented dictionaries, textbooks, multilingual term banks). For example, in order to compare learner data with native data (L2 vs. L1) or different kinds of learner data (L2 vs. L2) Sylviane Granger developed the Contrastive Interlanguage Analysis (CIA) methodology which has been a fruitful approach in many ICLE studies. In addition, to integrate the CIA method with contrastive observations from multilingual corpora, she developed the Integrated Contrastive Model which helps the researcher to predict or explain various deviant interlanguage phenomena. Examples of computer-aided tools developed by the team at the CECL are the error-tagging system designed for the ICLE corpus and various learner-oriented projects in electronic lexicography, such as the creation of a web-based phraseological dictionary of English for Academic Purposes intended for non-native writers and of a trilingual terminological database of university-related terms (English-French-Dutch). This short survey of the work carried out at the CECL can only give an indication of the varied and rapidly expanding activities initiated by Sylviane Granger (for details, see the CECL homepage at www.uclouvain.be/en-cecl.html). Apart from her central influence as an enthusiastic organizer and creative researcher, she has inspired a large number of scholars around the world and created fruitful international cooperation around her projects. The collection of articles presented here on the occasion of her 60th birthday give a good indication of her wide research interests. They illustrate the variety of topics and approaches that characterize the field as well as new lines of development. In presenting this collection the editors and contributors wish to join her colleagues and friends around the world in celebrating her pioneering achievement, hoping that her enthusiasm and creativity will continue to inspire us in the years to come.
Putting corpora to good uses A guided tour Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
This volume is a tribute to Professor Sylviane Granger, and a special gift for her 60th birthday. The eleven chapters it is made up of tackle corpora from a wide range of perspectives, thus reflecting Sylviane’s insatiable taste for corpora and her many interests. They were written by distinguished scholars whose work is appreciated by Sylviane, but who are also (long-standing) friends of hers. The different contributions aim to shed light on the numerous linguistic and pedagogical uses to which corpora can be put. They present cutting-edge research in the authors’ respective domain of expertise and suggest directions for the future. Given the many potential uses of corpora, the volume is inevitably incomplete and limited in size and focus, but we nevertheless believe that it will offer readers an informed account of the important role that corpora play in applied linguistics today. In this chapter, we will first guide readers through the main paths that Sylviane has explored in her career so far, and then provide an overview of the articles that are brought together in this volume. Sylviane Granger is a corpus linguist, a specialist in contrastive linguistics, a lexicographer, and also an English as a Foreign Language teacher. She is a polymathic applied linguist and her impressive list of publications (see Addendum, this volume) reflects her numerous research interests including corpus linguistics (native, learner and bilingual corpora), phraseology, lexicography, English as a Foreign Language, English for Academic Purposes, Second Language Acquisition, contrastive linguistics and technology-enhanced language learning. Twenty years ago, Sylviane founded the Centre for English Corpus Linguistics (CECL) at the Université catholique de Louvain (UCL), Belgium. From a modest start in 1990, with one table, one chair, one computer, one bookcase and one researcher, the centre has gradually grown to include many more tables and computers, but above all many more researchers. To date some twenty researchers have been directly involved in the work done at the CECL, a worldwide renowned research centre. This exponential growth is the result of Sylviane’s enthusiasm, work, vision and leadership. Sylviane has always been an
Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
enthusiastic project and team leader. In addition she has always put a lot of energy and efforts into promoting learner corpus research through her publications, the many talks she has given all over the world, but also via the (co-)organization of Summer/Easter schools and international conferences in Louvain-la-Neuve. Recent conferences include Phraseology 2005: The Many Faces of Phraseology, and eLexicography in the 21st Century: New Challenges, New Applications (2009). Today, the CECL is busy organizing the Learner Corpus Research 2011 conference to mark the 20th anniversary of its creation. Sylviane has been one of the main driving forces behind learner corpus research and she initiated two pioneering projects in the field: the International Corpus of Learner English (ICLE, Granger et al. 2009) and the Louvain International Database of Spoken English Interlanguage (LINDSEI, Gilquin et al. 2010). The ICLE project started in 1990 and the second version of ICLE, released in 2009, contains data from 16 mother tongue backgrounds, for a total of 3.3 million words. As for LINDSEI, whose first version has recently been released, it was started in 1995 and to date contains about 800,000 words produced by learners from 11 different mother tongue backgrounds. Methodological issues have also been a major concern for Sylviane and, in 1996, she proposed the Contrastive Interlanguage Analysis (CIA) (Granger 1996) approach to analyze learner corpora. The advent of learner corpus research can be said to have taken place with the publication of Learner English on Computer (Granger 1998), a collection of pioneering papers on learner language, largely based on ICLE, which has inspired many publications in learner corpus research. The volume Sylviane co-edited in 2002, entitled Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, (Granger et al. 2002) provides a follow-up with further developments in the field. At the time of writing, Sylviane’s research appetite and enthusiasm remain undiminished and her head is full of new ideas and exciting projects! She often says that none of this would have been possible without her team at the CECL, and this is probably true. But what would a team be without an inspirational team leader who always looks on the bright corpus side of life? With this book, we explicitly want to thank her for her catching enthusiasm, her intellectual perceptiveness, her unfailing expert guidance, her sparkling personality, but also for her friendship and for the time she spends with us, be it to discuss academic or more personal everyday life matters, or even to party and have a good laugh. As highlighted at the beginning of this introduction, the different contributions included in the book reflect the numerous linguistic and pedagogical uses to which corpora can be put. The first two chapters address two central issues in corpus research: the notion of frequency and the role of contrastive analysis. In Chapter 1, Leech examines the role of frequency, as established on the basis of corpus evidence, in language learning. He shows that after early word-frequency lists such as West’s General Service List, followed by a generative period characterised by rejection of frequency, the advent of electronic corpora has led to renewed interest in frequency (frequency of words, but also of collocations, constructions, etc). This movement is supported by recent trends in linguistics such as the development of usage-based theories or the recognition of
Putting corpora to good uses
frequency effects in grammaticalisation. Leech claims that frequency, though not the only relevant factor, is important for language teaching, because of the principle of ‘more frequent = more important to learn’, according to which the most frequent words are the more useful ones to the learner (for comprehension as well as production purposes). The chapter finishes with some words of caution (what is most frequent does not necessarily correspond to what is most salient, and corpora from which frequencies are extracted do not always match learners’ needs) and some words of comfort (ordinal frequencies, i.e. how words are ordered along a frequency list, are normally sufficient, and these are usually quite similar across corpora). In the second chapter, Hasselgård and Johansson† start their paper with a select review of pre-corpus interlanguage studies, focusing on three Scandinavian research projects, before moving on to the development of computerized learner corpora. They focus on the ICLE project and on CIA, and present a number of valuable insights into advanced learner English that were gained from using comparable corpora and a common model of analysis. They then introduce another framework developed by S. Granger, viz. the Integrated Contrastive Model (ICM), which makes it possible to explain and/or predict mother-tongue (L1)specific learner problems on the basis of systematic comparisons of the first language and the target language. The two research models are illustrated by means of three case studies. The first two studies adopt CIA to investigate the use of quite and I would say in four ICLE sub-corpora and the third one uses the ICM to analyse seem in the interlanguage of Norwegian learners. After identifying a number of challenges that learner corpus research needs to meet, Hasselgård and Johansson conclude by praising the dynamism and enthusiasm that characterise this relatively new field. Chapters 3 and 4 discuss the development of academic literacies. Neff van Aertselaer and Bunce do so by examining the use of reporting verbs and evaluative lexical resources in two small corpora of texts written within the framework of an academic writing (AW) course by EFL Spanish university students at B1 and B2 levels of the Common European Framework of Reference. The Academic Writing (AW) course was organised around a series of can do descriptors to make explicit the required structural and rhetorical features to be learned. The authors compare their results with the ICLE Spanish sub-corpus and show that, by providing explicit descriptors for argumentative writing, the syllabus for the two AW courses did actually support students’ literacy growth. This is also confirmed by a comparison of the AW texts written at the beginning and end of the academic writing course. The study also illustrates how learner corpus data can be used to evaluate the syllabus and modify classroom teaching practices. In Chapter 4, Tribble investigates expert and apprentice performances in academic writing, drawing on Biber’s (2006) account of lexical bundles. He compares lexical bundles in a corpus of apprentice written production (KCL Apprentice Writing Corpus) and a close analogue corpus of British Academic Written English (BAWE), an exemplar corpus (Applied Linguistics Corpus) and two progressively more distant analogue corpora (BNC Baby, Academic and Acta Tropica). The chapter provides concrete illustrations of how the written production of postgraduate students in a single disciplinary area can be
Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
used to trace contrasts between apprentice and expert writing, and how the account of such contrasts can be exploited in materials development for English for Academic Purposes (EAP) writing courses. Tribble’s study demonstrates how corpus analysis can help meet the learners’ linguistic needs; it also shows that a focus on lexical bundles fosters a better understanding of apprentice writers’ strategies. Issues pertaining to the automatic analysis of learner corpora are addressed in Chapters 5 and 6. In Chapter 5, Rayson and Baron present the novel application of a hybrid approach to the detection of spelling errors in learner data. They use a modified version of the Variant Detector (VARD) software, initially developed to match historical spelling variants to modern equivalents, to detect spelling errors in ICLE sub-corpora consisting of 50,000 words from three different mother tongue backgrounds (French, German and Spanish). They show the potential of natural language processing methods to contribute to the automatic error analysis of learner corpora as VARD can both assist a manual editing process of a sample corpus and be trained and run automatically to generate larger amounts of data for analysis. The authors explain, however, that despite the very high precision rate obtained by VARD, further research is still needed to improve the recall rate of detection of learner errors, especially those that can only be found using contextual patterns. In the next chapter, Jarvis uses datamining techniques to automatically detect the L1 of learners. The influence of the first language on a second has been one of the most researched topics in learner corpus studies. Most of these studies have used CIA to identify features of non-nativeness in learner productions and assess whether these features are peculiar to one language group, and thus possibly due to the influence of the learners’ mother tongues. In a number of recent publications, however, Jarvis has put forward the detection-based approach to cross-linguistic influence, a complementary and largely automatic approach to detect cross-linguistic influence. The author compares 20 learning algorithms used for supervised classification, i.e. classifiers, and assesses their ability to learn to detect L1-related patterns of use of n-grams in 12 ICLE sub-corpora. He also explains that the applications of the detection-based approach to cross-linguistic influence are tremendous and largely transcend the field of language learning and teaching, as they could for instance be used for intelligence purposes. Chapters 7 to 9 deal with the sometimes blurred frontiers between second/foreign language acquisition, second language use and new varieties of English. Mauranen compares learner corpora, which contain data produced by second/foreign language learners, and corpora of English as a Lingua Franca (ELF), which contain data produced by non-native speakers who use English as a contact language. She first highlights the differences between the two, making a distinction between second language acquisition and second language use, and showing how this distinction, and the social, cognitive and interactive differences it implies, may impact corpus compilation and interpretation. While the division according to mother tongue background makes sense in learner corpora, for example, it is much less relevant (and feasible) in ELF corpora, which usually incorporate unpredictable combinations of mother tongues. On the other hand, learners and ELF users are also shown to share certain features that can be seen
Putting corpora to good uses
to reflect the cognitive processes underlying the production of (non-native) language. The processes of overgeneralisation and simplification, for instance, are important in both second language acquisition and second language use, and can result in similar lexicogrammatical or phraseological features, as exemplified by Mauranen. On the basis of these similarities and differences, the author argues that learner corpora and ELF corpora should be kept separate, but are of great mutual interest. In Chapter 8, Schmitt and Redwood analyze 68 second language learners’ productive and receptive knowledge of some of the most common phrasal verbs in English with the help of productive and receptive tests. In addition to frequency effects, the authors also address the potential link between mode (spoken vs. written) and phrasal verb knowledge, as well as the interactions of other factors that can lead to individual differences in the acquisition of phrasal verbs (second/foreign language proficiency, gender, age, and amount and type of exposure to the target language inside and outside the classroom). The authors demonstrate that frequency can predict phrasal knowledge to a considerable degree in terms of productive mastery, but not in terms of receptive mastery. Whilst their results show no effect for formal-instruction-based variables, they show that more out-of-class exposure facilitates the learning of phrasal verbs. In the next chapter, Mair shows how corpus linguistics has contributed to the study of the so-called ‘New Englishes’. His own research focus is on Jamaican English and Jamaican Creole, which he explores on the basis of a large corpus of diasporic Jamaican web-posts, called the Corpus of Cyber-Jamaican. Mair highlights interesting features of Jamaican English and Creole as it is used in computer-mediated communication, for example the higher frequency of basilectal variants in cyber-Jamaican than in face-to-face interaction, which he explains by the phenomenon of ‘anti-formality’, i.e. “conscious closing of social distance”. He also deals with lexical borrowings from African languages in Jamaican English and Creole, with words such as mzungu (‘white person’ in Kiswahili) or wahala (‘trouble/problem’ in Nigerian Pidgin) being found in the Corpus of Cyber-Jamaican. More generally, the paper underlines the benefits of relying on data derived from the World Wide Web, which includes more non-standard forms than corpora of face-to-face interaction, in order to investigate variation in the New Englishes. It also argues that web-forums can provide an arena for language contact that would be unlikely to occur in the real world, resulting in the rapid globalisation of certain vernacular features. The last two chapters of the book are devoted to the role that corpora can play in the development of lexical and lexicographical resources for language learning. Wible and Tsao report on a new corpus-derived lexical resource designed to help bridge the gap between language learners’ needs and what corpora can offer when it comes to vocabulary learning. After arguing that vocabulary knowledge is best seen as a rich network of interconnections among words and that corpora as collections of texts and tokens fail to give language learners direct access to this web of interconnections, the chapter describes the lexical knowledgebase StringNet, which has been specifically created to reflect what learners need to master. The authors explain how corpus-derived ‘hybrid n-grams’, in which part-of-speech categories can occur alongside lexemes or word forms, have been instrumental in automatically discovering not only patterns of word behaviour but also
Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
the relations among these patterns and words. In addition, they show how hybrid ngrams make it possible to uncover the larger patterns in which collocations often tend to be embedded. Finally, Wible and Tsao suggest that language learners could be given access to the lexical knowledgebase StringNet via a browser-based tool which could help them discover patterns they had not thought of looking for. As for Rundell and Kilgarriff, they examine and evaluate the role of computers and automation in modern dictionary making and more specifically in the period from the late 1990s onwards. The focus is on a number of lexicographic tasks that have been or are in the process of being automated to a significant degree. These include the compilation of lexicographic corpora (with the advent of the ‘web corpus’), the development of headword lists (e.g. selecting headwords, identifying multiwords or new words), the identification of the key linguistic features of the lexical units included in the dictionary (e.g. their collocational/colligational preferences, the grammatical or register labels they should be assigned), and the selection of examples to be included (e.g. using the GDEX [‘good dictionary examples’] algorithm). The contribution and development of word sketches and the Sketch Engine are also highlighted and amply illustrated. Throughout the chapter the authors show how automation has made it possible not only to relieve lexicographers of more tedious work involved in dictionary making but also to increase consistency and reliability when describing language and compiling dictionary entries. Their paper is rounded off by a discussion of possible further developments of the process of automation in lexicography. We hope that the guided tour of some of the key approaches, methods and domains of applications of (learner) corpus research provided in this volume will help readers refine and/or develop their own taste for corpora, and that it will prompt them to discover and freely explore new paths.
References Biber, D. 2006. University Language: A Corpus-based Study of Spoken and Written Registers [Studies in Corpus Linguistics 23]. Amsterdam: John Benjamins. Gilquin, G., De Cock, S. & Granger, S. (eds). 2010. The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain. Granger, S. 1996. From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In Languages in Contrast. Text-based Cross-linguistic Studies [Lund Studies in English 88], K. Aijmer, B. Altenberg & M. Johansson (eds), 37–51. Lund: Lund University Press. Granger, S. (ed.). 1998. Learner English on Computer. London: Addison Wesley Longman. Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (eds). 2009. The International Corpus of Learner English. Handbook and CD-ROM. Version 2. Louvain-la-Neuve: Presses universitaires de Louvain. Granger, S., Petch-Tyson, S. & Hung J. (eds). 2002. Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam & Philadelphia: Benjamins.
Frequency, corpora and language learning Geoffrey Leech I begin this chapter with a brief survey of how frequency – in particular, frequency of words – had a role in language learning in the days before electronic corpora existed. Then I consider how the ‘corpus revolution’ made frequency information available in a totally unprecedented way from the 1960s onward. But how far is this useful to the language learner and teacher? Is the right kind of frequency knowledge being captured? In the second half of this chapter, I will consider the equation ‘more frequent = more important to learn’, what questions of frequency we really need to ask, and how far they can be answered in the present state of corpus linguistics.1
1. Introduction If asked what is the one benefit that corpora can provide and that cannot be provided by other means, I would reply ‘information about frequency’. Frequency is also a theme which has recurred in language learning – although it has also suffered from neglect (as will be briefly explained below). Hence there is need for a re-appraisal of the links between frequency, corpora and language learning. Following this introduction, the chapter is divided into four main sections: Section 2: ‘A brief glance at history’; Section 3: ‘Recent progress in frequency studies relevant to language learning’; Section 4: ‘New directions in applied linguistics favourable to frequency’; Section 5: ‘Challenges and possible solutions’.The chapter ends with some concluding remarks (Section 6). To begin with, it is as well to make clear that there are three usages of frequency that might be confused. a. ‘Raw frequency’ is simply a count of how many instances of some linguistic phenomenon X occur in some corpus, text or collection of texts. b. ‘Normalized frequency’ (sometimes called ‘relative frequency’) expresses frequency relative to a standard yardstick (e.g. ‘tokens per million words’).
1. I am very grateful to the editors, Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot, for their valuable suggestions and support in helping me to improve this chapter.
Geoffrey Leech
c. In what I will call ‘ordinal frequency’, the frequency of X is compared with the frequencies of Y, of Z, etc. Thus a rank frequency list, in which words are listed in order of frequency, is the classic example of ordinal frequency. Although (a) is the raw measure from which (b) and (c) are derived, it is of little or no use in itself. Normalized frequency (b) is of course essential if we are to make comparisons between corpora, texts, etc., of different sizes. But my view is that ordinal frequency (c) is the most useful measure to use when we are considering language learning. It is of no use for the language teacher to be told that shall occurs 175 times per million words in a corpus. But to be told that will is much (15 times) more frequent than shall may well be pedagogically useful.
2. A brief glance at history The historical sketch I am about to give roughly divides into three epochs: (a) early frequency studies; (b) the rejection of frequency; (c) the computer age and the revival of frequency studies.
2.1
Early frequency studies
The early chapters of introductions to corpus linguistics by Kennedy (1998) and by McEnery & Wilson (2001) give something of the background to this. But for my present purpose, it is enough to refer to one or two landmarks in the provision of wordfrequency information on English. Thorndike (1921, 1932), Thorndike & Lorge (1944), and West (1953)2 are noted examples of word-frequency lists produced by counting and calculating word frequencies by hand in the first half of the twentieth century – before, that is, the development of computers. By present-day standards, the corpora used were pitifully small, and the selection of texts they contained included some choices hardly ideal for learners of the current language. For example, Thorndike (1921, 1932) made use of a corpus containing such classics from the 17th, 18th and 19th centuries as Dryden’s Dramatic Essays, the American Declaration of Independence, and Jane Austen’s Pride and Prejudice. However, the important point here is that word frequency was taken seriously as a guide for language teaching in those days, and in spite of the enormous amount of unrewarding ‘slave labour’ involved, building frequency lists was felt to be a worthwhile exercise. The simple postulate justifying this effort was: ‘more frequent = more important to learn’. Of greater interest from the theoretical point of view was the mathematical work of Zipf. Zipf ’s Law (1935, 1949) held that the frequency of any word is inversely 2. West’s book was called A General Service List of English Words, and recorded frequencies of senses, not just words. Although not published until 1953, West’s book was based on counts laboriously undertaken in the decades before 1950.
Frequency, corpora and language learning
100% Top 3000 words – 86% of the language % total words in the LCN Top 100 words – 50% of the language
Top 10 words – 25% of the language* the, be, of, and, a, to, in, have, it, I
concession
erode
consolation
stylistic
consumption
90% 80%
aspiration
overwhelm
carefully
viable
therapeutic
mingle
unique
70%
fresh
60%
deep
50%
very
40%
good
30%
up
20%
year
10% 1
1000
2000 3000
4000
5000
6000
7000
8000
9000
10000
Figure 1.╇ Frequency graph of the 10,000 most frequent words in the Longman Corpus Network (Reproduced by permission of Pearson Education Limited from: Stephen Bullon and Geoffrey Leech, ‘Longman Communication 3000 and the Longman Defining Vocabulary’. In Longman Communication 3000. 1. Harlow, Essex: Pearson/Longman.)
proportional to its rank in the frequency list, such that the nth word has a frequency of approximately 1/n X the frequency of the word of highest rank. Zipf ’s Law gave a more heavily weighted importance to the most frequent words than would be expected according to normal distribution. Language is such that the most frequent 50 words (i.e. word-types) account for 40% of word-tokens in a corpus of texts; the most frequent 3,000 words account for 85% of word-tokens; and the most frequent 10,000 words account for 92% of word-tokens (see Figure 1). Carroll’s (1971) mathematicallyinduced estimate of the number of word-types in the English language was 609,606 words, of which a majority have extremely small probabilities. For practical purposes we can say that the wordstock of English is both very large and open-ended.
2.2
The rejection of frequency
In linguistics, the second half of the twentieth century, at least up to the 1990s, was dominated by the generative school of Noam Chomsky, who rejected the value of frequency in the study and understanding of language. Chomsky famously used the illustration of I live in Dayton, Ohio and I live in New York to show that the greater frequency of the latter sentence as compared with the former was of no linguistic relevance or interest. Of course, this had more to do with the differences of population between Dayton, Ohio and New York – from Chomsky’s point of view, a matter of performance (and hence of no value to linguistics) rather than competence. He
Geoffrey Leech
concluded that “probabilistic considerations have nothing to do with grammar” (Chomsky 1964 [1962]: 215) – using grammar in a broad Chomskyan sense to include the whole language system. From that time until (roughly) the end of the century, since Second Language Acquisition (SLA) research was heavily influenced by the generative paradigm, it was difficult to find any serious reference to frequency in publications about the learning of languages, and where frequency was discussed, it was dealt with perfunctorily and sometimes negatively. The well-known authoritative handbook by Rod Ellis, The Study of Second Language Acquisition (1994), has little to say about frequency, and offers very little extra in its second edition of over a thousand pages, published as recently as 2008. The only substantial reference to frequency is in the section headed ‘The frequency hypothesis’, in which the emphasis is wholly on the learner’s input frequency (see Ellis 1994: 269–273, 2008: 241–246). For corpus linguistics, a more relevant question is: how can both the learner’s input and output be adjusted to the future likely needs of the learner as revealed in corpora?
2.3
The computer age and the revival of frequency studies
It can be said that the corpus revolution in linguistics began with the completion and distribution of the Brown Corpus in 1964.3 Shortly after, Kučera & Francis (1967) used this to create the first word frequency lists for English based on corpus data. Later, in Francis & Kučera (1982), they published lemmatized frequency lists, based on the part-of-speech (POS) tagged version of the corpus. Further word frequency lists were derived from the Lancaster-Oslo/Bergen (LOB) Corpus of British English (Hofland & Johansson 1982; Johansson & Hofland 1989), and for the first time grammatically informed word frequency lists derived automatically from matching computer corpora became available to the language researcher and the language teacher permitting comparison of American and British English. Of course, this was only the first step: in the last forty years, there has been an immense increase in the number of corpus-based frequency studies both for written and spoken English, as more diversified corpora as well as much larger corpora have become available. Apart from word frequency lists and studies (e.g. those derived from the British National Corpus [BNC] – Leech et al. 2001), corpus-based frequency studies have dealt with collocations (e.g. Sinclair et al. 1970, republished in Krishnamurthy 2005), and with frequency of grammatical categories, structures, etc. Here hundreds of grammatical studies could be mentioned, starting from Ehrman (1966), and culminating in a corpus-based frequency grammar of English (Biber et al. 1999) as well as with frequency studies of the language of learners (Granger 1997, 1998). It goes almost without saying that the availability of electronic corpora has revolutionized the 3. The Brown Corpus was originally issued by W. Nelson Francis and Henry Kučera of Brown University, under the title A Standard Sample of Present-Day Edited American English, for use with Digital Computers.
Frequency, corpora and language learning
application of frequency information whether derived from general corpora, specialized corpora, written texts or spoken transcriptions. It is also clear that frequency data from authentic texts have been one of the major driving forces of natural language processing (NLP), leading to the development of sophisticated statistical methods and probabilistic systems. One of the first steps was taken in the probabilistic POS tagging of the LOB Corpus, employing a modified Hidden Markov Process model (Marshall 1983, 1987). The history of statistical modelling in NLP, however, cannot be pursued further here. See Jelinek (1998) for further coverage.
2.4
Co-frequency, collocation
Another great step forward was taken through the pursuit of co-frequency – i.e. the frequency of X and Y occurring together in a corpus, as measured against the probability of their occurring together by chance. A serious beginning was made in Sinclair’s research discussed in his (and colleagues’) OSTI report (1970), using a small corpus of spoken English of 135,000 words. Obviously, as Sinclair pointed out, a much bigger corpus (of 20 million words or more) was needed to produce significant results for collocational analysis. This was achieved and surpassed in the 1980s and 1990s with Sinclair’s development of the Birmingham Collection of English Texts, later known as the Bank of English, as well as by other corpora such as the BNC. To give an impression of how vastly the size of corpora on which frequency studies are based has mushroomed in the last forty years: in comparison with Sinclair’s spoken corpus of 135,000 words in 1970, a recently published frequency dictionary of American English (Davies & Gardner 2010) is based on a corpus of 385,000,000 words, including 79,000,000 words of speech. This dictionary is also an innovation in providing, alongside individual word frequencies, a classified list of common collocations for each word. Word frequency lists such as those of Francis and Kučera were of limited interest to corpus linguists like John Sinclair, who urged the inadequacy of the open choice principle of treating every word-token in a string as if independently selected, as contrasted with the idiom principle whereby texts are observed to be constructed in terms of “a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments” (Sinclair 1991: 110). Sinclair’s idiom principle has since been followed up by many corpus linguists and lexicographers for whom multi-word units – collocations, lexical bundles, and the like – are essential to the fabric of language, as well as to the learning of language. Indeed, corpus research itself has shown observationally the importance of word combinations, whose significance is capable of being measured by statistical formulae such as mutual information, t-test, and log likelihood. Sinclair, in championing the idiom principle, was following to some extent in the footsteps of his former Edinburgh colleague M.A.K. Halliday, and Halliday’s teacher J.R. Firth (1957), who first gave prominence to the co-frequential concept of collocation (Halliday 1966; Sinclair 1966). Halliday (1961: 273–277) had stated that the level of lexis (including collocation) had
Geoffrey Leech
to be a distinct level of linguistic description,4 and at the same time had proposed that the levels of grammar and of lexis were interrelated along a cline or continuum of delicacy (ibid. 276–277). For him, the levels of grammar and lexis constituted a single lexico-grammatical level accounting for the formal structuring of language. Many corpus linguists have espoused something like this model, evidenced as it is by a multitude of studies,5 with the result that the interpenetration of grammar and lexis (and hence the spread of lexical frequency-based concepts into grammar) has become widely accepted. In this respect, it can be said that the corpus revolution has introduced a new theoretical perspective on linguistic structuring: one in bold contrast to the mainstream paradigm of Chomsky (e.g. Chomsky 1965: 84–88) whereby grammar and lexicon are two clearly distinct components. It also challenges a tradition long established in language study, whereby grammars and dictionaries provide distinct kinds of information about a language, and are published in separate covers.
3. Recent progress in frequency studies relevant to language learning In this part of the chapter I will revisit four topics already briefly touched on, showing how studies of frequency have been increasingly applied to various linguistic units or components: a. b. c. d.
word frequency (by register, region, etc.) co-frequency between words – lexis and collocation grammatical frequency lexico-grammatical frequency – co-frequency between lexis and grammatical structures
I will consider how these topics have been advanced by recent research. Other linguistic levels at which frequency has been somewhat less investigated, e.g. semantics, will remain in the background.
3.1
How frequency is important for English Language Teaching (ELT)
First, let us revisit word frequency lists. The case for ‘more frequent = more important to learn’ is simply put: “The reasoning behind this position is that learners should be taught what is most frequent in language, since it is what is of most use to them” (Gilquin 2006a: 58). In other words, the more frequent a word is in language use, the more likely it is to be useful to the learner. This is (a) because it will be more frequently 4. “So there must be a theory of lexis, to account for that part of linguistic form which grammar cannot handle” (Halliday 1961: 273). 5. For example, Moon (1998), Nesselhauf (2005), Adolphs (2008), Römer & Schulze (2009), and a rich range of studies contributed to Granger & Meunier (2008).
Frequency, corpora and language learning
encountered in the language use of other people, and (b) because it will be more frequently needed for the learner’s own language use.
3.2
Word frequency associated with language varieties
However, frequency counts are least useful when they are based on a general corpus covering the range of the language; they are more useful if they are differentiated for region and register. This is one advantage that the corpus revolution has brought, and which was lacking in earlier manually-based studies. Earlier I mentioned briefly Johansson & Hofland’s (1989) frequency lists of comparable corpora for American English (AmE) and British English (BrE): Brown and LOB. These and other corpora in the Brown family show differences in regional varieties of written English and show, for example, that the auxiliaries must, may, should and shall have been declining sharply in frequency, and that in this decline BrE is following in the wake of a sharper decline in AmE (Leech et al. 2009: 71–83). It is also possible to compare AmE and BrE in terms of spoken English: two comparable corpora of conversation (the demographically sampled part of the BNC and the Longman Corpus of Spoken American English [LCSAE]) show that in AmE, much more than in BrE, ‘core’ modals like must, may, should and shall are less frequent than in the written language, whereas some ‘semimodals’ resulting from grammaticalization – constructions like be going to and have to – have reached a greater frequency than most core modals (see Leech et al. 2009: 100). In Leech et al. (2001), based on the BNC, we presented word frequency lists for both written and spoken English, and also lists of words which were most ‘key’ in these two varieties (i.e. those most strongly associated with written texts or with spoken texts). Dictionaries such as the Longman Dictionary of Contemporary English (LDOCE) also give variety-differentiated frequency information. Since its third edition (1995), LDOCE has flagged words in the first thousand, second thousand, and third thousand in terms of frequency in speech and in writing, differentiating between their occurrence in the two media, where not surprisingly word frequencies differ greatly. For example, in Table 1 the items in List A are in the top 1,000 words for speech, but below the top 3,000 in writing: in other words, these words are much more at home in the Table 1.╇ Words strongly associated with spoken (List A) and written (List B) English List A: Words strongly associated with the spoken medium
List B: Words strongly associated with the written medium
awful, basically, bet (verb), daddy, dear (interj), exam, go (noun), hello (interj), hi (interj), hopefully, like (adv), like (conj), mine (pron), mom, mummy, OK (adv), OK (interj), ours (pron.), penny, phone (verb), rid (adj), yeah (interj), yep (interj)
authority, institution, security, program (noun), reveal, sector, king, thus
Geoffrey Leech
spoken medium. On the other hand, those in List B are in the top 1,000 words for writing, but below the top 3,000 in speech: they strongly prefer the written medium. In addition to frequency information about speech and writing, there are also corpus-based frequency lists relating to different registers or domains – such as the Academic Word List of Coxhead (2000). Such differentiated frequency information is potentially very useful for learners of a language, or more directly, for those preparing teaching materials, selecting reading materials, or devising tests. Up to recently, corpora have been restricted largely to the written medium, and frequency lists were presented as undifferentiated as to variety: words like daddy and institutional would appear side by side one another in the same list without much distinction (in fact, their overall frequencies in the entire BNC are close – 22 and 20 occurrences per million words respectively). So this is a decided step forward: for the learner, vocabulary resources for speech are very different from vocabulary resources for writing, and corpora have enabled us to see this clearly and in considerable detail.6 A further innovation in recent lexicography for advanced learners has been the recognition also of semantic frequency. Using again the example of LDOCE (1995 and later editions), the various senses of a word are listed under each headword in frequency order, and likewise homographs are presented in frequency order. In such ways, dictionaries using corpus resources for the advanced learner have been striving to supply the information the learner needs most frequently in readily accessible form.
3.3
A more considered view
The principle “more frequent = more important to learn” can scarcely be gainsaid as a general principle. However, one of the discoveries from the study of learner corpora is that non-native students of English tend to overuse the words towards the top of the frequency lists: A number of studies reveal that learners from a wide variety of (unrelated) mother tongue backgrounds display a common tendency to overuse common, non-specific words such as important (...) or big or nice (De Cock & Granger 2004: 78)
Part of this effect may well be due to failure to adapt to the written medium: it is true that words such as big and nice are very frequent – but this is only in the spoken medium, whereas nice, in particular, is rather infrequent in writing. A more general reason for overuse of common words is that they are the words learners have encountered and used most in the past. They are inevitably words with which the learners feel most familiar, most confident and most comfortable – “lexical teddy bears”, as Hasselgren (1994) calls them. Hence it is important to make a distinction between frequency in 6. This is all the more important since learners tend to confuse spoken and written registers (see e.g. Altenberg & Tapper 1998 or Gilquin & Paquot 2008).
Frequency, corpora and language learning
past experience for the learner, and frequency in projected future experience. The reason for prioritizing commoner items over less common items in teaching is that they are predictably the items the learner will encounter, and need to use, more frequently in the future. But the reason why learners overuse common words must be that they are the words they have encountered and used most frequently in the past. The conclusion is that, if we are to follow the ‘more frequent = more important to learn’ principle, attainment in vocabulary acquisition must be linked progressively and systematically to extending the range of use to less common vocabulary, including less common uses of frequent words (cf. Lennon 1996).7 The focus of learning should be step by step on less frequent items which the learner needs to entrench for further use. From a testing perspective too, as Alderson (2007) points out, less common items are more discriminatory in the evaluation of levels of performance – for example, in vocabulary size placement tests. All this indicates that, applied to learning processes, frequency should be a relative, not an absolute quantity. What is important is that more common words should be most usefully learned before less common words, whether those more common words are in the top bands of frequency or not. So far, then, the postulate ‘more frequent = more important to learn’ has not been overthrown. The overuse effect implies simply that the students, in their learning process, have not progressed down the frequency list as far as is desirable. They are relying too much on well-worn and well-loved paths of expression.
3.4
Frequency of word combinations: Is it more important than frequency of individual words?8
Teubert (2004: 188) goes so far as to claim: “Not simple words but collocations constitute the true vocabulary of a language”. This may be going too far, as Teubert here embraces the idiom principle one hundred percent. Nevertheless, the formulaicity of English has been calculated as around 21% in written texts and even higher (30%) in spoken language (Biber et al. 1999: 993–994).9 Learning vocabulary is not just a matter 7. Also we should include here the collocational patterns of frequent words, usually disregarded at more advanced levels because the words are considered as easy or known, whilst studies have shown that these collocational patterns were not mastered even at an advanced level (see for instance Nesselhauf 2003). 8. As background to this section, see Durrant (2009) and Ellis & Simpson-Vlach (2009). 9. There are many ways of defining formulaicity (see Moon 1998) and the percentage figures here are derived from a very specific definition: lexical bundles (3-grams and 4-grams) recurring at least 10 times per million words. These percentages are estimated from Biber et al.’s (1999: 993–994) Figures 13.2 and 13.3. A somewhat earlier study by Eeg-Olofsson & Altenberg (1996) reported that as many as 86% of words in two 5,000-word samples (one monologue, the other dialogue) “were part of a recurrent word combination in one way or another”. See also Altenberg (1998).
Geoffrey Leech
of acquiring individual words, but of acquiring phraseology. Hence frequency of word combinations, as well as of words, should be an important input to the learning process. The strange thing is that, according to De Cock (1998), the percentage of formulaicity in learners’ productions is even higher than in those of native speakers (although some formulae are erroneous). Again, this may be the result of a ‘teddy bear’ effect, whereby learners hang on to the use of well-worn and familiar phrases, rather than risking new ones.
3.5
Grammatical frequency
The focus has been mainly on lexical frequency so far – the easiest kind of frequency data to extract from corpora. The collection of data on frequency of grammatical categories, grammatical constructions and the like can be achieved automatically only if the corpus has been annotated with the grammatical information supplied by POS tagging and (ideally) parsing.10 This annotation process is far from easy. Alternatively, unless the grammatical items happen to be unambiguously identifiable from their orthographic form, grammatical information has to be gathered laboriously by manual intervention. Nevertheless, much has been learned from corpora about grammatical frequency since the first POS tagging (of the Brown Corpus) was achieved in 1970 (Greene & Rubin 1971). Many results come from individual case studies of particular areas of English grammar. A more concerted corpus-based account of grammatical frequency is provided by the Longman Grammar of Spoken and Written English (Biber et al. 1999). At the more theoretical level, one rather unexpected finding (Sampson 2007) is that frequency of grammatical structures, defined as tree fragments or mother-daughter sequences, follows a Zipfian curve similar to that of word frequency, with an enormous tail of structures occurring only once in a corpus, just as the tail of vocabulary frequency (around 50%) consists of words that only occur once (hapax legomena). This is surprising for those brought up in the Chomskyan framework (Chomsky 1957: 13) where there is assumed to be a clear dividing line between items which are grammatical and those which are not. The common assumption up to recently has been that the grammar is a closed system of rules whereas the lexicon is open-ended. On grammatical frequency, perhaps even more than lexical frequency, corpus findings can be surprising alike to native speakers and to experienced teachers of the language. For example, it has been reported that teachers of English, when asked whether the present progressive or the present simple is more common, typically opt 10. However, tagging and even parsing do not necessarily imply that the retrieval of grammatical phenomena is fully automatic. Sometimes, considerable manual post-editing is necessary (cf. Gilquin 2002). On the other hand, advanced corpus software such as BNCweb (for use with the BNC) can undertake queries leading to the retrieval of syntactic patterns by use of a powerful query syntax known as CQP employing regular expressions – see Hoffmann et al. (2008: 215–243).
Frequency, corpora and language learning
for the progressive.11 This choice is reinforced by the fact that in syllabuses, the present progressive has sometimes been taught before the present simple. Teachers, it can be supposed, are hugely surprised to be told that (according to corpus evidence) the progressive aspect is about 20 times less common than the simple non-progressive aspect (see Figure 2). Another illustration of how teaching practices in grammar have been notoriously at odds with corpus evidence is that of conditional sentences. For a generation at least, Thomson & Martinet’s (1980: 186–192) best-selling grammar textbook helped to perpetuate the time-honoured assumption that there are just three categories of conditional which learners of English have to master:12 First conditional: Protasis: present simple Apodosis: will + infinitive e.g. If you don’t get it he’ll repeat it. Second conditional: Protasis: past simple Apodosis: would + infinitive e.g. If I had an acre to plant, I would spend all day working on it. Third conditional: Protasis: past perfect Apodosis: would have + infinitive e.g. If I’d owned it, I would have thrown it away. [Examples from the LCSAE]
simp perf prog perf+prog
Figure 2.╇ Chart showing the frequency of the simple aspect (non-perfect, non-progressive) compared with those of the perfect and progressive aspects (based on Biber et al. 1999: 461–462; the portions represent percentages of all verb phrases)
11. Douglas Biber, personal communication: “I have used this example in literally dozens of talks, and I consistently get the same result. The most dramatic case was probably a plenary that I gave at AAAL several years ago. The estimated attendance was c. 800, but only c. 20 raised their hand to vote for simple aspect as more frequent”. 12. To be fair, Thomson and Martinet allow for variants on these three patterns, for example where other modals than will and would occur. More recent pedagogical accounts of grammar tend to include the zero type. For further corpus evidence and discussion, see Gabrielatos (2003, 2007).
Geoffrey Leech 30
60
20
Manner & Place
10 0
M-->P
40
0
P-->M
Place & Time
20 P-->T
T-->P
30 20
Manner & Time
10 0
M-->T
Key: --> means “before”
T--> M
Figure 3.╇ Likelihood of Manner preceding Place, Place preceding Time, and Manner preceding Time (based of Biber et al. 1999: 811; frequencies per 10,000 words)
However, corpora show that more frequent than each of these three is the unmodalized conditional, often called the ‘zero type’, typically with the present simple tense in both clauses: If you do it in twenty days, you’re wonderful. [Example from the LCSAE]
For the millions of learners who have sweated over the second and third conditionals, it might be a comfort (or alternatively, a vexation) to know that these are fairly rare in comparison with the type just illustrated. Yet a further example is the ‘MPT rule’, repeated in many books and materials, decreeing that the order of adverbials at the end of a clause is ‘Manner followed by Place and Place followed by Time’. In practice, this turns out to be a probabilistic rule, and not a very good one at that. The charts in Figure 3 show the likelihood that these three classes will occur in the order stated.13 As the anecdote of the progressive just mentioned suggests, ‘authoritative’ figures in language teaching, whether teachers, materials writers or just native speakers, are very poor at guessing relative frequencies of grammatical classes and structures. If the time wasted teaching rather uncommon structures and weak rules is to be avoided, the ‘more frequent = more important to learn’ principle should be applied to grammar. This is where corpus evidence again becomes crucial.
3.6
Phraseology and the interaction of lexis and grammar
In the interaction of lexis and grammar, frequency helps to unlock predictable patterns of meaning. This is definitely an area of corpus-based investigation whose hour has come. Recently, various frameworks have been put forward extending the 13. The data for Figure 3 comes from Biber et al. (1999: 811), Figures 10.14–16.
Frequency, corpora and language learning
collocational analysis paradigm to apply to frequency of co-occurrence of both lexical and grammatical choices: a. Pattern grammar: described as “a corpus-driven approach to the lexical grammar of English” (Hunston & Francis 2000) b. Collostructions: the statistical measurement of the degree of attraction or repulsion between words and constructions (Stefanowitsch & Gries 2003) c. Word sketches: use of the Sketch Engine software to derive a summary of a word’s collocational behaviour in terms of grammatical slots (Kilgarriff & Tugwell 2002) d. Concgrams: use of ConcGram software to generate word-collocations of variable position and distance, such that (for example) play a role, play an important role, a key role to play can all be listed as belonging to the same concordance output (Cheng et al. 2006) Here I will not go into the technical characteristics distinguishing these approaches from one another. The important point, as I see it, is that they all explore statistically the until-recently-neglected interface between lexis and grammar. Lexis, in its pure Hallidayan and Sinclairian form, focuses on patterns of word co-occurrence while excluding generalizations on the level of grammatical structure.14 On the other hand, many approaches to grammar have neglected the level of lexical patterning. Surely the most valuable way to synthesise the relations between lexis and grammar within a single lexico-grammatical framework is to use corpus linguistic techniques such as those in (a)–(d) above. I will illustrate this with just two examples, the first of word sketches and the second of collostructions. Table 2 displays a word sketch of the noun bank, showing its co-occurrence connections in terms of frequency and salience (a strength-of-association measure), with verbs in the Subject-of relation, with verbs in the Object-of relation and with adjectives or nouns as modifiers of bank. The automated analysis of grammatical structure, as shown by the Sketch Engine, has reached a stage where errors are rather few, and results can be regarded as substantially reliable. On the other hand, a semantic element of analysis is still lacking, as we can see from the juxtaposition at the top of the Object-of list of burst (where the bank is obviously a river bank) and rob (where the bank is obviously a financial institution). The second and rather similar technique derives from Stefanowitch & Gries’s (2003) statistical concept of collostructional analysis interrelating (as its name suggests) collocational analysis and construction grammar. It can be illustrated from the analysis of the construction [Verb NP as X] by Gries et al. (2005: 649) – see Table 3. The interesting debate here lies in two different measures, item frequency and strength-of-association (collostructional strength), which can produce different results. In Table 3, the verbs see and describe are more frequent in this construction than 14. See Halliday (1961: 273–277), Halliday (1966) and Sinclair (1966).
Geoffrey Leech
Table 2.╇ Part of a word sketch (after Kilgarriff & Tugwell 2002: 131) of the noun bank subject-of
num
sal
object-of
num
sal
modifier
num
sal
lend issue charge operate step deposit borrow eavesdrop finance underwrite account wish
95 60 29 45 15 10 12 â•⁄ 4 13 â•⁄ 6 19 26
21.2 11.8 â•⁄ 9.5 â•⁄ 8.9 â•⁄ 7.7 â•⁄ 7.6 â•⁄ 7.6 â•⁄ 7.5 â•⁄ 7.2 â•⁄ 7.2 â•⁄ 7.1 â•⁄ 7.1
burst rob overflow line privatize defraud climb break oblige sue instruct owe
27 31 â•⁄ 7 13 â•⁄ 6 â•⁄ 5 12 32 â•⁄ 7 â•⁄ 6 â•⁄ 6 â•⁄ 9
16.4 15.3 10.2 â•⁄ 8.4 â•⁄ 7.9 â•⁄ 6.6 â•⁄ 5.9 â•⁄ 5.5 â•⁄ 5.2 â•⁄ 4.7 â•⁄ 4.5 â•⁄ 4.3
central Swiss commercial grassy royal far steep issuing confirming correspondent state-owned eligible
755 â•⁄ 87 231 â•⁄ 42 336 â•⁄ 93 â•⁄ 50 â•⁄ 23 â•⁄ 13 â•⁄ 15 â•⁄ 18 â•⁄ 16
25.5 18.7 18.6 18.5 18.2 15.6 14.4 14.0 13.8 11.9 11.1 11.1
num = number of tokensâ•…â•…â•…â•… sal = salience (roughly: strength of association)
Table 3.╇ A partial collostructional listing (from Gries et al. 2005: 649) of verbs most strongly attracted to the construction [Verb NP as X] verb in construction regard describe see know treat define use view map
number of tokens
collostruction strength
verb in construction
number of tokens
collostruction strength
â•⁄ 80 â•⁄ 88 111 â•⁄ 79 â•⁄ 21 â•⁄ 18 â•⁄ 42 â•⁄ 12 â•⁄â•⁄ 8
166.476 134.870 â•⁄ 78.790 â•⁄ 42.796 â•⁄ 28.224 â•⁄ 23.843 â•⁄ 21.425 â•⁄ 17.861 â•⁄ 12.796
recognise/ize categorise/ize perceive hail appoint interpret class denounce dismiss
12 â•⁄ 6 â•⁄ 6 â•⁄ 3 â•⁄ 5 â•⁄ 5 â•⁄ 3 â•⁄ 3 â•⁄ 4
12.159 11.525 â•⁄ 8.304 â•⁄ 6.316 â•⁄ 6.073 â•⁄ 5.920 â•⁄ 5.379 â•⁄ 5.158 â•⁄ 5.079
regard. But regard is a more ‘typical’ verb to use with the [Verb NP as X] construction, because a larger proportion of its tokens occur with this construction as compared with others. It is more securely attracted (or ‘bonded’) to this construction than to others. Describe and (especially) see are more general-purpose verbs that do not have this special relationship with the construction. As another illustration, Stefanowitsch & Gries (2003: 231) determine the collostructional strength of verbs with the progressive. The most strongly bonded verbs, in order, are talk, go, try, look, work, sit and wait. This order is obviously not that of the frequency of the verbs themselves, which is (as it happens) as follows: go, look, work, try, talk, sit and wait. The debate is to determine whether learners acquire the construction better
Frequency, corpora and language learning
with common verbs or with bonded verbs: arguably a matter for SLA specialists, rather than corpus linguists. But surely both measures are potentially useful to the learner.
4. New directions in applied linguistics favourable to frequency In this section, striking an optimistic, forward-looking note, I take account of present directions of research favouring the importance of frequency. After this, I turn less optimistically in Section 5 to the problems of determining frequency relevant to language learning. Twenty years ago, there was very little support for the idea that frequency phenomena contribute to our understanding of language and language learning. Now, I believe, there has been something of a transformation which brings frequency increasingly into the limelight. I will say something about: a theoretical positions favouring frequency (Section 4.1) b. frequency effects in language change (Section 4.2) c. frequency effects in language acquisition, including both L1 and L2 learning (Section 4.3)
4.1
Theoretical positions favouring frequency
Three theoretical positions which have been gaining momentum since the 1990s all implicitly or explicitly give frequency a role in the workings of language: usage-based linguistics, cognitive linguistics, and construction grammar. These three differentlylabelled approaches are so closely linked that they could be called different facets of the same theoretical paradigm. Usage-based linguistics (based on observation and analysis of language in use – see Barlow & Kemmer 2000) reacts strongly against Chomsky’s position that linguistics is concerned with competence (a mental phenomenon) rather than with performance (the use of language in utterances and texts) – or, to use a later terminology, with (internal) I-language rather than with (external) E-language. During the heyday of the generative paradigm, as we have seen, performance-based theorizing was inevitably eclipsed, although the usage-oriented paradigm of Halliday’s systemic functional grammar, for example, maintained a following (largely outside the USA). More recently, usage-based approaches have made a significant comeback, especially in the western part of the USA. Cognitive grammar/linguistics has also gained momentum in the western states of the USA since the 1970s, and is perhaps found in its most influential form in the cognitive grammar of Langacker (1987). Although this is not the place to expound the theoretical foundations of the cognitive linguistics enterprise, among its important tenets is that the way we use and process language is integral to the nature of language
Geoffrey Leech
as a cognitive phenomenon. In this sense cognitive linguistics is usage-oriented. The notion of entrenchment is key to Langacker’s cognitive grammar: repeated exposure to a linguistic item makes the difference between an item that is strongly and centrally established as part of language cognition (entrenched), and one that is weakly established and peripheral. Entrenchment is central to processes of language acquisition, and it is dependent on frequency: the more frequently a linguistic item has been encountered and used, the more entrenched in the language user’s competence it is likely to be (see Langacker 1987: 100; Gries 2006). Construction grammar (Fillmore et al. 1988; Goldberg 1995) is a framework for describing and accounting for language structure in terms of constructions, rather than ‘words and rules’. A construction is a symbolic unit that combines both form and meaning, and may be linguistically complex. It is commonly postulated that constructions are learned and stored as wholes, and that they are learned from the bottom up, on the basis of actual language use. A construction can be an idiomatic combination of words, like garbage in, garbage out; it can also be semi-idiomatic, like the let alone construction, or an abstract pattern such as the double-object construction. Hence constructions accord with the phraseologists’ view of a grammar-lexicon continuum, for which Goldberg has coined the term constructicon. Once again, frequency plays a role, in that frequency of occurrence in the learning process is seen as a necessary precondition for construction status. These three approaches are indeed so closely linked that some might object to their being distinguished from one another. For example, ‘cognitive linguistics’ could be regarded as a cover term that includes construction grammar, and has the usagebased approach as one of its chief tenets.
4.2
Frequency effects in language change
In diachronic linguistics, frequency has come to the fore above all in the theory of grammaticalization (Hopper & Traugott 2003), which focuses on the way lexical material becomes (over time) converted into grammatical material as a prime force in language change. Many studies (e.g. Hooper 1976; Bybee & Hopper 2001; Bybee 2007) show the relevance of frequency, both as an input and as an output to the grammaticalization process. For example, frequent expressions are susceptible to phonetic reduction (e.g. don’t know --> dunno; kind of --> kinda), a trigger of grammaticalization. Also, after the criterial changes of grammaticalization have taken place, the increase in frequency can continue for centuries – witness the rise in frequency of the English progressive, a continuous development from before Early Modern English up to the present day. Recent short-term diachronic studies using the Brown family of corpora show significant trends in change of grammatical frequency partly motivated by grammaticalization as well as other processes, such as colloquialization. Leech et al. (2009: 142–143) find that frequency changes like the increasing use of the progressive cannot be attributed to expansion of the progressive to particular verb classes or other
Frequency, corpora and language learning
categorical, structural or semantic trends. Rather, there seems to be a general increase of frequency across the board. It seems a fairly natural assumption that one result of a strengthened cognitive representation of a linguistic form is that it gets used more often by individuals, and more generally by the language community. Thus, from this perspective, input frequency and output frequency are both concomitants of grammatical change: greater input frequency → greater entrenchment → greater output frequency
4.3
Frequency effects in language acquisition
The sequence represented graphically above is primarily, of course, to be applied to language development in the individual, and only secondarily to a whole language community of users. Tomasello (2003), more than anyone else, has demonstrated the case for a usage-based theoretical position on first language acquisition, rejecting Chomsky’s view of universal grammar as a genetic basis for language acquisition, and instead arguing for the view that language acquisition takes place through implicit learning (using cognitively generic learning strategies) of patterns of form and meaning encountered in the child’s language input. Further, Ellis (2002a, 2002b) has presented persuasively the evidence of frequency effects in language processing generally, and more particularly in SLA. He finds that explicit and implicit learning and memory are complementary, implicit learning being driven by frequency of exposure. These two learning processes are seen as coming from very different neurological sources, the implicit capability deriving from the hippocampal system, and explicit learning from the neo-cortical system. Frequency of activation leads to the (implicit) learning of prototype categories. However, our knowledge of frequency is unconscious, and research has shown that even experts in language and language teaching have a poor record of guessing the frequency of linguistic items such as verbs (cf. Alderson 2007). These findings explain why (in the anecdote mentioned earlier) language teachers are unable to recognize that the present simple is many times more frequent than the present progressive. Ellis’s frequency effects link SLA with the idiom principle, the phraseological perspective on learning, construction grammar, and data-driven learning (Johns 1994). They indicate how learning is adaptive to an unfolding history of inputs, how change is incremental and cumulative, and how prior activation facilitates subsequent activation. As we learn to process high-frequency phenomena such as multi-word expressions and collostructions faster, we become more adapted to identifying them as units and processing them holistically. Priority in learning goes to formulae, then to higher structures (both subsumed under the constructions of construction grammar). Ellis’s line of research ties psycholinguistic research in language processing and SLA closely to learner corpus research. Researchers in SLA and in learner corpora, which seemed
Geoffrey Leech
to be on separate tracks a few years ago, are at last coming together (see Granger et al. 2002) and frequency appears to be a key link between them. We can now begin to see how the principle of ‘more frequent = more important to learn’ fits in with advances in learning theory and SLA. Institutional L2 teaching often has to implement adaptive learning within the confines of a curriculum where opportunities for L2 input and L2 output are severely limited in time. An important goal, in this case, is to present the learners with materials and productive tasks that extend their range of competence by moving them as far as possible from frequent towards less frequent. The implicit learning which is dominant in L1 acquisition can, of course, be complemented in L2 acquisition by explicit learning, which, through the conscious ‘noticing’ of language phenomena (see Schmidt 1990, 1995), can improve the learner’s control of the language.
5. Challenges and possible solutions The preceding section leads to the conclusion that frequency is an important consideration in language learning, and, since corpora are the only practicable means of supplying frequency information, this is where corpus linguistics should be able to make a key contribution. However, we should not paint too rosy a picture of this marriage between corpus linguistics and SLA: there are difficulties in determining the relevance of frequency, and in supplying the corpus-derived information needed.
5.1
Challenge I: Bringing together corpus linguistic and cognitive linguistic approaches
We have seen that corpus linguistics and cognitive linguistics are becoming strongly linked through the usage-based paradigm. But there are some signs that the ‘more frequent = more important to learn’ principle is not always supported by cognitive linguistics. Gries (2006) and Gilquin (2006b) present two examples where what is prototypical (and therefore more salient and central from a cognitive perspective) does not correspond to what is most frequent. Gries’s analysis of the verb run from both the cognitive and the corpus angle suggests that there is a discrepancy between the most likely prototype sense of run (motion) and the most frequent sense (fast pedestrian movement). Similarly, Gilquin’s analysis of causative verb constructions leads her to the conclusion that the prototypical case of causation is not the most frequent. Although determination of what is the prototype is far from clear-cut, these result appear to contradict the implication, for example, from Ellis’s work, that the most frequent category is the most entrenched and therefore the most cognitively salient. Perhaps one way of resolving this conundrum is to recognize that the establishment of a prototype category in the adult competence may have taken place at a relatively earlier stage
Frequency, corpora and language learning
of language acquisition, when (for example) the ‘fast pedestrian movement’ sense of run would in fact be the sense in commonest use. Hence the most prototypical usage would not necessarily be the one found most frequently in an adult corpus. However, there is much more work to be done on this.
5.2
Challenge II: Corpora do not always match learners’ needs
There are many different kinds of corpora, but none of them seem to be exactly the kind of corpus that will give frequency information relevant to learners. For English, for example, the following varieties of corpora have been, or can be, used to provide the empirical basis for ELT materials: a. General purpose reference or monitor corpora (e.g. the BNC, the Bank of English) b. Corpora of English for Specific Purposes (ESP) and English for Academic Purposes (EAP) (e.g. Corpus of Professional English, CSPAE, MICASE, BASE Corpus)15 c. Corpora of EFL (English as a Foreign Language) learner language (e.g. ICLE, LINDSEI)16 d. Corpora containing the language of native speaker (NS) children (e.g. CHILDES)17 e. Corpora of teenager and young adult NSs (e.g. LOCNESS, COLT)18 f. Corpora of English as a Lingua Franca (e.g. VOICE, ELFA)19 This list is far from complete and new corpora are making their appearance month by month. In fact, there are so many corpora of potential use for English language education that it may seem perverse to suggest that they are not enough. To some extent, though, it is a matter of debate what kind of corpus best suits the needs of a learner. The general principle, I suggest, is that such a corpus should represent as far as possible the target linguistic communicative behaviour to which learning is directed. Despite the usefulness of the above types of corpora for various purposes, there are reasons why they are not optimal for particular groups of language learners. General purpose corpora (a), containing both written and spoken material, although they yield frequency data useful for adult learners, are less useful for younger adults such
15. Corpus of Spoken, Professional American-English; Michigan Corpus of Academic Spoken English; corpus of British Academic Spoken English. 16. International Corpus of Learner English; Louvain International Database of Spoken English Interlanguage. 17. Child Language Data Exchange System. 18. Louvain Corpus of Native English Essays; Bergen Corpus of London Teenage Language. 19. Vienna-Oxford International Corpus of English; English as a Lingua Franca in Academic Settings.
Geoffrey Leech
as the average undergraduate student, and because of their ‘adult’ style and content, might be considered quite unsuitable for primary or secondary school learners. The same applies to ESP and EAP corpora (b) such as MICASE: these are well tailored to the academic needs of students or those training for a professional career, but not for more general groups. Corpora of learner language (c) such as ICLE and LINDSEI do, of course, provide vital frequency data for comparison of learners’ language to that of NSs, as well as comparison of the interlanguage of learners of different mother tongue backgrounds. Even here, however, it remains somewhat problematic whether the target linguistic behaviour with which the language of such student learners should be compared is that of NSs of the target language of their own age group, or the specialist adults we typically find as authors in written corpora of NSs, or indeed some other target communities such as non-native speakers (NNS) using English as a Lingua Franca (ELF), whose language use is recorded in a corpus such as VOICE. For learners of primary school and high school ages, there is as yet a dearth of NS children’s/teenagers’ language of primary or secondary school age (d-e), although CHILDES contains a wide variety of spoken data of earlier age groups. Corpora of ELF (f), e.g. VOICE (Seidlhofer 2004) or ELFA (Mauranen 2006), are new contenders on the scene, and raise the whole question of whether NSs’ language should any longer be regarded as the standard to aim at, as it has been unquestioningly considered in the past. In all these kinds of corpora (except for the largest reference and monitor corpora) there is an issue, also, about the size of available corpora and their representativeness in terms of different registers and activity types. A corpus intended to represent frequency data of target language behaviour should ideally be large and wide-ranging enough to yield reliable frequencies not only of words but of collocations: something that requires large corpora. For the normal EFL educational curriculum, the ideal corpus should be longitudinal, representing competent target language use appropriate to the age cohort of the learners. An early example of such a corpus (for NS learners, however) was the 5-million-word text collection used for the AHI Frequency Book (Carroll et al. 1971), which consisted of reading text materials used in US schools from the third grade to twelfth grade. Textbooks, readers, and other learning materials have been used for research both in Germany and in Japan, but the emphasis of the research (e.g. Mindt 1996; Römer 2005) has been to show how far the language to which students are exposed in school is divergent from that of NS corpora. Recent research on corpora of textbooks is reported in Meunier & Gouverneur (2009), who also give an account of their TeMa (Textbook Material) corpus consisting of general-purpose best-selling international ELT textbooks. So here is another issue: how appropriate is the teaching-induced language on which students are led, through their curriculum, to model themselves? It seems that, for various reasons, we are far from an ideal situation in which the frequency information applied to learner input comes from a corpus tailor-made to meet the learner’s needs.
Frequency, corpora and language learning
6. Conclusion: With words of comfort In spite of the negative points raised in the preceding section, it should be emphasized in conclusion that frequency information remains a highly valuable resource for input to language learning materials and testing, and that it is increasingly available. To insist on precise frequency counts is often to aim at too high an ideal, for, as Halliday put it long ago (1971: 344), “a rough indication of frequency is often just what is needed”. The afore-mentioned case of teachers who believed the present progressive to be more frequent than the present simple illustrates just how wildly wrong people’s intuitions of linguistic frequency can be: virtually any corpus representing NS productions, spoken or written, would correct this erroneous belief. A further point (referring back to distinctions I made in Section 1) is that in general, corpora differ much more in terms of raw frequency or normalized frequency than in terms of ordinal frequency (the placing of items in an order of frequency). Fortunately, raw or normalized frequency counts are rarely needed: ordinal frequency (allowing certain items to be prioritized above others) is usually all that matters for language learning and teaching purposes. The greatest need, I believe, is for the development of longitudinal corpora of both NSs and NNS learners. However, without waiting for the Holy Grail of the ideally tailored corpus for a given learner group, much could be achieved by building a database of frequency data from a range of different corpora and subcorpora, to enable ELT professionals to compare frequencies in different styles, registers, age groups, etc. For a given target learner group, corpora could be given weightings relative to their relevance to the group, resulting in optimal frequencies approximating to the learners’ needs. In this way the best available value could be put on Halliday’s call for approximate frequency. One final point: the emphasis on frequency in this chapter should not mislead any reader into thinking that ‘all we need to do is to count things’. In the selecting, devising and grading of learning materials, not only frequency, but other values, such as learner interest and motivation, learner difficulty, etc. need to be factored in. But, to correct what I believe to have been the neglect of frequency in thinking up to now, I suggest that from now on, there is no reason why any choices regarding learner input, learner performance and learner evaluation should not be frequency-informed.
References Adolphs, S. 2008. Corpus and Context: Investigating Pragmatic Functions in Spoken Discourse [Studies in Corpus Linguistics 30]. Amsterdam: John Benjamins. Alderson, J.C. 2007. Judging the frequency of English words. Applied Linguistics 28(3): 383–409. Altenberg, B. 1998. On the phraseology of spoken English: The evidence of recurrent word combinations. In Phraseology: Theory, Analysis and Applications, A.P. Cowie (ed.), 101–122. Oxford: Clarendon Press.
Geoffrey Leech Altenberg, B. & Tapper, M. 1998. The use of adverbial connectors in advanced Swedish learners’ written English. In Learner English on Computer, S. Granger (ed.), 80–93. London: AddisonWesley Longman. Barlow, M. & Kemmer, S. 2000. Usage-based Models of Language. Stanford CA: CSLI. Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. 1999. Longman Grammar of Spoken and Written English. London: Longman. Bybee, J. 2007. Frequency of Use and the Organization of Language. Oxford: OUP. Bybee, J. & Hopper, P. (eds). 2001. Frequency and the Emergence of Linguistic Structure [Typological Studies in Language 45]. Amsterdam: John Benjamins. Carroll, J.B. 1971. Statistical analysis of the corpus. In The American Heritage Frequency Book, J.B. Carroll, P. Davies & B. Richman (eds), xxi-xl. Boston MA: Houghton Mifflin. Carroll, J.B., Davies, P. & Richman, B. 1971. The American Heritage Frequency Book. Boston MA: Houghton Mifflin. Cheng, W., Greaves, C. & Warren, M. 2006. From n-gram to skipgram to concgram. International Journal of Corpus Linguistics 11(4): 411–433. Chomsky, N. 1957. Syntactic Structures. The Hague: Mouton. Chomsky, N. 1964 [1962]. A transformational approach to syntax. In Proceedings of the Third Texas Conference on Problems of Linguistics Analysis, A.A. Hill (ed.), 124–158. Austin TX: University of Texas. (Reprinted in J.A. Fodor & J.J. Katz. 1964. The Structure of Language, 211–241. Englewood Cliffs NJ: Prentice-Hall.) Chomsky, N. 1965. Aspects of the Theory of Syntax. Cambridge MA: The MIT Press. Coxhead, A. 2000. A new Academic Word List. TESOL Quarterly 34(2): 213–238. Davies, M. & Gardner, D. 2010. A Frequency Dictionary of Contemporary American English. London: Routledge. De Cock, S. 1998. A recurrent word combination approach to the study of formulae in the speech of native and non-native speakers of English. International Journal of Corpus Linguistics 3: 59–80. De Cock, S. & Granger, S. 2004. Computer learner corpora and monolingual learners’ dictionaries: The perfect match. In The Corpus Approach to Lexicography, W. Teubert & M. Mahlberg (eds), Special issue of Lexicographica 20: 72–86. Durrant, P. 2009. Investigating the viability of a collocation list for students of English for Academic Purposes. English for Specific Purposes 28(3): 157–169. Eeg-Olofsson, M. & Altenberg, B. 1996. Recurrent word combinations in the London-Lund Corpus: Coverage and use for word-class tagging. In Studies in Synchronic Corpus Linguistics, C.E. Percy, C.F. Meyer & I. Lancashire (eds), 97–107. Amsterdam: Rodopi. Ehrman, M.E. 1966. The Meanings of the Modals in Present-Day American English. The Hague: Mouton. Ellis, N.C. 2002a. Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition 24(2): 249–260. Ellis, N.C. 2002b. Reflections on frequency effects in language processing. Studies in Second Language Acquisition 24(2): 297–339. Ellis, N.C. & Simpson-Vlach, R. 2009. Formulaic language in native speakers: Triangulating psycholinguistics, corpus linguistics, and education. Corpus Linguistics and Linguistic Theory 5: 61–78. Ellis, R. 1994. The Study of Second Language Acquisition. Oxford: OUP. Ellis, R. 2008. The Study of Second Language Acquisition, 2nd edn. Oxford: OUP.
Frequency, corpora and language learning Fillmore, C.J., Kay, P. & O’Connor, M.K. 1988. Regularity and idiomaticity in grammatical constructions: The case of let alone. Language 64: 501–538. Firth, J.R. 1957. Modes of meaning. In Papers in Linguistics 1934–51, 190–215. Oxford: OUP. Francis, W.N. & Kučera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston MA: Houghton Mifflin. Gabrielatos, C. 2003. Conditional sentences: ELT typology and corpus evidence. Paper given at the Annual Meeting of the British Association of Applied Linguistics, University of Leeds, 4–6 September 2003. Gabrielatos, C. 2007. If-conditionals as modal colligations: A corpus-based investigation. In Proceedings of the Corpus Linguistics Conference: Corpus Linguistics 2007, M. Davies, P. Rayson, S. Hunston & P. Danielsson (eds). Birmingham: University of Birmingham. Gilquin, G. 2002. Automatic retrieval of syntactic structures: The quest for the Holy Grail. International Journal of Corpus Linguistics 7(2): 183–214. Gilquin, G. 2006a. Highly polysemous words in Foreign Language Teaching: How to give learners a flying start. In Proceedings of the 7th Conference on Teaching and Language Corpora, Université Paris 7 – Denis Diderot, 1–4 July 2006, 58–60. Gilquin, G. 2006b. The place of prototypicality in corpus linguistics. Causation in the hot seat. In Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, S.T. Gries & A. Stefanowitsch (eds), 159–191. Berlin: Mouton de Gruyter. Gilquin, G. & Paquot, M. 2008. Too chatty: Learner academic writing and register variation. English Text Construction 1(1): 41–61. Goldberg, A. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago IL: University of Chicago Press. Granger, S. 1997. On identifying the syntactic and discourse features of participle clauses in academic English: Native and non-native writers compared. In Studies in English Language and Teaching, J. Aarts, I. de Mönnink & H. Wekker (eds), 185–198. Amsterdam: Rodopi. Granger, S. (ed.). 1998. Learner English on Computer. London: Addison-Wesley Longman. Granger, S., Hung, J. & Petch-Tyson, S. (eds). 2002. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching [Language Learning & Language Teaching 6]. Amsterdam: John Benjamins. Granger, S. & Meunier, F. (eds.). 2008. Phraseology: An Interdisciplinary Perspective. Amsterdam: John Benjamins. Greene, B.B. & Rubin, G.M. 1971. Automatic Grammatical Tagging of English. Providence RI: Department of Linguistics, Brown University. Gries, S.T. 2006. Corpus-based methods and cognitive semantics: The many senses of to run. In Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, S.T. Gries & A. Stefanowitsch (eds), 57–99. Berlin: Mouton de Gruyter. Gries, S.T., Hempe, B. & Schönefeld, D. 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics 16(4): 635–676. Halliday, M.A.K. 1961. Categories of the theory of grammar. Word 17(3): 241–292. Halliday, M.A.K. 1966. Lexis as a linguistic level. In In Memory of J.R. Firth, C.E. Bazell, J.C. Catford, M.A.K. Halliday & R.H. Robins (eds), 148–162. London: Longman,. Halliday, M.A.K. 1971. Linguistic functions and literary style. In Style: A Symposium, S. Chatman (ed.), 330–365. Oxford: OUP.
Geoffrey Leech Hasselgren, A. 1994. Lexical teddy bears and advanced learners: A study into the ways Norwegian students cope with English vocabulary. International Journal of Applied Linguistics 4(2): 237–258. Hoffmann, S., Evert, S., Smith, N., Lee, D. & Berglund Prytz, Y. 2008. Corpus Linguistics with BNCweb – A Practical Guide. Frankfurt: Peter Lang. Hofland, K. & Johansson, S. 1982. Word Frequencies in British and American English. Bergen: Norwegian Computing Centre for the Humanities. Hooper, J. 1976. Word frequency in lexical diffusion and the source of morphophonological change. In Current Progress in Historical Linguistics, W. Christie (ed.), 96–105. Amsterdam: North Holland. Hopper, P.J. & Traugott, E.C. 2003[1993]. Grammaticalization. Cambridge: CUP. Hunston, S. & Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English [Studies in Corpus Linguistics 4]. Amsterdam: John Benjamins. Jelinek, F. 1998. Statistical Methods for Speech Recognition. Cambridge MA: The MIT Press. Johansson, S. & Hofland, K. 1989. Frequency Analysis of English Vocabulary and Grammar: Based on the LOB Corpus, 2 Vols. Oxford: Clarendon Press. Johns, 1994. From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 293–317. Cambridge: CUP. Kennedy, G. 1998. An Introduction to Corpus Linguistics. London: Addison-Wesley Longman. Kilgarriff, A. & Tugwell, D. 2002. Sketching words. In Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins, M-H. Corréard (ed.), 125–137. EURALEX. Krishnamurthy, R. (ed.). 2005. English Collocation Studies: The OSTI Report, by J. Sinclair, S. Jones & R. Daley. London: Continuum. Kučera, H. & Francis, W.N. 1967. Computational Analysis of Present-day American English. Providence RD: Brown University Press. Langacker, R.W. 1987. Foundations of Cognitive Grammar, Vol. I: Theoretical Prerequisites. Stanford CA: Stanford University Press. Leech, G., Hundt, M., Mair, C. & Smith, N. 2009. Change in Contemporary English: A Grammatical Study. Cambridge: CUP. Leech, G., Rayson, P. & Wilson, A. 2001. Word Frequencies in Written and Spoken English: Based on the British National Corpus. Harlow: Longman. Lennon, P. 1996. Getting ‘easy’ verbs wrong at the advanced level. International Review of Applied Linguistics 34(1): 23–36. Longman Dictionary of Contemporary English, 3rd edn, Dir. D. Summers. 1995. London: Longman. Marshall, I. 1983. Choice of grammatical word-class without global syntactic analysis. Computers and the Humanities 17: 139–150. Marshall, I. 1987. Tag selection using probabilistic methods. In The Computational Analysis of English: A Corpus-based Approach, R. Garside, G. Leech & G. Sampson (eds), 42–56. London: Longman. Mauranen, A. 2006. A rich domain of ELF – the ELFA Corpus of Academic Discourse. Nordic Journal of English Studies 5(2): 145–159. McEnery, T. & Wilson, A. 2001. Corpus Linguistics, 2nd edn. Edinburgh: EUP. Meunier, F. & Gouverneur, C. 2009. New types of corpora for new educational challenges: Collecting, annotating and exploiting a corpus of textbook material. In Corpora and Language
Frequency, corpora and language learning Teaching [Studies in Corpus Linguistics 33], K. Aijmer (ed.), 179–201. Amsterdam: John Benjamins. Mindt, D. 1996. English corpus linguistics and the foreign-language teaching syllabus. In Using Computer Corpora for Language Research: Studies in Honour of Geoffrey Leech, J. Thomas & M. Short (eds), 232–247. London: Longman. Moon, R. 1998. Fixed Expressions and Idioms in English: A Corpus-based Approach. Oxford: OUP. Nesselhauf, N. 2003. The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics 24(2): 223–242. Nesselhauf, N. 2005. Collocations in a Learner Corpus [Studies in Corpus Linguistics 14]. Amsterdam: John Benjamins. Römer, U. 2005. Progressives, Patterns, Pedagogy. A Corpus-driven Approach to English Progressive Forms, Functions, Contexts and Didactics [Studies in Corpus Linguistics 18]. Amsterdam: John Benjamins. Römer, U. & Schulze, R. (eds). 2009. Exploiting the Lexis-Grammar Interface [Studies in Corpus Linguistics 35]. Amsterdam: John Benjamins. Sampson, G. 2007. Grammar without grammaticality. Corpus Linguistics and Linguistic Theory 3(1): 1–32, 111–129. Schmidt, R.W. 1990. The role of consciousness in second language learning. Applied Linguistics 11: 129–158. Schmidt, R.W. 1995. Consciousness and foreign language teaching: A tutorial on the role of attention and awareness in learning. In Attention and Awareness in Foreign Language Learning and Teaching, R.W. Schmidt (ed.), 1–63. Honolulu HI: University of Honolulu. Seidlhofer, B. 2004. Research perspectives on teaching English as a lingua franca. Annual Review of Applied Linguistics 24: 209–239. Sinclair, J. 1966. Beginning the study of lexis. In In Memory of J.R. Firth, C.E. Bazell, J.C. Catford, M.A.K. Halliday & R.H. Robins (eds), 410–430. London: Longman. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP. Sinclair, J., Jones, S. & Daley, R. 1970. English Lexical Studies: Report to OSTI on Project C/ LP/08. Ms, University of Birmingham 1970. Reprinted in Krishnamurthy (ed.) 2005. Stefanowitsch, A. & Gries, S.T. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2): 209–243. Teubert, W. 2004. Units of meaning, parallel corpora, and their implications for language teaching. In Applied Linguistics: A Multidimensional Perspective, U. Connor & T.A. Upton (eds), 171–189. Amsterdam: Rodopi. Thomson, A.J. & Martinet, A.V. 1980 [1960]. A Practical English Grammar, 3rd edn. Oxford: OUP. Thorndike, E.L. 1921. Teacher’s Word Book. New York NY: Columbia Teachers College. Thorndike, E.L. 1932. A Teacher’s Word Book of 20,000 words. New York NY: Columbia Teachers College. Thorndike, E.L. & Lorge, I. 1944. The Teacher’s Word Book of 30,000 Words. New York NY: Columbia Teachers College. Tomasello, M. 2003. Constructing a Language: A Usage-based Approach to Child Language. Cambridge MA: Harvard University Press. West, M. 1953. A General Service List of English Words. London: Longman. Zipf, G.K. 1935. The Psychobiology of Language. Boston MA: Houghton Mifflin. Zipf, G.K. 1949. Human Behavior and the Principle of Least Effort. Reading MA: Addison-Wesley.
Learner corpora and contrastive interlanguage analysis Hilde Hasselgård and Stig Johansson1 This paper gives a glimpse of pre-corpus interlanguage studies, focusing on some Scandinavian research projects, before moving on to the development of computerized learner corpora and computer-aided interlanguage analysis with special reference to the International Corpus of Learner English (ICLE) project. Contrastive interlanguage analysis (CIA) is defined and discussed, followed by a presentation of the so-called Integrated Contrastive Model (ICM). The two models of analysis are illustrated by means of three case studies; two using CIA to study the use of quite and I would say across four learner groups in ICLE and one using the ICM to analyse seem in the interlanguage of Norwegian learners. Towards the end, some challenges for interlanguage research are discussed.
1. Introduction Learning a foreign language is a slow and, for most people, difficult process which rarely leads to full mastery. Even advanced language learners make mistakes and normally have a limited repertoire compared with native speakers of the target language. Problems may be linked to features of the target language, the learner’s first language or to the learning process itself. Revealing features of learner language, or interlanguage, has become an important means of surveying both obvious and more subtle differences between interlanguage and native speaker performance, and can potentially lead to improved language teaching as well as insights into the processes of language learning.
1. Stig Johansson sadly passed away before the article was finalized, but contributed substantially to the first submission of it and read and commented on a near-final version. The authors thank Bengt Altenberg for insightful comments on an early and a near-final version of this paper.
Hilde Hasselgård and Stig Johansson
2. Interlanguage studies before computer corpora In the 1940s and 1950s, linguists interested in language teaching emphasized the role of contrastive analysis, on the assumption that “in the comparison between native and foreign language lies the key to ease or difficulty in foreign language teaching” (Lado 1957: 1). The aim of the comparison was to identify both easy and difficult features of the language to be learnt. Lado considered that first language transfer in the foreign language might either help the learners or cause them to produce grammatical and lexical structures that deviate from the target norm (e.g. Lado 1957: 58). Observations of deviant features of learner language have probably always been made by language teachers, but it was not until about 50 years ago that they were subjected to systematic analysis. The 1960s and the early 1970s were the heyday of error analysis. Error analysis could be based on elicitation data and/or (pre-electronic) corpus data.2 Unlike contrastive analysis, error analysis is not restricted to interlingual transfer (Hammarberg 1973: 29). However, although Nickel (1973: 24) saw the “growing interest in error analysis [...] in connection with the efforts undertaken [...] to objectify measuring and grading of achievement in language testing”, it quickly became apparent that it was not sufficient to focus on errors, as pointed out in Hammarberg’s (1973) paper entitled “The insufficiency of error analysis”. At the same time, Enkvist (1973) put the question “Should we count errors or measure success?” Other perspectives on learner language were suggested: Levenston (1971) drew attention to overindulgence and under-representation in learner language, i.e. features which may not be overtly wrong but differ in e.g. style and register from the language of native speakers.3 One of the most important figures in the development of learner language research, Pit Corder, also pointed out that there are both overt and covert errors: overt errors produce linguistically unacceptable sentences, while “covertly erroneous sentences are those which are not appropriate in the context in which they occur” (Corder 1973: 272–3). More important, he stressed the significance of errors in providing a window into the learner’s mind; i.e. the study of a learner’s errors enables the researcher to “infer the nature of his knowledge at that point in his learning career and discover what he still has to learn” (ibid.: 257). Thus, the aim of the study is not only to map the errors, but represent the learner’s level of proficiency. Svartvik (1973: 8) takes a step further in suggesting that the term ‘error analysis’ should be replaced by the 2. Note that the term ‘corpus’ is used in this section to denote a “collection of naturally occurring examples of language [...] which has been collected for linguistic study” (Hunston 2002: 2). Nowadays, however, the term tends to imply that the corpus is “stored and accessed electronically” (ibid.). 3. Levenston relates over-indulgence and under-representation to contrastive analysis; learners are found (or predicted, in the case of learner groups other than Levenston’s own Hebrewspeaking students) to overindulge in “structures which closely resemble translation-equivalents in the mother tongue, or L1, to the exclusion of other structures (‘under-representation’) which are less like anything in L1” (1971: 115).
Learner corpora and contrastive interlanguage analysis
more appropriate ‘performance analysis’: “Although the study of errors is a natural starting-point, the final analysis should include linguistic performance as a whole, not just deviation”.4 To illustrate features of these early studies of learner errors and learner performance, we will present a few investigations, chosen from the work of researchers in Scandinavia. The first two investigations were initiated within the context of the Swedish-English Contrastive Studies project directed by Jan Svartvik (see Svartvik 1973). While Thagg Fisher (1985) focuses on a grammatical problem, Linnarud (1986) is concerned with lexis. Finally we will outline a more large-scale Danish project that aimed at a comprehensive description of language learning as well as learner language. Thagg Fisher (1985) is a study of Swedish learners’ concord problems in English. Concord errors produced by Swedish learners, as found in three situations (essays, translations, and recorded speech), were excerpted and analysed. This material was supplemented by elicitation tests given to learners and native speakers. The outcome was a detailed account of the frequency of different types of concord errors, a comparison of the three situations, and an analysis of the major causes of concord difficulty. A hierarchy of concord error gravity was established, taking into account the behaviour of native speakers and their reactions to the learners’ errors. Besides pointing out difficult areas for Swedish learners, Thagg Fisher discovered conflicts between grammar/textbook norms and actual language use. An important finding was that concord ‘errors’ are not a matter of either/or, since there are ‘vague’ areas where the norms for concord depend on contextual factors such as medium and style (1985: 177ff.). There is thus a scale of error gravity implying varying degrees of irritation and negative evaluation by native speakers (see also Johansson 1978). Some errors were classified as ‘nativelike’, reflecting areas where native speaker usage may differ from the prescriptive norm, and ‘non-nativelike’ (Thagg Fisher 1985: 191), reflecting problems that are characteristic of learners and that are generally evaluated more negatively. Teaching should thus emphasize the latter type and de-emphasize the former. Pedagogical applications of the study also include improved descriptions of concord in English teaching materials. Linnarud’s (1986) investigation is a performance analysis of lexis in general, not just errors. The material consisted of English compositions written by Swedish 17year-old learners and a comparable group of native speakers of English. A number of quantitative measures were used, the most important of which were lexical individuality (lexical words unique to the writer), lexical sophistication (the number of less frequent words), lexical variation (type-token ratio), and lexical density (the proportion of lexical words in relation to the total number of words). The compositions were assessed by both Swedish L1 (i.e. first language) and English L1 evaluators. Not surprisingly, the native speaker group wrote longer texts and made fewer mistakes. There was a large difference in lexical individuality between the learner group and the native 4. The term which eventually became established was ‘interlanguage studies’, connected with Selinker’s (1972) term ‘interlanguage’.
Hilde Hasselgård and Stig Johansson
speakers, and a strong positive correlation with evaluations; lexical creativity was appreciated by all evaluators, but slightly more by the native speakers of English. Just as the native speakers used more unique words, they also used more rare words; there was thus a great difference between the two groups in lexical sophistication, but without a corresponding correlation with evaluations. Lexical variation was greater for the native speakers, but this measure turned out to be unsatisfactory as it was not adjusted for the length of the compositions. The native speaker essays also had a slightly higher lexical density, but no correlation was found with evaluations. Commenting on the findings, Linnarud stresses the importance of lexis in composition and makes a number of pedagogical recommendations for teaching vocabulary and grading compositions. In both areas the importance of communication and context are emphasized. Writing a composition is not primarily an exercise in using correct language, but a means of expressing ideas and communicating a message where lexical choice plays a crucial role (Linnarud 1986: 120). Although the studies by Thagg Fisher and Linnarud are very different in most respects, they are alike in the comparison of learner language with the language of native speakers. Both used text material combined with elicitation, and both were also very much concerned with pedagogical applications of their research. A much more comprehensive project was going on in Denmark around the same time, the Project In Foreign language pedagogy (PIF), one outcome of which was the book Learner Language and Language Learning (Færch et al. 1984). Many aspects of language learning and the study of language learning are discussed in the book, drawing on the corpus compiled for the project. This was an extensive collection of samples of the written and spoken English (including video-recordings) of more than a hundred Danes, ranging from the near-beginner (after one year of instruction) to the nearnative (higher education students) stage. The cross-sectional data allowed ‘pseudo-longitudinal’ studies of language learning (Færch et al. 1984: 297). Note this comment in the description of the corpus: With the one exception that the 12 learners at the lowest level did not provide written texts, each of these texts was elicited from all our informants. So as to hold as many factors constant as possible, learners with different ages, experience and personalities were given the same tasks. Most of these tasks were familiar from school, e.g. reading aloud and writing an essay, whereas the video-taped conversation was novel and represented an attempt to place the learner in a real communicative situation. (Færch et al. 1984: 295f.)
At that time, the PIF learner corpus was unique both in size and range and, most importantly, in the systematic way in which the corpus was developed. In a working paper Færch (1979) reported that the corpus of written learner language amounted to about 100,000 words and the corpus of spoken learner language to about 250,000
Learner corpora and contrastive interlanguage analysis
words, and he presented plans for computerization of the material.5 Here we are very close to the stage of computer-aided analysis of learner language.
3. Learner computer corpora A significant step in interlanguage studies was the development of computerized learner corpora and computer-aided interlanguage analysis. Whereas earlier work was generally limited in scale and range, it now became possible to increase the size and variety of the material; and whereas the material used earlier rarely went beyond the individual researcher, the new electronic corpora could be developed as research tools to be used more generally by scholars in the field. The new technology and the research methods developed in corpus linguistics in general allowed new kinds of studies to be performed, for example with easier access and greater attention to frequency of occurrence and patterns of language use. Interest in learner corpora increased rapidly,6 to a great extent inspired by the work of Sylviane Granger and her team at the Centre for English Corpus Linguistics, Université catholique de Louvain, which we will focus on below.7 In 1990 Sylviane Granger initiated a highly successful project to collect an International Corpus of Learner English (ICLE), which inspired similar work in many other countries. The background was her interest in interlanguage studies and also a wish to extend English corpus research beyond native and second-language varieties of English, which were the focus of the International Corpus of English (ICE), initiated by Sidney Greenbaum (1991). Both ICE and ICLE should in turn be seen against the background of the development of ‘families’ of corpora within English corpus linguistics, i.e. corpora that are compiled according to the same design criteria and therefore lend themselves to comparative studies.8 Apart from the computerization of the material and the development of computational analysis tools, the main innovative aspect of ICLE is the systematic approach to corpus design and the compilation of comparable sub-corpora produced by learners
5.
The death of Claus Færch in 1987 hampered the further development of the PIF Project.
6. A detailed survey of learner corpora can be found in Pravec (2002). See also www.uclouvain.be/en-cecl-lcWorld.html. 7. See the website of the Centre for English Corpus Linguistics: www.uclouvain.be/encecl.html. 8. The best-known of these ‘families’ is probably the ‘Brown family’, including the Brown Corpus, the LOB Corpus and their younger siblings FROWN and FLOB; see http://icame.uib.no/ newcd.htm.
Hilde Hasselgård and Stig Johansson
with a wide range of different mother-tongue backgrounds (e.g. Granger 1994, 1996).9 These make it possible to examine the extent to which learner language is mothertongue specific or reflects general language learning processes.
4. Contrastive interlanguage analysis A special feature of the ICLE project is that a framework for learner corpus research has been developed alongside the corpus. This is Contrastive Interlanguage Analysis (CIA), said to lie “at the core of the ICLE project” (Granger 1996: 43). Unlike contrastive analysis, which involves the linguistic comparison of (normally) two languages, CIA concerns varieties of the same language. It “involves quantitative and qualitative comparisons between native language and learner language (L1 vs. L2) and between different varieties of interlanguage (L2 vs. L2)” (Granger 2009: 18; see also Granger 1996). The former type of comparison thus presupposes a comparable corpus of native speaker (NS) data, whose role is to serve as a yardstick for measuring the extent to which L2 English differs from L1 English. As pointed out by Barlow (2005: 345), “a variety of issues arise” when “a learner corpus is to be contrasted with an NS corpus”, for example concerning regional variety and text type. In addition, the level of proficiency of the native speakers should be considered to avoid inadvertent comparisons between novice and professional writers (Granger 2002: 12). The solution to these issues within the ICLE project was the compilation of the Louvain Corpus of Native English Essays (LOCNESS), consisting of essays written by British and American students. ICLE and LOCNESS are relatively closely matched for text type (mostly argumentative writing) as well as writer age and experience. However, there is less information available on contributors in LOCNESS than in ICLE (age, sex, writing conditions, etc.). Furthermore, the LOCNESS texts are more heterogeneous as to essay topics as well as contributors (both university students and A-level pupils). This has caused many researchers to use only a sample of it, for instance by excluding A-level essays, or by using only US or only UK texts. Still, LOCNESS remains the best available comparable corpus to match ICLE and continues to be widely used. The extent to which an NS reference corpus is adequate for CIA is intimately connected with the aim of the comparison; cf. the discussions by Ädel (2006: 206) and Gilquin et al. (2007: 326 f.). From the point of view of descriptive linguistics, it is a clear advantage that the corpora can be closely matched on the most relevant variables, 9. The first edition of ICLE, released on CD-ROM in 2002, contained about 2.5 million words of English, chiefly argumentative essays written by university students representing 11 different mother-tongue backgrounds. In the second edition, ICLEv2, released in 2009, the number of sub-corpora has increased to 16, and the material has been enriched with analysis tools (see Granger et al. 2009).
Learner corpora and contrastive interlanguage analysis
such as the age and level of expertise of the writers. From an English Language Teaching (ELT) perspective, however, a student corpus such as LOCNESS may be considered unsuitable as a reference corpus because it does not represent the desired target norm for proficiency or the type of language one would like to teach (cf. Leech 1998: xix f.). Thus, if the aim is to identify areas of argumentative or academic writing in which learners need to improve, an NS corpus consisting, for example, of press editorials or academic articles may be preferable. Comparing data from a learner corpus and an NS corpus enables the researcher to identify overuse, underuse and misuse in the English of the learners. As Granger has repeatedly emphasized (e.g. 1998a: 18), the terms over- and underuse are intended as neutral, quantitative measures of linguistic differences, not as qualitative judgements on interlanguage performance. Importantly, the study of overuse and underuse marks a widening of the scope of traditional error analysis as these phenomena, which are difficult to identify reliably other than by computational methods, often do not constitute errors. Rather, they reflect areas in which learner language differs from NS language in terms of frequency of distribution rather than correctness. For example, the expression kind of occurs 49 times per 100,000 words in the Norwegian sub-corpus of ICLE (ICLE-NO)10 and 12.3 times in LOCNESS. This shows clearly that the Norwegian learners overuse the expression. The question of whether or not they use it correctly, however, requires a qualitative investigation. Contrastive Interlanguage Analysis also includes the comparison of different nonnative-speaker (NNS) varieties. With ICLE, such comparisons are greatly facilitated by the common design of the sub-corpora, with control of a range of relevant variables (see Granger et al. 2009: 3ff.). For example, a comparison of the Norwegians’ use of kind of with that of their Swedish neighbours reveals that the Swedes overuse the expression almost as much as the Norwegians with 44.8 occurrences per 100,000 words. French learners overuse it even more, with 73.1 occurrences. In fact, kind of is universally overused across the sub-corpora of the second edition of the International Corpus of Learner English (ICLEv2) (Granger et al. 2009), ranging from 29.1 (Tswana) to 138.5 occurrences per 100,000 words (Mandarin), which may be linked to the fact that the expression represents a way of making up for insufficiently nuanced vocabulary. Other lexicogrammatical items may be underused by some learner groups and overused by others. For example, French learners are known to overuse indeed in contrast to some other learner groups (Granger 2004: 135), such as Norwegians, who underuse it at 11.2 occurrences per 100,000 words vs. 17.9 in LOCNESS. The potential and usefulness of CIA have been demonstrated in a wide range of studies, as evidenced by e.g. Granger (1998c) and Granger et al. (2002). It should be noted that CIA is by no means restricted to the ICLE corpus or to English; the 10. The ICLE sub-corpora are referred to here by means of their tags in ICLE with the last two letters showing the L1 background of the learners (Norwegian, Swedish, German, French, Spanish, Hong Kong Chinese).
Hilde Hasselgård and Stig Johansson
methodology has been adopted by other researchers using interlanguage corpora of for instance German, Italian and Norwegian.11 Nor is it restricted to written language. Spoken learner language is being explored by means of, for example, the Louvain International Database of Spoken English Interlanguage (LINDSEI),12 compiled as a spoken counterpart of ICLE (Brand & Kämmerer 2006: 130) and comprising different L1 backgrounds and an NS reference corpus (ibid.: 134). Because the compilation of spoken corpora is costly in terms of time as well as money, the sub-corpora of LINDSEI are rather small (about 100,000 words). At present the completed sub-corpora represent 11 L1 backgrounds (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish and Swedish), but more teams are joining. Until very recently (2010), the corpus has not been publicly available outside the national project teams, and not all the sub-corpora have so far been used much in research. Hence the remainder of this chapter will continue to focus on the analysis of written corpora.
5. Some significant findings of CIA The availability of similar corpora with a common design as well as a common research model for investigating them has led to a number of important insights into advanced learner English. In this section we will present what we consider to be significant findings within the lexis, grammar and discourse of advanced learners of English (see also Hunston 2002: 206 ff.). Most of them come from studies of one or more non-native varieties compared to an NS corpus, usually LOCNESS. NNS vocabulary is found to be generally less varied than that of native speakers. According to Ringbom (1998), learners rely greatly on a relatively small vocabulary containing many words with a general meaning, such as people and thing. Similarly, Hasselgren (1994) observes that learners tend to overuse frequent words belonging to the core vocabulary at the expense of more precise synonyms, i.e. they cling to their ‘lexical teddy bears’, which is Hasselgren’s term for “the words they feel safe with” (ibid.: 237). Furthermore, learners tend to use a slightly greater number of recurrent word combinations than native speakers do (De Cock et al. 1998: 72 f.), and the frequently recurring word combinations are not always the same in L1 as in L2 English (cf. Wiktorsson [2003], who found that the prefabs used by Swedish learners were more informal than those of native speakers). Another common finding is that the written English of advanced learners is to a great extent influenced by informal spoken language. This shows up clearly in the use 11. For information on the FALKO corpus of learner German, the VALICO corpus of learner Italian and the ASK corpus of learner Norwegian, as well as other learner corpora, see www. uclouvain.be/en-cecl-lcWorld.html. 12. See www.uclouvain.be/en-cecl-lindsei.html.
Learner corpora and contrastive interlanguage analysis
of features of interactiveness, such as first- and second-person pronouns and other signs of writer/reader visibility (Petch-Tyson 1998) and the high frequency of various modal expressions (Aijmer 2002) and questions (Virtanen 1998). In their study of connector use, Altenberg & Tapper (1998) found that Swedish learners tend to overuse informal connectors (such as sentence-initial and and but) at the expense of more formal connectors. Eia (2006) found the same tendency among Norwegian learners. Gilquin & Paquot (2008: 50) likewise found an overuse of sentence-initial and and but as well as other spoken-like features in learner writing. In addition to the influence of spoken English they suggest that this may be explained by L1 transfer (in the case of different style levels of otherwise equivalent expressions in the L1 and the L2), teaching-induced factors, and developmental factors (ibid.: 52 ff.). Interestingly, Ädel (2008) shows that the use of interactional features seems to depend on factors such as task setting and intertextuality; untimed essays written by students who used topical texts as a starting point for their discussion displayed far fewer interactional features than those in ICLE-SW, although written by Swedish students at the same stage of their studies. The claim that non-native written English borrows features from the spoken language should thus be treated with some caution. It should also be remembered that although learners may import features of spoken English into their writing, there are huge differences between real conversation and ICLE essays (see Gilquin & Paquot 2008). When learner writing is compared to spoken data, one finds that the ‘spoken’ features are relatively modestly represented in the NNS essays after all. A small indication of this is given in Table 1, in which the first four rows reproduce Petch-Tyson’s (1998: 112) figures for first- and second person pronouns in some sub-corpora of ICLE. The Swedish learners come across as most interactive in their writing; however, the pronouns are twice as frequent in the spoken dialogues in the British National Corpus (BNC). Table 1.╇ Use of first- and second-person reference across a number of corpora (based on Petch-Tyson [1998: 112] with added figures for Hong-Kong Chinese and the BNC) per 50,000 words Dutch L1 Finnish L1 French L1 Swedish L1 HK Chinese L1 BNC spoken dialogue BNC written (press editorials) US English (LOCNESS)
1,195 1,531 1,202 1,998 â•⁄â•‹449 3,973 â•⁄â•‹834 â•⁄â•‹449
Hilde Hasselgård and Stig Johansson
Petch-Tyson’s (1998) study of writer/reader visibility was carried out at a time when the ICLE corpus contained only Western L1 backgrounds. Interestingly, a corresponding investigation of the more recent Hong-Kong sub-corpus of ICLE (ICLE-HK) indicates that first- and second-person pronouns are not overused by Hong Kong learners (fifth row of Table 1). The difference between ICLE-HK and the other ICLE sub-corpora is likely to have cultural explanations. Returning to the issue of reference corpus, however, it is also noteworthy that the press editorials in the BNC have nearly twice as many first- and second-person references as the US section of LOCNESS (see Table 1), thus potentially reducing the degree of overuse by the European learners of English and suggesting underuse in the US and HK Chinese groups. The question of authorial presence has also been investigated by Hyland (2002), who compares the use of first-person reference in student reports to that of published journal articles within the same disciplines. Hyland finds that the student reports contain four times fewer references to first person than the journal articles do; i.e. the student reports have 10.1 references per 10,000 words and the published articles have 41.2 (ibid.: 1099). The findings are explained by reference to the students’ lack of authority in the field with a concurrent reluctance to assert themselves. This is backed up by the students’ own comments in interviews (ibid.: 1097). By comparison, ICLE-HK (cf. Table 1) contains about 90 first- and second-person pronouns per 10,000 words, of which the majority (81/10,000 words) are first-person. In other words, first-person reference is eight times more frequent in argumentative essays than in the reports examined by Hyland (2002), thus suggesting that the use of interactive features may vary with text type. A number of studies find that learners transfer syntactic patterns as well as discourse patterns from their L1 to their written English. For example, Osborne (2008) revealed strong L1 influence as regards the learners’ placement of adverbs; contrasts between language families could be clearly seen in the patterns found in the learner corpora. More specifically, the sequence V-Adv-O was overused by Romance L1 learners, underused by Germanic L1 learners and used with a frequency similar to that of the NS control corpus by a group consisting of Slavic and Finnish L1 learners (Osborne 2008: 134). Nesselhauf (2005: 242) found that L1 influence occurred in about half of the non-nativelike collocations identified in the German ICLE sub-corpus (ICLE-GE), which suggests that phraseological patterns are transferred in a similar manner to syntactic patterns. The transfer of L1 syntactic patterns into NNS English need not constitute errors, but may lead to an overuse of the pattern in question, possibly with unintended discourse effects. Boström Aronsson (2003), for example, found that Swedish learners overuse cleft constructions, which could to some extent be explained by analogy with Swedish style. The clefts are generally not ungrammatical, but according to Boström Aronsson (2003: 209), they may entail “unmotivated focus and emphasis, and implications of contrastiveness when there is none”. Extraposition was also found to be twice as frequent in ICLE-SW as in NS writing. As the construction often has an evaluative
Learner corpora and contrastive interlanguage analysis
function, its overuse is interpreted as “a tendency for NNS to foreground their opinions and evaluative comments” (Herriman & Boström Aronsson 2009: 109). Hasselgård (2009a) found equal overuse of extraposition in ICLE-NO. A later study (Hasselgård 2009b) showed that fronted time and space adverbials were overused in ICLE-NO compared to LOCNESS. The frequencies were, however, similar to those found in a collection of Norwegian NS argumentative essays. The fronted time and space adverbials in ICLE-NO furthermore had discourse functions more typical of Norwegian than of English (particularly as text organizers). A feature of learner language that is attributable to either learner strategies or lack of proficiency and/or register awareness (Altenberg 1997: 130) is the use of metadiscourse. Ädel (2006: 189) found that Swedish advanced learners used metadiscourse twice as often as American students, who in turn used it more often than British students. The overuse among learners concerned above all “personal metadiscourse” (ibid.: 190), i.e. items that refer directly to the writer and/or reader of the text. The functions of personal metadiscourse items are “to introduce the topic and to repeat (or review) some preceding discourse unit” (ibid.: 94), i.e. to negotiate the text as discourse between writer and reader. Such items may also be involved in definitions of terms and concepts (ibid.). The quantitative differences between the writer groups may be due to different writing conventions in the three cultures (ibid.: 154), but may also reflect the learners’ consciousness that they are writing in a foreign language. As mentioned above, most of the CIA studies carried out so far involve the comparison of native English to only one or two non-native varieties. Studies that involve a wider range of non-native varieties often make interesting observations, such as the scale of writer/reader visibility revealed by Petch-Tyson (1998: 112), the typological differences found by Osborne (2008), and the differences in the use of academic vocabulary observed by Paquot (2010). It is to be hoped that the greater cultural and linguistic variation in L1 background represented in the latest version of ICLE, along with the improved facilities for searching and analysing the corpus, will inspire more such studies.
6. From CIA to the integrated contrastive model Like the analysis of interlanguage, contrastive analysis has profited greatly by the development of corpus research methods. The English-Norwegian Parallel Corpus (ENPC) was the first electronic bidirectional translation corpus of its kind (see Johansson 2007: 10 ff.). The model combines the idea of a translation corpus with that of a comparable corpus, i.e. one in which the original texts in both languages are matched for genre, publication date and size. This design allows the researcher to study translation correspondence in both directions of translation and to compare original and translated texts in the same language or original texts in different languages.
Hilde Hasselgård and Stig Johansson
The method for contrastive analysis based on parallel corpora has lately been successfully paired with the CIA method; see Granger (1996) and Gilquin (2000/2001) on the Integrated Contrastive Model (ICM). This model offers a new dimension to interlanguage studies, enabling the researcher not only to differentiate general from L1-specific learner problems but also to explain and/or predict such problems on the basis of contrastive analyses of the L1 and the target language, in the spirit of the weak version of the contrastive analysis hypothesis (Wardhaugh 1970: 123). The link between learner corpus research and contrastive analysis is explored e.g. in Gilquin et al. (2008). The Integrated Contrastive Model is visualized in Figure 1.13 Granger (1996: 46) points out that “the model involves constant to-ing and fro-ing between CA [Contrastive Analysis] and CIA. CA data helps analysts to formulate predictions about interlanguage which can be checked against CIA data”. This part of the procedure follows the arrow marked “predictive” in Figure 1. In the opposite direction, deviations between learner language and native language can be explained (or ‘diagnosed’) by recourse to the contrastive analysis. The arrows pointing out of the figure were added by Gilquin (2000/2001: 100 f.) to show that not all errors can be explained by a contrastive analysis (see also Corder 1973: 288). The other change in Gilquin’s version of Granger’s (1996) diagram is the use of broken lines between CA and CIA to indicate a weaker connection between the two. CA
SL vs. TL EJBHOPTUJD
53"/4'&3
QSFEJDUJWF
OL vs. OL
CIA NL vs. IL
IL vs. IL
Figure 1.╇ The Integrated Contrastive Model (quoted from Gilquin 2000/2001: 100, based on Granger 1996: 47)
13. Key to the abbreviations found in Figure 1: CA = Contrastive Analysis; OL = Original Language; SL = Source Language; TL = Target Language; CIA = Contrastive Interlanguage Analysis; NL = Native Language; IL = InterLanguage.
Learner corpora and contrastive interlanguage analysis
The weak connection was also pointed out by Corder (1973: 229 ff.) who argued that differences between the native language and the foreign language need not produce learning difficulty. Differences between the native and the target language can also have unexpected effects on interlanguage, as demonstrated by Johansson & Stavestrand (1987). Since Norwegian does not have a grammaticalized progressive aspect, a natural assumption would be that Norwegian learners will have difficulties acquiring the progressive, and furthermore that they will underuse it. The investigation showed that the Norwegian learners indeed made a number of mistakes with the form. Curiously, most of the errors consisted in using the progressive where a simple form was required. Hence, the second prediction failed: the learners in fact overused the progressive. The overuse is believed to be caused by factors such as (intralingual) hypercorrection, overexposure in teaching and the simpler morphology of the progressive (i.e. only one form of the lexical verb needs to be mastered).
7. Case studies As an additional demonstration of contrastive interlanguage analysis, we will present two small-scale case studies based on ICLEv2, namely the use of the single lexical item quite and the phraseological item I would say. Four L1 groups have been selected: Norwegian, German, French and Spanish, thus representing two Germanic and two Romance L1 backgrounds. Texts have been identified on the basis of the learners’ first language, irrespective of home country. LOCNESS has been used for comparison. A third study makes use of the Integrated Contrastive Model in an investigation of seem in ICLE-NO against the background of a contrastive study based on the ENPC.
7.1
Quite
Granger (1998a and b) has drawn attention to the overuse of the all-round intensifier very at the expense of collocationally restricted -ly intensifiers such as closely or highly. Do we find a similar tendency for quite? Table 2 shows that quite is overused in all the learner groups but most markedly so among the Germans, followed at a distance by the Norwegians (both at significance levels of p < 0.01).14 The overuse of quite in ICLEGE ties in with the general overuse of adjective modification by German learners identified by Lorenz (1998: 57). In ICLE-FR and ICLE-SP, the overuse is less dramatic (significant at p < 0.05 for ICLE-FR, but less obviously so at p = 0.1 for ICLE-SP). The overall frequency distribution shown in Table 2 thus seems to reflect the
14. The ICLE frequencies were found using the statistics function on the ICLEv2 CD, while LOCNESS was analysed using the corpus tool AntConc (www.antlab.sci.waseda.ac.jp/software.html). The frequencies from each ICLE sub-corpus and LOCNESS were compared using chi-square.
Hilde Hasselgård and Stig Johansson
Table 2.╇ Quite across corpora: Raw frequencies and relative frequencies per 100,000 words Corpus ICLE-NO ICLE-GE ICLE-FR ICLE-SP LOCNESS
Occurrences
Rel. freq.
â•⁄ 92 147 â•⁄ 78 â•⁄ 63 â•⁄ 67
43.7 62.3 38.0 31.8 20.5
Germanic – Romance distinction. The question of how the learners use this word, however, can only be answered by studying concordance lines. The word quite can enter into a number of grammatical patterns, notably as: (i) modifier of adjective – quite safe; (ii) modifier of adverb – quite easily; (iii) modifier of predicate – never quite enter the big money fights; (iv) modifier of indefinite or quantified noun phrase – quite a remarkable feat, quite some time; (v) modifier of definite noun phrase/nominalized adjective – quite the opposite; (vi) modifier of prepositional phrase – quite by chance. Table 3 gives the relative frequencies of the different patterns across the corpora under study. Strikingly, the overuse of quite among German and Norwegian learners is visible across the patterns, while the French and Spanish learners differ from the native speakers mainly in the use of quite as a modifier of an adjective. Figure 2 shows the proportional distribution of the patterns across the corpora. The adjective modifier function of quite is most common in all the learner groups as well as in the NS corpus. However, the groups differ as to the use of other patterns: Spanish learners use other patterns very little, while Norwegian and German learners use quite for indefinite NP modification significantly more often than native speakers (p < 0.05) and also for adverb modification more often than native speakers though not at significant levels. French learners use other patterns more than the Spanish learners, but less than Norwegian and German learners. The adverb-modifying quite takes up a larger proportion in NS than in NNS writing, but as Table 3 shows, this pattern is actually more frequent in the learner corpora, except ICLE-SP. All other types are too rare to show reliable tendencies, but we may note that the category of ‘other’ (which includes cases of misuse) does not occur in LOCNESS. Table 3.╇ Patterns of quite across corpora, relative frequencies per 100,000 words
ICLE-NO ICLE-GE ICLE-FR ICLE-SP LOCNESS
+adj
+adv
+pred
+indef NP
+PP
+def NP
other
24.7 38.5 25.4 25.2 12.6
4.3 6.8 3.9 2.5 3.4
1.9 2.1 1.0 1.0 0.9
10.5 12.3 â•⁄ 5.9 â•⁄â•⁄â•‹0 â•⁄ 2.8
1.0 0.4 â•⁄â•‹0 0.5 0.6
1.0 2.1 0.5 1.0 0.3
0.5 â•⁄â•‹0 1.5 1.5 â•⁄â•‹0
Learner corpora and contrastive interlanguage analysis
ICLE-NO +adj +adv +pred + indef NP +PP +def NP other
ICLE-GE ICLE-FR ICLE-SP LOCNESS 0%
20 %
40 %
60 %
80 %
100 %
Figure 2.╇ Patterns of quite across corpora
The Spanish learners have the smallest extent of overuse, but at the same time differ most from native speakers in their use of quite. German learners, on the other hand, have a proportional distribution of patterns that does not differ much from that of the NS group in spite of the overuse shown in Tables 2 and 3. As noted above, Norwegian and German learners often use quite as a modifier of noun phrases. Examples are given in (1) – (3). (1) ... which now suddenly requires an education with quite a lot of theory. (ICLE-NO) (2) ... reading my way through the book itself, which turned out to be quite an adventure given my poor standard of French. (ICLE-GE) (3) Stating that the time of dreaming and imagination is over is quite a sad statement. (ICLE-NO) Norwegian and German learners have a potential problem in placing the indefinite article between quite and a premodifying adjective, as in (3), since both Norwegian and German place the article before the equivalent of quite in a corresponding construction. The pattern seen in (2) and (3) must thus be a result of successful learning. The pattern ‘quite a(n) + adjective’ occurs 5 times in ICLE-NO; however ‘a quite + adjective’ is found 6 times. The corresponding figures for ICLE-GE are 9 vs. 8. Thus, both learner groups use the pattern of their L1 in about half the cases. Interestingly, a similar variation is found in LOCNESS. The pattern ‘a quite + adjective’, illustrated by (4), occurred twice while the other pattern occurred only once. However, in the BNC, the ‘quite a(n) + adjective’ pattern is clearly most frequent, with 27 instances per million words as against 5.6 for ‘a quite + adjective’.15
15. By comparison, the French learners had ‘quite a(n) + adjective’ 7 out of 11 times. The Spanish learners used quite with a premodified indefinite noun phrase only once, with the article preceding quite.
Hilde Hasselgård and Stig Johansson
(4) One possible solution is a quite radical one. (LOCNESS) (5) Passengers whose life seems to revolve around annoying others – listening to not-quite-personal stereos, smoking in no smoking sections, ... (LOCNESS)
Example (5) shows a creative use of quite. No similar uses were found in the NNS corpora. However, a close examination of the NNS concordances for quite also shows some cases of dissonance (Hasselgren’s [1994] term for non-nativelike usage):
(6) Even in the text there are quite allusions to Pamela. (7) This kind of allusion is quite used in abstracts or introductions.
(ICLE-SP) (ICLE-SP)
The dissonance can be due to grammatical error as in (6), where quite modifies a bare noun phrase. In (7) the predicate is not one that can be modified for degree. Both cases of dissonance can possibly be explained as equivalence errors between quite and Spanish bastante, which carries much the same meaning as quite, but unlike quite can be used as a modifier of a noun or a participle verb.16 Similarly, there are examples from ICLE-FR where the dissonant use of quite is due to an equivalence error; in (8) this probably concerns quite/assez as well as changing/changeant. In (9) the collocation quite many is one that is not found in the BNC, but which may reflect the French expression (d’)assez nombreux. (8) Whereas political borders can be quite changing, cultural ones are not. (ICLE-FR) (9) On a human level, I met quite many foreigners, but no Dutch people. (ICLE-FR) German and Norwegian learners do not seem to have much difficulty with quite, probably due to the semantic and syntactic similarity with the nearest L1 equivalents ganz and ganske. A typical example of dissonance in these two sub-corpora is given in (10), where the dissonance is caused by a confusion between a good deal and quite a lot. In (11) the problem with the adjective modification is the context, i.e. the use of a ‘compromiser’ (Lorenz 1998: 56) where understatement does not seem intentional. (10) ... but the figures clearly show that men on the average earn quite a deal more than women here in Norway. (ICLE-NO) (11) ... and we had to spent nearly two, quite exiting years in the monster’s dungeon. (ICLE-GE) This CIA study of quite yielded some interesting findings. First of all, the quantitative investigation showed overuse of quite in all four learner groups, though to different degrees. The overuse was most pronounced in ICLE-GE and least in ICLE-SP. However, a qualitative study showed that quite is not used in the same way in the five corpora examined. The Spanish learners use quite as a modifier of an adjective at the cost 16. Thanks to Magali Paquot and Maximino Jesus Ruiz Rufino for identifying the Spanish source of transfer.
Learner corpora and contrastive interlanguage analysis
of all other constructions, while the Germans and the Norwegians overuse it as a modifier of indefinite noun phrases. Finally, dissonant uses were studied. Most uses of quite are correct in all the corpora. However, the most serious cases of dissonance were found among the Spanish and French learners, possibly because the greater similarity between quite and its closest equivalent in the Germanic languages led to fewer problems among the German and Norwegian learners. The qualitative analysis thus uncovered problems in those learner groups that were quantitatively closer to native speaker usage.
7.2
I would say
In recent years, a great deal of research has focused on recurrent sequences in language, largely inspired by John Sinclair’s insightful work on collocations and his insistence on the importance of the ‘idiom principle’ (Sinclair 1991). Studies comparing learner and NS phraseology have shown important differences in this area (see e.g. Wiktorsson 2003; Meunier & Granger 2008). Hasselgård (2009a: 134) found that Norwegian learners overuse the string I would say. In ICLE-NO it typically functioned as an expression of stance, often prefacing a conclusion. In the native speaker data used for comparison (from the British component of the International Corpus of English, ICE-GB), the expression was found either in its literal sense or with the meaning of approximation. As a follow-up to this, we have studied the same expression across different learner groups and in LOCNESS. Table 4 shows Norwegian and French learners to have approximately the same degree of overuse, while the German and Spanish learners are closer to the distribution found in LOCNESS. Unlike the results for quite, the use of I would say does not reflect the Germanic – Romance distinction. Still, the use of the expression may be attributed to L1 transfer, or it may even be teaching-induced (some Norwegian textbooks list the expression as a possible turn of phrase in argumentation). Incidentally, the expression is mentioned by Granger (1998b: 156) as part of the learner’s (restricted) repertoire “for introducing arguments and points of view”. Table 4.╇ I would say across learner groups: Raw frequencies and relative frequencies per 100,000 words Corpus ICLE-NO ICLE-GE ICLE-FR ICLE-SP LOCNESS
Occurrences
Rel. freq.
27 10 23 â•⁄ 7 â•⁄ 5
12.8 â•⁄ 4.2 11.2 â•⁄ 3.5 â•⁄ 1.5
Hilde Hasselgård and Stig Johansson
First we examined I would say in LOCNESS. Surprisingly it was found with functions not attested in ICE-GB (Hasselgård 2009a: 134), namely as a stance marker (12) and as an introduction to a conclusion (13). (12) ... and so in some ways I would say that he is of use to the party. (LOCNESS) (13) In conclusion, I would say that a single europe would lead to a damaging loss of sovereignty for Britain ... (LOCNESS) Both instances in LOCNESS of I would say as a stance marker have the function of signposting the following proposition as the speaker’s considered, but tentative opinion. As shown in (14), this use can also be identified in other NS material, such as the BNC. We may note that the meaning of say in example 14 (taken from the academic writing section of the corpus) is close to suggest. The fairly literal implication (i.e. the writer’s response to a question) seems typical of NS use of the expression. (14) So, what is to be done about sexism in language? I would say, whatever is most effective in making people think about the implications of the expressions they use. (BNC: CGF) In ICLE-FR I would say is by far most frequent (80–90%) as part of a conclusion. The expression is most often accompanied by phrases such as to conclude or in conclusion, as exemplified by (15). This conclusive use carries a higher degree of modal certainty than the tentative use illustrated by (12) and (14). The conclusive use of I would say in ICLE-FR is most likely related to similar expressions in French as illustrated in example (16).17 (15) To conclude with this whole debate, I would say that I can hardly find positive arguments to stand for the compulsory military service. (ICLE-FR) (16) En conclusion je dirais que ce baladeur m’a complètement séduit. (www.iaddict.fr/ipod-shuffle.php) The Norwegian learners also use I would say in conclusions, but the stance marker use is about equally common. The latter typically occurs earlier on in the essay, prefacing a proposition that the writer is going to argue for. For example, (17) is the second sentence of an essay on ‘dreaming and imagination’. The expression can also have a meaning similar to ‘I think’, as shown by (18), and this use may be found anywhere in the text. (17) I would say that it is a statement close to the truth of today’s society, and in this essay I will give my opinions on the topic, and some reasons why this could be a fact. (ICLE-NO) 17. Google searches restricted to the domain .fr showed that je dirais often collocates with en conclusion or en somme. Interestingly, the one example of in conclusion I would say in the BNC comes from a school essay.
Learner corpora and contrastive interlanguage analysis
(18) There is a vast difference between speeding and intentionally murdering another human being. In this first case I would say that punishment is just right, by removal of the driver’s license for a period of time ... (ICLE-NO) A formal difference between I would say and its closest Norwegian equivalent jeg vil si (lit: ‘I will say’) is that the Norwegian modal vil has the present tense form, which is a potential source of transfer errors. However, the expression I will say occurs only three times in ICLE-NO. It signals either stance or conclusion, as shown by (19), which occurs towards the end of a text. (19) Anyway, from my point of view, I will say there is a great space for both dreaming and imaginations in our lives. (ICLE-NO) The Norwegian expression is fairly close to ‘from my point of view’, i.e. it flags the following proposition as the speaker’s opinion, but not necessarily as tentative. In English, however, the past-tense form of the modal gives the expression I would say a tentative ring (e.g. Biber et al. 1999: 496). It is thus possible that the Norwegian learners, through L1 transfer, invest the English expression with a higher degree of assertiveness than it seems to have in NS usage. The German learners use I would say mostly to express stance, but also in a more literal sense as a metatextual device (Ädel 2006); in (20) the writer simply explains how s/he would answer a question. There are also a few cases of I would say in conclusions, as in (21). (20) Well, what is best for them, what is it they love? I would say: sitting on their mothers’ or fathers’ lap while being told a story ... (ICLE-GE) (21) On balance, I would say that corporal punishment is no appropriate means to fight against criminality. (ICLE-GE) In example (22), from ICLE-SP, I would say has a slightly different metatextual function, namely that of commenting on the use of a word, while (23) shows the expression of stance. These are the main uses of I would say in ICLE-SP, and the Spanish learners use them about equally often. (22) The recruit spends (“wastes” I would say) almost a year of his life (nine month is the average in Europe) doing nothing except ... (ICLE-SP) (23) First of all I would say that love was completely under the social convections and prejudice, ... (ICLE-SP) This investigation has shown clear overuse of I would say by Norwegian and French learners. The overuse can probably be explained in both cases by the existence of similar expressions in the learner’s L1. The qualitative study shows that the learners use the expression for different functions: the conclusive use is most frequent in ICLE-FR, where the expression often collocates with conclude, sum up or similar words. A plausible explanation for the overuse of this function is the frequent collocation in French
Hilde Hasselgård and Stig Johansson
of je dirais with expressions such as en conclusion. As for the Norwegian learners, we suggested that they overuse the expression in conclusions because of the different degree of modal certainty carried by the Norwegian cognate expression. The conclusive use is absent from the ICE-GB material used by Hasselgård (2009a), but is found in LOCNESS. Yet, the phraseology of I would say in native speaker material suggests a lower degree of assertiveness than would normally be desirable in the conclusion to a line of argumentation. It is thus possible that conclusive I would say is related not just to L1 influence but also to developmental factors or to (lack of) speaker authority, though this is a point that needs further study. The stance-marker function of I would say is found in all the corpora, though it dominates most in ICLE-GE and ICLE-SP. The metatextual function would seem to constitute a relatively simple way of marking a rhetorical structure of question and answer in the text, and may thus be a feature of novice writing. Phraseological usage clearly depends on style and register and consequently reflects the proficiency level and writing experience of the writers. For this reason a reference corpus of ‘expert’ writing might usefully complement the NS student corpus. Furthermore, the study of the phraseology of learner language shows very clearly that contrastive interlanguage analysis would profit vastly from being supplemented by a contrastive analysis of the learner’s first language and the target language.18
7.3
A Norwegian perspective on seem
To give an example of how the Integrated Contrastive Model can work, we will take as our starting point Johansson’s (2007: 117 ff.) analysis of seem and its Norwegian correspondences in the ENPC and supplement this with an investigation of seem in ICLENO and LOCNESS. Johansson’s study was triggered by the observation that seem “sometimes seems to disappear without a trace in translations into Norwegian and likewise may be added, seemingly without any motivation, by English translators” (ibid.). Seem indeed turned out to be much more frequent in English originals than in translations (145.8 vs. 100.5 occurrences per 100,000 words). When comparing the English constructions with seem to their Norwegian correspondences, Johansson found that (i) English catenative constructions are strikingly more common than the corresponding syntactic choice in Norwegian; (ii) copula constructions are far more common in English than in Norwegian; those with a noun phrase complement are found in English only; (iii) English clauses with dummy subject it or there + seem(s) are less common than the corresponding Norwegian structures with the dummy subject det; (iv) an experiencer is more commonly expressed in Norwegian than in English; and (v) Norwegian uses more comparative structures, particularly with som (‘as (if)’, ‘like’) (2007: 123 and 138).
18. For a good example, see Paquot (2008).
Learner corpora and contrastive interlanguage analysis
Apart from the expectation that seem will be underused, these findings give rise to the following predictions for ICLE-NO compared to LOCNESS: (i) catenative seem will be underused; (ii) copular patterns will be underused, especially those with a noun phrase complement; (iii) a dummy subject will be used more often by the Norwegian learners; (iv) an experiencer will be expressed more often by the Norwegian learners; and (v) comparative structures will show up more often in the context of seem. The overall expectation is in fact not met: ICLE-NO has a higher frequency of seem per 100,000 words than LOCNESS (117 vs. 90). Even more surprisingly, the catenative function accounts for a slightly higher proportion of the occurrences of seem in ICLE-NO (51%) than in LOCNESS (47.5%). On the other hand, the copular function is, as predicted, more common in LOCNESS, with a proportion of 35.5%, compared to 28% of the occurrences of seem in ICLE-NO. The third prediction is partly met; in ICLE-NO, 31.5% of the occurrences of seem collocate with the dummy subject it, as against 23% in LOCNESS. Existential there, however, is more common with seem in LOCNESS (6 vs. 3 occurrences) but these figures are too low to reveal patterns. The predicted overuse of comparative structures is also to some extent confirmed. In any case the collocations seem like and seem as if are about twice as common in ICLE-NO as in LOCNESS. Finally, explicit experiencers are almost twice as common in ICLE-NO as in LOCNESS, which was expected on the basis of the contrastive study. An example is given in (24). (24) The oral examinaton seems to me to be more of a test in how to tackle stress ... (ICLE-NO) To dig further into the (mis-)match between the predictions based on Johansson’s contrastive study and the evidence from ICLE-NO we need to take a closer look at the learner data. First, the overuse of seem must be seen in connection with the general overuse of modal and hedging expressions in learner data, as shown by Aijmer (2002). Though not a modal auxiliary, seem clearly has modal meanings, particularly of evidentiality, and is thus handy for writers who want to hedge their claims. The unexpected overuse of catenative seem may take place at the expense of copular seem, as the most common lexical verb following catenative seem in ICLE-NO is be, often with a copular function, as seen in (25). By contrast, in (26) the predicative follows seem directly, without the aid of copular be. Admittedly, be is the most frequent verb following catenative seem in LOCNESS too, but it is more predominant in ICLENO (41.5% vs. 33% of all occurrences of catenative seem). It is thus likely that the Norwegian learners add be by analogy with corresponding Norwegian constructions (cf. Johansson 2007: 120). (25) The characters seem to be able to come to terms with Willie Loman’s death. (ICLE-NO) (26) This idea does not seem acceptable to the British public. (LOCNESS)
Hilde Hasselgård and Stig Johansson
The high frequency of dummy it in clauses with seem might be expected from the more general tendency of Norwegian to prefer light sentence openings (Hasselgård 2005). The dummy subject typically refers forward to a clause in extraposition. Interestingly, the Norwegian learners use the conjunction like more often than the more formal that in the extraposed clauses, as in (27). This may be due to the frequent use of som (‘as’, ‘like’) found in a number of Norwegian correspondences of seem in Johansson (2007). The learners also use the subordinator as if much more often than the native speakers (11 vs. 3 occurrences), no doubt influenced by the Norwegian equivalent som om illustrated in (28). (27) To me it seemed like some of the teachers had never been teaching school children (ICLE-NO) (28) ... but it seems you also know that if that happens it would be just as easily finished. (ENPC: ABR1) ... men det virker som om du også vet at hvis det skjer, kan det avsluttes like lett. (ABR1T) [lit: but it seems as if ...] In (28) som om corresponds to as if. However, om can be omitted in this construction, which is probably the cause of some dissonant occurrences like (29), where as is not followed by if. This type of dissonance can thus be explained by reference to the learner’s L1. (29) It might seem as it will cost a lot in the beginning ...
(ICLE-NO)
The frequent expression of experiencers with seem in ICLE-NO, illustrated by (24) above, correlates with the general tendency to writer/reader visibility in learner texts (Petch-Tyson 1998), as the most common realization of the experiencer is to me. The native speakers use seem(s) to me in 7 out of 14 experiencer phrases, but the Norwegian learners use it in 14 out of 22, and in addition three of the remaining cases have an experiencer that includes the speaker, e.g. many of us. The tendency to overuse experiencer phrases may thus have two explanations; the more frequent expression of an experiencer in Norwegian and/or the learners’ inclination to be visible in their texts. Further exploration of other sub-corpora of ICLE is needed to check which of the explanations is more plausible. This case study has illustrated that the connection between learner data and contrastive data is far from straightforward. As discussed by Gilquin (2008), even features of learner language that may be attributed to L1 transfer on the basis of a contrastive analysis may in fact have other causes. In the study of seem it seems that the overuse of the word is related to the general overuse of modal markers by learners of English. The expression of experiencers may be either L1-related or due to the tendency for learners to use colloquialisms in their written texts (e.g. Altenberg & Tapper 1998). The preference of like to that in subordinate clauses may likewise have two explanations. However, the overuse of it as a dummy subject and the occasional omission of if in as if are very likely caused by L1 transfer.
Learner corpora and contrastive interlanguage analysis
It should be noted that the ICM, with the parallel corpora available, suffers from a mismatch of genres and/or writer proficiency. The ENPC consists of fictional and nonfictional texts. None of them are argumentative or academic (with the possible exception of a few popular science texts) and all are produced by professional writers and translators. Thus, an ICM analysis based on a corpus such as the ENPC should ideally be checked against a (monolingual) corpus of student writing in the learner’s L1 to control for genre and writer variables. The contrastive analysis based on ‘OL vs. OL’ in Figure 1 above might thus include a comparison of comparable monolingual corpora of student writing.
8. Some challenges Granger has often discussed (e.g. Granger 2004: 134; Granger 2009: 14) the challenge of translating findings from CIA studies into pedagogical issues and EFL practice (see also Hunston 2002: 208). On the one hand, CIA studies usually outline potential pedagogical implications of the investigation, typically measures that will bring the learners closer to NS performance; on the other these measures are not necessarily directly translatable to classroom practice. In any case, the recommendations should probably to a greater extent take proper account of the reference corpus used as well as learner needs and teaching objectives (Granger 2009: 22). As pointed out by Ädel (2006: 206), “if we take it for granted that learners aim to achieve as professional a style of writing as possible, we should not make recommendations to learners based on native-speaker student usage, but rather should use professional native-speaker writing as the target”. For example, if compared to LOCNESS, Norwegian advanced learners underuse the connector however, even at a frequency of 66 per 100,000 words (N = 139), since LOCNESS has 181 instances per 100,000 words (N = 591). But a change of reference corpus alters the picture dramatically. The press editorials in the BNC, for example, have 58 occurrences of however per 100,000 words. We may thus wonder whether the Norwegian learners really underuse the word, or whether it is the LOCNESS writers who overuse it. In some cases of underuse, EFL teaching might focus on the underused items, though at the risk of inducing overuse instead. In the case of overused items, as noted by Hunston (2002: 209), there may be little point in saying “Use thing less often” without knowing what the relevant alternatives would be in specific contexts. The example of however given above also illustrates that the concepts of overuse and underuse are not straightforward, and quantitative findings need to be carefully considered and cross-checked with qualitative analyses before exposing learners to them.19 This is, 19. In fact, Granger (2009: 22) points out that “features of learner language uncovered by L[earner] C[orpus] research need not necessarily lead to targeted action in the classroom”. This will depend on the degree of divergence between learner and native speaker usage as well as on learner needs.
Hilde Hasselgård and Stig Johansson
however, not to deny the immense value of quantitative studies based on the CIA method and the ICLE corpus collection, but researchers should keep their eyes open for alternative reference corpora and external causes for some of the findings; cf. Ädel (2008) and Gilquin & Paquot (2008). Another important challenge concerns genre, as Biber et al. (1999) convincingly demonstrate that grammar depends on register. Studies of advanced learner language often suggest that learners are unaware of genre requirements (e.g. Altenberg 1997, Gilquin & Paquot 2008), and that this may be part of the explanation for the general overuse of informal and spoken-like features. This may well be true. However, the comparison of Hyland’s (2002) study of scientific reports written in English by Hong Kong learners with the figures for ICLE-HK (see Table 1 above) may indicate that learners of English can adapt their style to different registers. The challenge for CIA is thus to expand its empirical base to include more registers. This work has been started with the ongoing compilation of a new international learner corpus, the Varieties of English for Specific Purposes dAtabase (VESPA). With this corpus alongside ICLE and LINDSEI it will be possible to extend the field of CIA into studies of genre, medium and style. Finally, the study of corpora such as ICLE and LINDSEI can give invaluable insights into the interlanguage of learners at a particular proficiency level. However, such corpora cannot reveal much about language learning. For example, dummy it is not often used instead of existential there in ICLE-NO even though this is a well-known learning problem for Norwegians (cf. Hasselgård 2009a). When do learners begin to keep the two constructions apart? At what stage do learners whose native language does not have a grammaticalized progressive start overusing the form in English (cf. Johansson & Stavestrand 1987)? When do learners acquire syntactic patterns that are different from those of their own native tongue, and by what steps? To answer such questions, we need data representing different stages of the learning process, from beginners to advanced learners, for instance along the lines of the Danish PIF project (Færch et al. 1984). Hopefully, the new Longitudinal Database of Learner English (LONGDALE) project will bring corpus-linguistic studies closer to the language learning process.20
9. The revolution continues About twenty years after the ICLE project was conceived, the achievement seems immense. This applies not just to the important work done by Sylviane Granger and her team at the Centre for English Corpus Linguistics. No less important is the enthusiasm which has spread to many countries across the world (a good overview is given at 20. For information on the VESPA and the LONGDALE projects, see www.uclouvain.be/encecl-vespa.html and www.uclouvain.be/en-cecl-longdale.html, respectively.
Learner corpora and contrastive interlanguage analysis
www.uclouvain.be/en-cecl-lcWorld.html). The study of learner corpora is now an established field of applied linguistics. But it is a field which keeps evolving; new projects emerge, and thereby the potential for renewed research procedures, more sophisticated corpus tools, new types of investigations and new applications. An important example of the recognition of interlanguage research is the ICLE-based contribution of the Centre for English Corpus Linguistics to the Macmillan Dictionary (Rundell 2007). ‘Get-it-right’ boxes as well as a section entitled “Improve your writing skills” are advertised as key features of the dictionary.21 One of the earliest articles presenting the ICLE project, Granger (1994), carries the title “The Learner Corpus: A revolution in applied linguistics”. It has indeed been revolutionary in the sense that it has opened up a whole range of new research questions. Contrastive Interlanguage Analysis has turned out to be a fruitful paradigm. And yet there were significant studies of learner language preceding ICLE. At the outset of our paper we drew attention to some early work in Scandinavia. A hallmark of these studies is the concern with pedagogical applications (Thagg Fisher 1985; Linnarud 1986) and with issues of language learning (Færch et al. 1984). What they lacked was the comparison across different mother-tongue groups. In contrast, the CIA paradigm includes both learner vs. native speaker comparison and the possibility of comparing across groups of learners with different mother-tongue backgrounds. Moreover, the Integrated Contrastive Model has a great advantage over earlier error analysis and contrastive studies undertaken previously for purposes of improving language teaching: the combined resources inherent in the model secure a much better basis for explaining errors as well as making and testing predictions of learning difficulties. In spite of the wealth of studies, Granger (2009: 14) admits that “learner corpus research has not yet fully realized its stated ambition as its links with SLA have been somewhat weak and it has given rise to relatively few concrete pedagogical applications”. But the potential is definitely there, and Granger points out some important directions to go. If these are followed, the future seems bright for foreign-language pedagogy and for understanding interlanguage and the processes of foreign language acquisition.
References Ädel, A. 2006. Metadiscourse in L1 and L2 English [Studies in Corpus Linguistics 24] Amsterdam: John Benjamins. Ädel, A. 2008. Involvement features in writing: do time and interaction trump register awareness? In Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp & M.B. Díez-Bedmar (eds), 35–53. Amsterdam: Rodopi. Aijmer, K. 2002. Modality in advanced Swedish learner’ written interlanguage. In Computer Learner Corpora, Second Language Acquisition and Foreign Language Learning [Language 21. See www.macmillandictionaries.com/about/MED2/keyfeatures.htm.
Hilde Hasselgård and Stig Johansson Learning & Language Teaching 6], S. Granger, J. Hung & S. Petch-Tyson, S. (eds), 55–76. Amsterdam: John Benjamins. Aijmer, K. (ed.). 2009. Corpora and Language Teaching [Studies in Corpus Linguistics 33]. Amsterdam: John Benjamins. Altenberg, B. 1997. Exploring the Swedish component of the International Corpus of Learner English. In Proceedings of PALC’97 Practical Applications in Language Corpora (Lódz, 10–14 April 1997), B. Lewandowska-Tomaszcyk & P.J. Melia (eds), 119–132. Lódz: Lódz University Press. Altenberg, B. & Tapper, M. 1998. The use of adverbial connectors in advanced Swedish learners’ written English. In Learner English on Computer, S. Granger (ed.), 80–93. London: Longman. Barlow, M. 2005. Computer-based analyses of learner language. In Analysing Learner Language, R. Ellis & G. Barkhuizen (eds), 335–357. Oxford: OUP. Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. 1999. Longman Grammar of Spoken and Written English. London: Longman. Boström Aronsson, M. 2003. On clefts and information structure in Swedish EFL writing. In Extending the Scope of Corpus-based Research. New Applications, New Challenges, S. Granger & S. Petch-Tyson (eds), 197–210. Amsterdam: Rodopi. Brand C. & Kämmerer, S. 2006. The Louvain International Database of Spoken English Interlanguage (LINDSEI): Compiling the German component. In Corpus Technology and Language Pedagogy, S. Braun, K. Kohn, & J. Mukherjee (eds), 127–140. Frankfurt: Peter Lang. Corder, S.P. 1973. Introducing Applied Linguistics. Harmondsworth: Penguin. De Cock, S., Granger, S., Leech, G., & McEnery, T. 1998. An automated approach to the phrasicon of EFL learners. In Learner English on Computer, S. Granger (ed.), 67–79. London: Longman. Eia, A.-B. 2006. The use of linking adverbials in Norwegian advanced learners’ written English. MA thesis, University of Oslo. Enkvist, N.E. 1973. Should we count errors or measure success? In Errata: Papers in error analysis, J. Svartvik (ed.), 16–23. Lund: Gleerup/Liber. Færch, C. 1979. Computational analysis of the PIF Corpus of learner language. PIF Working Papers 1, 2nd rev. version. Department of English, University of Copenhagen. Færch, C., Haastrup, K. & Phillipson, R. 1984. Learner Language and Language Learning. Copenhagen: Nordisk Forlag A.S. & Clevedon: Multilingual Matters. Gilquin, G. 2000/2001. The Integrated Contrastive Model: Spicing up your data. Languages in Contrast 3(1): 95–124. (Printed in 2003). Gilquin, G. 2008. Combining contrastive and interlanguage analysis to apprehend transfer: detection, explanation, evaluation. In Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp & M.B. Díez-Bedmar (eds), 3–34. Amsterdam: Rodopi. Gilquin G., Granger S. & Paquot M. 2007. Learner corpora: The missing link in EAP pedagogy. In Corpus-based EAP Pedagogy, P. Thompson (ed.). Special issue of Journal of English for Academic Purposes 6(4): 319–335. Gilquin, G., Papp, S. & Díez-Bedmar, M.B. (eds). 2008. Linking up Contrastive and Learner Corpus Research. Amsterdam: Rodopi. Gilquin, G. & Paquot, M. 2008. Too chatty: Learner academic writing and register variation. English Text Construction 1(1): 41–61. Granger, S. 1994. The Learner Corpus: A revolution in applied linguistics. English Today 10(3): 25–32.
Learner corpora and contrastive interlanguage analysis Granger, S. 1996. From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In Languages in Contrast. Papers from a Symposium on Text-based Cross-linguistic Studies, Lund 4–5 March 1994 [Lund Studies in English 88], K. Aijmer, B. Altenberg, & M. Johansson (eds), 37–51. Lund: Lund University Press. Granger, S. 1998a. The computer learner corpus: A versatile new source of data for SLA research. In Learner English on Computer, S. Granger (ed.), 3–18. London: Longman. Granger, S. 1998b. Prefabricated patterns in EFL writing. In Phraseology. Theory, Analysis, and Applications, A.P. Cowie (ed.), 145–160. Oxford: OUP. Granger, S. (ed.). 1998c. Learner English on Computer. London: Longman. Granger, S. 2002. A bird’s-eye view of learner corpus research. In Granger, Hung & Petch-Tyson (eds), 3–33. Granger, S. 2004. Computer learner corpus research: current status and future prospects. In Applied Corpus Linguistics: A Multidimensional Perspective, U. Connor & T. Upton (eds), 123–145. Amsterdam: Rodopi. Granger, S. 2009. The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation. In Aijmer (ed.), 13–32. Granger, S., Hung, J. & Petch-Tyson, S. (eds). 2002. Computer Learner Corpora, Second Language Acquisition and Foreign Language Learning [Language Learning & Language Teaching 6]. Amsterdam: John Benjamins. Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (eds). 2009. International Corpus of Learner English. Version 2. Louvain-la-Neuve: Presses universitaires de Louvain. Greenbaum, S. 1991. The development of the International Corpus of English. In English Corpus Linguistics: Studies in Honour of Jan Svartvik, K. Aijmer & B. Altenberg (eds), 83–91. London: Longman. Hammarberg, B. 1973. The insufficiency of error analysis. In Errata: Papers in error analysis, J. Svartvik (ed.), 29–36. Lund: Gleerup/Liber. Hasselgård, H. 2005. Theme in Norwegian. In Semiotics from the North: Nordic Approaches to Systemic Functional Linguistics, K. L. Berge & E. Maagerø (eds), 35–48. Oslo: Novus. Hasselgård, H. 2009a. Thematic choice and expressions of stance in English argumentative texts by Norwegian learners. In Aijmer (ed.), 121–139. Hasselgård, H. 2009b. Temporal and spatial structuring in English and Norwegian student essays. In Corpora and Discourse – and Stuff. Papers in Honour of Karin Aijmer. R. Bowen, M. Mobärg, & S. Ohlander (eds), 93–104. Göteborg: Acta Universitatis Gothoburgensis. Hasselgren, A. 1994. Lexical teddy bears and advanced learners: A study into the ways Norwegian students cope with English vocabulary. International Journal of Applied Linguistics 4: 237–259. Herriman, J. and Boström Aronsson, M. 2009. Themes in Swedish advanced learners’ writing in English. In Aijmer (ed.), 101–120. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: CUP. Hyland, K. 2002. Authority and invisibility: authorial identity in academic writing. Journal of Pragmatics 34: 1091–1112. Johansson, S. 1978. Studies of Error Gravity. Native Reactions to Errors Produced by Swedish learners of English. Gothenburg: Acta Universitatis Gothoburgensis. Johansson, S. 2007. Seeing through Multilingual Corpora: On the Use of Corpora in Contrastive Studies [Studies in Corpus Linguistics 26]. Amsterdam: John Benjamins. Johansson, S. & Stavestrand, H. 1987. Problems in learning – and teaching – the progressive form. In Proceedings from the Third Nordic Conference for English Studies [Stockholm
Hilde Hasselgård and Stig Johansson Studies in English 73(1)], I. Lindblad & M. Ljung (eds), 139–148. Stockholm: Almqvist & Wiksell. Lado, R. 1957 [1971]. Linguistics across Cultures: Applied Linguistics for Language Teachers. Ann Arbor MI: University of Michigan Press. Leech, G. 1998. Preface. In Learner English on Computer, S. Granger (ed.), xiv-xx. London: Longman. Levenston, E. A. 1971. Overindulgence and underrepresentation – Aspects of mother tongue interference. In Contrastive Linguistics, G. Nickel (ed.), 115–121. Cambridge: CUP. Linnarud, M. 1986. Lexis in Composition: A Performance Analysis of Swedish Learners’ Written English [Lund Studies in English 74]. Lund: Gleerup/Liber. Lorenz, G. 1998. Overstatement in advanced learners’ writing: Stylistic aspects of adjective intensification. In Learner English on Computer, S. Granger (ed.), 53–66. London: Longman. Meunier, F. & Granger, S. (eds). 2008. Phraseology in Foreign Language Learning and Teaching. Amsterdam: John Benjamins. Nesselhauf, N. 2005. Collocations in a Learner Corpus [Studies in Corpus Linguistics 14]. Amsterdam: John Benjamins. Nickel, G. 1973. Aspects of error evaluation and grading. In Errata: Papers in Error Analysis, J. Svartvik (ed.), 24–28. Lund: Gleerup/Liber. Osborne, J. 2008. Adverb placement in post-intermediate learner English: A contrastive study of learner corpora. In Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp & M.B. Díez-Bedmar (eds), 127–146. Amsterdam: Rodopi. Paquot M. 2008. Exemplification in learner writing: A cross-linguistic perspective. In Phraseology in Foreign Language Learning and Teaching, F. Meunier & S. Granger (eds), 101–119. Amsterdam: John Benjamins. Paquot, M. 2010. Academic Vocabulary in Learner Writing. From Extraction to Analysis. London: Continuum. Petch-Tyson, S. 1998. Writer/reader visibility in EFL written discourse. In Learner English on Computer, S. Granger (ed.), 107–118. London: Longman. Pravec, N. A. 2002. Survey of learner corpora. ICAME Journal 26: 81–114. Ringbom, H. 1998. Vocabulary frequencies in advanced learner English: A cross-linguistic approach. In Learner English on Computer, S. Granger (ed.), 41–52. London: Longman. Rundell, M. (Editor in chief) 2007. Macmillan English Dictionary for Advanced Learners, 2nd edn. Oxford: Macmillan Education. Selinker, L. 1972. Interlanguage. International Review of Applied Linguistics 10(3): 219–231. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP. Svartvik, J. (ed.). 1973. Errata: Papers in Error Analysis. Lund: Gleerup/Liber. Thagg Fisher, U. 1985. The Sweet Sound of Concord: A Study of Swedish Learners’ Concord Problems in English [Lund Studies in English 73]. Lund: Gleerup/Liber. Virtanen, T. 1998. Direct questions in argumentative student writing. In Learner English on Computer, S. Granger (ed.), 94–106. London: Longman. Wardhaugh, R. 1970. The contrastive analysis hypothesis. TESOL Quarterly 4(2): 123–130. Wiktorsson, M. 2003. Learning Idiomaticity. A Corpus-Based Study of Idiomatic Expressions in Learners’ Written Production [Lund Studies in English 105]. Stockholm: Almqvist & Wiksell International.
Learner corpora and contrastive interlanguage analysis
Corpora used in examples and case studies British National Corpus (BNC) <www.natcorp.ox.ac.uk/> English-Norwegian Parallel Corpus (ENPC) <www.hf.uio.no/ilos/english/services/omc/enpc/> International Corpus of English, British Component (ICE-GB) <www.ucl.ac.uk/english-usage/ projects/ice-gb/> International Corpus of Learner English (ICLE) <www.uclouvain.be/en-cecl-icle.html> Louvain Corpus of Native English Essays (LOCNESS) <www.uclouvain.be/en-cecl-locness.html>
The use of small corpora for tracing the development of academic literacies JoAnne Neff van Aertselaer and Caroline Bunce Since Erasmus exchanges have fostered student mobility in the European Union, various features of argumentation skills for Academic English (AE) have become central elements of university curricula. This chapter presents an analysis of a small corpus of texts written in an academic writing (AW) class by English as a Foreign Language (EFL) Spanish university students at B1 and B2 levels of the Common European Framework for Languages (CEFR). The small corpus data is contrasted with the Spanish sub-corpus of the International Corpus of Learner English (SPICLE) regarding the use of certain devices for intertextuality and evaluation. The study shows that students who have been given very definite CEFR guidelines regarding the use of specific academic features are able to improve their writing, even though there remain certain types of errors in their overall lexico-grammatical production.
1. Introduction Given the increasing student mobility within the European Union, skill in the critical argumentation indispensable for academic writing (AW) in English has become an essential competency. This development within institutions of higher education is reflected in the manual called Relating Language Examinations to the Common European Framework of References for Languages, published in 2009 by the Language Policy Division of the Council of Europe. On various pages (pp. 44, 138, 177), this document addresses the question of two text types which are essential for academic work: descriptive-chronological text (as in lab reports) and argumentative text type (essential in all academic disciplines, at least for many sections of an academic report or research article). To the Appendix on ‘Written assessment criteria’ (Table C4, p. 187) of this document, the Language Policy Division has attached additional columns for these two text types. The specifications list features of argumentative AW, such as the ability to present a case; provide a critical appreciation of proposals; expand and support a point of view with subsidiary points, reasons and examples and provide an appropriate reader-friendly logical structure. If these characteristics constitute what
JoAnne Neff van Aertselaer and Caroline Bunce
university students’ writing will be judged on, then it is crucial that university teachers analyse academic writing in the different disciplines in order to ascertain what these features, which include a mixture of structural and rhetorical patterns, are and how they could be best taught. That is, these general Common European Framework of Reference (CEFR) features do not specify the linguistic realizations that AW requires and therefore, these must be identified and incorporated into can do statements for writing syllabi.1 In this chapter, we focus on a series of lexical choices which enter into grammar patterns and their pragmatic associations2 – so often the focus of the work of Sylviane Granger (Granger 1983; Granger 1998a; Granger 1998b; Gilquin et al. 2007; Meunier & Granger 2008) – in order to show how the elaboration of can do statements for a one-semester academic writing course can improve student writing (and reading skills) in terms of the students’ communicative goals, if not their syntactic competency. The use of these lexical items are traced through two corpora: the Spanish sub-corpus of the International Corpus of Learner English (SPICLE), a collection of texts produced by Spanish English as a Foreign Language (EFL) students with no specific training in AW, as compared to a corpus consisting of texts written by similar students as part of a course in AW. The purpose of the various comparisons was to ascertain whether the syllabus for the two AW courses (2007–2008 and 2008–2009) was actually beneficial to the students’ literacy growth in the production of texts.3 Therefore, the study focuses more on the students’ text production than on the readings used as models during the course. In both of the years of the AW course, the ultimate aim of the study was pedagogical, i.e. revising the syllabus and thus classroom practices. The study shows that, while instructors of an AW course cannot hope to significantly improve their students’ grammatical competence over a one-semester period, by providing explicit descriptors for argumentative writing, they are able to help the students understand the dialogic nature of argumentation. The attention given to the frequency of different features of argumentation and the ways in which these combine shows students how to produce more sophisticated texts. Furthermore, the study also illustrates how small corpora can be usefully employed both to trace learners’ developmental patterns and subsequently adapt specific classroom teaching practices (Thompson 2001a). 1. This study forms part of the work completed for a national project funded by the Spanish Ministry of Science and Innovation (FFI2008–03968). 2. Following Hoey (2005: 43), we define pragmatic association as the particular pragmatic function(s) that words and nested combinations of words are primed for because of frequent use, such as as can be seen in Table 2 as a discourse marker for presenting information. Also see Hunston & Francis (1999). 3. No attempt was made to measure the improvement (or not) of the students’ reading competency.
The use of small corpora for tracing the development of academic literacies
2. The development of academic literacies in an EFL context According to Johns (1997: 2), literacy is an inclusive term which refers to both reading and writing, and also “encompasses ways of knowing particular content, languages and practices”, including strategies to deal with “understanding, discussing, organizing and producing texts”. As many researchers have noted (Bazerman 1994; Johns 1997; Bhatia 2004), the development of academic literacy in particular disciplines depends on the students’ having become aware of the requirements of the genre in question – giving rise to what Hoey refers to as “productive priming” (Hoey 2005: 11) – and also being conscious of the socio-cultural forces which give rise to the intertextual nature of academic texts. In the context of university students of English Studies at the Universidad Complutense de Madrid, course instructors have observed that the students can readily classify text types4 into narrative or descriptive passages; however, they have difficulty in explaining the reasons for their categorisation, particularly in identifying text-internal features of argumentative texts, such as the use of modal verbs, concessive constructions, and adversative lexical phrases in order to present various viewpoints. That is, students are intuitively aware of features of text types but this schematic knowledge is insufficient for them to produce good argumentative texts. These linguistic forms and text patterns (text internal features) should be understood as a means for negotiating a stance within a genre. But students rarely comprehend texts in terms of negotiating multiple text external discourses, perhaps because they do not fully understand texts as a form of social practice. It must also be admitted that student texts, mostly written for teacher evaluation, do not usually bring about any “consequent social action” (Bazerman 1994: 79). A useful concept for presenting such text external factors is genre. Swales (1990: 45–58) has defined genre as a class of communicative events with a shared set of purposes and goals, carried out within certain conventions for the presentation of contents, positioning and form. EFL undergraduate students, such as those whose texts are studied here, have not had enough experience with different varieties of academic texts, except for textbooks, to have formed prototypical concepts for these different texts, and in particular, for highly conventional texts such as a formal research paper. Since students’ contact with academic sub-genres has mostly centred 4. Text types have been defined following Werlich (1983), who proposes 5 types – description, exposition, narration, argumentation and instruction. Genre has been defined following Biber (1995) and Swales (1990). Text types are considered to have internal (linguistic) features which define the types in themselves, while genres are heavily influenced by cultural, external features. Different text types may occur within a single genre, as in a research paper, which may include a narrative account of past research, an expository account in the Methods section and argumentative text type in the Discussion section. In the 2001 book on the Common European Framework of Reference for Languages (Council of Europe 2001: 95), text types are referred to, but these are in fact genres (comic books, textbooks, newspapers, etc.).
JoAnne Neff van Aertselaer and Caroline Bunce
on textbooks, it is very likely that they will confuse the types of text-internal features (such as the use of imperative verbs and vocatives like let’s) found in textbooks with the language they are to use in essays and academic papers. Therefore, sequenced, goal-directed reading tasks should be the starting point for genre acquisition (Swales 1990: 76). Linked to the concept of genre is that of discourse community, which is defined by Swales (1990: 24–27) as having “a common set of public goals” and, among the expert members of the community, shared discursive practices, which often develop into one or more genres. Our students need to become aware of the nature of the external and internal factors which influence the academic discourse communities they are entering, in our case, Linguistics and Literature. These differences exist both between these two communities and among various types of subgenre, such as textbooks, essays, critical analyses, and term papers (Bhatia 2004: 31). In addition to the necessity of beginning the AW course for university students with general notions of genre and discourse community, at a very early point, intertextuality should be introduced as a way of helping students realise that their texts will enter into some academic discourse community, as limited as that may be within their own institutions. There are various ways in which academic texts are intertextual. Their form is a reflection of prior texts (both in structural and rhetorical features). Their content also engages with prior texts, in that the arguments must be strengthened by the reading and digesting of others’ texts. Additionally, academic texts must combine both the author’s intention, that is, the stance expressed towards the content, with the evaluation of those texts read and cited as background material. Often students do not conceive of themselves as members of an academic discourse community and therefore do not see their texts as participating in what Briggs & Baumann (1992: 146) have described as the “ongoing process of producing and receiving discourse”. Without our students’ understanding of this dialogic process, they will not be able to make sense of the way in which structural and rhetorical features combine in order to construct an effective academic argument. There is a further complication for the Spanish context. Writing in academic contexts is often seen primarily as knowledge telling and may be governed by an assumption that students should display the knowledge they have acquired, usually that given by the teacher in class or the textbook. This attitude is reflected in examination questions which do not require the candidate to put forth stance moves or to have completed outside critical readings. For example, a typical literature question for a Spanish university entrance exam (Educared 2009) is the following: Características del Modernismo (“Characteristics of Modernism”). As it is not really a question, this type of essay prompt merely requires the candidates to list a set of characteristics, not to examine the various issues involved, or to contrast sources; in fact, the latter are not required at all. These types of prompts, requiring mainly descriptive answers, given over a number of years of schooling, mean that little attention is given to argumentation, as a lesser-valued skill at
The use of small corpora for tracing the development of academic literacies
secondary level.5 In contrast, in most schooling in English-speaking contexts, narrative and descriptive texts are the focus of instruction until approximately 9 or 10 years of age when factual writing of different types (description, report, explanation, persuasion, Martin 1990: 15) begins to take on importance (Perera 1989), not only for examinations but, when students are older, for longer texts as well, such as term papers. For the latter, argumentation becomes the main text type and descriptive text is used mostly for contextualization and exemplification, in support of the arguments presented. If Spanish contexts stress description (what something is like) over persuasive exposition/argumentative text types (reasons and arguments),6 Spanish university students entering English Studies may have to struggle in order to comprehend argumentation patterns and incorporate them into their writing. At tertiary level, it is difficult for instructors to convince students that they must strive to create their own voice, perhaps because the text internal and external features still remain implicit. The purpose of the can do statements elaborated for this course is to provide students with an explicit set of such features which can serve as the basis for academic literacy exercises, and ultimately, academic essays.
3. The academic writing course In order to encourage knowledge transformation, and not merely the knowledge telling found in descriptive texts, the instructors found it necessary to draw up a series of guidelines or can do descriptors to make explicit the required structural and rhetorical features to be learned. Since the competence levels of the students are mixed, the syllabus for the course centres on specific genre and intertextual practices, as displayed in Table 1, which must be learned by the students of the AW course, regardless of their competence level in English.7 5. This assumption is corroborated by the number of points given to the students taking the Spanish Literature and Language exams for university entrance. The argumentative essay counts for 1 point out of 10 points in total. 6. Although argumentation is one of the text types mentioned in preparatory university courses for Spanish students, it is not a text type that students frequently practise. 7. The classes are not streamed in the English Studies Department at our university; thus, as was the case in both AW courses considered in this study, students’ levels may range from A2 to C1, as tested during the first class with the Oxford Quick Placement (OQP) Test. It is not possible to simply exclude students whose level is not at B1, the level at which the first specific descriptors appear on the Writing Grid for Argument (Council of Europe 2009: 187). In order to measure students’ progress regarding the structural and rhetorical descriptors, the data from sample 1 (AW1) had to be matched with the final essay data (AW2) from the same students. For this purpose, we selected from each of the two courses (2007–2008 and 2008–2009), 20 initial essays (n = 40) and 20 final essays (n = 40). The competence level of the 40 students (OQP Test) was as follows: A2 level: 20%; B1 level: 20%; B2/C1 levels: 65%; and C2 level: 5%.
JoAnne Neff van Aertselaer and Caroline Bunce
Table 1.╇ Can do statements for B2 level Features of structural and rhetorical competence
Qualifications
Structural features – Can reword the prompt of a writing assignment incorporating opposing points of view appropriate for argumentative genre – Can present all claims and supporting data in a logically organized way – Can use both prospection and encapsulation8 to create coherence – Can conclude by restating major ideas and placing the arguments in a wider context
– Proper contextualization – Few stranded claims or data – Few limitations regarding lexical phrases used – Suggestion of future events
Rhetorical features – Can consider other points of view, adopting a critical stance – Can incorporate intertextuality by reporting others’ views and statements, using lexical resources, such as adjectives, adverbs and verbs, which show writer alignment (stance) – Can use a reasonably extensive range of hedges and boosters as well as impersonalization strategies in presenting claims – Can successfully use a variety of discourse markers (DMs) to indicate flow of text
– Can distinguish among the arguments in sources – Can use a wide range of reporting verbs (suggest, claim, show, etc.) – Can make effective use of passive voice, modalized utterances, abstract rhetors (non-human agents) – Can effectively use lexical cohesive devices (synonyms, hyponyms, etc.) as well as DMs
These features were also used to measure the students’ written performance throughout and at the end of the course. These criteria enabled the instructors to avoid solely focusing on the elimination of student errors and instead, to concentrate, more reasonably, on feasible advancement in discourse competency. The can do statements were presented on the first day of the course and frequently referred to before focusing on specific writing exercises. Students reported having found these descriptors clear and useful and also having referred to them for home assignments. As can be observed in this table, the can do statements cover a range of genre characteristics. By the end of the course, the student is expected to display ownership 8. Following Sinclair (1993: 8), encapsulation is defined as phrases which reformulate what has been stated, usually in order to move on to another topic or conclusion, and prospection occurs when “the phrasing of a sentence leads the addressee to expect something specific in the next sentence” (Sinclair 1993: 12), namely because the speaker/writer has alluded to topics to be dealt with.
The use of small corpora for tracing the development of academic literacies
of the ideas presented as claims and sub-claims, as well as adopting an authorial stance suitable for a nuanced argumentation. Of the above features, the ones examined in this study are rhetorical rather than structural, particularly those related to intertextuality, such as the range of reporting verbs used and the internal (authorial) and external (non-authorial) voices used to present points of view.
4. The study As previously mentioned, the aim of the study was to discover how EFL students negotiate stance in academic papers, with the ultimate aim of examining our students’ progress in the acquisition of various devices for stance-taking, an important feature of the AW syllabus.
4.1
Texts included in the study
For purposes of measuring development in student writing, the SPICLE corpus (see Table 2), collected throughout the 1990s, provides a picture of Spanish EFL university writing without the benefit of a specific AW course. This corpus is a collection of texts (194,845 words) on general interest and literature topics, written by third- and fourthyear Spanish EFL university students, and included as the Spanish component of the International Corpus of Learner English (ICLE), held at Louvain. The data from this corpus is compared with the two small sub-corpora (AW1 and AW2) of English Studies students enrolled in the Academic Writing class at the Complutense University of Madrid (UCM), during 2008 and 2009. These texts (27,462 words) are samples of argumentative essays written by second-year English Studies students on general interest topics (i.e., approximately the same as those used for the ICLE corpus, but excluding literature topics). Writing sample 1 (AW1), collected during the second-week of the course, was matched with the texts of the final sample (AW2), written by the same students. The students were required to do writing assignments throughout the course, but only course-initial and course-final samples of their writing were selected since the aim of the study was to analyse the students’ progress and evaluate the effectiveness of the AW course. These two AW sub-corpora show the gains made by UCM students after explicit teaching of the features of academic writing. The data from the two sub-corpora are also compared with each other in order to trace the development regarding the specific features listed in the can do statements. The study is further complemented by previous studies carried out on part of the Louvain Corpus of Native English Essays (LOCNESS), texts written by American university students, especially regarding the use of deictics referring to propositions in the text. The results from all of these studies will be used to inform the syllabus design for the AW course in the future.
JoAnne Neff van Aertselaer and Caroline Bunce
Table 2.╇ Corpora included in the study Name of corpus
Number of words
SPICLE AW TEXTS AW1 AW2
194,845 â•⁄ 27,462 â•⁄ 10,596 â•⁄ 16,866
Since the corpora were of different sizes, all the figures for the data were normed per one hundred words to permit comparisons. The texts produced by the AW students represented a very limited number of words because, for the purpose of measuring progress, we could use only the texts written by the students who had completed both the first and final writing assignments, elaborated in class from notes. In Appendix I, there is an example of a final essay (AW2) from the writing course in 2009, and in Appendix II, in order to show developmental trends, there are two essays from the same student enrolled in the writing course in 2008: the initial essay (AW1) and the final essay (AW2).
4.2
Methods and procedures
In order to investigate stance-taking, we first searched for the reporting verbs used by the students in order to compare the latter with a list used by expert article writers in English (cf. Neff et al. 2001) and then also carried out a more qualitative study of the rhetors, or agents, established by the students as giving voice to evaluations or claims. Two main criteria governed the inclusion or not of data in the study: one concerning the rhetor (usually the subject) associated with the verb and the other concerning the ideas, statements or arguments introduced by the verb (usually the object). The first criterion was that the verb should be associated with an identifiable rhetor which could be considered to be one of the text’s voices (rhetors) and to participate in the textual discussions. Thus, the instances of conclude with the function of discourse organizer (e.g. “To conclude: the best solution is ...”) were not classified, since it was considered that they did not give sufficient emphasis to the rhetor, but rather served principally to organize the text. The second criterion was that the verb should introduce or be associated with propositional content which could be phrased as a statement or question (e.g. “Many agree that TV is too violent”). Thus uses such as “the discovery of AIDS has changed how people think” or “They should think about their morals” were not included. The data were included in the study if they fulfilled at least one of these criteria. An impersonal use such as “It is reasonable to conclude also that without the satellite this would not have occurred” was thus accepted because it fulfilled the second criterion though not the first, while “These works and studies have looked at this issue from many different angles” was also included on the grounds that it fulfilled the first criterion though not the second.
The use of small corpora for tracing the development of academic literacies
The initial quantitative approach was to focus on a range of reporting verbs, such as argue, note, suggest and show, which we had observed as frequently used in LOCNESS and expert academic texts (cf. Neff et al. 2001). We first searched for the root and irregular forms of the verbs (see Table 5 for the full list) using Wordsmith 5.0 (Scott 2007). Some instances of reporting verbs from the SPICLE corpus were not included when they occurred in display-type answers particularly in the literature essays, such as “In the two final stanzas, John Donne explains the meaning of that conceit” and “Joan says she will rather die than spend the rest of her days in prison”. These uses by SPICLE writers were considered instances of contextualization and not argumentation and therefore, were not taken into account in this study. The initial analysis of reporting verbs showed some basic patterns and tendencies with regard to the different discourse verbs used by the students. As well, it became apparent that certain verbs tended to be used with certain types of rhetors, e.g., “this shows that ...”. Therefore, in a second step, we carried out a more qualitative study in order to categorize the rhetors, that is, to classify the use of voice (abstract rhetor, nonspecific rhetors or personal pronouns, etc.) and impersonal and/or passive constructions (i.e., no agent). Academic texts frequently use rhetors of various kinds, such as those shown in Table 3: specific human agents (I, you, we); non-specific human agents (one); specific and/or named rhetors (two researchers from New York; Hyland); general, non-specific and unnamed rhetors (some people may think that ...); or, abstract rhetors (This study shows that ...). Also there is frequent use of impersonal constructions, such as it is Table 3.╇ Categorization of the different voices associated with reporting verbs Classifications
Examples
Abstract rhetors
An examination of the programming has concluded that ... it has been said that...; it is necessary to point out that... Some people may think...; Opponents claim...; Proponents of X argue that...; The average reader may not find... Methvin believes that...; Two researchers from New York found that... This shows that...
Impersonal and passive constructions General, non-specific and unnamed rhetors Specific and/or named rhetors Deictics as subject (referring to propositions in the text) “one” subject “you” subject “we” subject “I” subject
One may assume that... If you analyze many of these arguments... Before we discuss the case of ... Personally, I find that...; I have always believed that...
JoAnne Neff van Aertselaer and Caroline Bunce
Table 4.╇ Different types of evaluative devices examined in study Evaluative lexical device
Examples
it + copular verb + adjectival phrase + that it + copular verb + adjectival phrase + to + verb of knowing/saying
it is obvious that...; it is indisputable that...; it seems more logical that... it is important to take into account that...; it is only natural to think that...; it seems contradictory to say that... immigration is obviously a problem...; but unfortunately, many governments do not....; I will briefly summarise...; scientists plausibly claim...
*ly adverbs used as disjunct/used to modify discourse verb
important to note that ... and passive constructions, such as it has been said that ..., in which the rhetorical act appears to have no human agency. Many of these latter constructions permit the writer to present her arguments as resting upon common knowledge and factual, objective data. All allow the writer to adopt a variation of stances with regard to the propositions put forward, which range from distancing from or subscribing to these propositions. There are, of course, many ways in which writer stance can be expressed and in successful academic argumentation stance-taking consists of a complex combination of a variety of linguistic features. Therefore, in a third step of the research, we decided to focus on four lexical resources for evaluation (all displayed with examples in Table 4) explicitly taught during the course, namely it + copular verb + evaluative adjectival phrase + that, it + copular verb + evaluative adjectival phrase + to + verb of mental or verbal processes, and two uses of adverbs ending in ly: those conveying a writer comment on the whole content of the proposition (disjunct), and those modifying a discourse oriented verb. As occurred with the reporting verbs, Wordsmith 5.0 was used for the word searches (using the strings it **** that, it **** to and *ly) and initial data sorting, while the subsequent elimination of irrelevant data was done manually. For the purpose of comparison, all the figures for the various data were normed per 100 words and chi-square was used to test for statistical significance.
5. Analysis and discussion Stance-taking in any piece of writing requires the use of different devices employed within a very nuanced textual process. During the AW course comprising 37 hours, it was not possible to teach all these diverse strategies. Thus, the instructors opted for a limited number of structural and rhetorical indicators, which appear as can do descriptors in Table 1. In this study we explore the use of various of these indicators, namely reporting verbs and rhetor types that occur with these, and four types of lexical devices for evaluation.
The use of small corpora for tracing the development of academic literacies
5.1
Reporting verbs
The principal findings for the reporting verbs in each corpus are presented in Table 5, with the raw figures in the left-hand columns followed by the figures normed by 100 words. First we discuss the unusual frequencies of some of the individual verbs and then some developmental trends. Table 5.╇ Occurrences of reporting verbs per corpus Verb
SPICLE Raw fig. Normed fig.
address* agree*/disagree* analyz*/s* argu* assum* believ* claim* conclud* discuss* establish* explain* find*/found focus* on hypothesis*/iz* indicat* look* at not* (note) point* out/to present* prov* (prove) provid* (+ evidential N.) refer* report* say*/said show* stat* (state) stud* (study) suggest* think*/thought Total
AW1
AW2
Raw fig.
Normed fig.
Raw fig. Normed fig
â•⁄â•⁄ 0 â•⁄ 32 â•⁄ 19 â•⁄ 12 â•⁄â•⁄ 4 â•⁄ 51 â•⁄â•⁄ 4 â•⁄â•⁄ 7 â•⁄ 14 â•⁄â•⁄ 0 â•⁄ 30 â•⁄â•⁄ 3 â•⁄ 18 â•⁄â•⁄ 0 â•⁄â•⁄ 3 â•⁄â•⁄ 3 â•⁄â•⁄ 4 â•⁄ 22 â•⁄â•⁄ 4 â•⁄ 20 â•⁄â•⁄ 4
0 0.02 0.01 0.006 0.002 0.03 0.002 0.004 0.007 0 0.02 0.002 0.009 0 0.002 0.002 0.002 0.01 0.002 0.01 0.002
â•⁄ 0 â•⁄ 2 â•⁄ 4 â•⁄ 4 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 2 â•⁄ 4 â•⁄ 0 â•⁄ 6 â•⁄ 0 â•⁄ 4 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 4 â•⁄ 8 â•⁄ 0 â•⁄ 4 â•⁄ 2
0 0.02 0.04 0.04 0 0 0 0.02 0.04 0 0.06 0 0.04 0 0 0 0.04 0.08 0 0.04 0.02
â•⁄â•⁄ 0 â•⁄â•⁄ 6 â•⁄â•⁄ 6 â•⁄ 12 â•⁄â•⁄ 0 â•⁄â•⁄ 0 â•⁄ 26 â•⁄â•⁄ 2 â•⁄â•⁄ 6 â•⁄â•⁄ 0 â•⁄ 16 â•⁄â•⁄ 0 â•⁄â•⁄ 4 â•⁄â•⁄ 0 â•⁄â•⁄ 4 â•⁄â•⁄ 0 â•⁄â•⁄ 4 â•⁄ 20 â•⁄â•⁄ 0 â•⁄â•⁄ 4 â•⁄â•⁄ 2
0 0.04 0.04 0.07 0 0 0.15 0.01 0.04 0 0.09 0 0.02 0 0.02 0 0.02 0.12 0 0.02 0.01
â•⁄ 25 â•⁄â•⁄ 1 246 â•⁄ 45 â•⁄â•⁄ 8 â•⁄â•⁄ 0 â•⁄â•⁄ 9 247 835
0.01 0.0005 0.1 0.02 0.004 0 0.005 0.1 0.4
â•⁄ 0 â•⁄ 2 â•⁄ 8 10 12 â•⁄ 0 â•⁄ 2 20 98
0 0.02 0.08 0.09 0.11 0 0.02 0.2 0.92
â•⁄â•⁄ 0 â•⁄â•⁄ 0 â•⁄ 18 â•⁄ 12 â•⁄ 24 â•⁄â•⁄ 0 â•⁄â•⁄ 8 â•⁄ 20 194
0 0 0.11 0.07 0.14 0 0.05 0.12 1.15
JoAnne Neff van Aertselaer and Caroline Bunce
5.1.1 Unusual frequencies As can be seen, four of the verbs used by expert writers (cf. Neff et al. 2001), address, establish, hypothesise-ze and study, were not used at all in any of the student corpora. There are also very few tokens of assume, find, look at and present. These results point to the EFL students’ lack of range in using reporting verbs, as corroborated by other studies (Charles 2006; Neff et al. 2001). It is worth noting that some of the reporting verbs used by experts are particularly academic in tone, such as hypothesize-se, and are probably not commonly used even by native undergraduate students. As a result of both novice writer and EFL writer limitations, both groups of EFL university writers show a certain tendency to rely on a limited set of discourse oriented verbs. 5.1.2 Developmental trends In comparing the SPICLE data with the AW data, three main trends become apparent: 1. the concentration of the SPICLE tokens on two verbs 2. the progressive increase in frequencies of use of some verbs: from SPICLE to AW1 to AW2 3. the progressive decrease in frequencies of use of some verbs: from SPICLE to AW1 to AW2 The data resulting from the corpus of students who had no specific training in AW, i.e., the SPICLE corpus, show that there is a much greater concentration of use on very few common discourse verbs. In fact, two verbs, think and say, account for approximately 59% of the total use of reporting verbs. The texts of students who received AW training show a broader range of reporting verbs. Verbs that carry more discourse value, e.g., suggest, state and claim, now appear more frequently, which allows these students to rely less heavily on think and say. In the AW1 texts, think and say accounted for approximately 29% and in AW2, 20% of the total reporting verbs. This result suggests that, although the AW writers still show a certain limited range of reporting verbs, similar to that of the SPICLE group, they rely far less on the two verbs previously mentioned and more readily use other discourse oriented verbs which are more academic in tone and convey a greater degree of authorial stance. Of the 21 remaining verbs (after discounting the 8 verbs occurring either negligibly or not at all in the corpora), 19 (agree/disagree, analyze, argue, claim, conclude, discuss, explain, focus on, indicate, note, point out/to, prove, provide, report, say, show, state, suggest, think) appear with greater frequency in one of the AW sub-corpora than in the SPICLE corpus, thus, in general terms corroborating the finding that the AW writers show less over-reliance on a limited range of verbs. It is encouraging for the instructors to note that 8 verbs (argue, claim, explain, indicate, point out, show, state, suggest) also show a longitudinal increase in frequency when the AW1 sub-corpus is compared with the AW2 sub-corpus. In the case of argue, explain, point out, show, state and suggest, the AW1 texts show a greater frequency than the SPICLE texts and the AW2 texts, in turn, an increase in frequency vis-à-vis the AW1 texts. Claim and indicate
The use of small corpora for tracing the development of academic literacies
are not used by the AW student writers in their first essays (AW1), but the students have incorporated them into their writing by the final week of the course (AW2) and use them with a greater frequency than the SPICLE writers. Finally, regarding the decrease in frequencies of use from SPICLE to the AW texts, there are two verbs, believe and refer, that show this tendency. The explanation for this decrease appears somewhat complex and can only be offered tentatively. In the SPICLE texts, 64% of the instances of refer correspond to interactive9 uses with the pronouns we and I (e.g. “We have previously refered to”, “here we are referring to the fact that”, “I am refering to Spain”). The AW writers’ avoidance of such expressions in an attempt to achieve a more impersonal academic voice explains, at least in part, the absence of this verb in their data. As far as believe is concerned, the SPICLE writers use this verb interactionally10 with I and we as rhetors in 43% of the cases, such as in “I believe that university studies must be reformed”. This means that their claims are often made almost exclusively in terms of personal experience rather than by relying on external authoritative sources. This overuse of believe may point to a transfer effect and a mismatch of registers since oral Spanish prefers believe (“yo creo”, I believe) to think for expressing personal opinions. In the light of these data our hypothesis was that, in contrast to the SPICLE writers, the AW students express their opinions by different means. One such device would be through evaluative adjectives and adverbs, which are used precisely to comment on the claims made by others, as in “Actually, what it clearly reveals is that the result of this process is ...”. This use of lexical resources to convey writer alignment would suggest that the AW students have been successful in incorporating the rhetorical devices set out in the can do statements. Table 6 presents the total number of occurrences of reporting verbs in the SPICLE data as compared with AW1 and AW2. There are statistically significant differences in frequency between the SPICLE texts (produced without the aid of specific writing instruction) and the two AW sub-corpora. Moreover, significance increases as the students progress through the coursework, from sample one (AW1) to the final essay (AW2). Table 6.╇ Total occurrences of reporting verbs: SPICLE, AW1 and AW2 Total reporting verbs Corpora SPICLE vs AW1 SPICLE vs AW2
SPICLE (raw figures)
AW texts (raw figures)
P
835 835
â•⁄ 98 194
<0.025 <0.001
9. Thompson (2001b: 58) defines interactive textual resources as those which “help to guide the reader through the text”. 10. Thompson (2001b: 58) defines interactional resources as involving “the reader collaboratively in the development of the text”.
JoAnne Neff van Aertselaer and Caroline Bunce
Table 7.╇ Total occurrences of reporting verbs: AW1 and AW2 Total reporting verbs Corpora
SPICLE (raw figures)
AW texts (raw figures)
P
98
194
<0.05
AW1 vs AW2
However, a chi-square comparison of AW1 and AW2, displayed in Table 7, shows that the increase in frequency only approaches significance (p between 0.10 and 0.05). The lack of significant findings may have occurred because in the AW2 texts, the students use other, more varied devices to support their claims, not reflected by the reporting verb figures. A detailed qualitative analysis of the AW texts certainly suggests this is the case. What emerges from this is that by the end of the course the students are able to effectively balance external voices with their own, as their expressions make clear, e.g. “There can be no doubt that ...”, “Even if it has been clearly demonstrated that ...” and “it seems more logical that society should struggle ...”. Table 8 shows the total figures for the SPICLE corpus and the two AW sub-corpora with regard to the types of rhetors used. The clearest developmental trends can be observed in the increase on the part of AW students in the use of abstract and impersonal rhetors (e.g. “one” as subject), and of impersonal and passive constructions, with a corresponding and extremely marked decrease in the use of we and I as subject. All of these figures point to a greater degree of impersonalization in these texts and suggest, once again, that authorial voice is being conveyed by other evaluative lexical means. Table 8.╇ Rhetor types: raw figures and normed frequencies Classifications
SPICLE Raw fig.
Abstract rhetors Impersonal and passive constructions General, non-specific and unnamed rhetors Specific and/or named rhetors Deictics as subject (referring to propositions in the text) “one subject “you” subject “we” subject “I” subject
Normed fig.
AW1
AW2
Raw fig.
Normed fig.
Raw fig.
Normed fig.
â•⁄ 91 139
0.05 0.07
16 36
0.15 0.34
26 38
0.15 0.22
159
0.08
22
0.21
34
0.2
â•⁄ 78 â•⁄â•⁄ 1
0.04 0.0005
20 â•⁄ 0
0.2 â•⁄â•⁄ 0
80 â•⁄ 0
0.5 0
â•⁄â•⁄ 2 â•⁄ 10 125 230
0.001 0.005 0.06 0.11
â•⁄ 2 â•⁄ 0 â•⁄ 0 â•⁄ 2
0.02 â•⁄â•⁄ 0 â•⁄â•⁄ 0 0.02
â•⁄ 12 â•⁄â•⁄ 2 â•⁄â•⁄ 0 â•⁄â•⁄ 2
0.07 0.01 0 0.01
The use of small corpora for tracing the development of academic literacies
A further developmental pattern is clear in the greater incorporation of named outside sources, such as “Porter (2002) claims that ...” and “As the National Energy Education Development Project (2008) has shown ...”. These academically appropriate references are specifically mentioned in the can do statements and are targeted in the course exercises. However, one category shows no improvement, i.e. the use of a deictic as subject (referring to propositions in the text). The AW writers do not make use of this cohesive device in conjunction with reporting verbs and further analysis of the texts shows that it is generally not exploited. They prefer to use deictics in noun phrases such as “These two simple questions are answered in a book called ...”. This absence does not mean that their texts lack cohesion, but perhaps suggests that their range of devices is still somewhat limited. The use of deictics should therefore be the focus of future instruction and inclusion in a can do statement.
5.2
Evaluative lexical resources
We had hypothesized that the lack of a statistically significant difference between the totals for reporting verbs in the AW1 and AW2 texts could be explained by the writers using other lexical devices, shown in Table 9, to convey internal and external voice and writer alignment. In order to test this assumption, a comparison was made of the figures for four types of evaluative lexical devices: it + copular verb + evaluative adjectival phrase + that, it + copular verb + evaluative adjectival phrase + to + verb expressing mental or verbal processes, and two types of adverbial functions ending in ly. The data analysed here show mainly developmental patterns. Table 9 displays the figures for these evaluative devices and shows that, as far as the total use of such devices is concerned, while they occur with a slightly lower relative frequency in the AW1 texts than in the SPICLE texts, their frequency in the AW2 texts is considerably higher. The results of the chi-square tests, as shown in Table 10, revealed that the low frequency in the AW1 sub-corpus did not differ statistically from Table 9.╇ Distribution of evaluative devices per corpus Type of evaluative device
it * adj. + that it * adj. + to + verb of mental or verbal process *ly adverbs used as disjunct/ used to modify discourse verb Total
SPICLE
AW1
AW2
Raw fig.
Normed fig.
Raw fig.
Normed fig.
Raw fig.
Normed fig.
â•⁄ 64 â•⁄ 49
0.03 0.03
â•⁄ 3 â•⁄ 5
0.03 0.05
13 â•⁄ 3
0.08 0.02
137
0.07
â•⁄ 5
0.05
56
0.33
250
0.13
13
0.12
72
0.43
JoAnne Neff van Aertselaer and Caroline Bunce
Table 10.╇ Total uses of evaluative devices: SPICLE, AW1 and AW2 Total uses of evaluative devices Corpora SPICLE vs AW1 SPICLE vs AW2
SPICLE (raw figures)
AW texts (raw figures)
P
250 250
13 72
Not significant <0.001
the SPICLE writers’ use. However, the greater frequency in the AW2 sub-corpus, as compared to SPICLE, was highly significant (p < 0.001). Thus it can be said that the students who had received no specific AW instruction (SPICLE) and those with just two weeks of instruction (the AW1) used these lexical resources to a similar degree. On the other hand, from the highly significant difference between the total use of evaluative devices in the AW2 texts, as compared to those used by AW1 (Table 11), it can be concluded that the students have become aware of the importance of such expressions for negotiating writer stance. It can also be presumed, with a reasonable degree of confidence, that the lack of statistical significance regarding the increase in the total use of reporting verbs between AW1 and AW2 texts is attributable to and compensated for by the greater use of these other lexical resources by AW2 students. A more detailed examination of the number of occurrences of each type of lexical device shows that the developmental increase is not uniform across the categories. In the it * adj. + that constructions, the relative frequencies in the SPICLE and AW1 texts are similar, while in the AW2 texts the frequency is higher. In the case of the *ly adverbs, the frequency in the AW1 texts is the lowest, and in the AW2 texts is the highest while in the SPICLE texts the frequency falls between these two. There is even a case of the AW2 texts showing the lowest frequency of the three groups, namely in constructions with it * adj. + to + verb of mental or verbal process, especially think and say. This does not, however, detract from the relevance of the considerable development seen in the total figures. One individual category does merit particular comment, i.e. that of the adverbs. Here the frequency shown in the AW2 texts – indicating a very clear developmental pattern – is markedly higher (0.33) than that of either the AW1 Texts (0.05) or the SPICLE texts (0.07). Furthermore, when the two types of adverbial use are examined individually, other interesting observations can be made. Table 12 shows that, in the Table 11.╇ Total uses of evaluative devices: AW1 and AW2 Total uses of evaluative devices Corpora AW2 vs AW1
AW1 (raw figures)
AW2 (raw figures)
P
13
72
<0.001
The use of small corpora for tracing the development of academic literacies
Table 12.╇ Types of *ly adverbs used in the three sub-corpora Disjuncts SPICLE AW 1 AW 2
124 â•⁄ 5 36
Adverbs modifying reporting verbs Modifying internal voices: 7 0 Modifying internal voices: 1
Modifying external voices: 6 Modifying external voices: 19
SPICLE corpus, 13 of the 137 adverbs (9.5%) studied correspond to adverbs used to modify a reporting verb. The AW writers do not use adverbs in this way at all in their course-initial texts. In their final writing essays, however, 20 of the 56 adverbs (35.7%) are being used in this way. Interestingly, 19 of these are used to comment on external voices. This represents a relative frequency of 0.11 per one hundred words, in contrast to the SPICLE writers’ use of only 6 such adverbs (relative frequency of 0.003 per one hundred words). Thus, it is evident that the AW writers have, by the end of the course, realised the importance of evaluative adverbs to comment on propositions from various internal and external sources. Also relevant is the fact that they use these adverbs to comment on the opinions of others to a far greater degree than do students who have not studied AW. There seems to be a shift from the type of use found in SPICLE to boost a very personal presentation of the writer’s stance (e.g. “I strongly believe that ...”, “I completely disagree with the idea that ...”) to a more complex use in the AW2 texts (e.g. “as scientists plausibly claim”, “it is rightly considered that ...”) which combines others’ voices and their own. While there is a certain awkwardness in some of the examples (e.g. “as Tierney (1996) wrongly pointed out”), which is usual when learners are testing the limits of a recently acquired lexical device, it is clear that these writers have become aware of, and are experimenting with, a new possibility for expressing their alignment with external propositions (see the example of a final essay in Appendix I).
6. Conclusion This study, using corpora of various sizes, set out to examine the Spanish EFL university students’ use of various devices for intertextual dialogue, namely discourse oriented verbs, including the various types of rhetors which can co-occur with these, certain kinds of grammar patterns, such as anticipated it constructions and modal adverbs, which also permit the inclusion of writer evaluations. The EU Framework adopted in the AW course and the re-working of the descriptors for writing (can do statements) – developed from analysis of the student data previously collected (SPICLE texts) – arose from the practical needs of our instructors and their students. The instructors used structural and rhetorical features to draw up a set of criteria for measuring the students’ written performance throughout and at the end of the course.
JoAnne Neff van Aertselaer and Caroline Bunce
Although the features examined in this study were not the only criteria used for evaluation of the students’ final texts, these aspects of AW, adopted as criteria for advancement, enabled the instructors to avoid solely focusing on the EFL students’ lexico-grammatical errors and instead, to give more attention to feasible improvement in discourse competence (see Appendix II for developmental trends). The difference, in regard to these criteria, between the SPICLE group (with no specific training in AW) and the two sub-groups of the AW course has shown that the academic literacy of university students can be improved by studying the student’s use of text internal and external features and by centring sets of exercises around these features. Specifically, the comparison of the AW1 texts (written at the beginning of the course) with the AW2 texts (the final sample of writing) presented here confirms that the students benefited from a detailed list of features which they could learn to incorporate into their texts in the limited time period of the course. To be sure, there are still some features that call for attention. However, the use of explicitly stated text internal and text external requirements has advanced the students’ understanding of the dialogic processes involved in argumentative writing and, in many cases, their discourse competence improved so greatly that these writers appear to have progressed to an entirely new stage of competence,11 as measured by the descriptors included in the CEFR for argumentative writing.
References Bazerman, C. 1994. Systems of genres and the enhancement of social intentions. In Genre and New Rhetoric, A. Freedman & P. Medway (eds), 79–101. London: Taylor & Francis. Bhatia, V. 2004. Worlds of Written Discourse. London: Continuum. Biber, D. 1995. Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge: CUP. Briggs, C. & Baumann, R. 1992. Genre, intertextuality and social power. Journal of Linguistic Anthropology 2: 131–172. Charles, M. 2006. The construction of stance in reporting clauses: A cross-disciplinary study of theses. Applied Linguistics 27: 492–518. Council of Europe. 2001. Common European Framework of Reference for Languages. Cambridge: CUP. Council of Europe. 2009. Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment (CEFR). Strasbourg: Language Policy Division. Educared. 2009. Exámenes resueltos, Literatura española. (accessed April 2009). Gilquin, G., Granger, S. & Paquot, M. 2007. Learner corpora: The missing link in EAP pedagogy. Journal of English for Academic Purposes 6(4): 319–335. 11. At the 2008 Montenegro meeting of the English Profile Networks: Research Network in SouthEast Europe, Cambridge First Certificate Examiners estimated that the B1.1. student <011–2008> whose initial and final texts appear in Appendix II had improved her writing to such an extent that, in the final essay, she appeared to be approaching B2, First Certificate Level.
The use of small corpora for tracing the development of academic literacies Granger, S. 1983. The “be” + Past Participle Construction in Spoken English: With Special Emphasis on the Passive. Amsterdam: North-Holland. Granger, S. 1998a. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis and Applications, A. Cowie (ed.), 145–160. Oxford: OUP. Granger, S. (ed.). 1998b. Learner English on Computer. London: Addison Wesley Longman. Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge. Hunston, S. & Francis, G. 1999. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English [Studies in Corpus Linguistics 4]. Amsterdam: John Benjamins. Johns, A. 1997. Text, Role and Context. Cambridge: CUP. Martin, J. 1990. Factual Writing: Exploring and Challenging Social Reality. Oxford: OUP. Meunier, F. & Granger, S. (eds). 2008. Phraseology in Foreign Language Learning and Teaching. Amsterdam: John Benjamins. Neff, J., Martínez, F. & Rica, J. P. 2001. A contrastive study of qualification devices in NS and NNS argumentative texts in English. In ERIC Document Reproduction Service, ED 465301. Washington DC: Educational Resource Information Center, U.S. Department of Education. Perera, K. 1989. Children’s Writing and Reading: Analysing Classroom Language. Oxford: Basil Blackwell. Scott, M. 2007. Wordsmith 5.0. Oxford: OUP. Sinclair, J.M. 1993. Written discourse structure. In Techniques of Description. Spoken and Written Discourse, J.M. Sinclair, M. Hoey & G. Fox (eds), 6–31. London: Routledge. Swales, J. 1990. Genre Analysis: English in Academic and Research Settings. Cambridge: CUP. Thompson, G. 2001a. Corpus, comparison, culture: Doing the same things differently in different cultures. In Small Corpus Studies and ELT [Studies in Corpus Linguisics 5], M. Ghadessy, A. Henry & R. Rosenberry (eds), 311–334. Amsterdam: John Benjamins. Thompson, G. 2001b. Interaction in academic writing: Learning to argue with the reader. Applied Linguistics 22(1): 58–78. Werlich, E. 1983. A Text Grammar of English. Heidelberg: Quelle und Meyer.
Appendix 1 Final student essay <001–2009, level B1.3> Nowadays, specialists of the environment are constantly insisting on the importance of recycling. It is rightly considered that recycling is the best way of lengthening the Earth’s life. Consequently, society will experiment a healthier and longer life. The problem comes when recycling. For many people it means something which might be wrong. Porter(2002) claims that recycling involves processing used materials, reduce the consumption of fresh raw materials, reduce energy usage, reduce air pollution(from incineration) and water pollution(from landfilling). First, it is obligatory to know that everyday day people use energy in their different daily activities. For instance: when someone turn on a light; when someone is cooking; when both, public transport and own cars, are being used; etc. That is, society wastes energy unconsciously, according to The National Energy Education Development Project (2008). In order to avoid this wasting Porter (2002) proposes some materials that could be recycled, such as: paper, glass, metal, electronics, textils, and paper. None the less, in
JoAnne Neff van Aertselaer and Caroline Bunce
some countries governments state that recycling is not only difficult to do but also useless. That means that recycling wastes resources, as well as non-recycling do (Tierney 1996). In spite of what Tierney claimed, the NEEDP (National Energy Education Development Project) (2008) published some advice so that recycling would be easier. Turning off machines like the microwave, air-conditioning, etc when not necessary are some. Furthermore, using less energy (energy conservation) and increasing ecological machines and transport, could also help. With regard to Economy, it is insistingly claimed that recycling would affect market prices. Therefore, economy would experiment a descent due to the recycling process, which is significantly expensive (Tierney 1996). Moreover, landfills would be needed and that would arise prices and incinerators are considered as an useless recycling way. Otherwise, they would waste more energy than save, as Tierney (1996) has intelligently stated. In conclusion, society is divided due to the recycling process. While a part says that it is utterly unnecessary it is unconditionally stated by the other part that it is a favorable option and an incredible chance to save, not only human beings but also the entire world. Unless the recycling process would be forbidden by governments, people should act and recycle.
Appendix 2 Initial essay (AW1), B1.1. student* <011–2008> Some students believe that university degrees do not prepare people for the real world. To what degree would you agree with this belief?
Essay Prompt (an exact copy)
P. 1 The actual Educative System is being the main theme in a lot of debats in our days. Politics and students do not agree because it is truth that a lot of students believe that the university degrees do not prepare us for the real world and for our first job.
Contextualization Claim 1
P. 2 On the one hand, I believe that they have some reason because when you finish your degree really you don’t have made anything similar to you be will have to do in your real job. In my case, for instance, when I finished my university degree which is english filology, I will not have any idea to teach in a school class.
Data 1 Example 1
P. 3 On the other hand, I think that we learn a lot of literature and english: so we will be able to make our job really well as soon as we get a bit of self-confidence, because we have a good preparation.
Claim 2 Contradict Claim 1
P. 4 FINALLY, students should have more practise classes and periods in our university degrees, although I think that when I have finished my degree I will have a great knowledges for my new job although I haven’t much experience.
Conclusion (repetition of P. 3)
* Paragraphs are indicated with “P”.
The use of small corpora for tracing the development of academic literacies
Final essay (AW2), the same B1.1. student <011–2008> P 1. Physician-assisted suicide is a controversial theme because of a lot of human lifes depend on this decision. In this debat, there is a mixture of religious beliefs, opinion of the family members and of course, moral values. Nevertheless, the theme has become more important in the recent years.
Claim 1 Data 1, 2, 3 (prospection) Mistaken DM Stranded information for Claim 1
P 2. Against the legalisation, the religious beliefs have a big importance. It could be observe how people who have a strong religious education are generally disagree with legalizing assisted suicide. Some people belive that legalizing this type of suicide, this could lead to euthanasia, and more terminally ill patients could decide to dye. People who disagree with this legalisation believe that this kind of patients show a large value of resignation.
Data 1.1.
P 3. Other important aspect in this debat could be the opinion of the family members. People who agree with the legalization of the physician-assisted suicide, probably consider that this solution would help to relieve families of the burdens of caring for terminally ill relative, because it is very exausting looking after a ill patient. Although, perhaps, family member who disagree with this idea prefer to look after their relatives for all their lifes. In total, there are an 35% of the people who do not agree with Euthanasia in various European countries.
Data 2.1.
It seems that a lot of people disagree with the legalization, but in Europe more than an 60% of the population agree with Euthanasia. People believe that a terminal ill have the right to control his/her own life of course, if the dead has a justificated cause.
Stranded data 5, better placed with Data 2.1.
P. 5 Doctor has a important paper [role] too, because they could be prosecuted for assisting in the suicide, although if the legalization of assisted suicide is approved they could not be prosecuted because this process would be legal.
Data 3 (not previously mentioned)
P. 6 In conclusion, the dead is an important aspect but, over all, it seems that it is more important the right of the human being to decide about his/her own life, and of care it is more important if they don’t want to live, although it is important to consider the opinion of the family members.
Conclusion DM: weakens conclusion (& possible contradiction of previous claims)
Stranded data 4, better placed with Data 1.1.
Revisiting apprentice texts Using lexical bundles to investigate expert and apprentice performances in academic writing Christopher Tribble Early developments in corpus linguistics were driven by the needs of those interested in the description of a language, whether as grammarians or lexicographers, not the needs of language teachers. This has hindered the development of corpus applications in language education, but work by Granger and others changed the situation and it is now accepted that learner data can be a valuable resource for those concerned with language education. Drawing on Biber’s (2006) account of lexical bundles, this chapter provides a practical example how the written production of postgraduate students in a single disciplinary area can be used to build an account of contrasts between apprentice and expert writing, and how this account can be used in the development of a course specification for English for Academic Purposes (EAP) writing.
1. Introduction The first phase of computer corpus development in the United Kingdom was driven by linguists and lexicographers whose primary concern was to build a broad account of the written and spoken production of native speakers of the language. However, from the late 1980s onwards, there was also an interest in building corpus resources which would have relevance to the needs of foreign language learners (e.g. Summers & Rundell 1990) and, as a result of the Collins Birmingham University International Language Database (COBUILD) project at Birmingham University (Sinclair 1987), this kind of data became directly available to a group of teachers concerned with the needs of students on English for Academic Purposes programmes at Birmingham. It was this partnership between researchers and teachers which led to the development of corpus informed or Data Driven Learning (DDL) approaches to language teaching (Johns 1994). A major motivation for these early developments in corpus informed language teaching (e.g. Johns 1988, Tribble & Jones 1990, Stevens 1991) was linked to a concern over the kinds of language data which had hitherto been available for classroom use.
Christopher Tribble
Sinclair later summarised this concern as follows: “Linguistics has been formed and shaped on inadequate evidence and in a famous phrase ‘degenerate data’. [...] In linguistics up till now we have been relying very heavily on speculation” (Sinclair 2004: 9). What these teachers, materials developers and researchers wanted was to give their students access to real rather than made-up language data, with the long term ambition of enabling learners to become actively involved in discover learning themselves: “If you are able to give your students access to a PC you can also give them the chance to discover rules of language use for themselves” (Tribble & Jones 1990: 56). Although the claims of some of those involved in the attempt to use corpus data in language education were criticised as having insufficient regard for the linguistic needs of learners (see Widdowson 1991), teachers and materials developers continued to work with corpus data to prepare syllabus specifications (e.g. Willis 1990) and practical teaching materials (e.g. Thurston & Candlin 1997). During this period, work also started on the collection and use of corpus data which it was felt would be more directly relevant to language learning purposes (Willis & Willis 1988), and to investigate learner language with a view to better understanding the problems which learners faced in the process of language acquisition work (Meara & English 1987, Granger 1993 and 1994). Alongside this research into issues in learner writing (i.e. error, overuse, underuse, etc.) work was also done to use learner language as a resource in classroom learning and syllabus design (Tribble 1989). Building on this earlier experience, Granger & Tribble (1998) offered a set of reasons for using learner corpora in the development of teaching materials, and provided some practical examples of how this might be done. Granger & Tribble’s study focused on the contrast between native speaker (NS) and non-native speaker (NNS) writing1, with an emphasis on how students could improve their accuracy in the use of problematic words. They addressed two main questions in the course of their study − which forms could most usefully be presented to learners, and which model should be drawn on in the preparation of teaching materials. These questions remain relevant today and will provide the starting point for this present paper. In the following sections, I will ask anew the questions which forms and which models best meet the needs of learners, this time focusing on the needs of students of English for Academic Purposes (EAP). In answering these questions I will first discuss the value of a focus on what are variously called lexical bundles (Biber et al. 1999), clusters (Scott 2008), or, within the Natural Language Processing (NLP) community, n-Grams. I will then go on to demonstrate how corpora of expert and apprentice texts can be used in the development of practical resources to support Master’s level students who need to write English for Academic Purposes. As the texts are written by native and non-native speakers of English, and as they are potentially of a standard which could be offered for publication in journals belonging to the Applied Linguistics 1. It is interesting to note how the Native Speaker and Non-native Speaker dichotomy was not contentious terminology at that time.
Revisiting apprentice texts
discourse community, I have chosen to use the terms expert and apprentice in relation to the texts in these two corpora; native and non-native are not relevant categories in the present context. Such an approach draws on Rampton’s (1990) account of expertise in foreign language production and Bazerman’s (1994) notion of ‘expert performance’. In carrying out this investigation I shall be building on Hyland’s (2008a and 2008b) studies in which lexical bundles were used to identify contrasts both between expert and apprentice performances, and between texts in different disciplinary areas. This study extends work in the area by considering contrasts between expert and apprentice performances within the same disciplinary area, thereby offering a descriptive model which can be of value to teachers and learners in English for Specific Purposes (ESP) and disciplinarily specific EAP programmes.
2. Forms and models 2.1
Which forms?
Granger & Tribble (1998) took individual word forms as the starting point for their work. However, a growing number of studies (Biber & Conrad 1999; Biber 2006; Cortes 2004; Scott & Tribble 2006; Hyland 2008a and 2008b) have demonstrated the value of the investigation of lexical bundles in differentiating between language production in different registers, genres or text populations within genres. Although these combinations have been variously defined, a useful and simple account of lexical bundles is given in Biber (2006: 134). They are simply the most frequently occurring sequences of words, such as do you want to and I don’t know what. These examples illustrate two typical characteristics of lexical bundles: they are usually not idiomatic in meaning, and they are usually not complete grammatical structures. The actual cut-off which is used to determine whether or not to include a specific lexical bundle in a study largely depends on the size of the corpus one is studying. For Biber (2006) the threshold was 40 times per million words. Scott (2008) sets a default of a minimum of 5 occurrences in the corpus under investigation in his account of clusters. As this study makes use of smaller corpora than those used by Biber, I use a lower threshold of normalised count of 3 instances per 100,000 words. This threshold has been arrived at empirically and was chosen to ensure that there was a sufficiently large set of bundles in the final results to allow useful comparisons. Apart from the issue of which threshold to use, a further consideration when selecting the lexical bundles to study is whether to choose bundles containing three words, four words or more. Most studies to date have focused on 4-word combinations. As Cortes (2004: 401) argues: “many 4-word bundles hold 3-word bundles in their structures (as in as a result of, which contains as a result)”. However, in a study which compared apprentice writing with expert performances, Scott & Tribble
Christopher Tribble
(2006: 141) argue in favour of also maintaining a focus on 3-word lexical bundles, commenting that 3-word bundles offer learners ways of discovering a range of less frequent, but no less valuable phrasal combinations. They give as an example the fact that: “... most may be by far and away the most frequent immediate right collocate of ‘one of the’, but there is a wealth of other combinatorial potentials which we will lose sight of if three word clusters + their right collocates are excluded from our analysis” (ibid.: 141). Their conclusion is that: “... while 4-word clusters are strong discriminators between registers, there is a good argument for also using 3-word clusters together with their immediate right collocates in studying the contrast between different styles of writing or the product of different groups of writers” (ibid.: 142). In this present study, I will also focus on both 4-word and 3-word lexical bundles in an attempt both to identify contrasts between the ways in which expert and apprentice writers construct the texts which are required of them in order to participate in specific academic genres, and to outline a basis from which apprentice writers can enhance their performances within these genres.
2.2
Which models?
In foreign language teaching, the choice of which model to present to students has become multiply problematic. Concerns around notions of authenticity summarised in Widdowson’s (1990) discussion of the contrast between authentic and genuine input for language teaching can be seen as leading to the debates around the relevance of corpus data which are summarised in Seidlhofer (2003). In the teaching of pronunciation, concerns over the ownership of English, the status of native and non-native speakers of the language, and the utility of native speaker standards for pronunciation (Seidlhofer 2000) have led to Jenkins’ (2000) argument for a model for pronunciation teaching based on empirical comprehensibility criteria rather than on idealised native speaker models. In EAP writing instruction, Hyland (2002) has argued for greater specificity in the kinds of models used. This stance is in opposition to approaches which argue in favour of developing general and transferable academic competencies (Spack 1988), or to the position of proponents of Academic Literacies (Lea & Street 1998; Lillis 2001) who can be seen as arguing against the use of models at all. In this present study the choice of model is less problematic as it is possible to see a close correlation between the kinds of research papers and dissertations which Masters’ students have to write for purposes of assessment, and the research papers that they read in their field of study. Indeed, in the King’s College assessment criteria for a piece of work to achieve the highest grade it has to be: “Striking insightful, displaying for example: publishable quality, outstanding research potential1”. In this context, it is possible to see examples from journal articles in applied linguistics as the textual result of expert performances, following Bazerman (1994: 131) for whom the “expert performance describes the whole act [of composition] with all its potential variety and complexity.”
Revisiting apprentice texts
1. Building the context
2. Modelling and desconstructing the text
5. Linking related texts
3. Joint construction of the text 4. Independent construction of the text
Figure 1.╇ The Teaching/Writing Cycle Feez, 1998: 28
Exemplar texts arising from expert performances can then be exploited within a teaching/learning cycle such as that proposed by Feez (1998), shown in Figure 1 and constitute an excellent basis for an exemplar corpus. By having access to both the exemplars themselves and the results of corpus analysis, students are better placed to work through the ‘modelling and deconstructing the text’ phase of the cycle. Here we can see modelling fitting in with Bazerman’s (ibid: 193) proposal that modelling is a “process when we try out the behaviours we observe in others; it is clearly related to learning by imitation as advocated in classical rhetoric”, and with Flowerdew’s comments that: this skill of seeking out instances of genre-dependent language modelling use in English and incorporating them in one’s own writing or speaking is not limited to foreign languages. Many native speakers make use of others’ writing or speech to model their own work in their native language, where the genre is an unfamiliar one. It is time that this skill was brought out of the closet, and exploited as an aid to learning, instead of remaining a secret activity not acknowledged by teachers. (Flowerdew 1993: 313)
In this study I shall use the term exemplar corpus to refer to a collection of texts made up of expert performances which are very closely aligned with the kinds of written production to which apprentices aspire (Tribble 2001). For technical writing students on an ESP programme this could be a set of engineering manuals, for post-graduate
Christopher Tribble
students on an in-sessional programme, it could be a set of PhD theses in disciplinarily relevant fields. In this instance the exemplar corpus I shall draw on is a collection of journal articles in Applied Linguistics. I shall use the term analogue corpus (Tribble 2001) to refer to collections of texts which are generically different from a particular group’s target performance, but which usefully share some features (both in terms of register and organisation) with these performances. Texts in analogue corpora can be close or distant to the target behaviour of a particular group of learners. Thus, a collection of factual encyclopaedia essays can be a reasonably close analogue corpus for students on written composition courses, while a collection of newspaper editorials can constitute a distant but still useful analogue corpus for students who need to develop argumentative essay writing skills. In this study, I will use a collection of successful student writing as a close analogue corpus, and a collection of general academic writing from the British National Corpus (BNC) as a more distant analogue corpus alongside a collection of published academic journal articles from Acta Tropica, an international journal that covers biomedical and health sciences with particular emphasis on topics relevant to human and animal health in the tropics and the subtropics.
3. Investigating lexical bundles in apprentice and expert texts 3.1
Data
The five data sets which have been drawn on in this particular study are listed in Table 1: The purpose of this selection has been to enable comparisons between lexical bundles in a corpus of apprentice written production (KCL Apprentice Writing Corpus) and a close analogue corpus (BAWE), an exemplar corpus (Applied Linguistics Corpus) and two progressively more distant analogue corpora (BNC Baby, Academic and Acta Tropica).
3.2
Method
Word (.doc) documents (KCL Apprentice Writing Corpus) and PDFs (Applied Linguistics Corpus/Acta Tropica) were converted to unicode text using a Word 2003 macro or commercially available software, and, in the case of KCL Apprentice Writing Corpus, front and end matter was separated from the body of the text with Text Encoding Initiative (TEI2) compliant tags. It was not feasible to carry out this process with the ad-hoc Applied Linguistics Corpus and Acta Tropica corpora, so a certain amount of noise had to be accepted in findings from these collections.
2. http://www.tei-c.org/index.xml (accessed 28/11/09).
Revisiting apprentice texts
Table 1.╇ Corpus data in the study Corpus
Words
Acta Tropica (distant analogue)
3,409,969
Applied Linguistics Corpus (exemplar)
â•⁄â•‹939,923
BAWE (close analogue)
3,277,560
BNC Baby – Academic (medium close analogue) KCL Apprentice Writing Corpus2 (research corpus)
1,000,198
509,891
Description Acta Tropica – A 3.5 million word corpus of 1,071 articles from the journal Acta Tropica from 1989–2009 (http://www.elsevier.com/wps/find/journaldescription.cws_ home/506043/ description#description – accessed 01/12/09) Applied Linguistics Corpus − A collection of 174 research articles in applied linguistics drawn from the following journals: Applied Linguistics Discourse and society English for Specific Purposes Journal English Language Teaching Journal English World-Wide International Journal of Applied Linguistics Journal of English for Academic Purposes Journal of Second Language Writing Language and Communication Language Learning and Technology ReCALL Studies in Higher Education TESOL Quarterly Corpus of British Academic Written English − “The BAWE corpus contains 2761 pieces of proficient assessed student writing, ranging in length from about 500 words to about 5000 words. Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty-five disciplines are represented” (http://www2.warwick. ac.uk/fac/soc/al/research/collect/bawe/− accessed 27/11/09) BNC Baby (Academic) − a one million word subset made up of thirty texts from the Academic Writing component of the British National Corpus (http://www.natcorp.ox.ac.uk/corpus/index.xml. ID=products#baby accessed 08-02-10) King’s College London – Apprentice Writing Corpus − a corpus of apprentice writing donated by students on the MA programme in English Language Teaching and Applied Linguistics and BA in English Language and Communications. The corpus is under development and not publicly available as yet. For this study a subset of 119 MA level assignments and dissertations was used. 34% of texts were written by students for whom English is not a first language.
3. http://www.kcl.ac.uk/schools/sspp/education/courses/masters/elt/handbook.html (accessed 27/11/09).
Christopher Tribble UNEDITED top 5 ENGLISH FOR ACADEMIC PURPOSES OF ENGLISH FOR ACADEMIC JOURNAL OF ENGLISH FOR ENGLISH FOR SPECIFIC PURPOSES CAMBRIDGE CAMBRIDGE UNIVERSITY PRESS
889 791 782 405 264
Figure 2.╇ Applied Linguistics Corpus – pre-edit EDITED top 5 ON THE OTHER HAND IN THE CASE OF AT THE SAME TIME IN THE USE OF ON THE BASIS OF
130 105 101 70 66
Figure 3.╇ Applied Linguistics Corpus – post-edit
Once data was ready for processing, a Wordsmith Tools Index was created for each corpus, and 3-word and 4-word lexical bundle lists were then generated and saved in Excel spreadsheets. Once the lists were generated, the first task was to deal with the noise problem in the journal article collections (Applied Linguistics Corpus and Acta Tropica) by manually editing the lexical bundle lists. An example of the un-edited and edited data is given in Figures 2 and 3, indicating how relatively straightforward it was to spot lexical bundles which were part of the end matter (bibliography/footnotes) or text meta-data. Where there was any ambiguity, this was reconciled by reviewing a concordance of the lexical bundle in question. Once the lexical bundle lists had been saved in an Excel workbook, all lexical bundle statistics were normalised to counts per 100,000. This has enabled a more meaningful comparison of relative distributions of different lexical bundles across the corpora – although as the main emphasis in this particular study has been on the top 40 lexical bundles in each corpus, apart from for ranking purposes, the frequency of occurrence of individual lexical bundles has not had much significance.
3.3
Findings: 4-word lexical bundles
The analysis of four-word lexical bundles has been carried out in relation to two main parameters: (a) the extent to which lexical bundles are shared between the corpora, and (b) the extent to which specific categories of lexical bundles are shared or not shared in the different corpora. This analysis makes use of a system of lexical bundle categorisation proposed by Hyland (2008a and 2008b). Drawing on Halliday’s (1994) metafunctions of language, Hyland groups lexical bundles from a functional perspective: Research-oriented, Text-oriented, and Participant oriented, and then offers a more finely grained categorisation of each lexical bundle within these broad groupings. The framework is given in Figure 4.
Revisiting apprentice texts
RESEARCH-ORIENTED. Help writers to structure their activities and experiences of the real world. location – indicating time and place (at the beginning of, at the same time, in the present study); procedure (the use of the, the role of the, the purpose of the, the operation of the); quantification (the magnitude of the, a wide range of, one of the most); description (the structure of the, the size of the); topic – related to the field of research (in the Hong Kong, the currency board system). TEXT-ORIENTED. These clusters are concerned with the organisation of the text and the meaning of its elements as a message or argument. transition signals – establishing additive or contrastive links between elements (on the other hand, in addition to the, in contrast to the); resultative signals – mark inferential or causative relations between elements (as a result of, it was found that, these results suggest that); structuring signals – text-reflexive markers which organise stretches of discourse or direct reader elsewhere in text (in the present study, in the next section, as shown in fig); framing signals – situate arguments by specifying limiting conditions (in the case of, with respect to the, on the basis of, in the presence of, with the exception of). PARTICIPANT-ORIENTED. These are focused on the writer or reader of the text (Hyland 2005). stance features – convey the writer’s attitudes and evaluations (are likely to be, may be due to, it is possible that); engagement features – address readers directly (it should be noted that, as can be seen) (Hyland 2008a: 49).
Figure 4.╇ Functional categorisation of 4-word lexical bundles
The first analysis reported here reviews the shared and unshared lexical bundles in the five corpora included in the study. This demonstrates the similarities and differences between the written production of the KCL Apprentice Writing Corpus apprentice writers, and the written production of the authors included in the comparator corpora (both close and distant). The second analysis makes a direct comparison between lexical bundles in the research corpus (KCL Apprentice Writing Corpus) and the closer analogue corpora (BAWE/BNC Baby – Academic). 3.3.1 KCL apprentice writing corpus compared with BNC baby – academic, BAWE, and applied linguistics corpus + acta tropica A complete summary of the 4-word lexical bundle comparison is given in Figure 5. This figure is presented here in a reduced format as, for the moment, we are only concentrating on the distribution of lexical bundles, rather than on specific lexical bundle categories but a more legible version is provided as an appendix. Dark shading indicates (a) those lexical bundles which are shared across the four main corpora (BNC Baby- Academic, BAWE, Applied Linguistics Corpus, KCL Apprentice Writing Corpus) and (b) those from this group which also occur in Acta Tropica. Paler shading with white text indicates those which occur in at least three of the main corpora, and lighter shading with black text those which occur in two of the main corpora. These different groups are also shown in the Acta Tropica set. Lexical bundles outside the shaded areas only occur in that specific corpus.
Christopher Tribble
BNC Baby Academic
norm
ON THE OTHER HAND
13
BAWE
norm
Applied Linguistics
norm
KLC Apprentice Writing
norm
ON THE OTHER HAND
26
ON THE OTHER HAND
22
ON THE OTHER HAND
21
THE END OF THE
9
AS A RESULT OF
22
ON THE BASIS OF
14
IN THE FORM OF
10
ON THE BASIS OF
9
THE END OF THE
18
THE END OF THE
13
AS A RESULT OF
8
AS A RESULT OF
8
IT IS IMPORTANT TO
17
AT THE END OF
12
IT IS IMPORTANT TO
7 6
AT THE END OF
7
AS WELL AS THE
15
AS WELL AS THE
10
AT THE END OF
IT IS IMPORTANT TO
6
IN THE FORM OF
15
IT IS IMPORTANT TO
9
ON THE BASIS OF
6
IN THE FORM OF
5
AT THE END OF
13
AS A RESULT OF
9
AS WELL AS THE
5
IN THE FORM OF
8
THE END OF THE
5
IN TERMS OF THE
AS WELL AS THE
11
IN THE CASE OF
19
IN THE CASE OF
19
THAT THERE IS A
8
IN THE CASE OF
10
4
AT THE SAME TIME
ON THE BASIS OF
15
6
AT THE SAME TIME
16
IN THE CONTEXT OF
6
THE WAY IN WHICH
8
THE FACT THAT THE
13
IN THE CONTEXT OF
12
A WIDE RANGE OF
5
THE EXTENT TO WHICH
7
CAN BE USED TO
12
A WIDE RANGE OF
11
TO BE ABLE TO
5
IN THE CONTEXT OF
7
ONE OF THE MOST
11
THE EXTENT TO WHICH
10
I WOULD LIKE TO
11
AT THE SAME TIME
7
THAT THERE IS A
10
ONE OF THE MOST
7
AS A LINGUA FRANCA
9
THAT THERE IS A
5
THE WAY IN WHICH
9
THE REST OF THE
7
OF ENGLISH AS A
8
IS ONE OF THE
7
A WIDE RANGE OF
5
THE REST OF THE
9
IN TERMS OF THE
7
ONE OF THE MOST
5
IN TERMS OF THE
9
THE FACT THAT THE
6
ENGLISH AS A LINGUA
THE REST OF THE
5
THE EXTENT TO WHICH
8
CAN BE USED TO
5
NATIVE SPEAKERS OF ENGLISH
6
CAN BE USED TO
4
IT IS POSSIBLE TO
12
TO BE ABLE TO
5
THE INVOLVEMENT LOAD HYPOTHESIS
15
THE BEGINNING OF THE
11
THE FACT THAT THE
10
8
IS ONE OF THE
10
THE USE OF THE
9
OF TEXTS HAVE YOU
8
5
THE NATURE OF THE
10
NATIVE SPEAKERS OF ENGLISH
8
TEXTS HAVE YOU WRITTEN
8
THE SIZE OF THE
4
IT IS CLEAR THAT
9
THE NATURE OF THE
7
I AM GOING TO
7
4
WHEN IT COMES TO
8
TO BE ABLE TO
IT IS POSSIBLE TO IT IS CLEAR THAT IN RELATION TO THE
4
7
IT IS DIFFICULT TO
8
OF ENGLISH AS A
6
ENGLISH LANGUAGE OF INSTRUCTION
7
IT IS DIFFICULT TO
4
THE USE OF THE
7
AS A LINGUA FRANCA
6
IN THIS ESSAY I
7
THE HOUSE OF LORDS
8
THE BEGINNING OF THE
6
ENGLISH AS ALINGUA
6
THE ROLE OF THE
7
PERCENT OF THE
7
THE SIZE OF THE
6
I WOULD LIKE TO
5
INTHE FIELD OF
6
IN THE UNITED STATES
5
IN RELATION TO THE
6
THE WAYS IN WHICH
7
ELT AND APPLIED LINGUISTICS
6
AT THE TIME OF
5
CAN BE SEEN IN
11
AT THE UNIVERSITY OF
1
THE HOUSE OF COMMONS
5
IT CAN BE SEEN
9
AT THE BEGINNING OF
11
IN THE PROCESS OF
6
AS SHOWN IN FIG
5
TO THE FACT THAT
9
IN THE USE OF
10
LIKE USE OF ENGLISH
NATIVE LIKE USE OF
6
6
IN THE ABSENCE OF
4
A RESULT OF THE
8
IN THE PRESENT STUDY
7
AS A FOREIGN LANGUAGE
RULE IN RYLANDS V
4
CAN BE SEEN THAT
7
THE RESULTS OF THE
7
IMPLICIT AND EXPLICIT KNOWLEDGE
6
IS LIKELY TO BE
4
IT IS NECESSARY TO
7
OF SECOND LANGUAGE WRITING
7
IN THE EXPANDING CIRCLE
6
THE RULE IN RYLANDS
4
IN THE SAME WAY
6
THE USE OF CORPORA
6
IN THE NEXT SECTION
6
6
IN THIS CASE THE
4
THE ROLE OF THE
6
ON THE ONE HAND
6
ENGLISH AS A SECOND
5
THE NATURE OF THE
4
IS DUE TO THE
6
NATIVE AND NON NATIVE
6
TO LOOK AT THE
5
THE COURT OF APPEAL
4
DUE TO THE FACT
6
AS CAN BE SEEN
5
IT SHOULD BE NOTED
5
THAT THERE IS NO
4
THAT THERE IS NO
6
IT SHOULD BE NOTED
5
MEANING OF A WORD
5
AS WE HAVE SEEN
4
CAN BE SEEN AS
6
OF SPOKEN AND WRITTEN
5
OF THE INVOLVEMENT LOAD
5
Figure 5.╇ 4-word lexical bundles
Two kinds of contrast emerge here. The first is the relatively small number of lexical bundles in KCL Apprentice Writing Corpus which also occur in either all three comparator corpora (8): on the other hand/in the form of/as a result of/it is important to/at the end of/on the basis of/as well as the/the end of the, or in two of these corpora (3): that there is a/in the context of/a wide range of. Although these lexical bundles are prominent in any analysis of lexical bundles in written academic English (Biber 2006: 158–159), the small number which occur in KCL Apprentice Writing Corpus is noteworthy. The smaller size of the corpus partly explains the higher proportion of topic related lexical bundles in the KCL Apprentice Writing Corpus list (the early appearance of topic related lexis is a typical feature of any word/lexical bundle list for small corpora), and the contrasting institutional roles of apprentices and experts in academic writing events can account for the absence of stance markers such as: it is possible to/ it is clear that. However, the absence of core academic framing markers such as: in terms of the/in the case of/the way in which/the extent to which is striking, and indicates an area where the apprentice writers in King’s may need support to extend the ways in which they develop arguments and comment findings.
Revisiting apprentice texts
3.3.2 Applied linguistics corpus vs. KCL apprentice writing corpus – shared and unshared lexical bundles The first analysis is a direct comparison between the exemplar corpus (Applied Linguistics Corpus) and the apprentice corpus (KCL Apprentice Writing Corpus). As can be seen in Figure 6, 16 of the top 40 lexical bundles are shared between these two collections (Research Orientation, R_: 6; Text Orientation, T_: 6; Participant orientation, P_: 4). The three shared topic related lexical bundles give an indication of the alignment of the content of the two corpora. Of the 16 shared lexical bundles in the two corpora, eight are common to all four of the corpora included in the main study: on the other hand/in the form of/as a result of/it is important to/at the end of/on the basis of/as well as the/the end of the, and can be seen as having core functions in academic registers (see Biber 2006: 158–159). no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
APPLING ON THE OTHER HAND ON THE BASIS OF THE END OF THE IN THE CONTEXT OF AT THE END OF A WIDE RANGE OF AS WELL AS THE IT IS IMPORTANT TO AS A RESULT OF IN THE FORM OF NATIVE SPEAKERS OF ENGLISH AS A LINGUA FRANCA ENGLISH AS A LINGUA I WOULD LIKE TO TO BE ABLE TO IT SHOULD BE NOTED AS CAN BE SEEN THE FACT THAT THE CAN BE USED TO AT THE SAME TIME THE BEGINNING OF THE AT THE BEGINNING OF ONE OF THE MOST THE REST OF THE AT THE UNIVERSITY OF OF SECOND LANGUAGE WRITING OF ENGLISH AS A THE USE OF CORPORA NATIVE AND NON NATIVE OF SPOKEN AND WRITTEN IN THE CASE OF THE EXTENT TO WHICH IN THE USE OF THE USE OF THE THE NATURE OF THE THE WAYS IN WHICH IN TERMS OF THE THE RESULTS OF THE IN THE PRESENT STUDY ON THE ONE HAND
NORM 22 14 13 12 12 11 10 9 9 8 8 6 6 5 5 5 5 6 5 16 11 11 7 7 11 7 6 6 6 5 19 10 10 9 7 7 7 7 7 6
Func T_transition T_framing R_location T_framing R_location R_quantification T_framing P_stance T_resultative T_framing R_topic R_topic R_topic P_stance P_stance P_engagement P_engagement P_stance P_stance R_location R_location R_location R_quantification R_quantification R_topic R_topic R_topic R_topic R_topic R_topic T_framing T_framing T_framing T_framing T_framing T_framing T_framing T_resultative T_structuring T_transition
KCL_AWC IT SHOULD BE NOTED I WOULD LIKE TO IT IS IMPORTANT TO TO BE ABLE TO AT THE END OF THE END OF THE A WIDE RANGE OF AS A LINGUA FRANCA ENGLISH AS A LINGUA NATIVE SPEAKERS OF ENGLISH IN THE FORM OF ON THE BASIS OF IN THE CONTEXT OF AS WELL AS THE AS A RESULT OF ON THE OTHER HAND I AM GOING TO WHEN IT COMES TO IN THIS ESSAY I IN THE PROCESS OF IS ONE OF THE THE INVOLVEMENT LOAD HYPOTHESIS OF ENGLISH AS A OF TEXTS HAVE YOU TEXTS HAVE YOU WRITTEN ENGLISH LANGUAGE OF INSTRUCTION IN THE FIELD OF ELT AND APPLIED LINGUISTICS NATIVE LIKE USE OF LIKE USE OF ENGLISH AS A FOREIGN LANGUAGE IMPLICIT AND EXPLICIT KNOWLEDGE IN THE EXPANDING CIRCLE ENGLISH AS A SECOND MEANING OF A WORD OF THE INVOLVEMENT LOAD THAT THERE IS A THE ROLE OF THE TO LOOK AT THE IN THE NEXT SECTION
NORM 5 11 7 5 6 5 5 9 7 6 10 6 6 5 8 21 7 8 7 6 7 15 8 8 8 7 6 6 6 6 6 6 6 5 5 5 8 7 5 6
Func P_engagement P_stance P_stance P_stance R_location R_location R_quantification R_topic R_topic R_topic T_framing T_framing T_framing T_framing T_resultative T_transition P_engagement R_location R_location R_procedure R_quantification R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic R_topic T_framing T_framing T_framing T_structuring
Figure 6.╇ Applied Linguistics Corpus vs. KCL Apprentice Writing Corpus: Shared/unshared lexical bundles4
4. All counts in this analysis have been normalised to counts per 100,000 in order to facilitate meaningful comparisons across the differently sized data sets used in this study
Christopher Tribble
Amongst the unshared lexical bundles, the high level of R_Topic related lexical bundles in KCL Apprentice Writing Corpus (15) is not surprising given the relative corpus sizes (Applied Linguistics Corpus 939,923 vs. KCL Apprentice Writing Corpus 509,891), and other contrasts can be accounted for by the genre contrast between the MA assignments and dissertations and published research articles. These are 4-word lexical bundles that are found in KCL Apprentice Writing Corpus alone (i.e. they do not occur in any of the other corpora in this study). Specific indicators (see Figure 7) include: What is surprising is that a significant number of Research, Text and Participant oriented lexical bundles found in Applied Linguistics Corpus is neither present in the top 40 of KCL Apprentice Writing Corpus, nor in the top 100 lexical bundle list for KCL Apprentice Writing Corpus. In Figure 8 an “x” indicates that an item does not appear in the top 100 lexical bundle list in this corpus. Where a lexical bundle was present outside the top 40, a number indicates its rank position. I AM GOING TO WHEN IT COMES TO
P_engagement R_location
IN THIS ESSAY I
R_location
IS ONE OF THE
R_quantification
THAT THERE IS A
T_framing
THE ROLE OF THE
T_framing
TO LOOK AT THE
T_framing
IN THE NEXT SECTION
T_structuring
Figure 7.╇ Unshared lexical bundles (KCL Apprentice Writing Corpus)
RESEARCH oriented
#
TEXT oriented
#
PARTICIPANT oriented
THE BEGINNING OF THE
X
IN THE USE OF
x
AS CAN BE SEEN
X
THE REST OF THE
X
THE USE OF THE
x
CAN BE USED TO
X
AT THE SAME TIME
74
IN TERMS OF THE
x
THE FACT THAT THE
63
ONE OF THE MOST
50
THE RESULTS OF THE
x
IN THE PRESENT STUDY
x
ON THE ONE HAND
x
IN THE CASE OF
88
THE WAYS IN WHICH
87
THE EXTENT TO WHICH
62
THE NATURE OF THE
59
Figure 8╇ Applied Linguistics Corpus lexical bundles
#
Revisiting apprentice texts
When one considers the role that these lexical bundles have in the development of argument, the presentation and evaluation of data, and the guiding of the reader, their lower frequency of occurrence in the KCL Apprentice Writing Corpus texts indicates a contrast between the ways in which apprentice writers comment on results and organise their texts and the approaches adopted by experts in Applied Linguistics Corpus. The examples below give an indication of the importance of these lexical bundles. Sentence concordances have been used here rather than the standard KWIC format. The examples for the extent to which show how the expert writers limit their evaluations of their own or others’ research findings: This has implications for the extent to which academic writing teachers can fruitfully adopt the concept of discourse community to contextualise student writing.
The distinctive ways in which discourse communities construct knowledge raise questions about the extent to which their discourses can exclude the prior experiences of novice participants, particularly those who may not share the same beliefs and values of the discourse community.
The extent to which each of the criteria was fulfilled was graded on a four-point scale from ‘Very much’ to `Not at all’.
Our next task is to review what kinds of advice or models are provided in published materials for EAP students, and to assess the extent to which these might need complementing.
The use of the results of the as a sentence theme demonstrates how expert writers use such a device to assist in foregrounding the results of research processes:
The results of the questionnaire survey thus showed that the course design was quite satisfactory to the students.
The results of the study indicate that in Conservation Biology abstracts include some moves that have been ascribed to research article introductions.
The results of the current study can be used to teach advanced level students pursuing master’s and doctoral degrees the structure of research article introductions and abstracts in their disciplines.
To make the two groups as similar as possible from a proficiency-level point of view, we also used the results of the diagnostic test which all students take at the beginning of the semester.
Christopher Tribble
3.3.3 4-word lexical bundles in Acta Tropica Acta Tropica was included in the study to demonstrate another value of analysing lexical bundles. Setting aside topic specific lexical bundles (e.g. enzyme linked immunosorbent assay, for the diagnosis of), the lexical bundles derived from this corpus also strongly contrast with those derived from Applied Linguistics Corpus (a direct counterpart), as well as BAWE and BNC Baby – Academic. However, unlike the case of KCL Apprentice Writing Corpus, where the contrast may arise from limitations in the performance repertoire of apprentice writers, the contrast in Acta Tropica arises from contrasting textual and disciplinary practices. This becomes clear the moment we consider the Text oriented lexical bundles which occur uniquely in Acta Tropica. Nearly all of these relate in one way or another to the reality of engaging in and reporting the results of natural sciences. Thus we find traces of the major need to report experimental or observational results (often in tabular form): the results of the/are shown in table/ were found to be/has been shown to/been shown to be, to report on collaborative research: in this study we, and the need to specify the specific bio-chemical conditions under which experiences were carried out: in the presence of/in the absence of. 3.3.4 Interim conclusion: 4-word lexical bundles Thus far, this study confirms the value of 4-word lexical bundles in differentiating between text collections by author status (expert/apprentice) and disciplinary area. From a pedagogic perspective, the study offers valuable insights for teachers and students with an interest in how expert writers in a specific field construct allowable contributions at a high level. In the next section we will see how 3-word lexical bundles can be used to extend this analysis.
3.4
Findings: 3-word lexical bundles
Key findings for the 3-word lexical bundles in the main research corpora are presented in Figure 9. Again, a more legible version of the table is provided in the appendices. The distribution of 3-word lexical bundles across the corpora is similar to that observed for 4-word lexical bundles. Thirteen 3-word lexical bundles are shared between all four main corpora (as opposed to nine shared 4-word lexical bundles), and the contrast between Acta Tropica and the other four corpora is similarly marked (see Appendix 0 below). When the KCL Apprentice Writing Corpus and Applied Linguistics Corpus 3-word and 4-word lexical bundle lists are compared (see Figures 6 and 10) we find that the same level of sharing (16) occurs across the two corpora, and that Applied Linguistics Corpus contains an important group of Research and Text oriented lexical bundles which do not occur with high frequency in KCL Apprentice Writing Corpus (see Figure 11).
Revisiting apprentice texts BNC Baby Academic Norm BAWE Norm Applied Linguistics Norm IN TERMS OF 34 IN ORDER TO 58 THE USE OF 80 PART OF THE 30 AS WELL AS 36 IN ORDER TO 43 THERE IS A 30 THE FACT THAT 30 IN TERMS OF 40 ONE OF THE 27 THE USE OF 29 AS WELL AS 36 A NUMBER OF 27 THERE IS A 29 ONE OF THE 35 AS WELL AS 26 ONE OF THE 27 A NUMBER OF 32 SOME OF THE 24 IN TERMS OF 26 THE FACT THAT 26 THE USE OF 23 AS A RESULT 19 ON THE OTHER 25 THE FACT THAT 21 PART OF THE 18 THE OTHER HAND 23 IN ORDER TO 19 A NUMBER OF 17 SOME OF THE 23 ON THE OTHER 16 ON THE OTHER 15 PART OF THE 17 THE OTHER HAND 13 THE OTHER HAND 14 THERE IS A 17 AS A RESULT 11 SOME OF THE 13 AS A RESULT 15 IT IS NOT 25 THE NUMBER OF 18 THE CASE OF 25 THE NUMBER OF 24 THAT IT IS 18 THE NUMBER OF 24 THAT IT IS 17 IT IS NOT 18 THE END OF 22 THE END OF 17 THE IMPORTANCE OF 15 IN THE CASE 20 THE CASE OF 13 THE END OF 15 THE IMPORTANCE OF 15 IN THE CASE 12 THE CASE OF 14 IN WHICH THE 18 THERE IS NO 24 IN THE CASE 12 THE ROLE OF 18 IN WHICH THE 18 DUE TO THE 38 THE BASIS OF 16 THE BASIS OF 13 THERE IS NO 22 END OF THE 15 THAT THERE IS 13 CAN BE SEEN 21 SUCH AS THE 15 PERCENT OF 24 BE ABLE TO 18 CAN BE SEEN 14 TERMS OF THE 22 SUCH AS THE 17 IN THE CORPUS 22 IT MAY BE 18 NEED TO BE 17 IN ACADEMIC WRITING 21 IN THIS CASE 17 THAT THERE IS 15 USE OF THE 19 IT HAS BEEN 16 IT IS A 13 OF IN THE 18 AND IT IS 15 IT HAS BEEN 12 THE PRESENT STUDY 17 LIKELY TO BE 15 END OF THE 11 AT THE SAME 17 THE EFFECT OF 13 IT CAN BE 21 THE CONTEXT OF 17 TO BE A 13 TO BE A 15 THE BEGINNING OF 17 AND SO ON 13 A RESULT OF 13 THE SAME TIME 16 THE HOUSE OF 13 THE DEVELOPMENT OF 13 ENGLISH AS A 16 IT IS A 13 IN THIS CASE 13 THE TEACHING OF 16 MANY OF THE 12 IT WOULD BE 12 IN THIS STUDY 16 BUT IT IS 12 BE USED TO 12 OF THE CORPUS 15 IS LIKELY TO 12 THE PRESENCE OF 11 OF THE STUDENTS 15 IN THE FIRST 12 IT IS IMPORTANT 11 THE RESULTS OF 15 15 IT IS POSSIBLE 11 IT IS THE 11 ANALYSIS OF THE
KLC Apprentice Writing Norm IN ORDER TO 56 THE USE OF 51 IN TERMS OF 41 ONE OF THE 34 AS WELL AS 34 THERE IS A 30 THE FACT THAT 29 SOME OF THE 27 A NUMBER OF 23 ON THE OTHER 22 THE OTHER HAND 21 AS A RESULT 20 PART OF THE 18 THE IMPORTANCE OF 31 IT IS NOT 25 THAT IT IS 23 THE ROLE OF 27 DUE TO THE 19 THAT THERE IS 17 THERE IS NO 16 BE ABLE TO 16 IN THE CLASSROOM 36 ENGLISH AS A 22 IN THE UK 19 THE CONCEPT OF 19 THE INVOLVEMENT LOAD 19 IN OTHER WORDS 19 NON NATIVE SPEAKERS 18 NEED TO BE 18 INVOLVEMENT LOAD HYPOTHESIS 18 THE TARGET LANGUAGE 18 THE MEANING OF 17 THE RELATIONSHIP BETWEEN 17 WOULD LIKE TO 17 VARIETIES OF ENGLISH 16 BASED ON THE 16 MOST OF THE 16 THE NOTION OF 16 OF THE LANGUAGE 15 THE PROCESS OF 15
Acta Tropica AS WELL AS THE USE OF IN ORDER TO ONE OF THE THE ROLE OF A NUMBER OF THE FACT THAT THE NUMBER OF THE EFFECT OF IT HAS BEEN THE PRESENCE OF IN THIS STUDY THE PRESENT STUDY THE DEVELOPMENT OF THE PREVALENCE OF OF PLASMODIUM FALCIPARUM OF THE DISEASE DUE TO THE ACCORDING TO THE IN THE PRESENT MATERIALS AND METHODS OF TRYPANOSOMA CRUZI OF T CRUZI OF THE PARASITE THE TREATMENT OF SEE FRONT MATTER WAS CARRIED OUT WORLD HEALTH ORGANIZATION FOUND TO BE A TOTAL OF THE ABSENCE OF BASED ON THE RIGHTS RESERVED DOI E MAIL ADDRESS CARRIED OUT IN IN REVISED FORM RECEIVED IN REVISED T B GAMBIENSE ANALYSIS OF THE IN THE PRESENCE
Norm 782 614 575 429 323 307 295 982 458 307 1188 654 527 524 486 465 463 453 441 430 429 404 389 389 386 381 371 371 361 359 355 352 349 328 313 312 312 303 299 299
Figure 9.╇ 3-word lexical bundles across all corpora
A further feature of the 3-word lexical bundles in Applied Linguistics Corpus and KCL Apprentice Writing Corpus is illustrated in Figure 12. This figure shows the fourteen 3-word lexical bundles which are not entailed by 4-word lexical bundles in Applied Linguistics Corpus. This group constitutes a further kind of resource, which is either limited to a three-word horizon (in this paper) or is less phrasally stable, but which nevertheless collocates with a range of discoursally or disciplinarily important lexis. Examples include: – DISCOURSE: the immediate right collocates of IN ORDER TO: understand/identify/gain/avoid/determine/ensure/investigate/control/function/provide/achieve/assess/explore/facilitate/illustrate/improve/better/complete/express/help/see/address/ establish – DISCIPLINE: the immediate right collocates of THE IMPORTANCE OF: learning/linguistic/writing/academic/frequent/genre/language/student/understanding/ appropriate/aware/awareness/based/corpora As mentioned earlier, these 3-word lexical bundle PLUS collocate combinations would not necessarily be sufficiently frequent to figure as 4-word lexical bundles, yet they constitute an important resource for learners to draw on during the Phase 2: modelling and deconstructing in the teaching and learning cycle discussed above (Section 0).
Christopher Tribble #
APPLING Norm Func
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
THE FACT THAT THE IMPORTANCE OF PART OF THE SOME OF THE AS WELL AS THE ROLE OF THE USE OF A NUMBER OF ONE OF THE ENGLISH AS A AS A RESULT IN ORDER TO IN TERMS OF THERE IS A ON THE OTHER THE OTHER HAND
1 2 3
OF IN THE CAN BE SEEN AT THE SAME
END OF THE 4 THE BEGINNING OF 5 THE END OF 6 THE PRESENT STUDY 7 THE SAME TIME 8 THE NUMBER OF 9 THE RESULTS OF 10 ANALYSIS OF THE 11 12 IN ACADEMIC WRITING IN THE CORPUS 13 OF THE CORPUS 14 OF THE STUDENTS 15 THE TEACHING OF 16 IN THE CASE 17 IN WHICH THE 18 SUCH AS THE 19 THE CASE OF 20 THE CONTEXT OF 21 USE OF THE 22 THE BASIS OF 23 IN THIS STUDY 24
26 15 17 23 36 18 80 32 35 16 15 43 40 17 25 23
KCL_AWC Norm Func
BE ABLE TO NEED TO BE WOULD LIKE TO
18 fragment 14 P_engagement 17 R_location 15 17 22 17 16 24 15 15 21 22 15 15 16 20 18 15 25 17 19 16 16
R_location R_location R_location R_location R_location R_quantification R_resultative R_topic R_topic R_topic R_topic R_topic R_topic T_framing T_framing T_framing T_framing T_framing T_procedure T_resultative T_structuring
29 31 18 27 34 27 51 23 34 22 20 56 41 30 22 21
THE FACT THAT THE IMPORTANCE OF PART OF THE SOME OF THE AS WELL AS THE ROLE OF THE USE OF A NUMBER OF ONE OF THE ENGLISH AS A AS A RESULT IN ORDER TO IN TERMS OF THERE IS A ON THE OTHER THE OTHER HAND
P_engagement P_engagement R_description R_description R_location R_procedure R_procedure R_quantification R_quantification R_topic T_framing T_framing T_framing T_framing T_transition T_transition
P_engagement P_engagement R_description R_description R_location R_procedure R_procedure R_quantification R_quantification R_topic T_framing T_framing T_framing T_framing T_transition T_transition
16 P_engagement 18 R_stance 17 R_stance 15 R_procedure 16 R_quantification 36 R_topic 19 R_topic 18 R_topic 18 R_topic 15 R_topic 19 R_topic 19 R_topic 17 R_topic 16 R_topic 17 R_topic 18 R_topic 16 R_topic 16 T_framing 25 T_framing 23 T_framing 17 T_framing 16 T_framing 19 T_resultative 19 T_transition
THE PROCESS OF MOST OF THE IN THE CLASSROOM IN THE UK INVOLVEMENT LOAD HYPOTHESIS NON NATIVE SPEAKERS OF THE LANGUAGE THE CONCEPT OF THE INVOLVEMENT LOAD THE MEANING OF THE NOTION OF THE RELATIONSHIP BETWEEN THE TARGET LANGUAGE VARIETIES OF ENGLISH BASED ON THE IT IS NOT THAT IT IS THAT THERE IS THERE IS NO DUE TO THE IN OTHER WORDS
Figure 10.╇ 3-word lexical bundles (KCL Apprentice Writing Corpus vs. Applied Linguistics Corpus)
CAN BE SEEN AT THE SAME END OF THE THE BEGINNING OF THE END OF THE PRESENT STUDY THE SAME TIME THE NUMBER OF THE RESULTS OF
14 17 15 17 22 17 16 24 15
P_engagement R_location R_location R_location R_location R_location R_location R_quantification R_resultative
IN THE CASE IN WHICH THE SUCH AS THE THE CASE OF THE CONTEXT OF USE OF THE THE BASIS OF IN THIS STUDY
20 18 15 25 17 19 16 16
T_framing T_framing T_framing T_framing T_framing T_procedure T_resultative T_structuring
Figure 11.╇ Research and textual lexical bundles in Applied Linguistics Corpus which do not appear in KCL Apprentice Writing Corpus
Revisiting apprentice texts APPLING A NUMBER OF ANALYSIS OF THE CAN BE SEEN IN ORDER TO IN THIS STUDY IN WHICH THE PART OF THE SOME OF THE SUCH AS THE THE BEGINNING OF THE IMPORTANCE OF THE NUMBER OF THE ROLE OF THERE IS A
Norm Func 32 15 14 43 16 18 17 23 15 17 15 24 18 17
R_qunatification R_topic R_engagement T_framing T_structuring T_framing R_description R_description T_framing R_location P_engagement R_qunatification R_procedure T_framing
Figure 12.╇ 3-word lexical bundles not subsumed by 4-word lexical bundles in Applied Linguistics Corpus
4. From description to application An example of how lexical bundles have been used in pedagogy might be instructive as a final step in this article. Cortes (2006) provides an account of some of the issues that can arise in when lexical bundles are taught in the context of academic writing instruction in the disciplines. Working with a group of native English-speaking third and fourth year History students at Iowa State University, and with the active participation of a member of the History faculty, she attempted to teach a salient set of ngrams drawn from research articles in their disciplinary area. Through pre- and postcourse analysis of the students’ writing, Cortes’ study attempted to assess the extent to which relevant lexical bundles could be acquired and used by participants in the study. Results were disappointing, with only a small number of the target bundles being incorporated into texts which students wrote following the two week course of 5 x 20 minute session. In a subsequent analysis of the students’ writing and published research articles, Cortes notes that one of the main reasons for this apparent failure is that students consistently: “generally favored simple conjunctions, conjuncts, and adverbs to express functions which published authors frequently convey by using lexical bundles” (Cortes 2006: 399). She concludes by commenting: All in all, the differences in linguistic exponents (lexical bundles, adverbs, conjunctions, etc.) used to convey academic-related functions by published authors and university students present a gap that seems difficult to bridge. On the one
Christopher Tribble
hand, expressions like lexical bundles, which are extremely frequent in the production of published authors in history are extremely rare in student production. On the other hand, students seem to favor structurally simple expressions or single words to convey certain functions, expressions and words which, in general, are frequently used in spoken registers and do not seem to be published authors’ first choices. (ibid: 401)
Cortes herself admits that part of the reason for this apparent failure might lie in the instructional methodology she used (which depended on the introduction of a set of de-contextualised lexical bundles and gap-fill and matching activities as a means of consolidating learning), and also on a mis-match between the present level of the students’ cognitive development as writers and their engagement with the discipline, and the target language behaviour the were expected to achieve. The lesson which I draw from Cortes’ study is that it is essential: (a) to align the exemplar corpus as closely to the needs of the learners as possible; (b) to recognise that lexical bundles express epistemologies and modes of reasoning which students may not yet be able to access; and (c) that instruction should not be simply a process of presentation, practice and production, but will require the kinds of critical engagement which are implicit in genre approaches to language instruction as outlined in Section 0 above.
5. Conclusions From my perspective, the key insight of Granger’s pioneering work on learner language (from Granger 1993 through to Granger & Paquot 2009) has always been that the production of those who are on the way to gaining fuller control of linguistic systems is as legitimate a focus for linguistic research as the hitherto privileged production of native-speaking users of a language. Indeed, in language pedagogy I would argue that unless we have a clear idea of what aspects of the language system our students use, fail to use, underuse and overuse when their production is set against that of relevant comparators, we will be hard pressed to develop useful curricula and learning programmes. In this present study, I hope that I have demonstrated that there is a value in investigating the contrasts between what advanced students in a disciplinary area are able to do, and how this compares with the language use of experts in the same field. Clearly, lexical bundles are only part of the story; as Corte’s study highlights, matching the exemplar corpus tolearners’ current needs is also a major issue. One of the things that is now required is serious effort to better understand how lexical bundles operate across different stages in, for example, the development of an argument, the reporting of results, and the citing of authorities, and how apprentice writers at different levels of engagement with their disciplinary areas realise such literacy practices. Despite these challenges, I would contend that lexical bundles offer a valuable starting point for a
Revisiting apprentice texts
better understanding of how apprentice writers write − and for how we as teachers can help students to develop expertise in disciplinary writing.
References Bazerman, C. 1994. Constructing Experience. Carbondale IL: Southern Illinois University Press. Biber, D. 2006. University Language: A Corpus-based Study of Spoken and Written Registers [Studies in Corpus Linguistics 23]. Amsterdam: John Benjamins. Biber, D. & Conrad, S. 1999. Lexical bundles in conversation and academic prose. In Out of Corpora: Studies in Honor of Stig Johansson, H. Hasselgard, H. & Oksefjell, S. (eds), 181–189. Amsterdam: Rodopi. Cortes, V. 2004. Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes 23: 397–423. Cortes, V. 2006. Teaching lexical bundles in the disciplines: An example from a writing intensive history class. Linguistics and Education 17(4): 391–406. Feez, S. 1998. Text-based Syllabus Design. Sydney: Sydney: NCELTR, Macquarie University. Flowerdew, J. 1993. An educational or process approach to the teaching of professional genres ELTJ 47(4): 305–316. Granger, S. 1993. The international corpus of learner English. In English Language Corpora: Design, Analysis and Exploitation, J. Aarts, P. de Haan & N. Oostdijk (eds), 57–69. Amsterdam: Rodopi. Granger, S. 1994. The learner corpus: A revolution in applied linguistics. English Today 39(10/3): 25–29. Granger, S. & Tribble, C. 1998. Exploiting learner corpus data in the classroom: Form-focused instruction and data-driven learning. In Learner Language on Computer, S. Granger (ed.), 199–209. Harlow: Longman. Granger, S. & Paquot, M. 2009. In search of General Academic English: A corpus-driven study. In Options and Practices of L.S.P Practitioners Conference Proceedings [University of Crete Publications, E-media, 94–108], K. Katsampoxaki-Hodgetts (ed.). (12 July, 2009). Halliday, M.A.K. 1994. An Introduction to Functional Grammar, 2nd edn. London: Edward Arnold. Hyland, K. 2002. Specificity revisited: How far should we go now? English for Specific Purposes 21: 385–395. Hyland, K. 2005. Stance and engagement: A model of interaction in academic discourse. Discourse Studies 7(2): 173–91. Hyland, K. 2008a. Academic clusters: Text patterning in published and postgraduate writing. International Journal of Applied Linguistics 18(1): 41–62. Hyland, K. 2008b. As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes 27: 4–21. Jenkins, J. 2000. The Phonology of English as an International Language. Oxford: OUP. Johns, T. 1988. Whence and whither classroom concordancing? In Computer Applications in Language Learning, T. Bongaerts, P. de Haan, S. Lobbe & H. Wekker (eds). Dordrecht: Foris.
Christopher Tribble Johns, T. 1994. From printout to handout: Grammar and vocabulary learning in the context of data-driven learning. In Approaches to Pedagogic Grammar, T. Odlin T. (ed.), 293–313. Cambridge: CUP. Lea, M.R & Street, B.V. 1998. Student writing in higher education:an academic literacies approach. Studies in Higher Education 23(2): 157–172. Lillis, T.M. 2001. Student Writing: Access, Regulation, Desire. London: Routledge. Meara, P. & English, F. 1987. Lexical errors and learners’ dictionaries. London: Birkbeck College, Applied Linguistics Group. (16 November, 2009). Rampton, M.B.H. 1990. Displacing the ‘native speaker’: Expertise, affiliation, and inheritance. ELT Journal 44(2): 97–101. Scott, M. 2008., WordSmith Tools version 5. Liverpool: Lexical Analysis Software. Scott, M. & Tribble, C. 2006. Textual Patterns: Key Words and Corpus Analysis in Language Education [Studies in Corpus Linguistics 22]. Amsterdam: John Benjamins. Seidlhofer, B. 2000. Mind the gap: English as a mother tongue vs. English as a lingua franca. Views 9(1): 51–58. Seidlhofer, B. (ed.). 2003. Controversies in Applied Linguistics. Oxford: OUP. Sinclair, J.M. (ed.). 1987. Looking Up. An Account of the COBUILD project. London: Collins. Sinclair, J.M. 2004. New evidence, new priorities, new attitudes. In How to Use Corpora in Language Teaching, J.M. Sinclair (ed.), 271–297. Amsterdam: John Benjamins. Spack, R. 1988. Initiating ESL students into the academic discourse community: How far should we go? TESOL Quarterly 22(1): 29–52. Stevens, V. 1991. Classroom concordancing: Vocabulary materials derived from relevant, authentic text. English for Specific Purposes 10: 10–15 Summers, D. & Rundell, M. 1990. Longman Dictionary of Contemporary English, 2nd edn. Harlow: Longman. Thurston, J. & Candlin, C.N. 1997. Exploring Academic English: A Workbook for Student Essay Writing. Sydney: NCELTR. Tribble, C. 1989. The use of text structuring vocabulary in native and non-native speaker writing. MUESLI News. Jun-89: 11–13. Tribble, C. 1999. Genres, keywords, teaching: Towards a pedagogic account of the language of project proposals. In Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora, [Lodz Studies in Language], L. Burnard & T. McEnery (eds). Frankfurt: Peter Lang. Tribble, C. 2001. Corpora and corpus analysis: New windows on academic writing. In Academic Discourse, J. Flowerdew (ed.), 131–149. Harlow: Addison Wesley Longman. Tribble, C. & Jones, G. 1990. Concordances in the Classroom. Harlow: Longman. Widdowson, H.G. 1990. Aspects of Language Teaching. Oxford: OUP. Widdowson, H.G. 1991. The description and prescription of language. In Georgetown University Round Table on Languages and Linguistics 1991, J.E. Alatis (ed.), 11–24. Washington DC: Georgetown University Press. Willis, D. 1990. The Lexical Syllabus. London: Collins. Willis, J. & Willis, D. 1988. Collins COBUILD English Course, Part 1. Birmingham: Collins COBUILD.
â•⁄ 7
â•⁄ 7
â•⁄ 7
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 4
â•⁄ 4
â•⁄ 8
AT THE SAME TIME
THAT THERE IS A
A WIDE RANGE OF
ONE OF THE MOST
THE REST OF THE
CAN BE USED TO
THE FACT THAT THE
IT IS POSSIBLE TO
11
IN TERMS OF THE
IN THE CONTEXT OF
â•⁄ 4
AS WELL AS THE
THE EXTENT TO WHICH
AT THE SAME TIME
â•⁄ 5
IN THE FORM OF
10
â•⁄ 6
IT IS IMPORTANT TO
â•⁄ 8
â•⁄ 7
AT THE END OF
IN THE CASE OF
â•⁄ 8
AS A RESULT OF
THE WAY IN WHICH
IN THE CASE OF
â•⁄ 9
ON THE BASIS OF
IS ONE OF THE
TO BE ABLE TO
IT IS POSSIBLE TO
THE EXTENT TO WHICH
IN TERMS OF THE
THE REST OF THE
THE WAY IN WHICH
THAT THERE IS A
ONE OF THE MOST
CAN BE USED TO
THE FACT THAT THE
ON THE BASIS OF
AT THE END OF
IN THE FORM OF
AS WELL AS THE
IT IS IMPORTANT TO
THE END OF THE
AS A RESULT OF
â•⁄ 9
ON THE OTHER HAND
13
THE END OF THE
norm BAWE
ON THE OTHER HAND
BNC baby Academic
4-word clusters
Appendices
10
10
12
â•⁄ 8
â•⁄ 9
â•⁄ 9
â•⁄ 9
10
11
12
13
15
19
6
13
15
15
17
18
22
26
THE USE OF THE
THE BEGINNING OF THE
TO BE ABLE TO
CAN BE USED TO
THE FACT THAT THE
IN TERMS OF THE
THE REST OF THE
ONE OF THE MOST
THE EXTENT TO WHICH
A WIDE RANGE OF
IN THE CONTEXT OF
AT THE SAME TIME
IN THE CASE OF
IN THE FORM OF
AS A RESULT OF
IT IS IMPORTANT TO
AS WELL AS THE
AT THE END OF
THE END OF THE
ON THE BASIS OF
ON THE OTHER HAND
norm Applied Linguistics
â•⁄ 9
11
â•⁄ 5
â•⁄ 5
â•⁄ 6
â•⁄ 7
â•⁄ 7
â•⁄ 7
10
11
12
16
19
â•⁄ 8
â•⁄ 9
â•⁄ 9
10
12
13
14
22
OF TEXTS HAVE YOU
WHEN IT COMES TO
THE INVOLVEMENT LOAD HYPOTHESIS
NATIVE SPEAKERS OF ENGLISH
ENGLISH AS A LINGUA
IS ONE OF THE
OF ENGLISH AS A
AS A LINGUA FRANCA
I WOULD LIKE TO
TO BE ABLE TO
A WIDE RANGE OF
IN THE CONTEXT OF
THAT THERE IS A
THE END OF THE
AS WELL AS THE
ON THE BASIS OF
AT THE END OF
IT IS IMPORTANT TO
AS A RESULT OF
IN THE FORM OF
ON THE OTHER HAND
norm KLC Apprentice Writing
â•⁄ 8
â•⁄ 8
15
â•⁄ 6
â•⁄ 7
â•⁄ 7
â•⁄ 8
â•⁄ 9
11
â•⁄ 5
â•⁄ 5
â•⁄ 6
â•⁄ 8
â•⁄ 5
â•⁄ 5
â•⁄ 6
â•⁄ 6
â•⁄ 7
â•⁄ 8
10
21
FOR THE PRESENCE OF
WAS FOUND TO BE
FOR THE DETECTION OF
WERE FOUND TO BE
ENZYME LINKED IMMUNOSORBENT ASSAY
FOR THE DIAGNOSIS OF
IN THE TREATMENT OF
FOR THE TREATMENT OF
IN THE ABSENCE OF
IN THE PRESENCE OF
THE RESULTS OF THE
AT THE TIME OF
IN THE PRESENT STUDY
IS ONE OF THE
ON THE OTHER HAND
AT THE END OF
IN THE CASE OF
THE END OF THE
AS A RESULT OF
AS WELL AS THE
ON THE BASIS OF
norm Acta Tropica
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 5
â•⁄ 5
â•⁄ 6
â•⁄ 7
â•⁄ 9
â•⁄ 4
â•⁄ 5
10
â•⁄ 3
â•⁄ 9
â•⁄ 5
â•⁄ 6
â•⁄ 6
â•⁄ 4
â•⁄ 4
â•⁄ 5
norm
Revisiting apprentice texts
â•⁄ 5
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 8
â•⁄ 7
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
THE SIZE OF THE
IN RELATION TO THE
IT IS DIFFICULT TO
THE HOUSE OF LORDS
PER CENT OF THE
IN THE UNITED STATES
AT THE TIME OF
THE HOUSE OF COMMONS
AS SHOWN IN FIG
IN THE ABSENCE OF
RULE IN RYLANDS V
IS LIKELY TO BE
THE RULE IN RYLANDS
IN THIS CASE THE
THE NATURE OF THE
THE COURT OF APPEAL
THAT THERE IS NO
AS WE HAVE SEEN
CAN BE SEEN AS
THAT THERE IS NO
DUE TO THE FACT
IS DUE TO THE
THE ROLE OF THE
IN THE SAME WAY
IT IS NECESSARY TO
CAN BE SEEN THAT
A RESULT OF THE
TO THE FACT THAT
IT CAN BE SEEN
CAN BE SEEN IN
IN RELATION TO THE
THE SIZE OF THE
THE BEGINNING OF THE
THE USE OF THE
IT IS DIFFICULT TO
IT IS CLEAR THAT
THE NATURE OF THE
norm BAWE
IT IS CLEAR THAT
BNC baby Academic
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 7
â•⁄ 7
â•⁄ 8
â•⁄ 9
â•⁄ 9
11
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 7
â•⁄ 8
â•⁄ 9
10
OF SPOKEN AND WRITTEN
IT SHOULD BE NOTED
AS CAN BE SEEN
NATIVE AND NON NATIVE
ON THE ONE HAND
THE USE OF CORPORA
OF SECOND LANGUAGE WRITING
THE RESULTS OF THE
IN THE PRESENT STUDY
IN THE USE OF
AT THE BEGINNING OF
AT THE UNIVERSITY OF
THE WAYS IN WHICH
I WOULD LIKE TO
ENGLISH AS A LINGUA
AS A LINGUA FRANCA
OF ENGLISH AS A
THE NATURE OF THE
NATIVE SPEAKERS OF ENGLISH
norm Applied Linguistics
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 7
â•⁄ 7
â•⁄ 7
10
11
11
â•⁄ 7
â•⁄ 5
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 7
â•⁄ 8
OF THE INVOLVEMENT LOAD
MEANING OF A WORD
IT SHOULD BE NOTED
TO LOOK AT THE
ENGLISH AS A SECOND
IN THE NEXT SECTION
IN THE EXPANDING CIRCLE
IMPLICIT AND EXPLICIT KNOWLEDGE
AS A FOREIGN LANGUAGE
LIKE USE OF ENGLISH
IN THE PROCESS OF
NATIVE LIKE USE OF
ELT AND APPLIED LINGUISTICS
IN THE FIELD OF
THE ROLE OF THE
IN THIS ESSAY I
ENGLISH LANGUAGE OF INSTRUCTION
I AM GOING TO
TEXTS HAVE YOU WRITTEN
norm KLC Apprentice Writing
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 5
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 6
â•⁄ 7
â•⁄ 7
â•⁄ 7
â•⁄ 7
â•⁄ 8
SPECIAL PROGRAMME FOR RESEARCH
IN THIS STUDY WE
WORK WAS SUPPORTED BY
AS WELL AS IN
IN AN AREA OF
THIS WORK WAS SUPPORTED
WORLD BANK WHO SPECIAL
BEEN SHOWN TO BE
ARE SHOWN IN TABLE
IN THE NUMBER OF
TRAINING IN TROPICAL DISEASES
WAS CARRIED OUT IN
FOR RESEARCH AND TRAINING
UNDP WORLD BANK WHO
THE TOTAL NUMBER OF
AND TRAINING IN TROPICAL
RESEARCH AND TRAINING IN
USED IN THIS STUDY
HAS BEEN SHOWN TO
norm Acta Tropica
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 3
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
â•⁄ 4
norm
Christopher Tribble
17
THAT IT IS
13
24
THE NUMBER OF
13
25
IT IS NOT
THAT THERE IS
11
AS A RESULT
THE BASIS OF
13
THE OTHER HAND
18
16
ON THE OTHER
IN WHICH THE
19
IN ORDER TO
24
21
THE FACT THAT
12
23
THE USE OF
THERE IS NO
24
SOME OF THE
IN THE CASE
THE IMPORTANCE OF
26
AS WELL AS
17
27
A NUMBER OF
13
27
ONE OF THE
THE END OF
30
THERE IS A
THE CASE OF
IT IS NOT
30
PART OF THE
CAN BE SEEN
THERE IS NO
DUE TO THE
IN THE CASE
THE CASE OF
THE END OF
THAT IT IS
THE NUMBER OF
SOME OF THE
THE OTHER HAND
ON THE OTHER
A NUMBER OF
PART OF THE
AS A RESULT
IN TERMS OF
ONE OF THE
THERE IS A
THE USE OF
THE FACT THAT
AS WELL AS
IN ORDER TO
34
IN TERMS OF
BAWE
Norm
BNC Baby Academic
3-word clusters
21
22
38
12
14
15
15
18
18
18
13
14
15
17
18
19
26
27
29
29
30
36
58
SUCH AS THE
END OF THE
THE BASIS OF
THE ROLE OF
IN WHICH THE
THE IMPORTANCE OF
IN THE CASE
THE END OF
THE NUMBER OF
THE CASE OF
AS A RESULT
THERE IS A
PART OF THE
SOME OF THE
THE OTHER HAND
ON THE OTHER
THE FACT THAT
A NUMBER OF
ONE OF THE
AS WELL AS
IN TERMS OF
IN ORDER TO
THE USE OF
Norm Applied Linguistics
15
15
16
18
18
15
20
22
24
25
15
17
17
23
23
25
26
32
35
36
40
43
80
Norm
ENGLISH AS A
IN THE CLASSROOM
BE ABLE TO
THERE IS NO
THAT THERE IS
DUE TO THE
THE ROLE OF
THAT IT IS
IT IS NOT
THE IMPORTANCE OF
PART OF THE
AS A RESULT
THE OTHER HAND
ON THE OTHER
A NUMBER OF
SOME OF THE
THE FACT THAT
THERE IS A
AS WELL AS
ONE OF THE
IN TERMS OF
THE USE OF
IN ORDER TO
KLC Apprentice Writing
22
36
16
16
17
19
27
23
25
31
18
20
21
22
23
27
29
30
34
34
41
51
56
Norm
OF T CRUZI
OF TRYPANOSOMA CRUZI
MATERIALS AND METHODS
IN THE PRESENT
ACCORDING TO THE
DUE TO THE
OF THE DISEASE
OF PLASMODIUM FALCIPARUM
THE PREVALENCE OF
THE DEVELOPMENT OF
THE PRESENT STUDY
IN THIS STUDY
THE PRESENCE OF
IT HAS BEEN
THE EFFECT OF
THE NUMBER OF
THE FACT THAT
A NUMBER OF
THE ROLE OF
ONE OF THE
IN ORDER TO
THE USE OF
AS WELL AS
Acta Tropica
13
14
14
14
15
15
16
16
16
18
18
22
40
10
15
33
10
10
11
14
19
21
26
Norm
Revisiting apprentice texts
TO BE A
15
15
13
AND IT IS
LIKELY TO BE
THE EFFECT OF
13
13
12
12
12
12
11
THE HOUSE OF
IT IS A
MANY OF THE
BUT IT IS
IS LIKELY TO
IN THE FIRST
IT IS POSSIBLE
13
16
IT HAS BEEN
13
17
IN THIS CASE
TO BE A
18
IT MAY BE
AND SO ON
IT CAN BE
22
TERMS OF THE
IT IS THE
IT IS IMPORTANT
THE PRESENCE OF
BE USED TO
IT WOULD BE
IN THIS CASE
THE DEVELOPMENT OF
A RESULT OF
END OF THE
IT HAS BEEN
IT IS A
THAT THERE IS
NEED TO BE
SUCH AS THE
BE ABLE TO
24
PER CENT OF
BAWE
Norm
BNC Baby Academic
11
11
11
12
12
13
13
13
15
21
11
12
13
15
17
17
18
ANALYSIS OF THE
THE RESULTS OF
OF THE STUDENTS
OF THE CORPUS
IN THIS STUDY
THE TEACHING OF
ENGLISH AS A
THE SAME TIME
THE BEGINNING OF
THE CONTEXT OF
AT THE SAME
THE PRESENT STUDY
OF IN THE
USE OF THE
IN ACADEMIC WRITING
IN THE CORPUS
CAN BE SEEN
Norm Applied Linguistics
15
15
15
15
16
16
16
16
17
17
17
17
18
19
21
22
14
Norm
THE PROCESS OF
OF THE LANGUAGE
THE NOTION OF
MOST OF THE
BASED ON THE
VARIETIES OF ENGLISH
WOULD LIKE TO
THE RELATIONSHIP BETWEEN
THE MEANING OF
THE TARGET LANGUAGE
INVOLVEMENT LOAD HYPOTHESIS
NEED TO BE
NON NATIVE SPEAKERS
IN OTHER WORDS
THE INVOLVEMENT LOAD
THE CONCEPT OF
IN THE UK
KLC Apprentice Writing
15
15
16
16
16
16
17
17
17
18
18
18
18
19
19
19
19
Norm
IN THE PRESENCE
ANALYSIS OF THE
T B GAMBIENSE
RECEIVED IN REVISED
IN REVISED FORM
CARRIED OUT IN
E MAIL ADDRESS
RIGHTS RESERVED DOI
BASED ON THE
THE ABSENCE OF
A TOTAL OF
FOUND TO BE
WORLD HEALTH ORGANIZATION
WAS CARRIED OUT
SEE FRONT MATTER
THE TREATMENT OF
OF THE PARASITE
Acta Tropica
10
10
10
10
10
10
11
12
12
12
12
12
12
12
13
13
13
Norm
Christopher Tribble
Automatic error tagging of spelling mistakes in learner corpora Paul Rayson and Alistair Baron Manual error tagging of learner corpus data is time consuming and creates a bottleneck in the analysis of learner corpora. This had led researchers to apply techniques from the area of natural language processing to assist in the automatic analysis of such data. This chapter presents the novel application of a hybrid approach to the detection of spelling errors in learner data. The Variant Detector (VARD) software was developed to match historical spelling variants to modern equivalents with the intention of improving the accuracy and robustness of corpus linguistics techniques when applied to historical corpora. Here, we describe its application to detect spelling errors in written learner corpora consisting of 50,000 words from each of three learner backgrounds (French, German and Spanish).
1. Introduction As witnessed by the contributions in this book and elsewhere, computer learner corpus (CLC) research originating from the Louvain-la-Neuve team led by Sylviane Granger has contributed significantly to the description and analysis of learner errors, second language acquisition research and beyond (Dagneaux et al. 1998, Granger 1999, Granger & Thewissen 2005a, 2005b, Meunier & Granger 2008). One of the key elements of CLC research, in addition to the collection of real language output from learners, is the marking of learner errors directly in the resulting corpora. The annotation of learner errors, also known as error tagging, enables mistakes to be counted, sorted in specific ways as well as viewed in context. Error tagging has previously been carried out manually or semi-automatically using software assistance in terms of computer-aided error analysis (Dagneaux et al. 1998). However, even when using intelligent editors, manual error tagging is time consuming and creates a bottleneck in the analysis of learner corpora. Recently, researchers in Intelligent Computer-Aided Language Learning (ICALL) have begun to apply results from the area of natural language processing (NLP) to learner corpora with two workshops bringing together research
Paul Rayson and Alistair Baron
on the automatic analysis of learner language in 20081 and 20092. The research presented in this paper continues this trend. Spelling mistakes are one of the basic errors that learners make in their writing. In other areas of corpus linguistic research, the consideration of spelling variants is also an important issue, for example in corpora of online or internet varieties such as chat language or email communication (Gries & Myslin 2009) where novel variants (such as gr8 for great) are emerging. In applying corpus linguistic techniques to the analysis of Early Modern English corpora, spelling variants (such as avysyd instead of advised) are found to degrade the performance and robustness of techniques such as key word analysis (Baron et al. 2009b), part-of-speech tagging (Rayson et al. 2007)3 and semantic tagging (Archer et al. 2003). Hence, techniques have been developed to detect historical spelling variants and insert modern equivalents alongside the original forms. The corpus techniques can then be applied to the modern forms while retaining the original spellings. Our contribution in this paper is to apply the Variant Detector (VARD) software (Baron & Rayson 2008), originally designed to address this issue in historical corpora, to learner data. Our aim is to evaluate VARD’s potential for the automatic detection of learner spelling errors and the insertion of corrections within the learner corpora. Patterns of learner spelling errors are more diverse (e.g. across mother-tongue backgrounds) than spelling errors resulting from typos and other native errors. The hybrid approach taken by the VARD tool is therefore expected to be of particular value in this area. Our research contributes to the understanding of automatic analysis of learner language, and if successful, will partly address the bottleneck of manual error analysis of learner corpora because at least one type of error can be marked up automatically. The remainder of the paper begins (in Section 2) with an introduction to the VARD tool and a description of previous work on spelling errors in learner data. In Section 3, we describe the experimental setup and the data used for the study. Section 4 presents the results and we conclude the paper in Section 5 with some suggestions for further work.
2. Background In addition to learner data, there are other varieties of language with significant amounts of spelling variation. One such area is historical corpora. Over recent years, vast digitisation efforts have been undertaken to create textual resources, for example
1.
https://www.calico.org/p-364-CALICO%2008%20Workshop.html
2. https://calico.org/p-420-AALL09.html 3. A few similar studies have been carried out on the effect of learner errors on part-of-speech taggers (van Rooy and Schäfer, 2002)
Automatic error tagging of spelling mistakes in learner corpora
the Open Content Alliance4, Google Books5 and Early English Books Online6. Much of this data is out of copyright material from the Early Modern English period. In addition, historical corpora have been compiled containing texts from the same time period; these include the Helsinki, ARCHER, ZEN, the Corpus of Early English Correspondence (CEEC), the Corpus of English Dialogues (CED) and the Early Modern English Medical Texts (EMEMT) corpus7. Our research on the detection of spelling variants was driven by the need to take natural language processing (NLP) tools that were trained on modern language varieties, and apply them to historical corpora. When applied to historical data, the accuracy and robustness of existing NLP tools (e.g. part-of-speech taggers and semantic taggers) is severely reduced due to a number of factors, the most prominent of which is historical spelling variants and the resultant mismatch to the modern lexicons embedded within the tools. In addition, even the most basic corpus linguistic techniques and methods such as frequency profiling, key words, concordances and collocations are affected due to the dispersal of frequency counts of a given word across a number of different orthographic forms in the corpus (see Baron & Rayson 2009, for a summary). Our solution to the problems of historical corpus analysis was to develop the Variant Detector (VARD) software, to act as a pre-processor for NLP and corpus tools. The initial aim of the software was to process the corpus data and insert modern equivalents alongside the historical variants. The historical variants were preserved within an XML tag, but the subsequent NLP or corpus tools ‘saw’ only the modern equivalent for tagging or searching, for example: company
The original version of the VARD tool (Rayson et al, 2005) exploited a large manually compiled list of c. 45,000 entries each consisting of an historical variant linked to its modern equivalent. The pre-processing consisted of a search and replace operation on the corpus data. This first version was useful for corpora on which it had been trained but suffered due to its fixed list of variants when applied to previously unseen corpora. Due to the nature of historical spelling variation, listing all of the possible variants was shown not to be a scalable solution. VARD28 (Baron & Rayson 2008) was then developed to address this limitation by incorporating a hybrid of other methods for detecting variants and finding candidate modern equivalents using techniques embedded in spell
4. http://www.opencontentalliance.org/ 5.
http://books.google.com
6. http://eebo.chadwyck.com/home 7. See the Corpus Resource Database for more information: http://www.helsinki.fi/varieng/ CoRD/ 8. VARD 2 is freely available for academic research from http://www.comp.lancs. ac.uk/~barona/vard2/
Paul Rayson and Alistair Baron
checkers such as those contained in word processors (e.g. Microsoft Word). The process incorporated in the current version of VARD (2.2) employs the following steps: 1. Compare each word in the input text to a large and broad coverage modern word list derived from the British National Corpus and the Spell Checking Oriented Word List (SCOWL)9. If the input text word is not found in the modern list then mark it as a potential variant. 2. For each potential variant, produce a list of candidate modern equivalents using the four techniques below and rank the resulting list with a confidence score based on the weighted combination of techniques used to find each candidate: a. Known variants list, i.e. the manually created list of historical variants and modern equivalents b. Phonetic matching algorithm, adapted from the Soundex phonetic algorithm that is used to assign the same code to homophones in order to match them despite small spelling differences c. Letter replacement rules, representing common patterns of alternative spellings, e.g. ‘replace u with v’ d. Edit distance, which records the number of edits (insertions, deletions and substitutions) required to transform the variant to its equivalent 3. In the interactive version of VARD2, present the resulting rank list to the user alongside each variant, allowing the user to choose the best modern replacement (in a similar way to how corrections are displayed in word processors). A non-interactive ‘batch’ version of the tool also exists which can perform automatic insertion of the highest ranked modern equivalents that have scores above a certain threshold. The confidence score that is used to rank the candidate modern equivalents is based on a weighted combination of the four methods listed above (see 2.a-d). Initial weights are assigned to each method based on our previous experience with the tool, but when applied to text the tool recalculates these weights based on the number of times that a method is used to find a candidate replacement that is subsequently chosen by the user in the interactive tool. Hence, during a training phase these weights can change substantially over time. This might reflect where, for example, in a particular historical corpus the pre-built list of known variants is not as suitable as expected and where the letter replacement technique is more often successful at suggesting chosen candidate modern equivalents. It is this capability to ‘learn’ that makes VARD2 a tool worthy of consideration for other situations where spelling variation occurs. It allows the tool to be retrained by first applying it to sample texts from a particular corpus and then running it in a non-interactive mode over the remainder of the corpus. The interactive version of VARD2 also permits users to customise the tool by adding to the built-in list of known variants, replacing or extending its modern lexicon (i.e. for a different language) and adding new letter replacement rules. The learning method is described in 9. See http://wordlist.sourceforge.net/scowl-readme
Automatic error tagging of spelling mistakes in learner corpora
more detail in Baron & Rayson (2009) where the tool is evaluated on a child language corpus in addition to an historical dataset. In addition to VARD2, we have developed a complementary tool called DICER (Discovery and Investigation of Character Edit Rules). DICER takes a corpus previously standardised manually through VARD or a list of variant and equivalent mappings and extracts a database of letter replacement rules and their frequencies of use in the input text. This frequency analysis of letter replacement patterns can be used subsequently to create a new set of letter replacement rules for VARD2 or to find extensions or restrictions on the existing set of rules. The aim for DICER was to improve the accuracy of VARD2 after manual training, and we have shown that a significant increase in performance follows (Baron et al. 2009a). In addition, DICER results can be used to study spelling variation in a given corpus, e.g. to find changes in spelling patterns over time or in child language data. In this paper, we will apply DICER to detect patterns of misspelling in learner data. Until recently, there has been a dearth of work focussing on the spelling errors of English as a Foreign Language/English as a Second Language (EFL/ESL) language learners. As Granger (2003) points out, error-tagged corpora of learner language are especially useful for second language acquisition (SLA), foreign language teaching (FLT) research and computer-assisted language learning (CALL). As well as the research on learner corpora, research on spelling errors falls across multiple areas within linguistics, second language acquisition, language learning and teaching, psycholinguistics, and educational research. An early study by Ibrahim (1978) categorised the spelling errors made by a group of Arab EFL learners in the Department of English at the University of Jordan. The size of the group of learners was unspecified and the resulting categorisation described a list of error types rather than tokens. Bebout (1985) compared misspellings made by first (English speaking children) and second language learners (Spanish-speaking adults learning English). She described how the ESL-speakers made more errors involving consonant doubling. Zutell & Allen (1988) examined the spelling strategies of 108 English-Spanish bilingual children and found a Spanish phonological influence on the spelling of English words. Wade-Woolley & Siegel (1997) examined the spelling performance of 79 children and found that second language (ESL) speakers performed in a similar manner to native speakers. Cook (1997) compared the spelling of adult second language (L2) users of English with that of first language (L1) children and L1 adult users. Cook’s source data was corpus-based and taken from a mixture of exam scripts, essay samples from assessment tests, and essays produced not under exam conditions. Wang & Geva (2003) carried out a longitudinal study of 35 Chinese ESL children and found an L1 transfer effect in relation to two English phonemes. Figueredo’s (2006) extensive review article incorporated an analysis of 27 other papers considering the influence of first language upon ESL learners’ spelling, including those mentioned above. The majority of the studies cited used a descriptive qualitative approach and supported the hypothesis that there is a relationship between the first language of the ESL learner and development of skills in English spelling.
Paul Rayson and Alistair Baron
What all of these studies have in common is the manual approach to finding spelling errors in language data, the resultant small size of the data sets and the variety of patterns observed from different learner backgrounds. Since the late 1990s, the amount of research activity involving the collection and annotation of learner corpora has grown significantly (Tono 2003) and it is the case that manual error analysis is still predominantly the norm. In this paper, we aim to address this issue by piloting a hybrid approach to the automatic discovery and correction of learner spelling errors. The eventual aim is to allow much larger datasets to be analysed automatically, thus improving the reliability and replicability of the research. We follow in the footsteps of other corpus-based studies of spelling errors from language learners. Lefer & Thewissen (2007) studied orthographic and morphological errors in learner argumentative essays using samples from the Spanish, German and French components of the International Corpus of Learner English (ICLE) representing intermediate to advanced learner writing. They compared a manual approach using the second version of the Louvain error tagging system (Dagneaux et al. 2005) with an automatic approach which identified unknown words as a side effect of semantic tagging using the UCREL Semantic Analysis System (USAS) tagger (Rayson et al. 2004). In order to use these automatic results, a significant amount of manual weeding out was needed (around half of the results) since the USAS tagger marked as unknown (and therefore as candidates for spelling errors) proper nouns and other words that were not in its lexicon. The manual approach was shown to be better at identifying contextual and capitalisation errors e.g. woman for women and church for Church. On average the manual approach identified around 14% more learner errors. Lefer & Thewissen’s paper was itself a revisitation of an earlier study by Granger & Wynne (1999) to explore the effect of learner corpora on measures of lexical variation such as type-token ratio. Granger & Wynne showed that such measures should be considered unsafe on learner corpus data because it contains non-standard word forms (i.e. spelling errors and nonstandard coinages) which will unduly boost type-token ratio measures. As a side effect of their research, they were able to extract these non-standard forms using an earlier version of the USAS system. Milton & Chowdhury (1994) reported on the corpus-based analysis of interlanguage which included the manual markup of spelling errors in a written corpus of Chinese learners of English. Nicholls (2003) described the error coding in the Cambridge Learner Corpus which included the markup of spelling errors in a very large (6-million word) component. However, this data is not publicly available. Recently, research in Intelligent Computer-Aided Language Learning (ICALL) has applied natural language processing techniques to learner corpora with the aim of improving spelling checkers for L2 writers in addition to informing second language acquisition research. Some research in this area has found its way into prototype applications. The Microsoft Research ESL Assistant10 targets common errors made by native speakers of East Asian languages (Chinese, Japanese and Korean) (see Gamon 10. http://research.microsoft.com/en-us/projects/msreslassistant/
Automatic error tagging of spelling mistakes in learner corpora
et al. 2008, 2009). Check My Words11 is targeted at Chinese learners of English in Hong Kong and enables learners to check their vocabulary and grammar online (Milton, 2004). However, still further progress is required. Rimrott & Heift (2008) evaluated the performance of the spell checker in Microsoft Word on 1,027 spelling errors types (1,808 tokens) of L2 writers in German and found that only 62% of the errors are successfully detected and corrected. This was mainly due to L2 learner errors in their corpus containing multiple erroneous letters rather than one single erroneous letter as is more expected in native writing. This estimate is confirmed in a study by Hovermale (2008) who found that one third of the errors made by Japanese learners of English were not detected and corrected by standard spell checkers. As with spell checking of native language, the detection and correction of spelling errors for learner language must deal with both non-word and real-word errors. For real-word errors, where the error is contextual (e.g. their instead of there), then the problem has extra complexity in the learner data because it most likely sits with other errors in the surrounding context and there are more possible deviations from the target word (Hovermale & Mehay 2009). Lee (2009) has extended the work on spelling error correction to that of grammatical error correction using syntactic analysis.
3. Experiment Having described the previous research in this area, we now turn to the evaluation of the detection and correction abilities of the VARD software on learner data. In this section, we discuss the experimental methodology. The data that we used for this experiment is an expanded version of the set that was used to study orthographic and morphological errors in learner writing (Lefer & Thewissen 2007). That study used 30,000 words each from three learner populations (Spanish, German and French) with data drawn from the ICLE corpus. Since the 2007 study, this data set has been expanded by Jennifer Thewissen and incorporates c. 50,000 words per L1 background. The data is marked up for all types of learner error. However, we focussed on the spelling and morphological errors, marked as (FS), (FM) and (GADJN) in the corpus. The description of the relevant tags under the form (F) category in the ICLE corpus error tagging manual (Dagneaux et al, 2005) are as follows: (FS) includes all spelling errors. It is also used for misuse or omission of capital letters, word coinages (those which do not belong to the (FM) category), borrowings, homophones (e.g. it’s/its, their/there), doubling of consonants/vowels, and misuse/omission of hyphens/blanks in compound words. 11. http://mywords.ust.hk/
Paul Rayson and Alistair Baron (FM) is used for morphological errors (inflectional and derivational). Inflectional errors result from the misuse of grammatical morphemes (plural, genitive, verb morphology, degree of adjectives, etc.) while derivational errors are due to the addition of an erroneous affix to an existing word.
Learner errors are marked in the ICLE corpus with an error tag in round brackets followed by the original error and then the corrected form which is enclosed in dollar signs. In order for the data to be processed by VARD and DICER, it was converted into an XML format. All other error tags and corrections were removed with a Perl script leaving only the spelling errors. For example, for the original text: If we believe the boulevard newspapers and the talk shows and especially the commercials, the human race is on (FS) it’s $its$ way to perfection. We are offered the right soap for dry or greasy skin, the perfect collar for (GDO) your $our$ dachshund or (FS) sheep dog $sheepdog$, the (FS) imacculate $immaculate$ photo for (GDO) your $our$ wedding album, the ideal book for (GDO) your $our$ (LS) difficulties $problems$ in maths and biology, the best fitness (FS) programm $program$ for (GDO) your $our$ (LP) life belts $excess weight$ – or rather against them.
The converted version was as follows: If we believe the boulevard newspapers and the talk shows and especially the commercials, the human race is on its way to perfection. We are offered the right soap for dry or greasy skin, the perfect collar for your dachshund or sheepdog, the immaculate photo for your wedding album, the ideal book for your difficulties in maths and biology, the best fitness program for your life belts – or rather against them.
In addition, other spelling errors occur when a general rule of grammar is broken, in which case the error is classified under the Grammar (G) category in the error tagging manual e.g. (GADJN) poors $poor$ people. Here, the word poors is a spelling error, in the sense that the attachment of the ‘s’ breaks the rule that all adjectives are invariable in English. These errors were also included in our experimental dataset. The converted corpus was fed into VARD which enabled us to calculate the accuracy of the VARD tool. The manually marked up corpus was used as a gold-standard for the first analysis. Subsequently, the manually marked up corpus was split into training and test sets in order to see whether VARD’s learning abilities can be used to retrain
Automatic error tagging of spelling mistakes in learner corpora
the tool from the detection of Early Modern English variants to those produced by language learners. In parallel to the VARD analysis, the manually marked up corpus was loaded into the DICER tool and the results are discussed in the following section.
4. Results Using the manually corrected corpus as a gold-standard in the DICER tool, we can produce a variety of statistics of interest. Although these results are not the main focus of this paper, it is worth highlighting the differences between the corpora in order to place our later VARD analysis in context. First, we can compare the profiles for edit distance. As mentioned in Section 2, edit distance is the number of edits (insertions, deletions and substitutions) that are needed to change an original word form produced by the learner into the corrected form that has been manually inserted. It should be noted that both DICER and VARD do not take account of corrections (i.e. edits) due to capitalisation and these are given an edit distance of zero. Without contextual knowledge, it is difficult to correctly predict capitalisation errors. Overall, there are 1765 spelling errors manually marked in the three corpora, of which 307 are due to capitalisation errors. Table 1 shows the percentage of corrections (and therefore learner errors) that are due to capitalisation. The rate in the German corpus (13.7%) is noticeably lower than in the French and Spanish corpora. Turning now to errors which are not due to capitalisation, Table 2 shows the profile of edit distance for the overall corpus and then broken down by learner background. Rimrott & Heift (2008) found that only 62% of the L2 misspellings in their dataset were corrected by Microsoft Word. This was mainly due to many L2 misspellings Table 1.╇ Percentage of replacements due to capitalisation
All replacements Replacements due to capitalisation
All
French
German
Spanish
1765 â•⁄ 307 â•⁄â•⁄â•⁄â•⁄â•⁄ 17.4%
351 â•⁄ 67 â•⁄â•⁄â•⁄â•⁄ 19.1%
432 â•⁄ 59 â•⁄â•⁄â•⁄â•⁄ 13.7%
982 181 â•⁄â•⁄â•⁄â•⁄ 18.4%
Table 2.╇ Edit distance profile Edit distance
All %
French %
German %
Spanish %
1
75.0 14.2
69.7 16.2
83.1 â•⁄ 8.6
73.0 16.1
â•⁄ 4.0
â•⁄ 5.3
â•⁄ 2.9
â•⁄ 4.1
â•⁄ 2.0
â•⁄ 2.5
â•⁄ 1.9
â•⁄ 1.9
â•⁄ 4.8
â•⁄ 6.3
â•⁄ 3.5
â•⁄ 4.9
2 3 4 5+
Paul Rayson and Alistair Baron
in their dataset containing multiple-edits. Our results show that an average of 75% of the learner errors have an edit distance of one, representing the insertion, deletion or substitution of one character between the learner error and the corrected form. An average of 14.2% of the errors have an edit distance of two from the corrected form, although the German corpus shows a significantly lower percentage (8.6%). Beyond this point, there are much smaller numbers of errors: 4.0% showing an edit distance of three overall, 2.0% with an edit distance of four overall and then 4.8% of the corrections have an edit distance of five or more. Using DICER, we can also detect where in each word the corrections are required. Table 3 shows the location of these changes within words of each corpus. It can be seen that French learners are more likely than German learners to make spelling errors at the end of words and Spanish learners even more so. By contrast, German learners are much more prone to make spelling errors towards the middle of words than French and Spanish learners. The DICER analysis also permits the counting of what types of spelling errors are being made and, in addition, to group them by type of edit, i.e. by insertion, deletion and substitution. Table 4 shows these comparative results that are based on types of rules. From these initial results, we can hypothesise that Spanish learners make more substitution errors than French and German learners. It should be noted that this table is based on numbers of types rather than tokens. Therefore, the overall figures are not an average of the three groups. The higher overall percentage of substitution errors (76.2%) is due to a lack of overlap between specific types of substitution errors, i.e. the three corpora do not share as many types of substitution errors (as they do with deletions and insertions). The implications of this difference between learner groups will also emerge in the VARD analysis that is to follow. Table 3.╇ Position of corrections with words Position of correction
All %
French %
German %
Spanish %
Start Second Middle Penultimate End
â•⁄ 7.2 â•⁄ 8.4 56.7 â•⁄ 8.6 19.2
â•⁄ 7.3 â•⁄ 7.9 60.6 â•⁄ 7.9 16.4
â•⁄ 4.9 â•⁄ 6.8 66.8 10.7 10.7
â•⁄ 8.2 â•⁄ 9.3 50.8 â•⁄ 7.8 23.9
Table 4.╇ Type of spelling errors made Type of correction
All %
French %
German %
Spanish %
Deletion Insertion Substitution
12.1 11.7 76.2
15.9 14.7 69.4
17.4 16.5 66.1
12.6 12.6 74.8
Automatic error tagging of spelling mistakes in learner corpora
The DICER software would also allow us to investigate further differences in terms of specific learner spelling errors. However, such an analysis is out of scope for this study and remains future work. Now, we turn to the results arising from the application of the VARD tool to the data. During our initial experiments on the learner corpus data, we made some improvements to the VARD software over and above those described in our recent work (Baron & Rayson, 2009). These included improvements to how the known variants list and letter replacement rules are used. For the known variants list, each individual variant to modern equivalent mapping is now assigned a precision and recall score which contributes to the confidence score for a candidate variant replacement. This allows more fine-grained training of VARD as some entries in the list will be of more value to the current dataset than others (e.g. entries of Early Modern English are likely to be of much less use for learner errors than new additions to the list from training). The letter replacement technique was also improved to allow more specific additions to the rule list from the DICER analysis; previously a rule’s application position was limited to start, end or anywhere. Middle, second and penultimate positions have been added to bring the letter replacement method in line with the DICER analysis. Taking the manually corrected corpus as a gold-standard, we are able to count the number of ‘real-word errors’ using VARD. These are learner spelling errors that match other words in the VARD dictionary, e.g. it’s (learner) for its (corrected) and the (learner) for they (corrected). In addition, the real word error category includes a large number of corrections due to removing or inserting spaces or hyphens e.g. match box (learner) for matchbox (corrected) and mountain-bikes (learner) for mountain bikes (corrected). Without taking into account local context, VARD and other spell-checking tools are unable to spot these errors. Table 5 shows the percentages of types and tokens that are real-word errors. The Spanish learners represented in the corpus make the lowest number of spelling errors that are also real-words while the German learners make over double that amount relatively speaking, although they make fewer errors overall. If we include errors due to capitalisation, then the overall percentages increase to 27.7% for types and 33.8% for tokens. This corresponds roughly with a third of the tokens in the Rimrott & Heift (2008) study which Microsoft Word was unable to correct. The large number of real word errors causes a significant problem for VARD as discussed below. Given that we have a manually checked corpus where all the learner errors have been marked and corrected, we can apply the VARD software to this data and calculate Table 5.╇ Real-word errors as a percentage of total spelling errors
Types Tokens
All %
French %
German %
Spanish %
19.7 21.9
20.7 22.6
36.4 35.9
12.7 15.2
Paul Rayson and Alistair Baron
Table 6.╇ VARD recall and precision before training
Recall Precision
Types %
Tokens %
â•⁄ 7.8 87.9
â•⁄ 7.6 88.9
the number of learner spelling errors that it detects and how many it corrects. By comparing the automatic method as represented by VARD with the manual method from the corpus, we can calculate how accurate the tool is. It is worth reiterating at this point that the VARD tool was designed to detect historical spelling variants in Early Modern English and the resources it contains in terms of known variants list and letter replacement rules have been trained for this kind of data. Applying the tool as is, without any training, we observe the results shown in Table 6. The accuracy (precision) of the untrained tool is high with a success rate of almost 90%. However, the number of learner spelling errors that it finds and tries to correct (recall) is low, under 10%. There is a compromise to be made between recall and precision; if we wish to lower the thresholds in the tool and attempt to correct more errors, then the accuracy will fall. Two scenarios can be imagined. First, where the analyst will carry out manual spot checks on the data output from VARD. In this case, lower precision and higher recall are acceptable. Second, where the analyst wishes to run a very large amount of data without carrying out any post-editing. Here, higher precision is preferred since we would not want the tool to introduce corrections where they are not needed (false positives). In addition to the learner spelling errors that are manually marked in the data set, VARD also considers the remaining words in the text. Therefore, it may also detect other candidates for learner errors that have not been marked as such by the human analyst, e.g. und12. The rate of false positives is very low as reported in Table 7. From the point of view of our experiment, these can be viewed as mistakes by VARD. However, if it can help identify untagged learner spelling errors, VARD can then be used as a tool to assist the human analyst with manual error tagging. The second stage of our experiment is to use part of the manually corrected corpus to train the VARD tool on the type of spelling errors that learners make. Our approach has been to use three quarters of each corpus as training material and one quarter for testing. The training and test sections are sampled randomly into 500 (±10) word sections. We use a replacement threshold of 80% in VARD for all experiments Table 7.╇ VARD false positive rate before training
False positives
Types %
Tokens %
5.8
4.6
12. This was deliberately left untagged by the analyst since it occurs in German book titles.
Automatic error tagging of spelling mistakes in learner corpora
Table 8.╇ VARD recall and precision after training
Recall Precision
Types %
Tokens %
13.8 87.9
16.2 90.7
reported. Following this training process, we observe the results shown in Table 8 for the overall dataset of the three corpora. The precision values have increased slightly while the recall values have significantly improved. This means that VARD is correcting around double the number of learner spelling errors compared to before training, without any loss of accuracy. Splitting the data into individual native languages and training VARD on these datasets in isolation provides improved performance over using the dataset as a whole. Figure 1 shows VARD’s recall rising to higher levels in the individual language sets; In terms of tokens13, scores of 14.5% for French, 20% for German and 16.2% for Spanish are observed. An improvement in precision was also observed, with French and German reaching 100% accuracy throughout the training process and Spanish beginning with 100% accuracy but dropping slightly to 96.6% by the end of training. The third stage of our experiment links back to the DICER analysis described earlier. We used the DICER tool to extract by hand a set of letter replacement rules observed in the manual corrections. We extracted the most frequent (i.e. successful) 40
French recall German recall Spanish recall
35 30
% Tokens
25 20 15 10 5 0
0
5000
10000
15000 20000 25000 Sample tokens seen
30000
35000
Figure 1.╇ Training effect on VARD recall for individual languages 13. For types, recall scores of 14.5% for French, 18.4% for German and 15.8% for Spanish are observed
Paul Rayson and Alistair Baron
rules and added them to the set of letter replacement rules within VARD. The aim was to improve the automatic analysis by increasing the likelihood of a correction suggested by VARD being the same as the one introduced manually in the error tagging. The resulting improvement can be seen in Table 9. Again, the precision is not affected, but the rate of recall (i.e. number of spelling errors found and corrected) has improved. Around one quarter of the learner spelling errors that are manually error tagged are now being found by VARD and automatically corrected. Again, using each native language in isolation improves performance; Table 9 shows VARD’s recall during the training of each language set as in Figure 2. Here recall is further improved by introducing rules from specific DICER analyses for individual native languages. Recall scores (in terms of tokens14) of 21.8%, 29.6% and 20.8% are attained for French, German and Spanish respectively. Precision is maintained at 100% for all languages throughout the training process15. Table 9.╇ VARD recall and precision after manual addition of DICER rules
Recall Precision
Types %
Tokens %
19.1 87.7
23.4 90.8
40
French recall German recall Spanish recall
35 30
% Tokens
25 20 15 10 5 0
0
5000
10000
15000 20000 25000 Sample tokens seen
30000
35000
Figure 2.╇ Training effect on VARD recall after manual addition of DICER rules
14. For types, recall scores of 21.8% for French, 22.5% for German and 20.6% for Spanish are observed. 15. These perfect precision scores are tempered slightly by VARD introducing false positives through the attempted standardisation of extra words detected as variants (as described for the whole dataset in Table 7.
Automatic error tagging of spelling mistakes in learner corpora
Compared with the approach used by Lefer & Thewissen (2007) and described in Section 2, the results show that VARD is much more suitable for this automated task. Lefer & Thewissen had to manually weed-out around 58% of the unrecognised forms suggested by the automatic process as candidate learner errors. VARD’s false positive rate is around 5% as shown above.
5. Conclusion In this paper, we have discussed previous work on spelling errors in learner data from a variety of perspectives: second language acquisition, language teaching, psycholinguistics and educational research. The experiments described here draw on the areas of computer-aided language learning where techniques from natural language processing are applied to learner corpora and computer learner corpus research where manual error tagging is still the norm. The techniques employed by the VARD tool are highly accurate, around 90% precision, for correcting learner errors. Further research is required to improve the detection of learner errors since the recall shown is around 23%. Specifically, new techniques are required for the detection of learner spelling errors that can only be found using contextual patterns. Techniques such as these do exist in spell checkers, but learner research shows that different patterns emerge depending on the language background and experience of the learner. We have shown the potential of NLP methods to contribute to the automatic error analysis of learner corpora. VARD can contribute in at least two ways. First, it can assist a manual process of editing by suggesting further learner spelling errors that are missed by a human analyst. Second, after the manual correction of a sample corpus, VARD can be trained and run automatically over the full corpus to generate larger amounts of data for analysis. Indirectly through computer learner corpus research, these results contribute to the improvement of spell checking techniques for learners and it allows the selection of corpus data for L1-specific spelling and morphology exercises. Seventeen years ago, Granger & Meunier (1994) suggested the idea of a grammar checker for learners of English. Much more research is required, but we offer the experimental results and VARD tool presented here as a partial contribution to this endeavour.
Acknowledgements We are grateful to Jennifer Thewissen who provided the error-tagged data for our experiments and commented on a draft of this paper.
Paul Rayson and Alistair Baron
References Archer, D., McEnery, T., Rayson, P. & Hardie, A. 2003. Developing an automated semantic analysis system for Early Modern English. In Proceedings of the Corpus Linguistics 2003 Conference [UCREL Technical Paper Number 16], D. Archer, P. Rayson, A. Wilson & T. McEnery (eds), 22 – 31. Lancaster: UCREL, Lancaster University. Baron, A., Rayson, P. & Archer, D. 2009a. Automatic standardization of spelling for historical text mining. In Proceedings of Digital Humanities 2009, Maryland, USA, 309–312. College Park MD: University of Maryland. Baron, A., Rayson, P. & Archer, D. 2009b. Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies 20(1): 41–67. Baron, A. & Rayson, P. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics. Birmingham: Aston University. Baron, A. & Rayson, P. 2009. Automatic standardisation of texts containing spelling variation: How much training data do you need? In Proceedings of Corpus Linguistics 2009, University of Liverpool, UK, July 2009. Liverpool: University of Liverpool. Bebout, L. 1985. An error analysis of misspellings made by learners of English as a first and as a second language. Journal of Psycholinguistic Research 14(6): 569–593. Cook, V.J. 1997. L2 users and English spelling. Journal of Multilingual and Multicultural Development 18(6): 474–488. Dagneaux E., Denness, S. & Granger, S. 1998. Computer-aided error analysis. System: An International Journal of Educational Technology and Applied Linguistics 26(2): 163–174. Dagneaux E., Denness S., Granger S., Meunier F., Neff J. & Thewissen J. 2005. Error Tagging Manual Version 1.2. Louvain-la-Neuve: Centre for English Corpus Linguistics, Université Catholique de Louvain. Figueredo, L. 2006. Using the known to chart the unknown: A review of first-language influence on the development of English-as-a-second-language spelling skill. Reading and Writing 19(8): 873–905. Gamon, M., Gao, J., Brockett, C., Klementiev, A., Dolan, W., Belenko, D. & Vanderwende, L. 2008. Using contextual speller techniques and language modeling for ESL error correction. In Proceedings of IJCNLP, Hyderabad, India, Asia Federation of Natural Language Processing, January 2008, 449–456. Gamon, M., Leacock, C., Brockett, C., Dolan, W., Gao, J., Belenko, D. & Klementiev, A. 2009. Using statistical techniques and web search to correct ESL errors. CALICO Journal 26(3): 491–511. Granger S. 1999. Use of tenses by advanced EFL learners: Evidence from an error-tagged computer corpus. In Out of Corpora – Studies in Honour of Stig Johansson. H. Hasselgård & S. Oksefjell (eds), 191–202. Amsterdam: Rodopi. Granger, S. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal 20(3): 465–480. Granger, S. & Meunier, F. 1994. Towards a grammar checker for learners of English. In Creating and Using English Language Corpora: Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993, U. Fries, G. Tottie & P. Schneider (eds), 79–91. Amsterdam: Rodopi. Granger S. & Wynne M. 1999. Optimising measures of lexical variation in EFL learner corpora. In Corpora Galore, J. Kirk (ed.), 249–257. Amsterdam: Rodopi.
Automatic error tagging of spelling mistakes in learner corpora Granger S. & Thewissen J. 2005a. Towards a reconciliation of a ‘Can Do’ and ‘Can’t Do’ approach to language assessment. Paper presented at the Second Annual Conference of EALTA, 2nd5th June 2005, Voss, Norway. Granger S. & Thewissen J. 2005b.The contribution of error-tagged learner corpora to the assessment of language proficiency. Paper presented at the 2005 Language Testing Research Colloquium, July 20th- 22nd 2005, Ottawa, Canada. Gries, S. & Myslin, M. 2009. k dixez? A corpus study of Spanish Internet Orthography. In Proceedings of Corpus Linguistics 2009, Liverpool, July 2009. Liverpool: University of Liverpool. Hovermale, D. 2008. SCALE: Spelling correction adapted for learners of English. In Proceedings of the CALICO-08 ICALL Special Interest Group Pre-conference Workshop, San Francisco, CA, USA. Hovermale, D. & Mahey, D. 2009. Real-word spelling correction for CALL. Presented at the Sixth Midwest Computational Linguistics Colloquium, MCLC-6, May 2009, Indiana University, Bloomington, USA. Ibrahim, M. H. 1978. Patterns in spelling errors. English Language Teaching Journal 32: 207–212. Lee, J.S.Y. 2009. Automatic Correction of Grammatical Errors in Non-native English Text. PhD dissertation, Massachusetts Institute of Technology. Lefer, M.-A. & Thewissen, J. 2007. Orthographic and morphological errors in learner writing. Presented at ICAME 2007, Stratford-upon-Avon, UK. Meunier, F. & Granger, S. (eds) 2008. Phraseology in Foreign Language Learning and Teaching. Amsterdam: John Benjamins. Milton, J. 2004. Mark my words: Technologies for supporting, managing and responding to student writing. In Proceedings of the Second Teaching and Learning Symposium, Hong Kong, May 17, 2004, Senate Committee on Teaching and Learning Quality, and Center for Enhanced Learning and Teaching, HKUST, Hong Kong. Milton, J. & Chowdhury, N. 1994. Tagging the interlanguage of Chinese learners of English. In Proceedings of Joint Seminar on Corpus Linguistics and Lexicology, Guangzhou and Hong Kong, 19–22 June, 1993, Language Centre, HKUST, Hong Kong, 127–143. Nicholls, D. 2003. The Cambridge Learner Corpus – error coding and analysis for lexicography and ELT. In Proceedings of Corpus Linguistics 2003, Lancaster University, UK, 572–581. Rayson, P., Archer, D., Piao, S. L. & McEnery, T. 2004. The UCREL semantic analysis system. In Proceedings of the Workshop on Beyond Named Entity Recognition Semantic Labelling for NLP Tasks in Association with 4th International Conference on Language Resources and Evaluation (LREC 2004), 25th May 2004, Lisbon, Portugal, 7–12. Rayson, P., Archer, D. & Smith, N. 2005. VARD versus word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings of Corpus Linguistics 2005, University of Birmingham, UK, July 2005. Rayson, P., Archer, D., Baron, A., Culpeper, J. & Smith, N. 2007. Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of Corpus Linguistics 2007, July 27–30, University of Birmingham, UK. Rimrott, A. & Heift, T. 2008. Evaluating automatic detection of misspellings in German. Language Learning & Technology 12(3): 73–92. van Rooy, B & Schäfer, L. 2002. Southern African Linguistics and Applied Language Studies 20(4): 325–335. Tono, Y. 2003. Learner corpora: Design, development and applications. In Proceedings of Corpus Linguistics 2003, Lancaster University, UK, 800–809.
Paul Rayson and Alistair Baron Wade-Woolley, L. & Siegel, L. 1997. The spelling performance of ESL and native speakers of English as a function of reading skill. Reading and Writing 9(5–6): 387–406. Wang, M. & Geva, E. 2003. Spelling acquisition of novel English phonemes in Chinese children. Reading and Writing: An Interdisciplinary Journal 16: 325–348. Zutell, J. & Allen, V. 1988. The English spelling strategies of Spanish-speaking bilingual children. TESOL Quarterly 22(2): 333–340.
Data mining with learner corpora Choosing classifiers for L1 detection Scott Jarvis This paper discusses the usefulness of machine-learning techniques for the investigation of cross-linguistic influence in learner corpora, and focuses on an approach known as supervised classification. Within this approach, one of the challenges that researchers face is deciding which particular method – or classifier – to use for a particular task. The classification task that this paper deals with is the ability of classifiers to learn to detect native language-related patterns in samples of second language writing. The empirical portion of this paper compares 20 classifiers in relation to their ability to perform this task with second language texts written by learners from 12 different native language backgrounds on the basis of their use of words and word sequences (or n-grams).
1. Introduction An important characteristic of corpus analysis – including learner corpus analysis – is its heavy reliance on computer automation for purposes of discovering patterns in the data. Because of the size and complexity of most language corpora, it would be infeasible to perform comprehensive analyses of the data without computer automation. Automated processes of searching for and extracting newly discovered information from large databases, such as language corpora, are often referred to as data mining, and the computer-based tools available for this type of information retrieval are becoming increasingly varied and sophisticated. One of many approaches to data mining involves what is known as classification, which can be further divided into unsupervised classification and supervised classification. The purpose of unsupervised classification in the case of corpus analysis is to identify clusters of texts that have similar contents in relation to a number of textual features (or variables), such as the relative frequencies of specific letters, morphemes, words, word classes, syntactic constructions, semantic relations, and/or ratios or other types of indices that reflect various aspects of the contents of a text. The clustering or identification of sets of texts with similar contents can lead to new discoveries
Scott Jarvis
concerning the factors at play in the texts or among the people who produced the texts. For example, a study by Jarvis et al. (2003) used an unsupervised classification tool known as cluster analysis to examine the linguistic similarities and differences that can be found across highly rated learner texts. The results of the study revealed multiple clusters of highly rated texts, which indicated multiple profiles of effective second language (L2) writing. One of the profiles, for instance, involved a high level of lexical diversity, a high use of nouns, and a high use of prepositions, whereas another profile involved a low use of nouns and prepositions but a high use of adverbials and present tense verbs. Among other things, this study provided an indication of the combinations of variables that work together in successful L2 texts, and also showed that there are multiple alternative routes to successful L2 writing. This study used cluster analysis, but some of the other tools available for unsupervised classification include statistical procedures known as Independent Component Analysis and neural network models (Hinton & Sejnowski 1999; Duda et al. 2000; Hyvärinen & Oja 2000; Kotsiantis & Pintelas 2004). For convenience, computer programs that perform unsupervised classification are often referred to simply as clusterers (Witten & Frank 2005). Supervised classification, in turn, is a form of machine learning where a computer program learns to recognize patterns associated with predefined classes. The term supervised refers to the fact that, in this type of machine learning, the computer program is not designed to discover classes (i.e. groups of cases) on its own, but is told what the relevant classes are and is directed to discover the patterns in the data that are most distinctive of those particular classes. When used with a corpus of texts, supervised classification performs its learning on training data (i.e. a subset of the corpus) that include not only the types of textual features that are used in unsupervised classification, but which also include predefined class labels associated with each text. In some cases, the labels represent text variables, such as the topics of the texts or the genres they represent. In other cases, the labels represent attributes of the authors who produced the texts, such as their gender, nationality, or attitude toward the topic. The purpose of supervised classification is to discover patterns among the textual features fed into the program that may be predictive of the class labels associated with the texts. After the program has constructed a predictive model on the basis of the training data, the model is applied to a further set of texts whose labels are withheld from the classifier in order to determine how generalizable the model is – to determine how accurately it is able to predict the class memberships of texts whose classes are unknown (to the classifier, at least). High levels of classification accuracy are indicative of two things: (a) that the data do indeed contain patterns associated with the class labels in question (i.e. the program could not learn to detect these class-related patterns correctly if there were nothing in the data to learn), and (b) that the program itself is effective in discovering these patterns. A recent study by Crossley & McNamara (2009) provides a clear example of how supervised classification can be used to discover patterns associated with different groups of writers. The study examined argumentative essays written in English by
Data mining with learner corpora
both English-speaking and Spanish-speaking university students. The purpose of the study was to determine the ways in which “L2 writers of English differ from L1 writers in their use of lexical cohesive devices and other lexical features” (p. 123). The predefined classes in this study were thus native versus nonnative, and the authors set out to determine how accurately the class membership of each text could be predicted on the basis of the use of various lexical and cohesion-related features (e.g. average levels of hypernymy and polysemy, argument overlap, the use of causal verbs and motion verbs). To conduct their analysis, the researchers used a supervised classification tool known as Discriminant (Function) Analysis. During the training phase of their analysis, they fed the class labels (i.e. native or nonnative) and the feature values (e.g. hypernymy values, polysemy values) for half of the texts into the Discriminant Analysis program. During this phase, the computer program created a statistical model of the relationship between features and classes. Then, during the testing phase, the researchers applied that statistical model to the other half of the texts to determine how well it could predict whether they were written by native or nonnative speakers of English. The results showed that a statistical model based on just 7 features provided the highest degree of classification accuracy, classifying correctly 79.10% of the texts that were held back for the testing phase. The strength and clarity of these results pointed to relatively reliable differences between Englishspeaking and Spanish-speaking university students in relation to the levels of lexical depth of knowledge, lexical variation, and lexical sophistication found in their argumentative writing. I mentioned earlier that computer programs used for unsupervised classification are often referred to simply as clusterers. By contrast, computer programs used for supervised classification purposes are often referred to as classifiers. One such classifier is Discriminant Analysis – mentioned in the preceding paragraph – and other classifiers include statistical programs such as Support Vector Machines, Bayesian classification models, rule-based models, and decision-tree models, among many others (Witten & Frank 2005; Kotsiantis 2007). Classifiers are used not only for text classification purposes, but also for a vast range of classification purposes in many other fields, such as in medical research, where it is used in the identification of predictors of diseases like cancer (e.g. Shen et al. 2007), and in geophysical research, where it is used to create maps of land use based on data from satellite-borne sensors (e.g. Liu et al. 2003). A number of the discoveries and developments regarding classifiers correspondingly come from other fields (e.g. Molinaro et al. 2005), and several studies referred to in this paper thus come from outside of linguistics and languagerelated research. The topic of the present chapter is the value of (supervised) classifiers for second language research with learner corpora. Such tools are already in widespread use in fields such as stylometry, literary analysis, and information science, and they are likely to become increasingly common in the analysis of learner corpora, too. Whereas traditional second language research tends to examine individual language
Scott Jarvis
features (e.g. subject-verb agreement) in relation to the central tendencies of groups of learners defined according to a given criterion (e.g. proficiency), classifiers highlight the ways in which multiple language features work together in the language use of individuals sharing particular background characteristics. In other words, traditional second language research tends to examine (a) one language feature at a time (b) in relation to group tendencies, whereas classifier-driven research examines (c) constellations of features in (d) the language use of individuals belonging to certain groups. Regarding (b) and (d), where traditional second language research and learner corpus research tend to rely on group means and/or overall frequencies of occurrence, classifier-driven research tends to rely instead on classification accuracy – i.e. the percentage of individuals within each group whose use of a particular bundle of features is group-distinctive in the sense of being both representative of a particular group (i.e. group-representative) and also relatively unique to that group (i.e. group-specific). The classifier-driven approach offers a clearer picture of how well learners’ group memberships can be predicted on the basis of their language behaviors. In recent papers, I have emphasized the value of classifiers for the investigation of cross-linguistic influence (or language transfer), and I have referred to the use of such means for the collection of evidence for cross-linguistic effects as the detectionbased approach (Jarvis 2010; Jarvis forthcoming). As I describe in these papers, the detection-based approach complements the more traditional comparison-based approach by offering an alternative argument for cross-linguistic influence, which is that the ability to identify learners’ native languages (L1s) accurately from their patterns of language use is prima facie evidence for L1 effects. Such evidence is not necessarily incontestable, as even in the legal sense the term prima facie evidence means that the case is not closed but that the evidence is judged to be strong enough to be presented to a jury, or alternatively that the evidence compels a particular conclusion unless counterevidence is available to rebut it (e.g. Herlitz 1994–1995: 395). Accordingly, a fully rigorous detection-based argument for the presence of L1 effects requires not only L1 detection accuracies that are significantly above the level of chance, but also the presentation of facts that rule out the possibility that the observed L1 detection accuracies may have resulted from factors other than L1 influence. When such confounds cannot be ruled out, other means will be needed to achieve argumentative rigor. Yet, even when argumentative rigor is not achieved through detection-based argumentation alone, the detection-based approach still plays an important role in leading researchers to possible cases of cross-linguistic influence that can later be scrutinized from the perspectives of both detection-based and comparison-based evidence (see Jarvis 2000; Jarvis & Pavlenko 2008; Jarvis 2010) – similar to how prima facie evidence in a legal context is the impetus for a full trial. The detection-based approach to transfer research is highlighted in a forthcoming collection of studies dealing with issues of argumentative rigor and the strength of
Data mining with learner corpora
evidence for L1 influence derivable from classifiers applied to learner corpora (see Jarvis & Crossley forthcoming). Given that those issues are addressed at length in that volume, I will not deal with them further in this paper. However, those studies do raise a separate question that I will attempt to address relatively comprehensively in the present chapter. It is the question of which of the many available classifiers is the most useful for this type of research in terms of the levels of classification accuracy it achieves, the interpretability of its output, and the practicality of its use. As I will discuss shortly, the answer to this question is complicated by the fact that the usefulness of classifiers appears to be heavily dependent on how many and which specific language features are being investigated, what the specific relationship is between those features, and what their specific relationship is with the class variable that the researcher is trying to predict. The studies in the Jarvis & Crossley (forthcoming) volume all use Linear Discriminant Analysis (LDA) for classification purposes, and the main purpose of the present paper is to determine how LDA compares with other classifiers in terms of its ability to learn to recognize L1-related patterns among the features under investigation. Because each study in the Jarvis & Crossley volume investigates a different set of language features in relation to learners’ L1 backgrounds, however, it is possible that the best performing classifier could be different for each study. Given that it is impractical in this paper for me to conduct a thorough comparison of classifiers with the data used in each of those studies, I will restrict my focus to the study in that volume that deals with the greatest number of features and the greatest number of L1 backgrounds. The study in question is an investigation of the ability of LDA to identify L1-related patterns in the use of 722 word n-grams (or multiword sequences) in L2 essays extracted from the International Corpus of Learner English (ICLE) that were produced by learners of English from 12 L1 backgrounds (Jarvis & Paquot forthcoming). The relevant details of that study will be described in Section 3.2, after I have discussed the types of classifiers that are available and how they work, and after I have reviewed empirical studies that have compared their performance in general, as well as studies that have examined their ability to detect learners’ L1 backgrounds in particular.
2. Classifiers It is important to point out at the beginning of this section that the term classifier is alternately used in the literature with two different meanings: a general sense and a specific sense. In its general sense, the term refers to a computer program (e.g. a program that performs LDA) that has been designed to construct a predictive model of the relationship between features (e.g. language features) and a class variable (e.g. L1 background). In its specific sense, it refers to a specific model that has been constructed by such a program. In the present paper, I will use this term in its general sense, and will use the term model to refer to the more specific meaning.
Scott Jarvis
2.1
Types of classifiers
The number of existing classifiers is large and quickly expanding, which makes it infeasible to present a full catalog. In this section, I will briefly describe the major types of classifiers in relation to the kinds of algorithms they rely on. This information is also summarized in Table 1, along with examples of some of the prominent classifiers within each category. More information about each type of classifier can be found in Appendix 1. Centroid-based classifiers create statistical models in which each case (e.g. text) is mathematically represented as if it existed as a point in multi-dimensional space. The classifier also determines the exact center, or centroid, of all points belonging to the same class (e.g. L1 group). During the testing phase, the classifier measures the distance between each case and each centroid, and classifies each case as a member of the class whose centroid it is closest to. Boundary-based classifiers are similar to centroidbased classifiers, except that instead of relying on group centroids, they attempt to determine boundaries between each cluster of points, and then classify each case as belonging to the class whose boundaries it falls within. Bayesian classifiers use a simple algorithm for calculating the probability that a particular case (e.g. text) is a member of a particular class (e.g. L1 group). These Table 1.╇ Major types of classifiers Type of Classifier
Examples
Centroid-based
Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Nearest Shrunken Centroids (NSC) Support Vector Machines (SVM) Optimal Separating Hyperplanes (OSH) Sequential Minimal Optimization (SMO) Naïve Bayes (NB) Naïve Bayes Multinomial (NBM) Complement Naïve Bayes (CNB) Multilayer Perceptron Analysis (MPA) Radial Basis Function (RBF) Simple CART (SC) Random Forest (RF) Conjunctive Rule Classifier (CRC) RIPPER (RIP) Delta Delta Prime (DP) Naïve Bayes Tree (NBT) Classification via Regression (CVR) LogitBoost (LB)
Boundary-based
Bayesian classifiers
Artificial neural networks Decision trees Rule-based Means-based Composite
Data mining with learner corpora
probabilities are calculated from feature values, such that a higher or lower frequency of a particular feature (e.g. the definite article) will either raise or lower the probability that a case belongs to a particular class. Classification is performed by aggregating the probabilities for all features, and by classifying a case as belonging to the class for which it has the highest probability. Whereas Bayesian classifiers treat features as if they are independent of one another, artificial neural networks treat them as being interrelated. Artificial neural networks also assign different weights to different features in order to improve classification accuracy. Decision trees and rule-based classifiers are similar to each other in that both tend to rely on the flow-chart principle, where individual cases are sorted into classes through sequences of decisions. The decisions at each step can usually be stated as ifthen statements (e.g. if the relative frequency of the definite article in this text is between 95 and 110 occurrences per 1000 words, then proceed to branch/rule X). Meansbased classifiers, on the other hand, are more similar to Bayesian classifiers in the sense that they involve adding up values associated with individual features, and classifying cases as belonging to the group with the most similar aggregate value. The difference between Bayesian classifiers and means-based classifiers is that the former rely on probabilities calculated for each feature, whereas the latter rely on z-scores (or standardized values) associated with each feature. Finally, composite classifiers (which are sometimes referred to as meta classifiers or ensemble classifiers) involve a combination of more than one type of classification, such as the combination of Bayesian probabilities with a decision-tree algorithm.
2.2
Feature selection and parameter tuning
Classification accuracy is often enhanced by reducing the number of features in the model. This not only improves the efficiency of the classifier, but also removes unnecessary complexity from the model. In many cases, overall classification accuracy can be improved by removing features that do not contribute to the predictive power of the model. Some classifiers also have restrictions on the number of cases that are needed per feature. For example, Linear Discriminant Analysis (LDA) rests on parametric assumptions for multivariate statistical tests, which, according to the old rule of thumb, require at least 10 cases per variable (i.e. 10 texts per feature). Other scholars have proposed even more rigid requirements for LDA, calling for at least five times as many cases per class as there are features (Burns & Burns 2008: 591) or at least 20 times as many cases in the overall training data as there are features (Field 2005). Under all of these circumstances, it is desirable if not absolutely necessary to limit the number of features on which a classifier will build its model of class membership. Ultimately, features should be selected and prioritized on the basis of theoretical criteria. However, sometimes this is not possible or desirable, such as in the case of exploratory research whose purpose is the pre-theoretical, empirical discovery of which features (or combination of features) are most predictive of membership in
Scott Jarvis
given classes. Multiple automated options for feature selection have been developed for these types of exploratory purposes. With respect to LDA, statistical software applications such as SPSS include stepwise options for selecting features according to how much a particular feature adds to the strength of the model. The stepwise method first chooses the feature that is most strongly correlated with the class (or grouping) variable, then it chooses the feature with the next highest unique correlation with the class variable, and so on until no further feature adds significantly to the model (McLachlan 2004: 412; Burns & Burns 2008: 604–605). Options also exist for adjusting the alpha level used to define significance and for choosing different indices of model strength. Choosing the best set of features for a particular analysis usually involves a good deal of experimentation with different options, as well as the use of cross-validation, which I will discuss in the following section. There are three general methods for performing feature selection: wrappers, filters, and embedded methods. According to Guyon & Elisseef (2003: 1166), “wrappers utilize the learning machine of interest as a black box to score subsets of variable [sic] according to their predictive power. Filters select subsets of variables as a pre-processing step, independently of the chosen predictor. Embedded methods perform variable selection in the process of training and are usually specific to given learning machines” (emphases in the original). I have only rarely encountered the use of wrappers in the classification literature, but such methods are available in classification software such as Weka (see Witten & Frank 2005).1 Examples of the use of filters are much more frequent, including the use of measures such as information gain (a measure of entropy changes) or the Gini index (a measure of statistical dispersion) for feature selection before the data are submitted to a decision-tree classifier (see Raileanu & Stoffel 2004; Kotsiantis 2007: 252). An example of an embedded method is the use of the stepwise procedure in Linear Discriminant Analysis (LDA), which I described earlier. There appears to be a consensus that no one feature-selection method is consistently better than all others; instead, different methods tend to work best in different contexts (e.g. Guyon & Elisseef 2003: 1178; Kotsiantis 2007: 252). For the researcher, this unfortunately means that a good deal of experimentation with different methods may be in order. Under certain circumstances, some classifiers may perform optimally without removing any features at all. This is particularly likely with classifiers such as Support Vector Machines (SVM) that do not have strict restrictions on the ratio of features to cases and that deal well with multidimensionality (i.e. the complexity resulting from a high number of features). Regardless of whether or how feature selection is carried out, the classification accuracy of a classifier can often be improved by adjusting – or tuning – its parameters. An important parameter for LDA, for example, is prior probabilities – i.e. whether the 1. Wrappers determine all possible subsets of features from the given feature pool, and then follow an algorithm for sampling these subsets and submitting them to the chosen classifier in order to determine which combination of features results in the highest classification accuracy for that particular classifier (see e.g. Kohavi & John 1997).
Data mining with learner corpora
classifier should treat each class as being equal or whether it should calculate prior probabilities from class sizes (e.g. the number of texts from each L1 background). SVM, for its part, includes a complexity parameter that can be adjusted, and also different kernel options that determine how boundaries between classes are calculated. Random Forest, in turn, includes parameter options for setting the maximum depth of a tree (i.e. how many sequential decisions it will include), the number of randomly selected features that will be included in each tree, and the number of trees it will generate (see, e.g. Witten & Frank 2005; Shen et al. 2007). In many cases, a classifier’s default settings will produce optimal results, but researchers need to be aware of what the settings are and how they might affect the results. Experimenting with different parameter settings is often useful, of course, and fortunately some statistical software suites (e.g. Weka) include automated means for testing multiple parameter settings and choosing the one that results in the highest classification accuracy.
2.3
Cross-validation
One of the most important components of classification analysis is the use of crossvalidation (CV). In the simplest case, this involves splitting one’s data into a training set and a testing set. During the training phase, the classifier is given both the feature values and the class labels for all of the cases (e.g. texts) in the training set. The training set is thus used to build the model of the relationship between features and classes. Afterwards, during the testing phase, that model is applied to the feature values of each of the cases in the testing set in order to determine how accurately the model is able to predict the class memberships of new cases (i.e. cases that were not used to build the model, and whose class labels are withheld from the model). The model’s classification accuracy with the testing set is used as an indication of the model’s generalizability as a predictor of class membership in the classes in question. If both the training set and testing set are truly large, representative of their populations, and normally distributed, then this form of CV should be perfectly reliable. When these conditions are not met, however, the specific way in which the data are divided into training and testing sets can result in an accidental bias and an unstable model. This means that the classification accuracy results would be different if the data were divided differently, or if the testing set were used as the training set and vice versa (Molinaro et al. 2005: 3306). A useful way to compensate for this problem is by dividing the data into multiple sets and by iterating through several steps where each data set is given its own turn of being held back as the testing set while all other sets are combined to form the training set. This method is referred to as k-fold CV (sometimes as v-fold CV), where k refers to the number of sets the data have been divided into and likewise the number of training-testing iterations the CV will progress through. The most common form of k-fold CV is 10-fold CV, which has been found to be optimal with respect to both reliability and efficiency (cf. Molinaro et al. 2005: 3306; Lecocke & Hess 2006: 315). In a 10-fold CV, the data are divided into 10 equally sized
Scott Jarvis
subsets. In the first fold of the CV, the first subset is held back as the testing set, and the other nine subsets are used as the training set. During the second fold of the CV, the second subset is held back as the testing set, and the other nine are used as the training set, and so forth. Each fold of the CV produces results concerning the number or percentage of cases in the testing set that were classified accurately. After all 10 folds of the CV are completed, the final cross-validated accuracy is the overall percentage of cases classified correctly across all 10 folds. In k-fold CV, the highest value that k can take is the number of cases in the database. K-fold CV that treats each case as a separate subset of the data is referred to as leave-one-out CV (LOOCV). In LOOCV, the number of folds in the CV is equal to the number of cases. In each fold of LOOCV, only a single case is held back as the testing set while all others are used as the training set to build the model. The final result of LOOCV is, as with other forms of CV, the overall percentage of testing cases that have been classified correctly. Although some researchers have pointed to potential problems with LOOCV (e.g. Gavin & Teytaud 2002), empirical comparisons between different forms of CV have generally shown 10-fold CV and LOOCV to be similarly superior to other forms of CV, with 10-fold CV being the most efficient computationally (e.g. Molinaro et al. 2005: 3306; Lecocke & Hess 2006: 315). The process of CV takes on an extra degree of complexity in cases where feature selection is performed in conjunction with classification. This is because automated feature selection capitalizes on even randomly occurring strong statistical relationships in the data. This results in a model that is overfitted to the training set in relation to the features included in the model. In order to avoid bias in the selection of features for the model, and in order to avoid overly optimistic CV results regarding the ability of the model to predict the class memberships of new cases, it is therefore necessary to embed feature selection within each fold of the CV. This will not only give more realistic results regarding the predictive power of the model, but will also show, for example, which features are selected across multiple folds of the CV, and therefore which features are truly generalizable predictors of class membership. Feature selection that is embedded within the folds of a CV has alternatively been referred to as honest, complete, embedded, and nested CV (e.g. Molinaro et al. 2005: 3303; Lecocke & Hess 2006: 316). In this paper, I will refer to it as embedded CV, and it is noteworthy that both embedded 10-fold CV and embedded LOOCV have been found to be reliable and optimally unbiased estimates of classification accuracy (e.g. Molinaro et al. 2005).
3. Previous research 3.1
Which classifier is best?
A number of studies have performed comparisons of classifiers in order to determine which classifier (with which set of features and which parameters settings) produces
Data mining with learner corpora
the highest level of cross-validated classification accuracy. The results of these studies have varied a great deal. One of the broadest comparisons of classifiers is a study by Shen et al. (2007), which does not deal with language at all, but rather compares the classification accuracy of nine classifiers in relation to their ability to predict the occurrence of liver cancer. Seven of the nine classifiers in the study are among those listed in Table 1. These include Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Nearest Shrunken Centroids (NSC) (i.e. centroidbased classifiers), Support Vector Machines (SVM) with a linear kernel and SVM with a radial kernel (boundary-based classifiers), Simple CART (SC) and Random Forest (RF) (decision trees), and LogitBoost (composite classifier). The other two classifiers included in the study were K Nearest Neighbor (KNN) classification and a type of artificial neural network referred to by the authors as NNET. The data included 88 cases (people), two classes (59 people with liver cancer, 29 people without liver cancer), and 30 features (mass spectrometry measures of biological samples taken from the participants). Because of the relatively small size of the database, the authors divided it into only four subsets for CV purposes (i.e. 4-fold CV), and they ran the classifiers on the full set of 30 features as well as on a smaller set of 17 features derived through a stricter threshold criterion (see p. 332). The results showed that SVM with a radial kernel produced the highest cross-validated classification accuracies with both 30 features and 17 features. In both cases, SVM with a radial kernel classified 67% of the participants correctly with respect to whether they had been diagnosed with liver cancer. Because of restrictions regarding the necessary proportion of cases to features with some classifiers, QDA could not be used with 30 features, but with 17 features it turned out to be nearly as powerful as SVM, producing a classification accuracy of 66%. NSC and RF also performed quite well, whereas LogitBoost, KNN, and SC performed quite poorly in this particular classification task. In the study just described, SVM produced the most predictive model of class membership, but the relative usefulness of this and other classifiers can differ a good deal from one study to the next, and this is true not only between fields but also within fields of research. In the literature on text classification, a study by Jockers & Witten (2010) shows that SVM is the least effective out of five classifiers in identifying the authors of texts in the Federalist Papers corpus. The five classifiers in the comparison were SVM, KNN, NSC, Delta, and a form of Discriminant Analysis referred to as Regularized Discriminant Analysis (RDA). There were three classes (i.e. Jay, Madison, and Hamilton, the three authors of the Federalist Papers) and 2,907 features consisting of the relative frequencies of the words and word bigrams (i.e. combinations of two words) that occurred at least once in the training texts written by Jay, Madison, and Hamilton. A smaller set of 298 features was also used, consisting of all words and word bigrams that have an overall relative frequency of at least 0.05% (or 5 occurrences per 10,000 words) in the Federalist Papers corpus as a whole. On the testing set of 70 texts of known origin, the best 10-fold cross-validated performance was 100% classification accuracy, which was achieved by (a) NSC using 718 of the original 2,907 features, (b) NSC using 199 of the
Scott Jarvis
truncated set of 298 features, and (c) RDA using 312 of the original 2,907 features. The lowest cross-validated accuracy was 86%, achieved by SVM using all 2,907 features, but SVM achieved 94% accuracy when run on just the set of 298 features. The differences between the two studies just described in relation to the ranking of classifiers such as SVM could be due to many factors, such as differences in (a) the number of features they dealt with, (b) the sizes of their training sets, and (c) the specific characteristics of the features and classes under investigation. From the perspective of (c), a study that is especially interesting and also directly relevant to the present chapter is a study by Estival et al. (2007). In this study, the authors performed a comparison of a number of classifiers available in the Weka toolkit (see Witten & Frank 2005) in order to determine which worked best in each of a number of classification tasks involving a database of 9,836 email messages written in English by 1,033 people from 5 regions of the world who speak three different L1s. The eight classifiers in the study included decision trees, a so-called lazy learner, a rule-based classifier, boundary-based classifiers, and so-called meta classifiers. The authors extracted 689 features from the data, which were divided into three categories: character-related (e.g. punctuation frequencies, word length indices), lexical (e.g. relative frequencies of function words and part-of-speech categories), and structural (e.g. paragraph breaks, presence or absence of various HTML tags). The feature values for each text were fed into all eight classifiers, and the classifiers were compared in relation to their ability to predict class memberships for the following class variables: age, gender, L1 (Arabic, English, Spanish), level of education, country of origin, agreeableness, conscientiousness, extraversion, neuroticism, and openness. The classifiers were used in combination with feature selection, parameter tuning, and 10-fold cross-validation in order to arrive at a reliable and optimal solution for each classifier in each classification task. The results showed that the very highest level of classification accuracy was achieved by Random Forest (RF, a decision-tree classifier) in the task involving L1 prediction (84%) when it was combined with an information-gain criterion for feature selection and when it also involved removing features that were highly correlated with L1 (e.g. function words that were used by speakers of only one L1). The second highest classification accuracy was achieved by Sequential Minimal Optimization (SMO – a boundary-based adaptation of Support Vector Machines) in the task involving country prediction (81%), and SMO also performed best in relation to gender (69%) and age (56%); in all cases, SMO performed most optimally when using all 689 features. The only other classifier to achieve a cross-validated classification accuracy above 60% was Bagging (a meta classifier), which achieved an accuracy rate of 80% in the task involving education prediction. It achieved this result when using all features except function words. Together, these studies show that most of the major classifiers perform quite well in certain circumstances, but no classifier is the best pattern learner in all classification tasks. Unfortunately, it is not possible to predict in advance which classifier will perform best in a particular classification task, but an understanding of the characteristic strengths and weaknesses of each classifier can nevertheless help in deciding which to
Data mining with learner corpora
use for a particular purpose. The advantages and disadvantages of each type of classifier are summarized by Kotsiantis (2007: 262–263). The author points out, for example, that boundary-based classifiers (e.g. SVM, SMO) and neural networks (e.g. MPA) tend to deal best with large numbers of features, especially when the features are continuous variables. With fewer variables, Bayesian classifiers are particularly useful, and in cases where the features involve a combination of discrete, binary, and continuous variables, decision trees tend to be superior. Decision trees, Bayesian classifiers, and rule-based classifiers are also advantageous with respect to interpretational transparency (i.e. how easy it is to see which features lead to which predictions). When the researcher’s goal is strictly classification accuracy and not interpretational transparency, Kotsiantis points out that it may be best to rely on the majority vote of an ensemble of classifiers (ibid.: 263). This is one of the methods that will be used in the empirical portion of this chapter (see Section 5).
3.2
Previous studies on L1 detection
The first investigation of L1 classification I am aware of is Mayfield Tomokiyo & Jones (2001). This study used a simple Bayesian classifier (NB) to examine whether lexical features (word n-gram and part-of-speech frequencies) could be used to distinguish English spoken texts produced by native speakers versus those produced by nonnatives (Chinese and Japanese speakers). The authors also tested whether the classifier could correctly distinguish the Chinese from the Japanese speakers. Due to the small sample and apparent lack of controls, the results are probably overly optimistic, but they show that the classifier was able to achieve 100% accuracy in distinguishing between the samples produced by Chinese (n = 6) and Japanese speakers (n = 31). The study also showed that the NB classifier performs much better with a carefully selected subset of fewer than 100 features (selected on the basis of the information gain index and various stopword lists) than with the full set of 4,800 features. Other studies on L1 classification include, in chronological order, Jarvis et al. (2004), Koppel et al. (2005), Tsur & Rappoport (2007), the Estival et al. (2007) study described earlier, Wong & Dras (2009), and the studies in the Jarvis & Crossley (forthcoming) volume. In the remainder of this section, I will focus on the four studies that are most relevant to the purposes of the present chapter. These include Koppel et al. (2005), Tsur & Rappoport (2007), Wong & Dras (2009), and Jarvis & Paquot (forthcoming). All four studies examine a large number of texts drawn from the ICLE corpus, and all four studies perform L1 prediction on the basis of a large number of textual (including lexical) features. The study by Koppel et al. (2005) used the Support Vector Machines (SVM) classifier with a linear kernel in an attempt to predict the L1 backgrounds of English texts written by speakers of five different L1s: Bulgarian, Czech, French, Russian, and Spanish. The texts represent only a sample of the texts available in the ICLE, but are nevertheless quite numerous, consisting of 258 texts per L1 group, for a total of 1,290 texts. The
Scott Jarvis
features examined are also quite numerous, consisting of 1,035 features of the following types: 400 function words, 200 frequent letter n-grams, 185 error types, and 250 rare part-of-speech bigrams identified by Francis & Kucera (1982). The 10-fold cross-validated classification accuracy with all 1,035 features was 80%. Tests run with subsets of the features showed that the 400 function words alone result in 75% classification accuracy, and the 200 letter n-grams alone result in 71% classification accuracy. These results – particularly the 80% achieved with all 1,035 features – are quite remarkable in light of the fact that the number of classes (or L1s) is five, meaning that the level of chance is only 20% accuracy (i.e. given that the L1 groups are equally represented in the sample). The study by Tsur & Rappoport (2007) replicates several aspects of the Koppel et al. study but focuses especially on the ability of letter n-grams to distinguish L1 groups. Tsur & Rappoport used the same classifier (SVM) with the same five L1 backgrounds, but drew their own random sample of texts from the ICLE, which included 238 (instead of 258) texts per L1, for a total of 1,190 texts. The researchers also used several of the same features as were used by Koppel et al., but focused mainly on the effects of letter bigrams and trigrams. Tsur & Rappoport appear not to have performed an overall classification using all features at the same time, but their results with different sets of features show a 10-fold cross-validated accuracy of 67% with 460 function words, 66% with the 200 most frequent letter bigrams, and 60% with the 200 most frequent letter trigrams. Given that Tsur & Rappoport used essentially the same data and the same classifier as Koppel et al. did, it is unclear why Tsur & Rappoport found lower accuracy rates for similar sets of features. Three possible explanations are that (a) Tsur & Rappoport may have used a different kernel with SVM (they did not report which kernel they used), (b) SVM parameters may have been tuned slightly differently in the two studies, and (c) the slightly smaller number of texts used in the Tsur & Rappoport may have hindered the classifier’s construction of an optimal model of the relationship between the features in question and the learners’ L1 backgrounds. A further follow-up to the Koppel et al. (2005) study is found in Wong & Dras (2009). Like the two previous studies, Wong & Dras also used SVM as their classifier. However, they used SVM with a radial kernel instead of the linear kernel that Koppel et al. used. Wong & Dras also extended their study to seven L1 backgrounds represented in the ICLE, adding Chinese and Japanese to the five L1s in the previous two studies. Despite the increased number of L1 groups, however, the sample in Wong & Dras is smaller than that of the previous two studies, consisting of only 70 texts per group (490 total) as the training set and 25 texts per group (175 total) as the testing set. As this implies, the authors also appear to have used a simple-split CV rather than a k-fold CV. Wong & Dras also used a slightly different set of features, which nevertheless overlaps a great deal with the other two studies. The features in this study include 400 function words, 500 character n-grams, and 650 part-of-speech n-grams. The highest level of classification accuracy achieved was 74%, which was attained with two different sets of features: a combination of all three types of features, and a combination of just the function words and part-ofspeech n-grams. The 74% accuracy achieved in this study is perhaps equally remarkable
Data mining with learner corpora
as the accuracy rate of 80% in the Koppel et al. (2005) study given the larger number of L1s in the Wong & Dras study and the concomitantly lower level of chance (i.e. 14% vs. 20%). At the same time, the smaller sample and the use of a simple-split CV in the Wong & Dras study do cast some doubt on the generalizability of their results. Some of the questions left unanswered by these studies are (a) whether high levels of classification accuracy can also be achieved when the number of L1s is extended substantially beyond seven, (b) whether highly frequent words that include not just function words but also content words will facilitate L1 classification, and (c) what levels of L1 classification accuracy can be attained with n-grams made up of word sequences instead of letter sequences and part-of-speech sequences. These are some of the questions that Jarvis & Paquot (forthcoming) set out to address while taking advantage of the wealth of data available in the newest version of the ICLE. The newest version of the ICLE includes argumentative and literary texts written in English by learners from 16 different L1 backgrounds, but Jarvis & Paquot chose to focus on just 12 of these because the texts written by Chinese, Japanese, Tswana, and Turkish speakers include a relatively high proportion of lower proficiency texts (cf. Table 6 in Granger et al. 2009).2 Following the conventions of the previous three studies, Jarvis & Paquot used only those texts that were between 500 and 1,000 words in length. This was done for reasons of inter-class comparability, and a further criterion that the authors imposed on their selection of texts was to use only argumentative texts. The resulting breakdown of the number of texts per L1 group that were included in the study is shown in Table 2. Table 2.╇ Texts included in the Jarvis & Paquot (forthcoming) study L1 Bulgarian Czech Dutch Finnish French German Italian Norwegian Polish Russian Spanish Swedish TOTAL
Number of texts â•⁄â•— 140 â•⁄â•— 116 â•⁄â•— 125 â•⁄â•— 121 â•⁄â•— 200 â•⁄â•— 182 â•⁄â•⁄â•— 86 â•⁄â•— 270 â•⁄â•— 288 â•⁄â•— 144 â•⁄â•— 144 â•⁄â•— 217 2,033
2. The fact that Wong & Dras (2009) included ICLE texts written by Chinese and Japanese speakers raises additional questions about the reliability of their results.
Scott Jarvis
The features used in the Jarvis & Paquot study included four categories of word ngrams extracted from the data: unigrams (single words), bigrams (two-word sequences), trigrams (three-word sequences), and quadrigrams (four-word sequences). The selected n-grams were the most frequent 200 n-grams in each category that occurred at least 35 times in the data and were not prompt-induced. This latter criterion meant that n-grams were manually disqualified if they included content words (and their families) that were used in the essay prompts that the essays were written in response to. Examples of such prompt-induced words that were disqualified include society, prison, science, technology, television, religion, imagination, and dream, which appear in essay prompts such as ‘Some people say that in our modern world, dominated by science and technology and industrialisation, there is no longer a place for dreaming and imagination. What is your opinion?’ or ‘Marx once said that religion was the opium of the masses. If he was alive at the end of the 20th century, he would replace religion with television’. The final feature set for the study consisted of 200 unigrams, 200 bigrams, 200 trigrams, but only 122 quadrigrams because only 122 quadrigrams met the criteria for inclusion. The total number of features was therefore 722, and these included frequent n-grams made up of both content words and function words. Unlike the previous three studies, which used an SVM classifier, Jarvis & Paquot chose LDA as their classifier because of the relatively clear interpretability of the results it produces and for reasons of consistency with the other studies in the same collection. One of the disadvantages of this choice, however, is that LDA has stricter statistical assumptions than SVM and most other classifiers, requiring at least 10 cases per feature. Given that there were 2,033 texts in their dataset, Jarvis & Paquot were able to perform LDA classification using only approximately 200 features instead of the full set of 722 features. In order to comply with this limitation while nevertheless taking advantage of the full set of 722 features, they performed stepwise feature selection using a relatively strict criterion for feature entry (p < .01) and removal (p > .05), which resulted in the selection of 200 features representing a combination of unigrams, bigrams, trigrams, and quadrigrams that contributed the most to the strength of the model. Because they combined feature selection with classification, it was also necessary to embed the stepwise procedure within their 10-fold CV. Their final embedded 10-fold cross-validated classification accuracy for predicting the L1 backgrounds of the 2,033 texts representing 12 L1 backgrounds was 53.57%. This is lower than the accuracy levels achieved by the previous three studies, but this is of course to be expected in light of the substantially higher number of L1s in this study. Jarvis & Paquot set their LDA classifier to treat all L1 groups as having equal prior probabilities, which means that the level of chance for correct L1 prediction was only 8% (compared with 20% for five L1 groups and 14% for seven L1 groups). A more conservative baseline is the level of accuracy that would have been attained if all texts had been classified as belonging to the biggest L1 group (i.e. the Polish group, n = 288), in which case the baseline is 14% accuracy. In either case, the result of 53%
Data mining with learner corpora
cross-validated classification accuracy with 12 L1 groups is quite high and clearly points to group-distinctive behaviors in the learners’ use of n-grams. Although the higher number of L1s is certainly largely responsible for the lower classification accuracies found in the Jarvis & Paquot study vis-à-vis those found in the previous three studies, other factors that may also have contributed to the difference are differences in (a) the strictness of the researchers’ criteria for text inclusion, (b) the specific features that were used in the analysis, and (c) the classifier that was used. In the study presented in the following sections of this chapter, I address the third possibility, and specifically consider whether other classifiers can achieve higher cross-validated classification accuracies with precisely the same texts and features that Jarvis & Paquot used. This is especially relevant in light of the fact that many other classifiers do not have strict restrictions on the ratio of texts to features. Accordingly, an important question is whether classifiers that are able to create a model that includes all 722 of the features used in the Jarvis & Paquot study would achieve superior classification accuracies.
4. Method As just mentioned, the data used in the present study are identical to those in the Jarvis & Paquot (forthcoming) study. A breakdown of the 2,033 texts included in the present study was shown in Table 2, and the features are, again, 722 n-grams made up of the most frequent unigrams (n = 200) (e.g. the, to, of, and, a, is, in, that, it, are, for, be, not, they, have, we), bigrams (n = 200) (e.g. of the, on the, there is, I think, we are), trigrams (n = 200) (e.g. a lot of, in order to, the fact that, one of the, on the other, in my opinion), and quadrigrams (n = 122) (e.g. on the other hand, at the same time, I would like to, to be able to, is one of the) that occur at least 35 times in the data and are not prompt-induced. The purpose of the present study is to compare the performance of a number of classifiers in relation to their ability to produce accurate cross-validated predictions of the L1 backgrounds of the 2,033 texts in the data, with all relying on the same pool of n-gram features. The classifiers selected for this study include Linear Discriminant Analysis (used by Jarvis & Paquot), Support Vector Machines (used by Koppel et al. 2005, Tsur & Rappoport 2007, Wong & Dras 2009, and found to be a superior classifier by Shen et al. 2007), Random Forest (found to be superior for L1 detection by Estival et al. 2007), Sequential Minimal Optimization (also found to be useful by Estival et al. 2007), Nearest Shrunken Centroids (found to be superior for authorship attribution by Jockers & Witten 2010), Delta and Delta Prime (also found to be useful by Jockers & Witten 2010), Naïve Bayes (used by Mayfield Tomokiyo & Jones 2001 in their study of L1 detection), and a number of classifiers included in the Weka toolkit. Some classifiers that are available in the Weka toolkit were not included because of how much time they took to process the data (e.g. the MultilayerPerception classifier took over two hours to complete its initial, pre-CV analysis), or because their categories were considered to be already well enough represented.
Scott Jarvis
In all, I used 20 classifiers and attempted to find the optimal classification accuracy for each by experimenting with various parameter settings and feature-selection methods. Of course, it was not feasible to test all possible parameter settings and feature-selection methods with these 20 classifiers, so it is possible that the optimal accuracy rates for some classifiers may be higher than what I have found. Nevertheless, I am confident in the results for the best-performing classifiers because most of these achieved their highest levels of classification accuracy without any parameter tuning or feature selection at all. In all cases, the final classification accuracies were determined through 10-fold CV. In the case of LDA, where the optimal result was obtained through the reduction of features, feature selection was embedded in the 10-fold CV. In the case of the classifiers available in Weka, however, only non-embedded 10-fold CV was used. This means that the classifiers run in Weka that relied on feature selection may have produced somewhat overly optimistic classification accuracies, but this is probably not a problem because these classifiers produced the lowest accuracy rates at any rate.
5. Results The results of the classifier comparison are shown in Table 3, where the 20 classifiers are listed according to their optimal accuracy rates with respect to the data under investigation. The table also shows the classifier type that each classifier represents, as well as the software application (and the software package, in the case of R) that was used to run the classifier. The final two columns in the table show the number of features that was selected for each classifier’s optimal model and the optimal 10-fold cross-validated accuracy rates for each classifier. As these results show, the best-performing classifiers for the present classification task are Linear Discriminant Analysis (LDA), Sequential Minimal Optimization (SMO), Naïve Bayes Multinomial (NBM), and Nearest Shrunken Centroids (NSC), with relatively little difference between them. LDA was restricted to only 200 features, but it performed as well or better than all classifiers that could and did include all 722 features in their models. Aside from LDA, the other 10 classifiers with the highest classification accuracies all made use of the full set of 722 features, although this statement needs to be qualified with respect to Delta Prime, which in the present case was found to achieve optimal results when it ignored differences between texts and the means of particular L1 groups that did not exceed 0.8 standard deviations. All 20 classifiers achieved classification accuracies that are substantially higher than chance (8%) and, with the exception of Random Tree, also substantially higher than the more conservative baseline of 14%. Nevertheless, a noticeable gap exists between the seven classifiers with the best performance (accuracies of 44.66% and higher) and the 13 classifiers with the worst performance (accuracies of 39.10% and lower).
Data mining with learner corpora
Table 3.╇ Classifiers ordered by accuracy in detecting L1 background Classifier
Class. Type
Application Features
Lin. Discriminant Analysis (LDA) Seq. Minimal Optimization (SMO) Naïve Bayes Multinomial (NBM) Nearest Shrunken Centroids (NSC) Delta Prime (>.8) (DP) Complement Naïve Bayes (CNB) Support Vector Machines (SVM) Naïve Bayes Tree (NBT) Bayes Net (BN) Logit Boost (LB) Naïve Bayes (NB) Classification via Regression (CVR) Bagging (Bag) Delta Random Forest (RF) Simple Cart (SC) J48 Graft (J48G) J48 Jrip Random Tree (RT)
Centroid Boundary Bayesian Centroid Means Bayesian Boundary Composite Composite Composite Bayesian Composite Tree Means Tree Tree Tree Tree Rule Tree
SPSS Weka Weka R (pamr) Perl script Weka R (e1071) Weka Weka Weka Weka Weka Weka Perl script Weka Weka Weka Weka Weka Weka
200 (stepwise) 722 722 722 722 722 722 722 722 722 722 69 (infogain) 69 (infogain) 722 55 (g. stepw.) 69 (infogain) 55 (g. stepw.) 55 (g. stepw.) 69 (infogain) 55 (g. stepw.)
Accuracy (10-fold CV) 53.57% 53.22% 52.29% 51.45% 47.37% 44.76% 44.66% 39.10% 38.81% 37.48% 34.92% 34.38% 32.66% 30.20% 29.32% 26.36% 23.81% 23.02% 23.02% 18.74%
A further question that is addressed by the results is whether the majority-vote method with an ensemble of classifiers might produce higher classification accuracies than any one classifier alone (cf. Kotsiantis 2007). To find out, I created several different ensembles of classifiers, beginning with the two to five classifiers with the highest accuracy rates in Table 3, and then successively including four additional classifiers that had relatively high classification rates but are relatively unique with respect to the algorithms they rely on. I considered the diversity of algorithms to be an advantage for the majority-vote (or ensemble) method in order to avoid a situation where similar algorithms lead to the same misclassifications. In each ensemble of classifiers, I extracted the L1 predictions that each classifier made regarding each text, and then determined each text’s ensemble classification as the L1 prediction made by a plurality of classifiers. The L1 prediction made by a classifier was thus treated as the classifier’s vote, and the final L1 classification for a text was determined by a plurality of votes. In some cases, however, two or more L1s received the same number of winning votes (i.e. two or more L1s tied for first place). There were a number of cases, for
Scott Jarvis
example, where the highest number of votes that any L1 received was only two or three, and in such cases it was common for more than one L1 to receive this number of votes. When this happens, it is difficult to say whether the ensemble method has truly identified the correct L1, but it is clear that this may nevertheless be a useful way of narrowing the range of possibilities when the L1 truly is not known. Table 4 shows the classification accuracies that were achieved with each ensemble of classifiers. The results include both the percentage of correct classifications where there was no tie for first place and the percentage of correct classifications that included ties for first place. The former involves unambiguously correct classifications, whereas the latter is a combination of unambiguously correct classifications plus those cases where the ensemble vote identified the correct L1 as one of two or more equal possibilities. It is not completely clear from these results whether the ensemble method is superior to the use of LDA or SMO alone. On the basis of clear-winner voting, ensembles of at least three classifiers produce results very similar to those of LDA and SMO. When ties are taken into consideration, the ensembles produce higher classification accuracies, but ties ultimately still need to be resolved. Assuming that ties will not be settled completely successfully, the true predictive power of ensemble voting is likely to be somewhere between the clear-winner and tied-for-first accuracy rates. Interestingly, both the clear-winner and tied-for-first accuracy rates appear to become stable with ensembles of five or more classifiers, remaining in the narrow range of 53.32–53.76% for clear winners, and staying within the narrow range of 59.67–60.21% for ties. In light of these results, it seems doubtful that the inclusion of more classifiers in the ensemble would result in higher accuracy rates, especially since the accuracy rates of most of the remaining available classifiers are rather low. Table 4.╇ Ensemble classification by majority vote Ensemble of classifiers LDA, SMO LDA, SMO, NBM LDA, SMO, NBM, NSC LDA, SMO, NBM, NSC, DP LDA, SMO, NBM, NSC, DP, LB LDA, SMO, NBM, NSC, DP, LB, CVR LDA, SMO, NBM, NSC, DP, LB, CVR, Bag LDA, SMO, NBM, NSC, DP, LB, CVR, Bag, SC
Accuracy (clear winner)
Accuracy (clear winner + tied for first place)
38.71% 53.96% 51.50% 53.32% 53.47% 53.76% 53.37% 53.52%
68.08% 65.17% 61.49% 59.67% 60.16% 59.81% 60.21% 59.76%
Data mining with learner corpora
6. Discussion and conclusions The primary research question addressed in the present study concerns which of the many available classifiers is best able to learn to recognize the relationship between ngram patterns in ICLE texts and the L1 group membership of the learners who produced those texts. For the classification task used in the present study, LDA showed the strongest ability to learn these patterns, but SMO, NBM, and NSC produced comparably high levels of classification accuracy. One important difference, however, is that LDA achieved this result with far fewer features than these other classifiers did. In some situations, such as where the number of texts available is only a few dozen or even a few hundred, LDA’s restrictions on the ratio of features to texts may severely hinder its usefulness. This was not the case in the present analysis, however. Thus, it appears that LDA was indeed one of the best options – if not the best option – for the particular purposes of the Jarvis & Paquot (forthcoming) study. According to the results of the present study, not even the use of the ensemble method would have led to an improvement over LDA in the number of unambiguously correctly classified cases. Potential uses of the ensemble method do seem intriguing, however, especially with respect to how they might help to determine the true percentage of texts in a corpus that contain the relevant group-related pattern. The question of which classifier is best is actually secondary to this higher aim of discovering and capturing as much of the true group-related patterning (or signal) as possible. It seems that the ensemble method is a good place to start in order to obtain an estimate of the percentage of texts in which the true signal may be found. On the basis of the results of the ensemble method, the researcher could then select a specific classifier that accounts for as much of that signal as possible, and which also has other characteristics (e.g. interpretational transparency) that are conducive to the purposes of the study. In the present case, it appears that LDA was able to capture most of the true L1-related signal embedded within the features it was given, although the tied-for-first results of the ensemble method suggest that some of that signal may be intertwined with other signals and may therefore be difficult if not impossible to tease apart fully. A critical question regarding L1-related signals is whether a classifier’s ability to identify the correct L1 backgrounds of L2 texts necessarily means that the signal that the classifier has tuned into really is being produced by the L1 itself, or whether it may be produced by other factors that happen to coincide with L1 group divisions. For example, if the L1 groups represented in the data are not at equivalent levels of L2 proficiency, if they have had different types and amounts of L2 instruction, and/or if they have experienced the L2 in differing environments with different types and amounts of input and exposure and differing opportunities to use the L2, then these factors by themselves, in combination with one another, and/or in combination with L1 influence, may be the source of the signal that allows the classifier to achieve such high levels of L1 classification accuracy. Regarding the ICLE data used in the present study, it is in fact certain that the different L1 groups have differing ranges of L2 abilities
Scott Jarvis
(see Bestgen et al. forthcoming), and there are also some indications of potential effects of training and instruction on the ICLE texts that coincide with L1 divisions (see e.g. Paquot 2010). Nevertheless, there is also a great deal of evidence of direct L1 influence in the ICLE data, such as the French writers’ distinctively high use of on the contrary (see also Paquot 2007) and the Finnish writers’ distinctively high use of all the time (see Jarvis & Paquot forthcoming), whose counterparts in the respective L1s occur with correspondingly high frequency rates. The very real effects of cross-linguistic influence are also underscored by the finding by Jarvis & Paquot (forthcoming) that, when texts are misclassified, there is a strong tendency for them to be classified into the correct language families (e.g. Dutch as German, Italian as French or Spanish, Norwegian as Swedish). An important direction for future research in this area will be to combine the power of classifiers with principled methods for teasing apart the effects of the L1 from other potential factors (cf. Jarvis 2000; Jarvis 2010). In this study, I have highlighted the use of classifiers for investigating L1-related effects, but it is important to recognize that classifiers can be used similarly for the investigation of the relationship between language features and other class variables, such as text type, task type, topic, learners’ L1 writing proficiency, learners’ L2 proficiency in general or in specific ability areas, learners’ educational backgrounds, the context of their L2 learning, the number of years of L2 instruction they have received, and so forth. Concerning cross-linguistic influence, classification could and probably should also be used to investigate influences of prior languages besides just the L1. One of the interesting results in Jarvis & Paquot (forthcoming), for example, is that when Finnish speakers’ texts are misclassified, they are more frequently identified with Swedish than with any other L1 background. Interestingly, Swedish is unrelated to Finnish but is a language that all Finnish speakers are required to study in school. Influences of nonnative languages on each other may, in fact, be some of the strongest signals intertwined with the L1 signal. Other promising avenues for the future of this area of research include the development of classifiers that perform cross-language comparisons and retrieval (cf. Sorg & Cimiano 2010), which could considerably enhance the ability of classifiers to detect and verify direct influences of one language on another. Finally, despite the exciting technological developments in this area of research, it is of course important to make sure that we use it to enhance rather than to supplant our expertise in qualitative linguistic analysis, which is ultimately what allows us to make sense of learner data.
References Bestgen, Y., Granger, S. & Thewissen, J. Forthcoming. Error patterns and automatic L1 identification. In Approaching Transfer through Text Classification: Explorations in the Detectionbased Approach, S. Jarvis & S. Crossley (eds). Bristol, UK: Multilingual Matters.
Data mining with learner corpora Burns, R.B. & Burns, R.A. 2008. Business Research Methods and Statistics Using SPSS. London: Sage. Burrows, J.F. 2002. ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17: 267–287. Crossley, S.A. & McNamara, D.S. 2009. Computational assessment of lexical differences in L1 and L2 writing. Journal of Second Language Writing 18: 119–135. Duda, R.O., Hart, P.E. & Stork, D.G. 2000. Pattern Classification, 2nd edn. New York NY: Wiley. Estival, D., Gaustad, T., Pham, S.B., Radford, W. & Hutchinson, B. 2007. Author profiling for English emails. Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, 31–39. Field, A. 2005. Discovering Statistics Using SPSS. London: Sage. Francis, W. & Kucera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston MA: Houghton Mifflin. Gavin, G. & Teytaud, O. 2002. Lower bounds for training and leave-one-out estimates of the generalization error. In ICANN 2002, LNCS 2415, J.R. Dorronsoro (ed.), 583–588. Berlin: Springer. Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. 2009. The International Corpus of Learner English. Handbook and CD-ROM, Version 2. Louvain-la-Neuve: Presses universitaires de Louvain. Guan, H., Zhou, J. & Guo, M. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the 18th International Conference on the World Wide Web, 201–210. New York NY: Association for Computing Machinery. Guyon, I. & Elisseef, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157–1182. Herlitz, G.N. 1994–1995. The meaning of the term prima facie. Louisiana Law Review 55: 391–408. Hinton, G. & Sejnowski, T.J. (eds). 1999. Unsupervised Learning: Foundations of Neural Computation. Cambridge MA: The MIT Press. Hoover, D.L. 2004a. Testing Burrows’s Delta. Literary and Linguistic Computing 19: 453–475. Hoover, D.L. 2004b. Delta prime? Literary and Linguistic Computing 19: 477–495. Hyvärinen, A. & Oja, E. 2000. Independent component analysis: Algorithms and application. Neural Networks 13: 411–430. Jarvis, S. 2000. Methodological rigor in the study of transfer: Identifying L1 influence in the interlanguage lexicon. Language Learning 50: 245–309. Jarvis, S. 2010. Comparison-based and detection-based approaches to transfer research. In EUROSLA Yearbook 10, L. Roberts, M. Howard, M. Ó Laoire & D. Singleton (eds), 169– 192. Amsterdam: John Benjamins. Jarvis, S. Forthcoming. Introduction. In Approaching Transfer through Text Classification: Explorations in the Detection-based Approach, S. Jarvis & S. Crossley (eds). Bristol, UK: Multilingual Matters. Jarvis, S., Castañeda-Jiménez, G. & Nielsen, R. 2004. Investigating L1 lexical transfer through learners’ wordprints. Paper presented at the 2004 Second Language Research Forum. State College, Pennsylvania. Jarvis, S. & Crossley, S.A. (eds). Forthcoming. Approaching Transfer through Text Classification: Explorations in the Detection-based Approach. Bristol, UK: Multilingual Matters. Jarvis, S., Grant, L., Bikowski, D. & Ferris, D. 2003. Exploring multiple profiles of highly rated learner compositions. Journal of Second Language Writing 12: 377–403.
Scott Jarvis Jarvis, S. & Paquot, M. Forthcoming. Exploring the role of n-grams in L1 identification. In Approaching Transfer through Text Classification: Explorations in the Detection-based Approach, S. Jarvis & S. Crossley (eds). Bristol, UK: Multilingual Matters. Jarvis, S. & Pavlenko, A. 2008. Crosslinguistic Influence in Language and Cognition. London: Routledge. Jockers, M.L. & Witten, D.M. 2010. A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing 25: 215–223. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C. & Murthy, K.R.K. 2001. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation 13: 637–649. Kohavi, R. & John, G.H. 1997. Wrappers for feature subset selection. Artificial Intelligence 97: 273–324. Koppel, M., Schler, J. & Zigdon, K. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 624–628. Chicago IL: Association for Computing Machinery. Kotsiantis, S. 2007. Supervised machine learning: A review of classification techniques. Informatica Journal 31: 249–268. Kotsiantis, S. & Pintelas, P. 2004. Recent advances in clustering: A brief survey. WSEAS Transactions on Information Science and Applications 1: 73–81. Lecocke, M. & Hess, K. 2006. An empirical study of univariate and genetic algorithm-based feature selection in binary classification with microarray data. Cancer Informatics 2: 313–327. Liu, J.Y., Zhuang, D.F., Luo, D. & Xiao, X. 2003. Land-cover classification of China: Integrated analysis of AVHRR imagery and geophysical data. International Journal of Remote Sensing 24: 2485–2500. Mayfield Tomokiyo, L. & Jones, R. 2001. You’re not from ‘round here, are you? Naive Bayes detection of non-native utterance text. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL ‘01), unpaginated electronic document. Cambridge MA: The Association for Computational Linguistics. McCallum, A. & Nigam, K. 1998. A comparison of event models for Naïve Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization [Technical Report WS-9805], 41–48. Menlo Park CA: Association for the Advancement of Artificial Intelligence. McLachlan, G.J. 2004. Discriminant Analysis and Statistical Pattern Recognition. Hoboken NJ: Wiley. Millet-Roig, J., Ventura-Galiano, R., Chorro-Gascó, F.J. & Cebrián, A. 2000. Support Vector Machine for arrhythmia discrimination with wavelet-transform-based feature selection. Computers in Cardiology 27: 407–410. Molinaro, A.M., Simon, R. & Pfeiffer, R.M. 2005. Prediction error estimation: A comparison of resampling methods. Bioinformatics 21: 3301–3307. Paquot, M. 2007. EAP Vocabulary in EFL Learner Writing. A Phraseology-oriented Approach. PhD dissertation, Université Catholique de Louvain. Paquot, M. 2010. Academic Vocabulary in Learner Writing: From Extraction to Analysis. London: Continuum. Prinzie, A. & Van den Poel, D. 2007. Random multiclass classification: Generalizing Random Forests to random MNL and random NB. In DEXA 2007, LNCS 4653, R. Wagner, N. Revell & G. Pernul (eds), 349–358. Berlin: Springer. Raileanu, L.E. & Stoffel, K. 2004. Theoretical comparison between the Gini Index and Information Gain criteria. Annals of Mathematics and Artificial Intelligence 41: 77–93.
Data mining with learner corpora Shen, C., Breen, T.E., Dobrolecki, L.E., Schmidt, C.M., Sledge, G.W., Miller, K.D. & Hickey, R.J. 2007. Comparison of computational algorithms for the classification of liver cancer using SELDI mass spectrometry: A case study. Cancer Informatics 3: 329–339. Sorg, P. & Cimiano, P. 2010. An experimental comparison of explicit semantic analysis implementations for cross-language retrieval. In Natural Language Processing and Information Systems: NLDB 2009, LNCS 5723, H. Horacek, E. Metais & R. Munoz (eds), 36–48. Berlin: Springer. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. 2003. Class prediction by Nearest Shrunken Centroids, with applications to DNA microarrays. Statistical Science 18: 104–117. Tsur, O. & Rappoport, A. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, 9–16. Cambridge MA: The Association for Computational Linguistics. Witten, I.H. & Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Amsterdam: Elsevier. Wong, S.-M. J. & Dras, M. 2009. Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association, 53–61. Cambridge MA: The Association for Computational Linguistics.
Appendix 1. Types of classifiers Centroid-based classifiers. This type of classifier creates a vector space model in which each case (e.g. text) is represented as a vector in multidimensional space. The vectors are created by entering the values for each feature (e.g. relative frequencies for various language features found in the text) into formulas that combine these values with numerical weights that maximize the similarities within classes and the differences between classes. The vector space model also creates a prototype vector for each class (e.g. each L1 background), which is referred to as the class centroid. Classification is performed by comparing the vector for a text with each of the class centroids, and by classifying the text as belonging to the class whose centroid is closest, or most similar, to the text’s own vector. Centroid-based classifiers are among the most popular classifiers because of their computational efficiency, and Linear Discriminant Analysis (LDA) is probably the most widely used classifier of this type. Despite their computational efficiency, however, LDA and other traditional centroid-based classifiers have often been found to perform less optimally than other types of classifiers, such as boundary-based classifiers. Their lower performance is believed to be due to the fact that the traditional algorithms for determining class centroids do not produce initial values that are optimally representative of their classes. To compensate for this deficiency, new classifiers have been developed for adjusting centroids to make them more predictive of the classes they represent (see e.g. Guan, Zhou and Guo 2009). One such classifier that does this is referred to as Nearest Shrunken Centroids (NSC). NSC uses a threshold function for adjusting each class centroid a certain distance toward the
Scott Jarvis
middle of all class centroids, which in many cases has been found to improve classification accuracy (see e.g. Tibshirani et al. 2003). Boundary-based classifiers. These classifiers are similar to centroid-based classifiers in the sense that both represent texts as vectors in a vector space model. The difference is that boundary-based classifiers do not use centroids, but instead use mathematical means for determining boundaries between classes. These boundaries are referred to as hyperplanes, and the cases that lie along the margins of the hyperplanes are referred to as support vectors. Classification takes place by determining which side of a hyperplane a case’s vector falls on, and by classifying the case as a member of the class whose side it is on. The most common boundary-based classifier is Support Vector Machines (SVM), but adaptations of SVM have been developed to correct the positioning of the hyperplane between support vectors (e.g. Optimal Separating Hyperplanes, OSH, see Millet-Roig et al. 2000) and to optimize the SVM algorithm for efficiency (e.g. Sequential Minimal Optimization, SMO, see Keerthi et al. 2001). Bayesian classifiers. The most common Bayesian classifier is referred to as a Naïve Bayes classifier, which relies on a relatively simple algorithm for using feature values to determine the probability that a particular case belongs to a particular class. One of the assumptions underlying this approach is that the presence or absence of one feature is independent of the presence or absence of other features. With respect to language features, this assumption is generally wrong, and for this reason, a Naïve Bayes classifier tends not to classify texts as accurately as more sophisticated types of classifiers do. However, attempts have also been made to modify the Bayesian model in order to create alternative Bayesian classifiers that do take into account the relationships among features (Kotsiantis 2007). Naïve Bayes Multinomial (NBM) and Complement Naïve Bayes (CNB) are two such classifiers (see e.g. McCallum & Nigam 1998). Artificial neural networks. An artificial neural network is, as the name implies, a mathematically represented network of interconnected artificial neurons, or nodes. In learning tasks, such as supervised classification, an artificial neural network is trained on input-output pairs in order to allow it to assign appropriate weights to the input that will result in the correct output. Artificial neural networks usually consist of multiple input nodes and at least one layer of so-called hidden nodes between input and output nodes. In this type of model, initial input values are sent to each of the intermediate hidden nodes with which they are connected, and then each of the hidden nodes uses that information with its own set of weights to calculate its own activation value, which is then passed on to the output nodes with which it is connected. As with real neural networks, individual input units can have effects throughout the entire network, but each neuron (or node) interprets and converts information in its own particular way. Some of the more sophisticated neural network classifiers include Multilayer Perceptron Analysis (MPA) and Radial Basis Function (RBF) (e.g. Kotsiantis 2007). Decision trees. As described by Kotsiantis (2007: 251), “decision trees are trees that classify instances [i.e. texts, in this case] by sorting them based on feature values. Each node in a decision tree represents a feature in an instance to be classified, and each
Data mining with learner corpora
branch represents a value that the node can assume. Instances are classified starting at the root node and sorted based on their feature values”. The root node of a decision tree represents the feature that best divides the training data into correct classes, and the tree’s construction progresses from one node to the next in accordance with which further features best separate the classes in the training data. One problem with decision tree classifiers is that they are prone to overfitting the training data, which means that they tend to become so overly complex in accounting for the specific characteristics of the training data that they do not generalize well to future cases. In order to avoid overfitting, decision-tree classifiers often include pruning algorithms that remove leaves and even full branches that do not improve the classification accuracy. Simple CART (SC) is an example of a decision-tree classifier that uses a pruning algorithm. Another approach to reducing dimensionality is to limit the number of features used to build a tree. The Random Forest (RF) classifier, for example, builds decision trees based on a random subset of features. However, given that a single subset of features might result in an unstable model, RF constructs several random trees (hence, a random forest) and classifies texts according to the majority vote of all the trees in the forest (Kotsiantis 2007, Prinzie & Van den Poel 2007). Rule-based classifiers. A decision tree can be converted into a rule-based classifier by formulating each possible path from the root node to each leaf as a separate rule, but other types of algorithms can also be used for creating a rule-based classifier. Rules can be thought of as IF-THEN statements, where the IF part of the statement often includes multiple AND (conjunction) and OR (disjunction) conditions related to the features that are used to predict class membership. The predicted class membership is the THEN part of the statement. There can be several rules associated with each class, but an excessive number of rules generated by the classifier “is usually a sign that the learning algorithm is attempting to ‘remember’ the training set, instead of discovering the assumptions that govern it” (Kotsiantis 2007: 253). This would be a matter of overfitting the training data, and one way of minimizing overfitting is to use pruning procedures, similar to what is done with decision-tree classifiers. RIPPER is a prominent rule-based classifier that proceeds repeatedly through a series of growing and pruning phases until it arrives at an optimal set of rules (see Kotsiantis 2007: 253–254). Means-based classifiers. This is perhaps the simplest type of classifier, as it is based simply on the mean difference between the value of each feature in a particular text and the mean values of those same features for each class. Because different features will usually have different ranges of values, the values are first converted into z scores, which use a standardized scale to represent how many standard deviations above or below the mean a particular value is. To take a simple example, if the features in question are the relative frequencies of the words the, of, and and, then the first step is to calculate the overall means and standard deviations for these words in the entire training set. These overall means and standard deviations are then used as the basis for calculating z scores for each of these words for every text and for every class (e.g. every L1 group). Classification is carried out by determining the mean difference
Scott Jarvis
between the z scores for these features in a text and the corresponding z scores for each class, and by classifying the text as a member of the class for which it shows the smallest mean difference. This type of classification was first introduced by Burrows (2002) as a way of determining the authorship of a specific text, where the classes in question are individual authors rather than groups. This method has been referred to alternately as Delta and Burrows’ Delta (e.g. Hoover 2004a), and has proven to be quite effective for identifying the authors of specific texts. However, under certain circumstances, it has been found to be even more effective when it is combined with an algorithm that ignores small differences (e.g. < 0.2 standard deviations) between the z scores of texts and classes. This modified method is sometimes referred to as Delta Prime (Hoover 2004b). Composite classifiers. Not all classifiers fit neatly into one of the categories just described, and in fact some classifiers combine elements from more than one category. For example, a Naïve Bayes Tree is constructed as a normal decision tree but then implements Naïve Bayes classification at its leaves. Other composite classifiers combine regression methods with classification procedures by giving binary values to classes, and by creating a separate regression model for each class value. This is referred to as Classification via Regression. A variation of this is the LogitBoost classifier, which creates a logistic model of the probability that a case belongs to one of two competing classes. The LogitBoost classifier also re-samples the data adaptively so that it prioritizes the selection of cases that are most often misclassified, which then makes the classifier better able to account for unusual cases. Through re-sampling, LogitBoost creates multiple possible models of the training data. The classification of cases in LogitBoost is carried out through the majority-vote principle, similar to what was described in relation to the decision tree RF classifier (see e.g. Shen et al. 2007). LogitBoost and several other composite classifiers are referred to as meta classifiers in the Weka toolkit (see Witten & Frank 2005).
Learners and users – Who do we want corpus data from? Anna Mauranen Learner corpora and lingua franca corpora differ in important ways in social and interactional aspects. Yet in the cognitive domain of language processing they have much in common, as reflected in lexicogrammatical and phraseological features. They can therefore be seen as complementary takes on second language research. We can expect advanced second/foreign language learners to show similar linguistic features to lingua franca speakers, and supporting evidence is accumulating. This paper suggests that although some features in an English as a lingua franca (ELF) corpus can be explained on cognitive principles similar to those likely to operate in learners, such as economy of effort, others cannot. For instance the common use of not quite native-like phraseological units requires a use-based rather than learning-based explanation. On the whole, the major differences between learner and ELF corpora make it necessary to keep them separate. At the same time, both can contribute results of considerable mutual interest.
1. Introduction In a world that is increasingly globalised, learning and using second and foreign languages has become everyday reality for a growing number of people. Bi- or multilingualism has always been a normal feature of human life, but powerful linguistic theories were erected in the last century on the idea that the ordinary speaker is monolingual. This speaker was assumed to possess a virtually infallible intuition about the grammaticality of his or her own language, and theoretically modelling such an intuitive competence would involve just a little extra step of abstraction and idealization. Well, reality has struck back and we have accepted not only the fallibility of the native speaker, but also their multilingualism. The majority of people have some knowledge of more than one language. The new realism in linguistics has shifted our interest increasingly towards analysing the ways in which languages are used, and building models on the basis of large databases of actual usage. Corpora soon proved to be not merely convenient
Anna Mauranen
repositories of authentic examples for grammarians and lexicographers to draw on but powerful instruments for discovering new facets of language. Collecting corpora started with native speakers, and this already revolutionised second/foreign-language teaching: instead of trusting the native’s intuition, the learner could now consult the natives’ attested language practices by accessing corpora. Tim Johns’s “Data-driven learning” (e.g. Johns 1991, 1994) was an innovative application to language learning of the insights that John Sinclair’s Cobuild project had uncovered in lexical patterning (e.g. Sinclair 1987, 1991). The route of corpus-based research from one application (lexicography) to revelations of theoretical significance (phraseological patterning in language) and from there to a different application (language learning) showed in an intriguing way how closely intertwined practical and theoretical interests can be. In the same spirit of merging practical and theoretical interests, another step was not a long way off, which put corpora and learners together in yet a new configuration: language learners themselves became sources of linguistic data. Second/foreign-language corpora, or computer learner corpora (CLC) began to make their way to the corpus world and even make inroads into the Second Language Acquisition (SLA) research world in the mid-nineties. Sylviane Granger was the primus motor in this with her now world-famous International Corpus of Learner English (ICLE; see Granger et al. 2009) of English as a Foreign Language (EFL) essays, which has subsequently been followed by other kinds of learner corpora, both in Louvain and elsewhere: the Polish and English Language Corpora for Research and Applications (PELCRA) corpus, the Japanese EFL Learner corpus (JEFLL), and the Hong Kong TELEC Secondary Learner Corpus (TSLC) among others. The remarkable applicability of corpus findings to language learning was testified by a lexicographic extension (Macmillan English Dictionary 2007), which made use of learner corpus results by Granger’s team (see e.g. Gilquin et al. 2007). Learner corpora have made a difference to language research on two fronts: one is the break away from corpora of native speakers only. There is no reason to assume that the only speaker groups of linguistic interest should be native speakers. How languages are learned, whether first or later languages, is a central theoretical question in linguistic enquiry. The other direction where learner corpora are beginning to make an impact is the field of second language acquisition. SLA has traditionally been dominated by an experimental approach, and thereby necessarily small-scale studies. With sizable corpora of learner language, general patterns of L2 can be found (assuming ‘second language’, L2, as a cover term here for foreign, second, third, nth and any nonfirst language), and importantly, corpus data can provide a powerful source in investigating the influence of first language transfer, along with other kinds of data (see, for example, Granger et al. 2002; Jarvis & Pavlenko 2008). Evidence of both has been found in the many studies that learner corpora have given rise to all over the world. What is the next step in this chain of corpus linguistic development? It seems to me that non-native users in their own right are the group that we need corpus data from. Corpora of English as a Lingua Franca (ELF) are already in existence: English as
Learners and users – Who do we want corpus data from
a Lingua Franca in Academic Settings (ELFA) in Helsinki, and Vienna-Oxford International Corpus of English (VOICE) in Vienna. These corpora are not confined to the language classroom, but constitute the new avenue that is now open to the research community: authentic language of second language speakers in multilingual environments where the language is naturally spoken as a lingua franca. If learner corpora liberated L2 learners from the laboratory, corpora of lingua franca and second language use liberate L2 speakers from the confines of the classroom.
2. How are learner and L2 user corpora different? When people use English as a lingua franca, that is, a contact language between speakers who do not share a first language, they are L2 users but not learners. We can draw a line between second language acquisition (SLA) and second language use (SLU) and look at its consequences on corpus compilation and interpretation. To appreciate the contrast between learner and L2 user corpora, it is useful to sketch out basic dissimilarities between L2 learners and L2 users. At the same time, we should not lose sight of the fact that they also have very much in common. Although there is reason to believe that fundamental processes in language use must be essentially the same for a speaker’s languages whether they are an early bilingual’s languages, an initial monolingual’s first and later languages, or a plurilingual’s complex mixture of language resources, it is also reasonable to assume that there is bifurcation at stages closer to the ‘surface’ of the actual reception and production processes among languages differentially entrenched in speaker’s repertoires – such as their first and other languages. Thus, even though processes such as memory storage and retrieval are likely to be basically similar in terms of neural pathway formation, information chunking, or simultaneous processing and monitoring at different levels, things like ease and speed of retrieval, access to alternative expressions, and mapping linguistic and social repertoires effectively onto each other are probably rather different in a speaker’s different languages. The differences between first and second languages are in many respects shared by second language learners and users, and therefore any research findings from either group are of interest to those studying the other. However, what I want to do at this point is to throw light on matters where these two groups differ. The differences are perhaps best illuminated by reference to certain social, cognitive, and interactive parameters. In social terms, an immediately obvious difference is that ELF speakers do not share a cultural background or a first language: by definition a lingua franca means a vehicular language between speakers who do not share a first language (L1). In most classrooms around the world, especially where English is learned as a foreign language, students share an L1. As many textbooks and other pedagogical materials testify, much pedagogical effort rests on the assumption that learners who share a mother tongue will have similar problems and are therefore offered similar remedies. Mixed
Anna Mauranen
classrooms also exist, of course, and in those, English can appear as a lingua franca along with its role as the object of study, but these may be more typical of Englishspeaking countries than the number of non-English speaking countries where samelanguage students are taught. In an environment of shared linguistic and cultural assumptions the social orientation to the new language is also shared, and along with it, cultural identities and expectations relative to target language speakers. For learners, the principal English-speaking countries constitute ‘target cultures’, which can be seen against their own cultural background for comparison, contrast, and for models of social appropriateness. In a lingua franca environment, the scene is quite different, because the vehicular language is chosen as a matter of convenience or necessity, and interlocutors may have very little idea of each other’s cultural backgrounds or familiarity with Anglo-American cultures. English-speaking cultures with their conventions may be far from appropriate in situations where communicative effectiveness hinges upon dealing with various cultures and cultural mixes as they come up in particular situations or tasks. Often the target is simply an international or global audience. But a definable, let alone national, target culture is an irrelevant concept. A classroom is a social environment of a particular kind. It imposes particular social positions on learners that do not hold outside its own context. Why this matters is that the learner position overrules other social parameters in a classroom setting, whereas outside the classroom other social parameters relevant to positioning people override the learner status. A learner position is one from which educational and classroom targets are viewed as relevant, and they regulate the norms of interaction in every respect. This is particularly obvious in giving and receiving feedback, providing and following models of behaviour and practices of assessing performance. Out of the bounds of educational settings interactional parameters do not follow classroom rules – there are even pedagogical genres that are specific to educational settings only, such as particular question-answer sequence types, fill-in exercises, or ‘composition’. In principle it is possible to transcend such borderlines at times; they are negotiable to an extent, and we can make a learner position relevant in an ordinary encounter outside the classroom by for example asking our interlocutors about the correctness of our language. We can also invoke learners’ other identities such as professional or gender identities in class. But it seems these borderlines are not often transcended, and even in native/non-native situations where a non-native speaker is assuming a learner position, native speakers tend to orient to them as speakers, not correcting their language but orienting to the contents of what is being said (see e.g. Kurhila 2003) It has often been pointed out that people can alternate in these roles – in the classroom they are learners, but as soon as they get outside it they may turn into users of the same language. Therefore, the argument continues, the identities are inseparable because the people are the same. However, we do not have single or simple identities, but assume them situationally as is relevant, foregrounding and backgrounding our different identities and their elements, and drawing on them as the need arises in response to a social environment. It is important to be sensitive to the situational demands on
Learners and users – Who do we want corpus data from
social identity also as an analytical principle: when people enter an educational context as language learners, their position shifts from that of a user to whom a given language is the relevant means of communication. For example I am writing this in Denmark and despite my virtually nonexistent Danish do not have any inclination to position myself as a learner of Danish outside my “Danish for beginners” class. To get by, I use Swedish, English, or rudimentary Danish according to how I judge the demands of the situation, but rarely if at all invoke a learner identity outside the classroom. A “learner” identity can also be seen reductive and limiting, as has been pointed out by Firth & Wagner (1997), who criticize the SLA research paradigm for doing just that: learners are seen as deficient communicators, and their output as a struggle with difficulties. The target set for them is an idealised native speaker, which is beyond reach for learners, given that ideal models just do not fit in with the contingencies of reality. Firth & Wagner called for a broadening of the basis of SLA studies to embrace the everyday use of a second language outside classroom settings, and the inclusion of L2–L2 communication. ELF research has done this, expanding the perspective for L2 research. There is also the wider issue of the potential influence of these different kinds of L2 – learner or user – on the English language. Learner language cannot influence the target language by definition, because it orients to acquiring the native norm; learners get corrected for their errors, and because they are learners, they will accept the correction as far as they can, and the target language remains intact. ELF is used to achieve communication in international environments, and it does not have a ‘target language’ but is an ‘instrument language’. The forms that ELF assumes may not be very influential in fleeting encounters between strangers at airports, but it constitutes the working language of many more permanent and important communities of practice in business, academia, research, and so on. In the absence of linguistic authority other than communicative efficiency in a community of practice, group norms evolve without the external intervention of the standard language norms that guide learners and teachers. Speakers mediating norms may even deliberately appeal to practices that are international but not in accordance with British or American norms (Hynninen 2011). Since ELF is a more widespread use of English than communities where English is used as a native language (ENL), and since many ELF using communities command high international prestige in for instance multinational companies, international politics and science, they hold the key to the future of English. Changes in English as a result of its increasing use as a global lingua franca are likely to arise from common but complex units such as phraseological sequences which consist of structural and lexical elements and which have variable as well as invariant parts. ELF speech shows many kinds of phraseological sequences. As pointed out in the previous paragraph, ELF speech has the potential to influence English by its new developments, most likely when the same features appear independently and repeatedly in different places. As the interactively co-constructed group norm evolves, there is no intervening external authority to correct ELF speakers, but whatever works well is likely to be strengthened and diffused. Phraseological sequences may be a point
Anna Mauranen
where ELF begins to impact English more widely, because they seem to serve their communicative purposes quite well without being target-like in ENL terms. An example of a phraseological frame with such potential is -ly speaking. It is a partly flexible unit, where the adverb ending can in principle be attached to any adjective. In practice this possibility is constrained by conventional preference, so that although the unit is productive, it is also restricted. I looked at the Michigan Corpus of Academic Spoken English (MICASE) (http://lw.lsa.umich.edu/eli/micase/index.htm), more precisely, its ENL parts, since it is the most closely comparable ENL corpus to ELFA. In MICASE, the -ly speaking frame appears fifteen times in the whole 1.8 million corpus (i.e. 8.3/ million words), and the overwhelmingly preferred expression is generally speaking with six occurrences. Strictly speaking appears twice, and all other cases just once (Table 1). With fifteen occurrences in MICASE, the expected number in the first part of ELFA (ELFA(i), the first 0.6 million words of the database which was finished earlier than the rest) is five, assuming that the rate of occurrence is the same. However, the actual number of occurrences is 19 (Table 2), (i.e. 31.7/million words). In SLA terms, we might want to speak of ‘overuse’ of the frame in ELFA. For ELF, this is hardly a relevant characterisation. Clearly, the frame is salient and it is being utilised as a conveniently productive frame. Thus in ELFA the frame is proportionally almost four times as common as in MICASE, with more occurrences even in absolute terms. In spite of this, the most frequent item (historically speaking) occurred only three times, followed by four others, each with two instances (basically/formally/frankly/generally speaking). Thus, we can detect not only what in SLA terms would be ‘overuse’ (of the frame) and ‘underuse’ (of the preferred item) in ELF, with total absence of the second-ranking item in ENL (strictly speaking). What the more general implication is, however, is that such shifts in ELF preferences affect English usage. If we take ‘English usage’ to refer to all of the Table 1.╇ Partly flexible phraseological frame: -ly speaking in MICASE Expression generally speaking strictly speaking morally speaking objectively speaking properly speaking relatively speaking roughly speaking simply speaking stylistically speaking Total
Abs. frequency â•⁄ 6 â•⁄ 2 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 15
Learners and users – Who do we want corpus data from
Table 2.╇ Partly flexible phraseological frame: -ly speaking in ELFA(i) Expression historically speaking basically speaking formally speaking frankly speaking generally speaking comfortably speaking honestly speaking largely speaking legally speaking linguistically speaking realistically speaking relatively speaking seriously speaking Total
Abs. frequency â•⁄ 3 â•⁄ 2 â•⁄ 2 â•⁄ 2 â•⁄ 2 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 19
English being used in the world, this must be the case. Even if native speakers should maintain their previous usage, they are in a minority, and the breaking down of conventionally preferred forms affects the relative frequencies of English if all its global use is taken into account. Frequency patterns in turn affect anyone using the language. The expression -ly speaking was partly variable to begin with, and it is possible that L2 speakers perceive such expressions as even more freely variable. Yet even fully fixed expressions can become subject to similar fracturing: in the first half of the ELFA corpus, the form as the matter of fact occurred more often than the ENL as a matter of fact. ELF generates its own patterning on the basis of standard varieties of English, and gradually makes inroads into its use. In cognitive terms, ELF speakers as L2 users do not orient to their linguistic environment as a setting for language learning, but focus their efforts on making sense and making themselves understood. Many ELF scholars have noted a strong orientation to content over form in ELF discourse (e.g. Karhukorpi 2006; Ehrenreich 2009), and so have researchers working on authentic L1–L2 interaction in real-life conversations (Kurhila 2003). In contrast, the cognitive orientation of learners is far more towards language form. This is a consequence of the pedagogical setting, where the immediate focus is necessarily on learning grammar, vocabulary, phonology, and phraseology in the new language. Other aspects of language, such as textual organisation, style, register, and pragmatics come in as well, but the principle remains the same. Feedback and evaluation are based on mastering elements of the language, and it is hard to imagine an educational setting where this would not be so. Long-term objectives of SLA curricula are typically defined in real-life communicative terms, but those objectives
Anna Mauranen
remain outside the classroom context. They can be simulated in the classroom, but not performed there. Thus they cannot be assessed in class for their success or effectiveness in achieving their goals outside it. Communicative simulations may be pedagogically useful, as Widdowson often points out (e.g. Widdowson 2000), even realistic and meaningful, but not authentic in its basic sense of being real (see also Mauranen 2004). Success in SLA and SLU context thus depends on different criteria, and the situated cognitive orientation in SLA and SLU diverge in consequence. The particular demands on ELF speakers are often exacerbated by the unpredictably varying language parameters they need to cope with: their interlocutors’ accents, transfer features, and proficiency levels. The reality for classroom learners may also be more varied in multilingual circumstances than in shared-L1 classes, but the variation becomes familiar in the classroom, as speakers get used to each other’s ways of speaking (see Smit 2010). Reduced predictability is therefore basically an SLU feature that affects cognition as well as interaction. From an interactional perspective, users of ELF typically find themselves in situations where discourse norms are not clear or given. Terms of appropriate interaction must be negotiated by participants. In contrast, while learners may not master the discourse conventions of the target culture, they orient to modelling their behaviour on those, and take the lead from native speakers. Native speaker authority and superior expertise are axiomatic for learners of foreign languages, and therefore native preferences at all levels of language use are to be emulated for improved proficiency. In a lingua franca context, linguistic authority is not given, and participants are not seeking to learn the language from each other. As mutual understanding is constructed, any linguistic solutions that serve the purpose may be adopted by common consent – the best solutions need not be the most standard-like or native-like (see Hülmbauer 2009). Things that are ruled out in SLA classrooms, like language mixing, can be effective strategies in ELF communication (see e.g. Klimpfinger 2009). In this way, there is far greater symmetry in the interaction of lingua franca users, in addition to the openness and negotiability of discourse conventions. As ELF speakers orient to mutual comprehensibility, they engage in interactive strategies in support of this, such as enhanced explicitness (Mauranen 2007). An important aspect of ELF communication is that speakers seem to be prepared for the possibility of misunderstanding (Mauranen 2006a), and take steps to pre-empt that, which in effect results in few misunderstandings (Mauranen 2006a; Kaur 2009). Clearly, learners attend to comprehensibility as well in certain types of cooperative communication tasks, and again this is not a categorical divide between learners and users, but rather a difference in emphasis: for users it is a constant determiner of behaviour, whereas for learners it is less vital, even if desirable, as the classroom safety net will prevent major disasters ensuing from communication breakdown. The above distinctions are reflected in compiling corpora of learner English and ELF. For example the ELFA corpus of English as a Lingua Franca in Academic Settings (www.eng.helsinki.fi/elfa/elfacorpus), which was the first ELF corpus (finished 2008),
Learners and users – Who do we want corpus data from
differs radically from learner corpora at the outset because it has deliberately avoided collecting any data from learners of English. In other words, it has not recorded any data from classes where English would be the object of study. This choice was prompted by the general considerations just discussed, and also in view of more directly corpus-related issues, among which two major factors separate learner and ELF corpora. One is speaker proficiency and the other is participants’ mother tongue. Learner corpora are compiled following the principle familiar from SLA data of controlling for learner proficiency as well as possible. While this cannot in practice be too narrowly defined, as students even in the same classrooms or in comparable stages in their studies vary in their proficiency levels (see e.g. Granger et al. 2009), the criteria are set with awareness of proficiency levels in mind. Either all data is gathered from the same proficiency level or the data is organised in terms of developmental stages. In an ELF corpus this would be an untenable solution, because it is in the nature of lingua franca communication that speakers’ proficiencies vary, sometimes considerably. This is one of the unpredictable factors in a lingua franca environment. There is not only a straightforward variability of level, such as obtains between stronger and weaker students, but a huge diversity of previous learning environments and earlier experiences of English. The earlier language experiences that usually count in learner corpora relate to time spent in English-speaking countries, but in the case of ELF, earlier experience often comes from non-English speaking countries, which is increasingly typical among internationally mobile university students. Thus any idea of an even tolerably unidimensional scale of proficiency is alien to ELF and should not be attempted in an ELF corpus. The second corpus compilation principle where SLA and SLU part company is in terms of speakers’ first language. It makes sense for a learner corpus to keep first languages separated – either in different corpora or in separate sections or subcorpora. Moreover, each L1 background needs to be represented in a sufficient and similar way to enable comparisons. For practical as well as theoretical reasons, any research on learner corpora is interested in the effects of a particular L1 on the target L2, even though this is not the only kind of research such corpora lend themselves to (see e.g. Ädel 2008). This is clearly useful for teaching applications, and there is continued interest in such language-specific information, as testified by the popularity of guidebooks to teachers that contain typical learner errors from different L1 backgrounds along with their preferred target forms in ENL (e.g. Swan’s hugely popular reprinted and re-edited guidebook 2005). Textbooks of this kind would certainly benefit from systematic corpus-based studies of learner language now increasingly available (see e.g. Nesselhauf 2004, 2005), and it is to be hoped that evidence of the distribution of error and problem types according to learners’ first languages that corpora can provide will find their way to teaching and learning materials on a scale comparable to ENL corpora. However, pedagogical applicability does not exhaust the potential of L1based divisions in learner corpora; by keeping L1s clearly separate a learner corpus also lends itself to testing more theoretical predictions about the errors of learners with
Anna Mauranen
a particular language background (e.g. Ringbom 1992, 2007; papers in Granger 1998). It is therefore important to capitalise on the deeper understanding of the deviations from target use that can be derived from learner corpora. It makes sense, then, to maintain dividing lines according to language background in learner corpora – and they need not be limited to first languages, but can incorporate bilingual backgrounds where English is a third language, and so on. Insofar as sufficiently large groups with similar language backgrounds can be found, the case for keeping them in separate subcorpora can be made. But this is not the path for ELF. If we wish to investigate a lingua franca, we need to focus on environments where it is typically used, and gather data from those situations. This may mean a proliferation of first languages in unpredictable mixes. So for example ELFA has speakers from 51 different first languages in a million words of speaking (as opposed to the 16 carefully controlled ones in the second version of ICLE), which appear in their authentic mixes, and in different quantities. It is hardly possible to even try to control for language backgrounds so as to gather equal amounts from all languages represented, but just as in learner corpora, it is important to keep track of the L1s. In ELF, the focus is on ensuring a good mix so as not to get excessive dominance from one language group. This might skew findings towards L1 transfer features from that group – and as learner corpora among other evidence (see e.g. Jarvis & Pavlenko 2008) tell us, this is a ubiquitous feature of learner language and therefore extremely likely to surface in SLU. Dominance of one L1 group or closely related L1s, say, Nordic languages, may also affect the propensity to code-switch, to rely on shared cultural knowledge, and other interactive strategies. ELFA has therefore made a special effort to keep the majority language of the matrix culture, Finnish, to a reasonable proportion, and at just over a quarter of the data it is a good achievement (see Mauranen et al. 2010). What about native speakers and ELF? ENL speakers also find themselves every now and then in situations where English is a lingua franca. While this is true it is not of central relevance to ELF use. The VOICE corpus (http://voice.univie.ac.at/) has drawn the line between ELF and native/non-native situations where non-native majority begins: dyadic conversations between ENL and non-native speakers do not count as lingua franca use. While there is a certain arbitrariness in this, it is a workable solution to a dilemma that otherwise might linger on forever. ELFA has included ENL speech as part of polylogic conversations, and in all ENL talk amounts to about 5% of the corpus. In all, learner and ELF corpora have fundamental differences that require keeping them clearly separate. The main distinction boils down to language as an object of study vs. language as a means of achieving particular objectives in real environments. In spite of this, or perhaps because the two strands of non-native corpus studies differ in their very starting points, they can be of use to each other in important ways. While learner corpora can inform ELF study about what linguistic deviations from ENL are common in SLA and thus likely to have learning-based explanations if found in ELF, ELF can in return enlighten SLA research by showing how L2 actually works in
Learners and users – Who do we want corpus data from
ordinary life outside educational contexts and what might be worth focusing on in teaching. Together, learner and ELF language research can get deeper into the nature of languages other than the first.
3. How are learner and L2 user corpora similar? The most obvious affinity between learner and lingua franca corpora is that both collect data from speakers using a non-native language. Their social background and language identity are not those of a native speaker on account of their primary socialisation in other environments. Cultural conventions known widely among ENL speakers are largely unfamiliar to them. Such social and cultural similarities are perhaps the most obvious but at the same time somewhat trivial, because the differences between learners and users in sociocultural identity and position, discussed in the previous section, are so fundamental. Where cultural aspects most probably converge in their influence on ELF speakers as well as learners is around problems both groups experience with ENL speech. These are above all allusions to aspects of major ENL cultures that their members are likely to share but non-members far less likely to do so; especially certain culture-specific linguistic expressions, particularly what Seidlhofer (2002) has termed “unilateral idiomaticity” – that is, one speaker using idiomatic expressions the interlocutor might not know. These phenomena may be even more problematic to learners, whose relation to ‘target culture’ and ‘target language’ goes beyond the needs of effective communication on the spot. Where the similarity of learners and L2 users is the most relevant is in the domain of cognitive processes. They all use a linguistic repertoire where items are inclined to be less deeply entrenched than in an L1 repertoire, and where different stored systems are likely to compete (see e.g. Riionheimo 2009). It is here that we can expect most fruitful cross-fertilization between the two kinds of corpora. If we try to understand how second languages differ from first languages linguistically, and the kinds of changes that languages undergo in the hands of L2 speakers, learner and user corpora are both important data sources. If we look at some of the most typical examples of non-standard lexicogrammatical features from ELFA, the similarities are immediately clear. One commonly occurring non-standard feature is article use. Articles in ELF are often missing (it was absolutely in spirit of the time), superfluous (not in a principle) or just different from those expected in Standard English (I have written down here a word chronic liver diseases). Likewise, prepositions are very often used in non-standard ways (discuss about; obsession in; we’re dealing what is science; on this stage). Similar departures from Standard English article and preposition use have been found in learner English (e.g. Jarvis & Odlin 2000), with article use remaining shaky even in near-native speakers (Ringbom 1993).
Anna Mauranen
In morphology, two tendencies can be distinguished: regularisation of irregular forms (teached) and overproductive or nonstandard morphology (interpretate; maximalise; introducted). The tendency to regularise is probably also what underpins shifts from uncountable to countable nouns (offsprings). Morphology tends to be overproductive in its possibilities in natural languages. While convention and acquired preference keep it in check in native language communities, its possibilities are liberally utilised by non-natives. This has also been observed in learners, and referred to as ‘overgeneralisation’ (e.g. Master 1997) or ‘elaborative simplification’ (Meisel 1980; see also Winford 2003). ELF speakers also resort to well-known word formation practices like ‘back-formation’, which leads to forms like interpretate. Syntactically lack of concord or agreement is usual (each sciences; the main ideas is), as are for example nonstandard word order in interrogative clauses, or embedded inversions (Ranta 2010, forthcoming) and the very high frequency of the -ing form of the verb, especially in its progressive use (Ranta 2006). Thus, while some of these processes can easily be seen as facets of simplification (like regularisation), they cannot all be comfortably thrown into that category (like morphological overproductivity or lack of concord), at least, unless we adopt simplification as an overarching cover term in Meisel’s (1980) way, and add modifiers like ‘elaborative’ to keep it alive. The similarity of the above examples of recurrent lexicogrammatical ELF features to what are regarded as typical learner errors is clear. It does not seem too far-fetched to suggest that such features result from speaking a second language, and the weak or unstable entrenchment of lexicogrammar in the L2, perhaps involving online choices from competing systems. Such cognitive aspects of dealing with an L2 should cover much of the common ground between learners and users. A similar explanation would seem plausible in the case of phraseological units, which are notorious in SLA for presenting difficulties for learners (e.g. Nattinger & DeCarrico 1992; see also Seidlhofer 2002; Mauranen 2006b). While the SLU angle departs from the ‘difficulty’ or ‘error’ conceptualization of such expressions, the linguistic phenomenon in itself is the same. Examples of approximations to ENL phraseology abound in ELFA (to put the end on it, take closer look to the world, on the end, the hen or the egg...). Many of the phraseological units have a different preposition or article from the ENL phrase, but also lexical and structural substitutions occur. These are not dissimilar to the findings that have been made in learner language studies (e.g. papers in Schmitt 2004), notably in the corpora compiled at the Centre for English Corpus Linguistics (CECL) in Louvain (e.g. papers in Meunier & Granger 2008). What is worth noticing about these units is not only the now generally recognised fact that second-language speakers tend to get them slightly wrong even at high levels of proficiency, but perhaps more importantly that L2 speakers use them with great frequency. The fact that people use them outside educational environments is important for efforts to understand their significance to L2 users. Educational settings may reward learners for their use (and penalise them for getting them slightly wrong), whereas in second language use such as lingua franca communication their use must be
Learners and users – Who do we want corpus data from
explained by other means. In other words, we need use-based explanations along with learning-based ones. As pointed out by Wray (2002) among others, schematic units reduce a speaker’s processing load because they are relatively predictable and make processing faster. Therefore when we speak second languages, it would seem a good strategy to try to resort to those just as we do in our first languages. Such units may be less readily available in an L2, at least in their accurate form, than they are in a well-entrenched first language where at least monolinguals face no competition from other stored systems, but since they are useful building blocks, their approximate forms may work just as well for the purposes of facilitating communication. In this, learner language and ELF should be essentially similar. Some distributional patterns reveal other tendencies that can also easily be related to SLA findings. So for example distributions of ‘announced self-repairs’ studied by Marx & Swales (2005) in the MICASE corpus. By these they meant phrases that a speaker might use when he or she wanted to tell the interlocutors that an attempt to fix a speech mistake, clarify an idea, or rephrase an ambiguous utterance was coming up. (Marx & Swales 2005)
On inspection, it turned out that these are very different in MICASE and ELFA: in sheer quantitative terms, there was strikingly more announced self-rephrasing in ELFA (Mauranen 2007). As to the actual expressions, the favourite ELF items were not the same as those in ENL. The overwhelmingly most common ELF way of announcing a self-rephrase was I mean, while the corresponding top preference fell on in other words in the ENL material. I mean was the second most common in MICASE, but nowhere near in other words, which was more than four times as frequent. None of the other expressions found by Marx & Swales were much repeated in the ELFA data, and some did not appear at all. This is in line with what is often found in learner data: learners use a small variety of expressions for a particular function, but the ones they use are extremely frequent (see e.g. Altenberg & Granger 2002). This is of course an economical strategy for any L2 user: ‘make good use of the items you know’. Lingua franca communication is akin to learner strategies in this respect. Resources have to go far, so speakers economise on their cognitive effort. One meaning for one form, or isomorphism, is what Winford (2003) suggests as a central principle in SLA. It would indeed seem like an economical strategy to hang on to one form (I mean) for one sense, rather than learn several (in other words, that is to say, etc), and this goes for learners and L2 users alike. The explanation seems to fit this instance fairly well, but it is hardly likely to suffice as the overarching principle of L2 use. We already saw above cases that would not seem to be readily explained in this way: in many cases ELF introduced new variability into ENL preferences, and the frame -ly speaking showed little preferential patterning. And even if I mean was the clearly preferred expression, it was by no means the only one used for the ‘announced self-repair’ function. The finding also presents a further question: why this expression?
Anna Mauranen
Why I mean and not in other words, which in comparable ENL circumstances is the most frequent? The preferred ELF expression is not the most bookish one either, as one might quite reasonably expect in an academic, text-dominated environment, but one which is typical of everyday speech. This would seem to point to spontaneous acquisition in social interaction rather than classroom learning or to a strong written language bias. Interestingly, there is also evidence suggesting that L2 learners tend to overuse expressions from informal spoken mode even in their writing, and following a primarily academic education (see e.g. Gilquin & Paquot 2008). There is much left to investigate here, and these questions certainly fall within the common ground that all those have who take an interest in understanding L2, whether it occurs in a learning or use environment.
4. Conclusion Learner language corpora such as ICLE and others in the by now impressive CECL collection are a great step forward in studying learner performance, because they have a wide international coverage and capture learner language in large quantities. The corpora complement experimentally and qualitatively oriented SLA studies in a very important way, reaching far beyond the small-scale studies typical in the field. They have been compiled in a number of countries in a reasonably comparable way and consist of learners’ extended products as part of their normal language studies. The resulting corpora are not entirely without their problems (cf. e.g. Ädel 2006), but some faults can always be found in large databases. They are definitely a major contribution to language learning research and applications to learning and teaching. Learner corpora also provide highly useful material for comparison with lingua franca studies, because advanced learners can reasonably be expected to show many similar language features to speakers who are in natural out-of-classroom situations – and there is already evidence that they do. Some examples were shown in this paper of common features with plausible similar explanatory bases. Others again seemed to point to more use-based than learning-based explanations. Looking at ELF in real-life contexts gives us a view of second language in action. It is precisely the commonalities between the learner and the user perspectives that hold the most theoretical promise: what remains constant in different social contexts, and conversely, what is different according to social and interactional context? The most fruitful areas of common interest are clearly lexicogrammatical and phraseological aspects of language, which are of course at the heart of much corpus work. To understand what second languages are about, and what they can tell us about human language in general, we need research into second language learning as well as second language use. Corpora in both domains give the best opportunities for seeing the big picture of the kind of language that we are dealing with.
Learners and users – Who do we want corpus data from
In all, there are principled differences between learner and ELF corpora, and good reasons for keeping them separate. At the same time, they share certain features which make them yield results which are of great mutual interest.
References Ädel, A. 2006. Metadiscourse in L1 and L2 English [Studies in Corpus Linguistics 24]. Amsterdam: John Benjamins. Ädel, A. 2008. Involvement features in writing: Do time and interaction trump register awareness? In Linking up Contrastive and Learner Corpus Research, G. Gilquin, M.B. Díez Bedmar & S. Papp (eds), 35–53. Amsterdam: Rodopi. Altenberg, B. & Granger, S. 2002. The grammatical and lexical patterning of make in native and non-native student writing. Applied Linguistics 22(2): 173–189. Ehrenreich, S. 2009. English as a lingua franca in multinational corporations – Exploring business communities of practice. In English as a Lingua Franca: Studies and Findings, A. Mauranen & E. Ranta (eds), 126–151. Newcastle: Cambridge Scholars. Firth, A. & Wagner, J. 1997. On discourse, communication, and (some) fundamental concepts in SLA research. Modern Language Journal 81(3): 285–300. Gilquin, G., Granger, S. & Paquot, M. 2007. Learner corpora: The missing link in EAP pedagogy. Journal of English for Academic Purposes 6(4): 319–335. Gilquin, G. & Paquot, M. 2008. Too chatty: Learner academic writing and register variation. English Text Construction 1(1): 41–61. Granger, S. (ed.). 1998. Learner English on Computer. London: Addison Wesley Longman. Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (eds). 2009. The International Corpus of Learner English. Handbook and CD-ROM, Version 2. Louvain-la-Neuve: Presses universitaires de Louvain. Granger, S., Hung, J. & Petch-Tyson, S. (eds). 2002. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching [Language Learning & Language Teaching 6]. Amsterdam: John Benjamins. Hülmbauer, C. 2009. “We don’t take the right way. We just take the way that we think you will understand” – The shifting relationship between correctness and effectiveness in ELF. In English as a Lingua Franca: Studies and Findings, A. Mauranen & E. Ranta (eds), 323–347. Newcastle: Cambridge Scholars. Hynninen, N. 2011. The practice of ‘mediation’ in English as a lingua franca interaction. Journal of Pragmatics. 965–977. Jarvis, S. & Odlin, T. 2000. Morphological type, spatial reference, and language transfer. Studies in Second Language Acquisition 22: 535–566. Jarvis, S. & Pavlenko, A. 2008. Crosslinguistic Influence in Language and Cognition. London: Routledge. Johns, T. 1991. Should you be persuaded – Two examples of data-driven learning materials. English Language Research Journal 4: 1–16. Johns, T. 1994. From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 293–313. Cambridge: CUP.
Anna Mauranen Karhukorpi, J. 2006. Negotiating Opinions in Lingua Franca E-mail Discussion Groups, Discourse Structure, Hedges and Repair in Online Communication. Licenciate thesis, University of Turku. Kaur, J. 2009. Pre-empting problems of understanding in English as a Lingua Franca. In English as a Lingua Franca: Studies and Findings, A. Mauranen & E. Ranta (eds), 107–123. Newcastle: Cambridge Scholars. Klimpfinger, T. 2009. “She’s mixing the two languages together” – Forms and functions of codeswitching in English as a Lingua Franca. In English as a Lingua Franca: Studies and Findings, A. Mauranen & E. Ranta (eds), 344–371. Newcastle: Cambridge Scholars. Kurhila, S. 2003. Co-constructing Understanding in Second Language Conversation. Helsinki: University of Helsinki. Macmillan English Dictionary for Advanced Learners, 2nd edn. 2007. Basingstoke: Macmillan. Marx, S. & Swales, J.M. 2005. Announcements of self-repair: “all i’m trying to say is, you’re under an illusion”. Master, P. 1997. The English article system: Acquisition, function, and pedagogy. System 25: 215–232. Mauranen, A. 2004. Spoken corpus for an ordinary learner. In How to Use Corpora in Language Teaching [Studies in Corpus Linguistics 12], J.McH. Sinclair (ed.), 89–105. Amsterdam: John Benjamins. Mauranen, A. 2006a. Signalling and preventing misunderstanding in English as lingua franca communication. International Journal of the Sociology of Language 177: 123–150. Mauranen, A. 2006b. Speaking the discipline. In Academic Discourse Across Disciplines, K. Hyland & M. Bondi (eds), 271–294. Bern: Peter Lang. Mauranen, A. 2007. Hybrid voices: English as the Lingua Franca of academics. In Language and Discipline Perspectives on Academic Discourse, K. Fløttum, T. Dahl & T. Kinn (eds), 244– 259. Newcastle: Cambridge Scholars. Mauranen, A., Hynninen, N. & Ranta, E. 2010. English as an academic lingua franca: The ELFA project. English for Specific Purposes 29(3): 183–190. Meisel, J. 1980. Linguistic simplification. In Second Language Development: Trends and Issues, S. Felix (ed.), 13–40. Tübingen: Gunter Narr. Meunier, F. & Granger, S. (eds). 2008. Phraseology in Foreign Language Learning and Teaching. Amsterdam: John Benjamins. Nattinger, J.R. & DeCarrico, J. 1992. Lexical Phrases and Language Teaching. Oxford: OUP. Nesselhauf, N. 2004. Learner corpora: Learner corpora and their potential for language teaching. In How to Use Corpora in Language Teaching [Studies in Corpus Linguistics 12], J.McH. Sinclair (ed.), 125–152. Amsterdam: John Benjamins. Nesselhauf, N. 2005. Collocations in a Learner Corpus [Studies in Corpus Linguistics 14]. Amsterdam: John Benjamins. Ranta, E. 2006. The ‘attractive’ progressive – why use the -ing form in English as a lingua franca? Nordic Journal of English Studies 5(2): 95–116. Ranta, E. 2010. Models for English grammar at school? Paper given at the International ELF 3 Conference, 22–25 May 2010, University of Vienna, Austria. Ranta, E. Forthcoming. Universals in a Universal Language? – Study into the Verb-Syntactic Features of English as a Lingua Franca. PhD dissertation, University of Tampere.
Learners and users – Who do we want corpus data from Riionheimo, H. 2009. Interference and attrition in inflectional morphology: A theoretical perspective. In Language Contact Meets English Dialects: Studies in Honour of Markku Filppula, E. Penttilä & H. Paulasto (eds), 83–106. Newcastle: Cambridge Scholars. Ringbom, H. 1992. On L1 transfer, L2 comprehension and L2 production. Language Learning 42(1): 85–112. Ringbom, H. 1993. Near-Native Proficiency in English. Turku: English Department Publications Abo Akademi University. Ringbom, H. 2007. Cross-Linguistic Similarity in Foreign Language Learning. Clevedon: Multilingual Matters. Schmitt, N. (ed.). 2004. Formulaic Sequences: Acquisition, Processing and Use [Language Learning & Language Teaching 9]. Amsterdam: John Benjamins. Seidlhofer, B. 2002. The shape of things to come? Some basic questions about English as a Lingua Franca. In Lingua Franca Communication, K. Knapp & C. Meierkord (eds), 269–302. Frankfurt: Peter Lang. Sinclair, J. (ed.). 1987. Looking Up. Account of the Cobuild Project in Lexical Computing. London: Collins Cobuild. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP. Smit, U. 2010. English as a Lingua Franca in Higher Education. Berlin: Mouton de Gruyter. Swan, M. 2005. Practical English Usage, 3rd edn. Oxford: OUP. Widdowson, H. 2000. On the limitations of linguistics applied. Applied Linguistics 21(1): 3–25. Winford, D. 2003. An Introduction to Contact Linguistics. Oxford: Blackwell. Wray, A. 2002. Formulaic Language and the Lexicon. Cambridge: CUP.
References to corpora ELFA: ICLE: MICASE: VOICE. 2009. The Vienna-Oxford International Corpus of English (Version 1.0 online):
Learner knowledge of phrasal verbs A corpus-informed study Norbert Schmitt and Stephen Redwood This study analyses whether a group of learners’ productive and receptive knowledge of some of the most common phrasal verbs (PVs) is related to the frequency of those PVs. Secondly, we look at factors which may have affected the learners’ PV knowledge. The learners completed two tests (productive, receptive) and were also required to complete a biodata questionnaire containing questions about age, gender and nationality, and items relating to the language instruction they received and the incidental exposure they had to English. The analysis of the data shows that there is a relationship between learner knowledge and PV frequency, and that extensive reading and watching English language films and TV programmes appear to have a positive effect on the acquisition of PVs.
1. Introduction Phrasal verbs are one of the most productive areas of the English language (Konishi 1958, Bolinger 1971), consisting of many thousands of items (Gardner & Davies 2007), and with new ones regularly coming into use (e.g. chill out, freak out, log off/on, max out, scroll up/down, sex up, space out). They are a key feature of both spoken and written language, with Gardner & Davies (ibid.: 347) estimating that phrasal verbs occur, on average, every 192 words, that is almost 2 phrasal verbs per page of written text. Language coursebooks are now belatedly giving much more attention to these items, and a growing number of dictionaries and other publications devoted exclusively to phrasal verbs have been published in recent years, for example: Longman Dictionary of Phrasal Verbs (Courtney 1983), The Ultimate Phrasal Verb Book (Hart 2009), English Phrasal Verbs in Use: Advanced (McCarthy & O’Dell 2007), Dictionary of English Phrasal Verbs and their Idioms (McArthur & Atkins 1990), and Collins COBUILD Dictionary of Phrasal Verbs (Sinclair 2002). Despite their frequency in spoken and written language, phrasal verbs are often perceived as ‘difficult’ by both English as a Foreign/Second language (EFL/ESL) teachers, and their learners. There appear to be a number of reasons for this. Much of the language that we use is both idiomatic and formulaic and cannot be interpreted simply by looking at the individual
Norbert Schmitt and Stephen Redwood
words (Moon 1997). Phrasal verbs as multi-word units are no exception and many are opaque, making them difficult to decipher and understand. They often consist of a high frequency, monosyllabic, delexicalised verb (e.g. get, give, go, make, take) and one of a fixed number of particles (e.g. down, in, off, on, out, over, up), and the problem for learners is that these frequent and apparently simple components may come together to form units which are specialised, emotive, and idiomatic (e.g. the situation is really getting her down; I can’t make out what this says; don’t give up now; it was too much to take in). The opaque and idiomatic nature of some phrasal verbs presents obvious difficulties for learners and these problems are compounded when we take into account the significant number of phrasal verbs that are also polysemous. Sometimes there is a degree of transparency, and a semantic link may be made between the different senses (cf. fill in a hole, fill in a form, fill in somebody on something, fill in for somebody). However, in other instances the connection is more tenuous (cf. put up a fence, put up a fight, put somebody up for the night), and the meanings more difficult to interpret. In addition to the semantic complexity of phrasal verbs, particle movement can also present difficulties for learners. We may think of phrasal verbs as holistic multiword units, but with most transitive and a number of intransitive phrasal verbs, particles may be separated from their verbs by pronouns, adverbs or noun phrases (e.g. she put her new fur coat on; he picked her up from the station; I’ll come straight over to see you; we tried to calm the old woman down). Learners not only have to decide whether a phrasal verb is separable (cf. I stayed up late last night; *I stayed late up last night) but also what it can be separated by (adverb, pronoun, short noun phrase, long noun phrase). For example, it is acceptable to say he gave all of his vast fortune away, but not *the rebels are putting a huge amount of resistance up). This decision is not always based on transitivity or other grammatical considerations, but often depends on stylistic and syntactic conventions, context, prosody and intended meaning (see Bolinger 1971).
2. The acquisition of phrasal verbs The widespread use of phrasal verbs means that learners need to know them, but their semantic, syntactic, and pragmatic complexities lead to learning difficulties. So how can researchers and teachers help learners master this important linguistic feature? One way is to better understand the factors that lead to the learning of phrasal verbs. SLA research has identified a wide range of factors that influence language learning in general (see Dörnyei 2009; and Ellis 2008 for overviews), but recent research and theorizing have highlighted exposure to the target language as the driving force of language learning (e.g. Ellis 2003; Tomasello 2003). This exposure can come from the naturalistic environment, or from classroom input. In both cases, frequency is an
Learner knowledge of phrasal verbs
essential factor, because all things being equal, the more frequent an item is, the more a learner will be exposed to it. This is certainly true for vocabulary learning, where frequency is widely accepted as one of the best predictors of whether individual words will be known or not (Nation 2001; Schmitt 2008, 2010). However, phrasal verbs have some important differences from individual words as we have seen above, and it is not obvious whether frequency is such a clear predictor of their learning as it is for individual words. If not, teachers and materials writers will have to look for other characteristics to guide the sequencing of the phrasal verbs they wish to teach. However, if frequency in the learning environment does prove to be predictive, then practitioners could tentatively assume that learners know the highest frequency phrasal verbs from exposure, and would need to focus on teaching the somewhat less frequent ones. So in a naturalistic environment, frequency is important, because the more frequently items occur, the better they are generally learned. This has been shown in a number of studies into incidental vocabulary learning from reading, perhaps the most important source of outside input (e.g. Horst 2005; Rott 1999). But there are a number of other ways that learners can gain exposure to the target language, including film, television, radio, music, and social networking sites. We do not know yet how much effect these kinds of exposure have on language acquisition, but some believe that they can help significantly in the learning process (e.g. Pemberton & Fallahkair 2008; Sjöholm 2004). For explicit instruction, frequency is essential for selecting the phrasal verbs that will be the most beneficial for learners. There are many thousands of phrasal verbs (e.g. Gardner & Davies 2007), but as with other vocabulary items, some occur more frequently in language than others. Lexical items that are in common use are more often than not those which are the most useful, and as such their acquisition should be a priority for both teachers and learners (Leech, this volume; Nation 2001; Nation & Waring 1997). Unfortunately, some of the ‘most frequent phrasal verb’ lists in textbooks and dictionaries appear to be based more on intuition and tradition than on solid corpus data. As a result of this somewhat arbitrary selection process, students may be learning low frequency phrasal verbs which are rarely used in the real world, and worse, not acquiring those which are most frequent and useful (Darwin & Gray 1999: 67). Good frequency information would indicate which phrasal verbs are the most common, and therefore the ones to prioritise. So frequency is an important factor in learning from both naturalistic environment and formal instruction contexts. For determining the frequency of occurrence of lexical items in both, corpus analysis is the essential tool. Before the advent of corpora, intuition was the only guide to assessing lexical frequency, and while it may be a useful tool, it is not always a reliable guide (Hunston 2002: 20–21; Schmitt 2008: 333). But with the development of large and accessible corpora (multi-million word corpora are now common), it has become possible to determine the most common words and phrases, and their most frequent uses (Biber et al. 1999; Gardner & Davies 2007;
Norbert Schmitt and Stephen Redwood
Miller 2005). This is particularly true with formulaic language, which is probably the linguistic category that phrasal verbs can best be conceptualized as belonging to. For example, Sylviane Granger and her research unit (Centre for English Corpus Linguistics) at the Université catholique de Louvain have shown how corpus evidence can illustrate L2 learners’ acquisition and use of various kinds of formulaic language (see De Cock 2000; De Cock et al. 1998; Granger 1998; Granger & Meunier 2008; Learner Corpus Bibliography 2010; Meunier & Granger 2008). From the above discussion we see that there are good reasons to expect that frequency should be a strong factor in the learning of phrasal verbs. However, there has been little direct research on the relationship between the two. This study will focus on comparing the frequency of phrasal verbs (as determined by corpus evidence) with the degree to which L2 learners know them (receptively and productively), which leads to the first research question: 1. How well do learners know, productively and receptively, some of the most frequently occurring phrasal verbs in the English language? There are also a number of other factors which may affect how successfully a learner masters common phrasal verbs, and we will also explore a limited number of these: 2. Does overall language proficiency have a significant effect on phrasal verb knowledge? 3. Do gender and age have a significant effect on phrasal verb knowledge? 4. Do the amount and mode of language instruction have a significant effect on phrasal verb knowledge? 5. Does incidental learning through exposure to the target language outside the classroom have a significant effect on phrasal verb knowledge?
3. Methodology 3.1
Participants
Our participants consisted of 68 EFL/ESL students from three private language schools in the Nottingham and Eastbourne areas; 23 students at intermediate level and 45 at upper intermediate level. Their levels had been assessed initially by their schools’ placement tests and confirmed, after a number of lessons and by further progress checks, by their EFL/ESL teachers. The participants were made up of 47 females and 21 males, ranging in age from 14 to 55, from 14 countries, with 10 mother tongues, the largest group being the Italians (32). Table 1 shows a breakdown of the participant’s nationalities, genders, ages and language levels.
Learner knowledge of phrasal verbs
Table 1.╇ The Participants Nationality
N
M
F
Age
Intermediate
Upper Intermediate
Italian Columbian Spanish Polish Saudi German Libyan Chilean Chinese Kazak Portuguese Taiwanese Turkish Vietnamese Totals
32 â•⁄ 9 â•⁄ 6 â•⁄ 5 â•⁄ 5 â•⁄ 2 â•⁄ 2 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 68
â•⁄ 5 â•⁄ 6 â•⁄ 1 â•⁄ 0 â•⁄ 4 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 1 â•⁄ 0 â•⁄ 1 â•⁄ 0 â•⁄ 0 â•⁄ 1 19
27 â•⁄ 3 â•⁄ 5 â•⁄ 5 â•⁄ 1 â•⁄ 2 â•⁄ 2 â•⁄ 1 â•⁄ 0 â•⁄ 1 â•⁄ 0 â•⁄ 1 â•⁄ 1 â•⁄ 0 49
14–21 18–26 33–46 23–29 20–33 19–55 17–30 31 21 28 22 15 38 25 –
18 â•⁄ 0 â•⁄ 0 â•⁄ 5 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 â•⁄ 0 23
14 â•⁄ 9 â•⁄ 6 â•⁄ 0 â•⁄ 5 â•⁄ 2 â•⁄ 2 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 â•⁄ 1 45
3.2
Target phrasal verbs
The study included 60 phrasal verbs. The majority (50) were taken from Gardner & Davies’ (2007) list of the 100 most frequently occurring phrasal verbs in the British National Corpus (BNC 2007). Because phrasal verbs are considered difficult to acquire, we concentrated on high frequency examples because we wished to see how well our learners knew the type of phrasal verb they would presumably have had the most exposure to. However, we also wished to have a range of frequency on the list, so we included ten less frequent phrasal verbs, which were selected from student coursebooks and grammar reference books. In addition to investigating the relationship between overall phrasal verb frequency and learner knowledge, we also wanted to find out whether there were differences in knowledge levels between those phrasal verbs found more often in written language and those more frequent in spoken language. To do this, we consulted the BNC (2007) for phrasal verb frequencies. We chose the BNC because it is one of the largest corpora (100 million words) publicly available, and because it represents a cross-section of written and spoken language from a wide range of late twentieth century sources (BNC Homepage). Another key advantage is that the complete corpus can be bought and downloaded, which made our phrasal verb analysis possible. Gardner & Davies used the following definition as the basis for the identification and tagging of phrasal verbs: “all two-part verbs in the BNC consisting of a lexical verb ... proper ... followed by an adverbial particle ... that is either contiguous ... to that verb or non-contiguous ...”
Norbert Schmitt and Stephen Redwood
(ibid.: 341). Our definition of a phrasal verb was rather broader in that we included verbs followed by prepositional as well as adverbial particles (see Biber et al. 1999: 403; Bolinger 1971: 6; Collins Cobuild English Grammar 2005; McArthur & Atkins 1990). First we looked up the phrasal verbs’ overall frequency in the complete BNC. We then repeated the process using only the spoken section (10 million words). Finally, by subtracting the spoken frequency results from those for the complete BNC we arrived at figures for the written section (90 million words). Each phrasal verb and its inflections (come off, came off, coming off) was tagged for contiguous (verb + particle) and noncontiguous (verb + word(s) + particle) occurrences, up to a limit of 4 words separating verb and particle. We found that most phrasal verbs were either contiguous, or separated by a single word. Very few phrasal verbs were separated by 4 words (228 occurrences in the whole of the BNC) and there were many lexical strings that were not phrasal verbs (the Carry On films; the meeting was held on the 28th of January; pay me when you get back). The occurrences of phrasal verbs separated by 5 or more words were so infrequent that we did not consider these in the calculations. When we compared our findings for the complete BNC with those of Gardner & Davies (see Appendix A for comparison) we found that the majority of our frequency figures were higher, on occasions significantly so (e.g. get in, go in, put on), with a small number lower (e.g. carry on, carry out). These differences may be partly due to our use of a broader phrasal verb definition, or the tagging methods used; but we can only speculate as we do not know exactly how Gardner and Davies’ figures were calculated. The results showing the frequency figures for the 60 target phrasal verbs are shown in Table 2.
3.3
Receptive and productive measurement instruments
One of the goals of the study was to establish learners’ knowledge about the target phrasal verbs, and it seemed important to assess both receptive and productive mastery (Schmitt 2010). The productive test used a cloze technique in which the participants had to produce the target vocabulary themselves, requiring a higher level of mastery than would a receptive word recognition test (Groot 2000: 76). Cloze tests are used extensively as a testing procedure, and are seen, especially in the area of vocabulary, as a good measure of lexical knowledge (Read 1997). An example for set up is given below: The police s__________ u__________ roadblocks to stop people driving into the city centre. (build, erect) To be consistent with the aim of testing both productive and receptive knowledge of the same target language we used similar items in both tests. The differences between the two tests being, the first letter prompts were omitted from the receptive test, multiple-choice options were added to the receptive test, and the items in each test were in different orders. To help reduce guessing, a fifth ‘Don’t know’ option was included in the receptive test. The productive test, being the one in which the participants had to
Learner knowledge of phrasal verbs
Table 2.╇ BNC Target Phrasal Verb Frequency Phrasal Verb
BNC
Written Spoken
1 2 3 4 5 6
go on pick up come in take up go out hold on
16228 10884 9777 9450 7765 6977
12591 10147 7700 6548 5008 4444
3637 737 2077 2902 2757 2533
31 32 33 34 35 36
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
put on find out work out make up come out sit down take on carry on set up go in get up get on carry out come down get out get in bring in put up go over go off break down take back move on put out
6760 6329 5257 5231 5190 5022 4717 4695 4199 3892 3637 3441 3406 3083 3010 2587 2565 2386 2152 1728 1469 1430 1415 1251
5484 5605 4732 3369 3922 4610 3886 1994 3981 3449 2338 1949 2283 2301 2367 2466 1928 2118 1810 1326 1031 1272 809 753
1276 724 525 1862 1268 412 831 2701 218 443 1299 1492 1123 782 643 121 637 268 342 402 438 158 606 498
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
BNC = token frequency BNC complete Written = token frequency BNC written Spoken = token frequency BNC spoken
Phrasal Verb
BNC
Written Spoken
get in hold on go over move in turn down look around come over come off sit up put off make out turn off pick out hold back take down give back move up move back move out give out dig up pay back pin down tear up think over fall behind pass away chat up take after cool off
4671 1797 1732 1377 1355 1350
3221 1493 1173 1134 1182 1268
1450 304 559 243 173 82
1262 1191 1181 1075 1067 1057 905 862 849 654 616 603 594 550 383 295 249 224 219 156 104 100 84 65
916 803 1040 851 937 650 732 811 521 404 537 527 468 404 318 230 229 202 206 145 89 87 58 59
346 388 141 224 130 407 173 51 328 250 79 76 126 146 65 65 20 22 13 11 15 13 26 6
Norbert Schmitt and Stephen Redwood
recall and produce the target language, was to be administered first. If the productive test was given second there would be the possibility of testees remembering some of the multiple-choice answers from the receptive test. Obviously, learners who knew the answer to an item productively would also be likely to know it receptively as receptive knowledge usually precedes productive knowledge (Melka 1997; Schmitt 2010). The receptive version of the above item is illustrated below. See Appendices B & C for the complete productive and receptive tests. The police __________ __________ roadblocks to stop people driving into the city centre. (build, erect) A. set in
3.4
B. set up
C. set on
D. set at
E.?
Biodata questionnaire
In addition to establishing the relationship between frequency and learner knowledge, we were also interested in gathering information about some of the other factors which may have had an effect on phrasal verb acquisition. We already had a rough idea of our participants’ language proficiency from the school’s in-house assessments, so we produced a 10-item questionnaire which contained items on basic biodata information (age, gender, nationality), and items relating to language exposure through classroom instruction, and exposure through incidental learning, that is, extensive reading, the media and entertainment, and social networking (see Appendix D for the complete questionnaire).
3.5
Procedure
Having written the tests it was essential to thoroughly pilot them to test their validity and reliability (Dörnyei 2007), and importantly, to confirm that they could be completed in the time available (Schmitt 2010). We first asked ten educated native speakers to complete the tests and comment on any difficulties they had with any of the test items. Subsequent feedback showed that most of the native speakers took 15 to 25 minutes to finish the productive test and 10 to 15 minutes for the receptive test. They reported no serious problems with the items, but we listened carefully to the comments they made and as a consequence rewrote several of the items to make them clearer, and modified the definitions/synonyms for others to improve their performance. We repeated the exercise with 8 other native speakers and made further modifications. We were satisfied that the instruments worked well with native speakers but we also required confirmation from non-native speakers. We asked a number of upper-intermediate and advanced level learners to complete the tests, and in response to the feedback received a number of minor alterations were made. The tests were then given to the participants in a single session in their intact classes with a short break between the tests. Instructions explaining the purpose of the
Learner knowledge of phrasal verbs
test, its format, and what the participants had to do were printed at the beginning of each test, together with example items. To make sure the participants knew exactly what to do, they were led through all the instructions, paying particular attention to the example items, and the amount of time they had to complete the test. In addition, we explained that they could answer each question with the base form of the phrasal verb, but that any of its inflections would be accepted as correct (work out, worked out, working out). The productive test was given first, then the receptive version after a 10 minute break, and finally the biodata questionnaire. All the participants had sufficient time to complete both tests and fill in the biodata questionnaire.
4. Results and discussion 4.1
Phrasal verb frequency and knowledge
The main aim of the study was to explore the link between phrasal verb frequency and how well they are learned by L2 learners. In other words, do learners tend to know more about the most frequently occurring phrasal verbs than the less frequent ones? In addition we wanted to discover whether there was a link between mode (spoken vs. written) and phrasal verb knowledge. That is, do learners know more about those phrasal verbs more frequently found in written language, those used more often in spoken language, or do they have a broad knowledge extending across the two modes? To explore this link, we first carried out correlations comparing the results of the productive and receptive tests with our phrasal verb frequency rankings from the BNC complete, BNC written, and BNC spoken. The Pearson coefficients indicated a significant positive correlation between mean tests scores and phrasal verb frequencies as shown in Table 3. The strengths of the correlation coefficients were moderate for the productive test, and relatively low for the receptive test. To achieve a better understanding of the strengths of the correlations, the correlation coefficients were squared (r2), which produced figures which represent the percentage of the variance in the test scores that can be related to frequency. These figures are shown in parentheses, and they indicate that for the BNC complete, 20% of the variance in the productive scores was attributable to frequency, but for the receptive scores only 9% of the variance was related to frequency. Thus we find that the learning of phrasal verbs is related to their frequency of occurrence, just as it is with individual words. However, the strength of relationship is not particularly strong, and varies according to productive and receptive knowledge. As for the difference between phrasal verbs in written and spoken discourse, there was virtually no difference in terms of receptive knowledge, and only a small difference in terms of productive knowledge. These results suggest it is probably sufficient to use overall corpus frequency figures (i.e. combined written and spoken) when thinking about the likely acquisition of phrasal verbs, as there seems to be no real advantage to distinguishing between spoken and written frequencies.
Norbert Schmitt and Stephen Redwood
Table 3.╇ Correlations between Tests Scores and Phrasal Verb Frequencies (BNC) Phrasal Verb frequencies
Productive test
Receptive test
.45** (20.3)a .41** (16.8) .46** (21.2)
.30* (9.0) .29* (8.4) .31* (9.6)
BNC complete BNC written BNC spoken *p < .05, **p < .01 a. r2 reported in percentage
To better understand the frequency-knowledge relationship, it is useful to look at the data in graphic form. Figure 1 shows the frequencies (BNC complete) of the 60 phrasal verbs used in the study. They range from the most frequent, go on (16,228 tokens) to the least frequent, cool off (65 tokens). If the participants’ phrasal verb knowledge was related closely to phrasal verb frequency, then test results should have shown a similar curve to that in Figure 1. Figures 2 (BNC complete), 3 (BNC written) and 4 (BNC spoken) are graphic representations of the relationship between phrasal verb knowledge and phrasal verb frequency according to the three corpora. The phrasal verbs are arranged in frequency groups of 5 to reduce the effect of individual item variation. Several points are immediately evident when viewing these curves. First, none of the curves match that in Figure 1 very closely, so learning does not seem to be highly 18
PV token frequency (thousands)
16 14 12 10 8 6 4 2 0 1
11
21
31 PV frequency ranking
41
Figure 1.╇ Target Phrasal Verb (PV) Frequency BNC Complete
51
Learner knowledge of phrasal verbs 100 90
Mean test scores %
80 70 60 50 40 30 20 10 0
1
6
11
16
21
26
31
36
41
46
51
56
60 PVs by frequency (grouped in 5s) productive
receptive
Figure 2.╇ Test Scores BNC Complete
dependent on the absolute frequency of a phrasal verb. Second, there is a considerable amount of variation in knowledge of the phrasal verbs (the curves bounce up and down), even though the phrasal verbs have been clustered in groups of five to even out this variation. Thus we find that learning does not smoothly follow rank order frequency either. Third, despite the previous two observations, there is clearly some overall relationship between frequency and knowledge. This is most obvious with the productive tests, where there is a clear decline in knowledge as frequency decreases, with the exception of a blip at the 46–50 frequency ranking (see below). The receptive trend is harder to discern, with a fairly clear decline in the first ten or so phrasal verbs, but thereafter a great deal of variation in what is essentially a plateau. Fourth, as might be expected, the receptive scores were usually higher than the productive scores, as recalling language in order to use it productively is more difficult and requires a greater depth of knowledge than being able to recognize it receptively (e.g. Groot 2000; Nation 2001). Overall, the learners scored 17% higher on the receptive test than the productive test on average and this difference was significant (Pearson, t = 12.01, p<.001, Eta squared = .69). Overall, the evidence points to a general trend of higher frequency leading to a greater chance of learning phrasal verbs to a productive degree of mastery. The relationship is not strongly linear, but higher frequency phrasal verbs were clearly learned by a greater number of our participants than lower frequency phrasal verbs. Conversely, with the exception of the very highest frequency phrasal verbs, there does not seem to be a very reliable relationship between the frequency of phrasal verbs and
Norbert Schmitt and Stephen Redwood 90 80
Mean test scores %
70 60 50 40 30 20 10 0 1
6
11
16
21 26 31 36 41 46 60 PVs by frequency (grouped in 5s) productive
51
56
51
56
receptive
Figure 3.╇ Test Scores BNC Written
90 80
Mean test scores %
70 60 50 40 30 20 10 0 1
6
11
16
21 26 31 36 41 46 60 PVs by frequency (grouped in 5s) productive
Figure 4.╇ Test Scores BNC Spoken
receptive
Learner knowledge of phrasal verbs
mastery of receptive knowledge. Thus, it seems that in order to develop the more advanced productive mastery of phrasal verbs, the repeated exposure that comes from higher frequency is necessary. Receptive mastery, which is presumably easier to acquire, does not seem so dependent on this exposure. We might speculate that this is because only a few exposures might lead to receptive mastery. This would be congruent with findings from incidental vocabulary acquisition studies, where it has been found to take something like 8–10 exposures to learn words from reading, but where productive mastery is seldom achieved (Schmitt 2008). Furthermore, the relationship between frequency and learning may be stronger than demonstrated here. We used mainly the highest frequency phrasal verbs (50/60) in this study to see if our participants knew these high-exposure items. If we had used a group of phrasal verbs with a wider range of frequencies, we may well have found a clearer frequency-knowledge trend. Another point to keep in mind is that the frequency information was from occurrence in general, as indicated by the sources included in the BNC. We assume that higher frequencies in the BNC also indicate higher levels of exposure among our participants. However, this assumption may be unfounded to some extent. Learners may (probably?) receive quite different exposure to the L2, especially in a classroom, than is indicated by a native corpus. If we were able to use their actual exposure rates as our frequency figures, the correlation would undoubtedly be higher. This leads to the question of whether other corpora may better predict the learning of phrasal verbs. One suitable candidate is the Corpus of Contemporary American English (COCA). It has in excess of 400 million words of text and is equally divided between spoken, fiction, popular magazines, newspapers, and academic texts (Davies 2008). We did not have the resources to carry out a full lemmatized and noncontiguous analysis as we did with the BNC, but were able to do a simplified analysis based only on the contiguous base forms of the target phrasal verbs as shown in Table 2. The correlations for frequency and productive mastery, using the data from our test results, were very similar to the BNC results, but the correlations for frequency and receptive mastery were marginally higher than the BNC figures (Table 4). The similar results from the two main large-scale, accessible corpora give us confidence in concluding that the frequency of phrasal verbs as shown by large corpora predicts phrasal verb Table 4.╇ Correlations between Tests Scores and Phrasal Verb Frequencies (COCA) Phrasal Verb frequencies COCA complete COCA written COCA spoken **p<.01 a. r2 reported in percentage
Productive test
Receptive test
.42** (17.6)a .40** (16.0) .42** (17.6)
.36** (13.0) .34** (11.6) .38** (14.4)
Norbert Schmitt and Stephen Redwood
acquisition (productive mastery) somewhere around the 16–20% covariance level and receptive mastery at around the 8–14% level. The data also gives us a chance to look at how well our participants knew the phrasal verbs in real terms. The majority of the participants were able to recognize most of the phrasal verbs receptively (average score 65.2%), and were able to produce 48.2% of them on average. Thus we find that despite being a relatively difficult type of lexical item, the participants had a good knowledge of the target phrasal verbs relative to their language levels. However, there were a number (18) of phrasal verbs that less than half of the learners knew either receptively or productively. A number of these were relatively infrequent (cool off, dig up, pin down) which we expected would have low scores, but a number were some of the most frequent in the BNC (e.g. carry out, go in, take up, work out). The low scores may in part be attributable to learners being unfamiliar with some of the contexts and meaning senses presented in the tests, or the wording of some of the test items themselves, but even if we make some allowance for these anomalies, there would still remain a number of moderately high to high frequency phrasal verbs that were relatively unknown to at least half of the students. There does not appear to be any particular semantic or syntactic features that distinguish these phrasal verbs from others in the tests. In fact, some were relatively transparent (come off, give out, go in, hold back), and we can speculate that some students’ lack of receptive or productive knowledge of these items was due to the absence or paucity of exposure to these phrasal verbs, even though they occurred relatively frequently in the corpora. One factor that may partly account for this lack of exposure is the fact that often a learner’s primary source of exposure to English is in the language classroom, through the medium of student coursebooks, which are frequently the core resource of the language syllabus. Although a number of coursebooks purport to be corpora-based, research shows that often the language presented in these publications appears to have been selected in an intuitive or arbitrary fashion without reference to corpus data (e.g. Koprowski 2005). Furthermore, the phrasal verbs are often presented on a single page in large numbers in test-like formats, which give little or no opportunity for learners to use the target phrasal verbs productively. Additionally, once these phrasal verbs have been ‘covered’ on these pages, quite frequently no attempt is made to re-cycle these items in subsequent parts of the book. Another factor that may influence phrasal verb exposure is the fact that the majority of learners around the world are taught by non-native teachers who themselves may not use, or even be aware of, those phrasal verbs most commonly used. Finally we looked at the phrasal verbs at the lower end of the frequency range to see if we could explain the blip which occurred at the 46–50 rank level. The blip is largely a result of the spoken frequency curve, and so we looked at the five phrasal verbs in this cluster in terms of spoken frequency. They include: carry on, look around, move up, move back, and pay back. These phrasal verbs were considerably better learned than the phrasal verbs in adjoining frequency clusters (41–45: break up, give out, sit up, make out, move out; 51–55: dig up, hold back, take after, tear up, pin down). It is difficult to pinpoint the reason for this, although one potential explanation is that
Learner knowledge of phrasal verbs
at least four of the phrasal verbs can be interpreted literally (look around, move up, move back, pay back), while in the adjoining clusters there are more phrasal verbs which cannot be (break up = *breaking something in an upwards direction; take after = *taking something subsequently to something else). Regardless of the phrasal verb characteristic(s) at play here, our frequency/knowledge curves suggest that phrasal verbs are idiosyncratic in terms of learning burden, and that a purely frequency-based explanation can never fully explain their acquisition.
4.2
Individual differences factors in the acquisition of phrasal verbs
We have seen that frequency is a factor in the acquisition of phrasal vocabulary, but that it only explains 10–20% of the variance in test scores. Unsurprisingly, other factors must also be at play. One area that has been well documented is that of first language (L1) influence on language acquisition. Phrasal verbs are found predominantly in English and a few other cognate languages. German, for example, whilst not having phrasal verbs as such, does have particle verbs which are superficially similar (see Waibel 2007: 38–40). L1 influence is certainly an important factor in language acquisition, and the absence of a feature, like phrasal verbs, from a learner’s L1, can affect the way a second language (L2) is acquired (e.g. Dagut & Laufer 1985; Hulstijn & Marchena 1989; Laufer & Eliasson 1993; Liao & Fukuya 2004; Siyanova & Schmitt 2007; Swan 1997). However, as only 2 of our 68 learners had L1s (German) that contained an equivalent to phrasal verbs we decided not to take this factor into consideration, and concentrated instead on other individual differences to determine if they had any effect on the acquisition of phrasal verbs. 4.2.1 Language proficiency Does phrasal verb knowledge increase as overall language proficiency rises? To answer this question we compared the scores of the intermediate and upper-intermediate learners to see if there were significant differences.1 Table 5 shows the results of the independent sample t-tests, indicating that the upper-intermediate learners scored on average higher than their intermediate counterparts. The differences in scores were significant (p<.05), and the effect sizes (eta squared)2 large, accounting for 20% 1. The proficiency assessments of the different schools involved in the study are idiosyncratic, and so cannot be directly compared. This makes the distinction between intermediate and upper intermediate proficiency levels in the study somewhat tenuous. Although it is difficult to quantify what these proficiency levels mean in absolute terms, we had personal experience of all participants, and feel that the distinction accurately reflects a noticeable difference in relative levels of proficiency. 2. Effect size is a measure of the strength of the relationship between two variables. Eta squared is the proportion (.01 = small effect, .06 = moderate effect, .14 = large effect) of the total variance that is attributed to an effect and is usually expressed as a percentage by multiplying it by 100.
Norbert Schmitt and Stephen Redwood
Table 5.╇ Proficiency Level Comparisons (Independent Samples T-Tests)
productive intermediate (n = 23) upper-intermediate (n = 45) receptive intermediate (n = 23) upper-intermediate (n = 45)
Ma
SD
22.91 33.65
â•⁄ 5.15 10.00
32.00 41.91
D
t
Effect sizeb
66
–4.079*
.20
66
–4.482*
.23
â•⁄ 5.80 â•⁄ 7.79
*p < .001 a. Max score = 60 b. Eta squared
(productive) and 23% (receptive) of the differences between the intermediate and upper-intermediate scores, confirming that learners’ knowledge of phrasal verbs appears to be related to overall language proficiency. The differences in the phrasal verb knowledge of intermediate and upper-intermediate students may also be related to the language level at which learners are first exposed to phrasal verbs. Very few coursebooks below intermediate level have any explicit or implicit reference to phrasal verbs, and whilst there may be valid pedagogic reasons for this, it does mean that phrasal verb acquisition may lag behind other areas of language at lower proficiency levels. 4.2.2 Gender There has been much debate about the role of gender in language learning and acquisition, and research has examined a number of areas such as language proficiency, attitudes, motivation, and learning, cognitive and metacognitive strategies (e.g. Kobayashi 2002; Tercanlioglu 2004). We were interested to know if gender was also a factor in the acquisition of phrasal verbs. The results from our t-tests indicate that, although males scored higher in both tests, the differences in scores were not statistically significant, and for these participants at least, gender did not appear to be a factor in their knowledge of phrasal verbs. 4.2.3 Age We were also interested in whether age had any influence on the participants’ productive and receptive knowledge of phrasal verbs. The ages of the learners ranged between 14 and 55 and for the purpose of the analyses we divided them into 3 age groups (under 18, 18–25, over 25). The results from one-way ANOVAs indicate that the older learners scored higher in both the productive and receptive tests, but the differences in scores were not significant, showing that age was not a causal factor.
Learner knowledge of phrasal verbs
4.3
Exposure to target language inside and outside the classroom
The second type of factor we explored was the amount and type of exposure our participants had to English both inside and outside the language classroom. 4.3.1 Formal language instruction Achieving proficiency in a second language is dependent on a number of factors, not least the quantity and quality of language instruction. The biodata questionnaire included items relating to the length of time the participants had been learning English, where they took their lessons, and how many hours of instruction they received each week. The results of the comparison of test scores, perhaps surprisingly, indicated that overall the type of instruction and hours of classroom input that the learners received did not have a significant effect on their test scores. 4.3.2 Extensive reading Research indicates that extensive reading can improve vocabulary knowledge and have a positive effect on language proficiency overall. Using data collected from the biodata questionnaire, we divided students into those who read in English 0–1 hour per week 1–2 hours, and 2+ hours. One-way ANOVAs were significant (productive: F(2, 65)=4.46, p<.05; receptive: F(2, 65)=4.04, p<.05), and Least Significant Difference (LSD) post-hoc tests showed the difference (p<.05) existed between those who read the least (0–1 hour = 27.0 productive and 37.4 receptive) and those who read the most (2+ hours = 36.7 productive and 45.2 receptive). The effect sizes were moderately high (.12 productive, .11 receptive). So while differences in classroom input did not significantly affect acquisition of phrasal verbs, the amount of input from reading did have an effect. 4.3.3 Watching English language films and television As reading had an effect on the acquisition of phrasal verbs, it is interesting to see if other types of non-classroom input did as well. Another way of increasing one’s exposure to the target language is through watching English language films and TV shows, and we included an item in the biodata questionnaire asking participants how much time they spent on these activities. Using the same methodology as for reading, we came up with nearly identical findings. The ANOVAs were significant (productive: F(2, 65)=4.54, p<.05; receptive: F(2, 65)=3.83, p<.05) and the LSD post-hoc tests (p<.05) showed that learners who spent more than 2 hours per week watching English language films and TV shows knew more phrasal verbs (33.2 productive, 42.9 receptive) than those who only watched for an hour or less (25.7 productive, 36.7 receptive). The effect sizes are the same as for reading (.12 productive, .11 receptive). These results indicate that this type of exposure is also effective in promoting the acquisition of phrasal vocabulary.
Norbert Schmitt and Stephen Redwood
4.3.4 Listening to English language music Another type of input that many learners partake of is listening to English music outside the classroom. English language popular music has a worldwide appeal and some research has indicated that incidental listening can have a positive effect on language acquisition (Sjöholm 2004). We used the same type of analysis as for reading and film/ TV watching, but whilst those learners who listened to English language music for 1 to 2 hours per week scored higher than those who listened less, the differences in their scores were not significant. It seems therefore that the amount of listening to English music does not affect the acquisition of phrasal vocabulary. This may be because listening to music requires much less attention and concentration than watching films or TV programmes. 4.3.5 Social networking Social networking sites have become extremely popular in recent years (e.g. Facebook, MySpace, Twitter), and together with other forms of electronic communication (Skype, SMS) have allowed millions to interact and socialise on a global scale. English is often the lingua franca of the Internet and we were interested to see how many of the participants took advantage of these forms of communication to practise their language skills, and whether it had any effect on their vocabulary knowledge. Half of the participants spent more than two hours each week using English as a lingua franca on social networking sites. However, those who used these sites the most did not score significantly higher than those who used these sites the least.
5. Conclusion Our study set out to explore what ESL/EFL learners knew about relatively frequent phrasal verbs, and how that knowledge was acquired. We found that frequency (as indicated by large General English corpora) predicted phrasal knowledge to a considerable degree in terms of productive mastery (r2 ≈ 20%), but not in terms of receptive mastery (r2 ≈ 9%). Corpus frequency figures will always be useful in identifying the phrasal verbs that need to be known, as high frequency phrasal verbs undoubtedly have great utility for students. However, the same frequency figures seem to have differential ability in predicting whether phrasal verbs are known or not. They produce strong enough correlations to predict productive knowledge to some extent, but seem to lack the capacity to do the same for receptive mastery. Clearly, the acquisition of phrasal verbs relies on more than just frequency of exposure. Interestingly, our results showed no effect for formal-instruction-based variables, but did show that more outof-class exposure (in the form of outside reading, film/TV watching) facilitated the learning of phrasal verbs. It is interesting to note that not all outside exposure was beneficial though; the amount of listening to English language music and social networking did not have an effect. Perhaps the most encouraging outcome of the study
Learner knowledge of phrasal verbs
was the relatively good knowledge our participants demonstrated of the target phrasal verbs. Overall, they knew about two-thirds of them receptively, and about one-half productively. While admittedly the target phrasal verbs were mostly among the most frequent in English, this knowledge is a good start, and the quest continues to find ways of helping students/learners master the rest of the phrasal verb inventory. A better understanding of the ways frequency interacts with learner knowledge and acquisition can only aid this pursuit.
References Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman. Bolinger, D.L.M. 1971. The Phrasal Verb in English. Cambridge MA: Harvard University Press. British National Corpus. 2007. from Oxford University Computing Services on behalf of the BNC Consortium, version 3 (BNC XML Edition). Collins Cobuild English Grammar. 2005. 2nd edn. Glasgow: HarperCollins. Courtney, R. 1983. Longman Dictionary of Phrasal Verbs. London: Longman. Dagut, M., & Laufer, B. 1985. Avoidance of phrasal verbs: A case for contrastive analysis. Studies in Second Language Acquisition 7(1): 73–79. Darwin, C. M. & Gray, L. S. 1999. Going after the phrasal verb: An alternative approach to classification. TESOL Quarterly 33(1): 65–83. Davies, M. 2008. The corpus of contemporary American English (COCA): 400+ million words, 1990-present. De Cock, S. 2000. Repetitive phrasal chunkiness and advanced EFL speech and writing. In Corpus Linguistics and Linguistic Theory, C. Mair & M. Hundt (eds), 51–68. Amsterdam: Rodopi. De Cock, S., Granger, S., Leech, G. & McEnery, T. 1998. An automated approach to the phrasicon on EFL learners. In Learner English on Computer, S. Granger (ed.), 67–79. London: Addison Wesley Longman. Dörnyei, Z. 2007. Research Methods in Applied Linguistics. Oxford: OUP. Dörnyei, Z. 2009. The Psychology of Second Language Acquisition. Oxford: OUP. Ellis, N. C. 2003. Constructions, chunking, and connectionism: The emergence of second language structure. In The Handbook of Second Language Acquisition,C.J. Doughty & M.H. Long (eds), 63–103. Oxford: Blackwell. Ellis, R. 2008. The Study of Second Language Acquisition. Oxford: OUP. Gardner, D., & Davies, M. 2007. Pointing out frequent phrasal verbs: A corpus-based analysis. TESOL Quarterly 41(2): 339–359. Granger, S. 1998. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis and Applications, A.P. Cowie (ed.), 145–160. Oxford: OUP. Granger, S. & Meunier, F. (eds). 2008. Phraseology: An Interdisciplinary Perspective. Amsterdam: John Benjamins. Groot, P.J.M. 2000. Computer assisted second language acquisition. Language Learning and Technology 4(1): 60–81. Hart, C.W. 2009. The Ultimate Phrasal Verb Book, 2nd edn. Hauppauge NY: Barron’s Educational Series.
Norbert Schmitt and Stephen Redwood Horst, M. 2005. Learning L2 vocabulary through extensive reading: A measurement study. The Canadian Modern Language Review 61(3): 355–382. Hulstijn, J.H. & Marchena, E. 1989. Avoidance: Grammatical or semantic cause? Studies in Second Language Acquisition 11(3): 241–255. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: CUP. Kobayashi, Y. 2002. The role of gender in foreign language learning attitudes: Japanese female students’ attitudes towards English learning. Gender and Education 14(2): 181–197. Konishi, T. 1958. The growth of the verb-adverb combination in English: A brief sketch. In Studies in English Grammar and Linguistics: A Miscellany in Honour of Takanobu Otsuka, K. Araki & T. Otsuka (eds). Tokyo: Kenyusha. Koprowski, M. 2005. Investigating the usefulness of lexical phrases in contemporary coursebooks. ELT Journal 59(4): 322–332. Laufer, B. & Eliasson, S. 1993. What causes avoidance in L2 learning: L1–L2 difference, L1–L2 difference, or L2 complexity? Studies in Second Language Acquisition 15(1): 35–48. Learner Corpus Bibliography. 2010. Centre for English Corpus Linguistics. Liao, Y. & Fukuya, Y.J. 2004. Avoidance of phrasal verbs: The case of Chinese learners of English. Language Learning 54(2): 193–226. McArthur, T. & Atkins, B. 1990. Dictionary of English Phrasal Verbs and Their Idioms. London: Collins. McCarthy, M. & O’Dell, F. 2007. English Phrasal Verbs in Use: Advanced: 60 Units of Vocabulary Reference and Practice; Self-study and Classroom Use. Cambridge: CUP. Melka, F. 1997. Receptive vs. productive aspects of vocabulary. In Vocabulary: Description, Acquisition and Pedagogy, N. Schmitt & M. McCarthy (eds), 84–102. Cambridge: CUP. Meunier, F. & Granger, S. (eds). 2008. Phraseology in Foreign Language Learning and Teaching. Amsterdam: John Benjamins. Miller, G. 2005. WordNet, Version 2.1. Princetown University. Moon, R. 1997. Vocabulary connections: Multi-word items in English. In Vocabulary: Description, Acquisition and Pedagogy, N. Schmitt & M. McCarthy (eds). Cambridge: CUP. Nation, I.S.P. 2001. Learning Vocabulary in Another Language. Cambridge: CUP. Nation, P. & Waring, R. 1997. Vocabulary size, text coverage and word lists. In Vocabulary: Description, Acquisition and Pedagogy, N. Schmitt & M. McCarthy (eds), 6–19. Cambridge: CUP. Pemberton, L. & Fallahkair, S. 2008. Interactive television as a vehicle for language learning. In Interactive Digital Television: Technologies and Applications, G. Lekakos, K. Chorianopoulos & G.I. Doukidis (eds), 18–32. Hershey NJ: IGI Publishers. Read, J. 1997. Vocabulary and testing. In Vocabulary: Description, Acquisition and Pedagogy, N. Schmitt & M. McCarthy (eds), 303–320. Cambridge: CUP. Rott, S. 1999. The effect of exposure frequency on intermediate language learners’ incidental vocabulary acquisition and retention through reading. Studies in Second Language Acquisition 21(4): 589–619. Schmitt, N. 2008. Instructed second language vocabulary learning. Language Teaching Research 12(3): 329–363. Schmitt, N. 2010. Researching Vocabulary: A Vocabulary Research Manual. Basingstoke: Palgrave. Sinclair, J.M. 2002. Collins COBUILD Dictionary of Phrasal Verbs. London: HarperCollins. Siyanova, A. & Schmitt, N. 2007. Native and nonnative use of multi-word vs. one-word verbs. International Review of Applied Linguistics in Language Teaching 45: 119–139.
Learner knowledge of phrasal verbs Sjöholm, K. 2004. The complexity of the learning and teaching of EFL among Swedish-minority students in bilingual Finland. Journal of Curriculum Studies 36(6): 685–696. Swan, M. 1997. The influence of the mother tongue on second language vocabulary acquisition and use. In Vocabulary: Description, Acquisition, Pedagogy, N. Schmitt & M. McCarthy (eds). Cambridge: CUP. Tercanlioglu, L. 2004. Exploring gender effect on adult foreign language learning strategies. Issues in Educational Research 14(2): 181–193. Tomasello, M. 2003. Constructing a Language: A Usage-based Theory of Language Acquisition. Cambridge MA: Harvard University Press. Waibel, B. 2007. Phrasal Verbs in Learner English: A Corpus-based Study of German and Italian Students. Freiburg: Albert-Ludwigs-Universität.
Norbert Schmitt and Stephen Redwood
Appendix A.╇ BNC phrasal verb frequency: Comparison of results
â•⁄ 1 â•⁄ 2 â•⁄ 3 â•⁄ 4 â•⁄ 5 â•⁄ 6 â•⁄ 7 â•⁄ 8 â•⁄ 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Phrasal Verb
G&D
S&R
go on carry out set up pick up go out find out make up come out come in work out take up sit down take on get up carry on get out come down put up get on bring in break down go off go in put out take back get down put on move on put back break up
14903 10798 10360 9037 7688 6619 5469 5022 4814 4703 4608 4478 4199 3936 3869 3545 3305 2835 2696 2505 2199 2104 1974 1660 1628 1538 1428 1419 1369 1286
16228 4199 10884 9777 7765 6760 6329 5231 9450 5190 5257 4717 5022 3892 2587 3406 3637 3083 3441 3010 2386 2565 4695 1728 1469 1415 6977 2152 1251 1430
G & D = Gardner & Davies token frequency S & R = Schmitt & Redwood token frequency
Phrasal Verb 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
sit up get in make out turn down come over go over hold on pick out hold back move in look around take down put off turn off move out move back give out come off give back move up dig up pay back pin down tear up think over fall behind pass away chat up take after cool off
G&D
S&R
1158 1127 1105 1051 1004 991 908 856 823 790 779 775 742 594 573 566 532 518 507 477 – – – – – – – – – –
1181 4671 1067 1355 1262 1732 1797 905 862 1377 1350 849 1075 1057 594 603 550 1191 654 616 383 295 249 224 219 156 104 100 84 65
Learner knowledge of phrasal verbs
Appendix B.╇ Productive phrasal verb test Student: _____________________________________ Level: _________ We are carrying out a study of students’ receptive and productive knowledge of phrasal verbs. To help us in our research please complete this productive knowledge test. Read each question carefully, and then write what you think the missing words (a phrasal verb) are, in the space next to the question. To help you, the first letter of each word is shown. We have also given a definition for each phrasal verb after every sentence. There are 60 questions and each one uses a different phrasal verb. You have 40 minutes to finish the test. Good luck! Example questions: #
Question
Answer
i
This is a really good piece of work. You must have p________ i__________ put in a lot of effort. (make an effort, spend time)
ii
I don’t have enough money to pay the tuition fees. I need to ask the bank if take out I can t__________ o__________ a loan to pay for them. (get, obtain,)
iii
We spent the afternoon at the airport watching the planes t________ o__________ and land. (leave the ground and fly)
â•⁄ 1
Mike needs a lift from the station. Can you go and p__________ him u________? (collect, give a lift)
â•⁄ 2
I think we’ve spent enough time talking about this. We should m__________ o__________ to the next item. (continue, proceed)
â•⁄ 3
P__________ the book b__________ on the shelf when you’ve finished with it. (return, replace)
â•⁄ 4
I don’t like that picture on the wall there. I think I’ll t__________ it d__________ and hang it somewhere else. (remove, move to a lower position)
â•⁄ 5
I don’t want to stay in and cook tonight. Let’s g__________ o__________ for a meal. (leave your house for a special reason)
â•⁄ 6
I know you’re tired, but we can’t stop now. We have to g__________ o__________ until we finish. (continue, proceed)
â•⁄ 7
She was relaxing reading a book when a loud crash made her s________ u__________ straight in her chair. (seated with a straight back)
â•⁄ 8
It rained all morning, but in the afternoon the sky cleared and the sun c__________ o__________. (appear, become visible)
â•⁄ 9
I can’t find my phone anywhere. Please l__________ a__________ the house and see if you can find it. (search, view)
take off taking off
Norbert Schmitt and Stephen Redwood
10
The police s__________ u__________ roadblocks to stop people driving into the city centre. (build, erect)
11
Derek has got the keys to his new flat and I’m going to help him m__________ i__________ tomorrow. (occupy, start to live in)
12
I was extremely sorry to hear that John’s father p__________ a__________ yesterday. I understand that he had been very ill for a long time. (die)
13
I wonder where Pete is today. Jim, could you f__________ o__________ what’s happened to him? (discover, check)
14
I have been offered a really good job in London, but I don’t want to move, so I’m going to t__________ the offer d__________. (reject, refuse, say no)
15
Please c__________ i__________, take a seat and make yourself comfortable. (enter)
16
We heard this really loud explosion, and found out later that a bomb had g__________ o__________ in the city centre. (explode)
17
I think I t__________ a__________ my mother. We look very similar and we like the same kind of things. (similar to, be like)
18
As I was running across the field one of my shoes got stuck in the mud and c__________ o__________. (be detached, separate from)
19
There are plenty of chairs. Let’s all s__________ d__________ together and have a nice long chat. (take a chair)
20
It’s a problem finding a job now because companies are just not t__________ o__________ new staff at the moment. (employ, recruit, accept)
21
Don’t let go of the rope. H__________ o__________ tight and I’ll try and pull you out. (grasp, grip firmly)
22
I thought this question was difficult at first but I managed to w__________ o__________ the right answer in the end. (learn, discover, calculate)
23
What time does this train g__________ i__________ to Manchester? (arrive, enter the station)
24
I am trying to get Peter to tell me when he wants to go on holiday, but it’s very difficult to p__________ him d__________ to an exact date. (make him decide)
25
There are a lot more girls than boys in the English Department. In fact, they m__________ u__________ 85% of the students. (comprise, form)
26
There’s more milk in the fridge. Can you g__________ some o__________ please? (remove, take from)
Learner knowledge of phrasal verbs
27
We could fit more people on the bus if everybody m__________ u__________ a bit. (change position to make more space)
28
I should go to bed. I’ve got to g__________ u__________ early in the morning. (rise from bed)
29
They c__________ o__________ from Italy every summer to stay with us in London. (travel)
30
Jean was so angry with Ray that she took all his photos, t__________ them u__________, and threw the pieces on the fire. (rip apart, shred)
31
Henry has m__________ o__________ of his flat and gone back to live with his parents. (leave, vacate)
32
I need more time to decide what to do. Can you give me a few days to t__________ it o__________? (consider, contemplate, ponder)
33
I didn’t mean to stop you working. Please c__________ o__________ with what you were doing. (continue)
34
When searchers saw the floating wreckage they knew the missing plane had c__________ d__________ in the sea. (crash, land, fall)
35
I have got your new English books here. Maria, can you g_______ them o_______ to the class? (distribute, hand to)
36
The writing was very difficult to read and it was hard to m_________ o_________ what it said. (see, recognise, distinguish)
37
When are you going to p__________ me b__________ the money I lent you? (return)
38
Can you g__________ me b__________ my pen? I need it now. (return)
39
There’s no room for my things on the shelf. Your books t__________ u__________ all the space. (occupy, use, fill)
40
The doctors aren’t sure what’s wrong with her and they need to c__________ o__________ more tests. (do, complete)
41
Don’t climb up there. It’s dangerous. G__________ d__________ at once before you fall! (move to a lower position, descend)
42
Let’s p__________ u__________ some posters on the notice board to advertise our concert. (fix/attach somewhere they can be seen)
43
When we first met we didn’t like each other much but now we g_______ o_______ really well. (have a good relationship, be friends)
44
Mary missed a lot of lessons and has f__________ b__________ the rest of the class. She will have to work hard to catch up. (fail to keep level with)
Norbert Schmitt and Stephen Redwood
45
When I have a long piece of writing to do I find it easier if I b__________ it d__________ into small parts. (divide, separate, take apart)
46
I can’t hear what you’re saying. Can you t__________ that music o__________? (stop by using a switch)
47
Trevor was working in his garden the other day, putting in some new plants, when he d__________ u__________ an old box full of silver coins. (remove from the ground)
48
Do the plates g__________ i__________ this cupboard? I’m not sure where to put them. (be stored, be put)
49
It’s always a good idea to g__________ o__________ your answers to check you haven’t made any silly mistakes. (check, examine, survey)
50
The staff, using buckets of water, managed to p_________ the fire o_________ before the fire crew arrived. (extinguish, stop from burning)
51
The crowd rushed forward and the riot police were unable to h__________ them b__________. (stop, contain, check)
52
Mark thinks he is a bit of a romeo. He is always trying to c__________ u__________ the girls. (talk to in a friendly way)
53
They’ve p__________ o__________ their trip to Australia until next year to give them more time to save up some money. (postpone, cancel until a later date)
54
Lots of people applied for the job but Mary was p__________ o__________ as the best candidate. (choose, select)
55
Quick, p__________ your coat o__________. We’re going now. (wear, clothe yourself)
56
Are you m__________ b__________ to Scotland after you’ve finished your work here? (return)
57
When the food and drink ran out the party b__________ u__________ and everyone went home. (come to an end, finish)
58
It’s so hot. Let’s go for a swim in the lake to c__________ o__________. (lose heat, get colder)
59
The football club sacked their manager and b__________ i__________ a new man in the hope of improving results. (introduce, employ)
60
This phone’s still not working properly. I’ll have to t__________ it b_________ to the shop where I bought it. (return)
Thank you very much for completing the first part of the study.
Learner knowledge of phrasal verbs
Appendix C.╇ Receptive phrasal verb test Student: __________________________________ Level: __________ For the second part of our study we would like to know about your receptive knowledge of phrasal verbs. To help us, please complete this multiple choice test. Read each question carefully, and then choose the best answer (A, B, C, D) to go in the spaces. There is only one correct answer for each question. If you do not know the answer write E. To help you there is a definition for each phrasal verb after every sentence. You have 30 minutes to finish. Good luck! Example question: #
Sentence
A
B
C
D
E Answer
â•⁄ 0 When we tried to buy tickets for the concert we were told there they had ______ ______ within a couple of hours. (all had been bought and there were none left)
sold down
sold out
sold up
sold in
?
â•⁄ 1 We heard this really loud explosion, and found out later that a bomb had ______ ______ in the city centre. (explode)
gone back
gone in
gone off
gone up
?
â•⁄ 2 We could fit more people on the bus if everybody _____ _____ a bit. (change position to make more space)
broke up
looked up
turned up
moved up
?
â•⁄ 3 I am trying to get Peter to tell me when he wants to go on holiday, but it’s very difficult to ______ him ______ to an exact date. (make him decide)
pin in
pin on
pin up
pin down
?
â•⁄ 4 I think we’ve spent enough time talking about this. We should _____ _____ to the next item. (continue, proceed)
move in
move down
move out
move on
?
â•⁄ 5 Mike needs a lift from the station. Can you go and ______ him ______? (collect, give a lift)
pick out
pick up
pick at
pick on
?
B
Norbert Schmitt and Stephen Redwood
#
Sentence
A
B
C
D
E Answer
â•⁄ 6 I wonder where Pete is today. Jim, could you ______ ______ what’s happened to him? (discover, check)
find in
find up
find on
find out
?
â•⁄ 7 I should go. I’ve got to _____ _____ early in the morning. (rise from my bed)
work up
stand up
get up
take up
?
â•⁄ 8 What time does this train _____ _____ to Manchester? (arrive, enter the station)
take in
give in
get in
bring in
?
â•⁄ 9 I don’t like that picture on the wall there. I think I’ll ______ it _______ and hang it somewhere else. (remove, move to a lower position)
turn down
stand down
take down
hold down
?
10 It rained all morning, but in the afternoon the sky cleared and the sun _____ _____. (appear, become visible)
took out
came out
made out
passed out
?
11
moved out
moved off
moved back
moved on
?
12 When searchers saw the floating wreckage they knew the missing plane had _____ _____ in the sea. (crash, land, fall)
come down
come across
come out
come up
?
13 There’s no room for my things on the shelf. Your books _____ _____ all the space. (occupy, use, fill)
take in
take on
take out
take up
?
14 Can you ______ me ______ my pen? I need it now. (return)
give off
give up
give out
give back
?
15 It’s always a good idea to _____ _____ your answers to check you haven’t made any silly mistakes. (check, examine, survey)
show over
take over
go over
give over
?
Henry’s _____ _____ of his flat and gone back to live with his parents. (left, vacate)
Learner knowledge of phrasal verbs
#
Sentence
A
B
C
D
E Answer
16 There are plenty of chairs. Let’s all _____ _____ together and have a nice long chat. (take a chair)
sit down
sit on
sit over
sit off
?
17 The football club sacked their manager and _____ _____ a new man in the hope of improving results. (introduce, employ)
held in
brought in
turned in
came in
?
18 Lots of people applied for the job, but she was _____ _____ as the best candidate. (choose, select)
picked out
picked back
picked over
picked in
?
19 The police _____ _____ roadblocks to stop people driving into the city centre. (build, erect)
set in
set up
set on
set at
?
20 This phone’s still not working properly. I’ll have to ______ it ______ to the shop where I bought it. (return)
take back
set back
turn back
look back
?
21 Jean was so angry with Ray that she took all his photos, ______ them ______, and threw the pieces on the fire. (rip apart, shred)
took up
tore up
set up
looked up
?
22 Let’s _____ _____ some posters to advertise our concert. (fix, attach somewhere they can be seen)
go up
put up
give up
sit up
?
23 When the food and drink ran out the party _____ _____ and everyone went home. (come to an end, finish)
broke up
broke in
broke over
broke out
?
24 Please _____ _____, take a seat and make yourself comfortable. (enter)
put in
come in
give in
bring in
?
Norbert Schmitt and Stephen Redwood
#
Sentence
A
B
C
D
E Answer
25 I was extremely sorry to hear that John’s father _____ _____ yesterday. I understand that he had been very ill for a long time. (die)
passed about
passed back
passed away
passed up
?
26 I know you’re tired, but we can’t stop now. We have to _____ _____ until we finish. (continue, proceed)
put on
look on
go on
take on
?
27 Do the plates _____ _____ this cupboard? I’m not sure where to put them. (be stored, be put)
come in
give in
take in
go in
?
28 They _____ _____ from Italy every summer to stay with us in London. (travel)
come about
come on
come off
come over
?
29 I need more time to decide what to do. Can you give me a few days to ______ it ______? (consider, contemplate, ponder)
think over
think under
think up
think back
?
30 I can’t find my phone anywhere. Please _____ _____ the house and see if you can find it. (search, view)
look across
look down
look on
look around
?
31 When I have a long piece of writing to do I find it easier if I ______ it ______ into small parts. (divide, separate, take apart)
break back
break off
break out
break down
?
32 I thought this question was difficult at first but I managed to _____ _____ the right answer in the end. (learn, discover, calculate)
work in
work up
work out
work off
?
33 Don’t let go of the rope. _____ _____ tight and I’ll try and pull you out. (grasp, grip firmly)
Hold in
Hold off
Hold on
Hold up
?
Learner knowledge of phrasal verbs
#
Sentence
A
B
C
D
E Answer
34 I have got your new English books here. Maria, can you ______ them ______ to the class? (distribute, hand to)
give off
give out
give up
give on
?
35 There’s more milk in the fridge. Can you _____ some _____ please? (remove, take from)
go out
hold out
make out
get out
?
36 I don’t want to stay in and cook tonight. Let’s _____ _____ for a meal. (leave your house for a special reason)
go under
go out
go up
go in
?
37 I didn’t mean to stop you working. Please _____ _____ with what you were doing. (continue)
carry off
carry back
carry on
carry up
?
38 It’s so hot. Let’s go for a swim in the lake to _____ _____. (lose heat, get colder)
cool on
cool in
cool off
cool up
?
39 There are a lot more girls than boys in the English Department. In fact, they _____ _____ 85% of the students. (comprise, form)
make on
make up
make in
make off
?
40 ______ the book ______ on the shelf when you’ve finished with it. (return, replace)
Put off
Put in
Put under
Put back
?
41 The crowd rushed forward and the riot police were unable to ______ them ______ (stop, contain, check)
hold under
hold back
hold on
hold over
?
42 The doctors aren’t sure what’s wrong with her and they need to _____ _____ more tests. (do, complete)
carry down
carry in
carry up
carry out
?
43 Don’t climb up there. It’s dangerous. _____ _____ at once before you fall! (move to a lower position, descend)
Get down
Take down
Look down
Put down
?
Norbert Schmitt and Stephen Redwood
#
Sentence
A
B
C
D
E Answer
44 When are you going to _____ me _____ the money I lent you? (return)
pay back
pay on
pay after
pay down
?
45 The staff, using buckets of water, managed to ______ the fire ______ before the fire crew arrived. (extinguish, stop from burning)
put out
put up
put off
put in
?
46 It’s a problem finding a job now because companies are just not _____ _____ new staff at the moment. (employ, accept)
taking on
going on
looking on
getting on
?
47 I can’t hear what you’re saying. Can you _____ that music _____? (stop by using a switch)
turn out
turn back
turn in
turn off
?
48 She was relaxing reading a book when a loud crash made her _____ _____ straight in her chair. (seated with a straight back)
sit off
sit over
sit up
sit on
?
49 I think I _____ _____ my mother. We look very similar and we like the same kind of things. (similar to, be like)
take in
take up
take after
take back
?
50 Are you _____ _____ to Scotland after you’ve finished your work here? (return)
standing back
looking back
moving back
bringing back
?
51 Mary missed a lot of lessons and has _____ _____ the rest of the class. She will have to work hard to catch up. (fail to keep level with)
looked behind
turned behind
fallen behind
put behind
?
52 I have been offered a really good job in London, but I don’t want to move, so I’m going to ______ the offer ______. (reject, refuse, say no)
turn over
turn up
turn down
turn off
?
Learner knowledge of phrasal verbs
#
Sentence
A
B
C
D
E Answer
53 Trevor was working in his garden the other day, putting in some new plants, when he _____ _____ an old box full of silver coins. (remove from the ground)
dug down
dug up
dug off
dug on
?
54 Derek has got the keys to his new flat and I’m going to help him _____ _____ tomorrow. (occupy, start to live in)
give in
move in
make in
work in
?
55 When we first met we didn’t like each other much but now we _____ _____ really well. (have a good relationship, be friends)
take on
look on
bring on
get on
?
56 The writing was very difficult to read and it was hard to _____ _____ what it said. (see, recognise, distinguish)
make off
make out
make up
make in
?
57 They’ve _____ _____ their trip to Australia until next year to give them more time to save up some money. (postpone, cancel until a later date)
put off
put up
put over
put in
?
58 As I was running across the field one of my shoes got stuck in the mud and _____ _____ . (be detached, separate from)
took off
came off
turned off
put off
?
59 Quick, ______ your coat ______. We’re going now. (wear, clothe yourself)
look on
put on
hold on
make on
?
60 Mark thinks he is a bit of a romeo. He is always trying to ____ _____ girls. (talk to in a friendly way)
chat up
chat off
chat in
chat out
?
Norbert Schmitt and Stephen Redwood
Appendix D. Biodata questionnaire Finally, we would like to know how much exposure you have to English. Please spend a few minutes filling in this brief questionnaire. How long have you been learning English?
less than 1 year
Where do you have English lessons? (you can mark more than one box)
How many hours of English lessons do you have each week?
How much time do you spend reading books, magazines and newspapers in English, or visiting English language websites each week?
How much time do you spend watching films, videos or TV in English each week?
How much time do you spend listening to music in English each week?
Do you use English to make new friends and keep in contact with people? (Facebook, MySpace, Twitter, Skype, email, instant messaging, SMS [texts] etc)
1 – 2 years
school
3 – 5 years over 5 years
language school
private lessons
1 – 2 hours 2 – 4 hours more than 4 hours
0 – 1 hour 1 – 2 hours more than 2 hours
0 – 1 hour 1 – 2 hours more than 2 hours
0 – 1 hour 1 – 3 hours more than 3 hours
never
1 – 2 hours more than a week 2 hours a week
Learner knowledge of phrasal verbs
Your age
Your gender
Your nationality
male
female
country
Many thanks for your help. If you would like to know your scores please fill in your email address below.
Corpora and the new Englishes Using the ‘Corpus of Cyber-Jamaican’ to explore research perspectives for the future Christian Mair1 Contrasts between British and American usage were an important topic in computer-aided corpus linguistics from the very start. The present contribution shows how from these beginnings the scope of corpus-based research was successively extended to cover standard varieties of the New Englishes (e.g. in the International Corpus of English) and eventually also non-standard and vernacular varieties, so that today the corpus-linguistic approach has become an important complement to sociolinguistics in the study of variation in the New Englishes. From a general discussion of this development, the contribution moves on to present the ‘Corpus of Cyber-Jamaican’ (CCJ), a large web-derived corpus of diasporic Jamaican web forums, and shows in a number of exploratory studies how this new resource can be used to investigate the globalisation of vernacular features.
1. The corpus-based documentation of the New Englishes: A brief historical survey An interest in the description of regional varieties of Standard English, especially the pluricentric standardisation of the world language, was one of the driving forces behind the rise of computer-aided corpus linguistics from the very start. When W. Nelson Francis and Henry Kučera had completed the Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (subsequently known as the Brown corpus) in 1964 (sampling date of texts: 1961), this inspired the creation of its British analogue, the Lancaster-Oslo/Bergen (or LOB) corpus, albeit with a delay of more 1. The present paper was written while I enjoyed the extremely productive and congenial working environment provided by FRIAS, Freiburg University’s Institute for Advanced Studies. I am grateful for this support. My thanks are also due to the members of the CCJ team, Anastasia Cobet, Johanna Holz, Véronique Lacoste, Andrea Moll, Larissa Teichert, for help with corpus searches and many fruitful discussions.
Christian Mair
than ten years (completion of corpus: 1978; sampling date: 1961). To this pair were eventually added three corpora devoted to different kinds of New Englishes: the Kolhapur Corpus of Indian English in 1986 (sampling date: 1978), the Australian Corpus of English (ACE) and the New Zealand ‘Wellington’ Corpus of English (sampling dates: 1986 and 1986/87 respectively). By the early nineteen-nineties corpus coverage of the New Englishes was thus certainly not complete, but the major types – transplanted ‘settler’ Englishes (Australia, New Zealand) and second-language or ‘official’ English (India) – were represented. In spite of its general currency, the term ‘New English(es)’ is notoriously fuzzy to define. The least controversial understanding is the purely chronological one: a variety of English which arose in the wake of the second wave of British colonial expansion, after the loss of the thirteen North American colonies in the late 18th century. This definition does not imply any claim about the linguistic structure or social status of any one specific New English. A more specific understanding (advocated, for example, in Platt et al. 1984) would restrict the term to non-native institutionalised varieties (which, of course, would exclude Australian English or natively spoken South African English). If defined in this way, New Englishes are first and foremost to be described as contact Englishes, or learner Englishes which have been institutionalised in their communities. Since all New Englishes, however defined, result from colonialism, there is significant overlap between this group and what Peter Trudgill has referred to as “colonial Englishes”: [We shall use the term] colonial as a technical term covering in principle all types of English other than those spoken in England and the lowlands of Scotland – the part of the world to which English was almost entirely confined until the seventeenth century, which is to say for most of its history. Those varieties of English which are spoken elsewhere in the world – the colonial varieties – have resulted from movements of people outwards from Britain, from the seventeenth century onwards, often involving dialect mixture; the influence of other languages with which English has come into contact; and independent developments that have occurred subsequently in different parts of the world, some of them in response to new environments and new uses. These colonial varieties include the forms of English spoken in the Highlands of Scotland, in Wales, in the English county of Cornwall (which has been entirely English speaking only since the seventeenth or eighteenth centuries), and in Ireland, the Isle of Man, Canada, the United States of America, Central America, South America, the Caribbean, the Bahamas, Bermuda, St. Helena, Tristan da Cunha, the Falkland Islands, Liberia, East Africa, South Africa, Zimbabwe, Australia, and New Zealand, as well as in many other areas of the world where second-language and/or pidginized and creolized forms of English are to be found. (Trudgill 1986: 127)
Corpora and the new Englishes
As can be seen, the difference between natively spoken and non-native varieties of English plays no prominent role in this definition, and this has become the mainstream view in World Englishes research today (with the term colonial being replaced by postcolonial, as in the title of Schneider’s [2007] book). Given that the boundary between native and non-native varieties is permeable and that US independence is a historical rather than a linguistic watershed, a priori definitions and categorisations seem of limited practical value. Singaporean English, for example, like Malaysian English, started out as a clear case of a second-language variety at the dusk of the colonial era in South East Asia in the 1950s. For many of its speakers, however, it has now become a native variety (while Malaysian English, by contrast, has undergone a process of societal disestablishment, towards being a foreign language rather than a second or official one, during the same period – cf. Schneider 2007: 147–148). Irish English represents a chronological dilemma. Is it a New English because its most distinctive features result from the mass shift from Irish to English in the second half of the 19th century, or do we have to also consider its pre-history extending back into the Middle Ages? Jamaican English, the variety which will be focussed on in the present contribution, is intractable on both counts. Chronologically, Jamaican Creole or patois, the English-lexifier creole spoken by the mass of the population, had certainly consolidated by the first half of the 18th century, before US independence. With regard to native-speaker status, there is a clear conflict between popular opinion, according to which the creolophone British West Indies have always seen themselves as ‘Englishspeaking’, and many linguists and educators, who point out that the creoles of the region are phonologically and grammatically distinct languages from English and that a competence in English for most speakers is acquired in the educational system, much as is the case in second-language communities. After reviewing the various problems, the working definition of ‘New English(es)’ I shall adopt is ‘any postcolonial variety of English which is undergoing the process of endonormative stabilisation and standardisation that Standard British and American English, the two global reference standards, have completed’. Australian English, for which there have been widely accepted locally produced dictionaries and usage manuals since the 1980s and which has supra-national influence in the South Pacific region, presents itself as a New English far advanced in this process. Jamaican English, whose norms are still emerging in a force-field defined by an inherited but weakening British Standard, increasing influence from the US, and a growing readiness to accommodate Jamaican Creole features, is following behind at some distance. Whichever way one defines the notion of ‘New English’, however, all early corpora in the field had a major shortcoming, in that they were restricted to written English. In this regard, the International Corpus of English (ICE) project, conceived in 1990 by Sidney Greenbaum (see Greenbaum 1990, 1996), represented a major advance by
Christian Mair
including spoken language.2 Equally important was a broadening of the base for systematic comparative research, both on the relationship between ‘Old’ and New Englishes, and on similarities and contrasts among the New Englishes themselves. From the start, links were established between the ICE project and the International Corpus of Learner English (ICLE) developed at around the same time by Sylviane Granger. Apart from studying the obviously interesting question of which widespread learner features generally tend to make it into institutionalised second-language standards, it is possible also to look at more specific constellations: Hong Kong English, for example, is documented as an institutionalised contact English in ICE, and similar contact phenomena will no doubt be in evidence in the Chinese component of ICLE.3 The critical question whether or not the second-language New Englishes represent a special case of language acquisition was bravely raised in a very early paper by Williams (1987), but not followed up systematically – no doubt partly because of the lack of suitable data and corpora. Today, we have these data, for example by joining up ICE and ICLE in the way envisaged by Sylviane Granger at the very inception of the ICLE project. Directed by Sidney Greenbaum until his death and then passing on into the stewardship of Bas Aarts, the British component of ICE (ICE-GB) was completed, annotated for part of speech, parsed syntactically, and made searchable through a sophisticated customised software package (ICECUP). When with the second release of the corpus in 2006 the sound files were made available alongside the transcription, a benchmark was set for other ICE ventures and the wider corpus-linguistic community. A project designed to document as many existing and emerging regional standard varieties of English as possible is of course beset by numerous difficulties, but it is vivid testimony to Greenbaum’s far-reaching insight that, at the time of this writing, ten more ICE sub-corpora have been completed (in plain-text versions) and several more are in the making at various stages of completeness.4 As the survey of these corpus-ventures makes clear, the focus of current corpusbased research on the New Englishes is very much on the Standard English end of the sociolinguistic scale. ICE corpora document the English of educated users of the language, and not of others. Thus, the Jamaican component of ICE focuses on Jamaican English, and not on the island’s vital and thriving English-lexifier creole, or the mesolectal span of the English-creole continuum which most residents of the island tend 2. Another pioneer in the development of spoken corpora of the New Englishes who deserves mention is Janet Holmes, who started collecting texts for the Wellington Spoken Corpus of New Zealand English (WSC; completed 1998) in 1988. 3. For a more comprehensive survey of work taking place at the intersection of New-Englishes and learner-English research, see Mukherjee & Hundt (forthcoming). 4. The completed sub-corpora cover Australia, Canada, East Africa, Hong Kong, India, Ireland, Jamaica, New Zealand, Philippines and Singapore. Most prominent and eagerly awaited among those still being compiled is ICE-USA, while work is going on to produce corpora for Fiji, Ghana, Malaysia, Malta, Namibia, Nigeria, Pakistan, South Africa, Sri Lanka, and Trinidad & Tobago. See http://ice-corpora.net/ice/for further information.
Corpora and the new Englishes
to feel most at home in (Patrick 1999). This, however, is a bias which is likely to be redressed in the foreseeable future as sociolinguists and dialectologists are becoming increasingly receptive to the potential of corpora and corpus-related methods of investigation (cf., e.g., Kortmann 2005; Beal et al. 2007; Tagliamonte & D’Arcy 2007a, 2007b; Mair 2009). As for non-standard corpora specifically of the New Englishes, consider the Corpus of Nigerian Pidgin, published as part of Deuber (2005), or the Corpus of Written British Creole (compiled by Mark Sebba and Susan Dray, cf. http://www.ling. lancs.ac.uk/staff/mark/cwbc/cwbcman.htm). While the attempt is still occasionally made (cf. Nelson 2006), as in most other sub-disciplines of corpus linguistics, it is now impossible to compile tidy and complete lists of available corpus resources for the study of the New Englishes. As corpora are becoming mainstream and the vernacularisation of the World Wide Web is progressing apace, corpora of the New Englishes of all sizes and degrees of specialisation are proliferating, with many of them being derived from the Web, the almost inexhaustible digital text archive (as, e.g., in Mukherjee & Hoffmann 2006, a study of Indian newspaper language). The rich corpus-linguistic working environment which we now have for the study of the New Englishes encourages comparative research among these varieties (and, of course, also among some or all of them and the ‘old’ Englishes which provided the input at their genesis and have continued to influence them since). Which features are shared by many or even all of the New Englishes (as ‘vernacular universals’, ‘Angloversals’, ‘New Englishisms’)?5 Which features, on the other hand, are locally specific (and therefore often traceable to particular language contact in multilingual settings)? Are there specific morphosyntactic profiles distinguishing natively spoken New Englishes such as Canadian English, New Zealand English or Australian English from second-language New Englishes such as West African or Southern Asian varieties? Can we establish recurrent diachronic trends in the emergence and stabilisation of the New Englishes? And so on. Tentative answers to the first three questions are proposed, for example, in the synoptic chapters included in the recent Handbook of Varieties of English (e.g. Kortmann & Szmrecsanyi 2004), whereas the fourth question is at the centre of Schneider (2007).
2. Current challenges: The web as a data source for the study of the new Englishes After this sketch of the thriving field of corpus-based research on the New Englishes, I would like to move on to present first findings from an ongoing research project of my 5. This is not the place to enter into a detailed discussion of contrasts and overlap between these three terms. The notion of ‘vernacular universal’ goes back to Chambers (e.g. 2000, 2003); ‘Angloversal’ was coined by the present writer – see Mair (2003) – and was made the subject of an extensive study by Sand (2005); ‘New Englishisms’ is used in Simo Bobda (2001, 2004).
Christian Mair
own, the investigation of the use of Jamaican English and Jamaican Creole in a large (> 15 million words) corpus of Jamaican web-posts which is currently being compiled at Freiburg. The project builds on previous research based on the Jamaican component of ICE (ICE-JA), with the obvious first goal being to position ‘cyber-Jamaican’ with regard to spoken and written usage as documented in ICE-JA.6 As computer-mediated communication is not restricted territorially to the island of Jamaica but involves a large Jamaican diaspora in the US, Canada and Great Britain, it will also be interesting to compare the strength of American influence on the web (in the Corpus of CyberJamaican) and ‘on the ground’ (in ICE-JA). On the methodological and theoretical plane, the project positions itself at the intersection of web-based corpus linguistics and the emerging field of the sociolinguistics of computer-mediated communication. There has been a boom in all fields of webrelated corpus-linguistic activity. For example, it is now standard practice in corpus compilation to draw on available digitised text from the web. Mark Davies’ large webderived Latin American newspaper corpora produced in the late 1990s (Davies 2001) were a pioneering venture, and his 400+-million-word generically balanced Corpus of Contemporary American English (COCA, http://www.americancorpus.org/) is convincing proof of the viability of the approach in the field of English. The Special Interest Group of the Association for Computational Linguistics (ACL) on Web as Corpus (SIGWAC, http://www.sigwac.org.uk/) is one of several initiatives set up to ensure that the technological challenges presented by the task are tackled in a co-ordinated way. So far, most such activity has resulted in corpora documenting standard varieties of language. The presence of informal and vernacular features has been noted in many web-based textual genres and has usually been explained as the result of the mediated immediacy of digital communication (its ‘pseudo-oral’ quality). Vernacular varieties of English, however, have not usually been made the explicit target of collection. In addition to introducing a new type of data into web-based corpus linguistics, the corpus also holds considerable innovative potential for the study of computermediated communication. Linguistic studies of computer-mediated communication on the other hand – whether in the mainstream inspired by discourse-analytical approaches (Beißwenger & Storrer 2008) or in the emerging sociolinguistic paradigm (Androutsopoulos 2006a) – commonly employ the methods of qualitative ethnography or use small corpora compiled for the purposes of a particular study. Some of the findings from such analyses are therefore necessarily provisional in the sense that it is not clear in how far the insights are confined to the discursive constellation investigated and in how far they can be generalised. In such a situation, large corpora can be extremely helpful. To give a simple example: in a qualitative analysis of a few ‘threads’ in a particular web-based discussion forum, the researcher can never be sure whether all relevant conventionalised spellings of some vernacular form are actually represented 6. The e-mail messages in the small W1B-section of ICE-JA represent the only infiltration of computer-mediated language into this corpus. They will not figure in the comparison.
Corpora and the new Englishes
or whether the statistical preference for one spelling over the others is significant; in a corpus drawn from millions of posts by thousands of participants, the answers to these questions can be given on much firmer ground.
3. The data: CCJ, a corpus of cyber-Jamaican English/Jamaican Creole Before describing the data on which the present study is based, a brief word of clarification is in order on the terminology employed. The term used to refer to the emerging local norm of educated English usage in Jamaica, the most populous and culturally most influential former British colony in the Caribbean, is ‘Jamaican English’ (JamEng). Jamaican English is considered one of the New Englishes which have arisen in the wake of decolonisation and can be placed at various stages of endonormative development as described in the widely used model proposed by Schneider (2007). ‘Jamaican Creole’ (JC), locally known as patois, on the other hand, is the name of the English-lexicon Creole which consolidated on the island in the course of the transition to a slave-based plantation economy in the early 18th century. It is easy to claim a status for JC as a separate and independent language from English on the basis of its distinctive grammar and phonology, but most speakers on the island do not draw a rigid line of division between JamEng and JC. While small minorities of speakers may be clearly dominant in ‘basilectal’ JC or ‘acrolectal’ JamEng, most cover a span on the ‘mesolectal’ range of the JC-JamEng ‘continuum’ in their own usage. Note again that, as has been pointed out, even speakers who show strong and obvious influences from JC still consider themselves as speakers of English, if of a more or less highly stigmatised variety.7 What are the problems this highly fluid and flexible continuum situation raises in corpus-practical terms? They are manageable in written material: outside of experimental and literary writing it is usually easy to distinguish between JamEng and JC in written texts. From colonial times, there has been a strict diglossia, in which the language appropriate for writing has been Standard English. JC has a phonemic orthography developed by linguists Frederic G. Cassidy and Robert B. LePage which is used in some scholarly linguistic writing and valiantly promoted by the University of the West Indies-based Jamaican Language Unit but has made little headway among the general public, who prefer an English-based ad hoc system. Full texts written in JC tend to come from the domains of folklore, dialect poetry, or humour. In recent times, the reggae movement, its derivatives (e.g. dancehall) and socio-religious movements such as Rastafarianism, with their emphasis on affirming the African and Afro-Caribbean folk heritage of Jamaica, have added to this repertoire. 7. Is English We Speaking has thus come to serve as the appropriately emblematic title of a collection of essays on Caribbean literature and culture by Mervyn Morris (1999) and is echoed in two linguistic studies: Youssef (2004) and Deuber (2009).
Christian Mair
In the written texts of ICE-JA, borrowings from JC are therefore very rare, and usually identified as such, for example by the use of quotation marks, as in the following instance:
(1) They just cannot afford to go to University and so their education ceases at the high school sixth form, community college level. They often times do not have the financial ‘backative’ needed to secure a Student Loan. (ICE-JA, W1a, student writing) The writer consciously chooses a JC term meaning ‘support’ in order to add local colour to the text and draws attention to this device by using the quotation marks. Given that we are dealing with student writing, the quotation marks may additionally be used to prevent a teacher correcting the use of backative as a mistake. While there are a number of plausible rhetorical or stylistic motivations for the occasional use of JC vocabulary in JamEng written texts, contact features or JC borrowing at the morphosyntactic level would usually be considered erroneous:
(2) Secondly the government provides a Student Loan Bureau which lends qualified students the tuition fee to pay the university and this money is paid back after the student have recieve|receive their tertiary level education and can pay back the loan. (ICE-JA, W1a, student writing)
There is self-correction of the spelling error in receive (as indicated by |), but the absence of the plural in student (or of appropriate singular marking on have) and of the regular past-participle ending -ed in received would be noted and disapproved of by expert writers. This is different in spoken Jamaican English, where the occasional use of such contact features is common even among the educated speakers sampled for ICE-JA, not only as the result of occasional code-switching into JC but also in informal upper-mesolectal JamEng. Compare the following three typical specimens: (3) Don’t get panicky about it don’t get vex when it’s our turn to be scrutinised but just deal with it you know (ICE-JA, S1a, telephone conv.) (4) No no but they’re not around but what you find is that the persons who are teaching JAMALs [Jamaican Movement for the Advancement of Literacy teaching modules] are person like me who no know nutten but are scared of word ... (ICE-JA, S1a, face-to-face conv.) (5) Worst if you value the person friendship and you think the person is somebody you want to keep in touch with there’s no way you’re going to I mean let that candle [go] out – going to always try to keep the candle burning (ICE-JA, S1a, face-to-face conv.)
Example (3) is similar to the written example (1). One JC word appears in an otherwise JamEng sentence. However, the sound files show that there are no intonational distancing signals marking off vex which could be compared to the ‘scare’ quotes
Corpora and the new Englishes
found around backative. The JC synonym vex for angry is used to make the point more emphatically and functions as a conversational cue establishing a relationship of ethnic solidarity between speaker and listener. Examples (4) and (5) show variability in inflectional marking which is essentially of the same type as that found in the written example (2). The evaluation, however, is entirely different. What in the written text is perceived as an error becomes a useful stylistic signal of informality in speech. The presence of JC within spoken JamEng, however, is not confined to such moderate contact influence as illustrated in examples (3) to (5). In fact if there is one lesson ICE-JA teaches the analyst it is that at the level of spontaneous face-toface interaction JC is an inextricable part of the linguistic resources of the educated Jamaican speakers sampled for the corpus. There is no diglossic situation, as in writing, but the continuum rules: given the appropriate context of situation, a speaker of JamEng may find it convenient to reach into the lower mesolectal range, as is illustrated in example (6). Example (7), finally, combines the phenomena illustrated in (5) and (6), in that we have a baseline style which can be defined as informal JamEng (note the absence of inflectional marking of the 3rd person singular in the present tense in she have ...) within which we find switching to JC (marked by preverbal negation and tense and aspect marking, as in me no know how that a go work).
(6) she naa go school tomorrow with her hair look so <#> She feel she must be the hottest thing so mummy did haffi carry her go hairdresser (ICE-JA, S1a, face-to-face conv.) [she’s not going to school tomorrow with her hair looking like this – she feels she must be the hottest thing so mummy had to take her to the hairdresser’s] (7) A: Uh she says she wants to be a model and a lawyer <#> Oh me no know how that a go work [... Oh, I don’t know how that’s going to work] B: Mhm Hope she has found a way to model skinny A: In fact she she have more body than me but she have a nice little shape going <#> But I just miss her you know (ICE-JA, S1a, face-to-face conv.) Considering what we know about the informality and pseudo-orality of computermediated communication, we would expect cyber-Jamaican to be more like speech than writing. And indeed all the phenomena illustrated in examples (3) to (7) from the spoken texts of ICE-JA were attested copiously in exploratory searches; in addition, further contact features were observed which are unusual even in face-to-face conversation. This was the reason for the compilation of CCJ, the Corpus of Cyber-Jamaican English/Jamaican Creole. Obtaining data from the web in the quantities envisaged here (> 15 million words) presents a modest technical challenge but is beset with a number of ethical issues involving copyright and privacy. This is why – unlike other corpora compiled at Freiburg
Christian Mair
under the direction of the present author – public access to the material will remain restricted for the time being.8 At the start of the CCJ project in 2008, the major issue that needed to be resolved was to identify suitable donor sites on the web. As any Google search for terms such as riddim (‘rhythm’, particularly reggae and related styles), bashment (‘party’, ‘celebration’), oonoo (‘you’, plural), criss (‘crisp’, general-purpose term of approval), inna yaad (‘at home’), mek I tell (‘let me tell’), im a go (‘he is going (to)’) and so forth will testify, JamEng/JC long ago ceased to be a marginal or locally restricted variety of English; today it has a quantitatively impressive presence on the Web (which is, incidentally, not restricted to English-language sites).9 However, the majority of these sites are single-issue ventures, usually devoted to reggae music, inter-racial dating, or similarly restricted activities. After discarding these, a number of sites remained which featured web-based discussions which met the following criteria: a. the discussions covered a broad range of topics; b. participants were numerous and of diverse backgrounds; c. participants mainly originated from Jamaica or the Jamaican diaspora in Canada, Great Britain and the United States. As one site, http://www.jamaicans.com, exceeded all others in size and quality, it was decided to make it the focus of the investigations rather than mix several sources in the compilation of CCJ. It patently met the first two criteria mentioned above; as for the third, 1,318 from among a total of 2,141 users disclosed their place of residence in their user profiles, and selective cross checks with the contents of the posts did not give rise to undue suspicion regarding the reliability of these self-reports. As the history of the use of JamEng/JC on the web is short, the plan was for the corpus to be stratified by year, so as to make possible the real-time study of the emergence of writing norms in computermediated communication for this variety (for a related study see Deuber & Hinrichs 2007). From this point of view, as well, it seemed more appropriate to document conventions of one representative web-based community of practice rather than mix several. In order to make the material amenable to analysis by conventional corpus-analytical software and to preserve the data for possible replication and testing after they were removed from the original site, large amounts of material were automatically downloaded and manually post-edited in 2008. Table 1 surveys the material thus 8. Arguably, the fact that posters contributing to web-based discussion forums operate under pseudonyms and should in principle be aware that what they say is in the public domain makes it legitimate to use what they produce as the data for linguistic analysis, without infringing on informants’ privacy. On the other hand, informants clearly have not given their explicit consent, and in fact most of them would consider the presence of the non-participating researcher as an instance of unwanted ‘lurking’. The fact that the material needs to be downloaded for preservation and analysis raises additional issues of copyright. For a survey of positions in this ongoing debate, see Crystal (2006: 200–201) or Mann & Stewart (2002: 39–64). 9. See, for example, sites such as www.reggae.fr or www.raggamafia.at.
Corpora and the new Englishes
Table 1.╇ CCJ – corpus size by year Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 Total
Number of words 354 104,077 697,184 1,808,513 1,683,851 1,477,049 2,127,915 4,878,145 3,833,655
16,610,743
obtained (and originally produced by a total of 2,140 different contributors10 to forum discussions) in purely quantitative terms. To gain a first idea of the quality of the material contained in CCJ a number of high-frequency morphosyntactic variables were looked at, such as the realisational variants of the going to-future (cf. Table 2). This list is not exhaustive but certainly captures the most important variants of the variable. All those forms which would be expected to turn up in comparable webbased discussion forums in other regions of the English-speaking world were subsumed under the category ‘acrolect’. The basilectal variants conform to traditional JC grammar, whereas the mesolectal ones are those which have arisen as compromises between JamEng and JC norms in the continuum situation. Note that the mesolectal forms are reductions or simplifications when compared to the corresponding standard English ones, whereas the basilectal ones have explicit grammatical marking absent from standard English. Psychologically, omission of marking is less salient than the use of marking which is not part of English grammar and consequently carries less stigma in sociolinguistic terms. Variants not listed in Table 2 include those which are very rare (e.g. me is going – with just one attestation) or those, such as me/mi go, for which the vast majority does not have future reference (but, in this case, past). Even this very first count highlights what will turn out to be a recurrent feature – the over-representation of basilectal variants in comparison to spoken data from ICE-JA. 10. This is based on login information which was cross-checked with information in the posts. There are the expected occasional irregularities, such as the same individual shifting to a new pseudonym, but such cases are very rare. More importantly for the analysis, there are drastic differences in activity, with core members contributing thousands of posts and marginal ones only a handful. Another factor which needs to be taken into account is that some forum members reveal themselves to be in contact offline. For these, a real-life dimension is added to their computer-mediated interaction, which will need to be taken into account in the detailed analyses.
Christian Mair
Table 2.╇ Realisational variants of the going to-future in CCJ (1st person sg., affirmative) I am going to I am gonna I’m going to I’m gonna subtotal acrolect
891 109 746 366 2,112
I going (to) I gwine me/mi going subtotal mesolect
65* 67 168 300
me/mi gwine me/mi a go subtotal basilect
327 405
732
* The figures in Table 2 present raw concordance output as the searches usually ensure a satisfactory degree of both precision and recall. In a preliminary study, for example, two samples of 100 attestations of am going to were inspected and found to contain only 10 and 8 instances respectively in which go was used as a motion verb. The figure for I going (to), however, is an exception, as 71 instances of am I going to had to be subtracted as obvious mishits.
With ICE text category S1A (90 samples of face-to-face and 10 samples of telephone conversation) amounting to ca. 200,000 words of text, there are only 47 relevant instances for comparison, but the distribution is clear nevertheless: the vast majority, namely 40, fall into the acrolectal band of Table 2, a further 6 into the mesolectal one, and only one into the basilect. The same stylistic spread is revealed when other variables are chosen as diagnostics. For example, a search for realisations of the progressive reveals a very similar picture, as appears from Table 3. Unlike Table 2, this list requires extensive post-editing and is still incomplete in minor ways, most importantly because the search strategy adopted here misses all instances with adverbial phrases intervening between the pronoun or auxiliary and the present participle – a problem which chiefly distorts the results for JamEng. On the other hand, the search under-collects for JC because it disregards all those instances among the 371 instances of im a [verb], which, in accordance with basilectal JC grammar,11 have indeterminate or female gender reference. The corresponding distribution for the conversational part of ICE-JA is as follows: from a total of 37 comparable instances, 18 fall into the acrolect, 17 into the mesolect and only 2 into the basilect. As in the case of the going-to future (Table 2), the basilect is thus obviously over-represented in computer-mediated communication when compared to face-to-face interaction. 11. There is no gender-contrast in the third-person singular pronoun at the level of the basilect.
Corpora and the new Englishes
Table 3.╇ Realisational variants of the progressive in CCJ (3rd person sg. fem., affirmative) all hits
manually post edited conc.
she is *in(g)
â•⁄â•‹793
she’s *in(g) shes *in(g)
â•⁄â•‹500 â•⁄â•⁄â•‹21
â•⁄â•‹664 â•⁄â•‹383
she s *in(g)
â•⁄â•⁄â•‹13
Subtotal acrolect
1,327
â•⁄â•⁄â•⁄â•‹5 1,070
she *in(g) (= mesolect)
â•⁄â•‹431
â•⁄â•‹262
she a [verb] (= basilect)
â•⁄â•‹880
â•⁄â•‹596
â•⁄â•⁄â•‹18
For a final and even more drastic illustration of this general trend, consider the use of fi, a JC grammatical marker often corresponding to English infinitival to, but additionally expressing purposive and related modality. Its extremely basilectal status is reflected in the fact that the 200,000 words of spontaneous conversation in ICE-JA contain only 9 instances (a normalised frequency of ca. 45 per million words). In CCJ, by contrast, it occurs 34,071 times, which corresponds to a rate of 2,051 per million words. Clearly, this degree of mismatch in the use of basilectal features between face-to-face and computer-mediated interaction needs an explanation. The obvious explanation, that participants contributing to the forum discussions are drawn from a lower social class of speakers, with a correspondingly lower average level of education, clearly does not hold. Most posters reveal themselves to be professionals with a middle-class outlook and income and a corresponding command of English – exactly the profile, in other words, which characterises contributors to the spoken portions of ICE-JA. The fact to be accounted for is that the same type of speaker is apparently more ready to draw on JC when using the computer keyboard than when speaking in face-to-face interaction. To understand the causes of this phenomenon in detail, it will be necessary to focus on the usage of the around 120 key participants who each have contributed more than 1,000 posts to the material, as individual profiles and preferences differ considerably. One generalisation which holds for almost all of them, though, is that there is a correlation between the density of JC features and the topic of a particular thread. No topic is completely off limits for JC. For example, while most contributions on the topic of the International Monetary Fund (IMF) and its policies’ impact on Jamaica are formulated in JamEng, it is not difficult to find occasional counter-examples: (8) (borrowing)? Well di IMF never waan fi len dem afta di crash [...] (CCJ 2006) [borrowing? Well, the IMF did not want to lend them after the crash]
Christian Mair
However, as is to be expected, yaad (‘yard’, or ‘home’) topics encourage shifts into JC, and so do emotional topics, such as love and romance. For example, on the topic of jealousy more than half the contributions show strong JC influence, as is revealed by a lexical search for jealous. Example (9) is dominantly in standard English, with one instance of a copula-less progressive (now she spreading) making a minimal concession to the standard-creole continuum. The short passage in example (10), on the other hand, is maximally packed with non-standard morphosyntactic features:
(9) I have tried helping her with this new idiot that she met, now she has jumped on the phone telling my aunt in California that she is getting married to the scum “just met him not even 5 months ago” On top of that now she spreading lies that I must be jealous and some other crap. (CCJ 2008) (10) mi jealous...mi no have no phone fi nobody fi call mi (CCJ 2004) [I’m jealous ... I don’t have a phone for anybody to call me] Searching for the non-standard spelling jellus, predictably, raises the likelihood of finding other JC features to almost 100 per cent: (11) Compry she stop by, so de whole a we a sing har Happy Birthday, dat a when she tell we say har fren dem a work give har a Microwave she seem happy enuff bout it but it look like Coolbeans did a get jellus caaa Compry a gwaan over de flowers (CCJ 2004) [Compry, she stopped by, so all of us are singing her Happy Birthday, that’s when she told us that her friends from work gave her a Microwave; she seemed happy enough about it, but it looked like Coolbeans was getting jealous because Compry was going on about the flowers] The following two brief sections will focus on issues raised by the data which can be studied before the individual speaker profiles have been compiled and analysed (work now in progress). Section 4, on ‘anti-formality’, explores the chief factor which is responsible for lowering the inhibition to use JC in computer-mediated communication in comparison to face-to-face interaction. Section 5, on lexical borrowings from African languages, is more exploratory in nature and deals with web-forums as a potential site for language contact unlikely to be encountered in the real world. It is intended to show how a sociolinguistics of computer-mediated communication can be developed into a sociolinguistics of globalisation.
4. Anti-formality As has been pointed out above, CCJ shows speakers who are generally very proficient in English incorporating considerable amounts of JC features into their texts – much more than would be occasioned by normal ‘conversational’ informality. They thus present very clear instances of what lexicographer Richard Allsopp has defined as the
Corpora and the new Englishes
“anti-formal” tendency in Caribbean English usage. He defines the three stylistic levels12 he distinguishes as follows: Formal: “Accepted as educated; belonging or assignable to IAE [internationally acceptable English]; also any regionalism which is not replaceable by any other designation. No personal familiarity is shown when such items are used.” Informal: “Accepted as familiar; chosen as part of usually well-structured, casual, relaxed speech, but sometimes characterized by morphological and syntactic reductions of English structure and by other remainder features of decreolization. Neither inter-personal tenseness nor intimacy is shown when such items are used, and the speaker is usually capable of switching to the upper level when necessary but more easily to the lower.” Anti-formal: “Deliberately rejecting Formalness; consciously familiar and intimate; part of a wide range from close and friendly through jocular to coarse and vulgar; any Creolized or Creole form or structure surviving or conveniently borrowed to suit context or situation. When such items are used an absence or a wilful closing of social distance is signalled. Such forms survive profusely in folk-proverbs and sayings, and are widely written with conjectural spellings in attempts at realistic representations of folk-speech in Caribbean literature.” (Allsopp 1996: lvi-lvii) Apparently, it is not just informality but a wilful and conscious “closing of social distance” which is encouraged in web-based communication. What in Allsopp’s database was mostly confined to folk-speech, proverbs and literary representations thereof has apparently become general practice on the web. In conversation, anti-formality is not without risks. As a conscious closing of social distance it is by definition face-threatening and may be perceived as rude and aggressive. In the pseudo-conversational environment provided by the web forum, this risk is curtailed, and participants engage in anti-formal linguistic behaviour in a spirit of playfulness. This playfulness is apparent even at the spelling level, where legibility often takes a back seat to expressiveness: maddaratar, maddarator, maddarater, maddrayta and madrater are just a selection of the variants used to refer to the moderator of the list. Laddamassi is used to write ‘Lord have mercy’. The playfulness also involves the use of local or yaad stereotypes and allusions, as in the following stretch of match-making banter: (12) [RollinCalf] Well ep. yuh si mi a chat bout now, mi fraid a married like puss cause di firs one was not pleasant. [Well, EP, you see I’m talking about the present. I’m afraid of marriage like a tomcat cause the first one was not pleasant] [3281] e_p_11_4_08 Pray tell..wat did happen? 12. A fourth level, “erroneous”, is not relevant to the present discussion.
Christian Mair
[RollinCalf] Well afta wi leff yard, shi go a university, get har degree an suddenly was too stush fi mi. Satdeh time...braps,13 beef soup done, seh shi neva go a college fi bwile no yam. One mawnin mi a hat up little mackrel an some pepper an shi tell mi seh mi a tink up di house. Nex ting mi know, when wi have party, some breed a people weh no look like mi or come from weh mi come fram pack up mi house. Anyway mek mi tap yaw cause mi blood a start fi hat up already. [Well after we left home, she went to university, got her degree and suddenly was too refined for me. Saturday time, and all of a sudden the beef soup is done and she says she didn’t go to college to cook yams. One morning I’m heating up a little mackerel and some peppers and she tells me that I’m stinking up the house. Next thing I know, when we had a party, a kind of people that didn’t look like me or came from where I came from fill my house. Anyway, let me stop here cause my blood is starting to heat up already] [...] Well it was a looong time before mi did check fi certain woman. Don’t get me wrong, mi can speaky spokey wid di bess a dem wen mi ready but mi neva grow soh. [Well it was a long time before I understood the woman for sure. Don’t get me wrong: I can speak posh with the best of them when I’m ready to, but I didn’t feel like it] [e_p_11_4_08] JAH Know rolling calf yuh soun like wan taxi man weh did a tell mi him bout fi him situation out of the blue sky o/basically di sed ting..him seh him file fi him wife an sen har guh college fi 6 years an afta she graduate she was a different ooman all together...wats up with this trend? Not to say that all women from Jamaica are like that..but I have heard a few stories... [You know, Rolling Calf, you sound like this taxi driver that was telling me about his situation out of the blue sky. Basically the same thing: he said he applied for papers for his wife and sent her to college for 6 years and after she graduated she was a different woman altogether. What’s happening with this trend? ...] (CCJ 2004) There is some uncertainty about the status of parts of the text. What did happen? (second turn) looks like an instance of Standard English emphatic do but is probably better regarded as mesolectal JC, with invariable did replacing the basilectal anteriority marker ben/(w)en, with the passage thus simply translating as ‘what happened?’. The one unusual expression is the literary archaism pray tell, also used in the second turn. Combining 18th-century literary English and mesolectal 21st-century JC in one utterance challenges RollinCalf to live up to expectations in his response, which he does by styling himself as the bwoy from yaad, increasingly alienated from his wife, who is educated and socially ambitious, or to put it in JC terms: stush (or to give the 13. I take this to be a misspelling for baps/bragadaps, ‘suddenly/abruptly’.
Corpora and the new Englishes
more usual vernacular spelling: stoosh). Yaad is contrasted with the university; modern cuisine with boiling yams, and so on. RollinCalf even draws on the folk-linguistic concept of speaky-spoky, which refers to a speaker’s conscious effort to use Standard English, with the implication that the target is not necessarily reached.14 Note also that the conclusion to the story formulates a moral in informal JamEng rather than JC, which shows that for the contributors to the forum the baseline style is English, and JC is a resource which is mobilised for a conscious activity which combines serious debate and play in about equal measure. This observation is a helpful reminder that the greater quantitative presence of JC in web-based material does not automatically imply a status upgrade in sociolinguistic terms. A last important question that we must address is whether material derived from the web can be legitimate and authentic data for a sociolinguistic analysis. This question presents itself in a particularly pressing fashion whenever the spirit of anti-formal linguistic playfulness results in forms which are highly implausible by the standards of the JamEng-JC continuum that governs face-to-face interaction. While the examples quoted so far were remarkable because of an extreme degree of style-shifting of the kind familiar in real-life encounters, the following post violates the constraints of the JamEng-JC grammatical continuum because it seems to ‘pack’ acrolectal and basilectal features into a mix rather than rapidly shift from one level to the other. It is from Bizi_Q – at 5,512 posts one of the most active contributors – and starts off a thread on gem-cleaning: (13) [Bizi_Q] so what mi can use fi clean gems, stones, diamonds, etc?? mi have de silver/gold cleaning solution but it nuh seem fi do nutten fi de stone dem.. sometimes it look wussa dan when it went in. mi read up pon de net an some sites say use dish washing liquid..others say that’s a big no no. what unu use clean unu stones? would it be better fimi carry it go a jeweler?? how much dat would cost? [so what can I use to clean gems, stones, diamonds, etc.? I have the silver/gold cleaning solution but it does not seem to do anything for the stones ... sometimes it looks worse than when it went in. I read up on the net and some sites say ‘use dish washing liquid’ ... others say that’s a big no no. What do you use to clean your stones? Would it be better for me to take it to the jeweller’s? How much would that cost?] (CCJ 2008) Among other features, this post is characterised by the high frequency of the JC grammatical marker fi, which in many cases corresponds to infinitival to in English translations. In Bizi_Q’s post, fi occurs together with several other basilectal JC features, the 2nd-person plural unu, preverbal negation with no, and the optional JC creole pluralmarker dem (e.g. in de stone-dem). This pattern of co-occurrence is expected, as all 14. The speaky-spoky style is often characterised by hypercorrections such as the over-use of prestige pronunciations associated with stoosh [or stush?] or posh talk: filther for filter, gloss for glass, or hassist for assist. For further information see Patrick (1999: 277–278).
Christian Mair
these features index the same stylistic range on the creole-standard continuum. However, the use of fi in phrases with inflected plurals, as in fi clean gems, stones, diamonds, is not. Fi indexes a basilectal style (or low social status of the speaker), while the plural inflection is in regular use only in acrolectal style (or with educated speakers). Similarly, unu stones combines a basilectal possessive pronoun with an acrolectal nominal head, which is as incongruous at first sight as the combination of an ‘English’ main clause (would it be better) with a dependent clause that contains a JC serial-verb construction (carry it go). Judged by the norms of face-to-face interaction, such usage is incoherent. By the standards of classical sociolinguistics, the data are unsuitable for analysis because they are inauthentic. By contrast, I would argue that such data are not just idiosyncratic oddities but that dealing with them is a high priority in a sociolinguistics for the 21st century. JC is no longer the ‘local’ language that it used to be in earlier stages of its development, firmly rooted in its community of speakers and largely confined to use in face-to-face interaction. Today, JC is among the vernaculars which are regularly heard on the media, have gone on the move and become globally available linguistic resources. One thing which CCJ shows beyond doubt is that JC on the web is used by real people in authentic communication. The specific mix of JamEng and JC in CCJ is therefore in no way less real than the different one spoken in the island. As Nikolas Coupland puts it: At least implicitly, sociolinguistics has made strong assumptions about authentic speech and the authentic status of (some) speakers. Sociolinguistics has often assumed it is dealing with ‘real language.’ [...] But ‘real language’ is an increasingly uncertain notion. In late-modern social arrangements and in performance frames for talk, do we have to give up on authenticity? (Coupland 2007: 179)
Coupland’s question is rhetorical, and his answer is in the negative. I agree. Bizi_Q is styling or performing her language, but the game works because she can mobilise real resources shared by herself and her readers. Put briefly, the effect of ‘artificially’ or ‘unexpectedly’ packing a simple request for help on a practical matter with JC morpho-syntactic features is to make a somewhat businesslike request for information part of an exercise in recreational community-building among JC speakers in the diaspora. On a purely utilitarian level, this is useful, because it may increase the quantity and quality of the responses elicited; on a more general level, it is part of the strategic way in which Bizi_Q draws on her and her community’s linguistic resources to create her persona in the forum. CCJ contains: a. passages which read as if spontaneously produced spoken JC was transferred on the screen, b. passages such as the one discussed, in which a presumably unconscious element of stylization is evident, and c. passages which are consciously crafted with rhetorical skill.
Corpora and the new Englishes
None of them is more authentic data than the others, but a framework is required to make explicit how they are authentic in different ways. The concept of indexicality, as proposed by Silverstein (2003) and refined by Eckert (2008), is a good starting point. The traditional view in sociolinguistics is that variation reflects speakers’ membership in pre-established social categories and that an individual speaker’s agency is confined to increasing the proportion of standard-like variants of a given variable in more formal (hence more monitored) styles. Drawing attention to the obvious limitations of such a view, Eckert (2008: 453) argues that “meanings of variables are not precise or fixed but rather constitute a field of potential meanings – an indexical field, or constellation of ideologically related meanings, any one of which can be activated in the situated use of the variable”. Seeing things in this way, we can reconcile the apparent paradox posed by the CCJ data. We can both accept that in broad terms use of JC in computer-mediated communication is derived from (and depends on) the conventions that have evolved in the sociolinguistic continuum that governs face-to-face interaction and still accommodate the very different ways in which individual variables are deployed in the new medium. Some CCJ examples do not differ at all from face-toface data in their indexical potential. Consider, for instance, example (14), a part of passage (9) discussed above: (14) On top of that now she spreading lies that I must be jealous and some other crap. (CCJ 2008) According to Silverstein (2003: 227), first-order indexicality indexes the pragmatic properties of a particular communicative act, reproducing social macro-categories in language. Potentially, however, every linguistic variant used in this way can become indexical at the second-order level and index speakers’ or writers’ metapragmatic evaluation of the situation. What we have in (14) is an informal exchange between equals. This is unproblematically indexed by the informal variant of the progressive (she spreading) on the level of first-order (pragmatic) indexicality. On the level of second-order (metapragmatic) indexicality, however, the fact that the informal variant of the progressive adopted here is a typically mesolectal JC one (and not for example an internationally current non-standard one such as she’s spreadn) becomes significant. Higher-order indexicality, that is communicative effects of a more complex nature and possibly unique to the specific thread, are not in evidence here. This is different in (15) and (16), which contain highly unusual spellings involving the sequence , which turn out to be a part of a rather complex strategy of online linguistic identity management: (15) bway.... but it’s important to dress 3 y/olds like this? what did blu seh but blakkk ooman slut culture? (CCJ 2008) [Boy ... but it’s important to dress three-year olds like this? What did Blu say about the slut culture of black women?]
Christian Mair
(16) only in amerikkka does this happen, only in amerikkka does a campaign go one for so two or more years wasting good money that could help ppl. Americans are fools if dem look pan a man ar ooman spouse an base dem decisions of dat person pan a nex smaddy. MORE FIYAH fi dem type a tinkin deh. An mi love Obama wife still (CCJ 2008) [... Americans are fools if they look at a man’s or a woman’s spouse and base their decisions on that person on someone else. More shame (= fire) on that type of thinking. And I still love Obama’s wife] Here, the signal is purely visual. Neither blakkk nor Amerikkka index Jamaicanness at any level. Even so, such experimental and sensational spellings convey a message in their context. In general terms, they are a liberally used device for emphasis and attention-getting, and three k-s is not the upper limit in this regard. Among the words in the material featuring six or more contiguous k-s are OK, wicked, bruck (= JC ‘break’), back, and several others. Nor is the custom confined to , as is shown by spellings such as eeeediotttttttt (‘idiot’) or ciiiigaretttttttttes. More is at stake, however, than merely drawing attention to key terms and sensitive topics such as blackness or America. Recall that KKK is a widely known abbreviation for the Ku-Klux-Klan, the racist secret society founded during the Confederacy’s defeat in the Civil War, and that a spelling such as AmeriKKKa has long had iconic status among radical minorities opting to disaffiliate themselves from the American way of life.15 As such, it is available for further creative development, and various web-based glossaries claim that the spelling blakkk may be used to denote individual blacks who are worse for the community than the Klan.16 Although the few attestations of blakkk in the CCJ data have a negative connotation (as certainly does the “blakkk ooman slut culture” of example [15]), they do not corroborate this specific, narrow meaning. A fairly complex process of inferencing allows us to conclude that, for blakkk and Amerikkka, indexes (1) writers’ high degree of emotional involvement when mentioning key concepts in their discourse, (2) a consciously politicised radical stance, and (3) affiliation with the ‘hip-hop nation’ and pop culture, which are key agents propagating sensationalist non-standard spellings in the contemporary US media.17 Just as the linguistic indexes of African-American political radicalism and pop-cultural chic percolate into JamEng/JC, the diasporic webforum can be a conduit through which JamEng/JC linguistic material diffuses 15. The first OED attestations are from 1969 (s.v. Amerika). 16. One example is the Urban Dictionary, at http://www.urbandictionary.com/define. php?term=blakkk (consulted on 16 March 2010), which defines the word as “an African American whose actions are more damaging to a black community than the KKK” and illustrates the use with the following unsourced citation: “We dot [sic!] rid of KKK night riders, and got blakkk drivebys instead”. 17. This can be proved easily even in a perfunctory web search for ‘boyzz’, ‘niggazzz’, ‘greatest hitz’, ‘bizkit’, etc., for instance. Note that in the last two examples we are not even dealing with phonetic respellings as the /s/ in hits and biscuit is clearly voiceless.
Corpora and the new Englishes
internationally. This brings us to the final section of the present paper, which will explore the Web as a novel site for contact between languages and varieties.
5 The globalisation of vernacular features: A ‘Black Atlantic’ on the web? A great deal of recent research in sociolinguistics has focussed on a phenomenon which has been referred to as the “globalisation of vernacular features” (Meyerhoff & Niedzielski 2003). It has been noted, for example, that the new quotative be like, which was first recorded in the United States in the 1970s and 1980s, has spread extremely rapidly into many other varieties of English all over the world (including, on the basis of ICE evidence, Irish, Canadian and Jamaican English).18 This almost instantaneous global spread of linguistic innovations raises the question whether traditional processes of diffusion via face-to-face interaction are sufficient to account for the speed of the developments observed or whether additional media influence needs to be invoked to account for them. In addition, one must be careful not to regard the use of ‘American’ quotatives as an instance of straightforward Americanisation of the other varieties affected. People using American quotatives rarely do so in conscious imitation of American language norms; rather they use a linguistic resource of American origin to achieve a communicative effect in a different ‘local’ linguistic environment. In the Jamaican case, for instance, the new quotative be like is preferably used by (young) women, and up to this point JamEng is indeed like American English (cf. Höhn forthcoming). However, while in American English it is in contrast with the standard quotationintroducing verb say and other non-standard forms such as go or be all, in JamEng it is the third option alongside Standard English say and JC mi/yu/im etc. seh. This three-way choice between a formal variant, a local informal variant and an imported informal variant is specific to the sociolinguistic situation of present-day Jamaica and thus gives an internationally available linguistic form its specifically local functional value. An interesting phenomenon to study from a language-and-globalisation perspective is the use of loanwords from African languages in CCJ. Since the end of the slave trade there has been little opportunity for such loans to spread into JC through face-to-face interaction. There is, of course, opportunity for contact between speakers of JC and Africans in diasporic situations, for example in multilingual metropolises such as London, New York or Toronto. However, the typical ways in which people of 18. There is no doubt that the new quotative is widely used in contemporary British English, as well (see, e.g., Buchstaller 2006a, 2006b). However, ICE-GB was compiled too early to record this. What is intriguing is the absence of quotative be like from the ‘true’ second-language ICE corpora, such as ICE-India or ICE-Singapore, which raises the issue of whether this is accidental (due, for example, to sampling dates) or systematic.
Christian Mair
African descent in the Americas have learned about Africa in the past 150 years has been through the work of political activists in the anti-colonial and civil-rights movements, and through the work of writers and scholars. Alongside these ‘elite’ links, there have also been popular movements. Many African-American and Caribbean musicians and entertainers have large followings in Africa, and a literal or metaphorical return to Africa is at the heart of several politico-religious movements, such as Rastafarianism. In principle, there is no reason why people from, say, Nigeria or South Africa should not join thematically relevant discussions on a forum such as Jamaicans.com (or, conversely, why Jamaicans with an Afrocentric ideological orientation should not take part in an immensely popular Nigerian forum such as Nairaland.com). In practice, however, to the extent that contributors can be located, such cross-over among forums still seems to be the rare exception. The overwhelming majority of contributors to Jamaicans.com are Jamaican, or people of Jamaican background resident in the US, the UK or Canada. The few who are not are a mixed bag – from the (presumably white) Canadian who would like to learn JC through the woman from Kurdistan currently resident in Austria to the Belgian who spent five years in Jamaica, with very few Africans among them. To put it mildly, the evidence is not sufficient to claim that the ‘Black Atlantic’ cultural region which joined West Africa, Europe and large parts of the Americas from Virginia and Maryland to the North East of Brazil in the 17th and 18th centuries is re-emerging in cyber-space in the 21st. However, the absence of African contributors from forum discussions does not mean that Africa has no impact at all. The spirit of linguistic adventurousness evident in the posts extends to lexical borrowing from African languages, and in this sense computer-mediated communication can become one of several avenues of lexical innovation in contemporary JC. The use of mzungu and wahala by forum participants provides an illustration. M(u)zungu is a slightly derogatory, originally Kiswahili term for ‘white person’ widely used in Southern Africa, whereas wahala, originally from Hausa, is also a very common word meaning ‘trouble/problem’ in Nigerian Pidgin and hence widely known throughout West Africa. Interestingly, these two words already have some international profile in World English, which is reflected in the fact that they are recorded in the Oxford English Dictionary (OED). In both cases, however, the quotations provided show that they have been circulated largely as exotica, either in writing on Southern Africa or Nigeria or through literary works by authors from the region. Here are the crucial segments of the OED documentation for mzungu:
[1844 J. L. KRAPF Diary 25 Sept. (Birmingham University Library: C.M.S. Archives CA5/016/28) f. 496, Sheikh Ibrahim soon after my arrival dispatched a messenger to the nearest Wanica villages, informing the chiefs of the arrival of a M’soongo (as a European is called in the Sooahelee tongue).] [...]
Corpora and the new Englishes
1961 Transition No. 2. 33/2, I found myself welcomed not in spite of the fact that I was a mzungu, not exactly because of it, but rather with the sense of being welcome anyway but with particular pleasure because I was white. 1975 B. KAGGIA Roots of Freedom vii. 66 We could no longer accept the belief that a mzungu was better than an African. 1992 Harper’s Mag. Jan. 65/1 Njoki.is almost never asked what her family thinks of her being married to a ‘mzungu’ – a white person.
In spite of a history of almost 200 years of attestation, the OED gives the impression that the word has never moved from specialist discourse into general circulation. For example, there is nothing which would make us expect it to turn up in CCJ. The history of wahala is similar, if considerably shorter:
1973 W. SOYINKA Season of Anomy xii. 258 He was going to his house on the reservation and would not step out of it until all the trouble was over. ‘I shall simply lock myself in’ he grinned, ‘stock up on stout and drink the wahala dry.’ 1982 B. EMECHETA Double Yoke (1983) x. 93 Look at all the wahala he raised about the university forms. 1986 E. AMADI Estrangement ii. 19 The GOC came with his wahala. 1987 C. ACHEBE Anthills of Savannah x. 137 If I for know na such big oga de for my front for that go-slow how I go come make such wahala for am. 1988 N.Y. Times 21 Feb. VII. 26/2 The taxi driver blames the ‘wahala’ of the traffic incident on the fact that Ikem doesn’t travel around in a chauffeur-driven car that befits his position.
All but the last citation are from works of fiction by West African authors, and even the last quotation from the New York Times is from a book review about such work. As in the case of mzungu, there is nothing in the OED evidence to make us expect that wahala would occur in CCJ, because, after all, neither the human geography of Southern and Eastern Africa nor West African writing in English are central concerns of the contributors to the discussion. However, they do. A search for muzung* and mzung* retrieved the vast majority of instances (= 1,010).19 It is striking to note that it was not used before 2005, when a trickle of instances came in which developed into a veritable flood from 2007. The first and most persistent user is Blugiant. Blugiant, who joined the forum on 1 June 2003, has contributed a total of 2,121 posts and is thus one of the more prolific authors. His contributions show him to be a social activist who is informed by a black-consciousness ideology. In addition, he deserves attention because of his experimentalism in JC spelling. In 2005 he launches the term mzungu. The context is a discussion of a newspaper article celebrating the fact that black earning power per household in the New York City borough of Queens has outstripped white earning power. For Blugiant, however, this is not a genuine sign of social advancement, but merely represents cooking of the statistical books:
19. Alternative spellings do occur, but seem to be exceedingly rare: e.g. zungu (2 occurrences).
Christian Mair
(17) [Blugiant] Quote: “In addition to the larger share of whites who are elderly, said Andrew Hacker, a Queens College political scientist, black Queens families usually need two earners to get to parity with working whites.” soo arjen [= address to another participant] oww manee peeps inn dem caribbean household versus mzungu oousehold. iff itt tekk two ar more caribbean wage earnas wukkinn more owa dan mzungu fii mekk more da mzungu oousehold widd less peeps wat iss da artikkle seyinn bout da caribbean peeps qualitee aff life. [So, Arjen: how many people are there in those Caribbean households versus white households? If it takes two or more Caribbean wage earners working more hours than whites in order to make more than a white household with fewer people, what is the article saying about the Caribbean people’s quality of life? (...)] [Jaded] its saying yu gotta do what yu gotta do... whatz the option..sit back an keep paying rent an gripe that life isn’t fair [Kingston20] Maybe we should a chill pon the corner block a play dice and bawl bout how zungu a howl we dung..no true Blu. Nothing no satisfy some people, atleast the majority of us a try a ting with what we got, nobody seh America was easy the fact that you come to another person’s country and have fi start over from scratch means that your quality of life ago suffer fi a long time till you can access the factors of production, land,labor, capital goods which lead to weath generation. This is basic economics. [Maybe we should relax on the corner block and play dice and moan about how the whites are holding us down, shouldn’t we, Blu? Some people are satisfied by nothing. At least the majority of us is trying something with what we’ve got. Nobody said America was easy. The fact that you come to somebody else’s country and have to start from scratch means that your quality of life is going to suffer for a long time. (...)] (CCJ 2006) Blugiant is not challenged for his use of the word, which allows two interpretations. Either it is known by the other forum members, or they work out the meaning from the context. In the end, it is taken up and used by other contributors, although it has to be admitted that Blugiant remains the active promoter until the end of the period of observation in mid-2008, contributing 455 of the 588 posts in which the word occurs. His main achievement is to have made the word well-known and familiar on the forum, which may provide motivation for subsequent use by others and outside the forum. The situation is somewhat more complex for wahala, whose primary use is as the pseudonym of a contributor and term of address used by others. The screen-name or pseudonym adopted by a contributor is a central element of self-stylisation and online identity construction. While Bechar-Israeli (1995) points out that screen names make few references to nationality or ethnic group in general-purpose Internet Relay Chat, Androutsopoulos (2006b: 539) reads “the strong preference for home language screen names as an index of ethnicity” in his study of Turkish, Iranian and Indian diasporic web forums based in Germany. Home language in CCJ is JC, and JC is indeed a direct
Corpora and the new Englishes
or indirect presence in many (though not the majority of) screen names, from the straightforward Rasta Talk of Irie Dawta to the more subtle pun in SailorBuoy [bwaI]. Among the Dr. Dudds, Peppers, Style Divas, ChurchDudes, Championgals, etc., Wahalla’s use of an African moniker is thus clearly a marked and presumably meaningful choice. In one instance, user Wahalla is even challenged as to whether he is aware of the significance of his name: (18) [queenvenus] by the way do you know what the word wahalla means in pigeon english, which is spoken in some parts of west Africa (CCJ 2007) To which he replies: (19) [Wahalla] Actually the only place I have heard the term used is along the coast of Nigeria.. But then the Ibo say its not their word.. The people from Kanu deny it is from them.. Tejh Ijaw deny it ever came from them. Personally I go for the beleif that is a dervation from ishalla.. But edumicate mi.... Cause Wahalla just chose his name by the radomised chance without ever having uttered the word in anger in a sentence???? He never live in Nigeria... Guess its cool to call how the nigerians speak English Pigeon.. Yu si if sumadi call patwa pigeon english dem getta tongue lasjing fram mi in no uncertain terms.. Guess its cool for people from Belgium to call Nigerian English Pigeon English... Why is it that people from England assume that once words and syntax from colonised comumication language pass into English it becomes pigeon English when spoken in the former anglohone countries????? (CCJ 2007) If we are to trust this information, Wahalla picked up the word during a brief visit to the Nigerian coast, misinterpreting its emotional connotations somewhat as expressing anger rather than sorrow. The interesting thing here is that it is other forum members who enlighten him on its meaning. The ensuing metalinguistic discussion is noteworthy because of its factual errors and conceptual flaws, but also because it is one of many signs of the keen interest which participants take in the topic of language variation.
6. Conclusion and outlook After presenting a brief sketch of the state-of-the-art in the corpus-based research on the New Englishes, the present paper has gone on to explore innovations in two directions. The first new departure concerns the type of data investigated. The World Wide Web has not only become more linguistically heterogeneous over the past few years (cf. Dor 2004; Danet & Herring 2007), but it has also become a repository of substantial amounts of non-standard English, which provided the motivation to compile a web-derived 16-million-word corpus of Jamaican English/Jamaican Creole. When used in computermediated communication, this variety shows a number of distinctive properties. Most importantly, diasporic web-based communities of practice of the type investigated here
Christian Mair
use JC forms far more commonly than in traditional writing – an expected result –, but also more frequently than in spoken face-to-face interaction sampling sociologically comparable groups of speakers. The reason for this greater readiness to use a stigmatised variety in the new medium is that the covert prestige associated comes for free, that is users do not run the risk of being categorised as uneducated or lower-class (as they would be in face-to-face encounters) but are able to mobilise the linguistic resources of JC for self-stylisation, the creation of atmosphere and the construction of identity. The second innovation proposed was to study language contact on the web from the perspective of the sociolinguistics of globalisation, as providing an arena (1) for the world-wide spread of originally highly localised vernaculars and (2) for contact between non-standard varieties of English beyond the sphere of face-to-face interaction in physical space. In this way, the present study contributes to an emerging sociolinguistics of computer-mediated communication (on which see Androutsopoulos 2006a). Had we chosen Chicano-run forums in the USA or Nigeria’s popular Nairaland forum instead of Jamaicans.com, we would have been alerted to what is probably the most glaring lacuna in corpus-based research on the New Englishes – namely a failure to recognise their being embedded in sometimes intensely multilingual communities. ICE-JA aims to provide a corpus of JamEng – and yet makes clear that spoken English in Jamaica cannot be understood without also taking into account JC. In the compilation of CCJ, no attempt was made to separate JamEng and JC even in the compilation of the corpus. Future corpora of Indian, Nigerian or any other kind of second-language English should, unlike the ICE sub-corpora, be conceived as multilingual corpora from the very start, and the contact languages with which English interacts should stop being treated as ‘extra-corpus’ material. As one prominent sociolinguist and expert on the New Englishes has recently put it: For many ‘New English’ speakers, monolingualism is the marked case, a special case outside of the multilingual prototype. Today’s ideal speaker lives in a heterogeneous society (stratified along increasingly globalized lines) and has to negotiate interactions with different people representing all sorts of power and solidarity positions on a regular basis. What is this ideal speaker a native speaker of, but a polyphony of codes/languages working cumulatively (and sometimes complementarily), rather than a single, first-learned code? (Mesthrie 2006: 482)
Multilingual corpora are necessary tools to research a multilingual reality. The Web may not be a suitable window on multilingualism in everyday face-to-face interaction, but multilingual diasporic web forums are certainly not in short supply and await corpus-linguistic exploration.
References Allsopp, R. 1996. Dictionary of Caribbean English Usage. Oxford: OUP. Androutsopoulos, J. (ed.). 2006a. Special issue on ‘The Sociolinguistics of Computer-mediated Communication’. Journal of Sociolinguistics 10(5).
Corpora and the new Englishes
Androutsopoulos, J. 2006b. Multilingualism, diaspora, and the internet: Codes and identities on German-based diasporic websites. Journal of Sociolinguistics 10: 520–547. Beal, J., Corrigan, K.P. & Moisl, H. (eds). 2007. Creating and Digitizing Language Corpora, Vol. 1: Synchronic Databases. Basingstoke: Palgrave Macmillan. Bechar-Israeli, H. 1995. From to : Nicknames, play, and identity on Internet Relay Chat. Journal of Computer-Mediated Communication 1(2). (19 March 2010). Beißwenger, M. & Storrer, A. 2008. Corpora of computer-mediated communication. In Corpus Linguistics. An International Handbook, Vol. 1, A. Lüdeling & M. Kytö (eds), 292–308. Berlin: Mouton de Gruyter. Buchstaller, I. 2006a. Diagnostics of age-graded linguistic behaviour: The case of the quotative system. Journal of Sociolinguistics 10: 3–30. Buchstaller, I. 2006b. Social stereotypes, personality traits and regional perception displaced: Attitudes towards the ‘new’ quotatives. Journal of Sociolinguistics 10: 362–381. Chambers, J.K. 2000. Universal sources of the vernacular. In Die Zukunft der europäischen Soziolinguistik/The Future of European Sociolinguistics/Le Futur de (la) sociolinguistique européenne, U. Ammon, K.J. Mattheier & P.H. Nelde (eds), 11–15. Tübingen: Niemeyer. Chambers, J.K. 2003. The sociolinguistics of immigration. In Social Dialectology: In Honour of Peter Trudgill [IMPACT: Studies in Language and Society 16], D. Britain & J. Cheshire (eds), 97–113. Amsterdam: John Benjamins. Coupland, N. 2007. Style: Language, Variation and Identity. Cambridge: CUP. Crystal, D. 2006. Language and the Internet, 2nd edn. Cambridge: CUP. Danet, B. & Herring, S.C. (eds). 2007. The Multilingual Internet: Language, Culture, and Communication Online. Oxford: OUP. Davies, M. 2001. Creating and using multimillion-word corpora from web-based newspapers. In Corpus Linguistics in North America: Selections from the 1999 Symposium, R.C. Simpson & J. Swales (eds), 58–75. Ann Arbor MI: The University of Michigan Press. Deuber, D. 2005. Nigerian Pidgin in Lagos. Language Contact, Variation and Change in an African Urban Setting. London: Battlebridge. Deuber, D. 2009. ‘The English we speaking’: Morphological and syntactic variation in educated Jamaican speech. Journal of Pidgin and Creole Languages 24: 1–52. Deuber, D. & Hinrichs, L. 2007. Dynamics of orthographic standardization in Jamaican Creole and Nigerian Pidgin. World Englishes 26: 22–47. Dor, D. 2004. From Englishization to imposed multilingualism: Globalization, the Internet, and the political economy of the linguistic code. Public Culture 16: 97–118. Eckert, P. 2008. Variation and the indexical field. Journal of Sociolinguistics 12: 453–476. Greenbaum, S. 1990. Standard English and the International Corpus of English. World Englishes 9: 79–83. Greenbaum, S. (ed.). 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon. Höhn, N. Forthcoming. Quotatives in Jamaican English. PhD dissertation, University of Freiburg. Kortmann, B. (ed.). 2005. A Comparative Grammar of British English Dialects: Agreement, Gender, Relative Clauses. Berlin: Mouton de Gruyter. Kortmann, B. & Szmrecsanyi, B. 2004. Global synopsis: Morphological and syntactic variation in English. In A Handbook of Varieties of English, Vol. II: Morphology and Syntax, B.
Christian Mair Kortmann, K. Burridge, R. Mesthrie, E.W. Schneider & C. Upton (eds), 1142–1202. Berlin: Mouton de Gruyter. Mair, C. 2003. Kreolismen und verbales Identitätsmanagement im geschriebenen jamaikanischen Englisch. In Zwischen Ausgrenzung und Hybridisierung: Zur Konstruktion von Identitäten aus kulturwissenschaftlicher Perspektive [Identitäten und Alteritäten 14], E. Vogel, A. Napp & W. Lutterer (eds), 79–96. Würzburg: Ergon. Mair, C. 2009. Corpus linguistics meets sociolinguistics: The role of corpus evidence in the study of sociolinguistic variation and change. In Corpus Linguistics: Refinements and Reassessments – Proceedings of the 2007 ICAME Conference – Stratford-upon-Avon, A. Renouf & A. Kehoe (eds), 1–26. Amsterdam: Rodopi. Mann, C. & Stewart, F. 2002. Internet Communication and Qualitative Research: A Handbook for Researching Online. London: Sage. Mesthrie, R. 2006. Society and language: Overview. In Encyclopedia of Language and Linguistics, Vol. 11, K. Brown (ed.), 472–484. Amsterdam: Elsevier. Meyerhoff, M. & Niedzielski, N. 2003. The globalization of vernacular variation. Journal of Sociolinguistics 7: 534–555. Morris, M. 1999. Is English We Speaking and Other Essays. Kingston: Ian Randle. Mukherjee, J. & Hoffmann, S. 2006. Describing verb-complementational profiles of new Englishes: A pilot study of Indian English. English World-Wide 27: 147–173. Mukherjee, J. & Hundt, M. (eds). Forthcoming 2011. Exploring Second-Language Varieties of English and Learner Englishes: Bridging a Paradigm Gap [Studies in Corpus Linguistics 44]. Amsterdam: John Benjamins. Nelson, G. 2006. World Englishes and corpora studies. In The Handbook of World Englishes, B.B. Kachru, Y. Kachru & C.L. Nelson (eds), 733–750. Oxford: Blackwell. OED = Oxford English Dictionary Online. Patrick, P. 1999. Urban Jamaican Creole: Variation in the Mesolect [Varieties of English around the World G17]. Amsterdam: John Benjamins. Platt, J., Weber, H. & Ho, M.L. 1984. The New Englishes. London: Routledge. Sand, A. 2005. Angloversals? Shared Morphosyntactic Features in Contact Varieties of English. Post-doctorol research thesis, Universität Freiburg im Breisgau. Schneider, E. 2007. Postcolonial English: Varieties around the World. Cambridge: CUP. Silverstein, M. 2003. Indexical order and the dialectics of sociolinguistic life. Language and Communication 23: 193–229. Simo Bobda, A. 2001. Taming the madness of English. Modern English Teacher 10(2): 11–18. Simo Bobda, A. 2004. Linguistic apartheid: English language policy in Africa. English Today 77: 19–26. Tagliamonte, S. & D’Arcy, A. 2007a. Frequency and variation in the community grammar: Tracking a new change through the generations. Language Variation and Change 19: 341–380. Tagliamonte, S. & D’Arcy, A. 2007b. The modals of obligation/necessity in Canadian perspective. English World-Wide 28: 47–87. Trudgill, P. 1986. Dialects in Contact. Oxford: Blackwell. Williams, J. 1987. Non-native varieties of English: A special case of language acquisition. English World-Wide 8: 161–199. Youssef, V. 2004. ‘Is English we speaking’: Trinbagonian in the twenty-first century. English Today 20(4): 42–49.
Towards a new generation of corpus-derived lexical resources for language learning David Wible and Nai-Lung Tsao [N]ature is already, in its forms and tendencies, describing its own design. Emerson
This chapter first argues that, despite their convenience compared to paperbased resources, corpora are, by their very nature as collections of texts and tokens, severely limited in what they can offer directly to language learners or teachers. The focus here is on understanding these limitations with respect to lexical knowledge, and it is suggested that overcoming them requires a different sort of digital resource that mediates between corpora on the one hand and teachers or learners on the other. The challenge is complicated by the fact that such a lexical knowledge resource should capture patterns of word behaviors that fall along a continuum between grammatically well-behaved and lexically idiosyncratic. A knowledgebase called StringNet, designed to capture this range of word behaviors, is described and motivated in detail.
1. Introduction Arguably the most common reason that language educators turn to corpora is for help in teaching vocabulary. They do this typically because corpora can readily show words in the contexts of their actual use. A premise of this chapter is that, despite the advantages, an unfortunate gap still stands between what learners need for vocabulary learning and what corpora currently provide. A further premise is that reducing this distance calls for a new generation of corpus-derived resources. In what follows, we try to develop an analysis of the nature of this gap, to motivate what sort of resource might bridge it, and to illustrate this with one resource designed to help with the bridging. The perspective we hope to develop makes sense only on a certain view of the nature of vocabulary knowledge and the role of words within a language. That view is eloquently distilled in the work of Dwight Bolinger (1977; 1985; inter alia). So we begin with an extended quote from the opening of his five-page gem, “Defining the Indefinable”. His point here concerns lexicography. We quote him because his description
David Wible and Nai-Lung Tsao
of the dictionary writer’s challenge in defining words mirrors the classroom language teacher’s challenge in teaching them. Lexicography is an unnatural occupation. It consists in tearing words from their mother context and setting them in rows – carrots and onions, and beetroot and salsify next to one another – with roots shorn like those of celery to make them fit side by side, in an order determined not by nature but by some obscure Phoenician sailors who traded with Greeks in the long ago.1 Half of the lexicographer’s labor is spent repairing this damage to an infinitude of natural connections that every word in any language contracts with every other word, in a complex neural web knit densely at the center but ever more diffusely as it spreads outward... Undamaged definition is impossible because we know our words not as individual bits but as parts of what Pawley and Syder (1983) call lexicalized sentence stems, hundreds of thousands of them, conveniently memorized to repeat – and adapt – as the occasion arises.... A speaker who does not command this array, as Pawley and Syder point out, does not know the language, and there is little that a dictionary can do to promote fluency beyond offering a few hints. Bolinger (1985: 69)
As we read this now, the promise of corpora cannot help but suggest itself. “...[A]nd there is little that a dictionary can do...” he says. “Yes, but there’s so much that corpora can do”, we want to reply. If lexicographers must tear words from their natural habitat to plant them in alphabetic rows, and if the resulting dictionaries are of so limited value in light of the infinitude of interconnections lost in this process of domestication, then corpora are the promising counter-balance. Corpora release words back into their “mother contexts”, into the wild where teachers can guide students on highly affordable digital fieldtrips into this habitat and train them to do ecological research in vivo to balance all of the in vitro lab work with uprooted isolated words that they traditionally have done. The urge to assign this sort of role to corpora within Bolinger’s analogy as a counter-balance to the dictionary is perfectly understandable. We want to argue, however, that corpora in fact do not constitute this sort of natural habitat of words in the wild that Bolinger alludes to. It is the misconception that they do so, we will suggest, which has limited our ways of exploiting the promise of corpora for language learning. We then describe what would be closer to the analog of such natural environments for words, illustrate how corpora can play an essential role in creating these habitats (or lexical ecologies), and sketch their value for language education along the way. First, it is worth elucidating why picturing “corpora as a natural habitat of words” within Bolinger’s metaphor is to misconstrue that metaphor, why corpora are not even close to playing such a role in that picture. Our reasons have nothing to do with the insufficient authenticity of corpora. There have been claims put forth that corpora lack ‘authenticity’ once their texts have been torn from larger contexts of situation 1. “Referring to the development of the alphabet by the Greeks after its invention by the Phoenicians. (Ed.)” (This note appears in the original. D. W. & N.-L. T.)
Towards a new generation of corpus-derived lexical resources for language learning
(Widdowson 2000;2 Mishan 2004; inter alia). While this criticism may be warranted, it is irrelevant to our point. To see our point, we need to look more closely at Bolinger’s metaphor. He evokes an ecology of tangled roots that interconnect all the lexical fauna in ways that are lost once they are uprooted for transplant into the ordered garden of the dictionary. But a corpus is not such a natural ecology of words, nor even a sample of one. A corpus is not where this myriad of original relations hold. A corpus is a collection of tokens. For botanical plants in the real out of doors, the connections do indeed hold among tokens (i.e. real and concrete instances of vegetation). But this is not the case in the metaphorical wild of words that Bolinger wants us to imagine. To think of the lexical connections he has in mind as holding among tokens of words, say among words found in a corpus, would render his whole metaphor incoherent. A token of a word in a particular utterance or line of text holds nothing near the ‘infinitude of natural connections’ among words he is indicating, at least not at that plane of existence, the plane of tokens. At most a word token exhibits some syntagmatic connection with co-occurring tokens in the same utterance, sentence, or text. But Bolinger is trying to evoke something much more imposing and substantial. Extending the quote by a few words, we see that he means “...an infinitude of connections that every word in any language contracts with every other word” (emphasis added). This extent of interconnectivity could be true only of words as types, not tokens, only of words meant as abstractions, i.e., as lexemes. And if this is not enough to lift us from thinking at the level of tokens, he tells us later in the sentence that the network of connections is “...a complex neural web...”. It is neural. It is the language user’s own mental grasp of the relations among words. So the crucial natural habitat of words is neither an alphabetic dictionary (Bolinger’s point) nor massive collections of tokens of words in context as found in corpora (our point). In the dictionary, the interconnections have been severed; in a corpus, as we argue above, they have not yet taken hold. In this article we try to build on two points drawn from Bolinger. First, his metaphor of a complex neural web is apt; it gives a useful (though, of course, partial) picture of the mature language user’s grasp of the words of a language. Second, learners aspiring to such a grasp will not find what they need in a dictionary. And we add a third point: neither will they find it in a corpus. In our estimation, two of the more urgent and at the same time tractable issues in corpus-based computational linguistics for language education are: (1) constructing alternatives to the dictionary on the one hand and the corpus on the other as lexical resources that more closely reflect Bolinger’s picture of what learners need to master and (2) creating the means of making such knowledge resources accessible when and where they matter to learners and teachers (Wible 2008). We categorize the first issue as one of lexical knowledge discovery (or extraction) and the second issue as lexical knowledge representation. We could 2. Our thanks to one of the reviewers for pointing out the relevance of the Widdowson (2000) article to this point.
David Wible and Nai-Lung Tsao
also call these the issues of What (knowledge) and How (to represent it). Of course, progress on the first must be made before the second becomes an issue. We need to have knowledge in hand before worrying about how to make it available. Thus, our purpose in what follows centers around the first issue. Specifically we describe and motivate a particular sort of lexical knowledgebase intended to capture at least some of the lexical interconnectivity pictured by Bolinger. We have little to say in this chapter about the second issue: how to make this knowledge accessible to learners. Wible (2008) offers a view on a new generation of lexical representation for language learning, especially for multiword expressions, which require alternatives to both the dictionary and the corpus. The corpus-derived resources we describe in this chapter are completely compatible with the approach to lexical representation for learners that is proposed there. Near the end of this chapter we touch on the issue of representation by describing how this is the case.
2. The gap between corpora and lexical knowledge With a few incisive lines, Bolinger has made memorably clear a fundamental limitation of the dictionary as a resource for learning words. Here we are interested in clarifying some limitations of corpora for that same role. As a resource for learning words, a corpus can be seen as a repository of instances of words in use. The intuition behind much of corpus-supported vocabulary learning and teaching, as we mentioned at the outset, is that vocabulary learning depends upon exposure to instances or tokens of words in use and corpora can provide users with just such exposure. We know of no educator or researcher, however, who finds tokens of words to be interesting in themselves. Tokens are valuable only as windows onto what they betoken. And basically, what they betoken is patterns of word behaviors. It is precisely for this reason that corpus concordancing and KWIC searches are seen as useful. They are seen as useful to the extent that they show conventional uses of words, that is, to the extent that they reflect patterns of their behavior. Data-driven language learning approaches, for example, are premised on the assumption that guided exposure to the data (the tokens) will reveal these underlying patterns to the learner (Johns 1994; inter alia). The point we want to make here, however, is that concordancing and KWIC searches are not designed to find patterns. They are designed to find strings. Literally and technically, they search by means of string matching, not by pattern matching. Detecting patterns is left up to the user. The sorts of patterns that these searches are good at making salient are cases where the forms in the strings coincide with a pattern. For example, a query of the term look will show numerous cases of look at listed together if results are displayed alphabetically by the word to the right of the search term. And this is because the two items are contiguous and the second one is a specific word form, so all its tokens show up together under an alphabetic listing. But much of the patterned use of words, say, as in formulaic sequences or multiword expressions, will
Towards a new generation of corpus-derived lexical resources for language learning
fall through the cracks here. As Read and Nation (2004) point out, many patterns of word behavior involve non-contiguous strings, rendering them resistant to discovery “whether it be by human intuition or automated computer search” (pp. 31–32). There are corpus search software programs that do support pattern searches but for users who are willing to learn a technical language such as regular expressions or other similarly complex notations. Once this learning curve is passed, however, the patterns that these programs find are the patterns that the user tells them to search for. That is, users must specify a pattern in their query. This is different from learners discovering patterns they had not thought of looking for. In corpus applications for language learning, this point is crucial. It is precisely these sorts of facts about word behavior (those that have not even occurred to the learner to wonder about) that underlie their most recalcitrant errors or misconceptions in lexical knowledge (Wible 2008). So one of the more important issues for corpus-supported language learning is how to overcome the dilemma that corpora are best suited either for those who have a lexicographer’s nose for lexical facts or those who already know what they are looking for before they begin. Here, then, is the source of the gap between corpora and language learners that we want to address. Corpora are ideal for storing instances of language in use, but the mind is not. In the long run, the mind is better at distilling than storing. One property of words worth distilling is the patterns of their behavior. And what Bolinger pointed out is that, in the case of words, what the mind distills is not a list but a web. Corpora do no such distilling; nor do they afford it in any straightforward way. Accordingly, the sort of resource we aim for is something more akin to such a distilled web, with the dense interconnections navigable not only among words but also among patterns of word uses.
3. The role of some current constructs Researchers interested in extracting patterns of behavior of words from corpora have relied traditionally on the construct of the n-gram. N-grams are ordered n-tuples of grams, and the ‘grams’ of n-grams are linguistic units, typically words. So a tri-gram is a contiguous sequence of three words; a 4-gram is such a sequence of 4 words, and so on, with no upper limit, in principle, on the value of n. A specific n-gram is a type, and we can search a corpus for all of its tokens. Thus, for example, put up with, taken as a tri-gram, is a type, and we can extract all tokens of it from a corpus, using string matching to identify each occurrence where these three word forms appear side by side.3 Such work with n-grams lies behind much corpus-based research for second 3. Typically the grams of n-grams are specific word forms (as opposed to lexemes). Thus put up with and putting up with would be two distinct tri-grams. This makes it possible to extract n-grams from corpora that have not been lemmatized. But, as I try to show later, it also obscures the obvious relationship between put up with and putting up with. StringNet encodes this relationship while still distinguishing between the two variations.
David Wible and Nai-Lung Tsao
language learning. Important constructs such as ‘lexical bundles’ (Biber et al. 1999; Biber & Conrad 1999; Biber et al. 2003; Biber et al. 2004) and ‘formulaic sequences’ (Simpson-Vlach & Ellis 2010), for example, have been operationalized as n-grams, making it possible to identify them in large corpora, to rank them by frequency, compute the strength of association of the co-occurring words with each other, determine which n-grams are distinctive of particular genres, and in general to render them susceptible to analysis. This sort of corpus work has been valuable in helping determine which multiword sequences are important for learners to learn and teachers to teach. N-grams are one-dimensional, however. They encode syntagmatic relations of cooccurring elements. For this reason, they flatten the space available for representing the rich network of interconnections among words that we aim to encode. This onedimensionality has consequences. To illustrate just one, seen as n-grams, the strings consider himself lucky and consider yourself lucky are simply two different tri-grams, as distinct from each other as they are from any other tri-grams, say from stroke of luck, a fine line, up with the, or close to you. There is something counter-intuitive about this. There are degrees of similarity and difference among n-grams worth distinguishing and connections among them worth capturing. To make such distinctions and capture such connections, however, we need something other than the n-gram. It is the same one-dimensionality of n-grams that explains in part why they are made available to users (when made available at all) in lists. Such lists may be organized in order of frequency of occurrence in a corpus or by the strength of association of the component grams, say, by mutual information (MI) score (Simpson-Vlach & Ellis 2010), but always they are lists and the lists are flat, ranking n-grams but showing no relations between or among them. It is worth recalling that our approach to narrowing the gap between corpora and learners is to create, as an underlying knowledge resource, a navigable web rather than a list. The point here is that such a web will not be composable from the traditional unit of the n-gram. There is research that avoids the restrictions of the n-gram, some of it explicitly acknowledging the limitations of n-grams for lexicography and language education. In a substantial literature on collocation extraction, for example, it is common to assume a window of proximity for the two collocating words rather than fixed sequences of grams in which the two occupy fixed slots (Church & Hanks 1990; Church et al. 1991; Dunning 1993; Manning & Schutze 1999; Evert & Krenn 2001; inter alia). Collocability is computed from instances where the two words co-occur within this window, in some cases with no requirement that they be adjacent or part of any fixed strings like n-grams. On this approach, verb collocates for the noun mistake are extracted by counting (and computing conditional probabilities in the presence of) verbs that occur within, say, a five-word window of mistake. Accordingly, the V-N collocation make mistake is extracted by taking into account indiscriminately all cases where make and mistake co-occur within the window. This includes cases of make a mistake, make lots of mistakes, make so many mistakes, make the mistake, make the biggest mistake, and so on, with no regard for what intervenes in the rest of the window. Word association
Towards a new generation of corpus-derived lexical resources for language learning
measures run on such data have produced important results in extracting collocations from corpora. While collocation research has proceeded apace with little regard for the n-gram, another literature has taken aim directly at the n-gram and its limits (Cheng et al. 2006; Wible et al. 2006a). This work, however, considers the limitations arising not from the n-gram’s one dimensionality (the limitation we are concerned with) but from its contiguity, the restriction that the grams must co-occur in uninterrupted, contiguous sequences. The alternatives explored in this literature, for example congrams and skipgrams (Cheng et al. 2006; Cheng et al. 2009), loosen this restriction to allow for discontinuous sequences of grams. This work opens up spaces horizontally, allowing a wider range of variations in what sequences can be retrieved for a target word or pair of co-occurring words. Skipgram searches detect both contiguous and non-contiguous word co-occurrences within a window. The congram in addition allows variation in the linear ordering of the co-occurring words, for example, both played a role and the role she played. It is important to note that these researchers intend the constructs of skipgrams and congrams for lexicographic research rather than for direct use by learners or teachers. While the skipgram and congram acknowledge the importance of discontinuous strings in understanding word behavior and open up the spaces surrounding the target words, what they extract are sentences containing the congrams rather than patterns that include these intervening spaces (finding patterns again would be done by the user). In this respect, the perspective on word behaviors that this work takes is still essentially horizontal. The limitations of this perspective are difficult to elucidate without a contrasting alternative view. For this, we turn next to the alternative we pursue in the design of the lexical knowledgebase called StringNet.
4. The lexical knowledgebase4 What is missing from the one-dimensional n-gram and its extensions, we want to suggest, is the paradigmatic dimension. And this we incorporate into StringNet in a straightforward way that has wide-ranging consequences. For a construct that captures both syntagmatic and paradigmatic aspects of word behaviors, we have introduced the notion of hybrid n-gram (Tsao & Wible 2009; Wible & Tsao 2010). We hope to show that, once the paradigmatic dimension is introduced within the minimal unit of the hybrid n-gram, it then becomes possible from a corpus-derived set of these units to generate not simply a list, but a net. This is because the paradigmatic dimension makes it possible not only to identify patterns of word behavior but to create a new entire space where relations hold among these patterns and among the words in them. Moreover, in this space these relations become susceptible to automatic detection 4. See Wible and Tsao (2010) for details on the computational aspects of the knowledgebase design.
David Wible and Nai-Lung Tsao
and indexing. The resulting StringNet is a massive and organic lexical knowledgebase whose structure is not prescribed but emerges. We illustrate in what follows.5
4.1
Hybrid N-grams
Essentially, hybrid n-grams expand the inventory of gram types and allow these different types of grams to co-occur in the same hybrid n-gram. Most notably, we add parts of speech (POS) as a gram type. Thus, in the same hybrid n-gram, POS grams can occur alongside lexemes or word forms. For example, in the hybrid n-gram consider [pnx] lucky, the second gram, [pnx], is the POS tag for reflexive pronouns. Thus, this one hybrid n-gram describes consider himself lucky and consider yourself lucky and all other instances with different reflexive pronouns in that same second slot. This both captures the similarity between these strings and distinguishes them from other trigrams (e.g. from the likes of, time after time, tally up the, or so if you). Similarly, while traditional n-grams must treat my point of view and your point of view as different 4-grams, distinct from each other as from the 4-grams once upon a time, in case of emergency, and a friend in need, the hybrid n-gram [dps] point of view can capture the similarity they share and differentiate them from other, dissimilar 4-grams. The hybrid n-grams of StringNet permit four distinct types of grams: (1) specific word forms, such as climb, climbed, climbing; (2) lexemes, (indicated in bold) such as climb, subsuming all its word form variations; (3) fine-grained part of speech (POS) categories (46 different categories from CLAWS 5, see Burnard 2007), indicated in brackets like [noun sg]; and (4) coarse-grained POS categories (twelve general tags subsuming the 46 fine-grained ones),6 also marked in brackets.
5. Another lexicographic tool that incorporates both the syntagmatic and paradigmatic dimensions in capturing patterns of word behavior is Sketch Engine (Kilgarriff et al. 2004). Sketch Engine is a fundamental contribution to the ‘new generation’ of corpus-derived lexical resources referred to in the title of this chapter. StringNet and Sketch Engine differ from each other in the approach taken to discovering and representing the syntagmatic and paradigmatic dimensions. A point by point comparison of the two resources is beyond the scope of this chapter and would be misleading because of their differing aims and approaches. One representative difference worth pointing out here is that Sketch Engine has the advantage of providing a set of important pre-determined functional slots in the context of the target word (for example, in the case of a target verb, it clearly lays out the grammatical subject slot for that verb and shows the various strings attested as subject of that verb, and so on). As we elaborate in the text, StringNet takes a different tack. Specifically, its hybrid n-grams are built case by case (though automatically) for a search word without targeting specified or pre-determined functional relations to the target word that should be represented. In this and other respects, the two resources are suited to exploring or highlighting complementary (as well as overlapping) aspects of word behavior. 6. For example, all the various verb forms distinguished in the detailed POS tag set are subsumed into the single category V in the coarse-grained set.
Towards a new generation of corpus-derived lexical resources for language learning
The main restriction imposed on the co-occurrence of gram types within a hybrid n-gram is that at least one of the co-occurring grams must be lexical. So there must be at least one lexeme or word form in a hybrid n-gram. This insures that hybrid n-grams are lexically grounded and reflect word behavior. Traditional n-grams are subsumed as one type of hybrid n-gram. Thus, the traditional 5-gram leaving aside the question of is included in StringNet, but as just one of the numerous hybrid n-grams that also describe this same string. With four tiers of gram types available for each gram slot in hybrid n-grams, such a single traditional 5-gram corresponds to 512 distinct hybrid n-grams.7 Table 1 includes a small sampling for this one. While the discussion here is focused on lexical knowledge discovery rather than representation, it is worth mentioning one new possibility that hybrid n-grams create for knowledge representation. Specifically, the hybrid n-grams of StringNet afford a unique concordancing that answers queries not with lists of sentences but with lists of patterns. We implement this concordancing through a web-based search interface
Table 1.╇ Sampling of the distinct hybrid n-grams corresponding to the the traditional 5-gram leaving aside the question of Hybrid n-grams Leaving Leaving
Leaving Leaving Leaving Leaving
Leaving Leaving
aside aside aside
the [art] [art]
question question question
of of [prp]
aside
the
[noun]
of
aside
the
[noun]
[prp]
aside aside
the the
[noun sg] [noun sg]
of [prp]
aside
[art]
[noun sg]
[prp]
Leaving
aside
[art]
[noun]
Leave
aside
[art]
[noun sg]
Leave
aside
[art]
question
[prp] of of
Leave
aside aside
[art] the
question question
[prp] of
aside
the
question
[prp]
aside
the
question
of
aside
the
question
[prp]
[Vvg] [Vvg] [verb] [verb]
7. Every traditional 6-gram corresponds to 2,048 hybrid 6-grams, each 7-gram to 8,192 hybrid 7-grams, and an 8-gram to 32,768 distinct hybrid 8-grams.
David Wible and Nai-Lung Tsao
Figure 1.╇ Sample LexChecker search results for eye
called LexChecker (see Figure 1).8 The listed patterns, in turn, are each linked to all sentences in BNC that instantiate them (see Figure 2).9 Returning now to knowledge discovery, an important feature of hybrid n-grams is that, while they represent patterned word uses, the patterns are not prescribed, for example by an inventory of lexical entry templates, but emerge and are discovered by simple computational means from the BNC. Recall that hybrid n-grams draw upon four tiers of gram types, each type with the potential to occupy any slot in a string. The patterns that are automatically extracted for StringNet are all those describable within this immense space of possibilities and attested in the BNC with a threshold frequency.10 A single algorithm extracts a wide range of lexical behaviors. These include not only fixed expressions such as tongue in cheek, by word of mouth, on foot, one way or another, slip of the tongue, in fact, wear and tear, as a matter of fact, by trial and error, with the possible exception of, stand up and be counted, but also quirky patterns of highly idiosyncratic word behaviors such as [noun] and [noun] alike (‘teacher and student alike’) or vary from [noun] to [noun] as well as grammatical constructions, such 8. www.lexchecker.org. Query results are ranked by a combination of frequency of the hybrid n-gram and the association strength among the co-occurring grams composing it (determined by a mutual information, MI, measure). Since StringNet is huge, exceeding four terabytes (over 4000 gigabytes), the only practical means of making it available is through a web-based service. 9. We rank the search hybrid n-grams from the results according to a normalized MI measure that enables us to place hybrid n-grams of different lengths along the same scale. For details, see Tsao & Wible (2009). 10. Currently, the frequency threshold is set at five; that is, any hybrid n-gram attested with five or more tokens in BNC is included in StringNet.
Towards a new generation of corpus-derived lexical resources for language learning Example sentences are from British National Corpus No. Sentences 1 The mite is just visible to the naked eye and feeds on honey bees and their grubs by sucking their body fluids. 2 But many kinds of bacteria in nature form elaborate colonies, often quite visible to the naked eye, in which different individuals perform different functions, so that the whole colony functions as if it were a single organism. 3 Because the creatures of the plankton on individually are small, they are not always visible to the naked eye. 4 Because they are so faint, not a single one is visible to the naked eye. 5 Protozoa are much larger than bacteria or viruses, although still not visible to the naked eye. 6 However, some cells, like the large eggs of frogs, are easily visible, and the human egg is just visible to the naked eye. 7 The human and mouse eggs are about one tenth of a millimetre in diameter and are just visible to the naked eye. 8 Quite often, olivine and pyroxene begin to crystallize out early on, so they may be present in the final rock as quite large crystals, up to a centimetre across, many times larger than the crystals surrounding them, and easily visible with the naked eye.
Figure 2.╇ Sentences linked to the hybrid n-gram visible [prep] the naked eye
as the ‘It Cleft proto-construction’: it be the [noun] that [verb] (‘It’s the thought that counts’). Such a space is much richer and more capable of registering nuances than one using typical lexical entries that assume a discrete delineable interface between words and grammar and that impose a priori structure on the permitted patterns or features of words to be encoded. Hybrid n-grams also add important refinements to collocation knowledge by detecting larger patterns in which many collocations are often embedded. This is possible because hybrid n-grams not only detect collocating words that may or may not be adjacent to each other, but also encode them as part of the contextual patterns within which they co-occur. Taken as collocations pure and simple, widely cited pairs such as spend time or make mistakes leave hidden important features of the patterns in which they conventionally appear. A common frustration for students is to have a teacher correct a miscollocation like I pay time... to I spend time..., only to have the revised version then marked again for a newly discovered error when the teacher notices what follows the collocation: I spend time to clean my room every Saturday (cf. spend time cleaning... or take time to clean...). Hybrid n-grams express the full patterns here, and a query of StringNet for time lists them: spend time [Vvg] and take time [to V]. Similarly, learners are commonly taught the collocation make a mistake (as opposed to the miscollocation do a mistake) but not the patterns of its larger context. Yet this collocation, as so many others, is simply one part of a set of wider patterns: make the mistake of [Vvg] but not make the mistake [to V]. Nor is this the complete pattern. The noun mistake takes both complement types of [Vvg] and to [V], but the choice of which forms should follow mistake is conditioned not by mistake alone but in
David Wible and Nai-Lung Tsao
combination with what precedes it. Compare I made the mistake of trusting him and It was a mistake to trust him. Notice too that even definiteness here (the mistake vs. a mistake) is conditioned by context. These are subtle dependencies within which collocation forms just one part. In StringNet they are distilled whole and set into clear relief as distinct hybrid n-grams: be a mistake [to V] and make the mistake of [Vvg]. Such patterns, exhibiting lexically determined grammatical properties, suggest hybrid n-grams’ potential for uncovering colligation (Hoey 2004) and for investigating lexico-grammatical constructions.11
4.2
Relations among hybrid n-grams
The appearance of POS categories in the hybrid n-grams of StringNet not only creates an explosion in the number of patterns of word behavior we detect, but also, as mentioned above, raises the possibility of detecting relations among these patterns.12 It is in this respect that introducing the paradigmatic dimension makes possible the emergence of an organic structure in the lexical knowledgebase. Consider the 5-gram discussed above, for example. The query results for the word aside include not only the 5-gram we saw above ‘leaving aside the question of ’, but also a parent of that hybrid n-gram: ‘leaving aside the [noun sg] of ’. And that parent/child relation between them is automatically detected and encoded in StringNet by virtue of the fact that question in the first hybrid n-gram is a case of (or a child of) its counterpart gram in the second one, the category [noun sg]. In turn, StringNet indexes this parent hybrid n-gram to all of the other tokens that instantiate it in BNC. Further, StringNet indexes this specific [noun sg] slot in this exact hybrid n-gram to all of the particular nouns that are attested in that specific slot in ‘leaving aside the [noun sg] of ’. As it turns out, there are thirteen different such nouns appearing in the 28 tokens of this pattern in BNC. Thus, this set of thirteen nouns constitutes a paradigm. The noun question of course belongs to this specific paradigm, but this is not a discrete or independent fact. Just as ‘leaving 11. See Wible & Tsao (2010) for a description of StringNet as a resource for investigating linguistic constructions. 12. We prune search results shown to users to eliminate a large number of redundant patterns. For example, any pattern that is attested by only one of its more specific sub-patterns (or ‘children’) is pruned as redundant and the attested sub-pattern retained. For example, the pattern ‘from [dps] point [prn] view’ is automatically pruned by comparing it with the more specific pattern ‘from [dps] point of view’ while the latter, more specific pattern, is retained. This is because in all instantiations of ‘from [dps] point [prn] view’, the [prn] category is realized as the preposition ‘of ’. In contrast, when this retained pattern ‘from [dps] point of view’ is compared with its sub-patterns or children, for example with ‘from my point of view’, it will be found that ‘my’ is not the only instantiation of the [dps] slot in this pattern. There are attested cases of ‘from their/our/his/her point of view’. Thus, the more general parent pattern here ‘from [dps] point of view’ is not pruned and is shown in the results because the [dps] represents attested substitutability in that slot. For details on this pruning, see Wible & Tsao (2010).
Towards a new generation of corpus-derived lexical resources for language learning
aside the [noun sg] of ’ is related by parenthood to ‘leaving aside the question of ’ because question is a case of [noun sg], so it in turn has its own parent patterns. Thus, navigating StringNet upward from the hybrid n-gram ‘leaving aside the [noun sg] of ’, we discover the parent pattern ‘[Vvg] aside the [noun sg] of ’, with [Vvg] including the verb ‘leaving’ as a subcase (or child).13 In this parent pattern we have co-occurring POS slots in the same hybrid n-gram, that is, two paradigms marked as [Vvg] and [noun sg] connected syntagmatically. Linking to its instances we find two things. First, the [Vvg] slot in this pattern is instantiated by only two distinct verbs: setting (aside) and leaving (aside). Second, the choice between these two verbs in this position corresponds to a change of the membership in the paradigm in the neighboring [noun sg] slot. This can be seen by the comparing the two (see Figure 3). The slight shift from leaving to setting in the [Vvg] gram here corresponds to an accompanying change in the membership of the neighboring [noun sg] paradigm. The noun question appears in this [noun sg] slot in the presence of both setting and leaving, and it has only one cohort noun that shares this same distribution: issue. So with each shift in the neighborhood (syntagmatically), this noun contracts a new set of relations with a new set of neighbors (paradigmatically) with a slight overlap in the two conditions. Thus, hybrid n-grams derived from a corpus and indexed to it can tell us more than one-dimensional constructs can about the company that words keep. One-dimensional n-grams can indicate a specific pattern of company kept between leaving and question and between leaving and issue but by two separate and independent n-grams. The added dimension of hybrid n-grams, however, can tell us that, as a consequence, question and issue keep company with each other, but only in another 13. The possibility of actually ‘navigating’ StringNet by linking to the parents or children of any hybrid n-gram that appears in any search results is realized in a prototype research interface that we have just completed at the time of writing (http: //nav.stringnet.org). Navigation is afforded in the prototype by a pair of links appearing beside each hybrid n-gram listed in a set of search results alongside the current ‘Examples’ link that accompanies each hybrid n-gram in a search result. These new links are labeled ‘Parents’ and ‘Children’. Clicking on the ‘Parent’ link, say, for the hybrid n-gram ‘consider yourself lucky’ gives a list of its parent hybrid n-grams, for example, ‘consider [pnx] lucky’, ‘consider yourself [adj]’ and so on. These parents in turn show links to each of their parents and children. This is invaluable for research into constructions to determine whether a specific string is a one-off frozen expression or a specific case of a more general construction. Thus, the hybrid n-gram ‘it’s the thought that counts’ links to the parent ‘it’s the [noun] that counts’ showing variation possible in that exact noun slot. (In fact, a list of the nouns attested in that noun slot appears in a pop-up linked to the [noun] slot.) This hybrid n-gram links in turn to its own parent ‘it’s the [noun] that [verb]’, showing substitutability in the verb slots as well (and, by pop-up windows, can show the verbs that are attested in that slot), thus leading to the proto-construction, the ‘It Cleft’. The research possibilities afforded by StringNet proliferate with such a navigable network. For example, dependencies can be detected within such discovered constructions, such as the non-canonical agreement holding in the ‘It Cleft’ (e.g. ‘It’s the voters that count’ vs. ‘It’s the thought that counts.’)
David Wible and Nai-Lung Tsao [Vvg]
leaving
setting
aside the [ noun sg ] of study difficulty case ……… position impact question issue mass ……… forfeiture sum chance bundle
Figure 3.╇ Co-occurring paradigms occupying two POS slots in a hybrid n-gram
dimension (the vertical, paradigmatic here) as members of the same paradigm. And the cross-indexing of hybrid n-grams makes it possible to detect exactly all of the intersections of the vertical and horizontal dimensions that bring two words into company anywhere in this network (like drinking buddies but not only, also prayer partners and sparring partners and so on) and what other words are implicated along the way (who else frequents the same bar, church, and gym). And so, the possibilities for ‘word associations’ that we can detect under StringNet quickly proliferate. This example shows a minute portion of the relations that a word contracts with other words in the context of StringNet. We see, for example, two related paradigms containing the noun question. StringNet contains hundreds of thousands of such paradigms, and none of them are isolated. They are indexed to each other, directly or indirectly, by virtue of the myriad syntagmatic and paradigmatic connections that take hold in this web. A single word may occur in hundreds or thousands of highly specific paradigms, implicating that word in a unique “infinitude of natural connections” that extend horizontally and vertically. In StringNet, they are susceptible to discovery and exploration. To take one of the simpler possibilities, information theoretic or statistical measures can be used to determine to what extent there is really a conditioning relationship between the membership of the [Vvg] slot and that of the [noun sg] in the family of patterns above as opposed to differences in membership we would expect by chance given the corpus size and the frequencies of the words involved. While we have not yet run such measures, the point here is that, by virtue of its structure, StringNet makes it possible to do so. That is, the rich and explicitly structured web of relations encoded in StringNet makes it possible to run such measures and discover these and other more complex and nuanced connections.
Towards a new generation of corpus-derived lexical resources for language learning
5. Knowledge representation and access for users The fundamental issue motivating this chapter has been the gap between the sorts of knowledge language learners need, on the one hand, and the sort of thing a corpus is, on the other. We have described StringNet as a corpus-derived knowledgebase designed to help bridge this gap, specifically with respect to lexical knowledge. But how does StringNet constitute a contribution in this respect? How is StringNet closer to the knowledge a language learner needs? As we pointed out earlier, this challenge of ‘getting closer’ in the case of lexical knowledge resources involves the two aspects of lexical knowledge discovery and lexical knowledge representation. This chapter is devoted mainly to knowledge discovery, specifically, showing how StringNet distills from corpora the patterned uses of words and the relations of these patterns and these words to each other. The main purpose has been to show how this design for a lexical knowledgebase in itself comes uniquely close to the sorts of lexical knowledge a language learner needs compared, for example, to the collection of tokens that constitutes a corpus. With the size and complexity of StringNet, the question of representation remains, however: How can learners access it? There sits this knowledge in the form of a massive cross-indexed network of lexical patterns. How can it reach learners? StringNet, due to its unique structure and content, in fact opens a wide range of new possibilities for lexical knowledge representation.14 While the aspect of representation is not the main concern of this chapter, we sketch briefly here one among the range of ways that StringNet knowledge can be represented to learners. Specifically, we describe how it can support and extend the browser-based approach laid out in Wible et al. (2006b) and Wible (2008). Wible et al. (2006b) and Wible (2008) describe and motivate an approach to helping learners learn collocations through a browser-based agent that appears to users as an icon on their Web browser’s toolbar. Clicking that icon triggers the agent to detect, in real time, any collocations appearing in the text of the webpage the user is currently browsing. Detected collocations then appear in a dropdown menu, from which the user can select specific collocations to highlight within the current webpage or to show multiple example sentences from BNC containing that collocation. This browser-based approach is designed on the pair of assumptions that, first, input is the key to acquiring collocations and, second, there is little within that input which differentiates collocations as such from free combinations. And here we add that it is not only collocations that fly under the radar in this way but the whole range of multiword expressions. This argument is laid out in detail in Wible (2008). An agent that can detect 14. For example, it has enabled us to rely on a single algorithm to automatically detect and correct a diverse array of learner errors. To give a sampling of the range of coverage for this one algorithm, it detects the error in my point of view and corrects it to from my point of view, detects the error enjoy to read, correcting it to enjoy reading, and detects pay attention on as an error and suggests correcting it as either pay attention to or focus attention on (Tsao & Wible 2009).
David Wible and Nai-Lung Tsao
these multiword expressions in real time in the texts that users freely browse on the Web thus provides a crucial and heretofore unavailable support to learners in facing the challenge of learning multiword expressions from input. Until now we have implemented this browser-based approach in a tool called Collocator (http://toolbar.stringnet.org). As the name suggests, however, Collocator is limited to detecting collocations or two-word expressions. StringNet opens the possibility of detecting the much wider range of multiword expressions under the same conditions that Collocator detects and shows collocations during web browsing.15 Whereas Collocator can detect make...mistake, StringNet makes it possible to detect make the mistake of including or It would be a mistake to include.... A further enhancement StringNet makes possible is the capacity for the tool not simply to detect the string of words that constitutes the multiword expression in the webpage (make the mistake of including) but to show the pattern that it betokens (make the mistake of Vvg) and to illustrate with examples the range that this pattern encompasses (make the mistake of including/assuming/withholding...). This is due to the fact that StringNet consists of hybrid n-grams and can match any of these hybrid n-grams to all tokens that instantiate it. As a result, a browser-based tool that has StringNet as its knowledge source can not only identify for a reader a multiword token as an instance of a type or pattern, but can show that reader the particular type or pattern it instantiates and provide other tokens of this same type, illustrating the coverage of the pattern and familiarizing the reader with the members of the family of tokens that instantiate that pattern. The central contribution of this browser-based approach is that it opens up a radically different path of access to lexical resources. When it comes to modes of access, corpora closely resemble traditional dictionaries in a key respect. They both require the user to initiate a query (or look up a word), and this requires that the user has decided what target word or target expression to search for. This posture that dictionaries and corpora require of users is of little value in helping with gaps in the learners’ knowledge which they are unaware of in the first place, ones they would not therefore deliberately seek to address with a query or search. On our browser-based approach, in contrast, the agent on the toolbar actively discovers for the user those forms within the texts they are reading that are worth attending to. Since access to StringNet can be provided by such an agent piggybacking on the browser’s toolbar, it can show users multiword expressions which they may never have thought to look up on their own initiative and which are found in the texts they have freely chosen to read. Thus rather than being overwhelmed or confused by needing to navigate such a huge lexical knowledgebase, learners are shown only those exact patterns and examples the agent brings to their attention, and these are only the ones relevant to the text they are currently reading. While this is by no means the sole application that StringNet is intended
15. Currently StringNet includes expressions (or hybrid n-grams) ranging from two to eight grams in length.
Towards a new generation of corpus-derived lexical resources for language learning
to support, it is an illustrative one which attends to the fundamental issue of lexical knowledge representation in a learner-centered way.
6. An emergent langue Corpus and usage-based linguists have typically eschewed as far as possible any traffic with abstractions such as langue. Perhaps this has been because attempts to approximate such abstractions have traditionally entailed theorizing beyond the warrants of data, valuing elegance over coverage and too casual an acceptance of Sapir’s aphorism that “all grammars leak.” While shunning abstractions may have its reasons, language teachers still need something to sate classes of thirsty learners. Perhaps the problem is not with abstraction per se but with how we arrive at it, whether by theorizing or rather by distilling. Stereotypically it is approached by brilliant theorizing that sees leakage as a small price to pay for beauty and simplicity. But abstraction can be arrived at instead by a different sort of simplicity. By the simple but relentless work of a spider, making every single simple connection afforded by the raw data at hand and then in turn whatever new connections become possible by exploiting the structure from these first ones, then more from those, and so on. Such a simple process might require not brilliant theorizing but loyalty to data in the extreme. Certainly, the sort of abstraction that emerges will be less wieldy, more unruly, more organic than the sort of langue typically envisioned as a grammar. But, perhaps for this very reason, it just may hold water.
References Biber, D., Conrad, S. & Cortes, V. 2004. If you look at...: Lexical bundles in university teaching and textbooks. Applied Linguistics 25(3): 371–405. Biber, D., Johannson, S., Leech, G., Conrad, S. & Finegan, E. 1999. The Longman Grammar of Spoken and Written English. London: Longman. Biber, D. & Conrad, S. 1999. Lexical bundles in conversation and academic prose. In Out of Corpora: Studies in Honor of Stig Johansson, H. Hasselgard & S. Oksefjell (eds), 181–189. Amsterdam: Rodopi. Biber, D., Conrad, S. & Cortes, V. 2003. Lexical bundles in speech and writing: An initial taxonomy. In Corpus Linguistics by the Lune, A. Wilson, P. Rayson & T. McEnery (eds), 71–93. Frankfurt: Peter Lang. Burnard, L. (ed.) 2007. Reference Guide for the British National Corpus (XML Edition). Published for the British National Corpus Consortium by the Research Technologies Service at Oxford University Computing Services, February. Bolinger, D. 1977. Idioms have relations. Forum Linguisticum 2: 157–169. Bolinger, D. 1985. Defining the indefinable. In Dictionaries, Lexicography and Language Learning, R. Ilson (ed.), 69–73. Oxford: Pergamon Press.
David Wible and Nai-Lung Tsao Cheng, W., Greaves, C., Sinclair, J.M. & Warren, M. 2009. Uncovering the extent of the phraseological tendency: Towards a systematic analysis of congrams. Applied Linguistics 30(2): 236–252. Cheng, W., Greaves, C. & Warren, M. 2006. From n-gram to skipgram to concgram. International Journal of Corpus Linguistics 11(4): 411–433. Church, K.W. & Hanks, P. 1990. Word association norm, mutual information, and lexicography. Computational Linguistics 16(1): 22–29. Church, K.W., Gale, W., Hanks, P. & Hindle, D. 1991. Using Statistics in Lexical Analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon, U. Zernik (ed.), 115–164. Hillsdale NJ: Lawrence Erlbaum Associates. Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1): 61–74. Emerson, R.W. 1982. Nature. In Nature and Selected Essays. New York NY: Penguin. Evert, S. & Krenn, B. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics, B. L. Webber (ed.), 188–195. Toulouse, France. Hoey, M. 2004. A theory for TaLC? The textual priming of lexis. In Corpora and Language Learners [Studies in Corpus Linguistics 17], G. Aston, S. Bernardini & D. Stewart (eds), 21–44. Amsterdam: John Benjamins. Johns, T. 1994. From printout to handout: Grammar and vocabulary teaching in the context of Data-driven Learning. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 293–313. Cambridge: CUP. Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D. 2004. The sketch engine. In Proceedings of the Eleventh EURALEX International Congress, G. Williams & S. Vessier (eds), 105–116. Lorient: UBS. Manning, C. & Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge MA: The MIT Press. Mishan, F. 2004. Authenticating corpora for language learning: A problem and its resolution. ELT Journal 58(3): 219–227. Read, J. & Nation, P. 2004. Measurement of formulaic sequences. In Formulaic Sequences: Acquisition, Processing, and Use [Language Learning & Language Teaching 9], N. Schmitt (ed.), 23–35. Amsterdam: John Benjamins. Simpson-Vlach, R. & Ellis, N. 2010. An academic formulas list: New methods in phraseology research. Applied Linguistics 31(4): 487–512. Tsao, N.-L. & Wible, D. 2009. A method for unsupervised error detection and correction. Paper presented at the North American Association of Computational Linguistics (NAACL) Conference, Workshop on Innovative Uses of NLP for Building Educational Applications. Boulder, Colorado, June, 2009. Wible, D. 2008. Multiword expressions and the digital turn. In Phraseology in Foreign Language Learning and Teaching, F. Meunier & S. Granger (eds), 163–181. Amsterdam: John Benjamins. Wible, D. & Tsao, N.-L. 2010. StringNet as a computational resource for discovering and representing linguistic constructions. Paper presented at the North American Association of Computational Linguistics (NAACL) Conference, Workshop on Extracting and Using Constructions in Computational Linguistics. June, 2010.
Towards a new generation of corpus-derived lexical resources for language learning Wible, D., Kuo, C.-H., Chen, M.C., Tsao, N.-L. & Hung, T.F. 2006a. A computational approach to the discovery and representation of lexical chunks. Paper presented at TALN 2006, Leuven, Belgium, April, 2006. Wible, D., Kuo, C.-H., Chen, M.C., Tsao, N.-L. & Hung, T.F. 2006b. A ubiquitous agent for unrestricted vocabulary learning in noisy digital environments. Lecture Notes in Computer Science 4053: 503–512. Widdowson, H.G. 2000. On the limitations of linguistics applied, Applied Linguistics 21(1): 325.
Automating the creation of dictionaries Where will it all end? Michael Rundell and Adam Kilgarriff The relationship between dictionaries and computers goes back around 50 years. But for most of that period, technology’s main contributions were to facilitate the capture and manipulation of dictionary text, and to provide lexicographers with greatly improved linguistic evidence. Working with computers and corpora had become routine by the mid-1990s, but there was no real sense of lexicography being automated. In this article we review developments in the period since 1997, showing how some of the key lexicographic tasks are beginning to be transferred, to a significant degree, from humans to machines. A recurrent theme is that automation not only saves effort but often leads to a more reliable and systematic description of a language. We close by speculating on how this process will develop in years to come.
1. Introduction This paper describes the process by which – over a period of 50 years or so – several important aspects of dictionary creation have been gradually transferred from human editors to computers. We begin by looking at the early impact of computer technology, up to and including the groundbreaking COBUILD project of the 1980s. The period that immediately followed saw major advances in the areas of corpus building and corpus software development, and the first dedicated dictionary writing systems began to appear. These changes – important though they were – did not significantly advance the process of automation. Our main focus is on the period from the late 1990s to the present. We show how a number of lexicographic tasks, ranging from corpus creation to example writing, have been automated to varying degrees. We then look at several areas where further automation is achievable and indeed already being planned. Finally, we speculate on how much further this process might have to run, and on the implications for dictionaries, dictionary-users, and dictionary-makers.
Michael Rundell and Adam Kilgarriff
2. Computers meet lexicography: From the 1960s to the 1990s The great dictionaries of the 18th and 19th centuries were created using basic technologies: pen, paper, and index cards for the lexicography, hot metal for the typesetting and printing. In the English-speaking world, the principle that a dictionary should be founded on objective language data was established by Samuel Johnson, and applied on a much larger scale by James Murray and his collaborators on the Oxford English Dictionary (OED; Murray et al. 1928). The task of collecting source material – citations extracted from texts – was immensely laborious. Johnson employed half a dozen assistants to transcribe illustrative sentences which he had identified in the course of his extensive reading, while the OED’s ‘corpus’ – running into several million handwritten ‘slips’ – was collected over several decades by an army of volunteer readers. And this was only the first stage in the dictionary-making process. In all of its components, the job of compiling a dictionary was extraordinarily labour-intensive. Johnson’s references to ‘drudgery’ are well-known, but Murray’s letters testify even more eloquently to the stress, exasperation, exhaustion and despair which haunted his life as the OED was painstakingly assembled (Murray 1979, esp. Ch. XI). It was Laurence Urdang – as Editor of the Random House Dictionary of the English Language (Stein & Urdang 1966) – who first saw the potential of computers to facilitate and rationalize the capture, storage and manipulation of dictionary text.1 From this point, the idea of the dictionary as a database, in which each of the components of an entry has its own distinct field, became firmly established. An early benefit of this approach was that cross-references could be checked more systematically: the computer generated an error report of any cross-references that did not match up, and errors would then be dealt with manually. An extremely dull task was thus transferred from humans to computers, but with the added benefit that the computers made a much better job of it. And when learner’s dictionaries began to control the language of definitions by using a limited defining vocabulary (DV), similar methods could be used to ensure that proscribed words were kept out. In a further development, the first edition of the Longman Dictionary of Contemporary English (LDOCE1; Procter 1978) included some categories of data (notably a complex system of semantic coding) which were never intended to appear in the dictionary itself. In projects like these, the initial textcompilation process remained largely unchanged, but subsequent editing was typically done on pages created by line printers, with the revisions keyed into the database by technicians.
1. We are aware that our detailed knowledge relates mainly to developments in English language lexicography. We apologise in advance for our Anglocentrism and any exaggerated claims it has led to.
Automating the creation of dictionaries
2.1
Year Zero: The COBUILD project
Some time around 1981 marks Year Zero for modern lexicography. The COBUILD project brought many innovations in lexicographic practices and editorial styles (as described in Sinclair 1987), but our focus here is on the impact of technology, and its potential to take on some of the tasks traditionally performed by humans. Computers were central to the COBUILD approach from the start. Like the visible tip of an iceberg, the eventual dictionary would be derived from a more extensive database, and lexicographers created their entries using an array of coloured slips to record information of different types (Krishnamurthy 1987). Every linguistic fact the lexicographers identified would be supported by empirical evidence in the form of corpus extracts. For the first time, a large-scale description of English was created from scratch to reflect actual usage as illustrated in (what was then) a large and varied corpus of texts. The systematic application of this corpus-based methodology represents a paradigm shift in lexicography. What was revolutionary in 1981 is now, a generation later, the norm for any serious lexicographic enterprise. But from the point of view of the human-machine balance, COBUILD’s advances were relatively modest. Corpus creation was still a laborious business. As the use of scanners supplemented keyboarding, data capture was somewhat less arduous than the methods available to Henry Kučera two decades earlier, when he used punched cards to turn a million words of text into the Brown Corpus (Kučera & Francis 1967). But like their predecessors at Brown, the COBUILD developers were testing available technology to its limits, and building the corpus on which the dictionary would be based involved heroic efforts (Renouf 1987). As for the lexicographic team, few ever got their hands on a computer. Concordances were available in the form of microfiche printouts, and the fruits of their analysis were written in longhand – the slips then being handed over to a separate team of computer officers responsible for data-entry.
2.2
The 80s and 90s
The fifteen years or so that followed saw quite rapid technical advances. Computers moved from being large and expensive machines available only to specialists, to become everyday objects to be found on most desks in the developed world. This has brought vast changes to many aspects of our lives. During this period, corpora became larger by an order of magnitude, and improved corpus-query systems (CQS) enabled lexicographers to search the data more efficiently. The constituent texts of a corpus were now routinely annotated in various ways. Forms of annotation included tokenization, lemmatization, and part-of-speech tagging (see Grefenstette 1998: 28–34 and Atkins & Rundell 2008: 84–92 for summaries), and this allowed more sophisticated, better-targeted searches. From the beginning of the 1990s, it became normal for lexicographers to work on their own computers rather than depending on technical staff
Michael Rundell and Adam Kilgarriff
for data-entry, and the first generation of dedicated dictionary-writing systems (DWS) were created. By the late 1990s, the use of computers in data analysis and dictionary compilation was standard practice (at least for English). But to what extent was lexicography ‘automated’ at this point? Corpus creation remained a resource-intensive business. Corpus analysis was easier and faster, but lexicographers found themselves handling far more data. From the point of view of producing more reliable dictionary entries, access to higher volumes of data was a good thing. But scanning several thousand concordance lines for a word of medium frequency (within the time constraints of a typical dictionary project) is a demanding task – in a sense, a new form of drudgery for the lexicographer. On the entry-writing front, the new DWS made life somewhat easier. When we use this kind of software, the overall shape of an entry is controlled by a ‘dictionary grammar’. This in turn implements the decisions made in the dictionary’s style guide about how the many varieties of lexical facts are to be classified and presented. Data fields such as style labels, syntax codes, and part-of-speech markers have a closed set of possible contents which can be presented to the compiler in drop-down lists. Lexicographers no longer have to remember whether a particular feature should appear in bold or italics, whether a colloquial usage is labelled ‘inf ’, ‘infml’ or ‘informal’, and so on. In areas like these, human error is to a large extent engineered out of the writing process. A good DWS also facilitates the job of editing. For example, an editor will often want to restructure long entries, changing the ordering or nesting of senses and other units. This is a hard intellectual task, but the DWS can at least make it a technically easy one. Meanwhile, some essential but routine checks – cross-reference validation, defining vocabulary compliance, and so on – are now fully automated, taking place at the point of compilation with little or no human intervention. With more linguistic data at their disposal and better software to exploit it, and with compilation programs which strangle some classes of error at birth, support the editing process, and quietly handle a range of routine checks, lexicographers now had the tools to produce better dictionaries: dictionaries which gave a more accurate account of how words are used, and presented it with a degree of consistency which was hard to achieve in the pre-computer age. Whether this makes life easier for lexicographers is another question. Delegating low-level operations to computers is clearly a benefit for all concerned. The computers do the things they are good at (and do them more efficiently than humans), while the lexicographers are relieved of the more tedious, undemanding tasks and thus free to focus on the harder, more creative aspects of dictionary-writing. But the effect of these advances is limited. The core tasks of producing a dictionary still depend almost entirely on human effort, and there is no sense, at this point, of lexicography being automated.
Automating the creation of dictionaries
3. From 1997 to the present What we describe above represents the state of the art in the late 1990s. For present purposes, we will take as our baseline the year 1997, which is when planning began for a new, from-scratch learner’s dictionary. If the big change to the context of working life in the 80s and 90s was that most of us (in lexicography and everywhere else) got a computer, the big change in the current period is that the computer got connected to the Internet. When work started on the Macmillan English Dictionary for Advanced Learners (MEDAL; Rundell, ed., 2001), we had the advantage of entering the field at a point when the corpus-based methodology was well-established, and the developments described above were in place. But we faced the challenge of entering a mature market in which several high-quality dictionaries were already competing for the attention of language learners and their teachers. It was clear that any new contender could only make a mark by doing the basic things well, and by doing new things which had not been attempted before but which would meet known user needs. It was equally clear that computational methods would play a key part in delivering the desired innovations. The rest of this paper reviews developments in the period from 1997 to the present, and discusses further advances that are still at the planning stage. The work we describe represents a collaboration between a lexicographer and a computational linguist (the authors), and shows how the job of dictionary-makers has been supported by, and in some cases replaced by, computational techniques which originate from research in the field of natural language processing (NLP). We will conclude with some speculations on the direction of this trajectory: is the end point a fully-automated dictionary? does it even make sense to think in terms of an ‘end-point’? First, it will be helpful to give a brief inventory of the main tasks involved in creating a dictionary, so that we can assess how far we have progressed along the road to automation. They are: – corpus creation – headword list development – analysis of the corpus: – to discover word senses and other lexical units (fixed phrases, phrasal verbs, compounds, etc.) – to identify the salient features of each of these lexical units 1. their syntactic behaviour 2. the collocations they participate in 3. their colligational preferences 4. any preferences they have for particular text-types or domains – providing definitions (or translations) at relevant points – exemplifying relevant features with material gleaned from the corpus – editing compiled text in order to control quality and ensure consistent adherence to agreed style policies
Michael Rundell and Adam Kilgarriff
We look at all of these, some in more detail than others.
3.1
Corpus creation
For people in the dictionary business, one of the most striking developments of the 21st century is the ‘web corpus’. Corpora are now routinely assembled using texts from the Internet and this has had a number of consequences. First, the curse of datasparseness, which has dogged lexicography from Johnson’s time onwards, has become a thing of the past.2 The COBUILD corpus of the 1980s – an order of magnitude larger than Brown – sought to provide enough data for a reliable account of mainstream English, but its creators were only too aware of its limitations.3 The British National Corpus (BNC) – larger by another order of magnitude – was another attempt to address the issue. As new technologies have arisen to facilitate corpus creation from the web, it has become possible to create register-diverse corpora running into billions of words. Software tools such as WebBootCat (Baroni & Bernardini 2004; Baroni et al. 2006) provide a one-stop operation in which texts are selected according to user-defined parameters, ‘cleaned up’, and linguistically annotated. The timescale for creating a large lexicographic corpus has been reduced from years to weeks, and for a small corpus in a specialized domain, from months to minutes. Texts on the web are, by definition, already in digital form. The overall effect is to drastically reduce both the human effort involved in corpus creation and the ‘entry fee’ to corpus lexicography.4 Thus the process of collecting the raw data that will form the basis of a dictionary has to a large extent been automated. Inevitably there are downsides. The granularity of smaller corpora (in terms of the balance of texts, the level of detail in document headers, and the delicacy of annotation) cannot be fully replicated in corpora of several billion words. While for some types of user (e.g. grammarians or sociolinguists) this will sometimes limit the usefulness of the corpus, for lexicographers working on general-purpose dictionaries, the benefits of abundant data outweigh most of the perceived disadvantages of web corpora. There were good reasons why the million-word Brown Corpus of 1962 was designed with such great care: a couple of ‘rogue’ texts could have had a disruptive effect on the statistics. In a billion-word corpus the occasional outlier will not compromise the overall picture. We now simply aim to ensure that the major text-types are all well represented.
2. We should perhaps add this rider: “at least for the most widely-used languages, for which many billions of words of text are now available”. 3.
“Every time COBUILD doubles its corpus, we want to double it again” (Clear 1996: 266).
4. Hence, for example, there are now substantial corpora for ‘smaller’ languages such as Irish or the Bantu languages of southern Africa: Kilgarriff et al. (2006); de Schryver & Prinsloo (2000).
Automating the creation of dictionaries
Concerns about the diversity of text-types available on the web have proved largely unfounded. Comparisons of web-derived corpora against benchmark collections like the BNC have produced encouraging results, suggesting that a well-designed web corpus can provide reliable language data (Sharoff 2006; Baroni et al. 2009).5
3.2
Headword lists
Building a headword list is the most obvious way to use a corpus for making a dictionary. Ceteris paribus, if a dictionary is to have N words in it, they should be the N words from the top of the corpus frequency list. 3.2.1 In search of the ideal corpus It is never as simple as this, mainly because the corpus is never good enough. It will contain noise and biases. The noise is always evident within the first few thousand words of all the corpus frequency lists that either of us has ever looked at. In the BNC, for example, a large amount of data from a journal on gastro-uterine diseases presents noise in the form of words like mucosa – a term much-discussed in these specific documents, but otherwise rare and not known to most speakers of English.6 Bias in the spoken BNC is illustrated by the very high frequencies for words like plan, elect, councillor, statutory and occupational: the corpus contains a great deal of material from local government meetings, so the vocabulary of this area is well represented. Thus keyword lists of the BNC in contrast to other large, general corpora show these words as particularly BNC-flavoured. And unlike many of today’s large corpora, the BNC contains, by design, a high proportion of fiction. Finally, if our dictionary is to cover the varieties of English used throughout the world, the BNC’s exclusive focus on British English is another limitation. If we turn to UKWaC (the UK ‘Web as Corpus’; Baroni et al. 2009), a web-sourced corpus of around 1.6 billion words, we find other forms of noise and bias. The corpus contains a certain amount of web spam. In particular, we have discovered that people advertising poker are skilled at producing vast quantities of ‘word salad’ which is not easily distinguished – using automatic routines – from bona fide English. Internet-related bias also shows up in the high frequencies for words like browser and configure. While noise is simply wrong, and its impact is progressively reduced as ongoing cleanups are implemented, biases are more subtle in that they force questions about the sort of language to be covered in the dictionary, and in what proportions.7 5. See for example Keller & Lapata (2003); Fletcher (2004). For general background to web corpora, see Kilgarriff & Grefenstette (2003); Atkins & Rundell (2008: 78–80); Baroni et al. (2009). 6. In the BNC mucosa is marginally more frequent than spontaneous and enjoyment, though of course it appears in far fewer corpus documents. 7. As is now generally recognised, the notion of ‘representativeness’ is problematical with regard to general-purpose corpora like BNC and UKWaC, and there is no ‘scientific’ way of achieving it: see e.g. Atkins & Rundell (2008: 66).
Michael Rundell and Adam Kilgarriff
3.2.2 Multiwords English dictionaries have a range of entries for multiword items, typically including noun compounds (credit crunch, disc jockey), phrasal and prepositional verbs (take after, set out) and compound prepositions and conjunctions (according to, in order to). While corpus methods can straightforwardly find high-frequency single-word items and thereby provide a fair-quality first pass at a headword list for those items, they cannot do the same for multiword items. Lists of high-frequency word-pairs in any English corpus are dominated by items which do not merit dictionary entries: the string of the is usually top of any list of bigrams. We have several strategies here: one is to view multiword headwords as collocations (see discussion below) and to find multiword headwords when working through the alphabet looking at each headword in turn. Another, currently underway in the Kelly project (Kilgarriff 2010), is to explore lists of translations of single-word headwords for a number of other languages into English, and to find out what multiwords occur repeatedly. 3.2.3 Lemmatization The words we find in texts are inflected forms; the words we put in a headword list are lemmas. So, to use a corpus list as a dictionary headword, we need to map inflected forms to lemmas: we need to lemmatize. English is not a difficult language to lemmatize as no lemma has more than eight inflectional variants (be, am, is, are, was, were, been, being), most nouns have just two (apple, apples) and most verbs, just four (invade, invades, invading, invaded). Most other languages, of course, present a substantially greater challenge. Yet even for English, automatic lemmatization procedures are not without their problems. Consider the data in Table 1. To choose the correct rule we need an analysis of the orthography corresponding to phonological constraints on vowel type and consonant type, for both British and American English.8 Even with state-of-the-art lemmatization for English, an automatically extracted lemma list will contain some errors. These and other issues in relating corpus lists to dictionary headword lists are described in detail in Kilgarriff (1997). 3.2.4 Practical solutions Building a headword list for a new dictionary (or revising one for an existing title) has never been an exact science, and little has been written about it. Headword lists are by their nature provisional: they evolve during a project and are only complete at the end. A good starting point is to have a clear idea of what your dictionary will be used for, and this is where the ‘user profile’ comes in. A user-profile “seeks to characterize the typical user of the dictionary, and the uses to which the dictionary is likely to be put” 8. The issue came to our attention when an early version of the BNC frequency list gave undue prominence to verbal car.
Automating the creation of dictionaries
Table 1.╇ Complexity in verb lemmatization rules for English lemma
-ed, -s forms
Rule
-ing form
Rule
fix care hope hop
fixed, fixes cared, cares hoped, hopes hopped
fixing caring hoping hopping
delete -ing delete -ing, add -e delete -ing, add -e delete -ing, undouble consonant
fuse fuss bus
AmE
hops fused fussed bussed, busses??
fusing fussing bussing
BrE
bused, bused
delete -ed, -es delete -d, -s delete -d, -s delete -ed, undouble consonant delete -s delete -d delete -ed delete -ed/-s, undouble consonant delete -ed
delete -ing, add -e delete -ing delete -ing, undouble consonant delete -ing
busing
(Atkins & Rundell 2008: 28). This is a manual task, but it provides filters with which to sift computer-generated wordlists. An approach which has been used with some success is to generate a wordlist which is (say) 20% larger than the list you want to end up with – thus, a list of 60,000 words for a dictionary of 50,000 – and then whittle it down to size taking account of the user profile. Then, if the longer list contains obsolescent terms which are used in 19th century literature, but the user profile specifies that uses are all engaged with the contemporary language, these items could safely be deleted. If the user profile included literary scholarship, they could not. 3.2.5 New words As everyone involved in commercial lexicography knows, neologisms punch far above their weight. They might not be very important for an objective description of the language but they are loved by marketing teams and reviewers. New words and phrases often mark the only obvious change in a new edition of a dictionary, and dominate the press releases. Mapping language change has long been a central concern of corpus linguists, and a longstanding vision is the ‘monitor corpus’, the moving corpus that lets the researcher explore language change objectively (Clear 1988; Janicivic & Walker 1997). The core method is to compare an older ‘reference’ corpus with an up-to-the-minute one to find words which are not already in the dictionary, and which are in the recent corpus but not in the older one. O’Donovan & O’Neill (2008) describe how this has been done at Chambers Harrap Publishers, and Fairon et al. (2008) describe a generic system in which users can specify the sources they wish to use and the terms they wish to trace.
Michael Rundell and Adam Kilgarriff
The nature of the task is that the automatic process creates a list of candidates, and a lexicographer then goes through them to sort the wheat from the chaff. There is always far more chaff than wheat. The computational challenge is to cut out as much chaff as possible without losing the wheat – that is, the new words which the lexicography team have not yet logged but which should be included in the dictionary. For many aspects of corpus processing, we can use statistics to distinguish signal from noise, on the basis that the phenomena we are interested in are common ones and occur repeatedly. But new words are usually rare, and by definition are not already known. Thus lemmatization is particularly challenging since the lemmatizer cannot make use of a list of known words. So for example, in one list we found the ‘word’ authore, an incorrect but understandable lemmatization of authored, past participle of the unknown verb author. For new-word finding we will want to include items in a candidate list even though they occur just once or twice. Statistical filtering can therefore only be used minimally. We are exploring methods which require that a word that occurred once or twice in the old material occurs in at least three or four documents in the new material, to make its way onto the candidate list. We use some statistical modulation to capture new words which are taking off in the new period, as well as the items that simply have occurred where they never did before. Many items that occur in the new words list are simply typing errors. This is another reason why it is desirable to set a threshold higher than one in the new corpus. We have found that almost all hyphenated words are chaff, and often relate to compounds which are already treated in the dictionary as ‘solid’ or as multiword items. English hyphenation rules are not fixed: most word pairs that we find hyphenated (sand-box) can also be found written as one word (sandbox), as two (sand box), or as both. With this in mind, to minimize chaff, we take all hyphenated forms and two- and three-word items in the dictionary and ‘squeeze’ them so that the one-word version is included in the list of already-known items, and we subsequently ignore all the hyphenated forms in the corpus list. Prefixes and suffixes present a further set of items. Derivational affixes include both the more syntactic (-ly, -ness) and the more semantic (-ish, geo-, eco-).9 Most are chaff: we do not want plumply or ecobuddy or gangsterish in the dictionary, because, even though they all have google counts in the thousands, they are not lexicalized and there is nothing to say about them beyond what there is to say about the lemma, the affix and the affixation rule. The ratio of wheat to chaff is low, but amongst the nonce productions there are some which are becoming established and should be considered for the dictionary. So we prefer to leave the nonce formations in place for the lexicographer to run their eye over.
9. Here we exclude inflectional morphemes, addressed under lemmatization above: in English a distinction between inflectional and derivational morphology is easily made.
Automating the creation of dictionaries
For the longer term, the biggest challenge is acquiring corpora for the two time periods which are sufficiently large and sufficiently well-matched. If the new corpus is not big enough, the new words will simply be missed, while if the reference corpus is not big enough, the lists will be full of false positives. If the corpora are not wellmatched but, for example, the new corpus contains a document on vulcanology and the reference corpus does not, the list will contain words which are specialist vocabulary rather than new, like resistivity and tephrochronology. While vast quantities of data are available on the web, most of it does not come with reliable information on when the document was originally written. While we can say with confidence that a corpus collected from the web in 2009 represents, overall, a more recent phase of the language than one collected in 2008, when we move to words with small numbers of occurrences, we cannot trust that words from the 2009 corpus are from more recently-written documents than ones from the 2008 corpus. Two text types where date-of-writing is usually available are newspapers and blogs. Both of these have the added advantage that they tend to be about current topics and are relatively likely to use new vocabulary. Our current efforts for new-word-detection involve large-scale gathering of one million words of newspapers and blogs per day. The collection started in early 2009 and we need to wait at least one year or possibly two before we can assess what it achieves. Over a shorter time span lists will be dominated by short-term items and items related to the time of year. It will take a longer view to support the automatic detection of new words which have become established and have earned their place in the dictionary.
3.3
Collocation and word sketches
As in most areas of life, new ways of doing things typically evolve in response to known difficulties. What has tended to happen in the dictionary-development sphere is that we first identify a lexicographic problem, and then consider whether NLP techniques have anything to offer in the way of solutions. And when computational solutions are devised, we find – as often as not – that they have unforeseen consequences which go beyond the specific problem they were designed to address. When planning a new dictionary, it is good to pay attention to what other dictionaries are doing, and to consider whether you can do the same things but do them better. But this is not enough. It is also important to look at emerging trends at the theoretical level and at their practical implications for language description. Collocation is a good example. The arrival of large corpora provided the empirical underpinning for a Firthian view of vocabulary, and – thanks to the work of John Sinclair and others – collocation became a core concept within the language-teaching community. Books such as Lewis (1993) and McCarthy & O’Dell (2005) helped to show the relevance of collocation at the classroom level, but in 1997 learner’s dictionaries had not yet caught up: they showed an awareness of the concept, but their coverage of collocation was patchy and unsystematic. This represented an opportunity for MEDAL.
Michael Rundell and Adam Kilgarriff
The first author described the problem to the second, who felt it should be possible to find all common collocations for all common words automatically, by using a shallow grammar to identify all verb-object pairs, subject-verb pairs, modifier-modifiee pairs and so on, and then to apply statistical filtering to give a fairly clean list, as proposed by Tapanainen & Järvinen (1998; and for the statistics, Church & Hanks 1990). The project would need a very large, part-of-speech-tagged corpus of general English: this had recently become available in the form of the British National Corpus. First experiments looked encouraging: the publisher contracted the researcher to proceed with the research, and the first versions of word sketches were created. A word sketch is a one-page, corpus-based summary of a word’s grammatical and collocational behaviour, as illustrated in Figure 1.
Figure 1.╇ Part of a word sketch for return (noun)
Automating the creation of dictionaries
As the lexicographers became familiar with the software, it became apparent that word sketches did the job they were designed to do. Each headword’s collocations could be listed exhaustively, to a far greater degree than was possible before. That was the immediate goal. But analysis of a word’s sketch also tended to show, through its collocations, a wide range of the patterns of meaning and usage that it entered into. In most cases, each of a word’s different meanings is associated with particular collocations, so the collocates listed in the word sketches provided valuable prompts in the key task of identifying and accounting for all the word’s meanings in the entry. The word sketches functioned not only as a tool for finding collocations, but also as a useful guide to the distinct senses of a word – the analytical core of the lexicographer’s job (Kilgarriff & Rundell 2002). Prior to the advent of word sketches, the primary means of analysis in corpus lexicography was the reading of concordances. Since the earliest days of the COBUILD project, the lexicographers scanned concordance lines – often in their thousands – to find all the patterns of meaning and use. The more lines were scanned, the more patterns would tend to be found (though with diminishing returns). This was good and objective, but also difficult and time-consuming. Dictionary publishers are always looking to save time, and hence budgets. Earlier efforts to offer computational support were based on finding frequently co-occurring words in a window surrounding the headword (Church & Hanks 1990). While these approaches had generated plenty of interest among university researchers, they were not taken up as routine processes by lexicographers: the ratio of noise to signal was high, the first impression of a collocation list was of a basket of earth with occasional glints of possible gems needing further exploration, and it took too long to use them for every word. But early in the MEDAL project, it became clear that the word sketches were more like a contents page than a basket of earth. They provided a neat summary of most of what the lexicographer was likely to find by the traditional means of scanning concordances. There was not too much noise. Using them saved time. It was more efficient to start from the word sketch than from the concordance. Thus the unexpected consequence was that the lexicographer’s methodology changed, from one where the technology merely supported the corpus-analysis process, to one where it pro-actively identified what was likely to be interesting and directed the lexicographer’s attention to it. And whereas, for a human, the bigger the corpus, the greater the problem of how to manage the data, for the computer, the bigger the corpus, the better the analyses: the more data there is, the better the prospects for finding all salient patterns and for distinguishing signal from noise. Though originally seen as a useful supplementary tool, the sketches provide a compact and revealing snapshot of a word’s behaviour and uses and have, in most cases, become the preferred starting point in the process of analyzing complex headwords.
Michael Rundell and Adam Kilgarriff
3.4
Word sketches and the sketch engine since 2004
Since the first word sketches were used in the late 1990s in the development of the first edition of MEDAL, word sketches have been integrated into a general-purpose corpus query tool, the Sketch Engine (Kilgarriff et al. 2004) and have been developed for a dozen languages (the list is steadily growing). They are now in use for commercial and academic lexicography in the UK (where most of the main dictionary publishers use them), China, the Czech Republic, Germany, Greece, Japan, the Netherlands, Slovakia, Slovenia and the USA, and for language and linguistics teaching all round the world. Word sketches have been complemented by an automatic thesaurus (which identifies the words which are most similar, in terms of shared collocations, to a target word) and a range of other tools including ‘sketch difference’, for comparing and contrasting a word with a near-synonym or antonym in terms of collocates shared and not shared. There are also options such as clustering a word’s collocates or its thesaurus entries. The largest corpus for which word sketches have been created so far contains over five billion words (Pomikálek et al. 2009). In a quantitative evaluation, two thirds of the collocates in word sketches for five languages were found to be ‘publishable quality’: a lexicographer would want to include them in a published collocations dictionary for the language (Kilgarriff et al. 2010).
3.5
Word sketches and the sketch engine in the NEID project
The New English-Irish Dictionary (NEID) is a project funded by Foras na Gaeilge, the statutory language board for Ireland, and planned by the Lexicography MasterClass.10 It has provided a setting for a range of ambitious ideas about how we can efficiently create ever more detailed and accurate descriptions of the lexis of a language. The project makes a clear divide between the ‘source-language analysis’ phase of the project, and the translation and final-editing phases. A consequence is that the analysis phase is an analysis of English in which the target language (Irish) plays no part, and the resulting ‘Database of ANalysed Texts of English’ (DANTE) is a database with potential for a range of uses in lexicography and language technology. It could be used, for example, as a launchpad for bilingual dictionaries with a different target language, or as a resource for improving machine translation systems or text-remediation software. The Lexicography MasterClass undertook the analysis phase, with a large team of experienced lexicographers, over the period 2008–2010.11 The project has used the Sketch Engine with a corpus comprising UKWaC plus the English-language part of the New Corpus for Ireland (Kilgarriff et al. 2006). In the course of the project, three innovations were added to the standard word sketches.
10. http://www.lexmasterclass.com. 11. For an account see Atkins et al. (2010).
Automating the creation of dictionaries
3.5.1 Customization of sketch grammar Any dictionary uses a particular grammatical scheme in its choice of the repertoire and meaning of the grammatical labels it attaches to words. The Sketch Engine also uses a grammatical scheme in its ‘Sketch Grammar’, which defines the grammatical relations according to which it will classify collocations in the word sketches: object_of, and/or etc. in Figure 1. The Sketch Grammar also gives names to the grammatical relations. This raises the prospect of mapping the grammatical scheme that is specified in a dictionary’s Style Guide onto the scheme in the Sketch Grammar. In this way, there will be an exact match between the inventory of grammatical relations in the dictionary, and those presented to the lexicographer in the word sketch. A relation that is called NP_ PP, for a verb such as load (load the hay onto the cart) in the lexical database will be called NP_PP, with exactly the same meaning, in the word sketch. Such an approach will simplify and rationalize the analysis process for the lexicographer: for the most part s/he will be copying a collocate of type X in the word sketch, into a collocate of type X (under the relevant sense of the headword) in the dictionary entry s/he is writing. The NEID was the first project where the Sketch Grammar and Dictionary Grammar were fully harmonized: the Sketch Grammar was customized to express the same grammatical constructions and collocation-types, with the same names, as the lexicographers would use in their analysis. Another Macmillan project (the Macmillan Collocations Dictionary; Rundell, ed., 2010) subsequently used the same approach. 3.5.2 ‘Constructions list’ as top-level summary of word sketch The dictionary grammar for the NEID project is quite complex and fine-grained. In the case of verbs, for example, any of 43 different structures may be recorded. Consequently we soon found that word sketches were often rather large and hard to navigate around. To address this, we introduced an ‘index’, which appears right at the top of the word sketch and summarizes its contents by listing the constructions that are most salient for that word (cf. Figure 2). In other cases, we found that there were a large number of constructions involving prepositions and particles, and that these could make the word sketch unwieldy. To address this, we collected all the preposition/particle relations on a separate web page, as in Figure 3. 3.5.3 ‘More data’ and ‘Less data’ buttons The size of a word sketch is (inevitably) constrained by parameters which determine how many collocates and constructions are shown. The Sketch Engine has always allowed users to change the parameters, but most users are either unaware of the possibility or are not sure which parameters they should change or by how much. A simple but much-appreciated addition to the interface was ‘More data’ and ‘Less data’ buttons so the user can, at a single click, see less data (if they are feeling overwhelmed) or more data (if they have accounted for everything in the word sketch in front of them, but feel they have missed something or not said enough).
Michael Rundell and Adam Kilgarriff
Figure 2.╇ Part of a word sketch for remember (verb), where the verb’s main syntactic patterns appear in the box at top left
Figure 3.╇ Word sketch for argue, showing part of the page devoted to prepositional phrases
Automating the creation of dictionaries
3.6
Labels
Dictionaries use a range of labels (such as usu pl., informal, Biology, AmE) to mark words according to their grammatical, register, domain, and regional-variety characteristics, whenever these deviate significantly from the (unmarked) norm. All of these are facts about a word’s distribution, and all can, in principle, be gathered automatically from a corpus. In each of these four cases, computationalists are currently able to propose some labels to the lexicographer, though there remains much work to be done. In each case the methodology is to: – specify a set of hypotheses – there will usually be one hypothesis per label, so grammatical hypotheses for the category ‘verb’ may include: – is it often/usually/always passive – is it often/usually/always progressive – is it often/usually/always in the imperative – for each word – test all relevant hypotheses – for all hypotheses that are confirmed, add the information to the word sketch. Where no hypotheses are confirmed – when, in other words, there is nothing interesting to say, which will be the usual case – nothing is added to the word sketch. 3.6.1 Grammatical labels: usu. pl, usu. passive, etc. To determine whether a noun should be marked as ‘usually plural’, it is possible simply to count the number of times the lemma occurs in the plural, and the number of times it occurs overall, and divide the second number by the first to find the proportion. Similarly, to discover how often a verb is passivized, we can count how often it is a past participle preceded by a form of the verb be (with possible intervening adverbs) and determine what fraction of the verb’s overall frequency the passive forms represent. Given a lemmatized, part-of-speech-tagged corpus, this is straightforward. A large number of grammatical hypotheses can be handled in this way. The next question is: when is the information interesting enough to merit a label in a dictionary? Should we, for example, label all verbs which are over 50% passive as often passive? To assess this question, we want to know what the implications would be: we do not want to bombard the dictionary user with too many labels (or the lexicographer with too many candidate-labels). What percentage of English verbs occur in the passive over half of the time? Is it 20%, or 50%, or 80%? This question is also not in principle hard to answer: for each verb, we work out its percentage passive, and sort according to the percentage. We can then give a figure which is, for lexicographic purposes, probably more informative than ‘the percentage passive’: the percentile. The
Michael Rundell and Adam Kilgarriff
percentile indicates whether a verb is in the top 1%, or 2%, or 5%, or 10% of verbs from the point of view of how passive they are. We can prepare lists as in Table 2. This uses the methodology for finding the ‘most passive’ verbs (with frequency over 500) in the BNC. It shows that the most passive verb is station: people and things are often stationed in places, but there are far fewer cases where someone actively stations things. For station, 72.2% of its 557 occurrences are in the passive, and this puts it in the 0.2% ‘most passive’ verbs of English. At the other end of the table, levy is in the passive just over half the time, which puts it in the 1.9% most passive verbs. The approach is similar to the collostructional analysis of Gries & Stefanowitsch (2004). As can be seen from this sample, the information is lexicographically valid: all the verbs in the table would benefit from an often passive or usually passive label. A table like this can be used by editorial policy-makers to determine a cut-off which is appropriate for a given project. For instance, what proportion of verbs should Table 2.╇ The ‘most passive’ verbs in the BNC, for which a usually passive label might be proposed Percentile
Ratio
Lemma
Frequency
0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.7 0.8 0.9 1.1 1.1 1.2 1.2 1.3 1.5 1.5 1.5 1.6 1.7 1.9
72.2 71.8 71.1 68.7 66.3 65.0 64.7 64.1 63.2 62.0 59.8 58.1 57.8 55.5 55.4 54.9 53.9 53.1 52.8 52.4 51.5 50.8 50.1
station base destine doom poise situate schedule associate embed entitle couple jail deem confine arm design convict clothe dedicate compose flank gear levy
â•⁄â•⁄ 557 19201 â•⁄â•⁄ 771 â•⁄â•⁄ 520 â•⁄â•⁄ 640 â•⁄ 2025 â•⁄ 1602 â•⁄ 8094 â•⁄â•⁄ 688 â•⁄ 2669 â•⁄ 1421 â•⁄â•⁄ 960 â•⁄ 1626 â•⁄ 2663 â•⁄ 1195 11662 â•⁄ 1298 â•⁄â•⁄ 749 â•⁄ 1291 â•⁄ 2391 â•⁄â•⁄ 551 â•⁄â•⁄ 733 â•⁄â•⁄ 603
Automating the creation of dictionaries
attract an often passive label? Perhaps the decision will be that users benefit most if the label is not overused, so just 4% of verbs would be thus labelled. The full version of Table 2 tells us what these verbs are. And now that we know precisely the hypothesis to use (“is the verb in the top 4% most-passive verbs?”) and where the hypothesis is true, the label can be added into the word sketch. In this way, the element of chance – will the lexicographer notice whether a particular verb is typically passivized? – is eliminated, and the automation process not only reduces lexicographers’ effort but at the same time ensures a more consistent account of word behaviour. 3.6.2 Register Labels: formal, informal, etc. Any corpus is a collection of texts. Register is in the first instance a classification that applies to texts rather than words. A word is informal (or formal) if it shows a clear tendency to occur in informal (or formal) texts. To label words according to register, we need a corpus in which the constituent texts are themselves labelled for register in the document header. Note that at this stage, we are not considering aspects of register other than formality. One way to come by such a corpus is to gather texts from sources known to be formal or informal. In a corpus such as the BNC, each document is supplied with various text type classifications, so we can, for example, infer from the fact that a document is everyday conversation, that it is informal, or from the fact that it is an academic journal article, that it is formal. The approach has potential, but also drawbacks. In particular, it is not possible to apply it to any corpus which does not come with text-type information. Web corpora do not. An alternative is to build a classifier which infers formality level on the basis of the vocabulary and other features of the text. There are classifiers available for this task: see for example Heylighen & Dewaele (1999) and Santini et al. (2009). Following this route, we have recently labelled all documents in a five billion word web corpus according to formality, so we are now in a position to order words from most to least formal. The next tasks will be to assess the accuracy of the classification, and to consider – just as was done for passives – the percentage of the lexicon we want to label for register. The reasoning may seem circular: we use formal (or informal) vocabulary to find formal (or informal) vocabulary. But it is a spiral rather than a circle: each cycle has more information at its disposal than the previous one. We use our knowledge of the words that are formal or informal to identify documents that are formal or informal. That then gives us a richer dataset for identifying further words, phrases and constructions which tend to be formal or informal, and allows us to quantify the tendencies. 3.6.3 Domain labels: Geol., Astron., etc. The issues are, in principle, the same as for register. The practical difference is that there are far more domains (and domain labels): even MEDAL, a general-purpose learner’s dictionary, has 18 of these, while the NEID database has over 150 domain labels. Collecting large corpora for each of these domains is a significant challenge.
Michael Rundell and Adam Kilgarriff
It is tempting to gather a large quantity of, for example, geological texts from a particular source, perhaps an online geology journal. But rather than being a ‘general geology’ corpus, that subcorpus will be an ‘academic-geology-prose corpus’, and the words which are particularly common in the subcorpus will include vocabulary typical of academic discourse in general as well as of the domain of geology. Ideally, each subcorpus will have the same proportions of different text-types as the whole corpus. None of this is technically or practically impossible, but the larger the number of subcorpora, the harder it is to achieve. In current work, we are focusing on just three subcorpora: legal, medical and business, to see if we can effectively propose labels for them. Once we have the corpora and counts for each word in each subcorpus, we need to use statistical measures for deciding which words are most distinctive of the subcorpus: which words are its ‘keywords’, the words for which there is the strongest case for labelling. The maths we use is based on a simple ratio between relative frequencies, as implemented in the Sketch Engine and presented in Kilgarriff (2009). 3.6.4 Region labels: AmE, AustrE, etc. The issues concerning region labels are the same as for domains but in some ways a little simpler. The taxonomy of regions, at least from the point of view of labelling items used in different parts of the English-speaking world, is relatively limited, and a good deal less open-ended than the taxonomy of domains. In MEDAL, for example, it comprises just 12 varieties or dialects, including American, Australian, Irish, and South African English.
3.7
Examples
Most dictionaries include example sentences. They are especially important in pedagogical dictionaries, where a carefully-selected set of examples can clarify meaning, illustrate a word’s contextual and combinatorial behaviour, and serve as models for language production. The benefits for users are clear, and the shift from paper to electronic media means that we can now offer users far more examples. But this comes at a cost. Finding good examples in a mass of corpus data is labour-intensive. For all sorts of reasons, a majority of corpus sentences will not be suitable as they stand, so the lexicographer must either search out the best ones or modify corpus sentences which are promising but in some way flawed. 3.7.1 GDEX In 2007, the requirement arose – in a project for Macmillan – for the addition of new examples for around 8,000 collocations. The options were to ask lexicographers to select and edit these in the ‘traditional’ way, or to see whether the example-finding process could be automated. Budgetary considerations favoured the latter approach, and
Automating the creation of dictionaries
subsequent discussions led to the GDEX (‘good dictionary examples’) algorithm, which is described in Kilgarriff et al. (2008). Essentially, the software applies a number of filters designed to identify those sentences in a corpus which most successfully fulfil our criteria for being a ‘good’ example. A wide range of heuristics is used, including criteria like sentence length, the presence (or absence) of rare words or proper names, and the number of pronouns in the sentence. The system worked successfully on its first outing – not in the sense that every example it ‘promoted’ was immediately usable, but in the sense that it significantly streamlined the lexicographer’s task. GDEX continues to be refined, as more selection criteria are added and the weightings of the different filters adjusted. For the DANTE database, which includes several hundred thousand examples, GDEX sorts the sentences for any of the combinations shown in the word sketches, in such a way that the ones which GDEX thinks are ‘best’ are shown first. The lexicographer can scan a short list until they find a suitable example for whatever feature is being illustrated, and GDEX means they are likely to find what they are looking for in the top five examples, rather than, on average, within the top 20 to 30. 3.7.2 One-click copying DANTE is an example-rich database in which almost all word senses, constructions, and multiword expressions are illustrated with at least one example. All examples are from the corpus and are unedited (DANTE is a lexical database rather than a finished dictionary). Lexicographers are thus required to copy many example sentences from the corpus system into the dictionary editing system. We use standard copy-and-paste but in the past this has often been fiddly, with one click to see the whole sentence, then manoeuvring the mouse to mark it all. So we have added a button for ‘one-click copying’: now, a single click on an icon at the right of any concordance line copies not the visible concordance line, but the complete sentence (with headword highlighted) and puts it on the clipboard ready for pasting into the dictionary.
3.8
Tickbox lexicography (TBL)
One-click copying is a good example of a simple software tweak that streamlines a routine lexicographic task. This may look trivial, but in the course of a project such as DANTE, the lexicographic team will be selecting and copying several hundred thousand example sentences, so the time-savings this yields are significant. Another development – currently in use on two lexicographic projects – takes this process a step further, allowing lexicographers to select collocations for an entry, then select corpus examples for each collocation, simply by ticking boxes (thus eliminating the need to retype or cut-and-paste). We call this ‘tickbox lexicography’ (TBL), and in this process, the lexicographer works with a modified version of the word sketches, where each collocate listed under the various grammatical relations (‘gramrels’) has a tickbox beside it. Then, for each word sense and each gramrel, the lexicographer:
Michael Rundell and Adam Kilgarriff
– ticks the collocations s/he wants in the dictionary or database – clicks a ‘Next’ button – is then presented with a choice of six corpus examples for every collocation, each with a tickbox beside it (six is the default, and assumes that – thanks to GDEX – a suitable example will appear in this small set; but the defaults can of course be changed) – ticks the desired examples, then clicks a ‘Copy to clipboard’ button. The system then builds an XML structure according to the Document Type Definition (DTD) of the target dictionary (each target dictionary has its own TBL application). The lexicographer can then paste this complex structure, in a single move, directly into the appropriate fields in the dictionary writing system. In this way, TBL models and streamlines the process of getting a corpus analysis out of the corpus system and into the dictionary writing system, as the first stage in the compilation of a dictionary. Here again, the incremental efficiency gains are substantial. The TBL process is especially well-adapted to the emerging situation where online dictionaries give their users access to multiple examples of a given linguistic feature (such as a collocation or syntax pattern): with TBL, large numbers of relevant corpus examples can be selected and copied into the database with minimum effort.
4. Conclusions If we look back at the list of lexicographic tasks (Section 3, above), we find that the following have been – or soon will be – automated to a significant degree: – corpus creation – headword list building – identification of key linguistic features or preferences (syntactic, collocational, colligational, and text-type-related) – example selection. Further improvements are possible for each of these technologies (notably the GDEX algorithm and the text-type classifiers), and many of these are already in development. An especially interesting approach we are now looking at is one that takes the whole automation process a step further. In this model, we envisage a change from the current situation, where the corpus software (some version of the word sketches) presents data to the lexicographer in (as we have seen) intelligently pre-digested form, to a new paradigm where the software selects what it believes to be relevant data and actually populates the appropriate fields in the dictionary database. In this way of working, the lexicographer’s task changes from selecting and copying data from the software, to validating – in the dictionary writing system – the choices made by the computer. Having deleted or adjusted anything unwanted, the lexicographer then tidies up and
Automating the creation of dictionaries
completes the entry. The principle here is that, assuming the software can be trained to make the ‘right’ decisions in a majority of cases, it is more efficient to edit out the computer’s errors than to go through the whole data-selection process from the beginning. If this approach can be made to work effectively, we are likely to see a further change in lexicographers’ working practices – and a further shift towards full automation. Automated lexicography is still some way off. In particular, we have not yet reached the point where definition writing and (hardest of all) word sense disambiguation (WSD) are carried out by machines. In both cases, however, it may be possible to solve the problem by redefining the goal. If, for example, we think less in terms of discreet, numbered ‘dictionary senses’, and more of the contribution that a word makes to the meaning of a given communicative event, then the task starts to look less intractable. It has become increasingly clear that the meaning of a word in a particular context is closely associated with the specific patterning in which it appears – where ‘patterning’ encompasses features such as syntax, collocation, and domain information. A good deal of research is going on in this area, notably Patrick Hanks’ work on ‘Corpus Pattern Analysis’ (e.g. Hanks 2004), and it is self-evident that computers can identify and count clusters of patterns more readily than they can count something as unstable as ‘senses’. This offers one way forward. Equally, definitions could become less important if the user who encounters an unknown word could immediately access half a dozen very similar corpus examples (filtered by GDEX or the like), and then draw his or her own conclusions. Whether this could be a viable alternative to the traditional definition – especially when the user is a learner – remains to be seen. We have described a long-running collaboration between a lexicographer and a computational linguist, and its outcomes in terms of the way that dictionary text is compiled in the early 21st century. There is plenty more to be done, but it should be clear from this brief survey that the interaction between lexicography and language engineering has already been fruitful and promises to deliver even greater benefits in the future.
References Atkins, S., Kilgarriff, A., & Rundell, M. 2010. The Database of Analysed Texts of English (DANTE). In Proceedings of 14th EURALEX International Congress, A. Dykstra & T. Schoonheim (eds), 549–556. Leeuwarden, The Netherlands. Atkins, S. & Rundell, M. 2008. The Oxford Guide to Practical Lexicography. Oxford: OUP. Baroni, M. & Bernardini, S. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004, 1313–1316. Lisbon: ELDA. Baroni, M., Bernardini, S., Ferraresi, A. & Zanchetta, E. 2009. The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation Journal 43(3): 209–226.
Michael Rundell and Adam Kilgarriff Baroni, M., Kilgarriff, A., Pomikálek, J. & Rychlý, P. 2006. WebBootCaT: A Web tool for instant corpora. In Proceedings of 12th EURALEX International Congress, E. Corino, C. Marello, C. Onesti (eds), 123–131. Alessandria: Edizioni Dell’Orso. Church, K. & Hanks, P. 1990. Word association norms, mutual information and lexicography. Computational Linguistics 16: 22–29. Clear, J. 1988. The monitor corpus. In ZüriLEX ‘86 Proceedings, M. Snell-Hornby (ed.), 383–389. Tübingen: Francke. Clear, J. 1996. Technical implications of multilingual corpus lexicography. International Journal of Lexicography 9(3): 265–273. de Schryver, G.-M. & Prinsloo, D.J. 2000. The compilation of electronic corpora, with special reference to the African languages. Southern African Linguistics and Applied Language Studies 18(1–4): 89–106. Fairon, C., Macé, K., & Naets, H. 2008. GlossaNet2: a linguistic search engine for RSS-based corpora. Proceedings, Web As Corpus Workshop (WAC4), S. Evert, A. Kilgarriff & S. Sharoff (eds), 34–39. Marrakesh. Fletcher, W.H. 2004. Making the Web more useful as a source for linguistic corpora. In Applied Corpus Linguistics: A Multidimensional Perspective [Studies in Corpus Linguistics 16], U. Connor & T. Upton (eds), 191–205. Amsterdam: Rodopi. Grefenstette, G. 1998. The future of linguistics and lexicographers: Will there be lexicographers in the year 3000? In Actes EURALEX 1998, T. Fontenelle, P. Hiligsmann, A. Michiels, A. Moulin & S. Theissen (eds), 25–42. Liège: Université de Liège. Gries, S.T. & Stefanowitsch, A. 2004. Extending collostructional analysis: A corpus-based perspective on ’alternations’. International Journal of Corpus Linguistics 9(1): 97–129. Hanks, P.W. 2004. Corpus Pattern Analysis. In Proceedings of the Eleventh EURALEX International Congress, G. Williams & S. Vessier (eds), 87–98. Lorient: UBS. Heylighen F. & Dewaele, J.-M. 1999. Formality of language: Definition, measurement and behavioural determinants. Internal Report, Free University Brussels, Janicivic, T. & Walker, D. 1997. NeoloSearch: Automatic detection of neologisms in French Internet documents. Proceedings of ACH/ALLC’97, 93–94. Ontario, Canada: Queen’s University. Keller, F. & Lapata, M. 2003. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics 29(3): 459–484. Kilgarriff, A. 1997. Putting frequencies in the dictionary. International Journal of Lexicography 10(2): 135–155. Kilgarriff, A. 2009. Simple maths for keywords. In Proceedings, Corpus Linguistics, M. Mahlberg, V. González-Díaz & C. Smith (eds). Liverpool: University of Liverpool. . Kilgarriff, A. 2010. Comparable corpora within and across languages, word frequency lists and the Kelly project. In Proceedings, 3rd Workshop on Building and Using Comparable Corpora, R. Rapp, P. Zweigenbaum & S. Sharoff (eds), 1–5. Malta: LREC. Kilgarriff, A. & Grefenstette, G. 2003. Introduction to the special issue on the web as corpus. Computational Linguistics 29(3): 333–348. Kilgarriff, A., Husák, M., McAdam, K., Rundell, M. & Rychlý, P. 2008. GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the XIII Euralex Congress, E. Bernal & J. DeCesaris (eds), 425–431. Barcelona: Universitat Pompeu Fabra.
Automating the creation of dictionaries Kilgarriff, A., Kovář, V. Krek, S. Srdanović, I. & Tiberius, C. 2010. A quantitative evaluation of word sketches. In Proceedings of 14th EURALEX International Congress, A. Dykstra & T. Schoonheim (eds), 372–379. Leeuwarden, The Netherlands. Kilgarriff, A. & Rundell, M. 2002. Lexical profiling software and its lexicographic applications: A case study. In Proceedings of the Tenth Euralex Congress, A. Braasch & C. Povlsen (eds), 807–818. Copenhagen: University of Copenhagen. Kilgarriff, A., Rundell, M. & Uí Dhonnchadha, E. 2006. Efficient corpus development for lexicography: Building the new corpus for Ireland. Language Resources and Evaluation Journal 40(2): 127–152. Kilgarriff, A., Rychly, P., Smrz, P. & Tugwell, D. 2004. The sketch engine. In Proceedings of the Eleventh EURALEX International Congress, G. Williams & S. Vessier (eds), 105–116. Lorient: UBS. Krishnamurthy, R. 1987. The process of compilation. In Looking Up: An Account of the COBUILD Project in Lexical Computing, J.M. Sinclair (ed.), 62–85. London: Collins. Kučera, H. & Francis, W.N. 1967. Computational Analysis of Present-Day American English. Providence RI: Brown University Press. Lewis, M. 1993. The Lexical Approach. Hove: Language Teaching Publications. McCarthy, M. & O’Dell, F. 2005. English Collocations in Use. Cambridge: CUP. Murray, K.E.M. 1979. Caught in the Web of Words: James A.H. Murray and the Oxford English Dictionary. Oxford: OUP. Murray, J., Bradley, H., Craigie, W. & Onions, C.T. 1928. Oxford English Dictionary. Oxford: OUP. O’Donovan, R. & O’Neill, M. 2008. A systematic approach to the selection of neologisms for inclusion in a large monolingual dictionary. In Proceedings of the XIII Euralex Congress, E. Bernal & J. DeCesaris (eds), 571–579. Barcelona: Universitat Pompeu Fabra. Pomikálek, J., Rychlý, P. & Kilgarriff, A. 2009. Scaling to billion-plus word corpora. Advances in computational linguistics. Special Issue of Research in Computing Science 41: 3–14. Procter, P. (ed.). 1978. Longman Dictionary of Contemporary English. Harlow: Longman. Renouf, A. 1987. Corpus development. In Looking Up: An Account of the COBUILD Project in Lexical Computing, J.M. Sinclair (ed.), 10–40. London: Collins. Rundell, M. (ed.). 2001. Macmillan English Dictionary for Advanced Learners. Oxford: Macmillan Education. Rundell, M. (ed.). 2010. Macmillan Collocations Dictionary. Oxford: Macmillan Education. Santini, M., Rehm, G., Sharoff, S. & Mehler, A. (eds). 2009. Introduction: In Special issue on Automatic Genre Identification: Issues and Prospects. Journal for Language Technology and Computational Linguistics 24(1): 129–145. Sinclair, J.M. (ed.) Looking Up: An Account of the COBUILD Project in Lexical Computing. London: Collins. Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In Wacky! Working Papers on Web as Corpus, M. Baroni & S. Bernardini (eds), 63–98. Bologna: Gedit. Stein, J. & Urdang, L. (eds). 1966. Random House Dictionary of the English Language. New York NY: Random House. Tapanainen, P. & Järvinen, T. 1998. Dependency concordances. International Journal of Lexicography 11(3): 187–203.
addendum
Select list of publications by Sylviane Granger* 1. Books The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM (G. Gilquin, S. De Cock & S. Granger eds). Presses universitaires de Louvain: Louvain-laNeuve, 2010. eLexicography in the 21st Century: New Challenges, New Applications. Proceedings of ELEX2009 (S. Granger & M. Paquot eds). Cahiers du CENTAL. Presses universitaires de Louvain: Louvain-la-Neuve, 2010. The International Corpus of Learner English. Handbook and CD-ROM. Version 2 (S. Granger, E. Dagneaux, F. Meunier & M. Paquot eds). Presses universitaires de Louvain: Louvain-laNeuve, 2009. Phraseology: An Interdisciplinary Perspective (S. Granger & F. Meunier eds). John Benjamins: Amsterdam, 2008. Phraseology in Foreign Language Learning and Teaching (F. Meunier & S. Granger eds). John Benjamins: Amsterdam, 2008. Eigo Gakushusha Kopasu Nyumon---SLA to Kopasu no Deai (Introduction to English Learner Corpus – SLA Meets Corpus Linguistics) (S. Granger ed.). Japanese translation of S. Granger (ed.) Learner English on Computer (Addison Wesley Longman). Kenkyusha: Tokyo, 2008. Corpus-based Approaches to Contrastive Linguistics and Translation Studies (S. Granger, J. Lerot & S. Petch-Tyson eds). Foreign Language Teaching and Research Press: Beijing, 2007. Extending the Scope of Corpus-based Research: New Applications, New Challenges (S. Granger & S. Petch-Tyson eds), Rodopi: Amsterdam, 2003. Corpus-based Approaches to Contrastive Linguistics and Translation Studies (S. Granger, J. Lerot & S. Petch-Tyson eds), Rodopi: Amsterdam, 2003. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching (S. Granger, J. Hung & S. Petch-Tyson eds), Language Learning and Language Teaching 6. John Benjamins: Amsterdam, 2002. The International Corpus of Learner English. Handbook and CD-ROM (S. Granger, E. Dagneaux & F. Meunier eds). Presses universitaires de Louvain: Louvain-la-Neuve, 2002. Lexis in Contrast. Corpus-based Approaches (B. Altenberg & S. Granger eds), Studies in Corpus Linguistics 7. John Benjamins: Amsterdam, 2002. Contrastive Linguistics and Translation (S. Granger, L. Beheydt & J.-P. Colson eds), Special issue of Le Langage et l’Homme XXXIV(1), 1999. Learner English on Computer (S. Granger ed.). Addison Wesley Longman: London, 1998.
* The publications are sorted by publication type (books or articles) and in reverse chronological order.
A Taste for Corpora Dictionnaire des Faux Amis/Dictionary of Faux Amis Français-Anglais English-French (J. Van Roey, S. Granger & H. Swallow). Duculot: Gembloux. 3rd edition Duculot: Paris, 1998. Perspectives on the English Lexicon (S. Granger ed.). Peeters: Louvain-la-Neuve, 1991. Thèmes Grammaticaux Français-Anglais (S. Granger & J. Van Roey), Collection Pédasup 7. Academia: Louvain-la-Neuve, 1988. The Be + Past Participle Construction in Spoken English with Special Emphasis on the Passive (S. Granger), North Holland Linguistic Series. Elsevier Science Publishers: Amsterdam, 1983. Tendencje interpretacyjne i generatywne w gramatyce transformacyjnej (S. Granger & B. Devlamminck), Katolicki Uniwersytet Lubelski: Lublin, 1981.
2. Articles Error patterns and automatic L1 identification (Y. Bestgen, S. Granger & J. Thewissen). In S. Jarvis & S. Crossley (eds) Approaching Transfer through Text Classification: Explorations in the Detection-based Approach, forthcoming. How to use foreign and second language learner corpora? (S. Granger) In A. Mackey & S.G. Gass (eds) A Guide to Research Methods in Second Language Acquisition. Basil Blackwell, forthcoming. Learner corpora (S. Granger). In C.A. Chapelle (ed.) The Encyclopedia of Applied Linguistics. Wiley-Blackwell: Oxford, forthcoming. The comparative and combined contributions of n-grams, Coh-Metrix indices, and error types in the L1 classification of learner texts (S. Jarvis, Y. Bestgen, S. Crossley, S. Granger, M. Paquot, J. Thewissen & D. McNamara). In S. Jarvis & S. Crossley (eds) Approaching Transfer through Text Classification: Explorations in the Detection-based Approach, forthcoming. Language for specific purposes learner corpora (S. Granger & M. Paquot). In T.A. Upton & U. Connor (eds) The Encyclopedia of Applied Linguistics. Language for Specific Purposes. WileyBlackwell: Oxford, forthcoming. Categorizing spelling errors to assess L2 writing (Y. Bestgen & S. Granger) International Journal of Continuing Engineering Education and Life-Long Learning (IJCEELL), in press. From phraseology to pedagogy: Challenges and prospects (S. Granger). In T. Herbst, P. Uhrig & S. Schüller (eds) Chunks in the Description of Language. A Tribute to John Sinclair. Mouton de Gruyter: Berlin, in press. From EFL to ESL: Evidence from the International Corpus of Learner English (G. Gilquin & S. Granger). In M. Hundt & J. Mukherjee (eds) Exploring Second-Language Varieties of English and Learner Englishes: Bridging a Paradigm Gap. John Benjamins: Amsterdam, in press. Comparable and translation corpora in cross-linguistic research. Design, analysis and applications (S. Granger). Contemporary Foreign Language Studies 2. Shanghai Jiao Tong University, 2010. Vingt ans d’analyse de corpus d’apprenants: Leçons apprises et perspectives (S. Granger). In P. Cappeau, H. Chuquet & F. Valetopoulos (eds) L’exemple et le corpus: quel statut? Travaux linguistiques du CerLiCO. Presses universitaires de Rennes: Rennes, 2010, 29–42. Customising a general EAP dictionary to meet learner needs (S. Granger & M. Paquot). In S. Granger & M. Paquot (eds) eLexicography in the 21st Century: New Challenges, New Applications. Proceedings of ELEX2009. Cahiers du CENTAL. Presses universitaires de Louvain: Louvain-la-Neuve, 2010, 87–96.
Select list of publications by Sylviane Granger How can data-driven learning be used in language teaching? (G. Gilquin & S. Granger). In A. O’Keeffe & M. McCarthy (eds) The Routledge Handbook of Corpus Linguistics. Routledge: London, 2010, 359–370. Learner corpora: A window onto the L2 phrasicon (S. Granger). In A. Barfield & H. Gyllstad (eds) Collocating in another Language: Multiple Interpretations. Palgrave Macmillan: London, 2009, 60–65. Lexical verbs in academic discourse: A corpus-driven study of learner use (S. Granger & M. Paquot). In M. Charles, S. Hunston & D. Pecorari (eds) At the Interface of Corpus and Discourse: Analysing Academic Discourses. Continuum: London, 2009, 193–214. In search of General Academic English: A corpus-driven study (S. Granger & M. Paquot). In K. Katsampoxaki-Hodgetts (ed.) Options and Practices of L.S.P Practitioners Conference Proceedings. University of Crete Publications, E-media, 2009, 94–108. Integrated Digital Language Learning (G. Antoniadis, S. Granger, O. Kraif, J. Medori, C. Ponton & V. Zampa). In N. Balacheff, S. Ludvigsen, T. de Jong, A. Lazonder & L. Montandon (eds) Technology-Enhanced Learning. Principles and Products. Springer: Berlin, 2009, 89–103. The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation (S. Granger). In K. Aijmer (ed.) Corpora and Language Teaching. John Benjamins: Amsterdam, 2009, 13–32. Japanese translation of ‘Prefabricated patterns in advanced EFL writing: Collocations and formulae’ (OUP, 1998) (S. Granger). In A. Cowie (ed.) Phraseology: Theory, Analysis and Applications. Kurosio Publishers: Tokyo, 2009, 185–204. Learner corpora (S. Granger). In A. Lüdeling & M. Kytö (eds) Corpus Linguistics. An International Handbook. Volume 1. Walter de Gruyter: Berlin, 2008, 259–275. Disentangling the phraseological web (S. Granger & M. Paquot). In S. Granger & F. Meunier (eds) Phraseology: An Interdisciplinary Perspective. John Benjamins: Amsterdam, 2008, 27–49. From dictionary to phrasebook? (S. Granger & M. Paquot). In E. Bernal & J. DeCesaris (eds) Proceedings of the XIII EURALEX International Congress, Barcelona, Spain, 2008, 1345–1355. Phraseology in language learning and teaching. Where to from here? (S. Granger & F. Meunier). In F. Meunier & S. Granger (eds) Phraseology in Foreign Language Learning and Teaching. John Benjamins: Amsterdam, 2008, 247–252. Learner corpora in foreign language education (S. Granger). In N. Van Deusen-Scholl & N.H. Hornberger (eds) Encyclopedia of Language and Education. Volume 4. Second and Foreign Language Education. Springer: Berlin, 2008, 337–351. Improve your writing skills (S. De Cock, G. Gilquin, S. Granger, M.-A. Lefer, M. Paquot & S. Ricketts). In M. Rundell (editor in chief) Macmillan English Dictionary for Advanced Learners (second edition). Macmillan Education: Oxford, 2007, IW1–IW50. Learner corpora: The missing link in EAP pedagogy (G. Gilquin, S. Granger & M. Paquot). In P. Thompson (ed.) Corpus-based EAP Pedagogy. Special issue of Journal of English for Academic Purposes 6(4), 2007, 319–335. Reprint of ‘The computer learner corpus: A versatile new source of data for SLA research’ (1998) (S. Granger). In W. Teubert & R. Krishnamurthy (eds) Corpus Linguistics: Critical Concepts in Linguistics. Volume 2. Routledge: London, 2007, 166–182. Reprint of ‘A bird’s-eye view of computer learner corpus research’ (2002) (S. Granger). In W. Teubert & R. Krishnamurthy (eds) Corpus Linguistics: Critical Concepts in Linguistics. Volume 2. Routledge: London, 2007, 44–72. Corpus d’apprenants, annotation d’erreurs et ALAO: Une synergie prometteuse (S. Granger). Cahiers de lexicologie 91(2), 2007, 117–132.
A Taste for Corpora From corpora to confidence (M. Rundell & S. Granger). English Teaching Professional 50, 2007, 15–18. Corpus linguistics, language learning & ELT: Interviewing Sylviane Granger (S. Granger & V. Viana). Mindbite 1, 2007, 11–14. Extraction of multi-word units from EFL and native English corpora. The phraseology of the verb ‘make’ (S. Granger, M. Paquot & P. Rayson). In A.H. Buhofer & H. Burger (eds) Phraseology in Motion I. Schneider Verlag: Baltmannsweiler, 2006, 57–68. Quelles machines pour enseigner la langue? (G. Antoniadis, C. Fairon, S. Granger, J. Medori & V. Zampa). In P. Mertens, C. Fairon, A. Dister & P. Watrin (eds) TALN 06: Verbum ex Machina. Volume 2. Presses universitaires de Louvain: Louvain-la-Neuve, 2006, 795–805. Computer learner corpora and monolingual learners’ dictionaries: The perfect match (S. De Cock & S. Granger). In W. Teubert (ed.) Special issue of Lexicographica 20, 2005, 72–86. Computer learner corpus research: Current status and future prospects (S. Granger). In U. Connor & T. Upton (eds) Applied Corpus Linguistics: A Multidimensional Perspective. Rodopi: Amsterdam, 2004, 123–145. Practical applications of learner corpora (S. Granger). In B. Lewandowska-Tomaszczyk (ed.) Language, Corpora, E-Learning. Peter Lang: Frankfurt, 2004, 291–301. High frequency words: The bête noire of lexicographers and learners alike. A close look at the verb ‘make’ in five advanced learners’ dictionaries of English (S. De Cock & S. Granger). In G. Williams & S. Vesssier (eds) Proceedings of the Eleventh EURALEX International Congress. Université de Bretagne-Sud: Lorient, 2004, 233–243. The International Corpus of Learner English: A new resource for foreign language learning and teaching and second language acquisition research (S. Granger). TESOL Quarterly 37(3), 2003, 538–546. Error-tagged learner corpora and CALL: A promising synergy (S. Granger). CALICO (Special issue on Error Analysis and Error Correction in Computer-Assisted Language Learning) 20(3), 2003, 465–480. The corpus approach: A common way forward for contrastive linguistics and translation studies (S. Granger). In S. Granger, J. Lerot & S. Petch-Tyson (eds) Corpus-based Approaches to Contrastive Linguistics and Translation Studies. Rodopi: Amsterdam, 2003, 17–29. A bird’s eye view of learner corpus research (S. Granger). In S. Granger, J. Hung & S. Petch-Tyson (eds) Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Language Learning and Language Teaching 6. John Benjamins: Amsterdam, 2002, 3–33. Recent trends in cross-linguistic lexical studies (B. Altenberg & S. Granger). In B. Altenberg & S. Granger (eds) Lexis in Contrast. Corpus-based Approaches. Studies in Corpus Linguistics 7. Benjamins: Amsterdam, 2002, 3–48. The grammatical and lexical patterning of make in native and non-native student writing (B. Altenberg & S. Granger). Applied Linguistics 22(2), 2001, 173–194. Didactique des langues étrangères, linguistique de corpus et traitement automatique des langues (S. Granger). In M. Marquillo Larruy (ed.) Questions d’épistémologie en didactique du français (langue maternelle, langue seconde, langue étrangère). Cahiers FORELL. Université de Poitiers: Poitiers, 2001, 105–109. Analyse des corpus d’apprenants pour l’ELAO basé sur le TAL (S. Granger, A. Vandeventer & M.J. Hamel). Corpus Linguistics. Special issue of TAL (Traitement Automatique des Langues) 42(2), 2001, 609–621.
Select list of publications by Sylviane Granger Optimising measures of lexical variation in EFL learner corpora (S. Granger & M. Wynne). In J. Kirk (ed.) Corpora Galore. Rodopi, Amsterdam, 1999, 249–257. Use of tenses by advanced EFL learners: Evidence from an error-tagged computer corpus (S. Granger). In H. Hasselgård & S. Oksefjell (eds) Out of Corpora – Studies in Honour of Stig Johansson. Rodopi: Amsterdam, 1999, 191–202. Prefabricated patterns in advanced EFL writing: Collocations and formulae (S. Granger). In A. Cowie (ed.) Phraseology: Theory, Analysis and Applications. Oxford University Press: Oxford, 1998, 145–160. The computerized learner corpus: A versatile new source of data for SLA research (S. Granger). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman: London, 1998, 3–18. Tag sequences in learner corpora: A key to interlanguage grammar and discourse (S. Granger & J. Aarts). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman: London, 1998, 132–141. An automated approach to the phrasicon of EFL learners (S. De Cock, S. Granger, G. Leech & T. McEnery). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman: London, 1998, 67–79. Learner corpus data in the foreign language classroom: Form-focused instruction and datadriven learning (S. Granger & C. Tribble). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman: London, 1998, 199–209. Automatic lexical profiling of learner texts (S. Granger & P. Rayson). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman: London, 1998, 119–131. Computer-aided Error Analysis (E. Dagneaux, S. Denness & S. Granger). System. An International Journal of Educational Technology and Applied Linguistics 26(2), 1998, 163–174. On identifying the syntactic and discourse features of participle clauses in academic English: Native and non-native writers compared (S. Granger). In J. Aarts, I. de Mönnink & H. Wekker (eds) Studies in English Language and Teaching. Rodopi: Amsterdam, 1997, 185–198. The computer learner corpus: A testbed for electronic EFL tools (S. Granger). In J. Nerbonne (ed.) Linguistic Databases. CSLI Publications: Stanford CA, 1997, 175–188. Automated retrieval of passives from native and learner corpora: Precision and recall (S. Granger). Journal of English Linguistics 25(4), 1997, 365–374. From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora (S. Granger). In K. Aijmer, B. Altenberg & M. Johansson (eds) Languages in Contrast. Text-based Cross-linguistic Studies. Lund Studies in English 88. Lund University Press: Lund, 1996, 37–51. Learner English around the world (S. Granger). In S. Greenbaum (ed.) Comparing English World-wide. Clarendon Press: Oxford, 1996, 13–24. Romance words in English: From history to pedagogy (S. Granger). In J. Svartvik (ed.) Words. Proceedings of an International Symposium. Almqvist and Wiksell International: Stockholm, 1996, 105–121. Connector usage in the English essay writing of native and non-native EFL speakers of English (S. Granger & S. Tyson). World Englishes 15, 1996, 9–29. The learner corpus: A revolution in applied linguistics (S. Granger). English Today 39(10/3), 1994, 25–29. Towards a grammar checker for learners of English (S. Granger & F. Meunier). In U. Fries & G. Tottie (eds) Creating and Using English Language Corpora. Rodopi: Amsterdam, 1994, 79–91.
A Taste for Corpora New insights into the learner lexicon: A preliminary report from the International Corpus of Learner English (S. Granger, F. Meunier & S. Tyson). In L. Flowerdew & K.K. Tong (eds) Entering Text. The Hong Kong University of Science and Technology: Hong Kong, 1994, 102–113. La description de la compétence lexicale en langue étrangère: Perspectives méthodologiques (S. Granger & G. Monfort). Acquisition et Interaction en Langue Etrangère (AILE) 3, 1994, 55–75. Cognates: An aid or a barrier to successful L2 vocabulary development? (S. Granger). ITL Review of Applied Linguistics 99–100, 1993, 43–56. The International Corpus of Learner English (S. Granger). In J. Aarts, P. de Haan & N. Oostdijk (eds) English Language Corpora: Design, Analysis and Exploitation. Rodopi: Amsterdam, 1993, 57–69. The International Corpus of Learner English (S. Granger). The European English Messenger 2(1), 1993, 34. False friends: A kaleidoscope of translation difficulties (S. Granger & H. Swallow). Le Langage et l’Homme 23(2), 1988, 108–120. The Longman Dictionary of Contemporary English and the Collins Cobuild English Language Dictionary (S. Granger & J.P. Van Noppen). Revue Belge de Philologie et d’Histoire LXVI(3), 1988, 710–713. Why the passive? (S. Granger). In J. Van Roey (ed.) English-French Contrastive Analyses. Contrastive Analysis Series. Acco: Leuven, 1976, 23–57. A survey of transformational theories (Part 3) (S. Granger & B. Devlamminck). Le Langage et l’Homme 30, 1976, 49–67. A survey of transformational theories (Part 2) (S. Granger & B. Devlamminck). Le Langage et l’Homme 29, 1975, 25–50. Tendances interprétatives et génératives en grammaire transformationnelle (S. Granger & B. Devlamminck). Cahiers de l’Institut de Linguistique de Louvain (Cours et Documents) 3, 1975–76, 25–115. A survey of transformational theories (Part 1) (S. Granger & B. Devlamminck). Le Langage et l’Homme 26, 1974, 41–55. On some active and passive structures with infinitive and their French correspondents (S. Granger). Cahiers de l’Institut de Linguistique de Louvain 1(5), 1972, 705–732.
Subject index A academic English╇ xv, 63, 94 academic literacy╇ 63–65, 67, 80, 88 academic writing╇ 39, 50, 63–80, 85, 90–91, 94, 101, 185, 276 see also English for academic purposes accuracy╇ 86, 109, 111, 113, 116, 120–121 see also classification accuracy acrolect╇ 215, 219–220, 225–226 adjectival phrase╇ 72, 77 adjective╇ 45–48, 75, 116, 160 advanced learner╇ xiv, 14, 40, 43, 55–56, 114, 168 adverb╇ 42, 46, 68, 72, 75, 77–79, 101, 160, 174, 273 adverbial╇ 18, 43, 128, 220 adverbial particle╇ 177–178 affix╇ 116, 266 American English╇ 11, 13, 209, 211, 229, 264 ANOVA╇ 188–189 apprentice writing╇ 87–88, 90–91, 93–100, 102–103, 105, 107 argumentative essay╇ xiv, 38, 42–43, 67, 69, 90, 114, 128 argumentative writing╇ 38, 64, 80, 129 artificial neural network╇ 132– 133, 137, 152 authorship attribution╇ 143, 154 auxiliary╇ 13, 53, 220 B Bank of English╇ 11, 25 basilect╇ 215, 219–221, 224–226 borrowing╇ 115, 216, 222, 230 British Academic Written English (BAWE) corpus╇ 90–91, 93, 98 British English╇ 10, 13, 229, 263
British National Corpus (BNC)╇ 10–11, 13–14, 16, 25, 41–42, 47–48, 50, 55, 90–91, 112, 177–179, 181–186, 246, 248, 251, 262–264, 268, 274–275 BNC Baby╇ 90–91, 93, 98 BROWN corpus╇ 10, 13, 16, 22, 37, 209, 259, 262 bundle see lexical bundle C can do statements╇ 64, 67–69, 75, 77, 79 Chinese learners of English╇ 39– 42, 113–115, 139–141, 177, 212 classification accuracy╇ 128–153 classroom╇ 55, 64, 85–86, 157–159, 162–163, 168, 176, 238, 267 classroom input╇ 174, 180, 185–186, 189–190 cluster╇ 86–88, 93 see also lexical bundle co-frequency╇ 11–12 COBUILD project╇ 85, 156, 257, 259, 262, 269 cognate╇ 52, 187 cognitive linguistics╇ 21–22, 24 cohesion╇ 68, 77, 129 coinage╇ 114–115 colligation╇ 248, 261, 278 collocate see collocation collocation╇ 10–12, 15, 19, 26, 48–51, 53, 88, 99, 111, 242–243, 247–248, 251–252, 267–271, 276–279 Collocator╇ 252 collostruction╇ 19–20, 23, 274 collostructional analysis╇ 19, 274 Common European Framework of Reference (CEFR)╇ 63–65, 80 competence╇ 9, 21–22, 24, 64, 67–68, 80, 155, 211
compound╇ 115, 261, 264, 266 computer-mediated communication╇ 214, 218–220, 222, 227, 230, 233–234 congram╇ 243 conjunction╇ 54, 101, 264 connector╇ 41, 55 construction╇ 13, 19–23, 42, 49, 52–54, 56, 65, 71–72, 76, 78–79, 127, 226, 246, 247–249, 271, 275, 277 construction grammar╇ 19, 21–23 contrastive analysis╇ xiv, 2, 34, 38, 43–44, 52, 54–55 contrastive interlanguage analysis (CIA)╇ xv, 2, 33, 38–40, 43–45, 52, 55–57 conversation╇ 13, 36, 41, 217, 161, 164, 220–223, 275 copula╇ 52–53, 72, 77, 222 Corpus of Contemporary American English (COCA)╇ 185, 214 Corpus of Cyber-Jamaican╇ 209, 214–215, 217 creole╇ 211–215, 217, 222–223, 225–226, 233 cross-linguistic influence╇ 127, 130, 148 see also transfer cross-validated classification accuracy╇ 128–131, 133–147, 152–153 cross-validation╇ 134–135, 138 Czech learners of English╇ 139, 141 D Danish learners of English╇ 36 Database of ANalysed Texts of English (DANTE)╇ 270, 277 deictic╇ 69, 71, 76–77 derivation╇ 116, 266 detection-based approach╇ 130
A Taste for Corpora developmental developmental factors╇ 41, 52 developmental patterns╇ 64, 70, 73–74, 76–78, 80 developmental stages╇ 163 diachrony╇ 22, 213 dialect╇ 210, 215, 276 dictionary dictionary grammar╇ 260, 271 dictionary writing system╇ 257, 260, 278 discourse╇ xiv, 40, 42–43, 66, 70–72, 74, 77, 79–80, 87, 91, 97, 99, 161–162, 181, 214, 228, 231, 276 discourse competence╇ 68, 80 discourse community╇ 66, 87 discourse markers╇ 64, 68, 70 discourse oriented verbs╇ 71– 72, 74, 79 discriminant analysis╇ 129, 137 see also linear discriminant analysis Dutch╇ xiii, xv Dutch learners of English╇ 40– 41, 48, 141, 148 E elicitation test╇ 34–36 English as a foreign language (EFL)╇ xiii, xiv, 1, 3, 25–26, 55, 63–65, 69, 74, 79–80, 113, 156, 173, 176, 190 English as a lingua franca (ELF)╇ 25–26, 155–157, 159–169, 190 English as a Lingua Franca in Academic Settings (ELFA) corpus╇ 25–26, 157, 160–162, 164–167 English as a native language (ENL)╇ 159–161, 163–168 English as a second language (ESL)╇ 113–114, 173, 176, 190 English for academic purposes (EAP)╇ xv, 1, 25–26, 85–88 see also academic writing English for specific purposes (ESP)╇ xiv–xv, 25–26, 87, 89 English language teaching (ELT)╇ 12, 25–27, 39 entrenchment╇ 22–23, 166
error╇ xiv–xv, 34–35, 39, 42, 44–45, 48, 51, 57, 63, 68, 80, 86, 109–110, 113–123, 140, 159, 163, 166, 216–217, 233, 241, 247, 251, 260 error analysis╇ 34, 39, 57, 109–110, 114, 123 error tagging╇ 109, 114–116, 120, 122–123 see also manual error tagging evaluative adjectival phrase╇ 72, 77 expert performance╇ 87–89 expert writer╇ 74, 97–98, 216 exposure╇ 22–23, 147, 173–177, 180, 185–186, 189–190, 199, 240 F face-to-face interaction╇ 217, 220–222, 225–227, 229, 234 Federalist Papers corpus╇ 137 Finnish learners of English╇ 41– 42, 141, 148 formulaic language╇ 173, 176 formulaic sequences╇ 11, 16, 23, 240, 242 formulaicity╇ 15–16 French╇ xiii, xv French learners of English╇ 39– 41, 45–51, 109, 114–115, 117–119, 121–122, 139, 141, 148 frequency╇ xiv, 7–27, 37, 39, 111, 174–176, 180–187, 190, 242 frequency list╇ 8–15, 263–264 see also co-frequency, grammatical frequency, lexical frequency, phrasal verb frequency, word combination frequency function word╇ 138, 140–142 G GDEX╇ 276–279 gender╇ 128, 138, 158, 173, 176, 180, 188, 220 genre╇ 43, 55–56, 65–68, 87-89, 96, 102, 128, 158, 214, 242 German╇ 187 German learners of English╇ 39–40, 42, 45–49, 51, 109, 114–115, 117–122, 141, 148, 177, 187
grammar╇ xiv, 10, 12, 16–19, 21– 23, 35, 40, 56, 64, 79, 115–116, 123, 161, 177, 215, 219–220, 247, 253, 260, 268, 271 grammar checker╇ 123 see also construction grammar, pattern grammar, systemic functional grammar grammatical construction╇ 16, 246, 248, 271 grammatical frequency╇ 12, 16, 22 grammatical relation╇ 271, 277 grammaticalization╇ 13, 22 H historical corpus╇ 109–112 historical spelling variant╇ 109– 111, 120 hybrid n-gram╇ 243–250, 252 I idiomatic expression╇ 22, 87, 165, 173–174 idiom principle╇ 11, 15, 23, 49 implicit learning╇ 23–24 Indian English╇ 210, 213, 234 inflection╇ 116, 178, 181, 217, 226, 264, 266 integrated contrastive model╇ xv, 33, 43–45, 52, 57 intelligent computer-aided language learning (ICALL)╇ 109, 114, 123 International Corpus of English (ICE)╇ 37, 211– 212, 214, 229 ICE-GB╇ 49–50, 52, 212, 229 ICE-JA╇ 214, 216–217, 219–221, 234 International Corpus of Learner English (ICLE)╇ xiv–xv, 2, 25–26, 33, 37–54, 56–57, 63–64, 69, 114-116, 131, 139-141, 147–148, 156, 164, 168, 212 ICLE-FR╇ 45–46, 48–51 ICLE-GE╇ 42, 45–49, 51–52 ICLE-HK╇ 42, 56 ICLE-NO╇ 39, 43, 45–54, 56 ICLE-SP╇ 45–46, 48–49, 51–52 intertextuality╇ 41, 63, 65–69, 79 Irish English╇ 211, 229, 262, 270, 276 Italian learners of English╇ 40, 141, 148, 176–177
Subject index J Jamaican Creole╇ 211, 214-222, 224–234 Jamaican English╇ 211-212, 214–221, 225–226, 228–229, 233–234 Japanese learners of English╇ 40, 114–115, 139–141, 156 L L1 detection╇ 127, 130, 139, 143 Lancaster-Oslo/Bergen (LOB) corpus╇ 10–11, 13, 37, 209 language instruction╇ 36, 75–77, 86, 101–102, 173, 175–176, 180, 189–190 language proficiency see proficiency language teaching╇ 3, 8, 18, 23, 33–34, 57, 85, 88, 113, 123, 156, 267 see also English language teaching learning materials╇ 26–27, 163 lemmatization╇ 10, 185, 241, 259, 264–266, 273 LexChecker╇ 246 lexical bundle╇ 11, 15, 85–88, 90, 92–102, 130, 242 see also cluster lexical frequency╇ 12, 16, 175 lexical knowledge╇ 178, 237, 240–241, 251 lexical knowledge discovery╇ 239, 245, 251 lexical knowledge representation╇ 239, 251, 253 lexical variation╇ 35–36, 114, 129 lexico-grammar╇ 12, 19, 39, 63, 80, 155, 165–166, 168, 248 lexicography╇ xiv–xv, 1, 11, 14, 85, 156, 222, 237–238, 241–242, 257–262, 265–266, 269–271, 273, 275–279 see also tickbox lexicography lexicon╇ 12, 16, 22, 111–112, 114, 215, 275 lexis╇ xiv, 11–12, 18–19, 35–36, 40, 94, 99, 270 linear discriminant analysis╇ 131–134, 137, 143–145 literacy╇ 65, 102, 216 see also academic literacy
Longman Corpus of Spoken American English (LCSAE)╇ 13, 17–18 Longitudinal Database of Learner English (LONGDALE)╇ xiv, 56 longitudinal longitudinal corpus╇ xiv, 26–27, 56 longitudinal study╇ 74, 113 see also pseudo-longitudinal study Louvain Corpus of Native English Essays (LOCNESS)╇ 25, 38–43, 45–50, 52–53, 55, 69, 71 Louvain International Database of Spoken English Interlanguage (LINDSEI)╇ xiv, 2, 25–26, 40, 56 M machine learning╇ 127–128 manual error tagging╇ 109, 120, 123 mesolect╇ 212, 215–217, 219–221, 224, 227 metadiscourse╇ 43 metatextual function╇ 51–52 Michigan Corpus of Academic Spoken English (MICASE)╇ 25–26, 160, 167 misspelling╇ 113, 117, 224 mistake╇ 33, 35, 45, 109–110, 120, 167, 216 misuse╇ 39, 46, 115–116 modality╇ 13, 17, 41, 50–54, 65, 79 morpheme╇ 116, 127, 266 morphological error╇ 114–116 morphology╇ 45, 116, 123, 166, 223, 266 morphosyntax╇ 213, 216, 219, 222 multilingual corpora╇ xiv–xv, 234 multilingualism╇ 155, 234 multi-word╇ 264, 266 multi-word expression╇ 23, 240, 251–252, 277 multi-word sequence╇ 127, 131, 141–142, 242 multi-word unit╇ 11, 174 N native English╇ xiv–xv, 43, 101 see also English as a native language
native language╇ 38, 44–45, 56, 89, 115, 121–122, 127, 130, 159, 166 native-speaker╇ 16, 18, 25–26, 33– 36, 38–40, 46–47, 49, 52, 54–55, 57, 85–86, 88–89, 113–114, 129, 139, 155–156, 158–159, 161–162, 164–165, 180, 211, 234 Natural Language Processing (NLP)╇ 11, 86, 109, 111, 114, 123, 261, 267 neural network╇ 128, 139, 152 see also artificial neural network New Englishes╇ 209–210, 212–213, 215, 233–234 n-gram╇ 86, 127, 131, 139–143, 147, 241–250, 252 norm╇ xiv–xv, 34–35, 39, 114, 123, 158–159, 162, 211, 215, 218–219, 226, 229, 273 Norwegian╇ 40, 43, 45, 47, 51–54 Norwegian learners of English╇ 33, 39, 41, 45–56, 141, 148 noun╇ 13, 19–20, 48, 114, 128, 166, 242, 244–250, 264, 268, 273 noun phrase╇ 46–49, 52–53, 77, 174 P paradigmatic╇ 243–244, 248, 250 parameter tuning╇ 133, 138, 144 particle╇ 174, 178, 187, 271 see also adverbial particle part-of-speech (POS) tagging╇ 10, 110–111, 259, 268, 273 passive╇ xiii, 68, 71–72, 76, 273–275 pattern grammar╇ 19 pedagogical applications╇ xiv–xv, 1–2, 35–36, 55, 57, 64, 101, 163 pedagogical dictionary╇ 276 pedagogy╇ 8, 17, 36, 57, 102, 157–158, 161, 163 phrasal verb╇ 173–183, 185–191, 194–195, 199, 261 phrasal verb acquisition╇ 180, 188 phrasal verb frequency╇ 175– 177, 179, 181–183, 185–186, 190, 194 phrasal verb knowledge╇ 176, 181–182, 187–188
A Taste for Corpora phraseology╇ xiv, 1–2, 16, 18, 49, 52, 161, 166 Polish learners of English╇ 40, 141–142, 156, 177 pragmatics╇ 64, 161, 174, 227 preposition╇ 128, 165–166, 248, 264, 271 present progressive╇ 16–17, 23, 27 present simple╇ 16–18, 23, 27 production╇ xiv, 16, 27, 63–64, 85, 87, 102, 157, 232, 276 see also written production productive productive mastery╇ 178, 185–186, 190 productive test╇ 178, 180–183, 185 proficiency╇ xiv, 39, 43, 52, 55, 130, 141, 148, 162–163, 176, 180, 187–189 proficiency level╇ 34, 38, 52, 56, 147, 162–163, 166, 187–188 pronoun╇ 41–42, 71, 75, 174, 220, 226, 244, 277 pronunciation╇ 88, 225 prototype category╇ 23–24 pseudo-longitudinal study╇ 36 Q qualitative approach╇ 38–39, 48–49, 51, 55, 70–71, 76, 113, 148, 214 quantitative approach╇ xiv, 35, 38–39, 43, 48, 55–56, 71, 167, 219, 225, 270 R receptive receptive knowledge╇ 173, 178, 180–181, 185, 188, 199 receptive mastery╇ 185–186, 190 receptive test╇ 178, 180–183, 185, 188 reference corpus╇ 38–40, 42, 52, 55, 265, 267 register╇ 12–14, 26–27, 34, 43, 52, 56, 75, 87–88, 90, 95, 102, 161–162, 273, 275 reporting verb╇ 68–79 rhetor╇ 68, 70–72, 75–76, 79 Russian learners of English╇ 139, 141
S second language acquisition (SLA)╇ xiv, 1, 10, 21, 23–24, 57, 109, 113–114, 123, 156–157, 159–164, 166–168, 174 second language learning╇ xiii, 168, 241 Sketch Engine╇ 19, 244, 270–271, 276 see also word sketch Sketch Grammar╇ 271 social networking╇ 175, 180, 190 sociolinguistics╇ 209, 212, 214, 219, 222, 225–227, 229, 234 Spanish╇ 48, 66–67, 75, 113 Spanish learners of English╇ 40, 45–49, 51, 63–64, 67, 69, 79, 109, 113, 115, 117–119, 121–122, 129, 138–139, 141, 148, 177 speech╇ 10–11, 13–15, 35, 40–41, 89, 159, 164–165, 167–168, 177, 181, 217, 223, 226 see also spoken English spelling variants╇ 109–111, 120 spoken English╇ xiii, 10–11, 13, 36, 41, 234 see also speech standard English╇ 165, 212, 215, 219, 222, 224–225, 229, 233 standard varieties of English╇ 161, 209, 212, 234 StringNet╇ 237, 241, 243–252 Swedish╇ 35, 42, 148, 159 Swedish learners of English╇ 35, 39–43, 141, 148, 159 syntactic patterns╇ 16, 42, 56, 272 syntagmatic╇ 239, 242–244, 250 systemic functional grammar╇ 21 T teaching materials╇ 14, 35, 86 textbook╇ xiv–xv, 17, 26, 35, 49, 65–66, 157, 163, 175 tickbox lexicography (TBL)╇ 277–278 transfer╇ xiv, 34, 41–42, 44, 48–49, 51, 54, 75, 113, 130, 156, 162, 164 see also cross-linguistic influence, detection-based approach, L1 detection
U UK Web as Corpus (UKWaC)╇ 263, 270 usage-based approach╇ 2, 21–24, 253 V Variant Detector (VARD)╇ 109– 113, 115–123 varieties of English╇ xv, 37, 210–214, 229, 234, 263 Varieties of English for Specific Purposes dAtabase (VESPA)╇ 56 verb╇ 13, 17, 19–24, 45, 48, 53, 65–66, 68, 70–79, 116, 128–130, 166, 220–221, 226, 229, 242, 244–245, 247, 249, 264–266, 268, 271–275 see also copula, discourse oriented verb, modality, phrasal verb, reporting verb Vienna-Oxford International Corpus of English (VOICE)╇ 25–26, 157, 164 vocabulary learning╇ 175, 237, 240 voice╇ 67, 69–71, 75–77, 79 W word combination╇ 11, 15, 40, 87 word combination frequency╇ 15–16 word sketch╇ 19–20, 267–273, 275, 277–278 Wordsmith╇ 71–72, 92 writing╇ 13–14, 36, 38, 40–43, 46, 52, 55, 57, 63–70, 72, 75, 79–80, 85–91, 93–101, 103, 110, 114–115, 127–128, 148, 168, 173, 177, 181, 215–218, 230–231, 234, 267, 279 see also academic writing, apprentice writing, argumentative writing, dictionary writing system written English╇ 13, 40–42, 91, 211 written production╇ 85, 89, 93 written texts╇ 11, 13, 15, 36, 54, 215–216 Z Z score╇ 133, 153–154
Name index A Ädel╇ 38, 41, 43, 51, 55–56, 163, 168 Adolphs╇ 12 Aijmer╇ 41, 53 Alderson╇ 15, 23 Allen╇ 113 Allsopp╇ 222–223 Altenberg╇ 14–15, 41, 43, 54, 56, 167 Androutsopoulos╇ 214, 232, 234 Archer╇ 110 Atkins╇ 173, 178, 259, 263, 265, 270 B Barlow╇ 21, 38 Baroni╇ 262–263 Baron╇ 110–111, 113, 119 Baumann╇ 66 Bazerman╇ 65, 87–89 Beal╇ 213 Bebout╇ 113 Bechar-Israeli╇ 232 Beißwenger╇ 214 Bernardini╇ 262 Bestgen╇ 148 Bhatia╇ 65–66 Biber╇ 10, 15–18, 51, 56, 65, 85–87, 94–95, 175, 178, 242 Bolinger╇ 173–174, 178, 237–241 Boström Aronsson╇ 42–43 Brand╇ 40 Briggs╇ 66 Buchstaller╇ 229 Burnard╇ 244 Burns╇ 133–134 Burrows╇ 154 Bybee╇ 22 C Candlin╇ 86 Carroll╇ 9, 26 Chambers╇ 213, 265 Charles╇ 74
Cheng╇ 19, 243 Chomsky╇ 9–10, 12, 16, 21, 23 Chowdhury╇ 114 Church╇ 242, 268–269 Cimiano╇ 148 Clear╇ 262, 265 Conrad╇ 87, 242 Cook╇ 113 Corder╇ 34, 44–45 Cortes╇ 87, 101–102 Coupland╇ 226 Courtney╇ 173 Coxhead╇ 14 Crossley╇ 128, 131, 139 Crystal╇ 218 D Dagneaux╇ 109, 114–115 Dagut╇ 187 Danet╇ 233 D’Arcy╇ 213 Darwin╇ 175 Davies╇ 11, 173, 175, 177–178, 185, 214 DeCarrico╇ 166 De Cock╇ 14, 16, 40, 176 de Schryver╇ 262 Deuber╇ 213, 215, 218 Dewaele╇ 275 Dor╇ 233 Dörnyei╇ 174, 180 Dras╇ 139–141, 143 Duda╇ 128 Dunning╇ 242 Durrant╇ 15 E Eckert╇ 227 Eeg-Olofsson╇ 15 Ehrenreich╇ 161 Ehrman╇ 10 Eia╇ 41 Eliasson╇ 187 Elisseef╇ 134 Ellis╇ 10, 15, 23–24, 174, 242
English╇ 86 Enkvist╇ 34 Estival╇ 138–139, 143 Evert╇ 242 F Færch╇ 36–37, 56–57 Fairon╇ 265 Fallahkair╇ 175 Feez╇ 89 Field╇ 133 Figueredo╇ 113 Fillmore╇ 22 Firth╇ 11, 159 Fletcher╇ 263 Flowerdew╇ 89 Francis╇ 10–11, 19, 64, 140, 209, 259 Frank╇ 128–129, 134–135, 138, 154 Fukuya╇ 187 G Gabrielatos╇ 17 Gamon╇ 114 Gardner╇ 11, 173, 175, 177–178 Gavin╇ 136 Geva╇ 113 Gilquin╇ 2, 12, 14, 16, 24, 38, 41, 44, 54, 56, 64, 156, 168 Goldberg╇ 22 Gouverneur╇ 26 Granger╇ 2, 10, 12, 14, 24, 38–39, 44–45, 49, 55, 57, 64, 86–87, 102, 109, 113–114, 123, 141, 156, 163–164, 166–167, 176 Gray╇ 175 Greenbaum╇ 37, 211 Greene╇ 16 Grefenstette╇ 259, 263 Gries╇ 19–20, 22, 24, 110, 274 Groot╇ 178, 183 Guan╇ 151 Guo╇ 151 Guyon╇ 134
A Taste for Corpora H Halliday╇ 11–12, 19, 21, 27, 92 Hammarberg╇ 34 Hanks╇ 242, 268–269, 279 Hart╇ 173 Hasselgård╇ 43, 49–50, 52, 54, 56 Hasselgren╇ 14, 40, 48 Heift╇ 115, 117, 119 Herlitz╇ 130 Herriman╇ 43 Herring╇ 233 Hess╇ 135–136 Heylighen╇ 275 Hinrichs╇ 218 Hinton╇ 128 Hoey╇ 64–65, 248 Hoffmann╇ 16, 213 Hofland╇ 10, 13 Höhn╇ 229 Hooper╇ 22 Hoover╇ 154 Hopper╇ 22 Horst╇ 175 Hovermale╇ 115 Hülmbauer╇ 162 Hulstijn╇ 187 Hundt╇ 212 Hunston╇ 19, 34, 40, 55, 64, 175 Hyland╇ 42, 56, 87–88, 92–93 Hynninen╇ 159 Hyvärinen╇ 128 I Ibrahim╇ 113 J Janicivic╇ 265 Järvinen╇ 268 Jarvis╇ 128, 130–131, 139, 141–143, 147–148, 156, 164–165 Jelinek╇ 11 Jenkins╇ 88 Jockers╇ 137, 143 Johansson╇ 10, 13, 35, 43, 45, 52–54, 56 John╇ 134 Johns╇ 23, 65, 85, 156, 240 Jones╇ 85–86, 139, 143 K Kämmerer╇ 40 Karhukorpi╇ 161 Kaur╇ 162 Keerthi╇ 152
Keller╇ 263 Kemmer╇ 21 Kennedy╇ 8 Kilgarriff╇ 19–20, 244, 262–264, 269–270, 276–277 Klimpfinger╇ 162 Kobayashi╇ 188 Kohavi╇ 134 Konishi╇ 173 Koppel╇ 139–141, 143 Koprowski╇ 186 Kortmann╇ 213 Kotsiantis╇ 128–129, 134, 139, 145, 152–153 Krenn╇ 242 Krishnamurthy╇ 10, 259 Kučera╇ 10–11, 140, 209, 259 Kurhila╇ 158, 161 L Lado╇ 34 Langacker╇ 21–22 Lapata╇ 263 Laufer╇ 187 Lea╇ 88 Lecocke╇ 135–136 Leech╇ 9–10, 13, 22, 39, 175 Lee╇ 115 Lefer╇ 114–115, 123 Lennon╇ 15 Levenston╇ 34 Lewis╇ 267 Liao╇ 187 Lillis╇ 88 Linnarud╇ 35–36, 57 Liu╇ 129 Lorenz╇ 45, 48 Lorge╇ 8 M Mair╇ 213 Manning╇ 242 Mann╇ 218 Marchena╇ 187 Marshall╇ 11 Martinet╇ 17 Martin╇ 67 Marx╇ 167 Master╇ 166 Mauranen╇ 26, 162, 164, 166–167 Mayfield Tomokiyo╇ 139, 143 McArthur╇ 173, 178 McCallum╇ 152 McCarthy╇ 173, 267
McEnery╇ 8 McLachlan╇ 134 McNamara╇ 128 Meara╇ 86 Meisel╇ 166 Melka╇ 180 Mesthrie╇ 234 Meunier╇ 12, 26, 49, 64, 109, 123, 166, 176 Meyerhoff╇ 229 Miller╇ 176 Millet-Roig╇ 152 Milton╇ 114–115 Mindt╇ 26 Mishan╇ 239 Molinaro╇ 129, 135–136 Moon╇ 12, 15, 174 Morris╇ 215 Mukherjee╇ 212–213 Murray╇ 258 Myslin╇ 110 N Nation╇ 175, 183, 241 Nattinger╇ 166 Neff╇ 70–71, 74 Nelson╇ 213 Nesselhauf╇ 12, 15, 42, 163 Nicholls╇ 114 Nickel╇ 34 Niedzielski╇ 229 Nigam╇ 152 O O’Dell╇ 173, 267 Odlin╇ 165 O’Donovan╇ 265 Oja╇ 128 O’Neill╇ 265 Osborne╇ 42–43 P Paquot╇ 14, 41, 43, 52, 56, 102, 131, 139, 141–143, 147–148, 168 Patrick╇ 213, 225 Pavlenko╇ 130, 156, 164 Pemberton╇ 175 Perera╇ 67 Petch-Tyson╇ 41–43, 54 Pintelas╇ 128 Platt╇ 210 Pomikálek╇ 270 Pravec╇ 37 Prinsloo╇ 262
Name index Prinzie╇ 153 Procter╇ 258 R Raileanu╇ 134 Rampton╇ 87 Ranta╇ 166 Rappoport╇ 139–140, 143 Rayson╇ 110–111, 113–114, 119 Read╇ 178, 241 Renouf╇ 259 Riionheimo╇ 165 Rimrott╇ 115, 117, 119 Ringbom╇ 40, 164–165 Römer╇ 12, 26 Rott╇ 175 Rubin╇ 16 Rundell╇ 57, 85, 259, 261, 263, 265, 269, 271 S Sampson╇ 16 Sand╇ 213 Santini╇ 275 Schäfer╇ 110 Schmidt╇ 24 Schmitt╇ 166, 175, 178, 180, 185, 187 Schneider╇ 211, 213, 215 Schulze╇ 12 Scott╇ 71, 86–87 Seidlhofer╇ 26, 88, 165–166 Sejnowski╇ 128 Selinker╇ 35 Sharoff╇ 263 Shen╇ 129, 135, 137, 143, 154 Siegel╇ 113 Silverstein╇ 227 Simo Bobda╇ 213 Simpson-Vlach╇ 15, 242 Sinclair╇ 10–11, 19, 49, 68, 85–86, 156, 173, 259
Siyanova╇ 187 Sjöholm╇ 175, 190 Smit╇ 162 Sorg╇ 148 Spack╇ 88 Stavestrand╇ 45, 56 Stefanowitsch╇ 19–20, 274 Stein╇ 258 Stevens╇ 85 Stewart╇ 218 Stoffel╇ 134 Storrer╇ 214 Street╇ 88 Summers╇ 85 Svartvik╇ 34–35 Swales╇ 65–66, 167 Swan╇ 163, 187 Szmrecsanyi╇ 213 T Tagliamonte╇ 213 Tapanainen╇ 268 Tapper╇ 14, 41, 54 Tercanlioglu╇ 188 Teubert╇ 15 Teytaud╇ 136 Thagg Fisher╇ 35–36, 57 Thewissen╇ 109, 114–115, 123 Thompson╇ 64, 75 Thomson╇ 17 Thorndike╇ 8 Thurston╇ 86 Tibshirani╇ 152 Tomasello╇ 23, 174 Tono╇ 114 Traugott╇ 22 Tribble╇ 85–87, 89–90 Trudgill╇ 210 Tsao╇ 243, 246, 248, 251 Tsur╇ 139–140, 143 Tugwell╇ 19–20
U Urdang╇ 258 V Van den Poel╇ 153 van Rooy╇ 110 Virtanen╇ 41 W Wade-Woolley╇ 113 Wagner╇ 159 Waibel╇ 187 Walker╇ 265 Wang╇ 113 Wardhaugh╇ 44 Waring╇ 175 Werlich╇ 65 West╇ 8 Wible╇ 239–241, 243, 246, 248, 251 Widdowson╇ 86, 88, 162, 239 Wiktorsson╇ 40, 49 Williams╇ 212 Willis╇ 86 Wilson╇ 8 Winford╇ 166–167 Witten╇ 128–129, 134–135, 137–138, 143, 154 Wray╇ 167 Wynne╇ 114 Y Youssef╇ 215 Z Zhou╇ 151 Zipf╇ 8–9 Zutell╇ 113