This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Ha Ha Ha! The (1206) us to COUGH UP into her (1245)
The (1246) Are nasty posts from the (1374)
The president of the (1375) her credibility by hiring this (1449)
I hear the new (1590) drop out of the C of C. The (1629) don't live here! As to the (1648) this latest appointment of (2144) as far as to threaten the (2160) threatening to boycott the (2211) what if Ly/Arms boycott the (2349) can you get by going to the
Chamber Maid chamber of commerce clean up Chamber Maid
Chamber Maid didn't have a pot Chamber Pot!Talk about Chamber of Commerce is said to Chamber Maid being made from Chamber of Commerce needs to Chamber Maid. Too bad, I used Chamber Maid wants to get rid Chamber had no money and has Chamber of Commerce, not all Chamber Maid, the C of C is Chamber of Commerce with | Chamber of Commerce, they have Chamber of Commerce? Any Chamber of Commerce to demand
As shown above, chamber collocates with of commerce, maid, and pot. The collocate chamber maid has negative connotations and is used to undermine the image of a female candidate running for public office.
98
Susana M. Sotillo and Julie Wang-Gempp
Table 3: Spandex (Epithet and Collocations) spandex (7) (602) An overweight forty-something in (620) in person!Blonde
(1450) bookstore and replace it with a (1451) it with a spandex shop.
(2006) further than she can stretch her (2009) in november). HERE, USE MY (2208) issue. She DOES wear
spandex tights and spin-art SpandexTaxes
spandex shop.
Spandex Spandex 'R' Us!
spandex. This is the woman SPANDEX TO ABSORB THE spandex.
Where are
Table 4: Clark Politics (Hyperbole and Sarcasm) administration (14) (843) that Mayor Robert Ellenport's administration will make a (865) incompetence in Ellenport's administration.(1333) relative of which administration official is (1423) ADMINISTRATION: YOU (1567) and no one in the Ellenport administration was (1781) He added he's upset the administration awarded 16,000 (2958) Limited Open Green Spaces. His administration should be (3978) easier for voters to defend this administration then it is to (4390) residents rising up against the Administration, the pocket (4393) election, all of a sudden the Administration and it's (5746) a referendum on the Ellenport administration.
A Senior Bus on Election Day? <span > (296) <pre>< Election Time grows near... (3387) dems having won in the last election: as has been pointed (3439) an opponents signs in a past election.
|< From: > Election Time grows (4392) now that Cislo is up for election, all of a sudden the (4412) power, beginning with the election of Pete N., (5160 DEMOCRATS WILL LOSE THIS ELECTION BECAUSE (5208) and they vote for you. After the election, if they ask you (5322) of financial trouble. (But this election is not for mayor (5351) not playing a part in this election, you must be kidding, (5532) A N D , jerk-- since when is any election NOT about the
Class, Ideology, and Discursive Practices in Online Political Discussions
99
Likewise, in Table 3, Spandex is repeatedly used as an epithet to describe this female candidate. Spandex refers to a cheap, elastic synthetic fabric. It conveys an image of cheapness (synthetic textile), unreliability (elastic fabric that stretches), and bad taste (an overweight person using spandex). Those using Spandex as an epithet want to portray this candidate as unrefined and unreliable, thus undermining her political platform and civic accomplishments. Table 4 displays key words in context (administration, election) for the Clark Politics thread. This is an acerbic online political discussion between supporters and detractors of Democrats and Republicans vying for public office. The then Mayor of Clark Township is virulently attacked both online and in Letters to the Editor in the Star-Ledger, a metropolitan newspaper. Individuals posting to this political discussion thread claim that the Mayor is trying to influence election outcomes and is actively supporting certain democratic candidates. As shown in Table 4, the term administration collocates primarily with the Mayor’s last name (Ellenport) and with determiners the, this, and which. Those criticizing the local governing body want to establish a direct link between negative aspects of the current town’s administration and the Mayor’s alleged autocratic managerial style. In Excerpt 1, individuals posting to this acrimonious online political discussion thread use metaphors, hyperbole (e.g., the reincarnation of Goebbels), and action verbs (to strike, set up, decree, draw, spread, attacking, come out, force) in an attempt to persuade the Cyber audience to challenge the Mayor and vote against his Democratic candidates in the upcoming election: Excerpt 1: < Subject: RE: Clark Politics > < From: > Ellenport's Y2K website < Date: 07-Jul-99 > Well, R.E., the reincarnation of Goebbels, strikes again. Using Taxpayer money, he wants to set up a site in an attempt to draw readership away from this site, and thus spread his lies and innuendoes via the Internet. Of course, it will be an EP PR site, lauding him and his lackeys for everything they do, and attacking the opposition at every opportunity. The only way to stop this travesty is for taxpayers to attend the next council meeting and let the Council and R.E. know that you are AGAINST this useless waste of your tax dollars, the way the people from the Picton Park area came out and forced the Council to do a turnaround on their RE-decreed stance against fencing their park in. Roy No one is immune from online attacks as shown below in Excerpts 2 through 4 from the Corzine thread (consecutive exchanges). These are middle-class residents from the towns of Nutley and Montclair. In the following exchanges,
Susana M. Sotillo and Julie Wang-Gempp
100
Corzine’s detractors use semantic motifs that appeal to various politically conservative members of the Cyber audience who perceive issues such as universal health care and free education as attacks on capitalism: Excerpt 2: < Subject: John Corzine for US Senate > < From: > Italian Americans against Corzine < Date: 21-Apr-00 > Corzine says he has "bold ideas"? Universal this and universal that are not bold ideas. They are representative of the same tired, empty promises of statist elites who will never have to live in the world they've created for us. Excerpt 3: < Subject: RE: Jon Corzine for US Senate > < From: > < Date: 21-Apr-00 > Corzine sits around the country club, imagining everything is made of gold, and tells anyone not falling asleep how the city of gold should be built for those same people who can't get a glass of water at the same country club. A yuppie hypocrite. Socialist with a limousine. Leftie with his Brooks Brothers suits on hand for emergencies only....when his tailor gets sick. Excerpt 4: < Subject: RE: Jon Corzine for US Senate > < From: > < Date: 21-Apr-00 > Corzine was attacking Florio over the tax hike issue! The height of Corzine hypocricy. Corzine wants a HUGE and EXPENSIVE socialist health care bureaucracy, but he points at and condemns the Florio record on taxes. Is any left who thinks Corzine is honest? Supporters of Corzine counterattack by using both abusive language and factual information concerning tax hikes and political machinations within the Democratic Party, as shown in Excerpt 5: Excerpt 5: < Subject: RE: Jon Corzine for US Senate > < From: > Corzine Fan! < Date: 21-Apr-00 >
Class, Ideology, and Discursive Practices in Online Political Discussions
101
Brooks Brothers? County Club? You are such an ass. He doesn't belong to a country club and just cause the guy can afford a nice suit makes him a yuppie. You know nothing you idiot. Jon Corzine is a good person who instead of sitting at the country club or socializing or traveling his retirement away at age 50 he is going to union halls and senior citizen homes to talk to people about trying to change a system that does not work and will not be able to support the numbers of people who are on it. You whine about socializing health care but there are hundreds of thousands of people in the country who have none. You talk about hipocracy like you know what it is. Florio's tax hike was done to help the working families of NJ but his so called political smarts failed to let him in on a big secret. 1.don't tax the rich--they'll run a campaign against you 2. Don't ram your policies down peoples throats, explain what your doing and why… The frequency lists helped us identify repetitive words or groups of words that seemed to encode specific political ideologies or orientations. Word collocations to the left and right of the keyword socialist abound in the Corzine thread, such as socialist medicine, socialist liar, Socialist Party, socialist health care, yuppie socialist, hypocrite socialist. These words encode specific political ideologies and orientations of groups of individuals posting to the discussion thread. Table 5 displays instances of negative semantic prosody or “the phonological colouring which spreads beyond segmental boundaries,” as cited in Partington (2001: 84). Negatively loaded words are frequently exploited by ideologically conservative individuals posting to this thread in an effort to undermine Jon Corzine’s liberal political platform and discourage prospective voters: Table 5: Negative Semantic Prosody (Jon Corzine’s Thread) Socialist (22) (266) To hell with Corzine and his socialist medicine. (271) for universal health care as socialist medicine asshole? (311) calls himself an “independent socialist” could win a (344) and not Corzine the yuppie socialist. (348) country club. A yuppie hypocrite. Socialist with a (352) wants a HUGE and EXPENSIVE socialist health care (402) stuff. It is a lie. Corzine is a socialist. Go to the NJ (501) that mean Comrade Corzine’s socialist medicine is The word collocation analysis reveals issues of concern to various participants in these online political discussions, and underscores divergent ideological orientations among those posting lengthy messages to the discussion threads. With respect to the type of lexical verbs used frequently by individuals posting to these political discussion threads, verbs of cognition and perception
Susana M. Sotillo and Julie Wang-Gempp
102
such as know, think, believe, and feel seem to predominate when discussing political intrigues or township spending practices. As shown in Table 6 (from the Union Township thread), the verbs know, think, and believe, which encode mental states or activity, appear to be used by participants posting messages to this discussion thread as a means of taking a stand on social and economic issues that directly affect their community. They are also used for displaying an understanding of local and state wide political issues, institutional practices, and political machinations: Table 6: Verbs of Cognition and Perception in Union Township (1026) ordinance (for those who don't (1115) accepted opinion. However, I truly (1432) often enough people will begin to (1436) as the Big Liars would have you know (21) (245) And in 1997. Items 1,2, 6 and 9 I (465) isn't responsive government, I don't (494) information. The guys at town hall (499) information age. I posted what I (500) with the Township Engineer. If I (508) problem, and with the study, we (509) here, lots of folks should now (535) their own saftey. Believe me, I (636) better off keeping his mouth shut. I (647) That anyone in Union should not (654) Amazing The person that wanted to (887) incomes above $500,000, we all think (12 ) (136) 09-Sep-99 > to ? you seem to (252) seriously. Of course, I (329) away in its own containers. I (334) waiting for a pick-up. Do you (704) remember hearing $500K, but I (802) speak to me. Does anyone really (804) for dollar revenue to Union? I (970) Russocrat living under? You (1332) though! What does everybody 3.3
believe, call the Clerk's believe that the current believe it--especially with believe--but because the know I personally brought to know what is. Incidentally, know this chat page exists. know, based on conversations a know this info, then the know about the cure. By know about the study - and I know.know that if I were a know that the mayor is NOT know about a Mayors election know once a tax is created it think that we have no problems think this is just skimming think that the DPW did a great think we were the only think it was a grant just think that the state, once think not. There is no reason think Petty is only a think??
Pronoun Usage
The use and function of pronouns and pronominals has been extensively investigated in CDA as a linguistic mechanism through which powerful elites and defenders of the status quo exert discursive domination over less powerful actors in the polity. Discursive domination in CDA implies controlling the national
Class, Ideology, and Discursive Practices in Online Political Discussions
103
discourse, setting the political agenda, defining discourse parameters, and enacting national guidelines (van Dijk 1996). In each of the four political discussion threads examined, participants eagerly exchanging messages seem to use first person and second person singular pronouns to a greater extent than first and third person plural pronouns, which are normally found in the type of political institutional discourse investigated by critical discourse analysts. In Excerpt 6, for instance, I is used by one of the participants posting to the Union Township thread to indicate in-group membership and display knowledge of local political affiliations and shenanigans: Excerpt 6: < Subject: > RE: Union Election -1999 < From: > Dem Two < Date: 18-Oct-99 > As a former "Russocrat", now just an ordinary run-of-the mill Democrat I'd like to bring to the attention of a previous poster his/her error in observation. The inner circle of the local Democratic Party consists of about 5 individuals who "call the shots". My guess is that group is limited to JC, CM ,AT,PS & TP. The infamous "JP" is at best a "consultant". M. Cohen is no more a 'bigwig' in the Democratic Party than I am. If that were the case would he be a candidate in a classically gerrymandered GOP district? Why would the Dems hang one of their 'bigwigs' out to dry? And, why would he be foolish enough to do it?An individual responding to the former “Russocrat’s” comments uses you to directly address the author of the message and challenge the veracity of his/her claims. In Excerpt 7, this participant offers alternative explanations for the current intrigues and power struggles within the Democratic Party: Excerpt 7: < Subject: > RE: Union Election -1999 < From: > Real Unionite < Date: 23-Oct-99 > Issue 1--what cloud is the so-called former Russocrat living under? You think P. is only a consultant and C. is just a smart guy running because he has some kind of goodwill? C. is president of the JP-JC coalitioninspired club whose existence is dependent on them and for whom, like the rest in his club, he is a puppet only carrying out orders to bring the worst in machine politics into Union … The above message posted by Real Unionite is criticized by another participant in Excerpt 8 below, who uses you to directly reprimand him/her for allegedly distorting the truth:
tags in the BNC data files, stripped SGML tags including grammatical markup, and mapped SGML entities to the corresponding characters. Spaces around orthographic word-interior hyphen and apostrophe were removed. The resulting text data were amalgamated into nine large data files with 87,221,955 tokens total for further processing.15 Frequency lists of 1-, 2-, 3-, 4-, 5-, 25-, and 50-grams in the two corpora were produced with kfNgram. Relevant options chosen were: not case-sensitive, preserve word-interior hyphens and apostrophes, replace numerals with #, floor 50. Standard kfNgram character remapping was chosen, so boundaries between sentences, paragraphs and even entire texts were ignored on the reasonable assumption that the random “pseudo-n-grams” resulting from this expediency would fall below the relatively high threshold chosen. The 5,000 most frequent alphabetic n-grams for each value of n were then imported into a Microsoft Access database for further analysis. Three sets of queries yielded the following record sets16: 1.
Susana M. Sotillo and Julie Wang-Gempp
104
Excerpt 8: < Subject: > RE: Union Election -1999 < From: > To Real Unionite < Date: 24-Oct-99 > You are free to criticize but you have an obligation to base the criticism on TRUTH. Unless F. ran under the Russo mantle in the distant unremembered past and was then plucked [from obscurity] only to run and lose, you cannot make that statement regarding his recent bids for township committee. Those have come after many years of quite visible service on the Planning Board and then the Board of Adjustment. A pronoun closely associated with exclusion, they, was used more frequently by those posting to two of the political discussion threads: Bloomfield Politics and Union Election. CDA motivated research has shown that the pronouns we/us versus they/them are frequently used to communicate in-group and out-group membership in political discourse. As shown in Excerpt 9, netizens from the Union Election thread are aware of the us vs. them socio-political configurations: Excerpt 9 < Subject: > RE: Union Election -1999 < From: > Politico < Date: 28-Sep-99 > Look, they is them and we is us. Issues be damned, we is looking out for us. There ain't nothing more. It's the game. Once the topic/theme or issues discussed in each of these threads had been identified, we calculated the type/token ratio or total number of different words to total number of words in each text that had been normalized to 10,000 words of text. As previous CDA research has shown, those engaged in political discourse, including prominent politicians and committed citizens, choose specific rhetorical strategies, semantic motifs, syntactic structures, and lexical items that reflect a shared history, socio-cultural beliefs, attitudes, and political orientations (Mautner 2000). Using corpus linguistic tools to analyze linguistic forms, collocations, and rhetorical devices or strategies will help us uncover factors that underlie the power relations and discursive practices of groups of individuals from towns with dissimilar socio-economic characteristics whose political goals and ideologies clash in cyberspace.
Class, Ideology, and Discursive Practices in Online Political Discussions
105
Table 7: Topics, Tone, and Participants’ Linguistic Choices and Rhetorical Strategies Topics or Political Issues Discussed Bloomfield Politics: Local elections to Town Council.
Jon Corzine for US Senate: The discussion focuses on Jon Corzine’s platform and the $60million he spent on this Senate race. Corzine successfully defeated an experienced Democratic opponent in the primary elections.
Clark Politics: Democrats and Republicans engage in heated discussions. Participants identifying with the Republican party vigorously try to persuade voters to reject alleged political shenanigans and vote against the Democratic Mayor’s protégé.
Union Elections: Historical facts and local and county politics are discussed. Useful information is exchanged with the cyber audience about the impact of Hurricane Floyd, problems with flooding, and ideas for streamlining municipal services.
Tone of Arguments
Linguistic Choices Rhetorical Strategies
Ad hominem arguments are directed at a female Republican candidate for Town Council who lives in the less affluent section of town. Agonistic exchanges predominate.
Epithets and hyperbole are used to disparage and undermine: “This thread isn't about socks. It's about a candidate who doesn't pay taxes, has the mental ability of a flea, and totally lacks class or dignity.” Hyperbole and words with negative connotations such as socialist, communist, comrade, socialized medicine, lying yuppie, prevail in arguments put forth by Corzine’s opponents.
Those posting messages against Corzine’s candidacy to the US Senate use slippery slope fallacies and negatively charged words in an effort to undermine his political platform. Highly acrimonious exchanges predominate in this thread.
Standpoints and arguments are logically structured and clearly stated in this thread. Participants appear to be trying to reach consensus with respect to town politics and issues that affect them. In general, the tone of the arguments is civil.
Participants use hyperbole and metaphors in an attempt to persuade the cyber audience: “Why aren’t you jumping up and down and screaming about what RE and his puppets are doing to the residents of Dawn Drive? Could it be you can’t because your lips are stuck to the Mayor’s ass?” Attempts to use race as an issue are directly confronted. For example, the use of nonstandard features by an alleged ‘impostor’ is challenged: “Bad impersonation of an African American. You called yourself ‘Vauxhall Voice’ in the past.”
4
Results and Discussion
4.1
Issues, Tone of Arguments, Linguistic Choices and Rhetorical Strategies
With respect to the first research question posed, the specific issues or topics addressed by participants in these CMC political discussions, Table 7 summarizes the topic or political issues discussed in each of the four threads, the types of
Susana M. Sotillo and Julie Wang-Gempp
106
arguments presented, and the use of linguistic forms and rhetorical devices. Netizens posting messages to this unmoderated public bulletin board are concerned with a variety of issues, which include the political platform and qualifications of various candidates for public office, township-related financial problems, quality of life concerns, information sharing, and consensus building. Acrimonious exchanges and ad hominem arguments predominate in three of the four threads examined. Thus, with the exception of the exchanges found in the Union Election political discussion thread, the answer to the second research question concerning the tone of these arguments is that these political discussions are combative or agonistic. For example, in Excerpt 10, participants display their hostility toward this female Republican candidate by claiming that she has often failed to pay her taxes on time, lacks class, or the capacity to think rationally: Excerpt 10: < Subject: RE: Bloomfield Politics - V is running for 3rd Ward??????? > < From: > < Date: 22-Sep-99 > This thread isn't about socks. It's about a candidate who doesn't pay taxes, has the mental ability of a flea, and totally lacks class or dignity. As shown in Excerpt 11, in addition to attacking her public image and credibility, one of her detractors associates her personal style with unbecoming behavior in public, such as cursing and screaming, which generally characterizes unschooled individuals: Excerpt 11: < Subject: RE: Bloomfield Politics – Vis running for 3rd Ward??????? > < From: > < Date: 22-Sep-99 > Maure is an outstanding lady and a conscientious councilwoman. She is very interested in what goes on in our third ward. All of a sudden her opponent is doing things out of her category trying to get points. It's not going to work because people don't forget all her cursing and swearing at the Board of Ed meetings. So keep up the good job Maure and you will be our new councilwoman. 4.2
Word Collocations and Negative Semantic Prosody
A qualitative analysis of the chat data enables us to answer the third research question, which concerns the use of specific rhetorical devices (e.g., sarcasm, hyperbole, and epithets) and word collocations to encode class and political ideology. Those actively posting to each of the four political discussion threads
Class, Ideology, and Discursive Practices in Online Political Discussions
107
use epithets and hyperbole, as well as negatively loaded words and word collocations, to structure their arguments. This seems to be a calculated strategy to draw the reader’s attention to the message posted and encourage debate. For instance, as shown in Table 8, critics of a female Republican candidate to Town Council, who lives in the South End or the working-class section of the Township of Bloomfield, draw attention to the negative connotations of cursing and swearing in an effort to persuade voters to reject her candidacy because they regard her as uncouth and uneducated, and thus incapable of representing her constituency effectively. Table 8: Collocations for Cursing cursing (5) (1066) people don't forget all her (1312) They all remember too well her (1316) comes, remember her as the (1893) few years ago were wild. The
cursing and swearing at the cursing and swearing at the cursing and screaming woman. cursing, badmouthing,
As Cotterill (2001) has shown, negative semantic prosody, which spreads unpleasant connotations beyond single word boundaries, characterizes adversarial exchanges in the Bloomfield thread and is effectively used by critics of a female Republican candidate in this township to deconstruct her image and effectively undermine her political platform (e.g., chamber maid, Spandex, her cursing and swearing). In three of the four threads, Bloomfield Politics, Jon Corzine, and Clark Politics, participants in this virtual town square are bent on deconstructing the public image and political platform of various candidates running for public office. They do so through their choice of words, expressions, and word collocations. In the case of the Bloomfield Politics thread, personal attacks posted to this public bulletin board on the Internet are followed up with letters to the editor in local newspapers in an effort to persuade local voters to reject the female Republican candidate. Likewise, in the Jon Corzine for US Senate thread, critics of Corzine use word collocations with highly negative associations (Comrade Corzine, socialist medicine, multimillionaire yuppie) to deconstruct his public image among members of the cyber audience. However, twenty-nine percent (29%) of those posting messages to the Corzine thread support his candidacy and defend his political ideas. Jon Corzine, former chairman of brokerage house Goldman Sachs, is a wealthy, well-educated, and successful financial expert. Although two powerful and widely circulated newspapers, The New York Times and The Star-Ledger, failed to endorse his bid for the U.S. Senate, he successfully defeated a seasoned politician in the 1999 June primaries, James Florio, former New Jersey Governor, and went on to defeat Robert Franks, a popular Republican candidate, in November 2000. We are not, however, claiming that those posting to these political discussion threads influenced voters at the local or state wide level since they represented less than 1% of the total number of registered voters from the various towns involved in this investigation.
Susana M. Sotillo and Julie Wang-Gempp
108 4.3
Knowing and Thinking Online: Results
Quantitative and Qualitative
The fourth research question investigates how participants employ mental verbs and personal pronouns and for which purposes. With respect to the use of verbs of cognition and perception (i.e. verbs denoting mental activity), Table 9 shows that those posting actively to the four discussion threads frequently utilize know and think. Both verbs encode cognition. There are also quantitative differences in the use of these verbs. For example, as shown in Table 9, think, know, and feel are employed to a greater extent by those posting to the Bloomfield political discussion thread, a town which is routinely classified as lower-middle class/working-class. Also, despite the agonistic tone of the arguments and the informal nature of these political discussions, the verbs believe (cognition) and feel (perception) are rarely utilized by those posting messages to these four threads. This may be related to the virtual context or setting in which these political discussions are taking place. Discussions normally take place in clearly defined situational contexts (i.e. interviews or casual conversations happen in a specific setting or physical space), and this element is absent in online discussions (Yates 1996). Normed frequencies in Table 9 indicate that know is more extensively used by participants in the Bloomfield and Union Election discussions (31 and 30, respectively) than by those posting to the Corzine (23), and Clark Politics (19) threads. Table 9: Verbs Denoting Mental Activity Normed per 10,000 words
Total Number of Words Believe Feel Know Think Total Number of Mental Verbs
Bloomfield Politics 9,777
Clark Politics 13,061
Jon Corzine for US Senate 12,895
Union Election 10,567
8 15 31 34 88
7 8 19 13 47
5 2 23 26 56
11 2 30 14 57
A qualitative analysis of these two verbs of cognition in context sheds light on the underlying reasons for their frequent use. For example, know appears to be employed by participants to mark a certain stance with respect to a political position taken, or to comment on the current state of affairs in the polity. As shown below in Excerpt 12 from the Corzine thread, know is used to display one’s understanding of institutional and societal practices:
Class, Ideology, and Discursive Practices in Online Political Discussions
109
Excerpt 12: The fact is Corzine was the chairman of A FORTUNE 500 company!!! The man knows finances, business practices, etc. He had several thousand people working under him. Normed frequency counts that appear in Table 9 above for think indicate that this verb is used frequently and slightly more often than know in the Bloomfield thread (34 vs. 31, respectively). In Excerpt 13, a netizen anxious about the political struggles and issues affecting Bloomfield uses think when reflecting upon his/her political choices in the November 2001 election: Excerpt 13: I voted for who I thought would best represent Bloomfield on a local and state level. I did not vote on personalities but representation. Living in this town for many years I for one do not think we are on a par with the other towns in our legislative district. (11/8/2001) In Excerpt 14, think, which often collocates with you in this thread, is utilized by a town resident to scold the Republican candidate to Town Council and negatively comment on her character: Excerpt 14: Do you really think all of K’s supporters support YOU? Think they forgot about your duct tape on the lawn sign? Do you think when they whisper they’re pledging their support to you? YOU and your friends are the divisive ones. Now you’re paying the price. (11/4/2001) The vertical distribution of verbs denoting cognition, perception, and activity (believe, feel, know, and think) is displayed in Figure 1. This graph shows striking differences in the frequency counts for each of the four verbs, which account for 76.62% of all verbs denoting mental states and activity selected for analysis. As has been shown in excerpts 12 through 14, two of these verbs, know and think, are used effectively in these electronic discussions to indicate a contributor’s posture on specific issues such as taxes and township management, or to display his/her understanding of the political climate at the local or state level. With respect to pronouns, CDA motivated research has frequently shown that pronouns such as we/us vs. they/them are frequently used to communicate in-group and out-group membership. In addition, corpus linguistics research has shown that discourse-pragmatic functions are accomplished through linguistic forms, which index expressions of stance and serve as metalinguistic devices (see Tao 2001). Therefore, any likely associations between the form and function of pronouns (i.e. form-function relations) are carefully examined in these
Susana M. Sotillo and Julie Wang-Gempp
110
computer-mediated political discussions. It is possible that the skillful use of pronouns by vocal participants in these online discussions may enhance the persuasive force of political arguments in hotly contested elections among members of the cyber audience. The results of the deployment of personal pronouns in the four political discussion threads are presented in the following section. Figure 1. Verbs Denoting Mental Activity Normed per 10,000 Words 40 35 30 25 20 15 10 5 0 Believe
Bloomfield
Feel
Clark
Know
Corzine
Think
Union
Figure 1: Verbs Denoting Mental Activity Normed per 10,000 Words 4.4
Distribution and Use of Personal Pronouns
In order to effectively answer the second part of the fourth question, concerning the use of pronouns in the four discussion threads, quantitative and qualitative analyses were carried out. The vertical distribution of these pronouns in the four discussion threads is displayed in Figure 2. Since the “we vs. they” and “us vs. them” schemas are widely used for differentiating socio-economic and political power (i.e. elites vs. the powerless), we expected to find high frequency counts for they, we, us, and them in all four threads. However, as shown in the raw and lemmatized frequency counts in Table 10, the distribution of we and they in these political discussions was 349, and 387, respectively. A pronoun closely associated with exclusion, they, was used more frequently by those posting to the Bloomfield and Union threads (99 and 100, respectively). The inclusive pronoun we (126) was used to a greater extent by those posting to the Bloomfield thread, which may reflect working-class or lowermiddle class attitudes and ways of designating inclusion within compartmentalized sections of the township, as shown in messages repeatedly posted to this thread: “We are almost the laughing stock of Essex county right
Class, Ideology, and Discursive Practices in Online Political Discussions
111
behind Newark. Something will happen here in the next couple of weeks…”-“We have been spreading the word about her for weeks. We must remember we cannot stop now. V. does not have it as a council person mentally.”
350 300 250 200 150 100 50 0 I
You
We
They
Bloomfield Township Clark Township Corzine (Nutley & Montclair Towns) Union Township
Figure 2: Total Pronoun Usage Lemmatized (Four Political Discussion Threads) Contributors to the Union Township thread also use the first person plural pronoun more frequently. It is generally used to display solidarity with neglected constituents, especially when discussing the shallowness of political campaigns, where instead of addressing issues that affect ordinary citizens, politicians resort to mud slinging: “Again, we’re dealing with POLITICIANS! So, probably there will never be a “valid issue,” as long as it’s possible to sling mud. All political campaigns are based on negativity. Remember: when a Democrat calls his Republican opponent a “bum,” and in turn the Republican calls his Democratic counterpart a “bum,” the taxpayers can rest assured that no matter who wins we will have a “bum” in office.” Table 10 displays the usage of pronouns, which have been lemmatized (i.e. a process of reducing each word from its inflectional and variant forms to its base form) in order to perform a quantitative analysis of the data. This analysis seeks to determine whether or not the most frequently occurring personal pronouns such as I, you, we, and they are utilized in significantly different ways by participants posting to each of the four political discussion threads.
Susana M. Sotillo and Julie Wang-Gempp
112 Table 10: Pronoun Usage Lemmatized Lemmatized Bloomfield Pronouns Politics I
You We
They
I (112) me (23 my(16) mine (0) you (188) your (55) yours (1) we (73) us (24) our (29) ours (0) they (52) them (22) their (24) theirs (1)
Clark Politics
151
244 126
99
I (92) me (22) my (14) mine (0) you (89) your (18) yours (2) we (39) us (8) our (5) ours (0) they (57) them (15) their (24) theirs (0)
128
109 52
96
Union Election Jon Corzine (includes Towns of Nutley & Montclair) I (87) 130 I (91) 125 me (21) my (22) mine (0) you (228) your (62) yours (1) we (38) us (19) our (20) ours (1) they (56) them (9) their (27) theirs (0)
291 78
92
me (18) my (16) mine (0) you (43) your (8) yours (1) we (52) us (12) our (29) ours (0) they (54) them (11) their (35) theirs (0)
52 93
100
Table 11: Two-way Design Chi-Square (Pronoun Usage) Pronouns
1st Person Singular (I) 2nd Person Singular/ Plural (You) 1st Person Plural (We) 3rd Person Plural (They) Total
Union Township Thread
Total
128
Jon Corzine Thread (Nutley & Montclair) 130
125
534
244
109
291
52
696
126
52
78
93
349
99
96
92
100
387
620
385
591
370
1,966
Bloomfield Thread
Clark Township Thread
151
Chi-Square tests were used to determine whether or not there were significant differences between the observed and expected values of these four variables (I, you, we, they) as employed by participants in the four online political discussion threads. We hypothesized that there were no significant differences in the way participants used these four pronouns. In the two-way design shown in Table 11, we are comparing four different pronouns or variables among themselves as used by those posting messages across four different political discussion threads. The results indicate that there are significant differences in the use of these pronouns across all four threads (X2 = 156.06, df = 9, p < .001). This means that contributors to these computer-mediated political discussions use the same lexical categories
Class, Ideology, and Discursive Practices in Online Political Discussions
113
in significantly different ways.4 Thus the Null hypothesis (no significant differences) is rejected. To further investigate the differences among these four pronouns as deployed in each of the four discussion threads, we employed a one-way ChiSquare Test. The results are displayed in Table 12, which is a summary table of one-way Chi-Square (X2) tests. Table 12: One Way Design Chi-Square Bloomfield Politics Thread 151
Clark Politics Thread 128
Jon Corzine Thread 120
Union Election Thread 125
2nd Person Singular/Plural
244
198
291
52
1st Person Plural
126
52
78
93
3rd Person Plural
99
96
92
100
Pronouns 1st Person Singular
X2 results X2 = 3.17 df 3 p < .10 X2 = 192.6 df 3 p < .001 X2 = 20.00 df 3 p < .001 X2 = 6.17 df 3 p < .10
The one-way design X2 results show how one variable behaves across four different political discussion threads. That is, the four pronouns are compared independently. The results show that all participants posting to these threads use the first person singular and third person plural pronouns, I and they, in the same manner when posting messages to the discussion threads, but significantly differ in their use of you, the lemmatized second person pronoun (X2 = 192.6, p<.001), and we, the first person plural pronoun (X2 = 20.00, p< .001). Since CDA research has shown that these two pronouns are generally associated with issues of inclusion or in-group membership and exclusion or out-group membership (Oktar 2001), the results of our analysis show that these discursive practices are employed by those posting to the four discussion threads. To summarize, Table 10 and Figure 2 show that the lemmatized pronouns you (696) and I (534) are used to a greater extent than we (349) and they (387) in all four political discussion threads. However, the results of OneWay Chi-Square tests show that netizens posting to the Bloomfield discussion thread use the personal pronouns you and we in a significantly different manner from contributors to the other three threads. Although individuals discussing political issues in the Corzine thread use you to a greater extent than others posting to the Clark and Union threads, the high numbers can be explained by the fact that this thread includes residents from two different towns, Nutley and
Susana M. Sotillo and Julie Wang-Gempp
114
Montclair. Socio-economic and class differences may influence the choice of you and we, two pronouns which are in most cases used to designate exclusion and inclusion, depending on the context surrounding their use. As demonstrated by the One-Way Chi-Square (X2) results, residents from Bloomfield, a workingclass/lower-middle class township, use you and we for different functions and to a significantly greater extent than residents from the other middle- and uppermiddle class townships. 4.5
The Pragmatics of You
A subsequent qualitative analysis of each of the four pronouns identified in the chat data allows us to address the fifth research question, which seeks to uncover the specific discourse-pragmatic functions that are accomplished through the use of personal pronouns. The results of this corpus investigation demonstrate that participants in these electronic discussions use we and they for in-group and outgroup social categorization. In addition, an exhaustive analysis of you reveals that it is used to accomplish five specific pragmatic functions, which include social categorization and persuasion. Furthermore, the use of first and second person pronouns also reflects the self-centered nature of these computer-mediated texts, and the similarities to speech or face-to-face (F2F) communication (Yates 1996). A close examination of variable context samples for you obtained from all four discussion threads reveals that when participants use the second person singular or plural pronoun, you, in the context of heated computer-mediated political discussions, five specific functions are accomplished: (1) giving or requesting information; (2) persuading prospective voters; (3) admonishing a political candidate not physically present; (4) excluding those who do not share one’s political beliefs or ideologies; and (5) addressing others in a broad sense (i.e. using generic “you”). Excerpts 15 through 20 illustrate the discoursepragmatic functions of you in these online political discussions. For example, in Excerpt 15 of the Bloomfield thread, where ad hominem arguments abound, you is used to give advice or information to a political ally (or opponent), which in turn creates a sense of intimacy between two cyber strangers: Excerpt 15 Boy are you right that Bloomfield is up for the sale to the highest bidder! For evidence, you just need to go to the website of the Election Law Enforcement Commission (www.elec.state.nj.us) and click on View Reports and enter in the names of our local candidates. You'll need to download a "plug-in" but once you get to see the campaign contribution reports you won't be sorry that you spent the time to do it. (11/4/2001) Animated and sarcastic political exchanges between supporters and detractors of Corzine are illustrated by Excerpt 16 below, where you is used to address a specific member of the cyber audience:
Class, Ideology, and Discursive Practices in Online Political Discussions
115
Excerpt 16 Before you go off on another verbal flamethrower assault, why not explain to us how “universal health care” and “free college tuition” can possibly be provided for free? Does that mean the Mayo Clinic and Princeton will remain the highest standards on name alone? Make sense before you march out there with fists in the air shouting, “Comrade Corzine hates Fascists!” You is also often used to admonish a candidate not physically present, as is the case in Excerpt 17 from the Corzine thread: Excerpt 17 <Title: Jon Corzine for US Senate> <MEMO TO CORZINE FROM THE ITALIAN COMMUNITY> DROP OUT NOW BEFORE ITS TOO LATE YOUR MONEY CANNOT AND WILL NOT BUY YOU AN ELECTED OFFICE THAT YOU ARE UNQUALIFIED FOR. YOUR FROM THAT MM BROKERAGE HOUSE GOLDMAN SACKS. YOU MADE MANY MILLIONS I WONDER IF IT WAS DUE TO INSIDER TRADING. (Italics, misspellings and grammatical errors have been preserved since they were present in the original postings.) As shown in Excerpts 18 and 19, the deictic nature of you also serves to exclude those who hold different political ideologies and socio-cultural values: Excerpt 18 <Title: Jon Corzine for US Senate> I value my heritage AND my paycheck. To hell with Corzine and his socialist medicine. Haven’t you idiots learned anything from the collapse of the Soviet Union? Everything has a price. Nothing is free. Corzine is selling a used car lemon to voters. Excerpt 19 < Subject: RE: Clark Politics > < From: > Mr. M.'s stupid comment < Date: 07-Jul-99 > Understand something Mr. M., Picton Park and Dawn Drive are radically different situations. Many of the Council members are in favor of building Picton Park; you are against it. You are a thorn in their side and F told you that he would not tolerate trouble making from you over
Susana M. Sotillo and Julie Wang-Gempp
116
Picton Park. The Picton Park meetings are held to serve a purpose; the attendees want to accomplish something so those meetings are worthwhile. Finally, as shown in Excerpt 20, you is also used in a generic sense in the context of these computer-mediated discussions. It can be substituted by one: Excerpt 20 < Title: Union Election – 1999> Not a very impressive victory when you consider no one knew BK, he had no platform, money, or personality, and no party support. Makes you wonder what would have happened if a viable candidate was run. Despite the fact that participants have to use a keyboard in order to prepare their messages, these political discussions are more similar to informal conversations in slow motion than to those taking place in more formal environments (e.g., public forums or interviews with candidates). Although civil discourse is violated on numerous occasions in the Bloomfield and Clark threads, individuals discussing local and state wide politics in these threads seem to be observing appropriate social conventions found primarily in face-to-face conversations (Collot and Belmore 1996). In fact, in the Corzine thread, a careful reading of the messages shows that two of Corzine’s critics, use you as a means of directly addressing Corzine the candidate in an informal but polite tone: “Q: What makes you qualified for the Senate, sir?” This is a very effective deictic expression that encodes class aspects of social relationships. It is also used to achieve a specific communicative intent; that is, to address a socially powerful candidate not physically present. Instead of the expected “we/they” and “us/them” schema for social categorization, it seems that the hybrid nature of these text-intensive online discussions accounts for the frequent use of you and I. This virtual setting creates an atmosphere conducive to self-centered narratives. Yates (1996) has pointed out in his analysis of computer conferencing that the high levels of first and second person pronoun usage found in CMC discourse can be explained by the lack of a clearly defined semiotic field or social structure in which communication among participants takes place. With respect to the use of the first person, I, participants in this study feel the necessity to continuously display their knowledge of topics under discussion, challenge critics, and persuade potential allies. On the other hand, you, the second person, as has been extensively discussed, seems to be used by those posting to these discussion threads to perform five specific functions: providing information; persuading voters; admonishing a political candidate; excluding others; and speaking in general terms. To sum up, the data indicate how pronoun choices in these online discussions are part of a highly grammaticalized system and assume addressee knowledge of the identity of the individual posting a message (i.e. the speaker) in
Class, Ideology, and Discursive Practices in Online Political Discussions
117
order to identify referents in relation to the point of origin. Crucially, the references can only be understood by an addressee who is able to reconstruct the speaker’s viewpoint. When this reconstruction occurs, the intersubjectivity attained is a kind of common ground the speakers and addressees share. 5
Conclusion
The present study has used a theoretical and methodological framework informed by the traditions of critical discourse analysis and corpus linguistics in order to uncovered class indicators, political ideology, and discursive practices in the online political discussions examined. The major findings of this study can be summarized as follows: 1.
Netizens posting to these online political discussion threads are concerned with the qualifications and political platforms of local and state wide candidates, quality of life issues, township financial problems, information sharing, and consensus building.
2.
Acrimonious exchanges and ad hominem arguments predominate in three of the four discussion threads whereas consensus building characterizes debate in the Union Election thread.
3.
Negative semantic prosody, which spreads unpleasant connotations beyond single word boundaries, characterizes adversarial exchanges in these online discussions. Specific word collocations, choice of lexical items, sarcasm, hyperbole, and epithets, which encode differences in class and political ideology, are used to undermine political candidates and persuade the cyber audience.
4.
Those posting to the four discussion threads utilize verbs denoting mental activity to display their knowledge and understanding of local and state wide issues, political institutions, and societal practices.
5.
Whereas personal pronouns I and they are employed in the same manner by contributors to these online discussions, their use of you and we differs significantly. Additionally, five pragmatic functions are accomplished through the use of you: giving or requesting information; persuading members of the cyber audience; admonishing a political candidate; excluding those with different political ideologies; and addressing others in a general sense.
In the case of the Clark Township political discussions, opponents of the then Democratic Mayor appear to have succeeded in portraying his protégé as an ineffective political candidate by disseminating damaging information about her political alliances and dealings. Postings to this unmoderated bulletin board and
118
Susana M. Sotillo and Julie Wang-Gempp
fliers distributed among residents of this township prior to the local elections became so rancorous that some citizens publicly expressed their indignation in Letters to the Editor published in the Star-Ledger, criticizing their tone and content: Every town in New Jersey has a small cadre of people who live to criticize the local government and, in some instances, its employees. They seem to thrive on negativity. Very rarely will they compliment someone who performs a service for the town. I also find K’s relishing the misfortunes of the mayor’s bankruptcy offensive and immoral. K. is within his rights as a citizen, but we need civil discourse, not venom, when discussing issues. A moratorium on personal attacks both locally and nationally is in order. (Letters to the Editor, J.D., 8/13/2000) Informative messages that contributed to the formation of a healthy online political community were posted regularly to the Union Election thread. Although the trading of insults characterized several of the initial exchanges among residents, very useful information was disseminated to town residents. Participants seemed genuinely interested in consensus building. An extensive qualitative analysis of linguistic data has enabled us to uncover different political ideologies and discursive practices that characterize the political exchanges of residents from socio-economically diverse towns. The results of this study seem to indicate that socio-cultural orientations and political ideologies found at the local level generally reflect marked differences in political ideology and socio-economic policies found at the national level, which generally favor powerful elites and political lobbyists at the expense of less powerful sectors of American society such as members of the working- and lower-middle classes. Digital communication networks and the Internet are changing the nature of political discourse. The findings of this study indicate that ad hominem arguments and negatively oriented semantic prosody abound in the Bloomfield thread, though they are also present in the Clark Politics and Jon Corzine discussions. In contrast, very useful political and historical facts, as well as technical information, are employed to structure arguments by participants in the Union Election thread. Unlike face-to-face speech acts, these cyber acts or postings are not mitigated by euphemisms, which are often used to save face, or appeal to higher authority or common interests. Strategies such as persuasion, opposition, resistance, protest, and consensus building are enacted by cyber chatters as social actions in the political and cultural context of these computermediated discussions through the use of rhetorical devices, word collocations, mental verbs, and personal pronouns. In the past, except for letters to the editor, the general public has had very limited or no access to the media but this has changed with the phenomenal growth of the Internet and expansion of digital networks (Gurak 1996). As van Dijk (1996: 89-90) observes, mentally mediated control of the actions of others is
Class, Ideology, and Discursive Practices in Online Political Discussions
119
the ultimate form of power. This is what powerless individuals in dialogue with more powerful participants are attempting to do through these computer-mediated political exchanges. They are attempting to influence a wider cyber audience comprised of ordinary citizens and individuals in positions of authority (e.g., the Mayor, Town Council, and the more affluent residents of each of the towns). It is possible that the these new digital technologies can strengthen democracy because they enable less powerful political actors to compete on a more equal playing field with stronger and more powerful political actors, such as the rich and influential members of society. As Hollihan (2001: 159) points out, the Internet and digital communication networks have the potential to create “healthy public spheres,” in which carefully constructed political arguments are tested and evaluated by an informed cyber audience that represents a wide spectrum of American society. It is only through a broad participation of an involved and informed electorate that democratic communities can flourish and encourage all citizens to work towards a common goal. The effects of computer-mediated political discussions on voter behaviour at the local and state levels, as well as the percentage of voters who directly participate in these discussions, should be empirically investigated. It is worth pursuing in future studies the extent to which political discussions in cyberspace influence voters’ opinions, political choices, and public policy in similar urban and suburban American towns. Notes 1.
Available as shareware, TACT is a suite of powerful software programs for analyzing electronic texts. The manual, Using TACT with Electronic Texts (1996), is edited by I. Lancashire, J. Bradley, W. McCarty, M. Stairs, and T. R.Wooldridge.
2. Type is the individual graphic word because TACT cannot subcategorize different word forms that belong to the same lemma, and token is the total number of words that appear in the text being analyzed. 3.
Normalization is a means of adjusting raw frequency counts from texts of different lengths so that they can be compared accurately (see Biber et al. 1998: 263).
4.
We are indebted to Dr. Longxing Wei for his editorial suggestions and assistance in performing Chi-Square tests.
References Biber, D., S. Conrad, and R. Reppen (1998), Corpus linguistics, Cambridge: Cambridge University Press.
120
Susana M. Sotillo and Julie Wang-Gempp
Bromberg, H. (1996), Are MUDs communities? Identity, belonging and consciousness in virtual worlds, in R. Shields (ed.), Cultures of Internet: Virtual spaces, real histories, living bodies, London: Sage, pp. 143-152. Chilton, P. and C. Schäffner (1997), Discourse and politics, in T.A. van Dijk (ed.), Discourse as social interaction, Vol.2, London: Sage, pp. 206-230. Collot, M. and N. Belmore (1996), Electronic language: A new variety of English, in S. Herring (ed.), Computer-mediated communication: Linguistic, social and cross-cultural perspectives, Amsterdam: John Benjamins, pp. 13-28. Cotterill, J. (2001), Domestic discord, rocky relationships: Semantic prosodies in representations of marital violence in the O.J. Simpson trial, Discourse & Society, 12 (3): 291-312. Fairclough, N. and R. Wodak (1997), Critical Discourse Analysis, in T.A. van Dijk (ed.), Discourse as social interaction, Vol.2, London: Sage, pp. 258-284. Gurak, L.J. (1996), The rhetorical dynamics of a community protest in cyberspace: What happened with Lotus MarketPlace?, in S. Herring (ed.), Computer-mediated communication: Linguistic, social, and crosscultural perspectives, Amsterdam: John Benjamins, pp. 265-277. Hauben, M. and R. Hauben (1997), Netizens: On the history and impact of Usenet and the Internet, Los Alamitos, CA: IEEE Computer Society Press. Herring, S., D. Johnson, and T. DiBenedetto (1992), Participation in electronic discourse in a ‘feminist’ field, in K. Hall, M. Bucholtz, and B. Moonwomon (eds), Locating power: Proceedings of the Second Berkeley Women and Language Conference, Berkeley, CA: Berkeley Women and Language Group, pp. 250-262. Herring, S., D. Johnson, and T. DiBenedetto (1995), This discussion is going too far! Male resistance to female participation on the Internet, in M. Bucholtz and K. Hall (eds), Gender articulated: Language and the socially constructed self, New York: Routledge, pp. 67-120. Hollihan, T. (2001), Uncivil wars, New York: St. Martin’s. Hollihan, T., P. Riley, and J.F. Klumpp (1993), Greed versus hope, self-interest versus community: Reinventing argumentative praxis in post-free marketplace America, in R.E. McKerrow (ed.), Argument and the postmodern challenge, Fairfax, VA: Speech Communication Association, pp. 332-339. Katz, J. (1996), The age of Paine. http://www.hotwired.com/wired/3.05/features/paine.html. Kennedy, G. (1998), An introduction to corpus linguistics, London: Longman. Kolko, B. (1998), We are not just (electronic) words: Learning the literacies of culture, body, and politics, in T. Taylor and I. Ward (eds), Literacy theory, New York: Columbia University Press, pp. 61-78. Kress, G. (1996), Representational resources and the production of subjectivity: Questions for the theoretical development of Critical Discourse Analysis
Class, Ideology, and Discursive Practices in Online Political Discussions
121
in a multicultural society, in C.R. Caldas-Coulthard and M. Coulthard (eds), Texts and practices, London: Routledge, pp. 15-31. Lancashire, I. (1996), Using TACT with electronic texts: A guide to text-analysis computing tools, New York: The Modern Language Association of America. Mautner, G. (2000). Deutschland über alles – and we are part of ‘alles’, in M. Reisigl and R. Wodak (eds), The semiotics of racism, Vienna, Austria: Passagen Verlag, pp. 223-236. McChesney, R. (1999), Rich media, poor democracy, Chicago: University of Illinois Press. Ng, H. and J. Bradac (1993), Power in language, London: Sage. Nguyen, D.T. and J. Alexander (1996), The coming of cyberspacetime and end of the polity, in R. Shields (ed.), Cultures of Internet: Virtual spaces, real histories, living bodies, London: Sage, pp. 99-124. Oktar, L. (2001), The ideological organization of representational processes in the presentation of us and them, Discourse & Society, 12 (3): 313-346. Partington, A. (2001), Corpus-based description in teaching and learning, in G. Aston (ed.), Learning with corpora, Houston, TX: Athelstan, pp. 63-84. Ronfeldt, D. (1991), Cyberocracy, cyberspace, and cyberology: Political effects of the information revolution, Santa Monica, CA: RAND. Sardar, Z. (1996), alt.civilizations.faq: Cyberspace as the darker side of the west, in Z. Sardar and J. Ravetz (eds), Cyberfutures, New York: New York University Press, pp. 14-41. Simpson, P. (1993), Language, ideology and point of view, London: Routledge. Sproull, L. and S. Kiesler (1986), Reducing social context cues: Electronic mail in organizational communication, Management Science, 32: 1491-1512. Tao, H. (2001), Discovering the usual with corpora: The case of remember, in R. C. Simpson and J.W. Swales (eds), Corpus linguistics in North America, Ann Arbor: University of Michigan Press, pp. 116-144. Teo, P. (2000), Racism in the news: A critical discourse analysis of news reporting in two Australian newspapers, Discourse & Society, 11 (1): 749. van Dijk, T.A. (1996), Discourse, power and access, in C.R. Caldas-Coulthard and M. Coulthard (eds), Texts and practices, London: Routledge, pp. 84104. van Dijk, T.A. (1997), Discourse as interaction in society, in T.A. van Dijk (ed.), Discourse as social interaction, Volume 2, London: Sage, pp. 1-37. van Dijk, T.A. (1998), What is political discourse analysis? in J. Blommaert and C. Bulcaen (eds), Political linguistics, Amsterdam: John Benjamins, pp. 11-52. van Dijk, T.A. (2001), Political discourse and ideology, April 29 (2nd draft), Jornadas del Discurso Politico, UPF. Barcelona, Spain, 1-17, Retrieved from the World Wide Web July 6, 2001: http://www.hum.uva.nl/teun/dis-pol-ideo.htm.
122
Susana M. Sotillo and Julie Wang-Gempp
Widdowson, H.G. (1998), The theory and practice of critical discourse analysis, Applied Linguistics, 19: 136-151. Yates, S. (1996), Oral and written linguistic aspects of computer conferencing: A corpus based study, in S. Herring (ed.), Computer-mediated communication: Linguistic, social and cross-cultural perspectives, Amsterdam: John Benjamins, pp. 29-46.
Computer Learner Corpus Research: Current Status and Future Prospects Sylviane Granger University of Louvain, Belgium Abstract Despite a mere decade of existence, the field of computer learner corpus (CLC) research has been the focus of so much active international work that it seems worth taking a retrospective look at the research accomplished to date and considering the prospects for future research in both Second Language Acquisition (SLA) studies and Foreign Language Teaching (FLT) that emerge. One of the main distinguishing features of computer learner corpora – and indeed one of their main strengths – is that they can be used by specialists from both these fields and thus constitute a possible point of contact between them. The first three sections of this chapter are devoted to a brief overview of the main aspects of CLC research: data collection, methodological approaches, learner corpus typology, and size and representativeness. Sections 4 and 5 review the tangible results of CLC research in the fields of SLA and FLT.
1
Introduction
The relative youth of computer learner corpus (CLC) research as a field of scientific enquiry (it burgeoned as a discipline as recently as the late 1980s) renders a definitive assessment of its achievements somewhat premature. However, enough work has been done to take stock of advances made in the field and to evaluate its future prospects. The main objective of this article is to assess whether, in making Leech’s (1992: 106) description of corpus linguistics our own, we would be justified in calling CLC research “a new research enterprise, a new way of thinking about learner language, which is challenging some of our most-deeply rooted ideas about learner language.” After highlighting some of the main features that distinguish CLC data from other types of learner data, I will take stock of the current situation in terms of corpus collection and analysis and give an overview of the current results and future prospects in two distinct but closely related fields: Second Language Acquisition (SLA) and Foreign Language Teaching (FLT). 2
Distinguishing Features of CLC Data
There is nothing new in the idea of collecting learner data. Both FLT and SLA researchers have been collecting learner output for descriptive and/or theorybuilding purposes since the disciplines emerged. In view of this, it is justified to
Sylviane Granger
124
ask what added value, if any, can be gained from using learner corpus data. Computer learner corpora typically fall into the category of natural or “openended” language use data, a data type which has not tended to be favoured in recent SLA research. There are many reasons why SLA researchers have tended to prefer other types of notably experimental and introspective data. The intention here however is not to expand on these (for a brief overview, see Granger 1998b: 4-6) and compare the respective values of natural and elicited data types, but instead to highlight three features which give CLC data a definite advantage over previously used natural use data, in the hope of reinstating this neglected data type. 2.1
Size
Computer learner corpora are electronic collections of spoken or written texts produced by foreign or second language learners. As the data is stored electronically, it is possible to collect a large amount of it fairly quickly. As a result, learner corpora are now counted in the millions rather than in the hundreds or thousands of words. But is big beautiful in SLA/FLT terms? The answer to this question is more of a “yes on the whole” or a “yes but” than an unqualified “yes.” Many SLA researchers have highlighted the drawback of using a very narrow empirical base. In reference to longitudinal SLA studies, which usually involve a highly limited number of subjects, Gass and Selinker (2001: 31) note that “It is difficult to know with any degree of certainty whether the results obtained are applicable only to the one or two learners studied, or whether they are indeed characteristic of a wide range of subjects.” It is the same kind of dissatisfaction and mistrust that led MacWhinney (2000: 3) to build the CHILDES child language acquisition database: Conducting an analysis on a small and unrepresentative sample may lead to incorrect conclusions. Because child language data are so timeconsuming to collect and to process, many researchers may actually avoid using empirical data to test their theoretical predictions. Or they may try to find one or two sentences that illustrate their ideas, without considering the extent to which their predictions are important for the whole of the child's language. In the case of studies of pronoun omission, early claims based on the use of a few examples were reversed when researchers took a broader look at larger quantities of transcript data. Like child language data, L2 data is difficult to collect. While the practice of getting students to submit their homework electronically has become standard in some countries, in others this is still a very remote prospect. In any case, some types of text, for instance those produced as part of an exam or as a classroom exercise, still tend to be handwritten. The difficulty is compounded in the case of
Computer Learner Corpus Research
125
spoken data. In the absence of reliable automatic speech recognition software, collecting and transcribing oral data remains a highly time-consuming activity. In addition, any data that has been keyed in manually or scanned needs to go through a process of careful proofreading to ensure that the original learner text is faithfully transcribed with no new errors introduced and all the original ones kept. This being said, there is no doubt that the widespread use of word processors, electronic mail and web-based learning environments will speed up learner corpus collection. Indeed some of the most recent learner corpora have been collected fully automatically (see Wible et al. 2001). Whether collected electronically over a very short period of time or after years of painstaking work, current learner corpora tend to be rather large, which is a major asset in terms of representativeness of the data and generalizability of the results. Of course, a very large data sample is not necessary for all types of SLA research. A detailed longitudinal study of one single learner is of great value if the focus is on individual interlanguage development. Likewise in FLT, as pointed out by Ragan (1996: 211), small corpora compiled by teachers of their own pupils’ work are of considerable value: “the size of the sample is less important than the preparation and tailoring of the language product and its subsequent corpus application to draw attention to an individual or group profile of learner language use.” In addition, as we will see in the following section, size is only really useful if the corpus has been collected on the basis of strict design criteria. 2.2
Variability
Learner language is highly variable. It is influenced by a wide variety of linguistic, situational and psycholinguistic factors, and failure to control these factors greatly limits the reliability of findings in learner language research. The strict design criteria which should govern all corpus building make corpora a potentially very attractive type of resource for SLA research. As rightly pointed out by Cobb (2003: 396), “It is a common misconception that corpus building means collecting lots of texts from the Internet and pasting them all together.” Atkins et al. (1992) list 29 variables to be considered by corpus builders. While many of these variables are also relevant for learner corpus building, the specific nature of learner language calls for the incorporation of L2-specific variables. Figure 1 represents all the variables that are controlled for and recorded in one particular CLC, the International Corpus of Learner English (ICLE) database. In addition to some general dialectal and diatypic variables, which are also used in native corpus building, the ICLE database contains a series of L2-specific variables, pertaining to the learner or the task. A search interface enables researchers to select data on the basis of these criteria (for more information on the ICLE, see Granger 2003a; Granger et al. 2002). This degree of control distinguishes CLC data from the samples of language use that are commonly used in SLA research. In his critique of EA (Error Analysis) studies, Ellis (1994: 49)
Sylviane Granger
126
lists some of the factors that can bring about variation in learner output and notes that “unfortunately, many EA studies have not paid sufficient attention to these factors, with the result that they are difficult to interpret and almost impossible to replicate.” Gass and Selinker (2001: 33) make a similar comment in relation to cross-sectional SLA studies: “there is often no detailed information about the learners themselves and the linguistic environment in which production was elicited.”
International Corpus of Learner English General variables Dialectal
L2-specific variables
Diatypic variables
Learner variables
Task variables
Age
Medium
Other FL
timing
Gender
Field
L2 exposure
exam
Mother tongue
Genre
Region
Topic
reference tools
Length
Figure 1: ICLE general and L2-specific variables It would be wrong, however, to paint too rosy a picture of current CLC. In all fairness, one must admit that (a) there are not many tightly-designed learner corpora in the public domain, and (b) there are so many variables that influence learner output that one cannot realistically expect ready-made learner corpora to contain all the variables for which one may want to control. Ideally, as stated by Biber (1993: 256), “theoretical research should always precede the initial design and general compilation of texts.” This preliminary theoretical analysis is the only way to ensure that the corpus will contain all the relevant design parameters. 2.3
Automation
So far, research on learner language has been largely manual. The ground covered in SLA and FLT research over the last decades shows that major advances can be made in the field without having recourse to computers. However, the benefit that researchers can derive from automating some of their work is so great that it would seem a pity to do without the invaluable help it can provide. While with small language samples the gain in terms of time and effort may not seem large enough to compensate for the investment necessary to become familiar with automated methods and tools, using big corpora makes it absolutely essential to use automated approaches. In the following, I will focus on four functions –
Computer Learner Corpus Research
127
COUNT, SORT, COMPARE and ANNOTATE –
which lend themselves particularly well to automation, and highlight their relevance for SLA/FLT research. 2.3.1
COUNT
This function involves a series of options, from the crude to the highly sophisticated, all of which are potentially very useful for interlanguage studies. The crudest function of all, counting the number of words in a text, is essential if one is to compare the frequency of linguistic items in various texts. To effect this type of comparison, researchers working on the basis of non-electronic texts have no other option but to count the average number of words per page and multiply the resulting figure by the number of pages in the text to obtain a rough estimate. If the data is computerised, the researcher can obtain the precise figure using the word count option on his/her word processor. More sophisticated options, provided by text handling packages, such as WordSmith Tools (Scott 1996), provide researchers with word frequency lists sorted in alphabetical or frequency order, type/token ratios and a series of other statistical measures (number of paragraphs, average number of words per sentence, etc.). Frequency lists of two or more word combinations are of great value to the growing number of SLA/FLT researchers interested in phraseological/routine aspects of interlanguage. In addition, all annotations inserted in the corpus (e.g., errors, grammatical categories, lemmas) can be counted and the frequencies compared across individual learners or learner populations. 2.3.2
SORT
One of the simplest but at the same time most rewarding benefits of electronic data is the multitude of possibilities offered in terms of sorting facilities. Concordancing programs give SLA/FLT researchers an unparalleled view of learners’ lexico-grammatical patterning of words (i.e. their use or misuse, or over/underuse) of collocations, colligations and other (semi-)prefabricated phrases. In addition, more sophisticated programs such as WordSmith Tools combine the COUNT and SORT facilities and provide a collocate display, which provides the exact frequency of all words occurring within a particular window on either side of the headword. 2.3.3
COMPARE
Interlanguage is a variety in its own right, which can be studied as such without comparing it to any other variety. However, for many purposes, both theoretical and applied, it is useful to compare it to other language varieties to bring out its specificities. This contrastive approach, which is usually referred to in CLC-based research as Contrastive Interlanguage Analysis, may involve two types of
Sylviane Granger
128
comparison: a comparison of native language and learner language (L1 vs. L2) and a comparison of different varieties of interlanguage (L2 vs. L2). The “compare list” facility in WordSmith Tools makes it possible to automate these comparisons: it compares frequency lists from two corpora and brings out the words or phrases that are significantly over- or underused in either corpus (for illustrations, see section 4). 2.3.4
ANNOTATE
Garside et al. (1997: 2) define corpus annotation as “the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written data.” While a raw learner corpus is in itself a highly useful resource, it does not take long for the SLA/FLT researcher to realise that it would be even more useful if it contained an extra layer of information, which could also be counted, sorted and compared. To this end, researchers can either use off-theshelf annotating tools or develop their own. For obvious reasons, researchers tend to prefer ready-made tools. A number are available, some free of charge (for a survey, see Meunier 1998). However, it is important to bear in mind that all these programs – whether lemmatizers, part-of-speech (POS) taggers, or parsers – have been trained on the basis of native speaker corpora, and there is no guarantee that they will perform as accurately when confronted with learner data. While the success rate of POS-taggers has been found to be quite good with advanced learner data (Meunier 1998: 21), it has proved to be very sensitive to morphosyntactic and orthographic errors (Van Rooy and Schäfer 2003) and success rate will therefore tend to decrease as the number of these errors increases. Pilot studies aimed at testing the reliability of the annotation, and recommended whatever the type of corpus used, are therefore a must with learner corpora1. Similarly, while lemmatizers are potentially very useful for lexical analyses of interlanguage, researchers have to be aware that only the standard realisations of the lemma will be retrieved (i.e. for the lemma LOSE) the standard forms lose/loses/losing/lost, but not the (sometimes equally frequent!) non-standard forms loose/looses/loosing/loosed. If proved reliable, a POS-tagged learner corpus is a very powerful resource, allowing for detailed studies of the use of grammatical categories, such as prepositions, phrasal verbs, modals, passives, etc. Note, however, that the search and retrieval possibilities depend on the granularity of the tagset, which is extremely variable (from 50 up to 250 tags). POS-taggers and lemmatizers have undeniable advantages, not least of which is the fact that they are fully automatic, but there are other types of annotation that SLA/FLT researchers may want to add to the text for which no ready-made program exists. This type of tagging, which de Haan (1984) calls “problem-oriented tagging,” can be inserted with the help of editing tools to speed up the process. Any type of annotation is potentially useful (discourse annotation, semantic annotation, refined syntactic annotation, etc.), but one type, error annotation, is particularly relevant for interlanguage studies and is enjoying
Computer Learner Corpus Research
129
growing popularity among CLC researchers. While I would not go as far as Wible et al. (2001: 311) who consider that unannotated learner corpora are “in themselves (...) worth little to teachers and researchers,” I fully agree that error annotation is a major added value, especially if the corpus is compiled for FLT purposes. Several systems of annotation have been developed (Milton and Chowdhury 1994; Dagneaux et al. 1998; Nicholls 2003) and have been exploited in a series of innovative FLT applications. These three main distinguishing features clearly differentiate computer learner corpora from the language use data types that have traditionally been used in SLA and FLT research. It should be borne in mind, however, that each type of investigation calls for its own data collection methods and, as a result, learner corpora should not be seen as a panacea, but rather as one highly versatile resource which SLA/FLT researchers can usefully add to their battery of data types. 3
Learner Corpus Collection and Analysis
This section aims to assess the current state of CLC research in terms of (1) corpus collection: What learner corpora have been compiled to date? What are their main characteristics? Are there gaps that would need to be filled? And (2) corpus analysis: What types of analysis have been carried out? What methodological approaches have been adopted? I will focus exclusively on English not only for reasons of space but also because this is where a majority of the research has been carried out to date. It should be noted, however, that the CLC movement has recently gained new momentum and CLC projects on languages other than English are mushrooming in all parts of the world. The recent launch of a “multilingual learner corpus” project, which will contain data in several L2s2 (Tagnin 2003), is but one significant example of this trend. 3.1
Corpus Collection
Rather than duplicating Pravec’s (2002) excellent survey, which gives a wealth of information (size, availability, learner background information, etc.) on the bestknown written learner corpora, I will adopt a more general outlook. By situating current CLC along a series of dimensions, I hope to be able to bring out some of the main characteristics of current CLC and hence to make suggestions for future data collection. Computer learner corpora fall into two major categories: commercial CLCs, which are initiated by major publishing companies, and academic CLCs, which are compiled in educational settings.3 The two major commercial learner corpora, the Longman Learners’ Corpus and the Cambridge Learner Corpus, are both very big (10 million words for the Longman corpus and 16 million for the Cambridge corpus). The academic corpora, far more numerous, are extremely variable in size (the Hong Kong University of Science and Technology Learner
Sylviane Granger
130
Corpus contains 25 million words while the Montclair Electronic Language Database only contains 100,000 words). In addition to the 8 academic corpora listed by Pravec (2002), a myriad of other corpora have been or are being collected and exploited by individual researchers and/or teachers. The paradox we face is that while there is an abundance of learner corpora, hardly any of it is available for academic research. It is to be hoped that the recently published first version of the International Corpus of Learner English (Granger et al. 2002), comprising 2.5 million words of EFL writing, will be the first of many CLCs to become publicly available. Current CLC can be classified along two major dimensions relating to characteristics of the learners who have produced the data and characteristics of the tasks they were requested to perform. 3.1.1
Learners
The learners represented in current CLC corpora are overwhelmingly learners of English as a Foreign Language (EFL) rather than as a Second Language (ESL). The line between the two categories is undoubtedly a fine one, but if ESL is broadly defined as taking place “with considerable access to speakers of the language being learned, whereas learning in a foreign language environment does not” (Gass and Selinker 2001: 5), it is quite clear that the latter dominates the current CLC scene. Regarding L1 background, there is a clear difference between commercial corpora, which tend to have multi-L1 coverage, and academic corpora which tend to cover learners from only one mother tongue background, the ICLE database being a notable exception in this respect. The learners’ proficiency predominantly falls in the intermediate-advanced range. This somewhat vague description reflects the well-known fact that “one researcher’s advanced category may correspond to another’s intermediate category” (Gass and Selinker ibid: 37). The fuzziness is compounded by the fact that compilers, following established corpus design practices (see Atkins et al. 1992: 5), have tended to use external criteria to compile their corpus. As regards proficiency, this comes down to favouring the criterion of “institutional status” (for instance, third year English undergraduates) over other criteria such as impressionistic judgements, specific research-designed test or standardised tests (Thomas 1994). 3.1.2
Task
As regards medium, the number of written learner corpora by far exceeds the number of spoken learner corpora. Far from being restricted to learner corpora, the difficulty of collecting and transcribing spoken data also affects native corpus building, as evidenced by the limited proportion of speech in recent mega-corpora of English (the BNC has 10% spoken vs. 90% written data). However, in the case of spoken learner language, the difficulty is multiplied by a factor of 10 and the time involved in collecting and transcribing data is so prohibitive that
Computer Learner Corpus Research
131
collaborative projects such as the LINDSEI4 project, would seem to be the only realistic course to take. As regards the field of discourse, the language covered by learner corpora is predominantly English for General Purposes (EGP) rather than English for Specific Purposes (ESP). For writing, English for Academic Purposes (EAP), which can be seen as situated between EGP and ESP, gets the lion’s share because of its importance in the EFL context. Another dimension along which CLC can be classified is the longitudinal vs. cross-sectional dimension. The overwhelming majority of CLC covering more than one type of interlanguage data are cross-sectional (i.e. they contain data gathered from different categories of learners at a single point in time). Genuine longitudinal corpora, where data from the same learners are collected over time, are very few and far between. For this reason, researchers interested in interlanguage development tend to collect quasi-longitudinal corpora (i.e. corpora gathered at a single point in time but from learners of different proficiency levels). Though easier to collect than “real” longitudinal corpora, this type of corpus is nevertheless still relatively infrequent. Learner corpora also differ in their degree of processing. While most current learner corpora consist of raw data (i.e. they contain the learner texts with no added annotation), there are several projects based on POS-tagged corpora. At the same time, the number of error-tagged learner corpora is clearly on the increase. This very brief overview shows that the language data contained in current CLC falls short of covering the wide diversity that characterises learner language. A lot of work remains to be done, not only to compile CLC representing hitherto neglected data types, but also to make the numerous CLC that have been compiled – either commercially or academically – available to the scientific community. One new promising development gives cause for optimism. Synchronous corpus building projects, in which corpora are collected online while the students carry out a pedagogical task (see section 5 below), solve many of the difficulties that beset standard asynchronous CLC building and will hopefully contribute to faster corpus building and dissemination. 3.2
Corpus Analysis
For a field that is little over ten years old, CLC has already generated a very rich and diversified body of research. The learner corpus bibliography stored on the Louvain website5 contains over 150 publications and is a good starting point for any researcher wishing to embark on learner corpus analysis. In this section, I will restrict myself to highlighting some of the areas in which research has been particularly active, distinguishing between the following three broad categories: methodological and analytic framework, contrastive interlanguage analysis (CIA) and computer-aided error analysis (CEA).
Sylviane Granger
132 3.2.1
Methodological and Analytical Framework
Like any new discipline, computer learner corpus research has had to avail itself of a sound framework of analysis. To this end, it has been able to rely to some extent on the methodological and analytic apparatus developed in the field of corpus linguistics (CL). There are however special considerations with learner corpora, given the type of language data involved, and the reasons for collecting them differ from other corpus endeavours, specifically because of their relevance to language learning theory and practice. The CL apparatus has therefore had to be tailored for the specific needs of CLC research and several publications have contributed to this. Leech (1998) and Granger (1998, 2002) contain wide-ranging discussions of particular methodological and analytical considerations relating to CLC, including methods of analysis such as CIA and CEA. Meunier (1998) deals more specifically with the software tools that can be used in CLC research, Van Rooy and Schäfer (2003) look into the reliability of POS-tagging of CLC data and de Mönnink (2000) examines the feasibility of parsing CLC. Other descriptions of the CIA methodology can be found in Granger (1996) and Gilquin (2001), while the principles of CEA are presented in Milton and Chowdhury (1994), Dagneaux et al. (1998), de Haan (2000) and Nicholls (2003). In addition, highly valuable methodological guidelines and warnings are contained in the many CLC case studies that have appeared to date. 3.2.2
CIA studies
The bulk of CLC research so far has been of the CIA type. There has been a wide range of topics, but some fields have received a great deal of attention, in particular high frequency vocabulary (Ringbom 1998, 1999; Källkvist 1999; Altenberg 2002), modals (Aijmer 2002; McEnery and Kifle 2002; Neff et al. in press), connectors (Milton and Tsang 1993; Field 1993; Granger and Tyson 1996; Altenberg and Tapper 1998; L. Flowerdew 1998b), collocations and prefabs (Chi Man-Lai et al. 1994; De Cock 1998, 2000; De Cock et al. 1998; Howarth 1996; Granger 1998; Nesselhauf 2003). Most of the CIA studies are based on unannotated learner corpora. A few, however, make use of POS-tagged corpora and compare the frequency of grammatical categories or sequences of grammatical categories in native and learner corpora (Aarts and Granger 1998; Granger and Rayson 1998; de Haan 1999; Tono 2000). All these studies bring out the words, phrases, grammatical items or syntactic structures that are either overor underused by learners and therefore contribute to the foreign-soundingness of advanced interlanguage even in the absence of downright errors. It is important to understand at this point that this CIA approach would draw fire from some SLA theorists for its failure to study interlanguage (IL) in its own right but rather as an incomplete version of the target language (TL). This practice, which BleyVroman (1983) refers to as the “comparative fallacy,” is discussed as follows by Larsen-Freeman and Long (1991: 66): “researchers should not adopt a normative
Computer Learner Corpus Research
133
TL perspective, but rather seek to discover how an IL structure which appears to be non-standard is being used meaningfully by a learner.” In her recent excellent book on Corpora in Applied Linguistics, Hunston (2002: 211-2) expresses a similar view when she writes that one of the drawbacks of the CIA approach is that “it assumes that learners have native speaker norms as a target.” However, she adds that the CLC approach also has two advantages: first, the standard is clearly identified and if felt to be inappropriate can be changed and replaced by another standard; and second, the standard is realistic: it is “what native/expert speakers actually do rather than what reference books say they do.” In addition, it is important to bear in mind that most CLC research so far has involved advanced EFL learners (i.e. learners who are getting close to the end point of the interlanguage continuum and who are keen to get even closer to the NS norm). For this category of learners more than any other, it makes sense to try and identify the areas in which learners still differ from native speakers and which therefore necessitate further teaching. 3.2.3
CEA studies
CEA has led to a much more limited number of publications than CIA. Apart from articles describing error tagging systems (see above), there are a few articles focusing on certain specific error categories (lexical errors: Chi Man-lai et al. 1994; Källkvist 1995; Lenko-Szymanska 2003; tense errors: Granger 1999). In view of the investment of time necessary to error tag corpora and analyse the results, it is not surprising that CEA studies should to some extent be lagging behind. However, it should be borne in mind that in CLC research, errors are not isolated from the texts in which they originated, as was the case in traditional EA studies, but rather are studied in context alongside cases of correct use and overand underuse. Discussions of errors can therefore be found in a large number of CLC case studies. This brief overview gives a glimpse of the buzz of activity in the CLC field, but at the same time it leaves a certain impression of patchiness. This may well be due to the corpus linguistic bottom-up approach which, as stated by Swales (2002: 152) “involves working from small-stretch surface forms and then trying to fit them into some larger contextual frame,” a method which produces a “huge amount of trial-and-error.” It is important to bear in mind, however, that what can be presented as a down side of the corpus linguistic approach is also its major strength: it is the required passage to gain new insights into language. This being said, one must acknowledge that the wider perspective is often difficult to discern from current CLC studies. In the coming sections, I will therefore try to highlight the wider SLA (section 4) and FLT (section 5) implications of CLC research.
Sylviane Granger
134 4
Computer Learner Corpora and SLA
To what extent can CLC contribute to SLA research? Second Language Acquisition is the study of how second languages are learned. It involves questions such as “Are the rules like those of the native language? Are they like the rules of the language being learned? Are there patterns that are common to all learners regardless of the native language and regardless of the language being learned? Do the rules created by second language learners vary according to the context of use?” (Gass and Selinker 2001: 1). CLC data can contribute to answering these questions. The use of bilingual corpora in addition to learner corpora can help answer the first question. Researchers can only say for sure if the learner’s rules “are like those of the native language” if they have detailed descriptions of the learner’s native language compared with the target language. This integrated contrastive perspective, which combines classic CA (Contrastive Analysis) and CIA, is a very reliable empirical platform from which to conduct interlanguage research (for illustrations of the method, see Gilquin 2001; Altenberg 2002). The following questions involve the two types of comparison that are at the heart of the CIA methodology: comparisons of native and learner data and comparisons of different interlanguages to each other. As to the last question, recourse to strictly controlled learner corpora is a good way of identifying the impact of different “contexts of use.” In fact, richly documented corpora such as the ICLE allow researchers to carry out cross-sectional research without having to cope with the major disadvantage that is usually presented as being part and parcel of this type of study: “The disadvantage [of cross-sectional studies] is that, at least in the second language acquisition literature, there is often no detailed information about the learners themselves and the linguistic environment in which production was elicited” (Gass and Selinker 2001: 33). On the whole, the contribution of CLC research to SLA so far has been much more substantial in description than interpretation of SLA data. In my view, there are two main reasons for this. First, as rightly pointed out by Hasselgård (1999), learner corpus research has mainly been conducted by corpus linguists rather than SLA specialists: “A question that remains unanswered is whether corpus linguistics and SLA have really met in learner corpus research. While learner language corpus research does not seem to be very controversial in relation to traditional corpus linguistics, some potential conflicts are not resolved, nor commented on by anyone from ‘the other side’.” It is undeniable that the term “learner corpus” – or “corpus” for that matter – is rarely found in SLA books and articles. However, there are signs that this is beginning to change. Two recent studies (Housen 2002; Wible and Ping-Yu Huang 2003) show the advantage of using CLC to test SLA hypotheses, in this case the Aspect Hypothesis. In particular, Housen (2002: 78) remarks that “computer-aided language learner corpus research provides a much needed quantificational basis” for current SLA hypotheses and makes it possible to “empirically validate previous research findings obtained from smaller transcripts, as well as to test explanatory hypotheses about pace-setting factors in second language acquisition” (ibid: 108).
Computer Learner Corpus Research
135
The second reason for the emphasis on description has perhaps been that the type of interlanguage CLC researchers have been most interested in (i.e. the interlanguage of intermediate to advanced EFL learners) was so poorly described in the literature that they felt the need to establish the facts first before launching into theoretical generalisations. According to McLaughlin (1987: 80), this focus on description is typical of the interlanguage paradigm: “The emphasis in Interlanguage theory on description stems from a conviction that it is important to know well what one is describing before attempting to move into the explanatory realm. There is a sense that as descriptions of learners’ interlanguages accumulate, answers will emerge to the larger questions about second-language acquisition.” Already now, even if it is still in the early stages, a much more accurate picture of advanced EFL interlanguage is beginning to emerge. This appears clearly from a recent excellent study by Cobb (2003) who replicated three European CLC studies with Canadian data and found a high degree of similarity. The three studies highlighted the following characteristics of advanced interlanguage: overuse of high frequency vocabulary (Ringbom 1998), high frequency of use of a limited number of prefabs (De Cock et al. 1998) and a much higher degree of involvement (Petch-Tyson 1998). Several other studies point to the stylistic deficiency of advanced learner writing, which is often characterised by an overly spoken style or a somewhat puzzling mixture of formal and informal markers. All in all, CLC studies suggest that “advanced learners are not defective native speakers cleaning up a smattering of random errors, but rather learners working through identifiable acquisition sequences. The sequences are not the – ing endings and third person –s we are familiar with, but involve more the areas of lexical expansion, genre diversification, and others yet to be identified” (Cobb 2003: 419). Advanced interlanguage is the result of a very complex interplay of factors: developmental, teaching-induced and transfer-related, some shared by several learner populations, others more specific. An ongoing study of linkwords (Granger 2003b) in 5 different subcorpora of the ICLE (French, Dutch, Spanish, Italian and German learners) brings convincing evidence of this interplay of features. Some features, like the overuse of the coordinator but or the tendency to favour initial position for adverbial connectors, are probably partly developmental: they give evidence of a more simplified linking system. On the other hand, there are quite a few transfer-related uses. French learners’ overuse of indeed is not shared by the other learner groups. It is clearly due to a faulty oneto-one equivalence between indeed and en effet, a tendency which is reinforced by teaching and reference books6. Some other phenomena, like the overuse of nevertheless or on the one hand......on the other hand are clearly teachinginduced. They are the direct consequence of the long lists of connectors found in most ELT textbooks, which classify connectors in broad semantic categories (contrast, addition, result, etc.) but fail to provide guidelines on their precise semantic, syntactic and stylistic properties, thereby giving learners the erroneous impression that they are interchangeable. When combined, these factors can
Sylviane Granger
136
reinforce each other. For instance, the overuse of on the contrary, which was attested in all five subcorpora of the ICLE and is probably teaching-induced, was found to be much more marked in the case of French- and Italian-speaking learners, due to the presence in the learners’ mother tongue of a formally equivalent connector (au contraire and al contrario). Likewise, there is evidence that the tendency to place connectors in initial position may be reinforced by teaching (J. Flowerdew 2001: 81). 5
Computer Learner Corpora and FLT
The usefulness of computer corpora for FLT is now widely acknowledged and many would agree with Aston (1995: 261) that “corpora constitute resources which, placed in the hands of teachers and learners who are aware of their potential and limits, can significantly enrich the pedagogic environment”. The main fields of application of corpus data are materials and syllabus design and classroom methodology.7 In all three fields, there is very active work in progress, but, with the exception of ELT dictionaries, the number of concrete corpusinformed achievements is not proportional to the number of publications advocating the use of corpora to inform pedagogical practice. According to L. Flowerdew (1998a: 550), this is due to the fact that in most corpus studies “the implications for pedagogy are not developed in any great detail with the consequence that the findings have had little influence on ESP syllabus and materials design.” As to classroom use of corpus data, although learners could undoubtedly benefit from exploring language to discover for themselves the underlying grammatical rules and/or typical patterns of use, teachers seem reluctant to introduce this type of “discovery learning” in their everyday teaching practices (see Mukherjee 2003). As learner corpora have developed much later than native corpora, one could expect CLC-informed pedagogical materials to be even more limited and yet activity in this field seems to be just as buoyant as in the native corpus field, already resulting in the production of new CLC-informed tools which address learners’ attested difficulties. As space is limited, I will limit myself here to the description of two categories of CLC-informed ELT tools: learners’ dictionaries and CALL (Computer-Assisted Language Learning) programs (for a more detailed survey of practical applications of learner corpora, see Granger forthcoming). 5.1
CLC-informed reference tools
Only a few years after the production of the first CLC-informed dictionary, the Longman Essential Activator (1997), learner corpus data have made their entry into general advanced learners’ dictionaries. The latest editions of the Longman Dictionary of Contemporary English (LDOCE) (2003) and the Cambridge Advanced Learner’s Dictionary (CALD) (2003) both contain language notes based on their respective learner corpora, notes intended to help learners to avoid
Computer Learner Corpus Research
137
making common mistakes. The language notes in LDOCE are based on careful analysis of a raw (i.e. unannotated) corpus, while CALD has made use of an extensive error-tagged corpus (for a description of the error tagging system, see Nicholls 2003). The language notes are a clear added value for dictionary users as they draw their attention to very frequent errors, which in the case of advanced learners have often become fossilised (accept + infinitive, persons instead of people, news + plural, etc.). Most notes are useful but space is regrettably limited in paper versions of dictionaries and selecting the most useful information is a challenging task. There is no doubt, however, that in subsequent electronic versions of the dictionaries, where space is no longer so much of an issue, it will be possible to include much information derived from CLC analysis in the form of notes and crucially to provide much more L1-specific information, currently sorely lacking, but which is so important to learners who, even at an advanced stage of proficiency still have considerable difficulty with transfer-related interlanguage errors. 5.2
CLC-informed CALL programs
The pioneer of CLC-informed CALL programs is Milton (1998), who developed a writing kit called WordPilot. This program combines remedial exercises targeting Hong Kong learners’ attested difficulties and a writing aid tool which helps learners to select appropriate wording by accessing native corpora of specific text types. Cowan et al.’s (2003) ESL Tutor program is an error correction courseware tool that contains units targeting persistent grammatical errors produced by Korean ESL students. The program is L1-specific, addressing errors that are clearly transfer-related. Wible et al.’s (2001) web-based writing environment is different from the other two as learner corpus building and analysis are integrated in normal pedagogical activities. The CALL environment contains a learner interface, where learners write their essays, send them to their teacher over the Internet and revise them when they have been corrected by the teacher, as well as a teacher interface, where teachers correct the essays using their favourite comments (comma splice, article use, etc.) stored in a personal Comment Bank. This environment is extremely attractive both for learners, who get immediate feedback on their writing and have access to lists of errors they are prone to produce, and for teachers, who progressively and painlessly build a large database of learner data from which they can draw to develop targeted exercises. 6
Conclusion
In learner corpus research, like in any corpus endeavour, “a great deal of spadework has to be done before the research results can be harvested” (Leech 1998: xvii). As I hope to have shown in this survey, researchers have spared no pains to build and analyse learner corpora and their efforts have been rewarded as the harvest has already begun. However, it is not yet time to rest on our laurels.
Sylviane Granger
138
We need a wider range of learner corpora (in particular, ESP, speech and longitudinal data) with more elaborate processing (POS-tagging and errortagging). Results need to be interpreted in the light of current SLA theory and incorporated in syllabus and materials design. Computer learner corpora have the potential of bridging the gap between SLA and ELT, but one must acknowledge that the ELT community has joined the learner corpus “revolution” (Granger 1994) more quickly and enthusiastically than the SLA community. There are signs that this is changing, as SLA specialists begin to recognise the value of CLC data which, by virtue of their size and representativeness, can help them validate their hypotheses and indeed formulate new ones. There are clearly exciting times ahead. Let’s roll up our sleeves and get to work! Notes 1.
For an illustration of such a pilot study to test the reliability of automatic extraction of passives, see Granger 1997.
2.
The USP (University of Sao Paulo) Multilingual Learner Corpus will contain German, English and Spanish L2 written data from Brazilian learners.
3.
Note, however, that commercial corpora have been used for academic research and academic corpora for commercial purposes.
4.
LINDSEI stands for Louvain International Database of Spoken English Interlanguage. Like its sister project, ICLE, it covers data from advanced EFL learners from various mother tongue backgrounds. More information on the project can be found on the following website: http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/CeclProjects/Lindsei/lindsei.htm.
5.
The learner corpus bibliography can be consulted on the following website: http://juppiter.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/publications.ht ml. Suggestions for additions to the bibliography can be sent to [email protected].
6.
The Robert-Collins English-French dictionary gives en effet as the first translation of indeed.
7.
For an excellent overview of the usefulness of corpus data for materials development and classroom use, see Tomlinson (1998), Part A: Data collection and materials development, pp. 25-89.
Computer Learner Corpus Research
139
References Aarts, J. and S. Granger (1998), Tag sequences in learner corpora: A key to interlanguage grammar and discourse, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 132-141. Aarts, J. and W. Meijs (eds) (1984), Corpus linguistics: Recent developments in the use of computer corpora, Amsterdam: Rodopi. Aijmer, K. (2002), Modality in advanced Swedish learners’ written interlanguage, in S. Granger, J. Hung, and S. Petch-Tyson (eds), Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins, pp. 55-76. Aijmer, K., B. Altenberg, and M. Johansson (eds) (1996), Languages in contrast: Papers from a symposium on text-based cross-linguistic studies in Lund, 4-5 March 1994, Lund, Sweden: Lund University Press. Altenberg, B. (2002), Using bilingual corpus evidence in learner corpus research, in S. Granger, J. Hung, and S. Petch-Tyson (eds), Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins, pp. 37-54. Altenberg, B. and M. Tapper (1998), The use of adverbial connectors in advanced Swedish learners’ written English, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 80-93. Archer, D., P. Rayson, A. Wilson, and T. McEnery (eds) (2003), Proceedings of the Corpus Linguistics 2003 Conference, Technical Papers 16, Lancaster University: University Centre for Computer Corpus Research on Language. Aston, G. (1995), Corpus evidence for norms of lexical collocation, in G. Cook and B. Seidlhofer (eds), Principle and practice in applied linguistics: Studies in honour of H.G. Widdowson, Oxford: Oxford University Press, pp. 257-270. Atkins, S., J. Clear, and N. Ostler (1992), Corpus design criteria, Literary and Linguistic Computing, 7: 1-16. Biber, D. (1993), Representativeness in corpus design, Literary and Linguistic Computing, 8 (4): 243-257. Bley-Vroman, R. (1983), The comparative fallacy in interlanguage studies: The case of systematicity. Language Learning, 33: 1-17. Chi Man-Lai, A., K. Wong Pui-Yiu, and M. Wong Chau-ping (1994), Collocational problems amongst ESL learners: A corpus-based study, in L. Flowerdew and A.K.K. Tong, Entering text, Hong Kong: Language Centre, Hong Kong University of Science and Technology, and Department of English, Guangzhou Institute of Foreign Languages, pp. 157-165. Cambridge Advanced Learner’s Dictionary (2003), Cambridge: Cambridge University Press.
140
Sylviane Granger
Cobb, T. (2003), Analyzing late interlanguage with learner corpora: Québec replications of three European studies, The Canadian Modern Language Review/La Revue canadienne des langues vivantes, 59 (3): 393-423. Cook, G. and B. Seidlhofer (eds) (1995), Principle and practice in applied linguistics: Studies in honour of H.G. Widdowson, Oxford: Oxford University Press. Cowan, R., H.E. Choi, and D.H. Kim (2003), Four questions for error diagnosis and correction in CALL, CALICO Journal, 20 (3): 451-463. Dagneaux, E, S. Denness and S. Granger (1998), Computer-aided error analysis, System: An International Journal of Educational Technology and Applied Linguistics, 26: 163-174. De Cock, S. (1998), A recurrent word combination approach to the study of formulae in the speech of native and non-native speakers of English, International Journal of Corpus Linguistics, 3: 59-80. De Cock, S. (2000), Repetitive phrasal chunkiness and advanced EFL speech and writing, in C. Mair and M. Hundt (eds), Corpus linguistics and linguistic theory, Amsterdam: Rodopi, pp. 51-68. De Cock, S., S. Granger, G. Leech, and T. McEnery (1998). An automated approach to the phrasicon of EFL learners, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 67-79. Ellis, R. (1994), The study of second language acquisition, Oxford: Oxford University Press. Field, Y. (1993), Piling on the additives: The Hong Kong connection, in R. Pemberton and E. Tsang (eds), Studies in lexis, Hong Kong: Hong Kong University of Science and Technology, pp. 247-267. Flowerdew, J. (2001), Concordancing as a tool in course design, in M. Ghadessy, A. Henry, and R.L. Roseberry (eds), Small corpus studies and ELT, Amsterdam: John Benjamins, pp. 71-92 Flowerdew, J. (ed.) (2002), Academic discourse, London: Longman. Flowerdew, L. (1998a), Corpus-linguistic techniques applied to textlinguistics, System, 26: 541-552. Flowerdew, L. (1998b), Integrating ‘expert’ and ‘interlanguage’ computer corpora findings on causality: Discoveries for teachers and students, English for Specific Purposes, 17: 329-345. Flowerdew, L. and A.K.K. Tong (eds) (1994), Entering text, Hong Kong: Language Centre, Hong Kong University of Science and Technology, and Department of English, Guangzhou Institute of Foreign Languages. Garside, R., G. Leech, and A. McEnery (eds) (1997), Corpus annotation: Linguistic information from computer text corpora, London: Longman. Gass, S.M. and L. Selinker (2001), Second language acquisition: An introductory course, Mahwah, NJ: Lawrence Erlbaum. Ghadessy, M., A. Henry, and R.L. Roseberry (2001), Small corpus studies and ELT: Theory and practice, Studies in Corpus Linguistics 5, Amsterdam: John Benjamins.
Computer Learner Corpus Research
141
Gilquin, G. (2001), The integrated contrastive model: Spicing up your data, Languages in Contrast, 3 (1): 95-123. Granger, S. (1994), The learner corpus: A revolution in applied linguistics, English Today, 39 (10/3): 25-29. Granger, S. (1996), From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora, in K. Aijmer, B. Altenberg, and M. Johansson (eds), Languages in contrast, Lund, Sweden: Lund University Press, pp. 37-51. Granger, S. (1998a), Prefabricated patterns in advanced EFL writing: Collocations and formulae, in A.P. Cowie (ed.), Phraseology: Theory, analysis and applications, Oxford: Oxford University Press, pp. 145160. Granger, S. (1998b), The computer learner corpus: A versatile new source of data for SLA research, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 3-18. Granger, S. (ed.) (1998), Learner English on computer, London: Addison Wesley Longman. Granger, S. (1999), Use of tenses by advanced EFL learners: Evidence from an error-tagged computer corpus, in H. Hasselgård and S. Oksefjell (eds), Out of corpora, Amsterdam: Rodopi, pp. 191-202. Granger, S. (2002), A bird’s-eye view of learner corpus research, in S. Granger, J. Hung, and S. Petch-Tyson (eds), Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins, pp. 3-33. Granger, S. (2003a), The International Corpus of Learner English: A new resource for foreign language learning and teaching and second language acquisition research, to appear in TESOL Quarterly, special issue on corpus linguistics (Autumn 2003). Granger, S. (2003b), A multi-contrastive approach to the use of linkwords by advanced learners of English: Evidence from the International Corpus of Learner English, Paper presented at the ‘Pragmatic markers in contrast’ workshop organized by the Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten, Brussels, 22-23 May 2003. Granger, S. (forthcoming), Practical applications of learner corpora, in B. Lewandowska-Tomaszczyk (ed.), Language, corpora, e-learning, Peter Lang: Frankfurt. Granger, S., E. Dagneaux, and F. Meunier (2002), The International Corpus of Learner English: Handbook and CD-ROM, Louvain-la-Neuve: Presses Universitaires de Louvain. Available from http://www.i6doc.com Granger, S., J. Hung, and S. Petch-Tyson (eds) (2002), Computer learner corpora, second language acquisition and foreign language teaching, Language Learning and Language Teaching 6, Amsterdam: John Benjamins. Granger, S. and S. Petch-Tyson (eds) (in press), Extending the scope of corpusbased research: New applications, new challenges, Amsterdam: Rodopi.
142
Sylviane Granger
Granger, S. and P. Rayson (1998), Automatic profiling of learner texts, in S. Granger (ed.), Learner English on computer, pp. 119-131. Granger, S. and S. Tyson (1996), Connector usage in the English essay writing of native and non-native EFL speakers of English, World Englishes, 15: 1929. de Haan, P. (1984), Problem-oriented tagging of English corpus data, in J. Aarts and W. Meijs (eds), Corpus linguistics: Recent developments in the use of computer corpora, London: Addison Wesley Longman, pp. 123-139. de Haan, P. (1999), English writing by Dutch-speaking students, in H. Hasselgård and S. Oksefjell (eds), Out of corpora, Amsterdam: Rodopi, pp. 203212. de Haan, P. (2000), Tagging non-native English with the TOSCA-ICLE tagger, in C. Mair and M. Hundt (eds), Corpus linguistics and linguistic theory, Amsterdam: Rodopi, pp. 69-79. Harmer, J. (2001), The practice of English language teaching, Harlow, UK: Longman. Hasselgård, H. (1999), Review of Granger (ed.), Learner English on computer. ICAME Journal, 23: 148-152. Hasselgård, H. and S. Oksefjell (eds) (1999), Out of corpora, Amsterdam: Rodopi. Housen, A. (2002), A corpus-based study of the L2-acquisition of the English verb system, in S. Granger, J. Hung, and S. Petch-Tyson (eds), Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins, pp. 77-116. Howarth, P. (1996), Phraseology in English academic writing: Some implications for language learning and dictionary making, Tübingen, Germany: Max Niemeyer Verlag. Hunston, S. (2002), Corpora in applied linguistics, Cambridge: Cambridge University Press. Källkvist, M. (1995), Lexical errors among verbs: A pilot study of the vocabulary of advanced Swedish learners of English, Working papers in English and Applied Linguistics, 2, Research Centre for English and Applied Linguistics, University of Cambridge: 103-115. Källkvist, M. (1999), Form-class and task-type effects in learner English: A study of advanced Swedish learners, Lund Studies in English 95, Lund, Sweden: Lund University Press. Larsen-Freeman, D. and M.H. Long (1991), An introduction to second language acquisition research, London: Longman. Leech, G. (1992), Corpora and theories of linguistic performance, in J. Svartvik (ed.), Directions in corpus linguistics, Berlin: Mouton de Gruyter, pp. 105-22. Leech, G. (1998), Learner corpora: What they are and what can be done with them, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, xiv-xx.
Computer Learner Corpus Research
143
Lenko-Szymanska, A. (2003), Lexical problems in the advanced learner corpus of written data. Paper presented at PALC 2003 (Practical Applications of Language Corpora), Lodz, Poland, 4-6 April 2003. Lewandowska-Tomaszczyk, B. and P.J. Melia (eds) (2000), PALC’99: Practical applications in language corpora, Frankfurt am Mein: Peter Lang. Longman Dictionary of Contemporary English (2003), Harlow, UK: Longman. Longman Essential Activator (1997), Harlow, UK: Longman. MacWhinney, B. (2000), The CHILDES Project, Volume 1: Tools for analysing talk: Transcription format and programs, Mahwah, NJ: Lawrence Erlbaum. Mair, C. and M. Hundt (eds) (2000), Corpus linguistics and linguistic theory, Amsterdam: Rodopi. McEnery, T. and N.A. Kifle (2002), Epistemic modality in argumentative essays of second-language writers, in J. Flowerdew (ed.), Academic discourse, London: Longman, pp. 182-215. McLaughlin, B. (1987), Theories of second-language learning, London: Edward Arnold. Meunier, F. (1998), Computer tools for the analysis of learner corpora, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 19-37. Meunier, F. (2002). The pedagogical value of native and learner corpora in EFL grammar teaching, in S. Granger, J. Hung, and S. Petch-Tyson (eds), Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins, pp. 119-141. Milton, J. (1998), Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 186-198. Milton, J. and N. Chowdhury. (1994), Tagging the interlanguage of Chinese learners of English, in L. Flowerdew and A. K. K. Tong (eds), Entering text, Hong Kong: Language Centre, Hong Kong University of Science and Technology, and Department of English, Guangzhou Institute of Foreign Languages, pp. 127-143. Milton, J. and E. Tsang (1993), A corpus-based study of logical connectors in EFL students’ writing, in R. Pemberton and E. Tsang (eds), Studies in lexis, Hong Kong: Hong Kong University of Science and Technology, pp. 215-246. de Mönnink, I. (2000), Parsing a learner corpus, in C. Mair and M. Hundt (eds), Corpus linguistics and linguistic theory, Amsterdam: Rodopi, pp. 81-90. Mukherjee, J. (2003), Bridging the gap between applied corpus linguistics and the reality of English language teaching in Germany, in this volume. Neff J., E. Dafouz, H. Herrera, F. Martinez, J. Rica, M. Diez, R. Prieto, and C. Sancho (in press), Contrasting learner corpora: The use of modal and reporting verbs in expression of writer stance, in S. Granger and S.
144
Sylviane Granger
Petch-Tyson (eds), Extending the scope of corpus-based research: New applications, new challenges. Nesselhauf, N. (2003), The use of collocations by advanced learners of English and some implications for teaching, Applied Linguistics, 24: 223-242. Nicholls, D. (2003), The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT, in D. Archer, P. Rayson, A. Wilson, and T. McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference (CL 2003): 572-581. Pemberton, R. and E. Tsang (eds) (1993), Studies in lexis, Hong Kong: Hong Kong University of Science and Technology. Petch-Tyson, S. (1998), Writer/reader visibility in EFL written discourse, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 107-118. Pravec, N.A. (2002), Survey of learner corpora, ICAME Journal, 26: 81-114. Ragan, P.H. (1996), Classroom use of a systemic functional small learner corpus, in M. Ghadessy, A. Henry, and R.L. Roseberry (eds), Small corpus studies and ELT, Amsterdam: John Benjamins, pp. 207-236. Renouf, A. (ed.) (1999), Explorations in corpus linguistics, Amsterdam: Rodopi. Ringbom, H. (1998), Vocabulary frequencies in advanced learner English: A cross-linguistic approach, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, pp. 41-52. Ringbom, H. (1999), High frequency verbs in the ICLE corpus, in A. Renouf (ed.), Explorations in corpus linguistics, Amsterdam: Rodopi, pp. 191200. Scott, M. (1996), WordSmith Tools, Oxford: Oxford University Press. Swales, J. (2002), Integrated and fragmented worlds: EAP materials and corpus linguistics, in J. Flowerdew (ed.), Academic discourse, London: Longman, pp. 150-164. Tagnin, S. (2003), A multilingual learner corpus in Brazil, Paper presented at the Learner Corpus Workshop organized within the framework of the Corpus Linguistics 2003 Conference (CL 2003), Lancaster, 28-32 March 2003. Thomas, M. (1994), Assessment of L2 proficiency in second language acquisition research, Language Learning, 44: 307-336. Tomlinson, B. (ed.) (1998), Materials development in language teaching, Cambridge: Cambridge University Press. Tono, Y. (2000), A corpus-based analysis of interlanguage development: Analysing part-of-speech sequences of EFL learner corpora, in B. Lewandowska-Tomaszczyk and P.J. Melia (eds), PALC’99: Practical applications in language corpora, Frankfurt am Mein: Peter Lang, pp. 323-340. Van Rooy, B. and L. Schäfer (2003), Automatic POS tagging of a learner corpus: The influence of learner error on tagger accuracy, in D. Archer, P. Rayson, A. Wilson, and T. McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference (CL 2003), Lancaster University:
Computer Learner Corpus Research
145
University Centre for Computer Corpus Research on Language, pp. 835844. Wible, D., C-H. Kuo, F-Y. Chien, A. Liu, and N-L. Tsao (2001), A web-based EFL writing environment: Integrating information for learners, teachers, and researchers, Computers and education, 37: 297-315. Wible, D. and P-Y. Huang (2003), Using learner corpora to examine L2 acquisition of tense-aspect markings, in D. Archer, P. Rayson, A. Wilson, and T. McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference (CL 2003), Lancaster University: University Centre for Computer Corpus Research on Language, pp. 889-898.
Concordancing and Corpora for K-12 Teachers: Project MORE Boyd Davis and Lisa Russell-Pinson University of North Carolina-Charlotte1 Abstract This paper describes the technology-based training initiatives developed for public school teachers in Mecklenburg County, North Carolina. It focuses on the uses of corpora and concordancing in in-services and coursework designed for ESL, sheltered-content ESL and content-area teachers and describes how these groups of teachers have responded to the initial training. The challenges of introducing K-12 teachers to the applications of corpora and concordancing are highlighted and recommendations for overcoming these obstacles are presented.
1
Introduction
Corpus linguists have traditionally emphasized the pedagogical applications of using corpora and computerized software, such as concordancers, for second- and foreign-language teachers (Hunston 2002). In this context, corpora and concordancers are seen as important tools in the creation of learning aids, such as dictionaries (e.g., COBUILD) and reference books (e.g., Biber et al. 1999); in addition, these technologies provide both language instructors and language learners with ways to explore, search and organize linguistic information in vast amounts of authentic material (Hyland 2000). Language classes need not be the only beneficiaries of this approach, however. Given the influx of non-native English speaking students into the public school systems across the U.S. and the shortage of support for ESL classes in many locales, more and more secondlanguage English learners are being placed into mainstream classes. Although many subject-area teachers do not have coursework or training in working with ESL students, they nonetheless must develop strategies to address their students’ diverse needs (Villegas and Lucas 2002) and we find that corpora and concordancing can assist them in doing so. Project MORE, an initiative supported by the Training All Teachers program of the Office of English Language Acquisition (U.S. Department of Education), has been working with ESL instructors, content-area teachers and public school administrators in Charlotte, North Carolina, to produce mainstream classroom materials based on a corpus of over 600 oral narratives collected from native and non-native speakers in Mecklenburg County, North Carolina, and teacher-training activities keyed to this and other corpora. As a part of our discussion we provide examples from our corpus of these narratives and the pedagogical materials created to accompany them. Since the materials frequently include concordance-based activities, we also indicate and exemplify the range of
Boyd Davis and Lisa Russell-Pinson
148
perceptions, both positive and negative, voiced by public school teachers about the ease and effectiveness of using web-accessible corpora and concordancebased materials in their classes. We are especially interested in identifying both general and specific challenges to introducing corpora and concordancing to K-12 public school teachers, and close by reviewing some ways in which we have begun to address different kinds of challenges. 2
Background
Project MORE is designed to serve all classroom teachers in CharlotteMecklenburg Schools (CMS) who work with English Language Learners (ELLs). The project was developed in response to the rapid growth in the number of ELLs in CMS. CMS is one of the twenty-five largest school systems in the US, with 145 schools and 112,458 students in 2002-03. ELLs comprised 7.1% of the CMS student body in 2002-2003, a 22% increase from previous academic year (CMS Fast Facts 2002). An example of this growth and the diversity it represents is reflected in the student enrollment at Martin Middle School. Martin Middle School is the third largest of the 29 middle schools, with 1,100 students. In the 2001-2002 school year, there were 9 ELLs enrolled in the school. However, at the beginning of the 2002-2003 academic year, Martin was established as an ESL Site for CMS and since this time, its ELL population increased more than ten-fold. Figure 1 shows the demographic make-up of the student body at Martin for the 2002-2003 academic year. A similar growth in the number of ELLs across the school system has put a strain on many local schools, which are having difficulty attracting, hiring and retaining qualified ESL teachers. As a result, students are being placed into content-area classes sooner than the two years typically recommended in this region, and often without the benefit of adequate ESL instruction. At the same time, licensed content-area teachers in North Carolina are still not required to have coursework or practical experience in responding to the needs of ELLs, despite the fact that the number of ELLs in the state has increased more than 200% in the past decade (U.S. Department of Education 2002) and is projected to continue on this trajectory for at least the next decade. To address this disparity in teacher training, Project MORE: •
creates supplemental materials based on materials from a corpus of oral narratives; these materials are keyed to the North Carolina Standard Course of Study (NCSCOS) for ESL, sheltered-content ESL and content-area teachers to use with their ELLs and with other at-risk students.
•
trains prospective and practicing ESL, sheltered-content ESL and content-area teachers to adapt Project MORE materials and develop their own materials suitable to the needs of their ELLs and other at-risk students.
Concordancing and Corpora for K-12 Teachers
149
•
instructs prospective and practicing ESL, sheltered-content ESL and content-area teachers on how to use computer-based technologies, including corpora and concordancing, and to implement these in their classes.
•
works with prospective and practicing teachers administrators to increase their cultural competence.
and
school
134 ELLs originating from 29 countries and representing over 20 native languages Spanish speakers and countries: Mexico United States El Salvador Peru Honduras Colombia Dominican Republic Guatemala Venezuela Canada
African countries and languages: Ghana (Ewe and Twi) Liberia (Creole) Somalia (Somali)
French speakers and countries: Congo Haiti Guinea
Other countries and languages: Yugoslavia (Croatian) Ukraine (Russian) Netherlands (Dutch) Brazil (Portuguese) N Mariana Islands (Chuukese) United States (Hmong) United States (Khmer) Canada (Somali)
Arabic speakers and countries: Ethiopia Sudan Saudi Arabia
Asian countries and languages: China (Chinese) Vietnam (Vietnamese) Japan (Japanese) Korea (Korean) India (Gujarati and Hindi)
Figure 1: Martin Middle School ELL Demographics (2002-2003) 2.1
Corpus-based Pedagogical Materials
At the heart of the materials development and cultural awareness activities in Project MORE is the Charlotte Narrative and Conversation Collection (CNCC), a corpus of over 600 oral narratives and conversations with residents in the greater Mecklenburg County, North Carolina region. The CNCC currently features materials in the following languages:
Boyd Davis and Lisa Russell-Pinson
150 • • • • • • • •
English (multiple varieties; native and non-native speakers) Spanish (multiple varieties) Chinese (multiple varieties) Hmong Vietnamese Korean Russian Japanese
Collection of other languages will continue throughout the duration of the project. All narratives and conversations are transcribed, and non-English varieties have been translated into English. Transcripts, as well as the accompanying audio or video of the narratives, are web-deliverable (http://education.uncc.edu/more/). The conversations and narratives in the CNCC cover a wide range of topics, including favorite childhood books and stories, folktales, typical daily activities, childhood memories, travel adventures and historical events, so they easily lend themselves to supporting a number of themes commonly found in language arts and social studies textbooks. At the same time, they promote greater cultural awareness through exposure to the diverse voices in the community. We use the interviews in the corpus to develop materials for classroom use by creating a set of activities keyed to a narrative. In doing so, ESL, sheltered-content ESL, and content-area teachers can see how a single narrative supports a range of activities across different ability levels and content areas, and fulfills many primary and secondary standards for those levels and content areas, as set forth in the NCSCOS. Then, these activities are used to train teachers to develop their own activities from the CNCC. The Appendix contains a set of staff-developed activities for language arts and social studies. The activities are keyed to a CNCC interview with Preeyaporn Chareonbutra, who talks about her family in Thailand. The set includes: • • • • •
A cloze listening activity keyed to the corpus narrative; Gist and detail exercises connected to comprehension of the corpus narrative; Map skills related to locations mentioned in the corpus narrative; Pre-writing and pre-speaking work with Venn diagrams for expressing ‘same-different’ relationships for the themes in the corpus narrative; Research and discussion questions keyed to details of the corpus narrative.
Although the subject matter of the narrative is age-appropriate for middle-school students and fits the content emphases for grade 7 social studies, the pedagogical materials developed from the narrative are here accompanied by the instructional objectives for elementary grades-levels. This is because ELLs often need to review (or to be introduced to) the skills covered in lower grades due to limited or
Concordancing and Corpora for K-12 Teachers
151
interrupted schooling (Short 1998). Thus, this set of activities aligns with a number of the NCSCOS objectives for 2nd and 3rd grade social studies and K-6th grade language arts. 2.2
Corpora and Concordancing in Teacher-training
As part of our work, prospective and practicing public school teachers are trained to use online corpora and concordancing tools. Project MORE’s teacher inservices, as well as graduate-level classes courses taught by project staff, contain instruction on corpora and concordancing and hands-on application of these tools. The Longman Grammar of Spoken and Written English (Biber et al. 1999), the Bank of English site and the CNCC are some of the resources used during teacher-training sessions and in coursework. In addition, we have developed a number of activities for teachers learning about corpora and concordancing for the first time, including the notion that corpora are not resources just for teachers and materials designers, but also for learners (Gavoli and Aston 2001; Hyland 2000; Stevens 1995). MB = Ms. Boal HS = Hannah Schuenemann MB: (1) _______ _______ Ms. Boal and I am interviewing Hannah Schuenemann. HS: (2) _______ _______ Hannah Schuenemann and Ms. Boal is interviewing me. MB: Today we have taken a trip back in (3)____________ and the year is 1913. We are in Chicago, Illinois. The (4)________________ snows are just beginning to fall on the city. Hannah, can you tell me about your husband, Herman? I heard that he was so loved by the (5) _______________ of Chicago. *** MB: What happened then? HS: No one knows for sure. The captain, his boat, and the crew were never seen again. Days later, two (23) _____ found some of the trees. Key: (1) I am; (2) I’m; (3) time; (4) November; (5) people; (23) fishermen Figure 2: Cloze activities developed by Barbara Boal Probably the most convincing demonstrations for public-school teachers are corpus and concordancing materials created by their peers. An example of this is seen in the case study of middle-school ESL teacher Barbara Boal. Encouraged by her students’ response to the narrative by Chareonbutra referenced above, as well as the set of materials keyed to it, Boal took the techniques as her own and developed not only a fictitious tape-recorded interview as a listening exercise that provided the gist of a story the students were to read, but also a set of cloze activities corresponding to this story. Local teachers attending a Project MORE
Boyd Davis and Lisa Russell-Pinson
152
in-service took notice when we and Boal showed them how our collaborative analyses of student errors on this activity suggested what instructional priorities Boal should adopt, and more importantly from our perspective, how a set of concordance materials could be used by both teachers to determine these priorities. Figure 2 contains two excerpts from Boal’s cloze, followed by recommendations for the next steps to take with students. While there was not a consistent pattern of errors for Boal’s highbeginner class for many of the items, the students’ responses to item 23 showed a lack of control over word forms. The answer to this item is fishermen but students wrote answers as varied as fishmen, fishman, fisher, fishining, fisheling, and fish, suggesting problems with how English creates compounds and changes parts of speech. Accordingly, we developed and modified a set of concordance materials to be used by both teacher and students. Using the American English corpus from the Bank of English site, we ran concordances on the words above, selected relevant examples (Figure 3) and modified a few of the examples (see Tribble and Jones 1990; Aston 1997 for a fuller discussion) to make them more comprehensible to her ELLs. Eating vegetables and The menu offers grilled fresh Did you read the book by Professor Are you Forrest Gump bought a Last year, those For National Public Radio, I'm Sophie
fish is very healthy fish and seafood, steak, and ribs Fish? I have not. fishing for your supper? Good luck fishing boat to catch shrimp fishermen caught a lot of fish! Fisher in Geneva.
Figure 3: Examples of a modified concordance Boal emphasized the usefulness of this approach to teaching word forms with her colleagues at the in-service. The teachers in attendance were eager to apply similar techniques, both to understand the language difficulties their students have and to strengthen their instruction. 3
Responses from Teachers about Corpora and Concordancing
A number of teachers have given favorable feedback on the corpus and concordance-based teacher-training initiatives developed for the project. On anonymous feedback forms from our in-services, many of the teachers comment on the usefulness of activities for their own teaching, citing that these technologies can help them teach “grammar on a deeper level” and illustrate the differences between “literal and expressive language”. For example, after being introduced to corpora and concordancing in a graduate-level English course, Helene2, a K-3 ESL teacher, wrote:
Concordancing and Corpora for K-12 Teachers
153
Concordance [sic] would be useful for an ESL teacher. Showing students the context certain words are used in will be helpful in their understanding of English grammar. As an ESL teacher of younger students, I can use their ‘big books’ to point out sight words that come before and after. Kelly, a high school English teacher, stated: Concordancing was good because you were able to see the many different ways a word can be used. It would be good to model for students (as a teacher) to show students the many different ways a word can be used to develop writing and vocabulary. I have never seen a callocation [sic] before. Others who have participated in our classes and in-services remarked that they now understand how corpora and concordancing can help them design activities for their classes. Among the teacher-suggested ways of applying information gained from these technologies are: students’ demonstrating the meaning of action verbs and their collocates, students’ drawing pictures or cartoons of collocations appropriate to their grade-level, and the teacher’s incorporation of collocations on the classroom “word wall.” Despite positive responses about the usefulness of corpora and concordancing from these and other public school teachers, there has still been some resistance to these tools. The rest of the paper discusses some of the challenges in introducing corpora and concordancing to K-12 teachers and suggests ways that these difficulties can be addressed. 3.1
Inability to Understand Utility of Corpora
While teachers generally agreed that corpora are good resources for secondlanguage learners, some did not understand how to make the connection between corpora and their curricula. For instance, after reviewing some Project MORE materials during the first year of the grant, Marlene, a middle-school social studies teacher, remarked: [Corpus-based materials are] great for language arts but absolutely useless for social studies. To address this concern of content-area teachers, the project sponsors a minigrant competition for University of North Carolina-Charlotte Arts and Science faculty and staff. Awardees revise course curricula to include interviews and narratives from the CNCC. They develop classroom materials from the corpus and then model how to use the corpus to create activities for content classes. In order for faculty to be eligible for the competition, they must propose revising a
154
Boyd Davis and Lisa Russell-Pinson
course comprised of 50% or more teacher-licensure candidates. To date, the project has awarded eight mini-grants to faculty teaching courses in American studies, applied linguistics, art, children’s literature, educational research methods, history, Spanish, and writing across the curriculum. In addition to supporting the mentoring of prospective content-area teachers, we create content-based activities for ELLs from the corpus and use these as models for practicing content-area teachers in CMS. While we focused on language arts for the first year of the grant, we turned our attention to creating activities for social studies during the second year and will focus on math and science in the final year. After drafting materials and ensuring that they align with the NCSCOS, we send them to teachers to get their comments on the appropriateness of the activities for their students, to test them with their students and to suggest ways to adapt them for different proficiency and grade-levels; after completing this extensive review process, the teachers have a better idea of how the corpus narratives can be effectively used in their classes and are ready to begin developing their own materials from the corpus with the input of project staff. 3.2
Intimidation by Corpus-technology
We discovered that working with computers induces anxiety among many prospective and practicing teachers. Because the teachers may not have access to technology in the classroom and most have not been trained to use it with students, this lack of experience makes them reluctant to try unfamiliar forms of technology, such as corpora and concordancing. In order to put teachers at ease, we have developed two primary strategies: offering instruction in technology to both pre-service and in-service teachers and drawing on the cultural background of most of the teachers in our classes and workshops. 3.2.1
Providing Instruction in Technology
Many of the teachers that we have worked with are often intimidated by technology because they lack experience in using computers. To help remedy this situation, we conducted a day-long technology-based in-service, open to all teachers and administrators in CMS, for 1 hour of license-renewal credit. To meet local and national technology standards, we used our corpus as the basis for modeling a number of techniques, including conducting searches on the web and on websites, accessing audio and video on the web, locating appropriate supplemental classroom materials on the web, and participating in an on-line discussion. These activities laid the groundwork for our introduction of brief definitions and examples of corpora and concordances to the participants, after which we asked them to complete an activity based on their brief examination of some words from the Dolch List. In this activity, teachers are asked to access the Dolch List from a link we provide, and choose three words from it. Then, they are
Concordancing and Corpora for K-12 Teachers
155
asked to run concordances of these words at the Bank of English site and note some common collocations of these words. After this, they select one of the collocations and draw a cartoon using the collocation in the caption to show learners the meaning. Theresa, an elementary school reading teacher, investigated the word full (Figure 4):
Figure 4: Teacher-produced illustrations of collocates for the word full Leslie and Lydia, middle-school ESL teachers, explored collocates of three words: jump, little and sleep; then, based on some of the collocates that they found in the concordances, Leslie drew a cartoon strip (Figure 5) to show how each word could be used.
Figure 5: Teacher-produced cartoon based on concordances As a follow-up, all of the in-service participants were given homework in which they had to design a lesson appropriate for their students by using the corpus of narratives and interviews in the CNCC and some of the electronic tools presented in the workshop.
156
Boyd Davis and Lisa Russell-Pinson
Knowing that teachers may have little time or few opportunities to learn about technology once they enter the classroom, we have devoted a considerable amount of teaching time to exposing pre-service teachers to corpora and concordancing while they are still doing their coursework. We have incorporated activities involving the use of online corpora and concordancing in undergraduate courses often taken by pre-service teachers, such as “Introduction to World Literature”, and in graduate courses taken as electives by both pre- and in-service teachers, such as “Great Books I,” an introductory course focusing on classic Western literature. For example, graduate students working with Charles Dickens’ Bleak House used an online Dickens site to run concordances of key words such as chill, chilling, chilly to identify and examine themes running throughout the novel. Only once did the students find that the word chill refers to the act of becoming colder, which was more typically expressed by chilled to signal that the act had already been accomplished. Instead, they quickly noticed that Dickens typically used ‘chill’ to describe empty edifices or the falling rain. Seeing their own literary interpretations expanded by direct reference to the text reduced the students’ feelings of intimidation by the technology. The graduate students quickly inferred how they could use the output of a literary concordancer as a way to stimulate close reading with mainstream high-school students. Courses in graduate licensure programs, such as Introduction to Linguistics, Family and Community Literacy, ESL Professionals in the 21st Century, and Language Assessment, include both theoretical articles and handson experience with concordancing. These latter classes are largely comprised of prospective and practicing ESL teachers, who in addition to teaching, are increasingly being asked to offer their content-area colleagues on-the-job strategies for working with ELLs. The hope is that in addition to using corpora and concordancing in their own classes, these ESL teachers can provide a resource to others by offering coaching in how to use corpora and concordancers to facilitate instruction, as Boal did in the in-service mentioned above. 3.2.2
Drawing on Teachers’ Cultural Background
Since most of the pre-service and in-service teachers that we have worked with in CMS grew up in the Southeastern part of the U.S. and come from a primarily Protestant background, we have found that it is helpful to connect computerized concordancing with something with which the students may already be familiar – namely, Biblical concordances. Before we introduce electronic concordancing to a class or teacher workshop, we ask if anyone has heard of concordances before. Inevitably, there is at least one teacher who talks about memorizing texts from Biblical concordances in childhood. We then ask the teacher(s) to explain what a Biblical concordance is, what it looks like and how it was created. We also ask the teachers if they have any experience with literary concordances and this often generates a few responses to further the discussion. In a written activity used to
Concordancing and Corpora for K-12 Teachers
157
activate the students’ background knowledge prior to presenting electronic concordancing, Tina, a high-school English teacher, noted: I have used a concordance in reference to The Bible and Shakespeare in looking up word order and trying to define a word through context clues. After the teachers are familiar with the concept of concordancing in nontechnological terms, we show them computer-based corpora and concordancing. Having the teachers share their own experiences with concordances prior to this introduction to electronic concordancing serves three main purposes: First, the teachers help others understand the concept of concordancing by using language and examples that are familiar to their peers. Also, hearing about concordancing from their colleagues makes teachers more responsive to later using the technology. Finally, the teachers who lead the discussion on concordancing often feel empowered by having their past experiences valued in a professional setting. 3.3
Perception of Information Overload
Milton (1999: 236) writes, “Learners…often need more guidance in the operation of the language than a purely discovery-based approach…provides.” Although his remark is referring to novice writers, we believe that this statement applies equally well to teachers who are novices at using corpora and concordancing and recommend that appropriate guidance be given to these teachers as they embark on this linguistic journey. For example, teachers often feel inundated by the number of tokens returned by a concordance. Cory, a high school English teacher, expressed this feeling on a feedback form following a concordance activity: The mass amounts of information and contextual evidence [from a concordance] is [sic] overload. Consequently, I don’t get too much from it. Based on her experiences and those reported by other teachers, we typically limit the number of concordance lines that we give to the teachers to relevant examples. We find that upon initial exposure to concordancing, the teachers feel comfortable working with no more than 10 lines and prefer working with 5 lines. Once the teachers are familiar with the concepts of corpora and concordancing, we explain how to use an electronic concordance and show them examples of concordance lines; we then ask them to brainstorm possible applications of this information to their teaching contexts. We also have them run their own concordances on words. After teachers have become more familiar with the concepts and what can be gained from using corpora and concordancers, we gradually increase the number of concordance lines used for demonstration.
Boyd Davis and Lisa Russell-Pinson
158
Because teachers report that their apprehension diminishes when we follow this sequence, we can recommend guided discovery as a training technique. 3.4
Ambivalence about Using ‘Authentic Language’
The use of authentic language has been a staple of many foreign- and secondlanguage curricula for the past two decades (Hedge 2000; O’Maggio Hadley 1993), primarily because “…if the goal of [language] teaching is to equip students to deal ultimately with the real world, they should be given opportunities to cope with this in the classroom” (Hedge 2000: 67). However, pre- and inservice teachers, particularly those who teach content-area subjects and have little if any training in language pedagogy, often do not appreciate the richness of authentic language and what it can bring to the classroom; instead, they sometimes see authentic language as a drawback for use in their classes for it often deviates from their notions of appropriate language usage and thus, conflicts with their perceived role as standard-bearers of “good English.” An example of this attitude is reflected in the anonymous comments and questions from students using the CNCC in a materials development project for an education course. The majority of students in the course were practicing K-12 teachers and felt that the themes of the narratives in the corpus could easily be linked to their classroom instruction. However, without exception, they wanted to change the language of the narratives to be more like “standard English”. To respond to this initial resistance of the students to the language of the corpus materials, project staff worked with both the students and the course instructor to explain the benefits of using authentic language in the classroom, including helping students: • • •
to transition between basic interpersonal cognitive skills (BICS) and cognitive academic language proficiency skills (CALPS) (Cummins 1980); to understand the range of dialects in the community; to become more culturally sensitive.
We also provided them with strategies for developing activities from the corpus materials that took the focus of the lesson away from prescriptive grammatical “errors” in the narratives; for example, we suggested that rather than using corpus narratives as models for grammatical “correctness”, the teachers develop materials for global and detailed listening and reading comprehension, vocabulary development and reinforcement and cultural awareness. Furthermore, we reminded these prospective and practicing teachers that some of the selections in required language arts textbooks in CMS contain non-standard dialect features, such as John Steptoe’s Stevie and Piri Thomas’ The Amigo Brothers, to give further legitimacy to using the authentic language contained in the corpus of narratives. Of course, these strategies presented above are not a panacea for
Concordancing and Corpora for K-12 Teachers
159
ridding teachers of their linguistic biases, but do constitute several ways to increase teachers’ awareness of why the authentic language represented in corpora is important, and thus the likelihood that teachers will use such resources in the future. 4
One Final Thought
In this chapter, we have described how Project MORE has used a corpus of oral narratives to produce classroom activities and teacher-training materials for K-12 teachers; in addition, we have discussed how prospective and practicing teachers have been introduced to corpora and concordancing in coursework and inservices, highlighted some of the obstacles that we have faced in working with these teachers and proposed some ways to overcome these challenges. We are committed to our work with K-12 teachers of all stripes – ESL, sheltered-content ESL and content-area – because we believe that all public school teachers in the U.S., regardless of the subject they teach, are language teachers at heart and deserve to be informed of and to be taught to use corpora and concordancing to inform their own instruction. As Conrad (1999: 3) observes: Practising teachers and teachers-in-training… owe it to their students to share the insights into language use that corpus linguistics provides. To do any less could disadvantage a generation of learners. Notes 1.
We appreciate the efforts of the many teachers in the CharlotteMecklenburg school system, including Barbara Boal and those who have participated in our workshops and in-services, and our students for they have helped us expand our own understanding of how corpora and concordancing can be applied to content-area K-12 classes.
2.
The names of the teachers referred to in the remainder of the chapter are pseudonyms.
References Aston, G. (1997), Enriching the learning environment: Corpora in ELT, in A. Wichmann, S. Fligelstone, T. McEnery, and G. Knowles (eds), Teaching and language corpora, London: Longman, pp. 51-64. Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman grammar of spoken and written English, Harlow, UK: Pearson Education. Charlotte-Mecklenburg Schools. (2002), CMS fast facts.
160
Boyd Davis and Lisa Russell-Pinson
Conrad, S. (1999), The importance of corpus-based research for language teachers, System, 27 (1): 1-18. Cummins, J. (1980), The construct of proficiency in bilingual education, in J.E. Alatis (ed.), Georgetown University round table on languages and linguistics, Washington: Georgetown University Press, pp. 81–103. Gavoli, L. and G. Aston. (2001), Enriching reality: Language corpora in language pedagogy, ELT Journal, 55: 238-246. Hedge, T. (2000), Teaching and learning in the language classroom. Oxford: Oxford University Press. Hunston, S. (2002), Corpora in applied linguistics, Cambridge: Cambridge University Press. Hyland, K. (2000), Disciplinary discourses: Social interactions in academic writing, London: Longman. Milton, J. (1999), Lexical thickets and electronic gateways, in C.N. Candlin and K. Hyland (eds), Writing: Texts, processes and practices, London: Longman, pp. 221-243. O’Maggio Hadley, A. (1993), Teaching language in context (2nd edition), Boston: Heinle and Heinle. Short, D. (1998), Secondary newcomer programs: Helping recent immigrants prepare for school success, ERIC Digest, Washington, DC: ERIC Clearinghouse on Languages and Linguistics. Stevens, V. (1995), Concordancing with language learners: Why? When? What? CAELL Journal, 6 (2): 2-10. Tribble, C. and G. Jones (1990), Concordances in the classroom, London: Longman. U.S. Department of Education (2002), The growing numbers of limited English proficient students: 1991/1992-2001/2002. Villegas, A.M. and T. Lucas (2002), Educating culturally responsive teachers: A coherent approach, Albany: State University of New York Press.
Concordancing and Corpora for K-12 Teachers
161
Appendix Stories from My Mother and Father This set of activities is developed from an interview with Preeyaporn Chareonbutra (transcript below) from the Charlotte Narrative and Conversation Collection. The set of activities contains subject matter and techniques keyed to the emphases and standards for middle school students. Simultaneously, it is designed to fulfill a number of North Carolina Standard Course of Study Goals for elementary language arts and social studies. This allows middle school content-area, sheltered-ESL and ESL teachers to introduce (or reinforce) contentarea skills typical for lower grades to newcomer ELLs who may have limited or interrupted schooling. Activity 1: Cloze Activity This activity fulfills the following North Carolina Standard Course of Study Goals: English Language Arts: Grades 1 and 2: Oral Language Strand Skill Continuum: Students can increase oral and written vocabulary by listening, discussing, and responding to literature that is read and heard. Goal 3: The learner will make connections through the use of oral language, written language, and media and technology. Grades 1, 2, 3, 4 and 5: Goal 2: The learner will develop and apply strategies and skills to comprehend text that is read, heard, and viewed. Teacher Instructions 1. 2. 3. 4. 5. 6. 7.
Divide students into pairs or small group teams of 3-4 students. (It works well to pair less fluent students with more fluent ones.) Preview unfamiliar vocabulary and/or grammatical forms from the narrative with students. Play the audio of the narrative once for the students. (For less fluent students, you may want them to listen to the audio and follow along with the full written transcript.) After listening to the narrative, give the students a cloze activity to complete. Go over the instructions with the students. Play the audio twice for the students and ask them to fill in the cloze activity as they listen. Have students in pairs or small groups compare their answers. Review activity with the class.
Boyd Davis and Lisa Russell-Pinson
162
Stories from My Mother and Father Listen and fill in the blanks. MC: Meredith Combs (Interviewer) PC: Preeyaporn Chareonbutra (Interviewee) MC: ____________________________ Meredith Combs and I am interviewing Preeyaporn. PC: _____________________________ Preeyaporn Chareonbutra and I am interviewed by Meredith. MC: Pree, can you tell me about some of the _________________ you remember being told as a child? PC: Being told, like, from--. MC: That maybe that your _________ told you, um, or family members, or a teacher, that, that told a story to you? PC: Um. What kind of stories would you like today, tales or real stories? MC: Just some that stick out in your mind. PC: OK, I remember my ____________________ stories when she was a young girl in her small town and she was like a beautiful ________________ in that village. And she was a __________________ on the village and every year she had to prepare for a dance and, and she knew a lot of boys and of boys was, were, interested in her and but, um, her, her girlfriends were like her security guards and she’s very naughty and um, my ______________ found her in a, within a, at a store in that town. And he wasn’t there but he just visited the _____________ and the, his first impression was her, um, personality, like, she’s, very, um, talkative, and she’s different from the other girls, because, I think, because, um, most Thai women were, um, at that time, were, shy, didn’t ___________ much and um, and he liked her. MC: Because she was different? PC: Yeah, um. She’s _____________________. MC: Did she used to tell, tell you about that when you were __________? PC: Yeah, so funny. MC: What stories did, um, your dad tell you? PC: Um, my _____________ didn’t have a lot of stories, mostly from his, his, his real, real true stories from his experience. Um [pause], I remember he, he talked about his younger brother, who’s not in Thailand now, because he’s married to a German _______________ and I think he had a good time with that brother and
Concordancing and Corpora for K-12 Teachers
163
he’s pretty close to him and he always miss him still, you know, and he’s in Switzerland now--. MC: Far away? PC: Yeah and um, my father was, was in a military __________________ for a few years, and when he came home and knew that the younger ________________ had a, had a job and he was a _________________, a guitarist in a rock band. MC: Oh neat. PC: And so he had like, like a, free time after school and he’s thinking about what kind of job he wanted to, to do after the school, because he could choose it. You know, he didn’t have to go to be a solider. But, um, he spent, um, a few months with his younger brother. And he said he pretended, um, to be a manager of that band and went around, you know, and they had a show. He went with them and, and, um, because he hung out with those, the musicians a lot of times so he, he learned to, to um, _________ ___________, because they _________ only English songs, the 60s, 70s songs and he knew a lot of songs and, um, he talk about, um, the songs and he, he sang the songs to me and then every time he, sang the ___________ to me, he would mention this younger brother. Activity #2: Listening Comprehension This activity fulfills the following North Carolina Standard Course of Study Goals: English Language Arts: Kindergarten, Grades 1 and 2: Oral Language Strand Skill Continuum: Students can increase oral and written vocabulary by listening, discussing, and responding to literature that is read and heard. Goal 3: The learner will make connections through the use of oral language, written language, and media and technology. Grades 1, 2, 3, 4 and 5: Goal 2: The learner will develop and apply strategies and skills to comprehend text that is read, heard, and viewed. Goal 4: The learner will apply strategies and skills to create oral, written, and visual texts. Grades 3, 4 and 5: Goal 1: The learner will apply enabling strategies and skills to read and write. Teacher Instructions 1. 2.
Divide students into pairs or small group teams of 3-4 students. (It works well to pair less fluent students with more fluent ones.) Preview unfamiliar vocabulary and/or grammatical forms from the narrative with students.
Boyd Davis and Lisa Russell-Pinson
164 3. 4. 5. 6. 7. 8.
Play the audio of the narrative once for the students. (For less fluent students, you may want them to listen to the audio and follow along with the full written transcript.) After listening to the narrative, give the students the comprehension questions for the students to complete. Go over the instructions with the students. Play the audio a second time for the students and ask them to listen for the information in the questions. Play the audio a third time, stopping to allow students to time to write down their answers. Have students in pairs or small groups compare their answers. Review the activity with the class. Stories from My Mother and Father
Listen to the story and answer the questions. 1.
Preeyaporn is from Thailand. She talks about a story that her mother told her. Describe Preeyaporn’s mother when she was younger.
2.
Where did Preeyaporn’s mother and father meet?
3.
What did Preeyaporn’s father think of Preeyaporn’s mother when they first met?
4.
Where does Preeyaporn’s uncle live?
5.
Who did Preeyaporn’s uncle marry?
6.
What kind of school did Preeyaporn’s father attend?
7.
What kind of music did Preeyaporn’s uncle play?
8.
How did Preeyaporn’s father learn to speak English?
Activity #3: Geography Reinforcement This activity fulfills the following North Carolina Standard Course of Study Goals: English Language Arts: Kindergarten, Grades 1 and 2: Goal 3: The learner will make connections through the use of oral language, written language, and media and technology. Grades 1, 2, 3, 4 and 5: Goal 2: The learner will develop and apply strategies and skills to comprehend text that is read, heard, and viewed. Goal 4: The learner will apply strategies and skills to create oral, written, and visual texts.
Concordancing and Corpora for K-12 Teachers
165
Social Studies: Grade 2: Goal 8: The learner will apply basic geographic concepts and terminology, including map skills. Teacher Instructions 1. 2. 3. 4. 5. 6.
Divide students into pairs or small group teams of 3-4 students. (It works well to pair less fluent students with more fluent ones.) Preview unfamiliar vocabulary and/or grammatical forms from the narrative with students. Distribute the handout and go over the instructions with the students. Play the audio of the narrative for the students. Ask students to work together to complete the activities on the handout. Review the activity with the class. Stories from My Mother and Father
Preeyaporn lives in North Carolina but is from Thailand. Listen to Preeyaporn’s story about her family. Use a world map or a globe to help you find out information about the countries important to her and her family. 1.
Find the U.S. What continent is it on? __________________________
2.
Find Thailand. What continent is it on? _________________________
3.
Find Switzerland. What continent is it on? _______________________
4.
Circle the countries that border water: U.S. Thailand Switzerland
5.
Circle the country that borders Canada: U.S. Thailand Switzerland
6.
Circle the country that borders Germany: U.S. Thailand Switzerland
7.
Circle the country closest to China: U.S. Thailand Switzerland
8.
Circle the country closest to Cuba:
9.
Circle the country closest to Poland: U.S. Thailand Switzerland
U.S. Thailand Switzerland
10. Find North Carolina. Is it on the East Coast or the West Coast? _______ 11. Preeyaporn was born in Thailand but now lives in the United States; Preeyaporn’s uncle is from Thailand but lives in Switzerland. Which country would you like to travel to? Why? Share your opinions with your classmate and find the country on a world map. 12. Look on the internet or go to the library to find more information about the country you would like to visit. Write a paragraph about the country. You may want to consider: location, capital, customs, language(s), food, type of money (currency), climate and historical and cultural landmarks.
Boyd Davis and Lisa Russell-Pinson
166
Activity #4: Identifying Similarities and Differences This activity fulfills the following North Carolina Standard Course of Study Goals: English Language Arts: Grades 1 and 2: Oral Language Strand Skill Continuum: Students can increase oral and written vocabulary by listening, discussing, and responding to literature that is read and heard. Goal 3: The learner will make connections through the use of oral language, written language, and media and technology. Grades 1, 2, 3, 4 and 5: Goal 2: The learner will develop and apply strategies and skills to comprehend text that is read, heard, and viewed. Goal 4: The learner will apply strategies and skills to create oral, written, and visual texts. Grades 3, 4 and 5: Goal 1: The learner will apply enabling strategies and skills to read and write. Social Studies: Grade 3: Goal 2: The learner will infer that individuals, families and communities are and have been alike and different. Teacher Instructions 1. 2. 3. 4. 5. 6.
Divide students into pairs (It works well to pair a less fluent student with a more fluent one.) Preview unfamiliar vocabulary and/or grammatical forms from the narrative with students. Distribute the handout and go over the instructions with the students. Play the audio of the narrative for the students. Ask students to work together to complete the activities on the handout. Review the activity with the class. Stories from My Mother and Father
Preeyaporn lives in North Carolina and tells a story about her family. Listen to Preeyaporn’s story about her family to help you with the questions below. 1.
In her story, Preeyaporn describes her mother. Describe a person in your family to one of your classmates. Listen to your classmate tell you about a person in his or her family. Ask your classmate questions about his or her description: What does he or she look like? How old is he or she? Where does he or she live? What does he or she like to do for fun? Is she or he a student? Where? Does he or she work? Where?
Concordancing and Corpora for K-12 Teachers
167
2.
Write a paragraph about a person in your family. Make sure that you describe: 9 how the person looks, 9 how old the person is, 9 where the person lives, 9 what the person likes to do for fun, 9 whether the person is a student and 9 whether the person has a job.
3.
Compare your family to Preeyaporn’s family. How is it similar? How is it different? Fill in this Venn Diagram to show the similarities and differences.
Write a paragraph about the similarities and differences between your family and Preeyaporn’s family. 4.
Ask your classmate about his or her family. How is it similar to your family? How is it different from your family? Fill in this Venn Diagram to show the similarities and differences.
Write a paragraph about the similarities and differences between your family and classmate’s family.
168
Boyd Davis and Lisa Russell-Pinson
Transcript for Activities: (NOTE: This is an excerpt from a longer transcript.) MC: Meredith Combs (Interviewer) PC: Preeyaporn Chareonbutra (Interviewee) MC: My name is Meredith Combs and I am interviewing Preeyaporn. PC: My name is Preeyaporn Chareonbutra and I am interviewed by Meredith. MC: Pree, can you tell me about some of the stories you remember being told as a child? PC: Being told, like, from--. MC: That maybe that your parents told you, um, or family members, or a teacher, that, that told a story to you? PC: Um. What kind of stories would you like today, tales or real stories? MC: Just some that stick out in your mind. PC: OK, I remember my mother’s stories when she was a young girl in her small town and she was like a beautiful girl in that village. And she was a dancer on the village and every year she had to prepare for a dance and, and she knew a lot of boys and of boys was, were, interested in her and but, um, her, her girlfriends were like her security guards and she’s very naughty and um, my father found her in a, within a, at a store in that town. And he wasn’t there but he just visited the town and the, his first impression was her, um, personality, like, she’s, very, um, talkative, and she’s different from the other girls, because, I think, because, um, most Thai women were, um, at that time, were, shy, didn’t speak much and um, and he liked her. MC: Because she was different? PC: Yeah, um. She’s different. MC: Did she used to tell, tell you about that when you were young? PC: Yeah, so funny. MC: What stories did, um, your dad tell you? PC: Um, my dad didn’t have a lot of stories, mostly from his, his, his real, real true stories from his experience. Um [pause], I remember he, he talked about his younger brother, who’s not in Thailand now, because he’s married to a German woman and I think he had a good time with that brother and he’s pretty close to him and he always miss him still, you know, and he’s in Switzerland now--. MC: Far away?
Concordancing and Corpora for K-12 Teachers
169
PC: Yeah and um, my father was, was in a military school for a few years, and when he came home and knew that the younger brother had a, had a job and he was a musician, a guitarist in a rock band. MC: Oh neat. PC: And so he had like, like a, free time after school and he’s thinking about what kind of job he wanted to, to do after the school, because he could choose it. You know, he didn’t have to go to be a solider. But, um, he spent, um, a few months with his younger brother. And he said he pretended, um, to be a manager of that band and went around, you know, and they had a show. He went with them and, and, um, because he hung out with those, the musicians a lot of times so he, he learned to, to um, speak English, because they sang only English songs, the 60s, 70s songs and he knew a lot of songs and, um, he talk about, um, the songs and he, he sang the songs to me and then every time he, sang the songs to me, he would mention this younger brother.
Units of Meaning, Parallel Corpora, and their Implications for Language Teaching Wolfgang Teubert Department of English, University of Birmingham Abstract Translation equivalence is a key issue for all who apply multilingual skills in a professional environment. This includes language teachers, translators, lexicographers and terminologists, as well as experts in computational linguistics. Translation equivalence has therefore to be dealt with by academic foreign language teaching. There are two reasons. The units of meaning are only rarely the traditional single words; much more common are larger chunks, compounds, multi-word units, set phrases and even full sentences. In corpus linguistics, these are called collocations. They are the true vocabulary of a language. Collocations are statistically significant co-occurrences of words in a corpus. But they also have to be semantically relevant. They have to have a meaning of their own, a meaning that is not obvious from the meaning of the parts they are composed of. Whether an English text chunk is a true collocation or just a chain of words can only be decided from the perspective of a source language. This is why a list of English collocations for students with other native languages would have to be compiled from a parallel corpus. I will show how an approach to translation equivalence based on collocations yields results that can be applied in language teaching.
1
Teaching a Foreign Language
We are all aware of the two diametrically opposed paradigms that have been informed on the teaching of foreign languages over the last one hundred years. Within the one paradigm, the goal is to introduce the foreign language independently from the native language of the students. This will keep them from translating everything they say and hear from and into their native language; it will enable them to use the target language more naturally and to develop a feeling for it similar to the one they have for the language they grew up with. The phrases and sentences the students learn are not linked to equivalent structures in their native languages but to the relevant situations and social practices. This approach certainly has a lot of merits. It empowers the student to take part in discourse activities pertinent to the taught situations quickly, by saying what they are expected to say, and by recognising the phrases and sentences they learned in their training. Within the other paradigm the students are taught the target language on the background of the source language. This paradigm involves a great deal of linguistic awareness. In order to compare or to contrast the source language with the target language, one has to know the entities and concepts involved, such as
172
Wolfgang Teubert
the parts of speech, complements, adjuncts, clause types, noun and verb phrases, morphology, word formation, inflection, word order, and what else linguists use to describe the differences and the commonalities of the two languages. It also involves an awareness of translation equivalence. We have to know where a word overlaps in its meaning with an English word, and where it differs. If students are taught a foreign language within this paradigm, they link what they want to say and what they hear less to situations and more to the equivalent source language structures. It takes them much longer to become fluent in the target language. On the other hand, once they have learned how to move freely from their source language into the target language, they can cope with situations students taught in the direct method find more difficult to master. They have learned to use tools such as grammars and dictionaries, they have learned to properly describe a structure they have not yet come across in terms of the entities and concepts they were taught, and they will feel more confident to deal, actively or passively, with all the variations of the sentences and phrases they constantly are confronted with in real life situations. The direct (or communicative) method and the contrastive method pursue different goals. If I want to get around as a tourist in a foreign country, or even if I work there in a workplace where I can use my native language for all official purposes and where I need the target language only for socialising, I will find it sufficient to be taught the target language in a very direct method. If I need to use the target language professionally (e.g., for document authoring) or for translation, or for teaching the target language to other students, the contrastive method is more advantageous. In reality, most foreign language teaching today is a combination of the two approaches, with more emphasis on one of them, according to the goal. The students trained in a foreign language at university level are expected to have to use the target language within the context of their future jobs. The students in the English departments, in the countries where English is not yet the national language, will become English teachers, translators, scientists or managers who will have to use English in a professional way, and this means they are expected to find a solution for a language problem they have not encountered before. They have to know about the entities and concepts that are used for describing, comparing and contrasting languages, so that they can find solutions in the many books that linguists, grammarians and lexicographers have produced over the years to make our life easier. For the language students at university level, we just cannot give up the traditional contrastive method. The disadvantage is, however, that the English they speak sounds more like the books and not quite like the English spoken by native speakers. We all know that both are needed: Our students should learn to speak more or less like natives, but they also have to have the linguistic knowledge to understand what they are doing when they are speaking. Why shouldn’t it be possible to combine the two goals? Why doesn’t the linguistic knowledge we acquire enable us to speak like natives? Why is it that the linguists, grammarians and lexicographers seem to be unable to teach us exactly that?
Units of Meaning, Parallel Corpora
173
The problem with the traditional linguistic backgrounding of language teaching is that one of the concepts, indeed the core concept, seems to be seriously flawed. I am speaking here of the concept of the word. There are many good reasons why linguists use this concept. But it does not help much when we deal with meaning. Yet meaning is what links one language to another, they can be as different in form as Chinese and English. What I can say in English in a particular situation, in a given social practice, in a specified context, I can also say in Chinese. If, however, meaning is not primarily organised in words, my traditional linguistic knowledge will not help me to find the proper Chinese phrase. Corpus linguistics replaces our traditional notion of the word by the notion of a unit of meaning. In some cases, a unit of meaning may indeed be a single word. In many cases, it will be more complex. It will be a compound, a multi-word unit, a set phrase or even a full sentence. We call these more complex units of meaning collocations. The vast majority of them are not listed in our monolingual and bilingual dictionaries. They are organised on the word principle, and they tend to let us down if we are looking for phrases and their equivalents. Corpus linguistics is empirical linguistics. It looks at language as it occurs in the discourse, this infinite body of all the texts that the members of a discourse community have contributed and are constantly contributing to the discourse. It is in the discourse where we find out what is usually said in a given situation, in a given social practice. Of course, we never have access to the totality of the discourse. All we can aspire for is to set up a corpus which, we hope, is a fair and balanced representation of this discourse. Today’s corpora of half a billion words, in a few cases even several billion words, are a first step in this direction. But monolingual corpora do not help us to link what is said in one language to what is said in another language. In a bilingual context, a monolingual corpus may often be useful, but it doesn’t really tell us what the target language equivalent is of a given compound, multi-word unit, set phrase or sentence. What we need here are parallel corpora, corpora of original texts in one language and their translations in the target language and vice versa. It has been argued, for instance by Baker (1992) and Sinclair (1996), that we should be very careful to use this evidence. Translated language is slanted, flawed language, and it differs from natural language. I am not convinced. After all, it is only the community of bilingual speakers, most often of translators, who can tell us how the two languages are linked in terms of meaning. Usually, the goal of a translator is to make the original text sound as natural as possible in the target language. Of course, there are good translators and bad translators, and in any parallel corpus we are bound to find a lot of equivalents we, as members of the community of bilingual speakers, would not advise to use. But appropriate translations tend to be repeated while wrong translations will remain singular. Therefore, frequency is the parameter that tells us which equivalents should be used and which not. Texts are not translated word by word. Translators have learned to identify units of meaning; indeed most of what they translate are collocations. Therefore we have to look for those we do
Wolfgang Teubert
174
not find in bilingual dictionaries, in parallel corpora. Collocations and their target language equivalents belong to the implicit knowledge experienced translators have. Parallel corpora are the repositories of all the naturally sounding phrases and sentences that we want our students to learn. This is how corpus linguistics can contribute to teaching foreign languages. 2
Words, Idioms and Collocations
In the view of corpus linguistics, meaning is an aspect of language and cannot be found outside of it. It is entirely within the confines of the discourse where we can find the answer to what a unit of meaning – be it a single word or, more commonly, a collocation, i.e. the co-occurrence of two or more words – means. A unit of meaning is a word (often called the node or keyword) plus all those words within its textual context that are needed to disambiguate this word, to make it monosemous. As most of the more frequent words are indeed polysemous, they do not, as single words, constitute units of meaning. As any larger dictionary tells us, the word fire is ambiguous. Therefore it is not a unit of meaning. In connection with the noun enemy it becomes a part of the collocation enemy fire, meaning “the shooting of projectiles from weapons by the enemy in an armed conflict.” This collocation is (under normal circumstances) monosemous, and therefore a unit of meaning. In the venerable field of phraseology people were always aware that language is full of units of meaning larger than the single word. When I hear She has not been letting the grass grow under her feet I do not expect that to be literally true. Rather I have learned that the phrase not let the grass grow under one’s feet is an idiom, a unit of meaning, according to the New Oxford Dictionary of English (NODE), ‘not delay in acting or taking an opportunity’. Indeed, the idiomaticity of language is a favourite topic of the discourse community. We like to talk about idiom; we feel that they are an important part of our cultural heritage. There is many a book explaining their origins, and there is hardly a dictionary that would dare to leave them out. Over the last century, we have come up with ever more refined typologies of idioms. Moon’s (1998) excellent study Fixed expressions and idioms in English provides a thorough corpus-based analysis of the phenomenon of idiomatic language. While some idioms are more or less inalterable (It’s raining cats and dogs), others are rather variable. Most idioms oscillate between the two extremes. If we probe too deeply, our intuition will often desert us. Are figments of imagination an idiom, or can there be other figments? Does figment have a meaning of its own? In the NODE, its meaning is described as “a thing that someone believes to be real but that exists only in their imagination.” The example given is figment of her overheated imagination. We have to look at a corpus (here at the British National Corpus) to find that figment is always followed by an of-phrase, but that there is indeed a small range of nouns. Imagination is by far the most common noun, but we also find. a figment of his own mind; a figment of my neurosis; a figment of its leaders’ fantasies; a
Units of Meaning, Parallel Corpora
175
figment of his own name, figments of linguistic bewitchment and figments of fiction. These are six incidents out of 58 occurrences. This shows that figment of and figments of, followed by a noun, fall into the category of variable idioms. The word does not occur without an of-phrase; one noun (imagination) accounts for 90% of all cases, and the other nouns are somehow comparable in meaning. Idioms have found their way into bilingual dictionaries as well. The Wildhagen Héraucourt German-English Dictionary tells me that the English equivalent of wie ein Blitz aus heiterem Himmel [literally: like a bolt from a serene sky] is like a bolt from the blue. Idioms feature rather prominently in foreign language learning (with the result that speakers of English as a second language tend to overuse those they have learned, such as It’s raining cats and dogs, an idiom studiously avoided by native speakers). Our modern western concept of world is not very old. It was the medieval Christian monks, the scribes who copied the few books found in their monastic libraries who introduced the space between words, not because they believed in a Platonic idea of the word but because it made it simpler to remember the text passage they were copying. We tend to think that words have been around all the time because we have learned to translate Greek logos and Latin verbum as word. Yet any larger dictionary shows that these words mean a lot of diverse things but most generally speech or text, something that is being said, but that they hardly ever refer to what we today normally associate with word. Our own word word originally had the same meaning. The first sense given in the OED is “speech, utterance, verbal expression.” Today, when we hear word, we normally think first of an element of speech, as the second sense given in the OED is circumscribed. If we believe Jack Goody, this concept is foreign to oral societies. That is not so astonishing. In spoken language we normally don’t insert a pause in between words. Neither were the old Greeks and Romans in the habit of putting in spaces between their written words. Where the space is inserted is largely a matter of convention. What is linguistic de corpus in French is corpus linguistics in English and Korpuslinguistik in German. There is no cogent reason other than tradition why there should be no space between the elements of German compounds: Korpus Linguistik. Other modern languages missed the chance to define words by spaces. When it was recognised that in most cases it didn’t make sense to define a single Chinese character as a word and it became accepted that most Chinese words would consist of two or even three characters it became a problem to identify words in a sentence. It is often the case that Chinese sentences can be cut up into words in different ways as long as we apply nothing but formal rules and leave out what they mean. Thus, in Chinese language processing there is still no segmentation software that is entirely reliable. How could it be different? We find cases of doubt in practically all Western languages. The problem of where there should be spaces and where there shouldn’t featured prominently in the last German spelling reform. It is meaning, not grammar that throws a shadow on the single word. A glance at any monolingual or bilingual dictionary confirms that the main problem
Wolfgang Teubert
176
of single words, from a semantic perspective, is their polysemy, their ambiguity and their fuzziness. For the verb strike, the NODE lists 11 senses. One of them is make (a coin or medal) by stamping metal. As a sub-sense of this find reach, achieve, or agree to (something involving agreement, balance, or compromise: the team has struck a deal with a sports marketing agency). Though we might, upon consideration, come to accept this sense as a metaphorisation of striking coins, the actions seem to have hardly anything in common. The strike in strike a deal means something else than the strike in strike coins, and something different from the other ten senses ascribed to it in the dictionary entry. Indeed one could easily maintain that it has no meaning of its own; only together with deal it means something, namely reach an agreement. This is the gist of Sinclair’s (1996) article, “The Empty Lexicon.” Yet once we have identified their semantically relevant collocates of words like strike, their ambiguity and fuzziness disappears. The collocation strike a deal is as monosemous or unambiguous as anyone could wish. Even though neither the NODE nor the Longman Dictionary of English Idioms (1979) list strike a deal as an idiom, it seems to belong in this category. In the BNC there are 25 occurrences of struck a deal. The absence of strike a deal from larger dictionaries and specialised idioms dictionaries illustrates that the recognised lists of idioms, those we are aware of as part of our cultural heritage, represent no more than the tip of an iceberg. Time and again, corpus evidence suggests that there are many more semantically relevant collocations than dictionaries tell us. But what about the sense of strike described in the NODE as “discover (gold, minerals, or oil) by drilling or mining”? In a random sample of 500 occurrences, taken from the Bank of English, we find seven instances for this sense of strike, four of struck gold, two of struck oil, and one of struck paydirt. All citations represent metaphorical usage. Here are two examples: Dixon, who, together with the unfailing Papa San, struck gold with `Run The Route." telephone franchises. No one has struck paydirt yet, although the Bells have captured business The example of strike “discover by drilling or mining” shows that there is no obvious feature to tell us whether we should analyse a phrase as consisting of two separate lexical items (strike and gold) or whether we should analyse it as a collocation, i.e. as one lexical item (strike gold). It is not a question of ontological reality, of what there is, but a question of expediency. 3
The Phrase Friendly Fire: a True Collocation?
Another phrase that is worth looking at from a bilingual perspective is friendly fire. It is a fairly recent addition to our vocabulary, occurring for the first time in 1976 as the title of a novel by D. B. Bryan Courtlandt. The story is about the death of an American soldier in the Vietnam War who had been accidentally
Units of Meaning, Parallel Corpora
177
killed by U.S. fire. Though this novel wasn’t particularly popular, the phrase quickly entered the general discourse. It replaced the military term fratricide, which we also find in French. But fratricide is also a general language word meaning “the killing of one’s brother (or sister).” As such, it is rare and smacks of erudition. Friendly fire, on the other hand, has a familiar ring, in spite of being a neologism. With each subsequent war, it became more popular. In the 450 million words of the Bank of English, there are 267 occurrences of this phrase. Do lexicographers regard friendly fire as a unit of meaning? The largest online English dictionary is WordNet, an electronic database that has been and is still being compiled at Princeton University, for some years now under the guidance of Christiane Fellbaum. WordNet is more than a traditional dictionary. It systematically lists relations of each entries with other entries such as synonymy, hyponymy, meronymy and antonymy. It organises the senses it assigns its entries as “synsets” (sets of synonyms), where each synset is defined as a list of all entries sharing this particular meaning. All synsets or senses come with glosses and often also with an example. For several years now, WordNet has begun to list collocations, as well. But I did not find an entry for friendly fire. There can be several reasons. Either the phrase was too new, or it was not frequent enough, or it was thought not to be a unit of meaning. The third of these reasons turned out to be the case. For when I looked up friendly, I found friendly fire referred to in synset 4: The adjective “friendly” has 4 senses in WordNet. 1. 2. 3. 4.
friendly (vs. unfriendly) – characteristic of or befitting a friend; “friendly advice”; “a friendly neighbourhood”; “the only friendly person here”; “a friendly host and hostess” friendly – favorably disposed; not antagonistic or hostile; “a government friendly to our interests”; “an amicable agreement” friendly (vs. unfriendly) – (in combination) easy to understand or use; “user-friendly computers”; “a consumer-friendly policy”; “a reader-friendly novel” friendly (vs. hostile) – of or belonging to your own country's forces or those of an ally; “in friendly territory”; “he was accidentally killed by friendly fire”
This entry shows that it was a deliberate decision not to enter friendly fire as a collocation. For the compilers of WordNet, it is a combination of two units of meaning. Are they right? Is there a separate sense of friendly accounting for cases such as friendly fire and friendly territory? Are there other phrases where we find this sense of friendly, such as friendly houses, friendly planes, friendly newspapers? Friendly houses seem to belong to synset 1 (cf. friendly neighbourhood), while friendly newspapers seem to belong to synset 2 (‘favourably disposed’). So perhaps there are really only two instances for the fourth synset. The antonym of friendly territory (Google: 5,130 hits) is sometimes hostile territory (Google: 27,800 hits), but more often enemy territory (Google:
Wolfgang Teubert
178
239,000 hits). The antonym of friendly fire (Google: 150,000 hits) is sometimes hostile fire (Google: 30,300 hits), but again more often enemy fire (83.300 hits). Both antonyms should be mentioned in the entry. The question is whether it makes sense to construe a sense that is limited to two instances. Let us now have a look at fire in WordNet: The noun “fire” has 8 senses in WordNet. 1. 2. 3. 4. 5. 6. 7. 8.
fire – the event of something burning (often destructive); “they lost everything in the fire” fire, flame, flaming – the process of combustion of inflammable materials producing heat and light and (often) smoke; “fire was one of our ancestors’ first discoveries” fire, firing – the act of firing weapons or artillery at an enemy; “hold your fire until you can see the whites of their eyes”; “they retreated in the face of withering enemy fire” fire – a fireplace in which a fire is burning; “they sat by the fire and talked” fire, attack, flak, flack, blast – intense adverse criticism; “Clinton directed his fire at the Republican Party”; “the government has come under attack”; “don't give me any flak” ardour, ardour, fervour, fervour, fervency, fire, fervidness – feelings of great warmth and intensity; “he spoke with great ardour” fire – (archaic) once thought to be one of four elements composing the universe (Empedocles) fire – a severe trial; “he went through fire and damnation”
The sense I am interested in is, of course, sense 3. Here, we find the phrase enemy fire in an example. Adding up the glosses for sense 4 of friendly and sense 3 of fire, we obtain, mutatis mutandis, “the act of firing weapons … at our own or our allies’ forces.” This is an appropriate definition. Is WordNet right to deny friendly fire the status of a unit of meaning? While other dictionaries have nothing equivalent to WordNet sense 4 of friendly, some of them list friendly fire as a separate entry, recognising the phrase as a unit of meaning, e.g., the NODE: [Military] “weapon fire coming from one’s own side that causes accidental injury or death to one’s own people.” Both options seem legitimate. The disadvantage of the first alternative is that it introduces a polysemy which doesn’t exist if we accept the unit of meaning solution. In the context of fire, friendly can only mean sense 4, and in the context of friendly, fire can only mean sense 3. But multiplying the four senses of friendly with the eight senses of fire, we end up with 32 combinations out of which we have to select the only possible one. So, if we accept Ockham’s razor (“Entities are not to be multiplied without necessity.”) as the underlying principle for constructing a semantic model, the interpretation of friendly fire as a unit of meaning is obviously preferable. From a methodological point of view, it makes sense to put friendly fire down as a unit of meaning because it simplifies the linguist’s task to account for
Units of Meaning, Parallel Corpora
179
what a text, a sentence, and a phrase mean. It is more convenient to treat the phrase as a collocation than to describe it as the contingent co-occurrence of two single words. This aspect is particularly important for the computational processing of natural language (e.g., for machine translation). Computers don’t ask whether the meaning of friendly fire (or of false dawn) is something that cannot be inferred from the meaning of the parts they are constituted of. We use computers not to understand what people talk about. We want them to facilitate the translation of sentences in which we encounter these and comparable phrases. Usage is something computers can cope with. If friendly fire is used in a unique way and not in any of the other 31 ways suggested by WordNet then it is simpler to deal with it as a unit in its own right, as a lexical item that just happens to be composed of two words. But usage does not tell us how we understand the phrase. When we want to communicate to other members of the discourse community how we understand friendly fire, we have to paraphrase it. Whether a given paraphrase (i.e. the interpretation of a phrase) is acceptable to the discourse community has to be left to the members of that community. The question is, therefore, whether friendly fire is a unit of meaning also from the perspective of meaning as paraphrase. The answer to this question is simple. It is a unit of meaning if we find paraphrases telling us how others understand it, and thus, how we would do better to understand it as well. In the NODE, we already found one paraphrase. That this is more than the concoction of an assiduous lexicographer, shows with a glance at the Bank of English. It lists several hundred occurrences of friendly fire. Among them there are about a dozen of citations that comment on the phrase, try to explain it, circumscribe it or downright paraphrase it: four Americans killed in battle during the Gulf War died as a result of friendly fire # in other words, they were killed by their own side. The Defence and artillery salvos. Whether called fratricide, amicicide, blue on blue, friendly fire, or--as in official U.S. casualty reports from Vietnam # up with their own bombs. In Vietnam, the Americans coined the phrase “friendly fire,” a monstrous use of the language, as if any such fire could be friendly fire--a term that means mistakenly shooting at your own side. There's also We learn that friendly fire is a “phrase;” a “term;” that there are synonyms; that it constitutes a “monstrous use of language;” that the Americans introduced it into the discourse in their Vietnam War; and that it means one’s men are “killed by their own side.” Paraphrases of these kinds abound when a new unit of meaning, be it a single word or a collocation, enters the discourse. Then people must be told. As we have seen, I found the first evidence of friendly fire in the 1976 novel with the
180
Wolfgang Teubert
same title. Unfortunately there are no corpora that could verify my assumption that during that and the subsequent year, there was an abundance of paraphrases. Here again, a bilingual perspective might prove useful. What happens when translators are confronted with a lexical item for which they cannot find a translation equivalent because it hasn’t been translated before? Corpus linguistics tells us that translation equivalence is not something that latently always exists and just has to be discovered. Translation equivalence has to be construed. As with meaning, this construal is a communal activity, only that it doesn’t involve a discourse community of a given language such as English, but the community of bilingual speakers of the two languages involved. One translator will come up with his or her proposal, which is then negotiated with the other members of that community, until agreement is reached and every translator starts using the same equivalent or until several equivalents are considered as acceptable and translators choose among them. It seems as if in the case of friendly fire translators had to start from scratch. Apparently there was never a fixed expression in German as an equivalent of fratricide, blue on blue or friendly fire. What does the bilingual perspective add to the issue? As mentioned above, friendly fire is a relatively new expression, first used in 1976, and became more frequent only in the course of the first Gulf War, when more British soldiers were killed by friendly (mostly American) fire than by enemy fire. It was only then that the phrase began to be translated into other languages, German among them. How was it translated? The second edition of the Oxford-Duden, published in 1999, acknowledges friendly fire as a single lexical item and gives it a separate entry. The translation equivalent it proposes is eigenes Feuer (“one’s own fire”). The Collins German Dictionary (1999) is more accurate: Beschuss durch eigenes Feuer “bombardment by one’s own fire.” Other translation equivalents we find in google and in various corpora are freundliches Feuer, befreundetes Feuer and the English collocation friendly fire, as a borrowing into German. Most of the texts we find there are texts originally written in German, not translations from the English. Still we have to assume that the concept “friendly fire” did not exist before it was introduced into the German discourse via translations. For neither of the German equivalents mentioned above occur in the older texts of our corpora. Thus all four German options have to be seen as the results of translations. It is noteworthy that there is, in Google, only one occurrence of “durch befreundetes Feuer” (“by/from fire of our friends”), because befreundet is the standard translation for the fourth meaning of friendly in WordNet, where we find friendly fire together with friendly territory. Indeed, friendly territory is befreundetes Territorium in German. This is a first indication that translators understand friendly fire as a collocation and not as contingent combination of two single words. We can be sure that befreundetes Feuer won’t ever become the default equivalent of friendly fire. For the phrase “durch freundliches Feuer” we find 48 occurrences in Google. This is a second indication that translators see friendly fire as a true collocation. For freundliches Feuer (freundlich being the
Units of Meaning, Parallel Corpora
181
default translation of friendly) would normally (without English inference) never mean “soldiers killed by their own side” but something quite different, as in this singular Google citation: Ihre nachtschwarzen Augen leuchteten jedoch in freundlichem Feuer, als sie in die Runde ihrer Amazonenkriegerinnen sah. (“Yet her nightblue eyes glowed in a friendly fire, as she was glancing at the round of her Amazon warriers.”) www.silverbow.de/kilageschichte.htm As a single lexical item, as a unit of meaning, however, freundliches Feuer can mean anything the discourse community accepts. Before this may happen, people have to do a lot of explaining. This becomes evident from the two examples taken from Google: Es gab 120 Verletzte durch freundliches Feuer - also Treffer durch die eigenen Leute. (“There were 120 wounded from “friendly fire” – i.e. hits by one’s own people.”) www.stud.uni-goettingen.de/~s136138/ pages/read/depleted.html Natürlich haben die amerikanischen Militärs auch einige elektronische Mittel erfunden, um den "Fratrizid", wie der Tod durch "freundliches Feuer" im offiziellen Jargon auch genannt wird, möglichst auszuschließen. (“Of course, the American military have invented some electronic gadgets to rule out “fratricide”, as death by “friendly fire” is often called in official jargon.”) www.ish.com_1048075934919.html In the first example the audience is told explicitly, in form of a paraphrase, what friendly fire means. In both instances we find freundliches Feuer in quotation marks, making the audience aware that it is a new expression, and that this expression has to be understood as a unit of meaning. The next few years will show whether freundliches Feuer will become the default translation of friendly fire. More frequent is eigenes Feuer, with 107 hits in Google for the phrase “durch eigenes Feuer” (“by/from one’s own fire”). I present two examples which show that this phrase is the result of English inference: Das Verteidigungsministerium in London hat Berichte bestätigt, nach denen durch "eigenes Feuer" in der Nähe von Basra ein britischer Soldat getötet und fünf weitere verletzt worden sind. (The Ministry of Defence has confirmed reports that near Basra, one British soldier was killed and five more were wounded from “friendly fire”) www.tagesschau.de/aktuell/meldungen/0,1185,OID1725410_TYP 1_THE1687956_NAVSPM3~1664644_REF,00.html
182
Wolfgang Teubert
Man kann es sich leicht vorstellen, dass es für die Moral eines militärischen Verbandes die schlimmste Erfahrung ist, wenn ein Kamerad durch eigenes Feuer, durch friendly fire, ums Leben kommt. (“It is easy to imagine that it is the worst experience for the morals of a military unit when a comrade dies from one’s own fire, from friendly fire.”) www.dradio.de/cgi-bin/es/neu-kommentar/609.html It seems strange indeed that the expression eigenes Feuer, which is very easy to understand, is put in quotation marks, but it shows that the speaker uses it as a translation of friendly fire. This becomes even more evident in the second example where the perfectly transparent eigenes Feuer is paraphrased by the much less familiar friendly fire. There seems to be a certain uneasiness to represent the concept expressed in English by a single unit of meaning, by a decomposable adjecive+noun phrase (i.e. by two separate words). Therefore it is still doubtful whether eigenes Feuer will become the German default equivalent. Even though it seems to be more common its other disadvantage is that it sounds less like friendly fire than the option freundliches Feuer. However, the most frequent equivalent we find is the borrowing friendly fire. There are, in Google, 459 hits for “durch friendly fire”. Again we notice that in most citations, the collocation is put into quotation marks, indicating the novelty and strangeness of the expression. Here are two examples from the Österreichisches Zeitungskorpus (OZK; “Austrian Newspaper Corpus”), a 500million-word corpus covering the nineties: Und fast schon ans Zynische grenzt jene Bezeichnung, welche die Militärsprache für den irrtümlichen Beschuß der eigenen Leute kennt. Man nennt das friendly fire - freundliches Feuer. (“And that name borders almost on cynicism that the military jargon uses for the erroneous fire at one’s own people. They call it friendly fire – freundliches Feuer.”) An dieser Frontlinie beobachten wir auch immer wieder das, was die Militaristen "friendly fire" nennen, nämlich Verluste in den eigenen Reihen durch fehlgeleitete Geschosse aus den eigenen, nachfolgenden Linien. Was die Haider-Diskussion anlangt, hat sich dieses Phänomen sogar zu einer Art intellektueller Selbstschußanlage verfestigt. (“At this frontline, we constantly find again what the military call “friendly fire”, i.e. losses in one’s own lines from misguided projectiles from one’s own back lines. As to the discussion about Mr Haider, this phenomenon has become solidified as some kind of intellectual automatic firing device. Paraphrases reveal whether a phrase has become a fixed expression, a collocation, a unit of meaning. The paraphrase in these two examples do not tell us what
Units of Meaning, Parallel Corpora
183
friendly means, they explain what friendly fire is. While we have learned above to establish, whenever expedient, collocations or fixed expressions on the basis of usage, paraphrases will tell us whether indeed they are understood as units of meaning. There is one more indicator for a true collocation: its availability for metaphorisation processes. The second example demonstrates that friendly fire in German can now be used to refer to internecine warfare. As a metaphor, friendly fire looses the feature of ‘accidental fire’; instead it refers to consciously hostile actions within a social group. Here is another example, taken from Google: Nicht alle Liberalen sind eingeschwenkt. Aber das friendly fire schmerzt besonders. Merkels Kandidatur ist streitbesetzt. (“Not all liberals [within the Christian Democratic Party] could be won over. But the friendly fire smarts particularly. [Party chair] Merkel’s candidature is controversial.”) www.zeit.de/2001/51/Politik/print_200151_k-frage.html - 7k The same metaphorical usage is also found in English texts. Here is an example taken from Google: Defence Secretary Geoff Hoon faced questions about the deployment, why it happened so quickly, what his exit strategy was and how long it would last - all of which he had answered in previous exchanges. But his opposite number, Bernard Jenkin, offered his overall support for the operation. There was not even much friendly fire from Mr Hoon's own benches. www.news.bbc.co.uk/hi/english/uk_politics/newsid_1884000/1884226. stm In this section, I have explored friendly fire in a monolingual and a bilingual context with the aim to find criteria that set apart statistically significant, but contingent co-occurrences, of two or more words from semantically relevant collocations, also called fixed expressions. There are two approaches. If we look at meaning from the perspective of usage, we find that there are good reasons of simplicity to assign collocation status to those expressions which, taken as a whole, are monosemous. The phrase friendly fire belongs here; a collocation analysis will reveal that it (almost) always occurs in comparable contexts. This perspective is decisive for the computational processing of natural language; as will see, it facilitates computer-aided translation. From the perspective of language understanding, the prime criterion for assigning collocation status to lexical co-occurrence patterns is paraphrase. If we find that a phrase is repeatedly paraphrased as a unit of meaning we have reason to assume that it is a single lexical item. A supporting criterion is that the phrase, as a whole, can be used in a metaphorical way. This is, as we have seen, the case both for false dawn and for friendly fire. A third criterion is specific to a bilingual perspective. It seems that the translation equivalent of a true collocation is not
Wolfgang Teubert
184
what would be the most appropriate translation if each of the elements were translated separately. For then we would expect, as the equivalent of friendly fire, the German phrase befreundetes Feuer, for which we found only one occurrence. Rather, collocations are translated as a whole, and it doesn’t seem to matter whether the equivalent makes any sense if interpreted literally as a combination of the elements involved. The phrase freundliches Feuer is, if taken literally, seriously misleading. As a new unit of meaning this doesn’t matter; it will mean whatever is acceptable to the discourse community. Finally, the high frequency of the English phrase friendly fire in German texts suggests that there is no acceptable German equivalent and that therefore the English phrase has to be imported. True collocations, therefore, can be shown to be not only statistically significant but also semantically relevant. Semantic relevance can be demonstrated both for the methodological approach and for the theoretical approach to the definition of units of meaning. The analysis presented here has demonstrated that the concept of the unit of meaning as the criterion for fixed expressions is not arbitrary. Corpus linguistics can make an enormous impact on lexicography. It can change our understanding of the vocabulary of a natural language. We can do away with the infelicitous situation that most of the (more common) lexical items in the dictionaries are polysemous. The ambiguity we had to deal with in traditional linguistics will disappear once we replace the medieval concept of the single word by the new concept of a collocation or a unit of meaning. Instead of four senses for friendly plus eight senses for fire we end up with one single meaning for the fixed expression friendly fire. 4
Collocations, Translation and Parallel Corpora
In this section, I will address the methodological aspect of working with collocations. My aim is to demonstrate the impact the appreciation of the collocation phenomenon can make for translation. As empirical bases, I will produce evidence from several parallel corpora. Parallel corpora, also called translation corpora, are corpora that contain original texts in one language together with their translation into one or more other languages. To work with these corpora, we have to align each text and its translation first on a sentence level and then on a level of the lexical item, be it a single word, and idiom; a true collocation, in short, on the level of the unit of meaning. As everyone knows who ever has translated a text into his own or a foreign language, we do not translate word by word. However, our traditional translation aid is the bilingual dictionary. Most entries, by far, are single words, and for most of the words we find many alternatives for how to translate them. In most cases, the dictionary cannot tell us which of the alternatives we have to choose in a particular case. This is why bilingual dictionaries are not very helpful when the target language is not our native language. We do not translate single words in isolation but units that are large enough to be monosemous, so that for
Units of Meaning, Parallel Corpora
185
them there is only one translation equivalent in the target language, or, if there are more, then these equivalents will be synonymous. I call these units translation units. Are they the same as units of meaning? Not quite. Natural languages cannot be simply mapped onto each other. The ongoing negotiations among the members of a discourse community lead to results which cannot be predicted. Languages go different ways. They construe different realities. According to most monolingual English dictionaries, the word bone seems to be a unit of meaning, described in the NODE as “any of the pieces of hard, whitish tissue making up the skeleton in humans and other vertebrates.” This accurately describes the way bone is used in English. From a German perspective, however, bone has, traditionally speaking, three different meanings; there are three non-synonymous translation equivalents for it. In the context of fish (or any of its hyponyms), Germans call it Gräte. In the context of non-fishy animals (dead and alive) and of live humans, they call it Knochen. In the context of our deceased, the Germans use the word Gebeine. For translating into German, the relevant unit of meaning therefore is bone plus all the context words that help to make the proper choice between the three German equivalents. What we come up with in our source text is (probably) not a fixed expression, a collocation of the type friendly fire, but rather a set of words (collocates) we find in the close vicinity of bone. Thus in Google we find: The poor were initially buried in areas in the churchyard or near the church. From time to time, the bones (Gebeine) were dug up and then laid out in a tasteful and decorative manner in the charnel house. www.death.monstrous.com/graveyards.htm Then place trout on a plate and run a knife along each side of ... Sever head, fins and remove skin with a fork. All you have left is great eating with no bones (Gräten). www.mccurtain.com/kiamichi/troutbonanza.htm We expect a person to say she feels terrible after breaking a bone (Knochen). www.myenglishteacher.net/unexpectedresults.html The words in italics are the ones that tell us how bone(s) has to be translated in each of the instances. A suitable parallel corpus would give us a sufficient number of occurrences for each of the three translation equivalents. Once we have found all the instances of Gräte(n) we can then search for bone(s) in the aligned English sentence and set up the collocation profile of bone when translated as Gräte. Such a collocation profile is a list of all words found in the immediate context of the keyword (bone in our case), listed according to their statistical significance as collocates of the keyword. The collocation profile of bone as the equivalent of Gräte will contain words like trout, salmon, eat, fin, remove, etc. A dictionary of translation units would give for each keyword which
Wolfgang Teubert
186
is ambiguous in terms of the target language, the collocation profile going with each of the equivalents. The users then have to check which of the words contained in the collocation profiles occur in the context of the word they are about to translate, and the choice can be made almost mechanically. These combinations of a keyword together with their (statistically significant) collocates are also called collocations. Thus we find two kinds of collocations: those which can be described as fixed expressions and to which a grammatical pattern can be assigned (friendly fire: adjective+noun) and those where we can only say the collocates are found in the immediate context of the keyword (trout in the context of bone). Both kinds of collocations have in common that they are monosemous, either in a monolingual or in a bilingual perspective, and that they therefore represent units of meaning or translation units. The parallel corpora I am working with are compiled from selections of the legal documents issued by the European Commission and excerpts from the proceedings of the European Parliament, together with some reports issued by them. They do not talk much about bones. This is why I choose another keyword, French travail/travaux. I include the plural travaux in my analysis, because often the plural is rendered as a singular when translated into English. The default translation is Arbeit in German, while for English there are two main translation equivalents: work and labour. When do I translate travail/travaux as work, when as labour? The parallel corpus allows us to set up the relevant collocation profiles on the basis of an analysis of the context spanning five words to the left and to the right of the keyword: Table 1: Collocation profiles for travail/travaux Travail/travaux translated as work Programme (410) Commission(255) Conseil (212) Cours (123) Organisation (122) Préparatoires (113) Vue (109) Groupe (108) Temps (99) Securité (97)
Travail/travaux translated as labour Marché (747) Ministre (170) Marchés (151) Sociales (125) Affaires (117) Emploi (88) Forces (65) Normes (60) Femmes (60) Sociale (50)
For each of the collocations profiles, I have selected the ten most frequent nongrammatical words found in the context. The frequency of each item is given in brackets. The most amazing finding is that there is no overlap at all between the two profiles. This is striking evidence that travail/travaux occurs in different contexts when it is translated as work from those when it is translated as labour. Do the collocation profiles help with translation? Here are two French sentences.
Units of Meaning, Parallel Corpora
187
The relevant collocates that inform on the appropriate translation equivalent are in italics: WORK: La réforme du fonctionnement du Conseil soit opérée indépendamment des travaux préparatoires en vue de la future conference intergouvernementale. LABOUR: La Comité permanent de l’emploi s’est réuni aujourdhui sous la présidence de M. Walter Riester, ministre fédéral du travail et des affaires sociales d’Allemagne. Indeed, the collocation profile approach to translation seems to work. This has little to do with our human understanding of meaning. In the first example, we find vue, part of the fixed expression en vue de a prepositional expression meaning “in the face of,” that is in no way semantically connected with travaux meaning “work.” That it is part of the profile is contingent to our corpus. Also, there is no sound reason why travaux in the context of Conseil should be translated as work and not as labour. It just happens to be that way. Again, in the second example there is no sound reason why emploi would necessitate the equivalent labour. It just so happens that in 88 cases where we find emploi close to travail/travaux, we find labour and not work in the translation. The real reason is a different one: le ministre du travail is a named entity in form of a fixed expression whose British equivalent is called Secretary of Labour. What we learn here is that the methodological approach to collocation analysis, the approach based on usage rather than on paraphrase, is a technical operation whose results do not map well on human understanding. Investigations on translation equivalence based on parallel corpora are still very much in their infancy. The collocation profiles have to become more refined. The goal is to increase their significance by allocating positions in grammatical patterns to the lexical elements they contain. For the time being our parallel corpora are too small for that. Once they can compare in size with our monolingual corpora we may well find out that the kind of collocations which aren’t fixed expressions (like travail/travaux and its collocates as they appear in a collocation profile) can be better described as “true collocations” conforming to a specific grammatical pattern. Thus, in the first sentence, we find travaux preparatoires. This phrase can be seen as a monosemous fixed expression, a unit of meaning, conforming to the adjective+noun pattern, and indeed is (almost) always rendered as preparatory work in our parallel corpus. 5
Conclusion
We all talk in phrases, in ready-made chunks of language. While these chunks do consist of words, we have to keep in mind that it is less the individual word than the chunks that account for the meaning. This has been the important message of
188
Wolfgang Teubert
direct or communicative language teaching. These chunks are what corpus linguists call collocations. We still do not know much about them. Some of them seem to come in a host of variants, others are largely fixed. Collocations are recurrent co-ocurrences of words in texts. They certainly are statistically signinficant; but this is not enogh. They also have to be semantically relevant. They have to have a meaning of their own, a meaning that isn’t obvious from the meaning of the parts they are composed of. This property is sometimes called semantic cohesion. Not single words but collocations constitute the true vocabulary of a language. Collocations are what students have to learn. Over the next few years, corpus linguistics has to deliver the vocabulary of collocations. Beginnings have been made. There is the exemplary Oxford Collocations Dictionary for students of English. But language teachers have to bear in mind that it depends on the perspective as to what makes a collocation. What has to count as an English collocation from a French perspective does not necessarily have to count as one from a Chinese perspective. This is why the vocabulary of the target language, including the collocations of a language, has to be taught from the source language perspective. Those who teach English on an academic level have to deal with the issue of translation equivalence. Parallel corpora are the repositories of source language units of meaning and their target language equivalents. All students who will enter a career in which they will have to apply language skills, as teachers, as translators, as lexicographers and terminologists, or as experts in artificial intelligence or machine translation, have to be introduced to working with parallel corpora. References Baker, M. (1992), In other words, London: Routledge. Bank of English. http://titania.cobuild.collins.co.uk/boe_info. British National Corpus. http://www.hcu.ox.ac.uk/BNC/. Collins German Dictionary (1999), Glasgow: HarperCollins. Google. http://www.google.com. Longman Dictionary of English Idioms (1979), London: Longman. Moon, R. (1998), Fixed expressions and idioms in English, Oxford: Clarendon. Österreichisches Zeitungskorpus. http://www.ids-mannheim.de. Oxford Collocations Dictionary for Students of English (2002), Oxford: Oxford University Press. New Oxford Dictionary of English (2000), Oxford: Oxford University Press. Oxford English Dictionary (2nd edition) (1998), on CD-ROM, Oxford: Oxford University Press.
Units of Meaning, Parallel Corpora
189
Oxford-Duden German Dictionary: German-English/English-German (1999), Oxford: Oxford University Press. Sinclair, J. M. (1996), The empty lexicon, International Journal of Corpus Linguistics, 1 (1): 99-119. Sinclair, J.M., J. Payne, and C. Pérez (eds) (1996), Corpus to Corpus: A Study of Translation Equivalence, International Journal of Lexicography, 9 (3). Wildhagen-Héraucourt: Deutsch-Englisches/English-Deutsches Wörterbuch, Wiesbaden: Brandstetter Verlag. WordNet. http://www.cogsci.princeton.edu/~wn/.
Making the Web More Useful as a Source for Linguistic Corpora William H. Fletcher United States Naval Academy Abstract Both as a corpus and as a source of texts for corpora the Web offers significant benefits in its virtually comprehensive coverage of major languages, content domains and written text types, yet its usefulness is limited by the generally unknown origin and reliability of online texts and by the sheer amount of “noise” on the Web. This paper describes and evaluates linguistic methods and computing tools to identify representative documents efficiently. To test these methods, a pilot corpus of 11,201 online documents in English was compiled. “Noise filtering” techniques based on n-grams helped eliminate both virtually identical and highly repetitive documents. Individual review of the remaining unique texts revealed that Web pages under 5 KB or over 200 KB tend to have a lower “signal to noise” ratio and therefore can be excluded a priori to reduce unproductive downloads. This paper also compares a selection of these web texts (4,949 documents totaling 5.25 million tokens) with the written texts from the British National Corpus (BNC) to assess their similarity. Generally, both corpora are quite similar, but important differences are outlined. With judicious selection Web pages provide representative language samples, often prove more useful than off-the-shelf corpora for special information needs, and complement and verify data from traditional corpora.
1
Web as Corpus
The World Wide Web has much promise as a source of machine-readable texts for corpora. Over ten billion publicly-accessible online documents provide comprehensive coverage of the major languages and language varieties, and span virtually all content domains and written text types. Throughout the developed world the Web is readily accessible at low cost and has become a familiar information source for hundreds of millions of users. As a self-renewing linguistic resource it offers a freshness and topicality unmatched by fixed corpora; emerging usage and current issues are generally well represented in online texts. When analyzing relatively rare features of a language, the Web is a nearly inexhaustible resource. With appropriate tools it is simple to compile an ad-hoc corpus from online documents to answer a specific language question or meet a specialized information need. The following example illustrates convincingly that bigger can be better when it comes to corpora. In January 2003 a discussion thread on the CLLT Listserv (2003) focused on the phrase “not as
192
William H. Fletcher
ADJECTIVE as you think.” In the Michigan Corpus of Academic Spoken English (MICASE, 1.7 million words) only two occurrences were found, and even the 100-million-word British National Corpus World Edition (BNC) yielded only 77 examples. In contrast, the AltaVista search engine reports over 66,328 Web pages with “as * as you think” and 41,189 with “as * as you * think”, where the first wildcard * almost always matches an adjective or an adverb and the second one typically matches (woul)d, may or might. In about 40 minutes, the Web concordancing search agent application KWiCFinder1 downloaded and analyzed 500 Web pages, ample material for a thorough analysis. Unfortunately, one must be cautious when using online texts as linguistic data. Web pages are typically anonymous and Web server location is no certain guide to origin, so it is difficult to establish authorship and provenance and to assess the reliability, representativeness and authorativeness of texts, both for their linguistic form and their content. Multilingual sites are common, as are English pages authored by non-native speakers of varying competence, raising questions about language quality and influence of the source language. Among the longer prose texts certain types predominate, especially legal, journalistic, commercial and academic prose, a much narrower cross-section of language usage than one might require. Overall, lower standards of form and content verification prevail than in printed sources. Web pages often contain a significant amount of “noise” (i.e. language which is fragmentary, repetitive, formulaic, or ill-formed, and often entire documents which have no cohesive text). A significant limitation on the Web is that systematic access to linguistic data online can only be gained through full-text searches on commercial search engines. Designed for the general public, most search engines do not support targeted search criteria such as sophisticated pattern matching which would make them most useful to linguists. Among the search engines AltaVista offers the most powerful combination of features, but its database has often languished months without updating, and its unstable financial position raises doubts about its future. Even more unfortunately for researchers, AltaVista’s reports of the number of documents matching a given query are inconsistent and can vary up to an order of magnitude during peak usage times; consequently they provide only a general numeric indication of the prevalence of a form, not statistically reliable proof. Perhaps the greatest weakness in contrast to most corpora is that the Web has no grammatical markup, so one can only match for strings, not specific structures. Elsewhere I discuss in greater detail the benefits and challenges of exploiting the Web as a corpus for both pure and applied linguistic research and propose a solution to the limitations imposed by commercial search engines (Fletcher 2001, 2002). This paper concentrates on efforts to make the World Wide Web more useful as a source for corpus compilation by developing and evaluating linguistic methods and PC tools to identify linguistically representative documents more efficiently. My long-range goal is to establish the Web both as a “corpus of first resort” and as a supplement to traditionally compiled corpora.
Making the Web More Useful as a Source for Linguistic Corpora 2
Compiling a Web Corpus
2.1
Objectives and Preliminary Considerations
193
In seven years of developing and using KWiCFinder I have viewed excerpts from over a quarter million online documents and have examined thousands as complete Web pages. My cumulative impression has convinced me that the Web can yield linguistic data which are both useful and reliable. To confirm this conviction I compiled a pilot corpus with KWiCFinder of Web documents in English for analysis offline in October 2001. These sample documents totaling 5.5 million tokens allowed me to gauge how suitable and representative such texts could be for research or learning and to evaluate techniques to identify Web pages with a high proportion of connected text. My goal was to analyze language samples from the Web, not to investigate the language of the Web in general. A major objective of this study was to develop procedures and software tools2 to automate or expedite identification of the most useful texts. Some steps toward optimizing a search can be taken at the outset when formulating the query by choosing selection criteria which either exclude a range of texts or favor inclusion of more relevant results. For example, by excluding documents with “copyright” or “all rights reserved” one can filter out many commercial and journalistic texts without excluding most academic, government and personal material. Another simple indicator of potential usefulness is document size: a query to the server can reveal how large a Web page is before the search agent “decides” to download it. With guidelines for rejecting a page before fetching it3 because it is relatively unlikely to contain useful text, search agent software can save both bandwidth and processing time. Web documents typically contain significant chunks of “noise”: headers and footers that identify the document, declare who owns it and explicitly reserves rights to it; links both within the document and to other documents, media and sites (especially advertisers); and other formulaic elements. I will refer to these as “boilerplate”. Unfortunately HTML provides no standard way to distinguish such boilerplate elements from the unique textual content of each page. Without insight into the structure of a Web page, a search agent has no criteria for extracting content while excluding formal elements.4 Obviously, the shorter a Web page is, the lower its “signal to noise” ratio as well, and the less likely it will be to contain more than a few sentences of connected text; practical guidelines for a lower cutoff point are needed. At the other end of the spectrum, the very largest Web pages tend to consist of lists and fragments: server logs and statistics, indexes, glossaries, discussion group messages and headers, and “linketeria” pages. Such Web pages can be enormous–up to several megabytes – while documents of that length consisting primarily of connected prose are exceedingly rare. Since downloading long documents consumes significant bandwidth, guidelines for an upper size limit would be useful as well.
William H. Fletcher
194 2.2
Collecting Web Pages as Corpus Data
Before compiling a sizable Web corpus I examined a sample of 100 Web pages retrieved and saved to local text files by KWiCFinder for the query “the OR of OR a”. As formulated this matches any document in English containing any of these three very high-frequency words almost certain to occur in an extended text. This search yielded primarily documents from commercial sites: all rights reserved was the most frequent 3-gram, occurring 43 times in 100 texts, and copyright #### fell among the top ten 2-grams. In a second follow-up sampling I ran a series of queries for the ten highest-frequency words in the BNC. Among the 5,859 documents these searches yielded were 2,277 or 39% duplicates.5 Early in 2001 AltaVista had instituted preferential treatment for paying advertisers, placing “sponsored links” prominently at the beginning of search results and updating its database only for links to its subscribers.6 For exclusion from future searches I determined both which hosts (Web sites) were “overrepresented” in the results (presumably appearing higher within the search results due to sponsorship) and which had yielded the “noisiest” documents. Finally I conducted a third round of searches. My search terms were the twenty-one highest-frequency words in the BNC, supplemented by the underlined forms: the, of, to, and, a | an, in, is | are | be | was | were | been, that, for, it, on, with, as, he, she, by, I, at, not. The requirement for each search was that it include at least one article and one form of the copula BE, on the assumption that any sizable chunk of prose will contain these words often lacking in fragmentary texts. To reduce the commercial bias of the sample, these searches were limited to documents last indexed by AltaVista before 1 January 2001; any clients who paid for preferential placement in search results would have been updated since then. In addition, the overbearing and noisy hosts identified in the second sampling were explicitly excluded. This third iteration yielded 11,201 documents and serves as the basis of the analysis below. 2.3
Reducing the “Noise” in the Data
Before analysis of the downloaded documents, four principal “noise-reduction” tasks were completed with a suite of Windows programs I developed.7 These procedures help filter out repetitive and fragmentary documents so they do not bloat the corpus and skew the linguistic data. 2.3.1 Filtering out Duplicate Identical Documents First, duplicate identical documents (IDs) had to be identified and removed. It is common for a given document to have more than one URL8 or to be “mirrored” on multiple sites (e.g., Rivest 1992 appears verbatim on over 22,000 sites), so duplicates cannot be avoided simply by comparing URLs. The documents had
Making the Web More Useful as a Source for Linguistic Corpora
195
been saved by KWiCFinder in text format (i.e. all HTML tags had been stripped and HTML entities had been converted to characters). The challenge was to compare over 11,000 files totaling almost 70 MB (after removing HTML markup). The solution is relatively simple as it reuses portions of programs I had developed for other purposes. For an n-gram extractor I had already developed routines to normalize a text and to build a binary tree of representations of each ngram for efficient comparison. To reduce memory requirements to a bare minimum my approach took advantage of the Message Digest 5 Secure Hash Algorithm (MD5 SHA), a 16-byte representation or ‘fingerprint’… of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest… (Rivest 1992) In other words, with the MD5 SHA a text of any length can be captured in a code string only 16 characters long which has an extremely high probability of uniqueness; in practice, only two identical texts will produce the same code. Each text in the binary tree requires only 24 bytes (16 bytes for the MD code and 8 bytes for pointers to the next nodes in the tree), so both storage requirements and the number of characters involved in each comparison are minimal.9 My program FindDuplicates reads and normalizes each downloaded document, then reduces it to an MD5 hash and compares this code to hashes of previously analyzed documents following a binary tree algorithm. If it is unique, the hash code is added to the tree and the document is retained; otherwise the document is moved to a directory of discarded files. Encoding and comparison of all 11,201 files took only 33 seconds, leaving 7,294 unique documents.10 Since this search was limited to documents last indexed almost a year earlier, the most common duplicate texts were variants of the infamous “404 – File Not Found” error message, and the second most frequent were warnings that the site requires frames. 2.3.2 Identifying Virtually Identical Documents Remaining among the unique files were a number of “virtually identical documents” (VIDs). These include multiple instances of the same text with only slight differences, such as news stories from the wire services appearing on several sites, mirrored Web pages with different footers, various pages from the same site in which boilerplate material predominates over unique content, and instances of the same URL with dynamically updated content (time of day, temperature etc.). FindDuplicates cannot help here, since even the slightest difference between normalized texts yields highly dissimilar MD codes. While I could not automate recognition of VIDs, I used n-grams to identify potential VIDs for visual comparison. Here n-gram is used in the sense of “sequence of n words”, and word is defined orthographically as “a string of alphanumeric characters preceded and
196
William H. Fletcher
followed by whitespace, punctuation or nothing”.11 Normalization converts alphabetic characters to lower case, strips punctuation except word-internal period, hyphen or apostrophe and the symbol ©, and maps numeric characters onto # so that copyright 1997 and copyright 2001 are both tallied as instances of copyright ####. The method I explored for recognizing VIDs rests on two assumptions: after normalization, two or more VIDs will be of approximately the same size, and the identical content will be far more extensive than the surrounding boilerplate material. My program ViewVIDs cycles through all the document files in descending order of file size. For each file it looks for any smaller files whose size differs by no more than 5% or 1000 bytes, whichever is greater. If so, lists of the most frequent 3-grams occurring two or more times in each file are made. If 20 of the top 25 3-grams in two documents of comparable size agree in form and frequency, a tentative match is made, and both texts are presented to the user for comparison in side-by-side text windows. If no tentative matches are made, the program continues on down the list. With this approach, 26 VIDs were identified and dropped.12 Several consisted of extensive boilerplate material with minimal unique content. In the most extreme example, three VIDs shared a footer almost 6000 bytes long!13 2.3.3 Finding Highly Repetitive Documents While IDs and VIDs incorporate significant amounts of text from other Web pages, “highly repetitive documents” (HRDs) repeat substantial chunks of content within the same document. To locate HRDs I tabulated the frequency of longer ngrams in the entire Web corpus for values of n equal to 20, 12 and 8 and kept lists of those found 5 or more times. While the most common shorter n-grams (say for n ≤ 4) are typically found in a wide variety of texts and contexts, these longer ngrams are highly specific and invariably derive from a single source, such as a title, instruction, formulaic expression, quotation, or simple repetition of the same sequence of words. Any single text with several instances of a longer n-gram is a potential HRD, but the recurring text may be insignificant in a large document. To determine the nature and prevalence of the redundancies, I developed and used FindHRDs, which searches each file for instances of frequent longer ngrams and displays any matching passages for assessment and possible elimination. Overall 256 documents were deemed highly repetitive, and many others showed some degree of repetition; some remaining VIDs were identified as well. Shorter elements such as links often recur after each section of a document, or Web pages derived from books or articles may repeat titles as headings at regular intervals. Software documentation and programming tutorials may include the same long sequence of characters again and again. In transcripts of legal and legislative proceedings, repetition of formulaic elements is common, as is the verbatim reappearance of entire passages in laws and contracts. Generally such
Making the Web More Useful as a Source for Linguistic Corpora
197
repetition was deemed minimal, so the documents were retained. Undoubtedly the most tedious HRDs are server logs, followed by forum threads where each response incorporates all preceding posts to the same discussion. Ironically, search-engine ranking algorithms favor such mindless echoes, since they make a search term very prominent within a document. I typically discarded such “fugues on a theme” without a second thought. 2.3.4 Unproven “Noise” Filters Other techniques to automate filtering out “noisy” Web pages were investigated but proved less effective without further refinement. The Spelling and Grammar Checker engines of the Microsoft Office suite can be controlled programmatically. These modules could help automate recognition and normalization of ill-formed documents. The primary obstacle encountered was the large number of items not in the default lexicon, such as personal, commercial and place names, technological terms and other neologisms, and abbreviations. Consequently these tools either require constant user intervention when used interactively or else reject too many good documents in automatic mode. Still, they deserve further consideration, particularly for use in a well-defined content domain for which a custom dictionary could be compiled. Presumably frequency patterns of 1- and 2-grams could indicate “primarily fragmentary documents” (PFDs) such as link lists, server logs and pages bloated with search-engine spam. Some types with high frequency in connected prose like articles, copula, and pronouns are rare in fragments, while others, such as common prepositions, are frequent. Content words, proper nouns and jargon are also relatively prominent in PFDs. In this investigation I did not succeed in exploiting these observations, but do intend to return to this technique in the future. 2.3.5 Separating Connected Prose from Fragmentary Texts After sifting out duplicate and repetitive documents with computer assistance, I viewed each of the 7038 survivors briefly, and classified them as predominately useful text, “noisy” text, i.e. with a significant level of “overhead”, and PFDs; the latter two categories were excluded from the final corpus. The principle of “predominance” was vaguely defined, and since I reviewed up to 12 documents per minute, no rigorous consistency is claimed. This cursory examination of the documents disqualified roughly 30% of the pages, leaving 4,949 documents totaling 5,248,929 tokens and 34,995,762 bytes. Each document included was allowed a “reasonable amount” of overhead for its size – headers, footers, links, bibliography, lists, non-English words – but not exceeding 20% of content for very short documents, dwindling to about 5% maximum for longer ones. During this visual dash through the Web pages I could not savor their content, but it did leave distinct impressions on me. Since I typically conduct
198
William H. Fletcher
narrowly-defined searches with criteria conceived to limit results to a single content domain, I was struck by the variety of material matching these general queries. Among the shorter documents – those of a few hundred to a couple of thousand words – commercial and personal text prevailed. At the other end of the scale (up to 60,000 words) legal texts and government proceedings were well represented. The middle range was filled with academic texts – papers, theses, syllabi and course materials – some computer hardware and programming documentation, other expository prose, drama (including Shakespeare) and fiction, and personal interest pages, as well as a surprising number of religious documents and commentaries in the Christian, Islamic and Hindu traditions. Numerous "hobbyist" pages broadened the range of topics as well. As expected, it was this middle range that yielded the most useful texts. 2.3.6 File size as Indication of Usefulness As anticipated, the shortest and longest documents bore the brunt of this visual selection. Half of all documents were under 3,330 bytes long, and of these about 40% were rejected. Only 10 documents were longer than 100 KB, and more than half of these were deemed primarily non-textual; in fact, no documents over 200 KB were retained. In the range of 5-100 KB, I judged over three-quarters of the documents to be primarily connected prose. The optimum size seems to fall around 50 KB, where only 17.8% of documents were rejected. Nevertheless, owing to the far greater number of smaller files, the median size of texts retained was only 3770 bytes! Which HTML files are most worth downloading? Due to variations in HTML markup, the size of a file only indicates roughly how much text it contains. Some HTML editors (most notoriously Microsoft Word) grossly inflate file size, often to 5-10 times that of generic HTML with the same content, and embedded stylesheets and scripts add bulk, but not textual content. Stripping out such formatting elements typically reduces files to 40-65% of the HTML size; here again shorter files have greater overhead. This signal-to-noise ratio and the observations in the previous paragraph suggest the following rule of thumb: to maximize the “yield” of connected prose, download HTML files only between 10 and 150 KB in size.14 Had KWiCFinder followed these guidelines for this study, only onethird of the final number of files would have been downloaded, but that would have yielded a corpus two-thirds of the size of the current one with enormous savings in bandwidth and analysis time. The capability to exclude files below a given size is now on my KWiCFinder “to do” list (currently only a maximum file size can be specified). Other researchers have sampled Web pages as a source of corpus data with other techniques to ensure that samples consisted primarily of running prose. Cavaglià and Kilgarriff (2001) use statistical methods to compare the rank frequencies of lexical items in individual Web pages to those in the BNC. This
Making the Web More Useful as a Source for Linguistic Corpora
199
comparison requires a sample size of at least 2,000 words per page, so briefer documents were rejected. This cut-off point would exclude about 90% of all Web pages in my sample. In a study for the American National Corpus (ANC) Ide et al. (2002) arrived at minimums of 2,000 words and 30 paragraphs per document as a reasonable indicator of primarily connected text. They report that only 1-2% of Web pages investigated satisfied both criteria. To increase the likelihood of reaching this 2,000-word threshold, one would have to raise the rule-of-thumb for the minimum size of HTML files to download to about 25 KB. In doing so, one would exclude many typical Web pages which consist primarily of prose. Good Web style requires breaking up long documents into shorter Web pages for quicker loading and more responsive hyperactivity. 3
Comparing this Web Corpus to the British National Corpus
My experience with KWiCFinder has convinced me that the Web is a reliable source of data when studying specific words or phrases. How representative of English is this Web Corpus? As a first step toward answering that question I compared lexical data from this corpus to the BNC. The 4,949 Web documents which survived the various “filters” and selection processes were combined into a single file with 5,382,595 tokens (approximately 1/16 of the size of the BNC written corpus). To obtain comparable data from the BNC, I extracted all text within
N-grams with a rank frequency of 1 to 250 in both corpora
2.
N-grams with a rank frequency of 1 to 200 in one corpus and greater than 300 (i.e. relatively less frequent) in the other
3.
N-grams with a rank frequency of 1 to 500 in one corpus not among the 5,000 most frequent in the other.
William H. Fletcher
200
A thorough analysis of the similarities and differences between the two corpora is beyond the scope of this paper, but will be the subject of a future study. Here I limit myself to preliminary observations about salient differences. Rank frequency lists of the 50 most common words in both corpora are quite similar, but some striking contrasts are found. Beyond these most frequent items the divergences become both greater and more numerous, and thus more indicative of the medium. Table 1 and Table 2 detail all important differences for the top 50 word forms, and sample differences from those ranked 51-200 in frequency. Since these are frequency ranks, lower numbers reflect higher frequencies. Table 1: Word forms far more frequent in BNC by frequency rank Rank list
Word form
BNC
Web
1-50
he his she her Mr man old thought never
23 23 33 34 123 146 153 160 155
39 44 155 130 371 414 319 729 331
came rather
184 189
566 499
51-200
Table 2: Word forms far more frequent in Web by frequency rank Rank list
Word form
BNC
Web
1-50
you will we information our site page university data search please file
28 41 43 206 100 1,054 1,011 586 490 1,367 924 1,773
15 27 28 45 46 67 70 114 120 135 184 186
51-200
Inspection of these word form data and of the distribution of the most frequent phrases (n-grams) in the two corpora reveals the biases and gaps in each. The BNC clearly reflects British institutions, place names and spelling, while the Web sample is more oriented toward the United States. The BNC data show a distinct tendency toward third person, past tense, and narrative style, while the Web corpus prefers first (especially we) and second person, present and future tense,
Making the Web More Useful as a Source for Linguistic Corpora
201
and interactive style. Since the BNC texts were all written before the midnineties, words referring to Internet concepts and information technology which permeate the Web texts (and contemporary life) are rare or missing. In the BNC texts, the language of news and politics stands out, while in the Web corpus academic concepts are quite salient. Finally, the Web data are more varied: none of the most common 5,000 words in the BNC is lacking in the Web corpus, yet the reverse is not true, despite the sixteen-fold greater sample size. 4
Conclusions and Future Plans
This paper has surveyed a number of techniques and algorithms for downloading, preprocessing and evaluating texts from the Web for inclusion in a corpus. Windows software to accomplish these tasks is (in some cases will be) freely available from my Web site so that readers can try it out—and help improve it. For comparability with the BNC I aimed to compile a domain-neutral sample Web corpus. Many colleagues will find these procedures especially beneficial for creating small- to medium-sized corpora from the Web for specific professional or pedagogical purposes, or to provide a corpus on a desktop machine for a language for which no corpora are currently available. With the programming done, it should take no more than two or three days’ work to produce another corpus of similar size. I hope to have demonstrated that such a project would be both worthwhile and feasible for a motivated linguist or student. The continuation of this project will lead me down several complementary paths. Currently I am working on a Web interface for an expanded version of this Web corpus as a prototype for the linguistic search engine outlined elsewhere (Fletcher 2002). Techniques and software developed will be disseminated so colleagues can share any Web corpora they do compile. Next I plan to complete a more sophisticated statistical analysis comparing with the BNC (and the ANC when it becomes available) to help dispel doubts about the representativeness of selected Web documents for English as whole. Finally I will investigate further refinements of the procedures and tools described here. Major goals will be to add grammatical markup to the texts and to extend my methods to morphologically richer languages like German and Spanish. Notes 1.
KWiCFinder, the author’s Key Word in Context Web Concordancer, automates finding, analyzing, and saving online documents matching specific search criteria. It is described in detail in Fletcher (2001), and can be downloaded free from http://kwicfinder.com/.
2.
All programs were developed for Windows with PowerBasic, which is comparable to C in speed, power and compactness. My intention is to offer tools with a familiar graphical user interface for the most
William H. Fletcher
202
widespread desktop operating system so that colleagues and students need not become proficient UNIX users to do corpus research. I gratefully acknowledge my substantial debt to the PowerBasic user forums for peer support and sample code. 3.
An application can obtain information from a Web server about the size and date of a file before downloading it. While search engines report file size, changes to an online resource often make their data unreliable.
4.
Many websites do use custom templates with comments or element tags which allow one to find page elements like headers, footers, advertisements and contents automatically. While useful for analyzing numerous documents from a single site, parsing heuristics are rarely transferable from one site to another.
5.
For the first sample of 100 pages a single KWiCFinder search was run, so duplicates occurred only when two URLs pointed to the identical document. Since KWiCFinder uses the AltaVista search engine to find matching documents, it cannot go beyond the latter’s 1000-document limit per query. Consequently it must “merge” data from multiple searches to find larger numbers of texts.
6.
AltaVista’s serious deficiencies in updating its database and distinguishing sponsored links were resolved in 2002. With these factors out of play, AltaVista tends to provide a more random sampling of Web pages than Google. Each site’s ranking algorithms are closely-guarded secrets subject to constant revision. Generally speaking, however, the former tends to rank a page higher in the search results based on formal criteria indicating relative salience of the search terms within the document, while the latter additionally weights results by “link popularity” (i.e. the number of sites that link to a given Web page). Google’s strategy favors relevance and reliability—which is why it quickly became the most popular search engine—but also skews results toward fewer, more prominent sites, often those run for business purposes.
7.
Some of this software is already available at http://kwicfinder.com, and other modules will be released when integration and documentation is complete.
8.
For example, the home page of the departmental website I administer is accessible via either at http://www.nadn.navy.mil/LangStudy/ or http://www.usna.edu/LangStudy/, followed or not by homepage.html. All four URLs point to the same document, but some appear redundantly in search engine results.
9.
Prior to settling on MD5 I evaluated numerous hashing algorithms (approaches to “digesting” a string into a short code) for uniqueness of
Making the Web More Useful as a Source for Linguistic Corpora
203
results and speed. In tests with 20 million unique strings I found that hashes of four bytes or less resulted in numerous “collisions” (i.e. different strings result in the same code). In comparison with SHA-1 and RIPEMD160 (both 20-byte hashes, i.e. 4 bytes longer), MD5 encoded faster over a greater range of string lengths while providing similar protection against collisions. (RIPEMD160 was only marginally slower, but SHA-1 took up to twice as long to encode; for a 67 kB file the range was 1-2 ms. This is not an absolute claim, as tests on different machines showed that the distribution and relative order of run times varies significantly depending on system configuration.) Those who work with much larger datasets where reducing memory load is critical might follow up Dillon’s (no date) suggestion that CRC-64 (8 bytes, i.e. half the size of MD5) is sufficient, as it theoretically would lead to a collision only once in 2.3 trillion (1012) times. 10. Approximate run times are based on a Pentium IV / 2.4 GHz / 512 MB desktop under Windows XP. Thanks to the binary tree comparison algorithm and the memory typical of today’s systems performance would not degrade substantially for much larger document collections. 11. Frequency-ordered lists of n-grams in each document were produced by my program nGram, for which the MD5 / binary tree algorithm was developed. Since this approach proved incapable of handling the far greater volume of material in the BNC, I subsequently programmed kfNgram, which adapts the far more efficient Virtual Corpus algorithm described by Kit and Wilks (1998) and offers a GUI. 12. This approach is a first attempt to address the problem of VIDs which requires further testing and refinement. It relies on working assumptions about efficient but effective parameters to identify VIDs. The method does not work for very short texts, since few 3-grams if any are repeated; the distribution of 2-grams or even 1-grams may be more useful. On the other hand, for relatively long texts, patterns of 4-grams are more distinctive. The optimal relationship of file size to “window” size, i.e. the range of sizes of other documents to which a given file should be compared, also deserves study. 13. Many online documents incorporate large chunks of superfluous text as “search-engine spam” in hopes of increasing traffic by matching more queries. 14. Applying these size guidelines to all 7038 documents remaining after discarding IDs, VIDs and HRDs, 4,724 of them would not have been downloaded. On average the documents eliminated by this rule of thumb were 509 words long, and those retained had a mean size of 1985 words. On the other hand, 23% of the documents kept by this rule were dropped
William H. Fletcher
204
after visual review, so the suggested size range is only a modest indicator of usefulness. 15. No attempt was made to normalize spelling. Systematic differences between British and American orthography such as -ize / -ise, -er / -re, or / -our, as well as national and personal tendencies to write compound forms with a hyphen, a space, or together—log-in, log in, login—can separate lexical variants, thus obscuring important patterns of similarity between the predominantly British BNC texts and the American-biased Web documents. 16. Extensive excerpts from the database are available online at http://kwicfinder.com/WebCorpus/AAACL2002_ngramdata.pdf. The complete database is available upon request. References BNC Consortium (2000), British National Corpus World Edition, Oxford: Humanities Computing Unit, (2 CD-ROMs). http://www.hcu.ox.ac.uk/BNC. Cavaglià, G. and A. Kilgarriff (2001), Corpora from the Web, Fourth Annual CLUCK Colloquium, Sheffield, UK, January 2001. ftp://ftp.itri.bton.ac.uk/reports/ITRI-01-11.pdf. CLLT (2003), Discussion thread ‘That/it is not as ADJECTIVE as you think’ on the Corpus Linguistics and Language Teaching Listserv, January 2003. http://listserv.linguistlist.org/cgi-bin/wa?A1=ind0301&L=cllt. Dillon, M. (no date), CRC1—CRC64 test results on 18.2M dataset. http://apollo.backplane.com/matt/crc64.html. Fletcher, W.H. (2001), Concordancing the Web with KWiCFinder, American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001. http://kwicfinder.com/FletcherCLLT2001.pdf. Fletcher, W.H. (2002), Facilitating compilation and dissemination of ad-hoc Web corpora, TaLC (Teaching and Language Corpora) 5, Bertinoro, Italy, 2631 July 2002. http://kwicfinder.com/Facilitating_Compilation_and_Dissemination_ of_Ad-Hoc_Web_Corpora.pdf. Ide, N., R. Reppen, and K. Suderman (2002), The American National Corpus: More than the Web can provide, Proceedings of the Third Language Resources and Evaluation Conference (LREC), Las Palmas, Canary Islands, Spain, pp. 839-44. http://www.cs.vassar.edu/~ide/papers/anc-lrec02.pdf. Kit, C. and Y. Wilks (1998), The Virtual Corpus approach to deriving n-gram statistics from large scale corpora, in C.N. Huang (ed.), Proceedings of the
Making the Web More Useful as a Source for Linguistic Corpora
205
international conference on Chinese information processing, Beijing, pp. 223-229. http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf. Rivest, R. (1992), The MD5 Message-Digest Algorithm, RFC1321 (Internet Request for Comments 1321), Cambridge, MA: Network Working Group, MIT Laboratory for Computer Science. http://www.faqs.org/rfcs/rfc1321.html.
Student Use of Large Corpora to Investigate Language Change Mark Davies Brigham Young University Abstract The use of corpora in historical linguistics courses is an idea whose time has come, but it is a topic that has received scant attention in previous studies. In this paper I examine the way in which students have used large corpora as a fundamental part of an online “History of the Spanish Language” course. These corpora include a parallel corpus of the entire Bible in Late Latin, Old Spanish, and Modern Spanish, which allows students to compare many different linguistic structures across these three languages. The main corpus used in the course is the recently-completed “Corpus del Español” – a web-based, 100 million word, fully-annotated corpus of Spanish texts from the 1200s-1900s. This corpus allows even beginning students of historical linguistics to quickly and easily extract data for a wide range of linguistic phenomena, and thus move beyond the simplistic memorization of “historical rules” that are found in many textbooks.
1
Introduction
Most research on the use of corpora in the classroom deals with using corpora to provide non-native speakers with a database of authentic language data (see the articles from the TALC proceedings: Botley et al. 1996; Wichman et al. 1997; Burnard and McEnery 2000; Kettemann and Marko 2002). Because the goal deals with language learning by foreign speakers, the focus is obviously on the modern, synchronic stage of the language. In this study, however, I will discuss how language corpora can be used in quite a different sphere of teaching– that of historical linguistics. The use of large, electronic corpora in teaching historical linguistics is still rather uncommon. Of course there have been many valuable applications of corpora methodology to examining problems of historical linguistics (for example, Rissanen 1992, 1993, 1997a,b for English, among many others). Nevertheless, a review of the literature shows only a handful of articles and presentations dealing with the pedagogical use of these materials in the classroom, such as Schmied (1996), Knowles (1997), Davies (2000), and Curzan (2000). This lack of research is unfortunate when one recognizes that the use of corpora in the teaching of historical linguistics can significantly enhance the learning process, as much (or more) as the use of corpora in learner-oriented and synchronically-based courses. Traditionally, courses in historical linguistics focus on rather abstract rules governing changes in the phonetic, morphological, syntactic, or semantic
Mark Davies
208
structure of the language in question. The students are responsible for memorizing a long list of rules, and perhaps supplying one or two samples of each type of linguistic change. For example, they might include one or two words that have undergone a particular phonetic shift, or one or two sample sentences showing the “before” and “after” stages of a grammatical shift in the language. By using large corpora, however, the students can truly immerse themselves in the data and – by themselves – find new and interesting examples of linguistic change. Depending on the corpus they are using, it may be possible to extract hundreds or thousands of examples of a particular linguistic shift in a very short period of time. This large amount of data can then be used to model linguistic change much more precisely and accurately than had been done by even the best researchers, previous to the use of large electronic corpora. This is very empowering for the students, as they can easily and accurately use data to test the textbook rules for a particular linguistic shift. In essence, even advanced undergraduates or beginning graduate students can use the corpora to add valuable insight into what is known about the evolution of a particular language. 2
“History of the Spanish Language”
Previous studies such as Knowles (1997) and Curzan (2000) are in part “how to manuals,” discussing concrete ways in which corpora have been used in actual courses in historical linguistics. Both of these studies, however, deal just with English. In the present study, I will expand the focus somewhat and look at several different ways in which corpora have been used to teach a “History of the Spanish Language” course that has been offered by Illinois State University (http://davies-linguistics.byu.edu/hisspan). In addition to its strong reliance on corpus-based investigation, this “History of the Spanish Language” course is also unique in terms of its method of delivery. Although originally offered as an in-classroom course, since Spring 2000 it has been offered as an online course, and has been taught entirely via the Web. The lack of traditional classroom interaction was in fact one of the reasons for using large corpora. If the class had been offered in a traditional setting, we could have memorized the different types of linguistic change in Spanish, and the students would have been responsible for duplicating these on the test. There would have also been opportunity for the students to ask questions about the changes, and receive feedback from the professor in areas where clarification was needed or desired. By teaching the class entirely via distance education, the dynamics of the class were altered dramatically. There would be much less opportunity for the traditional “give and take” of the classroom setting, which meant that the students themselves would be more responsible for internalizing the data. In addition, because the class is offered as a distance education course, there are problematic issues regarding the administration of tests and test security. For this reason, it was decided that student projects would form the basis of the evaluation.
Student Use of Large Corpora to Investigate Language Change
209
Once the decision was made to focus on projects – rather than the rote memorization and recitation of rules – it was obvious that the students would need to have access to a well-built and highly usable database of historical texts, in order to extract the needed data. In subsequent sections, I will focus on the specific corpora that have been used in the class, and the way that they have been used by students to examine and model several different types of linguistic change. First, however, let us briefly consider the basic structure of the class. Table 1: Course topics THE EARLIEST STAGES 1. Introduction 2. Pre-romanic languages 3. Indo-European 4. Latin: External 5. Latin: Internal 6. Vulgar Latin and the Romance languages 7. The Visigoths 8. The Arabs LATIN > MEDIEVAL SPANISH 9. Phonetic 10. Morphosyntax 11. Lexicon MEDIEVAL SPANISH 12. Medieval Spanish dialects 13. Medieval texts 14. The language c1250-1450 MEDIEVAL > MODERN SPANISH (INTERNAL) 15. Phonetic 16. Orthography 17. Morphology 18. Syntax 19. Lexicon MODERN SPANISH (EXTERNAL) 20. The language c1475-1700 21. Spanish in the Americas 22. Other modern dialects 23. The future of Spanish
QUESTIONS
O O O O O O O O
PROJECTS
O
O O O O O O
O O O O O O O O
O O O O
O
Mark Davies
210 3
Overview of Course Topics and Organization
The “History of the Spanish Language” course covers a wide range of topics, dealing both with language-internal as well as external factors. Table 1 shows the twenty-three topics that receive primary focus during the course. As can be seen in this table, there are two different types of activities in the course. For the topics that are more “language-internal” in nature, there are corpus-based projects. For the “external” topics, there are a number of activities that are somewhat more traditional in nature. These involve readings and selected essay-type questions, which are submitted and evaluated via the class website. Even with some of these topics, however, there is an attempt to use a simple corpus-based approach, wherever possible. For example, in the discussion of the medieval dialects, students are first presented with information on the major features distinguishing the dialects, and are then given a 200-300 word extracts from different “unlabeled” dialects and asked to identify the dialects, based on their linguistic features. Likewise, for the final topic – dealing with the present influence of other languages on Spanish – students are asked to use Google to find examples of English-based words in Spanish web pages. In addition to these traditional “question and answer” activities, however, there are many corpus-based projects, and this is the focus of this paper. As we will see, the two major sets of corpora of historical Spanish are used to 1) investigate the relationship between different stages of the language, and 2) accurately model several different types of linguistic change in Spanish. In Section 4, I will discuss how the first goal was addressed in the use of the large parallel “Polyglot” Bible of Late Latin, Old Spanish, and Modern Spanish. In Sections 5 and 6, I will discuss the second goal, by considering the way in which large, multi-million word corpora of Spanish are used to map out linguistic change from one century to the next. 4
The Polyglot Bible
One of the difficulties in teaching a course in historical linguistics is the challenge of having students see the relationship between different stages of the language. One way to address this challenge is by having students study the same passage in a parallel corpus that contains the same text in different stages of the language. Perhaps the best text for this purpose is the Bible, which has been translated into most of the European languages several times since the Middle Ages. With this goal in mind, several years ago I placed online a “Polyglot Bible” that contains the entire Gospel of Luke (1150+ verses) in thirty different languages (see http://davies-linguistics.byu.edu/polyglot). In addition to the modern stages of many different Indo-European and non-Indo-European languages, it also contains older stages of English (Old English [1000s], Middle English [1300s], Early Modern English [1600s], and Present-Day English [1900s]) and Spanish (Old
Student Use of Large Corpora to Investigate Language Change
211
Spanish [1200s], and Late Latin). Table 2 shows part of the story of the “Good Samaritan” (Luke 10:30-33) in the four stages of English: Table 2: Polyglot/parallel corpus (stages of English) CH:V OE (1000s) 10:30 þa cwæþ se hælend hine up beseonde; Sum man ferde fram hierusalem to hiericho and becom on þa sceaðan. þa hine bereafodon; and tintregodon hine: and forleton hine samcucene:
ME (1300s) soþli Jesus byholdende vp seide, sum man cam doun fro ierusalem to Jericho, & fel in to þeues, þe whiche also robbeden hym, & woundis put in, wenten awey, þe man left half quic
10:31
þa gebyrode hit þæt sum sacerd ferde on þam ylcan wege and þa he þæt geseah he hine forbeh.
forsoþe it befel þat sum prest cam doun in þe same weie, & hym seen, passede forþ
10:32
and eallswa se diacon. þa he wæs wið þa stowe and þæt geseah he hyne eac forbeah;
Also forsoþe & a dekne whan he was biside þe place & saÿ hym, passede forþ
10:33
þa ferde sum samaritanisc man wið hine: þa he hine geseah þa wearð he mid mildheortnesse of er hine astyred
forsoþe sum samaritan makende iourney, cam biside þe weie, & he seende hym, is stirid bi mercy
EME (1600s) And Jesus answering said, A certain [man] went down from Jerusalem to Jericho, and fell among thieves, which stripped him of his raiment, and wounded [him], and departed, leaving [him] half dead. And by chance there came down a certain priest that way: and when he saw him, he passed by on the other side. And likewise a Levite, when he was at the place, came and looked [on him], and passed by on the other side. But a certain Samaritan, as he journeyed, came where he was: and when he saw him, he had compassion [on him],
PDE (1900s) In reply Jesus said: "A man was going down from Jerusalem to Jericho, when he fell into the hands of robbers. They stripped him of his clothes, beat him and went away, leaving him half dead. A priest happened to be going down the same road, and when he saw the man, he passed by on the other side. So too, a Levite, when he came to the place and saw him, passed by on the other side. But a Samaritan, as he traveled, came where the man was; and when he saw him, he took pity on him.
The parallel text is a useful tool, in that it allows students and other users to see exactly the same text in different historical periods, and thus see quite clearly how the language has changed. A function of the usefulness of the online “Polyglot Bible” is the fact that the historical English corpus is currently being used as part of a number of “History of the English Language” courses throughout the world.
Mark Davies
212
In the case of Spanish, the parallel text is not just for the 1150-verse Gospel of Luke, but rather it contains the text for nearly all of the Old and New Testaments – nearly 15,000 verses (see http://davies-linguistics.byu.edu/span3). Table 3 is a small selection, containing part of the story of the “Good Samaritan” (Luke 10:30-33) in the three stages of Latin and Spanish. Table 3: Polyglot/parallel corpus (stages of Latin/Spanish) CH:V LATIN 10:30 suscipiens autem Iesus dixit homo quidam descendebat ab Hierusalem in Hiericho et incidit in latrones qui etiam despoliaverunt eum et plagis inpositis abierunt semivivo relicto 10:31 accidit autem ut sacerdos quidam descenderet eadem via et viso illo praeterivit 10:32
similiter et Levita cum esset secus locum et videret eum pertransiit
OLD SPANISH Catando Ihesu Christo a suso, dixo: un ombre decendie de Iherusalem a Iherico, e cayo en ladrones, e despoiaron le, e firieron le; de hy dexaron le medio uiuo e fueron se. Acaecio que aquel mismo dia un sacerdot passaua por aquella misma carrera, e quandol uio, passos e fue su uia. E otrosi un leuita que passo cab el, quandol uio, fuesse adelant.
10:33
Samaritanus autem quidam iter faciens venit secus eum et videns eum misericordia motus est
E un samaritano que passaua por alli, quandol uio, fue mouido de piedat;
MODERN SPANISH Respondiendo Jesús dijo: --Cierto hombre descendía de Jerusalén a Jericó y cayó en manos de ladrones, quienes le despojaron de su ropa, le hirieron y se fueron, dejándole medio muerto. Por casualidad, descendía cierto sacerdote por aquel camino; y al verle, pasó de largo. De igual manera, un levita también llegó al lugar; y al ir y verle, pasó de largo. Pero cierto samaritano, que iba de viaje, llegó cerca de él; y al verle, fue movido a misericordia.
In addition to the inherent advantages of presenting the same text in parallel format, the online corpus also has the advantage of being searchable, and this allows students to perform a number of useful queries of the data. For example, one of the projects in the course is to find evidence for seven or eight of the major morphosyntactic changes from Late Latin to Old Spanish, such as the loss of nominal case, the creation of articles, the maintenance of specific verbal inflexions, the loss of others (e.g., future and passive), the creation of others (e.g., analytic perfect tenses), and negation. In this case, a student might investigate the disappearance of the synthetic Latin future (facient; “3PL will make”) and the emergence of the analytic Romance future (VL facere habent> OSp. fazer (h)an > ModSp harán). In examining this shift, students can search for a Modern Spanish form (e.g., harán), and in less than half a second they retrieve the 33 matching hits in the 15,000 verses of text (see, for example, Table 4).
Student Use of Large Corpora to Investigate Language Change
213
Table 4: Searching the parallel corpus to compare constructions (Lat/OSp/MSp) Text LATIN Deut sin autem eum qui 25:2 peccavit dignum viderint plagis prosternent et coram se facient verberari pro mensura peccati erit et plagarum modus
OLD SPANISH mas si eillos vieren que aqueill que erro contra lotro fuere digno de ferir: tender lan & ante si fazer lo an acotar segunt que fuere su peccado assi sera batido.
MODERN SPANISH Sucederá que si el delincuente merece ser azotado, el juez lo hará recostar en el suelo y lo harán azotar en su presencia. El número de azotes será de acuerdo al delito.
Likewise, the assignment might require the student to find evidence for a particular linguistic shift from Old Spanish to Modern Spanish. For example, Modern Spanish often uses [ir + a + INF] to express the future (va a cantar “3SG is going to sing), whereas this was still very infrequent in Old Spanish. A student can therefore look for cases like [va a *r], and will retrieve several examples like in Table 5. Table 5: Searching the parallel corpus to compare constructions (OSp/MSp) Text OLD SPANISH MODERN SPANISH Rev Non temas ninguna destas cosas por No tengas ningún temor de las cosas 2:10 que as de passar. Euas que el diablo que has de padecer. He aquí, el diablo metra de uos en carcel . . . va a echar a algunos de vosotros en la cárcel. . . Mas los fijos de belial dixieron Como Pero unos perversos dijeron: "¿Cómo 1 Sam nos podra deffender: Desdennaron lo nos va a librar éste?" Ellos le 10:27 & non le trayeron dones et eill fazie tuvieron en poco y no le llevaron un semblant que no lo oye presente. Pero él calló.
In summary, the parallel corpora can help students to find an unknown form in a different stage of the language, simply by working from the stage with which they already feel the most comfortable. 5
The Original “Corpus del Español” (3 million words; unannotated)
The parallel text “Polyglot Bible” that has just been described allows students to easily compare equivalent structures in different stages of the language, and to actually see the contrasting structures in context. However, this corpus would not allow students to see how a particular form or construction developed over a number of centuries (i.e. in the period between the three or four specific stages that appear in the polyglot text). For this type of research, students would need access to a comprehensive corpus of many different texts. In the case of Spanish, this would include texts from each of the centuries from the 1200s to the 1900s.
Mark Davies
214
Fortunately, before the “History of the Spanish Language” course was taught on the web for the first time, I had already developed such a corpus of historical Spanish texts. Table 6 shows the composition of the corpus, which contained more than three million words in nearly 200 texts: Table 6: Composition of the original 3,000,000 word corpus Historical CENTURY
# texts
Modern Spanish # words
CENTURY/REGISTER
(#) texts
# words
1200
14
250,000
1800-Spain
13
250,000
1300
10
250,000
1800-LA
14
250,000
1400
15
250,000
1900-Spain-Spoken
Habla Culta,
250,000
1500
19
250,000
1900-Spain-Written
Novels, Short
Esp Oral 250,000
Stories 1600
16
250,000
1900-LA-Spoken
Habla Culta +
250,000
1700
17
250,000
1900-LA-Written
Novels, Short
250,000
Stories
As can be imagined, because there are at least a quarter of million words from each century from the 1200s-1900s, the students are able to use the corpus to very accurately describe several different types of language change. As was shown in Table 1, Units 15-19 of the course require students to show evidence from the corpus for specific linguistic changes in terms of the sound system, orthography, morphology, syntax, and the lexicon, and the three million word corpus of historical Spanish texts allow them to provide extensive data for these changes. In fact, the range of linguistic phenomena that the students are able to study is both quite broad as well as quite in-depth. The following table provides just a sampling of some of the shifts that the students have to map out and describe for two of these areas of language change – morphology and syntax – and comparable lists are given for phonetic, orthographical, and lexical changes. In each case, the information given in parenthesis after the shift (e.g., C 213) refers to the book and page number that describes the shift. The task of the students is to use the data from the corpus to verify whether the information in the textbook is in fact correct. Let us examine a concrete example of how the students carry out their research. In #5 of the “Pronouns” section above, it mentions that pronouns in the [indirect]+[direct] sequence changed from [gelo] in Old Spanish to [se lo] in Modern Spanish (e.g., se lo di “to-him it I-gave”). Students studying this shift would simply enter [gelo] or [se lo] into a web-based search form, and select the centuries for which they wanted to retrieve data.
Student Use of Large Corpora to Investigate Language Change
215
Table 7: Examples of specific types of phenomena investigated by the students Morphological shifts, 1200s-1900s
Syntactic shifts, 1200s-1900s
Nouns
Pronouns
1. Gender (C 213, 243) (S 101-2) 2. la + -o (C 243) 3. -íssimo (C 213)
1. placement (C 245) (S 119-20, 170-1) 2. mesoclitic future: cantar lo (h)an (S 114-5) 3. “redundant” DO/IO (C 245) 4. impersonal se (C 246) 5. gelo / se lo (C 246) 6. vos / tú / usted (C 214, 244) (S 167-8) 7. omne = se (S 106) (L 402-3)
Determiners / pronouns 1. vos(uos) / os (C 214) 2. la tu / tu (C 246) 3. nosotros/vosotros (C 214) 4. los/les (C 214, 245) (S 103, 201-2) 5. mio/mi, sos/sus, etc (C 215) 6. alguien:quien, nadie:otrie (C 215) 7. gelo / se lo (C 244) (S 103-4)
Meaning and use of verb forms
Verbs 1. -zco (verbs) (C 215-6) 2. irregular past participles (C 216) 3. irregular preterites (S 113-4) 4. imperfect in -ié / ía (C 216) (S 112-3) 5. irregular future tense (C 216) (S 115)
1. ser / estar (C 218) (S 127-8, 204) (L 400-1) 2. haber / tener (C 249) (S 127) 3. haber / ser + PP (C 249) (S 126, 169) 4. haber / hacer (S 127) 5. subjunctive(C 217, 248) (S 169) 6. infinitives (C 217) (S 123)
They would then see the frequency of the construction in each historical period, as shown in Table 8. Table 8: 3,000,000 word corpus – search interface and frequency listings Word/phrase _gelo __________
l Submit
Reset t
Time period 1200s 1300s 1400s 1500s 1600s 1700s 1800s 1900s
Search string gelo se lo
1200s 1300s 1400s 1500s 1600s 1700s 1800s 1900s 36 4
30 2
23 3
7 54
56
31
80
70
By comparing the two sets of data, the student can clearly see that it was about the 1500s that the new [se lo] form became the norm. For more precision, the students can click on the numbers indicating the frequency of any form in any century, and see the examples in context. Because this KWIC display shows the exact date of each text, it would be possible to describe the period of greatest change even more precisely. Similar queries and investigations for any of the other morphological or syntactic shifts could be (and are) carried out in like fashion. Students can easily map the emergence or disappearance of a given word, the variation in the use of a
Mark Davies
216
particular verbal conjugation, or the changes in the spelling (and perhaps also pronunciation) of a certain subset of words. Because of the design of the corpus, even relatively inexperienced students are able to quickly and easily extract large amounts of useful data. In fact, in many cases the descriptions that they give for different types of linguistic change are more detailed (in terms of the historical trajectories) than the descriptions given in the textbooks that we use in the class, which were written by experts with much more experience. All of this is very “empowering” to the students, in helping them to discover data that no one else had ever seen before. 6
The Present “Corpus del Español” (100 million words; richly annotated)
The three million word corpus that has just been described was the corpus that was used the first time that the course was offered online in Spring 2000. Although it was quite useful in its own right, it also had a number of limitations, which made certain types of linguistic investigations quite difficult. For example, the search engine for the corpus (Microsoft Search) does not allow much in the way of wildcard searches, which would have been quite useful for examining sound and spelling changes. More importantly, there was really no way to annotate the corpus. This meant that it was impossible to search by lemma (e.g., all of the forms of a particular verb) or by grammatical category. Table 9: Composition of the newer, NEH-funded 100,000,000 word corpus CENTURY 1200s
# WORDS 6,905,000
# TEXTS 71
CENTURY 1800s
# WORDS 20,465,000
# TEXTS 392 novels
1300s
2,820,000
50
1900s-Lit
6,750,000
8,515,000
160
1900s-Oral
6,800,000
1200s-1400s
18,240,000
281
1900s-Misc
6,800,000
1500s
18,001,000
323
1800s-1900s
40,815,000
850 novels/ stories 2040+ transcripts 4770+ articles 8052
1400s
1600s
12,746,000
499
1700s
10,263,000
159
1500s-1700s
41,010,000
981
TOTAL
100,000,000
9314
In order to address these shortcomings, a new corpus has been created, and this will now serve as the main database for the class. The new corpus was funded by a grant from the national Endowment for the Humanities, and was created between April 2001 and July 2002. It contains 100 million words of text, including 20 million from the 1200s-1400s, 40 million for the 1500s-1700s, and
Student Use of Large Corpora to Investigate Language Change
217
40 million for the 1800s-1900s. Table 9 provides more details on the composition of the corpus. The process of carrying out queries with the newer 100,000,000 corpus is fairly similar to the older 3,000,000 word corpus. With the new corpus there are more options as far as limiting the query by frequency in different centuries, how the results will be groups (word form or lemma), how the results will be sorted, etc. But the only field that is required is the [SEARCH] field itself. For example, suppose that a student wants to search for cases of an object pronoun + any form of querer “to want” + an infinitive (e.g., lo quiero hacer “it I-want todo”). Suppose also that the students want to limit the strings only to those that occur at least once in the 1900s, and that they want to sort the results by the frequency of the string in the 1900s. The students would enter the following into the search form, and then see the following results: Table 10: 100,000,000 word corpus – query interface and frequency listings SEARCH *.pn_obj querer.* *.v_inf__
# 1 4 19 22
PHRASE(S) te quiero decir me quiero ir Le quiere dar Te quiero contar …
SORT 1900s
12
13
14
9
1
1 1
LIMITS +1900s
15 17 32 7 11
16 10 10 6 3
17 1 1 2
GROUP FORMS
18 8 2 6
19 49 23 4 4
Lit 11 7
RESET SUBMIT
Oral Misc. 38 16 3 1 4
Once they are presented with the frequency listing of all matching forms, users can then use the checkboxes to select which phrase(s) to see in context and in which historical period(s). After selecting these phrases, they then see a “keyword in context” display, in which the example sentences can be re-sorted by left and right contextual words, or see a more expanded block of text. (Note: in Table 11 the examples are truncated, unlike on the web). Table 11: 100,000,000 word corpus – KWIC display TIME TEXT 12 Libro de los..
RE-SORT BY: L-2 L-1 C tiene gela forçada. Et non le quiere dar
R-1 R-2 lo que a tomado & en logar de 15 La Serrana de.. desdicha el desengaño. No me quiero casar, padre, que creo que mientras no 19_L Follaje en.. ¡Haré lo que quiera, no me quiero ir! Ya soy grande y sé hacer de 19_O EspOral:CO.. a mi madre y a mi padre. Te quiero decir que es una cosa que yo - y mis ... ... ... ... ...
Mark Davies
218
Even more important than the size of the corpus is its annotation scheme and search engine, which provide capabilities for a wider range of searches than almost any other large corpus in existence. The corpus uses a unique relational database architecture – which I have designed especially for this corpus – which allows searching by substring (advanced wildcard queries), subqueries, lemma, part of speech, synonyms, and user-defined features. In addition, the queries on the corpus are very fast. Even the most complex queries only take three or four seconds to return data from the 100 million word corpus. In the sections that follow, I will discuss very briefly how the new corpus can meet the needs of students in the “History of the Spanish Language” course, in terms of mapping out in very detailed fashion a wide range of linguistic shifts. First, the substring function allows students to investigate sound change and shifts in spelling. Examples of the types of queries allowed by the search engine are given in the Table 12, where the three columns refer to the student input, examples of the output, and an explanation of the search. Table 12: Examining sound/spelling changes s_fr*
sofryr, sufre, sufriendo
*mbre 1200s>5 1900s<5
ombre, fambre, combre
*aua* +1200s *aba* +1900s
fablaua, caualleros, daua hablaba, caballeros, daba
words relating to the root [s-fr] “to suffer” ([o] in Old Spanish, [u] in Modern Spanish) words ending in –mbre, which occur at least five times in the 1200s, but less than five times in the 1900s (i.e. forms from Old Spanish) words with the pattern *aua* in the 1200s, which have an equivalent with *aba* in the 1900s (resulting from a spelling change in the 1700s)
Second, the corpus can be used to examine morphological change and variation. This is due to the wildcard searches (just mentioned), as well as the fact that the word forms are annotated for lemma (= lemma.*): Table 13: Examining morphological changes *iere +1200s +1300s –1900s
fiziere, naciere, touiere
*simo +1400s -1300s decir.* +1200s –1500s –1900s
santísimo, altísimo, grandisimo dize, dixiere, dezyr
word ending in –iere that occur at least once in the 1200s and 1300s but not the 1900s. This would retrieve many forms of the future subjunctive, a verbal form that has essentially died out by Modern Spanish words ending in –[ií]simo (a marker of the superlative), which do not occur in the 1300s but which do occur for the first time in the 1400s forms of decir “to say” that occur in the 1200s, but not in the 1500s or 1900s (i.e. forms of the verb from Old Spanish, which have subsequently disappeared)
Student Use of Large Corpora to Investigate Language Change
219
Third, it is possible to carry out advanced syntactic analysis on the corpus, due to the fact that the corpus is annotated for part of speech (= *.pos): Table 14: Examining syntactic changes *.v_inf -1700s -1800s +1900s poder.* lo/la/los/las *.v_inf +1200s
detectar, liberar, programar
estar.* cansado.* de *.v_inf
estoy harto de vivir, estaba cansada de escuchar
puede lo fazer, podemos las fazer
infinitives that occur in the 1900s, but which to do occur in the 1700s or 1800s (i.e. new verbs that have entered into the language) forms of poder “to be able” + object pronouns (e.g. lo/la/los/las) + an infinitive (common word order in Old Spanish) any form of estar “to be” + any form of any adjective of cansado (“tired”) + de + infinitive
Fourth, the corpus can be used to investigate semantic change. Two features of the corpus make this possible. The first is the possibility of using collocations to see what other words occur with a given word in different historical periods. If the words that co-occur have changed significantly over time, then that may indicate that the word in question has also changed its meaning. Second, the corpus has a built-in thesaurus for more than 30,000 words. This allows users to see which synonyms of a given word have increased or decreased in frequency over time. Table 15: Examining semantic changes !romper 1900s>10 1700s<5 *.n suave +1900s –1800s –1700s
irrumpir, incumplir, escacharar sabor suave, pelaje suave
Synonyms of romper “to break” that occur at least ten times in the 1900s, but less than five times in the 1700s Nouns that occur with suave “soft” in the 1900s, but not in the 1800s or 1700s. May indicate recent shifts in the meaning of suave.
In addition to all of the types of searches shown previously, it is also possible to create “customized” lists of words that can be re-used in subsequent searches. These lists can include items that are semantically, syntactically, or morphologically related, such as parts of clothes, temporal adverbs, or words ending in –azo (a suffix that sometimes refers to a strike or blow made with an object, e.g., puerta > portazo = “to hit with a door”). The students simply create the list of words via a simple form in the search interface, and they can later modify the list and use it as part of the search syntax. For example, suppose that a student named [susana.rubio] has created lists called [ropa] “clothes” and [azo] “strikes/blows with an X” with the following items:
Mark Davies
220 ropa: azo:
sombrero, pantalón, camisa, zapato, cinturón puñetazo, portazo, manotazo, latigazo, collazo
Later that day, or even weeks later, this student could then re-use this list in a search, as shown in Table 16. Table 16: User-defined lists poner.* el/la/los/las [susana.rubio:ropa].* dar.* un [susana.rubio:azo]
ponerse los pantalones, puso el sombrero dé un portazo, da un puñetazo, dio un codazo
any form of poner (“to put”) + definite article (lo/la/los/las) + any form of any word in the [ropa] list any form of dar (“to give”) + un (“a”) + any word in the [azo] list
In summary, the Corpus del Español that I have created offers a wider range of searches than is possible with any other historical corpus of any language. This allows students in the online “History of the Spanish Language” course to investigate and describe an ever wider range of linguistic phenomena than has been possible in the past. All of this suggests that the time has past when students needed to memorize long lists of overly-abstract rules of linguistic change from textbooks. Using state-of-the-art corpora of the type that I have described, the students themselves are now in control of extracting the data, and can by themselves find evidence for and describe a wide range of historical changes in the language. References Botley S., J. Glass, T. McEnery, and A. Wilson (eds) (1996), Proceedings of teaching and language corpora 1996, Lancaster: University Centre for Computer Corpus Research on Language Technical Papers 9 (Special Issue). Burnard L. and T. McEnery (eds) (2000), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora, Frankfurt: Peter Lang. Curzan, A. (2000), English historical corpora in the classroom: The intersection of teaching and research, Journal of English Linguistics, 28: 77-89. Davies, M. (2000), Using multi-million word corpora of historical and dialectal Spanish texts to teach advanced courses in Spanish linguistics, in L. Burnard and T. McEnery (eds), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora, Frankfurt: Peter Lang, pp. 173-186. Kettemann, B. and G. Marko (eds) (2002), Teaching and learning by doing corpus analysis: Proceedings of the Fourth International Conference on Teaching and Language Corpora, Amsterdam: Rodopi.
Student Use of Large Corpora to Investigate Language Change
221
Knowles, G. (1997), Using corpora for the diachronic study of English, in A. Wichmann, S. Fligelstone, T. McEnery, and G. Knowles (eds), Teaching and language corpora, London: Longman, pp. 195-210. Rissanen, M. (1992), History of Englishes: New methods and interpretations in historical linguistics, Berlin: Mouton de Gruyter. Rissanen, M. (1993), Early English in the computer age: Explorations through the Helsinki corpus, Berlin: Mouton de Gruyter. Rissanen, M. (1997a), Grammaticalization at work: Studies of long-term developments in English, Berlin: Mouton de Gruyter. Rissanen, M. (1997b), English in transition: Corpus-based studies in linguistic variation and genre styles, Berlin: Mouton de Gruyter. Schmied, J. (1996), Encouraging students to explore language and culture in Early Modern English pamphlets, Unpublished presentation given at TALC 96 (Lancaster University). Wichmann A., S. Fligelstone, T. McEnery, and G. Knowles (eds) (1997), Teaching and language corpora, London: Longman.
The Montclair Electronic Language Database Project1 Eileen Fitzpatrick and M. S. Seegmiller Montclair State University Abstract The Montclair Electronic Language Database (MELD) is an expanding collection of essays written by students of English as a second language. This paper describes the content and structure of the database and gives examples of database applications. The essays in MELD consist of the timed and untimed writing of undergraduate ESL students, dated so that progress can be tracked over time. Demographic data is also collected for each student, including age, sex, L1 background, and prior experience with English. The essays are continuously being tagged for errors in grammar and academic writing as determined by a group of annotators. The database currently consists of 44,477 words of tagged text and another 53,826 words of text ready to be tagged. The database allows various analyses of student writing, from assessment of progress over time to relation of error type and L1 background.
1
Introduction
A corpus of the productions of language learners provides authentic language data that can be analyzed and sampled for language performance. As Granger (1998) argues, the large size of a corpus, the naturalness of the data, and its computerization yield advantages that complement data collected in controlled experiments. Corpus data represents the kind of data that learners use naturally. In addition, the data is collected from many informants, giving it a broad empirical base that enables descriptions of learner language to be generalized. Because of the size of the data set, even infrequent features of learner language can be studied, as well as the avoidance of difficult features of the language. A carefully constructed corpus can provide representative samples covering the different variables affecting learner productions. The large size of a corpus also sets the stage for innovations in teaching methodology and curriculum development as students examine learner data and compare it to native speaker language. Most significant, the automated analysis of language has the “power to uncover totally new facts about language” (Granger 1998: 3). Language learner corpus building has been well established for more than ten years. Pravec (2002) discusses nine projects in Belgium, England, Hong Kong, Hungary, Japan, Poland, and Sweden, all of which represent the productions of foreign language learners.2 Many of these corpora are annotated, giving them additional research value. The annotations include information on part of speech, syntactic structure, semantic relations, and type of error.
Ellen Fitzpatrick and M.S. Seegmiller
224
These corpora provide models of language performance that can be used to test hypotheses about the process of second language (L2) acquisition, to design teaching materials for the L2 writer, to design a parser for L2 writing, and to check the L2 writer's grammar (Milton and Chowdhury 1994). The language learning experience of a foreign language learner is normally different from that of a second language learner, the latter being immersed in the language and required to use it on a daily basis. Indeed, Nickel (1989: 298) observes that the lack of a distinction between the foreign and second language learner has been partly responsible for the contradictory results, particularly with respect to transfer, in SLA research. However, to date there has been no effort to build corpora comparable to the aforementioned data on foreignlanguage learners that represent the language of learners of English as a second language. The Montclair Electronic Language Database (MELD), under development at Montclair State University in the USA, aims to fill that gap in our understanding of the performance of English language learners. MELD differs from the cited corpora not only in its capture of second language data, but also in its method for annotating errors in the data, and in its goal of making the data publicly available for the building of resources and tools for language learners and for researchers in L2 acquisition. A publicly available corpus will enable analyses to be duplicated and results to be shared. A corpus is a large investment in time, money, and equipment and the lack of access to corpus data diminishes the advantages that these collections provide. This paper provides an overview of the MELD corpus, the annotation it provides, a discussion of its error annotation goals and techniques, sample applications using MELD data, and future plans for the project. 2
MELD Overview
The MELD corpus currently consists of formal essays written by upper level students of English as a Second Language preparing for college work in the United States. The corpus currently contains 44,477 words of text annotated for error and another 53,826 words waiting to be annotated. We expect to add another 50,000 words each year; if a funding source is found, we will accelerate this pace. Some of the essays are timed essays written in class; the rest are written at home at the students' own pace. Essays are either submitted electronically or transcribed from hand-written submissions. A record is kept as to how each essay was submitted and whether it was written in a timed or untimed situation. Timed essays are written in class in response to a general prompt such as “If you had a choice between traditional schooling and studying at home with a computer, which would you choose?” These writing tasks are given to each class on entering and exiting the course. Untimed essays are written outside of class in response to a question about a reading or topic discussed in class. Both the timed and the untimed essays vary widely in length.
The Montclair Electronic Language Database Project
225
Participating student authors sign a release form that permits us to enter their written work into the corpus throughout the semester. These students also complete a background form on native language, other languages, schooling, and extent and type of schooling in the target language, currently only English. The background data for each student is stored in a flat file that links to the essays by that student. The writing of 65 students is currently represented in the database. The L1 languages represented are Arabic, Bengali, Chinese (Mandarin and Taiwanese), Haitian Creole, Gujarati, Hindi, Malayalam, Polish, Spanish, and Vietnamese. Close to a quarter of the students are multilingual. A portion of the background data and text data is currently web accessible.3 MELD currently has a small set of tools to enable entry, viewing and manipulation of both the student author background data and the text data. The student authors fill out a form asking for 21 items of background data including gender, age, native and other languages, and venues and methods of learning English. We have developed a pop-up window tool to ensure accurate entry of these data. Another tool enables the user to view student background data and retrieve the essays written by that student. The data itself can also be viewed with the errors replaced by reconstructions. We hope that by using this viewer to remove low-level errors, annotator reliability might improve on errors that are more difficult to tag. We also have a crude concordancer that enables errors plus reconstructions to be viewed in context. 3
Data Annotation
3.1
Error Annotation
An important feature of MELD is the annotation of errors. Assuming that the goal of L2 learning is mastery of L1 performance, the value of a corpus of L2 productions lies in its ability to allow us to measure the distance between a sample of L2 writing and a comparable L1 corpus. Such a comparison also permits research into patterns of difference. The MELD annotation system allows such comparison. Many of the differences between L1 and L2 corpora can be observed by online comparison of the two. The work in Granger (1998), for example, shows differences in phrase choice (Milton), differences in complement choice (Biber and Reppen), and differences in word choice and sentence length (Meunier). An L2 corpus, however, also differs from a comparable L1 corpus in the number and type of morphological, syntactic, semantic, and rhetorical errors it exhibits, and this difference cannot be observed automatically; it requires the L2 text to be manually tagged for errors. To enable the researcher to find patterns, the individual errors must be tagged as errors and classified as to error type. Systems of error classification often use a predetermined list of error types (see, for example, the studies cited in Polio 1997). The Hong Kong corpus
Ellen Fitzpatrick and M.S. Seegmiller
226
(Milton and Chowdhury 1994) and the PELCRA corpus at the University of Lodz, Poland, use such a predetermined tagset (see Pravec 2002). The main advantage of a predetermined list of error types is that it guarantees a high degree of tagging consistency among the annotators. However, a list limits the errors recognized to those in the tagset. Our concern in using a tagset was that we would skew the construction of a model of L2 writing by using a list that is essentially already a model of L2 errors, allowing annotators to overlook errors not on the list. The use of a tagset also introduces the possibility that annotators will misclassify those errors that do not fit neatly into one of the tags on the list. In place of a tagset, our annotators “reconstruct” the error to yield an acceptable English sentence. Each error is followed by a slash, and a minimal reconstruction of the error is written within curly brackets. Missing items and items to be deleted are represented by "0". Tags and reconstructions look like this: 1. 2. 3.
school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens
The advantages of reconstruction over tagging from a predetermined tagset are that reconstruction is faster than classification, there is no chance of misclassifying, and less common errors are captured. An added benefit is that a reconstructed text does not pose the problems for syntactic parsers and part-ofspeech taggers that texts with ungrammatical forms pose (though see section 3.3). We anticipate that a reconstructed text can be more easily parsed and tagged for part-of-speech information than the unreconstructed essays. Reconstruction, however, has its own difficulties. Without a tagset, annotators can vary greatly in what they consider an error. The wide discretion given to annotators results in annotation differences that run the gamut from the correction of clearly grammatical errors to stylistic revisions of rhetorical preferences. Even in the case of strictly grammatical errors, different annotators may reconstruct differently. For example, the common error represented in (4) can be reconstructed as either (5) or (6), and the less predictable (7) as either (8) or (9). 4. 5. 6. 7. 8. 9.
the student need help the {student/students} need help the student {need/needs} help. We can also look up for anything that we might choose to buy, We can also {look up/search} for anything that we might choose to buy, We can also look {up/0} for anything that we might choose to buy,
We handle such discrepancies by adjudicating the tags as a team. Each text is tagged by two annotators, who then meet with a third annotator to discuss and resolve differences. For examples like (4) and (7), multiple reconstructions are
The Montclair Electronic Language Database Project
227
entered, although we are aware that cases like (7) have several possible reconstructions. More difficult issues involve grammatical rules that have a non-local domain. One recurring example involves the use of articles in English. For instance, the sentence 10. The learning process may be slower for {the/0} students as well is correct with or without the article before students. However, the use of the indicates that a particular group of students had been identified earlier in the essay, whereas the absence of the indicates that students is being used in the generic sense. We choose to mark errors at the paragraph level; since no students had been identified earlier in the paragraph, we marked (10) as containing an error. Language that involves the imposition of a standard also present difficulties for error tagging, primarily because the line between casual writing and academic writing is often fuzzy. Because of this vagueness, we are developing a list and using a different tag (square brackets) to annotate writing that violates an academic standard. Examples (11)-(12) illustrate this issue. 11. they learn how to interact with the other [kids/children] 12. [But/However] it doesn't take long for one to fit in The blurred line between grammatical and rhetorical errors presents the most difficult error tagging problem. It is difficult to categorize examples like (13) and (14) as ungrammatical, yet the error in (13) fails to capture the rhetorical contrast of sad and happy while the choice of the present tense in (14) fails to adhere to tense concord. 13. they felt sad to live far from them {and/but} also happy because 14. Maybe I would have a problem that no computer {can/could} solve 3.2
Annotation Agreement
Consistency among annotators is crucial if the annotation is to be useful. However, the fuzzy nature of many L2 learner errors makes consistency a serious concern. We have conducted several experiments on tagging consistency both between the authors (Fitzpatrick and Seegmiller 2000) and among a group of ESL teachers (Seegmiller and Fitzpatrick 2002). The consistency measures we have used for these experiments included interrater reliability, precision, and recall. Interrater reliability (Polio 1997) measures the percentage of errors tagged by both annotators, which we calculate as 1 minus the number of cases where one tagger or the other, but not both,4 tagged an error divided by an average of the total number of errors tagged:
Ellen Fitzpatrick and M.S. Seegmiller
228
Reliability = 1 - T1⊕ T2/(T1+T2)/2 This is the most stringent measure possible since we are calculating consistency on actual errors identified in common, not on number of errors identified, and we are not working from a predetermined set of errors, making every word and punctuation mark a target for an error tag. It was clear to us that our initial experiments might yield very low numbers and we could only hope that some basis for greater agreement would come out of the experiments. Precision and recall are measures commonly used in evaluations of machine performance against a human 'expert'. We use these measures because they enable us to compare the performance of one annotator against the other so that we can address problems attributable to a single annotator. To obtain these measures, we arbitrarily assume one annotator to be the expert. Precision measures the percentage of the non-expert's tags that are accurate. It is represented as the intersection (∩) of the non-expert's (T2) tags with the expert's tags (T1) divided by all of T2's tags. Precision = T1 ∩ T2/T2 For example, if T1 tagged 25 errors in an essay and T2 tagged the same 25 errors but also tagged 25 more errors not tagged by T1, then T2's precision rate would be .5 Recall measures the percentage of true errors the non-expert found. It is represented as the intersection of the non-expert's (T2) tags with the expert's tags (T1) divided by all of T1's tags. Recall = T1 ∩ T2/T1 Following our example above, T2's recall would be 1.0 since T2 tagged all the items that T1 tagged. Precision and recall can be illustrated as in Figure 1, which shows one possible outcome of the performance of two annotators. The non-expert has achieved high precision in this task; most of the errors she tagged were identified by the expert as errors. However her recall rate is low; she missed about half of the errors identified by the expert. We might expect the situation represented in Figure 1 if there are many low level grammatical errors that both annotators tagged as well as another type of error (e.g., errors involving academic writing standards) that T1 tagged but T2 did not. The precision and recall measures allow us to track the overzealous tagger and discover the source of a pattern of tagging disagreements.
The Montclair Electronic Language Database Project
nonexpert
229
expert
Figure 1: Precision and Recall for an expert and a non-expert tagger. Both experiments that we conducted into tagger agreement involved two tests. The first test let the annotators tag errors with no instruction. This was followed by a meeting in which the taggers established general guidelines for tagging that then guided test two. Table 1 shows the results of these two tests with the authors as annotators. Tables 2-4 show the pair-wise results among three ESL teachers (S, L, and N) who acted as taggers. The data sets were the same for both experiments; set one contained 2476 words, and set two 2418. The error counts indicated were those of the 'expert'; the teachers rotated as experts. Table 1: Results with authors as annotators Data set One Two
Errors 241 193
Recall .73 .76
Precision .84 .90
Reliability .54 .60
Precision .58 .78
Reliability .39 .49
Precision .48 .54
Reliability .23 .27
Precision .70 .78
Reliability .37 .36
Table 2: Results with J&L as annotators Essay One Two
Errors 474 206
Recall .54 .57
Table 3: Results with J&N as annotators Essay One Two
Errors 472 186
Recall .58 .37
Table 4: Results with L&N as annotators Essay One Two
Errors 411 208
Recall .65 .60
Ellen Fitzpatrick and M.S. Seegmiller
230
These levels of agreement are clearly unsatisfactory, and have led to our present practice of resolving disagreements between annotators by adjudication with a third annotator. Unfortunately, this is expensive and slows the task considerably. Since taggers differ in the extent to which they mark stylistic and rhetorical features of the essays, another helpful solution has been to use a different type of mark for errors involving a written standard, as mentioned in the previous section. These errors, particularly errors involving punctuation, verb mood (if I [have/had] the chance), and certain lexical choices (the [kids/children] can) make up a large proportion of the disagreements. It has proven effective to separate these from the language acquisition errors. 3.3
Part-of-Speech Tagging
Since automatic part-of-speech (pos) tagging and parsing are built on models of grammatical English, we anticipated that reconstructing errors would aid in the application of these systems to our data. To date, one test of an automatic pos tagger, the Brill tagger (Brill 1995), has assessed the performance of an automatic system on a test set of both the uncorrected and corrected MELD data (Higgins 2002). The pos test included six essays, involving 1521 words of raw text and 1551 words of reconstructed text. Once difficulties with contractions and parentheses were removed, only 22 errors appeared in both sets of essays, an additional four appeared in the raw text alone and another two in the reconstructed text. This gives an error rate of .017 percent on the raw text and .015 on the reconstructed text. We assume that the high accuracy of the Brill tagger, even on the raw data, resulted from the highly proficient writing currently represented in MELD. We still assume that as we capture the writing of less proficient learners, the reconstruction of errors will aid the pos tagging. The small number of pos tagging errors indicates that automatic pos tagging is a reasonable enhancement to the MELD data. Equally encouraging is the fact that the most common pos tagging error, with 10 occurrences, was caused by the labeling of capitalized ordinal numbers as proper nouns by the Brill tagger.5 4
Possible Applications
MELD, at under 100,000 words, is still a small corpus. However, even with a small corpus, there are trends that we can observe, particularly if we look at the raw data. Looking at the smaller, tagged portion of the corpus, we can present research that is illustrative of what can be done with a tagged corpus.
The Montclair Electronic Language Database Project
231
Studies of Progress over a Semester 6
5
Since the data in MELD include longitudinal data in the form of essays written by the same student over the course of a semester or more, one of the possible applications of the data is the study of changes in student writing over time. In this section, we will present some examples of the study of such changes using both the untagged and the tagged versions of the essays. When assessing students' writing over time, there are certain changes that we expect to find if our English-language program is working effectively. If we compare a timed essay written at the beginning of the semester with one written at the end, we would expect to find, among others, the following sorts of changes: • • •
•
•
Fluency will increase. That is, students will be able to write more easily, without having to stop and think about what to say and how to say it. Sentences will get longer. As students’ command of the target language increases, they will become more confident in their use of longer sentences. Sentences will become more complex. A sentence can be long but fairly simple, for example if it consists of several simple clauses joined by conjunctions (“John got up and he took a shower and he shaved and he got dressed”). But it is an indication of increasing mastery of the syntax of the L2 when students begin to use more complex sentence types (“After getting up but before getting dressed, John showered and shaved”). Sentence complexity is notoriously difficult to measure and many different approaches have been proposed, but one simple one is to count the number of clauses per sentence, measured by counting the number of verbs. Vocabulary will increase. It is difficult to measure any person’s total vocabulary. One approach is to count the number of different words used in a timed essay and take that as a rough measure of overall vocabulary. This approach assumes that students with a limited vocabulary will tend to use the same words over and over again, whereas students with a greater command of the language will be able to use a greater variety of words in an essay of limited length. The number of errors will decrease. For obvious reasons, this is usually taken as a standard measure of mastery of a language.
In this illustrative study, we used two essays from each of 23 students, for a total of 46 essays. One essay was written at the beginning of the semester and the other at the end, allowing us to measure what kinds of changes occur in the students’ writing during a rigorous ESOL writing course. The essays vary greatly in length, ranging from 86 to 377 words. The authors of the essays are from a variety of L1 backgrounds. Our analyses made use of several standard UNIX text-processing
Ellen Fitzpatrick and M.S. Seegmiller
232
tools, although similar studies could be carried out with any of several software packages. It should be noted that since the results reported below are for purposes of illustration, we have not carried out any statistical calculations to determine which, if any, results are statistically significant. For our first study, we calculated the mean length of the essays and compared the essays written at the beginning of the semester (the Pre-Test) with those written at the end of the semester (the Post-Test). The results are shown in Table 5. Table 5: Mean Number of Words per Essay Pre-Test 189.1
Post-Test 236.8
As anticipated, the average number of words per essay increased (substantially, in fact), indicating that in a 20-minute timed essay, students were able to write much more at the end of the semester than they were at the beginning. Next, we counted mean sentence length, which provides a rough but easy measure of syntactic complexity. Table 6 shows the results for 23 students: Table 6: Mean Sentence Length Mean words per sentence
Pre-Test 18.2
Post-Test 18.8
While this is a far less dramatic result, we still get a change in the predicted direction: the number of words per sentence has increased. We then looked at a slightly more sophisticated measure of sentence complexity, the number of clauses per sentence, measured by simply counting the number of main verbs per clause and dividing by the number of sentences: Table 7: Number of Clauses per Sentence Mean clauses per sentence
Pre-Test 3.6
Post Test 3.4
There is actually a slight decrease in the number of clauses per sentence, a phenomenon that might deserve further investigation. The next logical step in investigating changes in sentence complexity would be to separate conjoined clauses from embedded clauses, since the latter are more complex. It is possible that the students are using fewer but more complex clauses, or perhaps they have simply learned that shorter sentences are more effective.
The Montclair Electronic Language Database Project
233
Next we looked at changes in the number of errors in the essays. Errors are easy to count in the tagged essays. Here are the data: Table 8: Errors Mean number of errors/sentence
Pre-Test 1.46
Post-Test 1.34
Once again, we find the expected result: a decrease in the number of errors per sentence. Finally, we counted the number of different words used in each essay and then calculated the type/token ratio to control for the differing lengths of the essays. Table 9 shows the vocabulary results: Table 9: Vocabulary Mean total words Mean vocabulary Mean Type/Token Ratio
Pre-Test 189.1 102.7 1.81
Post-Test 224.5 117.5 1.91
We see that both the total vocabulary and the type token ratio increase between the first and second essays. Incidentally, when UNIX is used, one of the byproducts of the measure of vocabulary is a word frequency count, which is a list of all the words in a text with the frequency of each, arranged from most to least frequent. This is an interesting document in its own right, and might be studied in various ways, for example to see how many unusual (as opposed to common) words a student uses. In one of our studies, it was noticed that the relative frequency of the was about the same for speakers of Spanish as for those of English – the is typically the first or second most common word in the text – while for speakers of Japanese and Russian, the occurred much less frequently, often ranking as low as the fifteenth most frequent word. 6
Research on Error Types by L1
Concordancing of the errors currently enables us to compare problematic points with points students have mastered. For example, the essays so far demonstrate difficulty in mastering the correct preposition in a prepositional phrase complement to a head noun, verb or adjective when there is some notion of movement involved, as the following data, with the head in boldface, show. even before he paid have when they arrive four months of his arrival
{to/0} his Aunt Mary for his {to/in} the new country Entering into {to/in} the United States so he
Ellen Fitzpatrick and M.S. Seegmiller
234 Mike and Mary departing were sad to separate had never been separated see the parents separated at the time they separated are closing the door sad to live far
{to/for} the United States In {to/from} their son and daughter, but {to/from} their family before Mr {to/from} them {0/,} they felt sad {to/from} them they felt sad {from/to} Ireland I {0/am}going to write {of/from} them [and/but] also happy
In contrast, the same data shows a good command of prepositional complements to abstract nouns and verbs: they are taking the risk time they also have fear country may be thought never have the same relation as family and their relationship may never loose their relationship her life They were dreaming happy life she always dreamed
of hurting their parent's feelings of taking risks in unknown of as opening a door with their siblings They would to their land and country with their family and friends about their future but they of. As mentioned in the
Coupled with the demographic information, the error tagging also permits the correlation of grammatical properties with speaker differences. For example, in our still small data set, we see the following errors in concord between tenses for native speakers of Spanish and Gujarati. Spanish: Mr. and Mrs. Feeney worked their whole lives to {gave/give} a good education they understood that Mary and Michael {can/could} have a better future The risks they take occur when they {went/go} to the United States Gujarati: he would send money home when he {would start/started} earning it might not be able to see to help that person whenever they {will/0} {need/needs} If Michael and Mary {will be/are} successful they are not {use/used} to different types of work they got used to it as time {passes/passed} by. They were dreaming, but they {do/did} not know what {is/was} going to happen. However, in a 2,300 word sample produced by four native Spanish speakers there were only 6 such errors, while in the same size sample produced by four Gujarati speakers there were 31 such errors, a five-fold difference in mastery of tense concord. While the samples are small, the difference is striking and the
The Montclair Electronic Language Database Project
235
grammatical phenomenon – concord between tenses – would probably go unnoticed without the systematic view of the data given by the corpus. 7
Preparation of Instructional Materials
The type and frequency of error by level or native language background guides the teacher to writing problems which students of a comparable level and background need to work on. One can also compare the corpus to the work of comparable, proficient native writers to discover gaps in the L2 writing and develop materials accordingly. The corpus can also be used for testing purposes since it allows testing to be targeted to specific levels and language backgrounds. Several types of corpus-based exercises for students have been developed (e.g., Milton 1998) though they are not widely available. A publicly available corpus will enable more exercises of this type. Students can also use portions of the corpus for proof reading exercises, with the reconstructed text available for checking. Certain types of error can be ‘turned off’ so that the student sees only the type of usage s/he needs to master. The student can then compare corrections with those of the annotator. 8
Conclusion
MELD is a small but growing database of learner writing. It is accessible on line to anyone who wishes to use it, and the tools for searching and analyzing the data will continue to be expanded. We also hope to add data from other institutions, as well as spoken data from L2 learners. Along with the gradual increase in tagged data, we plan to enhance access to MELD and build tools that will enhance the usefulness of the data. We anticipate bringing certain tools online in the near future; some tool development requires funding that puts it beyond our immediate capability. Among our immediate goals are improved online access to the data, including the use of a concordancer to view errors and reconstructions, automatic part of speech annotation as a user option, and the addition of data from different ESL skill levels. Our long range plans include a statistical tool to correlate error frequency with student background; student editing aids, most specifically a grammar checker using our current data as a model; and – dream of dreams – the addition of L2 spoken data. The data in MELD can be used for a variety of both research and educational purposes, including the study of L2 acquisition and the preparation of teaching materials. It is our hope that MELD will prove to be a valuable resource to our colleagues in the field of second-language acquisition and teaching.
Ellen Fitzpatrick and M.S. Seegmiller
236 Notes 1.
We wish to thank the master teachers Jacqueline Cassidy, Norma Pravec and Lenore Rosenbluth, who contributed careful labor and thoughtful discussion in providing a tagged data set and tagging guidelines and the graduate student annotators Jennifer Higgins, Donna Samko and Jory Samkoff, and the programmers and data entry personnel Jennifer Higgins and Kae Shigeta.
2.
The corpora created in England (the Cambridge Learner Corpus and the Longman Corpus) represent the writing of students in non-Englishspeaking countries.
3.
Student background data and essays are http://www.chss.montclair.edu/linguistics/MELD.
4.
⊕ is to be interpreted as 'exclusive or', indicating that if one tagger marked a feature as an error, the other tagger did not.
5.
The Brill tags are based on the manually tagged labels of the Penn Treebank (Marcus et al. 1993), which labels all the items in a name like First National City Bank as proper nouns, giving First, Second, etc. a high frequency as a proper noun.
6.
Some of the material in this section was presented in Seegmiller et al. (1999).
available
at
References Biber, D. and R. Reppen (1998), Comparing native and learner perspectives on English grammar: A study of complement clauses, in S. Granger (ed.), Learner English on computer, London: Longman, pp. 145-158. Brill, E. (1995), Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging, Computational Linguistics, 21 (4): 543-566. Fitzpatrick, E. and M.S. Seegmiller (2000), Experimenting with error tagging in a language learning corpus, The Second North American Symposium of the American Association for Applied Corpus Linguistics, Northern Arizona University, Flagstaff, March 31-April 2. Granger, S. (ed.) (1998), Learner English on computer, London: Longman. Granger, S. (1998), The computer learner corpus: A versatile new source of data for SLA research, in S. Granger (ed.), Learner English on computer, London: Longman, pp. 3-18. Higgins, J. (2002), Comparing the performance of the Brill Tagger on corrected and uncorrected essays. http://picard.montclair.edu/linguistics/MELD/pos.html.
The Montclair Electronic Language Database Project
237
Marcus, M., B. Santorini, and M. Marcinkiewicz (1993), Building a large annotated corpus of English: The Penn Treebank, Computational Linguistics, 19 (2): 313-330. Meunier, F. (1998), Computer tools for the analysis of learner corpora, in S. Granger (ed.), Learner English on computer, London: Longman, pp. 1938. Milton, J. (1998), Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment, in S. Granger (ed.), Learner English on computer, London: Longman, pp. 186-198. Milton, J. and N. Chowdhury (1994), Tagging the interlanguage of Chinese learners of English, in L. Flowerdew and A.K.K. Tong (eds), Entering text, Language Centre, The Hong Kong University of Science and Technology. Nickel, G. (1989), Some controversies in present-day error analysis: “Contrastive” vs. “non-contrastive” errors, International Review of Applied Linguistics, 27: 292-305. Polio, C. (1997), Measures of linguistic accuracy in second language writing research, Language Learning, 47: 101-143. Pravec, N. (2002), Survey of learner corpora, ICAME Journal, 26: 81-114. Seegmiller, M.S. and E. Fitzpatrick (2002), Practical aspects of corpus tagging, in B. Lewandowska-Tomaszczyk and P.J. Melia (eds), PALC ’01: Practical applications in language corpora, New York: Peter Lang. Seegmiller, M.S., E. Fitzpatrick, and M. Call (1999), Assessing language development: Using text-processing tools in second-language teaching and research, MEXTESOL, Mazatlan, MX.
Bridging the Gap between Applied Corpus Linguistics and the Reality of English Language Teaching in Germany Joybrato Mukherjee University of Giessen Abstract The starting point for the present paper is the results of a survey among English language teachers in German secondary schools. The survey shows that the practice of English language teaching in Germany is still largely unaffected by descriptive corpus-linguistic research into authentic language use and applied corpus-linguistic suggestions of using corpus resources and corpus-based methods for teaching purposes. In the light of this gap between applied corpus linguistics and the reality of English language teaching in Germany, it is suggested that a concerted effort is needed to popularise the languagepedagogical potential of corpus linguistics, preferably under the auspices of the local state teaching boards. In this context, particular attention should be paid to the preconceptions and needs of the vast majority of teachers who, for a variety of reasons, have not yet worked with corpora. In particular, it is necessary to implement teacher-centred corpus activities in the classroom before truly learner-centred methods are envisaged.
1
Introduction
Corpus linguists have shown a persistent interest in the language-pedagogical implications and applications of corpus-based research for several decades. The COBUILD project, resulting in a new generation of learner dictionaries (see Sinclair 1987), the early coinage of the notion of “data-driven learning” (see Johns 1991) and the compilation and analysis of learner corpora such as ICLE (see Granger 1998) provide ample testimony of this fact. At first blush, then, one might readily expect that the multitude of suggestions on how to use corpus data, corpus-based resources and corpus-linguistic methods in the English language classroom (see Burnard and McEnery 2000; Aston 2001; Mukherjee 2002) has already revolutionised – or is just about to do so – the way in which English is taught and learned as a foreign language. However, in Germany (and probably in many other countries as well) this turns out to be wishful thinking. In reality, the influence of applied corpus-linguistic research on the actual practice of English language teaching is still relatively limited. Tribble (2000: 31), for example, admits that “not many teachers seem to be using corpora in their classrooms.” In order to empirically assess the extent to which English language teachers in Germany make use of – and actually know about – corpora, I conducted a survey in which 248 qualified English language teachers at secondary schools in North Rhine-Westphalia, the by far most heavily populated federal state of Germany, took part.1 The survey data were collected in the
Joybrato Mukherjee
240
context of several advanced teacher training workshops on corpus linguistics for qualified English language teachers that took place in 2001 and 2002. The idea to conduct such test workshops arose out of the desire to: firstly, introduce teachers of English at secondary schools to basic principles and methods in corpus linguistics; secondly, familiarize them with languagepedagogical applications and implications of corpus-based research; thirdly, find out what they know about corpus linguistics before the test workshop and what they think about the relevance of corpus linguistics to their own classroom practice after the test workshop. I am using the term test workshop in this context because at this stage the workshops were offered and carried out on an ad hoc basis whenever particular schools were willing to host such workshops for their English teachers as voluntary participants. In total, eight half-day test workshops took place. They were designed in slightly different ways in order to find out which of the formats would be most appropriate for an institutionalized introductory workshop to be offered to interested teachers by the local state teaching board (see section 3). What they had in common was the overall structure: •
•
•
In a lecture of about one hour, the participants were provided with a general introduction to some key concepts in corpus linguistics (e.g., major corpora of present-day English, the notion of representativeness, word-lists and concordances). In a seminar of about an hour, the participants were provided with selected findings from corpus-based research (e.g., concordances) that they had to compare with the descriptive statements given in traditional school textbooks and learner grammars. Of course, the examples focused on those fields in which there is a clear discrepancy between corpus data and traditional learner grammars (e.g., with regard to the use of some and any) or in which corpus data would give access to data not available otherwise (e.g., frequent lexicogrammatical patterns of a given word). In a practical part of about two hours, the participants were introduced to some applications of corpus data in the classroom that have been discussed in applied corpus linguistics (e.g., the production of concordance-based exercises). Also, some problems of usage were discussed in the light of corpus data (e.g., the question as to whether example for, typical for etc. – instead of example of, typical of etc. – occur in native usage or not).
All participants in the test workshops were asked to fill in a questionnaire; some questions were asked before the workshop started, others at the end of the workshop. It is the result of this survey to which I will turn in the following section. The participants’ answers reveal that corpus-based methods have not yet exerted much influence on teaching practice in the English classroom in Germany. After discussing the survey results (see section 2), I will sketch out how corpus linguistics may be popularised in the German context (see
Corpus Linguistics and English Language Teaching in Germany
241
section 3), which may best be achieved by taking into consideration and focusing on the average teacher’s preconceptions and needs (see section 4). Finally, I will offer a few concluding remarks on the implications of the survey data and the experiences from the test workshops. Table 1: The role of corpus linguistics in English language teaching in Germany: some survey data 1) Before the workshop: Are you familiar with corpus linguistics? • • •
Yes, I am familiar with corpus linguistics (> university studies) No, I am not familiar with corpus linguistics but I have already heard of it (> colleagues, books/articles, conferences, etc.) No, I don’t know anything about corpus linguistics
27
10.9%
24
9.7%
197
79.4%
2) After the workshop I: Do you think that teachers and/or learners may profit from corpus data? • • •
Yes, both teachers and learners Yes, but only teachers No
32 208 8
12.9% 83.9% 3.2%
3) After the workshop II: In which particular fields would you consider consulting or using corpus data in the future? (multiple answers possible) • • •
• •
Creation of concordance-based teaching material (> teaching of collocations, patterns, spoken/written differences, etc.) Correction of classtests (> acceptability/idiomaticity of collocations, patterns, phrases, etc.) Creation of word/phrase lists for individual text collections (> set books, texts in ‘bilingual subjects’ such as history and geography in English medium, etc.) (Other teacher-centred activities) Learner-centred activities (> consultation of corpus data, small-scale corpus studies, corpus browsing, large-scale term papers, etc.)
212
85.5%
137
55.2%
130
52.4%
128
51.6%
29
11.7%
242 2
Joybrato Mukherjee The Role of Corpus Linguistics in the English Classroom in Germany: What Survey Data Show
Some of the questions that the teachers were asked before and after participating in one of the eight test workshops and the answers they gave are listed above in Table 1. Before the workshop on applied corpus linguistics, the participants were asked about their previous knowledge about corpus linguistics in general. The answers to the first question in Table 1 paint a bleak picture: some 80% of all qualified English teachers had not come across corpus linguistics before. Under the assumption that the survey trend is more or less representative, the answer to the very first question nicely illustrates the low extent to which corpus linguistics has so far had an impact on teaching practice in Germany. After the workshop, all participants were asked to answer several questions on the role that corpus linguistics may play in English language teaching in general and in their own classrooms in particular, including questions 2) and 3) in Table 1 above. The distribution of answers to the second question shows that virtually all participants (i.e. more than 95%) do think that English language teaching may profit in one way or another from the advent of corpora. Note, however, that most teachers would only consider making use of corpus data and corpus-based methods themselves. That learners should have access to corpus data as well is not viewed as a fruitful idea by the vast majority. It should be noted that this bias towards teacher-centred corpus activities holds true for the majority of participants in all the test workshops, regardless of whether the focus was more on teacher-centred or on learner-centred techniques. In a sense, this sheds light on an important clash between applied corpus-linguistic research and the average teacher’s point of view; while in applied corpus linguistics, there is an increasing tendency to focus on corpus-based activities carried out by increasingly autonomous learners (see Bernardini 2000; Gavioli 2001), most teachers think that corpus data are particularly useful for themselves. This is corroborated by the answers to the third question in Table 1. In the test workshops, a wide range of language-pedagogical applications of corpora were introduced and exemplified – from teacher-centred activities such as the creation of concordance-based teaching material, as described by Flowerdew (2001), Granger and Tribble (1998) and many others, to learner-centred activities such as serendipitous corpus browsing, as sketched out by Bernardini (2000) and others. At the end of the workshops, the participants were supposed to list those activities that they found particularly useful and that they would intend to put into practice in their own classrooms. The important point here is that most teachers, in answering the third question, exclusively focused on teacher-centred activities and showed that learner-centred activities would presumably have no place in their classrooms.2 In conclusion, the results of the survey show quite clearly that the use of corpora, which may have become mainstream in English linguistics, is so far not at all central to the practice of English language teaching in Germany. On the
Corpus Linguistics and English Language Teaching in Germany
243
contrary, only a tiny fraction of English language teachers actually know of the existence of corpus linguistics in the first place. Paradoxically, most of the teachers who took part in the survey admitted using corpus-based resources, especially corpus-based dictionaries. When they were asked which monolingual English dictionary they tend to use for reference purposes, some 80% listed one of the following corpus-based dictionaries: Collins COBUILD English Dictionary, Oxford Advanced Learner’s Dictionary, Longman Dictionary of Contemporary English and Cambridge International Dictionary of English. This finding indicates that, on the one hand, most teachers do use corpus-based dictionaries but that, on the other hand, they are not aware of the corpus-linguistic background of these products, i.e. the fact that these dictionaries are based on the quantitative analysis of large and representative samples of naturally occurring language. The same holds true, by the way, for corpus-based learner grammars such as Ungerer’s (1999) Englische Grammatik Heute. This grammar is increasingly used by both teachers and learners alike in Germany, but the author’s comments in the preface on the role of the British National Corpus as a major database of this grammar usually go unnoticed. Here we thus encounter a second gap between corpus-linguistic research and teaching practice. Without any doubt, corpus-based insights into actual language use have already exerted an enormous influence on dictionaries, grammars and modern textbooks that are used by teachers and learners. However, most teachers do not know that many differences between these modern materials and older ones are caused by corpus data and their implications for language teaching. For example, most teachers were surprised when they were told in the workshop that the order in which irregular verbs are taught in modern teaching materials in Germany is largely based on corpus findings, especially those presented by Grabowski and Mindt (1995). The gap between the rapid development of applied corpus linguistics and its influence on modern classroom resources on the one hand and the average English teacher’s knowledge on the other can only be bridged if many more English language teachers are systematically familiarized with the basic foundations, implications and applications of corpus linguistics. This brings me to the need for a large-scale popularization of corpus linguistics among English teachers in Germany (and probably elsewhere too). Most importantly, it is obvious that learners will only get access to corpus data if teachers themselves work with corpora and make them available to their students. 3
The Need for Popularization
According to Aston (2000), there are three fields in which corpus data prove relevant to English language teaching:
Joybrato Mukherjee
244 ● ● ●
teaching about corpora, as corpus linguistics finds its way into university linguistics curricula; exploiting corpora to teach languages, linguistics, and potentially other subjects; teaching to exploit corpora, so that learners can explore them for their own purposes. (Aston, 2000: 7)
However, whether or not corpus linguistics is really about to ‘find its way into university curricula’ is open to discussion – at least when it comes to Germany. Even today, it is still perfectly possible for each and every student of English language and literature in virtually all English departments in Germany to take a university degree without ever having delved into corpus linguistics. Thus, it is important to keep in mind that for the time being – and in the foreseeable future – most newly-fledged English teachers enter schools with anything but a detailed knowledge about corpus linguistics. What is more, if most teachers lack this knowledge, they cannot be expected to exploit corpora to teach languages nor to teach [their students] to exploit corpora. In the light of the fact that university curricula usually do not include an obligatory corpus-linguistic module, a promising short-term solution to this problem is to offer introductory workshops for qualified English language teachers. The test workshops in which the participants in the survey on which I reported in sections 1 and 2 took part are examples of such a ‘quick-and-dirty’ introduction to corpus linguistics for qualified English language teachers. If, however, the target audience of such workshops are qualified and experienced English language teachers – and not, say, students of English language and literature – it is of paramount importance to offer teachers realistic and easily applicable corpus-based solutions to significant problems that they have been facing in their classrooms. I would contend that the use of corpus data only becomes popular if teachers immediately see the advantage of using corpus data in order to solve existing problems. Involving learners in corpus-based activities continues to be a vital objective, but I would regard it as a second step which teachers will only take after being convinced of the usefulness of corpus data for solving their own teaching problems. In picking up on Aston’s (2000) systematization above, I have already outlined elsewhere (see Mukherjee 2002: 118) that it is the teachers to whom particular attention should be paid in this process of popularization. As shown in Figure 1, teachers have to be trained in applied corpus linguistics first because only they can be expected to introduce corpus-linguistic methods in the classroom and to involve learners in corpus-based activities. The ultimate objective remains, of course, to make learners work with corpora autonomously.
Corpus Linguistics and English Language Teaching in Germany
245
Teacher education / teacher training: Ö teaching about corpora Ö exploiting corpora to teach language Ö teaching to exploit corpora
Teachers English language classroom: Ö teaching about corpora Ö exploiting corpora to teach language Ö teaching to exploit corpora Students
Learner autonomy: Ö autonomous use of corpus data Figure 1:
From corpus-experienced teachers to autonomous learners
There is no point in ignoring the fact that most teachers have no prior knowledge about corpus linguistics. Any effort to popularize the languagepedagogical use of corpus data can thus be successful only if we re-focus on such teachers’ preconceptions and needs. To this end, I would now like to briefly turn to some aspects of a workshop program for qualified English teachers in more detail – aspects that most participants in the eight test workshops found particularly useful and motivated them to get involved with corpus-linguistic methods. In due course, it is intended to officially include this workshop on corpus linguistics in the teacher training programme which is offered by one of the local state teaching boards in North Rhine-Westphalia. In principle, this institutionalized workshop will then be open to any qualified English language teacher.4 4
Refocusing on Teachers’ Preconceptions and Needs
The one-day workshop on corpus linguistics will consist of three parts which mirror Aston’s (2000) systematization. As shown in Table 2, the focus of the first part – ‘teaching about corpora’ – is not only on some basic issues of corpus linguistics but also on corpus-based findings that even experienced English language teachers find surprising. This part is thus called the surprise-the-teacher module. The eight test workshops (see sections 1 and 2) have shown that this approach makes teachers want to learn more about corpus linguistics right from
Joybrato Mukherjee
246
the beginning. For example, some 90% of all English language teachers mark the following sentence as wrong because it violates, in their view, the schoolgrammar rule which states that there should be no would in if-clauses: (1)
“I would be grateful if you would send me more specific information.” => marked as wrong: 221 (89.1%) => not marked: 27 (10.9%)
There are many other examples of corpus-based findings that call into question the way English language teachers go about correctness in learner language. Specifically, the discussion of such examples makes it clear to all teachers that their own intuition is often at odds with linguistic reality. As shown in Table 2, the second module is about exploiting corpora to teach language. The approach here is called help the teacher because special emphasis is placed on practical problems that virtually all teachers have to face. Picking up on the issue of correctness and correction, corpus data are shown to be useful resources for the teacher because they, for example, provide information on whether particular phrases are idiomatic and instantiate native speakers’ “preferred ways of putting things” (see Kennedy 1992: 335) or not. Table 2: Modules of a one-day workshop on corpus linguistics for qualified English language teachers: an overview Aspect
Module I
Module II
Module III
Approach
• teaching about corpora
• "surprise the teacher"
Ö basic notions: corpus design, major corpora, authenticity, representativeness etc.
Ö corpus-based findings that run counter to preconceived ideas: e.g., would in ifclauses
• exploiting corpora to teach language
• "help the teacher"
Ö idiomaticity, native-like selection, spoken vs. written English, genre differences etc.
Ö using corpus data to solve teaching problems: e.g., correction of class tests
• teaching to exploit corpora
• "pass it on to the learner"
Ö learner autonomy, datadriven learning, media literacy etc.
Ö involving learners: e.g., identification of genrespecific realisations of moves
Corpus Linguistics and English Language Teaching in Germany
247
Many examples of such usage problems, especially in written school work, are discussed in this part of the workshop. Another aspect that is covered in this section is the corpus-based teaching of spoken English. In Germany, many colleagues use the derogatory term Abiturspeak – with Abitur being the German A-levels – to refer to the phenomenon that many advanced learners leave school without being sufficiently able to use natural spoken English: Leider ist das in den Klassenzimmern anzutreffende Englisch in der Regel die geschriebene Sprache, mündlich angewendet. [Unfortunately, it is written English, used in the spoken medium, that we usually encounter in the classroom.] (Kieweg 2000: 8; my translation) In fact, learners very often speak just as they write. Many teachers are aware of this problem, and in an institutionalized workshop on corpus linguistics, it should be our intention to capitalize on their classroom experience and show them how they can use corpus data in order to identify, for example, frequently occurring spoken items and patterns. The principal objective of this module is, of course, to provide teachers with hands-on practical experience so that they regard corpus data not just as a recent (but useless) trend in language-pedagogy but as a helpful, problem-solving resource. It is only in the last module (see Table 2) that the emphasis will be shifted to learners’ interaction with corpora. As pointed out in section 2, most teachers remain sceptical about learner autonomy in this field, and the only thing that we aim at in this last section is to provide some sort of topic-opener in this regard. However, it should be noted that even among the sceptical majority of teachers some applications turn out to be more convincing than others. For example, Henry and Roseberry’s (2001) corpus-based genre approach to language teaching is a method that some participants in the test workshops have already tried out in their own classrooms; this method is therefore a good candidate for inclusion in the third module of an institutionalised workshop.5 5
Concluding Remarks
I hope to have shown that many qualified English language teachers in Germany do not know very much, if anything at all, about the rapid developments in corpus linguistics and its language-pedagogical applications. Against this background, I have tried to sketch out how the use of corpus data may become more popular among teachers in the German context. Let me emphasise once again that there is, at present, a large gap between the wealth of applied corpus-linguistic research and the teaching practice in Germany which so far has only been affected to a very limited extent by this research. Closing this gap is a challenge to applied corpus linguists and, perhaps more importantly, to those who are involved in teacher training (both for trainee and qualified teachers). In trying to meet this
Joybrato Mukherjee
248
challenge, special emphasis should be placed on the average teacher’s preconceptions and practical needs. For, as Kettemann (1997: 70) correctly points out, it is only by updating teachers’ brainware that we can change teaching practice in the English language classroom. I should think that the kind of institutionalized workshop that is envisaged in the present paper would help to popularize corpus-based methods in the English classroom not only in Germany but also in other countries with English as a foreign language (EFL). While the overall modular design may be picked up on in virtually all EFL countries, some aspects would need to be adapted to each individual country. For example, it would be useful to take into account the typical learner errors that are caused by structural differences between the learners’ native language and English and to focus on corpus-based methods that may help to iron out those typical cross-linguistic interferences. Also, it is quite clear that the kind of workshop suggested in the present paper is based on the language-pedagogical concepts of authentic language use, inductive learning and learner autonomy. While the corpus-based, data-driven approach to language learning is perfectly in line with English curricula in Germany, one would need to modify the workshop if curricular frameworks for English language teaching in other EFL countries are fundamentally different (e.g., by emphasizing written language use and deductive language learning). Notwithstanding these caveats, corpus linguistics will find its way into the reality of English language teaching in all EFL countries only if not only students of English language and literature but also qualified English teachers are trained on the job. The institutionalisation of introductory workshops may offer a way forward from the present gap between applied corpus-linguistic research and the reality of English language teaching. Notes 1.
Whether the population of 248 teachers is a truly representative sample of the entirety of all English teachers is, of course, open to debate. However, since the teachers were randomly selected, the general trends are, in my view, indicative of similar trends in the whole teacher population. There is no doubt that further research, including longitudinal studies, is needed.
2.
It should be noted in passing – and this does not come as a surprise – that there is a significant correlation between the age group of the participants and their willingness to let their students work with corpora autonomously. But since the average age of secondary school teachers in Germany is just below fifty, it goes without saying that most teachers belong to the group that is rather sceptical about learner-centred activities. For example, only 3 of 98 teachers of 50 to 65 years of age (3.1%) mentioned learner-centred activities in answering the third question in Table 1, while 25 of 46 teachers of up to 30 years of age
Corpus Linguistics and English Language Teaching in Germany
249
(54.3%) did. Unsurprisingly, too, 23 of the 27 teachers (85.2%) that had already been familiar with corpus linguistics before taking part in the test workshop were 30 years of age or younger. No-one in the 50+ agegroup, on the other hand, considered himself/herself to be already familiar with corpus linguistics. 3.
In fact, most of my students in Giessen and – until recently – in Bonn are not very keen on linguistic branches that make use of computers; I agree with Seidlhofer (2000: 208) that “most of our undergraduates are genuinely technophobic.” This negative attitude towards the computerbased description and analysis of language does not usually change once these students have obtained their degree and become trainee teachers and – eventually – qualified teachers.
4.
In this context, I am particularly grateful to Jan-Marc Rohrbach for sharing – and discussing – with me his classroom experience and to Kunibert Broich for helping to pave the way for an institutionalisation of such a workshop on corpus linguistics.
5.
In most cases, however, the teacher remains strongly involved in the corpus-based activities and we can thus not speak of true learner autonomy, as for example Rohrbach’s (2003) illuminating report on a corpus-based genre approach to the production of travel brochures in class 9 shows. Nevertheless, the workshop is considered to be more than successful if teachers are enabled – and willing – to work with corpora themselves, which is a prerequisite for corpus-based activities on the part of the learners somewhere down the line.
References Aston, G. (2000), Corpora and language teaching, in L. Burnard and T. McEnery (eds), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora, Frankfurt am Main: Peter Lang, pp. 7-17. Aston, G. (ed.) (2001), Learning with corpora, Houston, TX: Athelstan. Bernardini, S. (2000), Systematising serendipity: Proposals for concordancing large corpora with language learners, in L. Burnard and T. McEnery (eds), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora. Frankfurt am Main: Peter Lang, pp. 225-234. Burnard, L. and T. McEnery (eds) (2000), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora, Frankfurt am Main: Peter Lang. Flowerdew, L. (2001), The exploitation of small learner corpora in EAP materials design, in M. Ghadessy, A. Henry, and R.L. Roseberry (eds), Small
250
Joybrato Mukherjee
corpus studies and ELT: Theory and practice, Amsterdam: John Benjamins, pp. 363-379. Gavioli, L. (2001), The learner as researcher: Introducing corpus concordancing in the classroom, in G. Aston (ed.), Learning with corpora, Houston, TX: Athelstan, pp. 108-137. Grabowski, E. and D. Mindt (1995), A corpus-based learning list of irregular verbs in English, ICAME Journal 19: 5-22. Granger, S. (ed.) (1998), Learner English on computer, London: Longman. Granger, S. and C. Tribble (1998), Learner corpus data in the foreign language classroom: Form focused instruction and data-driven learning, in S. Granger (ed.), Learner English on computer, London: Longman, pp. 199-209. Henry, A. and R.L. Roseberry (2001), Using a small corpus to obtain data for teaching a genre, in M. Ghadessy, A. Henry, and R.L. Roseberry (eds), Small corpus studies and ELT: Theory and practice, Amsterdam: John Benjamins, pp. 93-133. Johns, T. (1991), Should you be persuaded: Two examples of data-driven learning materials, English Language Research Journal, 4: 1-16. Kennedy, G. (1992), Preferred ways of putting things with implications for language teaching, in J. Svartvik (ed.), Directions in corpus linguistics: Proceedings of Nobel Symposium 82, Berlin: Mouton de Gruyter, pp. 335-373. Kettemann, B. (1997), Der computer im Sprachunterricht, in M. Stegu and R. de Cilia (eds), Fremdsprachendidaktik und Übersetzungswissenschaft: Beiträge zum 1. verbal-workshop, Dezember 1994, Frankfurt am Main: Peter Lang, pp. 63-72. Kieweg, W. (2000), Zur Mündlichkeit im Englischunterricht, Der fremdsprachliche Unterricht Englisch 34 (5): 4-9. Mukherjee, J. (2002), Korpuslinguistik und Englischunterricht: Eine Einführung. Frankfurt am Main: Peter Lang. Rohrbach, J-M. (in press), Don’t miss out on Göttingen’s nightlife: Genreproduktion im Englischunterricht, Praxis des neusprachlichen Unterrichts, 50. Seidlhofer, B. (2000), Operationalizing intertextuality: Using learner corpora for learning, in L. Burnard and T. McEnery (eds), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora, Frankfurt am Main: Peter Lang, pp. 207-223. Sinclair, J.M. (ed.) (1987), Looking up: An account of the COBUILD project in lexical computing, London: Collins. Tribble, C. (2000), Practical uses for language corpora in ELT, in P. Brett and G. Motteram (eds), A Special interest in computers: Learning and teaching with information and communications technologies, Whitstable, Kent, UK: IATEFL, pp. 31-41. Ungerer, F. (1999), Englische grammatik heute, Stuttgart, Germany: Ernst Klett.
Top-down and Bottom-up Approaches to Corpora in Language Teaching John Osborne Université de Savoie, France Abstract Both native-speaker and learner corpora are exploited in language teaching. The activities associated with these types of corpora typically proceed in opposing directions: ‘downwards’ from a supposed model of native-speaker performance, helping the learner to discern and assimilate the lexico-grammatical patterns that lie behind this performance; and ‘upwards’ from the learners' collective interlanguage productions, guiding them towards a closer approximation with native-speaker proficiency. There is, ideally, a convergence between these two movements, by which language learners will become better able to perceive discrepancies between their own patterns of use and those of native speakers, starting from what they already know about the target language, and from what they themselves are trying to use it for. The purpose of this paper is to suggest ways in which this convergence can be encouraged, by constructing activities based on both native-speaker and learner corpus data.
1
Language Awareness
The distinction between top-down and bottom-up approaches to corpora in language teaching can be understood in a number of related ways. In the construction of learner competence, knowledge about language may be ruledriven, from explicit language instruction, pedagogical grammars, etc., or datadriven, either from raw input or from input which has been subjected to various degrees of selection, screening and ordering for pedagogical purposes. In the choice of what learners actually attend to, awareness may focus on larger units and more diffuse patterns, or on more specific local phenomena. In the pedagogical exploitation of corpus data, finally, the movement may be either topdown, drawing data from a native-speaker corpus to provide evidence of target usage to increase learners awareness of the language, or bottom-up, drawing data from a learner corpus and using the learners’ own productions as a starting point for error correction and gradual enrichment. It is this last point which will be the main concern here, with the aim of identifying ways in which data from learner corpora and from native speakers may usefully be exploited in combination to provide material for language awareness exercises.
John Osborne
252 2
Top-down
Most data-driven learning is essentially top-down, taking native-speaker data as evidence of how the target language is (and should be) used. This has obvious advantages of authenticity, potential for making patterns salient, and drawing attention to collocational features, but it also raises a number of questions, of which I should like to mention three. Firstly, it may be asked whether a nativespeaker corpus is a realistic or desirable model for foreign language learners (see, for example, Cook 1998; Seidlhofer 2001). Given the very small number of language learners who ever achieve native-like proficiency in the language, presenting the “real” language of corpus data as a model may be setting a goal which is unattainable, and to which most learners do not in fact aspire. In addition, as Cook remarks, much of the language which can be extracted from a corpus is neither very clear nor very expressive, and therefore not an appropriate model for any kind of learner. Secondly, unless corpus examples are filtered in some way (which rather defeats the “no middle-man” principle of data-driven learning) many of the contexts are likely to be linguistically and culturally bewildering for the language learner. When a randomly selected solution from the British National Corpus contains peripheral items such as Brewer’s Tudor and inglenooky, one inevitably wonders, the benefits of serendipity notwithstanding, whether it will usefully serve to enlighten students on the use of the initial search word.1 Thirdly, data from a large native-speaker corpus frequently contain instances of language usage that run counter to commonly used pedagogical rules. While this can be a salutary illustration of the over-simplified nature of grammar rules, it may have a destabilising effect on learners, and it is necessary to provide guidance on how best to incorporate such contradictions into their explicit grammatical knowledge. One example will suffice to illustrate this point. Learners of English as a foreign language, particularly those whose L1 has a perfect tense used for past-time reference, are frequently warned against using the English present perfect with a past time marker. In fact, occurrences of this association are not uncommon in native-speaker usage, as the examples in Figure 1 from the BNC illustrate: banks to building societies, have erloving family to devour. I have inance profitably. Big firms have h margin and short lifespan, have dgeability. Standard-setters have rd system of appraisal. They have s and how to take them (as we have ty of the people. The French have yman, butcher, and so forth, has n annual outing to the seaside has laws. New black immigration has e of Arab public opinion which has
long long long long long long long long long long long long
ago ago ago ago ago ago ago ago ago ago ago ago
given up caring a tinker chosen a more tortuous p weaned themselves off th been cost-depreciated to realised that there is a mastered the art of arra done about alcohol) , th destroyed the authority been replaced by a labor displaced this occasion. been stopped, but any b accepted the &bquo linka
Figure 1: Examples from BNC of present perfect used with a past tense marker
Approaches to Corpora in Language Teaching
253
Similar examples can be found in American English, and in earlier states of the language. It is therefore not a curiosity of present-day British English, but we would probably still not wish learners to conclude from examples such as these that the present perfect can safely be used with a past time marker in all contexts. Their exploitation thus needs to be handled with care. 3
Bottom-up
Bottom-up uses of data from a learner corpus offer the advantage of starting from what learners are trying to express with the language, in order to make them aware of deviant uses and help them to correct them. In addition, as Gass and Mackey (2002: 252) observe, “[…] the process of producing or struggling to produce output may sensitize learners to patterns and associations in future input.” Subsequently drawing learners’ attention to problematic areas in their own collective production may further contribute to the saliency of subsequent input. The exploitation of learner-corpus data nevertheless involves a number of practical questions: what sort of learner data to use, what discrepancies to attend to, and how to help learners to perceive and make sense of the patterns. Thanks to an increasing number of corpus-based analyses of learner interlanguage, we already have a clearer idea of the kinds of diffuse errors which tend to appear in learner production, and which are frequently questions of idiosyncratic associations or of under and over-use. These analyses cover various LI backgrounds and different areas of language use, notably connectors and cohesion (Altenberg and Tapper 1998; Granger and Tyson 1996), intensifiers and hyperbole (Lorenz 1998, 1999) and formulaic language and lexical phrases (De Cock et al. 1998). In the remainder of this paper, I should like to discuss a few examples of discrepancies observed in a learner corpus (L1 French), which appear to be cases of L1 transfer, before suggesting ways in which such learnercorpus data may be used as a basis for creating remedial activities. The examples are of three main types of discrepancy: lexical overuse (interesting, important), grammatical anomalies (use of determiners with non-count nouns, use of the present perfect), and discourse patterns (use of connectors). 4
Discrepancy and L1 Transfer
The corpus from which the examples are taken consists of just over 600,000 words of carefully monitored written English (no limits on time, number of drafts, or use of reference material) produced by French-speaking university students from three groups: 2nd year students majoring in English (45% of the corpus), 3rd year majors (40%) and 4th year students with English as a minor (15%). Students had on average 8-10 years of instruction in English; most of the errors that appear in this corpus of careful written production can therefore be assumed to be persistent. For purposes of rough comparison, a smaller sample (170,000 words) of native-speaker writing was used; these were essays of similar
John Osborne
254
length and on similar subjects, written by students at the same level in an Englishspeaking university. 5
Lexical Overuse
Table 1 shows the use of interesting by non-native and native speakers. This item is a clear candidate for over-use, being four times as frequent in the learner corpus as in the native writing. More particularly, there are three characteristic uses which are almost entirely absent from the native-speaker essays: the formulaic expression it is interesting to notice (or note), use with intensifiers (very, particularly more, etc. interesting), and coordinated adjectives, relevant and interesting, etc). Table 1: interesting in NNS and NS writing frequency/100,000 words number of occurrences it is interesting to (notice, note, etc) Intensification Coordination
NNS 22 135 44
NS 5 9 1
41 7
3 0
The case of important is somewhat different (Table 2). This word occurs with almost equal frequency in both corpora. Just as for interesting, there are many formulaic uses of the type it is important to note, but this time they are not restricted to the learner corpus. The most apparent differences are collocational. A comparison of the words most frequently occurring to the left and to the right of important, in the NNS and NS essays respectively, suggests that the learners tend to use verb frames of the type play/have an important role whereas the native-speakers prefer equivalence constructions such as is/are an important factor. Table 2: important in NNS and NS writing frequency/100,000 words occurrences it is important to (note, etc) collocations (R) collocations (L)
NNS
NS
86 527 69 role (26); part (19); thing (15); point (6) play (22); have (21)
83 139 15 factor (6); changes (6); event (5); part (5) be
Approaches to Corpora in Language Teaching 6
Grammatical Anomalies
6.1
Present Perfect
255
There is very little difference in the overall frequency of present perfect use by native and non-native writers. It is the context of use, in the learner corpus, which can be problematic, as in the examples below: (1) It is obvious that the sinking of the Titanic remains one of the most significant tragedies of the century since 1,518 out of 2,223 persons have died that terrible night. (2) Le Creuset S.A. is a French cookware company which was created in 1924. Traditionally the company produced cast iron cookware but since 1988 it has been acquired by Paul van ZUYDAM, the former Chairman and Chief Executive of the Prestige Group plc, the leading British cookware company. (3) Thus, since the Forestry Commission has been founded in 1919, it has reforested some 800,000 hectares (which correspond to 2 million acres). Each of these uses of the present perfect is preceded by since, used either as a time adverbial or as a logical connector. In example (1) the presence of since appears to have a simple triggering effect, even though it is the “wrong” since; (2) begins with a standard association of past-time marker with past tense (which was created in 1924), but this is followed, after since, by an inappropriate combination of lexical choice and verb form; (3) contains an over-extension of the “since + present perfect” rule, applied to both clauses of the sentence. These appear to be rule-driven errors, caused by over-generalisation from statements such as the following, taken from the Cobuild English Grammar: If you want to talk about a situation that began in the past and is continuing now […], you use the preposition ‘since’ with a time expression or an event to indicate when the situation began. The verb is in the present perfect tense. (Sinclair 1990: 277)2 Although the presence of since seems to be a facilitating factor, the very concept of “a situation that began in the past and is continuing now” and more generally of the present relevance associated with use of the present perfect is one which is prone to over-extended interpretations, as illustrated in this explanation offered by a student to explain why she judged the present perfect to be appropriate in the sentence - This text has been written in Middle English:
John Osborne
256
(4) The present perfect refers to an action which occurred in the past but which is still true at the moment we are speaking. The present perfect is used because the text is still written in Middle English. 6.2
Non-count Nouns
Non-count nouns such as information are a notorious source of error for Frenchspeaking learners of English (among others), since the cognate word in their L1 is used as a count noun. Conscious awareness of this difference is high among students, but this does not prevent a significant proportion of count-like uses appearing in their own written production; out of 280 uses of information in the learner corpus, 16 are plural, informations, 20 are associated with an indefinite article, an […] information, and 14 circumvent the problem by using the individualising formula a piece of information. Anomalous countable uses of information thus represent nearly 13% of the total, with a further 5% being individualising, but these uses are not indiscriminate. As can be seen in the examples in Figure 2, they occur overwhelmingly in contexts of quantification (more, most, all, an entire page of informations) or of qualification (scientific, precise, all sorts of informations). duce almost an entire page of informations on the history a asting scientific and technic informations). A multinational sales and can also have more informations of the market. Wi d violence mean ? All these "informations" come from the s h you can find every sorts of informations. But I think it not to give the origin of his informations and his own concl has to transmit, to broadcast informations of any types as m in his heart. Most of these informations, and especially d constructions allow to bring informations about the utterer incredible and rich source of informations about traditions t needs further evidences and informations to be fully under cey Morris, need some precise informations on the creature esents an essential source of informations for manufacturer refers to presuppositions or informations already mentioned to have access to scientific informations. Another fiel ld write or telephone to ask informations. It is a commerci
Figure 2: Anomalous countable uses of informations from learner corpus The tendency is even more marked for the singular a(n) information; with two exceptions, all occur in contexts of qualification (a new, additional, following information) as shown in Figure 3. Countable use of non-count nouns such as information is an instance of what I would call a “priming” issue, in Hoey’s (2002) sense: grammatical category, and sub-categorisation, are questions of priming. A word cannot be said to “be” non-count; rather, it is primed to be non-count by repeated association – or non-association – with other items. Cognate words such as information may be
Approaches to Corpora in Language Teaching n enables the speaker to give er, or even for himself, when , IT is just a means to evoke ow it, psychologically it is r psychology (the referent is Riding Hood. "Of" introduces stitutes is nothing more than es beyond it as it is no more ntrary here refers forward to humans must. " It is mple [19], "which" introduces a sentence, THIS/THESE give icator. What is introduced is nore" is underlined for it is ks a pre- supposed relation, e narrator's point of view is ary on the matrix clause than speaking for it cannot bring e " for the first time. It is vs XII has health problems is
257
an information and express an information is hardly acce an old information which is ne an old information ). From thi an old information) . Gre an additional information as i an old information . Therefore a new information. And, the sa a following information of ind a new information for the co-u a new information. This is mad a new information, it is calle a temporal information with a a new information. We can now a shared information between t a new information ("clear") as a complementary information. a new information to a discour a new information / rhematic. a new information since they w
Figure 3: Anomalous countable uses of a(n) information from learner corpus differently primed in English and in French, and in contexts which emphasise nonhomogeneity, L1 priming effects may take over, even though they run counter to the learner’s explicit knowledge. An example of over-riding contextual effects, despite recently activated metalinguistic knowledge, is provided inadvertently in the quote below, from a second-year student: (5) ‘An information’ is unacceptable because ‘information’ is an uncountable noun. ‘The information they gave me’ is acceptable because it is a particular information determined by ‘they gave me’. 6.3
Connectors
A number of studies of learners’ use of connectors have already revealed patterns of over-use and under-use.3 Although my focus here will be qualitative rather than quantitative, Table 3 shows some sample frequencies from the Frenchspeaking learner corpus. Table 3: Connectors (occurrences/100,000 words) Connectors Indeed In fact As a matter of fact Anyway
NS 704.27 33.53 15.57 0 0
NNS 2 871.19 36.13 43.70 4.20 7.56
NNS 3 906.95 18.32 29.00 12.98 9.92
NNS 4 842.56 32.86 29.57 2.19 5.48
258
John Osborne
Overall, the learner essays (columns 3-5) contained 20-30% more connectors than the native-speaker essays (column 2), with particularly marked discrepancies for connectors such as In fact, As a matter of fact or Anyway. An exception is Indeed, which despite being one of the connectors most notoriously overused by French speakers, is not noticeably more frequent in this corpus. This may be partly attributable to instruction effects, particularly in the case of 3rd year English majors, who have often learnt by this stage to be wary of Indeed. But even when non-native speakers have learnt not to over-use this connector, there is often an impression of strangeness that remains when they do use it, as in examples (6) and (7) below: (6) [..] it is easy to understand that the Scottish Government does not want to spread such ideas because it would represent a serious blow to the economy of the region. Indeed, Nessie is a profitable business since it attracts thousands of tourists each year and, of course, brings much money to the country (7) But the Republican woman adopted a very different strategy. Indeed, she decided to give the tapes to Clinton's worst enemy: Kenneth Starr. It is not always easy to identify what is inauthentic in such usage; inauthenticity is in the mind of the observer, and what may pass as mildly idiosyncratic in a native-speaker text may be perceived as erroneous in learner production. In blind assessment of native and advanced non-native writing, native-speaker judges are not always accurate at deciding which is which (see Ringbom 1998). However, a more qualitative comparison of native and non-native use of Indeed suggests that there is a difference in the way in which this word functions in association with other connectors. Extract (8) is an example of connector-chaining from a nativespeaker essay, in which there is an external-internal movement, conceding an alternative view-point before contrasting it with the author’s own line of argument, which is backed up by evidence introduced by Indeed: (8) While it is undoubtedly true that the most vocal support for the antiEMU lobby has tended to come from Britain, it would be wrong to imagine that any EC member country has been firmly and unconditionally behind the objective. Indeed, as Tsoukalis points out in ‘The New European Economy’: “The EMS was the product of an initiative taken by Chancellor Schmidt, against the advice, if not the outright opposition, of his central bank.” In extract (9) below, from a learner essay, the movement is purely internal, and Indeed is used simply to indicate congruence between the writer’s first statement and the following one:
Approaches to Corpora in Language Teaching
259
(9) As a second part, let's see the possibility that the Indians may only survive as a mere touristical curiosity. Indeed they participate a lot in the American economy thanks to tourism. 7
Applications in Teaching
These brief examples of discrepancies between native and non-native choices in lexis, grammar and rhetoric raise three main questions about the relevance of such data in language teaching. Why do upper-intermediate/advanced learners, after eight or more years of instruction in English, continue to make certain errors? Do these errors really matter? If so, what can be done to help learners make more appropriate choices? The examples of learner English discussed here are representative of errors which tend to be persistent, but which do not seriously interfere with understanding, and which do not concern the most salient aspects of English grammar and lexis. It is for this reason that, in the general context of teaching English as an international language, it might be asked whether it is really necessary to devote a lot of effort to correcting them. If I choose to ignore this question here, it is mainly because the learners concerned are working in an institutional context which expects them, rightly or wrongly, to model their language use as closely as possible on that of native speakers. The other two questions, though, are central to the exploitation of corpus data in language learning. One of the probable factors contributing to the persistency of such errors is precisely, as Ellis (2002: 175) suggests, that they relate to language phenomena which are neither particularly salient nor essential for understanding: The real stuff of language acquisition is the slow acquisition of formfunction mappings and the regularities therein. This skill, like others, takes tens of thousands of hours of practice, practice that cannot be substituted for by provision of a few declarative rules [...] However, without any focus on form or consciousness raising (Sharwood-Smith, 1981), formal accuracy is an unlikely result; relations that are not salient or essential for understanding the meaning of an utterance are otherwise only picked up very slowly, if at all. A major advantage of using corpus data in language learning is the possibility of making regularities in the language immediately more salient, by collecting dispersed naturally-occurring examples together as concordance lines, or by using these examples as a basis for language awareness exercises. By combining “topdown” data from a native-speaker corpus and “bottom-up” data from a learner corpus, it is possible to construct a variety of such exercises, to help learners to become more aware of discrepancies between their own usage and that of native speakers, to develop more effective observation skills, to notice less salient patterns, to draw conclusions from the regularities that they observe, and to resolve possible
John Osborne
260
conflicts between their metalinguistic knowledge, input evidence, and their own production. 7.1
Types of Exercises
What follows is by no means an exhaustive typology, but merely a sample of exercises which use data from a learner corpus, a native-speaker corpus, or both. Native or non-native? These are general awareness activities that ask learners to look at similar extracts from their own writing and that of native speakers, decide which are “authentic” and which are not, and note the features that seem to betray non-nativeness. The objective is to develop critical linguistic distance, and to increase overall sensitivity to the characteristics of native and non-native writing. Comparison: A variant of the preceding type of exercise, this focuses on a specific language point. The aim is not just to notice discrepancies between native and non-native usage, but also to reflect on the reasons for these discrepancies. Lexical enrichment: An obvious remedy for lexical overuse is to encourage learners to use alternative words in the same context. This type of exercise takes examples of overused lexical items from a learner corpus (for example, interesting), blanks out the item in question, and asks the learner to fill the gap with an appropriate word taken from a list of alternatives.4 This list is established partly by intuition, but checked and modified by searching for the words commonly used in comparable contexts in a native-speaker corpus. Collocations: As can be seen from the example of important discussed earlier, lexical divergences are not just a question of over or under-use of specific items, but also of collocation and phraseology. Learners therefore need practice in comparing their choice of associations with the patterns in native-speaker writing. Concordancing tasks: These are a staple component of data-driven learning, asking learners to investigate language patterns, by looking at teacher-prepared concordances, or by doing it themselves with simple concordancing software or by performing online searches. Completion exercises: Most gap-fill exercises take a contrived text and blank out items predicted to be problematic. A corpus-based variant is to take examples containing actual errors from a learner corpus, blank out the errors, and ask the learners to complete the text. In this way, the contexts are not just thought to be, but are known to be problematic. Focusing on these contexts can help learners to avoid reproducing these errors in similar circumstances elsewhere. Since the errors have been blanked out, there is no risk of the reinforcing effect that some may fear when learners are asked to focus on their own mistakes.
Approaches to Corpora in Language Teaching
261
Proof-reading/revision: Despite residual misgivings from a more behaviourist era, proof-reading exercises are quite widely used in language teaching, and certain authoring packages for creating computer-based language exercises include a proof-reading template.5 Having access to a learner corpus makes it possible to construct guided proof-reading exercises which are more focused on specific problems, and which may be particularly appropriate for more diffuse phenomena such as connector usage. 7.2
Examples of Exercises
To illustrate these exercise types, the following are a few brief examples, chosen from areas discussed above (lexical overuse, use of non-count nouns, present perfect, and connectors). Only the first few items of each exercise are given here. The exercises are written to be computer-based; the examples can also be consulted in a more complete form online.6 7.2.1 Lexical Enrichment: Alternative words for important. Instruction: All the examples below originally contained the word important in the gap. Try to choose a better word to use instead of important. Choose from the following words: major, leading, wide, strong, severe, crucial, established. (a) But this Congress has refused to fight another [ the fight against gun violence in America.
] battle here at home:
(b) The prosperous Disneyland Tokyo has become a(n) [ for many tourists attracted by its magic.
] destination
(c) This system suffers from a(n) [ ] weakness, that is it does not keep secret the political leanings of the citizens, and everybody knows for which candidate you will vote during the Primaries. (d) The gap between black and white income is still very [ in the service industries still represent a minority.
] and blacks
7.2.2 Collocations: What sort of things are important? Instruction: Compare these uses of “important” from native and non-native speaker essays. Look at the nouns qualified by “important” qualifies, and the verbs which precede it. Are they the same? Are there verbs or nouns which are used more by one group than by the other?
John Osborne
262 Non-native speakers: fies it. First of all, they play own culture and way of life. In e man's movement : Blacks played ntroversy caused by slavery, lay ation and TV has now become such elements give to the American TV mic stream. In 1992, Europe took te stories in which animals play
an an an an an an an an
important important important important important important important important
part in the selection sense, therefore, sla role, as it is shown i paradox : how an insti means of communication impact on our society part in South African role, as Les contes du
an an an an an an an
important important important important important important important
factor is also that St link for the feminist factor as to why fewer contribution to the w book, because it oozed part of Igbo life, but factor in ideas of med
Native speakers: part in explaining Stalinism but w writings on the subject, he is e act of suicide is likely to be this heritage...could still make unwa. Things Fall Apart was such . Material goods had always been n of press ownership is possibly
7.2.3 Proof-reading: What is wrong with this? Instruction: All of these extracts from non-native student essays contain the same basic error. What is it? is underlined for it is a new information. We can now refer to anot to transmit, to broadcast informations of any types as much objecti matrix clause than a complementary information. Moreover 'as' is ca it needs further evidences and informations to be fully understood
7.2.4 Completion: What word goes here? Instruction: Each of these extracts can be completed with a word that is usually uncountable in English. To help you guess, the first letter of each word is already given. (a) This is used as a sort of pretext to introduce almost an entire page of [i____] on the history and the population of Transylvania. (b) In the same way, the director has also tried and succeeded in giving his actors more or less the same personality. The numerous photos and other [e____] gathered after the wreck or provided by casualties' relatives and survivors unquestionably bring to the fore the amazing similarities between them. (c) Ellen is fortunate in living with rather weak people like Catherine Linton or even Edgar on whom she can entirely exercise her influence. She enjoys giving [a____] and they all ask for her opinion.
Approaches to Corpora in Language Teaching
263
7.2.5 Completion: What verb forms go with since? Instruction: Complete the gaps with an appropriate verb, using a present or present perfect or past tense, as necessary. Click on the [?] button to see the choices. (a) Ever since Machiavelli [ ] The Prince and The Discourses in the sixteenth century he [ ] associated with the ugly side of political activity. (b) Margaret Ensley's son died in High School at the age of 19. Since then, she [ ] a member of MAVIS (Mothers against Violence in Schools). 7.2.6 Native or non-native: Who wrote this? Instruction: Sometimes, student writing contains no obvious errors, but the style or choice of words makes it seem "foreign." Can you tell whether the following extracts were written by native (English-speaking) or by non-native (Frenchspeaking) students? (a) The downtown areas which were prosperous in the past, are now inhabited by poor people. Indeed, people who were educated and had money began to move to the suburbs leaving the poor in the ghettos. Therefore, this situation could only lead to violence. (b) Spontaneous sit-down strikes rapidly spread, and within a few weeks millions of workers had downed their tools. Indeed, the strike was almost too successful since it threatened the coming Blum government before it had even taken office. 8
Conclusion: Perspectives for the Future
Preparing data-driven learning materials is labour intensive, particularly when it involves manipulating several different corpora. The investment is worthwhile if it enables learners to look at the target language in new ways and contributes, in time, to better perception and understanding of its patterns. For the moment, the approach described here results largely from a personal conviction that the convergence of bottom-up and top-down data will highlight real discrepancies between native and non-native usage, that enhanced saliency will make learners more aware of the discrepancies, and that this will ultimately help them to modify inappropriate usage. To support this conviction, three types of additional work are needed: (i) further investigation of the relation between awareness of target language phenomena, frequency effects, and language performance, (ii) studies of the actual effects of data-driven learning on learner production, and (iii), as a
John Osborne
264
prerequisite to this, development of a wider range of learning materials. One of the reservations sometimes formulated about DDL is that it is essentially aimed at more proficient learners. This is generally true of top-down approaches, for the reasons mentioned in section 1 above, but the introduction of more learner-corpus based bottom-up data may offer possibilities for intermediate-level learners too. There is rich ground here for collaboration between corpus linguistics, language teaching and second-language acquisition research. Notes 1.
On serendipity in using corpora with language learners, see Bernardini (1999).
2.
No criticism is intended of this particular grammar, which specifies in a later section (p. 347) that in the time clause, since is followed by a past tense.
3.
See, for example, Granger and Tyson (1996), Osborne (1994, 1998), Kaszubski (1997).
4.
A very similar, paper-based, exercise is described in Granger and Tribble (1998).
5.
An example is the Tense Buster Authoring Kit, produced by Clarity.
6.
The URL is http://www.llsh.univ-savoie.fr/siteCELCE/projets.html.
References Altenberg, B. and M. Tapper (1998), The use of adverbial connectors in advanced Swedish learners' written English, in S. Granger (ed.), Learner English on computer, Harlow, UK: Longman, pp. 80-93. Bernardini, S. (1999), Systematising serendipity: Proposals for concordancing large corpora with language learners, in L. Burnard and T. McEnery (eds), Rethinking language pedagogy from a corpus perspective, Hamburg, Germany: Peter Lang, pp. 225-234. Cook, G. (1998), The uses of reality: A reply to Ronald Carter, ELT Journal, 52 (1): 57-64. De Cock, S., S. Granger, G. Leech, and T. McEnery (1998), An automated approach to the phrasicon of EFL learners, in S. Granger (ed.), Learner English on computer, Harlow, UK: Longman, pp. 67-79. Ellis, N. (2002), Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition, Studies in Second Language Acquisition, 24 (2): 143-188.
Approaches to Corpora in Language Teaching
265
Gass, S. and A. Mackey (2002), Frequency effects and second language acquisition: A complex picture? Studies in Second Language Acquisition, 24 (2): 249-260. Granger, S. and C. Tribble (1998), Learner corpus data in the foreign language classroom: Form-focused instruction and data-driven learning, in S. Granger (ed.), Learner English on computer, Harlow, UK: Longman, pp. 199-209. Granger, S. and S. Tyson (1996), Connector usage in the English essay writing of native and non-native EFL speakers of English, World Englishes, 15 (1): 17-27. Hoey, M. (2002), The textual priming of lexis, TALC/02 Conference, Bertinoro, Italy. Kaszubski, P. (1997), Polish student writers - Can corpora help them? in B. Lewandowska-Tomaszczyk and J. Melia (eds), PALC '97: Practical applications in language corpora, Lodz, Poland: Lodz University Press, pp. 133-158. Lorenz, G. (1998), Overstatement in advanced learners' writing: Stylistic aspects of adjective intensification, in S. Granger (ed.), Learner English on Computer, Harlow, UK: Longman, pp. 53-66. Lorenz, G. (1999), Adjective intensification - learners versus native speakers: A corpus study of argumentative writing. Amsterdam: Rodopi. Osborne, J. (1994), La cohésion dans les productions écrites d'étudiants en anglais de spécialité: Un problème culturel? ASp, 5/6 : 205-216. Osborne, J. (1998), Connecteurs inter-phrastiques et apprentissage de la cohésion textuelle: Problèmes linguistiques et culturels, in P. Cahuzac and J.M. Abreu (eds), Actes des 7èmes journées ERLA-GLAT, Brest, pp. 229-244. Ringbom, H. (1998), Near-native proficiency in writing, in D. Albrechsten (ed.), Perspectives on foreign and second language pedagogy, Odense, Denmark: Odense University Press, pp. 149-159. Seidlhofer, B. (2001), The case for a corpus of English as a lingua franca, in G. Aston and L. Burnard (eds), Corpora in the description and teaching of English, Bologna, Italy: CLUEB, pp. 70-85. Sharwood-Smith, M. (1981), Consciousness-raising and the second-language learner, Applied Linguistics, 2: 159-168. Sinclair, J. (ed.) (1990), Cobuild English Grammar, London: Collins.
Towards an Instrument for the Assessment of the Development of Writing Skills Pieter de Haan and Kees van Esch University of Nijmegen Abstract An important aspect of academic foreign language writing courses is assessing and grading the quality of students’ writing products. This can be done by using holistic or analytical scales or by ranking. What is needed specifically for the Dutch context is an instrument geared towards the specific objectives and context of our foreign language courses, which can help the teacher to assess students’ written products with more validity and which can be used to assess students’ progress over time. A joint project, aiming at developing such an instrument for the specific Dutch context, has recently started at the departments of English and Spanish in Nijmegen, the Netherlands. The present article describes the first step towards developing the above-mentioned instrument: the set-up of two modest-sized “longitudinal” learner corpora, one for Spanish and one for English. These corpora will contain learner essays written under controlled conditions and on predefined topics. The first batch of student essays was collected in March 2002. Lexical and syntactic analyses of these essays will provide a unique insight into the development of the students’ writing skills. An initial quantitative analysis of the essays has already yielded a number of interesting observations. The article concludes with a tentative suggestion for a more elaborate instrument to relate student performance to teacher assessment.
1
Introduction
It goes without saying that university students who either study a foreign language or who wish to study at a university abroad will want to express themselves adequately in writing in that foreign language. The former will want to acquire a written proficiency that will enable them to work professionally as teachers, translators, editors, etc. The latter will often simply need a good command of the written language in order to be admitted to university courses abroad. Many universities provide academic writing courses with a view to helping students acquire the necessary level. These courses are geared toward the target proficiency level, with a heavy emphasis on argument structure. Precisely how students develop their foreign language writing skill over time has not yet been researched extensively. A better understanding of how this development takes place will help course designers to fine-tune writing courses to students’ needs. What needs to be researched in particular is how students improve their writing skills, and how this improvement can be measured. In a research project, which has recently started at the University of Nijmegen, we aim to gain this better understanding by studying the students’
268
Pieter de Haan and Kees van Esch
written products on the basis of a small longitudinal corpus of essays written by Dutch-speaking students of English and Spanish. This article is written against the background of the academic writing tradition in the Netherlands, which is far less developed than that in the Anglo-Saxon world. Traditionally, much attention was always paid to grammar and translation, especially in foreign language teaching. During the past few decades a greater emphasis has undoubtedly been placed on the development of oral skills, but writing skills are only just beginning to receive proper attention. This also means that more attention is now given to the teaching of writing. It is with a view to the latter that the current research project has been initiated. In the current article we pay attention only to certain linguistic features by analysing a number of written texts produced by students over time. Section 2 briefly reviews literature on writing assessment and text quality. Section 3 reports on the aims and design of our project, and will present the first quantitative analyses of our data. Section 4, finally, presents our preliminary conclusions. 2
Writing Assessment and Text Quality
Polio (2001) reviews nine categories of features of L2 writers’ texts: overall quality, linguistic accuracy, syntactic complexity, lexical features, content, mechanics, coherence and discourse features, fluency and revision. Breaking down these categories into more specific ones, she gives examples of the various measures and analyses of these features, describes research and discusses issues and problems. The first category she describes is the overall quality, which can be assessed on the basis of holistic scales (e.g., the Test of Written English, or TWE), analytic scales (e.g., the Jacobs scale) and ranking without guidelines. She argues that it is up to researchers to choose any of these three different measures based on considerations of logic, validity and reliability. Other features relating to the linguistic quality of a text are linguistic accuracy (i.e. the absence of errors), mechanics (i.e. spelling, punctuation, capitalization and indentation), and complexity (i.e. the use of more elaborate language and variety of syntactic patterning per T-unit). Polio raises various questions about the use of linguistic accuracy in relation to validity and reliability and to the question of whether or not accuracy in L2 writing is interesting at all. The answer is not yet clear and is related to general L2 proficiency development. Mechanics, a feature related to accuracy, has not been explored very extensively because, according to Polio, it is not clear whether it is a construct at all, and because it has been studied only as a by-product of other studies. Measuring complexity may not involve reliability problems but it certainly has problems of validity because of different ways of measuring this feature, raising questions like how words and clauses are related per T-unit and what exactly more complex syntactic structures mean (see Polio 2001: 97). Lexical features intended to measure lexical richness are originality/individuality, sophistication, variation and density (as measured by the
Assessment of the Development of Writing Skills
269
well-known type-token ratio), errors and diversity in form classes. The problem with all these features is, as Polio argues, the lack of a clear theory of lexical acquisition in second language acquisition, and it is not easy to establish which of these features measure quality and which development. Measures of content have to do with different features such as interest, referencing and argumentation, the number of topics included in the texts and the quality of propositions and inferences. Research questions to be answered deal for example with effects of planning conditions and of the particular treatment of the content of students’ writing and again with reliable and valid measurement of the content. Aspects of the quality of the content are coherence, i.e. the organization of the text, and discourse features like hedges and emphatics and cohesive devices. Both these features have been studied extensively, and these studies have shown the importance of both aspects for assessing text quality and possibilities for improving that quality. Another measure reviewed is fluency. Polio states that this is a rather vague feature because it is a combination of totally different features, including the extent to which the text sounds native-like and production is of expected quantity. Further, this feature also includes measures such as complexity and lexical richness. Polio (2001: 109-110) contends that researchers who want to assess quality of writing products must report very explicitly what methodology they use. Moreover, they must be concerned with reliability and validity. It is this last aspect that is an important reason for carrying out our project. In connection with this it is relevant to point to a recent article by Connor and Mbaye (2002) on the assessment of writing and, specifically, the issue of validity when we score writing. They argue that, although we have at our disposal a variety of different scoring procedures and practices, what remains the problem is the gap between current practices in the evaluation of writing and the criteria referring to discourse structure. In spite of the developments in testing and changes in practices, Connor and Mbaye (2002) conclude that the assessment of writing still relies too much on linguistic criteria. Hamp-Lyons (2001) also speaks of a “fourth generation of writing assessment”, which involves not only technological but also humanistic, political and ethical dimensions. Connor and Mbaye (2002) review advances in text analysis and propose the inclusion of rhetorical and communicative aspects in the assessment of writing, after which they present a model for writing competence, on the analogy of the communicative competence model for oral production in the foreign or second language (see Canale and Swain 1980; Canale 1983). It includes the same four competences as the oral model: 1.
grammatical competence (i.e. grammar, vocabulary, spelling and punctuation)
2.
discourse competence (i.e. discourse organization, cohesion and coherence)
Pieter de Haan and Kees van Esch
270 3.
sociolinguistic competence (i.e. written genre appropriacy, register and tone)
4.
strategic competence (i.e. audience / reader awareness, appeals, pertinence of claims and warrants)
Connor and Mbaye’s (2002) proposal seems to be a very useful contribution to the issue of validity in assessing writing products because of the focus on competences other than linguistic competence. 3
The Research Project
In this section we report on an exploratory study that forms part of a larger project whose aims are not only to study any development in non-native writers’ writing skills (see Shaw and Liu 1998), but also to create a tool that will assist (non-native) teachers in assessing writing products with greater validity by focusing on relevant features of the four different competences, as proposed by Connor and Mbaye (2002). We believe that such a tool can be developed on the basis of extensive quantitative and qualitative analyses. These should ideally yield a meaningful checklist that teachers can use to assess student essays, without having to carry out any elaborate analyses of the essays themselves. This is particularly important since there are, as yet, no guidelines for writing assessment available for Dutch university lecturers. The project is currently envisaged to run from 2002 until 2005. In this period we aim to collect a number of student essays, and study these both quantitatively and qualitatively. The project is carried out at the departments of English and Spanish at the University of Nijmegen. Essays are collected from both Dutch-speaking students of English and Dutch-speaking students of Spanish. Again, the combination is a deliberate one, for two reasons: 1.
Students of English at Dutch Universities will have been taught English at primary and secondary school for a total of eight years when they enter university, which makes them fairly competent in English when they start their academic studies. Spanish, on the other hand, is not taught at Dutch primary or secondary schools, which means that Dutch students of Spanish start at a beginning level. It is therefore to be expected that there will be huge differences between the development of the writing skills of the students of Spanish and that of the students of English.
2.
English and Dutch are very closely related languages. Writing courses in English, especially at an academic level, will need to concentrate far less on the mechanics of writing than the Spanish writing courses. This, again, will have an effect on the way in which writing skills develop in
Assessment of the Development of Writing Skills
271
the two groups of foreign language students. It can also be expected that there will be significant differences in quality between the two groups. Non-native writing has been studied extensively in the past decade in the ICLE (International Corpus of Learner English) project (see Granger 1998). The main goal of this project has been to collect a large number of non-native essays in order to study the characteristics of writing produced by learners with various language backgrounds. Over two million words of non-native material have so far been collected. However, none of this material can be used for our purpose, since no student has contributed more than a single essay to the ICLE corpus, which makes it impossible to study individual or collective development in writing skills. 4
Data Collection
Student essays are collected according to the schedule presented in Table 1. The essays are collected at the end of March, in four consecutive years. The end of March is a good moment for essay collection, as students will, by that time, have come to the end of the third of the academic year’s four teaching periods. The students will have been taught at least one course with an emphasis on aspects of formal writing. Moreover, it would be quite pointless to collect essays any earlier from the first year students of Spanish, as they would not have a sufficient command of Spanish grammar or lexis to construct more than a few very elementary sentence types in Spanish. Table 1: Schedule for the collection of student essays Year 2002
Department of English Cohort 1 Essay 1 2003 Cohort 1 Cohort 2 Essay 2 Essay 1 2004 Cohort 1 Cohort 2 Essay 3 Essay 2 2005 Cohort 2 Essay 3 Essays are collected in March
Department of Spanish Cohort 1 Essay 1 Cohort 1 Cohort 2 Essay 2 Essay 1 Cohort 1 Cohort 2 Essay 3 Essay 2 Cohort 2 Essay 3
It was also decided to collect essays from the same students at intervals of a full year, as research has found (see Ortega 2002) that it is hard to measure any development after shorter intervals. We plan to collect at least three essays from two cohorts of students. This will give us six batches of student essays for English and six batches for Spanish.
Pieter de Haan and Kees van Esch
272
The first two batches of first year students’ essays were collected in March 2002. We collected 47 English essays and 21 Spanish essays. All the essays were written on a single prompt, taken from Grant and Ginther (2000), which asked the students to select their preferred source of news and give specific reasons to support their preference. They were allowed 30 minutes to complete this task. The prompt was given in Dutch1 so as to prevent any words or phrases in the prompt from being copied into the essays. Moreover, we wanted to make absolutely sure that the prompt was understood well, which was especially relevant for the first year students of Spanish. The students handwrote their essays. These essays were later computerized by a student assistant who had been instructed to type in accurately what the students had written, disabling any correction features provided by the word processor. They were later stripped of any titles, student names or numbers, and instead labelled with a unique ID number which would enable us to link essays to students later on. The total length of the English essays is 13,433 words, which means an average length of 286 words per essay, ranging from 133 words to 528 words. For Spanish the total number of words amounted to 4,338, meaning an average of 206 words, ranging from a mere 67 words to not more than 312 words for the longest essay. If production in quantitative terms is anything to go by, these figures clearly reflect the more mature proficiency of the first year students of English. The remainder of this article is devoted to a discussion of a general analysis of the English essays. 5
Data Analysis
Grant and Ginther (2000) set out to study the relationship between the occurrence of certain linguistic features in student essays and TWE test scores. A number of essays with different scores on the TWE were computer-tagged for linguistic features by means of Biber’s tagger (see Biber 1995). Grant and Ginther were able to demonstrate a correlation between higher TWE scores and the occurrence of linguistic features indicating a greater linguistic maturity. What we want to do is comparable, be it that we are not so much interested in the differences between the poorer and the better students (although this is undoubtedly relevant to our ultimate goal of improving our writing courses), but especially in the differences we expect to find over time, when we compare the later essays to the earlier ones. An important difference between Grant and Ginther’s (2000) study and ours is that there is nothing like a TWE available in the Netherlands. Although university lecturers who teach English proficiency courses have fairly similar ideas about the proficiency levels their students should aim for (near-native level), there are, as yet, no explicit criteria for these levels. Nor is there, for instance, a standard nation-wide test that all the Dutch university students of English must take. More than anything else, this project must, therefore, be considered to be exploratory.
Assessment of the Development of Writing Skills
273
The absence of any standard against which our essays could be measured was dealt with by asking three friendly colleagues, all of them experienced university proficiency teachers of English, and one a native speaker of British English, to mark the essays, and rather than grade them, simply rank them such that the best one came on top and the poorest at the bottom. The essays were presented as anonymous hard copies of the computerised versions, so as not to bias the graders if they recognised a student’s handwriting. The graders were also asked to write down brief characterisations of the essays, or strong and weak points, which they thought were relevant for the ranking. We then calculated an average rank for each essay, which enabled us to divide the 47 English essays into three proficiency levels: the “best” group, the “middle” group, and the “poor” group. An excerpt from a “good” essay is found in Figure 1. … First of all, newspapers (broadsheets, that is) seem more reliable than the Internet for example, or than certain types of TV news (SBS 6 etc.) Newspapers have a certain reputation to uphold, whereas TV broadcasts like “SBS 6-news” are looking for sensation and entertainment. The Internet is not reliable at all – unless you know where to look –, because everybody can write there what they want, without having to provide sources or evidence. Another reason why newspapers have my preference, is that they are more elaborate than TV or radionews. They can write more about backgrounds, causes of certain events etc. , whereas TV or radionews can only spend a certain amount of time to each newsitem. Moreover, in an newspaper, you can re-read things you want, which is impossible with TV and radionews (unless you are willing to wait another hour...). … Figure 1: Excerpt from a “good” essay One of the three graders described this essay in terms of: “good argumentative style; good sentence construction” on the positive side, and “poor word division” on the negative side. Another described the same essay in terms of: “wellexplained; quite idiomatic; fluent English” (positive) and “poor layout; one comma splice” (negative). Note that the comma splice does not occur in this excerpt. By contrast, consider an excerpt of what was considered to be a “poor” essay, as shown in Figure 2 below. Graders’ comments on this essay were only negative. One of them wrote: “childish; poor paragraphing; comma splice; poor English; repetitive”, while another wrote: “poor punctuation (comma); unidiomatic; poor spelling; poor layout; poor grammar”. While it is undeniably true that the essay in Figure 1 is far from flawless, it is certainly a lot better than the one in Figure 2.
274
Pieter de Haan and Kees van Esch
Every newssource has its advantages. The internet is always very quick with news, and it gives you text and pictures, so do newspapers. Radio only gives you spoken words. On t.v. you can see filmed material and that makes it more interesting. I would say that t.v. is my favourite newssource. It seems the most accurate source, and it gives you a good picture of the news, because of the filmed material. If I hear something on the news on the radio that is important, I always turn on the t.v. to receive more elaborate information about it. I suppose you could also check the internet for that, maybe I will do that in a couple of years, but now for me, t.v. is the most common newssource. You might say that newspapers are also accurate and elaborate about the news, and that is true. And it’s nice to smell the paper and everything. … Figure 2: Excerpt from a “poor” essay On the whole, the three graders agreed in their ranking of the essays, except for one dramatic case, in which the essay that came out worst in one grader’s ranking was considered to be the best by another. The graders knew the topic on which the students were supposed to write. It turned out that the essay in question was completely off-topic, a fact that was recognized by both graders. One grader felt this to be so serious that this student deserved to be ranked last. However, while admitting that the essay was off-topic, the other grader felt that the essay was nevertheless well written. Here it was evidenced that the graders had had no further instruction than to mark the essays holistically, and then to rank them. They had simply been asked to apply their own criteria. However, given the situation, any further instructions to the assessors might have been regarded as inappropriate interference. We used WordSmith Tools for our initial general analysis of the essays. In a later stage they will also be computer-tagged by means of Douglas Biber’s tagger.2 A series of text features can be studied in relation to text ranking: among them are such items as word length and type/token ratio; lexical features like conjuncts, hedges, and amplifiers; and grammatical features like nouns, nominalizations, personal pronouns, verb characteristics, and use of adjectives and adverbs. Clause level features include complementation, relative clauses, use of the passive, etc. We are now in the process of analysing the untagged essays in detail, but the first few general quantitative analyses of this material have already given us overwhelming evidence of the relationship between certain linguistic features and proficiency level. Figure 3 shows the average essay length in terms of the number of tokens in the three proficiency level groups that we have distinguished.
Assessment of the Development of Writing Skills
275
400 350 300 250 200 150 100 50 0 best
middle
poor
Figure 3: Average essay length (# of tokens) The essays that were ranked higher are much longer, on average, than the poorer essays. Being able to produce more text in a given time might be considered a sign of a more mature proficiency, a fact which appears to be confirmed by the huge difference in length between the English students’ essays and the Spanish students’ essays (see section 2). Figure 4 shows the average essay length in terms of the number of sentences that the students wrote. 25
20
15
10
5
0 best
middle
Figure 4: Average essay length (# of sentences)
poor
Pieter de Haan and Kees van Esch
276
Again, we see that the best essays are also longer in terms of the number of sentences produced. The observations presented in Figs. 3 and 4 may lead to the conclusion that the average sentence length is roughly equal in the three proficiency groups. However, when we look at the average number of words per sentence, we see a striking difference between the best group and the poor group. This is shown in Figure 5. 19
18
17
16 best
middle
poor
Figure 5: Average sentence length (# of words per sentence) At first glance, it might seem curious that the poorer students apparently construct longer sentences than the best students, but it should be borne in mind that a common error made by Dutch students is the so-called comma splice, which essentially combines what are two independent sentences into a very long “sentence”. Apparently the poorer students find it hard to avoid comma splices. Another indication of a more mature proficiency is the average word length of the essays (see Grant and Ginther 2000). Figure 6 below shows the average scores for the three proficiency classes. The poorer students indeed produce shorter words on average, which reflects their less mature command of English. The lack of the poorer students’ linguistic maturity should also be reflected in a smaller type/token ratio. The type/token ratio indicates the degree of lexical variation of a text by dividing the number of different words by the total number of words. It should be noted, however, that as texts become longer there will inevitably be more words that are repeated (especially function words), which lowers the type/token ratio and thus obscures the lexical variation. In our case, it would mean that the poorer students would have better type/token ratios as their essays were much shorter than those written by the best students.
Assessment of the Development of Writing Skills
277
4,34 4,32 4,3 4,28 4,26 4,24 4,22 4,2 best
middle
poor
Figure 6: Average word length WordSmith Tools allows the user to adjust the type/token ratio by re-calculating the ratio for each sequence of, say, 100 words, and calculating an average score for the entire text.3 We chose sequences of 50 words, which yielded the scores in Figure 7. 78
77
76 best
middle
poor
Figure 7: Adjusted type/token ratios Although the difference between the scores for the best group and the poor group is not great, it reflects the difference in linguistic maturity. What is remarkable is that the middle group has the highest score. This is something that a more
Pieter de Haan and Kees van Esch
278
detailed qualitative analysis may shed more light on, but which remains unresolved, for the time being. 6
Summary and Conclusion
A general quantitative analysis of the first batch of student essays has shown that there are measurable differences between what Dutch university lecturers consider to be better essays and poorer essays. What we have been able to demonstrate is that there are a number of global linguistic features that correlate to higher or lower levels of linguistic maturity of the essay writers. However, the figures yielded by the quantitative analysis so far are not unambiguous. It goes without saying that in order to be able to measure any student’s individual progress we need a far more elaborate analysis, both in quantitative and in qualitative terms, of his or her essays. A more advanced quantitative analysis will concentrate on the correlation of those linguistic features that play a role in the linguistic dimensions that are relevant in an academic context, viz. formality and informational focus (see Biber 1988, 1995; Hoogesteger 1998). The results of these quantitative analyses will have to be complemented by qualitative analyses of the other three competences: discourse competence, sociolinguistic competence and strategic competence (Connor and Mbaye 2002). As we stated in the first section, however, this can be the first step towards the creation of a tool that will help us to provide insight into the development of students’ writings skills and that can assist (non-native) teachers in assessing writing products with greater validity. Notes 1.
The Dutch prompt read: “Schrijf een tekst over welke nieuwsbron je voorkeur heeft en geef je redenen voor deze voorkeur” (“write a text about which news source you prefer and state your reasons for this preference”).
2.
We would like to express our gratitude to Douglas Biber for tagging both the English and the Spanish material.
3.
WordSmith Tools, in its help files, warns the novice user against calculating “raw” type/token ratios for texts of unequal lengths and recommends the use of adjusted type/token ratios for those cases.
References Biber, D. (1988), Variation across speech and writing, Cambridge: Cambridge University Press.
Assessment of the Development of Writing Skills
279
Biber, D. (1995), Dimensions of register variation: A cross-linguistic comparison, Cambridge: Cambridge University Press. Canale, M. (1983), From communicative competence to communicative language pedagogy, in J.C. Richards and R. Schmidt (eds), Language and communication, London: Longman, pp. 2-27. Canale, M. and M. Swaine (1980), Theoretical bases of communicative approaches to second language teaching and testing, Applied Linguistics, 1: 1-47. Connor, U. and A. Mbaye (2002), Discourse approaches to writing assessment, Annual Review of Applied Linguistics, 22: 263-278. Granger, S. (1998), Learner English on computer, New York: Addison Wesley Longman. Grant, L. and A. Ginther (2000), Using computer-tagged linguistic features to describe L2 writing differences, Journal of Second Language Writing, 9: 123–145. Hamp-Lyons, L. (2001), Fourth generation of writing assessment, in T. Silva and P.K. Matsuda (eds), On second language writing, Mahwah, NJ: Lawrence Erlbaum, pp. 117-129. Hoogesteger, M. (1998), A linguistic comparison of argumentative essays written by native speakers of English and advanced Dutch learners of English, University of Nijmegen: Unpublished MA thesis. Ortega, L. (2002), Magnitude and rate of syntactic complexity changes in collegelevel L2 writing: A research synthesis, Paper presented at the American Association for Applied Linguistics (AAAL) 2002 Conference in Salt Lake City. Polio, C. (2001), Research methodology in L2 writing assessment, in T. Silva and P.K. Matsuda (eds), On second language writing, Mahwah, NJ: Lawrence Erlbaum, pp. 91-115. Shaw, P. and E. Liu (1998), What develops in the development of secondlanguage writing?, Applied Linguistics, 19: 225-254.